社内勉強会資料_History of LLaVA .

NABLAS 482 views 13 slides Aug 28, 2024
Slide 1
Slide 1 of 13
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13

About This Presentation

社内エンジニア・リサーチャー勉強会の発表資料「LLaVA」を公開しました!

画像エンコーダとLLMを組み合わせることで、画像とテキストの処理を行う、大規模マルチモーダルモデルのLLaVAとその後続モデル(LLaVA-1.5〜LLaVA-One...


Slide Content

History of LLaVA
NABLAS

Paper Discussion

© NABLAS Inc. All Rights Reserved 2
“LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that
connects a vision encoder and an LLM for general purpose visual and language understanding”
What is LLaVA?
Typically, they are freezed during pre-training

© NABLAS Inc. All Rights Reserved 3
How LLaVA Evolves
LLaVA-1.5 and LLaVA-1.5-HD
MLP Connector
Better prompt during pre-training
Re-composed dataset
Higher input image resolution
Larger language model
AnyRes
LLaVA-NeXT
Higher input image resolution
Re-composed dataset
SGLang support
LLaVA-NeXT (Stronger)
Better LLM

LLaVA-NeXT-Video
Video extension

LLaVA-OneVision
Multiple image
& Video extension
Higher AnyRes

2023
2024
4
LLaVA
Original version
10
1
4
5
8
And many other variants! ??????

© NABLAS Inc. All Rights Reserved 4
●Fixed (low) image resolution
→ AnyRes (LLaVA-1.5-HD)
●Limitation of the number of visual tokens
→ Higher AnyRes (LLaVA-OneVision)
→ Are 64 visual tokens enough? (Idefics2)
●Image encoder (CLIP? SigLIP? InternViT?)
●LLM (LLaMA? Qwen?)
Problems

© NABLAS Inc. All Rights Reserved 5
Input image resolution of ViT can be changed with interpolating positional encoding, but it requires
fine-tuning → LLaVA-1.5-HD solved this problem by splitting input image into grids and running
model on each grid independently
AnyRes
Fixed (low) image resolution

© NABLAS Inc. All Rights Reserved 6
AnyRes increases the total number of visual tokens and that restricts the number of input images and
videos → Higher AnyRes sets threshold τ for the maximum number of visual tokens
Higher AnyRes
Limitation of the number of visual tokens
Maximum number of number of
visual tokens
The number of crops
Total number of visual tokens

© NABLAS Inc. All Rights Reserved 7
Idefics2
Limitation of the number of visual tokens
It uses Perceiver
instead of MLP to
control the number of
visual tokens

© NABLAS Inc. All Rights Reserved 8
Image encoder (CLIP? SigLIP? InternViT?)
Model Vision Encoder Image Resolution
LLaVA CLIP ViT-L/14 224x224
LLaVA-1.5 CLIP ViT-L/14 336 336x336
LLaVA-1.5-HD CLIP ViT-L/14 336 336x336
LLaVA-NeXT CLIP ViT-L/14 336 336x336
LLaVA-OneVision SigLIP 384x384
InternVL2 InternViT-6B 448x448
SPHINX CLIP / ConvNeXt / DINOv2 224x224
Idefics2 SigLIP-SO400M 384x384

© NABLAS Inc. All Rights Reserved 9
LLM (LLaMA? Qwen?)

© NABLAS Inc. All Rights Reserved 10
Common training pipeline
Pre-training
Update randomly initialized
projector (a module
connecting vision encoder
and LLM) while freezing vision
encoder and LLM with
image/text captioning dataset
(sometimes text next
prediction task is set with
image/text interleaved
dataset for other VLMs like
BLIP-3)
Instruction Tuning
Train projector, vision encoder
and LLM on image/text
instruction tuning dataset

© NABLAS Inc. All Rights Reserved 11
Pre-training datasets and instruction tuning datasets
Dataset Size Note Type
LLaVA-Pretrain / OneVision 558K image and text pairs / Pre-training / Instruction Tuning
OBELICS 350M images / 115B tokens Pre-training
LAION-5B
This model can be used to
remove inappropriate items from
the dataset
Pre-training
OCR-IDL 19M documents OCR Pre-training
PDFA 18M documents OCR Pre-training
MINT-1T 3.4B images and 1T tokens Pre-training
The Cauldron Instruction Tuning

© NABLAS Inc. All Rights Reserved 12
Comparison with proprietary models

© NABLAS Inc. All Rights Reserved 13
●Extremely long videos & many images & large images
→ Need better methods to reduce the total number of visual tokens
●Better training pipeline
→ Authors of Idefics2 reported unfreezing vision encoder and LLM during pre-training results in
training loss divergence without LoRA
●Better training data mixture scheme
●Better training objectives
●MoE extension
●More efficient architecture
Future work