社内勉強会資料_Object Recognition as Next Token Prediction

NABLAS 240 views 23 slides May 13, 2024
Slide 1
Slide 1 of 23
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23

About This Presentation

言語モデルのデコーダーを用いて画像中の物体認識を効率的に行う試みについて紹介しています。


Slide Content

Paper Discussion #15
Object Recognition as Next Token Prediction (CVPR 2024)

© NABLAS Inc.
2
Idea
Use a pair of an image encoder and a language decoder as an (open-ended) image recognizer
which returns a list of all objects in a given image
In this case, we’ll get a sequence of tokens as output
[“so”, “fa”, “[SEP]”, “cat”, “[SEP]”, “blank”, “et”, “[SEP]”]
→ “sofa”, “cat”, “blanket” after post processing

© NABLAS Inc.
3
Problems that current open-ended image recognizer (e.g. CLIP) have
●Need to predefine a set of class descriptions
●As the set becomes larger, accuracy decreases
← Is it possible to eliminate this step?

© NABLAS Inc.
4
Straightforward way: using LLM
●With a few-shot learning, it requires good samples (& it doesn’t scale?)
●With a zero-shot learning, No explicit way to specify target classes → low accuracy

© NABLAS Inc.
5
CLIP image encoder + FC
※ Except the last 6 blocks, it is
frozen
First 6 blocks and the last block only
Pipeline in more details
Image Embeddings [IMG] “the objects in the image are”
Learnable

© NABLAS Inc.
6
Data preprocess

© NABLAS Inc.
7
Formulation: current image recognizer (e.g. ResNet, CLIP)
Average pooling (ResNet)
[cls] token or token pooling
Fully-connected layer (ResNet)
Set of embedding vectors of predefined class descriptions
Feature map (ResNet)
Set of token (image patch) vectors
Softmax

© NABLAS Inc.
8
Formulation: proposed image recognizer (in the case of each class is represented as single token)
Projection layer + LLM
Fully-connected layer (+ layer normalization)
Set of token (image patch) vectors
Softmax

© NABLAS Inc.
9
Formulation: proposed image recognizer (in the case of each class is represented as possibly multiple tokens)

© NABLAS Inc.
10
Final objective function (multiple labels with multiple tokens each)

© NABLAS Inc.
11
Customized non-causal attention mask
Causal attention mask
Proposed non-causal
attention mask
Query \ Key

© NABLAS Inc.
12
One-shot sampling (or parallel sampling)
This is the first token for the first label
This is also the first token for the second label
The key to its parallelism lies in the non-causal masking
mechanism, which also avoids the repetition issue (?)

© NABLAS Inc.
13
Experiment settings
Train dataset
(1)G3M - CC3M / COCO Captions / SBU
(2)G70M - 67M from LAION-Synthetic-115M / G3M

Eval dataset
Eval splits of CC3M / COCO Captions / OpenImages V7

Input image preprocessing
●Same to CLIP image encoder
●224 x 224 resolution

Others
●No [cls] token in CLIP image encoder
●(32K-1) tokens (text) for output
●No [eos] token (instead of it [sep] token is used)
●We shuffle labels for each image in training (?)
●The global batch size is 512

© NABLAS Inc.
14
Metric
BERTScore is used
The number of objects in a given image
The number of predicted objects in a given image

© NABLAS Inc.
15
Recall@10 is higher while Precision@10 is lower
→ What does it mean? → It generates various classes that cover gt but some doesn’t match

© NABLAS Inc.
16
First 11 blocks are more important to image recognition

© NABLAS Inc.
17
Truncating larger LM is better than using smaller LM as it is

© NABLAS Inc.
18
Proposed sampling works comparable (& beam search works worse for some reason)

© NABLAS Inc.
19
Proposed attention masking slightly contributes to the results

© NABLAS Inc.
20
For larger train dataset, LLaMA 1 works better ??????

© NABLAS Inc.
21
vs GPT-4V Preview (gray)

© NABLAS Inc.
22
They say training on CC13M (more noisy) underperforms training on CC3M

© NABLAS Inc.
23
Removing intermediate blocks of LLM doesn’t affect score much
It works even with single block ??????