From CLIP to JinaCLIP: General Text-Image Representation Learning for Search and Multimodal RAG

chloewilliams62 194 views 45 slides Sep 11, 2024
Slide 1
Slide 1 of 45
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45

About This Presentation

"CLIP (Contrastive Language-Image Pretraining) is commonly used to train models that can connect images and text by representing them as vectors in the same embedding space. These models are crucial for tasks like multimodal information retrieval, where you need to search and match across both ...


Slide Content

From CLIP to JinaCLIP
General Text-Image Representation Learning for Search and Multimodal RAG
Engineering Manager
[email protected]

what is a good embedding model?

A good embedding model should..
1.Have general good performance on a variety of domains w.o fine-tuning.
2.Connects left and right, reflected with a high similarity score.
3.Does not necessarily need to be text only.

Text-Image Search: the old days
(Before OpenAI CLIP)

Connecting Modalities is HARD
1.Actually there is no way for direct connection.
2.You need to “convert” the image into text, either manually, or built based on some “hypothesis”.
3.Manually conversion is largely done by users: e.g. tagging photos on Flickr, or ask user to give a
description of the image.
4.Some other ppl, use “sournding text” as hints to connect text and image, build on top of the hypothesis:
surrounding text describe the image it’s being surrounded by.
5.Modern supervised approaches for automatic image classification and use classifier labels as tags,
produce unstable results and poor OOD performance.

OpenAI CLIP

OpenAI CLIP is a breakthrough, also nothing new
1.Training paradigm is old: text image alignment.
2.Loss function is mature: temperature scaled cross-entropy loss (bidirectional).
3.The only thing matter is: they trained the CLIP using 400 million text-image pairs, no one did this
before.

CLIP is an embedding model, also not
1.Strong performance in connecting text and images.
2.Weak (unusable) capability of modeling text.

modelBIOSS
ES
SICK-RSTS12 STS13 STS14 STS15 STS16 STS17 STS22 STSB AVG
OpenA
I/VIT-
B-16
67.7869.0872.0764.4455.7165.3772.4477.2353.6364.4066.22
JinaAI-
V2-B
81.2379.65 74.2784.1878.81 87.5585.3588.8862.2084.8480.70
Spearman correlation based on the model's cosine similarity

model DBPed
ia
FEVER FiQA Hotpot
QA
MSMA
RCO
NFCor
pus
NQ Quora TRECC
OVID
AVG
OpenAI
/VIT-
B-16
14.9433.455.78 9.30 9.3616.445.2876.6322.6021.53
JinaAI-
V2-B
35.0572.3341.5861.3840.92 32.4560.04 88.2071.6055.95
nDCG@10

Why CLIP can not model text well?
1.During CLIP training, the text were truncated to 77 characters only.
2.Most of the CLIP texts, or captions of images, are very short.
3.No guidance by hard-negatives.

Two major shortcomings of CLIP. The R@1 increases very slowly when the input length exceeds 20 tokens, indicating that the true effective length of CLIP is even no
longer than 20 tokens.
Long-CLIP: Unlocking the Long-Text Capability of CLIP

jina-clip-v1
1.A small model at a size of 223m parameters.
2.Trained for 70k steps (3 stages total).
3.Per step text-image batch size is 32768, model seen 2.29 billion pairs.
4.Per step text-text batch size is 32768, model seen 2.29 billion pairs/triplets.
5.1600 hour of Nvidia H100 80G.

model DBPed
ia
FEVER FiQA Hotpot
QA
MSMA
RCO
NFCor
pus
NQ Quora TRECC
OVID
AVG
OpenAI
/VIT-
B-16
14.9433.455.78 9.30 9.3616.445.2876.6322.6021.53
JinaV2-
Base
35.0572.3341.5861.3840.92 32.4560.04 88.2071.6055.95
JinaCLI
P-S1
28.4157.5036.1140.2425.8531.6540.0781.5549.2643.40
nDCG@10

model DBPed
ia
FEVER FiQA Hotpot
QA
MSMA
RCO
NFCor
pus
NQ Quora TRECC
OVID
AVG
OpenAI
/VIT-
B-16
14.9433.455.78 9.30 9.3616.445.2876.6322.6021.53
JinaV2-
Base
35.0572.3341.5861.3840.92 32.4560.04 88.2071.6055.95
JinaCLI
P-S1
28.4157.5036.1140.2425.8531.6540.0781.5549.2643.40
JinaCLI
P-S2
30.3356.7238.1043.8727.6032.1741.2384.3252.1545.17
nDCG@10

model DBPed
ia
FEVER FiQA Hotpot
QA
MSMA
RCO
NFCor
pus
NQ Quora TRECC
OVID
AVG
OpenAI
/VIT-
B-16
14.9433.455.78 9.30 9.3616.445.2876.6322.6021.53
JinaV2-
Base
35.0572.3341.5861.3840.92 32.4560.04 88.2071.6055.95
JinaCLI
P-S1
28.4157.5036.1140.2425.8531.6540.0781.5549.2643.40
JinaCLI
P-S2
30.3356.7238.1043.8727.6032.1741.2384.3252.1545.17
JinaCLI
P-S3
36.6476.2838.2761.8936.9133.5258.09 87.88 71.61 55.68
nDCG@10

How about overall performance?

Modality Gap in Multimodal Embeddings

Why Modalitiy Gap Exists?
1.Cone effect during weights initialisation creates the gap.
2.Temperature-Scaled Cross-Entropy Loss preserve the gap.
3.In-batch false negatives strengthen the gap.
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

img1
img2
img3
img4
img5
img6
img7
img8
caption1
caption2
caption3
caption4
caption5
caption6
caption7
caption8
pos
neg

…

neg
a multi-class, bidirectional classification problem, but the logits is
scaled by temperature: logits = cos(left, right)/ temperature

A higher temperature value in the contrastive loss function helps smoothing similarity
scores, reducing overconfidence, encouraging more uniform distributions.
a simple framework for contrastive learning of visual representations

model becomes more confident and encouraging more skewed distribution

img1
img2
img3
img4
img5
img6
img7
img8
caption1
caption2
caption3
caption4
caption5
caption6
caption7
caption8
pos
neg

…

neg
What if caption8 is a better description of img1 than the caption1?

We intentionally manipulate the training data
1.In Flickr8k, each image has 5 corresponding captions (we use 4).
2.The dataset is in sequential order.
3.During training, we load the dataset with Pytorch DataSet, perform random sampling (a batch) then feed
into the Pytorch DataLoader.
4.We disabled the RandomSampler, use SequentialSampler.

img1
img1
img1
img1
img2
img2
img2
img2
img1-cap1
img1-cap2
img1-cap3
img1-cap4
img2-cap1
img2-cap2
img2-cap3
img2-cap4
Given a batch size of 16, this ensures each text-image pair, there are
at least 3 in-batch negatives that can be considered as positives
(mismatch) and at most 12 "correct" in-batch negatives.
pos
false-neg
neg

Sampler Temperature BatchSize Cos Distance
w in batch
mismatch
Sequential 0.02 16 0.7840
w.o. in batch
mismatch
Random 0.02 16 0.7521

Carefully merge the scores when using Multimodal Embeddings!

https://jina.ai/
Tags