From CLIP to JinaCLIP: General Text-Image Representation Learning for Search and Multimodal RAG
chloewilliams62
194 views
45 slides
Sep 11, 2024
Slide 1 of 45
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
About This Presentation
"CLIP (Contrastive Language-Image Pretraining) is commonly used to train models that can connect images and text by representing them as vectors in the same embedding space. These models are crucial for tasks like multimodal information retrieval, where you need to search and match across both ...
"CLIP (Contrastive Language-Image Pretraining) is commonly used to train models that can connect images and text by representing them as vectors in the same embedding space. These models are crucial for tasks like multimodal information retrieval, where you need to search and match across both images and text.
However, when it comes to purely text-based tasks, CLIP models don’t perform as well as models that are specifically built for text. This causes inefficiencies because current systems often need to maintain separate models and embeddings for text-only and multimodal tasks, which adds complexity.
In this talk, Bo will explain the multi-task contrastive training scheme behind JinaCLIP, discuss the modality gap between different data types, and introduce JinaCLIP V2—our latest and most capable multilingual multimodal embedding model."
Size: 9.15 MB
Language: en
Added: Sep 11, 2024
Slides: 45 pages
Slide Content
From CLIP to JinaCLIP
General Text-Image Representation Learning for Search and Multimodal RAG
Engineering Manager [email protected]
what is a good embedding model?
A good embedding model should..
1.Have general good performance on a variety of domains w.o fine-tuning.
2.Connects left and right, reflected with a high similarity score.
3.Does not necessarily need to be text only.
Text-Image Search: the old days
(Before OpenAI CLIP)
Connecting Modalities is HARD
1.Actually there is no way for direct connection.
2.You need to “convert” the image into text, either manually, or built based on some “hypothesis”.
3.Manually conversion is largely done by users: e.g. tagging photos on Flickr, or ask user to give a
description of the image.
4.Some other ppl, use “sournding text” as hints to connect text and image, build on top of the hypothesis:
surrounding text describe the image it’s being surrounded by.
5.Modern supervised approaches for automatic image classification and use classifier labels as tags,
produce unstable results and poor OOD performance.
OpenAI CLIP
OpenAI CLIP is a breakthrough, also nothing new
1.Training paradigm is old: text image alignment.
2.Loss function is mature: temperature scaled cross-entropy loss (bidirectional).
3.The only thing matter is: they trained the CLIP using 400 million text-image pairs, no one did this
before.
CLIP is an embedding model, also not
1.Strong performance in connecting text and images.
2.Weak (unusable) capability of modeling text.
modelBIOSS
ES
SICK-RSTS12 STS13 STS14 STS15 STS16 STS17 STS22 STSB AVG
OpenA
I/VIT-
B-16
67.7869.0872.0764.4455.7165.3772.4477.2353.6364.4066.22
JinaAI-
V2-B
81.2379.65 74.2784.1878.81 87.5585.3588.8862.2084.8480.70
Spearman correlation based on the model's cosine similarity
model DBPed
ia
FEVER FiQA Hotpot
QA
MSMA
RCO
NFCor
pus
NQ Quora TRECC
OVID
AVG
OpenAI
/VIT-
B-16
14.9433.455.78 9.30 9.3616.445.2876.6322.6021.53
JinaAI-
V2-B
35.0572.3341.5861.3840.92 32.4560.04 88.2071.6055.95
nDCG@10
Why CLIP can not model text well?
1.During CLIP training, the text were truncated to 77 characters only.
2.Most of the CLIP texts, or captions of images, are very short.
3.No guidance by hard-negatives.
Two major shortcomings of CLIP. The R@1 increases very slowly when the input length exceeds 20 tokens, indicating that the true effective length of CLIP is even no
longer than 20 tokens.
Long-CLIP: Unlocking the Long-Text Capability of CLIP
jina-clip-v1
1.A small model at a size of 223m parameters.
2.Trained for 70k steps (3 stages total).
3.Per step text-image batch size is 32768, model seen 2.29 billion pairs.
4.Per step text-text batch size is 32768, model seen 2.29 billion pairs/triplets.
5.1600 hour of Nvidia H100 80G.
model DBPed
ia
FEVER FiQA Hotpot
QA
MSMA
RCO
NFCor
pus
NQ Quora TRECC
OVID
AVG
OpenAI
/VIT-
B-16
14.9433.455.78 9.30 9.3616.445.2876.6322.6021.53
JinaV2-
Base
35.0572.3341.5861.3840.92 32.4560.04 88.2071.6055.95
JinaCLI
P-S1
28.4157.5036.1140.2425.8531.6540.0781.5549.2643.40
nDCG@10
model DBPed
ia
FEVER FiQA Hotpot
QA
MSMA
RCO
NFCor
pus
NQ Quora TRECC
OVID
AVG
OpenAI
/VIT-
B-16
14.9433.455.78 9.30 9.3616.445.2876.6322.6021.53
JinaV2-
Base
35.0572.3341.5861.3840.92 32.4560.04 88.2071.6055.95
JinaCLI
P-S1
28.4157.5036.1140.2425.8531.6540.0781.5549.2643.40
JinaCLI
P-S2
30.3356.7238.1043.8727.6032.1741.2384.3252.1545.17
nDCG@10
model DBPed
ia
FEVER FiQA Hotpot
QA
MSMA
RCO
NFCor
pus
NQ Quora TRECC
OVID
AVG
OpenAI
/VIT-
B-16
14.9433.455.78 9.30 9.3616.445.2876.6322.6021.53
JinaV2-
Base
35.0572.3341.5861.3840.92 32.4560.04 88.2071.6055.95
JinaCLI
P-S1
28.4157.5036.1140.2425.8531.6540.0781.5549.2643.40
JinaCLI
P-S2
30.3356.7238.1043.8727.6032.1741.2384.3252.1545.17
JinaCLI
P-S3
36.6476.2838.2761.8936.9133.5258.09 87.88 71.61 55.68
nDCG@10
How about overall performance?
Modality Gap in Multimodal Embeddings
Why Modalitiy Gap Exists?
1.Cone effect during weights initialisation creates the gap.
2.Temperature-Scaled Cross-Entropy Loss preserve the gap.
3.In-batch false negatives strengthen the gap.
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
img1
img2
img3
img4
img5
img6
img7
img8
caption1
caption2
caption3
caption4
caption5
caption6
caption7
caption8
pos
neg
…
neg
a multi-class, bidirectional classification problem, but the logits is
scaled by temperature: logits = cos(left, right)/ temperature
A higher temperature value in the contrastive loss function helps smoothing similarity
scores, reducing overconfidence, encouraging more uniform distributions.
a simple framework for contrastive learning of visual representations
model becomes more confident and encouraging more skewed distribution
img1
img2
img3
img4
img5
img6
img7
img8
caption1
caption2
caption3
caption4
caption5
caption6
caption7
caption8
pos
neg
…
neg
What if caption8 is a better description of img1 than the caption1?
We intentionally manipulate the training data
1.In Flickr8k, each image has 5 corresponding captions (we use 4).
2.The dataset is in sequential order.
3.During training, we load the dataset with Pytorch DataSet, perform random sampling (a batch) then feed
into the Pytorch DataLoader.
4.We disabled the RandomSampler, use SequentialSampler.
img1
img1
img1
img1
img2
img2
img2
img2
img1-cap1
img1-cap2
img1-cap3
img1-cap4
img2-cap1
img2-cap2
img2-cap3
img2-cap4
Given a batch size of 16, this ensures each text-image pair, there are
at least 3 in-batch negatives that can be considered as positives
(mismatch) and at most 12 "correct" in-batch negatives.
pos
false-neg
neg
Sampler Temperature BatchSize Cos Distance
w in batch
mismatch
Sequential 0.02 16 0.7840
w.o. in batch
mismatch
Random 0.02 16 0.7521
Carefully merge the scores when using Multimodal Embeddings!