From CLIP to JinaCLIP: General Text-Image Representation Learning for Search and Multimodal RAG

From CLIP to JinaCLIP
General Text-Image Representation Learning for Search and Multimodal RAG
Engineering Manager
[email protected]

what is a good embedding model?

A good embedding model should..
1.Have general good performance on a variety of domains w.o fine-tuning.
2.Connects left and right, reflected with a high similarity score.
3.Does not necessarily need to be text only.

Text-Image Search: the old days
(Before OpenAI CLIP)

Connecting Modalities is HARD
1.Actually there is no way for direct connection.
2.You need to “convert” the image into text, either manually, or built based on some “hypothesis”.
3.Manually conversion is largely done by users: e.g. tagging photos on Flickr, or ask user to give a
description of the image.
4.Some other ppl, use “sournding text” as hints to connect text and image, build on top of the hypothesis:
surrounding text describe the image it’s being surrounded by.
5.Modern supervised approaches for automatic image classification and use classifier labels as tags,
produce unstable results and poor OOD performance.

OpenAI CLIP

OpenAI CLIP is a breakthrough, also nothing new
1.Training paradigm is old: text image alignment.
2.Loss function is mature: temperature scaled cross-entropy loss (bidirectional).
3.The only thing matter is: they trained the CLIP using 400 million text-image pairs, no one did this
before.

CLIP is an embedding model, also not
1.Strong performance in connecting text and images.
2.Weak (unusable) capability of modeling text.

modelBIOSS
ES
SICK-RSTS12 STS13 STS14 STS15 STS16 STS17 STS22 STSB AVG
OpenA
I/VIT-
B-16
67.7869.0872.0764.4455.7165.3772.4477.2353.6364.4066.22
JinaAI-
V2-B
81.2379.65 74.2784.1878.81 87.5585.3588.8862.2084.8480.70
Spearman correlation based on the model's cosine similarity

model DBPed
ia
FEVER FiQA Hotpot
QA
MSMA
RCO
NFCor
pus
NQ Quora TRECC
OVID
AVG
OpenAI
/VIT-
B-16
14.9433.455.78 9.30 9.3616.445.2876.6322.6021.53
JinaAI-
V2-B
35.0572.3341.5861.3840.92 32.4560.04 88.2071.6055.95
nDCG@10

Why CLIP can not model text well?
1.During CLIP training, the text were truncated to 77 characters only.
2.Most of the CLIP texts, or captions of images, are very short.
3.No guidance by hard-negatives.

Two major shortcomings of CLIP. The R@1 increases very slowly when the input length exceeds 20 tokens, indicating that the true effective length of CLIP is even no
longer than 20 tokens.
Long-CLIP: Unlocking the Long-Text Capability of CLIP

jina-clip-v1
1.A small model at a size of 223m parameters.
2.Trained for 70k steps (3 stages total).
3.Per step text-image batch size is 32768, model seen 2.29 billion pairs.
4.Per step text-text batch size is 32768, model seen 2.29 billion pairs/triplets.
5.1600 hour of Nvidia H100 80G.

model DBPed
ia
FEVER FiQA Hotpot
QA
MSMA
RCO
NFCor
pus
NQ Quora TRECC
OVID
AVG
OpenAI
/VIT-
B-16
14.9433.455.78 9.30 9.3616.445.2876.6322.6021.53
JinaV2-
Base
35.0572.3341.5861.3840.92 32.4560.04 88.2071.6055.95
JinaCLI
P-S1
28.4157.5036.1140.2425.8531.6540.0781.5549.2643.40
nDCG@10

model DBPed
ia
FEVER FiQA Hotpot
QA
MSMA
RCO
NFCor
pus
NQ Quora TRECC
OVID
AVG
OpenAI
/VIT-
B-16
14.9433.455.78 9.30 9.3616.445.2876.6322.6021.53
JinaV2-
Base
35.0572.3341.5861.3840.92 32.4560.04 88.2071.6055.95
JinaCLI
P-S1
28.4157.5036.1140.2425.8531.6540.0781.5549.2643.40
JinaCLI
P-S2
30.3356.7238.1043.8727.6032.1741.2384.3252.1545.17
nDCG@10

model DBPed
ia
FEVER FiQA Hotpot
QA
MSMA
RCO
NFCor
pus
NQ Quora TRECC
OVID
AVG
OpenAI
/VIT-
B-16
14.9433.455.78 9.30 9.3616.445.2876.6322.6021.53
JinaV2-
Base
35.0572.3341.5861.3840.92 32.4560.04 88.2071.6055.95
JinaCLI
P-S1
28.4157.5036.1140.2425.8531.6540.0781.5549.2643.40
JinaCLI
P-S2
30.3356.7238.1043.8727.6032.1741.2384.3252.1545.17
JinaCLI
P-S3
36.6476.2838.2761.8936.9133.5258.09 87.88 71.61 55.68
nDCG@10

How about overall performance?

Modality Gap in Multimodal Embeddings

Why Modalitiy Gap Exists?
1.Cone effect during weights initialisation creates the gap.
2.Temperature-Scaled Cross-Entropy Loss preserve the gap.
3.In-batch false negatives strengthen the gap.
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

img1
img2
img3
img4
img5
img6
img7
img8
caption1
caption2
caption3
caption4
caption5
caption6
caption7
caption8
pos
neg 
… 
neg
a multi-class, bidirectional classification problem, but the logits is
scaled by temperature: logits = cos(left, right)/ temperature

A higher temperature value in the contrastive loss function helps smoothing similarity
scores, reducing overconfidence, encouraging more uniform distributions.
a simple framework for contrastive learning of visual representations

model becomes more confident and encouraging more skewed distribution

img1
img2
img3
img4
img5
img6
img7
img8
caption1
caption2
caption3
caption4
caption5
caption6
caption7
caption8
pos
neg 
… 
neg
What if caption8 is a better description of img1 than the caption1?

We intentionally manipulate the training data
1.In Flickr8k, each image has 5 corresponding captions (we use 4).
2.The dataset is in sequential order.
3.During training, we load the dataset with Pytorch DataSet, perform random sampling (a batch) then feed
into the Pytorch DataLoader.
4.We disabled the RandomSampler, use SequentialSampler.

img1
img1
img1
img1
img2
img2
img2
img2
img1-cap1
img1-cap2
img1-cap3
img1-cap4
img2-cap1
img2-cap2
img2-cap3
img2-cap4
Given a batch size of 16, this ensures each text-image pair, there are
at least 3 in-batch negatives that can be considered as positives
(mismatch) and at most 12 "correct" in-batch negatives.
pos
false-neg
neg

Sampler Temperature BatchSize Cos Distance
w in batch
mismatch
Sequential 0.02 16 0.7840
w.o. in batch
mismatch
Random 0.02 16 0.7521

Carefully merge the scores when using Multimodal Embeddings!

https://jina.ai/

From CLIP to JinaCLIP: General Text-Image Representation Learning for Search and Multimodal RAG

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

From CLIP to JinaCLIP: General Text-Image Representation Learning for Search and Multimodal RAG

About This Presentation

Slide Content

Slide 1

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 9

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 17

Slide 20

Slide 22

Slide 24

Slide 25

Slide 28

Slide 30

Slide 32

Slide 33

Slide 34

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 45

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx