A recent and exciting development in the world of Generative AI has been the use of language to understand images, video, and sound. One example is multi-modal retrieval, which is the process of using one modality, like text, to search another modality, like images. It is not only useful for search ...
A recent and exciting development in the world of Generative AI has been the use of language to understand images, video, and sound. One example is multi-modal retrieval, which is the process of using one modality, like text, to search another modality, like images. It is not only useful for search engines across media types, but also for grounding LLMs in factual data and reducing hallucinations. In this talk, I explain how to build a simple but performant multi-modal retrieval pipeline using completely open-source tools and models: the vector database Milvus and HuggingFace libraries for modeling and data. I discuss techniques to use multimodal retrieval most effectively and increase recall, as well as some interesting and diverse industry applications.
Getting Started with Vector Databases
zilliz.com/cloud
Searching the Web with Gen AI
or
Apple
or
Rising dough
or
Change car tire
Rising Dough
Proofing
Bread
✔
❌
Why is Semantic Search Difficult?
Why is Semantic Search Important?
10%
Other
newly generated data in 2025
will be unstructured data90%
Data Source: The Digitization of the World by IDC
Solution: Deep Learning
Similarity Search
Vector Embedding
Vector Space
Embeddings Models
New Challenge: Search in Vector Spaces
How to Index and
Search?
●High-dimensional
●> 1000 dims
How to Scale?
●10-100 million vectors?
●Billions?
●Trillions?
●Billions of users?
Multiple Data Types?
●Text
●Images
●Audio
●Graphs
●…
Milvus is an Open-Source Vector Database to
store, index, manage, and use the massive
number of embedding vectors generated by
deep neural networks and LLMs.
contributors
400
stars
30K
docker pulls
66M
forks
2.7K
+
Milvus: High-performance, scalable vector database
Retrieval Augmented
Generation RAG
Expand LLMs' knowledge by
incorporating external data sources
into LLMs and your AI applications
Match user behavior or content
features with other similar ones to
make effective recommendations
Recommender System
RecSys
Search for semantically similar
texts across vast amounts of
natural language documents
Text/Semantic Similarity
Search
Molecular Similarity
Search
Search for similar substructures,
superstructures, and other
structures for a specific molecule
Fraud & Anomaly
Detection
Detect data points, events, and
observations that deviate
significantly from the usual pattern
Multimodal Similarity
Search
Search over multiple types of
data simultaneously, e.g. text,
audio, images, video
Common AI Use Cases
Deployment Options
Milvus Lite
●Locally hosted
●Suitable for prototyping
and demos
●10s of millions of vectors
Milvus Standalone
●Single remote/local server
●“Medium” scale
●Simplified setup,
maintenance, etc.
compared to cluster
●100s millions of vectors
Milvus Cluster
●Distributed system
●Many different types of
nodes
●100s of billions of vectors
Why Open-Source?
Cost-effective Innovation Community
Why Not Traditional Databases?
Suboptimal
Indexing / Search
Scaling Inadequate Query
& Analytics Support
Benchmarks
Shows 3-20x faster comparing with
open source Milvus
At least 6x faster than other vector databases
https://github.com/zilliztech/VectorDBBench
Multi-Modal Embeddings
text
encoder
image
encoder
“the lion sleeps”
“a lion roars”
“a dog is walked”
✕
✕
✕
✕
✕
✕
Why Multi-Modal?
Retrieval /
Similarity Search
Foundation
Models
RAG
vector
database
large
vision-language
model
“what is the user
clicking on?”
“from the screenshot
provided, I see the user
is…”
“photos and
recordings of Iberian
lynx”
“produce a graph of
revenue from
2021-2023”
How Does it Work?
•Dataset of e.g., (image, text) pairs
•Typically, mined from web
•For example, (<img src>, <img alt>)
•Pre-processing required for performance
•Train encoders so that embeddings for (image, text) are close
•Either initialized from pre-trained encoders or from scratch
•Simultaneously, penalize distances for (image, random text)
•This is called, Contrastive Learning
MagicLens: Task
[Figure 1; DeepMind, 2024]
Input:
•a query image, and;
•an instruction that specifies a specific
semantic relation
Output:
•matching image(s) from our database.
MagicLens: Data
[Figure 2; DeepMind, 2024]
Steps:
•Find image pairs on the same
webpages
•With PaLI, use images + alt. text to
generate descriptions of each image
and relationship between them
•Filter for sensitive content, similarity
•With PaLM2, form an instruction that
relates query to target
MagicLens: Model
[Figure 4; DeepMind, 2024]
3⃣ query image + instruction text
4⃣ target image + empty text
1⃣ initialized to pre-trained models
2⃣ additional layers
MagicLens: Training
Minimize contrastive loss function, which for an element of a minibatch
is:
4⃣ similarity between (query, instruction)
and (query, “”)
1⃣ similarity between (query, instruction) and (target, “”)
3⃣ similarity between (query, instruction)
and (non-target, “”)
2⃣ sum over minibatch