How Vector Databases are Revolutionizing Unstructured Data Search in AI Applications

chloewilliams62 136 views 35 slides Jul 10, 2024
Slide 1
Slide 1 of 35
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35

About This Presentation

"Powered by the popularity of ChatGPT, Llama2, and other LLMs, we've seen a huge surge in interest for vector databases in 2023 and 2024. Vector databases are commonly used to connect relevant documents with LLMs, through a process called retrieval augmented generation (RAG). RAG has seen w...


Slide Content

Frank Liu

Vector Databases and
Unstructured Data Search

2 | © Copyright Zilliz2
Speaker
Frank Liu
Head of AI & ML
[email protected]
https://www.linkedin.com/in/fzliu
https://www.twitter.com/frankzliu

3 | © Copyright Zilliz3
Milvus: The most widely-adopted vector
database
Milvus is an Open-Source Vector
Database to store, index, manage, and
use the massive number of embedding
vectors generated by deep neural
networks and LLMs.
contributors
267+
stars
27K+
docker pulls
11M+
forks
2K+

4 | © Copyright Zilliz4
The GenAI
Landscape

5 | © Copyright Zilliz5
It’s a jungle out there

Three Pillars of GenAI Revisited
Models
Computatio
n
Data

Opportunities in Unstructured Data
Vector Database

Opportunities in Unstructured Data
Vector Database
Data PipelineData ETL
Data Observability
Data Security
Data Encryption
Data Compliance

Unstructured Data Meetups
Foster innovations
and collaborations in
unstructured data
Our Mission

10 | © Copyright Zilliz10
Vector
Embedding
Primer

11 | © Copyright Zilliz11
Vector embeddings are something computers can
understand

12 | © Copyright Zilliz12
Vectors unlock unstructured data…
Knowledge Base
(Documents)
Embedding Models Vectors Vector Databases

13 | © Copyright Zilliz13
Retrieval Augmented
Generation (RAG)
Expand LLMs' knowledge by
incorporating external data sources
into LLMs and your AI applications.
Match user behavior or content
features with other similar ones to
make effective recommendations.
Recommender System
Search for semantically similar
texts across vast amounts of
natural language documents.
Text/ Semantic Search
Image Similarity Search
Identify and search for visually
similar images or objects from a
vast collection of image libraries.
Video Similarity Search
Search for similar videos, scenes,
or objects from extensive
collections of video libraries.
Audio Similarity Search
Find similar audios in large datasets
for tasks like genre classification or
speech recognition
Molecular Similarity Search
Search for similar substructures,
superstructures, and other
structures for a specific molecule.
Anomaly Detection
Detect data points, events, and
observations that deviate
significantly from the usual pattern
Multimodal Similarity Search
Search over multiple types of data
simultaneously, e.g. text and
images
…and power a variety of different use cases

14 | © Copyright Zilliz14
History of
Embeddings

15 | © Copyright Zilliz15
Embeddings models workhorses of AI apps

16 | © Copyright Zilliz16
Back in the day…

17 | © Copyright Zilliz17
Back in the day…
“Handcrafted features”

18 | © Copyright Zilliz18
Modern day embeddings
Source: Arize Phoenix

19 | © Copyright Zilliz19
Embedding
Models

20 | © Copyright Zilliz20
Recurrent neural networks
Source: CS230 notes

21 | © Copyright Zilliz21
Convolutional networks
Source: CS230 notes

22 | © Copyright Zilliz22
Transformer encoder
Source: Illustrated Transformer

23 | © Copyright Zilliz23
Transformer encoder
Source: Illustrated Transformer
One
embedding
per token!

24 | © Copyright Zilliz24
BERT
Source: BERT paper

25 | © Copyright Zilliz25
BERT
Source: BERT paper
Cross-encoder
(one inference per
query/document
pair)

26 | © Copyright Zilliz26
Sentence BERT
Source: SBERT paper

27 | © Copyright Zilliz27
Sentence BERT
Source: SBERT paper
Bi-encoder
1 embedding per text

28 | © Copyright Zilliz28
Vision transformer
Source: CS230 notes
Source: ViT paper

29 | © Copyright Zilliz29
Embedding models today
•Text embeddings: some flavor of SBERT
•Self-supervised pre-training (masking + NSP)
•Contrastive (regression) or triplet loss
•Changes to model arch or training, e.g. sparse attention, masking entities, or
a different objective function

•Image embeddings: hybrid convnet/self-attention
•Trained across a large labelled or weakly labelled dataset
•Categorical cross-entropy or binary cross-entropy loss

30 | © Copyright Zilliz30
Bonus:
Multimodal
RAG

31 | © Copyright Zilliz31
Girdhar, et al.
Multi-modal Retrieval

32 | © Copyright Zilliz32
Your
Documents
Embedding Model
Milvus
Question
Question + Context
Search
Gen AI Model
Reliable Answers
What kind of music
did they play in the
pre-show?
The musician played
improvised electronic
music.
Vanilla RAG (RAG 1.0)

33 | © Copyright Zilliz33
Multimodal Model
Milvus
Question
Question + Context
Search
Gen AI Model
Reliable Answers
What is the default
AUTOINDEX distance
metric in Milvus
Client?
The default
AUTOINDEX distance
metric in Milvus
Client is L2.
RAG 2.0

34 | © Copyright Zilliz34
Demo Time

35 | © Copyright Zilliz35
T H A N K Y O U
We need your stars!
https://github.com/milvus-io/milvus

?????? Join our discord: https://discord.gg/FjCMmaJng6
Tags