Introduction to Open Source RAG and RAG Evaluation

chloewilliams62 999 views 53 slides May 23, 2024
Slide 1
Slide 1 of 53
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53

About This Presentation

You’ve heard good data matters in Machine Learning, but does it matter for Generative AI applications? Corporate data often differs significantly from the general Internet data used to train most foundation models. Join me for a demo on building an open source RAG (Retrieval Augmented Generation) ...


Slide Content

1 | © Copyright 11/17/23 Zilliz1 | © Copyright 11/17/23 Zilliz 1| © Copyright 11/17/23 Zilliz1| © Copyright 11/17/23 Zilliz
Speaker
Christy Bergman
Developer Advocate, Zilliz
[email protected]
https://www.linkedin.com/in/christybergman/

https://github.com/milvus-io/milvus
discord: https://discord.gg/FjCMmaJng6

2 | © Copyright 11/17/23 Zilliz2 | © Copyright 11/17/23 Zilliz
Image source: https://thedataquarry.com/posts/vector-db-1/

3 | © Copyright 11/17/23 Zilliz3 | © Copyright 11/17/23 Zilliz
27K+
GitHub
Stars
25M+
Downloads
250+
Contributors
2,600
+Forks
Milvus is an open-source vector database for GenAI projects. Pip-install on your
laptop, plug into popular AI dev tools, and push to production with a single line of
code.
Easy Setup

Pip-install to start
coding in a notebook
within seconds.
Reusable Code

Write once, and
deploy with one line
of code into the
production
environment
Integration

Plug into OpenAI,
Langchain,
LlmaIndex, and
many more
Feature-rich

Dense & sparse
embeddings,
filtering, reranking
and beyond

4 | © Copyright 11/17/23 Zilliz4 | © Copyright 11/17/23 Zilliz
Zilliz Cloud is a fully-managed vector
database built atop of OSS Milvus
Open Source
Flexible & Secure Deployment





Enterprise features
for production-ready
Cardinal Search Engine &
Use Case Optimized Compute
Milvus completely
re-engineered to
be optimized
Pipelines Connectors Model Library
A streamlined
unstructured data
platform
Stable Milvus
versions are
continuously
deployed to Zilliz
Cloud

5 | © Copyright 11/17/23 Zilliz5 | © Copyright 11/17/23 Zilliz 5| © Copyright 11/17/23 Zilliz5| © Copyright 11/17/23 Zilliz
Milvus
Open Source Self-Managed

Milvus Discord
Join our community

github.com/milvus-io/milvus

Getting Started with Vector Databases
milvus.io/discord

6 | © Copyright 11/17/23 Zilliz6 | © Copyright 11/17/23 Zilliz
AGENDA
01AI Hallucinations and RAG
03
04RAG Evaluation Methods
024 Challenges
Demo RAG
05Demo Eval

7 | © Copyright 11/17/23 Zilliz7 | © Copyright 11/17/23 Zilliz
01
AI Hallucinations
and RAG

Example AI Hallucination
gemini
wikipedia

Example AI Hallucination
gemini
wikipedia
hallucinated
answer

Why do models hallucinate?
•The reason LLMs
hallucinate is because


•They are trained on
sequences of words
(tokens)

Sample Data
The hamster cabinet …
!!@#%# …
Monkey eats shark …
trees in the moons…

Vector
Database
Where do Vectors Come From?
Unstructured Data

Embeddings here
Pre-trained Deep
Learning Models

Vectors

Where do Vectors Come From?
Unstructured Data

Vectors

Where do Vectors Come From?
Unstructured Data

Vectors

Embedding
model
Generator
Model
or LLM

Semantic Similarity
Image from Sutor et al
Woman = [0.3, 0.4]
Queen = [0.3, 0.9]
King = [0.5, 0.7]
Woman = [0.3, 0.4]
Queen = [0.3, 0.9]
King = [0.5, 0.7]
Man = [0.5, 0.2]
Queen - Woman + Man = King
Queen = [0.3, 0.9]
- Woman = [0.3, 0.4]
[0.0, 0.5]
+ Man = [0.5, 0.2]
King = [0.5, 0.7]
Man = [0.5, 0.2]

15 | © Copyright 11/17/23 Zilliz15 | © Copyright 11/17/23 Zilliz
Retrieval Augmented Generation (RAG)
Your Data
Embedding Model
Vector Database
Question
Question + Context
Search
Gen AI Model
Reliable Answers
What is the default
AUTOINDEX distance
metric in Milvus
Client?
The default
AUTOINDEX distance
metric in Milvus
Client is L2.

16 | © Copyright 11/17/23 Zilliz16 | © Copyright 11/17/23 Zilliz
02
3 Challenges and
Lessons Learned

17 | © Copyright 11/17/23 Zilliz17 | © Copyright 11/17/23 Zilliz
Pain Point #1: Choosing an Embedding Model
https://huggingface.co/spaces/mteb/leaderboard

18 | © Copyright 11/17/23 Zilliz18 | © Copyright 11/17/23 Zilliz
Pain Point #1: Choosing an Embedding Model
CreatorModel Embedding
Dim
Context
Length
Use Case
Tasks
Open
Source
MTEB
Score
OpenAItext-embedding-
3-small
512-1536 8K Real-time
Multilingual text
chatbots
No 62 (1536)
62 (512)
OpenAItext-embedding-
3-large
256-3072 8K Real-time
Multilingual text
chatbots
No 65 (3072)
62 (256)
Matryoshka Representation Learning:
https://arxiv.org/pdf/2205.13147v4.pdf

19 | © Copyright 11/17/23 Zilliz19 | © Copyright 11/17/23 Zilliz
Pain Point #2: Choosing an Index
https://milvus.io/docs/index.md

20 | © Copyright 11/17/23 Zilliz20 | © Copyright 11/17/23 Zilliz
Pain Point #2: Choosing an Index
●In-memory
○Floating point dense
■Flat - The FLAT index is an exhaustive, brute-force approach that compares the query vector
against every single vector in the dataset to find the nearest neighbors. Suitable for small
datasets where perfect accuracy is required, and search latency is not of concern.
■IVF_Flat - The IVF_FLAT (Inverted File FLAT) index is a quantization-based index that
divides the vector space into clusters. During indexing, vectors are assigned to the nearest
cluster centroid, and during search, only the vectors within the closest clusters to the query
vector are compared.
■HNSW - HNSW organizes vectors in a hierarchical, multi-layered graph, so search
complexity is logarithmic. The basic idea is to separate nearest neighbours into layers in the
graph where the top layer is the sparsest. The lowest layer forms the complete graph. Search is
performed from top to bottom.
○Floating point sparse - SPLADE, BGE-M3
○Binary
●On-disk - diskANN when your data is too large to fit in memory
●Hardware-optimized: GPU CAGRA, ARM,

21 | © Copyright 11/17/23 Zilliz21 | © Copyright 11/17/23 Zilliz
Pain Point #2: Choosing an Index
IVF-Flat
HNSW
https://arxiv.org/abs/160
3.09320

22 | © Copyright 11/17/23 Zilliz22 | © Copyright 11/17/23 Zilliz
Conversation
Data
Documentation
Data
Lecture or Q/A
Data
Pain Point #3: Chunking

23 | © Copyright 11/17/23 Zilliz23 | © Copyright 11/17/23 Zilliz
Conversation
Data
Documentation
Data
Question Answer
Data
add
conversation
memory
use Q&A pair
formatting
Pain Point #3: Chunking

24 | © Copyright 11/17/23 Zilliz24 | © Copyright 11/17/23 Zilliz
Pain Point #3: Chunks need more context
Tesla Roadster

2018

Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tem

2023

Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tem








Chunk #1

Chunk #2
Naive Chunks

25 | © Copyright 11/17/23 Zilliz25 | © Copyright 11/17/23 Zilliz
Tesla Roadster

2018

Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tem

2023

Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tem








Tesla Roadster 2018
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tem


Tesla Roadster 2023
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tem
HTMLHeaderTextSplitter
ParentDocumentRetriever
Title 2-levels above

Title 1-level above
Naive Chunks Better Chunks
HierarchicalNodeParser
AutoMergingRetriever






Pain Point #3: Chunks need more context

26 | © Copyright 11/17/23 Zilliz26 | © Copyright 11/17/23 Zilliz
Example

27 | © Copyright 11/17/23 Zilliz27 | © Copyright 11/17/23 Zilliz
Example

28 | © Copyright 11/17/23 Zilliz28 | © Copyright 11/17/23 Zilliz
Pain Point #4: Keyword or Semantic Search?
??????
Good for:
●Exact product name
●Jargon words

Examples:
●Product name =
“2022 RF GT 6MT”
Good for:
●Similar meaning but
maybe not exact

Examples:
●Similar image search
●Related wiki articles

29 | © Copyright 11/17/23 Zilliz29 | © Copyright 11/17/23 Zilliz
Pain Point #4: Keyword or Semantic Search?
Dense Vector








Sparse Vector








TF-IDF
BM25
SPLADE
Lucene WAND pruning
BGE-M3
Top10
Top5
Final top_k
Prompt & Question
Improved context
Best of both worlds!
●Reranked Keyword AND Semantic top_k
●Put reranked into the Prompt Context
Keyword
Search
Semantic
Search
Linear comb.
Cross-encoder
Neural reranker

30 | © Copyright 11/17/23 Zilliz30 | © Copyright 11/17/23 Zilliz
Rerankers - when are they computed?
-Straight up Cosine similarity is called no interaction. This is dense embeddings “semantic
search”.
-BERT was an Early Interaction model meaning relationship between question and docs are
pre-computed as part of Embedding model, offline.
-Cross-encoders are ML-model Late Interaction, calculated at query time. Too
computation-heavy to run real-time except for small top_k to reduce to smaller top_2.
Cross-encoder reranking (adds classifier to Q, A pairs).
-ColBERT v2 is Neural-model Late Interaction calculated offline, before the user asks
their question! ~2% increased accuracy, but requires storing extra embeddings.
-Cohere’s rerank-3, claims ~26% improvement over sparse only; 6% over dense
-Jina.ai Reranker, claims ~20% improvement over sparse only

31 | © Copyright 11/17/23 Zilliz31 | © Copyright 11/17/23 Zilliz
BERT vs ColBert
BERT: SPLADE, BGE-M3
Query Top_k candidates
Final
top_k
https://arxiv.org/pdf/2112.01488.pdf

32 | © Copyright 11/17/23 Zilliz32 | © Copyright 11/17/23 Zilliz
Colbert v2 Reranker
https://arxiv.org/pdf/2112.01488.pdf

33 | © Copyright 11/17/23 Zilliz33 | © Copyright 11/17/23 Zilliz
Slide from Tengyu Ma, April 2024
talk at Unstructured Data
(+add Milvus metadata filtering)
Metadata
filtering (hash)

34 | © Copyright 11/17/23 Zilliz34 | © Copyright 11/17/23 Zilliz
BGE M3-Embedding
●“Multi-vec” - Multi-vector retrieval, uses
fine-grained interactions between query
and passage’s embeddings to compute
the relevance score. Re-rank the
top-200 Dense candidates, for efficient
processing.
●“Dense+Sparse” - Retrieve the top-1000
candidates with dense and sparse
method; then re-rank using the sum of
two scores.
●“All” - Re-rank based on the sum of all
three scores.

Multi-lingual retrieval performance on the MIRACL dev set (measured by nDCG@10).
https://arxiv.org/pdf/2402.03216

35 | © Copyright 11/17/23 Zilliz35 | © Copyright 11/17/23 Zilliz
https://chat.lmsys.org/?leaderboard
chart by @maximelabonne

36 | © Copyright 11/17/23 Zilliz36 | © Copyright 11/17/23 Zilliz

37 | © Copyright 11/17/23 Zilliz37 | © Copyright 11/17/23 Zilliz
Mixtral 8x22B-Instruct-v0.1 with Anyscale Endpoints
https://console.anyscale.com/v2/playground

38 | © Copyright 11/17/23 Zilliz38 | © Copyright 11/17/23 Zilliz
Question: What do the parameters for HNSW mean?
Prompt
GPT-3.5-turbo
Anyscale endpoints
Mixtral-8x22B-Instruct-v0.1

39 | © Copyright 11/17/23 Zilliz39 | © Copyright 11/17/23 Zilliz
2023 Lost-in-the-middle
https://arxiv.org/pdf/2307.03172
2024 Needle-in-a-haystack experiments
https://github.com/gkamradt/LLMTest_NeedleInAHaystack
Is RAG dead?

40 | © Copyright 11/17/23 Zilliz40 | © Copyright 11/17/23 Zilliz
Is RAG dead?
Needle in haystack experiments
Slide from Lance Martin, Langchain
https://blog.langchain.dev/multi-nee
dle-in-a-haystack/

41 | © Copyright 11/17/23 Zilliz41 | © Copyright 11/17/23 Zilliz
03 Demo Custom RAG

42 | © Copyright 11/17/23 Zilliz42 | © Copyright 11/17/23 Zilliz
04
RAG Evaluation
Methods

Where do Vectors Come From?
Unstructured Data

Vectors

Where do Vectors Come From?
Unstructured Data

Vectors

Embedding
model
Generator
Model
or LLM

45 | © Copyright 11/17/23 Zilliz45 | © Copyright 11/17/23 Zilliz
Retrieval Augmented Generation (RAG)
Your Data
Embedding Model
Vector Database
Question
Question + Context
Search
Gen AI Model
Reliable Answers
What is the default
AUTOINDEX distance
metric in Milvus?
The default
AUTOINDEX distance
metric in Milvus is L2.

46 | © Copyright 11/17/23 Zilliz46 | © Copyright 11/17/23 Zilliz
Model Evals vs Production System Evals
Your RAG systemArena Elo score

47 | © Copyright 11/17/23 Zilliz47 | © Copyright 11/17/23 Zilliz
RAG Evaluation Methods
https://arxiv.org/pdf/2306.05685.pdf
GPT-4 favors itself with a 10% higher
win rate; Claude-v1 favors itself with a
25% higher win rate

Open weight Prometheus-eval aligns
with human judgments up to 85% as
of May 2024.

48 | © Copyright 11/17/23 Zilliz48 | © Copyright 11/17/23 Zilliz
Known Problems with LLM-as-Judge
https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG
GPT-4 is not a good
judge of
comprehensiveness
GPT-4
Matches
Human
judgements on
Correctness &
Readability

49 | © Copyright 11/17/23 Zilliz49 | © Copyright 11/17/23 Zilliz
Known Problems with LLM-as-Judge
https://arxiv.org/pdf/2305.17926
AI scores
max/min higher
Humans
score
medians
higher

50 | © Copyright 11/17/23 Zilliz50 | © Copyright 11/17/23 Zilliz
RAG Evaluation Methods
https://github.com/explodinggradients/ragas
faithfulness
context_precision
context_recall
Query
Context
answer_relevancy
Ground Truth
Answer
answer_correctness
answer_similarity
Response

51 | © Copyright 11/17/23 Zilliz51 | © Copyright 11/17/23 Zilliz
03 Demo RAG Eval

52 | © Copyright 11/17/23 Zilliz52 | © Copyright 11/17/23 Zilliz
T H A N K Y O U
?????? We need your stars!
https://github.com/milvus-io/milvus

?????? Join our discord: https://discord.gg/FjCMmaJng6

Open Source Zilliz Architecture
Tags