Retrieval Augmented Generation Evaluation with Ragas

chloewilliams62 693 views 29 slides Jul 18, 2024
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo ...


Slide Content

1 | © Copyright 11/17/23 Zilliz1 | © Copyright 11/17/23 Zilliz 1| © Copyright 11/17/23 Zilliz1| © Copyright 11/17/23 Zilliz
Speaker
Christy Bergman
Developer Advocate, Zilliz

https://www.linkedin.com/in/christybergman/

https://github.com/milvus-io/milvus
discord: https://discord.gg/FjCMmaJng6

2 | © Copyright 11/17/23 Zilliz2 | © Copyright 11/17/23 Zilliz
Image source: https://thedataquarry.com/posts/vector-db-1/

3 | © Copyright 11/17/23 Zilliz3 | © Copyright 11/17/23 Zilliz
3 Pillars of Generative AI

4 | © Copyright 11/17/23 Zilliz4 | © Copyright 11/17/23 Zilliz
3 Pillars of Generative AI

5 | © Copyright 11/17/23 Zilliz5 | © Copyright 11/17/23 Zilliz
Opportunities in Unstructured Data

6 | © Copyright 11/17/23 Zilliz6 | © Copyright 11/17/23 Zilliz

7 | © Copyright 11/17/23 Zilliz7 | © Copyright 11/17/23 Zilliz
T H A N K Y O U
?????? We need your stars!
https://github.com/milvus-io/milvus

?????? Join our discord: https://discord.gg/FjCMmaJng6

8 | © Copyright 11/17/23 Zilliz8 | © Copyright 11/17/23 Zilliz
AGENDA
01AI Hallucinations and RAG
03
04RAG Evaluation Methods
024 Challenges
Demo

9 | © Copyright 11/17/23 Zilliz9 | © Copyright 11/17/23 Zilliz
01
AI Hallucinations
and RAG

10 | © Copyright 11/17/23 Zilliz10 | © Copyright 11/17/23 Zilliz
Example AI Hallucination
gemini
wikipedia

11 | © Copyright 11/17/23 Zilliz11 | © Copyright 11/17/23 Zilliz
Example AI Hallucination
gemini
wikipedia
hallucinated
answer

12 | © Copyright 11/17/23 Zilliz12 | © Copyright 11/17/23 Zilliz
Why do models hallucinate?
•The reason LLMs
hallucinate is because


•They are trained on
sequences of words
(tokens)

Sample Data
The hamster cabinet …
!!@#%# …
Monkey eats shark …
trees in the moons…

13 | © Copyright 11/17/23 Zilliz13 | © Copyright 11/17/23 Zilliz
Vector
Database
Where do Vectors Come From?
Unstructured Data

Embeddings here
Pre-trained Deep
Learning Models

Vectors

14 | © Copyright 11/17/23 Zilliz14 | © Copyright 11/17/23 Zilliz
Where do Vectors Come From?
Unstructured Data

Vectors

15 | © Copyright 11/17/23 Zilliz15 | © Copyright 11/17/23 Zilliz
Semantic Similarity
Image from Sutor et al
Woman = [0.3, 0.4]
Queen = [0.3, 0.9]
King = [0.5, 0.7]
Woman = [0.3, 0.4]
Queen = [0.3, 0.9]
King = [0.5, 0.7]
Man = [0.5, 0.2]
Queen - Woman + Man = King
Queen = [0.3, 0.9]
- Woman = [0.3, 0.4]
[0.0, 0.5]
+ Man = [0.5, 0.2]
King = [0.5, 0.7]
Man = [0.5, 0.2]

16 | © Copyright 11/17/23 Zilliz16 | © Copyright 11/17/23 Zilliz
Retrieval Augmented Generation (RAG)
Your Data
Embedding Model
Vector Database
Question
Question + Context
Search
Gen AI Model
Reliable Answers
What is the default
AUTOINDEX distance
metric in Milvus?
The default
AUTOINDEX distance
metric in Milvus is L2.

17 | © Copyright 11/17/23 Zilliz17 | © Copyright 11/17/23 Zilliz
Conversation
Data
Documentation
Data
Lecture or Q/A
Data
Pain Point #3: Chunking

18 | © Copyright 11/17/23 Zilliz18 | © Copyright 11/17/23 Zilliz
Conversation
Data
Documentation
Data
Question Answer
Data
add
conversation
memory
use Q&A tuple
formatting
Pain Point #3: Chunking

19 | © Copyright 11/17/23 Zilliz19 | © Copyright 11/17/23 Zilliz
Pain Point #3: Chunks need more context
Tesla Roadster

2018

Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tem

2023

Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tem








Chunk #1

Chunk #2
Naive Chunks

20 | © Copyright 11/17/23 Zilliz20 | © Copyright 11/17/23 Zilliz
Tesla Roadster

2018

Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tem

2023

Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tem








Tesla Roadster 2018
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tem


Tesla Roadster 2023
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tem
HTMLHeaderTextSplitter
ParentDocumentRetriever
Title 2-levels above

Title 1-level above
Naive Chunks Better Chunks
HierarchicalNodeParser
AutoMergingRetriever






Pain Point #3: Chunks need more context

21 | © Copyright 11/17/23 Zilliz21 | © Copyright 11/17/23 Zilliz
Pain Point #3: Chunks need more context
Naive Chunks
Better Chunks

22 | © Copyright 11/17/23 Zilliz22 | © Copyright 11/17/23 Zilliz
04
RAG Evaluation
Methods

23 | © Copyright 11/17/23 Zilliz23 | © Copyright 11/17/23 Zilliz
Foundation Model Evals vs Production System Evals
Your RAG systemArena Elo score

24 | © Copyright 11/17/23 Zilliz24 | © Copyright 11/17/23 Zilliz
RAG Evaluation Methods
https://arxiv.org/pdf/2306.05685.pdf
GPT-4 favors itself with a 10% higher
win rate; Claude-v1 favors itself with a
25% higher win rate

Open weight Prometheus-eval aligns
with human judgments up to 85% as
of May 2024.

25 | © Copyright 11/17/23 Zilliz25 | © Copyright 11/17/23 Zilliz
Known Problems with LLM-as-Judge
https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG
GPT-4 is not a good
judge of
comprehensiveness
GPT-4
Matches
Human
judgements on
Correctness &
Readability

26 | © Copyright 11/17/23 Zilliz26 | © Copyright 11/17/23 Zilliz
Known Problems with LLM-as-Judge
https://arxiv.org/pdf/2305.17926
AI scores
max/min higher
Humans
score
medians
higher

27 | © Copyright 11/17/23 Zilliz27 | © Copyright 11/17/23 Zilliz
RAG Evaluation Methods
https://github.com/explodinggradients/ragas
faithfulness
context_precision
context_recall
Query
Context
answer_relevancy
Ground Truth
Answer
answer_correctness
answer_similarity
Response

28 | © Copyright 11/17/23 Zilliz28 | © Copyright 11/17/23 Zilliz
05 Demo RAG Eval

29 | © Copyright 11/17/23 Zilliz29 | © Copyright 11/17/23 Zilliz
RETRIEVAL +46%, GENERATION +6%
####################################################
# Avg Context Precision htmlsplitter score = 0.67 (46% improvement)
# Avg Context Precision simple score = 0.46
####################################################

####################################################
# Avg mistralai mixtral_8x7b_instruct score = 0.7031 (6% improvement over
gpt-3.5-turbo)
# Avg llama3_70b_anyscale_chat score = 0.6888
# Avg llama3_70b_groq_instruct score = 0.6867
# Avg llama_3_70b_octoai_instruct score = 0.6863
# Avg llama_3_8b_ollama_instruct score = 0.6783
# Avg openai gpt-3.5-turbo score = 0.665
####################################################
Tags