Evaluating RAG pipelines built on unstructured data

chloewilliams62 118 views 11 slides Sep 18, 2024
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

This talk will cover different techniques for evaluating a RAG pipeline built on unstructured data. Standing up a basic RAG pipeline is becoming easier every day, however identifying weak points in your application or dataset remains a challenge. We'll review how you can use traditional assertio...


Slide Content

Evaluating Agentic RAG Pipelines
September 2024
Hakan Tekgul
Arize AI
Solution Architect

© All Rights Reserved | We Make Models Work
Transition from Twitter Demo to Real Product is Hard

© All Rights Reserved | We Make Models Work
One Big Pain: Even Small Changes Can Cause Performance
Regressions
Reality of AI Engineering this past year:
Change a prompt or model, break a use case.
Repeat.
You are an assistant debugging
RAG, investigate the retrieved
results and evals…
LGTM!

© All Rights Reserved | We Make Models Work
Solution: Evaluation Driven Development
Examples
Curate Dataset Track Changes as an Experiment
(Model, Prompt, Retriever)
Evaluate the Experiment
New
Output
Score
0.8
You’re a helpful
assistant. When
user asks about
return policy
respond with
{vars} …
LLM APPS REQUIRE ITERATIVE PERFORMANCE IMPROVEMENTS

© All Rights Reserved | We Make Models Work
RAG Architecture
User query LLM Response
User
feedback
Query
embedding
Vector store
Prompt
With Context and
User Query
Search & Retrieval
Vector DB OrchestrationAgent
LLM Evaluation and Observability
LLM INFRA STACK
LLM Providers

© All Rights Reserved | We Make Models Work
Output
Data
LLM Evals for RAG Applications (LLM-As-A-Judge)


Eval Library
(Phoenix)
Eval
Template
Model
Params
Eval LLM
Input Data
Eval
Chain
EmbRetriever
LLM Span/Chain Under Test

© All Rights Reserved | We Make Models Work
How do Evals work? (LLM-As-A-Judge)


Eval
Example
“relevant”
“irrelevant”

span
span
retrieval span
span
Phoenix Library
Model Params
Eval LLM
Eval Template
Example: Retrieval
retrieval span
Span we want to evaluate
Output
User Query
Input
Documents
Eval Template
You are comparing a reference text to a question and trying to determine
if the reference text contains information relevant to answering the
question. Here is the data:
[BEGIN DATA]
************
[Question]: {query}
************
[Reference text]: {reference}
[END DATA]

Compare the Question above to the Reference text. Determine whether
the Reference text contains information that can answer the Question.

© All Rights Reserved | We Make Models Work
RAG Evaluation Overview
For evaluating a RAG application, you need to consider 2 types of evaluations:
-Relevancy Evaluation: Is the retrieved context relevant to user query?
-Response Evaluations:
-Hallucination: Is the response hallucinated based on retrieved data? Is it faithful to
context?
-Q&A Correctness: Is the response correct based on question and context?

© All Rights Reserved | We Make Models Work
Unstructured RAG - Knowledge Base Analysis with Embeddings
●Leverage query and knowledge
embeddings for RAG
performance

●Understand gaps within your
knowledge base

●Essential for
Unstructured/MultiModal RAG

Live Demo

© All Rights Reserved | We Make Models Work
Thank you.

[email protected]
pip install arize-phoenix
Tags