Debugging & fixing Gen-AI applications.pdf

kdjeeves 248 views 95 slides Sep 22, 2024
Slide 1
Slide 1 of 95
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95

About This Presentation

Generative AI has revolutionized the world, by making complex AI / ML techniques very easy to use. It has enabled non-Data Science users like business and engineering folks to create AI solutions with ease. But as we know there are no free lunches, ease of use comes with challenges of handling non-t...


Slide Content

Debugging & fixing Gen-AI
applications
Kuldeep Jiwani
VP, Head of AI Solutions

“You only change what you understand.
What you don’t understand, are unaware of, you repress.
But when you understand it, it changes.”

Agenda
●Gen AI enterprise application
○Where and how to debug
●RAG components
○Embedding quality
●Fixing RAG
●Evaluating LLM response
●Tuning LLM behaviors

Enterprise Gen AI Applications
Prompt
Creation
Enterprise data
Question
Input data
RAG
Data store
LLM
(Online / Offline)Agentic LLMRelevant context
Few shots
Question +
Document +
Context / Few shots
Answer based
on document
User

Enterprise Gen AI Applications
Prompt
Creation
Enterprise data
Question
Input data
RAG
Data store
LLM
(Online / Offline)Agentic LLMRelevant context
Few shots
Question +
Document +
Context / Few shots
Answer based
on document
User

Enterprise Gen AI Applications
Prompt
Creation
Enterprise data
Question
Input data
RAG
Data store
LLM
(Online / Offline)Agentic LLMRelevant context
Few shots
Question +
Document +
Context / Few shots
Answer based
on document
User
Black Box in API world

Enterprise Gen AI Applications
Prompt
Creation
Enterprise data
Question
Input data
RAG
Data store
LLM
(Online / Offline)Agentic LLMRelevant context
Few shots
Question +
Document +
Context / Few shots
Answer based
on document
User
RAG
Embedding
quality
Vector
lookup
Data
type

RAG components
RAG
Embedding
quality
Concept
differentiation
Information
aggregation
Vector
Lookup
Search
optimization
Similarity
function
Data type
Chunk Graph

RAG
Differentiating power of embeddings

RAG Embeddings Overview
Documents
Embedding
model
0.30.10.20.8070.90.60.50.20.30.10.0
0.20.4050.8070.90.60.50.20.70.60.1
0.50.10.20.8070.90.60.50.20.90.90.5
0.1030.20.807040.60.50.20.20.20.3
0.90.20.20.8070.90.60.50.20.3030.7
Embedding vectors
ς�
��
�
��
�(�,�)
( , ,0.85)( , ,0.89)
( , ,0.75)( , ,0.92)
( , ,0.64)( , ,0.88)
( , ,0.44)( , ,0.55)
( , ,0.59)( , ,0.37)
( , ,0.95)
( , ,0.75)
( , ,0.64)
( , ,0.44)
( , ,0.59)
Similarity
function
Vector similarity metric

RAG: Embedding differentiation power
●Food
○"I just tried the most amazing sushi restaurant downtown
last night.",
○"I want some of Grandma's famous apple pie.",
○"A pinch of sugar balances the acidity of the tomatoes
beautifully.",
○"The street tacos are so flavorful with that homemade
salsa.",
○"I love fluffy pancakes topped with maple syrup and fresh
berries.",
○"This butter chicken is incredible.",
○"I'm craving spicy green curry and spring rolls.”
●Cars
○"My new sports car is a beast on the highway, the
acceleration is mind-blowing.",
○"Driving along the coast was so serene.",
○"The quiet ride and zero emissions make my drive to work
guilt-free.",
○"Driving those hills and navigating through the mud was
such a thrilling experience!",
○"I've been working on my classic Mustang it's a blast to
drive.",
○"My old beater had character I have so many memories in
that car.",
○"I just got a new SUV, it's spacious, comfortable, and has
all the latest safety features."
●Physics
○"I find quantum mechanics fascinating.",
○"We're constantly exploring the mysteries of black holes and
the expansion of universe.",
○"It's amazing how time and space are intertwined, and how
gravity warps them.",
○"Physics is all around us, just think about the laws of
motion.",
○"Learning about the Higgs boson makes me appreciate the
complexity of the universe.",
○"I love when science fiction explores theoretical physics
concepts like time travel.",
○"It's all thanks to the discoveries of physicists like Faraday
and Maxwell.”
●Medical
○"Patient came to ed with complaints of vomiting in blood.",
○"Admitted to hospital for scheduled transcatheter arterial
embolization procedure.",
○"Patient came to hospital consulted GI and started on
pantoprazole drip.",
○"Patient came to emergency via ambulance after a motor
vehicle accident with extensive head injury.",
○"Pt with long hx of spinal instability from past 5 yrs with relief
from medication and therapy.",
○"Pt with irregular heartbeat and chest pain presented to ed.",
○"Patient with failed medical and conservative treatment
advised for right knee joint replacement."

RAG: Embedding differentiation power
GPT 3.5 Turbo
OpenAI
Food
Car
Physics
Medical

RAG: Embedding differentiation power
S-BERT
Sentence Transformer
Food
Car
Physics
Medical

RAG: Embedding differentiation power
PubMed-BERT
Domain tuned model
(Medical domain)
Food
Car
Physics
Medical

Visualizing Embeddings
●Manifolds are way to visually embeddings to assess the semantic similarity in data
●Closer points refers to similar concepts, as captured by LLM’s training method
●Check proximity of points in a manifold that belongs to same class
○This reflects LLM’s semantic learning ability
●LLM embeddings are very high-dimensional and complex
○It comes along with lot of nuances like curse of dimensionality, non-linear relations, etc.
○One should choose the right manifold to understand true relations
●Important: One should choose distance preserving manifolds: MDS, LLE, UMAP, …

Manifolds: Global vs Local
Raw data (3-D)
Clustered data (3-D)
Global Manifold
MDS (2D)
Local Manifold
LLE (2D)

Visualizing Embeddings
●Not all manifolds are distance preserving
●t-SNE is not distant preserving
○Cluster sizes in a t-SNE plot mean nothing
○Distances between clusters might not mean anything
●t-SNE plots can results to misinterpretations
○Algorithm adapts to the underlying data; it performs different transformations on different regions
■These differences can be a major source of confusion
○t-SNE has a tunable parameter, “perplexity,”
■Which says (loosely) how to balance attention between local and global aspects of your data
■The parameter is, in a sense, a guess about the number of close neighbors each point has

Vector Embeddings (MDS)
Unclear concept separation Clear concept separations
Food
Car
Physics
Medical

RAG
Query & Retrieval Embeddings

RAG query: "What are the new projects in research and development?”
Document: Company’s annual report
Evaluating RAG quality: Via embedding manifolds
Query vector:
Retrieved vectors:
Good quality retrieval:
Query embedding is close to
retrieved chunks embeddings

Evaluating RAG quality: Via embedding manifolds
[0]--> our success is based on our ability to create new and compelling products, services, and experiences for our users, to initiate and embrace disruptive technology trends, to enter new geographic and
product markets, and to drive broad adoption of our products and services. we invest in a range of emerging technology trends and breakthroughs that we believe offer significant opportunities to deliver value
to our customers and growth for the company. based on our assessment of key technology trends, we maintain our long -term commitment to research and development across a wide spectrum of
technologies, tools, and platforms spanning digital work and life experiences, cloud computing, ai, devices, and operating systems. while our main product research and development facilities are located in
redmond, washington, we also operate research and development facilities in other parts of the u. s. and around the world. this global approach helps us remain
[1]--> competitive in local markets and enables us to continue to attract top talent from across the world. we plan to continue to make significant investments in a broad range of product research and
development activities, and as appropriate we will coordinate our research and development across operating segments and leverage the results across the company. in addition to our main research and
development operations, we also operate microsoft research. microsoft research is one of the world ’ s largest corporate research organizations and works in close collaboration with top universities around the
world to advance the state -of -the -art in computer science and a broad range of other disciplines, providing us a unique perspective on future trends and contributing to our innovation.
[2]--> company is better positioned to help them than microsoft. every day this past fiscal year i have had the privilege to witness our customers use our platforms and tools to connect what technology can do
with what the world needs it to do. here are just a few examples : • ferrovial, which builds and manages some of the world ’ s busiest airports and highways, is using our cloud infrastructure to build safer roads
as it prepares for a future of autonomous transportation. • peace parks foundation, a nonprofit helping protect natural ecosystems in southern africa, is using microsoft dynamics 365 and power bi to secure
essential funding, as well as our azure ai and iot solutions to help rangers scale their park maintenance and wildlife crime prevention work. • one of the world ’ s largest robotics companies, kawasaki heavy
industries, is using the breadth of our tools — from azure iot and hololens — to create an industrial metaverse solution that brings its distributed workforce
[3]--> but generally include parts and labor over a period generally ranging from 90 days to three years. for software warranties, we estimate the costs to provide bug fixes, such as security patches, over the
estimated life of the software. we regularly reevaluate our estimates to assess the adequacy of the recorded warranty liabilities and adjust the amounts as necessary. research and development research and
development expenses include payroll, employee benefits, stock - based compensation expense, and other headcount - related expenses associated with product development. research and development
expenses also include third - party development and programming costs, localization costs incurred to translate software for international markets, and the amortization of purchased software code and
services content. such costs related to software development are included in research and development expense until the point that technological feasibility is reached, which for our
[4]--> also increased the number of identified partners in the black partner growth initiative and continue to invest in the partner community through the black channel partner alliance by supporting events
focused on business growth, accelerators, and mentorship. progress does not undo the egregious injustices of the past or diminish those who continue to live with inequity. we are committed to leveraging our
resources to help accelerate diversity and inclusion across our ecosystem and to hold ourselves accountable to accelerate change – for microsoft, and beyond. investing in digital skills the covid - 19 pandemic
led to record unemployment, disrupting livelihoods of people around the world. after helping over 30 million people in 249 countries and territories with our global skills initiative.

Evaluating RAG quality: Via embedding manifolds
RAG query: "What are companies future plans?
Document: Company’s annual report
Query vector:
Retrieved vectors:
Lower quality retrieval:
Query embedding is not close
to retrieved chunks embeddings

Evaluating RAG quality: Via embedding manifolds
[0]--> continue, ” “ will likely result, ” and similar expressions. forward -looking statements are based on current expectations and assumptions that are subject to risks
and uncertainties that may cause actual results to differ materially. we describe risks and uncertainties that could cause actual results and events to differ materially in
“ risk factors, ” “ management ’ s discussion and analysis of financial condition and results of operations, ” and “ quantitative and qualitative disclosures about market
risk " in our fiscal year 2022 form 10 - k. readers are cautioned not to place undue reliance on forward - looking statements, which speak only as of the date they are
made. we undertake no obligation to update or revise publicly any forward - looking statements, whether because of new information, future events, or otherwise.
business general embracing our future microsoft is a technology company whose mission is to empower every person and every organization on the planet to
[1]--> our future growth depends on our ability to transcend current product category definitions, business models, and sales motions. we have the opportunity to
redefine what customers and partners can expect and are working to deliver new solutions that reflect the best of microsoft.
[2]--> 1 dear shareholders, colleagues, customers, and partners : we are living through a period of historic economic, societal, and geopolitical change. the world in
2022 looks nothing like the world in 2019. as i write this, inflation is at a 40 - year high, supply chains are stretched, and the war in ukraine is ongoing. at the same
time, we are entering a technological era with the potential to power awesome advancements across every sector of our economy and society. as the world ’ s largest
software company, this places us at a historic intersection of opportunity and responsibility to the world around us. our mission to empower every person and every
organization on the planet to achieve more has never been more urgent or more necessary. for all the uncertainty in the world, one thing is clear : people and
organizations in every industry are increasingly looking to digital technology to overcome today ’ s challenges and emerge stronger. and no
[3]--> projections of any evaluation of effectiveness to future periods are subject to the risk that controls may become inadequate because of changes in conditions, or
that the degree of compliance with the policies or procedures may deteriorate. / s / deloitte & touche llp seattle, washington july 28, 2022
[4]--> 11 note about forward -looking statements this report includes estimates, projections, statements relating to our business plans, objectives, and expected
operating results that are “ forward - looking statements ” within the meaning of the private securities litigation reform act of 1995, section 27a of the securities act of
1933, and section 21e of the securities exchange act of 1934. forward - looking statements may appear throughout this report, including the following sections : “
business ” in our fiscal year 2022 form 10 - k and “ management ’ s discussion and analysis of financial condition and results of operations ” in our fiscal year 2022
form 10 - k. these forward - looking statements generally are identified by the words “ believe, ” “ project, ” “ expect, ” “ anticipate, ” “ estimate, ” “ intend, ” “ strategy, ” “
future, ” “ opportunity, ” “ plan, ” “ may, ” “ should, ” “ will, ” “ would, ” “ will be, ” “ will

Manifolds:
Closer points:
Good retrieval
UMAP
LLE
IsoMap
MDS
Manifold
Dim-1
Manifold Dim-2
Manifold
Dim-1
Manifold Dim-2
Manifold
Dim-1
Manifold Dim-2
Manifold
Dim-1
Manifold Dim-2
RAG query: "What are the
new projects in research and
development?”

Manifolds:
Sparse points:
Lower quality retrieval
UMAP
LLE
IsoMap
MDS
Manifold
Dim-1
Manifold Dim-2
Manifold
Dim-1
Manifold Dim-2
Manifold
Dim-1
Manifold Dim-2
Manifold
Dim-1
Manifold Dim-2
RAG query: "What are
companies future plans?

RAG
Embedding Aggregations

Information Aggregation / Pooling strategy: Mean pooling
Encoder
Attention
is
all
you
need
σ
�=1
�
�
�
�
Mean Pooling
Word embeddings
Sentence embedding

Information Aggregation / Pooling strategy: CLS pooling
Encoder
Attention
is
all
you
need
��=�
CLS Pooling
CLS
Word embeddings
Sentence embedding

Information Aggregation / Pooling strategy: Pooler vector
Encoder
Attention
is
all
you
need
��
=tanh(�)
tanh pooling
CLS
Word embeddings
Sentence embedding

RAG quality: Vector Aggregation / Pooling strategy
Tanh CLS (Pooler)
CLS
Mean pooling

RAG
Vector Indexing

RAG quality: Vector Indexing algorithms
●HNSW - Hierarchical Navigable Small World
○Skip lists, NSW – Navigable Small World
●LSH – Locality Sensitive Hashing
●ALSH – Asymmetric Locally Sensitive Hashing
●NSG – Navigating Spread-out Graph
○ANN – Approximate Nearest Neighbor
●IVF – Inverted Files
○K-Means
●MSTG – Multi-Stage Tree Graph

Vector Indexing Pitfalls: LSH
P
1
P
2
1
0
Random
vector
+ve
-ve

LSH: Region identification via random projections
•Know the sensitivity of hash function
•Hash function: ??????
��=
�
??????
�
??????
•w is a d-dimensional vector from p-stable distribution
•H is called (R, cR, P
1, P
2) sensitive, if for any
two points p and q that
•if ��,�≤�,�ℎ�� ���� ℎ�=ℎ� ≥�
1
•(i.e. p and q collide)
•if ��,�≥��,�ℎ�� ���� ℎ�=ℎ� ≤�
2
•??????=
log(1/??????
1)
log(1/??????
2)

•The smaller ρ, the better search performance

ALSH – Asymmetric LSH (ScaNN - Google)
q – query vector
< q, x > – projection length of x on q
x
1, x
2 – embedding vectors
෤&#3627408485;
1,෤&#3627408485;
2 – quantized vector of x
1, x
2
ۦ&#3627408478;,ۧ෤&#3627408485;
1>ۦ&#3627408478;,ۧ෤&#3627408485;
2, while ۦ&#3627408478;,ۧ&#3627408485;
1<ۦ&#3627408478;,ۧ&#3627408485;
2
ۦ&#3627408478;,ۧ෤&#3627408485;
1<ۦ&#3627408478;,ۧ෤&#3627408485;
2
c
1, c
2 – centers of regions

Vector search: Speed vs Accuracy
Source: https://research.google/blog/announcing-scann-efficient-vector-similarity-search/

RAG: Assess Vector Indexing quality
●Prepare a set of query and relevant chunks on your real enterprise data
●Generate relevant answer via Vector DB lookup
●Generate relevant answers via full scan on stored embedding vectors
●The difference in results between the two indicates the quality of index store

RAG
Similarity function

RAG Quality: Similarity function
●Is Euclidian distance a good similarity measure?
○In high dimensional spaces the points tend to be far apart by Euclidian measure
○This sometimes distorts the notion of similarity

Euclidean distance
●2-Dimensional uncorrelated data: (X, Y)
●Add correlated dimensions
○By adding 1% noise (ϵ) to X and repeating
○3-D data: (X, Y, X+ϵ)
○4-D data: (X, Y, X+ϵ , X+ϵ)
○5-D data: (X, Y, X+ϵ , X+ϵ , X+ϵ)
○6-D data: (X, Y, X+ϵ , X+ϵ , X+ϵ , X+ϵ)
○7-D data: (X, Y, X+ϵ , X+ϵ , X+ϵ , X+ϵ , X+ϵ)
●Compute distance matrix at each dimension & observe
●Distribution, Mean, Std. Dev

Euclidean distance
Mean +2σ-2σ Mean +2σ-2σ Mean +2σ-2σ

●Distances are more far apart and spreading away as dimensions increase
○Increasing Std. Deviation with dimensions
Euclidean distance

RAG Quality: Similarity function
●Is Cosine similarity a good similarity measure?
○In higher dimensions, the angle between vectors is a more effective measure
○Cosine of identical vectors is 1, while orthogonal vectors as 0 and opposite vectors as -1
●Is Cosine similarity the best similarity function?

RAG
Data Type

RAG Data Type: Chunk RAG vs Graph RAG
●As per Deadline’s report on 28th July 2023, Barbie
has amassed $528.6 million at the worldwide box
office as the numbers grew on Wednesday compared
to Tuesday. Apart from the domestic market, the film
is enjoying superbly in overseas, with $291.4 million
coming in so far. With the kind of pace it is
witnessing, the mark of $600 million is expected to be
crossed by the second weekend.
●Speaking about Oppenheimer, it has collected
$265.1 million at the worldwide box office in the first
six days. By Friday, the film is expected to hit the
$300 million milestone and will again see a spike on
Saturday and Sunday.

RAG
Improving performance

RAG Components
●Query
●Embedding model
●Retrieval
Fetch relevant documents for …
Encoding
model
Query Embeddings Retrieval

RAG
Improving:
Query performance

RAG: Improving query performance
●Query augmentation: To increase chances of semantic matching with relevant answers
●Augmenting query with hypothetical answers
○Using domain knowledge
■Adding sample answers
○Using LLMs
■Generating possible hypothetical answer
●Augmenting query with multiple similar queries
○Using domain knowledge
■Adding alternative ways of asking the same questions
○Using LLMs
■Generating similar questions

RAG
Improving:
Embedding models performance

RAG Embeddings: Type of Encoders
●Encoder
●Bi-Encoder
●Cross Encoder
●Dual Encoder

Bi-Encoder: Sentence Transformers
Sent-1
Sent-2
Bi-Encoder is a fine-tuned version of Encoder (BERT), when trained using Siamese networks
Usage of Bi-Encoders is same as Encoders, but training is different, it uses pairs of sentences
Encoder
Encoder
Regression
&#3627408448;&#3627408454;&#3627408440;_&#3627408447;&#3627408476;&#3627408480;&#3627408480;(
&#3627408454;??????&#3627408474;&#3627408454;
1,&#3627408454;
2,
&#3627408455;&#3627408479;&#3627408482;&#3627408466;&#3627408454;??????&#3627408474;)
ς&#3627408485;
&#3627408470;&#3627408486;
&#3627408470;
&#3627408485;&#3627408486;
Cosine
similarity
Bi-Encoder
(S1, S2, 0.3)(S1, S3, 0.7)(S2, S4, 0.9)(S3, S4, 0.1)…
Training data
STS benchmark
Mean pooling
Mean pooling
Shared weights & params

Bi-Encoder via Contrastive Loss / Triplet Loss
Encoder
Encoder
Shared weights & params
Encoder
Shared weights & params
Anchor
Positive
Negative
Mean pooling
Mean pooling
Mean pooling
Trainer
&#3627408438;&#3627408476;&#3627408475;&#3627408481;&#3627408479;&#3627408462;&#3627408480;&#3627408481;??????&#3627408483;&#3627408466; &#3627408447;&#3627408476;&#3627408480;&#3627408480;

1
&#3627408475;

exp(&#3627408480;??????&#3627408474;(&#3627408462;
&#3627408470;,&#3627408477;
&#3627408470;))
σ
&#3627408471;
exp(&#3627408480;??????&#3627408474;(&#3627408462;
&#3627408471;,&#3627408477;
&#3627408471;))
&#3627408455;&#3627408479;??????&#3627408477;&#3627408473;&#3627408466;&#3627408481; &#3627408447;&#3627408476;&#3627408480;&#3627408480;
= max(0,&#3627408462;−&#3627408477; −
&#3627408462;−&#3627408475;+??????)
Contrastive loss
{(Anchor, Positive/Negative), (Label: 1/0)}
Triplet loss
{Sentence, Class}
Bi-Encoder
Contrastive Training(S1, S2, 1)(S1, S3, 0)(S2, S4, 1)(S3, S4, 0) …(S5, S6, 1)
Triplet Training
(S1, “Sports”)(S2,”Travel”)(S3,”Social”) …(S4,”Food”)
&#3627408485;−&#3627408486;
ς&#3627408485;
&#3627408470;&#3627408486;
&#3627408470;
&#3627408485;&#3627408486;
Similarity

Cross Encoder
Query +
Document
Cross Encoder is a fine-tuned version of Encoder (BERT), trained as a Binary Classifier
Usage and training both are different, it uses pairs of sentences to predict [1 (Similar), 0 (Not Similar) ]
Encoder
Classifier
&#3627408438;&#3627408479;&#3627408476;&#3627408480;&#3627408480; &#3627408440;&#3627408475;&#3627408481;&#3627408476;&#3627408477;&#3627408486; &#3627408447;&#3627408476;&#3627408480;&#3627408480;
−෍log(&#3627408480;
??????,&#3627408465;
+
)
−෍log(1−&#3627408480;
??????,&#3627408465;

))
&#3627408480;
??????,&#3627408465;=
&#3627408480;&#3627408476;&#3627408467;&#3627408481;&#3627408474;&#3627408462;&#3627408485;(&#3627408455;
????????????????????????
+&#3627408463;)
Cross-Encoder
Training data
CLS
(S1, S2, 1)(S1, S3, 0)(S2, S4, 1)(S3, S4, 0) …(S5, S6, 1)

RAG pipeline with Re-Rankers
Fetch relevant
documents for …
Query
Encoder /
Bi-Encoder
Embeddings
Raw Retrieval
Ranked Retrieval
Cross-Encoder
Re-Ranker

Dual Encoders
Question
Answer
Dual-Encoder is a fine-tuned version of 2 separate Encoders (BERT), one for Question, one for Answer
Question Encoder
Answer Encoder
Trainer
&#3627408438;&#3627408479;&#3627408476;&#3627408480;&#3627408480;&#3627408440;&#3627408475;&#3627408481;&#3627408479;&#3627408476;&#3627408477;&#3627408486;&#3627408447;&#3627408476;&#3627408480;&#3627408480;=

1
&#3627408449;

&#3627408470;=1
??????
log(
exp(&#3627408454;
&#3627408470;&#3627408470;)
σ
&#3627408471;=1
??????
exp(&#3627408454;
&#3627408470;&#3627408471;)
)
ς&#3627408485;
&#3627408470;&#3627408486;
&#3627408470;
&#3627408485;&#3627408486;
Cosine
similarity
(Q1, A1, 1)(Q1, A2, 0)(Q2, A1, 0)(Q2, A2, 1)…Training data
CLS pooling
CLS pooling
Question
Encoder
Answer
Encoder

Dual Encoders
Question Encoder*
Answer Encoder*
Document
Indexing phase
Document
Document
Retrieval phase
Query
Vector DB
Vector DB
ς&#3627408485;
&#3627408470;&#3627408486;
&#3627408470;
&#3627408485;&#3627408486;
)
* Pre-trained version available from Facebook research as DPR (Dense Passage Retriever)

Bi-Encoders vs Dual Encoders
Question { 'Which is the biggest continent in the world?’ }
Answers = [
"Which is the biggest continent in the world?",
"Which is the largest continent?",
"Asia is the biggest continent in the world.",
"United States of America is big.",
"India has 2nd largest population.",
"Children are playing in the park."
]

Bi-Encoders
(Sentence
Transformers)

Dual
Encoders

RAG
Improving:
Retrieval performance

RAG: Retrieval performance
●Feedback scores: On retrieved documents, used to improve retrieval performance
●Feedback on retrievals for a RAG system can be obtained by:
○Human SMEs, as per domain relevance
■+1 : Thumbs up
■-1 : Thumbs down
○LLMs, through expert reviewer prompts
■+1 : Relevant
■-1 : Irrelevant

RAG Retrieval performance: Adapter matrix
0.30.10.20.8070.9 0.20.4050.8070.9
Query vector
( , , Feedback(Q
1, D
1) )
Document vector
( , , Feedback(Q
1, D
2) )
0.70.10.20.8070.7 0.30.5050.8070.6
( , , Feedback(Q
1, D
3) )
0.90.10.20.8070.3 0.70.2050.8070.8
( , , Feedback(Q
2, D
1) )
0.10.70.20.8020.3 0.20.4050.8070.9
( , , Feedback(Q
2, D
2) )
0.20.70.20.8020.2 0.20.4050.8070.9
( , , Feedback(Q
2, D
3) )
0.30.70.20.8020.1 0.20.4050.8070.9
Regression
&#3627408448;&#3627408454;&#3627408440;_&#3627408447;&#3627408476;&#3627408480;&#3627408480;(
&#3627408454;??????&#3627408474;&#3627408452;,&#3627408439;,
&#3627408441;&#3627408466;&#3627408466;&#3627408465;&#3627408463;&#3627408462;&#3627408464;&#3627408472;)
Adapter
Matrix

RAG: Retrieval performance
●During retrieval, vectors will first undergo transformation before being ranked on similarity
○Transforming implies, Change of basis for embedding vectors in high dimensional space
&#3627408482;
1,&#3627408482;
2→(&#3627408483;
1,&#3627408483;
2) &#3627408485;
1,&#3627408485;
2→(ƴ&#3627408485;
1,ƴ&#3627408485;
2)
0.70.10.20.8070.7 0.30.5050.8070.6@
Retrieved vector Adapter matrix Transformed vector

RAG: Retrieval performance

LLMs

Question +
Document +
Context / Few shots
Enterprise Gen AI Applications
Prompt
Creation
Enterprise data
Question
Input data
RAG
Data store
LLM
(Online / Offline)Agentic LLM
Answer based
on document
User

Evaluating LLMs
●Evaluating LLM via
○Human SMEs
○LLM itself
○LLM diagnostics
●Tradeoff between domain precision and scalability
●Evaluate LLMs differently based upon task type
○Objective tasks
○Subjective tasks

Evaluating LLMs: Objective tasks
●Objective tasks like Entity extraction, Classification, Statistical analysis, etc.
●Go classical way prepare ground truth
○Generate validation sets and compare model output
○Quantify the results, generate Precision, Recall, F1
●Measure consistency of model even at temperature=0
○Run random cases from validation set multiple times and look at the distribution to assess variance
○Higher non-zero variance indicates inconsistency

Evaluating LLMs: Subjective tasks
●Subjective tasks like summary of text, translation, etc.
●Lexical evaluation
○ROUGE score to compare output with reference, higher ROUGE indicates more words matching
○ROUGE-1 (Unigram), ROUGE-2 (Bigrams), ROUGE-L (Longest Common Subsequence)
○Obtain Precision, Recall, F1 for each type of ROUGE
●Semantic evaluation
○Using embedding models to generate similarity score between question and answer
○A related answer should not have a low score

LLMs
Judging LLM via LLM

Evaluating LLM via LLM
●Asking another LLM to judge the entire response
●Context relevance
○Is the retrieved context relevant to the query
■Validating efficiency of RAG system
●Answer relevance
○Is the response relevant to the query
●Groundedness
○Is the response supported by the context
●Obtain a score from 0 to 1 on each of the above

Evaluating LLM via LLM
Question: "Which field has the most exciting career opportunities?”
Answer: "AI and data science field has the most exciting career opportunities.”
Question: "What are interesting avenues in AI?”
Answer: "Developing side projects, exploring various application areas, and collaborating with
others are interesting avenues in AI.”
Question: "What is the right AI job for applying?”
Answer: "The right AI job for applying is one that aligns with your interests and skills, offers
opportunities for growth and learning, and enables you to make a meaningful impact in the field of
artificial intelligence. It is important to research roles and companies, consider informational
interviews to gain insights into different positions, and focus on building a strong resume and
portfolio showcasing your relevant experience and expertise. Ultimately, the right AI job for you is
one that supports your career goals and allows you to continue developing your skills in the field."
Answer
Relevance
Context
Relevance
Grounded-
ness
1 0.95 0.67
Answer
Relevance
Context
Relevance
Grounded-
ness
0.5 0.85 0.23
Answer
Relevance
Context
Relevance
Grounded-
ness
1.0 1.0 0.17

Evaluating LLM via LLM
●Sample evaluation prompt for an article’s text summary task
●Evaluate on 4 criteria's
○Relevance
○Coherence
○Consistency
○Fluency

Evaluating LLM via LLM
EVALUATION_PROMPT_TEMPLATE = """
You will be given one summary written for an article. Your task is to rate the summary on one metric.
Please make sure you read and understand these instructions very carefully.
Please keep this document open while reviewing and refer to it as needed.
Evaluation Criteria: {criteria}
Evaluation Steps: {steps}
Source Text: {document}
Summary: {summary}
Evaluation Form (scores ONLY):
- {metric_name}"""

Evaluating LLM via LLM
# Metric 1: Relevance
RELEVANCY_SCORE_CRITERIA = """
Relevance(1-5) - selection of important content from the source. \
The summary should include only important information from the source document. \
Annotators were instructed to penalize summaries which contained redundancies and excess information.
"""
RELEVANCY_SCORE_STEPS = """
1. Read the summary and the source document carefully.
2. Compare the summary to the source document and identify the main points of the article.
3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant
information it contains.
4. Assign a relevance score from 1 to 5.
"""

Evaluating LLM via LLM
# Metric 2: Coherence
COHERENCE_SCORE_CRITERIA = """
Coherence(1-5) - the collective quality of all sentences. \
We align this dimension with the DUC quality question of structure and coherence \
whereby "the summary should be well-structured and well-organized. \
The summary should not just be a heap of related information, but should build from sentence to a\
coherent body of information about a topic."
"""
COHERENCE_SCORE_STEPS = """
1. Read the article carefully and identify the main topic and key points.
2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,
and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation
Criteria.
"""

Evaluating LLM via LLM
# Metric 3: Consistency
CONSISTENCY_SCORE_CRITERIA = """
Consistency(1-5) - the factual alignment between the summary and the summarized source. \
A factually consistent summary contains only statements that are entailed by the source document. \
Annotators were also asked to penalize summaries that contained hallucinated facts.
"""
CONSISTENCY_SCORE_STEPS = """
1. Read the article carefully and identify the main facts and details it presents.
2. Read the summary and compare it to the article. Check if the summary contains any factual errors that
are not supported by the article.
3. Assign a score for consistency based on the Evaluation Criteria.
"""

Evaluating LLM via LLM
# Metric 4: Fluency
FLUENCY_SCORE_CRITERIA = """
Fluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and
sentence structure.
1: Poor. The summary has many errors that make it hard to understand or sound unnatural.
2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points
are still comprehensible.
3: Good. The summary has few or no errors and is easy to read and follow.
"""
FLUENCY_SCORE_STEPS = """
Read the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.
"""

LLMs
LLM Diagnostics

LLM Diagnostics: Obtaining confidence via “logprobs”
●Enablinglogprobsin OpenAI’s API returns the log probabilities of each output token
def get_completion(messages: list[dict[str, str]]) -> str :
params = { "model": "gpt-4", "messages": messages, "max_tokens": 500, "temperature": 0, "stop":
None, "seed": 567, "logprobs": True, "top_logprobs": 2 }
completion = client.chat.completions.create(**params)
return completion

LLM Diagnostics: Confidence score
CLASSIFICATION_PROMPT = ""”
You will be given a sentence.
Classify the sentence into one of the following categories: Physics, Medical, Maths, Geography and History.
Return only the name of the category, and nothing else.
MAKE SURE your output is one of the five categories stated.
Sentence: {sent}""”
sentences = [
"I find quantum mechanics fascinating.",
"The terrain on the western ghats is very rough.",
"In calculating the trajectory of a projectile, we need to solve the quadratic equations",
"The Aryans crossed the rough terrain of Himalayan mountains",]

LLM Diagnostics: Confidence score
Sentence: I find quantum mechanics fascinating.
Output token 1:Physics,logprobs:-2.808727e-05,linear probability:100.0%
Output token 2:Physics,logprobs:-10.987374,linear probability:0.0%
Sentence: The terrain on the western ghats is very rough.
Output token 1:Ge,logprobs:-0.0011652225,linear probability:99.88%
Output token 2:Geography,logprobs:-6.789795,linear probability:0.11%
Sentence: In calculating the trajectory of a projectile, we need to solve the quadratic equations
Output token 1:Math,logprobs:-0.11094211,linear probability:89.5%
Output token 2:Physics,logprobs:-2.2538788,linear probability:10.5%
Sentence: The Aryans crossed the rough terrain of Himalyan mountains
Output token 1:History,logprobs:-0.044092752,linear probability:95.69%
Output token 2:Ge,logprobs:-3.1452622,linear probability:4.31%
Linear probability = &#3627408466;
??????&#3627408476;????????????
??????∗100

LLM Diagnostics: Checking Hallucination
Q_n_A_PROMPT = ""”
You retrieved this article: {article}. The question is: {question}.
Before even answering the question, consider whether you have sufficient information in the article to answer
the question fully.
Your output should JUST be the boolean true or false, of if you have sufficient information in the article to
answer the question.
Respond with just one word, the boolean true or false. You must output the word 'True', or the word 'False',
nothing else.""”

LLM Diagnostics: Checking Hallucination
News_Article = """As per Deadline’s report on 28th July 2023, Barbie has amassed $528.6 million at the
worldwide box office as the numbers grew on Wednesday compared to Tuesday. Apart from the domestic
market, the film is enjoying superbly in overseas, with $291.4 million coming in so far. With the kind of pace it
is witnessing, the mark of $600 million is expected to be crossed by the second weekend.
Speaking about Oppenheimer, it has collected $265.1 million at the worldwide box office in the first six
days. By Friday, the film is expected to hit the $300 million milestone and will again see a spike on
Saturday and Sunday.""”
Questions = [
"Was Barbie going to France?",
"Was Barbie happy?",
"Did Oppenheimer do better than Barbie?",
]

LLM Diagnostics: Checking Hallucination
Question: Was Barbie going to France?
has_sufficient_context_for_answer:False,logprobs: -0.4304386,linear probability: 65.02%
Question: Was Barbie happy?
has_sufficient_context_for_answer:True,logprobs: -0.3446536,linear probability: 70.85%
Question: Did Oppenheimer do better than Barbie
has_sufficient_context_for_answer:True,logprobs: -0.166682,linear probability: 84.65%

LLM Diagnostics: Hallucination via token confidence
Q_n_A_PROMPT = ""”
You retrieved this article: {article}. The question is: {question}.
Answering the question based upon the article. Just provide the answer and don't give explanation.
""”
Questions = [
"Was Barbie going to France?",
"Was Barbie happy?",
"Did Oppenheimer do better than Barbie?",
]

LLM Diagnostics: Hallucination via token confidence
Question: Did Oppenheimer do better than Barbie
token:No,logprobs: -0.22814107,linear probability: 79.6%
token:,,logprobs: -0.013628241,linear probability: 98.65%
token:Barbie,logprobs: -0.22969326,linear probability: 79.48%
token:did,logprobs: -0.41809866,linear probability: 65.83%
token:better,logprobs: -5.5577775e-06,linear probability: 100.0%
token:than,logprobs: -0.006389799,linear probability: 99.36%
token:Opp,logprobs: -3.4121115e-06,linear probability: 100.0%
token:en,logprobs: -2.7848862e-05,linear probability: 100.0%
token:heimer,logprobs: -3.5835506e-05,linear probability: 100.0%
token:.,logprobs: -0.7416629,linear probability: 47.63%

LLM Diagnostics: Hallucination via token confidence
Question: Was Barbie going to France?
token:No,logprobs: -0.6140927,linear probability: 54.11%
token:,,logprobs: -0.035805512,linear probability: 96.48%
token:the,logprobs: -0.42739838,linear probability: 65.22%
token:article,logprobs: -0.0002354833,linear probability: 99.98%
token:does,logprobs: -0.061123967,linear probability: 94.07%
token:not,logprobs: -7.3458323e-06,linear probability: 100.0%
token:mention,logprobs: -0.004370779,linear probability: 99.56%
token:Barbie,logprobs: -0.3683057,linear probability: 69.19%
token:going,logprobs: -0.00020318278,linear probability: 99.98%
token:to,logprobs: -1.9816675e-06,linear probability: 100.0%
token:France,logprobs: -7.89631e-07,linear probability: 100.0%
token:.,logprobs: -0.0005583932,linear probability: 99.94%

LLM Diagnostics: Hallucination via token confidence
Question: Was Barbie happy?
token:Yes,logprobs: -0.04654289,linear probability: 95.45%
token:.,logprobs: -0.40668586,linear probability: 66.59%

LLM Diagnostics: Measuring Uncertainty
&#3627408456;&#3627408475;&#3627408464;&#3627408466;&#3627408479;&#3627408481;&#3627408462;??????&#3627408475;??????&#3627408481;&#3627408486;=&#3627408466;

σ
??????=1
??????
log??????
??????
??????
Q_n_A_PROMPT = ""”
You retrieved this article: {article}. The question is: {question}.
Answering the question based upon the article. ""”
Questions = [
"Was Barbie going to France?",
"Was Barbie happy?",
"Did Oppenheimer do better than Barbie?",]

LLM Diagnostics: Measuring Uncertainty
Question: Did Oppenheimer do better than Barbie
Based on the information provided in the article , Barbie has amassed528.6 million at the worldwide box office,
while Oppenheimer has collected265.1 million in the first six days. Therefore, Barbie has performed better than
Oppenheimer in terms of box office earnings .
Uncertainty: 1.0905944809259287
Question: Was Barbie happy?
Based on the information provided in the article , it can be inferred that Barbie would likely be happy with the
success of the film at the box office . The film has already amassed over500 million worldwide and is expected
to cross the600 million mark by the second weekend . This indicates that the film is performing well both
domestically and overseas , which would likely bring satisfaction to Barbie and the team behind the movie .
Uncertainty: 1.2061617920899619

LLMs
Improving LLM performance

Improving LLM performance
●Ensure LLM gets the right context
●In case of objective tasks, ensure good amount of Few Shot examples
●Breaking it in smaller problems
○If a big multi-task prompt is not working well, try smaller prompt with smaller objectives
●Finding the right length of context, bigger context length does not ensure accuracy
●Using multi-agents
○Reflection: A reviewer agent to assess its own mistakes
○ReAct, CoT (Chain Of Thoughts): Breaking a complex task into multiple tasks with multiple agents
●Using automated frameworks like DSPy

!