Evaluation as an Essential Component of the Generative AI Lifecycle
webmaxru
93 views
23 slides
Feb 27, 2025
Slide 1 of 23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
About This Presentation
In this session, we’ll explore how systematic evaluation ensures generative AI applications are reliable, safe, and effective across their lifecycle. From selecting the right base model to rigorous pre-production testing and ongoing post-deployment monitoring, evaluation helps teams address risks ...
In this session, we’ll explore how systematic evaluation ensures generative AI applications are reliable, safe, and effective across their lifecycle. From selecting the right base model to rigorous pre-production testing and ongoing post-deployment monitoring, evaluation helps teams address risks like misinformation, biases, and security vulnerabilities. Learn how to integrate evaluation into every stage of development to build AI solutions that deliver high-quality user experiences, foster trust, and adapt seamlessly to real-world demands.
Size: 1.06 MB
Language: en
Added: Feb 27, 2025
Slides: 23 pages
Slide Content
EVALUATION AS AN ESSENTIAL COMPONENT OF
THE GENERATIVE AI LIFECYCLE
Maxim Salnikov
Evaluation in
GenAI Lifecycle
I’M MAXIM SALNIKOV
•Building on web platform since 90s
•Organizing developer communities and
technical conferences
•Speaking, training, blogging: Webdev,
Cloud, Generative AI,Prompt Engineering
Helping developers to succeed with the Dev Tools, Cloud & AI in Microsoft
PROMPTENGINEERING.ROCKS
Evaluation in
GenAI Lifecycle
WHY EVALUATION MATTERS
•Ensures AI reliability and trustworthiness
•Prevents misinformation, bias, and security risks
•Optimizes quality and performance
Evaluation in
GenAI Lifecycle
GENAIOPS LIFECYCLE
Evaluation in
GenAI Lifecycle
STAGE 1: BASE MODEL SELECTION
•Accuracy/quality
How well does the model generate relevant and coherent responses?
•Performance on specific tasks
Can the model handle the type of prompts and content required for your use case? How is its
latency and cost?
•Bias and ethical considerations
Does the model produce any outputs that might perpetuate or promote harmful stereotypes?
•Risk and safety
Are there any risks of the model generating unsafe or malicious content?
Evaluation in
GenAI Lifecycle
Evaluation in
GenAI Lifecycle
STAGE 2: PRE-PRODUCTION TESTING
•Evaluate responses using test datasets
•Identify edge cases
•Assessing robustness
•Measuring key metrics
Evaluation in
GenAI Lifecycle
PRE-PRODUCTION EVALUATION
Evaluation in
GenAI Lifecycle
TEST DATASETS
•Bring your own data
•Use simulators:
•Context-appropriate
•Adversarial
Evaluation in
GenAI Lifecycle
HANDLING EDGE CASES
•Unexpected or adversarial inputs
•Ethical dilemmas and bias detection
•Performance under varying conditions
Evaluation in
GenAI Lifecycle
COMMON AI EVALUATION METRICS
•Groundedness and relevance
•Fairness and bias detection
•Safety and security assessments
Evaluation in
GenAI Lifecycle
RISK AND SAFETY EVALUATORS
•Hateful and unfair content
•Sexual content
•Violent content
•Self-harm-related content
•Indirect attack jailbreak
•Direct attack jailbreak
•Protected material content
Severity levels:
•Very low
•Low
•Medium
•High
Evaluation in
GenAI Lifecycle
GENERATION QUALITY METRICS
Evaluation in
GenAI Lifecycle
EXAMPLE: ROUGE
Application:
•Text summarization, paraphrase generation, machine translation, etc. It emphasizes recall - the
proportion of reference content that is captured in the generated text.
Definition:
•Recall-Oriented Understudy for Gisting Evaluation (ROUGE) measures the overlap of n-grams
(sequences of n words), words sequences, and word-pairs between the candidate text (the model's
output) and one or more reference texts (human-provided ground truths).
ROUGE-2:
•Candidate: “the cat was found under the bed”
Bigrams: the cat, cat was, was found, found under, under the, the bed
•Reference: “the cat was under the bed”
Bigrams: the cat, cat was, was under, under the, the bed
•Rouge-2 Recall = # overlaps / # reference bigrams = 4 / 5 = 0.8
•Rouge-2 Precision = # overlaps / # candidate bigrams = 4 / 6 = 0.67
Evaluation in
GenAI Lifecycle
AI-ASSISTED EVALUATORS
•Groundedness
•Retrieval
•Relevancy
•Coherence
•Fluency
•Similarity
RAG triad
Business writing
NLP
Evaluation in
GenAI Lifecycle
EXAMPLE: FLUENCY
Application:
•Generative business writing such as summarizing meeting notes, creating marketing materials,
and drafting email.
Definition:
•Fluency refers to the effectiveness and clarity of written communication, focusing on
grammatical accuracy, vocabulary range, sentence complexity, coherence, and overall readability.
It assesses how smoothly ideas are conveyed and how easily the text can be understood by the
reader.
Ratings:
•[Fluency: 1] (Emergent Fluency)Definition: The response shows minimal command of the
language. It contains pervasive grammatical errors, extremely limited vocabulary, and
fragmented or incoherent sentences. The message is largely incomprehensible, making
understanding very difficult.
•…
•[Fluency: 5] (Exceptional Fluency)Definition: The response demonstrates an exceptional
command of language with sophisticated vocabulary and complex, varied sentence structures. It's
coherent, cohesive, and engaging, with precise and nuanced expression. Grammar is flawless,
and the text reflects a high level of eloquence and style.
Evaluation in
GenAI Lifecycle
Evaluation in
GenAI Lifecycle
STAGE 3: POST-PRODUCTION MONITORING
•Track ongoing performance metrics
•Detect and address failures in real-world use
•Ensure adaptability to evolving user behavior
Evaluation in
GenAI Lifecycle
BONUS: EVALUATING AI AGENTS
•The initial model request
•The agent's ability to identify the intent of the user
•The agent's ability to identify the right tool to perform the task
•The tool's response to the agent's request
•The agent's ability to interpret the tool's response
•The user's feedback to the agent's response
Evaluation in
GenAI Lifecycle
METACOGNITION
•Self-Reflection: Agents can assess their own performance and
identify areas for improvement.
•Adaptability: Agents can modify their strategies based on
past experiences and changing environments.
•Error Correction: Agents can detect and correct errors
autonomously, leading to more accurate outcomes.
•Resource Management: Agents can optimize the use of
resources, such as time and computational power, by planning
and evaluating their actions.
Evaluation in
GenAI Lifecycle
BUILDING A STRONG EVALUATION STRATEGY
•Use diverse evaluation datasets
•Implement iterative, automated testing
•Continuously refine based on feedback
Evaluation in
GenAI Lifecycle
MAKING AI EVALUATION A PRIORITY
•Evaluation ensures trust and reliability
•It must be an ongoing, iterative process
•A well-evaluated AI system is safer, more accurate, and more
effective
Evaluation in
GenAI Lifecycle
THANK YOU!
Let’s connect and chat:
•Maxim Salnikov on LinkedIn