[DSC DACH 24] Evalution and Observability of Gen AI application - Igor Nikolaienko

DataScienceConferenc1 137 views 13 slides Sep 16, 2024
Slide 1
Slide 1 of 13
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13

About This Presentation

Discuss the importance of evaluation and observability in generative AI applications, focusing on metrics, monitoring tools, and methodologies to ensure AI app performance and reliability.


Slide Content

Post & Parcel Germany | September 2024 Evalution and Observability of GenERATIVE AI application Generative AI Architect P&P Innovation Management Vienna, 12 September 2024 Igor Nikolaienko

Key points Why we use GenAI ? Why building GenAI App is complex? Why Observability is essential for GenAI ? How does GenAI Evaluation work? What Evaluation Frameworks are available?

3 Productivity 45% of organizations report improved productivity User Experience 85% Security 82% Revenue increase 86% of organizations report an improved ability to identify threats of organizations report improved user experience of organizations seeing revenue growth above 6% Business Benefits of GenAI

Building GenAI App Full -Stack Web App Back-End & Front-End Development Access Control & User Management API Integration DevOps Networking Penetration tests LLM-Driven (RAG) App Logic Vector Database Management Data Chunking, Re-ranking, Filtering Metadata Enrichment Guardrails Implementation Agentic Functionality Tracing and Monitoring My Midjourney prompt: “Badly observable spaghetti-code system”

5 Tackling GenAI Complexity Predictability Guardrails , Policies Evaluation Metrics / KPI‘s , Production Monitoring Explainability False Outputs Analysis, Tracing, Audit logging Testing A/B Testing “Introduce AI Obsevability to Supervise Generative AI” Modular Architecture is a key. M odular O pen System A pproach (MOSA) Observability

Testing and Evaluation Methods Prompts and Parameters Testing Objective: Optimize LLM performance by testing different prompts and application parameters. Method: A/B Testing Evaluation Objective: Assess LLM performance based on reference answers using KPIs and metrics. Method: LLM- as -a-Judge Reference Answers: Comparison of LLM output to ground-truth or manually curated Q&A’s as benchmarks Synthetic Q&A: Leveraging of automatically LLM-generated Q&A’s.

Metrics for Talk-to-Your-Data (RAG) App User Query 1 Knowledge Base 2 Relevant Context 3 LLM Call 4 LLM Responce 5 -„RAG is the Taylor Swift of Gen AI“

Metrics for Talk-to-Your-Data (RAG) App User Query 1 Knowledge Base 2 Relevant Context 3 LLM Call 4 LLM Responce 5 Reference Responce Correctness Context Relevance Faithfullness Latency Cost User Feedback -„RAG is the Taylor Swift of Gen AI“

Evaluation Frameworks not exhaustive list

GenAI Platform: LangSmith

Summary GenAI is an opportunity. Building GenAI is complex. Modular architecture is a key. Observability is essential. Establish evaluation methods. Define metrics and KPI’s. Choose Evaluation framework.

Thank you! -“SELL ME THIS PEN“ -“IT HAS GENERATIVE AI“
Tags