Discuss the importance of evaluation and observability in generative AI applications, focusing on metrics, monitoring tools, and methodologies to ensure AI app performance and reliability.
Size: 188.04 MB
Language: en
Added: Sep 16, 2024
Slides: 13 pages
Slide Content
Post & Parcel Germany | September 2024 Evalution and Observability of GenERATIVE AI application Generative AI Architect P&P Innovation Management Vienna, 12 September 2024 Igor Nikolaienko
Key points Why we use GenAI ? Why building GenAI App is complex? Why Observability is essential for GenAI ? How does GenAI Evaluation work? What Evaluation Frameworks are available?
3 Productivity 45% of organizations report improved productivity User Experience 85% Security 82% Revenue increase 86% of organizations report an improved ability to identify threats of organizations report improved user experience of organizations seeing revenue growth above 6% Business Benefits of GenAI
Building GenAI App Full -Stack Web App Back-End & Front-End Development Access Control & User Management API Integration DevOps Networking Penetration tests LLM-Driven (RAG) App Logic Vector Database Management Data Chunking, Re-ranking, Filtering Metadata Enrichment Guardrails Implementation Agentic Functionality Tracing and Monitoring My Midjourney prompt: “Badly observable spaghetti-code system”
5 Tackling GenAI Complexity Predictability Guardrails , Policies Evaluation Metrics / KPI‘s , Production Monitoring Explainability False Outputs Analysis, Tracing, Audit logging Testing A/B Testing “Introduce AI Obsevability to Supervise Generative AI” Modular Architecture is a key. M odular O pen System A pproach (MOSA) Observability
Testing and Evaluation Methods Prompts and Parameters Testing Objective: Optimize LLM performance by testing different prompts and application parameters. Method: A/B Testing Evaluation Objective: Assess LLM performance based on reference answers using KPIs and metrics. Method: LLM- as -a-Judge Reference Answers: Comparison of LLM output to ground-truth or manually curated Q&A’s as benchmarks Synthetic Q&A: Leveraging of automatically LLM-generated Q&A’s.
Metrics for Talk-to-Your-Data (RAG) App User Query 1 Knowledge Base 2 Relevant Context 3 LLM Call 4 LLM Responce 5 -„RAG is the Taylor Swift of Gen AI“
Metrics for Talk-to-Your-Data (RAG) App User Query 1 Knowledge Base 2 Relevant Context 3 LLM Call 4 LLM Responce 5 Reference Responce Correctness Context Relevance Faithfullness Latency Cost User Feedback -„RAG is the Taylor Swift of Gen AI“
Evaluation Frameworks not exhaustive list
GenAI Platform: LangSmith
Summary GenAI is an opportunity. Building GenAI is complex. Modular architecture is a key. Observability is essential. Establish evaluation methods. Define metrics and KPI’s. Choose Evaluation framework.
Thank you! -“SELL ME THIS PEN“ -“IT HAS GENERATIVE AI“