Infrastructure Challenges in Scaling RAG with Custom AI models

chloewilliams62 239 views 30 slides Jun 06, 2024
Slide 1
Slide 1 of 30
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30

About This Presentation

Read more: https://zilliz.com/blog/infrastructure-challenges-in-scaling-rag-with-custom-ai-models

Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval perf...


Slide Content

www.bentoml.com
Unstructured Data Meetup
Infrastructure Challenges
in Scaling RAG with
Custom AI models
Chaoyu Yang
Founder/CEO of BentoML
Jun 3, 2024

•Custom AI models in RAG systems - why you should care about
leveraging your data and custom models for improving RAG
performance
•Deploying custom model inference APIs - learn best practices in
serving your fine-tuned text embedding model or LLM as inference
APIs for RAG
•Advanced inference patterns for RAG - running multi-model
inference pipelines as online inference APIs and batch offline inference
jobs for RAG
Agenda

`
Simple RAG System
Text Embedding Model Large Language Model
Structured Data
Unstructured Data
Chunks
`
Retrieved
Chunks
Vector DB
(Embeddings)
Response
Generation

Production RAG Challenges
Retrieval Performance
•Recall: Not all chunks that are relevant to the user query are retrieved.
•Precision: Not all chunks retrieved are relevant to the user query.
•Data Ingestion: Complex document and unstructured data
Response Synthesis
•Safeguarding: Is the user query toxic or offensive
•Context Accuracy: Retrieved chunks lacking necessary context or containing misaligned context
Evaluate RAG responses
•Synthetic evaluation dataset: Use LLMs to bootstrap evaluation dataset
•LLMs as evaluators: use LLMs to evaluate end-to-end RAG performance

Why fine-tuned Text Embedding Model
The “default” text-embedding-ada-002 is ranked 57th on MTEB leaderboard for English
Fine-tuning optimizes embedding representations over your specific dataset

Best performing LLMs are still mostly
proprietary models
Fine-tuned OSS models offer comparable
performance on specific tasks
Key questions to ask:
•What level of control you need for security & data
privacy?
•What’s your latency and SLA requirement?
•What specific capabilities that you need from LLM?
•What’s the cost of running LLM at scale?
•What’s the total cost of ownership for hosting and
maintaining custom LLMs
Why hosting your own LLM?

•Document Layout Analysis
(LayoutLM)
•Table Detection, structure
recognition and analysis (Table
Transformers TATR)
•OCR optical character recognition
(EasyOCR, Tesseract)
•Visual Document QA (LayoutLM v3,
Donut)
•Fine-tuning on your specific
document
Document Processing & Understanding

And many more..
Context-Aware chunking
And global concept aware
chunking
Metadata
{ datetime: ‘2024-04-11’,
product: “..”, user_id: “..”,
sentiment: “positive”,
summary: “..”,
topics: [..], }
Text Chunk
“I recently purchased the
product and I'm extremely
satisfied with it! …”
Metadata extraction for Improved retrieval accuracy
And additional context for response synthesis
Reranker Model fine-tuned
with your dataset generally
performs ~10-30% better

Building
Inference APIs
For Custom Models

From Inference Script to Serving Endpoint
So you’ve got a fine-tuned embedding model ready

From Inference Script to Serving Endpoint
Simple things simple - build inference APIs for your fine-tuned text embedding model

Serving Optimizations: Dynamic Batching
Dynamically forming small batches, breaking down large batches, auto-tuning batch size
Dynamic Batching typically brings up to
3x faster response time and ~200%
improved throughput for embedding
serving

Deployment & Serving Infrastructure
External Queue, Auto-scaling, Instance Selection, Traffic control, Concurrency Control, and more

Serving Optimization
•Continuous Batching, KV-caching
•Paged Attention, Flash Attention
•Speculative Decoding
•Operator fusion
•Quantization
•Output Streaming
Important Metrics
•Time to first token (TTFT)
•Time per output token (TPOT)
•End-to-end Latency
•Throughput
Self-hosting LLMs

Recommendations
•The field of LLM Inference Backend
is rapidly evolving and heavily
researched
•Choosing the right backend
depends on your workload type,
optimization target, quantization
method and model type
•Developer experience can be a
significant factor given the
complexity in model compilation
and integrating fine-tuned models
Self-hosting LLMs

LMDeploy TensorRT-LLM vLLM MLC-LLM TGI
Quantization
Support
Yes. Support 4-bit
AWQ, 8-bit
quantization
options. Also
support 4-bit KV
quantization
Partially.
Quantization via
modelopt. But note
that quantized data
types is not
implemented for all
the models.
Not fully supported
as of now. Need to
quantize the model
through AutoAWQ
or find pre-
quantized models
on HF. Performance
is under-optimized.
Yes. Support 3-bit,
4-bit group
quantization
options. AWQ
quantization
support is still
experimental.
Offers AWQ, GPTQ
and bits-and-bytes
quantization
Supported
Model
Architectures
About 20 models
supported by
TurboMind engine.
30+ model
supported.
30+ model
supported.
20+ model support.
Does not include
some models like
Cohere command-
R, Arctic, etc.
20+ model
supported.
Hardware
Limitation
Only optimized for
Nvidia CUDA
Only support Nvidia
CUDA
Nvidia CUDA, AMD
ROCm, AWS
Neuron, CPU
Nvidia CUDA, AMD
ROCm, Metal,
Android, IOS,
WebGPU
Nvidia CUDA, AMD
ROCm, Intel Gaudi,
AWS Inferentia
Self-hosting LLMs

Scaling LLMs Inference Service
GPU Utilization
•“Fully utilized” GPU can often handle a lot more traffic in LLM serving
•Reflect usage after the resources have been consumed, results in a conservative
scale-up behavior that doesn’t match demand
QPS
•For LLM APIs, cost of each request is not uniform
•Hard to configure the right QPS target
Concurrency
•Accurately reflect load and desired replicas
•Easy to configure based of GPU count and max batch size
Autoscaling: Request based metrics vs. Resource utilization metrics

Most files in a container image are not used. Stream loading
container image based on requested file can drastically speed
up container downloading and startup time, from minutes to
seconds.
Scaling LLMs Inference Service
Cold start optimization for large container images with many unused files

Cold start optimization for loading large model weight files
GenAI inference requires specialized
infrastructure, such as streaming
model loading and efficient caching
to help accelerate this process.
Scaling LLMs Inference Service

bentoml.com/blog/scaling-ai-model-deployment
Auto-scaling based on traffic, request queue, and resource utilization
Scaling LLMs Inference Service

Advanced
Inference Patterns
For RAG systems

Apply “Small” Language Models
Example: Toxic query detection with GPT 3.5 vs. a fine-tuned text classification model
GPT 3.5
Bert For
Sequence Classification
{User Query}
Determine if a user
query is toxic. Reply
yes or no.
Here’re some
examples:
{EXAMPLES}
{User Query}
Response:
Assuming 400 tokens per input
sequence
Cost: $200 per 1 million sequence
Latency: 8 seconds
Cost: $0.07 per 1 million
sequence, 2800x improvement
Latency: 50 million second, 160x
improvement
Yes / No




Yes / No

Apply “Small” Language Models
Cold start and scaling with huge container and model files
User Request
LLMRouter
Toxic Classifier Mistral LLM
/generate
Distributed Bento Deployment

Document Processing Pipelines
Layout analysis, Table extraction, Image understanding and OCR for RAG data ingestion pipeline

Long running inference task
Async Task Submission ideal for large PDF files ingestion tasks

Large scale Batch Inference Jobs
•Bring compute (models) to your
data via BentoCloud BYOC
•Custom deployment and easy
integration via Snowflake External
Functions or Spark UDF
•Right sizing GPU clusters perfectly
based on your actual workload,
leveraging the same fast auto-
scaling infrastructure for real-time
inference
Ingest and index documents right from your cloud data warehouses
github.com/bentoml/rag-tutorials

•Define entire RAG service’s components
within one Python file
•Compile to one versioned unit for
evaluation and deployment
•Each model inference components can be
assigned to different GPU shapes and scale
independently
•Model serving and inference best practices
and optimizations baked in
•Auto generated production monitoring
dashboard
github.com/bentoml/rag-tutorials
RAG-as-a-service
Fully private RAG deployment in your own
infrastructure

Production RAG System Components
LLMsReranker
Text Embedding
Layout Analysis
Chunking
Classification for
Router and tool use
OCR
Summarization
Visual Reasoning
AI-powered Tools
Model Selection
Text Embedding
Synthetic Data
Generation
LLM Based
Evaluators
Summarization
Entity Extraction

Summary
•Production RAG often involves running
multiple open source or custom models for
better performance and data privacy protection.
•“State-of-the-art AI results are increasingly
obtained by compound systems with multiple
components, not just monolithic models.” -
Compound AI System (Matei et al. 2024)
•Scaling RAG with multiple components, each
with different computation needs and scaling
needs requires specialized infrastructure for
scaling the inference workload..

Inference Platfom
For Fast-Moving
AI Teams
github.com/bentoml/BentoML
BentoML
Tags