"Scaling RAG Applications to serve millions of users", Kevin Goedecke
fwdays
259 views
22 slides
Jun 18, 2024
Slide 1 of 22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
About This Presentation
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
Size: 2.09 MB
Language: en
Added: Jun 18, 2024
Slides: 22 pages
Slide Content
Scaling RAG
applications to serve
millions of usersLessons learned scaling LLMs, Vector databases and more…
Powered by
What we’ll cover today…
?????? Agenda
-Who is SlideSpeak? What do we do?
-Our Growth so far
-Mistakes we’ve made
-Architecture Overview
-Challenges of scaling RAG applications
-Challenges of scaling Vector databases
-What we’re currently busy with…
-Q&A
Create presentations from
Word, PDF or Excel files
Create presentations for any
topic
Some of our key features:
SlideSpeak is an AI-powered platform
to create and automate presentations
What we do
We save people hours of work
by speeding up manual
presentation workflows
OUR GROWTH
SO FAR
We’ve launched 6 months ago,
here’s what we’ve achieved. >2m
Files upload
>2m
LLM Tokens
consumed per
minute
>1000
LLM calls
per minute
250k
MAUs
User
Vector DB
Context
LLM
Response Query
Inference Retrieval
Prompt
Quick recap
on RAG
Using LLMs for everything
Storing vector data forever
No monitoring
Downtimes cost us thousands of $$$
Mistakes we’ve
made… so far…
??????
??????
??????
Storing vectors is expensive, like
really expensive…
If not absolutely necessary use avoid
using LLMs
-Rate limits are not as high as
they seem
-Difficult to balance RPM and
TPM
-40 page document has on avg.
24k tokens, with 2m limit thats
83 documents per minute or 1.5
per second
OpenAI Rate Limits (6/10/24)
Scaling LLM
providers
The problem
-We’ve migrated to Azure
OpenAI
-Not because the rate limits are
higher, but you can load
balance ??????
https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models
Scaling LLM
providers
The solution
Source:
https://techcommunity.microsoft.com/t5/fasttrack-for-azure/smart-load-balancing-for-openai-end
points-and-azure-api/ba-p/3991616
Scaling LLM providers
Load Balancing in Azure
Scaling LLM
providers
The solution
-Find similar information in a lot of data
-Feed that to the LLM as context
Why do we need Vector Databases?
5 Problems when scaling Postgres with PGVector
-?????? Slow queries
-?????? Memory intense
-?????? Challenge of combining vector search and
metadata search (also called hybrid search)
needs careful query optimization
Challenges of scaling Vector Databases
Challenges of scaling Vector Databases
A Vector is an array of 4-byte floating point numbers.
Number of Vectors Total Size
7 Thousand 7,000 × 1536 × 4 = 43 Megabyte
1 Million 1,000,000 × 1536 × 4 = 5.7 Gigabyte
10 Millions 10,000,000 × 1536 × 4 = 57 Gigabyte
1 Billion 1000,000,000 × 1536 × 4 = 5.5 Terabyte
Build an index over the vector data
Challenges of scaling Vector Databases
HNSW: Efficient nearest neighbor search algorithm (ANN) for high-dimensional data.
ef_construction: Defines how many similarity candidates to look for
m: Defines how many of the closest neighbors to pick from the ef_construction list
** This might be more tricky if you use hybrid search (metadata + vector data)
Define what could be a good partition value for you (date, filetype, category, …)
Use partitioning
What we’ve done to scale PGVector
General tips when
working with
PGVector
Make sure to delete unnecessary
vector data
01
Never use the same Postgres
database for Vectors and other data
02
If nothing helps… Scale horizontally
What we’ve done to scale PGVector
What we’re
currently busy
with…
System Extension
Azure AI Postgres Extension
RAG Evaluation
–Implementing robust evaluation testing methods
–Exploring Ragas for advanced system performance
✅
??????
??????
–Enabling direct creation of embeddings within Azure
–Extend system to cover images and other non-textual data
Here’s what currently keeps
us up at night…