AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
Alluxio
3,112 views
31 slides
May 24, 2024
Slide 1 of 31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
About This Presentation
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Junchen Jiang (Assistant Professor of Computer Science, @University of Chicago)
Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. Whi...
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Junchen Jiang (Assistant Professor of Computer Science, @University of Chicago)
Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill’s impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.
Size: 7.98 MB
Language: en
Added: May 24, 2024
Slides: 31 pages
Slide Content
Reducing Prefill for LLM Serving in
RAG (Using Knowledge Sharing)
Junchen Jiang
1
The Internet age is punctuated with innovations
2
Web apps
CDN
Video
MPEG-
DASH, 5G
Big Data
Cloud,
MapReduce
Search engine
Data Centers
Gen AI
????
OpenAI alone already has 1.8 billion monthly visitors
(YouTube has 2.7 billion monthly visitors)
What will be the key system innovations for generative AI?
time
Do we know how to build the Gen AI system yet?
3
Basics
How to build
websites & players
1990s2000s – 2020s
Building a global distributed system?
P2P or CDN, video transcoding, scale out streaming, streaming
quality monitoring, DNS redirection, video caching, …
Basics
How to build AI
apps and servers
2022-20242024 - ??????
Building a global distributed system
???
We are still at the very early stage of LLM infrastructure
These took us 20 years
This talk:
Sharing knowledge across LLMs
Internet
video
Gen AI
(LLMs)
We are here
LLMs are more powerful when paired with "knowledge"
LLMs need to read a large amount of data in real-time
(looooooooooooog) contexts
output textLLM
(short) query
NewsBusiness
docs
Chat/
shopping
history
Book
User
The prefill delay will only grow (longer contexts, bigger models),
while users get less patient.
Yet, it takes time to "learn" (prefill) the context
LLM
LLM
LLM
Queries about a book
Prefilling
Prefilling
Prefilling
LLM-Learned
knowledge
(KV cache)
Knowledge
Sharing
You Only Learn Once:
Once one LLM learns something, other LLMs will immediately know
Vision: Knowledge Sharing
LLM
LLM
LLM
Queries about a book
PrefillingLLM-Learned
knowledge
(KV cache)
Feelthespeedup!
Context text
(13K tokens)
6.5 sec
Query 2
0.9 sec (7x faster)
Mistral 7B
on A40
Mistral 7B
on A40
Query 1
KV CacheSharing
KV cache
w/o KV cache
With efficient KV cache sharing
(explained shortly)
Vision: Knowledge Sharing
Why will the same knowledge (KV cache) be reused?
20% of your knowledge is used 80% of the time. (20-80 rule)
Faster (shorter time-to-first-token)
Ex. 5,000-token document (context) + 100-token question
With document's KV cache, time-to-first-token is at least 50x faster
Higher throughput
Without prefill, generation (decoding) would be easier to batch
On an A100 GPU, vLLM running Llama2-7B can process 5x requests per second*
Will it be too expensive to store KV cache?
KV cache is bigger than text but storing it on SSD is 4x cheaper than re-computing it on GPUs.
With longer contexts (or bigger models), KV cache size grows slower than prefill delay.
LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System
Perfect fit for storage
solutions, like Alluxio
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System
LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Challenge:
KV cache is 100,000x bigger than text.
Simply loading them remotely is too slow
Key technique #1:
Fast KV retrieval via KV encoding
(Speed up KV loading by 3-10x)
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System
LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Challenge:
If a text is not at the prefix, its KV
cache cannot be reused
Key technique #2:
Flexible join of multiple KV caches
Knowledge-
Sharing
SystemLLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Key technique #1:
Fast KV retrieval via KV encoding
(Speed up KV loading by 3-10x)
Key technique #2:
Flexible join of multiple KV caches
CacheGen: KV Cache Compression and
Streaming for Fast Language Model Serving
14
Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang,
Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, Ganesh
Ananthanarayanan, Junchen Jiang
ACM SIGCOMM 2024
CacheGen: Compressing KV cache for fast prefill
15
loooooo…oooooong context + queryoutput textLLM
Prefill on query
timeGenerate
output
Loading KV cache
Prefill on query
timeGenerate
output
Loading compressed KV
cache & decompress
Compressed KV cache
KV cache
Faster prefill even if the reused
KV cache is loaded remotely
10,000 ft view of CacheGen
16
K tensor
KV cache
V tensor
Binary representationsStorage
Encoding
Binary representations
K tensor
Decompressed KV cache
V tensor
Decoding
CacheGen: Encode KV cache to compact binary representation
Several emerging approaches to KV compression
17
They all keep the KV's tensor shape è Complementary to CachGen.
CacheGen can improve them too!
CacheGen: Encode KV cache to compact binary representation
Quantizing KV cache directly?
Dropping less important tokens from the text?
Dropping less important tokens from the KV cache?
Can KV cache be encoded efficiently?
size of encoded KV cache
text quality
Better
size of encoded video
video quality
Better
Encode a video in a small size with small degradation on video quality
KV cacheGenerated text
Analogy with video compression
We could borrow the 20-year research literature of video compression
Why can fewer bits represent KV cache?
19
KV cache is similar between neighboring tokens
Some parts of a KV cache are less sensitive to quantization
Quantized KV cache can be entropy-encoded with fewer bits
Key distributional properties of KV cache
Opportunity 1: Locality of KV cache values
20
K tensor
K @ layer j
# of layers
# of tokens
tokens
Channels
# of channels
Opportunity: The KV values at nearby tokens have similar values
Delta values have much smaller variance
21
For any token !
Original: |#!|,|%!|
Delta: |#!−'!"#|,|%!−%!"#|
Encode the delta between neighboring tokens, rather than the tokens themselves
Delta values have much smaller variance è Easier to quantize
0
0.2
0.4
0.6
0.8
1
024
CDF
Values (abs.)
Original
Delta
Opportunity 2: Heterogeneous sensitivity to quantization
22
The output quality of LLM is more sensitive to losses in the KV cache values
of the shallower layers than to those in the deeper layers.
0
0.2
0.4
0.6
0.8
1
[0, 3][4, 7][8, 11][12, 15][16, 19][20, 23]
LLM output quality
(Accuracy)
Layer
Opportunity 3: Arithmetic coding
23
001101001110010…
KV cache
Compute
deltaQuantizeAdaptive
arithmetic
coding
More compact binary
representation
-stored on disk
-sent via network
Reducing decoding overhead?
24
K tensor
KV cache
V tensor
Binary
representations
Encoding
K tensor
Decompressed KV cache
V tensor
GPU-based
DecodingLoading
Decoding and loading can be pipelined
Quality vs. Size & TTFT (time to first token)
26
0.4
0.5
0.6
0.7
0.8
0.9
1
0500100015002000
Accuracy
Size of KV cache (MB)
Better
0.4
0.5
0.6
0.7
0.8
0.9
1
0246
Accuracy
Time to first token (TTFT) in seconds
CacheGen
Uniform
quantization
CacheGen
Uniform
quantization
Full prefill
(no caching)
Better
Setup
Dataset: Longchat (200 contexts, ~9.6K tokens each)
Model: Llama-70B
Link to load KV cache (1.6 GB 8bit): 3Gbps
3x smaller KV cache size è 3-6x lower time to first token (TTFT)
Impact of context length
27
0
1000
2000
3000
4000
34567891011121314151617181920
KV cache size (MB)
Context length (K tokens)
CacheGenUniform (8-bit quant.)
Setup
Model: Llama-70B
The size reduction remains under various context lengths
Breakdown of Time to First Token (TTFT)
28
6.1 secPrefill on
inputtime
Prefill on querytime
7.2 secLoad
0.12 sec
Prefill on querytime
1.8 secLoad + decompress
0.12 sec
Setup
Dataset: Longchat (200 contexts, ~9.6K tokens each)
Model: Llama-70B
Link to load KV cache: 3Gbps
Full recompute
Naïve KV cache loading
(w/o KV encoding)
CacheGen
(KV encoding)
Knowledge
Storing &
SharingLLM
LLM
LLM
Towards Efficient Knowledge Storing & Sharing
Key technique #1:
Fast KV cache loading via KV codec
(Speed up KV loading by 3-10x)
Key technique #2:
Flexible join of multiple KV caches
Happy to chat about
technique #2 after the talk
Try it yourself!
30
https://github.com/uchi-jcl/cachegen
https://arxiv.org/pdf/2310.07240.pdf
Research paper
Code repo
Efficient Knowledge Sharing System
31
Delay
(time to first token)
Cost
(storage, compute, communication)
Better
GPU prefill
Storing KV cache in CPU
Storing KV cache in SSD
Storing KV cache in S3
Efficient
Knowledge
Sharing
Contact me if you are a potential
user or contributor to our
Knowledge-Sharing System!! [email protected]