AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

Alluxio 3,112 views 31 slides May 24, 2024
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio

For more Alluxio Events: https://www.alluxio.io/events/

Speaker:
- Junchen Jiang (Assistant Professor of Computer Science, @University of Chicago)

Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. Whi...


Slide Content

Reducing Prefill for LLM Serving in
RAG (Using Knowledge Sharing)
Junchen Jiang
1

The Internet age is punctuated with innovations
2
Web apps
CDN
Video
MPEG-
DASH, 5G
Big Data
Cloud,
MapReduce
Search engine
Data Centers
Gen AI
????
OpenAI alone already has 1.8 billion monthly visitors
(YouTube has 2.7 billion monthly visitors)
What will be the key system innovations for generative AI?
time

Do we know how to build the Gen AI system yet?
3
Basics
How to build
websites & players
1990s2000s – 2020s
Building a global distributed system?
P2P or CDN, video transcoding, scale out streaming, streaming
quality monitoring, DNS redirection, video caching, …
Basics
How to build AI
apps and servers
2022-20242024 - ??????
Building a global distributed system
???
We are still at the very early stage of LLM infrastructure
These took us 20 years
This talk:
Sharing knowledge across LLMs
Internet
video
Gen AI
(LLMs)
We are here

LLMs are more powerful when paired with "knowledge"
LLMs need to read a large amount of data in real-time
(looooooooooooog) contexts
output textLLM
(short) query
NewsBusiness
docs
Chat/
shopping
history
Book
User

The prefill delay will only grow (longer contexts, bigger models),
while users get less patient.
Yet, it takes time to "learn" (prefill) the context
LLM
LLM
LLM
Queries about a book
Prefilling
Prefilling
Prefilling
LLM-Learned
knowledge
(KV cache)

Knowledge
Sharing
You Only Learn Once:
Once one LLM learns something, other LLMs will immediately know
Vision: Knowledge Sharing
LLM
LLM
LLM
Queries about a book
PrefillingLLM-Learned
knowledge
(KV cache)

Feelthespeedup!
Context text
(13K tokens)
6.5 sec
Query 2
0.9 sec (7x faster)
Mistral 7B
on A40
Mistral 7B
on A40
Query 1
KV CacheSharing
KV cache
w/o KV cache
With efficient KV cache sharing
(explained shortly)

Vision: Knowledge Sharing
Why will the same knowledge (KV cache) be reused?
20% of your knowledge is used 80% of the time. (20-80 rule)
Faster (shorter time-to-first-token)
Ex. 5,000-token document (context) + 100-token question
With document's KV cache, time-to-first-token is at least 50x faster
Higher throughput
Without prefill, generation (decoding) would be easier to batch
On an A100 GPU, vLLM running Llama2-7B can process 5x requests per second*
Will it be too expensive to store KV cache?
KV cache is bigger than text but storing it on SSD is 4x cheaper than re-computing it on GPUs.
With longer contexts (or bigger models), KV cache size grows slower than prefill delay.

LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System

LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System
Perfect fit for storage
solutions, like Alluxio

Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System
LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Challenge:
KV cache is 100,000x bigger than text.
Simply loading them remotely is too slow
Key technique #1:
Fast KV retrieval via KV encoding
(Speed up KV loading by 3-10x)

Knowledge
synthesis
Knowledge
caching
Knowledge
retrieval
Knowledge-Sharing System
LLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Challenge:
If a text is not at the prefix, its KV
cache cannot be reused
Key technique #2:
Flexible join of multiple KV caches

Knowledge-
Sharing
SystemLLM
LLM
LLM
Architecting Efficient Knowledge Sharing
Key technique #1:
Fast KV retrieval via KV encoding
(Speed up KV loading by 3-10x)
Key technique #2:
Flexible join of multiple KV caches

CacheGen: KV Cache Compression and
Streaming for Fast Language Model Serving
14
Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang,
Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, Ganesh
Ananthanarayanan, Junchen Jiang
ACM SIGCOMM 2024

CacheGen: Compressing KV cache for fast prefill
15
loooooo…oooooong context + queryoutput textLLM
Prefill on query
timeGenerate
output
Loading KV cache
Prefill on query
timeGenerate
output
Loading compressed KV
cache & decompress
Compressed KV cache
KV cache
Faster prefill even if the reused
KV cache is loaded remotely

10,000 ft view of CacheGen
16
K tensor
KV cache
V tensor
Binary representationsStorage
Encoding
Binary representations
K tensor
Decompressed KV cache
V tensor
Decoding
CacheGen: Encode KV cache to compact binary representation

Several emerging approaches to KV compression
17
They all keep the KV's tensor shape è Complementary to CachGen.
CacheGen can improve them too!
CacheGen: Encode KV cache to compact binary representation
Quantizing KV cache directly?
Dropping less important tokens from the text?
Dropping less important tokens from the KV cache?

Can KV cache be encoded efficiently?
size of encoded KV cache
text quality
Better
size of encoded video
video quality
Better
Encode a video in a small size with small degradation on video quality
KV cacheGenerated text
Analogy with video compression
We could borrow the 20-year research literature of video compression

Why can fewer bits represent KV cache?
19
KV cache is similar between neighboring tokens
Some parts of a KV cache are less sensitive to quantization
Quantized KV cache can be entropy-encoded with fewer bits
Key distributional properties of KV cache

Opportunity 1: Locality of KV cache values
20
K tensor
K @ layer j
# of layers
# of tokens
tokens
Channels
# of channels
Opportunity: The KV values at nearby tokens have similar values

Delta values have much smaller variance
21
For any token !
Original: |#!|,|%!|
Delta: |#!−'!"#|,|%!−%!"#|
Encode the delta between neighboring tokens, rather than the tokens themselves
Delta values have much smaller variance è Easier to quantize
0
0.2
0.4
0.6
0.8
1
024
CDF
Values (abs.)
Original
Delta

Opportunity 2: Heterogeneous sensitivity to quantization
22
The output quality of LLM is more sensitive to losses in the KV cache values
of the shallower layers than to those in the deeper layers.
0
0.2
0.4
0.6
0.8
1
[0, 3][4, 7][8, 11][12, 15][16, 19][20, 23]
LLM output quality
(Accuracy)
Layer

Opportunity 3: Arithmetic coding
23
001101001110010…
KV cache
Compute
deltaQuantizeAdaptive
arithmetic
coding
More compact binary
representation
-stored on disk
-sent via network

Reducing decoding overhead?
24
K tensor
KV cache
V tensor
Binary
representations
Encoding
K tensor
Decompressed KV cache
V tensor
GPU-based
DecodingLoading
Decoding and loading can be pipelined

Evaluation setup
3Gbps
(cloud server bandwidth)
Llama-70B, Llama-34B, Mistral-7B
Llama-70B
Llama-70B
Llama-70B
LongChat
TriviaQA
NarrativeQA
WikiText
Context length distributionVarious quality metrics
Accuracy
F1 score
Perplexity

Quality vs. Size & TTFT (time to first token)
26
0.4
0.5
0.6
0.7
0.8
0.9
1
0500100015002000
Accuracy
Size of KV cache (MB)
Better
0.4
0.5
0.6
0.7
0.8
0.9
1
0246
Accuracy
Time to first token (TTFT) in seconds
CacheGen
Uniform
quantization
CacheGen
Uniform
quantization
Full prefill
(no caching)
Better
Setup
Dataset: Longchat (200 contexts, ~9.6K tokens each)
Model: Llama-70B
Link to load KV cache (1.6 GB 8bit): 3Gbps
3x smaller KV cache size è 3-6x lower time to first token (TTFT)

Impact of context length
27
0
1000
2000
3000
4000
34567891011121314151617181920
KV cache size (MB)
Context length (K tokens)
CacheGenUniform (8-bit quant.)
Setup
Model: Llama-70B
The size reduction remains under various context lengths

Breakdown of Time to First Token (TTFT)
28
6.1 secPrefill on
inputtime
Prefill on querytime
7.2 secLoad
0.12 sec
Prefill on querytime
1.8 secLoad + decompress
0.12 sec
Setup
Dataset: Longchat (200 contexts, ~9.6K tokens each)
Model: Llama-70B
Link to load KV cache: 3Gbps
Full recompute
Naïve KV cache loading
(w/o KV encoding)
CacheGen
(KV encoding)

Knowledge
Storing &
SharingLLM
LLM
LLM
Towards Efficient Knowledge Storing & Sharing
Key technique #1:
Fast KV cache loading via KV codec
(Speed up KV loading by 3-10x)
Key technique #2:
Flexible join of multiple KV caches
Happy to chat about
technique #2 after the talk

Try it yourself!
30
https://github.com/uchi-jcl/cachegen
https://arxiv.org/pdf/2310.07240.pdf
Research paper
Code repo

Efficient Knowledge Sharing System
31
Delay
(time to first token)
Cost
(storage, compute, communication)
Better
GPU prefill
Storing KV cache in CPU
Storing KV cache in SSD
Storing KV cache in S3
Efficient
Knowledge
Sharing
Contact me if you are a potential
user or contributor to our
Knowledge-Sharing System!!
[email protected]