LLM KV Cache Offloading: Analysis and Practical Considerations by Eshcar Hillel
ScyllaDB
0 views
24 slides
Oct 09, 2025
Slide 1 of 24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
About This Presentation
LLM deployments are driving massive GPU demand and cost. This talk presents a generic architecture for offloading KV-cache tensors to a disaggregated shared store, enabling GPU-initiated IO for efficient storage and retrieval. We’ll cover the system requirements for offloading and retrieval, their...
LLM deployments are driving massive GPU demand and cost. This talk presents a generic architecture for offloading KV-cache tensors to a disaggregated shared store, enabling GPU-initiated IO for efficient storage and retrieval. We’ll cover the system requirements for offloading and retrieval, their impact on platform design and performance, and a mathematical model for predicting gains across LLMs and hardware, supported by initial results.
Size: 2.25 MB
Language: en
Added: Oct 09, 2025
Slides: 24 pages
Slide Content
A ScyllaDB Community
LLM KV Cache Offloading:
Analysis and Practical
Considerations
Eshcar Hillel
Principal Research
Scientist
Eshcar Hillel (she/her)
Principal Research Scientist, Leading AI Research at Pliops
■Building the leading disaggregated KV-store to
accelerate LLM inference
■P99s reveal the real bottlenecks that averages
conceal
■PhD CS, authored 25+ papers and patents in
distributed systems & AI
■Compete as a long-distance triathlete
Questions Covered in the Talk
Why is KV-cache offloading increasingly relevant in LLM inference?
How much acceleration can be expected, in both prefill and end-to-end inference?
What factors determine the gain we get?
Prompt Computation vs. Token Generation
Multi-Turn Inferencing with LLM Models
Multi-turn conversation or multi-shot task agent prefill kv-cache based on expanding history
Prefill Speedup Analysis
Prefill Acceleration By KV-Cache Retrieval
■R - time to retrieve a token from storage
●� = |��
�����
|/��
��
■T - compute time per input token
●� =(2|�����|+�
??????�
∙�(√|�����|))/�??????����(*)
■Effective token acceleration �=�/�
■α – KV-cache fraction cached in storage
■N
in
– number of input token
(*) approximate attention compute;
assuming all GPU compute is
utilized, not always true
Prefill Acceleration Analysis
■????????????�??????
�
= �
??????�
∙??????
■Offloading writes the newly generated KV cache
●Let W denote the write time of a single token KV
■Assuming IO and compute can be overlapped
●At each layer prefetch next layer’s KV cache and write the KV cache of the previous layer
●????????????�??????
??????�
= max (??????∙�
??????�
∙??????/?????? ,(??????−??????)∙�
??????�
∙�, (??????−??????)∙�
??????�
∙??????)
IO time to
fetch existing
KV cache
from storage
IO time to
write the
new part of
the prompt
Compute
time of the
new part of
the prompt
■????????????�??????
??????�
= max (??????∙�
??????�
∙??????/?????? ,(??????−??????)∙�
??????�
∙�, (??????−??????)∙�
??????�
∙??????)
●When �≤� TTFT is not affected by write IOs
■????????????�??????
??????�
= max (??????∙�
??????�
∙??????/?????? , (??????−??????)∙�
??????�
∙??????)
■????????????�??????
�
= �
??????�
∙??????
Performance gain (speedup)
■compute-bound: x=??????/(??????−??????)
■IO-bound: x=??????/??????
Prefill Acceleration Analysis (Cont.)
Acceleration
depends on read
performance and
hit rate
Insight
Write only need to
match compute
performance
Prefill Acceleration Analysis (Cont.)
Crossover
point
??????∗=�/(1+�)
For example e=9
e is the acceleration
from offloading (T/R)
Insight
No advantage for
higher IO speed!
(HBM, DRAM)
??????<??????∗
compute-bound ??????>??????∗
IO-bound
Prefill Acceleration Analysis (Cont.)
Crossover
point
??????∗=�/(1+�)
For example e=9
e is the acceleration
from offloading (T/R)
??????<??????∗
compute-bound ??????>??????∗
IO-bound
For maximal gain,
retrieve only ??????∗< ??????
Get ??????+1 speedup
IO-Compute parallelism - Nsight View
During Prefill – IOs execute in parallel with new prompt kv-cache computation
In orange:
GPU-initiated Put-IOs
of the previous layer
In green: GPU-initiated
Get-IOs for the next
layer
In yellow: compute
current layer
E2E Gain - Continuous Batching Analysis
■B – number of concurrent requests
■N
out
– Average number of output tokens
■Number of concurrent prefills~ �??????���(�,1/�
���
)
E2E Gain - Continuous Batching Analysis
■B – number of concurrent requests
■N
out
– Average number of output tokens
■Number of concurrent prefills~ �??????���(�,1/�
���
)
■�[????????????�??????(??????)]= ????????????�??????(??????)+ (??????−??????)/�
���
∙????????????�??????(??????)= ( ?????? +(??????−??????)/�
���
)∙????????????�??????(??????)
■�[??????��??????(??????)] = ??????��??????(??????)+(??????−??????)∙(????????????�??????(??????)/�
���
+|??????�(??????)|/??????�
????????????�
)
●|??????�(??????)| is the average KV cache size of a single input prompt
●??????�
????????????�
is the memory BW
Expected number of
additional simultaneous
prefills in a prefill slot
Time to transfer a single
prompt KV cache to
compute engines
Measured TPOT Overhead for LlaMa-3-70B
Impact of Prefill Speedup on Decode Speedup
■??????��??????(??????)~= ??????��??????(??????)+(??????−??????)∙(????????????�??????(??????)/�
���
+|??????�(??????)|/??????�
????????????�
)
■TPOT(B) ≤ SLA
TPOT
■Maximal B to meet SLA
TPOT
: ⌊(���
����
− ����(1))/(Δ�+Δ�)⌋
■Δ�
��
=Δ�/??????
■�
��
/�
�??????�??????��??????
=(Δ� + Δ�)/(Δ�
��
+ Δ�)
=(Δ� + Δ�)/(Δ�/??????+ Δ�)
■When ?????? → ∞, �
��
/�
�??????�??????��??????
= ??????+Δ�/Δ�
Δ� ΔH
System is required
to meet SLA
?????? is the speedup of
the prefill using IO
Insight
TPS gain
Asymptotic E2E Gain Analysis
■TPS gain <= 1+ Δ�/Δ� aim to maximize Δ�/Δ�
Model params
●Model size
●KV-cache
compression
(GQA, MQA, MLA)
App params
●in/out ratio
GPU params
●Mem bw to
compute ratio
■Llama-3.1-70B
■H100
■Varying #clients
Shift the Efficiency-Latency Tradeoff Frontier
Pliops
Summary and Conclusions
✓KV-cache offloading avoids recomputation and accelerates prefill
✓Write speed only needs to match compute
✓Speedup depends mainly on read performance
✓In compute-bound cases, no benefit for faster-than-SSD memory
✓GPU-initiated IO allows full IO-compute overlap with zero CPU overhead
✓Throughput gains depend on the model, the GPU, and the workload
KV-cache offloading increases efficiency and reduces cost while meeting SLAs