LLM KV Cache Offloading: Analysis and Practical Considerations by Eshcar Hillel

ScyllaDB 0 views 24 slides Oct 09, 2025
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

LLM deployments are driving massive GPU demand and cost. This talk presents a generic architecture for offloading KV-cache tensors to a disaggregated shared store, enabling GPU-initiated IO for efficient storage and retrieval. We’ll cover the system requirements for offloading and retrieval, their...


Slide Content

A ScyllaDB Community
LLM KV Cache Offloading:
Analysis and Practical
Considerations
Eshcar Hillel
Principal Research
Scientist

Eshcar Hillel (she/her)

Principal Research Scientist, Leading AI Research at Pliops
■Building the leading disaggregated KV-store to
accelerate LLM inference
■P99s reveal the real bottlenecks that averages
conceal
■PhD CS, authored 25+ papers and patents in
distributed systems & AI
■Compete as a long-distance triathlete

Questions Covered in the Talk

Why is KV-cache offloading increasingly relevant in LLM inference?


How much acceleration can be expected, in both prefill and end-to-end inference?


What factors determine the gain we get?

Prompt Computation vs. Token Generation

Multi-Turn Inferencing with LLM Models
Multi-turn conversation or multi-shot task agent prefill kv-cache based on expanding history

Prefill Speedup Analysis

Prefill Acceleration By KV-Cache Retrieval
■R - time to retrieve a token from storage
●� = |��
�����
|/��
��
■T - compute time per input token
●� =(2|�����|+�
??????�
∙�(√|�����|))/�??????����(*)
■Effective token acceleration �=�/�
■α – KV-cache fraction cached in storage
■N
in
– number of input token
(*) approximate attention compute;
assuming all GPU compute is
utilized, not always true

Prefill Acceleration Analysis
■????????????�??????
�
= �
??????�
∙??????
■Offloading writes the newly generated KV cache
●Let W denote the write time of a single token KV
■Assuming IO and compute can be overlapped
●At each layer prefetch next layer’s KV cache and write the KV cache of the previous layer
●????????????�??????
??????�
= max (??????∙�
??????�
∙??????/?????? ,(??????−??????)∙�
??????�
∙�, (??????−??????)∙�
??????�
∙??????)


IO time to
fetch existing
KV cache
from storage

IO time to
write the
new part of
the prompt

Compute
time of the
new part of
the prompt

■????????????�??????
??????�
= max (??????∙�
??????�
∙??????/?????? ,(??????−??????)∙�
??????�
∙�, (??????−??????)∙�
??????�
∙??????)
●When �≤� TTFT is not affected by write IOs
■????????????�??????
??????�
= max (??????∙�
??????�
∙??????/?????? , (??????−??????)∙�
??????�
∙??????)
■????????????�??????
�
= �
??????�
∙??????
Performance gain (speedup)
■compute-bound: x=??????/(??????−??????)
■IO-bound: x=??????/??????


Prefill Acceleration Analysis (Cont.)
Acceleration
depends on read
performance and
hit rate
Insight
Write only need to
match compute
performance

Prefill Acceleration Analysis (Cont.)
Crossover
point
??????∗=�/(1+�)
For example e=9
e is the acceleration
from offloading (T/R)
Insight
No advantage for
higher IO speed!
(HBM, DRAM)
??????<??????∗
compute-bound ??????>??????∗
IO-bound

Prefill Acceleration Analysis (Cont.)
Crossover
point
??????∗=&#3627408466;/(1+&#3627408466;)
For example e=9
e is the acceleration
from offloading (T/R)
??????<??????∗
compute-bound ??????>??????∗
IO-bound
For maximal gain,
retrieve only ??????∗< ??????
Get ??????+1 speedup

IO-Compute parallelism - Nsight View
During Prefill – IOs execute in parallel with new prompt kv-cache computation
In orange:
GPU-initiated Put-IOs
of the previous layer
In green: GPU-initiated
Get-IOs for the next
layer
In yellow: compute
current layer

Prefill Speedup - IO-Bound Example
■A100 GPU
■?????? =0.88
■|&#3627408446;&#3627408457;
&#3627408481;&#3627408476;&#3627408472;&#3627408466;&#3627408475;
| = 820KB
■&#3627408437;&#3627408458;
&#3627408444;&#3627408450;
= 20GBps
■R = 41us, T = 110us
■e = 2.7
■??????∗ = 0.73

Prefill is
IO-bound
x = 3

Prefill Speedup - Compute-Bound Example
■H100 GPU, B=4
■?????? =0.88
■|&#3627408446;&#3627408457;
&#3627408481;&#3627408476;&#3627408472;&#3627408466;&#3627408475;
| = 163KB (GQA8)
■&#3627408437;&#3627408458;
&#3627408444;&#3627408450;
= 20GBps
■R = 8us, T = 70us
■e = 9
■??????∗ = 0.9

Prefill is
compute-bound
x = 7

End-to-End (E2E) Gain
for Continuous Batching

E2E Gain - Continuous Batching Analysis
■B – number of concurrent requests
■N
out
– Average number of output tokens
■Number of concurrent prefills~ &#3627408437;??????&#3627408475;&#3627408476;&#3627408474;(&#3627408437;,1/&#3627408449;
&#3627408476;&#3627408482;&#3627408481;
)

E2E Gain - Continuous Batching Analysis
■B – number of concurrent requests
■N
out
– Average number of output tokens
■Number of concurrent prefills~ &#3627408437;??????&#3627408475;&#3627408476;&#3627408474;(&#3627408437;,1/&#3627408449;
&#3627408476;&#3627408482;&#3627408481;
)
■&#3627408492;[????????????&#3627408493;??????(??????)]= ????????????&#3627408493;??????(??????)+ (??????−??????)/&#3627408501;
&#3627408528;&#3627408534;&#3627408533;
∙????????????&#3627408493;??????(??????)= ( ?????? +(??????−??????)/&#3627408501;
&#3627408528;&#3627408534;&#3627408533;
)∙????????????&#3627408493;??????(??????)
■&#3627408492;[??????&#3627408503;&#3627408502;??????(??????)] = ??????&#3627408503;&#3627408502;??????(??????)+(??????−??????)∙(????????????&#3627408493;??????(??????)/&#3627408501;
&#3627408528;&#3627408534;&#3627408533;
+|??????&#3627408509;(??????)|/??????&#3627408510;
????????????&#3627408500;
)
●|??????&#3627408509;(??????)| is the average KV cache size of a single input prompt
●??????&#3627408510;
????????????&#3627408500;
is the memory BW

Expected number of
additional simultaneous
prefills in a prefill slot
Time to transfer a single
prompt KV cache to
compute engines

Measured TPOT Overhead for LlaMa-3-70B

Impact of Prefill Speedup on Decode Speedup
■??????&#3627408503;&#3627408502;??????(??????)~= ??????&#3627408503;&#3627408502;??????(??????)+(??????−??????)∙(????????????&#3627408493;??????(??????)/&#3627408501;
&#3627408528;&#3627408534;&#3627408533;
+|??????&#3627408509;(??????)|/??????&#3627408510;
????????????&#3627408500;
)
■TPOT(B) ≤ SLA
TPOT
■Maximal B to meet SLA
TPOT
: ⌊(&#3627408454;&#3627408447;&#3627408436;
&#3627408455;&#3627408451;&#3627408450;&#3627408455;
− &#3627408455;&#3627408451;&#3627408450;&#3627408455;(1))/(Δ&#3627408451;+Δ&#3627408443;)⌋
■Δ&#3627408451;
&#3627408446;&#3627408457;
=Δ&#3627408451;/??????
■&#3627408437;
&#3627408446;&#3627408457;
/&#3627408437;
&#3627408483;??????&#3627408475;??????&#3627408473;&#3627408473;??????
=(Δ&#3627408451; + Δ&#3627408443;)/(Δ&#3627408451;
&#3627408446;&#3627408457;
+ Δ&#3627408443;)
=(Δ&#3627408451; + Δ&#3627408443;)/(Δ&#3627408451;/??????+ Δ&#3627408443;)
■When ?????? → ∞, &#3627408437;
&#3627408446;&#3627408457;
/&#3627408437;
&#3627408483;??????&#3627408475;??????&#3627408473;&#3627408473;??????
= ??????+Δ&#3627408451;/Δ&#3627408443;
Δ&#3627408451; ΔH
System is required
to meet SLA
?????? is the speedup of
the prefill using IO
Insight
TPS gain

Asymptotic E2E Gain Analysis
■TPS gain <= 1+ Δ&#3627408451;/Δ&#3627408443; aim to maximize Δ&#3627408451;/Δ&#3627408443;

■Δ&#3627408451;= ????????????&#3627408493;??????(??????)/&#3627408501;
&#3627408528;&#3627408534;&#3627408533;
, Δ&#3627408443;=|??????&#3627408509;(??????)|/??????&#3627408510;
????????????&#3627408500;
●&#3627408455;&#3627408455;??????&#3627408455;(1) =&#3627408449;
??????&#3627408475;
∙(2|&#3627408474;&#3627408476;&#3627408465;&#3627408466;&#3627408473;|/&#3627408455;??????&#3627408447;&#3627408450;&#3627408451;&#3627408480;+&#3627408449;
??????&#3627408475;
∙&#3627408450;(√(|&#3627408474;&#3627408476;&#3627408465;&#3627408466;&#3627408473;| )/&#3627408455;??????&#3627408447;&#3627408450;&#3627408451;&#3627408480;))

■Δ&#3627408443;=&#3627408449;
??????&#3627408475;
|&#3627408446;&#3627408457;
&#3627408481;&#3627408476;&#3627408472;&#3627408466;&#3627408475;
|/??????&#3627408510;
????????????&#3627408500;


“Model” Compute Time Attention Compute Time
|&#3627408446;&#3627408457;(1)|
KV cache size of
a single token

Asymptotic E2E Gain Analysis (Cont.)
TTFT(1) behaves differently in two distinct regimes



■Short prompt regime : ΔP/ΔH∝ |&#3627408474;&#3627408476;&#3627408465;&#3627408466;&#3627408473;|/|&#3627408446;&#3627408457;
&#3627408481;&#3627408476;&#3627408472;&#3627408466;&#3627408475;
| ∙ &#3627408437;&#3627408458;
&#3627408443;&#3627408437;&#3627408448;
/&#3627408455;??????&#3627408447;&#3627408450;&#3627408451;&#3627408480; ∙ 1/&#3627408449;
&#3627408476;&#3627408482;&#3627408481;

(&#3627408449;
??????&#3627408475;
≪6&#3627408465;)
■Long prompt regime : ΔP/ΔH∝ √|&#3627408474;&#3627408476;&#3627408465;&#3627408466;&#3627408473;|/|&#3627408446;&#3627408457;
&#3627408481;&#3627408476;&#3627408472;&#3627408466;&#3627408475;
| ∙ &#3627408437;&#3627408458;
&#3627408443;&#3627408437;&#3627408448;
/&#3627408455;??????&#3627408447;&#3627408450;&#3627408451;&#3627408480; ∙ &#3627408449;
??????&#3627408475;
/&#3627408449;
&#3627408476;&#3627408482;&#3627408481;

(&#3627408449;
??????&#3627408475;
>>6&#3627408465;)

Model params
●Model size
●KV-cache
compression
(GQA, MQA, MLA)
App params
●in/out ratio
GPU params
●Mem bw to
compute ratio

■Llama-3.1-70B
■H100
■Varying #clients
Shift the Efficiency-Latency Tradeoff Frontier
Pliops

Summary and Conclusions
✓KV-cache offloading avoids recomputation and accelerates prefill
✓Write speed only needs to match compute
✓Speedup depends mainly on read performance
✓In compute-bound cases, no benefit for faster-than-SSD memory
✓GPU-initiated IO allows full IO-compute overlap with zero CPU overhead
✓Throughput gains depend on the model, the GPU, and the workload
KV-cache offloading increases efficiency and reduces cost while meeting SLAs

Thank you! Let’s connect.
Eshcar Hillel
[email protected]
linkedin.com/in/eshcar
Tags