LLM KV Cache Offloading: Analysis and Practical Considerations by Eshcar Hillel

A ScyllaDB Community
LLM KV Cache Ofﬂoading:
Analysis and Practical
Considerations
Eshcar Hillel
Principal Research
Scientist

Eshcar Hillel (she/her)

Principal Research Scientist, Leading AI Research at Pliops
■Building the leading disaggregated KV-store to
accelerate LLM inference
■P99s reveal the real bottlenecks that averages
conceal
■PhD CS, authored 25+ papers and patents in
distributed systems & AI
■Compete as a long-distance triathlete

Questions Covered in the Talk

Why is KV-cache offloading increasingly relevant in LLM inference?

How much acceleration can be expected, in both prefill and end-to-end inference?

What factors determine the gain we get?

Prompt Computation vs. Token Generation

Multi-Turn Inferencing with LLM Models
Multi-turn conversation or multi-shot task agent prefill kv-cache based on expanding history

Preﬁll Speedup Analysis

Preﬁll Acceleration By KV-Cache Retrieval
■R - time to retrieve a token from storage
●&#3627408453; = |&#3627408446;&#3627408457;
&#3627408481;&#3627408476;&#3627408472;&#3627408466;&#3627408475;
|/&#3627408437;&#3627408458;
&#3627408444;&#3627408450;
■T - compute time per input token
●&#3627408455; =(2|&#3627408474;&#3627408476;&#3627408465;&#3627408466;&#3627408473;|+&#3627408449;
??????&#3627408475;
∙&#3627408450;(√|&#3627408474;&#3627408476;&#3627408465;&#3627408466;&#3627408473;|))/&#3627408455;??????&#3627408447;&#3627408450;&#3627408451;&#3627408480;(*)
■Effective token acceleration &#3627408466;=&#3627408455;/&#3627408453;
■α – KV-cache fraction cached in storage
■N
in
– number of input token
(*) approximate attention compute;
assuming all GPU compute is
utilized, not always true

Preﬁll Acceleration Analysis
■????????????&#3627408493;??????
&#3627408509;
= &#3627408501;
??????&#3627408527;
∙??????
■Oﬄoading writes the newly generated KV cache
●Let W denote the write time of a single token KV
■Assuming IO and compute can be overlapped
●At each layer prefetch next layer’s KV cache and write the KV cache of the previous layer
●????????????&#3627408493;??????
??????&#3627408509;
= max (??????∙&#3627408501;
??????&#3627408527;
∙??????/?????? ,(??????−??????)∙&#3627408501;
??????&#3627408527;
∙&#3627408510;, (??????−??????)∙&#3627408501;
??????&#3627408527;
∙??????)

IO time to
fetch existing
KV cache
from storage

IO time to
write the
new part of
the prompt

Compute
time of the
new part of
the prompt

■????????????&#3627408493;??????
??????&#3627408509;
= max (??????∙&#3627408501;
??????&#3627408527;
∙??????/?????? ,(??????−??????)∙&#3627408501;
??????&#3627408527;
∙&#3627408510;, (??????−??????)∙&#3627408501;
??????&#3627408527;
∙??????)
●When &#3627408458;≤&#3627408455; TTFT is not affected by write IOs
■????????????&#3627408493;??????
??????&#3627408509;
= max (??????∙&#3627408501;
??????&#3627408527;
∙??????/?????? , (??????−??????)∙&#3627408501;
??????&#3627408527;
∙??????)
■????????????&#3627408493;??????
&#3627408509;
= &#3627408501;
??????&#3627408527;
∙??????
Performance gain (speedup)
■compute-bound: x=??????/(??????−??????)
■IO-bound: x=??????/??????

Preﬁll Acceleration Analysis (Cont.)
Acceleration
depends on read
performance and
hit rate
Insight
Write only need to
match compute
performance

Preﬁll Acceleration Analysis (Cont.)
Crossover
point
??????∗=&#3627408466;/(1+&#3627408466;)
For example e=9
e is the acceleration
from offloading (T/R)
Insight
No advantage for
higher IO speed!
(HBM, DRAM)
??????<??????∗
compute-bound ??????>??????∗
IO-bound

Preﬁll Acceleration Analysis (Cont.)
Crossover
point
??????∗=&#3627408466;/(1+&#3627408466;)
For example e=9
e is the acceleration
from offloading (T/R)
??????<??????∗
compute-bound ??????>??????∗
IO-bound
For maximal gain,
retrieve only ??????∗< ??????
Get ??????+1 speedup

IO-Compute parallelism - Nsight View
During Preﬁll – IOs execute in parallel with new prompt kv-cache computation
In orange:
GPU-initiated Put-IOs
of the previous layer
In green: GPU-initiated
Get-IOs for the next
layer
In yellow: compute
current layer

Preﬁll Speedup - IO-Bound Example
■A100 GPU
■?????? =0.88
■|&#3627408446;&#3627408457;
&#3627408481;&#3627408476;&#3627408472;&#3627408466;&#3627408475;
| = 820KB
■&#3627408437;&#3627408458;
&#3627408444;&#3627408450;
= 20GBps
■R = 41us, T = 110us
■e = 2.7
■??????∗ = 0.73

Preﬁll is
IO-bound
x = 3

Preﬁll Speedup - Compute-Bound Example
■H100 GPU, B=4
■?????? =0.88
■|&#3627408446;&#3627408457;
&#3627408481;&#3627408476;&#3627408472;&#3627408466;&#3627408475;
| = 163KB (GQA8)
■&#3627408437;&#3627408458;
&#3627408444;&#3627408450;
= 20GBps
■R = 8us, T = 70us
■e = 9
■??????∗ = 0.9

Preﬁll is
compute-bound
x = 7

End-to-End (E2E) Gain
for Continuous Batching

E2E Gain - Continuous Batching Analysis
■B – number of concurrent requests
■N
out
– Average number of output tokens
■Number of concurrent preﬁlls~ &#3627408437;??????&#3627408475;&#3627408476;&#3627408474;(&#3627408437;,1/&#3627408449;
&#3627408476;&#3627408482;&#3627408481;
)

E2E Gain - Continuous Batching Analysis
■B – number of concurrent requests
■N
out
– Average number of output tokens
■Number of concurrent preﬁlls~ &#3627408437;??????&#3627408475;&#3627408476;&#3627408474;(&#3627408437;,1/&#3627408449;
&#3627408476;&#3627408482;&#3627408481;
)
■&#3627408492;[????????????&#3627408493;??????(??????)]= ????????????&#3627408493;??????(??????)+ (??????−??????)/&#3627408501;
&#3627408528;&#3627408534;&#3627408533;
∙????????????&#3627408493;??????(??????)= ( ?????? +(??????−??????)/&#3627408501;
&#3627408528;&#3627408534;&#3627408533;
)∙????????????&#3627408493;??????(??????)
■&#3627408492;[??????&#3627408503;&#3627408502;??????(??????)] = ??????&#3627408503;&#3627408502;??????(??????)+(??????−??????)∙(????????????&#3627408493;??????(??????)/&#3627408501;
&#3627408528;&#3627408534;&#3627408533;
+|??????&#3627408509;(??????)|/??????&#3627408510;
????????????&#3627408500;
)
●|??????&#3627408509;(??????)| is the average KV cache size of a single input prompt
●??????&#3627408510;
????????????&#3627408500;
is the memory BW

Expected number of
additional simultaneous
prefills in a prefill slot
Time to transfer a single
prompt KV cache to
compute engines

Measured TPOT Overhead for LlaMa-3-70B

Impact of Preﬁll Speedup on Decode Speedup
■??????&#3627408503;&#3627408502;??????(??????)~= ??????&#3627408503;&#3627408502;??????(??????)+(??????−??????)∙(????????????&#3627408493;??????(??????)/&#3627408501;
&#3627408528;&#3627408534;&#3627408533;
+|??????&#3627408509;(??????)|/??????&#3627408510;
????????????&#3627408500;
)
■TPOT(B) ≤ SLA
TPOT
■Maximal B to meet SLA
TPOT
: ⌊(&#3627408454;&#3627408447;&#3627408436;
&#3627408455;&#3627408451;&#3627408450;&#3627408455;
− &#3627408455;&#3627408451;&#3627408450;&#3627408455;(1))/(Δ&#3627408451;+Δ&#3627408443;)⌋
■Δ&#3627408451;
&#3627408446;&#3627408457;
=Δ&#3627408451;/??????
■&#3627408437;
&#3627408446;&#3627408457;
/&#3627408437;
&#3627408483;??????&#3627408475;??????&#3627408473;&#3627408473;??????
=(Δ&#3627408451; + Δ&#3627408443;)/(Δ&#3627408451;
&#3627408446;&#3627408457;
+ Δ&#3627408443;)
=(Δ&#3627408451; + Δ&#3627408443;)/(Δ&#3627408451;/??????+ Δ&#3627408443;)
■When ?????? → ∞, &#3627408437;
&#3627408446;&#3627408457;
/&#3627408437;
&#3627408483;??????&#3627408475;??????&#3627408473;&#3627408473;??????
= ??????+Δ&#3627408451;/Δ&#3627408443;
Δ&#3627408451; ΔH
System is required
to meet SLA
?????? is the speedup of
the preﬁll using IO
Insight
TPS gain

Asymptotic E2E Gain Analysis
■TPS gain <= 1+ Δ&#3627408451;/Δ&#3627408443; aim to maximize Δ&#3627408451;/Δ&#3627408443;

■Δ&#3627408451;= ????????????&#3627408493;??????(??????)/&#3627408501;
&#3627408528;&#3627408534;&#3627408533;
, Δ&#3627408443;=|??????&#3627408509;(??????)|/??????&#3627408510;
????????????&#3627408500;
●&#3627408455;&#3627408455;??????&#3627408455;(1) =&#3627408449;
??????&#3627408475;
∙(2|&#3627408474;&#3627408476;&#3627408465;&#3627408466;&#3627408473;|/&#3627408455;??????&#3627408447;&#3627408450;&#3627408451;&#3627408480;+&#3627408449;
??????&#3627408475;
∙&#3627408450;(√(|&#3627408474;&#3627408476;&#3627408465;&#3627408466;&#3627408473;| )/&#3627408455;??????&#3627408447;&#3627408450;&#3627408451;&#3627408480;))

■Δ&#3627408443;=&#3627408449;
??????&#3627408475;
|&#3627408446;&#3627408457;
&#3627408481;&#3627408476;&#3627408472;&#3627408466;&#3627408475;
|/??????&#3627408510;
????????????&#3627408500;

“Model” Compute Time Attention Compute Time
|&#3627408446;&#3627408457;(1)|
KV cache size of
a single token

Asymptotic E2E Gain Analysis (Cont.)
TTFT(1) behaves differently in two distinct regimes

■Short prompt regime : ΔP/ΔH∝ |&#3627408474;&#3627408476;&#3627408465;&#3627408466;&#3627408473;|/|&#3627408446;&#3627408457;
&#3627408481;&#3627408476;&#3627408472;&#3627408466;&#3627408475;
| ∙ &#3627408437;&#3627408458;
&#3627408443;&#3627408437;&#3627408448;
/&#3627408455;??????&#3627408447;&#3627408450;&#3627408451;&#3627408480; ∙ 1/&#3627408449;
&#3627408476;&#3627408482;&#3627408481;

(&#3627408449;
??????&#3627408475;
≪6&#3627408465;)
■Long prompt regime : ΔP/ΔH∝ √|&#3627408474;&#3627408476;&#3627408465;&#3627408466;&#3627408473;|/|&#3627408446;&#3627408457;
&#3627408481;&#3627408476;&#3627408472;&#3627408466;&#3627408475;
| ∙ &#3627408437;&#3627408458;
&#3627408443;&#3627408437;&#3627408448;
/&#3627408455;??????&#3627408447;&#3627408450;&#3627408451;&#3627408480; ∙ &#3627408449;
??????&#3627408475;
/&#3627408449;
&#3627408476;&#3627408482;&#3627408481;

(&#3627408449;
??????&#3627408475;
>>6&#3627408465;)

Model params
●Model size
●KV-cache
compression
(GQA, MQA, MLA)
App params
●in/out ratio
GPU params
●Mem bw to
compute ratio

■Llama-3.1-70B
■H100
■Varying #clients
Shift the Efﬁciency-Latency Tradeoff Frontier
Pliops

Summary and Conclusions
✓KV-cache oﬄoading avoids recomputation and accelerates preﬁll
✓Write speed only needs to match compute
✓Speedup depends mainly on read performance
✓In compute-bound cases, no beneﬁt for faster-than-SSD memory
✓GPU-initiated IO allows full IO-compute overlap with zero CPU overhead
✓Throughput gains depend on the model, the GPU, and the workload
KV-cache oﬄoading increases eﬃciency and reduces cost while meeting SLAs

Thank you! Let’s connect.
Eshcar Hillel
[email protected]
linkedin.com/in/eshcar

LLM KV Cache Offloading: Analysis and Practical Considerations by Eshcar Hillel

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

LLM KV Cache Offloading: Analysis and Practical Considerations by Eshcar Hillel

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx