KV Caching Strategies for Latency-Critical LLM Applications by John Thomson

ScyllaDB 0 views 12 slides Oct 14, 2025
Slide 1
Slide 1 of 12
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12

About This Presentation

LLM inference is split into two phases: Prefill and Decode. The Prefill phase fills the KV Cache with the context, while the Decode phase is an autoregressive process which generates tokens. Maximizing the KV Cache hit rate is critical to minimize time-to-first-token latency. In this talk, I’ll sh...


Slide Content

A ScyllaDB Community
KV Caching Strategies
for Latency-Critical
LLM Applications
John Thomson
Deep Learning
Algorithms Engineer

John Thomson (He/Him)

■Computer Engineering Student at University
Waterloo
■3x past intern in Deep Learning Algorithms at
Nvidia
■Focus: Scalable + Performant LLM Inference

Agenda
■A Mental Model of LLMs
■Self Attention and KV Caching
■Optimizations
●Smart Cache Eviction
●KV-cache-aware routing

A Mental Model of LLMs
■Token: A small group of characters.
■LLM’s Goal: Given a sequence of tokens, predict the next token
■Autoregressive: Append output token back into input

Amirova, Aliya & Fteropoulli, Theodora & Ahmed, Nafiso & Cowie, Martin & Leibo, Joel. (2024). Framework-based qualitative
analysis of free responses of Large Language Models: Algorithmic fidelity. PLOS ONE. 19. e0300024.
10.1371/journal.pone.0300024.

A Mental Model of LLMs (cont.)
https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse

Self-Attention and KV Caching
-Self-attention: Transforms hidden state based on current + previous tokens
-Key ideas:
-Once we compute hidden states for a token, they never change.
-We can reuse KV cache both within and between requests
https://livebook.manning.com/wiki/categories/llm/causal+attention

Self-Attention (cont.)
https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse

KV Caching: Implementation
■Borrow Idea from OS: Virtual Memory
■Arrange KV cache into blocks of tokens
■Logically structured as a Radix Tree
■Key observation: KV Cache can be reused
between requests -> Reduce TTFT

https://arxiv.org/abs/1706.03762

Optimization 1: Smart Cache Eviction
https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/

Optimization 2: KV-Cache-Aware Routing
https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/

Conclusion
■KV Caching minimizes recomputation, both within and between requests
■Smart eviction + routing enable superlinear scaling

Thank you! Let’s connect.
John Thomson
[email protected]
https://www.linkedin.com/in/jthomson04/
Tags