KV Caching Strategies for Latency-Critical LLM Applications by John Thomson
ScyllaDB
0 views
12 slides
Oct 14, 2025
Slide 1 of 12
1
2
3
4
5
6
7
8
9
10
11
12
About This Presentation
LLM inference is split into two phases: Prefill and Decode. The Prefill phase fills the KV Cache with the context, while the Decode phase is an autoregressive process which generates tokens. Maximizing the KV Cache hit rate is critical to minimize time-to-first-token latency. In this talk, I’ll sh...
LLM inference is split into two phases: Prefill and Decode. The Prefill phase fills the KV Cache with the context, while the Decode phase is an autoregressive process which generates tokens. Maximizing the KV Cache hit rate is critical to minimize time-to-first-token latency. In this talk, I’ll share techniques NVIDIA’s TensorRT-LLM uses to maximize KV Cache hit rates for structured LLM workloads.
Size: 1.74 MB
Language: en
Added: Oct 14, 2025
Slides: 12 pages
Slide Content
A ScyllaDB Community
KV Caching Strategies
for Latency-Critical
LLM Applications
John Thomson
Deep Learning
Algorithms Engineer
John Thomson (He/Him)
■Computer Engineering Student at University
Waterloo
■3x past intern in Deep Learning Algorithms at
Nvidia
■Focus: Scalable + Performant LLM Inference
Agenda
■A Mental Model of LLMs
■Self Attention and KV Caching
■Optimizations
●Smart Cache Eviction
●KV-cache-aware routing
A Mental Model of LLMs
■Token: A small group of characters.
■LLM’s Goal: Given a sequence of tokens, predict the next token
■Autoregressive: Append output token back into input
Amirova, Aliya & Fteropoulli, Theodora & Ahmed, Nafiso & Cowie, Martin & Leibo, Joel. (2024). Framework-based qualitative
analysis of free responses of Large Language Models: Algorithmic fidelity. PLOS ONE. 19. e0300024.
10.1371/journal.pone.0300024.
A Mental Model of LLMs (cont.)
https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse
Self-Attention and KV Caching
-Self-attention: Transforms hidden state based on current + previous tokens
-Key ideas:
-Once we compute hidden states for a token, they never change.
-We can reuse KV cache both within and between requests
https://livebook.manning.com/wiki/categories/llm/causal+attention
KV Caching: Implementation
■Borrow Idea from OS: Virtual Memory
■Arrange KV cache into blocks of tokens
■Logically structured as a Radix Tree
■Key observation: KV Cache can be reused
between requests -> Reduce TTFT