KV Caching Strategies for Latency-Critical LLM Applications by John Thomson

ScyllaDB 0 views 12 slides Oct 14, 2025

Slide 1 of 12

About This Presentation

LLM inference is split into two phases: Prefill and Decode. The Prefill phase fills the KV Cache with the context, while the Decode phase is an autoregressive process which generates tokens. Maximizing the KV Cache hit rate is critical to minimize time-to-first-token latency. In this talk, I’ll sh...

Size: 1.74 MB

Language: en

Added: Oct 14, 2025

Slides: 12 pages

Slide Content

A ScyllaDB Community
KV Caching Strategies
for Latency-Critical
LLM Applications
John Thomson
Deep Learning
Algorithms Engineer

John Thomson (He/Him)

■Computer Engineering Student at University
Waterloo
■3x past intern in Deep Learning Algorithms at
Nvidia
■Focus: Scalable + Performant LLM Inference

Agenda
■A Mental Model of LLMs
■Self Attention and KV Caching
■Optimizations
●Smart Cache Eviction
●KV-cache-aware routing

A Mental Model of LLMs
■Token: A small group of characters.
■LLM’s Goal: Given a sequence of tokens, predict the next token
■Autoregressive: Append output token back into input

Amirova, Aliya & Fteropoulli, Theodora & Ahmed, Naﬁso & Cowie, Martin & Leibo, Joel. (2024). Framework-based qualitative
analysis of free responses of Large Language Models: Algorithmic ﬁdelity. PLOS ONE. 19. e0300024.
10.1371/journal.pone.0300024.

A Mental Model of LLMs (cont.)
https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse

Self-Attention and KV Caching
-Self-attention: Transforms hidden state based on current + previous tokens
-Key ideas:
-Once we compute hidden states for a token, they never change.
-We can reuse KV cache both within and between requests
https://livebook.manning.com/wiki/categories/llm/causal+attention

Self-Attention (cont.)
https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse

KV Caching: Implementation
■Borrow Idea from OS: Virtual Memory
■Arrange KV cache into blocks of tokens
■Logically structured as a Radix Tree
■Key observation: KV Cache can be reused
between requests -> Reduce TTFT

https://arxiv.org/abs/1706.03762

Optimization 1: Smart Cache Eviction
https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/

Optimization 2: KV-Cache-Aware Routing
https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/

Conclusion
■KV Caching minimizes recomputation, both within and between requests
■Smart eviction + routing enable superlinear scaling

Thank you! Let’s connect.
John Thomson
[email protected]
https://www.linkedin.com/in/jthomson04/

KV Caching Strategies for Latency-Critical LLM Applications by John Thomson

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

KV Caching Strategies for Latency-Critical LLM Applications by John Thomson

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx