LLM Inference Optimization by Chip Huyen

ScyllaDB 0 views 37 slides Oct 09, 2025
Slide 1
Slide 1 of 37
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37

About This Presentation

This talk will discuss why LLM inference is slow and key latency metrics. It also covers techniques that make LLM inference fast, including different batching, parallelism, and prompt caching. Not all latency problems are engineering problems though. This talk will also cover interesting tricks to h...


Slide Content

Inference optimization
Chip Huyen
huyenchip.com
@chipro

LLM compute
3Chip Huyen | @chipro
Training Finetuning
Inference

LLM economics
4Chip Huyen | @chipro
Training Finetuning
Inference

LLM economics
5Chip Huyen | @chipro
Training Finetuning
Inference w/o reasoning
Training Finetuning
Inference w/ reasoning

LLM economics
6Chip Huyen | @chipro
Training compute
Inference compute
:

Inference cost determines LLM profitability


7Chip Huyen | @chipro

Optimization: quality, cost, latency
8Chip Huyen | @chipro
Good Fast
Cheap

Optimization: quality, cost, latency
9Chip Huyen | @chipro
Good Fast
Cheap
What’s missing?

⚠⚠
Many, many techniques
Will only focus on core, easier-to-apply techniques
⚠⚠

10Chip Huyen | @chipro

JamesWho wrote the book Chaos? Gleick did .
Input Output
TTFT
Time To First Token
TPOT
Time Per Output Token
Total latency = TTFT + TPOT x (N-1)

JamesWho wrote the book Chaos? Gleick did .
Input Output
TTFT*
Time To First Token
TPOT
Time Per Output Token
Total latency = TTFT + TPOT x (N-1)
First generated token != First visible token

JamesWho wrote the book Chaos? Gleick did .
Input Output
TTFT
Time To First Token
TPOT
Time Per Output Token
Total latency = TTFT + TPOT x (N-1)
Which metric is more important to you?

JamesWho wrote the book Chaos? Gleick did .
Input Output
TTFT
Time To First Token
TPOT
Time Per Output Token
Total latency = TTFT + TPOT x (N-1)
TTFT vs. TPOT tradeoff?

Throughput vs. goodput
Examples:
●Goal
○TTFT < 200 ms
○TPOT < 100 ms
●Throughput: 10 requests/m

15Chip Huyen | @chipro
What’s goodput?

Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
How to make LLMs run fast?

Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
How to make LLMs run fast?
Mm, can we just use a lot
of compute?

34B
8B
8B
8B
8B
70B
13B
13B
13B
13B8B
34B
34B
70B
Given X machines & Y models,
how to efficiently load them?

Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
How to make LLMs run fast?
Make more powerful hardware0

Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
Make more powerful hardware0
Make models more efficient1

Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
Make more powerful hardware0
Make models more efficient1
Make service more efficient2

22Chip Huyen | @chipro

Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
Make models more efficient1
Make service more efficient2
Can change
model quality
Shouldn’t modify
output quality
Make more powerful hardware0

Make models more efficient
●Quantization
○Reducing precision
Ex: 32 bit → 8 bit
7 GB x 32 bit = 28 GB
7 GB x 8 bit = 7 GB
24Chip Huyen | @chipro

Make models more efficient
●Quantization
●Distillation

25Chip Huyen | @chipro
Teacher
Student
Data
Can student outperform teacher?

Make service more efficient


26Chip Huyen | @chipro

Requests
Static batching
Dynamic batching
Time
Batching

R1
R2
R3
R4
R5
R6
R7
R8
R2
R3
R4
R1
R5
R6
R7
R8
Normal
batching
Continuous
batching
Batch 1 Batch 2





Time
Continuous batching

JamesWho wrote the book Chaos? Gleick did .
Prefill Decode Decode Decode Decode
Input Output
Input tokens can be
processed in parallel
Output tokens are
generated sequentially

JamesWho wrote the book Chaos? Gleick did .
Prefill Decode Decode Decode Decode
Input Output
Compute-bound Memory bandwidth-bound
Decoupling prefill & decode

Non
distributed
Device 1
Device 2




Gather
Tensor
parallelism

Layer 1
Layer 2
Layer 3
Layer 4
Layer 1 Layer 1
Layer 2
Layer 3
Layer 1
Layer 2 Layer 2
Layer 3
Layer 4
Layer 3
Layer 4 Layer 4
MB1 MB2 MB3 MB4
Device 1
Device 2
Device 3
Device 4
Input
Output MB1 MB2 MB3 MB4
Time
Pipeline
parallelism

Prompt caching
33Chip Huyen | @chipro

Prompt caching
34Chip Huyen | @chipro
Structure your prompt to maximize prompt
cache hit!

Claude Code: prompt cache
35Chip Huyen | @chipro
sniffly.dev

Key takeaways
1.Inference optimization is essential for LLM economics
2.Evaluate an inference service not just by cost & latency but also quality
(techniques used can change a model’s quality)
3.Structure your prompts to maximize prompt cache hit
36Chip Huyen | @chipro

Thank you!

@chipro
chiphuyen
chiphuyen
huyenchip.com/blog
huyenchip.com/llama-policeTrack top 1200+ AI repos
Tags