LLM Inference Optimization by Chip Huyen

ScyllaDB 0 views 37 slides Oct 09, 2025

Slide 1 of 37

About This Presentation

This talk will discuss why LLM inference is slow and key latency metrics. It also covers techniques that make LLM inference fast, including different batching, parallelism, and prompt caching. Not all latency problems are engineering problems though. This talk will also cover interesting tricks to h...

Size: 2.07 MB

Language: en

Added: Oct 09, 2025

Slides: 37 pages

Slide Content

Inference optimization
Chip Huyen
huyenchip.com
@chipro

LLM compute
3Chip Huyen | @chipro
Training Finetuning
Inference

LLM economics
4Chip Huyen | @chipro
Training Finetuning
Inference

LLM economics
5Chip Huyen | @chipro
Training Finetuning
Inference w/o reasoning
Training Finetuning
Inference w/ reasoning

LLM economics
6Chip Huyen | @chipro
Training compute
Inference compute
:

Inference cost determines LLM proﬁtability

7Chip Huyen | @chipro

Optimization: quality, cost, latency
8Chip Huyen | @chipro
Good Fast
Cheap

Optimization: quality, cost, latency
9Chip Huyen | @chipro
Good Fast
Cheap
What’s missing?

⚠⚠
Many, many techniques
Will only focus on core, easier-to-apply techniques
⚠⚠

10Chip Huyen | @chipro

JamesWho wrote the book Chaos? Gleick did .
Input Output
TTFT
Time To First Token
TPOT
Time Per Output Token
Total latency = TTFT + TPOT x (N-1)

JamesWho wrote the book Chaos? Gleick did .
Input Output
TTFT*
Time To First Token
TPOT
Time Per Output Token
Total latency = TTFT + TPOT x (N-1)
First generated token != First visible token

JamesWho wrote the book Chaos? Gleick did .
Input Output
TTFT
Time To First Token
TPOT
Time Per Output Token
Total latency = TTFT + TPOT x (N-1)
Which metric is more important to you?

JamesWho wrote the book Chaos? Gleick did .
Input Output
TTFT
Time To First Token
TPOT
Time Per Output Token
Total latency = TTFT + TPOT x (N-1)
TTFT vs. TPOT tradeoff?

Throughput vs. goodput
Examples:
●Goal
○TTFT < 200 ms
○TPOT < 100 ms
●Throughput: 10 requests/m

15Chip Huyen | @chipro
What’s goodput?

Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
How to make LLMs run fast?

Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
How to make LLMs run fast?
Mm, can we just use a lot
of compute?

34B
8B
8B
8B
8B
70B
13B
13B
13B
13B8B
34B
34B
70B
Given X machines & Y models,
how to efﬁciently load them?

Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
How to make LLMs run fast?
Make more powerful hardware0

Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
Make more powerful hardware0
Make models more eﬃcient1

Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
Make more powerful hardware0
Make models more eﬃcient1
Make service more eﬃcient2

22Chip Huyen | @chipro

Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
Make models more eﬃcient1
Make service more eﬃcient2
Can change
model quality
Shouldn’t modify
output quality
Make more powerful hardware0

Make models more efﬁcient
●Quantization
○Reducing precision
Ex: 32 bit → 8 bit
7 GB x 32 bit = 28 GB
7 GB x 8 bit = 7 GB
24Chip Huyen | @chipro

Make models more efﬁcient
●Quantization
●Distillation

25Chip Huyen | @chipro
Teacher
Student
Data
Can student outperform teacher?

Make service more efﬁcient

26Chip Huyen | @chipro

Requests
Static batching
Dynamic batching
Time
Batching

R1
R2
R3
R4
R5
R6
R7
R8
R2
R3
R4
R1
R5
R6
R7
R8
Normal
batching
Continuous
batching
Batch 1 Batch 2
…
…
…
…
…
Time
Continuous batching

JamesWho wrote the book Chaos? Gleick did .
Preﬁll Decode Decode Decode Decode
Input Output
Input tokens can be
processed in parallel
Output tokens are
generated sequentially

JamesWho wrote the book Chaos? Gleick did .
Preﬁll Decode Decode Decode Decode
Input Output
Compute-bound Memory bandwidth-bound
Decoupling preﬁll & decode

Non
distributed
Device 1
Device 2

Gather
Tensor
parallelism

Layer 1
Layer 2
Layer 3
Layer 4
Layer 1 Layer 1
Layer 2
Layer 3
Layer 1
Layer 2 Layer 2
Layer 3
Layer 4
Layer 3
Layer 4 Layer 4
MB1 MB2 MB3 MB4
Device 1
Device 2
Device 3
Device 4
Input
Output MB1 MB2 MB3 MB4
Time
Pipeline
parallelism

Prompt caching
33Chip Huyen | @chipro

Prompt caching
34Chip Huyen | @chipro
Structure your prompt to maximize prompt
cache hit!

Claude Code: prompt cache
35Chip Huyen | @chipro
snifﬂy.dev

Key takeaways
1.Inference optimization is essential for LLM economics
2.Evaluate an inference service not just by cost & latency but also quality
(techniques used can change a model’s quality)
3.Structure your prompts to maximize prompt cache hit
36Chip Huyen | @chipro

Thank you!

@chipro
chiphuyen
chiphuyen
huyenchip.com/blog
huyenchip.com/llama-policeTrack top 1200+ AI repos

LLM Inference Optimization by Chip Huyen

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

LLM Inference Optimization by Chip Huyen

About This Presentation

Slide Content

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx