This talk will discuss why LLM inference is slow and key latency metrics. It also covers techniques that make LLM inference fast, including different batching, parallelism, and prompt caching. Not all latency problems are engineering problems though. This talk will also cover interesting tricks to h...
This talk will discuss why LLM inference is slow and key latency metrics. It also covers techniques that make LLM inference fast, including different batching, parallelism, and prompt caching. Not all latency problems are engineering problems though. This talk will also cover interesting tricks to hide latency at an application level.
Optimization: quality, cost, latency
8Chip Huyen | @chipro
Good Fast
Cheap
Optimization: quality, cost, latency
9Chip Huyen | @chipro
Good Fast
Cheap
What’s missing?
⚠⚠
Many, many techniques
Will only focus on core, easier-to-apply techniques
⚠⚠
10Chip Huyen | @chipro
JamesWho wrote the book Chaos? Gleick did .
Input Output
TTFT
Time To First Token
TPOT
Time Per Output Token
Total latency = TTFT + TPOT x (N-1)
JamesWho wrote the book Chaos? Gleick did .
Input Output
TTFT*
Time To First Token
TPOT
Time Per Output Token
Total latency = TTFT + TPOT x (N-1)
First generated token != First visible token
JamesWho wrote the book Chaos? Gleick did .
Input Output
TTFT
Time To First Token
TPOT
Time Per Output Token
Total latency = TTFT + TPOT x (N-1)
Which metric is more important to you?
JamesWho wrote the book Chaos? Gleick did .
Input Output
TTFT
Time To First Token
TPOT
Time Per Output Token
Total latency = TTFT + TPOT x (N-1)
TTFT vs. TPOT tradeoff?
Throughput vs. goodput
Examples:
●Goal
○TTFT < 200 ms
○TPOT < 100 ms
●Throughput: 10 requests/m
15Chip Huyen | @chipro
What’s goodput?
Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
How to make LLMs run fast?
Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
How to make LLMs run fast?
Mm, can we just use a lot
of compute?
34B
8B
8B
8B
8B
70B
13B
13B
13B
13B8B
34B
34B
70B
Given X machines & Y models,
how to efficiently load them?
Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
How to make LLMs run fast?
Make more powerful hardware0
Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
Make more powerful hardware0
Make models more efficient1
Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
Make more powerful hardware0
Make models more efficient1
Make service more efficient2
22Chip Huyen | @chipro
Inference server
Hardware
Model 2Model 1 Model 3
Apps
Requests
Responses
Make models more efficient1
Make service more efficient2
Can change
model quality
Shouldn’t modify
output quality
Make more powerful hardware0
Make models more efficient
●Quantization
○Reducing precision
Ex: 32 bit → 8 bit
7 GB x 32 bit = 28 GB
7 GB x 8 bit = 7 GB
24Chip Huyen | @chipro
Make models more efficient
●Quantization
●Distillation
25Chip Huyen | @chipro
Teacher
Student
Data
Can student outperform teacher?
Make service more efficient
26Chip Huyen | @chipro
Requests
Static batching
Dynamic batching
Time
Batching
JamesWho wrote the book Chaos? Gleick did .
Prefill Decode Decode Decode Decode
Input Output
Input tokens can be
processed in parallel
Output tokens are
generated sequentially
JamesWho wrote the book Chaos? Gleick did .
Prefill Decode Decode Decode Decode
Input Output
Compute-bound Memory bandwidth-bound
Decoupling prefill & decode
Prompt caching
34Chip Huyen | @chipro
Structure your prompt to maximize prompt
cache hit!
Claude Code: prompt cache
35Chip Huyen | @chipro
sniffly.dev
Key takeaways
1.Inference optimization is essential for LLM economics
2.Evaluate an inference service not just by cost & latency but also quality
(techniques used can change a model’s quality)
3.Structure your prompts to maximize prompt cache hit
36Chip Huyen | @chipro
Thank you!
@chipro
chiphuyen
chiphuyen
huyenchip.com/blog
huyenchip.com/llama-policeTrack top 1200+ AI repos