The LLM
Dilemma
“ Self-host or just call the API? ”
BENTOML 02
Why APIs?
Ease of Use
Access Private Models
No infra needed
Standard API client libraries
Easy integration with application frameworks
Private models (OpenAI GPT 4, Claude Sonnet) are only available via API access
Limited customization
Easy to get started and scale usage
Latency and behavior can be unpredictable
BENTOML 03
Why Self-Host LLMs?
05
Control
Data Security and Privacy Regulations
Predictable Quality and Latency
Ensured Availability and Reproducibility
Customization
Use Custom Fine-tuned Models
Flexible Cost/Performance Optimization
Advanced Inference Strategies
BENTOML
Self Host LLM
sounds fun!?
BENTOML 04
Why NOT to Self-Host?
BENTOML 06
High Cost
Building and maintaining the system
Hard to optimize
Hard to predict cost based on workload
Complexity
Slow down development iterations
Under-utilized GPU compute
Scaling challenges
Let’s
Dive
In!
Technical considerations for
making the decision
BentoML 08
https://github.com/bentoml/OpenLLM
Run or deploy with a single command
Any open-source or fine-tuned LLM
STOA Inference Performance
OpenAI-compatible API
Build-in ChatUI
Running LLM Inference has become
easier than ever
Why does OpenAI-compatible API matter?
BentoML 09
Compatible with OpenAI’s API specification and
SDK library
Switching LLM by changing only 1-2 lines of
code in your LLM app
Build your prototype with any API now, change
the it later if needed
“OpenAI-Compatible” is some times a lie
BENTOML 10
Model
Performance
The performance gap between
different LLMs may directly
impact your application.
Coupling
With
Application
Models are often coupled and co-
optimized with other components
like retriever or prompt.
Function
Calling
Function calling format and
capability differences can lead to
varies of API incompatibilities.
Multi-modal
and Vision
Most open-source Vision
Language Models does not
support input of multiple images
(besides Qwen-VL)
LLM Inference Optimization
The Table Stakes
Continuous Batching
Batch incoming requests with iteration-level scheduling
23x higher throughput, reduce p50 latency, requires
additional GPU memory besides model weights.
Token Streaming
Returns tokens incrementally as they are being generated
Improve perceived latency and user experience, critical for
real-time chat applications.
Quantization
Reduces model precision to lower bit representation
Decrease model size and inference latency. May lead to
degradation in output quality for certain tasks.
Kernel Optimizations
Optimize low-level GPU operations specific for LLM
Improves computational efficiency and speed. Limit
portability across different hardware platforms.
Model Parallelism
Distributes model across multiple GPUs for larger models
Enables inference of extremely large models (typically
>70B) and improves throughput. Introduces
communication overhead.
11BentoML
BentoML 12
Different from “Local LLM”?
Cloud LLM DeploymentsC
6 Optimized for throughpu,
6 Continuous batchin"
6 Reliability and Scalability
Local LLM^
6 Run model on constrained deviceW
6 Simplify installation for non-technical user
OpenLLM vs. Token Generation RateOllama
Pricing Model
BENTOML 13
Pay by Token
1 token ~= 3/4 words
Predictable cost based on traffic
Pay only what you used
Pay by Compute
Dedicated compute resources
Pay for compute time used
Harder to predict cost and budget resource
BentoML 14
Token Generation Speed vs. Throughput
Higher concurrency leads to higher total
token per second, lower per-request
generation rate
TTFT increases as concurrency go up
Token generation rate strongly correlates
with GPU utilization
Performance varies depending on the
inference runtime
Scaling Test-Time Compute with
Custom Inference Strategy
BentoML 15
Scaling inference often requires
customization at the LLM inference
layer. (e.g. COT-decoding,
Equilibrium Search)
Use case specific inference strategy
is only possible with self-host.
Prefix Caching
90%+
Cost savings
Bentoml 16
Cache intermediate states for requests
with identical prefix to avoid redundant
computation.
Reduces latency and improves throughput.
Requires additional GPU memory.
System
You’re a helpful assistant..
Assistant
<Answer 1>
Assistant
<Answer 2>
Assistant
<Answer 3>
USER
<DOCUMENT>
<Question 1>
USER
<DOCUMENT>
<Question 2>
USER
<DOCUMENT>
<Question 3>
System
You’re a helpful assistant..
System
You’re a helpful assistant..
Query 1
Common
Prefix
Cache
Hit
Query 2 Query 3
Prefix Caching Best Practices
Bentoml 17
lmsys.org
1
Front-load Static Information
Place any constant or rarely changing information at the beginning of
your prompt. Push any content that changes frequently towards the
end of the prompt.
2
Batch Similar Requests
Group similar requests with common prefixes together to maximize
cache utilization. E.g. Contextual Retrieval
3
Monitor Cache Hit Rate
Regularly review your cache performance to identify opportunities for
optimization.
Why k8s
Does NOT
Work for
Scaling LLMs
PROJECT NAME 18
Scaling Metrics
Cloud Provision
Container Image Pulling
Model Loading
BentoML 19
Autoscaling Metrics
1
Resource Utilization
CPU/GPU utilization are straightforward but limited as they don’t
accurately represent resource utilization for AI workloads leading to
conservative scaling.
2
QPS (Query Per Second)
While simple and widely recognized, QPS can be inaccurate for LLM due
to variable costs per request depending on input and output token
length.
3
Concurrency
Represent the number of active requests being queued or processed,
proves ideal as it accurately reflects system load, scales precisely, and is
easy to configure based on batch size.
Time
Concurrency
2
Request 1
Request 2
Request 3
Request 4
Request 5
Request 8
Request 9
Request 6
Request 7
4 5 6
Concurrency Based auto scaling
BentoML 19
Concurrency Based Scaling in BentoML
Set Per-service Concurrency Target in BentoML Configuring scaling behavior via BentoCloud UI
The
Cold Start
Problem
Cold Start
BentoML 21
Cloud Provision Container Image Pulling Model Loading
Cloud Provision
BentoML 22
Pre-heated for multiple
active models.
Ready to scale up
Instances Actively
Serving Traffic
Container Image Pulling
BentoML 23
BentoCloud image stream loading solution cuts
container start time from minutes to seconds
python 3.10-slim Python, CUDA,
PyTorch, vLLM
154 MB
6.7 GB
j LLM workload requires many large dependencie_
j Network bandwidth limitatio`
j Most files in container image are not immediately used
Model Loading
BentoML 24
Model Loading Acceleration in BentoCloud
BentoML 25
Who Should
Consider
Self-Host
LLMs?
Bentoml
CONTROL
Run LLMs at your own term. Comply
with security and regulatory
requirements.
CUSTOMIZATION
Optimize LLM inference for your
specific use case for better speed,
accuracy and cost efficiency.
Long Term cost benefits
Leverage open-source platforms
like BentoML/OpenLLM to speed up
your journey in self-hosting private
LLM endpoints.
26