Minimizing Request Latency of Self-Hosted ML Models by Julia Kroll

ScyllaDB 245 views 18 slides Oct 16, 2024
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

Join our session on minimizing latency in self-hosted #ML models in cloud environments. Learn strategies for deploying Deepgram's speech-to-text models on your hardware, including concurrency limits, auto-scaling, input chunk granularity, and efficient model loading. Optimize your ML inference.


Slide Content

A ScyllaDB Community
Minimizing Request Latency of
Self-Hosted ML Models
Julia Kroll
Applied Engineer @ Deepgram

Julia Kroll (she/her)

Applied Engineer @ Deepgram
■I’ve contributed to global language AI products
spanning dozens of languages
■P99s are outliers and should be removed accordingly
■Dark mode is overrated
■I brew my own kombucha

ML Models 101
■Training: learning by looking at lots of data
■Inference: predicting based on past learning

■Both require GPUs
■Training is more compute-intensive than inference
●Meta is training Llama 3 on a 24,576-GPU cluster
●Llama can run inference for a request on 1 GPU
■We’ll focus on inference

What is self-hosting?

■You have a trained ML model
■You want to serve inference requests with the model
■You use your own hardware (bare-metal or cloud VMs) to do so

Which models can you self-host?
✔ Open-source
●Llama, Falcon, Mistral
●Models on Hugging Face
✔ Encrypted, distributed models
●Deepgram
✔ Privately trained by you
●Your proprietary, in-house models
❌ Not distributed
●GPT, Gemini, Claude

How do you self-host at scale?

■Containerization via Docker, Kubernetes, etc
■Each container has a copy of the model
■Each container serves some inference requests
■You may scale to more or fewer containers as needed

Pros and cons of hosted API versus self-hosting
Hosted
■Pros:
●You don’t have to think about it.
●It’s abstracted.
●It just works.
■Cons:
●You lack control.
●It may not suit your needs.
●It may not scale.
●It’s expensive.
●You send data to a 3rd party.

Self-Hosted
■Pros:
●You control everything.
●You can trick out your perfect setup.
●It’s low-margin.
●You retain your own data.
■Cons:
●You’re responsible for everything.
●It takes engineering time and effort.
●You might have to remember that
servers exist in the physical world.

Demo: self-hosted Deepgram

So you want to self-host
in prod?

3 questions to ask yourself…

1. How many requests can I serve?

■Each server will have a maximum level of traffic it can serve before it falls over.
■Latency will likely degrade before total failure.
■To determine a threshold:
●Define maximum acceptable latencies
●Benchmark to obtain maximum number of requests
●Monitor and set limits in prod

1. How many requests can I serve?
n requests: g4dn, T4 GPU
P99 = 560 ms
Spiky P99 over time
2n requests: g6, L4 GPU
P99 = 360ms
Smooth P99 over time

2. How many servers do I need?
■Now that you know how many requests 1 unit of compute can serve,
how many units do you need?
●Baseline, min, max, periodicity (daily, weekly)
■Scale on number of requests, given max requests for acceptable latency
●Kubernetes: HorizontalPodAutoscaler
●Example:
■Values of max_requests = 100, target_ratio 0.9
■Active requests = 91 → utilization ratio > 0.9 → scale up

3. Which set of models am I serving?

■For low-latency inference, models must stay in memory.
■Are you serving one model, a few, or many?
■How large are your models? How much memory do your instances have?

3. Which set of models am I serving?
5 models
P99 = 800 ms
15 models
P99 = 2000ms
1 model
P99 = < 100 ms
not pictured

3. Which set of models am I serving?

To serve many models with low latency:
■Several servers, each with a subset of models
■Load-balance to distribute traffic across the servers
■Send each request to a server containing the right model

Conclusion
For low-latency self-hosted ML inference:
■Select your model(s)
■Define and benchmark your request limits
■Auto-scale on those known limits
■Distribute many models (and their requests) across servers

Thank you! Let’s connect.
Julia Kroll
[email protected]
https://www.linkedin.com/in/juliakroll
Tags