Minimizing Request Latency of Self-Hosted ML Models by Julia Kroll

ScyllaDB 245 views 18 slides Oct 16, 2024

Slide 1 of 18

About This Presentation

Join our session on minimizing latency in self-hosted #ML models in cloud environments. Learn strategies for deploying Deepgram's speech-to-text models on your hardware, including concurrency limits, auto-scaling, input chunk granularity, and efficient model loading. Optimize your ML inference.

Size: 1.42 MB

Language: en

Added: Oct 16, 2024

Slides: 18 pages

Slide Content

A ScyllaDB Community
Minimizing Request Latency of
Self-Hosted ML Models
Julia Kroll
Applied Engineer @ Deepgram

Julia Kroll (she/her)

Applied Engineer @ Deepgram
■I’ve contributed to global language AI products
spanning dozens of languages
■P99s are outliers and should be removed accordingly
■Dark mode is overrated
■I brew my own kombucha

ML Models 101
■Training: learning by looking at lots of data
■Inference: predicting based on past learning

■Both require GPUs
■Training is more compute-intensive than inference
●Meta is training Llama 3 on a 24,576-GPU cluster
●Llama can run inference for a request on 1 GPU
■We’ll focus on inference

What is self-hosting?

■You have a trained ML model
■You want to serve inference requests with the model
■You use your own hardware (bare-metal or cloud VMs) to do so

Which models can you self-host?
✔ Open-source
●Llama, Falcon, Mistral
●Models on Hugging Face
✔ Encrypted, distributed models
●Deepgram
✔ Privately trained by you
●Your proprietary, in-house models
❌ Not distributed
●GPT, Gemini, Claude

How do you self-host at scale?

■Containerization via Docker, Kubernetes, etc
■Each container has a copy of the model
■Each container serves some inference requests
■You may scale to more or fewer containers as needed

Pros and cons of hosted API versus self-hosting
Hosted
■Pros:
●You don’t have to think about it.
●It’s abstracted.
●It just works.
■Cons:
●You lack control.
●It may not suit your needs.
●It may not scale.
●It’s expensive.
●You send data to a 3rd party.

Self-Hosted
■Pros:
●You control everything.
●You can trick out your perfect setup.
●It’s low-margin.
●You retain your own data.
■Cons:
●You’re responsible for everything.
●It takes engineering time and effort.
●You might have to remember that
servers exist in the physical world.

Demo: self-hosted Deepgram

So you want to self-host
in prod?

3 questions to ask yourself…

1. How many requests can I serve?

■Each server will have a maximum level of traﬃc it can serve before it falls over.
■Latency will likely degrade before total failure.
■To determine a threshold:
●Deﬁne maximum acceptable latencies
●Benchmark to obtain maximum number of requests
●Monitor and set limits in prod

1. How many requests can I serve?
n requests: g4dn, T4 GPU
P99 = 560 ms
Spiky P99 over time
2n requests: g6, L4 GPU
P99 = 360ms
Smooth P99 over time

2. How many servers do I need?
■Now that you know how many requests 1 unit of compute can serve,
how many units do you need?
●Baseline, min, max, periodicity (daily, weekly)
■Scale on number of requests, given max requests for acceptable latency
●Kubernetes: HorizontalPodAutoscaler
●Example:
■Values of max_requests = 100, target_ratio 0.9
■Active requests = 91 → utilization ratio > 0.9 → scale up

3. Which set of models am I serving?

■For low-latency inference, models must stay in memory.
■Are you serving one model, a few, or many?
■How large are your models? How much memory do your instances have?

3. Which set of models am I serving?
5 models
P99 = 800 ms
15 models
P99 = 2000ms
1 model
P99 = < 100 ms
not pictured

3. Which set of models am I serving?

To serve many models with low latency:
■Several servers, each with a subset of models
■Load-balance to distribute traﬃc across the servers
■Send each request to a server containing the right model

Conclusion
For low-latency self-hosted ML inference:
■Select your model(s)
■Deﬁne and benchmark your request limits
■Auto-scale on those known limits
■Distribute many models (and their requests) across servers

Thank you! Let’s connect.
Julia Kroll
[email protected]
https://www.linkedin.com/in/juliakroll

Minimizing Request Latency of Self-Hosted ML Models by Julia Kroll

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Minimizing Request Latency of Self-Hosted ML Models by Julia Kroll

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......