AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale

Alluxio 536 views 16 slides Mar 11, 2025

Slide 1 of 16

About This Presentation

AI/ML Infra Meetup
Mar. 06, 2025
Organized by Alluxio

For more Alluxio Events: https://www.alluxio.io/events/

Speaker:
- Sean Po (Staff SWE @ Uber)
- Tse-Chi Wang (Senior SWE @ Uber)

This talk provided a deep dive into how Uber manages its Generative AI Gateway, which powers all generative AI ap...

Size: 1.36 MB

Language: en

Added: Mar 11, 2025

Slides: 16 pages

Slide Content

Deployment, Discovery
and Serving of LLMs at
Uber Scale
Tse-Chi Wang, Sean Po

2
Generative AI Gateway

Powers all GenAI applications inside Uber
●5M+ requests daily; 40+ services
●Unified Interface
○OpenAI compatible
●More models
○OpenAI - GPTs, o-series
○Vertex AI - Gemini
○OSS, FT models - Llama, Qwen etc.
■Hosted Uber in-house
Deployment, Discovery and Serving of LLMs at Uber Scale

3 Deployment, Discovery and Serving of LLMs at Uber Scale
Generative AI Gateway
In addition to standardizing APIs for
in-house and third party LLMs:

●Authentication/Authorization
●Logging
●Auditing
●Metrics
●Cost tracking
●Guardrails
●Caching
●Model fallback

4

●Project - Use case
●Deployment - Serving entity for the underlying model
●One project can have multiple deployments
●Different deployments can have the same underlying
model

Michelangelo’s Control Plane
Deployment, Discovery and Serving of LLMs at Uber Scale

5 Deployment, Discovery and Serving of LLMs at Uber Scale

6
Deployments are created outside of GenAI Gateway (e.g. UI) and stored in
Michelangelo’s Control Plane
How does GenAI Gateway discover added/removed deployments?
●Query Michelangelo’s Control Plane for each request
●Periodically polling Michelangelo’s Control Plane
●GenAI Gateway subscribes to changes in Michelangelo’s Control Plane

Discovery
Deployment, Discovery and Serving of LLMs at Uber Scale

7 Deployment, Discovery and Serving of LLMs at Uber Scale
Safe Model Deployments
Deployment controller enables safe model deployments for all types of models.
Machine learning/deep learning models are deployed incrementally by zone, keeping
both new and previous model in memory.
Blue: has existing model
Yellow: loading new model
Green: has new model

8 Deployment, Discovery and Serving of LLMs at Uber Scale
Safe LLM Deployments
For LLMs, Red/Black deployments are leveraged since two LLMs either
cannot fit in GPU vRAM nor serve traffic without latency degradation
Blue: has existing model
Yellow: loading new model
Green: has new model

9 Deployment, Discovery and Serving of LLMs at Uber Scale
Optimized Inference Servers: Low Latency and Cost
Powered by Triton Inference Server
●Streamlined API
●Deep Learning backends:
○TensorFlow
○Torch
○Python
●Generative AI backends:
○vLLM
○TensorRT-LLM

10 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: Overview

540+
WAU
15+
Unique Assistants With 50+
Weekly Engagements

11 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: Configuration

Simple UI for building chatbots

●Access to any model available in the Generative AI
Gateway, and all the benefits the Gateway
provides

●Customize system instructions, tools, model
hyperparameters, etc.

●Access internal files, and ingested knowledge
base content

12 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: Architecture Diagram

13 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: Architecture Diagram

14 Deployment and Serving of LLMs at Uber Scale
Uber Assistant Builder: Connectivity

Connect to any slack channel to provide domain
specific support

●Different support channels have different
processes. The flexibility of Assistant Builder allows
channel owners to self-serve assistants that work
for specific domains.

15 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: What's Next

●We're seeing limitations to the ReAct architecture
for building assistants.

●Some channels also require more deterministic
workloads.

●We are working on multi-agent architectures next.

16 Deployment, Discovery and Serving of LLMs at Uber Scale
Questions?
To learn more, see our blog post Scaling AI/ML Infrastructure at Uber

AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx