AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale

Alluxio 536 views 16 slides Mar 11, 2025
Slide 1
Slide 1 of 16
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16

About This Presentation

AI/ML Infra Meetup
Mar. 06, 2025
Organized by Alluxio

For more Alluxio Events: https://www.alluxio.io/events/

Speaker:
- Sean Po (Staff SWE @ Uber)
- Tse-Chi Wang (Senior SWE @ Uber)

This talk provided a deep dive into how Uber manages its Generative AI Gateway, which powers all generative AI ap...


Slide Content

Deployment, Discovery
and Serving of LLMs at
Uber Scale
Tse-Chi Wang, Sean Po

2
Generative AI Gateway

Powers all GenAI applications inside Uber
●5M+ requests daily; 40+ services
●Unified Interface
○OpenAI compatible
●More models
○OpenAI - GPTs, o-series
○Vertex AI - Gemini
○OSS, FT models - Llama, Qwen etc.
■Hosted Uber in-house
Deployment, Discovery and Serving of LLMs at Uber Scale

3 Deployment, Discovery and Serving of LLMs at Uber Scale
Generative AI Gateway
In addition to standardizing APIs for
in-house and third party LLMs:

●Authentication/Authorization
●Logging
●Auditing
●Metrics
●Cost tracking
●Guardrails
●Caching
●Model fallback

4

●Project - Use case
●Deployment - Serving entity for the underlying model
●One project can have multiple deployments
●Different deployments can have the same underlying
model


Michelangelo’s Control Plane
Deployment, Discovery and Serving of LLMs at Uber Scale

5 Deployment, Discovery and Serving of LLMs at Uber Scale

6
Deployments are created outside of GenAI Gateway (e.g. UI) and stored in
Michelangelo’s Control Plane
How does GenAI Gateway discover added/removed deployments?
●Query Michelangelo’s Control Plane for each request
●Periodically polling Michelangelo’s Control Plane
●GenAI Gateway subscribes to changes in Michelangelo’s Control Plane

Discovery
Deployment, Discovery and Serving of LLMs at Uber Scale

7 Deployment, Discovery and Serving of LLMs at Uber Scale
Safe Model Deployments
Deployment controller enables safe model deployments for all types of models.
Machine learning/deep learning models are deployed incrementally by zone, keeping
both new and previous model in memory.
Blue: has existing model
Yellow: loading new model
Green: has new model

8 Deployment, Discovery and Serving of LLMs at Uber Scale
Safe LLM Deployments
For LLMs, Red/Black deployments are leveraged since two LLMs either
cannot fit in GPU vRAM nor serve traffic without latency degradation
Blue: has existing model
Yellow: loading new model
Green: has new model

9 Deployment, Discovery and Serving of LLMs at Uber Scale
Optimized Inference Servers: Low Latency and Cost
Powered by Triton Inference Server
●Streamlined API
●Deep Learning backends:
○TensorFlow
○Torch
○Python
●Generative AI backends:
○vLLM
○TensorRT-LLM

10 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: Overview


540+
WAU
15+
Unique Assistants With 50+
Weekly Engagements

11 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: Configuration


Simple UI for building chatbots

●Access to any model available in the Generative AI
Gateway, and all the benefits the Gateway
provides

●Customize system instructions, tools, model
hyperparameters, etc.

●Access internal files, and ingested knowledge
base content

12 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: Architecture Diagram

13 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: Architecture Diagram

14 Deployment and Serving of LLMs at Uber Scale
Uber Assistant Builder: Connectivity


Connect to any slack channel to provide domain
specific support

●Different support channels have different
processes. The flexibility of Assistant Builder allows
channel owners to self-serve assistants that work
for specific domains.

15 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: What's Next

●We're seeing limitations to the ReAct architecture
for building assistants.

●Some channels also require more deterministic
workloads.

●We are working on multi-agent architectures next.

16 Deployment, Discovery and Serving of LLMs at Uber Scale
Questions?
To learn more, see our blog post Scaling AI/ML Infrastructure at Uber