●Project - Use case
●Deployment - Serving entity for the underlying model
●One project can have multiple deployments
●Different deployments can have the same underlying
model
Michelangelo’s Control Plane
Deployment, Discovery and Serving of LLMs at Uber Scale
5 Deployment, Discovery and Serving of LLMs at Uber Scale
6
Deployments are created outside of GenAI Gateway (e.g. UI) and stored in
Michelangelo’s Control Plane
How does GenAI Gateway discover added/removed deployments?
●Query Michelangelo’s Control Plane for each request
●Periodically polling Michelangelo’s Control Plane
●GenAI Gateway subscribes to changes in Michelangelo’s Control Plane
Discovery
Deployment, Discovery and Serving of LLMs at Uber Scale
7 Deployment, Discovery and Serving of LLMs at Uber Scale
Safe Model Deployments
Deployment controller enables safe model deployments for all types of models.
Machine learning/deep learning models are deployed incrementally by zone, keeping
both new and previous model in memory.
Blue: has existing model
Yellow: loading new model
Green: has new model
8 Deployment, Discovery and Serving of LLMs at Uber Scale
Safe LLM Deployments
For LLMs, Red/Black deployments are leveraged since two LLMs either
cannot fit in GPU vRAM nor serve traffic without latency degradation
Blue: has existing model
Yellow: loading new model
Green: has new model
9 Deployment, Discovery and Serving of LLMs at Uber Scale
Optimized Inference Servers: Low Latency and Cost
Powered by Triton Inference Server
●Streamlined API
●Deep Learning backends:
○TensorFlow
○Torch
○Python
●Generative AI backends:
○vLLM
○TensorRT-LLM
10 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: Overview
540+
WAU
15+
Unique Assistants With 50+
Weekly Engagements
11 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: Configuration
Simple UI for building chatbots
●Access to any model available in the Generative AI
Gateway, and all the benefits the Gateway
provides
●Customize system instructions, tools, model
hyperparameters, etc.
●Access internal files, and ingested knowledge
base content
12 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: Architecture Diagram
13 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: Architecture Diagram
14 Deployment and Serving of LLMs at Uber Scale
Uber Assistant Builder: Connectivity
Connect to any slack channel to provide domain
specific support
●Different support channels have different
processes. The flexibility of Assistant Builder allows
channel owners to self-serve assistants that work
for specific domains.
15 Deployment, Discovery and Serving of LLMs at Uber Scale
Uber Assistant Builder: What's Next
●We're seeing limitations to the ReAct architecture
for building assistants.
●Some channels also require more deterministic
workloads.
●We are working on multi-agent architectures next.
16 Deployment, Discovery and Serving of LLMs at Uber Scale
Questions?
To learn more, see our blog post Scaling AI/ML Infrastructure at Uber