LLMariner - Transform your Kubernetes Cluster Into a GenAI platform

kenjikaneda2 636 views 14 slides Oct 01, 2024
Slide 1
Slide 1 of 14
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14

About This Presentation

Overview of LLMariner (https://llmariner.ai)


Slide Content

© 2024 CloudNatix, All Rights Reserved
LLMariner
Transform your Kubernetes Cluster Into a GenAI platform

© 2024 CloudNatix, All Rights Reserved
LLMariner
Provide a unified AI/ML platform
with efficient GPU and K8s management


LLMariner
LLM
(inference,
fine-tuning, RAG)
Workbench
(Jupyter Notebook)
Non-LLM
Training

public/private cloud
GPUs
(gen G1,
arch A1)
GPUs
(gen G2,
arch A2)
public/private cloud
GPUs
(gen G1,
arch A1)
GPUs
(gen G2,
arch A2)
public/private cloud
GPUs
(gen G1,
arch A1)
GPUs
(gen G2,
arch A2)

© 2024 CloudNatix, All Rights Reserved
Example Use Cases
●Develop LLM applications with the API that is compatible with
OpenAI API
○Leverage existing ecosystem to build applications

●Fine-tune models while keeping data safely and securely
in your on-premise datacenter



Code auto-completionChat bot

© 2024 CloudNatix, All Rights Reserved
Key Features
●LLM Inference
●LLM fine-tuning
●RAG
●Jupyter Notebook
●General-purpose training


●Flexible deployment model
●Efficient GPU management
●Security / access control
●GPU visibility/showback
(*)

●Highly-reliable GPU
management
(*)

For AI/ML team For infrastructure team
(*) under development

© 2024 CloudNatix, All Rights Reserved
High Level Architecture
Worker GPU K8s
cluster
LLMariner Agent
for AI/ML
Worker GPU K8s
cluster
LLMariner Agent
for AI/ML
Worker GPU K8s
cluster
LLMariner Agent
for AI/ML
Control plane K8s
cluster
LLMariner Control
Plane for AI/ML
API endpoint

© 2024 CloudNatix, All Rights Reserved
Key Feature Details

© 2024 CloudNatix, All Rights Reserved
LLM Inference Serving
●Compatible with OpenAI API
○Can leverage the existing ecosystem and applications

●Advanced capabilities surpassing standard inference runtimes,
such as vLLM
○Optimized request serving and GPU management
○Multiple inference runtime support
○Multiple model support
○Built-in RAG integration

© 2024 CloudNatix, All Rights Reserved
Multiple Model and Runtime Support
●Multiple model support



●Multiple inference runtime support


Open models
from Hugging Face
Private models in
customers’
environment
Fine-tuned models
generated with
LLMariner
vLLM Ollama
Hugging Face
TGI
Nvidia
Tensor-RT LLM
Upcoming

© 2024 CloudNatix, All Rights Reserved
Cluster X
Optimized Inference Serving
●Efficiently utilize GPU to achieve high throughput and low latency
●Key technologies:
○Autoscaling
○Model-aware request load balancing & routing
○Multi-model management & caching
○Multi-cluster/cloud federation
LLMariner
Inference Manager Engine
vLLM
Llama 3.1
vLLM
Gemma 2
Autoscaling
vLLM
Llama 3.1
Ollama
Deepseek
Coder
Cluster Y

© 2024 CloudNatix, All Rights Reserved
Built-in RAG Integration
●Use API compatible OpenAI to manage vector stores and files
○Use Milvus as an underlying vector DB
●Inference engine retrieves relevant data when processing requests



File

File

File
Upload and create
embeddings
LLMariner
Inference
Engine
Retrieve data

© 2024 CloudNatix, All Rights Reserved
GPU K8s cluster
Beyond LLM Inference
●Provide LLM fine-tuning, general-purpose training, and Jupyter
Notebook management

●Empower AI/ML teams to harness the full power of GPUs in a
secure self-contained environment
Supervised
Fine-tuning Trainer

© 2024 CloudNatix, All Rights Reserved
A Fine-tuning Example
●Submit a fine-tuning job using the OpenAI Python library
○Fine-tuned job runs in an underlying Kubernetes cluster
●Enforce quota with integration with open source Kueue
K8s cluster
GPU GPU GPU GPU
Fine-tuning
job
Fine-tuning
job
Quota enforcement
with Kueue
submit

© 2024 CloudNatix, All Rights Reserved
Project X
Enterprise-Ready Access Control
●Control API scope with “organizations” and “projects”
○A user in Project X can access fine-tuned models generated by
other users in project X
○A user in Project Y cannot access the fine-tuned models in X
●Can be integrated with a customer’s identity management platform
(e.g., SAML, OIDC)
Project Y
User 1
User 2
Fine-tuned
model
User 3
create
read
cannot access

© 2024 CloudNatix, All Rights Reserved
Supported Deployment Models
Single public cloud
Single private cloud
Air-gapped env
Appliance


Hybrid cloud
(public & private)


Multi-cloud federation


Private cloud
Public cloud
LLMariner
Control Plane
LLMariner
Agent
Cloud Y
Cloud A
LLMariner
Control Plane
K8s cluster
LLMariner
Control Plane
LLMariner
Agent
Cloud Y
Cloud B
LLMariner
Agent
※ No need to open incoming ports in worker clusters, only outgoing port 443 is required