LLMariner - Transform your Kubernetes Cluster Into a GenAI platform

kenjikaneda2 1,233 views 15 slides Oct 11, 2024
Slide 1
Slide 1 of 15
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15

About This Presentation

Overview of LLMariner (https://llmariner.ai)


Slide Content

© 2024 CloudNatix, All Rights Reserved
LLMariner
Transform your Kubernetes Cluster Into a GenAI platform

© 2024 CloudNatix, All Rights Reserved
LLMariner
Provide a unified AI/ML platform
with efficient GPU and K8s management


LLMariner
LLM
(inference,
fine-tuning, RAG)
Workbench
(Jupyter Notebook)
Non-LLM
Training

public/private cloud
GPUs
(gen G1,
arch A1)
GPUs
(gen G2,
arch A2)
public/private cloud
GPUs
(gen G1,
arch A1)
GPUs
(gen G2,
arch A2)
public/private cloud
GPUs
(gen G1,
arch A1)
GPUs
(gen G2,
arch A2)

© 2024 CloudNatix, All Rights Reserved
Example Use Cases
●Develop LLM applications with the API that is compatible with
OpenAI API
○Leverage existing ecosystem to build applications

●Fine-tune models while keeping data safely and securely
in your on-premise datacenter



Code auto-completionChat bot

© 2024 CloudNatix, All Rights Reserved
Key Features
●LLM Inference
●LLM fine-tuning
●RAG
●Jupyter Notebook
●General-purpose training


●Flexible deployment model
●Efficient GPU management
●Security / access control
●GPU visibility/showback
(*)

●Highly-reliable GPU
management
(*)

For AI/ML team For infrastructure team
(*) under development

© 2024 CloudNatix, All Rights Reserved
High Level Architecture
Worker GPU K8s
cluster
LLMariner Agent
for AI/ML
Worker GPU K8s
cluster
LLMariner Agent
for AI/ML
Worker GPU K8s
cluster
LLMariner Agent
for AI/ML
Control plane K8s
cluster
LLMariner Control
Plane for AI/ML
API endpoint

© 2024 CloudNatix, All Rights Reserved
Key Feature Details

© 2024 CloudNatix, All Rights Reserved
Features for AI/ML team and infra team
APIs for the AI/ML team
K8s cluster
OpenAI-compatible API
(chat completion, embedding, RAG, fine-tuning, …)
Workbench with
Jupyter Notebooks
Inference engine

User mgmt

General purpose
training jobs
Cluster federation
GPU workloads mgmt Storage mgmtModel mgmt
Open models
Closed models
owned by your org
Fine-tuned models
Runtime mgmt
(e.g., autoscaling, routing)
vLLM
Nvidia
Triton
Ollama
Fine-tuning jobs
API usage audits
K8s cluster K8s cluster
Files
Vector DBs
Jupyter Notebooks
Training jobs
Kueue
Dex
API authn/authz

API key mgmt Orgs & projects mgmt
Cluster mgmt Secure session mgmt

© 2024 CloudNatix, All Rights Reserved
LLM Inference Serving
●Compatible with OpenAI API
○Can leverage the existing ecosystem and applications

●Advanced capabilities surpassing standard inference runtimes,
such as vLLM
○Optimized request serving and GPU management
○Multiple inference runtime support
○Multiple model support
○Built-in RAG integration

© 2024 CloudNatix, All Rights Reserved
Multiple Model and Runtime Support
●Multiple model support



●Multiple inference runtime support


Open models
from Hugging Face
Private models in
customers’
environment
Fine-tuned models
generated with
LLMariner
vLLM Ollama
Nvidia Triton
Inference Server
Hugging Face
TGI
Upcoming Experimental

© 2024 CloudNatix, All Rights Reserved
Cluster X
Optimized Inference Serving
●Efficiently utilize GPU to achieve high throughput and low latency
●Key technologies:
○Autoscaling
○Model-aware request load balancing & routing
○Multi-model management & caching
○Multi-cluster/cloud federation
LLMariner
Inference Manager Engine
vLLM
Llama 3.1
vLLM
Gemma 2
Autoscaling
vLLM
Llama 3.1
Ollama
Deepseek
Coder
Cluster Y

© 2024 CloudNatix, All Rights Reserved
Built-in RAG Integration
●Use API compatible OpenAI to manage vector stores and files
○Use Milvus as an underlying vector DB
●Inference engine retrieves relevant data when processing requests



File

File

File
Upload and create
embeddings
LLMariner
Inference
Engine
Retrieve data

© 2024 CloudNatix, All Rights Reserved
GPU K8s cluster
Beyond LLM Inference
●Provide LLM fine-tuning, general-purpose training, and Jupyter
Notebook management

●Empower AI/ML teams to harness the full power of GPUs in a
secure self-contained environment
Supervised
Fine-tuning Trainer

© 2024 CloudNatix, All Rights Reserved
A Fine-tuning Example
●Submit a fine-tuning job using the OpenAI Python library
○Fine-tuned job runs in an underlying Kubernetes cluster
●Enforce quota with integration with open source Kueue
K8s cluster
GPU GPU GPU GPU
Fine-tuning
job
Fine-tuning
job
Quota enforcement
with Kueue
submit

© 2024 CloudNatix, All Rights Reserved
Project X
Enterprise-Ready Access Control
●Control API scope with “organizations” and “projects”
○A user in Project X can access fine-tuned models generated by
other users in project X
○A user in Project Y cannot access the fine-tuned models in X
●Can be integrated with a customer’s identity management platform
(e.g., SAML, OIDC)
Project Y
User 1
User 2
Fine-tuned
model
User 3
create
read
cannot access

© 2024 CloudNatix, All Rights Reserved
Supported Deployment Models
Single public cloud
Single private cloud
Air-gapped env
Appliance


Hybrid cloud
(public & private)


Multi-cloud federation


Private cloud
Public cloud
LLMariner
Control Plane
LLMariner
Agent
Cloud Y
Cloud A
LLMariner
Control Plane
K8s cluster
LLMariner
Control Plane
LLMariner
Agent
Cloud Y
Cloud B
LLMariner
Agent
※ No need to open incoming ports in worker clusters, only outgoing port 443 is required