Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

TobiasSchneck 112 views 51 slides May 28, 2024
Slide 1
Slide 1 of 51
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51

About This Presentation

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologie...


Slide Content

Kubernetes & AI
?????? Beauty and the ?????? Beast!?!

Tobias Schneck
@
[email protected]
@toschneck
Principal Architect
@toschneck

As Kubernetes folks
why should we care about AI?
??????

… will it be the next big thing?!
??????

By 2028, the adoption of AI will culminate in
over 50% of cloud compute
resources devoted to AI workload, up from
less than 10% in 2023.
Gartner® states, 2023

OK … so what’s about this AI thingy?
????????????

Drawings created with Excalidraw, thanks Koray Oksay (@korayoksay) for the hint ??????

??????

… a lot of Data and Math for an
Infrastructure guy ??????

… how does such data get compute?
?????? ??????

Credits to Andrej Karpathy ?????? Awesome Intro to LLMs
[1hr Talk] Intro to Large Language Models

Credits to Andrej Karpathy ??????

Credits to Andrej Karpathy ??????

Credits to Andrej Karpathy ??????

Credits to Andrej Karpathy ??????

Credits to Andrej Karpathy ??????

How does our normal Job look like?
??????

Based on Adel Zaalouk (@ZaNetworker) drawings from the CNCF Cloud Native AI white paper ??????

What will change in our Infra?
??????

Based on Adel Zaalouk (@ZaNetworker) drawings from the CNCF Cloud Native AI white paper ??????

Based on Adel Zaalouk (@ZaNetworker) drawings from the CNCF Cloud Native AI white paper ??????

So, Why Kube ?

Flexibility & Standardization
Standard
Container
High-Cube
Container
Hardtop
Container
Open Top
Container
Flat Platform (Plat) Ventilated
Container
Cooling Container Bulk Container
Tank
Container
Container Types

Data Center I




Infrastructure Layer
Standardization with Kubernetes





App Services
IT Space I IT Space II IT Space III





Backend Services
DB Services Analytics Observability
Data Center II




Data Center III




Edge

Cloud Providers

Caches AI / ML
DDoS
Protect
Managed
Services
Real Time
Analysis
Intelligent
Edge Devices
Smart
Automation
Data
Processing

Kube for AI ⇔ Kube for Applications
?????? Kube is a de facto standard as “cloud operating systems”
✅ API abstraction layer for multiple types of network, storage, and compute resources
?????? Standard interfaces for support of DevOps best practices like GitOps
?????? Variation of Cloud Providers and Services are consumable via standard APIs

… and OpenAI uses Kube already 2017 ??????

So …. ??????

How to build an AI Platform?
?????? ??????

How does the ecosystem look
like!?
?????? ??????

AI Frameworks

Options to Adapt the Frameworks ??????
Feature
“Kube Native”
via Kubeflow
Managed Platforms
(e.g., SageMaker)
Focused Tools
(e.g., MLflow)
Scope End-to-end MLOps platformManaged MLOps service
Specific functionalities within
ML lifecycle
Open Source Yes No Yes
Scalability & Portability High Depends on cloud provider Moderate
Setup & Management Complex Simpler Simpler
Portability Everywhere Mostly Cloud Mostly Machine based
Vendor Lock-in No Yes (to specific cloud provider)No

AI Frameworks
Could use

KubeFlow ⁉

?????? Currently the most
feature complete
choice for Kube

?????? But Setup is complex!

KubeFlow ⁉

The Beauty ?????? :
●Incubating CNCF Project
●Serving AI Platform in Multi-Tenancy
●Popularity 13.7k ⭐ ~ long-term Maintenance Chance
●Alternatives like MLflow / KServe are integrated

The Beast ??????
●Mostly vendor specific installer instructions ??????
○No maintained automated installer for generic Kubernetes
○Helm chart issue #3173
●Dependency “hell”
○A lot of different 3rd party dependencies constraints
○Hard to adapt again to existing company defaults
●Only support EOL Kubernetes <= 1.26❗
○Usability is then questionable in production

Sounds good, but what about
on-prem / offline cases?
??????

[Cloud] Data Center I









GPU / TPU Powered Services
based on Argo CD




AI Model Serving
[AI] Application Service
Application
⚙ Separate Model Training / Model Usage Example
Infrastructure Layer
Data Center II




Data Center III




Edge

Cloud Providers

Real Time
Analysis
Intelligent
Edge Devices
Smart
Automation
Data
Processing
Data Delivery
Model
Export
Local AI
Consume
Scale for
Training
Vanilla Setup

Starting a POC
??????
github.com/toschneck/kubernetes-and-ai

Kubeflow | Katib Architecture for Hyperparameter
Tuning (aka optimization run)

Server & Local Model with LocalAI

Any Questions?

THANKS FOR JOINING!
kubermatic
@toschneck
[email protected]