Daniel Pepuho_AWS Community Day 2025.pdf

DanielPepuho1 6 views 44 slides Oct 29, 2025

Slide 1 of 44

About This Presentation

Presentations on AWS Community Day Indonesia.

Size: 3.61 MB

Language: en

Added: Oct 29, 2025

Slides: 44 pages

Slide Content

Jakarta, 25 October 2025
Daniel Pepuho

BUILDING A MULTI-TENANT
MACHINE LEARNING
PLATFORM ON AWS EKS
WITH RAY AND JUPYTERHUB

About Me
Focus area:
-Distributed Systems
-Cloud-Native
-Spends free time contributing
on open-source projects
Connect:
Daniel
Pepuho
danielcristho

Docs

DISCLAIMER

In this session, the discussion will
focus on the infrastructure and
architectural aspects of the Machine
Learning platform, rather than on the
development or fine-tuning of AI or
LLM models.

Background on Multi-Tenant ML
Infrastructure
Challenge:

-ML and AI are growing rapidly
-High cost of GPU investment and
maintenance
-Many GPUs remain idle or
underutilized
Focus:
-Implementing this concept
on AWS EKS
-Integrating it with
JupyterHub to provide a
seamless user experience
-Leveraging Ray for
efficient distributed
workload orchestration

Multi-Tenant ML Platform Concept
1.Single infrastructure serving multiple users
(researchers, teams, or students).
2.Each user has their own isolated workspace
3.Resource pooling (CPU/GPU) is managed
dynamically and fairly.

https://bit.ly/dcloudx

Reference:
https://github.com/aws-samples/gen-ai-on-eks

Komponen Peran Teknologi
Infrastructure Running containers
workload
Amazon EKS
Add-ons Functionality(GPU,
Ingress)
Ray, NVIDIA Operator,
Nginx Ingress
User Interface Interactive workspace JupyterHub
Scheduling Workload Distribution Ray
Storage Saving dataset & artifact Amazon S3
Infrastructure as Code Infrastructure
Provisioning
Terraform

JupyterHub
-Web-based platform providing interactive notebooks for
multiple users (multi-tenant).
-Integrates with Ray, Dask, Spark, or Airflow.
-Auto-scales through the Kubernetes scheduler
(CPU/GPU-aware).

JupyterHub Core Components
-Hub -> Central component for authentication and user
session management.
-Proxy -> Routes user traffic to the corresponding
notebook server.
-Single-User Notebook Server -> A dedicated container
per user, deployed as a Kubernetes Pod.
-Spawner (KubeSpawner) -> Manages the lifecycle of
user pods and integrates with the EKS scheduler.

JupyterHub Core
Components

Registrasi & Login on
JupyterHub
Using Native
Authenticator

Profile List

Spawn JupyterLab
Pulling image
Jupyter Notebook
Jupyter Notebooks are
scheduled to the
appropriate nodes, each
user has a persistent
volume.

Spawn JupyterLab on GPUs
Node
Jupyter
Notebook
with
PyTorch
and CUDA
12

Administrator Page

Ray
-Ray is an open-source framework for distributed
computing and machine learning workloads.
-It enables ML workloads to run in parallel across
multiple nodes (CPU/GPU) within a “RayCluster”.
-ML Libraries: Ray Train, Ray Tune, Ray Serve for
training, tuning, serving

Ray Core Components
Ray Head The main node that acts as the scheduler and control plane.
It coordinates tasks across worker nodes and stores cluster
metadata.
Ray Worker Execution nodes that run tasks, actors, or ML jobs in parallel.
They can be either CPU or GPU nodes.
Ray Client Interface that allows external connections (e.g., from a
JupyterHub notebook) to the Ray Head via the ray:// protocol.

Connect to Ray Cluster from Ray
Client

There are version differences between the Ray
Server and the Ray Client that may affect
compatibility.

Ray ML Libraries
Ray Core Run parallel tasks and actors across multiple nodes
Ray Train Distributed deep learning for PyTorch & TensorFlow. (Distributed
training)
Ray Tune Distributed hyperparameter optimization.
Ray Serve Model deployment and serve.

Karpenter
-Karpenter Dynamic node autoscaler for EKS.
-Analyzes pod specs (CPU, GPU, memory,
taints/tolerations) and optimizes cost by choosing the
best instance type (Spot/On-Demand).

Terraform
-Terraform is an IaC tool used to automate the
provisioning and management of the entire
platform infrastructure including:

-EKS Cluster & Node Groups (CPU/GPU)
-Networking (VPC, Subnets, Security Groups)
-IAM & IRSA Integration for service account
permissions
-Helm Add-ons: JupyterHub, Ray, NVIDIA
Plugin, Karpenter
-Automated setup of IAM roles, IRSA, and
resource quotas

Nodes Group

Addons Release

Infrastructure Provisioning
$ terraform apply
$ terraform destroy

New cluster on EKS

Nodes

Conclusion
This platform brings together multiple open-source
and AWS-native components:
-EKS as the foundational layer,
-Karpenter for auto-scaling,
-JupyterHub for multi-tenant access,
-Ray for distributed machine learning
workloads,
-and Terraform for automated infrastructure
provisioning and management.

References
-Fine-tuning Foundation Models on Amazon EKS for AI/ML
Workloads: https://github.com/aws-samples/gen-ai-on-eks
-Ray cluster with examples running on Kubernetes (k3d) :
https://github.com/tekumara/ray-demo
-Ray on Kubernetes (Kuberay):
https://docs.ray.io/en/latest/cluster/kubernetes/index.html
-Start Amazon EKS Cluster with GPUs for KubeRay:
https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/aws-ek
s-gpu-cluster.html
-Launching Ray Clusters on AWS:
https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clu
sters/aws.html

Connect with Me:
Daniel Pepuho

danielcristho

https://bit.ly/dc_devto

https://bit.ly/dc_medium

Daniel Pepuho_AWS Community Day 2025.pdf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Daniel Pepuho_AWS Community Day 2025.pdf

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 10

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 21

Slide 24

Slide 25

Slide 26

Slide 28

Slide 30

Slide 31

Slide 32

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 42

Slide 43

Slide 44

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx