Daniel Pepuho_AWS Community Day 2025.pdf

DanielPepuho1 6 views 44 slides Oct 29, 2025
Slide 1
Slide 1 of 44
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44

About This Presentation

Presentations on AWS Community Day Indonesia.


Slide Content

Jakarta, 25 October 2025
Daniel Pepuho

BUILDING A MULTI-TENANT
MACHINE LEARNING
PLATFORM ON AWS EKS
WITH RAY AND JUPYTERHUB

About Me
Focus area:
-Distributed Systems
-Cloud-Native
-Spends free time contributing
on open-source projects
Connect:
Daniel
Pepuho
danielcristho

Docs

DISCLAIMER

In this session, the discussion will
focus on the infrastructure and
architectural aspects of the Machine
Learning platform, rather than on the
development or fine-tuning of AI or
LLM models.

Background on Multi-Tenant ML
Infrastructure
Challenge:

-ML and AI are growing rapidly
-High cost of GPU investment and
maintenance
-Many GPUs remain idle or
underutilized
Focus:
-Implementing this concept
on AWS EKS
-Integrating it with
JupyterHub to provide a
seamless user experience
-Leveraging Ray for
efficient distributed
workload orchestration

Multi-Tenant ML Platform Concept
1.Single infrastructure serving multiple users
(researchers, teams, or students).
2.Each user has their own isolated workspace
3.Resource pooling (CPU/GPU) is managed
dynamically and fairly.

https://bit.ly/dcloudx

Reference:
https://github.com/aws-samples/gen-ai-on-eks

Komponen Peran Teknologi
Infrastructure Running containers
workload
Amazon EKS
Add-ons Functionality(GPU,
Ingress)
Ray, NVIDIA Operator,
Nginx Ingress
User Interface Interactive workspace JupyterHub
Scheduling Workload Distribution Ray
Storage Saving dataset & artifact Amazon S3
Infrastructure as Code Infrastructure
Provisioning
Terraform

JupyterHub
-Web-based platform providing interactive notebooks for
multiple users (multi-tenant).
-Integrates with Ray, Dask, Spark, or Airflow.
-Auto-scales through the Kubernetes scheduler
(CPU/GPU-aware).

JupyterHub Core Components
-Hub -> Central component for authentication and user
session management.
-Proxy -> Routes user traffic to the corresponding
notebook server.
-Single-User Notebook Server -> A dedicated container
per user, deployed as a Kubernetes Pod.
-Spawner (KubeSpawner) -> Manages the lifecycle of
user pods and integrates with the EKS scheduler.

JupyterHub Core
Components

Registrasi & Login on
JupyterHub
Using Native
Authenticator

Profile List

Spawn JupyterLab
Pulling image
Jupyter Notebook
Jupyter Notebooks are
scheduled to the
appropriate nodes, each
user has a persistent
volume.

Spawn JupyterLab on GPUs
Node
Jupyter
Notebook
with
PyTorch
and CUDA
12

Administrator Page

Ray
-Ray is an open-source framework for distributed
computing and machine learning workloads.
-It enables ML workloads to run in parallel across
multiple nodes (CPU/GPU) within a “RayCluster”.
-ML Libraries: Ray Train, Ray Tune, Ray Serve for
training, tuning, serving

Ray Core Components
Ray Head The main node that acts as the scheduler and control plane.
It coordinates tasks across worker nodes and stores cluster
metadata.
Ray Worker Execution nodes that run tasks, actors, or ML jobs in parallel.
They can be either CPU or GPU nodes.
Ray Client Interface that allows external connections (e.g., from a
JupyterHub notebook) to the Ray Head via the ray:// protocol.

Connect to Ray Cluster from Ray
Client

There are version differences between the Ray
Server and the Ray Client that may affect
compatibility.

Ray ML Libraries
Ray Core Run parallel tasks and actors across multiple nodes
Ray Train Distributed deep learning for PyTorch & TensorFlow. (Distributed
training)
Ray Tune Distributed hyperparameter optimization.
Ray Serve Model deployment and serve.

Karpenter
-Karpenter Dynamic node autoscaler for EKS.
-Analyzes pod specs (CPU, GPU, memory,
taints/tolerations) and optimizes cost by choosing the
best instance type (Spot/On-Demand).

Terraform
-Terraform is an IaC tool used to automate the
provisioning and management of the entire
platform infrastructure including:

-EKS Cluster & Node Groups (CPU/GPU)
-Networking (VPC, Subnets, Security Groups)
-IAM & IRSA Integration for service account
permissions
-Helm Add-ons: JupyterHub, Ray, NVIDIA
Plugin, Karpenter
-Automated setup of IAM roles, IRSA, and
resource quotas

Nodes Group

Addons Release

Infrastructure Provisioning
$ terraform apply
$ terraform destroy

New cluster on EKS

Nodes

Conclusion
This platform brings together multiple open-source
and AWS-native components:
-EKS as the foundational layer,
-Karpenter for auto-scaling,
-JupyterHub for multi-tenant access,
-Ray for distributed machine learning
workloads,
-and Terraform for automated infrastructure
provisioning and management.

References
-Fine-tuning Foundation Models on Amazon EKS for AI/ML
Workloads: https://github.com/aws-samples/gen-ai-on-eks
-Ray cluster with examples running on Kubernetes (k3d) :
https://github.com/tekumara/ray-demo
-Ray on Kubernetes (Kuberay):
https://docs.ray.io/en/latest/cluster/kubernetes/index.html
-Start Amazon EKS Cluster with GPUs for KubeRay:
https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/aws-ek
s-gpu-cluster.html
-Launching Ray Clusters on AWS:
https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clu
sters/aws.html

Connect with Me:
Daniel Pepuho

danielcristho

https://bit.ly/dc_devto

https://bit.ly/dc_medium

Thank you!