BUILDING A MULTI-TENANT
MACHINE LEARNING
PLATFORM ON AWS EKS
WITH RAY AND JUPYTERHUB
About Me
Focus area:
-Distributed Systems
-Cloud-Native
-Spends free time contributing
on open-source projects
Connect:
Daniel
Pepuho
danielcristho
Docs
DISCLAIMER
In this session, the discussion will
focus on the infrastructure and
architectural aspects of the Machine
Learning platform, rather than on the
development or fine-tuning of AI or
LLM models.
Background on Multi-Tenant ML
Infrastructure
Challenge:
-ML and AI are growing rapidly
-High cost of GPU investment and
maintenance
-Many GPUs remain idle or
underutilized
Focus:
-Implementing this concept
on AWS EKS
-Integrating it with
JupyterHub to provide a
seamless user experience
-Leveraging Ray for
efficient distributed
workload orchestration
Multi-Tenant ML Platform Concept
1.Single infrastructure serving multiple users
(researchers, teams, or students).
2.Each user has their own isolated workspace
3.Resource pooling (CPU/GPU) is managed
dynamically and fairly.
Komponen Peran Teknologi
Infrastructure Running containers
workload
Amazon EKS
Add-ons Functionality(GPU,
Ingress)
Ray, NVIDIA Operator,
Nginx Ingress
User Interface Interactive workspace JupyterHub
Scheduling Workload Distribution Ray
Storage Saving dataset & artifact Amazon S3
Infrastructure as Code Infrastructure
Provisioning
Terraform
JupyterHub
-Web-based platform providing interactive notebooks for
multiple users (multi-tenant).
-Integrates with Ray, Dask, Spark, or Airflow.
-Auto-scales through the Kubernetes scheduler
(CPU/GPU-aware).
JupyterHub Core Components
-Hub -> Central component for authentication and user
session management.
-Proxy -> Routes user traffic to the corresponding
notebook server.
-Single-User Notebook Server -> A dedicated container
per user, deployed as a Kubernetes Pod.
-Spawner (KubeSpawner) -> Manages the lifecycle of
user pods and integrates with the EKS scheduler.
JupyterHub Core
Components
Registrasi & Login on
JupyterHub
Using Native
Authenticator
Profile List
Spawn JupyterLab
Pulling image
Jupyter Notebook
Jupyter Notebooks are
scheduled to the
appropriate nodes, each
user has a persistent
volume.
Spawn JupyterLab on GPUs
Node
Jupyter
Notebook
with
PyTorch
and CUDA
12
Administrator Page
Ray
-Ray is an open-source framework for distributed
computing and machine learning workloads.
-It enables ML workloads to run in parallel across
multiple nodes (CPU/GPU) within a “RayCluster”.
-ML Libraries: Ray Train, Ray Tune, Ray Serve for
training, tuning, serving
Ray Core Components
Ray Head The main node that acts as the scheduler and control plane.
It coordinates tasks across worker nodes and stores cluster
metadata.
Ray Worker Execution nodes that run tasks, actors, or ML jobs in parallel.
They can be either CPU or GPU nodes.
Ray Client Interface that allows external connections (e.g., from a
JupyterHub notebook) to the Ray Head via the ray:// protocol.
Connect to Ray Cluster from Ray
Client
There are version differences between the Ray
Server and the Ray Client that may affect
compatibility.
Ray ML Libraries
Ray Core Run parallel tasks and actors across multiple nodes
Ray Train Distributed deep learning for PyTorch & TensorFlow. (Distributed
training)
Ray Tune Distributed hyperparameter optimization.
Ray Serve Model deployment and serve.
Karpenter
-Karpenter Dynamic node autoscaler for EKS.
-Analyzes pod specs (CPU, GPU, memory,
taints/tolerations) and optimizes cost by choosing the
best instance type (Spot/On-Demand).
Terraform
-Terraform is an IaC tool used to automate the
provisioning and management of the entire
platform infrastructure including:
-EKS Cluster & Node Groups (CPU/GPU)
-Networking (VPC, Subnets, Security Groups)
-IAM & IRSA Integration for service account
permissions
-Helm Add-ons: JupyterHub, Ray, NVIDIA
Plugin, Karpenter
-Automated setup of IAM roles, IRSA, and
resource quotas
Conclusion
This platform brings together multiple open-source
and AWS-native components:
-EKS as the foundational layer,
-Karpenter for auto-scaling,
-JupyterHub for multi-tenant access,
-Ray for distributed machine learning
workloads,
-and Terraform for automated infrastructure
provisioning and management.
References
-Fine-tuning Foundation Models on Amazon EKS for AI/ML
Workloads: https://github.com/aws-samples/gen-ai-on-eks
-Ray cluster with examples running on Kubernetes (k3d) :
https://github.com/tekumara/ray-demo
-Ray on Kubernetes (Kuberay):
https://docs.ray.io/en/latest/cluster/kubernetes/index.html
-Start Amazon EKS Cluster with GPUs for KubeRay:
https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/aws-ek
s-gpu-cluster.html
-Launching Ray Clusters on AWS:
https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clu
sters/aws.html