AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training with GPUs Anywhere

Alluxio 402 views 28 slides Aug 30, 2024
Slide 1
Slide 1 of 28
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28

About This Presentation

AI/ML Infra Meetup
Aug. 29, 2024
Organized by Alluxio

For more Alluxio Events: https://www.alluxio.io/events/

Speaker:
- Bin Fan (VP of Technology, Founding Engineer @OpenAI)

In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scal...


Slide Content

Maximizing GPU Efficiency:
Optimizing Model Training
with GPUs Anywhere
Bin Fan
Founding Engineer, VP of Technology @ Alluxio
Aug 29 2024

About Me

2
Bin Fan (https://www.linkedin.com/in/bin-fan/)
○Founding Engineer, VP of Technology @
Alluxio
○Email: [email protected]
○Previously worked in Google - Technical
Infra; PhD in CS at Carnegie Mellon
University

Common ML Platform
Architecture

3
Serving platform Model
Files
Training
Dataset
Checkpoints
2

Training Infra
Data Lake
1
3

Explore Efficient, Scalable
I/O for Model Training

4
Questions:
▪Possible Architectures
▪How to design a efficient, scalable,
distributed caching
○Evolution of Alluxio Architecture
▪Benchmark and Case Studies
○FIO Benchmark
○User Success Stories

5
Option 1: Connecting to
Cloud Storage Directly


Pros:

Easy to manage – Single source of
truth
Data Lake
Cons:
●Slow or Inconsistent Performance
○“(Service: Amazon S3; Status Code: 503;
Error Code: SlowDown …)”
●High cost in accessing cloud
storage
○https://arxiv.org/abs/2311.00156 - Joint case
study by Alluxio, CMU & Uber
Training
Direct Access to
Data Lake

6
Option 2: Adding a
High-performance Storage
Pros:

High and consistent I/O performance
Cons:
●Costly Infrastructure
●Extra overhead in data migration,
and maintenance
●Not scalable to extend to
multi-region/cloud: infra cost &
egress cost / bw limit
Data Lake
Fast Access
Migrate Data
Training
HPC Storage

7
Data Lake
Fast Access
Migrate Data
HPC Storage
us-west-1
Training

us-east-1
Training
HPC Storage

Option 2: Adding a
High-performance Storage
Pros:

High and consistent I/O performance
Cons:
●Costly Infrastructure
●Extra overhead in data migration,
and maintenance
●Not scalable to extend to
multi-region/cloud: infra cost &
egress cost / bw limit

Observation: A Classic
Caching Problem




8
●Itʼs always great to maintain a single-source of truth in
your data lake

●Having a data access/caching layer between different
compute and data lake storage to solve the demand of
IOPS, with possible data virtualization

●Share cache across analytics and AI workloads

Data Lake
Training Compute
Access/Caching Layer

9
Option3: Adding a High-performance Caching

Pros:

●High and consistent I/O
performance
●Still Keep Single-source of truth -
No Extra Cost in Data Migration,
and Maintenance
●Scalable to extend to
multi-region/cloud
Data Lake
us-east-1
Training
Distributed Cache

Fast Access with
Hot Data Cached
Only retrieve
Data on Demand
Distributed Cache
us-west-1
Training

Designing a High-performance,
Scalable, Distributed Caching for
Training Workloads

11
Alluxio Data Platform
Accelerate data-intensive AI training workloads

Alluxio Technology Journey
Open Source Started From UC Berkeley AMPLab in 2014
1000+
nodes
Largest deployment by
Baidu
Started
from UC
Berkeley
AMPLab
1 Billion
Files
supported by Alluxio
with 2.0 release
2014 2019 2023
7/10 top
Internet Co
powered by Alluxio
12
AliPay 80%
Model
Training
Zhihu LLM
Model training served by
Alluxio
EXPLOSION OF DATA
rise of big data & analytics
CLOUD ADOPTION
Single to hybrid cloud,
multi-cloud, cross region
GENERATIVE AI
Large-scale model training
and deployment
1000+
Contributors
Open Source
1000+
Attendees
Data Orchestration Summit
100% Presto @
Meta
Fully on-boarded to Alluxio
9/10 top
Internet Co
powered by Alluxio

Powered by Alluxio
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERSTECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
13

14
When Alluxio (Tachyon) was born in
Berkeley

15
Early Architecture: Modeled after HDFS



Compute Node Under Storage
Primary Master
Alluxio Cluster
MapReduce/
Spark/Trino
Request1
Client
Standby
Master
Standby
Master
Worker
Worker
Worker
2Get location
3
Request
worker
4
Cache miss
read from
under storage

When Serving ML Training:
Different Requirements






16
●Programming interface: HDFS vs POSIX
●Deployment environment: YARN/Bare Metal vs K8s
●Data format: Structured vs unstructured (audio, picture, video, text)
●Metadata performance: critical for CV/multimodal Training (millions to
billions of small files)
●I/O Concurrency: much higher in training
●Training Duration: hours vs days or weeks ⇒ reliability is the key.
●Fast Write (Checkpointing): Essential

Time to revisit key design choices

17
New Architecture of Alluxio



Training Node Under Storage
Service Registry
Alluxio Cluster
PyTorch
I/ORequest
Worker
Worker
Worker
Select
worker
Cache miss
read from
under storage
Consistent hashing
based data partition
I/ORequest

Under the hood






18
●Use consistent hashing to cache both data and metadata on workers.
○Reduced I/O RPC length, Performance ++
○No more single point of failure. Reliability ++
○No more performance bottleneck on masters. Performance ++
●Remove master from critical path: no more journal
●Many other resource/performance optimizations: e.g., applying zero
copy whenever possible


https://www.alluxio.io/blog/introducing-dora-the-next-generation-
alluxio-architecture/

By the numbers






19
●High Scalability
○One worker node supports 50+ million small files
○Scale linearly - easy to support 10 billions of files
● High Availability
○99.99% uptime
○No single point of failure
● High Performance
○Faster data loading
●Cloud-native K8s Operator and CSI-FUSE for data access management

20
API Option 1: Alluxio FUSE
●Expose the Alluxio file system as a local file system.
●Can access the cloud storage just as accessing local storage.
○cat, ls
○f = open(“a.txt”, “r”)
●Very low impact for end users

21
API Option 2: Use Python Client (alluxiofs)


Existing Code
With alluxiofs
Can we further minimize the
modification of existing code?

22
API Option 3 (Experimental):
Use alluxioio package

●Import a python package called alluxioio
●No need to modify existing code to use alluxiofs

Benchmark & Case Studies

FIO Benchmark:
Sequential Read x Single Client

24
●Alluxio AI-3.2: Achieved a bandwidth of 2081 MiB/s(1 thread) to 8183 MiB/s(32 threads) with a single
client, significantly outperforming competitors.
●NAS (J***FS): Recorded a bandwidth of 1886 MiB/s(1 thread) to 6207 MiB/s, showing 9.3% to 24.1%
slower than Alluxio 3.2.
●HPC FS (FSx Lustre): Managed a bandwidth of 185 MiB/s(1 thread) to 3992 MiB/s, showing 91.1% to
51.2% slower than Alluxio 3.2.
Setup
●Alluxio:
1 Alluxio worker (i3en.metal)
1 Alluxio fuse client
(c5n.metal)
●NAS (J***FS)
●HPC FS (AWS FSx Lustre (12TB
capacity)

Note: the Alluxio fuse client co-located
with training servers is responsible for
POSIX API access to Alluxio Workers
which actually cache the data
Alluxio 3.2 shows better performance, particularly in handling concurrent
sequential read operations.

BUSINESS BENEFIT:
TECH BENEFIT:
Increase GPU utilization in LLM
training
~50%
93%+
CASE STUDY:

High Data Performance AI Platform for
model training & inference

10X faster
time-to-production
-Avoid data copy from Cloud Object
store to HDFS
-Start GPU cluster and Alluxio Caching
in any Cloud with Kubernetes in 10
minutes
Increase GPU utilization in
Search/Recommendation/Ads training
~20%
40%+
HDFS
Training Data &
checkpoints

Checkpoints
Training
Data
Checkpoints
Model
Training
Model
Training
Model
Deployment
Model
Inference
Downstream
Applications
Checkpoint
Training - Cloud Training - On Prem
Online machine learning platform - Cloud
Training Data
&
checkpoints
400 Gbps network connection 400 Gbps network connection

Blog with sign up link and tutorial
Get started with a fully deployed Alluxio AI cluster
with just a few clicks in under 40 minutes!
●Explore the potential performance benefits of Alluxio by
running FIO benchmarks
●Simplify the deployment process with preconfigured
template clusters
●Maintain full control of your data with Alluxio deployed
within your AWS account
Introducing Rapid Alluxio Deployer (RAD) in AWS!

Takeaway






27
●When Compute Resource Scarcity Becomes the Norm, a Distributed
Caching Layer Works Well to Enable I/O-Intensive Training
○Having a Single Source of Truth Makes Life Much Easier

●Architectural Changes Are Required to Meet the Requirements for ML
workloads
○ Especially for Metadata Performance and Scalability of Data
Capacity

Thank You
Any Questions? Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
community of data & AI
infra experts!