AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training with GPUs Anywhere
Alluxio
402 views
28 slides
Aug 30, 2024
Slide 1 of 28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
About This Presentation
AI/ML Infra Meetup
Aug. 29, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Bin Fan (VP of Technology, Founding Engineer @OpenAI)
In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scal...
AI/ML Infra Meetup
Aug. 29, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Bin Fan (VP of Technology, Founding Engineer @OpenAI)
In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scale data for AI. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving.
In this talk, Bin Fan will discuss the challenges of I/O stalls that lead to suboptimal GPU utilization during model training. He will present a reference architecture for running PyTorch jobs with Alluxio in cloud environments, demonstrating how this approach can significantly enhance GPU efficiency.
What you will learn:
- How to identify GPU utilization and I/O-related performance bottlenecks in model training
- Leverage GPU anywhere to maximize resource utilization
- Best practices for monitoring and optimizing GPU usage across training and serving pipelines
- Strategies for reducing cloud costs and simplifying management of AI infrastructure at scale
Size: 10.03 MB
Language: en
Added: Aug 30, 2024
Slides: 28 pages
Slide Content
Maximizing GPU Efficiency:
Optimizing Model Training
with GPUs Anywhere
Bin Fan
Founding Engineer, VP of Technology @ Alluxio
Aug 29 2024
About Me
2
Bin Fan (https://www.linkedin.com/in/bin-fan/)
○Founding Engineer, VP of Technology @
Alluxio
○Email: [email protected]
○Previously worked in Google - Technical
Infra; PhD in CS at Carnegie Mellon
University
Common ML Platform
Architecture
3
Serving platform Model
Files
Training
Dataset
Checkpoints
2
Training Infra
Data Lake
1
3
Explore Efficient, Scalable
I/O for Model Training
4
Questions:
▪Possible Architectures
▪How to design a efficient, scalable,
distributed caching
○Evolution of Alluxio Architecture
▪Benchmark and Case Studies
○FIO Benchmark
○User Success Stories
5
Option 1: Connecting to
Cloud Storage Directly
Pros:
Easy to manage – Single source of
truth
Data Lake
Cons:
●Slow or Inconsistent Performance
○“(Service: Amazon S3; Status Code: 503;
Error Code: SlowDown …)”
●High cost in accessing cloud
storage
○https://arxiv.org/abs/2311.00156 - Joint case
study by Alluxio, CMU & Uber
Training
Direct Access to
Data Lake
6
Option 2: Adding a
High-performance Storage
Pros:
High and consistent I/O performance
Cons:
●Costly Infrastructure
●Extra overhead in data migration,
and maintenance
●Not scalable to extend to
multi-region/cloud: infra cost &
egress cost / bw limit
Data Lake
Fast Access
Migrate Data
Training
HPC Storage
…
7
Data Lake
Fast Access
Migrate Data
HPC Storage
us-west-1
Training
…
us-east-1
Training
HPC Storage
…
Option 2: Adding a
High-performance Storage
Pros:
High and consistent I/O performance
Cons:
●Costly Infrastructure
●Extra overhead in data migration,
and maintenance
●Not scalable to extend to
multi-region/cloud: infra cost &
egress cost / bw limit
Observation: A Classic
Caching Problem
8
●Itʼs always great to maintain a single-source of truth in
your data lake
●Having a data access/caching layer between different
compute and data lake storage to solve the demand of
IOPS, with possible data virtualization
●Share cache across analytics and AI workloads
Data Lake
Training Compute
Access/Caching Layer
9
Option3: Adding a High-performance Caching
Pros:
●High and consistent I/O
performance
●Still Keep Single-source of truth -
No Extra Cost in Data Migration,
and Maintenance
●Scalable to extend to
multi-region/cloud
Data Lake
us-east-1
Training
Distributed Cache
…
Fast Access with
Hot Data Cached
Only retrieve
Data on Demand
Distributed Cache
us-west-1
Training
Designing a High-performance,
Scalable, Distributed Caching for
Training Workloads
11
Alluxio Data Platform
Accelerate data-intensive AI training workloads
Alluxio Technology Journey
Open Source Started From UC Berkeley AMPLab in 2014
1000+
nodes
Largest deployment by
Baidu
Started
from UC
Berkeley
AMPLab
1 Billion
Files
supported by Alluxio
with 2.0 release
2014 2019 2023
7/10 top
Internet Co
powered by Alluxio
12
AliPay 80%
Model
Training
Zhihu LLM
Model training served by
Alluxio
EXPLOSION OF DATA
rise of big data & analytics
CLOUD ADOPTION
Single to hybrid cloud,
multi-cloud, cross region
GENERATIVE AI
Large-scale model training
and deployment
1000+
Contributors
Open Source
1000+
Attendees
Data Orchestration Summit
100% Presto @
Meta
Fully on-boarded to Alluxio
9/10 top
Internet Co
powered by Alluxio
Powered by Alluxio
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERSTECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
13
14
When Alluxio (Tachyon) was born in
Berkeley
15
Early Architecture: Modeled after HDFS
Compute Node Under Storage
Primary Master
Alluxio Cluster
MapReduce/
Spark/Trino
Request1
Client
Standby
Master
Standby
Master
Worker
Worker
Worker
2Get location
3
Request
worker
4
Cache miss
read from
under storage
When Serving ML Training:
Different Requirements
16
●Programming interface: HDFS vs POSIX
●Deployment environment: YARN/Bare Metal vs K8s
●Data format: Structured vs unstructured (audio, picture, video, text)
●Metadata performance: critical for CV/multimodal Training (millions to
billions of small files)
●I/O Concurrency: much higher in training
●Training Duration: hours vs days or weeks ⇒ reliability is the key.
●Fast Write (Checkpointing): Essential
Time to revisit key design choices
17
New Architecture of Alluxio
Training Node Under Storage
Service Registry
Alluxio Cluster
PyTorch
I/ORequest
Worker
Worker
Worker
Select
worker
Cache miss
read from
under storage
Consistent hashing
based data partition
I/ORequest
Under the hood
18
●Use consistent hashing to cache both data and metadata on workers.
○Reduced I/O RPC length, Performance ++
○No more single point of failure. Reliability ++
○No more performance bottleneck on masters. Performance ++
●Remove master from critical path: no more journal
●Many other resource/performance optimizations: e.g., applying zero
copy whenever possible
19
●High Scalability
○One worker node supports 50+ million small files
○Scale linearly - easy to support 10 billions of files
● High Availability
○99.99% uptime
○No single point of failure
● High Performance
○Faster data loading
●Cloud-native K8s Operator and CSI-FUSE for data access management
20
API Option 1: Alluxio FUSE
●Expose the Alluxio file system as a local file system.
●Can access the cloud storage just as accessing local storage.
○cat, ls
○f = open(“a.txt”, “r”)
●Very low impact for end users
21
API Option 2: Use Python Client (alluxiofs)
Existing Code
With alluxiofs
Can we further minimize the
modification of existing code?
22
API Option 3 (Experimental):
Use alluxioio package
●Import a python package called alluxioio
●No need to modify existing code to use alluxiofs
Benchmark & Case Studies
FIO Benchmark:
Sequential Read x Single Client
24
●Alluxio AI-3.2: Achieved a bandwidth of 2081 MiB/s(1 thread) to 8183 MiB/s(32 threads) with a single
client, significantly outperforming competitors.
●NAS (J***FS): Recorded a bandwidth of 1886 MiB/s(1 thread) to 6207 MiB/s, showing 9.3% to 24.1%
slower than Alluxio 3.2.
●HPC FS (FSx Lustre): Managed a bandwidth of 185 MiB/s(1 thread) to 3992 MiB/s, showing 91.1% to
51.2% slower than Alluxio 3.2.
Setup
●Alluxio:
1 Alluxio worker (i3en.metal)
1 Alluxio fuse client
(c5n.metal)
●NAS (J***FS)
●HPC FS (AWS FSx Lustre (12TB
capacity)
Note: the Alluxio fuse client co-located
with training servers is responsible for
POSIX API access to Alluxio Workers
which actually cache the data
Alluxio 3.2 shows better performance, particularly in handling concurrent
sequential read operations.
BUSINESS BENEFIT:
TECH BENEFIT:
Increase GPU utilization in LLM
training
~50%
93%+
CASE STUDY:
High Data Performance AI Platform for
model training & inference
10X faster
time-to-production
-Avoid data copy from Cloud Object
store to HDFS
-Start GPU cluster and Alluxio Caching
in any Cloud with Kubernetes in 10
minutes
Increase GPU utilization in
Search/Recommendation/Ads training
~20%
40%+
HDFS
Training Data &
checkpoints
Checkpoints
Training
Data
Checkpoints
Model
Training
Model
Training
Model
Deployment
Model
Inference
Downstream
Applications
Checkpoint
Training - Cloud Training - On Prem
Online machine learning platform - Cloud
Training Data
&
checkpoints
400 Gbps network connection 400 Gbps network connection
Blog with sign up link and tutorial
Get started with a fully deployed Alluxio AI cluster
with just a few clicks in under 40 minutes!
●Explore the potential performance benefits of Alluxio by
running FIO benchmarks
●Simplify the deployment process with preconfigured
template clusters
●Maintain full control of your data with Alluxio deployed
within your AWS account
Introducing Rapid Alluxio Deployer (RAD) in AWS!
Takeaway
27
●When Compute Resource Scarcity Becomes the Norm, a Distributed
Caching Layer Works Well to Enable I/O-Intensive Training
○Having a Single Source of Truth Makes Life Much Easier
●Architectural Changes Are Required to Meet the Requirements for ML
workloads
○ Especially for Metadata Performance and Scalability of Data
Capacity
Thank You
Any Questions? Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
community of data & AI
infra experts!