AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & Serving
Alluxio
220 views
21 slides
May 24, 2024
Slide 1 of 21
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
About This Presentation
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Lu Qiu (Data & AI Platform Tech Lead, @Alluxio)
- Siyuan Sheng (Senior Software Engineer, @Alluxio)
Speed and efficiency are two requirements for the underlying infrastruc...
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Lu Qiu (Data & AI Platform Tech Lead, @Alluxio)
- Siyuan Sheng (Senior Software Engineer, @Alluxio)
Speed and efficiency are two requirements for the underlying infrastructure for machine learning model development. Data access can bottleneck end-to-end machine learning pipelines as training data volume grows and when large model files are more commonly used for serving. For instance, data loading can constitute nearly 80% of the total model training time, resulting in less than 30% GPU utilization. Also, loading large model files for deployment to production can be slow because of slow network or storage read operations. These challenges are prevalent when using popular frameworks like PyTorch, Ray, or HuggingFace, paired with cloud object storage solutions like S3 or GCS, or downloading models from the HuggingFace model hub.
In this presentation, Lu and Siyuan will offer comprehensive insights into improving speed and GPU utilization for model training and serving. You will learn:
- The data loading challenges hindering GPU utilization
- The reference architecture for running PyTorch and Ray jobs while reading data from S3, with benchmark results of training ResNet50 and BERT
- Real-world examples of boosting model performance and GPU utilization through optimized data access
Size: 6.11 MB
Language: en
Added: May 24, 2024
Slides: 21 pages
Slide Content
Improve Speed and GPU
Utilization for Model
Training & Serving
Lu Qiu, Siyuan Sheng
AI Platform Tech Lead
& Open Source PMC Maintainer
@ Alluxio
linkedin.com/in/luqiu-ai
Lu Qiu
2
Senior Software Engineer
@ Alluxio
www.linkedin.com/in/siyuan-sheng
Siyuan Sheng
2
Open Source Started From UC Berkeley AMPLab
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
Million+ Download;
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million
Join the conversation
on Slack
slackin.alluxio.io
#9
Most critical open
source Java projects
(Google OpenSSF)
4
Alluxio Data Platform
High Performance data access, unified global view
ALLUXIO 5
COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERSTECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
Inefficiencies in AI infrastructure
●Prolonged AI Model Lifecycle
●Underutilized GPUs
GPU is waiting for data to be ready
AI is all about getting information
from Data
6
Waiting for Data to be Ready for AI
7
Data Loading
Preprocessing
Training
t=0
GPU IDLE
Low GPU Utilization Rate + Long Model Lifecycle
t=2
t=4
t=6
Ray/PyTorch: Streamlined Operations
t=0
GPU IDLE
Increase GPU utilization + Faster Model Lifecycle
t=2
t=4
t=6
Data Loading
Preprocessing
Training
●Separation of Compute and Storage
●Large Data Volumes
●Crowded Networks
●Slow Data Transfer
●Storage Request Rate Limit or Outages
Data Loading Bottleneck
Data Loading become the Bottleneck
10
Data Loading
Preprocessing
Training
t=0
GPU IDLE
Low GPU Utilization Rate + Long Model Lifecycle
t=2
t=4
t=6
While using Ray/PyTorch for training…
Performance & Cost Implications are
●You might load the entire dataset again and again for each epoch
●You cannot cache the hottest data among multiple training jobs
automatically
●You might be suffering from a cold start every time
Data Loading Bottleneck
Data Loading Bottleneck for Each Epoch
12
t=0
GPU IDLE
t=2
t=4
t=6
t=8
t=10
t=12
EPOCH 0 EPOCH 1
Storage
CPU Machine
GPU Machine
Alluxio
Ray
Distributed Caching
Data Preprocessing
Storage
Reduce data transfer
& storage cost
Speed up data loading & preprocessing for AI training
Data
Speed up data loading
& Preprocessing
Compute
Increase GPU
Utilization Rate
Ray
Storage
Ray
Ray + Alluxio: Speed up data loading
Alluxio
FASTER IS
BETTER
Often
Only when necessary
TIME CONSUMING
Ray in I/O bottlenecked workloads
15
Data Loading
Preprocessing
Training
t=0
GPU IDLE
Low GPU Utilization Rate + Long Model Lifecycle
t=2
t=4
t=6
Ray + Alluxio: Speed up data loading
t=0
GPU IDLE
Increase GPU utilization + Faster Model Lifecycle
t=2
t=4
t=6
Data Loading from Alluxio
Preprocessing
Training
High Scalability
●Cache 10 billion+ objects with an architecture that scales-out
horizontally without single node dependency
Performance
●Single-node storage with 50+ million objects per node
●Workload-specific optimizations for ML training & inference
○Low latency - < 1ms
○High throughput - hundreds of GB/s per Alluxio cluster
Stability and Reliability
●Automatic fallback to data lake storage for masking any
failures to due to capacity or other reasons
Alluxio Design
# Ray read data from Alluxio using S3 URL
ds = ray.data.read_images("s3://ai-ref-arch/imagenet-full/train",
filesystem=alluxio)
See more in: https://github.com/fsspec/alluxiofs
Using Alluxiofs instead of S3fs
Original S3 URL
19
Cloud Native Distributed Caching System
Read
Write
Local
Cache
Worker Selection
Consistent
Hash Ring
ETCD
Membership Service
Alluxio Worker 0
Alluxio Worker 1
Alluxio Worker 2
s3fs / gcsfs / huggingfacefs …..
Alluxio FSSpec + Alluxio System Cache
Stateless, Easy scaling, Fault Tolerant & Highly Available
Ray + Alluxio + Parquet - Multi-node
●Comparison
○Ray + Same Region S3
○Ray + Alluxio + Same Region S3
●Dataset
○200 MiB files, adds up to 60 GiB
○Images in Parquet Format
●Script
○Ray nightly multi-node train
benchmark
○28 train workers
With Alluxio
Without Alluxio
Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
https://alluxio.io/slack
@
Social Media
Github
https://github.com/Alluxio
Lu Qiu
www.linkedin.com/in/luqiu-ai
Siyuan Sheng
www.linkedin.com/in/siyuan-sheng
Scan QR code to access resources ->