AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & Serving

Alluxio 220 views 21 slides May 24, 2024
Slide 1
Slide 1 of 21
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21

About This Presentation

AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio

For more Alluxio Events: https://www.alluxio.io/events/

Speaker:
- Lu Qiu (Data & AI Platform Tech Lead, @Alluxio)
- Siyuan Sheng (Senior Software Engineer, @Alluxio)

Speed and efficiency are two requirements for the underlying infrastruc...


Slide Content

Improve Speed and GPU
Utilization for Model
Training & Serving
Lu Qiu, Siyuan Sheng

AI Platform Tech Lead
& Open Source PMC Maintainer
@ Alluxio
linkedin.com/in/luqiu-ai

Lu Qiu
2
Senior Software Engineer
@ Alluxio
www.linkedin.com/in/siyuan-sheng

Siyuan Sheng
2

Open Source Started From UC Berkeley AMPLab
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
Million+ Download;
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million
Join the conversation
on Slack
slackin.alluxio.io
#9
Most critical open
source Java projects
(Google OpenSSF)

4
Alluxio Data Platform
High Performance data access, unified global view

ALLUXIO 5
COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERSTECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE

Inefficiencies in AI infrastructure
●Prolonged AI Model Lifecycle
●Underutilized GPUs

GPU is waiting for data to be ready


AI is all about getting information
from Data
6

Waiting for Data to be Ready for AI
7
Data Loading

Preprocessing
Training
t=0

GPU IDLE
Low GPU Utilization Rate + Long Model Lifecycle
t=2

t=4

t=6

Ray/PyTorch: Streamlined Operations
t=0

GPU IDLE
Increase GPU utilization + Faster Model Lifecycle

t=2

t=4

t=6

Data Loading

Preprocessing
Training

●Separation of Compute and Storage
●Large Data Volumes
●Crowded Networks
●Slow Data Transfer
●Storage Request Rate Limit or Outages
Data Loading Bottleneck

Data Loading become the Bottleneck
10
Data Loading

Preprocessing
Training
t=0

GPU IDLE
Low GPU Utilization Rate + Long Model Lifecycle
t=2

t=4

t=6

While using Ray/PyTorch for training…
Performance & Cost Implications are
●You might load the entire dataset again and again for each epoch
●You cannot cache the hottest data among multiple training jobs
automatically
●You might be suffering from a cold start every time
Data Loading Bottleneck

Data Loading Bottleneck for Each Epoch
12
t=0

GPU IDLE
t=2

t=4

t=6

t=8

t=10

t=12

EPOCH 0 EPOCH 1

Storage

CPU Machine

GPU Machine

Alluxio

Ray
Distributed Caching

Data Preprocessing
Storage
Reduce data transfer
& storage cost


Speed up data loading & preprocessing for AI training
Data
Speed up data loading
& Preprocessing

Compute
Increase GPU
Utilization Rate
Ray

Storage

Ray
Ray + Alluxio: Speed up data loading
Alluxio

FASTER IS
BETTER
Often
Only when necessary
TIME CONSUMING

Ray in I/O bottlenecked workloads
15
Data Loading

Preprocessing
Training
t=0

GPU IDLE
Low GPU Utilization Rate + Long Model Lifecycle
t=2

t=4

t=6

Ray + Alluxio: Speed up data loading
t=0

GPU IDLE
Increase GPU utilization + Faster Model Lifecycle

t=2

t=4

t=6

Data Loading from Alluxio

Preprocessing
Training

High Scalability
●Cache 10 billion+ objects with an architecture that scales-out
horizontally without single node dependency
Performance
●Single-node storage with 50+ million objects per node
●Workload-specific optimizations for ML training & inference
○Low latency - < 1ms
○High throughput - hundreds of GB/s per Alluxio cluster
Stability and Reliability
●Automatic fallback to data lake storage for masking any
failures to due to capacity or other reasons
Alluxio Design

Ray + Alluxio Fsspec: Easy Usage
# Import fsspec & alluxio fsspec implementation
import fsspec
import s3fs
from alluxiofs import AlluxioFileSystem


alluxio = AlluxioFileSystem(etcd_host=host, target_protocol= "s3")

# Ray read data from Alluxio using S3 URL
ds = ray.data.read_images("s3://ai-ref-arch/imagenet-full/train",
filesystem=alluxio)


See more in: https://github.com/fsspec/alluxiofs

Using Alluxiofs instead of S3fs

Original S3 URL

19
Cloud Native Distributed Caching System
Read
Write
Local
Cache
Worker Selection
Consistent
Hash Ring
ETCD
Membership Service
Alluxio Worker 0
Alluxio Worker 1
Alluxio Worker 2
s3fs / gcsfs / huggingfacefs …..
Alluxio FSSpec + Alluxio System Cache
Stateless, Easy scaling, Fault Tolerant & Highly Available

Ray + Alluxio + Parquet - Multi-node
●Comparison
○Ray + Same Region S3
○Ray + Alluxio + Same Region S3
●Dataset
○200 MiB files, adds up to 60 GiB
○Images in Parquet Format
●Script
○Ray nightly multi-node train
benchmark
○28 train workers
With Alluxio
Without Alluxio

Twitter.com/alluxio
Linkedin.com/alluxio

Website
www.alluxio.io
Slack
https://alluxio.io/slack
@
Social Media
Github
https://github.com/Alluxio
Lu Qiu
www.linkedin.com/in/luqiu-ai
Siyuan Sheng
www.linkedin.com/in/siyuan-sheng
Scan QR code to access resources ->