Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Workloads
Alluxio
537 views
19 slides
Sep 10, 2024
Slide 1 of 19
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
About This Presentation
Alluxio Webinar
Sept. 10, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jingwen Ouyang (Senior Program Manager, Alluxio)
As machine learning and deep learning models grow in complexity, AI platform engineers and ML engineers face significant challenges with slow data lo...
Alluxio Webinar
Sept. 10, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jingwen Ouyang (Senior Program Manager, Alluxio)
As machine learning and deep learning models grow in complexity, AI platform engineers and ML engineers face significant challenges with slow data loading and GPU utilization, often leading to costly investments in high-performance computing (HPC) storage. However, this approach can result in overspending without addressing the core issues of data bottlenecks and infrastructure complexity.
A better approach is adding a data caching layer between compute and storage, like Alluxio, which offers a cost-effective alternative through its innovative data caching strategy. In this webinar, Jingwen will explore how Alluxio's caching solutions optimize AI workloads for performance, user experience and cost-effectiveness.
What you will learn:
- The I/O bottlenecks that slow down data loading in model training
- How Alluxio's data caching strategy optimizes I/O performance for training and GPU utilization, and significantly reduces cloud API costs
- The architecture and key capabilities of Alluxio
- Using Rapid Alluxio Deployer to install Alluxio and run benchmarks in AWS in just 30 minutes
Size: 6.76 MB
Language: en
Added: Sep 10, 2024
Slides: 19 pages
Slide Content
Optimize, Don't Overspend:
Data Caching Strategy for AI Workloads
Sep, 2024
Alluxio makes it easy to share and
manage data from any storage to any
compute engine in any environment,
with high performance and low cost.
2
3
Open Source Started From UC
Berkeley AMPLab in 2014
JOIN THE CONVERSATION
ON SLACK
ALLUXIO.IO/SLACK
1,200+
contributors &
growing
10,000+
Slack Community
Members
Top 10
Most Critical Java
Based Open
Source Project
Top 100
Most Valuable
Repositories Out of 96
Million on GitHub
4
Case Studies
Zhihu
TELCO & MEDIA
E-COMMERCE
FINANCIAL SERVICES
TECH & INTERNET
OTHERS
Leverage GPUs
Anywhere
Run AI workloads wherever
GPUs are available without
data locality concerns
6
Alluxio AI Offering $
7
Critical infrastructure barriers to
effective AI/ML adoption
LOW PERFORMANCE
COST MANAGEMENT
High performance caching
for model training &
distribution
I/O Performance for AI Training
and GPU Utilization
1
HPC Performance on Existing Data Lakes
Achieve up to 8GB/s throughput & 200K IOPS for a single client
Improvements compared to 2.x: 35% for hot sequential reads, 20x for
hot random reads, 4x for cold reads
2
GPU Saturation
Fully saturate 8 A100 GPUs, showing over 97% GPU utilization in
MLPerf Storage language processing benchmarks.
Customer production data show GPU utilization improvement from
40% to 60% for search/recommendation models & 50% to 95% for LLMs
3
Checkpoint Optimization
New checkpoint read/write support optimizes training with write caching
capabilities
●Alluxio 3.2: Achieved a bandwidth of 2081 MiB/s(1 thread) to 8183 MiB/s(32 threads) with a single client, significantly outperforming competitors.
●JuiceFS: Recorded a bandwidth of 1886 MiB/s(1 thread) to 6207 MiB/s, showing 9.3% to 24.1% slower than Alluxio 3.2.
●FSx Lustre: Managed a bandwidth of 185 MiB/s(1 thread) to 3992 MiB/s, showing 91.1% to 51.2% slower than Alluxio 3.2.
●Observations: Alluxio 3.2 shows better performance, particularly in handling sequential read operations efficiently.
Comparison against other vendors | FIO - Sequential Read
Setup
●Alluxio:
1 Alluxio worker (i3en.metal)
1 Alluxio fuse client (c5n.metal)
●AWS FSx Lustre (12TB capacity)
●JuiceFS (SaaS)
Note: the Alluxio fuse client co-located with
training servers is responsible for POSIX
API access to Alluxio Workers which
actually cache the data
Alluxio Proprietary and Confidential
Comparison against other vendors | MLPerf Storage
Setup
●Alluxio
1 fuse (c6in.metal)
2 worker (i3en.metal)
Note: DDN with 12 GPUs and Weka
with 20 GPUs are the available data
points published on MLPerf website.
Alluxio Proprietary and Confidential
11
New Architecture $
Scalability
Master as the
bottleneck
Unlimited scalability
Support tens of
billions of small files
with single Alluxio
cluster
70
70
AI/Analytics Applications
Get Task InfoSend Result
Alluxio Client
13
Affinity Block
Location Policy
Client
Consistent Hash
(Task Info)
2
3
Service
Registry
Alluxio Worker Alluxio Worker
Alluxio Worker
Execute Task
Get Cluster Info
Find Worker(s)
1
4
Cache miss Under storage task5
Training Node
Alluxio Cluster
Under Storage
Read Optimization
High Concurrent
Position Read
Solve up to 150X Read
Amplification issue
Improve unstructured file
parallel read up to 9X
Improve structured file position
read 2 - 15X
Zero-copy
Data Transmission
Improve memory efficiency
Improve large file sequential
streaming read performance by
30% - 50%
15
Example Use Case $
16
BUSINESS BENEFIT:
TECH BENEFIT:
Increase GPU
utilization
50%
93%
HDFS
Training
Data
Training
Data
Models
Training
Data
Models
Model
Training
Model
Training
Model
Deployment
Model
Inference
Downstream
Applications
Model
Update
Training Clouds Offline Cloud Online Cloud
Zhihu CASE STUDY:
High Performance AI Platform for LLM
2 - 4X faster
time-to-market
17
$Try Alluxio For Free in 30 min!
Try the fully deployed Alluxio AI cluster for FREE!
●Explore the potential performance benefits of Alluxio by
running FIO benchmarks
●Simplify the deployment process with preconfigured
template clusters
●Maintain full control of your data with Alluxio deployed
within your AWS account
●User friendly webUI with just a few clicks in under 40
minutes
Blog with sign up link and tutorial
Introducing Rapid Alluxio Deployer (RAD) in AWS!
Example
19
Thank you!
$
Join the conversation on Slack
alluxio.io/slack
Sign up RAD at https://signup.alluxio-rad.io/
and send us a screenshot of the cluster you
created to get a chance to win a $50 Amazon
gift card!