Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Workloads

Alluxio 537 views 19 slides Sep 10, 2024
Slide 1
Slide 1 of 19
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19

About This Presentation

Alluxio Webinar
Sept. 10, 2024

For more Alluxio Events: https://www.alluxio.io/events/

Speaker:
- Jingwen Ouyang (Senior Program Manager, Alluxio)

As machine learning and deep learning models grow in complexity, AI platform engineers and ML engineers face significant challenges with slow data lo...


Slide Content

Optimize, Don't Overspend:
Data Caching Strategy for AI Workloads
Sep, 2024

Alluxio makes it easy to share and
manage data from any storage to any
compute engine in any environment,
with high performance and low cost.
2

3
Open Source Started From UC
Berkeley AMPLab in 2014
JOIN THE CONVERSATION
ON SLACK
ALLUXIO.IO/SLACK
1,200+
contributors &
growing
10,000+
Slack Community
Members
Top 10
Most Critical Java
Based Open
Source Project
Top 100
Most Valuable
Repositories Out of 96
Million on GitHub

4
Case Studies
Zhihu
TELCO & MEDIA
E-COMMERCE
FINANCIAL SERVICES
TECH & INTERNET
OTHERS

Leverage GPUs
Anywhere
Run AI workloads wherever
GPUs are available without
data locality concerns

6
Alluxio AI Offering $

7
Critical infrastructure barriers to
effective AI/ML adoption
LOW PERFORMANCE

COST MANAGEMENT

High performance caching
for model training &
distribution

GPU SCARCITY

Multi-region/cloud data
serving capability

Shorten time-to-production
$
$
Higher GPU Utilization

Avoid copying across data lakes

Utilize NVMe directly on the GPU
cluster

I/O Performance for AI Training
and GPU Utilization
1
HPC Performance on Existing Data Lakes
Achieve up to 8GB/s throughput & 200K IOPS for a single client
Improvements compared to 2.x: 35% for hot sequential reads, 20x for
hot random reads, 4x for cold reads
2
GPU Saturation
Fully saturate 8 A100 GPUs, showing over 97% GPU utilization in
MLPerf Storage language processing benchmarks.
Customer production data show GPU utilization improvement from
40% to 60% for search/recommendation models & 50% to 95% for LLMs
3
Checkpoint Optimization
New checkpoint read/write support optimizes training with write caching
capabilities

●Alluxio 3.2: Achieved a bandwidth of 2081 MiB/s(1 thread) to 8183 MiB/s(32 threads) with a single client, significantly outperforming competitors.
●JuiceFS: Recorded a bandwidth of 1886 MiB/s(1 thread) to 6207 MiB/s, showing 9.3% to 24.1% slower than Alluxio 3.2.
●FSx Lustre: Managed a bandwidth of 185 MiB/s(1 thread) to 3992 MiB/s, showing 91.1% to 51.2% slower than Alluxio 3.2.
●Observations: Alluxio 3.2 shows better performance, particularly in handling sequential read operations efficiently.
Comparison against other vendors | FIO - Sequential Read
Setup
●Alluxio:
1 Alluxio worker (i3en.metal)
1 Alluxio fuse client (c5n.metal)
●AWS FSx Lustre (12TB capacity)
●JuiceFS (SaaS)

Note: the Alluxio fuse client co-located with
training servers is responsible for POSIX
API access to Alluxio Workers which
actually cache the data

Alluxio Proprietary and Confidential

Comparison against other vendors | MLPerf Storage
Setup
●Alluxio
1 fuse (c6in.metal)
2 worker (i3en.metal)








Note: DDN with 12 GPUs and Weka
with 20 GPUs are the available data
points published on MLPerf website.
Alluxio Proprietary and Confidential

11
New Architecture $

Scalability

Master as the
bottleneck

Unlimited scalability

Support tens of
billions of small files
with single Alluxio
cluster



Reliability


Fault tolerance

Automatic Fallback
to under file system

More friendly to
Kubernetes and Cloud


Performance

Zero-copy network
transmission with
netty

High concurrent read

Data
Governance

Multi-tenant & quota
management

Plugable security
management

Decentralized Object Repository Architecture (DORA)
Motivation & Benefits

Architecture

70
70
AI/Analytics Applications
Get Task InfoSend Result
Alluxio Client
13
Affinity Block
Location Policy
Client
Consistent Hash
(Task Info)
2
3
Service
Registry
Alluxio Worker Alluxio Worker
Alluxio Worker
Execute Task
Get Cluster Info
Find Worker(s)
1
4
Cache miss Under storage task5
Training Node
Alluxio Cluster
Under Storage

Read Optimization

High Concurrent
Position Read

Solve up to 150X Read
Amplification issue

Improve unstructured file
parallel read up to 9X

Improve structured file position
read 2 - 15X

Zero-copy
Data Transmission

Improve memory efficiency

Improve large file sequential
streaming read performance by
30% - 50%

15
Example Use Case $

16
BUSINESS BENEFIT:
TECH BENEFIT:
Increase GPU
utilization
50%
93%
HDFS
Training
Data
Training
Data
Models
Training
Data
Models
Model
Training
Model
Training
Model
Deployment
Model
Inference
Downstream
Applications
Model
Update
Training Clouds Offline Cloud Online Cloud
Zhihu CASE STUDY:

High Performance AI Platform for LLM
2 - 4X faster
time-to-market

17
$Try Alluxio For Free in 30 min!

Try the fully deployed Alluxio AI cluster for FREE!
●Explore the potential performance benefits of Alluxio by
running FIO benchmarks
●Simplify the deployment process with preconfigured
template clusters
●Maintain full control of your data with Alluxio deployed
within your AWS account
●User friendly webUI with just a few clicks in under 40
minutes
Blog with sign up link and tutorial
Introducing Rapid Alluxio Deployer (RAD) in AWS!
Example

19
Thank you!
$
Join the conversation on Slack
alluxio.io/slack
Sign up RAD at https://signup.alluxio-rad.io/
and send us a screenshot of the cluster you
created to get a chance to win a $50 Amazon
gift card!