Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Workloads

Alluxio 537 views 19 slides Sep 10, 2024

Slide 1 of 19

About This Presentation

Alluxio Webinar
Sept. 10, 2024

For more Alluxio Events: https://www.alluxio.io/events/

Speaker:
- Jingwen Ouyang (Senior Program Manager, Alluxio)

As machine learning and deep learning models grow in complexity, AI platform engineers and ML engineers face significant challenges with slow data lo...

Size: 6.76 MB

Language: en

Added: Sep 10, 2024

Slides: 19 pages

Slide Content

Optimize, Don't Overspend:
Data Caching Strategy for AI Workloads
Sep, 2024

Alluxio makes it easy to share and
manage data from any storage to any
compute engine in any environment,
with high performance and low cost.
2

3
Open Source Started From UC
Berkeley AMPLab in 2014
JOIN THE CONVERSATION
ON SLACK
ALLUXIO.IO/SLACK
1,200+
contributors &
growing
10,000+
Slack Community
Members
Top 10
Most Critical Java
Based Open
Source Project
Top 100
Most Valuable
Repositories Out of 96
Million on GitHub

4
Case Studies
Zhihu
TELCO & MEDIA
E-COMMERCE
FINANCIAL SERVICES
TECH & INTERNET
OTHERS

Leverage GPUs
Anywhere
Run AI workloads wherever
GPUs are available without
data locality concerns

6
Alluxio AI Offering $

7
Critical infrastructure barriers to
effective AI/ML adoption
LOW PERFORMANCE

COST MANAGEMENT

High performance caching
for model training &
distribution

GPU SCARCITY

Multi-region/cloud data
serving capability

Shorten time-to-production
$
$
Higher GPU Utilization

Avoid copying across data lakes

Utilize NVMe directly on the GPU
cluster

I/O Performance for AI Training
and GPU Utilization
1
HPC Performance on Existing Data Lakes
Achieve up to 8GB/s throughput & 200K IOPS for a single client
Improvements compared to 2.x: 35% for hot sequential reads, 20x for
hot random reads, 4x for cold reads
2
GPU Saturation
Fully saturate 8 A100 GPUs, showing over 97% GPU utilization in
MLPerf Storage language processing benchmarks.
Customer production data show GPU utilization improvement from
40% to 60% for search/recommendation models & 50% to 95% for LLMs
3
Checkpoint Optimization
New checkpoint read/write support optimizes training with write caching
capabilities

●Alluxio 3.2: Achieved a bandwidth of 2081 MiB/s(1 thread) to 8183 MiB/s(32 threads) with a single client, significantly outperforming competitors.
●JuiceFS: Recorded a bandwidth of 1886 MiB/s(1 thread) to 6207 MiB/s, showing 9.3% to 24.1% slower than Alluxio 3.2.
●FSx Lustre: Managed a bandwidth of 185 MiB/s(1 thread) to 3992 MiB/s, showing 91.1% to 51.2% slower than Alluxio 3.2.
●Observations: Alluxio 3.2 shows better performance, particularly in handling sequential read operations efficiently.
Comparison against other vendors | FIO - Sequential Read
Setup
●Alluxio:
1 Alluxio worker (i3en.metal)
1 Alluxio fuse client (c5n.metal)
●AWS FSx Lustre (12TB capacity)
●JuiceFS (SaaS)

Note: the Alluxio fuse client co-located with
training servers is responsible for POSIX
API access to Alluxio Workers which
actually cache the data

Alluxio Proprietary and Confidential

Comparison against other vendors | MLPerf Storage
Setup
●Alluxio
1 fuse (c6in.metal)
2 worker (i3en.metal)

Note: DDN with 12 GPUs and Weka
with 20 GPUs are the available data
points published on MLPerf website.
Alluxio Proprietary and Confidential

11
New Architecture $

Scalability

Master as the
bottleneck

Unlimited scalability

Support tens of
billions of small files
with single Alluxio
cluster

Reliability

Fault tolerance

Automatic Fallback
to under file system

More friendly to
Kubernetes and Cloud

Performance

Zero-copy network
transmission with
netty

High concurrent read

Data
Governance

Multi-tenant & quota
management

Plugable security
management

Decentralized Object Repository Architecture (DORA)
Motivation & Benefits

Architecture

70
70
AI/Analytics Applications
Get Task InfoSend Result
Alluxio Client
13
Aﬀinity Block
Location Policy
Client
Consistent Hash
(Task Info)
2
3
Service
Registry
Alluxio Worker Alluxio Worker
Alluxio Worker
Execute Task
Get Cluster Info
Find Worker(s)
1
4
Cache miss Under storage task5
Training Node
Alluxio Cluster
Under Storage

Read Optimization

High Concurrent
Position Read

Solve up to 150X Read
Amplification issue

Improve unstructured file
parallel read up to 9X

Improve structured file position
read 2 - 15X

Zero-copy
Data Transmission

Improve memory eﬀiciency

Improve large file sequential
streaming read performance by
30% - 50%

15
Example Use Case $

16
BUSINESS BENEFIT:
TECH BENEFIT:
Increase GPU
utilization
50%
93%
HDFS
Training
Data
Training
Data
Models
Training
Data
Models
Model
Training
Model
Training
Model
Deployment
Model
Inference
Downstream
Applications
Model
Update
Training Clouds Oﬀline Cloud Online Cloud
Zhihu CASE STUDY:

High Performance AI Platform for LLM
2 - 4X faster
time-to-market

17
$Try Alluxio For Free in 30 min!

Try the fully deployed Alluxio AI cluster for FREE!
●Explore the potential performance benefits of Alluxio by
running FIO benchmarks
●Simplify the deployment process with preconfigured
template clusters
●Maintain full control of your data with Alluxio deployed
within your AWS account
●User friendly webUI with just a few clicks in under 40
minutes
Blog with sign up link and tutorial
Introducing Rapid Alluxio Deployer (RAD) in AWS!
Example

19
Thank you!
$
Join the conversation on Slack
alluxio.io/slack
Sign up RAD at https://signup.alluxio-rad.io/
and send us a screenshot of the cluster you
created to get a chance to win a $50 Amazon
gift card!

Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Workloads

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Alluxio Webinar | Optimize, Don&#39;t Overspend: Data Caching Strategy for AI Workloads

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx

Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Workloads