Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywhere, Pythonic Filesystem API, Write Checkpointing and more

Alluxio 342 views 17 slides Jul 23, 2024
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

Alluxio Webinar
July.23, 2024

For more Alluxio Events: https://www.alluxio.io/events/

Speaker:
- Shouwei Chen (core maintainer and product manager, Alluxio)

In today's AI-driven world, organizations face unprecedented demands for powerful AI infrastructure to fuel their model training and se...


Slide Content

Alluxio Enterprise AI 3.2
Connect your machine learning platform
with data anywhere at high speed
Shouwei Chen, open source maintainer & product manager @ Alluxio
1

Compute-Storage
Separation
Cloud Data Lake
Multi-Region/
Hybrid/Multi-Cloud
Today
Data is Remote from Compute; Locality is Missing
Object store is not fit for AI/ML training; Performance is Missing

I/O Challenges
The Evolution of the Modern Data Stack

●GPU utilization inefficiency
●Additional hardware
purchase
●Data migration to training
platform
●Job failures
●Cloud storage request
throttling
I/O Challenge slow down business
●Long cycle of hardware
purchase
●Complex data pipeline to
integrate multi data
sources to training
platform
Time to production Cost Reliability

Why you machine training slow on top of data lake?
I/O
CPU
GPU
Without
Cache
With
Cache
I/O
CPU
GPU
cold read I/O
CPU
GPU
I/O
CPU
GPU
I/O
CPU
GPU
I/O
CPU
GPU
Epoch 1
Epoch 2
Epoch 3
Epoch 1
Epoch 2
Epoch 3

Overcome IO stall with caching
I/O
CPU
GPU
Without
Cache
With
Cache
I/O
CPU
GPU
cold read I/O
CPU
GPU
I/O
CPU
GPU
I/O
CPU
GPU
I/O
CPU
GPU
Epoch 1
Epoch 2
Epoch 3
Epoch 1
Epoch 2
Epoch 3
GPU idle
CPU idle

82% of the time
spent by
DataLoader

<10%
of your data is hot data

<10%
of your data is hot data

Data Caching Layer
between compute & data lake
Instead of
Purchasing additional
storage
Add a
Source: Alluxio

Increase GPU utilization & Reduce Cloud Storage Cost
Training
frameworks
Training
frameworks
AWS S3
us-east-1
Without Cache With Cache
AWS S3
us-west-1
AWS S3
us-east-1
Frequently Retrieving Data =
GPU inefficiency & High GET/PUT Operations Costs
& Data Transfer Costs
Fast Access with
Hot Data Cached
AWS S3
us-west-1
Only Retrieve Data When Necessary =
Lower S3 Costs
… …
… …
Data Cache

Improve Reliability
Prevent
Network
Congestion
Relieve
Overloaded
Storage
Prevent Job Failures like “503 Service Unavailable” …

ALLUXIO 12
Accessing Data and Models
In the Cloud/On prem
12

13
Alluxio Enterprise AI Architecture
Unified
access
Alluxio Enterprise AI Data Platform
Cloud Onprem Hybrid
Cloud
AI/ML frameworks
Cloud native
Distributed
Cache

Online ML platform
Serving cluster
ModelsTraining Data
Models
1
2
3

Offline training platform
Training cluster
DC/Cloud A
DC/Cloud B
14
Separation of compute and storage
Hybrid/Multi-Cloud ML Platforms

15
Alluxio Enterprise AI Architecture -
Cloud native
APP APP
SSD
APP APP APP APP
S3
APP APP APP APP APP APP
S3
SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD
Alluxio Fuse Alluxio Worker
HDFS NAS TOS ···

●Alluxio Enterprise AI 3.2 with a single node environment, when Threads=1 can achieve 2 GiB/s, when Threads=32 can achieve 8GiB/s
FIO Performance across Alluxio Versions
Throughput(GiB/s)
Throughput(GiB/s)
Sequential hot read performance Random hot read performance

Any Questions?
Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
community of data & AI
infra experts!
17