Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywhere, Pythonic Filesystem API, Write Checkpointing and more

Alluxio 342 views 17 slides Jul 23, 2024

Slide 1 of 17

About This Presentation

Alluxio Webinar
July.23, 2024

For more Alluxio Events: https://www.alluxio.io/events/

Speaker:
- Shouwei Chen (core maintainer and product manager, Alluxio)

In today's AI-driven world, organizations face unprecedented demands for powerful AI infrastructure to fuel their model training and se...

Size: 1.97 MB

Language: en

Added: Jul 23, 2024

Slides: 17 pages

Slide Content

Alluxio Enterprise AI 3.2
Connect your machine learning platform
with data anywhere at high speed
Shouwei Chen, open source maintainer & product manager @ Alluxio
1

Compute-Storage
Separation
Cloud Data Lake
Multi-Region/
Hybrid/Multi-Cloud
Today
Data is Remote from Compute; Locality is Missing
Object store is not fit for AI/ML training; Performance is Missing

I/O Challenges
The Evolution of the Modern Data Stack

●GPU utilization inefficiency
●Additional hardware
purchase
●Data migration to training
platform
●Job failures
●Cloud storage request
throttling
I/O Challenge slow down business
●Long cycle of hardware
purchase
●Complex data pipeline to
integrate multi data
sources to training
platform
Time to production Cost Reliability

Why you machine training slow on top of data lake?
I/O
CPU
GPU
Without
Cache
With
Cache
I/O
CPU
GPU
cold read I/O
CPU
GPU
I/O
CPU
GPU
I/O
CPU
GPU
I/O
CPU
GPU
Epoch 1
Epoch 2
Epoch 3
Epoch 1
Epoch 2
Epoch 3

Overcome IO stall with caching
I/O
CPU
GPU
Without
Cache
With
Cache
I/O
CPU
GPU
cold read I/O
CPU
GPU
I/O
CPU
GPU
I/O
CPU
GPU
I/O
CPU
GPU
Epoch 1
Epoch 2
Epoch 3
Epoch 1
Epoch 2
Epoch 3
GPU idle
CPU idle

82% of the time
spent by
DataLoader

<10%
of your data is hot data

<10%
of your data is hot data

Data Caching Layer
between compute & data lake
Instead of
Purchasing additional
storage
Add a
Source: Alluxio

Increase GPU utilization & Reduce Cloud Storage Cost
Training
frameworks
Training
frameworks
AWS S3
us-east-1
Without Cache With Cache
AWS S3
us-west-1
AWS S3
us-east-1
Frequently Retrieving Data =
GPU inefficiency & High GET/PUT Operations Costs
& Data Transfer Costs
Fast Access with
Hot Data Cached
AWS S3
us-west-1
Only Retrieve Data When Necessary =
Lower S3 Costs
… …
… …
Data Cache

Improve Reliability
Prevent
Network
Congestion
Relieve
Overloaded
Storage
Prevent Job Failures like “503 Service Unavailable” …

ALLUXIO 12
Accessing Data and Models
In the Cloud/On prem
12

13
Alluxio Enterprise AI Architecture
Unified
access
Alluxio Enterprise AI Data Platform
Cloud Onprem Hybrid
Cloud
AI/ML frameworks
Cloud native
Distributed
Cache

Online ML platform
Serving cluster
ModelsTraining Data
Models
1
2
3

Offline training platform
Training cluster
DC/Cloud A
DC/Cloud B
14
Separation of compute and storage
Hybrid/Multi-Cloud ML Platforms

15
Alluxio Enterprise AI Architecture -
Cloud native
APP APP
SSD
APP APP APP APP
S3
APP APP APP APP APP APP
S3
SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD
Alluxio Fuse Alluxio Worker
HDFS NAS TOS ···

●Alluxio Enterprise AI 3.2 with a single node environment, when Threads=1 can achieve 2 GiB/s, when Threads=32 can achieve 8GiB/s
FIO Performance across Alluxio Versions
Throughput（GiB/s）
Throughput（GiB/s）
Sequential hot read performance Random hot read performance

Any Questions?
Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
community of data & AI
infra experts!
17

Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywhere, Pythonic Filesystem API, Write Checkpointing and more

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywhere, Pythonic Filesystem API, Write Checkpointing and more

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx