Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywhere, Pythonic Filesystem API, Write Checkpointing and more
Alluxio
342 views
17 slides
Jul 23, 2024
Slide 1 of 17
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
About This Presentation
Alluxio Webinar
July.23, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Shouwei Chen (core maintainer and product manager, Alluxio)
In today's AI-driven world, organizations face unprecedented demands for powerful AI infrastructure to fuel their model training and se...
Alluxio Webinar
July.23, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Shouwei Chen (core maintainer and product manager, Alluxio)
In today's AI-driven world, organizations face unprecedented demands for powerful AI infrastructure to fuel their model training and serving workloads. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving. On July 9, 2024, we introduced Alluxio Enterprise AI 3.2, a groundbreaking solution designed to address these critical issues in the ever-evolving AI landscape.
In this webinar, Shouwei Chen will introduce exciting new features of Alluxio Enterprise AI 3.2:
- Leveraging GPU resources anywhere accessing remote data with the same local performance
- Enhanced I/O performance with 97%+ GPU utilization for popular language model training benchmarks
- Achieving the same performance as HPC storage on existing data lake without additional HPC storage infrastructure
- New Python FileSystem API to seamlessly integrate with Python applications like Ray
- Other new features, include advanced cache management, rolling upgrades, and CSI failover
Size: 1.97 MB
Language: en
Added: Jul 23, 2024
Slides: 17 pages
Slide Content
Alluxio Enterprise AI 3.2
Connect your machine learning platform
with data anywhere at high speed
Shouwei Chen, open source maintainer & product manager @ Alluxio
1
Compute-Storage
Separation
Cloud Data Lake
Multi-Region/
Hybrid/Multi-Cloud
Today
Data is Remote from Compute; Locality is Missing
Object store is not fit for AI/ML training; Performance is Missing
I/O Challenges
The Evolution of the Modern Data Stack
●GPU utilization inefficiency
●Additional hardware
purchase
●Data migration to training
platform
●Job failures
●Cloud storage request
throttling
I/O Challenge slow down business
●Long cycle of hardware
purchase
●Complex data pipeline to
integrate multi data
sources to training
platform
Time to production Cost Reliability
Why you machine training slow on top of data lake?
I/O
CPU
GPU
Without
Cache
With
Cache
I/O
CPU
GPU
cold read I/O
CPU
GPU
I/O
CPU
GPU
I/O
CPU
GPU
I/O
CPU
GPU
Epoch 1
Epoch 2
Epoch 3
Epoch 1
Epoch 2
Epoch 3
Overcome IO stall with caching
I/O
CPU
GPU
Without
Cache
With
Cache
I/O
CPU
GPU
cold read I/O
CPU
GPU
I/O
CPU
GPU
I/O
CPU
GPU
I/O
CPU
GPU
Epoch 1
Epoch 2
Epoch 3
Epoch 1
Epoch 2
Epoch 3
GPU idle
CPU idle
82% of the time
spent by
DataLoader
<10%
of your data is hot data
<10%
of your data is hot data
Data Caching Layer
between compute & data lake
Instead of
Purchasing additional
storage
Add a
Source: Alluxio
Increase GPU utilization & Reduce Cloud Storage Cost
Training
frameworks
Training
frameworks
AWS S3
us-east-1
Without Cache With Cache
AWS S3
us-west-1
AWS S3
us-east-1
Frequently Retrieving Data =
GPU inefficiency & High GET/PUT Operations Costs
& Data Transfer Costs
Fast Access with
Hot Data Cached
AWS S3
us-west-1
Only Retrieve Data When Necessary =
Lower S3 Costs
… …
… …
Data Cache
Improve Reliability
Prevent
Network
Congestion
Relieve
Overloaded
Storage
Prevent Job Failures like “503 Service Unavailable” …
ALLUXIO 12
Accessing Data and Models
In the Cloud/On prem
12
13
Alluxio Enterprise AI Architecture
Unified
access
Alluxio Enterprise AI Data Platform
Cloud Onprem Hybrid
Cloud
AI/ML frameworks
Cloud native
Distributed
Cache
Online ML platform
Serving cluster
ModelsTraining Data
Models
1
2
3
Offline training platform
Training cluster
DC/Cloud A
DC/Cloud B
14
Separation of compute and storage
Hybrid/Multi-Cloud ML Platforms
●Alluxio Enterprise AI 3.2 with a single node environment, when Threads=1 can achieve 2 GiB/s, when Threads=32 can achieve 8GiB/s
FIO Performance across Alluxio Versions
Throughput(GiB/s)
Throughput(GiB/s)
Sequential hot read performance Random hot read performance
Any Questions?
Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
community of data & AI
infra experts!
17