Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creation, New Cache Eviction Policies, Python SDK enhancements, and more

Alluxio 144 views 14 slides Feb 26, 2025
Slide 1
Slide 1 of 14
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14

About This Presentation

Alluxio Webinar
Feb. 25, 2025

For more Alluxio Events: https://www.alluxio.io/events/

Speaker:
Bill Hodak (VP of Marketing and Product Marketing, Alluxio)
Tom Luckenbach (Solutions Engineering Manager, Alluxio)

Join us to learn about the latest release of Alluxio Enterprise AI. In this webinar, ...


Slide Content

Alluxio Confidential
Alluxio Enterprise AI 3.5




Accelerate AI.
February 25, 2025
Bill Hodak &Tom Luckenbach

Alluxio Confidential
Alluxio Enterprise AI Overview

Alluxio Accelerates AI
by solving speed, scale, & scarcity challenges
through high-performance, distributed caching
and unified access to heterogeneous data sources.

Alluxio Confidential
Alluxio Key Components
Alluxio Worker
●Individual node comprised of CPU, memory, network, and NVME storage resources
●Responsible for retrieving data from persistent storage, caching data on NVME drives, and delivering
requested data to end users, applications, and APIs.
Alluxio Clients
●Alluxio FUSE-based POSIX
●Python SDK
●S3 API
Alluxio Cluster
●Pool of Alluxio Workers functioning together as a distributed caching layer
Alluxio Unified Namespace
●Unifies disparate storage systems, including (cloud) object storage, HDFS, NFS, etc, into a single, unified
namespace

Alluxio Confidential
ALLUXIO Accelerates
AI Model Training, Distribution, & Inference
4X
FASTER
MODEL TRAINING
UP TO
THROUGHPUT
PER CLIENT
14GB/s
GPU
UTILIZATION
80+%

Alluxio Confidential
RedNote Accelerates AI with Alluxio
Accelerated Nightly Training Jobs by 41%
➔Eliminating storage bottlenecks and cloud
storage throttling
➔45% increase in CPU utilization
➔Reduced training time to 5.5 hours, a 41%
improvement, and ensuring models are updated
within 6-hour SLA
Accelerated Model Distribution & Lower Costs
➔10X faster model download speeds by index
servers
➔80% cost savings for model distribution
compared to Alibaba Cloud Disk
Ali Cloud
TRAINING NODE
SLB (Ali Server Load Balancer)
OBJECT STORAGE SERVICE
ALLUXIO
WORKER
S3 API
ALLUXIO
WORKER
S3 API
ALLUXIO
WORKER
S3 API
TRAINING NODE TRAINING NODE
Alluxio Cluster

Alluxio Confidential
ALLUXIO FIO Benchmark
1 to 3 Alluxio Clients
➔Intel(R) Xeon(R) Platinum 8468
➔Memory: 512GB (64GB*8)
➔Network: 200Gbps
➔Intel(R) Xeon(R) Gold 6430
➔6 x 7TB NVMe PCI-4.0 6.5GB/s/NVMe)
➔Memory: 1TB (64GB*16)
➔Network: 400Gbps
1 Alluxio Worker
2 Alluxio Clients
3 Alluxio Clients
1 Alluxio Client

Alluxio Confidential
ALLUXIO Benchmarks
MLPerf Resnet50 on H100

Alluxio Confidential
New in Alluxio Enterprise AI 3.5

Alluxio Confidential
Alluxio CACHE_ONLY Write Mode
Why it is important:
➔AI training workloads periodically write checkpoint files to ‘save’ the state of the
partially trained model in the event the workload fails and needs to be restarted.
➔Checkpoint files are large and can take hours or longer to create.
➔Model training pauses while checkpoint files are created, slowing down end-to-end
model training time.

How it works:
➔Alluxio accelerates the performance of checkpoint file creation by writing to the
Alluxio cache and not to the underlying file system, avoiding network and storage
bottlenecks.
➔Faster checkpoint file creation accelerates end-to-end model training time.
Improves write performance 2-3X
e.g. Create checkpoint files in 20 minutes vs 1 hour!

Alluxio Confidential
Increase Cache Hit Ratio for Faster Model Training
Introducing 2 New Cache Eviction Policies
Why New Cache Eviction Policies are Important:
➔Increases Cache Hit Ratio and improves cache efficiency by retaining critical data in
cache and reducing the overhead of reading data from the underlying file system.
➔Provides administrators with more granular control over which data is retained in the
Alluxio cache.

How it works:
➔TTL Cache Eviction Policies enforce time-to-live (TTL) policies on cached data. These
policies optimize cache efficiency by ensuring that less frequently accessed data is
automatically evicted based on the policies settings.
➔Priority-based Cache Eviction Policies ensure specific data stays in cache even if the
data would have otherwise been evicted based on the Least Recently Used (LRU)
cache eviction algorithm.

Alluxio Confidential
Alluxio S3 API: Lower I/O Latency by 40%
HTTP Persistent Connections

➔Reduces I/O Latency by 40%
➔Persisting the HTTP connection for multiple requests reduces I/O latency by eliminating the
overhead of opening/closing HTTP connections for each request.

Multipart Uploads

➔Improves object upload performance (POST) for large objects by splitting the object into multiple,
smaller pieces and parallelizing the upload process.

TLS Encryption
➔Improves data security by encrypting data accessed via the Alluxio S3 API Endpoint

Alluxio Confidential
➔Alluxio’s Python SDK now integrates with the most popular AI frameworks,
including PyTorch, PyArrow, and Ray.
➔Python applications can now seamlessly interact with various storage
backends using a unified Python filesystem interface making it seamless to
access both local and remote storage systems.
Alluxio Data Now Accessible through Standard
Python FileSystem APIs

Alluxio Confidential
More Improvements in Alluxio Enterprise AI 3.5
➔Alluxio’s new Index Service improves the performance of directory listing 3-5X
➔Alluxio’s new UFS Rate Limiter enables administrators to optimize resource
utilization by controlling the bandwidth an individual worker node can read from the
underlying file system (UFS)

Alluxio Confidential
Demo Time!