Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creation, New Cache Eviction Policies, Python SDK enhancements, and more

Alluxio 144 views 14 slides Feb 26, 2025

Slide 1 of 14

About This Presentation

Alluxio Webinar
Feb. 25, 2025

For more Alluxio Events: https://www.alluxio.io/events/

Speaker:
Bill Hodak (VP of Marketing and Product Marketing, Alluxio)
Tom Luckenbach (Solutions Engineering Manager, Alluxio)

Join us to learn about the latest release of Alluxio Enterprise AI. In this webinar, ...

Size: 626.48 KB

Language: en

Added: Feb 26, 2025

Slides: 14 pages

Slide Content

Alluxio Conﬁdential
Alluxio Enterprise AI 3.5

Accelerate AI.
February 25, 2025
Bill Hodak &Tom Luckenbach

Alluxio Conﬁdential
Alluxio Enterprise AI Overview

Alluxio Accelerates AI
by solving speed, scale, & scarcity challenges
through high-performance, distributed caching
and uniﬁed access to heterogeneous data sources.

Alluxio Conﬁdential
Alluxio Key Components
Alluxio Worker
●Individual node comprised of CPU, memory, network, and NVME storage resources
●Responsible for retrieving data from persistent storage, caching data on NVME drives, and delivering
requested data to end users, applications, and APIs.
Alluxio Clients
●Alluxio FUSE-based POSIX
●Python SDK
●S3 API
Alluxio Cluster
●Pool of Alluxio Workers functioning together as a distributed caching layer
Alluxio Uniﬁed Namespace
●Uniﬁes disparate storage systems, including (cloud) object storage, HDFS, NFS, etc, into a single, uniﬁed
namespace

Alluxio Conﬁdential
ALLUXIO Accelerates
AI Model Training, Distribution, & Inference
4X
FASTER
MODEL TRAINING
UP TO
THROUGHPUT
PER CLIENT
14GB/s
GPU
UTILIZATION
80+%

Alluxio Conﬁdential
RedNote Accelerates AI with Alluxio
Accelerated Nightly Training Jobs by 41%
➔Eliminating storage bottlenecks and cloud
storage throttling
➔45% increase in CPU utilization
➔Reduced training time to 5.5 hours, a 41%
improvement, and ensuring models are updated
within 6-hour SLA
Accelerated Model Distribution & Lower Costs
➔10X faster model download speeds by index
servers
➔80% cost savings for model distribution
compared to Alibaba Cloud Disk
Ali Cloud
TRAINING NODE
SLB (Ali Server Load Balancer)
OBJECT STORAGE SERVICE
ALLUXIO
WORKER
S3 API
ALLUXIO
WORKER
S3 API
ALLUXIO
WORKER
S3 API
TRAINING NODE TRAINING NODE
Alluxio Cluster

Alluxio Conﬁdential
ALLUXIO FIO Benchmark
1 to 3 Alluxio Clients
➔Intel(R) Xeon(R) Platinum 8468
➔Memory: 512GB (64GB*8)
➔Network: 200Gbps
➔Intel(R) Xeon(R) Gold 6430
➔6 x 7TB NVMe PCI-4.0 6.5GB/s/NVMe)
➔Memory: 1TB (64GB*16)
➔Network: 400Gbps
1 Alluxio Worker
2 Alluxio Clients
3 Alluxio Clients
1 Alluxio Client

Alluxio Conﬁdential
ALLUXIO Benchmarks
MLPerf Resnet50 on H100

Alluxio Confidential
New in Alluxio Enterprise AI 3.5

Alluxio Conﬁdential
Alluxio CACHE_ONLY Write Mode
Why it is important:
➔AI training workloads periodically write checkpoint ﬁles to ‘save’ the state of the
partially trained model in the event the workload fails and needs to be restarted.
➔Checkpoint ﬁles are large and can take hours or longer to create.
➔Model training pauses while checkpoint ﬁles are created, slowing down end-to-end
model training time.

How it works:
➔Alluxio accelerates the performance of checkpoint ﬁle creation by writing to the
Alluxio cache and not to the underlying ﬁle system, avoiding network and storage
bottlenecks.
➔Faster checkpoint ﬁle creation accelerates end-to-end model training time.
Improves write performance 2-3X
e.g. Create checkpoint ﬁles in 20 minutes vs 1 hour!

Alluxio Conﬁdential
Increase Cache Hit Ratio for Faster Model Training
Introducing 2 New Cache Eviction Policies
Why New Cache Eviction Policies are Important:
➔Increases Cache Hit Ratio and improves cache efficiency by retaining critical data in
cache and reducing the overhead of reading data from the underlying ﬁle system.
➔Provides administrators with more granular control over which data is retained in the
Alluxio cache.

How it works:
➔TTL Cache Eviction Policies enforce time-to-live (TTL) policies on cached data. These
policies optimize cache efficiency by ensuring that less frequently accessed data is
automatically evicted based on the policies settings.
➔Priority-based Cache Eviction Policies ensure speciﬁc data stays in cache even if the
data would have otherwise been evicted based on the Least Recently Used (LRU)
cache eviction algorithm.

Alluxio Conﬁdential
Alluxio S3 API: Lower I/O Latency by 40%
HTTP Persistent Connections

➔Reduces I/O Latency by 40%
➔Persisting the HTTP connection for multiple requests reduces I/O latency by eliminating the
overhead of opening/closing HTTP connections for each request.

Multipart Uploads

➔Improves object upload performance (POST) for large objects by splitting the object into multiple,
smaller pieces and parallelizing the upload process.

TLS Encryption
➔Improves data security by encrypting data accessed via the Alluxio S3 API Endpoint

Alluxio Conﬁdential
➔Alluxio’s Python SDK now integrates with the most popular AI frameworks,
including PyTorch, PyArrow, and Ray.
➔Python applications can now seamlessly interact with various storage
backends using a uniﬁed Python ﬁlesystem interface making it seamless to
access both local and remote storage systems.
Alluxio Data Now Accessible through Standard
Python FileSystem APIs

Alluxio Conﬁdential
More Improvements in Alluxio Enterprise AI 3.5
➔Alluxio’s new Index Service improves the performance of directory listing 3-5X
➔Alluxio’s new UFS Rate Limiter enables administrators to optimize resource
utilization by controlling the bandwidth an individual worker node can read from the
underlying ﬁle system (UFS)

Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creation, New Cache Eviction Policies, Python SDK enhancements, and more

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creation, New Cache Eviction Policies, Python SDK enhancements, and more

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx