Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform

Alluxio 288 views 21 slides Jul 01, 2024
Slide 1
Slide 1 of 21
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21

About This Presentation

Alluxio Webinar
June. 18, 2024

For more Alluxio Events: https://www.alluxio.io/events/

Speaker:
- Jianjian Xie (Staff Software Engineer, Alluxio)

As Trino users increasingly rely on cloud object storage for retrieving data, speed and cloud cost have become major challenges. The separation of com...


Slide Content

10x Faster Trino Queries
on Your Data Platform
Jianjian Xie
Staff Software Engineer @ Alluxio

Staff Engineer @ Alluxio
Trino Contributor
PrestoDB Contributor
Jianjian Xie

RubiX is OUT, Alluxio is IN
3
Trino 332 introduced
Hive connector
storage caching
by RubiX/Qubole
2020
Trino 439
introduced Trino File
System Cache using
Alluxio
20242023 2024
June February June
Previewed cache
by Alluxio
developers at
Trino Fest 2023
Jonas@Dune shared the
production results at
Trino Fest 2024

⚡ 20~30% faster queries
?????? 70% less S3 GET
requests
●RubiX is no longer maintained
●Does not support
Iceberg/Hudi/Delta formats
●Dependent on Hadoop and
Hive ecosystem
January
Source: A cache refresh for Trino, Trino Fest 2023, Trino Fest 2024

Glossary - Let’s Talk Cache
4
Trino File System
Cache

latest built-in fs cache in Trino 439 release based on Alluxio caching
library and replaced RubiX caching library in Trino.
Read Trino blog for details.
Alluxio or Alluxio
Distributed Cache

Alluxio Edge

Full-capability distributed system, deployed as a standalone cluster
(both Open Source and Enterprise Edition available).
Read edition comparison for details.
Similar to Trino File System Cache, a lite version of Alluxio that is
purpose built for Trino to be deployed as a sidecar to Trino.

Which Cache Fits Your Need?
5
Trino File System Cache Alluxio or Alluxio Distributed Cache
Maintainers
Actively maintained by Alluxio and Trino
community
Actively maintained by Alluxio community
Availability Since Trino 439 and onwards Available since 2015
Deployment A library in Trino worker processes
A standalone service running on independent
processes
Cache Capacity
Leverage local disk NVMe or memory, also
bound to local disks capacity
Cache capacity scales horizontally
Cache Sharing
Only accessible to the local Trino worker
process for cached data
Cached data shareable across Trino clusters,
as well as Spark and other frameworks
APIs TrinoFileSystem internal to Trino
HadoopFileSystem, S3, POSIX (GA), Python
FSSpec (experimental)

Trino File System Cache
6

7
Four
Values of
Trino File
System
Cache
Boost
Performance
Save Costs
Prevent
Network
Congestion
Offload
Under
Storage

Key Features of Trino File System Cache
8
Caching Data

Local SSD
Memory
Connector Support
Iceberg
Hudi
Delta Lake
Hive

9
How to Enable Trino File System Cache?
From the view of a Trino user, nothing really changes
fs.cache.enabled=true
fs.cache.directories=/tmp/cache
fs.cache.max-sizes=10G

10
A Deeper Dive - How Trino File Cache Works

11
File System Caching at Uber Scale
3 Clusters, 1500 Nodes
Source: https://www.uber.com/blog/speed-up-presto-with-alluxio-local-cache/

50%
Input Read
Performance
10%
Data Read Traffic
to HDFS

Alluxio
Distributed Cache

12

13
Alluxio Distributed Cache Architecture
Compute Node

14
Unify Data Lake Across Multiple
Geographic Regions at Expedia
PROBLEMS ENCOUNTERED ALLUXIOʼS SOLUTION RESULTS ACHIEVED
US-WEST-2
MOUNTED
US-WEST-1
US-EAST-1
US-EAST-2
TEAM C
TEAM A
MAIN REGION: CENTRAL ANALYTICS
TEAM B
Unify data silos without the
need to copy or move data
Enhanced user experience with
consistent & high performance
analytics, reducing time to insights
Reduced cost per query
Data silos caused by different
brands/teams ingesting data dispersed
across multiple regions in AWS
Central analytics platform performing
queries across data silos suffered from
poor user experience and long time to
insight
Manual replication resulted in
inefficiencies, operational overheads and
expensive S3 egress cost
50%

Multi-Level
Cache

15

16
Multi-level Cache: Best of Both Worlds
Trino Worker

Trino
Trino
File System
Cache
Alluxio
Distributed
Cache

Ongoing
Work

17

18
Upcoming Trino Native Alluxio Distributed Cache
Avoid old & complex HDFS interface with native Trino interface implementation

Takeaways

19

Takeaways: Which Cache Fits Your Need?
20
Trino File System Cache Alluxio or Alluxio Distributed Cache
Maintainers
Actively maintained by Alluxio and Trino
community
Actively maintained by Alluxio community
Availability Since Trino 439 and onwards Available since 2015
Deployment A library in Trino worker processes
A standalone service running on independent
processes
Cache Capacity
Leverage local disk NVMe or memory, also
bound to local disks capacity
Cache capacity scales horizontally
Cache Sharing
Only accessible to the local Trino worker
process for cached data
Cached data shareable across Trino clusters,
as well as Spark and other frameworks
APIs TrinoFileSystem internal to Trino
HadoopFileSystem, S3, POSIX (GA), Python
FSSpec (experimental)

Thank You
Any Questions?
Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
community of data & AI
infra experts!