As Trino users increasingly rely on cloud object storage for retrieving data, speed and cloud cost have become major challenges. The separation of com...
Alluxio Webinar
June. 18, 2024
For more Alluxio Events: https://www.alluxio.io/events/
As Trino users increasingly rely on cloud object storage for retrieving data, speed and cloud cost have become major challenges. The separation of compute and storage creates latency challenges when querying datasets; scanning data between storage and compute tiers becomes I/O bound. On the other hand, cloud API costs related to GET/LIST operations and cross-region data transfer add up quickly.
The newly introduced Trino file system cache by Alluxio aims to overcome the above challenges. In this session, Jianjian will dive into Trino data caching strategies, the latest test results, and discuss the multi-level caching architecture. This architecture makes Trino 10x faster for data lakes of any scale, from GB to EB.
What you will learn:
- Challenges relating to the speed and costs of running Trino in the cloud
- The new Trino file system cache feature overview, including the latest development status and test results
- A multi-level cache framework for maximized speed, including Trino file system cache and Alluxio distributed cache
- Real-world cases, including a large online payment firm and a top ridesharing company
- The future roadmap of Trino file system cache and Trino-Alluxio integration
Size: 4.12 MB
Language: en
Added: Jul 01, 2024
Slides: 21 pages
Slide Content
10x Faster Trino Queries
on Your Data Platform
Jianjian Xie
Staff Software Engineer @ Alluxio
RubiX is OUT, Alluxio is IN
3
Trino 332 introduced
Hive connector
storage caching
by RubiX/Qubole
2020
Trino 439
introduced Trino File
System Cache using
Alluxio
20242023 2024
June February June
Previewed cache
by Alluxio
developers at
Trino Fest 2023
Jonas@Dune shared the
production results at
Trino Fest 2024
⚡ 20~30% faster queries
?????? 70% less S3 GET
requests
●RubiX is no longer maintained
●Does not support
Iceberg/Hudi/Delta formats
●Dependent on Hadoop and
Hive ecosystem
January
Source: A cache refresh for Trino, Trino Fest 2023, Trino Fest 2024
Glossary - Let’s Talk Cache
4
Trino File System
Cache
latest built-in fs cache in Trino 439 release based on Alluxio caching
library and replaced RubiX caching library in Trino.
Read Trino blog for details.
Alluxio or Alluxio
Distributed Cache
Alluxio Edge
Full-capability distributed system, deployed as a standalone cluster
(both Open Source and Enterprise Edition available).
Read edition comparison for details.
Similar to Trino File System Cache, a lite version of Alluxio that is
purpose built for Trino to be deployed as a sidecar to Trino.
Which Cache Fits Your Need?
5
Trino File System Cache Alluxio or Alluxio Distributed Cache
Maintainers
Actively maintained by Alluxio and Trino
community
Actively maintained by Alluxio community
Availability Since Trino 439 and onwards Available since 2015
Deployment A library in Trino worker processes
A standalone service running on independent
processes
Cache Capacity
Leverage local disk NVMe or memory, also
bound to local disks capacity
Cache capacity scales horizontally
Cache Sharing
Only accessible to the local Trino worker
process for cached data
Cached data shareable across Trino clusters,
as well as Spark and other frameworks
APIs TrinoFileSystem internal to Trino
HadoopFileSystem, S3, POSIX (GA), Python
FSSpec (experimental)
Trino File System Cache
6
7
Four
Values of
Trino File
System
Cache
Boost
Performance
Save Costs
Prevent
Network
Congestion
Offload
Under
Storage
Key Features of Trino File System Cache
8
Caching Data
Local SSD
Memory
Connector Support
Iceberg
Hudi
Delta Lake
Hive
9
How to Enable Trino File System Cache?
From the view of a Trino user, nothing really changes
fs.cache.enabled=true
fs.cache.directories=/tmp/cache
fs.cache.max-sizes=10G
10
A Deeper Dive - How Trino File Cache Works
11
File System Caching at Uber Scale
3 Clusters, 1500 Nodes
Source: https://www.uber.com/blog/speed-up-presto-with-alluxio-local-cache/
50%
Input Read
Performance
10%
Data Read Traffic
to HDFS
14
Unify Data Lake Across Multiple
Geographic Regions at Expedia
PROBLEMS ENCOUNTERED ALLUXIOʼS SOLUTION RESULTS ACHIEVED
US-WEST-2
MOUNTED
US-WEST-1
US-EAST-1
US-EAST-2
TEAM C
TEAM A
MAIN REGION: CENTRAL ANALYTICS
TEAM B
Unify data silos without the
need to copy or move data
Enhanced user experience with
consistent & high performance
analytics, reducing time to insights
Reduced cost per query
Data silos caused by different
brands/teams ingesting data dispersed
across multiple regions in AWS
Central analytics platform performing
queries across data silos suffered from
poor user experience and long time to
insight
Manual replication resulted in
inefficiencies, operational overheads and
expensive S3 egress cost
50%
Multi-Level
Cache
15
16
Multi-level Cache: Best of Both Worlds
Trino Worker
Trino
Trino
File System
Cache
Alluxio
Distributed
Cache
Ongoing
Work
17
18
Upcoming Trino Native Alluxio Distributed Cache
Avoid old & complex HDFS interface with native Trino interface implementation
Takeaways
19
Takeaways: Which Cache Fits Your Need?
20
Trino File System Cache Alluxio or Alluxio Distributed Cache
Maintainers
Actively maintained by Alluxio and Trino
community
Actively maintained by Alluxio community
Availability Since Trino 439 and onwards Available since 2015
Deployment A library in Trino worker processes
A standalone service running on independent
processes
Cache Capacity
Leverage local disk NVMe or memory, also
bound to local disks capacity
Cache capacity scales horizontally
Cache Sharing
Only accessible to the local Trino worker
process for cached data
Cached data shareable across Trino clusters,
as well as Spark and other frameworks
APIs TrinoFileSystem internal to Trino
HadoopFileSystem, S3, POSIX (GA), Python
FSSpec (experimental)
Thank You
Any Questions?
Scan the QR code for a
Linktree including great
learning resources,
exciting meetups & a
community of data & AI
infra experts!