Alluxio Product School Webinar - Boosting Trino Performance.
231 views
25 slides
Mar 23, 2023
Slide 1 of 25
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
About This Presentation
Alluxio Product School Webinar
Mar. 23, 2023
For more Alluxio Events: https://www.alluxio.io/events/
Speaker: Beinan Wang (Tech Lead, Alluxio)
In March’s Product School session, Beinan, an Alluxio tech lead, Presto committer, and Trino contributor, will share expert tips for tuning Trino perfor...
Alluxio Product School Webinar
Mar. 23, 2023
For more Alluxio Events: https://www.alluxio.io/events/
Speaker: Beinan Wang (Tech Lead, Alluxio)
In March’s Product School session, Beinan, an Alluxio tech lead, Presto committer, and Trino contributor, will share expert tips for tuning Trino performance. In addition, he will demonstrate how to integrate Trino with Alluxio as a caching layer using connectors for Hive, Iceberg, Hudi, or Delta Lake.
Size: 1.47 MB
Language: en
Added: Mar 23, 2023
Slides: 25 pages
Slide Content
Boosting Trino
Performance
Mar 23, 2023
Agenda
Overview &
Architecture
01
Trino, HMS and
Iceberg Overview
SQL Execution
02
SQL Execution
Introduction
Accelerating
SQL
03Best practice of
Accelerating SQL
Q&A
04
Questions & Answers
2
Overview & Architecture
01
3
Trino Overview
●Distributed SQL Query Engine
○ANSI SQL on Hive data warehouse, Hudi, Iceberg, Kafka, Druid and etc.
○Designed to be interactive
○Access to petabytes of data
Hive Tables Overview
●Using Hive metastore to serve the metadata
○Backed by a RDBMS(Mysql)
○Limited scalability
○
6
REF: https://tabular.io/blog/iceberg-metadata-indexing/
Trino Tuning Tips
02
7
Plan Generation
8 From Trino.io
Optimization
9
EXPLAIN vs EXPLAIN ANALYZE
10
EXPLAIN: plan structure + cost estimates
EXPLAIN ANALYZE: plan structure + cost estimates + actual execution statistics
Plan Generation & Optimization
11
Scheduling of Hive Tables
12
https://blog.bigdataboutique.com/2022/09/hive-tables-and-whats-next-for-modern-data-platf
orms-1xts1m
S3 getObject API
https://docs.aws.amazon.com/Ama
zonS3/latest/API/API_GetObject.ht
ml
Best Practices
03
16
○Set max memory per query
○Set max memory per node
■ If we could speed up queries, we would be able to reduce
max_concurrent and then increase the max memory per node and at the
same time keep the same throughput
○Memory allocation deadlock
■Try query.low-memory-killer.policy
Make Sure There Is Sufficient Memory
17
●Consider using column-based data format for your data files.
●ORC might have a better performance than parquet (especially when using
prestodb)
●Avoid using CSV or json format
●Compression (SNAPPY, LZ4, ZSTD, and GZIP)
●Use partitioning. You can create a partitioned version of a table with a CTAS
https://prestodb.io/docs/current/sql/create-table-as.html by adding the
partitioned_by clause to the CREATE TABLE.
●Use bucketing. Do this by adding the bucketed_by clause to your CREATE
TABLE statement. You will also need to specify bucket_count.
Optimize File Format & Table Layout
18
●Collect table statistics to ensure the most efficient query plan is produced,
which means queries run as fast as possible.
●Use the sql ANALYZE TABLE <tablename> command to do this. Repeat the
ANALYZE TABLE commands for all tables involved in queries on a regular basis,
typically when data has substantially changed (e.g. new data arrived / after an
ETL cycle has completed).
Collect Hive Table Stats
19
●Enable CBO (It’s default value already)
●“LARGE LEFT” (put the large table on the left side of the join).
●Let Trino do the job in case it’s using default settings.
○SET session join_distribution_type=’AUTOMATIC’;
○SET session join_reordering_strategy=’AUTOMATIC’;
Join Optimization
20
●You should enable Dynamic Filtering when 1 or more joins are in-play, especially
if there’s a smaller dimension table being used to probe a larger fact table for
example. Dynamic Filtering is pushed down to ORC and Parquet readers, and
can accelerate queries on partitioned as well as non-partitioned tables.
Dynamic Filtering is a join optimization intended to improve performance of
Hash JOINs. Enable this with:
○SET session enable_dynamic_filtering=TRUE;
Dynamic Filtering
21
Trino + Alluxio Multi-Level Cache
22
Configurations Recommendation (JVM)
●Avoid using large clusters
○<= 400 workers
●Using JDK 11 instead of JDK 8
○JDK 11 provides a much better performance for both runtime and GC
●Using G1GC for large heap (> 10G)
●DO NOT over tune JVM
○DO NOT touch MaxNewSize and NewSize
○DO NOT touch MaxTenuringThreshold
○DO NOT touch InitiatingHeapOccupancyPercent
●Larger Xmx remediate 95% of GC issues
23
Q&A
04
24
Complete the Community
Survey for a chance to win
an Amazon gift card!