Alluxio Product School Webinar - Boosting Trino Performance.

231 views 25 slides Mar 23, 2023
Slide 1
Slide 1 of 25
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25

About This Presentation

Alluxio Product School Webinar
Mar. 23, 2023

For more Alluxio Events: https://www.alluxio.io/events/

Speaker: Beinan Wang (Tech Lead, Alluxio)

In March’s Product School session, Beinan, an Alluxio tech lead, Presto committer, and Trino contributor, will share expert tips for tuning Trino perfor...


Slide Content

Boosting Trino
Performance

Mar 23, 2023

Agenda
Overview &
Architecture
01
Trino, HMS and
Iceberg Overview
SQL Execution
02
SQL Execution
Introduction
Accelerating
SQL
03Best practice of
Accelerating SQL

Q&A
04
Questions & Answers

2

Overview & Architecture
01
3

Trino Overview
●Distributed SQL Query Engine
○ANSI SQL on Hive data warehouse, Hudi, Iceberg, Kafka, Druid and etc.
○Designed to be interactive
○Access to petabytes of data

●Open-source
○github.com/trinodb

●Use Cases
○Ad-hoc
○BI tools
○Dashboard
○A/B testing
○ETL
4

Trino Architecture
5

Hive Tables Overview
●Using Hive metastore to serve the metadata
○Backed by a RDBMS(Mysql)
○Limited scalability

6
REF: https://tabular.io/blog/iceberg-metadata-indexing/

Trino Tuning Tips
02
7

Plan Generation
8 From Trino.io

Optimization
9

EXPLAIN vs EXPLAIN ANALYZE
10
EXPLAIN: plan structure + cost estimates
EXPLAIN ANALYZE: plan structure + cost estimates + actual execution statistics

Plan Generation & Optimization
11

Scheduling of Hive Tables
12
https://blog.bigdataboutique.com/2022/09/hive-tables-and-whats-next-for-modern-data-platf
orms-1xts1m

Scan Parquet Files
13 https://parquet.apache.org/docs/file-format/

Predicate Pushdown Resource Usage
14

S3 getObject API
https://docs.aws.amazon.com/Ama
zonS3/latest/API/API_GetObject.ht
ml

Best Practices
03
16

○Set max memory per query
○Set max memory per node
■ If we could speed up queries, we would be able to reduce
max_concurrent and then increase the max memory per node and at the
same time keep the same throughput
○Memory allocation deadlock
■Try query.low-memory-killer.policy



Make Sure There Is Sufficient Memory
17

●Consider using column-based data format for your data files.
●ORC might have a better performance than parquet (especially when using
prestodb)
●Avoid using CSV or json format
●Compression (SNAPPY, LZ4, ZSTD, and GZIP)
●Use partitioning. You can create a partitioned version of a table with a CTAS
https://prestodb.io/docs/current/sql/create-table-as.html by adding the
partitioned_by clause to the CREATE TABLE.
●Use bucketing. Do this by adding the bucketed_by clause to your CREATE
TABLE statement. You will also need to specify bucket_count.

Optimize File Format & Table Layout
18

●Collect table statistics to ensure the most efficient query plan is produced,
which means queries run as fast as possible.
●Use the sql ANALYZE TABLE <tablename> command to do this. Repeat the
ANALYZE TABLE commands for all tables involved in queries on a regular basis,
typically when data has substantially changed (e.g. new data arrived / after an
ETL cycle has completed).


Collect Hive Table Stats
19

●Enable CBO (It’s default value already)
●“LARGE LEFT” (put the large table on the left side of the join).

●Let Trino do the job in case it’s using default settings.
○SET session join_distribution_type=’AUTOMATIC’;
○SET session join_reordering_strategy=’AUTOMATIC’;



Join Optimization
20

●You should enable Dynamic Filtering when 1 or more joins are in-play, especially
if there’s a smaller dimension table being used to probe a larger fact table for
example. Dynamic Filtering is pushed down to ORC and Parquet readers, and
can accelerate queries on partitioned as well as non-partitioned tables.
Dynamic Filtering is a join optimization intended to improve performance of
Hash JOINs. Enable this with:
○SET session enable_dynamic_filtering=TRUE;

Dynamic Filtering
21

Trino + Alluxio Multi-Level Cache
22

Configurations Recommendation (JVM)
●Avoid using large clusters
○<= 400 workers

●Using JDK 11 instead of JDK 8
○JDK 11 provides a much better performance for both runtime and GC
●Using G1GC for large heap (> 10G)
●DO NOT over tune JVM
○DO NOT touch MaxNewSize and NewSize
○DO NOT touch MaxTenuringThreshold
○DO NOT touch InitiatingHeapOccupancyPercent
●Larger Xmx remediate 95% of GC issues



23

Q&A
04
24
Complete the Community
Survey for a chance to win
an Amazon gift card!

THANK YOU
25