Alluxio Product School Webinar - Boosting Trino Performance.

231 views 25 slides Mar 23, 2023

Slide 1 of 25

About This Presentation

Alluxio Product School Webinar
Mar. 23, 2023

For more Alluxio Events: https://www.alluxio.io/events/

Speaker: Beinan Wang (Tech Lead, Alluxio)

In March’s Product School session, Beinan, an Alluxio tech lead, Presto committer, and Trino contributor, will share expert tips for tuning Trino perfor...

Size: 1.47 MB

Language: en

Added: Mar 23, 2023

Slides: 25 pages

Slide Content

Boosting Trino
Performance

Mar 23, 2023

Agenda
Overview &
Architecture
01
Trino, HMS and
Iceberg Overview
SQL Execution
02
SQL Execution
Introduction
Accelerating
SQL
03Best practice of
Accelerating SQL

Q&A
04
Questions & Answers

2

Overview & Architecture
01
3

Trino Overview
●Distributed SQL Query Engine
○ANSI SQL on Hive data warehouse, Hudi, Iceberg, Kafka, Druid and etc.
○Designed to be interactive
○Access to petabytes of data

●Open-source
○github.com/trinodb

●Use Cases
○Ad-hoc
○BI tools
○Dashboard
○A/B testing
○ETL
4

Trino Architecture
5

Hive Tables Overview
●Using Hive metastore to serve the metadata
○Backed by a RDBMS(Mysql)
○Limited scalability
○
6
REF: https://tabular.io/blog/iceberg-metadata-indexing/

Trino Tuning Tips
02
7

Plan Generation
8 From Trino.io

Optimization
9

EXPLAIN vs EXPLAIN ANALYZE
10
EXPLAIN: plan structure + cost estimates
EXPLAIN ANALYZE: plan structure + cost estimates + actual execution statistics

Plan Generation & Optimization
11

Scheduling of Hive Tables
12
https://blog.bigdataboutique.com/2022/09/hive-tables-and-whats-next-for-modern-data-platf
orms-1xts1m

Scan Parquet Files
13 https://parquet.apache.org/docs/file-format/

Predicate Pushdown Resource Usage
14

S3 getObject API
https://docs.aws.amazon.com/Ama
zonS3/latest/API/API_GetObject.ht
ml

Best Practices
03
16

○Set max memory per query
○Set max memory per node
■ If we could speed up queries, we would be able to reduce
max_concurrent and then increase the max memory per node and at the
same time keep the same throughput
○Memory allocation deadlock
■Try query.low-memory-killer.policy

Make Sure There Is Sufﬁcient Memory
17

●Consider using column-based data format for your data ﬁles.
●ORC might have a better performance than parquet (especially when using
prestodb)
●Avoid using CSV or json format
●Compression (SNAPPY, LZ4, ZSTD, and GZIP)
●Use partitioning. You can create a partitioned version of a table with a CTAS
https://prestodb.io/docs/current/sql/create-table-as.html by adding the
partitioned_by clause to the CREATE TABLE.
●Use bucketing. Do this by adding the bucketed_by clause to your CREATE
TABLE statement. You will also need to specify bucket_count.

Optimize File Format & Table Layout
18

●Collect table statistics to ensure the most efﬁcient query plan is produced,
which means queries run as fast as possible.
●Use the sql ANALYZE TABLE <tablename> command to do this. Repeat the
ANALYZE TABLE commands for all tables involved in queries on a regular basis,
typically when data has substantially changed (e.g. new data arrived / after an
ETL cycle has completed).

Collect Hive Table Stats
19

●Enable CBO (It’s default value already)
●“LARGE LEFT” (put the large table on the left side of the join).

●Let Trino do the job in case it’s using default settings.
○SET session join_distribution_type=’AUTOMATIC’;
○SET session join_reordering_strategy=’AUTOMATIC’;

Join Optimization
20

●You should enable Dynamic Filtering when 1 or more joins are in-play, especially
if there’s a smaller dimension table being used to probe a larger fact table for
example. Dynamic Filtering is pushed down to ORC and Parquet readers, and
can accelerate queries on partitioned as well as non-partitioned tables.
Dynamic Filtering is a join optimization intended to improve performance of
Hash JOINs. Enable this with:
○SET session enable_dynamic_ﬁltering=TRUE;

Dynamic Filtering
21

Trino + Alluxio Multi-Level Cache
22

Conﬁgurations Recommendation (JVM)
●Avoid using large clusters
○<= 400 workers

●Using JDK 11 instead of JDK 8
○JDK 11 provides a much better performance for both runtime and GC
●Using G1GC for large heap (> 10G)
●DO NOT over tune JVM
○DO NOT touch MaxNewSize and NewSize
○DO NOT touch MaxTenuringThreshold
○DO NOT touch InitiatingHeapOccupancyPercent
●Larger Xmx remediate 95% of GC issues

23

Q&A
04
24
Complete the Community
Survey for a chance to win
an Amazon gift card!

Alluxio Product School Webinar - Boosting Trino Performance.

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Alluxio Product School Webinar - Boosting Trino Performance.

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

MGV Residential Design projects for different clients, including a New Mexico Adobe project-1-.pdf

EUNITED_Advocacy and Public Engagement through Visual Media

DESIGN THINKINGGG PPT 2 TOPIC IDEATION.pptx

DESIGN THINKING CHAPTER 1 PPTT PPT 1.pptx

Hinduism and Its History - PowerPoint Slides.pptx

Service Attributes of Manufactured Parts.pptx