Going beyond Apache Parquet's default settings

xhochy 31 views 43 slides Apr 24, 2024

Slide 1 of 60

About This Presentation

In the last decade, Apache Parquet has become the standard format to store tabular data on disk regardless of the technology stack used. This is due to its read/write performance, efficient compression technology, interoperability and especially outstanding performance with the default settings.

W...

Size: 30.03 MB

Language: en

Added: Apr 24, 2024

Slides: 43 pages

Slide Content

Going beyond Parquet’s
default settings
Uwe Korn – QuantCo – April 2024
!

About me
•Uwe Korn 
https://mastodon.social/@xhochy / @xhochy
•CTO at Data Science startup QuantCo
•Previously worked as a Data Engineer
•A lot of OSS, notably Apache {Arrow,
Parquet} and conda-forge
•PyData Südwest Co-Organizer

Apache Parquet
1.Data Frame storage? CSV? Why?
2.Use Parquet

Photo by Hansjörg Keller on Unsplash

Apache Parquet
1.Columnar, on-disk storage format
2.Started in 2012 by Cloudera and Twitter
3.Later, it became Apache Parquet
4.Fall 2016 brought full Python & C++ Support
5.State-of-the-art since the Hadoop era, still going strong

Clear benefits
1.Columnar makes vectorized operations fast
2.Efficient encodings and compression make it small
3.Predicate-pushdown brings computation to the I/O layer
4.Language-independent and widespread; common exchange format

Constructing
Parquet Files

Parquet with pandas

Parquet with polars

Anatomy of a file

Photo by Gabriel Dias Pimenta on Unsplash Tuning

Knobs to tune
1.Compression Algorithm
2.Compression Level
3.RowGroup size
4.Encodings

Data Types!?Photo by Patrick Fore on Unsplash

Data Types?

Data Types?
•Well, actually…

Data Types?
•Well, actually…
•…it doesn’t save much on disk.

Data Types?
•Well, actually…
•…it doesn’t save much on disk.
•By choosing the optimal types (lossless cast to e.g. float32 or uint8) on a
month of New York Taxi trips:

Data Types?
•Well, actually…
•…it doesn’t save much on disk.
•By choosing the optimal types (lossless cast to e.g. float32 or uint8) on a
month of New York Taxi trips:
Saves 963 bytes " of 20.6 MiB

CompressionPhoto by cafeconcetto on Unsplash

Compression Algorithm

Compression Algorithm
•Datasets:

Compression Algorithm
•Datasets:
•New York Yellow Taxi Trips 2021-01

Compression Algorithm
•Datasets:
•New York Yellow Taxi Trips 2021-01
•New York Yellow Taxi Trips 2021-01 with a custom prediction

Compression Algorithm
•Datasets:
•New York Yellow Taxi Trips 2021-01
•New York Yellow Taxi Trips 2021-01 with a custom prediction
•gov.uk (House) Price Paid dataset

Compression Algorithm
•Datasets:
•New York Yellow Taxi Trips 2021-01
•New York Yellow Taxi Trips 2021-01 with a custom prediction
•gov.uk (House) Price Paid dataset
•COVID-19 Epidemiology

Compression Algorithm
•Datasets:
•New York Yellow Taxi Trips 2021-01
•New York Yellow Taxi Trips 2021-01 with a custom prediction
•gov.uk (House) Price Paid dataset
•COVID-19 Epidemiology
•Time measurements: Pick the median of five runs

Compression Algorithm

Compression Level
1.For Brotli, ZStandard and GZIP, we can tune the level
2.Snappy and „none“ have a fixed compression level.

GZIP

Brotli

ZStandard

ZStandard #

ZStandard & Brotli #

Compression

Compression
1.Let’s stick for now with ZStandard, as it seems a good tradeoff between
speed and size.

Compression
1.Let’s stick for now with ZStandard, as it seems a good tradeoff between
speed and size.
2.In some cases (e.g. slow network drives), it might be worth to also
considering Brotli

Compression
1.Let’s stick for now with ZStandard, as it seems a good tradeoff between
speed and size.
2.In some cases (e.g. slow network drives), it might be worth to also
considering Brotli
•…but Brotli is relatively slow to decompress.

RowGroup size
1.If you plan to partially access the data, RowGroups are the common
place to filter.
2.If you want to read the whole data, less are better.
3.Compression & encoding also works better.

Single RowGroup

Encodings
1.https://parquet.apache.org/docs/file-format/data-pages/encodings/
2.We have been using RLE_DICTIONARY for all columns
3.DELTA_* encodings not implemented in pyarrow
4.Byte Stream Split a recent addition

Dictionary Encoding

RLE Encoding

Byte Stream Split Encoding

Encodings
1.Byte Stream Split sometimes is faster than dictionary encoding, but not
significantly
2.For high entropy columns, BSS shines

Hand-Crafted Delta

Hand-Crafted Delta
1.Let’s take the timestamps in NYC Taxi Trip
2.Sort by pickup date
3.Compute a delta column for both dates
4.17.5% saving on the whole file.

Order your data
1.With our hand-crafted delta, it was worth sorting the data
2.This can help, but only worked for the Price Paid dataset in tests, there it
saved 25%, all others actually got larger

Summary
1.Adjusting your data types is helpful for in-memory, but have no significant
effect on-disk
2.Store high-entropy floats as Byte Stream Split encoded columns
3.Check whether sorting has an effect
4.Delta Encoding in Parquet would be useful, use handcrafted for now
5.Zstd on level 3/4 seems like a good default compression setting

Cost Function for compression

What do we get?
1.Run once with the default settings
2.Test all compression settings, but also…
1.… use hand-crafted delta.
2.… use Byte Stream Split on predictions.

Cost Function for compression

Code example available at
https://github.com/xhochy/pyconde24-parquet

Questions?

Download

Download Slideshow Get the original presentation file

Quick Actions

Statistics

Views 31
Slides 43
Age 587 days

Going beyond Apache Parquet's default settings

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Going beyond Apache Parquet&#39;s default settings

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx

Going beyond Apache Parquet's default settings