In the last decade, Apache Parquet has become the standard format to store tabular data on disk regardless of the technology stack used. This is due to its read/write performance, efficient compression technology, interoperability and especially outstanding performance with the default settings.
W...
In the last decade, Apache Parquet has become the standard format to store tabular data on disk regardless of the technology stack used. This is due to its read/write performance, efficient compression technology, interoperability and especially outstanding performance with the default settings.
While these default settings and access patterns already provide decent performance, by understanding the format in more detail and using recent developments, one can get much better performance, smaller files, and utilise Parquet's newer partial reading features to read even smaller subsets of a file for a given query.
This talk aims to provide insight into the Parquet format and its recent development that are useful for end users' daily workflows. One only needs prior knowledge to know what a DataFrame/tabular data is.
About me
•Uwe Korn
https://mastodon.social/@xhochy / @xhochy
•CTO at Data Science startup QuantCo
•Previously worked as a Data Engineer
•A lot of OSS, notably Apache {Arrow,
Parquet} and conda-forge
•PyData Südwest Co-Organizer
Apache Parquet
1.Columnar, on-disk storage format
2.Started in 2012 by Cloudera and Twitter
3.Later, it became Apache Parquet
4.Fall 2016 brought full Python & C++ Support
5.State-of-the-art since the Hadoop era, still going strong
Clear benefits
1.Columnar makes vectorized operations fast
2.Efficient encodings and compression make it small
3.Predicate-pushdown brings computation to the I/O layer
4.Language-independent and widespread; common exchange format
Constructing
Parquet Files
Parquet with pandas
Parquet with polars
Anatomy of a file
Anatomy of a file
Anatomy of a file
Anatomy of a file
Anatomy of a file
Photo by Gabriel Dias Pimenta on Unsplash Tuning
Knobs to tune
1.Compression Algorithm
2.Compression Level
3.RowGroup size
4.Encodings
Data Types!?Photo by Patrick Fore on Unsplash
Data Types?
Data Types?
•Well, actually…
Data Types?
•Well, actually…
•…it doesn’t save much on disk.
Data Types?
•Well, actually…
•…it doesn’t save much on disk.
•By choosing the optimal types (lossless cast to e.g. float32 or uint8) on a
month of New York Taxi trips:
Data Types?
•Well, actually…
•…it doesn’t save much on disk.
•By choosing the optimal types (lossless cast to e.g. float32 or uint8) on a
month of New York Taxi trips:
Saves 963 bytes " of 20.6 MiB
CompressionPhoto by cafeconcetto on Unsplash
Compression Algorithm
Compression Algorithm
•Datasets:
Compression Algorithm
•Datasets:
•New York Yellow Taxi Trips 2021-01
Compression Algorithm
•Datasets:
•New York Yellow Taxi Trips 2021-01
•New York Yellow Taxi Trips 2021-01 with a custom prediction
Compression Algorithm
•Datasets:
•New York Yellow Taxi Trips 2021-01
•New York Yellow Taxi Trips 2021-01 with a custom prediction
•gov.uk (House) Price Paid dataset
Compression Algorithm
•Datasets:
•New York Yellow Taxi Trips 2021-01
•New York Yellow Taxi Trips 2021-01 with a custom prediction
•gov.uk (House) Price Paid dataset
•COVID-19 Epidemiology
Compression Algorithm
•Datasets:
•New York Yellow Taxi Trips 2021-01
•New York Yellow Taxi Trips 2021-01 with a custom prediction
•gov.uk (House) Price Paid dataset
•COVID-19 Epidemiology
•Time measurements: Pick the median of five runs
Compression Algorithm
Compression Algorithm
Compression Algorithm
Compression Level
1.For Brotli, ZStandard and GZIP, we can tune the level
2.Snappy and „none“ have a fixed compression level.
GZIP
Brotli
ZStandard
ZStandard #
ZStandard & Brotli #
Compression
Compression
1.Let’s stick for now with ZStandard, as it seems a good tradeoff between
speed and size.
Compression
1.Let’s stick for now with ZStandard, as it seems a good tradeoff between
speed and size.
2.In some cases (e.g. slow network drives), it might be worth to also
considering Brotli
Compression
1.Let’s stick for now with ZStandard, as it seems a good tradeoff between
speed and size.
2.In some cases (e.g. slow network drives), it might be worth to also
considering Brotli
•…but Brotli is relatively slow to decompress.
RowGroup size
1.If you plan to partially access the data, RowGroups are the common
place to filter.
2.If you want to read the whole data, less are better.
3.Compression & encoding also works better.
Single RowGroup
Encodings
1.https://parquet.apache.org/docs/file-format/data-pages/encodings/
2.We have been using RLE_DICTIONARY for all columns
3.DELTA_* encodings not implemented in pyarrow
4.Byte Stream Split a recent addition
Dictionary Encoding
RLE Encoding
Byte Stream Split Encoding
Encodings
1.Byte Stream Split sometimes is faster than dictionary encoding, but not
significantly
2.For high entropy columns, BSS shines
Hand-Crafted Delta
Hand-Crafted Delta
1.Let’s take the timestamps in NYC Taxi Trip
2.Sort by pickup date
3.Compute a delta column for both dates
4.17.5% saving on the whole file.
Order your data
1.With our hand-crafted delta, it was worth sorting the data
2.This can help, but only worked for the Price Paid dataset in tests, there it
saved 25%, all others actually got larger
Summary
1.Adjusting your data types is helpful for in-memory, but have no significant
effect on-disk
2.Store high-entropy floats as Byte Stream Split encoded columns
3.Check whether sorting has an effect
4.Delta Encoding in Parquet would be useful, use handcrafted for now
5.Zstd on level 3/4 seems like a good default compression setting
Cost Function for compression
What do we get?
1.Run once with the default settings
2.Test all compression settings, but also…
1.… use hand-crafted delta.
2.… use Byte Stream Split on predictions.
Cost Function for compression
Cost Function for compression
Code example available at
https://github.com/xhochy/pyconde24-parquet