Apache Parquet

megrhihaikel 560 views 11 slides Jun 03, 2017
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

https://www.youtube.com/watch?v=1Kur7SitXRA&index=5&list=PLAV_dWz2GNAiDnEVI9ynfVr3WYt3moS3c


Slide Content

Apache Parquet https://parquet.apache.org/

Why Parquet Columnar storage format Can store nested Data E fficiency in file size and query performance Nested field can be read independently of other fields Many data processing understand avro format (Hive, Spark, Pig and MapReduce,etc)

Data Model boolean int32 int64 int96 float double binary fixed_len_byte_array UTF8 ENUM DECIMAL(precision,scale DATE LIST MAP

Record parquet message Nom { (required,optional,repeated) type nom ; required int32 age ; required binary nom (UTF8) } {"name": "right", "type": "string", "order": "descending"} Avro Parquet

Parquet File Format Header Block .. Block Footer Magic number Using parquet Magic number ,Schema,Encoding method,Block position,.. Columns Pages 128MB

Encoding & Compression Run-length : True , True ,True ,False (3 true , 1 false) Dictionary encoding ( indexation ) Plain encoding Compression algorithms Snappy,gzip,...

Configuration parquet.block.size int parquet.page.size int parquet.dictionary.page.size int parquet.enable.dictionary boolean true parquet.compression String (SNAPPY,UNCOMPRESSED,LZO...)

Writing and reading Parquet files

Parquet MapReduce

Parquet MapReduce

Parquet MapReduce