Optimizing Data for Fast Querying

AndreiIonescu16 123 views 17 slides Feb 18, 2019
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

Presents a possible solution to the "Small Files Problem" - Queries over a data that is composed of small files (around 5MB) is very slow.


Slide Content

Optimizing Data for Fast Querying Andrei Ionescu – Adobe Romania, Data Platform

Agenda Data Flow & Architecture Some Initial Numbers The Problem Compaction Service On-Demand Real-Time New Numbers What’s Next Useful Links 2

Data Flow & Architecture 3 Solution Solution Solution HDFS Ingestion Data Science Query Service Query

Some Initial Numbers 4 Courtesy of Paul Mackles Each dataset per tenant has: Files per day – 10K of ~4MB each Rows per day – 500M Byte Size per day – 7GB Query time – time out due to> 10 min Files scan – tens of minutes For a month for 10 datasets we have 3M files of ~4MB each .

The Problem – Client Issue Interactivity when querying the data is mandatory – any query taking over 10 minutes is dropped, even though the process will eventually reach and end after a while. Due to the way the data is ingested and files are written the queries do reach 10 minutes timeout every time, even for simple queries on a month worth of data. 5 SELECT _Y as Year, _M as Month, count(*) as ECount FROM midvalues_1 WHERE _Y = 2018 OR _Y=2019 GROUP BY 1, 2 ORDER BY 1 Desc , 2 Desc ;

The Problem – Technical issue The problem is known as “Small File Problem”. HDFS allocates for each file about 150 bytes that are stored in the namenode memory. 10M files results in 3GB of memory usage. HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern. Each small file is passed to map() function, which is not very efficient because it will create a large number of mappers.   For example, the 1,000’s files of size (2 to 3 MB) will need 1,000 mappers which very inefficient. 6

Compaction Service Compaction is a service with the purpose of optimizing the ingested data into proper file size in regards to block size (128MB, 256MB, 512MB). Compaction Service takes small files and compacts them into bigger files of a specific size. Compaction Service has 2 functionalities that maps over its 2 components: On-demand compaction – Compaction Job Real-time compaction (any new file or group of files arriving is taken into account and compacted according to defined rules) – Compaction Tracker 7

On-Demand – Compaction Job One time job Spark job that gets the files and compacts them into bigger files For example: 7000 small files into 50 files Compaction Job is the core of the service Processing steps: Given path patterns ( ie : / mystore / mydataset /_Y=2019/_M=1/_D=7/batch=*/ ) Scan all files and get their file size Knowing total size, target size, schema size, etc., we can find n= totalSize / targetSize Two modes of running Load the data and apply repartition(n) to get the proper number of files (one DataSet with all data) Group files into n buckets and repartition(1) each bucket (multiple DataSets ) 8

On-Demand – Compaction Job – Running Modes 9 / myDs /_Y=2019/_M=1/_D=1/batch=1/File-01.part / myDs /_Y=2019/_M=1/_D=1/batch=1/File-02.part / myDs /_Y=2019/_M=1/_D=1/batch=1/File-03.part / myDs /_Y=2019/_M=1/_D=1/batch=1/* / myDs /_Y=2019/_M=1/_D=1/batch=12/File-01.part / myDs /_Y=2019/_M=1/_D=1/batch=12/* / myDs /_Y=2019/_M=1/_D=1/batch=22/File-01.part / myDs /_Y=2019/_M=1/_D=1/batch=22/* / myDs /_Y=2019/_M=1/_D=1/batch=13/File-01.part / myDs /_Y=2019/_M=1/_D=1/batch=13/* DataSet / myDs /_Y=2019/_M=1/_D=1/batch=199898/ Repartition n 1 DataSet with repartition n / myDs /_Y=2019/_M=1/_D=1/batch=1/File-01.part / myDs /_Y=2019/_M=1/_D=1/batch=1/File-02.part / myDs /_Y=2019/_M=1/_D=1/batch=1/File-03.part / myDs /_Y=2019/_M=1/_D=1/batch=1/* / myDs /_Y=2019/_M=1/_D=1/batch=12/File-01.part / myDs /_Y=2019/_M=1/_D=1/batch=12/* / myDs /_Y=2019/_M=1/_D=1/batch=22/File-01.part / myDs /_Y=2019/_M=1/_D=1/batch=22/* / myDs /_Y=2019/_M=1/_D=1/batch=13/File-01.part / myDs /_Y=2019/_M=1/_D=1/batch=13/* DataSet / myDs /_Y=2019/_M=1/_D=1/ batch=199898/__p1/ Repartition 1 n DataSets with repartition 1 DataSet Repartition 1 / myDs /_Y=2019/_M=1/_D=1/ batch=199898/__p2/ Move files to parent folder / myDs /_Y=2019/_M=1/_D=1/ batch=199898/

On-Demand – Compaction Job – Running Modes 1 DataSet with repartition n [-] File bloating due to random shuffling after repartition(n) [-] Twice the file size n DataSets with repartition 1 [+] File size is similar to the source size [-] Each DataSet can write only in its own folder, thus one more step to move the files back to parent path and cleanup 10

Real-Time – Compaction Tracker Any new batch of files arriving into the storage is taken into account and compacted by specific rules: Should be part of the same tenant Should be part of the same partition Should be grouped together to similar files so that it would target the size of a HDFS block size Triggers a Compaction Job (On-demand) for the ready to be compacted group 11

Real-Time – Compaction Tracker 12 batch of files ready Stateful Streaming State: Number of batches File size Partition Etc. Trigger One Time Job (On-Demand) / myDS /_Y=2019/_M=1/_D=1/batch=12/ / myDS /_Y=2019/_M=1/_D=1/batch=13/ / myDS /_Y=2019/_M=1/_D=1/batch=14/ … / myDS /_Y=2019/_M=1/_D=1/batch=1000/ HDFS

Architecture with Compaction Service 13 Solution Solution Solution HDFS Ingestion Data Science Workspace Query Service Query Compaction

New Numbers Now, each dataset per tenant has: Files per day – 50 Rows per day – 500M Byte Size per day – 7GB Query time – < 10 min Files scan – under a minute 14

What’s next Use of meta-store (Netflix Iceberg) Better scaling resources for Compaction Jobs based on the data Use of ML to fit the Compaction Job needs 15

Useful Links http://dataottam.com/2016/09/09/3-solutions-for-big-datas-small-files-problem/ https://blog.cloudera.com/blog/2009/02/the-small-files-problem/ https://community.hortonworks.com/questions/167615/what-is-small-file-problem-in-hdfs.html https://github.com/Netflix/iceberg https://parquet.apache.org/ Keep in Touch LinkedIn: https://www.linkedin.com/in/andreiionescu Email: [email protected] 16