Ingestion file copy using apex

ApacheApex 486 views 26 slides Apr 21, 2016
Slide 1
Slide 1 of 26
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26

About This Presentation

Ingestion file copy using apex


Slide Content

Apache Apex Meetup
Big Data File Ingestion
using Apex

Sandeep Deshmukh, PhD
[email protected]

Apache Apex Meetup
Contents
●What is Big Data Ingestion
●Challenges in File copy @ scale
●Ingestion using Apex
○Input
○Output
○Key features
●Demo
●Summary

Apache Apex Meetup
Directed Acyclic Graph (DAG)
•A Stream is a sequence of data tuples
•An Operator takes one or more input streams, performs computations & emits one or more output streams
•Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
•Operator has many instances that run in parallel and each instance in single-threaded
•Directed Acyclic Graph (DAG) is made up of operators and streams

Apex: Application Programming Model
Output Stream
Tuple Tuple
Filtered
Stream
Enriched
Stream
Enriched
Stream
er
Operator
er
Operator
er
Operator
er
Operator
Filtered
Stream

Apache Apex Meetup
What is Ingestion

Data ingestion
●process of obtaining, importing, and processing data for later use or storage
in a database

Big Data Ingestion
●discovering the data sources
●importing the data
●processing data to produce intermediate data
●Send data out to durable data stores

Apache Apex Meetup
Challenges in File copy @ scale


●Failure Recovery
●Copying big files in parallel
●Copying large number of small files
●Processing
○Encryption
○Compression
○Compaction

Apache Apex Meetup
DAG - Components
Read Data Write Data
Process

Apache Apex Meetup
DAG - Read Data : Requirements
●Independent of input file type
○HDFS
○S3
○FTP
○NFS

●Scale to large data
○Large files
○Large number of small files

●Configurable Bandwidth usage

Apache Apex Meetup
DAG - Read Data

Break the whole task
into smaller sub-tasks
Connect to input and
scan for available data

Assign smaller tasks for
downstream operators
Steps Purpose Name

Work on the sub-tasks
given by Operator 1, one
at a time
Connect to source and
read data as smaller
tasks one-by-one

Pass on the read data to
downstream operator

Write File
Save the data read by
Operator 2
File
Splitter
Block
Reader
File
Writer

Apache Apex Meetup
DAG - Simple Design
File
Splitter
Block
Reader
File
WriterBlockMetaData Data
Challenges
●Reading files in parallel is not possible
○Can have multiple Block Readers and File Writers reading multiple files in
parallel but single file can’t be read by two Block Readers

●Failure recovery is hard

Apache Apex Meetup
DAG - Read Data

Break the whole task
into smaller sub-tasks
Connect to input and
scan for available data

Assign smaller tasks
for downstream
operators
Steps Purpose Name

Work on the sub-tasks
given by Operator 1,
one at a time
Connect to source and
read data as smaller
tasks one-by-one

Pass on the read data to
downstream operator

Write File
Save the data read
by Operator 2
File
Splitter
Block
Reader
File
Writer

Check for completeness
Make sure all smaller
tasks for a file are
completed by upstream
operators & send file
merger trigger
Synchronizer

Apache Apex Meetup
DAG - Input
File
Splitter
Block
Reader
Block
Writer
BlockMetaData
Data
Block
Reader
Block
Writer
Synchronizer
BlockMetaData
FileMetaData
BlockMetaData
BlockMetaData
Data

Apache Apex Meetup
Input DAG - FileSplitter
Scan input files/ directories
Create smaller sub-tasks
FileMetaData
BlockMetaData
File
Splitter
●Parameters
○input files/directories to copy data from
○recursive - Yes / No
○polling - Yes / No
○bandwidth - MB / sec

Apache Apex Meetup
Input DAG - FileSplitter
●For each file in the directory:
■[output] FileMetaData - file information
●Name
●Size
●Relative path
●Block IDs into which the file is virtually split

■[output] BlockMetaData - block information
●BlockID
●Start position
●End position
●File URL
InputFile.txt
1073741824 (1GB)
input/data/InputFile.txt
[0,1,2,3,4,5,6,7,8]


1
134217728
268435456 (128MB)
hdfs://node18:8020/user/sandeep/input

Apache Apex Meetup
Input DAG - BlockReader
Block
Reader
Read block from remote
location and emit Data
Data
BlockMetaData
●Parameters
Input URL: E.g.: hdfs://node18:8020/user/hduser/input
BlockMetaData

Apache Apex Meetup
Input DAG - BlockWriter
Block
Writer
Write block data on local
HDFS
BlockMetaData
BlockMetaData
Data
Saves data in apps directory

Apache Apex Meetup
Input DAG - Synchronizer
Track blocks for each file
and send trigger once all
the block for that file
are available
FileMetaDataSynchronizer
FileMetaData
BlockMetaData

Apache Apex Meetup
DAG - Input
File
Splitter
Block
Reader
Block
Writer
Synchronizer
BlockMetaData
Data
FileMetaData
BlockMetaData
BlockMetaData
FileMetaData

Apache Apex Meetup
Output DAG - FileMerger
Merge blocks to recreate
original file
FileMerger
●Parameters
○Output directory to copy data to
○Overwrite - Yes/No
FileMetaData

Apache Apex Meetup
Output DAG - FileMerger - FastMerge Magic
Different
Blocks:
File :
B1
DataNode 1
DataNode 2
DataNode 3
DataNode 4
B2
B1
B1
B2
B2
Bn
Bn
Bn
BnB1B2
1
2
1
1
2
2
n
n
n
12 n

Apache Apex Meetup
●Same replication factor
●On same HDFS cluster
●Same block size for all files
●Size of all files (except last) : multiple of block size
Output DAG - FileMerger - FastMerge Magic

Apache Apex Meetup
DAG - Complete
File
Splitter
Block
Reader
Block
Writer
Synchronizer
BlockMetaData BlockMetaData
Data
BlockMetaData
FileMetaData
FileMerger
FileMetaData

Apache Apex Meetup
Other features: Optional processing
●Compression
○Gzip and lzo
●Encryption
○PKI & AES
●Compaction
○Size based
●Dedup
●Dimension Computation & Aggregation

Apache Apex Meetup

Apache Apex Meetup
Summary
●Easy to use
○Configure and run
●Unified for batch and continuous ingestion
●Handles
○Large files
○Large number of small files

25

Resources
Apache Apex Meetup
•Apache Apex website - http://apex.incubator.apache.org/
•Subscribe - http://apex.incubator.apache.org/community.html
•Download - http://apex.incubator.apache.org/downloads.html
•Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex
•Facebook - https://www.facebook.com/ApacheApex/
•Meetup - http://www.meetup.com/topics/apache-apex
•Startup Program – Free Enterprise License for startups, Universities, Non-Profits