Apache Apex Meetup
Contents
●What is Big Data Ingestion
●Challenges in File copy @ scale
●Ingestion using Apex
○Input
○Output
○Key features
●Demo
●Summary
Apache Apex Meetup
Directed Acyclic Graph (DAG)
•A Stream is a sequence of data tuples
•An Operator takes one or more input streams, performs computations & emits one or more output streams
•Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
•Operator has many instances that run in parallel and each instance in single-threaded
•Directed Acyclic Graph (DAG) is made up of operators and streams
Apex: Application Programming Model
Output Stream
Tuple Tuple
Filtered
Stream
Enriched
Stream
Enriched
Stream
er
Operator
er
Operator
er
Operator
er
Operator
Filtered
Stream
Apache Apex Meetup
What is Ingestion
Data ingestion
●process of obtaining, importing, and processing data for later use or storage
in a database
Big Data Ingestion
●discovering the data sources
●importing the data
●processing data to produce intermediate data
●Send data out to durable data stores
Apache Apex Meetup
Challenges in File copy @ scale
●Failure Recovery
●Copying big files in parallel
●Copying large number of small files
●Processing
○Encryption
○Compression
○Compaction
Apache Apex Meetup
DAG - Components
Read Data Write Data
Process
Apache Apex Meetup
DAG - Read Data : Requirements
●Independent of input file type
○HDFS
○S3
○FTP
○NFS
●Scale to large data
○Large files
○Large number of small files
●Configurable Bandwidth usage
Apache Apex Meetup
DAG - Read Data
Break the whole task
into smaller sub-tasks
Connect to input and
scan for available data
Assign smaller tasks for
downstream operators
Steps Purpose Name
Work on the sub-tasks
given by Operator 1, one
at a time
Connect to source and
read data as smaller
tasks one-by-one
Pass on the read data to
downstream operator
Write File
Save the data read by
Operator 2
File
Splitter
Block
Reader
File
Writer
Apache Apex Meetup
DAG - Simple Design
File
Splitter
Block
Reader
File
WriterBlockMetaData Data
Challenges
●Reading files in parallel is not possible
○Can have multiple Block Readers and File Writers reading multiple files in
parallel but single file can’t be read by two Block Readers
●Failure recovery is hard
Apache Apex Meetup
DAG - Read Data
Break the whole task
into smaller sub-tasks
Connect to input and
scan for available data
Assign smaller tasks
for downstream
operators
Steps Purpose Name
Work on the sub-tasks
given by Operator 1,
one at a time
Connect to source and
read data as smaller
tasks one-by-one
Pass on the read data to
downstream operator
Write File
Save the data read
by Operator 2
File
Splitter
Block
Reader
File
Writer
Check for completeness
Make sure all smaller
tasks for a file are
completed by upstream
operators & send file
merger trigger
Synchronizer
Apache Apex Meetup
DAG - Input
File
Splitter
Block
Reader
Block
Writer
BlockMetaData
Data
Block
Reader
Block
Writer
Synchronizer
BlockMetaData
FileMetaData
BlockMetaData
BlockMetaData
Data
Apache Apex Meetup
Input DAG - FileSplitter
Scan input files/ directories
Create smaller sub-tasks
FileMetaData
BlockMetaData
File
Splitter
●Parameters
○input files/directories to copy data from
○recursive - Yes / No
○polling - Yes / No
○bandwidth - MB / sec
Apache Apex Meetup
Input DAG - FileSplitter
●For each file in the directory:
■[output] FileMetaData - file information
●Name
●Size
●Relative path
●Block IDs into which the file is virtually split
■[output] BlockMetaData - block information
●BlockID
●Start position
●End position
●File URL
InputFile.txt
1073741824 (1GB)
input/data/InputFile.txt
[0,1,2,3,4,5,6,7,8]
Apache Apex Meetup
Input DAG - BlockReader
Block
Reader
Read block from remote
location and emit Data
Data
BlockMetaData
●Parameters
Input URL: E.g.: hdfs://node18:8020/user/hduser/input
BlockMetaData
Apache Apex Meetup
Input DAG - BlockWriter
Block
Writer
Write block data on local
HDFS
BlockMetaData
BlockMetaData
Data
Saves data in apps directory
Apache Apex Meetup
Input DAG - Synchronizer
Track blocks for each file
and send trigger once all
the block for that file
are available
FileMetaDataSynchronizer
FileMetaData
BlockMetaData
Apache Apex Meetup
DAG - Input
File
Splitter
Block
Reader
Block
Writer
Synchronizer
BlockMetaData
Data
FileMetaData
BlockMetaData
BlockMetaData
FileMetaData
Apache Apex Meetup
Output DAG - FileMerger
Merge blocks to recreate
original file
FileMerger
●Parameters
○Output directory to copy data to
○Overwrite - Yes/No
FileMetaData
Apache Apex Meetup
Output DAG - FileMerger - FastMerge Magic
Different
Blocks:
File :
B1
DataNode 1
DataNode 2
DataNode 3
DataNode 4
B2
B1
B1
B2
B2
Bn
Bn
Bn
BnB1B2
1
2
1
1
2
2
n
n
n
12 n
Apache Apex Meetup
●Same replication factor
●On same HDFS cluster
●Same block size for all files
●Size of all files (except last) : multiple of block size
Output DAG - FileMerger - FastMerge Magic
Apache Apex Meetup
DAG - Complete
File
Splitter
Block
Reader
Block
Writer
Synchronizer
BlockMetaData BlockMetaData
Data
BlockMetaData
FileMetaData
FileMerger
FileMetaData
Apache Apex Meetup
Other features: Optional processing
●Compression
○Gzip and lzo
●Encryption
○PKI & AES
●Compaction
○Size based
●Dedup
●Dimension Computation & Aggregation
Apache Apex Meetup
Apache Apex Meetup
Summary
●Easy to use
○Configure and run
●Unified for batch and continuous ingestion
●Handles
○Large files
○Large number of small files