Ingestion file copy using apex

ApacheApex 486 views 26 slides Apr 21, 2016

Slide 1 of 26

About This Presentation

Size: 662.88 KB

Language: en

Added: Apr 21, 2016

Slides: 26 pages

Slide Content

Apache Apex Meetup
Big Data File Ingestion
using Apex

Sandeep Deshmukh, PhD
[email protected]

Apache Apex Meetup
Contents
●What is Big Data Ingestion
●Challenges in File copy @ scale
●Ingestion using Apex
○Input
○Output
○Key features
●Demo
●Summary

Apache Apex Meetup
Directed Acyclic Graph (DAG)
•A Stream is a sequence of data tuples
•An Operator takes one or more input streams, performs computations & emits one or more output streams
•Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
•Operator has many instances that run in parallel and each instance in single-threaded
•Directed Acyclic Graph (DAG) is made up of operators and streams

Apex: Application Programming Model
Output Stream
Tuple Tuple
Filtered
Stream
Enriched
Stream
Enriched
Stream
er
Operator
er
Operator
er
Operator
er
Operator
Filtered
Stream

Apache Apex Meetup
What is Ingestion

Data ingestion
●process of obtaining, importing, and processing data for later use or storage
in a database

Big Data Ingestion
●discovering the data sources
●importing the data
●processing data to produce intermediate data
●Send data out to durable data stores

Apache Apex Meetup
Challenges in File copy @ scale

●Failure Recovery
●Copying big files in parallel
●Copying large number of small files
●Processing
○Encryption
○Compression
○Compaction

Apache Apex Meetup
DAG - Components
Read Data Write Data
Process

Apache Apex Meetup
DAG - Read Data : Requirements
●Independent of input file type
○HDFS
○S3
○FTP
○NFS

●Scale to large data
○Large files
○Large number of small files

●Configurable Bandwidth usage

Apache Apex Meetup
DAG - Read Data

Break the whole task
into smaller sub-tasks
Connect to input and
scan for available data

Assign smaller tasks for
downstream operators
Steps Purpose Name

Work on the sub-tasks
given by Operator 1, one
at a time
Connect to source and
read data as smaller
tasks one-by-one

Pass on the read data to
downstream operator

Write File
Save the data read by
Operator 2
File
Splitter
Block
Reader
File
Writer

Apache Apex Meetup
DAG - Simple Design
File
Splitter
Block
Reader
File
WriterBlockMetaData Data
Challenges
●Reading files in parallel is not possible
○Can have multiple Block Readers and File Writers reading multiple files in
parallel but single file can’t be read by two Block Readers

●Failure recovery is hard

Apache Apex Meetup
DAG - Read Data

Break the whole task
into smaller sub-tasks
Connect to input and
scan for available data

Assign smaller tasks
for downstream
operators
Steps Purpose Name

Work on the sub-tasks
given by Operator 1,
one at a time
Connect to source and
read data as smaller
tasks one-by-one

Pass on the read data to
downstream operator

Write File
Save the data read
by Operator 2
File
Splitter
Block
Reader
File
Writer

Check for completeness
Make sure all smaller
tasks for a file are
completed by upstream
operators & send file
merger trigger
Synchronizer

Apache Apex Meetup
DAG - Input
File
Splitter
Block
Reader
Block
Writer
BlockMetaData
Data
Block
Reader
Block
Writer
Synchronizer
BlockMetaData
FileMetaData
BlockMetaData
BlockMetaData
Data

Apache Apex Meetup
Input DAG - FileSplitter
Scan input files/ directories
Create smaller sub-tasks
FileMetaData
BlockMetaData
File
Splitter
●Parameters
○input files/directories to copy data from
○recursive - Yes / No
○polling - Yes / No
○bandwidth - MB / sec

Apache Apex Meetup
Input DAG - FileSplitter
●For each file in the directory:
■[output] FileMetaData - file information
●Name
●Size
●Relative path
●Block IDs into which the file is virtually split

■[output] BlockMetaData - block information
●BlockID
●Start position
●End position
●File URL
InputFile.txt
1073741824 (1GB)
input/data/InputFile.txt
[0,1,2,3,4,5,6,7,8]

1
134217728
268435456 (128MB)
hdfs://node18:8020/user/sandeep/input

Apache Apex Meetup
Input DAG - BlockReader
Block
Reader
Read block from remote
location and emit Data
Data
BlockMetaData
●Parameters
Input URL: E.g.: hdfs://node18:8020/user/hduser/input
BlockMetaData

Apache Apex Meetup
Input DAG - BlockWriter
Block
Writer
Write block data on local
HDFS
BlockMetaData
BlockMetaData
Data
Saves data in apps directory

Apache Apex Meetup
Input DAG - Synchronizer
Track blocks for each file
and send trigger once all
the block for that file
are available
FileMetaDataSynchronizer
FileMetaData
BlockMetaData

Apache Apex Meetup
DAG - Input
File
Splitter
Block
Reader
Block
Writer
Synchronizer
BlockMetaData
Data
FileMetaData
BlockMetaData
BlockMetaData
FileMetaData

Apache Apex Meetup
Output DAG - FileMerger
Merge blocks to recreate
original file
FileMerger
●Parameters
○Output directory to copy data to
○Overwrite - Yes/No
FileMetaData

Apache Apex Meetup
Output DAG - FileMerger - FastMerge Magic
Different
Blocks:
File :
B1
DataNode 1
DataNode 2
DataNode 3
DataNode 4
B2
B1
B1
B2
B2
Bn
Bn
Bn
BnB1B2
1
2
1
1
2
2
n
n
n
12 n

Apache Apex Meetup
●Same replication factor
●On same HDFS cluster
●Same block size for all files
●Size of all files (except last) : multiple of block size
Output DAG - FileMerger - FastMerge Magic

Apache Apex Meetup
DAG - Complete
File
Splitter
Block
Reader
Block
Writer
Synchronizer
BlockMetaData BlockMetaData
Data
BlockMetaData
FileMetaData
FileMerger
FileMetaData

Apache Apex Meetup
Other features: Optional processing
●Compression
○Gzip and lzo
●Encryption
○PKI & AES
●Compaction
○Size based
●Dedup
●Dimension Computation & Aggregation

Apache Apex Meetup

Apache Apex Meetup
Summary
●Easy to use
○Configure and run
●Unified for batch and continuous ingestion
●Handles
○Large files
○Large number of small files

Resources
Apache Apex Meetup
•Apache Apex website - http://apex.incubator.apache.org/
•Subscribe - http://apex.incubator.apache.org/community.html
•Download - http://apex.incubator.apache.org/downloads.html
•Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex
•Facebook - https://www.facebook.com/ApacheApex/
•Meetup - http://www.meetup.com/topics/apache-apex
•Startup Program – Free Enterprise License for startups, Universities, Non-Profits

Ingestion file copy using apex

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Ingestion file copy using apex

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx