Big Data and Hadoop Ecosystem

rajkrrsingh 3,034 views 27 slides Nov 14, 2013
Slide 1
Slide 1 of 27
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27

About This Presentation

Brief introduction of Hadoop Ecosystem's component


Slide Content

Big Data and Hadoop
Presenter
Rajkumar Singh
http://rajkrrsingh.blogspot.com/
http://in.linkedin.com/in/rajkrrsingh
http://rajkrrsingh.blogspot.com

Big Data and Hadoop Introduction
Volume
Facebook
Google Plus
Twitter
LinkedIn
Stock Exchange
Healthcare
Telecom
Variety Structured,SemiStructured,unstructured
Velocity
Facebook
Stock Exchange
Healthcare
Telecom
Mobile Devices
GPS
Security Infrastructure
http://rajkrrsingh.blogspot.com

The Problem
e.g. Stock Market
http://rajkrrsingh.blogspot.com

The Solution (Hadoop Evolution)
Traditional Approach
http://rajkrrsingh.blogspot.com

GB->TB->PB--ZB
so the processing with RDBMS is Impossible
http://rajkrrsingh.blogspot.com

Challenges In Big data
•Storage -- PB
•Processing – In a timely manner
•Variety of data -- S/SS/US
•Cost
http://rajkrrsingh.blogspot.com

To overcome Big Data Challenges
Hadoop evolves
•Cost Effective – Commodity HW
•Big Cluster – (1000 Nodes) --- Provides Storage n Processing
•Parallel Processing – Map reduce
•Big Storage – Memory per node * no of Nodes / RF
•Fail over mechanism – Automatic Failover
•Data Distribution
•Map Reduce Framework
•Moving Code to data
•Heterogeneous Hardware System (IBM,HP,AIX,Oracle Machine of
any memory and CPU configuration)
•Scalable
http://rajkrrsingh.blogspot.com

Typical Hadoop Infrastructure
http://rajkrrsingh.blogspot.com

What is Hadoop
•Java Framework to Process erroneous amount of data
Hadoop Core
• HDFS
• Programming Construct (Map Reduce)
http://rajkrrsingh.blogspot.com

HDFS
http://rajkrrsingh.blogspot.com

Processing Framework (Mapreduce)
http://rajkrrsingh.blogspot.com

Hadoop Ecosystem
http://rajkrrsingh.blogspot.com

Hadoop Sub-Projects
• Hadoop Common: The common utilities that support the other Hadoop subprojects.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to
application data.
• Hadoop MapReduce: A software framework for distributed processing of large data sets on compute
clusters.
Other Hadoop-related projects at Apache include:
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
• HBase™: A scalable, distributed database that supports structured data storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
• Pig™: A high-level data-flow language and execution framework for parallel computation.
• ZooKeeper™: A high-performance coordination service for distributed applications. 
http://rajkrrsingh.blogspot.com

HDFS
1 TB File
250 GB
250 GB
250 GB
250 GB
DFS
Based on GFS
http://rajkrrsingh.blogspot.com

HDFS : Use Cases
• Very large file.
• Reading/Streaming Data Access.
Read data in large volume
Write once and Read frequent
• Expensive Hardware.
•Low latency Access.
•Lots of small files
•Parallel write/ Arbitrary Read
http://rajkrrsingh.blogspot.com

HDFS Building Blocks
1GB file = 1024 MB/128 MB = 8 Blocks
Default Block Size
64MB
128MB
For Small File Size
100 MB File < Block Size (128 MB) : Optimize for storage = 1 Block of
HDFS of size 100 MB
http://rajkrrsingh.blogspot.com

HDFS Daemon Services
•Name Node
•Secondary Name Node
•Data Node
GFS (Master/Slave Architecture)
http://rajkrrsingh.blogspot.com

HDFS Write
128 MB
RF = 3
D1,D2,D4
D1 D2 D3 D4
File 1: D1,D2,D4
File 2: D1,D2,D3
http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com

HDFS File System Commands
http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com

HDFS Federation
http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com
High Availability

http://rajkrrsingh.blogspot.com
Copying Data from one Cluster to another
Cluster
UAT Cluster Prod Cluster
Parallel copying using distcp
hadoop distcp hdfs://uat:54311/user/rajkrrsingh/input hdfs://prod:54311/user/rajkrrsingh/input