Big Data and Hadoop Ecosystem

rajkrrsingh 3,034 views 27 slides Nov 14, 2013

Slide 1 of 27

About This Presentation

Brief introduction of Hadoop Ecosystem's component

Size: 800.28 KB

Language: en

Added: Nov 14, 2013

Slides: 27 pages

Slide Content

Big Data and Hadoop
Presenter
Rajkumar Singh
http://rajkrrsingh.blogspot.com/
http://in.linkedin.com/in/rajkrrsingh
http://rajkrrsingh.blogspot.com

Big Data and Hadoop Introduction
Volume
Facebook
Google Plus
Twitter
LinkedIn
Stock Exchange
Healthcare
Telecom
Variety Structured,SemiStructured,unstructured
Velocity
Facebook
Stock Exchange
Healthcare
Telecom
Mobile Devices
GPS
Security Infrastructure
http://rajkrrsingh.blogspot.com

The Problem
e.g. Stock Market
http://rajkrrsingh.blogspot.com

The Solution (Hadoop Evolution)
Traditional Approach
http://rajkrrsingh.blogspot.com

GB->TB->PB--ZB
so the processing with RDBMS is Impossible
http://rajkrrsingh.blogspot.com

Challenges In Big data
•Storage -- PB
•Processing – In a timely manner
•Variety of data -- S/SS/US
•Cost
http://rajkrrsingh.blogspot.com

To overcome Big Data Challenges
Hadoop evolves
•Cost Effective – Commodity HW
•Big Cluster – (1000 Nodes) --- Provides Storage n Processing
•Parallel Processing – Map reduce
•Big Storage – Memory per node * no of Nodes / RF
•Fail over mechanism – Automatic Failover
•Data Distribution
•Map Reduce Framework
•Moving Code to data
•Heterogeneous Hardware System (IBM,HP,AIX,Oracle Machine of
any memory and CPU configuration)
•Scalable
http://rajkrrsingh.blogspot.com

Typical Hadoop Infrastructure
http://rajkrrsingh.blogspot.com

What is Hadoop
•Java Framework to Process erroneous amount of data
Hadoop Core
• HDFS
• Programming Construct (Map Reduce)
http://rajkrrsingh.blogspot.com

HDFS
http://rajkrrsingh.blogspot.com

Processing Framework (Mapreduce)
http://rajkrrsingh.blogspot.com

Hadoop Ecosystem
http://rajkrrsingh.blogspot.com

Hadoop Sub-Projects
• Hadoop Common: The common utilities that support the other Hadoop subprojects.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to
application data.
• Hadoop MapReduce: A software framework for distributed processing of large data sets on compute
clusters.
Other Hadoop-related projects at Apache include:
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
• HBase™: A scalable, distributed database that supports structured data storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
• Pig™: A high-level data-flow language and execution framework for parallel computation.
• ZooKeeper™: A high-performance coordination service for distributed applications.
http://rajkrrsingh.blogspot.com

HDFS
1 TB File
250 GB
250 GB
250 GB
250 GB
DFS
Based on GFS
http://rajkrrsingh.blogspot.com

HDFS : Use Cases
• Very large file.
• Reading/Streaming Data Access.
Read data in large volume
Write once and Read frequent
• Expensive Hardware.
•Low latency Access.
•Lots of small files
•Parallel write/ Arbitrary Read
http://rajkrrsingh.blogspot.com

HDFS Building Blocks
1GB file = 1024 MB/128 MB = 8 Blocks
Default Block Size
64MB
128MB
For Small File Size
100 MB File < Block Size (128 MB) : Optimize for storage = 1 Block of
HDFS of size 100 MB
http://rajkrrsingh.blogspot.com

HDFS Daemon Services
•Name Node
•Secondary Name Node
•Data Node
GFS (Master/Slave Architecture)
http://rajkrrsingh.blogspot.com

HDFS Write
128 MB
RF = 3
D1,D2,D4
D1 D2 D3 D4
File 1: D1,D2,D4
File 2: D1,D2,D3
http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com

HDFS File System Commands
http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com

HDFS Federation
http://rajkrrsingh.blogspot.com

http://rajkrrsingh.blogspot.com
High Availability

http://rajkrrsingh.blogspot.com
Copying Data from one Cluster to another
Cluster
UAT Cluster Prod Cluster
Parallel copying using distcp
hadoop distcp hdfs://uat:54311/user/rajkrrsingh/input hdfs://prod:54311/user/rajkrrsingh/input

Big Data and Hadoop Ecosystem

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Big Data and Hadoop Ecosystem

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......