Brief introduction of Hadoop Ecosystem's component
Size: 800.28 KB
Language: en
Added: Nov 14, 2013
Slides: 27 pages
Slide Content
Big Data and Hadoop
Presenter
Rajkumar Singh
http://rajkrrsingh.blogspot.com/
http://in.linkedin.com/in/rajkrrsingh
http://rajkrrsingh.blogspot.com
Big Data and Hadoop Introduction
Volume
Facebook
Google Plus
Twitter
LinkedIn
Stock Exchange
Healthcare
Telecom
Variety Structured,SemiStructured,unstructured
Velocity
Facebook
Stock Exchange
Healthcare
Telecom
Mobile Devices
GPS
Security Infrastructure
http://rajkrrsingh.blogspot.com
The Problem
e.g. Stock Market
http://rajkrrsingh.blogspot.com
The Solution (Hadoop Evolution)
Traditional Approach
http://rajkrrsingh.blogspot.com
GB->TB->PB--ZB
so the processing with RDBMS is Impossible
http://rajkrrsingh.blogspot.com
Challenges In Big data
•Storage -- PB
•Processing – In a timely manner
•Variety of data -- S/SS/US
•Cost
http://rajkrrsingh.blogspot.com
To overcome Big Data Challenges
Hadoop evolves
•Cost Effective – Commodity HW
•Big Cluster – (1000 Nodes) --- Provides Storage n Processing
•Parallel Processing – Map reduce
•Big Storage – Memory per node * no of Nodes / RF
•Fail over mechanism – Automatic Failover
•Data Distribution
•Map Reduce Framework
•Moving Code to data
•Heterogeneous Hardware System (IBM,HP,AIX,Oracle Machine of
any memory and CPU configuration)
•Scalable
http://rajkrrsingh.blogspot.com
What is Hadoop
•Java Framework to Process erroneous amount of data
Hadoop Core
• HDFS
• Programming Construct (Map Reduce)
http://rajkrrsingh.blogspot.com
Hadoop Sub-Projects
• Hadoop Common: The common utilities that support the other Hadoop subprojects.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to
application data.
• Hadoop MapReduce: A software framework for distributed processing of large data sets on compute
clusters.
Other Hadoop-related projects at Apache include:
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
• HBase™: A scalable, distributed database that supports structured data storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
• Pig™: A high-level data-flow language and execution framework for parallel computation.
• ZooKeeper™: A high-performance coordination service for distributed applications.
http://rajkrrsingh.blogspot.com
HDFS
1 TB File
250 GB
250 GB
250 GB
250 GB
DFS
Based on GFS
http://rajkrrsingh.blogspot.com
HDFS : Use Cases
• Very large file.
• Reading/Streaming Data Access.
Read data in large volume
Write once and Read frequent
• Expensive Hardware.
•Low latency Access.
•Lots of small files
•Parallel write/ Arbitrary Read
http://rajkrrsingh.blogspot.com
HDFS Building Blocks
1GB file = 1024 MB/128 MB = 8 Blocks
Default Block Size
64MB
128MB
For Small File Size
100 MB File < Block Size (128 MB) : Optimize for storage = 1 Block of
HDFS of size 100 MB
http://rajkrrsingh.blogspot.com
HDFS File System Commands
http://rajkrrsingh.blogspot.com
http://rajkrrsingh.blogspot.com
http://rajkrrsingh.blogspot.com
HDFS Federation
http://rajkrrsingh.blogspot.com
http://rajkrrsingh.blogspot.com
High Availability
http://rajkrrsingh.blogspot.com
Copying Data from one Cluster to another
Cluster
UAT Cluster Prod Cluster
Parallel copying using distcp
hadoop distcp hdfs://uat:54311/user/rajkrrsingh/input hdfs://prod:54311/user/rajkrrsingh/input