Hadoop Distributed File System

VaibhavJain117 726 views 19 slides Feb 22, 2015
Slide 1
Slide 1 of 19
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19

About This Presentation

No description available for this slideshow.


Slide Content

Hadoop Distributed File System (HDFS)

Big Data Concepts Volume No more GBs of data TB,PB,EB,ZB Velocity High frequency data like in stocks Variety Structure and Unstructured data

Challenges In Big Data Complex No proper understanding of the underlying data Storage How to accommodate large amount of data in single physical machine Performance How to process large amount of data efficiently and effectively so as to increase the performance

Challenges in Traditional Application Network Limited bandwidth Data Growth of data can’t be controlled Efficiency & Performance How fast data can be read Processing capacity of machine Processor, RAM is a bottleneck

Statistics Application Size(MB) Data Size Total Round trip time(sec) 10 10 MB 1+1 = 2 10 100MB 10+10 = 20 10 1000 MB = 1GB 100 + 100 = 200 (~3.3 min ) 10 1000 GB= 1TB 100000 + 100000 = ~ 55 Hour Calculation is done under ideal condition No processing time is taken into consideration Assuming N/W bandwidth is 10MBPS How data is read ? Line by Line reading Depends on seek rate and disc latency Average Data Transfer rate = 75MB/sec Total Time to read 100GB = 22 min Total time to read 1TB = 3 hours How much time you take to sort 1TB of data?? Enough time to watch a movie, while data is being read

Statistics (Contd.) Observation Large amount of data takes lot of time to read Data is moved back and forth over the low latency network where application is running 90 % of the time is consumed in data transfer Application size is constant Conclusion Achieving Data Localization Move application close to data Or Move data close to application

Summary Storage is problem Cannot store large amount of data Upgrading the hard disk will also not solve the problem (Hardware limitation) Performance degradation Upgrading RAM will not solve the problem (Hardware limitation) Reading Larger data requires larger time to read

Solution Approach Distributed Framework Storing the data across several machine Performing computation parallel across several machines Should Support Partial failures Recoverability Data availability Consistency Data reliability Upgrading

Introducing Hadoop Distributed framework that provides scaling in : Storage Performance IO Bandwidth

What makes Hadoop special? No high end or expensive systems are required Can run on Linux, Mac OS/X, Windows, Solaris Fault tolerant system Execution of the job continues even of nodes are failing Highly reliable and efficient storage system In built intelligence to speed up the application Speculative execution Fit for lot of applications: Web log processing Page Indexing, page ranking Complex event processing

Features of Hadoop Partition , replicate and distributes the data Data availability, consistency Performs Computation closer to the data Data Localization Performs computation across several hosts MapReduce framework

Hadoop Components Hadoop is bundled with two independent components HDFS (Hadoop Distributed File System) Designed for scaling in terms of storage and IO bandwidth MR framework ( MapReduce ) Designed for scaling in terms of performance

Understanding file structure 1 GB file File is split into blocks Each block is typically 64MB Each block is stored as two files – one holding data and second for metadata, checksum Block

Hadoop Processes Processes running on Hadoop NameNode DataNode Secondary NameNode Task Tracker Job Tracker

NameNode Single point of contact HDFS master Holds meta information List of files and directories Location of blocks Single node per cluster Cluster can have thousands of DataNodes and tens of thousands of HDFS client. NameNode

DataNode Can execute multiple tasks concurrently Holds actual data blocks, checksum and generation stamp If block is half full, needs only half of the space of full block At start-up, connects to NameNode and perform handshake No binding to IP address or port, uses Storage ID Sends heartbeat to NameNode DataNode Storage ID: XYZ001

Communication Total Storage Capacity Fraction of storage in use No of data transfer currently in progress Instructs DataNode Replicate block to other node Remove local block replica Send immediate block report Shut down the node Every 3 seconds. “I AM ALIVE” NameNode DataNode Storage ID: XYZ001 DataNode Storage ID: XYZ002 DataNode Storage ID: XYZ003 Reply No heartbeat for 10 minutes Heartbeat

Overview of HDFS

HDFS Client
Tags