BDA_UNIT_II_BDA_UNIT_II_BDA_UNIT_II_BDA_UNIT_II_

SrikanthYadav578790 7 views 87 slides Nov 01, 2025
Slide 1
Slide 1 of 87
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87

About This Presentation

BDA


Slide Content

To study the features of Hadoop. To learn the basic concepts of HDFS and MapReduce Programming . To study HDFS Architecture To understand the read and write in HDFS To be able to understand Hadoop Ecosystem To be able to perform HDFS operations . Learning Objectives Learning Outcomes

Agenda Introduction to Hadoop Need of Hadoop Limitations of RDBMS RDBMS versus Hadoop Distributed Computing Challenges

Big Data? Extremely large datasets that are hard to deal with using relational databases Storage/cost Search/performance Analytics and Visualization Need for parallel processing on hundreds of machines ETL can not complete with in a reasonable time Beyond 24hrs-never catch up

Big Data Everywhere Lots of data is being collected and warehoused Web data, e-commerce purchases at department/ grocery stores Bank / Credit Card transactions Social Network

Every Day NYSE(New York Stock Exchange) generates 1.5 billion shares and trade data Every Day Face book stores 2.7 billion comments and Likes Every Day Google processes about 24 peta bytes of data How much data?

Every Minute YouTube users upload 72 hours of new video content Every Minute Email users send over 200 million messages Every Minute Amazon generates over $80,000 in online sales How much data ? Every Second Banking applications process more than 10,000 credit card transactions

Computing Power Storage It stores and process huge amounts of any kind of data, quickly. Hadoop's distributed computing model processes big data first. The more computing nodes you use, the more processing power you have. Why Hadoop?

Low Cost Flexibility There is no need of preprocessing the data before storing it. You can store as much data as you want and decide how to use it later. The open-source framework is free and uses commodity hardware to store large quantities of data. Why Hadoop?

Scalability You can easily grow your system to handle more data simply by adding nodes. Little administration is required. Why Hadoop?

Why Not RDBMS RDBMS is not suitable for storing and processing large files, images, videos RDBMS is not good choice when it comes to advanced analytics involving machine learning

Parameters RDBMS Hadoop System Relational Database Management system Node based flat structure Data Suitable for structured data Suitable for Structured, unstructured data, supports variety of formats(XML, JSON)

RDBMS versus Hadoop Parameters RDBMS Hadoop System Relational Database Management system Node based flat structure Data Suitable for structured data Suitable for Structured, unstructured data, supports variety of formats( xml,json ) Processing OLTP Analytical, big data processing Choice When the data needs consistent relationship Big data processing, which does not require any consistent relationships between data Processor Needs expensive hardware or high-end processors to store huge volumes of data In a Hadoop clusters, node require any consistent relationships between data Cost Cost around $10,000 to $14,000 per terabytes of storage Cost around $4000 per terabytes of storage.

RDBMS versus Hadoop Parameters RDBMS Hadoop System Relational Database Management system Node based flat structure Data Suitable for structured data Suitable for Structured, unstructured data, supports variety of formats( xml,json ) Processing OLTP Analytical, big data processing Choice When the data needs consistent relationship Big data processing, which does not require any consistent relationships between data Processor Needs expensive hardware or high-end processors to store huge volumes of data In a Hadoop clusters, node require any consistent relationships between data Cost Cost around $10,000 to $14,000 per terabytes of storage Cost around $4000 per terabytes of storage.

Distributed Computing Challenges Resource Sharing- access any data and utilize CPU across the system Heterogeneity -different OS and hardware Concurrency -allow concurrent access, update of shared resources Hardware Failure How to process gigantic store of data

Topics Discussed Introduction to Hadoop Need of Hadoop Limitations of RDBMS RDBMS vs Hadoop Distributed Computing Challenges

Assignment What is the role of Hadoop? What are the limitations of RDBMS? List out various application domains uses Hadoop. Differentiate RDBMS vs Hadoop. What are the major challenges in Distributed computing?

Agenda History of Hadoop Hadoop Overview

History of Hadoop Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after his son’s toy elephant.

Google Origins 2003 2004

Hadoop’s Developers Doug Cutting 2005: Doug Cutting and  Michael J. Cafarella developed Hadoop to support distribution for the Nutch search engine project. The project was funded by Yahoo. 2006: Yahoo gave the project to Apache Software Foundation.

Real world application of Nutch NASA’s Planetary Data System NASA’s archive for all planetary science data collected by missions over the past 30 years Collected 20 TB over the past 30 years Increasing to over 200 TB in the next 3 years! Built up a catalog of all data collected

Real world application of Nutch

Some Hadoop Milestones 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of data in 209 seconds , compared to previous record of 297 seconds) 2010 - Hadoop's HBase, Hive and Pig subprojects completed, adding more computational power to Hadoop framework 2011 – Zookeeper(centralized service) Completed 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha. Ambari(secure Hadoop clusters), Cassandra, Mahout have been added

Hadoop history

Hadoop overview Open source software frame work to store and process massive amounts of data in a distributed fashion on storage clusters of commodity hardware. Basically Hadoop accomplishes two tasks: Massive storage Faster data processing

Key aspects of Hadoop Open source software: it is free to download, use and contribute to Frame work: means everything that you will need to develop and execute application is provided – programs, tools etc., Distributed: divides and stores data across multiple computers, computation/processing is done in parallel across multiple connected nodes. Massive storage: stores colossal amounts of data across nodes of low cost commodity hardware. Faster processing: large amounts of data processing in parallel yielding quick response.

Hadoop Components

Hadoop core components HDFS Storage component Distributives data across several nodes Natively redundant MapReduce Computational frame work Splits a task across multiple nodes Process data in parallel

Hadoop core components HDFS: ( Hadoop Distributed File System) –  HDFS is the basic storage system of Hadoop . The large data files running on a cluster of commodity hardware are stored in HDFS. It can store data in a reliable manner even when hardware fails. The key aspects of HDFS are: Storage component Distributes data across several nodes Natively redundant.

Hadoop core components .. Map Reduce: MapReduce is the Hadoop layer that is responsible for data processing. It writes an application to process unstructured and structured data stored in HDFS. The processing is done in two phases Map and Reduce. The key aspects of Map Reduce are: Computational frame work Splits a task across multiple nodes Processes data in parallel

Hadoop High-Level architecture

Hadoop distributors

Topics Discussed History of Hadoop Hadoop Overview Key aspects of Hadoop Hadoop Core Components Hadoop distributors

Assignment Explain the evolution of Hadoop? What are core components of Hadoop? What are issues in Hadoop distribution? What is HDFS?

BDA Lecture Series Introduction to Hadoop

Agenda Use Case of Hadoop Hadoop Distributors HDFS (Hadoop Distributed File System)

Use cases of Hadoop ClickStream Data(Mouse Clicks) Helps to understand the purchasing behavior of a customer ClickStream analysis helps online marketers to optimize their product web pages, promotional content etc to improve their business ClickStream Data Analysis using Hadoop-Key benefits Joins click stream data with CRM and sales data Stores years of data without much incremental cost Hive or Pig script to analyze data ClickStream Data Analysis

Hadoop is a distributed Master-slave architecture. Master HDFS: Its main responsibility is partitioning the data storage across the slave nodes. It also keep track of locations of data on Datanodes. Master Map Reduce: It decides and schedules computation task on slave nodes.

Hadoop High-Level architecture

Hadoop Distributed File System Some key points of HDFS are as follows: Storage component of Hadoop Distributed file system Modeled after Google File System Optimized for high throughput You can replicate a file for a configured number of times, which is tolerant in terms of both software and hardware. Re-replicates data blocks automatically on nodes that have failed.

Features of HDFS

HDFS KEY POINTS BLOCK STRUCTURED FILE DEFAULT REPLICATION FACTOR: 3 DEFAULT BLOCK SIZE: 64MB/128MB

Hdfs Daemons NameNode Blocks : HDFS breaks large file into smaller pieces called blocks rackID : NameNode uses rackID to identify data nodes in the rack. (rack is a collection of datanodes with in the cluster) NameNode keep track of blocks of a file. File System Namespace : Is a collection of files in the cluster. NameNode stores HDFS namespace.

Hdfs Daemons NameNode FsImage : file system namespace includes mapping of blocks of a file, file properties and is stored in a file called FsImage . EditLog: namenode uses an EditLog (transaction log) to record every transaction that happens to the file system metadata.

Contd.., DataNode Multiple data nodes per cluster During pipeline read and write DataNodes communicate with each other. It also continuously Sends “heartbeat” message to NameNode to ensure the connectivity between the Name node and the data node.

Contd.., Secondary name node Takes snapshot of HDFS meta data at intervals specified in the hadoop configuration Memory is same for secondary node as NameNode But secondary node on different machine In case of failure secondary name node can be configured manually to bring up the cluster

Anatomy of file read

The steps involved in the File Read are as follows: The client opens the file that it wishes to read from by calling open() on the DFS. The DFS communicates with the NameNode to get the location of data blocks. NameNode returns with the addresses of the DataNodes . Client then calls read() on the stream DFSInputStream, which has addresses of DataNodes for the first few block of the file.

The steps involved in the File Read are as follows: Client calls read() repeatedly to stream the data from the DataNode. When the end of the block is reached, FSDInputStream closes the connection with the DataNode. When the client completes the reading of the file, it calls close() on the FSDInputStream to the connection.

Anatomy of file write

Anatomy of File Write: The client calls create() on DistributedFileSystem to create a file. An RPC call to the namenode happens through the DFS to create a new file. As the client writes data, data is split into blocks by FSDOutputStream, which is then writes to an internal queue, called data queue. Datastreamer consumes the data queue.

Anatomy of File Write: 4. Datastreamer streams the packets to the first DataNode in the pipeline. It stores packet and forwards it to the second DataNode in the pipeline. 5. In addition to the internal queue, DFSOutputStream also manages on “ Ackqueue ” of the packets that are waiting for acknowledged by DataNodes . 6. When the client finishes writing the file, it calls close() on the stream.

Replica placement strategy First replica is placed on the same node Second replica on a node that is present on different rack Third replica on the same rack as second but different node in the rack Special Features of HDFS Data Replication: There is absolutely no need for a client application to track all the blocks. It directs the client to the nearest replica to ensure high performance. Data Pipe line : A client application writes a block to the first Data Node in the pipeline.

What are the four modules(modes) that make up the Apache Hadoop frame work? Hadoop common , which contains the common utilities and libraries necessary for Hadoop’s other modules. Hadoop YARN , the framework’s platform for resorce Management. HDFS, which stores information on commodity machines. Hadoop MapReduce , a programming model used to process large-scale sets of data.

Different modes in which Hadoop run. Standalone , or local mode: which is one of the least commonly used environments, which only for running MapReduce programs. Pseudo-distributed mode , which runs all daemons on single machine. It is most commonly used in development environments. Fully distributed mode , which is most commonly used in production environments. This mode runs all daemons on a cluster of machines rather than single one.

PROCESSING DATA WITH HADOOP MapReduce programming helps to process massive amounts of data in parallel Input data set splits into independent chunks Map task (key, value) pair generates intermediate results automatically shuffle and sorted by the framework Reduce task -provides reduced output by combining the output of various mapers Data Locality- performing map reduce tasks where the data is stored

JobTracker is a master daemon responsible for executing over MapReduce job. It provides connectivity between Hadoop and application. Whenever code submitted to a cluster, JobTracker creates the execution plan by deciding which task to assign to which node. It also monitors all the running tasks. When task fails it automatically re-schedules the task to a different node after a predefined number of retires. There are only be one job Tracker process running on a single Hadoop cluster. Job Tracker processes run on their own Java Virtual machine process.

Task Tracker: This daemon is responsible for executing individual tasks that is assigned by the Job Tracker. Task Tracker continuously sends heartbeat message to job tracker. When a job tracker fails to receive a heartbeat message from a TaskTracker, the JobTracker assumes that the TaskTracker has failed and resubmits the task to another available node in the cluster.

Map Reduce Framework Phases: Map : Converts input into key-value pairs. Reduce : Combines output of mappers and produces a reduced result set. Daemons: JobTracker : Master, Schedules Task TaskTracker : Slave, Execute task

Job Tracker and Task Tracker Interaction

How does Mapreduce work?

How does Mapreduce work? In this example – There are two mappers and one reducer. Each mapper works on the partial data set that is stored on that node and the reducer combines the output from the mapers to produce the redundant result set. Working Model: First, the input dataset is split into multiple pieces of data Next, the framework creates a master and several workers processes and executes the worker processes remotely.

Map Reduce Programming Architecture

Map reduce example

Map reduce example Workflow of MapReduce consists of 5 steps Splitting – The splitting parameter can be anything, e.g. splitting by space, comma, semicolon, or even by a new line (‘\n’). Mapping – as explained above Intermediate splitting – the entire process in parallel on different clusters. In order to group them in “Reduce Phase” the similar KEY data should be on same cluster. Reduce – it is nothing but mostly group by phase Combining – The last phase where all the data (individual result set from each cluster) is combine together to form a Result

Map reduce example Work Flow of Program

Word count map reduce program using java 1.Driver class: specifies job configuration details 2.Mapper class: Map Function 3.Reducer class: Reduce Function

Word count map reduce program using java Steps Step 1.  Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo ) > Finish Step 2.  Right Click > New > Package ( Name it - PackageDemo ) > Finish Step 3. Right Click on Package > New > Class (Name it - WordCount ) Step 4. Add Following Reference Libraries – Right Click on Project > Build Path> Add External Archivals / usr /lib/hadoop-0.20/ hadoop-core.jar Usr /lib/hadoop-0.20/lib/ Commons-cli-1.2.jar

Word count map reduce program using java package PackageDemo ; import java.io.IOException ; import org.apache.hadoop.conf.Configuration ; import org.apache.hadoop.fs.Path ; import org.apache.hadoop.io.IntWritable ; import org.apache.hadoop.io.LongWritable ; import org.apache.hadoop.io.Text ; import org.apache.hadoop.mapreduce.Job ; import org.apache.hadoop.mapreduce.Mapper ; import org.apache.hadoop.mapreduce.Reducer ; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat ; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat ; import org.apache.hadoop.util.GenericOptionsParser ;

Word count map reduce program using java public static void main(String [] args ) throws Exception { Configuration c=new Configuration(); String[] files=new GenericOptionsParser ( c,args ). getRemainingArgs (); Path input=new Path(files[0]); Path output=new Path(files[1]); Job j=new Job( c,"wordcount "); j.setJarByClass ( WordCount.class ); j.setMapperClass ( MapForWordCount.class ); j.setReducerClass ( ReduceForWordCount.class ); j.setOutputKeyClass ( Text.class ); j.setOutputValueClass ( IntWritable.class ); FileInputFormat.addInputPath (j, input); FileOutputFormat.setOutputPath (j, output); System.exit ( j.waitForCompletion (true)?0:1); }

Word count map reduce program using java public class WordCount { public static class MapForWordCount extends Mapper < LongWritable , Text, Text, IntWritable >{ public void map( LongWritable key, Text value, Context con) throws IOException , InterruptedException { String line = value.toString (); String[] words= line.split (","); for(String word: words ) { Text outputKey = new Text( word.toUpperCase ().trim()); IntWritable outputValue = new IntWritable (1); con.write ( outputKey , outputValue ); } } } ; public static class ReduceForWordCount extends Reducer<Text, IntWritable , Text, IntWritable > { public void reduce(Text word, Iterable < IntWritable > values, Context con) throws IOException , InterruptedException { int sum = 0; for( IntWritable value : values) { sum += value.get (); } con.write (word, new IntWritable (sum)); } } }

Word count map reduce program using java public class WordCount { public static class MapForWordCount extends Mapper < LongWritable , Text, Text, IntWritable >{ public void map( LongWritable key, Text value, Context con) throws IOException , InterruptedException { String line = value.toString(); String[] words= line.split (","); for(String word: words ) { Text outputKey = new Text(word.toUpperCase().trim()); IntWritable outputValue = new IntWritable (1); con.write ( outputKey , outputValue ); } } } ;

YARN (YET ANOTHER RESOURCE NEGOTIATOR) Apache Hadoop YARN is a sub project of Hadoop 2.x. Hadoop 2.x is YARN based architecture. It is general processing plat-form. YARN is not constrained to MapReduce only. One can run multiple applications in Hadoop 2.x in which all applications share common resource management. Now Hadoop can be used for various types of processing such as Batch, Interactive, Online, Streaming, Graph and others.

YARN (YET ANOTHER RESOURCE NEGOTIATOR) In Hadoop 1.0, HDFS and MapReduce are core components, while other components are built around the core. Single namespace Restricted processing model Not supported for interactive analysis Not suitable for Machine learning algorithms, graphs, and other memory intensive algorithms MapReduce is responsible for cluster resource management and data Processing

Hadoop 2 YARN : Taking Hadoop beyond batch

Hadoop 2 YARN Architecture

Hadoop 2 YARN : Taking Hadoop beyond batch Rescource Manager: Arbitrates resources among all the applications in the system. Node Manager: the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage Application Master: Negotiate appropriate resource containers from the Scheduler , tracking their status and monitoring for progress. Container: Unit of allocation incorporating resource elements such as memory, cpu, disk, network etc, to execute a specific task of the application.

YARN Architecture The client program submits an application (Application Master) The Resource Manager launches the Application Master by assigning some container. The Application Master registers with the Resource manager. On successful container allocations, the application master launches the container by providing the container launch specification to the NodeManager.

YARN Architecture The NodeManager executes the application code. During the application execution, the client that submitted the job directly communicates with the Application Master to get status, progress updates. Once the application has been processed completely, the application master deregisters with the Resource Manager and shuts down allowing its own container to be repurposed.

The following are the components of Hadoop ecosystem: HDFS: Hadoop Distributed File System. It simply stores data files as close to the original form as possible. HBase : It is Hadoop’s database and compares well with an RDBMS. It supports structured data storage for large tables. Hive: It enables analysis of large data sets using a language very similar to SQL. So, one can access data stored in hadoop cluster by using Hive. Pig: Pig is an easy to understand data flow language. It helps with the analysis of large data sets which is quite the order with Hadoop .

ZooKeeper : It is a coordination service for distributed applications. Oozie: It is a workflow scheduler system to manage apache hadoop jobs. Mahout: It is a scalable Machine Learning and data mining library. Chukwa : It is a data collection system for managing large distributed systems. Sqoop : it is used to transfer bulk data between Hadoop and structured data stores such as relational databases. Ambari : it is a web based tool for provisioning, Managing and Monitoring Apache Hadoop clusters.

12 Managing Resources and Applications with Hadoop YARN (Yet another Resource Negotiator) 10 HDFS (Hadoop Distributed File System) 11 Processing Data with Hadoop Agenda 8 Use Case of Hadoop 9 Hadoop Distributors 13 Interacting with Hadoop Ecosystem
Tags