Lec 2 & 3 _Unit 1_Hadoop _MapReduce1.pptx

ashima967262 0 views 16 slides May 20, 2025
Slide 1
Slide 1 of 16
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16

About This Presentation

hadoop


Slide Content

Hadoop HDFS and MAPREDUCE (BTDS603-20 ) Module -1

What is Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop consists of the   Hadoop Common  package, which provides file system and operating system level abstractions MapReduce engine Hadoop Distributed File System  (HDFS). The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file system written in Java for the Hadoop framework

Hadoop Architecture

HDFS HDFS has five services as follows: Name Node Secondary Name Node Job tracker Data Node Task Tracker

Nodes Name Node also known as the master. The master node can track files, manage the file system and has the metadata of all of the stored data within it. The name node contains the details of the number of blocks, locations of the data node that the data is stored in, where the replications are stored, and other details. The name node has direct contact with the client. Data Node also known as slave: A Data Node stores data in it as blocks It stores the actual data into HDFS which is responsible for the client to read and write. Every Data node sends a Heartbeat message to the Name node every 3 seconds and conveys that it is alive. If the data node is dead and the Name node starts the process of block replications on some other Data node.

Nodes Secondary Name Node: Takes care of checkpoints of the file system metadata which is in the Name Node. This is also known as the checkpoint Node Job Tracker: Job Tracker receives the requests for Map Reduce execution from the client. Job tracker talks to the Name Node to know about the location of the data that will be used in processing. Task Tracker: It is the Slave Node for the Job Tracker and it will take the task from the Job Tracker. Task Tracker will take the code and apply on the file. The process of applying that code on the file is known as Mapper.

Commands on hdfs There are two type of commands Admin Commands Get status Generate a report Shell like Filesystem commands Put a file in the DFS Create a directory in DFS Show the contents of a file

Hadoop Admin command list hadoop classpath --Prints the class path needed to get the Hadoop jar and the required libraries. hadoop conftest --Check configuration file. Validates configuration XML file. hadoop version hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo -- DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. hadoop envvars mapred historyserver --Start JobHistoryServer . hdfs dfs -ls -d /user/Hadoop or hadoop fs -ls -d /user/ hadoop hadoop fs -expunge [Remove files from trace] hadoop dfsadmin –help hadoop namenode -format

Hadoop File system commands

Hadoop File system commands

Mapreduce MapReduce is a processing technique and a programing paradigm for distributed computing The paradigm consists of 2 parts: Map and Reduce Map stage − The map or mapper’s job is to process the input data. Map takes a set of data and converts it into another set of intermediate data The intermediate data is usually individual elements that are broken down into tuples (key/value pairs). Reduce Stage – The Reduce task takes in the output of the Map task and reduces them to a more compact output data After processing, it produces a new set of output, which will be stored in the HDFS.

MapReduce: The Map Step v2 k2 k v k v map v1 k1 vn kn … k v map Input key-value pairs Intermediate key-value pairs … k v E.g. (doc—id, doc-content) E.g. (word, wordcount-in-a-doc)

MapReduce: The Reduce Step k v … k v k v k v Intermediate key-value pairs group reduce reduce k v k v k v … k v … k v k v v v v Key-value groups Output key-value pairs E.g. (word, wordcount-in-a-doc) (word, list-of-wordcount) (word, final-count) ~ SQL Group by ~ SQL aggregation

Case of word count using map reduce Input Set of data Bus, Car, bus,  car, train, car, bus, car, train, bus,TRAIN,BUS , buS , caR , CAR, car, BUS, TRAIN Output Convert into another set of data ( Key,Value ) (Bus,1), (Car,1), (bus,1), (car,1), (train,1),(car,1), (bus,1), (car,1), (train,1), (bus,1),(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1),(car,1), (BUS,1), (TRAIN,1) MAP Input (output of Map function) Set of Tuples (Bus,1), (Car,1), (bus,1), (car,1), (train,1),(car,1), (bus,1), (car,1), (train,1), (bus,1),(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1),(car,1), (BUS,1), (TRAIN,1) Output Converts into smaller set of tuples (BUS,7), (CAR,7), (TRAIN,4) REDUCE

MAPREDUCE
Tags