MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab

437 views 89 slides May 14, 2018
Slide 1
Slide 1 of 89
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89

About This Presentation

Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH

This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:

1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduc...


Slide Content

Welcome to MapReduce Session

MapReduce
TODAY’S CLASS
●Thinking in MapReduce
○Word Frequency Problem
■Solution 1 - Coding
■Solution 2 - SQL
■Solution 3 - Unix Pipes
■Solution 4 - External Sort
●Map/Reduce Overview
●Visualisation
●Analogies to groupby
●Assignments

Understanding Sorting

MapReduce
BIG DATA PROBLEM - PROCESSING
Q: How fast can 1GHz processor sort 1TB data? This
data is made up of 10 billion 100 byte size strings.
A: Around 6-10 hours
What's wrong 6-10 hours?
1.Faster Sort
2.Bigger Data Sorting
3.More often
We need

MapReduce
BIG DATA PROBLEM - PROCESSING
Google, 8 Sept, 2011:
Sorting 10PB took 6.5 hrs on 8000 computers

MapReduce
1.Every SQL Query is impacted by Sorting:
○Where clause - Index (Sorting)
○Group By - Involves Sorting
○Joins - immensly enhanced by Sorting
○Distinct
○Order BY
2.Most of the algorithms depend on sorting
Why Sorting is such as big deal

MapReduce
•Programming Paradigm
•To help solve Big Data problems
•Specifically sorting intensive jobs or disc read
intensive
•You would have to code two functions:
•Mapper - Converts Input into “key - value” pairs
•Reducer - Aggregates all the values for a key
THINKING IN MAP / REDUCE
What is Map/Reduce?

MapReduce
•Also supported by many other systems such as
•MongoDB / CouchDB / Cassandra
•Apache Spark
•Mapper & Reducers in hadoop
•can be written in Java, Shell, Python or any binary
THINKING IN MAP / REDUCE
What is Map/Reduce?

MapReduce
MAP REDUCEMAP REDUCE
Split 0 Map
Sort
Split 1 Map
Sort
Split 2 Map
Sort
Reduce
Part 0
Copy
Merge
HDFS
Block
HDFS
Block
HDFS
Block
TO
HDFS

MapReduce
MAP REDUCE
CutIntoPieces()

MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of containing 100s of text books,[500 mb]
how would you find the frequencies of words?

MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach 1 (Programmatic):
•Create a frequency hash table / dictionary
•For each word in the files
•Increase its frequency in the hash table
•When no more words left in file, print the hash table
Problems?

MapReduce
THINKING IN MAP / REDUCE
Problems?
Start
Initialize a dictionary or
hashtable (word, count)
Read next word from file
Is Any
word
left?
Find word in
dictionary
Does the word
exist in
dictionary?
Increase the count by 1
Add new word
with count as 0
End
Print the word and
counts
1.wordcount={}
2.for word in file.read().split():
3. if word not in wordcount:
4. wordcount[word] = 0
5. wordcount[word] += 1
6.for k,v in wordcount.items():
7. print k, v
Line 1
2
2 3 4
5
6&7

MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach 1 (Programmatic):
•Create a frequency hash table / dictionary
•For each word in the file
•Increase its frequency in the hash table
•When no more words left in file, print the hash table
Problems?
Can not process the data beyond RAM size.

MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach2 (SQL):
•Break the books into one word per line
•Insert one word per row in database table
•Execute: select word, count(*) from table group by word.

Understanding Unix Pipeline

MapReduce
Understanding Unix Pipeline
A program can take input from you.

MapReduce
Understanding Unix Pipeline
A program may also print some output

MapReduce
Understanding Unix Pipeline
command1 | command2
Command1 Command2Pipe

MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach 3 (Unix):
•Replace space with a newline
•Order lines with a sort command
•Then find frequencies using uniq
•Scans from top to bottom
•prints the count when line value changes
cat myfile| sed -E 's/[\t ]+/\n/g'| sort -S 1g | uniq -c

MapReduce
THINKING IN MAP / REDUCE
Problems in Approach 2 (SQL) & Approach 3 (Unix)?

MapReduce
THINKING IN MAP / REDUCE
Problems in Approach 2 (SQL) & Approach 3 (Unix)?
The moment the data starts going beyond RAM the time taken
starts increasing. The following become bottlenecks:
•CPU
•Disk Speed
•Disk Space

MapReduce
THINKING IN MAP / REDUCE
Then?
Approach 4: Use a external sort.
•Split the files to a size that fits RAM
•Use the previous approaches (2&3) to find freq
•Merge (sort -m) and sum-up frequencies
Machine 1
Machine 2
Launcher
sa, re, rega, ga, re
re:2
sa:1
ga:2
re:1
merge
ga:2
re:3
sa:1

MapReduce
•Takes O(n) time to merge sorted data
•Or the time is proportional to the number of
elements to be merged
THINKING IN MAP / REDUCE
Merging

MapReduce
Merging
Merge the two sorted queues to
form another sorted queue

MapReduce
Merging
Compare the heads

MapReduce
Merging
Pick shorter

MapReduce
Merging
Pick shorter

MapReduce
Merging
Compare the heads
again

MapReduce
Merging
Pick shorter

MapReduce
Merging
Compare the heads
again

MapReduce
Merging
Pick both if equal

MapReduce
Merging
Compare the heads
again

MapReduce
Merging
Pick shorter

MapReduce
Merging
Compare the heads
again

MapReduce
Merging
Pick shorter

MapReduce
Merging
Compare the heads
again

MapReduce
Merging
Pick shorter

MapReduce
Merging
Since no one is left on
second queue.
Put remaining from
first

MapReduce
Merging
This merges the two
queues into one

MapReduce
•For more than two lists
○Use min-heap
THINKING IN MAP / REDUCE
Merging
1 4 6
9 1012
6 7 8
8 9 9
3 5 7
5 1017
To the output

MapReduce
•For more than two lists
○Use min-heap
THINKING IN MAP / REDUCE
Merging
1 4 6
9 1012
6 7 8
8 9 9
3 5 7
5 1017
1,

MapReduce
•For more than two lists
○Or merge two at a time
THINKING IN MAP / REDUCE
Merging

MapReduce
THINKING IN MAP / REDUCE
Problems with Approach 4?
Machine 1
Machine 2
Launcher
sa, re, rega, ga, re
re:2
sa:1
ga:2
re:1
merge
ga:2
re:3
sa:1

MapReduce
THINKING IN MAP / REDUCE
Problems with external Sort?
Time is consumed in transport of data.
+
For each requirement we would need to
special purpose network oriented program.
+
Would Require A lot of Engineering.
Solution?
Use Map/Reduce

MapReduce
•Programming Paradigm
•To help solve Big Data problems
•Specifically sorting intensive jobs or disc read
intensive
•You would have to code two functions:
•Mapper - Convert Input into “key - value” pairs
•Reducer - Aggregates all the values for a key
THINKING IN MAP / REDUCE
What is Map/Reduce?

MapReduce
•Also supported by many other systems such as
•MongoDB / CouchDB / Cassandra
•Apache Spark
•Mapper & Reducers in hadoop
•can be written in Java, Shell, Python or any binary
THINKING IN MAP / REDUCE
What is Map/Reduce?

MapReduce
Function Mapper (Image):
Convert image
to 100x100 pixel
EXAMPLE OF ONLY MAPPER

Directory Of Profile Pictures in HDFS
Function Mapper (Image):
Convert image
to 100x100 pixel
Function Mapper (Image):
Convert image
to 100x100 pixel
HDFS - Output Directory Of 100x100px Profile Pictures
Machine 1 Machine 2 Machine 3

MapReduce
InputFormat
Datanode
Input Split
HDFS Block1
Record1
Record2
Record3
Map()
Map()
Map()
Mapper
(key1, value1)
(key2, value2)
Nothing
(key3, value3)
InputSplit

MapReduce
With Both mapper() & Reducer() code
HDFS
Input
HDFS

MapReduce
MAP / REDUCE
Mapper/Reducer for word frequency problem.
function map(line):
foreach(word in line) :
print(word, 1);
sa 1
re 1
re 1
sa 1
ga 1
hdfs
sa re re
sa ga

MapReduce
MAP / REDUCE
Mapper/Reducer for word frequency problem.
function map(line):
foreach(word in line) :
print(word, 1);
sa re re
sa ga
function reduce(word, freqArray):
return Array.sum(freqArray);
sa 1
re 1
re 1
sa 1
ga 1
ga [1]
re [1, 1]
sa [1, 1]
ga 1
re 2
sa 2
hdfs

MapReduce
Mapper/Reducer for computing max temp
def mapp(line):
(t, c, time) = line.split(",")
print(c, t)
def reduce(key, values):
return max(values)
20, NYC, 2014-01-01
20, NYC, 2015-01-01
21, NYC, 2014-01-02
23, BLR, 2012-01-01
25, Seatle, 2016-01-01
21, CHICAGO, 2013-01-05
24, NYC, 2016-5-05
NYC 20
NYC 20
NYC 21
BLR 23
SEATLE25
CHICAGO21
NYC 24
BLR23
CHICAGO 21
NYC20,20,21,24
SEATLE25
BLR23
CHICAGo 21
NYC24
SEATLE 25
Temp, City, Date

MapReduce
Mapper/Reducer for computing max temp
def mapp(line):
(t, c, date) = line.split(",")
print(c, (t, date))
def reduce(key, values):
maxt = -19191919;
date = ''
for i in values:
T = i[0]
If T > maxt: maxt = T, date=i[1]
return (maxt, date)
20, NYC, 2014-01-01
20, NYC, 2015-01-01
21, NYC, 2014-01-02
23, BLR, 2012-01-01
25, Seatle, 2016-01-01
21, CHICAGO, 2013-01-05
24, NYC, 2016-5-05
NYC (20, 2014-01-01)
NYC (20, 2015-01-01)
NYC 21
BLR 23
SEATLE25
CHICAGO21
NYC 24
BLR(23, '2014-01-01'
CHICAGO (21, '2015-01-01').
NYC20,20,21,24
SEATLE25
BLR(23, '2015-01-01')
CHICAGo 21
NYC24
SEATLE 25
Temp, City, Date

MapReduce
MAP / REDUCE
Analogous to Group By
function map():
(temp, city, time) = line.split(",")
print(city, temp)
function reduce(city, arr_temps):
return max(arr_temps);
select city,
max(temp)
from table
group by city.

MapReduce
MAP / REDUCE
Analogous to Group By
function map():
foreach(word in input) :
print(word, 1);
function reduce(word, freqArray):
return Array.sum(freqArray);
select word,
count(*)
from table
group by
word.

MapReduce
MAP REDUCE - Multiple Reducers
Split 0 Map
Sort
Split 1 Map
Sort
Split 2 Map
Sort
Reduce
Part 0
Reduce
Part 1
Copy
Merge
HDFS
Block
HDFS
Block
HDFS
Block
TO
HDFS
TO
HDFS
Apple
Banana
Apricot
Carrots

MapReduce
MAP REDUCE - Paritioning
Reducer 0 Reducer 1 Reducer 2 Reducer 3
1
5
9
.
.
.
3201

2
6
10
.
.
.
3202

3
7
11
.
.
.
3203

0
4
8
.
.
.
3200

Key k will go to this reducer: hashcode(k) % total_reducers
Keys

Thank you

MapReduce
Thank you.
Hadoop & Spark
[email protected]
+1 419 665 3276 (US)
+91 803 959 1464 (IN)
Subscribe to our Youtube channel for latest videos -
https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA

MapReduce
MAP / REDUCE - RECAP

MapReduce
MAP / REDUCE
The data generated by the mapper is given to
reducer and then it is sorted / shuffled [Yes/No]?

MapReduce
MAP / REDUCE
The data generated by the mapper is given to
reducer and then it is sorted / shuffled [Yes/No]?
No. The output of mapper is first
shuffled/sorted and then given to reducers.

MapReduce
MAP / REDUCE
The mapper can only generate a single key value
pair for an input value [True/False]?

MapReduce
MAP / REDUCE
The mapper can only generate a single key value
pair for an input value [True/False]?
False. Mapper can generate as many key-value pair
as it wants for an input.

MapReduce
MAP / REDUCE
A mapper always have to generate at least a
key-value pair[Correct/Wrong]?

MapReduce
MAP / REDUCE
A mapper always generates at least a key-value
pair[Correct/Wrong]?
Wrong

MapReduce
MAP / REDUCE
By default there is only one reducer in case of
streaming job [Yes/No]?

MapReduce
MAP / REDUCE
By default there is only one reducer in case of
streaming job [Yes/No]?
Yes. By default there is a single reducer job but it
can be split by specifying cmd option :
mapred.reduce.tasks.

MapReduce
MAP / REDUCE
In hadoop 1.0, What is the role of job tracker?
A: Executing the Map/Reduce Logic
B: Delegate the Map/Reduce Logic to task
tracker.

MapReduce
MAP / REDUCE
What is the role of job tracker?
A: Executing the Map/Reduce Logic
B: Delegate the Map/Reduce Logic to task
tracker.
B.

MapReduce
MAP / REDUCE
Q: The Map logic is executed preferably on the
nodes that have the required data [Yes/No]?

MapReduce
MAP / REDUCE
Q: The Map logic is executed preferably on the
nodes that have the required data [Yes/No]?
Yes.

MapReduce
MAP / REDUCE
Q: The Map logic is always executed on the nodes
that have the required data [Correct/Wrong]?

MapReduce
MAP / REDUCE
Wrong
Q: The Map logic is always executed on the nodes
that have the required data [Correct/Wrong]?

MapReduce
MAP / REDUCE
Where does Hadoop Store the result of reducer?
In HDFS or Local File System?

MapReduce
MAP / REDUCE
In HDFS.
Where does Hadoop Store the result of reducer?
In HDFS or Local File System?

MapReduce
MAP / REDUCE
Where does Hadoop Store the intermediate data
such as output of Map Tasks?
In HDFS or Local File System or Memory?

MapReduce
MAP / REDUCE
First in Memory and purged to
Local File System.
Output of mapper is saved in HDFS directly only if
there is no reduce phase.
Where does Hadoop Store the intermediate data
such as output of Map Tasks?
In HDFS or File System or Memory?

MapReduce
MAP / REDUCE Assignment For Tomorrow
1. Frequencies of letters [a-z] - Do you need Map/Reduce?
2. Find anagrams in a huge text. An anagram is basically a
different arrangement of letters in a word. Anagram does not
need have a meaning.
Input:
“the cat act in tic tac toe”
Output:
cat, tac, act
the
toe
in
tic

MapReduce
MAP / REDUCE
3a. A file contains the DNA sequence of people. Find all the
people who have same DNAs.
Output:
User1, User4
User2
User3, User 5
User6
Input:
“User1 ACGT”
“User2 TGCA”
“User3 ACG”
“User4 ACGT”
“User5 ACG”
“User6 AGCT”
Assignment For Tomorrow

MapReduce
MAP / REDUCE Assignment For Tomorrow
3b. A file contains the DNA sequence of people. Find all the
people who have same or mirror image of DNAs.

Input:
“User1 ACGT”
“User2 TGCA”
“User3 ACG”
“User4 ACGT”
“User5 ACG”
“User6 ACCT”
Output:
User1, User2, User4
User3, User 5
User6

MapReduce
MAP / REDUCE Assignment For Tomorrow
4. In an unusual democracy, everyone is not equal. The vote count is a
function of worth of the voter. Though everyone is voting for each other.
As example, if A with a worth of 5 and B with a worth of 1 are voting
for C, the vote count of C would be 6.
You are given a list of people with their value of vote. You are also given
another list describing who voted for who all.
List1
VoterVotee
AC
BC
CF
Find out what is the vote count of everyone?
List2
PersonWorth
A5
B1
C11
Result
PersonVoteCount
A0
B0
C6
F11

MapReduce
JOB TRACKER

MapReduce
JOB TRACKER (DETAILED)

MapReduce
JOB TRACKER (CONT.)

MapReduce
JOB TRACKER (CONT.)

MapReduce

MapReduce
QUICK - CLUSTER HANDS ON
MapReduce Command
The Example is available here
Remove old output directory
hadoop fs -rm -r /user/student/wordcount/output
Execute the mapReduce Command:
hadoop jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-examples.jar
wordcount /data/mr/wordcount/input mrout