Map Reduce

VigenSahakyan1 2,127 views 10 slides May 15, 2016
Slide 1
Slide 1 of 10
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10

About This Presentation

This presentation is short introduction to how does Hadoop MapReduce framework work.


Slide Content

© Vigen Sahakyan 2016
Hadoop Tutorial
MapReduce

© Vigen Sahakyan 2016
Agenda
●What is MapReduce?
●Anatomy of MapReduce
●Purposes & Weaknesses

© Vigen Sahakyan 2016
What is MapReduce?
●Distributed data processing paradigm
●Designed especially for batch processing
●It was first introduced at Google
●Integral part of Hadoop ecosystem
●In Hadoop 2 it became application on
top of Yarn
●It split big data on chunks and apply
mappers and reducers to that chunks
which can be processed in parallel

© Vigen Sahakyan 2016
Anatomy of MapReduce
In Hadoop 2 MapReduce is batch processing framework implemented on top of
Yarn. To understand how MapReduce work in Hadoop you have to know how
MapReduce job is running.
Also you have to understand important parts of MR framework and how work:
●Mapper
●Shuffler
●Reducer

© Vigen Sahakyan 2016
Anatomy of MapReduce
How MapReduce job is running ?
1.MapReduce app submit job to Hadoop client
2.Client ask ResourceManager to get app ID
3.Copy job resources to HDFS:
a.Checks the output specification of the job
b.Computes the input splits for the job
c.Copy Jar and throw error if need.
4.Submit Application
5.ResourceManager:
a.Allocate container on some node
b.Run ApplicationMaster on that node
6.ApplicationMaster initialize job
7.Retrieve input splits
8.Allocate resources and start container
9.step 10 Retrieve resource for map or reduce task
10.step 11 run map or reduce task

© Vigen Sahakyan 2016
Anatomy of MapReduce
MapReduce step by step
Mapper:
1.Perform Map side operation (by you)
2.Write to in-memory buffer ( by framework)

Shuffler:
1.Partitioner figure out which Map output key goes to which Reducer (by framework). It possible to
have a several unique key in one partition.
a.You can specify Partitioner class which implement PartitionID extraction process.
2.Sort first by PartitionID then by Key value within partition. (by framework)

© Vigen Sahakyan 2016
Anatomy of MapReduce
Shuffler:
3.Call single Combiner (if it enabled by you) for each Key of every partition. (by framework)
a.Combiner implement Reduce interface, hence you can specify your Reducer as Combiner class, but
only if your operation are commutative and associative (e.g: sum in case of wordcount) otherwise you
have to override it.
4.Spill to disk (also group by key and merge) if limit exceed
(by default limit is 100mb)
Reducer:
1.Start read Map outputs from disk and from in-memory.
2.Merge outputs. Sort by PartitionID and then by Key.
3.Group by key.
4.Call Reduce operation defined by you for each unique Key

© Vigen Sahakyan 2016
Purposes & Weaknesses
Purposes:
●Batch processing
●Long running application
Weaknesses:
●Iterative algorithms (e.g: machine learning, graph processing and so on)
●Ad-hoc queries
●Computation depends on previously computed value
●Algorithms depend on shared global state

© Vigen Sahakyan 2016
References
1.Hadoop in Practice 2nd Edition by Alex Holmes
http://www.amazon.com/Hadoop-Practice-Alex-Holmes/dp/1617292222
2.Hadoop: The Definitive Guide by Tom White
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White-ebook/dp/B00V7B1IZC

Thanks!
© Vigen Sahakyan 2016