Introduction
Description of First PaperDescription of Second Paper MapReduce in Cloud Computing
Mohammad Mustaqeem
M.Tech 2
nd
Year
Reg No: 2011CS17
Computer Science and Engineering Department
Motilal Nehru National Institute of Technology Allahabad
November 8, 2012
Introduction
Description of First PaperDescription of Second Paper
Outline
1
Introduction
2
Motivation
3
Description of First Paper
Issues
Approach Used
HDFS
MapReduce Progamming Model
Example: Word Count
4
Description of Second Paper
Issues
Approach Used
Architecture
System Mechanism
Example
5
Comparison
6
Conclusion
Introduction
Description of First PaperDescription of Second Paper
Introduction
MapReduceis a general-purpose programming model for
data-intensive computing.
It was introduced by Google in 2004 to construct its web
index.
It is also used at Yahoo, Facebook etc.
It uses a parallel computing model that distributes
computational tasks to large number of nodes(approx
1000-10000 nodes.)
It is fault-tolerable. It can work even when 1600 nodes
among 1800 nodes fails.
Return
Introduction
Description of First PaperDescription of Second Paper
Introduction
In MapReduce model, user has to write only two functions-
map and reduce.
Few examples that can be easily expressed as
MapReduce computations:
Distributed Grep
Count of URL Access Frequency
Inverted Index
Mining
Return
Introduction
Description of First PaperDescription of Second Paper
Motivation
Cloud Computingrefers to services that are offered by
cluster having 1000 to 10000 machines[6].
e.g. services offered by Yahoo, Google etc.
Cloud computing deliveres computing resources as a
service. It may be -
Infrastructure as a Service (IaaS).
Platform as a Service (PaaS).
Software as a Service (SaaS).
Storage as a Service (STaaS). etc.
Return
Introduction
Description of First PaperDescription of Second Paper
Motivation cont..
Cloud Service is different from traditional hosting service in
following ways[6] -
It is sold on demand, typically by the minute or the hour.
It is elastic - a user can have as much or as little of a
service as they want at any given time.
It is fully managed by provider (the consumer needs
nothing but a personal computer and Internet access)
Amazon Web Services is the largest public cloud provider.
Return
Introduction
Description of First PaperDescription of Second Paper
Motivation cont..
MapReduce is a programming model for large-scale
computing[3].
It uses distributed environment of the cloud to process
large amount of data in reasonable amount of time.
It was inspired by map and reduce function of Functional
Programming Language(like LISP, scheme, racket)[3].
Map and Reduce in Racket (Functional Programming
Language)[4]:
Map:
(map f list1)!list2
e.g. (map square '(1 2 3 4 5))!'(1 4 9 16 25)
Reduce:
(foldl f init list1)!any
e.g. (foldl + 0 '(1 2 3 4 5))!15
Return
Introduction
Description of First PaperDescription of Second Paper
Motivation cont..
Although, the map and reduce functions in MapReduce
model is not exactly same as in functional programming.
Map and Reduce functions in MapReduce model:
Map:It process a (key, value) pair and returns a list of
(intermediate key, value) pairs-
map(k1, v1)!list(k2, v2)
Reduce:It merges all intermediate values having the same
intermediate key-
reduce(k2, list(v2))!list(v3)
Return
Introduction
Description of First PaperDescription of Second Paper
Issues
Issues
Gaizhen Yang,"The Application of MapReduce in the
Cloud Computing"
It analyzes Hadoop.
Hadoop is the implementation of MapReduce Model.
It process data parallely in distributed manner.
It divides the data into different logical blocks and process
these logical blocks in parallel on different machines and at
last combines all the results to produce the nal result[1].
It is fault-tolerable.
One attractive feature of Hadoop is that user can write the
map and reduce functions in any programming langauge.
Return
Introduction
Description of First PaperDescription of Second Paper
Approach Used
Approach Used
Hadoop is an open source Java framework for processing
large amount of data on the clusters of machines[1].
Hadoop is the implementation of Google's MapReduce
programming model.
Yahoo is the biggest contributor of Hadoop[5].
Hadoop has mainly two components:
Hadoop Distributed File System (HDFS)
MapReduce
Return
Introduction
Description of First PaperDescription of Second Paper
Approach Used
HDFS
HDFS provides support for distributed storage[1].
Like traditional File System, the les can be deleted,
renamed etc.
HDFS has two types of nodes:
Name Node
Data Node
Figure:HDFS Architecture
Return
Introduction
Description of First PaperDescription of Second Paper
Approach Used
HDFS cont..
Name Node:
Name Node provides the main data services.
It is a process that runs on a separate machine.
It stores only the meta-data of the les and directories.
Programmer access les through it.
For reliablity of the le system, it keeps multiple copies of
the same le blocks.
Data Node:
Data Node is a process that runs on individual machine of
the cluster.
The le blocks are stored in the local le system of these
nodes.
It periodically send the meta-data of the stored blocks to the
Name Node.
Return
Introduction
Description of First PaperDescription of Second Paper
Approach Used
MapReduce Progamming Model
MapReduce is the key concept behind the Hadoop.
It is a technique for dividing the work across a distributed
system.
The user has to dene only two functions:
Map:It process a (key, value) pair and returns a list of
(intermediate key, value) pairs-
map(k1, v1)!list(k2, v2)
Reduce:It merges all intermediate values having the same
intermediate key-
reduce(k2, list(v2))!list(v3)
Return
Introduction
Description of First PaperDescription of Second Paper
Approach Used
MapReduce Progamming Model cont..
Execution phase of a MapReduce Application
1
MapReduce library splits les into M pieces and copies
these pieces into multiple machines.
2
Master picks the idle workers and assigns a map task.
3
The map workers process key-value pairs of the input data
and passes each pair to the user-dened map function and
produces the intermediate key-value pairs.
4
The map worker buffers the output key-value pairs in the
local memory. It passes these memory locations to the
master and then master forwards it to the reducer.
5
After reading the intermediate key-value pairs, the reducer
sorts these pairs by the intermediate key.
6
For each intermediate key, the user dened reduce
function is applied to the corresponding intermediate
values.
Return
Introduction
Description of First PaperDescription of Second Paper
Approach Used
MapReduce Progamming Model cont..
7
When all map tasks and reduce tasks have been
completed. Master gives the nal output to the user.
Figure:Execution phase of a generic MapReduce Application
Return
Introduction
Description of First PaperDescription of Second Paper
Example: Word Count
Example: Word Count
The pseudo code of map and reduce function for word count
problem is -
Algorithm 3.1:MAPPER(lename;lecontents)
for eachword2lecontents
doEMIT(word;1)
Algorithm 3.2:REDUCER(word;values)
sum 0
for eachvalue2values
dosum sum+value
EMIT(word;sum)
Return
Introduction
Description of First PaperDescription of Second Paper
Example: Word Count
Example: Word Count cont..
Figure:Word Count Execution
Return
Introduction
Description of First PaperDescription of Second Paper
Issues
Issues
Fabrizio Marozzo, Domenico Talia, Paolo Trunoa,
"P2P-MapReduce: Parallel data processing in dynamic
Cloud environments"
The discussed MapReduce is centralized.
It can't deal with master failure.
Since the nodes joins and leaves the cloud dynamically, we
need a P2P-MapReduce model.
This paper descibes an adaptive P2P-MapReduce system
that can handle themaster failure.
Return
Introduction
Description of First PaperDescription of Second Paper
Approach Used
Approach Used
P2P-MapReduceis a programming model in which nodes
may join and leave the cluster dynamically.
The nodes act as either master or slave at a time.
The master and slave interchange to each other
dynamically such that the master/slave ratio remains
constant.
To prevent the loss of computation in case of master
failure, there are some backup masters for each masters.
The master responsible for a job J is referred as the
primary masterfor J.
The primary master dynamically updates the job state on its
backup nodes, which are referred asbackup mastersfor J.
When a primary master fails, its place is taken by one of its
backup masters.
Return
Introduction
Description of First PaperDescription of Second Paper
Approach Used
Architecture
There are three type of nodes in P2P-MapReduce
architecture:
User
Master
Slave
The masters and slaves nodes form two logical
peer-to-peer network M-net and S-net respectively.
The composition of M-net and S-net changes dynamically.
User node submits the MapReduce job to one of the
available master nodes. The selection of master node is
done by current workload of the available master nodes.
Return
Introduction
Description of First PaperDescription of Second Paper
Approach Used
Architecture cont..
Master nodes perform three type of operations[2]:
Management:A master node that is acting as primary
master for one or more jobs, executes management
operation.
Recovery:A master node that is acting as backup master
for one or more jobs, executes recovery operation.
Coordination:The coordinator operation changes slaves
into masters and vice-versa, so as to keep the desired
master/slave ratio.
The slave executes tasks that are assigned to it by one or
more primary masters.
Return
Introduction
Description of First PaperDescription of Second Paper
Approach Used
Architecture cont..
For each managed jobs, primary master runs oneJob
Manager.
Backup masters runsBackup Job Manager.
For each assigned tasks, slave runs oneTask Manager.
The task manager keeps informing to its job manager. The
information includes the status of the slave(ACTIVE or
DEAD) and howmuch computation has been done.
If a master doesn't get the signal from a task manager,
then it reschedules that assigned task on another idle
slave.
In addition to this condition, if a slave works slowly, then
also the master node reschedules that assigned task on
another idle slave and consider that output which comes
rst and discards other.
Return
Introduction
Description of First PaperDescription of Second Paper
Approach Used
System Mechanism
The mechanism of a generic node can be understood by UML
state diagram[2].
Figure:Behaviour of a generic node described by an UML State
Diagram
Return
Introduction
Description of First PaperDescription of Second Paper
Example
Example
Figure:P2P-MapReduce example
Return
Introduction
Description of First PaperDescription of Second Paper
Example
Example cont..
The following recovery procedure takes place when a
primary master Node1 fails[2]:
Backup masters Node2 and Node3 detect the failure of
Node1 and starts a distributed procedure to elect the new
primary master among them.
Assuming that Node3 is elected as the new primary master,
Node2 continues to play the backup function and, to keep
the desired number of backup masters active, another
backup node is chosen by Node3.
Node3 uses its local replica of the job to proceed from
where the Node1 fails.
Return
Introduction
Description of First PaperDescription of Second Paper
Comparison between two Papers
First Paper Second Paper
Issues To perform data-intensive
computation in Cloud en-
vironment in reasonable
amount of time.
To design a P2P MapReduce
system that can handle all the
node's failure including Mas-
ter node's failure.
Approaches Used Simple MapReduce (pre-
sented by Google) imple-
mentation is used. The
implemented version is
known as Hadoop, which is
based on the Master-Slave
Model.
Peer-to-peer architecture is
used to handle all the dy-
namic churns in a cluster.
Advantages Hadoop is scalable, reliable
and distributed able to handle
enormous amount of data. It
can process big data in real
time.
P2P-MapReduce can man-
age node churn, master fail-
ures and job recovery in an ef-
fective way.
Table:Comparison between two Papers.
Return
Introduction
Description of First PaperDescription of Second Paper
Conclusion
MapReduce is scalable, reliable computing model to
exploids the distributed environment of the cloud.
MapReduce optimizes the system performance by
rescheduling the slow task on multiple slaves.
P2P-MapReduce has all the property of simple
MapReduce.
Since P2P-MapReduce provides fault-tolerance against
master failures, so it is more reliable.
P2P-MapReduce prevents computation loss by keeping
job state at backup masters.
Return
Introduction
Description of First PaperDescription of Second Paper
References
Gaizhen Yang,"The Application of MapReduce in the Cloud Computing",
International Symposium on Intelligence Information Processing and Trusted
Computing (IPTC), October 2011, pp. 154-156,http://ieeexplore.ieee.
org/xpl/articleDetails.jsp?tp=&arnumber=6103560 .
Fabrizio Marozzo, Domenico Talia, Paolo Trunoa,"P2P-MapReduce: Parallel
data processing in dynamic Cloud environments", Journal of Computer and
System Sciences, vol. 78, Issue 5 September 212, pp.
1382-1402,http://dl.acm.org/citation.cfm?id=2240494 .
Jeffrey Dean and Sanjay Ghemawat,"MapReduce: simplied data processing on
large clusters", OSDI'04 Proceedings of the 6th conference on Symposium on
Opearting Systems Design&Implementation, vol. 6, 2004, pp.10-10,
www.usenix.org/event/osdi04/tech/full_papers/dean/dean.
pdfandhttp://dl.acm.org/citation.cfm?id=1251254.1251264. .
Return
Introduction
Description of First PaperDescription of Second Paper
References
The Racket Guide,http://docs.racket-lang.org/guide/ .
Hadoop Tutorial - YDN,
http://developer.yahoo.com/hadoop/tutorial/module4.html .
http://readwrite.com/2012/10/15/
why-the-future-of-software-and-apps-is-serverless .
F. Marozzo, D. Talia, P. Truno,"A Peer-to-Peer Framework for Supporting
MapReduce Applications in Dynamic Cloud Environments", In: N. Antonopoulos,
L. Gillam (eds.), Cloud Computing: Principles, Systems and Applications,
Springer, Chapter 7, 113-125, 2010,
IBM developer work, Using MapReduce and load balancing on the cloud,http:
//www.ibm.com/developerworks/cloud/library/cl-mapreduce/.
Return
Introduction
Description of First PaperDescription of Second Paper
THANK YOU
Return