MAPREDUCE ppt big data computing fall 2014 indranil gupta.ppt
zuhaibmohammed465
13 views
24 slides
Jun 20, 2024
Slide 1 of 24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
About This Presentation
**MapReduce: A Comprehensive Overview**
MapReduce is a programming model and associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It was pioneered by Google and has become a cornerstone of big data processing due to its scalabil...
**MapReduce: A Comprehensive Overview**
MapReduce is a programming model and associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It was pioneered by Google and has become a cornerstone of big data processing due to its scalability and fault tolerance. This 1000-word description aims to delve into the fundamental principles, architecture, workflow, and applications of MapReduce.
### Introduction to MapReduce
MapReduce is designed to handle vast amounts of data across thousands of commodity servers in a distributed computing environment. It abstracts the complexity of distributed computing, making it easier to write programs that process large-scale data sets efficiently. The model divides the computation into two main phases: the **Map** phase and the **Reduce** phase.
### The Map Phase
In the Map phase, the input data is divided into smaller chunks and processed independently by map tasks. Each map task operates on a subset of the input data and generates intermediate key-value pairs. The key-value pairs are processed by user-defined functions called **mappers**. Mappers are designed to extract and transform the data into a format suitable for further processing in the Reduce phase.
### The Reduce Phase
Following the Map phase, the Reduce phase aggregates the intermediate key-value pairs produced by the mappers. The framework groups together all intermediate values associated with the same intermediate key and passes them to user-defined functions called **reducers**. Reducers process the grouped data and produce the final output, typically aggregating results or performing some form of computation across the dataset.
### Key Components of MapReduce
1. **Master Node (JobTracker)**: Manages the assignment of map and reduce tasks to worker nodes, monitors their progress, and handles job scheduling and coordination.
2. **Worker Nodes (TaskTrackers)**: Execute map and reduce tasks assigned by the JobTracker. They report task status and data locality information back to the JobTracker.
3. **Input Data**: Divided into manageable splits, each processed independently by map tasks.
4. **Intermediate Data**: Key-value pairs generated by map tasks and shuffled across the network to the appropriate reducers.
5. **Output Data**: Final result produced by reducers, typically stored in a distributed file system like Hadoop Distributed File System (HDFS).
### Workflow of MapReduce
1. **Splitting**: Input data is divided into splits, each processed by a map task.
2. **Mapping**: Mappers process each split independently, generating intermediate key-value pairs.
3. **Shuffling**: Intermediate data is shuffled across the network, grouped by keys, and sent to reducers.
4. **Reducing**: Reducers aggregate the shuffled data, computing the final output.
5. **Output**: Final results are written to the distributed file system.
What is MapReduce?
â¢Terms are borrowed from Functional Language (e.g., Lisp)
Sum of squares:
â¢(map square â(1 2 3 4))
âOutput: (1 4 9 16)
[processes each record sequentially and independently]
â¢(reduce + â(1 4 9 16))
â(+ 16 (+ 9 (+ 4 1) ) )
âOutput: 30
[processes set of all records in batches]
â¢Letâs consider a sample application: Wordcount
âYou are given a hugedataset (e.g., Wikipedia dump or all of Shakespeareâs works) and asked to list the count for each
of the words in each of the documents therein
Map
â¢Process individual records to generate
intermediate key/value pairs.
Welcome Everyone
Hello Everyone
Welcome1
Everyone1
Hello1
Everyone1
Input <filename, file text>
Key Value
Map
â¢Parallelly Processa large number of
individual records to generate intermediate
key/value pairs.
Welcome Everyone
Hello Everyone
Why are you here
I am also here
They are also here
Yes, itâs THEM!
The same people we were thinking of
âŠâŠ.
Welcome1
Everyone1
Hello1
Everyone1
Why 1
Are 1
You1
Here1
âŠâŠ.Input <filename, file text>
MAP TASKS
Reduce
â¢Reduce processes and merges all intermediate
values associated per key
Welcome1
Everyone1
Hello1
Everyone1
Everyone2
Hello1
Welcome1
Key Value
Reduce
â¢Each key assigned to one Reduce
â¢ParallellyProcesses and merges all intermediate values by partitioning
keys
â¢Popular: Hash partitioning, i.e., key is assigned to reduce # =
hash(key)%number of reduce servers
Welcome1
Everyone1
Hello1
Everyone1
Everyone2
Hello1
Welcome1
REDUCE
TASK 1
REDUCE
TASK 2
Hadoop Code -Map
public static class MapClassextends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one =
new IntWritable(1);
private Text word = new Text();
public void map( LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while(itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one) ;
}
}
} // Source: http://developer.yahoo.com/hadoop/tutorial/module4.html#wordcount
Hadoop Code -Reduce
public static class ReduceClassextends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(
Text key,
Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get() ;
}
output.collect(key, new IntWritable(sum)) ;
}
}// Source: http://developer.yahoo.com/hadoop/tutorial/module4.html#wordcount
Hadoop Code -Driver
// Tells Hadoop how to run your Map -Reduce job
public void run(String inputPath, String outputPath)
throws Exception {
// The job. WordCount contains MapClass and Reduce.
JobConf conf = new JobConf(WordCount.class);
conf.setJobName(âmywordcount");
// The keys are words
(strings) conf.setOutputKeyClass(Text.class);
// The values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(ReduceClass.class);
FileInputFormat.addInputPath(
conf, newPath(inputPath));
FileOutputFormat.setOutputPath(
conf, new Path(outputPath));
JobClient.runJob(conf);
} // Source: http://developer.yahoo.com/hadoop/tutorial/module4.html#wordcount
Some Applications of
MapReduce
Distributed Grep:
âInput: large set of files
âOutput: lines that match pattern
âMap âEmits a line if it matches the supplied pattern
âReduce âCopies the intermediate data to output
Some Applications of
MapReduce (2)
Reverse Web-Link Graph
âInput: Web graph: tuples (a, b) where (page a ï page b)
âOutput: For each page, list of pages that link to it
âMap âprocess web log and for each input <source, target>, it outputs
<target, source>
âReduce -emits <target, list(source)>
Some Applications of
MapReduce (3)
Count of URL access frequency
âInput: Log of accessed URLs, e.g., from proxy server
âOutput: For each URL, % of total accesses for that URL
âMap âProcess web log and outputs <URL, 1>
âMultiple Reducers -Emits <URL, URL_count>
(So far, like Wordcount. But still need %)
âChain another MapReduce job after above one
âMap âProcesses <URL, URL_count> and outputs <1, (<URL, URL_count> )>
â1 Reducer âSums up URL_countâsto calculate overall_count.
Emits multiple <URL, URL_count/overall_count>
Some Applications of
MapReduce (4)
Map taskâs output is sorted (e.g., quicksort)
Reduce taskâs input is sorted (e.g., mergesort)
Sort
âInput: Series of (key, value) pairs
âOutput: Sorted <value>s
âMap â<key, value> ï <value, _> (identity)
âReducer â<key, value> ï <key, value> (identity)
âPartitioning function âpartition keys across reducers based on ranges (canât use
hashing!)
â¢Take data distribution into account to balance reducer tasks
Programming MapReduce
Externally: For user
1.Write a Map program (short), write a Reduce program (short)
2.Specify number of Maps and Reduces (parallelism level)
3.Submit job; wait for result
4.Need to know very little about parallel/distributed programming!
Internally: For the Paradigm and Scheduler
1.Parallelize Map
2.Transfer data from Map to Reduce
3.Parallelize Reduce
4.Implement Storage for Map input, Map output, Reduce input, and Reduce output
(Ensure that no Reduce starts before all Maps are finished. That is, ensure the barrierbetween the Map
phase and Reduce phase)
Inside MapReduce
For the cloud:
1.Parallelize Map: easy!each map task is independent of the other!
â¢All Map output records with same key assigned to same Reduce
2.Transfer data from Map to Reduce:
⢠All Map output records with same key assigned to same Reduce task
⢠use partitioning function, e.g., hash(key)%number of reducers
3.Parallelize Reduce: easy!each reduce task is independent of the other!
4.Implement Storage for Map input, Map output, Reduce input, and Reduce
output
⢠Map input: from distributed file system
⢠Map output: to local disk (at Map node); uses local file system
⢠Reduce input: from (multiple) remote disks; uses local file systems
⢠Reduce output: to distributed file system
local file system= Linux FS, etc.
distributed file system= GFS (Google File System), HDFS (Hadoop
Distributed File System)
1
2
3
4
5
6
7
Blocks
from DFS
Servers
Resource Manager (assigns maps and reduces to servers)
Map tasks
I
II
III
Output files
into DFS
A
B
C
Servers
A
B
C
(Local write, remote read)
Reduce tasks
The YARN Scheduler
â¢Used in Hadoop2.x +
â¢YARN = Yet Another Resource Negotiator
â¢Treats each server as a collection of containers
âContainer = fixed CPU + fixed memory
â¢Has 3 main components
âGlobal Resource Manager (RM)
â¢Scheduling
âPer-server Node Manager (NM)
â¢Daemon and server-specific functions
âPer-application (job) Application Master (AM)
â¢Container negotiation with RM and NMs
â¢Detecting task failures of that job
YARN: How a job gets a
container
Resource Manager
Capacity Scheduler
Node A
Node Manager A
Application
Master 1
Node B
Node Manager B
Application
Master 2
Task (App2)
2. Container Completed
1. Need
container 3. Container on Node B
4. Start task, please!
In this figure
â¢2 servers (A, B)
â¢2 jobs (1, 2)
Fault Tolerance
â¢Server Failure
âNM heartbeats to RM
â¢If server fails, RM lets all affected AMs know, and AMs take
action
âNM keeps track of each task running at its server
â¢If task fails while in-progress, mark the task as idle and restart it
âAM heartbeats to RM
â¢On failure, RM restarts AM, which then syncs up with its
running tasks
â¢RM Failure
âUse old checkpoints and bring up secondary RM
â¢Heartbeats also used to piggyback container requests
âAvoids extra messages
Slow Servers
Slow tasks are called Stragglers
â¢The slowest task slows the entire job down (why?)
â¢Due to Bad Disk, Network Bandwidth, CPU, or Memory
â¢Keep track of âprogressâ of each task (% done)
â¢Perform proactive backup (replicated) execution of straggler
task: task considered done when first replica complete. Called
Speculative Execution.
Locality
â¢Locality
âSince cloud has hierarchical topology (e.g., racks)
âGFS/HDFS stores 3 replicas of each of chunks (e.g., 64 MB in size)
â¢Maybe on different racks, e.g., 2 on a rack, 1 on a different rack
âMapreduceattempts to schedule a map task on
â¢a machine that contains a replica of corresponding input data, or
failing that,
â¢on the same rack as a machine containing the input, or failing that,
â¢Anywhere
Mapreduce: Summary
â¢Mapreduce uses parallelization + aggregation to
schedule applications across clusters
â¢Need to deal with failure
â¢Plenty of ongoing research work in scheduling and
fault-tolerance for Mapreduce and Hadoop
Announcements
â¢HW1 released today, due Sep 18
th
â¢Start early!
â¢Please fill out Student Survey (course webpage).