A Hands-on Introduction to MapReduce (in Python)

dmassart 519 views 16 slides Feb 22, 2015
Slide 1
Slide 1 of 16
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16

About This Presentation

This presentation explains how to build a simple map reduce algorithm step by step.


Slide Content

A Hands-on Introduction to MapReduce in Python David Massart, PhD

Who Am I ?

Outline Set-up and requirements Counting words Limitations Map / Reduce Mapping Shuffling Reducing Hadoop

Environment Set-up Required Unix-like shell Linux Mac OS X Windows + Cygwin Python (e.g., anaconda) Good to have Java 8 Hadoop 2.6

Moby Dick by Herman Melville Download Moby Dick: wget https://www.gutenberg.org/cache/epub/2701/pg2701. txt Rename it input.txt : mv pg2701.txt input.txt

c at input.txt

Counting Words

./ counter.py < input.txt

Limitations Processing time is, at best, proportional to the size of the text Actually, p erformance decreases with the size of the dictionary Very large texts can require more than one disk

MapReduce , Part 1: Mapping

./ mapper.py < input.txt

MapReduce , Part 2: Shuffling Redistribute data based on the output keys produced by the " mapper” So that all data belonging to one are grouped together

./ mapper.py < input.txt | sort

MapReduce , Part 3: Reducing

./ mapper.py < input.txt | sort | ./ reducer.py

Hadoop