Hadoop map reduce concepts

2,848 views 53 slides Dec 06, 2014
Slide 1
Slide 1 of 53
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53

About This Presentation

Hadoop


Slide Content

MapReduceConcepts

HadoopMapReduce
•HadoopMapReduceis a
–Software framework
–For easily writing applications
–Which process vast amounts of data (multi-terabyte data-sets)
–In-parallel on large clusters (thousands of nodes) of commodity
hardware
–In a reliable, fault-tolerant manner.
•A MapReducejobusually splits the input data-set into independent chunks
which are processed by themap tasksin a completely parallel manner.
•The framework sorts the outputs of the maps, which are then input to
thereduce tasks.
•Typically both the input and the output of the job are stored in a file-system.
•The framework takes care of scheduling tasks, monitoring them and re-
executes the failed tasks.

JobTrackerand TaskTracker
•The MapReduceframework consists of a
–single masterJobTrackerand
–one slaveTaskTrackerper cluster-node.
•The master is responsible for scheduling the jobs'
component -tasks on the slaves, monitoring them and
re-executing the failed tasks.
•The slaves execute the tasks as directed by the master.

JobTrackerand TaskTracker

Job Specification
•Minimally, applications specify the input/output locations and
supplymapandreducefunctions via implementations of appropriate
interfaces and/or abstract-classes.
•These, and other job parameters, comprise thejob configuration.
•The Hadoopjob clientthen submits the job (jar/executable etc.) and
configuration to theJobTrackerwhich then assumes the responsibility of
distributing the software/configuration to the slaves, scheduling tasks and
monitoring them, providing status and diagnostic information to the job-
client.
•Although the Hadoopframework is implemented in Java
TM
, MapReduce
applications need not be written in Java.
–HadoopStreamingis a utility which allows users to create and run jobs with any
executables (e.g. shell utilities) as the mapperand/or the reducer.
–HadoopPipesis aSWIG-compatibleC++ APIto implement MapReduce
applications (non JNI
TM
based).

Input Output
•The MapReduceframework operates exclusively on<key,
value>pairs.
•That is, the framework views the input to the job as a set
of<key, value>pairs and produces a set of<key, value>pairs as
the output of the job, conceivably of different types.
•Thekeyandvalueclasses have to be serializableby the
framework and hence need to implement
theWritableinterface.
•Additionally, thekeyclasses have to implement
theWritableComparableinterface to facilitate sorting by the
framework.
•Input and Output types of a MapReducejob:
–(input)<k1, v1>->map-><k2, v2>->combine-><k2, v2>->reduce-
><k3, v3>(output)

TaskTracker
JobTracker-Master
Client
TaskTracker
TaskTracker -
Slaves
Run Map and Reduce
task
Manage intermediate
output
UI for submitting
jobs
Polls status
information
Accepts MR jobs
Assigns tasks to slaves
Monitors tasks
Handles failures
Task
Run the Map and
Reduce functions
Report progress
HadoopMap/Reduce Architecture

D D D DTT
JobTracker
Namenode
Machines with Datanodesand Tasktrackers
T T TD
Client
Submit Job
HTTP Monitoring UI
Get Block
Locations
Locality optimizations
–With large data, bandwidth to data is a problem
–Map tasks are scheduled close to the inputs when possible
Automatic re-execution on failure
←. tasks
HadoopHDFS + MR cluster -putting them together

Word Count DataFlow

Ways of using Hadoop
•Map-Reduce Java API
–http://hadoop.apache.org/mapreduce/
•Streaming
–http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
•Pipes
–http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/pipes/package-
summary.html
•Pig
–http://hadoop.apache.org/pig/
•Hive
–http://hive.apache.org

Example: Compute TF-IDF using Map/Reduce
•TF-IDF (Term Frequency, Inverse Document Frequency)
•Is a basic technique
•To compute the relevancy of a document with respect to a
particular term
•"Term" is a generalized element contains within a document.
•A "term" is a generalized idea of what a document contains.
•E.g. a term can be a word, a phrase, or a concept.

TF-IDF
•Intuitively, the relevancy of a document to a term can
be calculated
•from the percentage of that term shows up in the
document
•i.e.: the count of the term in that document divide by
the total number of terms in it.
•We called this the "term frequency“

TF-IDF
•On the other hand,
•If this is a very common term which appears in many other
documents,
•Then its relevancy should be reduced.
•i.e.: the count of documents having this term divided by total
number of documents.
•We called this the "document frequency“

TF-IDF
•The overall relevancy of a document with respect to a term can
be computed using both the term frequency and document
frequency.
•relevancy = tf-idf= term frequency * log (1 / document
frequency)
•This is called tf-idf.
•A "document" can be considered as a multi-dimensional vector
where each dimension represents a term with the tf-idfas its
value.

Example: Compute TF-IDF using Map/Reduce

Basic MapReduce Patterns
1.Counting and Summing
2.Collating
3.Filtering (“Grepping”), Parsing, and Validation
4.Distributed Task Execution
5.Sorting
6.Iterative Message Passing (Graph Processing)
7.Distinct Values (Unique Items Counting)
8.Cross-Correlation
9.Selection
10.Projection
11.Union
12.Intersection
13.Difference
14.GroupByand Aggregation
15.Joining
16.Time-series –moving average

Counting and Summing
•Problem Statement:
–There is a number of documents where each document is a set of
terms.
–It is required to calculate a total number ofoccurrences of each
term in all documents.
–Alternatively, it can be an arbitrary function of the terms.
–For instance, there is a log file where each record contains a
response time and it is required to calculate an average response
time.
•Applications: Log Analysis, Data Querying

Counting and Summing
Solution:
class Mapper
method Map(docidid, doc d)
for all term t in doc d do
Emit(term t, count 1)
class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

Collating
•Problem Statement:
–There is a set of items and some function of one item.
–It is required to save all items that have the same value of
function into one file or perform some other computation
that requires all such items to be processed as a group.
–The most typical example is building of inverted indexes.
•Applications: Inverted Indexes, ETL

Collating
Solution:
Mappercomputes a given function for each item and emits
value of the function as a key and item itself as a value.
class Mapper
method Map(docidid, doc d)
for all term t in doc d do
Emit(term f(t), t)
Reducer obtains all items grouped by function value and
process or save them. In case of inverted indexes, items are
terms (words) and function is a document ID where the
term was found.
class Reducer
method Reduce(term f(t), [t1, t2,...])
//process or save
Emit(term t, count sum)

Filtering (“Grepping”), Parsing, and Validation
•Problem Statement:
–There is a set of records and it is required to collect all
records that meet some condition or transform each record
(independently from other records) into another
representation.
–The later case includes such tasks as text parsing and value
extraction, conversion from one format to another.
•Applications: Log Analysis, Data Querying, ETL, Data
Validation

Filtering (“Grepping”), Parsing, and Validation
Solution:
Mappertakes records one by one and emits
accepted items or their transformed
versions

Distributed Task Execution
•Problem Statement:
–There is a large computational problem that can bedivided
into multiple parts and results from all parts can be
combined together to obtain a final result.
•Applications: Physical and Engineering Simulations,
Numerical Analysis, Performance Testing

Distributed Task Execution
Solution:
Problem description is split in a set of specifications and
specifications are stored as input data for Mappers.
Each Mappertakes a specification, performs corresponding
computations and emits results.
Reducer combines all emitted parts into the final result.
Case Study: Simulation of a Digital Communication System
There is a software simulator of a digital communication system like WiMAXthat passes
some volume of random data through the system model and computes error probability
of throughput.
Each Mapperruns simulation for specified amount of data which is 1/Nth of the required
sampling and emit error rate.
Reducer computes average error rate.

Sorting
•Problem Statement:
–There is a set of records and it is required to sort these
records by some rule or process these records in a certain
order.
•Applications: ETL, Data Analysis

Sorting
Solution:
Mappersjust emit all items as values associated with the
sorting keys that are assembled as function of items.
Nevertheless, in practice sorting is often used in a quite
tricky way, that’s why it is said to be a heart of MapReduce
(and Hadoop).
In particular, it is very common to use composite keys to
achieve secondary sorting and grouping.
Sorting in MapReduce is originally intended for sorting of
the emitted key-value pairs by key, but there exist
techniques that leverage Hadoopimplementationspecifics
to achieve sorting by values.

Iterative Message Passing (Graph Processing)
•Problem Statement:
–There is a network of entities and relationships between them.
–It is required to calculate a state of each entity on the basis of properties of the
other entities in its neighborhood.
–This state can represent a distance to other nodes, indication that there is a
neighbor with the certain properties, characteristic of neighborhood density and
so on.
•Applications: Graph Analysis, Web Indexing

Iterative Message Passing (Graph Processing)
Solution:
A network is stored as a set of nodes and each node contains a list of
adjacent node IDs.
Conceptually, MapReduce jobs are performed in iterative way and at each
iteration each node sends messages to its neighbors.
Each neighbor updates its state on the basis of the received messages.
Iterations are terminated by some condition like fixed maximal number of
iterations (say, network diameter) or negligible changes in states between
two consecutive iterations.
From the technical point of view, Mapperemits messages for each node
using ID of the adjacent node as a key.
As result, all messages are grouped by the incoming node and reducer is
able to re-compute state and rewrite node with the new state.

Iterative Message Passing (Graph Processing)
Solution:
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelationsdo
Emit(id m, message getMessage(N))
class Reducer
method Reduce(id m, [s1, s2,...])
object M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State= calculateState(messages)
Emit(id m, object M)

Case Study: Availability Propagation Through The Tree of Categories
•Problem Statement:
–This problem is inspired by real life eCommercetask.
–There is a tree of categories that branches out from large categories (like Men,
Women, Kids) to smaller ones (like Men Jeans or Women Dresses), and
eventually to small end-of-line categories (like Men Blue Jeans).
–End-of-line category is either available (contains products) or not.
–Some high level category is available if there is at least one available end-of-line
category in its subtree.
–The goal is to calculate availabilities for all categories if availabilities of end-of-
line categories are know.

Case Study: Availability Propagation Through The Tree of Categories
class N
State in {True = 2, False = 1, null = 0}, //initialized 1 or 2 for end-
of-line categories, 0 otherwise
method getMessage(object N)
return N.State
method calculateState(state s, data [d1, d2,...])
return max( [d1, d2,...] )

Case Study: Breadth-First Search
•Problem Statement:There is a graph and it is required to
calculate distance (a number of hops) from one source node to
all other nodes in the graph.

Case Study: Breadth-First Search
class N
State is distance, initialized 0 for source node, INFINITY for all
other nodes
method getMessage(N)
return N.State+ 1
method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )

Distinct Values (Unique Items Counting)
•Problem Statement:
–There is a set of records that contain fields F and G. Count the total number of
uniquevalues of filed F for each subset of records that have the same G
(grouped by G).
–The problem can be a little bit generalized and formulated in terms of faceted
search: There is a set of records. Each record has field F and arbitrary number of
category labels G = {G1, G2, …} . Count the total number of uniquevalues of filed
F for each subset of records for each value of any label.
•Example Applications: Log Analysis,
Unique Users Counting
Record 1: F=1, G={a, b}
Record 2: F=2, G={a, d, e}
Record 3: F=1, G={b}
Record 4: F=3, G={a, b}
Result:
a -> 3 // F=1, F=2, F=3
b -> 2 // F=1, F=3
d -> 1 // F=2
e -> 1 // F=2

Distinct Values (Unique Items Counting)
Solution:
PHASE-1
class Mapper
method Map(null, record [value f, categories [g1, g2,...]])
for all category g in [g1, g2,...]
Emit(record [g, f], count 1)
class Reducer
method Reduce(record [g, f], counts [n1, n2, ...])
Emit(record [g, f], null )
PHASE-2
class Mapper
method Map(record [f, g], null)
Emit(value g, count 1)
class Reducer
method Reduce(value g, counts [n1, n2,...])
Emit(value g, sum( [n1, n2,...] ) )

Cross-Correlation
•Problem Statement:
–There is a set of tuplesof items.
–For each possible pair of items calculate a number of tuples
where these items co-occur.
–If the total number of items is N then N*N values should be
reported.
–This problem appears in :
•text analysis (say, items are words and tuplesare sentences),
•market analysis (customers who buythistend to also buythat).
–If N*N is small -matrix canfit in the memory of a single
machine.

Market Basket Analysis
•Suppose a store sells N different products and has data on B customer purchases
(each purchase is called a “market basket”)
•An association rule is expressed as “If x is sold, then y is sold”:
–x y
•“Support” for x y: Joint probability of P (x ^ y)
–The probability of finding x and y together in a random basket
•“Confidence” in x y: Conditional probability of P (y | x)
–The probability of finding y in a basket if the basket already contains x
•Goal is finding rules with high support and high confidence
–How are they computed?
–Cooccur[x, x] represents the number of baskets that contain x
–Cooccur[x, y] represents the number of baskets that contain x & y
–Cooccur[x, y] == Cooccur[y, x]
–Support: Calculated as Cooccur[x, y] / B
–Confidence: Calculated as Cooccur[x, y] / Cooccur[x, x]

Cross-Correlation
Solution:
class Mapper
method Map(null, basket = [i1, i2,...] )
for all item iin basket
for all item j in basket
Emit(pair [ij], count 1)
class Reducer
method Reduce(pair [ij], counts [c1, c2,...])
s = sum([c1, c2,...])
Emit(pair[ij], count s)

Cross-Correlation
Solution:
class Mapper
method Map(null, basket = [i1, i2,...] )
for all item iin basket
H = new AssociativeArray: item -> counter
for all item j in basket
H{j} = H{j} + 1
Emit(item i, stripe H)
class Reducer
method Reduce(item i, stripes [H1, H2,...])
H = new AssociativeArray: item -> counter
H = merge-sum( [H1, H2,...] )
for all item j in H.keys()
Emit(pair [ij], H{j})

Selection
Solution:
class Mapper
method Map(rowkeykey, tuplet)
if t satisfies the predicate
Emit(tuplet, null)

Projection
Solution:
class Mapper
method Map(rowkeykey, tuplet)
tupleg = project(t) // extract required fields to tupleg
Emit(tupleg, null)
class Reducer
method Reduce(tuplet, array n) // n is an array of nulls
Emit(tuplet, null)

Union
Solution:
class Mapper
method Map(rowkeykey, tuplet)
Emit(tuplet, null)
class Reducer
method Reduce(tuplet, array n) // n is an array of one or two nulls
Emit(tuplet, null)

Intersection
Solution:
class Mapper
method Map(rowkeykey, tuplet)
Emit(tuplet, null)
class Reducer
method Reduce(tuplet, array n) // n is an array of one or more nulls
if n.size() >= 2
Emit(tuplet, null)

Difference
Solution:
// We want to compute difference R –S.
// Mapperemits all tuplesand tag which is a name of the set this record came from.
// Reducer emits only records that came from R but not from S.
class Mapper
method Map(rowkeykey, tuplet)
Emit(tuplet, string t.SetName) // t.SetNameis either 'R' or 'S'
class Reducer
method Reduce(tuplet, array n) // array n can be ['R'], ['S'], ['R' 'S'], or ['S', 'R']
if n.size() = 1 and n[1] = 'R'
Emit(tuplet, null)

Join
•AJOINis a means for combining fields from two tables by using values
common to each.
•ANSI standard SQL specifies four types ofJOIN:INNER,OUTER,LEFT,
andRIGHT.
•As a special case, a table (base table, view, or joined table) canJOINto itself
in aself-join.
•Join algorithms
–Nested loop
–Sort-Merge
–Hash
SELECT *
FROM employee
INNER JOIN department ON employee.DepartmentID= department.DepartmentID;
SELECT *
FROM employee
JOIN department ON employee.DepartmentID= department.DepartmentID;
SELECT *
FROM employee
LEFT OUTER JOIN department ON employee.DepartmentID= department.DepartmentID;

Nested loop
For each tupler in R do
For each tuples in S do
If r and s satisfy the join condition
Then output the tuple<r,s>

Sort-Merge
p in P; q in Q; g in Q
while more tuplesin inputs do
while p.a< g.bdo
advance p
end while
while p.a> g.bdo
advance g //a group might begin here
end while
while p.a== g.bdo
q = g //mark group beginning
while p.a== q.bdo
Add <p,q> to the result
Advance q
end while
Advance p //move forward
end while
g = q //candidate to begin next group
end while

Sort-Merge : MapReduce
•This algorithm joins of two sets R and L on some key k.
•Mappergoes through all tuplesfrom R and L, extracts key k from the tuples,
marks tuplewith a tag that indicates a set this tuplecame from (‘R’ or ‘L’),
and emits tagged tupleusing k as a key.
•Reducer receives all tuplesfor a particular key k and put them into two
buckets –for R and for L.
•When two buckets are filled, Reducer runs nested loop over them and emits
a cross join of the buckets.
•Each emitted tupleis a concatenation R-tuple, L-tuple, and key k.
•This approach has the following disadvantages:
–Mapperemits absolutely all data, even for keys thatoccur only in one set and
have no pair in the other.
–Reducer should hold all data for one key in the memory. If data doesn’t fit the
memory, its Reducer’s responsibility to handle this by some kind of swap.

Sort-Merge : MapReduce Pseudocode
Solution:
class Mapper
method Map(null, tuple[join_keyk, value v1, value v2,...])
Emit(join_keyk, tagged_tuple[set_nametag, values [v1, v2, ...] ] )
class Reducer
method Reduce(join_keyk, tagged_tuples[t1, t2,...])
H = new AssociativeArray: set_name-> values
for all tagged_tuplet in [t1, t2,...] // separate values into 2 arrays
H{t.tag}.add(t.values)
for all values r in H{'R'} // produce a cross-join of the two arrays
for all values l in H{'L'}
Emit(null, [k r l] )

Hash Join
•The Hash Join algorithm consists of a ‘build’ phase and a ‘probe’ phase.
•In its simplest variant, the smaller dataset is loaded into an in-memory hash
table in the build phase.
•In the ‘probe’ phase, the larger dataset is scanned and joined with the
relevant tuple(s) by looking into the hash table.
for all p in P do
Load p into in memory hash table H
end for
for all q in Q do
if H contains p matching with q then
add <p,q> to the result
end if
end for

Hash Join : MapReduce Pseudocode
•Let’s assume that we join two sets –R and L, R is relative small. If so, R can
be distributed to all Mappersand each Mappercan load it and index by the
join key.
class Mapper
method Initialize
H = new AssociativeArray: join_key-> tuplefrom R
R = loadR()
for all [ join_keyk, tuple[r1, r2,...] ] in R
H{k} = H{k}.append( [r1, r2,...] )
method Map(join_keyk, tuplel)
for all tupler in H{k}
Emit(null, tuple[k r l] )

Reference
F. N. Afratiand J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages
99-110, 2010
R. Vernica, M. J. Carey, and C. Li. Ecientparallel set-similarity joins using mapreduce. In
SIGMOD, pages 495-506, 2010.
H. Yang, A. Dasdan, R.-L. Hsiao, and D. S. P. Jr. Map-Reduce-Merge: simplifiedrelational
data processing on large clusters. In SIGMOD Conference, pages 1029–1040, 2007.
Processing theta-joins using MapReduce, AlperOkcan, MirekRiedewald, SIGMOD '11
FotoN. Afrati, Jeffrey D. Ullman: Optimizing MultiwayJoins in a Map-Reduce Environment.
IEEE Trans. Knowl. Data Eng. 23(9): 1282-1298 (2011)
V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets
and Vectors, Ahmed Metwally, Christos Faloutsos, PVLDB Proceedings of the VLDB
Endowment, vol. 5 (2012)

End of session
Day –2: MapReduceConcepts