Machine Learning With MapReduce, K-Means, MLE

JasonPulikkottil 16 views 51 slides Jun 22, 2024
Slide 1
Slide 1 of 51
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51

About This Presentation

K-Means Map/Reduce Design
Driver
Runs multiple iteration jobs using mapper+combiner+reducer
Runs final clustering job using only mapper
Mapper
Configure: Single file containing encoded Clusters
Input: File split containing encoded Vectors
Output: Vectors keyed by nearest cluster
Combiner
Input: Vec...


Slide Content

Machine Learning with MapReduce

K-Means Clustering
3

How to MapReduceK-Means?
•Given K, assign the first Krandom points to be
the initial cluster centers
•Assign subsequent points to the closest cluster
using the supplied distance measure
•Compute the centroid of each cluster and
iterate the previous step until the cluster
centers converge within delta
•Run a final pass over the points to cluster
them for output

K-Means Map/Reduce Design
•Driver
–Runs multiple iteration jobs using mapper+combiner+reducer
–Runs final clustering job using only mapper
•Mapper
–Configure: Single file containing encoded Clusters
–Input: File split containing encoded Vectors
–Output: Vectors keyed by nearest cluster
•Combiner
–Input: Vectors keyed by nearest cluster
–Output: Cluster centroid vectors keyed by “cluster”
•Reducer (singleton)
–Input: Cluster centroid vectors
–Output: Single file containing Vectors keyed by cluster

Mapper-mapper has k centers in memory.
Input Key-value pair (each input data point x).
Find the index of the closest of the k centers (call it iClosest).
Emit: (key,value) = (iClosest, x)
Reducer(s)–Input (key,value)
Key = index of center
Value = iterator over input data points closest to ithcenter
At each key value, run through the iterator and average all the
Corresponding input data points.
Emit: (index of center, new center)

Improved Version: Calculate partial sums in mappers
Mapper-mapper has k centers in memory. Running through one
input data point at a time (call it x). Find the index of the closest of the
k centers (call it iClosest). Accumulate sum of inputs segregated into
K groups depending on which center is closest.
Emit: ( , partial sum)
Or
Emit(index, partial sum)
Reducer–accumulate partial sums and
Emit with index or without

EM-Algorithm

What is MLE?
•Given
–A sample X={X
1, …, X
n}
–A vector of parameters θ
•We define
–Likelihood of the data: P(X | θ)
–Log-likelihood of the data: L(θ)=log P(X|θ)
•Given X, find)(maxarg 

L
ML

MLE (cont)
•Often we assume that X
is are independently identically
distributed (i.i.d.)
•Depending on the form of p(x|θ), solving optimization
problem can be easy or hard. )|(logmaxarg
)|(logmaxarg
)|,...,(logmaxarg
)|(logmaxarg
)(maxarg
1












i
i
i
i
n
ML
XP
XP
XXP
XP
L





An easy case
•Assuming
–A coin has a probability p of being heads, 1-p of
being tails.
–Observation: We toss a coin N times, and the
result is a set of Hs and Ts, and there are m Hs.
•What is the value of p based on MLE, given
the observation?

An easy case (cont))1log()(log
)1(log)|(log)(
pmNpm
ppXPL
mNm


 0
1
))1log()(log()(







p
mN
p
m
dp
pmNpmd
dp
dL
p= m/N

Basic setting in EM
•X is a set of data points: observeddata
•Θis a parameter vector.
•EM is a method to find θ
MLwhere
•Calculating P(X | θ) directly is hard.
•Calculating P(X,Y|θ) is much simpler, where Y is
“hidden” data (or “missing” data).)|(logmaxarg
)(maxarg




XP
L
ML


The basic EM strategy
•Z = (X, Y)
–Z: complete data (“augmented data”)
–X: observed data (“incomplete” data)
–Y: hidden data (“missing” data)

The log-likelihood function
•L is a function of θ, while holding X constant:)|()()|(  XPLXL  )|,(log
)|(log
)|(log
)|(log)(log)(
1
1
1




yxP
xP
xP
XPLl
i
y
n
i
i
n
i
n
i
i











The iterative approach for MLE)|,(logmaxarg
)(maxarg
)(maxarg
1





yxp
l
L
n
i y
i
ML







 ,....,...,,
10 t

In many cases, we cannot find the solution directly.
An alternative is to find a sequence:....)(...)()(
10

t
lll 
s.t.

]
)|,(
)|,(
[log
]
)|,(
)|,(
[log
)|,(
)|,(
),|(log
)|,(
)|,(
)|',(
)|,(
log
)|,(
)|,(
)|',(
)|,(
log
)|',(
)|,(
log
)|,(
)|,(
log
)|,(log)|,(log
)|(log)|(log)()(
1
),|(
1
),|(
1
'
1
'
1
'
1
1
11
t
i
i
n
i
xyP
t
i
i
n
i
xyP
t
i
it
n
i y
i
t
i
i
y
t
y
i
t
i
n
i
t
i
t
i
y
t
y
i
i
n
i
y
t
y
i
i
n
i
t
y
i
y
i
n
i
t
y
i
n
iy
i
n
i
tt
yxP
yxP
E
yxP
yxP
E
yxP
yxP
xyP
yxP
yxP
yxP
yxP
yxP
yxP
yxP
yxP
yxP
yxP
yxP
yxP
yxPyxP
XPXPll
t
i
t
i























































 Jensen’s inequality

Jensen’s inequality])([()](([, xgEfxgfEthenconvexisfif  )])([log()]([log( xpExpE  ])([()](([, xgEfxgfEthenconcaveisfif 
log is a concave function

Maximizing the lower bound)]|,([logmaxarg
)|,(log),|(maxarg
)|,(
)|,(
log),|(maxarg
]
)|,(
)|,(
[logmaxarg
1
),|(
1
1
1
),|(
)1(














yxPE
yxPxyP
yxP
yxP
xyP
yxp
yxp
E
i
n
i
xyP
i
t
i
n
iy
t
i
it
i
n
iy
t
i
i
n
i
xyP
t
t
i
t
i













The Q function

The Q-function
•Define the Q-function (a function of θ):
–Y is a random vector.
–X=(x
1, x
2, …, x
n) is a constant (vector).
–Θ
t
is the current parameter estimate and is a constant (vector).
–Θis the normal variable (vector) that we wish to adjust.
•The Q-function is the expected value of the complete data log-likelihood
P(X,Y|θ) with respect to Y given X and θ
t
.)|,(log),|(
)]|,([log)|,(log),|(
)]|,([log],|)|,([log),(
1
1
),|(
),|(





yxPxyP
yxPEYXPXYP
YXPEXYXPEQ
i
t
n
iy
i
n
i
ixyP
Y
t
XYP
tt
t
i
t








The inner loop of the
EM algorithm
•E-step: calculate
•M-step: find),(maxarg
)1( tt
Q


 )|,(log),|(),(
1
 yxPxyPQ
i
t
n
iy
i
t


L(θ) is non-decreasing
at each iteration
•The EM algorithm will produce a sequence
•It can be proved that ,....,...,,
10 t
 ....)(...)()(
10

t
lll 

The inner loop of the
Generalized EM algorithm (GEM)
•E-step: calculate
•M-step: find),(maxarg
)1( tt
Q


 )|,(log),|(),(
1
 yxPxyPQ
i
t
n
iy
i
t


 ),(),(
1 tttt
QQ  

Recap of the EM algorithm

Idea #1: find θthat maximizes the
likelihood of training data)|(logmaxarg
)(maxarg




XP
L
ML


Idea #2: find the θ
t
sequence
No analytical solution iterative approach, find
s.t. ,....,...,,
10 t
 ....)(...)()(
10

t
lll 

Idea #3: find θ
t+1
that maximizes a tight
lower bound of )()(
t
ll
a tight lower bound]
)|,(
)|,(
[log)()(
1
),|( t
i
i
n
i
xyP
t
yxP
yxP
Ell
t
i







Idea #4: find θ
t+1
that maximizes
the Q function)]|,([logmaxarg
]
)|,(
)|,(
[logmaxarg
1
),|(
1
),|(
)1(








yxPE
yxp
yxp
E
i
n
i
xyP
t
i
i
n
i
xyP
t
t
i
t
i







Lower bound of )()(
t
ll
The Q function

The EM algorithm
•Start with initial estimate, θ
0
•Repeat until convergence
–E-step: calculate
–M-step: find),(maxarg
)1( tt
Q


 )|,(log),|(),(
1
 yxPxyPQ
i
t
n
iy
i
t


Important classes of EM problem
•Products of multinomial (PM) models
•Exponential families
•Gaussian mixture
•…

Probabilistic Latent Semantic Analysis (PLSA)
•PLSA is a generative model for generating the co-
occurrence of documents d∈D={d
1,…,d
D} and terms
w∈W={w
1,…,w
W}, which associates latent variable
z∈Z={z
1,…,z
Z}.
•The generative processing is:
w
1
w
2
w
W

d
1
d
2
d
D

z
1
z
2
z
Z
P(d)
P(z|d) P(w|z)

Model
•The generative process can be expressed by:( , ) ( ) ( | ),
( | ) ( | ) ( | )
zZ
P d w P d P w d
where P w d P w z P z d


 
Two independence assumptions:
1)Each pair (d,w) are assumed to be generated independently,
corresponding to ‘bag-of-words’
2)Conditioned on z, words w are generated independently of the
specific document d.

Model
•Following the likelihood principle, we detemines P(z),
P(d|z), and P(w|z) by maximization of the log-
likelihood function( | , , ) ( , )log ( , )
d D w W
L d w z n d w P d w

  ( , ) ( | ) ( | ) ( ) ( | ) ( | ) ( )
z Z z Z
where P d w P w z P z d P d P w z P d z P z


co-occurrence
times of d and w.
Observed data
Unobserved
data
P(d), P(z|d),
and P(w|d)

Maximum-likelihood
•Definition
–We have a density function P(x|Θ) that is govened by the set of
parameters Θ, e.g., Pmight be a set of Gaussians and Θcould be the
means and covariances
–We also have a data set X={x
1,…,x
N}, supposedly drawn from this
distribution P, andassume these data vectors are i.i.d. with P.
–Then the likehihood function is:
–The likelihood is thought of as a function of the parameters Θwhere
the data X is fixed. Our goal is to find the Θthat maximizes L. That is1
( | ) ( | ) ( | )
N
i
i
P X P x L X

     *
arg max ( | )LX

  

Jensen’s inequality0)(
0
1
)()(





 
jg
a
a
provided
jgajg
j
j
j
j j
a
j
j

 
 

Dd Ww Zz
zdPzwPzPwdnzwdL )|()|()(log),(max),,|(max Estimation-using EM
difficult!!!
Idea:start with a guess 
t
, compute an easily computed lower-bound B(; 
t
)
to the function log P(|U) and maximize the bound instead
By Jensen’s inequality:),|(
]
),|(
)|()|()(
[),|(
),|(
)|()|()(
dwzP
Zz j dwzP
zdPzwPzP
dwzP
dwzP
zdPzwPzP
 

  





Dd Ww z
dwzP
zDd Ww
t
dwzPdwzPzdPzwPzPwdn
dwzP
zdPzwPzP
wdnB
),|()],|(log)|()|()([log),(max
]
),|(
)|()|()(
[log),(max),(max
),|(

(1)Solve P(w|z)
•We introduce Lagrange multiplier λwith the constraint that

wP(w|z)=1, and solve the following equation:( , ) [log ( ) ( | ) ( | ) log ( | , )] ( | , ) ( ( | ) 1) 0
( | )
( , ) ( | , )
0,
( | )
( , ) ( | , )
( | ) ,
( | ) 1,
( , ) ( | , ),
( , )
( | )
d D w W z w
dD
dD
w
w W d D
n d w P z P w z P d z P z w d P z w d P w z
P w z
n d w P z d w
P w z
n d w P z d w
P w z
P w z
n d w P z d w
n d w P
P w z









   
 
  
  

  

   




( | , )
( , ) ( | , )
dD
w W d D
z d w
n d w P z d w





(2)Solve P(d|z)
•We introduce Lagrange multiplier λwith the constraint that

dP(d|z)=1, and get the following result:( , ) ( | , )
( | )
( , ) ( | , )
wW
d D w W
n d w P z d w
P d z
n d w P z d w






(3)Solve P(z)
•We introduce Lagrange multiplier λwith the constraint that

zP(z)=1, and solve the following equation:( , ) [log ( ) ( | ) ( | ) log ( | , )] ( | , ) ( ( ) 1) 0
()
( , ) ( | , )
0,
()
( , ) ( | , )
( ) ,
( ) 1,
( , ) ( | , ) ( , ),
d D w W z z
d D w W
d D w W
z
d D w W z d D w W
n d w P z P w z P d z P z w d P z w d P z
Pz
n d w P z d w
Pz
n d w P z d w
Pz
Pz
n d w P z d w n d w







   

   

  
  

    
   



   
( , ) ( | , )
()
( , )
d D w W
w W d D
n d w P z d w
Pz
n d w







(1)Solve P(z|d,w)
•We introduce Lagrange multiplier λwith the constraint that

zP(z|d,w)=1, and solve the following equation:,
,
,
( , ) [log ( ) ( | ) ( | ) log ( | , )] ( | , ) ( ( | , ) 1) 0
( | , )
( , )[log ( ) ( | ) ( | ) log ( | , ) 1] 0,
log ( | , ) log ( ) ( | ) ( | ) 1 0,
(|
dw
d D w W z d D w W z
dw
dw
n d w P z P w z P d z P z d w P z d w P z d w
P z d w
n d w P z P w z P d z P z d w
P z d w P z P w z P d z
P z d



   
 
   
 
    
    

     
,
,
,
1
1
,
1
1 (1 log ( ) ( | ) ( | ) )
, ) ( ) ( | ) ( | )
( | , ) 1,
( ) ( | ) ( | ) 1
1 log ( ) ( | ) ( | )
( ) ( | ) ( | )
( | )
( ) ( | ) ( | )
( ) ( | ) ( | )
( ) ( |
dw
dw
dw
z
z
z
dw
z
P z P w z P d z
w P z P w z P d z e
P z d w
P z P w z P d z e
P z P w z P d z
P z P w z P d z
P w z
e
P z P w z P d z
e
P z P w z P d z
P z P w z











  







) ( | )
z
P d z

(4)Solve P(z|d,w) -2( , , )
( | , )
( , )
( , | ) ( )
( , )
( | ) ( | ) ( )
( | ) ( | ) ( )
zZ
P d w z
P z d w
P d w
P w d z P z
P d w
P w z P d z P z
P w z P d z P z




The final update Equations
•E-step:
•M-step:( | ) ( | ) ( )
( | , )
( | ) ( | ) ( )
zZ
P w z P d z P z
P z d w
P w z P d z P z


 ( , ) ( | , )
( | )
( , ) ( | , )
dD
w W d D
n d w P z d w
P w z
n d w P z d w




 ( , ) ( | , )
( | )
( , ) ( | , )
wW
d D w W
n d w P z d w
P d z
n d w P z d w




 ( , ) ( | , )
()
( , )
d D w W
w W d D
n d w P z d w
Pz
n d w






Coding Design
•Variables:
•double[][]p_dz_n // p(d|z), |D|*|Z|
•double[][]p_wz_n // p(w|z), |W|*|Z|
•double[] p_z_n // p(z), |Z|
•Running Processing:
1.Read dataset from file
ArrayList<DocWordPair> doc; // all the docs
DocWordPair –(word_id, word_frequency_in_doc)
2.Parameter Initialization
Assign each elements of p_dz_n, p_wz_n and p_z_n with a random double value, satisfying

d p_dz_n=1, ∑
d p_wz_n =1, and ∑
d p_z_n =1
3.Estimation (Iterative processing)
1.Update p_dz_n, p_wz_n and p_z_n
2.Calculate Log-likelihood function to see where ( |Log-likelihood –old_Log-likelihood|
< threshold)
4.Output p_dz_n, p_wz_n and p_z_n

Coding Design
•Update p_dz_n
For each doc d{
For each word wincluded in d{
denominator = 0;
nominator = newdouble[Z];
For each topic z {
nominator[z] = p_dz_n[d][z]* p_wz_n[w][z]* p_z_n[z]
denominator +=nominator[z];
} // end for each topic z
For each topic z {
P_z_condition_d_w = nominator[j]/denominator;
nominator_p_dz_n[d][z] += tf
wd*P_z_condition_d_w;
denominator_p_dz_n[z] += tf
wd*P_z_condition_d_w;
} // end for each topic z
}// end for each word w included in d
}// end for each doc d
For each doc d {
For each topic z {
p_dz_n_new[d][z] = nominator_p_dz_n[d][z]/ denominator_p_dz_n[z];
} // end for each topic z
}// end for each doc d

Coding Design
•Update p_wz_n
For each doc d{
For each word wincluded in d{
denominator = 0;
nominator = newdouble[Z];
For each topic z {
nominator[z] = p_dz_n[d][z]* p_wz_n[w][z]* p_z_n[z]
denominator +=nominator[z];
} // end for each topic z
For each topic z {
P_z_condition_d_w = nominator[j]/denominator;
nominator_p_wz_n[w][z] += tf
wd*P_z_condition_d_w;
denominator_p_wz_n[z] += tf
wd*P_z_condition_d_w;
} // end for each topic z
}// end for each word w included in d
}// end for each doc d
For each w {
For each topic z {
p_wz_n_new[w][z] = nominator_p_wz_n[w][z]/ denominator_p_wz_n[z];
} // end for each topic z
}// end for each doc d

Coding Design
•Update p_z_n
For each doc d{
For each word wincluded in d{
denominator = 0;
nominator = newdouble[Z];
For each topic z {
nominator[z] = p_dz_n[d][z]* p_wz_n[w][z]* p_z_n[z]
denominator +=nominator[z];
} // end for each topic z
For each topic z {
P_z_condition_d_w = nominator[j]/denominator;
nominator_p_z_n[z] += tf
wd*P_z_condition_d_w;
} // end for each topic z
denominator_p_z_n[z] += tf
wd;
}// end for each word w included in d
}// end for each doc d
For each topic z{
p_dz_n_new[d][j] = nominator_p_z_n[z]/ denominator_p_z_n;
} // end for each topic z

Apache Mahout
Industrial Strength Machine Learning
GraphLab

Current Situation
•Large volumes of data are now available
•Platforms now exist to run computations over
large datasets (Hadoop, HBase)
•Sophisticated analytics are needed to turn data
into information people can use
•Active research community and proprietary
implementations of “machine learning”
algorithms
•The world needs scalable implementations of ML
under open license -ASF

History of Mahout
•Summer 2007
–Developers needed scalable ML
–Mailing list formed
•Community formed
–Apache contributors
–Academia & industry
–Lots of initial interest
•Project formed under Apache Lucene
–January 25, 2008

Current Code Base
•Matrix & Vector library
–Memory resident sparse & dense implementations
•Clustering
–Canopy
–K-Means
–Mean Shift
•Collaborative Filtering
–Taste
•Utilities
–Distance Measures
–Parameters

Others
•Naïve Bayes
•Perceptron
•PLSI/EM
•Genetic Programming
•Dirichlet Process Clustering
•Clustering Examples
•Hama (Incubator) for very large arrays