expectation maximization and Guassian Mixture.pdf

amarazadwh 8 views 11 slides Jul 16, 2024
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

EM, AI


Slide Content

The Expectation Maximization
or EM algorithm
Carl Edward Rasmussen
November 15th, 2017
Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11

Contents
•notation, objective
•the lower bound functional,F(q(H),)
•the EM algorithm
•example: Gaussian mixture model
•Appendix: KL divergence
Carl Edward Rasmussen The EM algorithm November 15th, 2017 2 / 11

Notation
Probabilistic models may havevisible(orobserved) variablesy,latentvariables,
(orhiddenorunobservedvariables ormissing data)zand parameters.
Example: in a Gaussian mixture model, the visible variables are the observations,
the latent variables are the assignments of data points to mixture components and
the parameters are the means, variances, and weights of the mixture components.
The likelihood,p(yj), is the probability of the visible variables given the
parameters. The goal of the EM algorithm is to nd parameterswhich
maximize the likelihood. The EM algorithm is iterative and converges to a local
maximum.
Throughout,q(z)will be used to denote an arbitrary distribution of the latent
variables,z. The exposition will assume that the latent variables are continuous,
but an analogue derivation for discretezcan be obtained by substituting integrals
with sums.
Carl Edward Rasmussen The EM algorithm November 15th, 2017 3 / 11

The lower bound
Bayes' rule:
p(zjy,) =
p(yjz,)p(zj)
p(yj)
,p(yj) =
p(yjz,)p(zj)
p(zjy,)
.
Multiply and divide by an arbitrary (non-zero) distribution q(z):
p(yj) =
p(yjz,)p(zj)
q(z)
q(z)
p(zjy,)
,
take logarithms:
logp(yj) =log
p(yjz,)p(zj)
q(z)
+log
q(z)
p(zjy,)
,
and average both sides wrtq(z):
logp(yj) =
Z
q(z)log
p(yjz,)p(zj)
q(z)
dz
| {z }
lower bound functionalF(q(z),)
+
Z
q(z)log
q(z)
p(zjy,)
dz
| {z }
non-negativeKL(q(z)jjp(zjy,))
.
Carl Edward Rasmussen The EM algorithm November 15th, 2017 4 / 11

The EM algorithm
From initial (random) parameters
t=0
iteratet=1, . . . ,Tthe two steps:
E step: for xed
t-1
, maximize the lower boundF(q(z),
t-1
)wrtq(z). Since
the log likelihood logp(yj)is independent ofq(z)maximizing the lower bound
is equivalent to minimizingKL(q(z)jjp(zjy,
t-1
)), soq
t
(z) =p(zjy,
t-1
).
M step: for xedq
t
(z)maximize the lower boundF(q
k
(z),)wrt. We have:
F(q(z),)
Z
q(z)log

p(yjz,)p(zj)

dz-
Z
q(z)logq(z)dz,
whose second term is the entropy ofq(z), independent of, so the M step is

t
=argmax

Z
q
t
(z)log

p(yjz,)p(zj)

dz.
Although the steps work with the lower bound, each iteration cannot decrease the
log likelihood as
logp(yj
t-1
)
E step
=F(q
t
(z),
t-1
)
M step
6F(q
t
(z),
t
)
lower bound
6 logp(yj
t
).
Carl Edward Rasmussen The EM algorithm November 15th, 2017 5 / 11

EM as Coordinate Ascent inF
Carl Edward Rasmussen The EM algorithm November 15th, 2017 6 / 11

Example: Mixture of Gaussians
In a Gaussian mixture model, the parameters are=fj,
2
j
,jgj=1...kthe
mixture means, variances and mixing proportions for each of thekcomponents.
There is one latent variable per data-pointzi,i=1 . . .ntaking on values 1 . . .k.
The probability of the observations given the latent variables and the parameters,
and the prior on latent variables are
p(yijzi=j,) =exp

-
(yi-j)
2
2
2
j

=
p
2
2
j,p(zi=jj) =j,
so the E step becomes:
q(zi=j)/uij=jexp(-(yi-j)
2
=2
2
j)=
p
2
2
j)q(zi=j) =rij=
uij
ui
,
whereui=
P
k
j=1
uij. This shows that the posterior for each latent variable,zi
follows a discrete distribution with probability given by the product of the prior
and likelihood, renormalized. Here,rijis called theresponsibilitythat component
jtakes for data pointi.
Carl Edward Rasmussen The EM algorithm November 15th, 2017 7 / 11

Example: Mixture of Gaussians continued
The
F(q(z),)
n
X
i=1
k
X
j=1
q(zi=j)

log(j) -
1
2
(yi-j)
2
=
2
j-
1
2
log(
2
j)

+const.
The M step, optimizingF(q(z),)wrt the parameters,
@F
@j
=
n
X
i=1
q(zi=j)
yi-j

2
j
=0)j=
P
n
i=1
q(zi=j)yi
P
n
i=1
q(zi=j)
,
@F
@
2
j
=
n
X
i=1
q(zi=j)
(yi-j)
2
2
4
j
-
1
2
2
j

=0)
2
j=
P
n
i=1
q(zi=j)(yi-j)
2
P
n
i=1
q(zi=j)
,
@[F+(1-
P
k
j=1
j)]
@j
=0)j=
1
n
n
X
i=1
q(zi=j),
which have nice interpretations in terms of.
Carl Edward Rasmussen The EM algorithm November 15th, 2017 8 / 11

Clustering with MoG
Carl Edward Rasmussen The EM algorithm November 15th, 2017 9 / 11

Clustering with MoG
Carl Edward Rasmussen The EM algorithm November 15th, 2017 10 / 11

Appendix: some properties of KL divergence
The (asymmetric) Kullbach Leibler divergence (or relative entropy)
KL(q(x)jjp(x))is non-negative. To minimize, add a Lagrange multiplier enforcing
proper normalization and take variational derivatives:

q(x)
h
Z
q(x)log
q(x)
p(x)
dx+

1-
Z
q(x)dx

i
=log
q(x)
p(x)
+1-.
Find stationary point by setting the derivative to zero:
q(x) =exp(-1)p(x), normalization conditon=1, soq(x) =p(x),
which corresponds to a minimum, since the second derivative is positive:

2
q(x)q(x)
KL(q(x)jjp(x)) =
1
q(x)
>0.
The minimum value attained atq(x) =p(x)isKL(p(x)jjp(x)) =0, showing that
KL(q(x)jjp(x))
•is non-negative
•attains its minimum 0 whenp(x)andq(x)are equal (almost everywhere).
Carl Edward Rasmussen The EM algorithm November 15th, 2017 11 / 11
Tags