Statistics (1): estimation Chapter 3: likelihood function and likelihood estimation
4,578 views
82 slides
Oct 26, 2014
Slide 1 of 82
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
About This Presentation
this is the third chapter of my third year undergraduate course at Paris-Dauphine
Size: 1.11 MB
Language: en
Added: Oct 26, 2014
Slides: 82 pages
Slide Content
Chapter 3 :
Likelihood function and inference
[0]
4
Likelihood function and inference
The likelihood
Information and curvature
Suciency and ancilarity
Maximum likelihood estimation
Non-regular models
EM algorithm
The likelihood
Given an usually parametric family of distributions
F2fF,2g
with densitiesf[wrt a xed measure], the density of the iid
samplex1, . . . ,xnis
n
Y
i=1
f(xi)
NoteIn the special caseis a counting measure,
n
Y
i=1
f(xi)
is the x1, . . . ,xnamong all
possible realisations ofX1, . . . ,Xn
The likelihood
Denition (likelihood function)
The x1, . . . ,xnis the
function
L:!R+
!
n
Y
i=1
f(xi)
same
Example: density function versus likelihood function
Take the case of a Poisson density
[against the counting measure]
f(x;) =
x
x!
e
-
IN(x)
which varies inNas a function ofx
versus
L(;x) =
x
x!
e
-
which varies inR+as a function of
=3
Example: density function versus likelihood function
Take the case of a Poisson density
[against the counting measure]
f(x;) =
x
x!
e
-
IN(x)
which varies inNas a function ofx
versus
L(;x) =
x
x!
e
-
which varies inR+as a function of
x=3
Example: density function versus likelihood function
Take the case of a NormalN(0,)
density
f(x;) =
1
p
2
e
-x
2
=2
IR(x)
which varies inRas a function ofx
versus
L(;x) =
1
p
2
e
-x
2
=2
which varies inR+as a function of
=2
Example: density function versus likelihood function
Take the case of a NormalN(0,)
density
f(x;) =
1
p
2
e
-x
2
=2
IR(x)
which varies inRas a function ofx
versus
L(;x) =
1
p
2
e
-x
2
=2
which varies inR+as a function of
x=2
Example: density function versus likelihood function
Take the case of a NormalN(0,1=)
density
f(x;) =
p
p
2
e
-x
2
=2
IR(x)
which varies inRas a function ofx
versus
L(;x) =
p
p
2
e
-x
2
=2
IR(x)
which varies inR+as a function of
=1=2
Example: density function versus likelihood function
Take the case of a NormalN(0,1=)
density
f(x;) =
p
p
2
e
-x
2
=2
IR(x)
which varies inRas a function ofx
versus
L(;x) =
p
p
2
e
-x
2
=2
IR(x)
which varies inR+as a function of
x=1=2
Example: Hardy-Weinberg equilibrium
Population genetics:
Genotypes of biallelic genesAA,Aa, andaa
sample frequenciesnAA,nAaandnaa
multinomial modelM(n;pAA,pAa,paa)
related to population proportion ofAalleles,pA:
pAA=p
2
A
,pAa=2pA(1-pA),paa= (1-pA)
2
likelihood
L(pAjnAA,nAa,naa)/p
2nAA
A
[2pA(1-pA)]
nAa
(1-pA)
2naa
[Boos & Stefanski, 2013]
mixed distributions and their likelihood
Special case when a random variableXmay take specic values
a1, . . . ,akand a continum of valuesA
Example:
positive probabilityp0[it did not rain!]
between0and100[capacity of measurement container] 100
with positive probabilityp100[container full]
mixed distributions and their likelihood
Special case when a random variableXmay take specic values
a1, . . . ,akand a continum of valuesA
Example: y N(X
T
,
2
)but
y
=yIfy>0gobserved
mixed distributions and their likelihood
Special case when a random variableXmay take specic values
a1, . . . ,akand a continum of valuesA
Density ofXagainst composition of two measures, counting and
Lebesgue:
fX(a) =
P(X=a)ifa2fa1, . . . ,akg
f(aj) otherwise
Results in likelihood
L(jx1, . . . ,xn) =
k
Y
j=1
P(X=ai)
nj
Y
xi=2fa1,...,akg
f(xij)
wherenj# observations equal toaj
Enters Fisher, Ronald Fisher!
Fisher's intuition in the 20's:
the likelihood function contains the
relevant information about the
parameter
the higher the likelihood the more
likely the parameter
the curvature of the likelihood
determines the precision of the
estimation
Concentration of likelihood mode around rue" parameter
Likelihood functions forx1, . . . ,xn P(3)asnincreases
n=40, ...,240
Concentration of likelihood mode around rue" parameter
Likelihood functions forx1, . . . ,xn P(3)asnincreases
n=38, ...,240
Concentration of likelihood mode around rue" parameter
Likelihood functions forx1, . . . ,xn N(0,1)asnincreases
Concentration of likelihood mode around rue" parameter
Likelihood functions forx1, . . . ,xn N(0,1)as sample varies
Concentration of likelihood mode around rue" parameter
Likelihood functions forx1, . . . ,xn N(0,1)as sample varies
why concentration takes place
Consider
x1, . . . ,xn
iid
F
Then
log
n
Y
i=1
f(xij) =
n
X
i=1
logf(xij)
and by
LLN
1=n
n
X
i=1
logf(xij)
L
!
Z
X
logf(xj)dF(x)
Lemma
Maximising the likelihood is asymptotically equivalent to
minimising the Kullback-Leibler divergence
Z
X
log
f(x)=f(xj)dF(x)
cMember of the family closest to true distribution
Score function
Score function dened by
rlogL(jx) =
@=@1L(jx), . . . ,
@=@pL(jx)
L(jx)
Gradient (slope) of likelihood function at point
lemma
WhenXF,
E[rlogL(jX)] =0
Score function
Score function dened by
rlogL(jx) =
@=@1L(jx), . . . ,
@=@pL(jx)
L(jx)
Gradient (slope) of likelihood function at point
lemma
WhenXF,
E[rlogL(jX)] =0
Score function
Score function dened by
rlogL(jx) =
@=@1L(jx), . . . ,
@=@pL(jx)
L(jx)
Gradient (slope) of likelihood function at point
lemma
WhenXF,
E[rlogL(jX)] =0
Reason:
Z
X
rlogL(jx)dF(x) =
Z
X
rL(jx)dx=r
Z
X
dF(x)
Score function
Score function dened by
rlogL(jx) =
@=@1L(jx), . . . ,
@=@pL(jx)
L(jx)
Gradient (slope) of likelihood function at point
lemma
WhenXF,
E[rlogL(jX)] =0
Connected with concentration theorem: gradient null on average
for true value of parameter
Score function
Score function dened by
rlogL(jx) =
@=@1L(jx), . . . ,
@=@pL(jx)
L(jx)
Gradient (slope) of likelihood function at point
lemma
WhenXF,
E[rlogL(jX)] =0
Warning:
support depends on
Fisher's information matrix
Another notion attributed to Fisher
Information:
I() =E
h
rlogf(Xj)frlogf(Xj)g
T
i
Often called
Measures curvature of the likelihood surface, which translates as
information
Sometimes denotedIXto stress dependence on distribution ofX
Fisher's information matrix
Second derivative of the log-likelihood as well
lemma
IfL(jx)is twice dierentiable [as a function of]
I() = -E
p
x1
1
p
xk
k
@=@pilogf(xjp) =
xi=pi-
xk=pk
@
2
=@pi@pjlogf(xjp) = -
xk=p
2
k
@
2
=@p
2
ilogf(xjp) = -
xi=p
2
i-
xk=p
2
k
Hence
I(p) =n
0
B
B
B
@
1=p1+
1=pk
1=pk
1=pk
1=pk
.
.
.
1=pk
1=pk-1+
1=pk
1
C
C
C
A
p
x1
1
p
xk
k
@=@pilogf(xjp) =
xi=pi-
xk=pk
@
2
=@pi@pjlogf(xjp) = -
xk=p
2
k
@
2
=@p
2
ilogf(xjp) = -
xi=p
2
i-
xk=p
2
k
and
I(p)
-1
=
1=n
0
B
B
B
@
p1(1-p1) -p1p2 -p1pk-1
-p1p2p2(1-p2) -p2pk-1
.
.
.
.
.
.
-p1pk-1-p2pk-1 pk-1(1-pk-1)
1
C
C
C
A
Illustrations
NormalN(,
2
)distribution
f(xj) =
1
p
2
1
Properties
Additive features translating as accumulation of information:
ifXandYare independent,IX() + IY() = I
(X,Y)()
IX1,...,Xn
() =nIX1
()
ifX=T(Y)andY=S(X),IX() = IY()
ifX=T(Y),IX()6IY()
If= ()is a bijective transform, change of parameterisation:
I() =
@
@
T
I()
@
@
"In information geometry, this is seen as a change of
coordinates on a Riemannian manifold, and the intrinsic
properties of curvature are unchanged under dierent
parametrization. In general, the Fisher information
matrix provides a Riemannian metric (more precisely, the
Fisher-Rao metric)." [Wikipedia]
Properties
Additive features translating as accumulation of information:
ifXandYare independent,IX() + IY() = I
(X,Y)()
IX1,...,Xn
() =nIX1
()
ifX=T(Y)andY=S(X),IX() = IY()
ifX=T(Y),IX()6IY()
If= ()is a bijective transform, change of parameterisation:
I() =
@
@
T
I()
@
@
"In information geometry, this is seen as a change of
coordinates on a Riemannian manifold, and the intrinsic
properties of curvature are unchanged under dierent
parametrization. In general, the Fisher information
matrix provides a Riemannian metric (more precisely, the
Fisher-Rao metric)." [Wikipedia]
Approximations
Back to the
D(
0
,) =
Z
X
f(xj
0
)log
f(xj
0
)=f(xj)dx
Using a second degree Taylor expansion
logf(xj) =logf(xj
0
) + (-
0
)
T
rlogf(xj
0
)
+
1
2
(-
0
)
T
rr
T
logf(xj
0
)(-
0
) +o(jj-
0
jj
2
)
approximation of divergence:
D(
0
,)
1
2
(-
0
)
T
I(
0
)(-
0
)
[Exercise: show this is exact in the normal case]
Approximations
Central limit law of the score vector
1=
p
nrlogL(jX1, . . . ,Xn)N(0,IX1
())
NotationI1()stands forIX1
()and indicates information
associated with a single observation
Suciency
What if a transform of the sample
S(X1, . . . ,Xn)
contains
I
(X1,...,Xn)() = I
S(X1,...,Xn)()
uniformly in?
In this caseS()is called a
sucient to know the value ofS(x1, . . . ,xn)to get complete
information]
A X1, . . . ,Xn
Suciency (bis)
Alternative denition:
If(X1, . . . ,Xn)f(x1, . . . ,xnj)and ifT=S(X1, . . . ,Xn)is such
that the distribution of(X1, . . . ,Xn)conditional onTdoes not
depend on, thenS()is a
Factorisation theorem
S()is a
f(x1, . . . ,xnj) =g(S(x1, . . . ,xnj))h(x1, . . . ,xn)
another notion due to Fisher
Illustrations
UniformU(0,)distribution
L(jx1, . . . ,xn) =
-n
n
Y
i=1
I
(0,)(xi) =
-n
I>max
i
xi
Hence
S(X1, . . . ,Xn) =max
i
Xi=X
(n)
is sucient
Illustrations
BernoulliB(p)distribution
L(pjx1, . . . ,xn) =
n
Y
i=1
p
xi
(1-p)
n-xi
=fp=1-pg
P
i
xi
(1-p)
n
Hence
S(X1, . . . ,Xn) =Xn
is sucient
Illustrations
NormalN(,
2
)distribution
L(,jx1, . . . ,xn) =
n
Y
i=1
1
p
2
expf-
(x
i-)
2
=2
2g
=
1
f2
2
g
n=2
exp
-
1
2
2
n
X
i=1
(xi-xn+xn-)
2
=
1
f2
2
g
n=2
exp
-
1
2
2
n
X
i=1
(xi-xn)
2
-
1
2
2
n
X
i=1
(xn-)
2
Hence
S(X1, . . . ,Xn) =
Xn,
n
X
i=1
(Xi-Xn)
2
!
is sucient
Suciency and exponential families
Both previous examples belong to exponential families
f(xj) =h(x)exp
T()
T
S(x) -()
Generic property of exponential families:
f(x1, . . . ,xnj) =
n
Y
i=1
h(xi)exp
T()
T
n
X
i=1
S(xi)n()
lemma
For an exponential family with summary statisticS(), the statistic
S(X1, . . . ,Xn) =
n
X
i=1
S(Xi)
is sucient
Suciency as a rare feature
Nice property reducing the data to a low dimension transform but...
How frequent is it within the collection of probability distributions?
Very rare as essentially restricted to exponential families
[Pitman-Koopman-Darmois theorem]
with the exception of parameter-dependent families likeU(0,)
Pitman-Koopman-Darmois characterisation
IfX1, . . . ,Xnare iid random variables from a density
f(j)whose support does not depend onand verifying
the property that there exists an integern0such that, for
n>n0, there is a sucient statisticS(X1, . . . ,Xn)with
xed [inn] dimension, thenf(j)belongs to an
exponential family
[Factorisation theorem]
Note:
Koopman and Pitman in 1936
omitted from the theorem... Fisher proved it for one-D sucient
statistics in 1934
Minimal suciency
Multiplicity of sucient statistics, e.g.,S
0
(x) = (S(x),U(x))
remains sucient whenS()is sucient
Search of a most concentrated summary:
Minimal suciency
A sucient statisticS()is
any other sucient statistic
Lemma
For a minimal exponential family representation
f(xj) =h(x)exp
T()
T
S(x) -()
S(X1) +. . .+S(Xn)is minimal sucient
Ancillarity
Opposite of suciency:
Ancillarity
WhenX1, . . . ,Xnare iid random variables from a densityf(j), a
statisticA()is A(X1, . . . ,Xn)has a distribution that
does not depend on
Useless?! A(X1, . . . ,Xn)
leads to more precision and eciency:
Use ofF(x1, . . . ,xnjA(x1, . . . ,xn))instead ofF(x1, . . . ,xn)
Notion of
Point estimation, estimators and estimates
When given a parametric familyf(j)and a sample supposedly
drawn from this family
(X1, . . . ,XN)
iid
f(xj)
an is a statisticT(X1, . . . ,XN)or^nproviding a
[reasonable] .
an is the value of the estimator for a given
sample,T(x1, . . . ,xn)
Example: N(,
2
sampleX1, . . . ,XN,
T(X1, . . . ,XN) = ^n=
1=nXn
is an estimator ofand^n=2.014is an estimate
Maximum likelihood principle
Given the concentration property of the likelihood function,
reasonable choice of estimator as mode:
MLE
A ^nsatises
L(^njX1, . . . ,XN)>L(^njX1, . . . ,XN)for all2
Under regularity ofL(jX1, . . . ,XN), MLE also solution of the
likelihood equations
rlogL(njX1, . . . ,XN) =0
Warning:^nis not but makes observation
(x1, . . . ,xN)most likely...
Maximum likelihood invariance
Principle independent of parameterisation:
If=h()is a one-to-one transform of, then
^
MLE
n=h(^
MLE
n)
[estimator of transform = transform of estimator]
By extension, if=h()is any transform of, then
^
MLE
n=h(^
MLE
n)
Unicity of maximum likelihood estimate
Depending on regularity ofL(jx1, . . . ,xN), there may be
1
an a.s. unique MLE^
MLE
n
2 3 1
Case ofx1, . . . ,xn N(,1)
2 3
Unicity of maximum likelihood estimate
Depending on regularity ofL(jx1, . . . ,xN), there may be
1 2
several or an innity of MLE's [or of solutions to likelihood
equations]
3 1 2
Case ofx1, . . . ,xn N(1+2,1)[[and mixtures of normal]
3
Unicity of maximum likelihood estimate
Depending on regularity ofL(jx1, . . . ,xN), there may be
1 2 3
no MLE at all
1 2 3
Case ofx1, . . . ,xn N(i,
-2
)
Unicity of maximum likelihood estimate
Consequence of standard dierential calculus results on
`() =L(jx1, . . . ,xn):
lemma
Ifis connected and open, and if`()is twice-dierentiable with
lim
!@
`()<+1
and ifH() =rr
T
`()is positive denite at all solutions of the
likelihood equations, then`()has a unique global maximum
Limited appeal because excluding local maxima
Unicity of MLE for exponential families
lemma
Iff(j)is a minimal exponential family
f(xj) =h(x)exp
T()
T
S(x) -()
withT()one-to-one and twice dierentiable over, ifis open,
and if there is at least one solution to the likelihood equations,
then it is the unique MLE
Likelihood equation is equivalent toS(x) =E[S(x)]
Unicity of MLE for exponential families
Likelihood equation is equivalent toS(x) =E[S(x)]
lemma
Ifis connected and open, and if`()is twice-dierentiable with
lim
!@
`()<+1
and ifH() =rr
T
`()is positive denite at all solutions of the
likelihood equations, then`()has a unique global maximum
Illustrations
UniformU(0,)likelihood
L(jx1, . . . ,xn) =
-n
I>max
i
xi
not dierentiable atX
(n)but
^
MLE
n=X
(n)
Illustrations
BernoulliB(p)likelihood
L(pjx1, . . . ,xn) =fp=1-pg
P
i
xi
(1-p)
n
dierentiable over(0,1)and
^p
MLE
n=Xn
The fundamental theorem of Statistics
fundamental theorem
Under appropriate conditions, if(X1, . . . ,Xn)
iid
f(xj), if^nis
solution ofrlogf(X1, . . . ,Xnj) =0, then
p
nf^n-g
L
!Np(0,I()
-1
)
Equivalent of CLT for estimation purposes
Assumptions
identiable
support off(j)constant in
`()thrice dierentiable
[the killer] g(x)integrable againstf(j)in a
neighbourhood of the true parameter such that
@
3
@i@j@k
f(j)
6g(x)
the identity
I() =E
h
rlogf(Xj)frlogf(Xj)g
T
i
= -E
r
T
rlogf(Xj)
stands
^nconverges in probability to[similarly superuous]
[Boos & Stefanski, 2014, p.286; Lehmann & Casella, 1998]
Inecient MLEs
Example of MLE of=jjjj
2
whenx Np(,Ip):
^
MLE
=jjxjj
2
ThenE[jjxjj
2
] =+pdiverges away fromwithp
Note:
MLE ofbased on
Z=jjXjj
2
2
p()
[Robert, 2001]
Inconsistent MLEs
TakeX1, . . . ,Xn
iid
f(x)with
f(x) = (1-)
1
()
f0(
x-=()) +f1(x)
for2[0,1],
f1(x) =I
[-1,1](x)f0(x) = (1-jxj)I
[1,1](x)
and
() = (1-)expf-(1-)
-4
+1g
Then for any
^
MLE
n
a.s.
!1
[Ferguson, 1982; John Wellner's slides, ca. 2005]
Inconsistent MLEs
ConsiderXiji=1, . . . ,n,j=1,2withXij N(i,
2
). Then
^
MLE
i
=
Xi1+Xi2=2
^
2
MLE
=
1
4n
n
X
i=1
(Xi1-Xi2)
2
Therefore
^
2
MLEa.s.
!
2
=2
[Neyman & Scott, 1948]
Inconsistent MLEs
ConsiderXiji=1, . . . ,n,j=1,2withXij N(i,
2
). Then
^
MLE
i
=
Xi1+Xi2=2
^
Likelihood optimisation
Practical optimisation of the likelihood function
?
=arg max
L(jx) =
n
Y
i=1
g(Xij).
assumingX= (X1, . . . ,Xn)
iid
g(xj)
analytical resolution feasible for exponential families
rT()
n
X
i=1
S(xi) =nr()
use of standard numerical techniques like Newton-Raphson
EM algorithm
Cases wheregis too complex for the above to work
Special case whengis a marginal
g(xj) =
Z
Z
f(x,zj)dz
Zcalled
Illustrations
censored data
X=min(X
,a)X
N(,1)
mixture model
X.3N1(0,1) +.7N1(1,1),
desequilibrium model
X=min(X
,Y
)X
f1(xj)Y
f2(xj)
Completion
EM algorithm based on completing dataxwithz, such as
(X,Z)f(x,zj)
Zmissing data vector (X,Z)complete data vector
Conditional density ofZgivenx:
k(zj,x) =
f(x,zj)
g(xj)
Likelihood decomposition
Likelihood associated with complete data(x,z)
L
c
(jx,z) =f(x,zj)
and likelihood for observed data
L(jx)
such that
logL(jx) =E[logL
c
(jx,Z)j0,x] -E[logk(Zj,x)j0,x](1)
for any0, with integration operated against conditionnal
distribution ofZgiven observables (and parameters),k(zj0,x)
[A tale of] 's
There are wo's" ! :in (1),0is a xed (and arbitrary) value
driving integration, whileboth free (and variable)
Maximising
L(jx)
equivalent to maximise r.h.s. term in (1)
E[logL
c
(jx,Z)j0,x] -E[logk(Zj,x)j0,x]
Intuition for EM
Instead of maximising wrtr.h.s. term in (1), maximise only
E[logL
c
(jx,Z)j0,x]
Maximisation of complete log-likelihood impossible sincez
unknown, hence substitute by maximisation of expected complete
log-likelihood, with expectation depending on term0
Expectation{Maximisation
Expectation of complete log-likelihood denoted
Q(j0,x) =E[logL
c
(jx,Z)j0,x]
to stress dependence on0and samplex
Principle
EMderives sequence of estimators^
(j),j=1,2, . . ., through
iteration ofExpectation andMaximisation steps:
Q(^
(j)j^
(j-1),x) =max
Q(j^
(j-1),x).
EM Algorithm
Iterate (inm)
1
(step E)
Q(j^
(m),x) =E[logL
c
(jx,Z)j^
(m),x],
2
(step M) Q(j^
(m),x)inand set
^
(m+1)=arg max
Q(j^
(m),x).
until a xed point [ofQ] is found
[Dempster, Laird, & Rubin, 1978]
Justication
Observed likelihood
L(jx)
increases at every EM step
L(^
(m+1)jx)>L(^
(m)jx)
[Exercice: use Jensen and (1)]
Censored data
NormalN(,1)sample right-censored
L(jx) =
1
(2)
m=2
exp
-
1
2
m
X
i=1
(xi-)
2
[1-(a-)]
n-m
Associated complete log-likelihood:
logL
c
(jx,z)/-
1
2
m
X
i=1
(xi-)
2
-
1
2
n
X
i=m+1
(zi-)
2
,
wherezi's are censored observations, with density
k(zj,x) =
expf-
1
2
(z-)
2
g
p
2[1-(a-)]
=
'(z-)
1-(a-)
,a < z.
Censored data (2)
Atj-th EM iteration
Q(j^
(j),x)/-
1
2
m
X
i=1
(xi-)
2
-
1
2
E
"
n
X
i=m+1
(Zi-)
2
^
(j),x
#
/-
1
2
m
X
i=1
(xi-)
2
-
1
2
n
X
i=m+1
Z
1
a
(zi-)
2
k(zj^
(j),x)dzi
Censored data (3)
Dierenciating in,
n^
(j+1)=mx+ (n-m)E[Zj^
(j)],
with
E[Zj^
(j)] =
Z
1
a
zk(zj^
(j),x)dz=^
(j)+
'(a-^
(j))
1-(a-^
(j))
.
Hence, EM sequence provided by
^
(j+1)=
m
n
x+
n-m
n
"
^
(j)+
'(a-^
(j))
1-(a-^
(j))
#
,
which converges to likelihood maximum^
Mixtures
Mixture of two normal distributions with unknown means
.3N1(0,1) +.7N1(1,1),
sampleX1, . . . ,Xnand parameter= (0,1)
Missing data:Zi2f0,1g, indicator of component associated with
Xi,
Xijzi N(zi
,1)Zi B(.7)
Complete likelihood
logL
c
(jx,z)/-
1
2
n
X
i=1
zi(xi-1)
2
-
1
2
n
X
i=1
(1-zi)(xi-0)
2
= -
1
2
n1(^1-1)
2
-
1
2
(n-n1)(^0-0)
2
with
n1=
n
X
i=1
zi,n1^1=
n
X
i=1
zixi,(n-n1)^0=
n
X
i=1
(1-zi)xi
Mixtures (2)
Atj-th EM iteration
Q(j^
(j),x) =
1
2
E
n1(^1-1)
2
+ (n-n1)(^0-0)
2
j^
(j),x
Dierenciating in
^
(j+1)=
0
B
B
@
E
n1^1
^
(j),x
.
E
n1j^
(j),x
E
(n-n1)^0
^
(j),x
.
E
(n-n1)j^
(j),x
1
C
C
A
Mixtures (3)
Hence^
(j+1)given by
0
B
B
@
P
n
i=1
E
Zi
^
(j),xi
xi
.
P
n
i=1
E
Zij^
(j),xi
P
n
i=1
E
(1-Zi)
^
(j),xi
xi
.
P
n
i=1
E
(1-Zi)j^
(j),xi
1
C
C
A
Conclusion
Step(E)in EM replaces missing dataZiwith their conditional
expectation, givenx(expectation that depend on^
(m)).
Mixtures (3)-1 0 1 2 3
-1 0 1 2 3
m
1
m
2
EM iterations for several starting values
Properties
EM algorithm such that
it converges to local maximum or saddle-point
it depends on the initial condition
(0)
it requires several initial values when likelihood multimodal