Statistics (1): estimation Chapter 3: likelihood function and likelihood estimation

4,578 views 82 slides Oct 26, 2014
Slide 1
Slide 1 of 82
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82

About This Presentation

this is the third chapter of my third year undergraduate course at Paris-Dauphine


Slide Content

Chapter 3 :
Likelihood function and inference
[0]
4
Likelihood function and inference
The likelihood
Information and curvature
Suciency and ancilarity
Maximum likelihood estimation
Non-regular models
EM algorithm

The likelihood
Given an usually parametric family of distributions
F2fF,2g
with densitiesf[wrt a xed measure], the density of the iid
samplex1, . . . ,xnis
n
Y
i=1
f(xi)
NoteIn the special caseis a counting measure,
n
Y
i=1
f(xi)
is the x1, . . . ,xnamong all
possible realisations ofX1, . . . ,Xn

The likelihood
Denition (likelihood function)
The x1, . . . ,xnis the
function
L:!R+
!
n
Y
i=1
f(xi)
same

Example: density function versus likelihood function
Take the case of a Poisson density
[against the counting measure]
f(x;) =

x
x!
e
-
IN(x)
which varies inNas a function ofx
versus
L(;x) =

x
x!
e
-
which varies inR+as a function of
=3

Example: density function versus likelihood function
Take the case of a Poisson density
[against the counting measure]
f(x;) =

x
x!
e
-
IN(x)
which varies inNas a function ofx
versus
L(;x) =

x
x!
e
-
which varies inR+as a function of
x=3

Example: density function versus likelihood function
Take the case of a NormalN(0,)
density
f(x;) =
1
p
2
e
-x
2
=2
IR(x)
which varies inRas a function ofx
versus
L(;x) =
1
p
2
e
-x
2
=2
which varies inR+as a function of
=2

Example: density function versus likelihood function
Take the case of a NormalN(0,)
density
f(x;) =
1
p
2
e
-x
2
=2
IR(x)
which varies inRas a function ofx
versus
L(;x) =
1
p
2
e
-x
2
=2
which varies inR+as a function of
x=2

Example: density function versus likelihood function
Take the case of a NormalN(0,1=)
density
f(x;) =
p

p
2
e
-x
2
=2
IR(x)
which varies inRas a function ofx
versus
L(;x) =
p

p
2
e
-x
2
=2
IR(x)
which varies inR+as a function of
=1=2

Example: density function versus likelihood function
Take the case of a NormalN(0,1=)
density
f(x;) =
p

p
2
e
-x
2
=2
IR(x)
which varies inRas a function ofx
versus
L(;x) =
p

p
2
e
-x
2
=2
IR(x)
which varies inR+as a function of
x=1=2

Example: Hardy-Weinberg equilibrium
Population genetics:
Genotypes of biallelic genesAA,Aa, andaa
sample frequenciesnAA,nAaandnaa
multinomial modelM(n;pAA,pAa,paa)
related to population proportion ofAalleles,pA:
pAA=p
2
A
,pAa=2pA(1-pA),paa= (1-pA)
2
likelihood
L(pAjnAA,nAa,naa)/p
2nAA
A
[2pA(1-pA)]
nAa
(1-pA)
2naa
[Boos & Stefanski, 2013]

mixed distributions and their likelihood
Special case when a random variableXmay take specic values
a1, . . . ,akand a continum of valuesA
Example:
positive probabilityp0[it did not rain!]
between0and100[capacity of measurement container] 100
with positive probabilityp100[container full]

mixed distributions and their likelihood
Special case when a random variableXmay take specic values
a1, . . . ,akand a continum of valuesA
Example: y N(X
T
,
2
)but
y

=yIfy>0gobserved

mixed distributions and their likelihood
Special case when a random variableXmay take specic values
a1, . . . ,akand a continum of valuesA
Density ofXagainst composition of two measures, counting and
Lebesgue:
fX(a) =

P(X=a)ifa2fa1, . . . ,akg
f(aj) otherwise
Results in likelihood
L(jx1, . . . ,xn) =
k
Y
j=1
P(X=ai)
nj

Y
xi=2fa1,...,akg
f(xij)
wherenj# observations equal toaj

Enters Fisher, Ronald Fisher!
Fisher's intuition in the 20's:
the likelihood function contains the
relevant information about the
parameter
the higher the likelihood the more
likely the parameter
the curvature of the likelihood
determines the precision of the
estimation

Concentration of likelihood mode around rue" parameter
Likelihood functions forx1, . . . ,xn P(3)asnincreases
n=40, ...,240

Concentration of likelihood mode around rue" parameter
Likelihood functions forx1, . . . ,xn P(3)asnincreases
n=38, ...,240

Concentration of likelihood mode around rue" parameter
Likelihood functions forx1, . . . ,xn N(0,1)asnincreases

Concentration of likelihood mode around rue" parameter
Likelihood functions forx1, . . . ,xn N(0,1)as sample varies

Concentration of likelihood mode around rue" parameter
Likelihood functions forx1, . . . ,xn N(0,1)as sample varies

why concentration takes place
Consider
x1, . . . ,xn
iid
F
Then
log
n
Y
i=1
f(xij) =
n
X
i=1
logf(xij)
and by
LLN
1=n
n
X
i=1
logf(xij)
L
!
Z
X
logf(xj)dF(x)
Lemma
Maximising the likelihood is asymptotically equivalent to
minimising the Kullback-Leibler divergence
Z
X
log
f(x)=f(xj)dF(x)
cMember of the family closest to true distribution

Score function
Score function dened by
rlogL(jx) =

@=@1L(jx), . . . ,
@=@pL(jx)

L(jx)
Gradient (slope) of likelihood function at point
lemma
WhenXF,
E[rlogL(jX)] =0

Score function
Score function dened by
rlogL(jx) =

@=@1L(jx), . . . ,
@=@pL(jx)

L(jx)
Gradient (slope) of likelihood function at point
lemma
WhenXF,
E[rlogL(jX)] =0

Score function
Score function dened by
rlogL(jx) =

@=@1L(jx), . . . ,
@=@pL(jx)

L(jx)
Gradient (slope) of likelihood function at point
lemma
WhenXF,
E[rlogL(jX)] =0
Reason:
Z
X
rlogL(jx)dF(x) =
Z
X
rL(jx)dx=r
Z
X
dF(x)

Score function
Score function dened by
rlogL(jx) =

@=@1L(jx), . . . ,
@=@pL(jx)

L(jx)
Gradient (slope) of likelihood function at point
lemma
WhenXF,
E[rlogL(jX)] =0
Connected with concentration theorem: gradient null on average
for true value of parameter

Score function
Score function dened by
rlogL(jx) =

@=@1L(jx), . . . ,
@=@pL(jx)

L(jx)
Gradient (slope) of likelihood function at point
lemma
WhenXF,
E[rlogL(jX)] =0
Warning:
support depends on

Fisher's information matrix
Another notion attributed to Fisher
Information:
I() =E
h
rlogf(Xj)frlogf(Xj)g
T
i
Often called
Measures curvature of the likelihood surface, which translates as
information
Sometimes denotedIXto stress dependence on distribution ofX

Fisher's information matrix
Second derivative of the log-likelihood as well
lemma
IfL(jx)is twice dierentiable [as a function of]
I() = -E

r
T
rlogf(Xj)

Hence
Iij() = -E

@
2
@i@j
logf(Xj)

Illustrations
BinomialB(n,p)distribution
f(xjp) =

n
x

p
x
(1-p)
n-x
@=@plogf(xjp) =
x=p-
n-x=1-p
@
2
=@p
2
logf(xjp) = -
x=p
2
-
n-x=(1-p)
2
Hence
I(p) =
np=p
2
+
n-np=(1-p)
2
=
n=p(1-p)

Illustrations
MultinomialM(n;p1, . . . ,pk)distribution
f(xjp) =

n
x1 xk

p
x1
1
p
xk
k
@=@pilogf(xjp) =
xi=pi-
xk=pk
@
2
=@pi@pjlogf(xjp) = -
xk=p
2
k
@
2
=@p
2
ilogf(xjp) = -
xi=p
2
i-
xk=p
2
k
Hence
I(p) =n
0
B
B
B
@
1=p1+
1=pk
1=pk
1=pk
1=pk
.
.
.
1=pk
1=pk-1+
1=pk
1
C
C
C
A

Illustrations
MultinomialM(n;p1, . . . ,pk)distribution
f(xjp) =

n
x1 xk

p
x1
1
p
xk
k
@=@pilogf(xjp) =
xi=pi-
xk=pk
@
2
=@pi@pjlogf(xjp) = -
xk=p
2
k
@
2
=@p
2
ilogf(xjp) = -
xi=p
2
i-
xk=p
2
k
and
I(p)
-1
=
1=n
0
B
B
B
@
p1(1-p1) -p1p2 -p1pk-1
-p1p2p2(1-p2) -p2pk-1
.
.
.
.
.
.
-p1pk-1-p2pk-1 pk-1(1-pk-1)
1
C
C
C
A

Illustrations
NormalN(,
2
)distribution
f(xj) =
1
p
2
1

exp

-
(x-)
2
=2
2

@=@logf(xj) =
x-=
2
@=@logf(xj) = -
1=+
(x-)
2
=
3@
2
=@
2
logf(xj) = -
1=
2
@
2
=@@logf(xj) = -2
x-=
3@
2
=@
2
logf(xj) =
1=
2
-3
(x-)
2
=
4
Hence
I() =
1=
2

1 0
0 2

Properties
Additive features translating as accumulation of information:
ifXandYare independent,IX() + IY() = I
(X,Y)()
IX1,...,Xn
() =nIX1
()
ifX=T(Y)andY=S(X),IX() = IY()
ifX=T(Y),IX()6IY()
If= ()is a bijective transform, change of parameterisation:
I() =

@
@

T
I()

@
@

"In information geometry, this is seen as a change of
coordinates on a Riemannian manifold, and the intrinsic
properties of curvature are unchanged under dierent
parametrization. In general, the Fisher information
matrix provides a Riemannian metric (more precisely, the
Fisher-Rao metric)." [Wikipedia]

Properties
Additive features translating as accumulation of information:
ifXandYare independent,IX() + IY() = I
(X,Y)()
IX1,...,Xn
() =nIX1
()
ifX=T(Y)andY=S(X),IX() = IY()
ifX=T(Y),IX()6IY()
If= ()is a bijective transform, change of parameterisation:
I() =

@
@

T
I()

@
@

"In information geometry, this is seen as a change of
coordinates on a Riemannian manifold, and the intrinsic
properties of curvature are unchanged under dierent
parametrization. In general, the Fisher information
matrix provides a Riemannian metric (more precisely, the
Fisher-Rao metric)." [Wikipedia]

Approximations
Back to the
D(
0
,) =
Z
X
f(xj
0
)log
f(xj
0
)=f(xj)dx
Using a second degree Taylor expansion
logf(xj) =logf(xj
0
) + (-
0
)
T
rlogf(xj
0
)
+
1
2
(-
0
)
T
rr
T
logf(xj
0
)(-
0
) +o(jj-
0
jj
2
)
approximation of divergence:
D(
0
,)
1
2
(-
0
)
T
I(
0
)(-
0
)
[Exercise: show this is exact in the normal case]

Approximations
Central limit law of the score vector
1=
p
nrlogL(jX1, . . . ,Xn)N(0,IX1
())
NotationI1()stands forIX1
()and indicates information
associated with a single observation

Suciency
What if a transform of the sample
S(X1, . . . ,Xn)
contains
I
(X1,...,Xn)() = I
S(X1,...,Xn)()
uniformly in?
In this caseS()is called a
sucient to know the value ofS(x1, . . . ,xn)to get complete
information]
A X1, . . . ,Xn

Suciency (bis)
Alternative denition:
If(X1, . . . ,Xn)f(x1, . . . ,xnj)and ifT=S(X1, . . . ,Xn)is such
that the distribution of(X1, . . . ,Xn)conditional onTdoes not
depend on, thenS()is a
Factorisation theorem
S()is a
f(x1, . . . ,xnj) =g(S(x1, . . . ,xnj))h(x1, . . . ,xn)
another notion due to Fisher

Illustrations
UniformU(0,)distribution
L(jx1, . . . ,xn) =
-n
n
Y
i=1
I
(0,)(xi) =
-n
I>max
i
xi
Hence
S(X1, . . . ,Xn) =max
i
Xi=X
(n)
is sucient

Illustrations
BernoulliB(p)distribution
L(pjx1, . . . ,xn) =
n
Y
i=1
p
xi
(1-p)
n-xi
=fp=1-pg
P
i
xi
(1-p)
n
Hence
S(X1, . . . ,Xn) =Xn
is sucient

Illustrations
NormalN(,
2
)distribution
L(,jx1, . . . ,xn) =
n
Y
i=1
1
p
2
expf-
(x
i-)
2
=2
2g
=
1
f2
2
g
n=2
exp

-
1
2
2
n
X
i=1
(xi-xn+xn-)
2

=
1
f2
2
g
n=2
exp

-
1
2
2
n
X
i=1
(xi-xn)
2
-
1
2
2
n
X
i=1
(xn-)
2

Hence
S(X1, . . . ,Xn) =

Xn,
n
X
i=1
(Xi-Xn)
2
!
is sucient

Suciency and exponential families
Both previous examples belong to exponential families
f(xj) =h(x)exp

T()
T
S(x) -()

Generic property of exponential families:
f(x1, . . . ,xnj) =
n
Y
i=1
h(xi)exp

T()
T
n
X
i=1
S(xi)n()

lemma
For an exponential family with summary statisticS(), the statistic
S(X1, . . . ,Xn) =
n
X
i=1
S(Xi)
is sucient

Suciency as a rare feature
Nice property reducing the data to a low dimension transform but...
How frequent is it within the collection of probability distributions?
Very rare as essentially restricted to exponential families
[Pitman-Koopman-Darmois theorem]
with the exception of parameter-dependent families likeU(0,)

Pitman-Koopman-Darmois characterisation
IfX1, . . . ,Xnare iid random variables from a density
f(j)whose support does not depend onand verifying
the property that there exists an integern0such that, for
n>n0, there is a sucient statisticS(X1, . . . ,Xn)with
xed [inn] dimension, thenf(j)belongs to an
exponential family
[Factorisation theorem]
Note:
Koopman and Pitman in 1936
omitted from the theorem... Fisher proved it for one-D sucient
statistics in 1934

Minimal suciency
Multiplicity of sucient statistics, e.g.,S
0
(x) = (S(x),U(x))
remains sucient whenS()is sucient
Search of a most concentrated summary:
Minimal suciency
A sucient statisticS()is
any other sucient statistic
Lemma
For a minimal exponential family representation
f(xj) =h(x)exp

T()
T
S(x) -()

S(X1) +. . .+S(Xn)is minimal sucient

Ancillarity
Opposite of suciency:
Ancillarity
WhenX1, . . . ,Xnare iid random variables from a densityf(j), a
statisticA()is A(X1, . . . ,Xn)has a distribution that
does not depend on
Useless?! A(X1, . . . ,Xn)
leads to more precision and eciency:
Use ofF(x1, . . . ,xnjA(x1, . . . ,xn))instead ofF(x1, . . . ,xn)
Notion of

Illustrations
1
IfX1, . . . ,Xn
iid
U(0,),A(X1, . . . ,Xn) = (X1, . . . ,Xn)=X
(n)
is ancillary
2
IfX1, . . . ,Xn
iid
N(,
2
),
A(X1, . . . ,Xn) =
(X1-Xn, . . . ,Xn-Xn
P
n
i=1
(Xi-Xn)
2
is ancillary
3
IfX1, . . . ,Xn
iid
f(xj), rank(X1, . . . ,Xn)is ancillary
> x=rnorm(10)
> rank(x)
[1] 7 4 1 5 2 6 8 9 10 3
[see, e.g., rank tests]

Point estimation, estimators and estimates
When given a parametric familyf(j)and a sample supposedly
drawn from this family
(X1, . . . ,XN)
iid
f(xj)
an is a statisticT(X1, . . . ,XN)or^nproviding a
[reasonable] .
an is the value of the estimator for a given
sample,T(x1, . . . ,xn)
Example: N(,
2
sampleX1, . . . ,XN,
T(X1, . . . ,XN) = ^n=
1=nXn
is an estimator ofand^n=2.014is an estimate

Maximum likelihood principle
Given the concentration property of the likelihood function,
reasonable choice of estimator as mode:
MLE
A ^nsatises
L(^njX1, . . . ,XN)>L(^njX1, . . . ,XN)for all2
Under regularity ofL(jX1, . . . ,XN), MLE also solution of the
likelihood equations
rlogL(njX1, . . . ,XN) =0
Warning:^nis not but makes observation
(x1, . . . ,xN)most likely...

Maximum likelihood invariance
Principle independent of parameterisation:
If=h()is a one-to-one transform of, then
^
MLE
n=h(^
MLE
n)
[estimator of transform = transform of estimator]
By extension, if=h()is any transform of, then
^
MLE
n=h(^
MLE
n)

Unicity of maximum likelihood estimate
Depending on regularity ofL(jx1, . . . ,xN), there may be
1
an a.s. unique MLE^
MLE
n
2 3 1
Case ofx1, . . . ,xn N(,1)
2 3

Unicity of maximum likelihood estimate
Depending on regularity ofL(jx1, . . . ,xN), there may be
1 2
several or an innity of MLE's [or of solutions to likelihood
equations]
3 1 2
Case ofx1, . . . ,xn N(1+2,1)[[and mixtures of normal]
3

Unicity of maximum likelihood estimate
Depending on regularity ofL(jx1, . . . ,xN), there may be
1 2 3
no MLE at all
1 2 3
Case ofx1, . . . ,xn N(i,
-2
)

Unicity of maximum likelihood estimate
Consequence of standard dierential calculus results on
`() =L(jx1, . . . ,xn):
lemma
Ifis connected and open, and if`()is twice-dierentiable with
lim
!@
`()<+1
and ifH() =rr
T
`()is positive denite at all solutions of the
likelihood equations, then`()has a unique global maximum
Limited appeal because excluding local maxima

Unicity of MLE for exponential families
lemma
Iff(j)is a minimal exponential family
f(xj) =h(x)exp

T()
T
S(x) -()

withT()one-to-one and twice dierentiable over, ifis open,
and if there is at least one solution to the likelihood equations,
then it is the unique MLE
Likelihood equation is equivalent toS(x) =E[S(x)]

Unicity of MLE for exponential families
Likelihood equation is equivalent toS(x) =E[S(x)]
lemma
Ifis connected and open, and if`()is twice-dierentiable with
lim
!@
`()<+1
and ifH() =rr
T
`()is positive denite at all solutions of the
likelihood equations, then`()has a unique global maximum

Illustrations
UniformU(0,)likelihood
L(jx1, . . . ,xn) =
-n
I>max
i
xi
not dierentiable atX
(n)but
^
MLE
n=X
(n)

Illustrations
BernoulliB(p)likelihood
L(pjx1, . . . ,xn) =fp=1-pg
P
i
xi
(1-p)
n
dierentiable over(0,1)and
^p
MLE
n=Xn

Illustrations
NormalN(,
2
)likelihood
L(,jx1, . . . ,xn)/
-n
exp

-
1
2
2
n
X
i=1
(xi-xn)
2
-
1
2
2
n
X
i=1
(xn-)
2

dierentiable with
(^
MLE
n,
^

2
MLE
n) =

Xn,
1
n
n
X
i=1
(Xi-Xn)
2
!

The fundamental theorem of Statistics
fundamental theorem
Under appropriate conditions, if(X1, . . . ,Xn)
iid
f(xj), if^nis
solution ofrlogf(X1, . . . ,Xnj) =0, then
p
nf^n-g
L
!Np(0,I()
-1
)
Equivalent of CLT for estimation purposes

Assumptions
identiable
support off(j)constant in
`()thrice dierentiable
[the killer] g(x)integrable againstf(j)in a
neighbourhood of the true parameter such that




@
3
@i@j@k
f(j)




6g(x)
the identity
I() =E
h
rlogf(Xj)frlogf(Xj)g
T
i
= -E

r
T
rlogf(Xj)

stands
^nconverges in probability to[similarly superuous]
[Boos & Stefanski, 2014, p.286; Lehmann & Casella, 1998]

Inecient MLEs
Example of MLE of=jjjj
2
whenx Np(,Ip):
^
MLE
=jjxjj
2
ThenE[jjxjj
2
] =+pdiverges away fromwithp
Note:
MLE ofbased on
Z=jjXjj
2

2
p()
[Robert, 2001]

Inconsistent MLEs
TakeX1, . . . ,Xn
iid
f(x)with
f(x) = (1-)
1
()
f0(
x-=()) +f1(x)
for2[0,1],
f1(x) =I
[-1,1](x)f0(x) = (1-jxj)I
[1,1](x)
and
() = (1-)expf-(1-)
-4
+1g
Then for any
^
MLE
n
a.s.
!1
[Ferguson, 1982; John Wellner's slides, ca. 2005]

Inconsistent MLEs
ConsiderXiji=1, . . . ,n,j=1,2withXij N(i,
2
). Then
^
MLE
i
=
Xi1+Xi2=2
^

2
MLE
=
1
4n
n
X
i=1
(Xi1-Xi2)
2
Therefore
^

2
MLEa.s.
!

2
=2
[Neyman & Scott, 1948]

Inconsistent MLEs
ConsiderXiji=1, . . . ,n,j=1,2withXij N(i,
2
). Then
^
MLE
i
=
Xi1+Xi2=2
^

2
MLE
=
1
4n
n
X
i=1
(Xi1-Xi2)
2
Therefore
^

2
MLEa.s.
!

2
=2
[Neyman & Scott, 1948]
Note: Xi1-Xi2 N(0,2
2
)produces a
consistent MLE

Likelihood optimisation
Practical optimisation of the likelihood function

?
=arg max

L(jx) =
n
Y
i=1
g(Xij).
assumingX= (X1, . . . ,Xn)
iid
g(xj)
analytical resolution feasible for exponential families
rT()
n
X
i=1
S(xi) =nr()
use of standard numerical techniques like Newton-Raphson

(t+1)
=
(t)
+I
obs
(X,
(t)
)
-1
r`(
(t)
)
with`(.)log-likelihood andI
obs
observed information matrix

EM algorithm
Cases wheregis too complex for the above to work
Special case whengis a marginal
g(xj) =
Z
Z
f(x,zj)dz
Zcalled

Illustrations
censored data
X=min(X

,a)X

N(,1)
mixture model
X.3N1(0,1) +.7N1(1,1),
desequilibrium model
X=min(X

,Y

)X

f1(xj)Y

f2(xj)

Completion
EM algorithm based on completing dataxwithz, such as
(X,Z)f(x,zj)
Zmissing data vector (X,Z)complete data vector
Conditional density ofZgivenx:
k(zj,x) =
f(x,zj)
g(xj)

Likelihood decomposition
Likelihood associated with complete data(x,z)
L
c
(jx,z) =f(x,zj)
and likelihood for observed data
L(jx)
such that
logL(jx) =E[logL
c
(jx,Z)j0,x] -E[logk(Zj,x)j0,x](1)
for any0, with integration operated against conditionnal
distribution ofZgiven observables (and parameters),k(zj0,x)

[A tale of] 's
There are wo's" ! :in (1),0is a xed (and arbitrary) value
driving integration, whileboth free (and variable)
Maximising
L(jx)
equivalent to maximise r.h.s. term in (1)
E[logL
c
(jx,Z)j0,x] -E[logk(Zj,x)j0,x]

Intuition for EM
Instead of maximising wrtr.h.s. term in (1), maximise only
E[logL
c
(jx,Z)j0,x]
Maximisation of complete log-likelihood impossible sincez
unknown, hence substitute by maximisation of expected complete
log-likelihood, with expectation depending on term0

Expectation{Maximisation
Expectation of complete log-likelihood denoted
Q(j0,x) =E[logL
c
(jx,Z)j0,x]
to stress dependence on0and samplex
Principle
EMderives sequence of estimators^
(j),j=1,2, . . ., through
iteration ofExpectation andMaximisation steps:
Q(^
(j)j^
(j-1),x) =max

Q(j^
(j-1),x).

EM Algorithm
Iterate (inm)
1
(step E)
Q(j^
(m),x) =E[logL
c
(jx,Z)j^
(m),x],
2
(step M) Q(j^
(m),x)inand set
^
(m+1)=arg max

Q(j^
(m),x).
until a xed point [ofQ] is found
[Dempster, Laird, & Rubin, 1978]

Justication
Observed likelihood
L(jx)
increases at every EM step
L(^
(m+1)jx)>L(^
(m)jx)
[Exercice: use Jensen and (1)]

Censored data
NormalN(,1)sample right-censored
L(jx) =
1
(2)
m=2
exp

-
1
2
m
X
i=1
(xi-)
2

[1-(a-)]
n-m
Associated complete log-likelihood:
logL
c
(jx,z)/-
1
2
m
X
i=1
(xi-)
2
-
1
2
n
X
i=m+1
(zi-)
2
,
wherezi's are censored observations, with density
k(zj,x) =
expf-
1
2
(z-)
2
g
p
2[1-(a-)]
=
'(z-)
1-(a-)
,a < z.

Censored data (2)
Atj-th EM iteration
Q(j^
(j),x)/-
1
2
m
X
i=1
(xi-)
2
-
1
2
E
"
n
X
i=m+1
(Zi-)
2





^
(j),x
#
/-
1
2
m
X
i=1
(xi-)
2
-
1
2
n
X
i=m+1
Z
1
a
(zi-)
2
k(zj^
(j),x)dzi

Censored data (3)
Dierenciating in,
n^
(j+1)=mx+ (n-m)E[Zj^
(j)],
with
E[Zj^
(j)] =
Z
1
a
zk(zj^
(j),x)dz=^
(j)+
'(a-^
(j))
1-(a-^
(j))
.
Hence, EM sequence provided by
^
(j+1)=
m
n
x+
n-m
n
"
^
(j)+
'(a-^
(j))
1-(a-^
(j))
#
,
which converges to likelihood maximum^

Mixtures
Mixture of two normal distributions with unknown means
.3N1(0,1) +.7N1(1,1),
sampleX1, . . . ,Xnand parameter= (0,1)
Missing data:Zi2f0,1g, indicator of component associated with
Xi,
Xijzi N(zi
,1)Zi B(.7)
Complete likelihood
logL
c
(jx,z)/-
1
2
n
X
i=1
zi(xi-1)
2
-
1
2
n
X
i=1
(1-zi)(xi-0)
2
= -
1
2
n1(^1-1)
2
-
1
2
(n-n1)(^0-0)
2
with
n1=
n
X
i=1
zi,n1^1=
n
X
i=1
zixi,(n-n1)^0=
n
X
i=1
(1-zi)xi

Mixtures (2)
Atj-th EM iteration
Q(j^
(j),x) =
1
2
E

n1(^1-1)
2
+ (n-n1)(^0-0)
2
j^
(j),x

Dierenciating in
^
(j+1)=
0
B
B
@
E

n1^1

^
(j),x

.
E

n1j^
(j),x

E

(n-n1)^0

^
(j),x

.
E

(n-n1)j^
(j),x

1
C
C
A

Mixtures (3)
Hence^
(j+1)given by
0
B
B
@
P
n
i=1
E

Zi

^
(j),xi

xi
.
P
n
i=1
E

Zij^
(j),xi

P
n
i=1
E

(1-Zi)

^
(j),xi

xi
.
P
n
i=1
E

(1-Zi)j^
(j),xi

1
C
C
A
Conclusion
Step(E)in EM replaces missing dataZiwith their conditional
expectation, givenx(expectation that depend on^
(m)).

Mixtures (3)-1 0 1 2 3
-1 0 1 2 3
m
1
m
2
EM iterations for several starting values

Properties
EM algorithm such that
it converges to local maximum or saddle-point
it depends on the initial condition
(0)
it requires several initial values when likelihood multimodal