Statistics (1): estimation Chapter 3: likelihood function and likelihood estimation

Chapter 3 :
Likelihood function and inference
[0]
4
Likelihood function and inference
The likelihood
Information and curvature
Suciency and ancilarity
Maximum likelihood estimation
Non-regular models
EM algorithm

The likelihood
Given an usually parametric family of distributions
F2fF,2g
with densitiesf[wrt a xed measure], the density of the iid
samplex1, . . . ,xnis
n
Y
i=1
f(xi)
NoteIn the special caseis a counting measure,
n
Y
i=1
f(xi)
is the x1, . . . ,xnamong all
possible realisations ofX1, . . . ,Xn

The likelihood
Denition (likelihood function)
The x1, . . . ,xnis the
function
L:!R+
!
n
Y
i=1
f(xi)
same

Example: density function versus likelihood function
Take the case of a Poisson density
[against the counting measure]
f(x;) =

x
x!
e
-
IN(x)
which varies inNas a function ofx
versus
L(;x) =

x
x!
e
-
which varies inR+as a function of
=3

Example: density function versus likelihood function
Take the case of a Poisson density
[against the counting measure]
f(x;) =

x
x!
e
-
IN(x)
which varies inNas a function ofx
versus
L(;x) =

x
x!
e
-
which varies inR+as a function of
x=3

Example: density function versus likelihood function
Take the case of a NormalN(0,)
density
f(x;) =
1
p
2
e
-x
2
=2
IR(x)
which varies inRas a function ofx
versus
L(;x) =
1
p
2
e
-x
2
=2
which varies inR+as a function of
=2

Example: density function versus likelihood function
Take the case of a NormalN(0,)
density
f(x;) =
1
p
2
e
-x
2
=2
IR(x)
which varies inRas a function ofx
versus
L(;x) =
1
p
2
e
-x
2
=2
which varies inR+as a function of
x=2

Example: density function versus likelihood function
Take the case of a NormalN(0,1=)
density
f(x;) =
p

p
2
e
-x
2
=2
IR(x)
which varies inRas a function ofx
versus
L(;x) =
p

p
2
e
-x
2
=2
IR(x)
which varies inR+as a function of
=1=2

Example: density function versus likelihood function
Take the case of a NormalN(0,1=)
density
f(x;) =
p

p
2
e
-x
2
=2
IR(x)
which varies inRas a function ofx
versus
L(;x) =
p

p
2
e
-x
2
=2
IR(x)
which varies inR+as a function of
x=1=2

Example: Hardy-Weinberg equilibrium
Population genetics:
Genotypes of biallelic genesAA,Aa, andaa
sample frequenciesnAA,nAaandnaa
multinomial modelM(n;pAA,pAa,paa)
related to population proportion ofAalleles,pA:
pAA=p
2
A
,pAa=2pA(1-pA),paa= (1-pA)
2
likelihood
L(pAjnAA,nAa,naa)/p
2nAA
A
[2pA(1-pA)]
nAa
(1-pA)
2naa
[Boos & Stefanski, 2013]

mixed distributions and their likelihood
Special case when a random variableXmay take specic values
a1, . . . ,akand a continum of valuesA
Example:
positive probabilityp0[it did not rain!]
between0and100[capacity of measurement container] 100
with positive probabilityp100[container full]

mixed distributions and their likelihood
Special case when a random variableXmay take specic values
a1, . . . ,akand a continum of valuesA
Example: y N(X
T
,
2
)but
y

=yIfy>0gobserved

mixed distributions and their likelihood
Special case when a random variableXmay take specic values
a1, . . . ,akand a continum of valuesA
Density ofXagainst composition of two measures, counting and
Lebesgue:
fX(a) =

P(X=a)ifa2fa1, . . . ,akg
f(aj) otherwise
Results in likelihood
L(jx1, . . . ,xn) =
k
Y
j=1
P(X=ai)
nj

Y
xi=2fa1,...,akg
f(xij)
wherenj# observations equal toaj

Enters Fisher, Ronald Fisher!
Fisher's intuition in the 20's:
the likelihood function contains the
relevant information about the
parameter
the higher the likelihood the more
likely the parameter
the curvature of the likelihood
determines the precision of the
estimation

Concentration of likelihood mode around rue" parameter
Likelihood functions forx1, . . . ,xn P(3)asnincreases
n=40, ...,240

Concentration of likelihood mode around rue" parameter
Likelihood functions forx1, . . . ,xn P(3)asnincreases
n=38, ...,240

Concentration of likelihood mode around rue" parameter
Likelihood functions forx1, . . . ,xn N(0,1)asnincreases

Concentration of likelihood mode around rue" parameter
Likelihood functions forx1, . . . ,xn N(0,1)as sample varies

why concentration takes place
Consider
x1, . . . ,xn
iid
F
Then
log
n
Y
i=1
f(xij) =
n
X
i=1
logf(xij)
and by
LLN
1=n
n
X
i=1
logf(xij)
L
!
Z
X
logf(xj)dF(x)
Lemma
Maximising the likelihood is asymptotically equivalent to
minimising the Kullback-Leibler divergence
Z
X
log
f(x)=f(xj)dF(x)
cMember of the family closest to true distribution

Score function
Score function dened by
rlogL(jx) =

@=@1L(jx), . . . ,
@=@pL(jx)

L(jx)
Gradient (slope) of likelihood function at point
lemma
WhenXF,
E[rlogL(jX)] =0

Score function
Score function dened by
rlogL(jx) =

@=@1L(jx), . . . ,
@=@pL(jx)

L(jx)
Gradient (slope) of likelihood function at point
lemma
WhenXF,
E[rlogL(jX)] =0
Reason:
Z
X
rlogL(jx)dF(x) =
Z
X
rL(jx)dx=r
Z
X
dF(x)

Score function
Score function dened by
rlogL(jx) =

@=@1L(jx), . . . ,
@=@pL(jx)

L(jx)
Gradient (slope) of likelihood function at point
lemma
WhenXF,
E[rlogL(jX)] =0
Connected with concentration theorem: gradient null on average
for true value of parameter

Score function
Score function dened by
rlogL(jx) =

@=@1L(jx), . . . ,
@=@pL(jx)

L(jx)
Gradient (slope) of likelihood function at point
lemma
WhenXF,
E[rlogL(jX)] =0
Warning:
support depends on

Fisher's information matrix
Another notion attributed to Fisher
Information:
I() =E
h
rlogf(Xj)frlogf(Xj)g
T
i
Often called
Measures curvature of the likelihood surface, which translates as
information
Sometimes denotedIXto stress dependence on distribution ofX

Fisher's information matrix
Second derivative of the log-likelihood as well
lemma
IfL(jx)is twice dierentiable [as a function of]
I() = -E

r
T
rlogf(Xj)

Hence
Iij() = -E

@
2
@i@j
logf(Xj)

Illustrations
BinomialB(n,p)distribution
f(xjp) =

n
x

p
x
(1-p)
n-x
@=@plogf(xjp) =
x=p-
n-x=1-p
@
2
=@p
2
logf(xjp) = -
x=p
2
-
n-x=(1-p)
2
Hence
I(p) =
np=p
2
+
n-np=(1-p)
2
=
n=p(1-p)

Illustrations
MultinomialM(n;p1, . . . ,pk)distribution
f(xjp) =

n
x1 xk

p
x1
1
p
xk
k
@=@pilogf(xjp) =
xi=pi-
xk=pk
@
2
=@pi@pjlogf(xjp) = -
xk=p
2
k
@
2
=@p
2
ilogf(xjp) = -
xi=p
2
i-
xk=p
2
k
Hence
I(p) =n
0
B
B
B
@
1=p1+
1=pk
1=pk
1=pk
1=pk
.
.
.
1=pk
1=pk-1+
1=pk
1
C
C
C
A

Illustrations
MultinomialM(n;p1, . . . ,pk)distribution
f(xjp) =

n
x1 xk

p
x1
1
p
xk
k
@=@pilogf(xjp) =
xi=pi-
xk=pk
@
2
=@pi@pjlogf(xjp) = -
xk=p
2
k
@
2
=@p
2
ilogf(xjp) = -
xi=p
2
i-
xk=p
2
k
and
I(p)
-1
=
1=n
0
B
B
B
@
p1(1-p1) -p1p2 -p1pk-1
-p1p2p2(1-p2) -p2pk-1
.
.
.
.
.
.
-p1pk-1-p2pk-1 pk-1(1-pk-1)
1
C
C
C
A

Illustrations
NormalN(,
2
)distribution
f(xj) =
1
p
2
1

exp

-
(x-)
2
=2
2

@=@logf(xj) =
x-=
2
@=@logf(xj) = -
1=+
(x-)
2
=
3@
2
=@
2
logf(xj) = -
1=
2
@
2
=@@logf(xj) = -2
x-=
3@
2
=@
2
logf(xj) =
1=
2
-3
(x-)
2
=
4
Hence
I() =
1=
2

1 0
0 2

Properties
Additive features translating as accumulation of information:
ifXandYare independent,IX() + IY() = I
(X,Y)()
IX1,...,Xn
() =nIX1
()
ifX=T(Y)andY=S(X),IX() = IY()
ifX=T(Y),IX()6IY()
If= ()is a bijective transform, change of parameterisation:
I() =

@
@

T
I()

@
@

"In information geometry, this is seen as a change of
coordinates on a Riemannian manifold, and the intrinsic
properties of curvature are unchanged under dierent
parametrization. In general, the Fisher information
matrix provides a Riemannian metric (more precisely, the
Fisher-Rao metric)." [Wikipedia]

Approximations
Back to the
D(
0
,) =
Z
X
f(xj
0
)log
f(xj
0
)=f(xj)dx
Using a second degree Taylor expansion
logf(xj) =logf(xj
0
) + (-
0
)
T
rlogf(xj
0
)
+
1
2
(-
0
)
T
rr
T
logf(xj
0
)(-
0
) +o(jj-
0
jj
2
)
approximation of divergence:
D(
0
,)
1
2
(-
0
)
T
I(
0
)(-
0
)
[Exercise: show this is exact in the normal case]

Approximations
Central limit law of the score vector
1=
p
nrlogL(jX1, . . . ,Xn)N(0,IX1
())
NotationI1()stands forIX1
()and indicates information
associated with a single observation

Suciency
What if a transform of the sample
S(X1, . . . ,Xn)
contains
I
(X1,...,Xn)() = I
S(X1,...,Xn)()
uniformly in?
In this caseS()is called a
sucient to know the value ofS(x1, . . . ,xn)to get complete
information]
A X1, . . . ,Xn

Suciency (bis)
Alternative denition:
If(X1, . . . ,Xn)f(x1, . . . ,xnj)and ifT=S(X1, . . . ,Xn)is such
that the distribution of(X1, . . . ,Xn)conditional onTdoes not
depend on, thenS()is a
Factorisation theorem
S()is a
f(x1, . . . ,xnj) =g(S(x1, . . . ,xnj))h(x1, . . . ,xn)
another notion due to Fisher

Illustrations
UniformU(0,)distribution
L(jx1, . . . ,xn) =
-n
n
Y
i=1
I
(0,)(xi) =
-n
I>max
i
xi
Hence
S(X1, . . . ,Xn) =max
i
Xi=X
(n)
is sucient

Illustrations
BernoulliB(p)distribution
L(pjx1, . . . ,xn) =
n
Y
i=1
p
xi
(1-p)
n-xi
=fp=1-pg
P
i
xi
(1-p)
n
Hence
S(X1, . . . ,Xn) =Xn
is sucient

Illustrations
NormalN(,
2
)distribution
L(,jx1, . . . ,xn) =
n
Y
i=1
1
p
2
expf-
(x
i-)
2
=2
2g
=
1
f2
2
g
n=2
exp

-
1
2
2
n
X
i=1
(xi-xn+xn-)
2

=
1
f2
2
g
n=2
exp

-
1
2
2
n
X
i=1
(xi-xn)
2
-
1
2
2
n
X
i=1
(xn-)
2

Hence
S(X1, . . . ,Xn) =

Xn,
n
X
i=1
(Xi-Xn)
2
!
is sucient

Suciency and exponential families
Both previous examples belong to exponential families
f(xj) =h(x)exp

T()
T
S(x) -()

Generic property of exponential families:
f(x1, . . . ,xnj) =
n
Y
i=1
h(xi)exp

T()
T
n
X
i=1
S(xi)n()

lemma
For an exponential family with summary statisticS(), the statistic
S(X1, . . . ,Xn) =
n
X
i=1
S(Xi)
is sucient

Suciency as a rare feature
Nice property reducing the data to a low dimension transform but...
How frequent is it within the collection of probability distributions?
Very rare as essentially restricted to exponential families
[Pitman-Koopman-Darmois theorem]
with the exception of parameter-dependent families likeU(0,)

Pitman-Koopman-Darmois characterisation
IfX1, . . . ,Xnare iid random variables from a density
f(j)whose support does not depend onand verifying
the property that there exists an integern0such that, for
n>n0, there is a sucient statisticS(X1, . . . ,Xn)with
xed [inn] dimension, thenf(j)belongs to an
exponential family
[Factorisation theorem]
Note:
Koopman and Pitman in 1936
omitted from the theorem... Fisher proved it for one-D sucient
statistics in 1934

Minimal suciency
Multiplicity of sucient statistics, e.g.,S
0
(x) = (S(x),U(x))
remains sucient whenS()is sucient
Search of a most concentrated summary:
Minimal suciency
A sucient statisticS()is
any other sucient statistic
Lemma
For a minimal exponential family representation
f(xj) =h(x)exp

T()
T
S(x) -()

S(X1) +. . .+S(Xn)is minimal sucient

Ancillarity
Opposite of suciency:
Ancillarity
WhenX1, . . . ,Xnare iid random variables from a densityf(j), a
statisticA()is A(X1, . . . ,Xn)has a distribution that
does not depend on
Useless?! A(X1, . . . ,Xn)
leads to more precision and eciency:
Use ofF(x1, . . . ,xnjA(x1, . . . ,xn))instead ofF(x1, . . . ,xn)
Notion of

Illustrations
1
IfX1, . . . ,Xn
iid
U(0,),A(X1, . . . ,Xn) = (X1, . . . ,Xn)=X
(n)
is ancillary
2
IfX1, . . . ,Xn
iid
N(,
2
),
A(X1, . . . ,Xn) =
(X1-Xn, . . . ,Xn-Xn
P
n
i=1
(Xi-Xn)
2
is ancillary
3
IfX1, . . . ,Xn
iid
f(xj), rank(X1, . . . ,Xn)is ancillary
> x=rnorm(10)
> rank(x)
[1] 7 4 1 5 2 6 8 9 10 3
[see, e.g., rank tests]

Point estimation, estimators and estimates
When given a parametric familyf(j)and a sample supposedly
drawn from this family
(X1, . . . ,XN)
iid
f(xj)
an is a statisticT(X1, . . . ,XN)or^nproviding a
[reasonable] .
an is the value of the estimator for a given
sample,T(x1, . . . ,xn)
Example: N(,
2
sampleX1, . . . ,XN,
T(X1, . . . ,XN) = ^n=
1=nXn
is an estimator ofand^n=2.014is an estimate

Maximum likelihood principle
Given the concentration property of the likelihood function,
reasonable choice of estimator as mode:
MLE
A ^nsatises
L(^njX1, . . . ,XN)>L(^njX1, . . . ,XN)for all2
Under regularity ofL(jX1, . . . ,XN), MLE also solution of the
likelihood equations
rlogL(njX1, . . . ,XN) =0
Warning:^nis not but makes observation
(x1, . . . ,xN)most likely...

Maximum likelihood invariance
Principle independent of parameterisation:
If=h()is a one-to-one transform of, then
^
MLE
n=h(^
MLE
n)
[estimator of transform = transform of estimator]
By extension, if=h()is any transform of, then
^
MLE
n=h(^
MLE
n)

Unicity of maximum likelihood estimate
Depending on regularity ofL(jx1, . . . ,xN), there may be
1
an a.s. unique MLE^
MLE
n
2 3 1
Case ofx1, . . . ,xn N(,1)
2 3

Unicity of maximum likelihood estimate
Depending on regularity ofL(jx1, . . . ,xN), there may be
1 2
several or an innity of MLE's [or of solutions to likelihood
equations]
3 1 2
Case ofx1, . . . ,xn N(1+2,1)[[and mixtures of normal]
3

Unicity of maximum likelihood estimate
Depending on regularity ofL(jx1, . . . ,xN), there may be
1 2 3
no MLE at all
1 2 3
Case ofx1, . . . ,xn N(i,
-2
)

Unicity of maximum likelihood estimate
Consequence of standard dierential calculus results on
`() =L(jx1, . . . ,xn):
lemma
Ifis connected and open, and if`()is twice-dierentiable with
lim
!@
`()<+1
and ifH() =rr
T
`()is positive denite at all solutions of the
likelihood equations, then`()has a unique global maximum
Limited appeal because excluding local maxima

Unicity of MLE for exponential families
lemma
Iff(j)is a minimal exponential family
f(xj) =h(x)exp

T()
T
S(x) -()

withT()one-to-one and twice dierentiable over, ifis open,
and if there is at least one solution to the likelihood equations,
then it is the unique MLE
Likelihood equation is equivalent toS(x) =E[S(x)]

Unicity of MLE for exponential families
Likelihood equation is equivalent toS(x) =E[S(x)]
lemma
Ifis connected and open, and if`()is twice-dierentiable with
lim
!@
`()<+1
and ifH() =rr
T
`()is positive denite at all solutions of the
likelihood equations, then`()has a unique global maximum

Illustrations
UniformU(0,)likelihood
L(jx1, . . . ,xn) =
-n
I>max
i
xi
not dierentiable atX
(n)but
^
MLE
n=X
(n)

Illustrations
BernoulliB(p)likelihood
L(pjx1, . . . ,xn) =fp=1-pg
P
i
xi
(1-p)
n
dierentiable over(0,1)and
^p
MLE
n=Xn

Illustrations
NormalN(,
2
)likelihood
L(,jx1, . . . ,xn)/
-n
exp

-
1
2
2
n
X
i=1
(xi-xn)
2
-
1
2
2
n
X
i=1
(xn-)
2

dierentiable with
(^
MLE
n,
^

2
MLE
n) =

Xn,
1
n
n
X
i=1
(Xi-Xn)
2
!

The fundamental theorem of Statistics
fundamental theorem
Under appropriate conditions, if(X1, . . . ,Xn)
iid
f(xj), if^nis
solution ofrlogf(X1, . . . ,Xnj) =0, then
p
nf^n-g
L
!Np(0,I()
-1
)
Equivalent of CLT for estimation purposes

Assumptions
identiable
support off(j)constant in
`()thrice dierentiable
[the killer] g(x)integrable againstf(j)in a
neighbourhood of the true parameter such that

@
3
@i@j@k
f(j)

6g(x)
the identity
I() =E
h
rlogf(Xj)frlogf(Xj)g
T
i
= -E

r
T
rlogf(Xj)

stands
^nconverges in probability to[similarly superuous]
[Boos & Stefanski, 2014, p.286; Lehmann & Casella, 1998]

Inecient MLEs
Example of MLE of=jjjj
2
whenx Np(,Ip):
^
MLE
=jjxjj
2
ThenE[jjxjj
2
] =+pdiverges away fromwithp
Note:
MLE ofbased on
Z=jjXjj
2

2
p()
[Robert, 2001]

Inconsistent MLEs
TakeX1, . . . ,Xn
iid
f(x)with
f(x) = (1-)
1
()
f0(
x-=()) +f1(x)
for2[0,1],
f1(x) =I
[-1,1](x)f0(x) = (1-jxj)I
[1,1](x)
and
() = (1-)expf-(1-)
-4
+1g
Then for any
^
MLE
n
a.s.
!1
[Ferguson, 1982; John Wellner's slides, ca. 2005]

Inconsistent MLEs
ConsiderXiji=1, . . . ,n,j=1,2withXij N(i,
2
). Then
^
MLE
i
=
Xi1+Xi2=2
^

2
MLE
=
1
4n
n
X
i=1
(Xi1-Xi2)
2
Therefore
^

2
MLEa.s.
!

2
=2
[Neyman & Scott, 1948]

Inconsistent MLEs
ConsiderXiji=1, . . . ,n,j=1,2withXij N(i,
2
). Then
^
MLE
i
=
Xi1+Xi2=2
^

2
MLE
=
1
4n
n
X
i=1
(Xi1-Xi2)
2
Therefore
^

2
MLEa.s.
!

2
=2
[Neyman & Scott, 1948]
Note: Xi1-Xi2 N(0,2
2
)produces a
consistent MLE

Likelihood optimisation
Practical optimisation of the likelihood function

?
=arg max

L(jx) =
n
Y
i=1
g(Xij).
assumingX= (X1, . . . ,Xn)
iid
g(xj)
analytical resolution feasible for exponential families
rT()
n
X
i=1
S(xi) =nr()
use of standard numerical techniques like Newton-Raphson

(t+1)
=
(t)
+I
obs
(X,
(t)
)
-1
r`(
(t)
)
with`(.)log-likelihood andI
obs
observed information matrix

EM algorithm
Cases wheregis too complex for the above to work
Special case whengis a marginal
g(xj) =
Z
Z
f(x,zj)dz
Zcalled

Illustrations
censored data
X=min(X

,a)X

N(,1)
mixture model
X.3N1(0,1) +.7N1(1,1),
desequilibrium model
X=min(X

,Y

)X

f1(xj)Y

f2(xj)

Completion
EM algorithm based on completing dataxwithz, such as
(X,Z)f(x,zj)
Zmissing data vector (X,Z)complete data vector
Conditional density ofZgivenx:
k(zj,x) =
f(x,zj)
g(xj)

Likelihood decomposition
Likelihood associated with complete data(x,z)
L
c
(jx,z) =f(x,zj)
and likelihood for observed data
L(jx)
such that
logL(jx) =E[logL
c
(jx,Z)j0,x] -E[logk(Zj,x)j0,x](1)
for any0, with integration operated against conditionnal
distribution ofZgiven observables (and parameters),k(zj0,x)

[A tale of] 's
There are wo's" ! :in (1),0is a xed (and arbitrary) value
driving integration, whileboth free (and variable)
Maximising
L(jx)
equivalent to maximise r.h.s. term in (1)
E[logL
c
(jx,Z)j0,x] -E[logk(Zj,x)j0,x]

Intuition for EM
Instead of maximising wrtr.h.s. term in (1), maximise only
E[logL
c
(jx,Z)j0,x]
Maximisation of complete log-likelihood impossible sincez
unknown, hence substitute by maximisation of expected complete
log-likelihood, with expectation depending on term0

Expectation{Maximisation
Expectation of complete log-likelihood denoted
Q(j0,x) =E[logL
c
(jx,Z)j0,x]
to stress dependence on0and samplex
Principle
EMderives sequence of estimators^
(j),j=1,2, . . ., through
iteration ofExpectation andMaximisation steps:
Q(^
(j)j^
(j-1),x) =max

Q(j^
(j-1),x).

EM Algorithm
Iterate (inm)
1
(step E)
Q(j^
(m),x) =E[logL
c
(jx,Z)j^
(m),x],
2
(step M) Q(j^
(m),x)inand set
^
(m+1)=arg max

Q(j^
(m),x).
until a xed point [ofQ] is found
[Dempster, Laird, & Rubin, 1978]

Justication
Observed likelihood
L(jx)
increases at every EM step
L(^
(m+1)jx)>L(^
(m)jx)
[Exercice: use Jensen and (1)]

Censored data
NormalN(,1)sample right-censored
L(jx) =
1
(2)
m=2
exp

-
1
2
m
X
i=1
(xi-)
2

[1-(a-)]
n-m
Associated complete log-likelihood:
logL
c
(jx,z)/-
1
2
m
X
i=1
(xi-)
2
-
1
2
n
X
i=m+1
(zi-)
2
,
wherezi's are censored observations, with density
k(zj,x) =
expf-
1
2
(z-)
2
g
p
2[1-(a-)]
=
'(z-)
1-(a-)
,a < z.

Censored data (2)
Atj-th EM iteration
Q(j^
(j),x)/-
1
2
m
X
i=1
(xi-)
2
-
1
2
E
"
n
X
i=m+1
(Zi-)
2

^
(j),x
#
/-
1
2
m
X
i=1
(xi-)
2
-
1
2
n
X
i=m+1
Z
1
a
(zi-)
2
k(zj^
(j),x)dzi

Censored data (3)
Dierenciating in,
n^
(j+1)=mx+ (n-m)E[Zj^
(j)],
with
E[Zj^
(j)] =
Z
1
a
zk(zj^
(j),x)dz=^
(j)+
'(a-^
(j))
1-(a-^
(j))
.
Hence, EM sequence provided by
^
(j+1)=
m
n
x+
n-m
n
"
^
(j)+
'(a-^
(j))
1-(a-^
(j))
#
,
which converges to likelihood maximum^

Mixtures
Mixture of two normal distributions with unknown means
.3N1(0,1) +.7N1(1,1),
sampleX1, . . . ,Xnand parameter= (0,1)
Missing data:Zi2f0,1g, indicator of component associated with
Xi,
Xijzi N(zi
,1)Zi B(.7)
Complete likelihood
logL
c
(jx,z)/-
1
2
n
X
i=1
zi(xi-1)
2
-
1
2
n
X
i=1
(1-zi)(xi-0)
2
= -
1
2
n1(^1-1)
2
-
1
2
(n-n1)(^0-0)
2
with
n1=
n
X
i=1
zi,n1^1=
n
X
i=1
zixi,(n-n1)^0=
n
X
i=1
(1-zi)xi

Mixtures (2)
Atj-th EM iteration
Q(j^
(j),x) =
1
2
E

n1(^1-1)
2
+ (n-n1)(^0-0)
2
j^
(j),x

Dierenciating in
^
(j+1)=
0
B
B
@
E

n1^1

^
(j),x

.
E

n1j^
(j),x

E

(n-n1)^0

^
(j),x

.
E

(n-n1)j^
(j),x

1
C
C
A

Mixtures (3)
Hence^
(j+1)given by
0
B
B
@
P
n
i=1
E

Zi

^
(j),xi

xi
.
P
n
i=1
E

Zij^
(j),xi

P
n
i=1
E

(1-Zi)

^
(j),xi

xi
.
P
n
i=1
E

(1-Zi)j^
(j),xi

1
C
C
A
Conclusion
Step(E)in EM replaces missing dataZiwith their conditional
expectation, givenx(expectation that depend on^
(m)).

Mixtures (3)-1 0 1 2 3
-1 0 1 2 3
m
1
m
2
EM iterations for several starting values

Properties
EM algorithm such that
it converges to local maximum or saddle-point
it depends on the initial condition
(0)
it requires several initial values when likelihood multimodal

Statistics (1): estimation Chapter 3: likelihood function and likelihood estimation

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Statistics (1): estimation Chapter 3: likelihood function and likelihood estimation

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77