Lecture 13 HMMs and the derivations for perusal.pdf
danny197240
8 views
23 slides
Jun 11, 2024
Slide 1 of 23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
About This Presentation
HMM models and how to derive ways to do the mat
Size: 990.56 KB
Language: en
Added: Jun 11, 2024
Slides: 23 pages
Slide Content
Hidden Markov Models
STATS 305C: Applied Statistics
Scott Linderman
May 16, 2023
1 / 23
Where are we?
Model Algorithm Application
Multivariate Normal Models Conjugate Inference Bayesian Linear Regression
Hierarchical Models MCMC (MH & Gibbs) Modeling Polling Data
Probabilistic PCA & Factor Analysis MCMC (HMC) Images Reconstruction
Mixture Models EM & Variational Inference Image Segmentation
Mixed Membership Models Coordinate Ascent VI Topic Modeling
Variational Autoencoders Gradient-based VI Image Generation
State Space Models Message Passing Segmenting Video Data
Bayesian Nonparametrics Fancy MCMC Modeling Neural Spike Trains
2 / 23
Gaussian Mixture Models
Recall the basic Gaussian mixture model,
z
t
iid
∼Cat(π) (1)
x
t|z
t∼ N(µ
z
t
,Σ
z
t
) (2)
where
▶z
t∈ {1, . . . ,K}is alatent mixture assignment
▶x
t∈R
D
is anobserved data point
▶π∈∆
K,µ
k
∈R
D
, andΣ
k∈R
D×D
⪰0
are parameters
(Here we’ve switched to indexing data points bytrather thann.)
LetΘdenote the set of parameters. We can be Bayesian and put a prior onΘand run Gibbs or VI, or
we can point estimateΘwith EM, etc.
3 / 23
Gaussian Mixture Models II
Draw the graphical model.
4 / 23
Gaussian Mixture Models III
Recall the EM algorithm for mixture models,
▶E step:Compute the posterior distribution
q(z
1:T) =p(z
1:T|x
1:T;Θ) (3)
=
TY
t=1
p(z
t|x
t;Θ) (4)
=
TY
t=1
q
t(z
t) (5)
▶M step:Maximize the ELBO wrtΘ,
L(Θ) =E
q(z
1:T)[logp(x
1:T,z
1:T;Θ)−logq(z
1:T)] (6)
=E
q(z
1:T)[logp(x
1:T,z
1:T;Θ)]+c. (7)
For exponential family mixture models, the M-step only requires expected sufficient statistics.
5 / 23
Hidden Markov Models
Hidden Markov Models (HMMs) are like mixture models with temporal dependencies between the
mixture assignments.
This graphical model says that the joint distribution factors as,
p(z
1:T,x
1:T) =p(z
1)
TY
t=2
p(z
t|z
t−1)
TY
t=1
p(x
t|z
t). (8)
We call this an HMM because thehiddenstates follow a Markov chain,p(z
1)
Q
T
t=2
p(z
t|z
t−1).
6 / 23
Hidden Markov Models II
An HMM consists of three components:
1. z
1∼Cat(π
0)
2. z
t∼Cat(P
z
t−1
)whereP∈[0,1]
K×K
is arow-stochastictransition matrix with
rowsP
k.
3. x
t∼p(· |θ
z
t
)
7 / 23
Example: Theoccasionallydishonest casino
Figure:Anoccasionallydishonest casino that sometimes throws loaded dice.
Fromhttps://probml.github.io/dynamax/notebooks/hmm/casino_hmm_inference.html
8 / 23
Example: HMM for splice site recognition
Figure:A toy model for parsing a genome to find 5’ splice sites. From Eddy [2004].
Question:Suppose the splice site always had aGTsequence. How would you change the model to
detect such sites?
9 / 23
Example: Autoregressive HMM for video segmentation
Figure:Segmenting videos of freely moving mice [Wiltschko et al., 2015]. (Show video.)
10 / 23
Hidden Markov Models III
We are interested in questions like:
▶What are thepredictive distributionsofp(z
t+1|x
1:t)?
▶What is theposterior marginaldistributionp(z
t|x
1:T)?
▶What is theposterior pairwise marginaldistributionp(z
t,z
t+1|x
1:T)?
▶What is theposterior mode z
⋆
1:T
= arg maxp(z
1:T|x
1:T)?
▶How can wesample the posterior p(z
1:T|x
1:T)of an HMM?
▶What is themarginal likelihood p(x
1:T)?
▶How can welearn the parametersof an HMM?
Question:Why might these sound like hard problems?
11 / 23
Computing the predictive distributions
The predictive distributions give the probability of the latent statez
t+1given observationsup to but
not includingtimet+1. Let,
α
t+1(z
t+1)≜p(z
t+1,x
1:t) (9)
=
KX
z
1=1
· · ·
KX
z
t=1
p(z
1)
tY
s=1
p(x
s|z
s)p(z
s+1|z
s) (10)
=
KX
z
t=1
KX
z
1=1
· · ·
KX
z
t−1=1
p(z
1)
t−1Y
s=1
p(x
s|z
s)p(z
s+1|z
s)
!
p(x
t|z
t)p(z
t+1|z
t)
(11)
=
KX
z
t=1
α
t(z
t)p(x
t|z
t)p(z
t+1|z
t). (12)
We callα
t(z
t)theforward messages. We can compute them recursively! The base case is
p(z
1|∅)≜p(z
1).
12 / 23
Computing the predictive distributions II
We can also write these recursions in a vectorized form. Let
α
t=
α
t(z
t=1)
.
.
.
α
t(z
t=K)
=
p(z
t=1,x
1:t−1)
.
.
.
p(z
t=K,x
1:t−1)
and l
t=
p(x
t|z
t=1)
.
.
.
p(x
t|z
t=K)
(13)
both be vectors inR
K
+
. Then,
α
t+1=P
⊤
(α
t⊙l
t) (14)
where⊙denotes the Hadamard (elementwise) product andPis the transition matrix.
13 / 23
Computing the predictive distributions III
Finally, to get the predictive distributions we just have to normalize,
p(z
t+1|x
1:t)∝p(z
t+1,x
1:t) =α
t+1(z
t+1). (15)
Question:What does the normalizing constant tell us?
14 / 23
Computing the posterior marginal distributions
The posterior marginal distributions give the probability of the latent statez
tgivenall the observations
up to timeT.
p(z
t|x
1:T) =
KX
z
1=1
· · ·
KX
z
t−1=1
KX
z
t+1=1
· · ·
KX
z
T=1
p(z
1:T,x
1:T) (16)
=
?KX
z
t=1
· · ·
KX
z
t−1=1
p(z
1)
t−1Y
s=1
p(x
s|z
s)p(z
s+1|z
s)
?
×p(x
t|z
t)
×
?KX
z
t+1=1
· · ·
KX
z
T=1
TY
u=t+1
p(z
u|z
u−1)p(x
u|z
u)
?
(17)
=α
t(z
t)×p(x
t|z
t)×β
t(z
t) (18)
where we have introduced thebackward messagesβ
t(z
t).
15 / 23
Computing the backward messages
The backward messages can be computed recursively too,
β
t(z
t)≜
KX
z
t+1=1
· · ·
KX
z
T=1
TY
u=t+1
p(z
u|z
u−1)p(x
u|z
u) (19)
=
KX
z
t+1=1
p(z
t+1|z
t)p(x
t
1
|z
t+1)
KX
z
t+2=1
· · ·
KX
z
T=1
TY
u=t+2
p(z
u|z
u−1)p(x
u|z
u)
!
(20)
=
KX
z
t+1=1
p(z
t+1|z
t)p(x
t
1
|z
t+1)β
t+1(z
t+1). (21)
For the base case, letβ
T(z
T) =1.
16 / 23
Computing the backward messages (vectorized)
Let
β
t=
β
t(z
t=1)
.
.
.
β
t(z
t=K)
(22)
be a vector inR
K
+
. Then,
β
t=P(β
t+1
⊙l
t+1). (23)
Letβ
T=1
K.
Now we have everything we need to compute the posterior marginal,
p(z
t=k|x
1:T) =
α
t,kl
t,kβ
t,k
P
K
j=1
α
t,jl
t,jβ
t,j
. (24)
We just derived theforward-backward algorithmfor HMMs [Rabiner and Juang, 1986].
17 / 23
What do the backward messages represent?
Question:If the forward messages represent the predictive probabilitiesα
t+1(z
t+1) =p(z
t+1,x
1:t),
what do the backward messages represent?
18 / 23
Computing the posterior pairwise marginals
Exercise:Use the forward and backward messages to compute the posterior pairwise marginals
p(z
t,z
t+1|x
1:T).
19 / 23
Normalizing the messages for numerical stability
If you’re working with long time series, especially if you’re working with 32-bit floating point, you need
to be careful.
The messages involve products of probabilities, which can quickly overflow.
There’s a simple fix though: after each step, re-normalize the messages so that they sum to one. I.e
replace
α
t+1=P
⊤
(α
t⊙l
t) (25)
with
eα
t+1=
1
A
t
P
⊤
(eα
t⊙l
t) (26)
A
t=
KX
k=1
KX
j=1
P
jkeα
t,jl
t,j≡
KX
j=1
eα
t,jl
t,j(sincePis row-stochastic). (27)
This leads to a nice interpretation: The normalized messages are predictive likelihoods
eα
t+1,k=p(z
t+1=k|x
1:t), and the normalizing constants areA
t=p(x
t|x
1:t−1).
20 / 23
EM for Hidden Markov Models
Now we can put it all together. To perform EM in an HMM,
▶E step:Compute the posterior distribution
q(z
1:T) =p(z
1:T|x
1:T;Θ). (28)
(Really, run theforward-backward algorithmto get posterior marginals and pairwise marginals.)
▶M step:Maximize the ELBO wrtΘ,
L(Θ) =E
q(z
1:T)[logp(x
1:T,z
1:T;Θ)]+c (29)
=E
q(z
1:T)
?
KX
k=1
I[z
1=k]logπ
0,k
?
+E
q(z
1:T)
T−1X
t=1
KX
i=1
KX
j=1
I[z
t=i,z
t+1=j]logP
i,j
+E
q(z
1:T)
?
TX
t=1
KX
k=1
I[z
t=k]logp(x
t;θ
k)
?
(30)
For exponential family observations, the M-step only requires expected sufficient statistics.
21 / 23
What else?
▶How can we sample the posterior?
▶How can we find the posterior mode?
▶How can we choose the number of states?
▶What if my transition matrix is sparse?
22 / 23
References
Sean R Eddy. What is a hidden Markov model?Nature biotechnology, 22(10):1315–1316, 2004.
Alexander B Wiltschko, Matthew J Johnson, Giuliano Iurilli, Ralph E Peterson, Jesse M Katon, Stan L
Pashkovski, Victoria E Abraira, Ryan P Adams, and Sandeep Robert Datta. Mapping sub-second
structure in mouse behavior.Neuron, 88(6):1121–1135, 2015.
Lawrence Rabiner and Biinghwang Juang. An introduction to hidden Markov models.ieee assp
magazine, 3(1):4–16, 1986.
23 / 23