Lecture 13 HMMs and the derivations for perusal.pdf

danny197240 8 views 23 slides Jun 11, 2024
Slide 1
Slide 1 of 23
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23

About This Presentation

HMM models and how to derive ways to do the mat


Slide Content

Hidden Markov Models
STATS 305C: Applied Statistics
Scott Linderman
May 16, 2023
1 / 23

Where are we?
Model Algorithm Application
Multivariate Normal Models Conjugate Inference Bayesian Linear Regression
Hierarchical Models MCMC (MH & Gibbs) Modeling Polling Data
Probabilistic PCA & Factor Analysis MCMC (HMC) Images Reconstruction
Mixture Models EM & Variational Inference Image Segmentation
Mixed Membership Models Coordinate Ascent VI Topic Modeling
Variational Autoencoders Gradient-based VI Image Generation
State Space Models Message Passing Segmenting Video Data
Bayesian Nonparametrics Fancy MCMC Modeling Neural Spike Trains
2 / 23

Gaussian Mixture Models
Recall the basic Gaussian mixture model,
z
t
iid
∼Cat(π) (1)
x
t|z
t∼ N(µ
z
t

z
t
) (2)
where
▶z
t∈ {1, . . . ,K}is alatent mixture assignment
▶x
t∈R
D
is anobserved data point
▶π∈∆
K,µ
k
∈R
D
, andΣ
k∈R
D×D
⪰0
are parameters
(Here we’ve switched to indexing data points bytrather thann.)
LetΘdenote the set of parameters. We can be Bayesian and put a prior onΘand run Gibbs or VI, or
we can point estimateΘwith EM, etc.
3 / 23

Gaussian Mixture Models II
Draw the graphical model.
4 / 23

Gaussian Mixture Models III
Recall the EM algorithm for mixture models,
▶E step:Compute the posterior distribution
q(z
1:T) =p(z
1:T|x
1:T;Θ) (3)
=
TY
t=1
p(z
t|x
t;Θ) (4)
=
TY
t=1
q
t(z
t) (5)
▶M step:Maximize the ELBO wrtΘ,
L(Θ) =E
q(z
1:T)[logp(x
1:T,z
1:T;Θ)−logq(z
1:T)] (6)
=E
q(z
1:T)[logp(x
1:T,z
1:T;Θ)]+c. (7)
For exponential family mixture models, the M-step only requires expected sufficient statistics.
5 / 23

Hidden Markov Models
Hidden Markov Models (HMMs) are like mixture models with temporal dependencies between the
mixture assignments.
This graphical model says that the joint distribution factors as,
p(z
1:T,x
1:T) =p(z
1)
TY
t=2
p(z
t|z
t−1)
TY
t=1
p(x
t|z
t). (8)
We call this an HMM because thehiddenstates follow a Markov chain,p(z
1)
Q
T
t=2
p(z
t|z
t−1).
6 / 23

Hidden Markov Models II
An HMM consists of three components:
1. z
1∼Cat(π
0)
2. z
t∼Cat(P
z
t−1
)whereP∈[0,1]
K×K
is arow-stochastictransition matrix with
rowsP
k.
3. x
t∼p(· |θ
z
t
)
7 / 23

Example: Theoccasionallydishonest casino
Figure:Anoccasionallydishonest casino that sometimes throws loaded dice.
Fromhttps://probml.github.io/dynamax/notebooks/hmm/casino_hmm_inference.html
8 / 23

Example: HMM for splice site recognition
Figure:A toy model for parsing a genome to find 5’ splice sites. From Eddy [2004].
Question:Suppose the splice site always had aGTsequence. How would you change the model to
detect such sites?
9 / 23

Example: Autoregressive HMM for video segmentation
Figure:Segmenting videos of freely moving mice [Wiltschko et al., 2015]. (Show video.)
10 / 23

Hidden Markov Models III
We are interested in questions like:
▶What are thepredictive distributionsofp(z
t+1|x
1:t)?
▶What is theposterior marginaldistributionp(z
t|x
1:T)?
▶What is theposterior pairwise marginaldistributionp(z
t,z
t+1|x
1:T)?
▶What is theposterior mode z

1:T
= arg maxp(z
1:T|x
1:T)?
▶How can wesample the posterior p(z
1:T|x
1:T)of an HMM?
▶What is themarginal likelihood p(x
1:T)?
▶How can welearn the parametersof an HMM?
Question:Why might these sound like hard problems?
11 / 23

Computing the predictive distributions
The predictive distributions give the probability of the latent statez
t+1given observationsup to but
not includingtimet+1. Let,
α
t+1(z
t+1)≜p(z
t+1,x
1:t) (9)
=
KX
z
1=1
· · ·
KX
z
t=1
p(z
1)
tY
s=1
p(x
s|z
s)p(z
s+1|z
s) (10)
=
KX
z
t=1



KX
z
1=1
· · ·
KX
z
t−1=1
p(z
1)
t−1Y
s=1
p(x
s|z
s)p(z
s+1|z
s)
!
p(x
t|z
t)p(z
t+1|z
t)

(11)
=
KX
z
t=1
α
t(z
t)p(x
t|z
t)p(z
t+1|z
t). (12)
We callα
t(z
t)theforward messages. We can compute them recursively! The base case is
p(z
1|∅)≜p(z
1).
12 / 23

Computing the predictive distributions II
We can also write these recursions in a vectorized form. Let
α
t=


α
t(z
t=1)
.
.
.
α
t(z
t=K)

=


p(z
t=1,x
1:t−1)
.
.
.
p(z
t=K,x
1:t−1)

 and l
t=


p(x
t|z
t=1)
.
.
.
p(x
t|z
t=K)

 (13)
both be vectors inR
K
+
. Then,
α
t+1=P


t⊙l
t) (14)
where⊙denotes the Hadamard (elementwise) product andPis the transition matrix.
13 / 23

Computing the predictive distributions III
Finally, to get the predictive distributions we just have to normalize,
p(z
t+1|x
1:t)∝p(z
t+1,x
1:t) =α
t+1(z
t+1). (15)
Question:What does the normalizing constant tell us?
14 / 23

Computing the posterior marginal distributions
The posterior marginal distributions give the probability of the latent statez
tgivenall the observations
up to timeT.
p(z
t|x
1:T) =
KX
z
1=1
· · ·
KX
z
t−1=1
KX
z
t+1=1
· · ·
KX
z
T=1
p(z
1:T,x
1:T) (16)
=
?KX
z
t=1
· · ·
KX
z
t−1=1
p(z
1)
t−1Y
s=1
p(x
s|z
s)p(z
s+1|z
s)
?
×p(x
t|z
t)
×
?KX
z
t+1=1
· · ·
KX
z
T=1
TY
u=t+1
p(z
u|z
u−1)p(x
u|z
u)
?
(17)

t(z
t)×p(x
t|z
t)×β
t(z
t) (18)
where we have introduced thebackward messagesβ
t(z
t).
15 / 23

Computing the backward messages
The backward messages can be computed recursively too,
β
t(z
t)≜
KX
z
t+1=1
· · ·
KX
z
T=1
TY
u=t+1
p(z
u|z
u−1)p(x
u|z
u) (19)
=
KX
z
t+1=1
p(z
t+1|z
t)p(x
t
1
|z
t+1)

KX
z
t+2=1
· · ·
KX
z
T=1
TY
u=t+2
p(z
u|z
u−1)p(x
u|z
u)
!
(20)
=
KX
z
t+1=1
p(z
t+1|z
t)p(x
t
1
|z
t+1)β
t+1(z
t+1). (21)
For the base case, letβ
T(z
T) =1.
16 / 23

Computing the backward messages (vectorized)
Let
β
t=


β
t(z
t=1)
.
.
.
β
t(z
t=K)

 (22)
be a vector inR
K
+
. Then,
β
t=P(β
t+1
⊙l
t+1). (23)
Letβ
T=1
K.
Now we have everything we need to compute the posterior marginal,
p(z
t=k|x
1:T) =
α
t,kl
t,kβ
t,k
P
K
j=1
α
t,jl
t,jβ
t,j
. (24)
We just derived theforward-backward algorithmfor HMMs [Rabiner and Juang, 1986].
17 / 23

What do the backward messages represent?
Question:If the forward messages represent the predictive probabilitiesα
t+1(z
t+1) =p(z
t+1,x
1:t),
what do the backward messages represent?
18 / 23

Computing the posterior pairwise marginals
Exercise:Use the forward and backward messages to compute the posterior pairwise marginals
p(z
t,z
t+1|x
1:T).
19 / 23

Normalizing the messages for numerical stability
If you’re working with long time series, especially if you’re working with 32-bit floating point, you need
to be careful.
The messages involve products of probabilities, which can quickly overflow.
There’s a simple fix though: after each step, re-normalize the messages so that they sum to one. I.e
replace
α
t+1=P


t⊙l
t) (25)
with

t+1=
1
A
t
P

(eα
t⊙l
t) (26)
A
t=
KX
k=1
KX
j=1
P
jkeα
t,jl
t,j≡
KX
j=1

t,jl
t,j(sincePis row-stochastic). (27)
This leads to a nice interpretation: The normalized messages are predictive likelihoods

t+1,k=p(z
t+1=k|x
1:t), and the normalizing constants areA
t=p(x
t|x
1:t−1).
20 / 23

EM for Hidden Markov Models
Now we can put it all together. To perform EM in an HMM,
▶E step:Compute the posterior distribution
q(z
1:T) =p(z
1:T|x
1:T;Θ). (28)
(Really, run theforward-backward algorithmto get posterior marginals and pairwise marginals.)
▶M step:Maximize the ELBO wrtΘ,
L(Θ) =E
q(z
1:T)[logp(x
1:T,z
1:T;Θ)]+c (29)
=E
q(z
1:T)
?
KX
k=1
I[z
1=k]logπ
0,k
?
+E
q(z
1:T)


T−1X
t=1
KX
i=1
KX
j=1
I[z
t=i,z
t+1=j]logP
i,j


+E
q(z
1:T)
?
TX
t=1
KX
k=1
I[z
t=k]logp(x
t;θ
k)
?
(30)
For exponential family observations, the M-step only requires expected sufficient statistics.
21 / 23

What else?
▶How can we sample the posterior?
▶How can we find the posterior mode?
▶How can we choose the number of states?
▶What if my transition matrix is sparse?
22 / 23

References
Sean R Eddy. What is a hidden Markov model?Nature biotechnology, 22(10):1315–1316, 2004.
Alexander B Wiltschko, Matthew J Johnson, Giuliano Iurilli, Ralph E Peterson, Jesse M Katon, Stan L
Pashkovski, Victoria E Abraira, Ryan P Adams, and Sandeep Robert Datta. Mapping sub-second
structure in mouse behavior.Neuron, 88(6):1121–1135, 2015.
Lawrence Rabiner and Biinghwang Juang. An introduction to hidden Markov models.ieee assp
magazine, 3(1):4–16, 1986.
23 / 23