A presentation on Markov Chain, HMM, Markov Random Fields with the needed algorithms and detailed explanations.
Size: 992.17 KB
Language: en
Added: Mar 29, 2011
Slides: 107 pages
Slide Content
PATTERNRECOGNITION
Markov models Markov models
Vu PHAM [email protected]
Department of Computer Science
March 28
th
, 2011
28/03/20111 Markov models
Contents •Introduction
–Introduction
–Motivation
•Markov Chain
•Hidden Markov Models
•Markov Random Field
28/03/20112 Markov models
Introduction •Markov processes are first proposed by
Russian mathematician Andrei Markov
–He used these processes to investigate
Pushkin’s poem.
•Nowadays, Markov property and HMMs are
widely used in many domains: widely used in many domains:
–Natural Language Processing
–Speech Recognition
–Bioinformatics
–Image/video processing
–...
28/03/2011Markov models3
Motivation [0] •As shown in his paper in 1906, Markov’s original
motivation is purely mathematical:
–Application of The Weak Law of Large Number to
dependent
random variables. random variables.
•However, we shall not follow this motivation...
28/03/2011Markov models4
Motivation [1] •From the viewpoint of
classification
:
–Context-free classification: Bayes classifier
(
)
(
)
| |
i j
j i
p p
ω ω
> ∀ ≠
x x
28/03/2011Markov models5
Motivation [1] •From the viewpoint of
classification
:
–Context-free classification: Bayes classifier
(
)
(
)
| |
i j
j i
p p
ω ω
> ∀ ≠
x x
28/03/2011Markov models6
•Classes are independent.
•Feature vectors are independent
.
Motivation [1] •From the viewpoint of
classification
:
–Context-free classification: Bayes classifier
(
)
(
)
| |
i j
j i
p p
ω ω
> ∀ ≠
x x
–However, there are some applications where various
classes are closely realated:
•POS Tagging, Tracking, Gene boundary recover...
28/03/2011Markov models7
s
1
...
s
2
s
3
...
s
m
Motivation [1] •Context-dependent classification:
–s
1, s
2, ..., s
m: sequence of mfeature vector
–ω
1, ω
2,..., ω
N: classes in which these vectors are classified: ω
i= 1...k.
s
1
...
s
2
s
3
...
s
m
28/03/2011Markov models8
Motivation [1] •Context-dependent classification:
–s
1, s
2, ..., s
m: sequence of mfeature vector
–ω
1, ω
2,..., ω
N: classes in which these vectors are classified: ω
i= 1...k.
s
1
...
s
2
s
3
...
s
m
•To apply Bayes classifier:
–X= s
1s
2...s
m: extened feature vector
–Ω
i= ω
i1, ω
i2,..., ω
iN: a classification MN
m
possible classifications
28/03/2011Markov models9
(
)
(
)
| |
i j
j i
p p
Ω > Ω ∀ ≠
X X
(
)
(
)
(
)
(
)
| |
i i j j
p
p j
p
p i
Ω Ω > Ω Ω ∀ ≠
X X
Motivation [1] •Context-dependent classification:
–s
1, s
2, ..., s
m: sequence of mfeature vector
–ω
1, ω
2,..., ω
N: classes in which these vectors are classified: ω
i= 1...k.
s
1
...
s
2
s
3
...
s
m
•To apply Bayes classifier:
–X= s
1s
2...s
m: extened feature vector
–Ω
i= ω
i1, ω
i2,..., ω
iN: a classification MN
m
possible classifications
28/03/2011Markov models10
(
)
(
)
| |
i j
j i
p p
Ω > Ω ∀ ≠
X X
(
)
(
)
(
)
(
)
| |
i i j j
p
p j
p
p i
Ω Ω > Ω Ω ∀ ≠
X X
Motivation [2] •From a
general view
, sometimes we want to evaluate the
joint
distribution
of a
sequence
of
dependent
random variables
28/03/2011Markov models11
Motivation [2] •From a
general view
, sometimes we want to evaluate the
joint
distribution
of a
sequence
of
dependent
random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
28/03/2011Markov models12
Hôm
...
nay
mùng
...
vào
q
1q
2q
3q
m
Motivation [2] •From a
general view
, sometimes we want to evaluate the
joint
distribution
of a
sequence
of
dependent
random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
•What is p(Hôm nay.... vào) = p(q
1
=Hômq
2
=nay... q
m
=vào)?
28/03/2011Markov models13
Hôm
...
nay
mùng
...
vào
q
1q
2q
3q
m
Motivation [2] •From a
general view
, sometimes we want to evaluate the
joint
distribution
of a
sequence
of
dependent
random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
•What is p(Hôm nay.... vào) = p(q
1
=Hômq
2
=nay... q
m
=vào)?
28/03/2011Markov models14
Hôm
...
nay
mùng
...
vào
q
1q
2q
3q
m
p(s
m
|s
1
s
2
...s
m-1
) =
p(s
1
s
2
...s
m-1
s
m
)
p(s
1
s
2
...s
m-1
)
Contents •Introduction •Markov Chain •
Hidden Markov Models Hidden Markov Models
•Markov Random Field
28/03/201115 Markov models
Markov Chain •Has N states, called s
1
, s
2
, ..., s
N
•There are discrete timesteps, t=0,
t=1,...
•On the t’th timestep the system is in
exactly one of the available states.
s
1
s
s
2
Call it
28/03/201116 Markov models
s
3
N = 3
t = 0
q
t
= q
0
= s
3
{
}
1 2
, ,...,
t N
s s q
s
∈
Current state
Markov Chain •Has N states, called s
1
, s
2
, ..., s
N
•There are discrete timesteps, t=0,
t=1,...
•On the t’th timestep the system is in
exactly one of the available states.
s
1
s
s
2
Current state
Call it
•Between each timestep, the next
state is chosen randomly.
28/03/201117 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
{
}
1 2
, ,...,
t N
s s q
s
∈
Markov Chain •Has N states, called s
1
, s
2
, ..., s
N
•There are discrete timesteps, t=0,
t=1,...
•On the t’th timestep the system is in
exactly one of the available states.
s
1
s
s
2
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
Call it
•Between each timestep, the next
state is chosen randomly.
•The current state determines the
probability for the next state.
28/03/201118 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
{
}
1 2
, ,...,
t N
s s q
s
∈
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
Markov Chain •Has N states, called s
1
, s
2
, ..., s
N
•There are discrete timesteps, t=0,
t=1,...
•On the t’th timestep the system is in
exactly one of the available states.
s
1
s
s
2
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
2/3
Call it
•Between each timestep, the next
state is chosen randomly.
•The current state determines the
probability for the next state.
–Often notated with arcs between states
28/03/201119 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
{
}
1 2
, ,...,
t N
s s q
s
∈
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
s
1
s
s
2
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
2/3
28/03/201120 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
s
1
s
s
2
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
2/3
28/03/201121 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
The state at timestep t+1depends
only on the state at timestep t
Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
s
1
s
s
2
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
2/3
28/03/201122 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
The state at timestep t+1depends
only on the state at timestep t
A
Markov chain of order m
(mfinite): the state at
timestep t+1depends on the past mstates:
(
)
(
)
1 1 0 1 1 1
, ,..., , ,...,
t t t t t t t m
p q q q q p q q q q
+ − + − − +
=
˚ ˚
Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
s
1
s
s
2
1
1/3
1/2
1/2
2/3
•How to represent the joint
distribution of (q
0
, q
1
, q
2
...) using
graphical models?
28/03/201123 Markov models
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
s
3
The state at timestep t+1depends
only on the state at timestep t
Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
s
1
s
s
2
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
1/3
q
0 q
1 q
•How to represent the joint
distribution of (q
0
, q
1
, q
2
...) using
graphical models?
28/03/201124 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
The state at timestep t+1depends
only on the state at timestep t
q
2
q
3
Markov chain •So, the chainof {q
t} is called
Markov chain
q
0
q
1
q
2
q
3
28/03/2011Markov models25
Markov chain •So, the chainof {q
t} is called
Markov chain
•Each q
t
takes value from the
countable
state-space
{s
1
, s
2
, s
3
...}
•Each q
t
is observed at a
discrete
timestep t
•
{q
t} sastifies the
Markov property
:
q
0
q
1
q
2
q
3
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚
•
{q
t} sastifies the
Markov property
:
28/03/2011Markov models26
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚
Markov chain •So, the chainof {q
t} is called
Markov chain
•Each q
t
takes value from the
countable
state-space
{s
1
, s
2
, s
3
...}
•Each q
t
is observed at a
discrete
timestep t
•
{q
t} sastifies the
Markov property
:
q
0
q
1
q
2
q
3
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚
•
{q
t} sastifies the
Markov property
:
•The transition from q
t
to q
t+1
is calculated from the
transition
probability matrix
28/03/2011Markov models27
s
1
s
3s
2
1
1/3
1/2
1/2
2/3
s
1
s
2
s
3
s
1
0 0 1
s
2
½ ½ 0
s
3
1/3 2/3 0
Transition probabilities
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚
Markov chain •So, the chainof {q
t} is called
Markov chain
•Each q
t
takes value from the
countable
state-space
{s
1
, s
2
, s
3
...}
•Each q
t
is observed at a
discrete
timestep t
•
{q
t} sastifies the
Markov property
:
q
0
q
1
q
2
q
3
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚
•
{q
t} sastifies the
Markov property
:
•The transition from q
t
to q
t+1
is calculated from the
transition
probability matrix
28/03/2011Markov models28
s
1
s
3s
2
1
1/3
1/2
1/2
2/3
s
1
s
2
s
3
s
1
0 0 1
s
2
½ ½ 0
s
3
1/3 2/3 0
Transition probabilities
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚
Markov Chain –Important property •In a Markov chain, the joint distribution is
( ) ( )
( )
0 1 0 1
1
, ,..., |
m
m j j
j
q q q p q q q p p
−
=
=
∏
28/03/2011Markov models29
Markov Chain –Important property •In a Markov chain, the joint distribution is
•
Why?
(
)
(
)
(
)
,previous st
, ,...
ates
, |
m
q q q p q p q q
p
=
∏
( ) ( )
( )
0 1 0 1
1
, ,..., |
m
m j j
j
q q q p q q q p p
−
=
=
∏
Why?
28/03/2011Markov models30
(
)
(
)
(
)
( )
( )
0 1 0 1
1
0 1
1
,previous st
, ,...
ates
, |
|
m j j
j
m
j j
j q q q p q p q q
p q p q q
p
−
=
−
=
= =
∏ ∏
Due to the Markov property
Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
28/03/2011Markov models31
Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
1/2
1/2
1/3
2/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
28/03/2011Markov models32
Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
1/2
1/2
1/3
2/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
•Markov assumption: weather in the t+1’th day is
depends only on the t’th day.
28/03/2011Markov models33
Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
1/2
1/2
1/3
2/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
•Markov assumption: weather in the t+1’th day is
depends only on the t’th day.
•We have observed the weather in a week:
28/03/2011Markov models34
rain
wind
cloud
rain
wind
Day: 0 1 2 3 4
Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
1/2
1/2
1/3
2/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
•Markov assumption: weather in the t+1’th day is
depends only on the t’th day.
•We have observed the weather in a week:
28/03/2011Markov models35
rain
wind
cloud
rain
wind
Day: 0 1 2 3 4
Markov Chain
Modeling pairsof sequences •In many applications, we have to model pairof sequences
•Examples:
–POS tagging in Natural Language Processing (assign each word in a
sentence to Noun, Adj, Verb...)
–Speech recognition (map acoustic sequences to sequence s of words)
–Computational biology (recover gene boundaries in DNA seq uences)
–Video tracking (estimate the underlying model states from the observation
sequences)
–And many others...
28/03/2011Markov models37
Probabilistic models for sequence pairs •We have two sequences of random variables:
X
1, X
2, ..., X
mand S
1, S
2, ..., S
m
•Intuitively, in a pratical system, each X
icorresponds to an observation
and each S
icorresponds to a state that generated the observati on.
•Let each S
ibe in {1, 2, ..., k} and each X
ibe in {1, 2, ..., o}
•How do we model the joint distribution:
28/03/2011Markov models38
(
)
1 1 1 1
,..., , ,...,
m m m m
p X
x X x S s S s
= = = =
Hidden Markov Models (HMMs) •In HMMs, we assume that
(
)
( )
( ) ( )
1 1 1 1
1 1 1 1
21
,..., , ,...,
m m m m
mm
j j j j j j j j
jj
p X
p
x X x S s S s
s p S s S s p X x S s
S
− −
==
= = = =
= = = = = =
∏ ∏
˚ ˚
•This is often called
Independence assumptions
in
HMMs
•We are gonna prove it in the next slides
28/03/2011Markov models39
Independence Assumptions in HMMs [1] •By
the chain rule
, the following equality is exact:
(
)
( )
( )
1 1 1 1
1 1
1 1 1 1
,..., , ,...,
,...,
,..., ,...,
m m m m
m m
m m m m
p
p
p
X x X x S s S s
S s S s
X x X x S s S s
= = = =
= = = ×
= = = =
˚
(
)
(
)
(
)
(
)
(
)
(
)
| | ABC p A BC p BC p A BC p B p
C p C
= =
˚
•Assumption 1
: the state sequence forms a Markov chain
28/03/2011Markov models40
( ) ( )
( )
1 1 1 1 1 1
2
,...,
m
m m j j j j
j
S s S s p S s p S s S p
s
− −
=
= = = = = =
∏
˚
Independence Assumptions in HMMs [2] •By the chain rule, the following equality is exact: •
Assumption 2
: each observation depends only on the underlying
(
)
()
1 1 1
1
1
1
1
11 1 1
,
,..., ,...,
,..., ,...,
m m m m
m
j j m m
j
j j
X x X x S s S s
X x S s S s X x X x
p
p
−
=
−
= = = =
= = = = = =
∏
˚
˚
•
Assumption 2
: each observation depends only on the underlying
state
•These two assumptions are often called independence
assumptions in HMMs
28/03/2011Markov models41
(
)
( )
1 1 1 1 1 1
,..., ,., ..,
j j m m j j
j j j j
X x S s S s x X x
X
X
p
p
x S s
− −
= = = = =
= = =
˚
˚
The Model form for HMMs •The model takes the following form:
•
Parameters in the model:
( ) ( )
( ) ( )
1 1 1 1
2 1
,.., , ,..., ;
m m
m m j j j j
j j
x x s s s t s s p
e x s
θ π
−
= =
=
∏ ∏
˚ ˚
Parameters in the model:
–
–
–
28/03/2011Markov models42
(
)
{
}
Initial probabilities for 1,2,...,
s s k
π
∈
(
)
{
}
Transition probabilities for , ' 1,2,...,
t s s s s k
′∈
˚
(
)
{
}
{ }
Emission probabilities for 1,2,..., and 1,2,..,
e x s s k
x o
∈
∈
˚
6 components of HMMs •Discrete
timesteps: 1, 2, ...
•Finite
state space: {s
i} (N states)
•Events {x
i} (M events)
•Vector of initial probabilities { ππππ
i}
ΠΠΠΠ= {π
i } = { p(q
1= s
i) }
•
Matrix of transition probabilities
s
1
s
2
s
3
t
11
t
21
t
12
t
31
t
23
t
32
e
11
e
e
13
e
23
e
start
ππππ
1
ππππ
2ππππ
3
•
Matrix of transition probabilities T= {T
ij} = { p(q
t+1=s
j|q
t=s
i) }
•Matrix of emission probabilities
E= {E
ij} = { p(o
t=x
j|q
t=s
i) }
28/03/2011Markov models43
x
1
x
2
x
3
e
11
e
31
e
22
e
23
e
33
The observations at continuous timesteps form an
observation sequence
{o
1, o
2, ..., o
t}, where o
i∈{x
1, x
2, ..., x
M}
6 components of HMMs •Discrete
timesteps: 1, 2, ...
•Finite
state space: {s
i} (N states)
•Events {x
i} (M events)
•Vector of initial probabilities { ππππ
i}
ΠΠΠΠ= {π
i } = { p(q
1= s
i) }
•
Matrix of transition probabilities
s
1
s
2
s
3
t
11
t
21
t
12
t
31
t
23
t
32
e
11
e
e
13
e
23
e
start
ππππ
1
ππππ
2ππππ
3
•
Matrix of transition probabilities T= {T
ij} = { p(q
t+1=s
j|q
t=s
i) }
•Matrix of emission probabilities
E= {E
ij} = { p(o
t=x
j|q
t=s
i) }
28/03/2011Markov models44
x
1
x
2
x
3
e
11
e
31
e
22
e
23
e
33
The observations at continuous timesteps form an
observation sequence
{o
1, o
2, ..., o
t}, where o
i∈{x
1, x
2, ..., x
M}
1 1 1
Constraints:
1 1 1
N N M
i ij ij
i j j
T E
π
= = =
= = =
∑ ∑ ∑
6 components of HMMs •Given a specific HMM and an
observation sequence, the
corresponding sequence of states
is generally not deterministic
•Example:
Given the observation sequence:
{x
, x
, x
, x
}
s
1
s
2
s
3
t
11
t
21
t
12
t
31
t
23
t
32
e
11
e
e
13
e
23
e
start
ππππ
1
ππππ
2ππππ
3
{x
1
, x
3
, x
3
, x
2
}
The corresponding states can be
any of following sequences:
{s
1, s
2, s
1, s
2}
{s
1, s
2, s
3, s
2}
{s
1, s
1, s
1, s
2}
...
28/03/2011Markov models45
x
1
x
2
x
3
e
11
e
31
e
22
e
23
e
33
Here’s an HMM
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models46
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models47
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1o
1
q
2o
2
q
3o
3
0.3 -0.3 -0.4
randomply choice
between S
1, S
2, S
3
Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models48
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
q
2o
2
q
3o
3
0.2 -0.8
choice between X
1
and X
3
Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models49
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2o
2
q
3o
3
Go to S
2with
probability 0.8 or
S
1with prob. 0.2
Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models50
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2
S
1
o
2
q
3o
3
0.3 -0.7
choice between X
1
and X
3
Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models51
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2
S
1
o
2
X
1
q
3o
3
Go to S
2with
probability 0.5 or
S
1with prob. 0.5
Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models52
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2
S
1
o
2
X
1
q
3
S
1
o
3
0.3 -0.7
choice between X
1
and X
3
Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
We got a sequence
28/03/2011Markov models53
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2
S
1
o
2
X
1
q
3
S
1
o
3
X
3
We got a sequence
of states and
corresponding
observations!
Three famous HMM tasks •Given a HMM Φ= (T, E, π). Three famous HMM tasks are:
•Probability of an observation sequence (state estim ation)
–Given: Φ, observation O = {o
1, o
2,..., o
t}
–Goal: p(O|Φ), or equivalently p(s
t= S
i|O)
•
Most likely expaination (inference) Most likely expaination (inference)
–Given: Φ, the observation O = {o
1, o
2,..., o
t}
–Goal: Q* = argmax
Qp(Q|O)
•Learning the HMM
–Given: observation O = {o
1, o
2,..., o
t} and corresponding state sequence
–Goal: estimate parameters of the HMM Φ= (T, E, π)
28/03/2011Markov models54
Three famous HMM tasks •Given a HMM Φ= (T, E, π). Three famous HMM tasks are:
•Probability of an observation sequence (state estim ation)
–Given: Φ, observation O = {o
1, o
2,..., o
t}
–Goal: p(O|Φ), or equivalently p(s
t= S
i|O)
•
Most likely expaination (inference)
Calculating the probability of
observing the sequence O over
Most likely expaination (inference)
–Given: Φ, the observation O = {o
1, o
2,..., o
t}
–Goal: Q* = argmax
Qp(Q|O)
•Learning the HMM
–Given: observation O = {o
1, o
2,..., o
t} and corresponding state sequence
–Goal: estimate parameters of the HMM Φ= (T, E, π)
28/03/2011Markov models55
all of possible sequences.
Three famous HMM tasks •Given a HMM Φ= (T, E, π). Three famous HMM tasks are:
•Probability of an observation sequence (state estim ation)
–Given: Φ, observation O = {o
1, o
2,..., o
t}
–Goal: p(O|Φ), or equivalently p(s
t= S
i|O)
•
Most likely expaination (inference)
Calculating the best
corresponding state sequence,
Most likely expaination (inference)
–Given: Φ, the observation O = {o
1, o
2,..., o
t}
–Goal: Q* = argmax
Qp(Q|O)
•Learning the HMM
–Given: observation O = {o
1, o
2,..., o
t} and corresponding state sequence
–Goal: estimate parameters of the HMM Φ= (T, E, π)
28/03/2011Markov models56
given an observation
sequence.
Three famous HMM tasks •Given a HMM Φ= (T, E, π). Three famous HMM tasks are:
•Probability of an observation sequence (state estim ation)
–Given: Φ, observation O = {o
1, o
2,..., o
t}
–Goal: p(O|Φ), or equivalently p(s
t= S
i|O)
•
Most likely expaination (inference)
Given an (or a set of)
observation sequence and
corresponding state sequence,
Most likely expaination (inference)
–Given: Φ, the observation O = {o
1, o
2,..., o
t}
–Goal: Q* = argmax
Qp(Q|O)
•Learning the HMM
–Given: observation O = {o
1, o
2,..., o
t} and corresponding state sequence
–Goal: estimate parameters of the HMM Φ= (T, E, π)
28/03/2011Markov models57
corresponding state sequence, estimate the Transition matrix,
Emission matrix and initial
probabilities of the HMM
Three famous HMM tasks
Problem Algorithm Complexity State estimation
Calculating: p(O|Φ)
Forward O(TN
2
)
Inference
Calculating: Q*= argmax
Q
p(Q|O)
Viterbi decoding O(TN
2
)
28/03/2011Markov models58
Calculating: Q*= argmax
Q
p(Q|O)
Learning
Calculating: Φ* = argmax
Φ
p(O|Φ)
Baum-Welch (EM) O(TN
2
)
T: number of timesteps
N: number of states
State estimation problem •Given: Φ= (T, E, π), observation O = {o
1, o
2,..., o
t}
•Goal: What is p(o
1o
2...o
t) ?
•
We can do this in a slow, stupid way We can do this in a slow, stupid way
–As shown in the next slide...
28/03/2011Markov models59
Here’s a HMM
•What is p(O) = p(o
1o
2o
3)
= p(o
1=X
3∧o
2=X
1 ∧o
3=X
3)?
•Slow, stupid way:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
(
)
(
)
( ) ( )
paths of length 3 paths of length 3
|
Q Q
p O p OQ
p
Q
O Q p
∈ ∈
=
=∑ ∑
•How to compute p(Q) for an
arbitrary path Q?
•How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011Markov models60
paths of length 3
Q
∈
Here’s a HMM
•What is p(O) = p(o
1o
2o
3)
= p(o
1=X
3∧o
2=X
1 ∧o
3=X
3)?
•Slow, stupid way:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
(
)
(
)
( ) ( )
paths of length 3 paths of length 3
|
Q Q
p O p OQ
p
Q
O Q p
∈ ∈
=
=∑ ∑
ππππ
s
1
s
2
s
3
•How to compute p(Q) for an
arbitrary path Q?
•How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011Markov models61
paths of length 3
Q
∈
p(Q) = p(q
1q
2q
3)
= p(q
1)p(q
2|q
1)p(q
3|q
2,q
1) (chain)
= p(q
1)p(q
2|q
1)p(q
3|q
2) (why?)
Example in the case Q=S
3S
1S
1
P(Q) = 0.4 * 0.2 * 0.5 = 0.04
0.3 0.3 0.4
Here’s a HMM
•What is p(O) = p(o
1o
2o
3)
= p(o
1=X
3∧o
2=X
1 ∧o
3=X
3)?
•Slow, stupid way:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
(
)
(
)
( ) ( )
paths of length 3 paths of length 3
|
Q Q
p O p OQ
p
Q
O Q p
∈ ∈
=
=∑ ∑
ππππ
s
1
s
2
s
3
•How to compute p(Q) for an
arbitrary path Q?
•How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011Markov models62
paths of length 3
Q
∈
p(O|Q) = p(o
1o
2o
3|q
1q
2q
3)
= p(o
1|q
1)p(o
2|q
1)p(o
3|q
3) (why?)
Example in the case Q=S
3S
1S
1
P(O|Q) = p(
X
3|S
3
)p(
X
1|S
1
) p(
X
3|S
1
)
=0.8 * 0.3 * 0.7 = 0.168
0.3 0.3 0.4
Here’s a HMM
•What is p(O) = p(o
1o
2o
3)
= p(o
1=X
3∧o
2=X
1 ∧o
3=X
3)?
•Slow, stupid way:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
(
)
(
)
( ) ( )
paths of length 3 paths of length 3
|
Q Q
p O p OQ
p
Q
O Q p
∈ ∈
=
=∑ ∑
ππππ
s
1
s
2
s
3
•How to compute p(Q) for an
arbitrary path Q?
•How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011Markov models63
paths of length 3
Q
∈
p(O|Q) = p(o
1o
2o
3|q
1q
2q
3)
= p(o
1|q
1)p(o
2|q
1)p(o
3|q
3) (why?)
Example in the case Q=S
3S
1S
1
P(O|Q) = p(
X
3|S
3
)p(
X
1|S
1
) p(
X
3|S
1
)
=0.8 * 0.3 * 0.7 = 0.168
0.3 0.3 0.4
p(O) needs 27 p(Q)
computations and 27
p(O|Q) computations.
What if the sequence has
20 observations?
So let’s be smarter...
The Forward algorithm •Given observation o
1o
2...o
T
•Forward probabilities:
α
t(i) = p(o
1o
2...o
t∧q
t= s
i| Φ) where 1 ≤t ≤T
αα
t(i) = probability that, in a random trial:
–We’d have seen the first tobservations
–We’d have ended up in s
i
as the t’th state visited.
•In our example, what is α
2(3) ?
28/03/2011Markov models64
αααα
t(i): easy to define recursively
(
)
(
)
( ) ( )
( )
1 1 1
1 1 1
1
|
i
i i
i i
i p o q s
p q s p o q s
oE
α
π
= ∧ =
= = = =
(
)
(
)
1 2
...
|
t t t i
i p oo o q s
α
= ∧ = Φ
(
)
(
)
(
)
1 1 1 1
1 1 1
2
2
...
...
t t t i
N
t t j t t i
i p oo o q s
oo o q s o q s p
α
+ + +
+ +
= ∧ =
= ∧ = ∧ ∧ =
∑
(
)
{
}
{ } ( )
{ }
{ } ( )
{ }
1
1
{ }
|
|
i i
ij t j t i
ij t j t i
p q s
T p s q s E p x q
q
o
s
π
+
Π= = =
= = = =
= = = =
T
E
28/03/2011Markov models65
(
)
( ) ( )
( )
( )
( )
( )
( )
( ) ( )
1 1 1
1
1 1 1 1
1
1 1
1
1 1 1
2
2 2
1
1
1
. |
|
|
. . .
|
. .
t t j t t i
j
N
t t i t t j t t j
j
N
t t i t j t
j
N
t t i t i t j t
j
N
ji i t t
j
o q s oo o q s o p p
p
p p
T
o o q s
o q s q s j
o q s q s q s j
o j E
α
α
α
+ +
=
+ +
=
+ +
=
+ + +
=
+
=
= ∧ = ∧ = ∧ =
= ∧ = =
= = = =
=
∑ ∑
∑
∑
∑
In our example
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
ππππ
s
1
s
2
s
3
(
)
(
)
( ) ( )
( ) ( ) ( ) ( ) ( )
1 2
1 1
1 1 1
. |..
t t t i
i i
t ji i t t i t ji t
jj
i p oo o q s
i E o
i T E o j E o T j
α
α π
α α α
+ + +
= ∧ = Φ
=
= =
∑ ∑
28/03/2011Markov models66
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
We observed: x
1
x
2
α
1(1) = 0.3 * 0.3 = 0.09
α
1(2) = 0
α
1(3) = 0.2 * 0.4 = 0.08
α
2(1) = 0 * (0.09*0 .5+ 0*0.4 + 0.08*0.2) = 0
α
2(2) = 0.1 * (0.09*0.5 + 0*0 + 0.08*0.8) = 0.0109
α
2(3) = 0 * (0.09*0 + 0*0.6 + 0.08*0) = 0
Forward probabilities -Trellis
N
s
3
s
4 28/03/2011Markov models67
T
1 2 3 4 5 6
s
1
s
2
Forward probabilities -Trellis
N
s
3
s
4
α
1 (4)
α
1 (3)α
2 (3)α
6(3)
28/03/2011Markov models68
T
1 2 3 4 5 6
s
1
s
2
α
1 (2)
α
1 (1)
α
3 (2)
α
4(1)
α
5(2)
Forward probabilities -Trellis
N
s
3
s
4
α
1 (4)
α
1 (3)α
2 (3)
(
)
(
)
1 1
i i
i E o
α π
=
28/03/2011Markov models69
T
1 2 3 4 5 6
s
1
s
2
α
1 (2)
α
1 (1)
Forward probabilities -Trellis
N
s
3
s
4
α
1 (4)
α
1 (3)α
2 (3)
(
)
(
)
(
)
1 1 t i t ji t
j
i E o T j
α α
+ +
=
∑
28/03/2011Markov models70
T
1 2 3 4 5 6
s
1
s
2
α
1 (2)
α
1 (1)
Forward probabilities •So, we can cheaply compute:
•How can we cheaply compute:
(
)
(
)
1 2
...
t t t i
i p oo o q s
α
= ∧ =
(
)
1 2
...
t
p o o o
•How can we cheaply compute:
28/03/2011Markov models71
(
)
1 2
|
...
t i t
p q s o o o
=
(
)
t
i
i
α
=
∑
Forward probabilities •So, we can cheaply compute:
•How can we cheaply compute:
(
)
(
)
1 2
...
t t t i
i p oo o q s
α
= ∧ =
(
)
1 2
...
t
p o o o
i
•How can we cheaply compute:
MLook back the trellis...
28/03/2011Markov models72
(
)
1 2
|
...
t i t
p q s o o o
=
(
)
( )
j
t
t
i
j
α
α
=
∑
State estimation problem •State estimationis solved:
•Can we utilize the elegant trellisto solve the Inference
problem?
( ) ( ) ()
1 2
1
|
N
t i
i
O p oo o i
p
α
=
Φ = …=∑
problem?
–Given an observation sequence O, find the best stat e sequence Q
28/03/2011Markov models73
(
)
*
argmax |
Q
O
Q p Q=
Inference problem •Given: Φ= (T, E, π), observation O = {o
1, o
2,..., o
t}
•Goal: Find
•
Practical problems:
(
)
( )
1 2
*
1 2 1 2
argmax |
argmax |
t
Q
t t
q q q
Q p Q O
p qq q oo o
…
=
= … …
•
Practical problems:
–Speech recognition: Given an utterance (sound), what is
the best sentence (text) that matches the utterance?
–Video tracking
–POS Tagging
28/03/2011Markov models74
s
1
s
2
x
1
x
2
x
3
s
3
Inference problem •We can do this in a slow, stupid way:
(
)
( ) ( )
( )
(
)
(
)
*
argmax |
|
argmax
Q
Q
p Q O
p O Q p Q
p O
Q=
=
•But it’s better if we can find another way to
compute the most probability path(MPP)...
28/03/2011Markov models75
(
)
(
)
( ) ( )
1 2
argmax |
argmax |
t
Q
Q
p O Q p Q
p oo o Q p Q
=
= …
Efficient MPP computation •We are going to compute the following variables: •δ
t(i) is the probability of the best path of length
(
)
(
)
1 2 1
1 2 1 1 2
max
t
t t t i t
q q q
i p qq q q s oo o
δ
−
−
…
= … ∧ = ∧ …
t(i) is the probability of the best path of length
t-1which ends up in s
iand emits o
1...o
t.
•Define: mpp
t(i) = that path
so:δ
t(i) = p(mpp
t(i))
28/03/2011Markov models76
Viterbi algorithm
(
)
(
)
( ) ( )
( ) ( )
( ) ( )
1 2 1
1 2 1
1 2 1 1 2
1 2 1 1 2
1 1 1
one choice
1 1
max
argmax
max
t
t
t t t i t
q q q
tt t
i
i t
q q q
i
i
i p qq q q s oo o
mpp i p qq q q s oo o
i p q s o
E o i
π
δ
δ
α
−
−
−
…
−
…
= … ∧ = ∧ …
= … ∧ = ∧ …
= = ∧
= =
N
s
4
δ
1 (4)
28/03/2011Markov models77
T
1 2 3 4 5 6
s
1
s
2
s
3
s
4
δ
1 (3)
δ
1 (2)
δ
1 (1)
δ
2 (3)
Viterbi algorithm
•The most prob path with last two states
s
is
j
is
the most path to s
i, followed by
transition s
i
Ms
j.
•The prob of that path will be:
s
1
s
i
s
j
time ttime t + 1
... ...
δ
t(i) ×p(s
i
Ms
j
∧o
t+1
)
= δ
t(i)T
ij
E
j(o
t+1
)
•So, the previous state at time tis:
28/03/2011Markov models78
(
)
(
)
*
1
argmax
t ij j t
i
i T E i
o
δ
+
=
...
Viterbi algorithm •Summary:
(
)
(
)
(
)
( )
( )
( ) ( )
*
1
1
1
*
1
*
argmax
t ij j t
t t j
t i
t
j j
i
t
i T E o
mpp j mpp i s
i T E
j
i
o
δ δ
δ
+ +
+
+
=
=
=
N
s
δ
1
(4)
(
)
(
)
(
)
1 1 1ii
i E o i
δ π α
= =
28/03/2011Markov models79
T
1 2 3 4 5 6
s
1
s
2
s
3
s
4
δ
1
(4)
δ
1 (3)
δ
1 (2)
δ
1 (1)
δ
2 (3)
What’s Viterbi used for? •Speech Recognition
28/03/2011Markov models80
Chong, Jike and Yi, Youngmin and Faria, Arlo and Sa tish, Nadathur Rajagopalan and Keutzer, Kurt, “ Data-Parallel Large Vocabulary
Continuous Speech Recognition on Graphics Processor s”, EECS Department, University of California, Berke ley, 2008.
Training HMMs •Given: large sequence of observation o
1o
2...o
T
and number of states N.
•Goal: Estimation of parameters Φ= 〈T, E, π〉
•That is, how to design an HMM.
•We will infer the model from a large amount of
data o
1o
2...o
Twith a big “T”.
28/03/2011Markov models81
Training HMMs •Remember, we have just computed
p(o
1
o
2
...o
T
| Φ)
•Now, we have some observations and we want to infer ence Φ
from them.
•So, we could use:
–MAX LIKELIHOOD:
–BAYES:
Compute
then take or
28/03/2011Markov models82
(
)
1
argmax |
T
p o o
Φ
Φ= … Φ
(
)
1
|
T
p
o o
Φ … [
]
E
Φ
(
)
1
max |
T
o
p o
Φ
Φ …
Max likelihood for HMMs •Forward probability
: the probability of producing o
1
...o
t
while
ending up in state s
i
(
)
(
)
1 2
...
t t t i
i p oo o q s
α
= ∧ =
(
)
(
)
( ) ( ) ( )
1 1
1 1
i i
t i t ji t
j
i E o
i E o T j
α π
α α
+ +
=
=
∑
•Backward probability
: the probability of producing o
t+1
...o
T
given
that at time t, we are at state s
i
28/03/2011Markov models83
(
)
(
)
1 2
. |..
t t t i
T t
i p o o o q s
β
+ +
= =
Max likelihood for HMMs -Backward •Backward probability
: easy to define recursively
(
)
(
)
( )
( )
( )
1 2
1 2 1
1
...
...
|
1
|
t t t t i
T
N
t t t t j t i
j
T
T
i p o o o q s
i
i p o o o q s q s
β
β
β
+ +
+ + +
=
= =
=
= ∧ ∧ = =
∑
(
)
( ) ( ) ( )
1 1
1
1
T
N
t t ij j t
j
i
i T o
j E
β
β β
+ +
=
=
=
∑
28/03/2011Markov models84
( ) ( )
( ) ( )
( ) ( )
1
1 1 2 1 1
1
1 1 2 1
1
1 1
1
. | |
| |
..
...
j N
t t j t i t t t j t i
j
N
t t j t i t t j
j
N
t ij j t
T
T
j
p o q s q s p o o o q s q s p o q s q s p o o q s
T o j E β
=
+ + + + +
=
+ + + +
=
+ +
=
= ∧ = = ∧ = ∧ =
= ∧ = = =
=
∑ ∑
∑
∑
Max likelihood for HMMs •The probability of traversing a certain arc at time tgiven
o
1
o
2
...o
T
:
(
)
(
)
( )
(
)
1 1 2
1 1 2
|
ij t i t j T
t i t j T
t p q s q s oo o
p q s q s oo o
p oo o
ε
+
+
= = ∧ = …
= ∧ = ∧ …
=
…
28/03/2011Markov models85
(
)
( )
( )
( )
( ) ( )
( )
( ) ( )
( ) ( )
1 2
1 21 1 2
1 2 1 2
1
1
|
|
T
t t i t i t j t t T t i
N
t i t t t i
i
ij t
ijN
t
i
t T
t
t
p oo o
p oo o q s p q s q s p o o o q s
p oo o q p o o o q
i T i
t
i i
s s
α β
α
ε
β
+ + +
+ +
=
=
=
…
… ∧ = = ∧
=
= … =
=
… ∧ …
=
=
∑
∑
Max likelihood for HMMs •The probability of being at state s
i
at time tgiven o
1
o
2
...o
T
:
(
)
(
)
( )
1 2
1 1 2
1
|
|
i t i T
N
t i t j T
j
t p q s oo o
p q s q s oo o
γ
+
=
= = …
= = ∧ = …
∑
28/03/2011Markov models86
( ) ( )
1 1
j N
ij
j
i
t t
γ ε
=
=
∑
Max likelihood for HMMs •Sum over the time index:
–Expected # of transitions from state ito jin o
1
o
2
...o
T
:
()
1 1
T
ij
t
t
ε
− =∑
–Expected # of transitions from state iin o
1
o
2
...o
T
:
28/03/2011Markov models87
1
t
=
() () ()
1 1 1
1 1 1 11
T T T N N
i ij ij
t t j jt
t t t
ε ε γ
− − −
= = = ==
= =
∑ ∑∑ ∑∑
Update parameters
(
) ( ) (
)
( )
(
)
1 1
1 1
1 1
ˆexpected frequency in state i at time t = 1 1
expected # of transitions from state i to j
expected # of transitions from state i
ii
T T
ij ij
t t
ijT N T
i ij
t t
T
t t
πγ
γε ε
ε
− −
= =
− −
==
== =
∑ ∑
∑ ∑∑
(
)
{
}
{ } ( )
{ }
{ } ( )
{ }
1
1
{ }
|
|
i i
ij t j t i
ij t j t i
p q s
T p s q s E p x q
q
o
s
π
+
Π= = =
= = = =
= = = =
T
E
28/03/2011Markov models88
(
)
(
)
1 1 1
expected
i ij
t j t
ik
t t
E
γ
ε
= = =
=
∑ ∑∑
( ) ( )
( )
( ) ( )
( )
1 1
1 1 1
1 1
1 1 1
# of transitions from state i with x observed
expected # of transitions from state i
, ,
k
N T T
t k ij t k i
j t t
T N T
i ij
t j t
o x t o x t
t t
δ ε δ γ
γ ε
− −
= = =
− −
= = =
= =
∑∑ ∑
∑ ∑∑
The inner loop of Forward-Backward Given an input sequence.
1. Calculate forward probability:
–Base case:
–Recursive case:
2. Calculate backward probability:
(
)
(
)
( ) ( ) ( )
1 1
1 1
i i
t i t ji t
j
i E o
i E o T j
α π
α α
+ +
=
=
∑
–Base case:
–Recursive case:
3. Calculate expected count:
4. Update parameters:
28/03/2011Markov models89
(
)
( ) ( ) ( )
1 1
1
1
T
N
t t ij j t
j
i
i T o
j E
β
β β
+ +
=
=
=
∑
( )
(
)
(
)
( ) ( )
1
ij t
ijN
t
t
i
ti T i
t
i i
α β
α β
ε
=
=
∑
( )
( )
( ) ( )
( )
1 1
1 1 1
1 1
1 1 1 1
,
N T T
t k ij ij
j t t
ij ikN T N T
ij ij
j t j t
o x t
t
T E
t t
δ ε
ε
ε
ε
− −
= = =
− −
= = = =
= =
∑∑ ∑
∑∑ ∑∑
Forward-Backward: EM for HMM •If we knew Φwe could estimate
expectations
of quantities
such as
–Expected number of times in state i
–Expected number of transitions iMj
•
If we knew the quantities such as
•
If we knew the quantities such as –Expected number of times in state i
–Expected number of transitions iMj
we could compute the
max likelihood
estimate of Φ= 〈T, E, Π〉
•Also known (for the HMM case) as the Baum-Welch algorithm.
28/03/2011Markov models90
EM for HMM •Each iteration provides values for all the paramete rs
•The new model always improve the likeliness of the
training data:
(
)
(
)
ˆ
| |
p p
oo o oo o
… Φ ≥ … Φ
•The algorithm does not guarantee to reach global
maximum.
28/03/2011Markov models91
(
)
(
)
1 2 1 2
ˆ
| |
T T
p p
oo o oo o
… Φ ≥ … Φ
EM for HMM •Bad News
–There are lots of local minima
•Good News
–The local minima are usually adequate models of the da ta.
•Notice
–EM does not estimate the number of states. That must b e given (tradeoffs)
–Often, HMMs are forced to have some links with zero probability. This is done
by setting Tij = 0 in initial estimate Φ(0)
–Easy extension of everything seen today: HMMs with real valued outputs
28/03/2011Markov models92
Contents •Introduction
•Markov Chain
•
Hidden Markov Models Hidden Markov Models
•Markov Random Field (from the viewpoint of
classification)
28/03/201193 Markov models
Example: Image segmentation •Observations: pixel values
•Hidden variable: classof each pixel
•It’s reasonable to think that there are some underl ying relationships
between neighbouring pixels... Can we use Markov mo dels?
•Errr.... the relationships are in 2D!
28/03/2011Markov models94
MRF as a 2D generalization of MC •Array of observations:
•Classes/States:
•Our objective is
classification
: given the array of
observations, estimate the corresponding values of the
{
}
, 0 ,0
y
ij x
i N j N
X x< ≤ < = ≤
{
}
, 1...
ij ij
S s s M
= =
observations, estimate the corresponding values of the state array S so that
28/03/2011Markov models95
(
)
(
)
| is maximum.
Xp S p S
2D context-dependent classification •Assumptions:
–The values of elements in S are mutually dependent.
–The range of this dependence is limited within a ne ighborhood.
•For each (i, j) element of S, a neighborhood N
ij
is defined so
thatthat
–s
ij∉N
ij
: (i, j) element does not belong to its own set of neighbors.
–s
ij∈N
kl⇔s
kl∈N
ij
: if s
ijis a neighbor of s
klthen s
klis also a neighbor
of s
ij
28/03/2011Markov models96
2D context-dependent classification •The Markov property for 2D case:
where includes all the elements of S excep t the (i, j) one.
•The elegeant dynamic programing is not applicable: the problem is
(
)
(
)
| |
ij ij ij ij
s S p s p=
N
ij
S
much harder now!
28/03/2011Markov models97
2D context-dependent classification •The Markov property for 2D case:
where includes all the elements of S excep t the (i, j) one.
•The elegeant dynamic programing is not applicable: the problem is
(
)
(
)
| |
ij ij ij ij
s S p s p
=
N
ij
S
We are gonna see an
application of MRF for
much harder now!
28/03/2011Markov models98
application of MRF for
Image Segmentation
and Restoration.
MRF for Image Segmentation •Cliques
: a set of each pixel which are neighbors
of each other (w.r.t the type of neighborhood)
28/03/2011Markov models99
MRF for Image Segmentation •Dual Lattice number
•Line process:
28/03/2011Markov models100
MRF for Image Segmentation •Gibbs distribution:
–Z: normalizing constant
( )
(
)
1
exp
U s
s
Z T
π
−
=
–T: parameter
•It turns out that Gibbs distribution implies MRF
([Gema 84])
28/03/2011Markov models101
MRF for Image Segmentation •A Gibbs conditional probability is of the form:
–C
k(i, j): clique of the pixel (i, j)
( )
( )
( )
1 1
| exp ,
ij ij k k
k
s F C i j
Z T
p
= −
∑
N
–F
k: some functions, e.g.
28/03/2011Markov models102
( ) ( )
(
)
1 2 1, 1, 2 , 1 , 1
1
ij i j i j i j i j
s s s s s
T
α α α
− + − +
+ + − + +
MRF for Image Segmentation •Then, the joint probability for the Gibbs model is
–
The sum is calculated over all possible cliques associ ated
( )
(
)
(
)
,
,
exp
k k
i j k
p
F C i j
S
T
= −
∑∑
–
The sum is calculated over all possible cliques associ ated with the neighborhood.
•We also need to work out p(X|S)
•Then p(X|S)p(S) can be maximized... [Gema 84]
28/03/2011Markov models103
More on Markov models... •MRF does not stop there... Here are some related mo dels:
–Conditional random field (CRF)
–Graphical models
–...
•
Markov
Chain and HMM
does not stop there...
•
Markov
Chain and HMM
does not stop there...
–Markov chain of order m
–Continuous-time Markov chains...
–Real-value observations
–...
28/03/2011Markov models104
What you should know •Markov property, Markov Chain
•HMM:
–Defining and computing α
t(i)
–Viterbi algorithm
–Outline of the EM algorithm for HMM
•Markov Random Field
–And an application in Image Segmentation
–[Geman 84] for more information.
28/03/2011Markov models105
Q & A
28/03/2011Markov models106
References •L. R. Rabiner, " A Tutorial on Hidden Markov Models and Selected Application s
in Speech Recognition“, Proc. of the IEEE, Vol.77, No.2, pp.257--286, 198 9.
•Andrew W. Moore, “Hidden Markov Models”,
http://www.autonlab.org/tutorials/
•Geman S., Geman D. “Stochastic relaxation, Gibbs distributions and the
Bayesian restoration of images
,” IEEE Transactions on Pattern Analysis and
Bayesian restoration of images
,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, Vol. 6(6), pp. 721-741, 1984.
28/03/2011Markov models107