Markov Models

PATTERNRECOGNITION
Markov models Markov models
Vu PHAM
[email protected]
Department of Computer Science
March 28
th
, 2011
28/03/20111 Markov models

Contents •Introduction
–Introduction
–Motivation
•Markov Chain
•Hidden Markov Models
•Markov Random Field
28/03/20112 Markov models

Introduction •Markov processes are first proposed by
Russian mathematician Andrei Markov
–He used these processes to investigate
Pushkin’s poem.
•Nowadays, Markov property and HMMs are
widely used in many domains: widely used in many domains:
–Natural Language Processing
–Speech Recognition
–Bioinformatics
–Image/video processing
–...
28/03/2011Markov models3

Motivation [0] •As shown in his paper in 1906, Markov’s original
motivation is purely mathematical:
–Application of The Weak Law of Large Number to
dependent
random variables. random variables.
•However, we shall not follow this motivation...
28/03/2011Markov models4

Motivation [1] •From the viewpoint of
classification
:
–Context-free classification: Bayes classifier
(
)
(
)
| |
i j
j i
p p
ω ω
> ∀ ≠
x x
28/03/2011Markov models5

Motivation [1] •From the viewpoint of
classification
:
–Context-free classification: Bayes classifier
(
)
(
)
| |
i j
j i
p p
ω ω
> ∀ ≠
x x
28/03/2011Markov models6
•Classes are independent.
•Feature vectors are independent
.

Motivation [1] •From the viewpoint of
classification
:
–Context-free classification: Bayes classifier
(
)
(
)
| |
i j
j i
p p
ω ω
> ∀ ≠
x x
–However, there are some applications where various
classes are closely realated:
•POS Tagging, Tracking, Gene boundary recover...
28/03/2011Markov models7
s
1
...
s
2
s
3
...
s
m

Motivation [1] •Context-dependent classification:
–s
1, s
2, ..., s
m: sequence of mfeature vector
–ω
1, ω
2,..., ω
N: classes in which these vectors are classified: ω
i= 1...k.
s
1
...
s
2
s
3
...
s
m
28/03/2011Markov models8

Motivation [1] •Context-dependent classification:
–s
1, s
2, ..., s
m: sequence of mfeature vector
–ω
1, ω
2,..., ω
N: classes in which these vectors are classified: ω
i= 1...k.
s
1
...
s
2
s
3
...
s
m
•To apply Bayes classifier:
–X= s
1s
2...s
m: extened feature vector
–Ω
i= ω
i1, ω
i2,..., ω
iN: a classification MN
m
possible classifications
28/03/2011Markov models9
(
)
(
)
| |
i j
j i
p p
Ω > Ω ∀ ≠
X X
(
)
(
)
(
)
(
)
| |
i i j j
p
p j
p
p i
Ω Ω > Ω Ω ∀ ≠
X X

Motivation [1] •Context-dependent classification:
–s
1, s
2, ..., s
m: sequence of mfeature vector
–ω
1, ω
2,..., ω
N: classes in which these vectors are classified: ω
i= 1...k.
s
1
...
s
2
s
3
...
s
m
•To apply Bayes classifier:
–X= s
1s
2...s
m: extened feature vector
–Ω
i= ω
i1, ω
i2,..., ω
iN: a classification MN
m
possible classifications
28/03/2011Markov models10
(
)
(
)
| |
i j
j i
p p
Ω > Ω ∀ ≠
X X
(
)
(
)
(
)
(
)
| |
i i j j
p
p j
p
p i
Ω Ω > Ω Ω ∀ ≠
X X

Motivation [2] •From a
general view
, sometimes we want to evaluate the
joint
distribution
of a
sequence
of
dependent
random variables
28/03/2011Markov models11

Motivation [2] •From a
general view
, sometimes we want to evaluate the
joint
distribution
of a
sequence
of
dependent
random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
28/03/2011Markov models12
Hôm
...
nay
mùng
...
vào
q
1q
2q
3q
m

Motivation [2] •From a
general view
, sometimes we want to evaluate the
joint
distribution
of a
sequence
of
dependent
random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
•What is p(Hôm nay.... vào) = p(q
1
=Hômq
2
=nay... q
m
=vào)?
28/03/2011Markov models13
Hôm
...
nay
mùng
...
vào
q
1q
2q
3q
m

Motivation [2] •From a
general view
, sometimes we want to evaluate the
joint
distribution
of a
sequence
of
dependent
random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
•What is p(Hôm nay.... vào) = p(q
1
=Hômq
2
=nay... q
m
=vào)?
28/03/2011Markov models14
Hôm
...
nay
mùng
...
vào
q
1q
2q
3q
m
p(s
m
|s
1
s
2
...s
m-1
) =
p(s
1
s
2
...s
m-1
s
m
)
p(s
1
s
2
...s
m-1
)

Contents •Introduction •Markov Chain •
Hidden Markov Models Hidden Markov Models
•Markov Random Field
28/03/201115 Markov models

Markov Chain •Has N states, called s
1
, s
2
, ..., s
N
•There are discrete timesteps, t=0,
t=1,...
•On the t’th timestep the system is in
exactly one of the available states.
s
1
s
s
2
Call it
28/03/201116 Markov models
s
3
N = 3
t = 0
q
t
= q
0
= s
3
{
}
1 2
, ,...,
t N
s s q
s
∈
Current state

Markov Chain •Has N states, called s
1
, s
2
, ..., s
N
•There are discrete timesteps, t=0,
t=1,...
•On the t’th timestep the system is in
exactly one of the available states.
s
1
s
s
2
Current state
Call it
•Between each timestep, the next
state is chosen randomly.
28/03/201117 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
{
}
1 2
, ,...,
t N
s s q
s
∈

Markov Chain •Has N states, called s
1
, s
2
, ..., s
N
•There are discrete timesteps, t=0,
t=1,...
•On the t’th timestep the system is in
exactly one of the available states.
s
1
s
s
2
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
Call it
•Between each timestep, the next
state is chosen randomly.
•The current state determines the
probability for the next state.
28/03/201118 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
{
}
1 2
, ,...,
t N
s s q
s
∈
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚

Markov Chain •Has N states, called s
1
, s
2
, ..., s
N
•There are discrete timesteps, t=0,
t=1,...
•On the t’th timestep the system is in
exactly one of the available states.
s
1
s
s
2
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
2/3
Call it
•Between each timestep, the next
state is chosen randomly.
•The current state determines the
probability for the next state.
–Often notated with arcs between states
28/03/201119 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
{
}
1 2
, ,...,
t N
s s q
s
∈
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚

Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
s
1
s
s
2
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
2/3
28/03/201120 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚

Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
s
1
s
s
2
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
2/3
28/03/201121 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
The state at timestep t+1depends
only on the state at timestep t

Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
s
1
s
s
2
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
2/3
28/03/201122 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
The state at timestep t+1depends
only on the state at timestep t
A
Markov chain of order m
(mfinite): the state at
timestep t+1depends on the past mstates:
(
)
(
)
1 1 0 1 1 1
, ,..., , ,...,
t t t t t t t m
p q q q q p q q q q
+ − + − − +
=
˚ ˚

Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
s
1
s
s
2
1
1/3
1/2
1/2
2/3
•How to represent the joint
distribution of (q
0
, q
1
, q
2
...) using
graphical models?
28/03/201123 Markov models
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
s
3
The state at timestep t+1depends
only on the state at timestep t

Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
s
1
s
s
2
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
1/3
q
0 q
1 q
•How to represent the joint
distribution of (q
0
, q
1
, q
2
...) using
graphical models?
28/03/201124 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
The state at timestep t+1depends
only on the state at timestep t
q
2
q
3

Markov chain •So, the chainof {q
t} is called
Markov chain
q
0
q
1
q
2
q
3
28/03/2011Markov models25

Markov chain •So, the chainof {q
t} is called
Markov chain
•Each q
t
takes value from the
countable
state-space
{s
1
, s
2
, s
3
...}
•Each q
t
is observed at a
discrete
timestep t
•
{q
t} sastifies the
Markov property
:
q
0
q
1
q
2
q
3
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚
•
{q
t} sastifies the
Markov property
:
28/03/2011Markov models26
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚

Markov chain •So, the chainof {q
t} is called
Markov chain
•Each q
t
takes value from the
countable
state-space
{s
1
, s
2
, s
3
...}
•Each q
t
is observed at a
discrete
timestep t
•
{q
t} sastifies the
Markov property
:
q
0
q
1
q
2
q
3
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚
•
{q
t} sastifies the
Markov property
:
•The transition from q
t
to q
t+1
is calculated from the
transition
probability matrix
28/03/2011Markov models27
s
1
s
3s
2
1
1/3
1/2
1/2
2/3
s
1
s
2
s
3
s
1
0 0 1
s
2
½ ½ 0
s
3
1/3 2/3 0
Transition probabilities
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚

Markov chain •So, the chainof {q
t} is called
Markov chain
•Each q
t
takes value from the
countable
state-space
{s
1
, s
2
, s
3
...}
•Each q
t
is observed at a
discrete
timestep t
•
{q
t} sastifies the
Markov property
:
q
0
q
1
q
2
q
3
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚
•
{q
t} sastifies the
Markov property
:
•The transition from q
t
to q
t+1
is calculated from the
transition
probability matrix
28/03/2011Markov models28
s
1
s
3s
2
1
1/3
1/2
1/2
2/3
s
1
s
2
s
3
s
1
0 0 1
s
2
½ ½ 0
s
3
1/3 2/3 0
Transition probabilities
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚

Markov Chain –Important property •In a Markov chain, the joint distribution is
( ) ( )
( )
0 1 0 1
1
, ,..., |
m
m j j
j
q q q p q q q p p
−
=
=
∏
28/03/2011Markov models29

Markov Chain –Important property •In a Markov chain, the joint distribution is
•
Why?
(
)
(
)
(
)
,previous st
, ,...
ates
, |
m
q q q p q p q q
p
=
∏
( ) ( )
( )
0 1 0 1
1
, ,..., |
m
m j j
j
q q q p q q q p p
−
=
=
∏
Why?
28/03/2011Markov models30
(
)
(
)
(
)
( )
( )
0 1 0 1
1
0 1
1
,previous st
, ,...
ates
, |
|
m j j
j
m
j j
j q q q p q p q q
p q p q q
p
−
=
−
=
= =
∏ ∏
Due to the Markov property

Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
28/03/2011Markov models31

Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
1/2
1/2
1/3
2/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
28/03/2011Markov models32

Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
1/2
1/2
1/3
2/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
•Markov assumption: weather in the t+1’th day is
depends only on the t’th day.
28/03/2011Markov models33

Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
1/2
1/2
1/3
2/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
•Markov assumption: weather in the t+1’th day is
depends only on the t’th day.
•We have observed the weather in a week:
28/03/2011Markov models34
rain
wind
cloud
rain
wind
Day: 0 1 2 3 4

Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
1/2
1/2
1/3
2/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
•Markov assumption: weather in the t+1’th day is
depends only on the t’th day.
•We have observed the weather in a week:
28/03/2011Markov models35
rain
wind
cloud
rain
wind
Day: 0 1 2 3 4
Markov Chain

Contents •Introduction
•Markov Chain
•Hidden Markov Models
–Independent assumptions –
Formal definition
–
Formal definition
–Forward algorithm
–Viterbi algorithm
–Baum-Welch algorithm
•Markov Random Field
28/03/201136 Markov models

Modeling pairsof sequences •In many applications, we have to model pairof sequences
•Examples:
–POS tagging in Natural Language Processing (assign each word in a
sentence to Noun, Adj, Verb...)
–Speech recognition (map acoustic sequences to sequence s of words)
–Computational biology (recover gene boundaries in DNA seq uences)
–Video tracking (estimate the underlying model states from the observation
sequences)
–And many others...
28/03/2011Markov models37

Probabilistic models for sequence pairs •We have two sequences of random variables:
X
1, X
2, ..., X
mand S
1, S
2, ..., S
m
•Intuitively, in a pratical system, each X
icorresponds to an observation
and each S
icorresponds to a state that generated the observati on.
•Let each S
ibe in {1, 2, ..., k} and each X
ibe in {1, 2, ..., o}
•How do we model the joint distribution:
28/03/2011Markov models38
(
)
1 1 1 1
,..., , ,...,
m m m m
p X
x X x S s S s
= = = =

Hidden Markov Models (HMMs) •In HMMs, we assume that
(
)
( )
( ) ( )
1 1 1 1
1 1 1 1
21
,..., , ,...,
m m m m
mm
j j j j j j j j
jj
p X
p
x X x S s S s
s p S s S s p X x S s
S
− −
==
= = = =
= = = = = =
∏ ∏
˚ ˚
•This is often called
Independence assumptions
in
HMMs
•We are gonna prove it in the next slides
28/03/2011Markov models39

Independence Assumptions in HMMs [1] •By
the chain rule
, the following equality is exact:
(
)
( )
( )
1 1 1 1
1 1
1 1 1 1
,..., , ,...,
,...,
,..., ,...,
m m m m
m m
m m m m
p
p
p
X x X x S s S s
S s S s
X x X x S s S s
= = = =
= = = ×
= = = =
˚
(
)
(
)
(
)
(
)
(
)
(
)
| | ABC p A BC p BC p A BC p B p
C p C
= =
˚
•Assumption 1
: the state sequence forms a Markov chain
28/03/2011Markov models40
( ) ( )
( )
1 1 1 1 1 1
2
,...,
m
m m j j j j
j
S s S s p S s p S s S p
s
− −
=
= = = = = =
∏
˚

Independence Assumptions in HMMs [2] •By the chain rule, the following equality is exact: •
Assumption 2
: each observation depends only on the underlying
(
)
()
1 1 1
1
1
1
1
11 1 1
,
,..., ,...,
,..., ,...,
m m m m
m
j j m m
j
j j
X x X x S s S s
X x S s S s X x X x
p
p
−
=
−
= = = =
= = = = = =
∏
˚
˚
•
Assumption 2
: each observation depends only on the underlying
state
•These two assumptions are often called independence
assumptions in HMMs
28/03/2011Markov models41
(
)
( )
1 1 1 1 1 1
,..., ,., ..,
j j m m j j
j j j j
X x S s S s x X x
X
X
p
p
x S s
− −
= = = = =
= = =
˚
˚

The Model form for HMMs •The model takes the following form:
•
Parameters in the model:
( ) ( )
( ) ( )
1 1 1 1
2 1
,.., , ,..., ;
m m
m m j j j j
j j
x x s s s t s s p
e x s
θ π
−
= =
=
∏ ∏
˚ ˚
Parameters in the model:
–
–
–
28/03/2011Markov models42
(
)
{
}
Initial probabilities for 1,2,...,
s s k
π
∈
(
)
{
}
Transition probabilities for , ' 1,2,...,
t s s s s k
′∈
˚
(
)
{
}
{ }
Emission probabilities for 1,2,..., and 1,2,..,
e x s s k
x o
∈
∈
˚

6 components of HMMs •Discrete
timesteps: 1, 2, ...
•Finite
state space: {s
i} (N states)
•Events {x
i} (M events)
•Vector of initial probabilities { ππππ
i}
ΠΠΠΠ= {π
i } = { p(q
1= s
i) }
•
Matrix of transition probabilities
s
1
s
2
s
3
t
11
t
21
t
12
t
31
t
23
t
32
e
11
e
e
13
e
23
e
start
ππππ
1
ππππ
2ππππ
3
•
Matrix of transition probabilities T= {T
ij} = { p(q
t+1=s
j|q
t=s
i) }
•Matrix of emission probabilities
E= {E
ij} = { p(o
t=x
j|q
t=s
i) }
28/03/2011Markov models43
x
1
x
2
x
3
e
11
e
31
e
22
e
23
e
33
The observations at continuous timesteps form an
observation sequence
{o
1, o
2, ..., o
t}, where o
i∈{x
1, x
2, ..., x
M}

6 components of HMMs •Discrete
timesteps: 1, 2, ...
•Finite
state space: {s
i} (N states)
•Events {x
i} (M events)
•Vector of initial probabilities { ππππ
i}
ΠΠΠΠ= {π
i } = { p(q
1= s
i) }
•
Matrix of transition probabilities
s
1
s
2
s
3
t
11
t
21
t
12
t
31
t
23
t
32
e
11
e
e
13
e
23
e
start
ππππ
1
ππππ
2ππππ
3
•
Matrix of transition probabilities T= {T
ij} = { p(q
t+1=s
j|q
t=s
i) }
•Matrix of emission probabilities
E= {E
ij} = { p(o
t=x
j|q
t=s
i) }
28/03/2011Markov models44
x
1
x
2
x
3
e
11
e
31
e
22
e
23
e
33
The observations at continuous timesteps form an
observation sequence
{o
1, o
2, ..., o
t}, where o
i∈{x
1, x
2, ..., x
M}
1 1 1
Constraints:
1 1 1
N N M
i ij ij
i j j
T E
π
= = =
= = =
∑ ∑ ∑

6 components of HMMs •Given a specific HMM and an
observation sequence, the
corresponding sequence of states
is generally not deterministic
•Example:
Given the observation sequence:
{x
, x
, x
, x
}
s
1
s
2
s
3
t
11
t
21
t
12
t
31
t
23
t
32
e
11
e
e
13
e
23
e
start
ππππ
1
ππππ
2ππππ
3
{x
1
, x
3
, x
3
, x
2
}
The corresponding states can be
any of following sequences:
{s
1, s
2, s
1, s
2}
{s
1, s
2, s
3, s
2}
{s
1, s
1, s
1, s
2}
...
28/03/2011Markov models45
x
1
x
2
x
3
e
11
e
31
e
22
e
23
e
33

Here’s an HMM
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models46
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4

Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models47
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1o
1
q
2o
2
q
3o
3
0.3 -0.3 -0.4
randomply choice
between S
1, S
2, S
3

Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models48
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
q
2o
2
q
3o
3
0.2 -0.8
choice between X
1
and X
3

Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models49
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2o
2
q
3o
3
Go to S
2with
probability 0.8 or
S
1with prob. 0.2

Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models50
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2
S
1
o
2
q
3o
3
0.3 -0.7
choice between X
1
and X
3

Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models51
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2
S
1
o
2
X
1
q
3o
3
Go to S
2with
probability 0.5 or
S
1with prob. 0.5

Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models52
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2
S
1
o
2
X
1
q
3
S
1
o
3
0.3 -0.7
choice between X
1
and X
3

Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
We got a sequence
28/03/2011Markov models53
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2
S
1
o
2
X
1
q
3
S
1
o
3
X
3
We got a sequence
of states and
corresponding
observations!

Three famous HMM tasks •Given a HMM Φ= (T, E, π). Three famous HMM tasks are:
•Probability of an observation sequence (state estim ation)
–Given: Φ, observation O = {o
1, o
2,..., o
t}
–Goal: p(O|Φ), or equivalently p(s
t= S
i|O)
•
Most likely expaination (inference) Most likely expaination (inference)
–Given: Φ, the observation O = {o
1, o
2,..., o
t}
–Goal: Q* = argmax
Qp(Q|O)
•Learning the HMM
–Given: observation O = {o
1, o
2,..., o
t} and corresponding state sequence
–Goal: estimate parameters of the HMM Φ= (T, E, π)
28/03/2011Markov models54

Three famous HMM tasks •Given a HMM Φ= (T, E, π). Three famous HMM tasks are:
•Probability of an observation sequence (state estim ation)
–Given: Φ, observation O = {o
1, o
2,..., o
t}
–Goal: p(O|Φ), or equivalently p(s
t= S
i|O)
•
Most likely expaination (inference)
Calculating the probability of
observing the sequence O over
Most likely expaination (inference)
–Given: Φ, the observation O = {o
1, o
2,..., o
t}
–Goal: Q* = argmax
Qp(Q|O)
•Learning the HMM
–Given: observation O = {o
1, o
2,..., o
t} and corresponding state sequence
–Goal: estimate parameters of the HMM Φ= (T, E, π)
28/03/2011Markov models55
all of possible sequences.

Three famous HMM tasks •Given a HMM Φ= (T, E, π). Three famous HMM tasks are:
•Probability of an observation sequence (state estim ation)
–Given: Φ, observation O = {o
1, o
2,..., o
t}
–Goal: p(O|Φ), or equivalently p(s
t= S
i|O)
•
Most likely expaination (inference)
Calculating the best
corresponding state sequence,
Most likely expaination (inference)
–Given: Φ, the observation O = {o
1, o
2,..., o
t}
–Goal: Q* = argmax
Qp(Q|O)
•Learning the HMM
–Given: observation O = {o
1, o
2,..., o
t} and corresponding state sequence
–Goal: estimate parameters of the HMM Φ= (T, E, π)
28/03/2011Markov models56
given an observation
sequence.

Three famous HMM tasks •Given a HMM Φ= (T, E, π). Three famous HMM tasks are:
•Probability of an observation sequence (state estim ation)
–Given: Φ, observation O = {o
1, o
2,..., o
t}
–Goal: p(O|Φ), or equivalently p(s
t= S
i|O)
•
Most likely expaination (inference)
Given an (or a set of)
observation sequence and
corresponding state sequence,
Most likely expaination (inference)
–Given: Φ, the observation O = {o
1, o
2,..., o
t}
–Goal: Q* = argmax
Qp(Q|O)
•Learning the HMM
–Given: observation O = {o
1, o
2,..., o
t} and corresponding state sequence
–Goal: estimate parameters of the HMM Φ= (T, E, π)
28/03/2011Markov models57
corresponding state sequence, estimate the Transition matrix,
Emission matrix and initial
probabilities of the HMM

Three famous HMM tasks
Problem Algorithm Complexity State estimation
Calculating: p(O|Φ)
Forward O(TN
2
)
Inference
Calculating: Q*= argmax
Q
p(Q|O)
Viterbi decoding O(TN
2
)
28/03/2011Markov models58
Calculating: Q*= argmax
Q
p(Q|O)
Learning
Calculating: Φ* = argmax
Φ
p(O|Φ)
Baum-Welch (EM) O(TN
2
)
T: number of timesteps
N: number of states

State estimation problem •Given: Φ= (T, E, π), observation O = {o
1, o
2,..., o
t}
•Goal: What is p(o
1o
2...o
t) ?
•
We can do this in a slow, stupid way We can do this in a slow, stupid way
–As shown in the next slide...
28/03/2011Markov models59

Here’s a HMM
•What is p(O) = p(o
1o
2o
3)
= p(o
1=X
3∧o
2=X
1 ∧o
3=X
3)?
•Slow, stupid way:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
(
)
(
)
( ) ( )
paths of length 3 paths of length 3
|
Q Q
p O p OQ
p
Q
O Q p
∈ ∈
=
=∑ ∑
•How to compute p(Q) for an
arbitrary path Q?
•How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011Markov models60
paths of length 3
Q
∈

Here’s a HMM
•What is p(O) = p(o
1o
2o
3)
= p(o
1=X
3∧o
2=X
1 ∧o
3=X
3)?
•Slow, stupid way:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
(
)
(
)
( ) ( )
paths of length 3 paths of length 3
|
Q Q
p O p OQ
p
Q
O Q p
∈ ∈
=
=∑ ∑
ππππ
s
1
s
2
s
3
•How to compute p(Q) for an
arbitrary path Q?
•How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011Markov models61
paths of length 3
Q
∈
p(Q) = p(q
1q
2q
3)
= p(q
1)p(q
2|q
1)p(q
3|q
2,q
1) (chain)
= p(q
1)p(q
2|q
1)p(q
3|q
2) (why?)
Example in the case Q=S
3S
1S
1
P(Q) = 0.4 * 0.2 * 0.5 = 0.04
0.3 0.3 0.4

Here’s a HMM
•What is p(O) = p(o
1o
2o
3)
= p(o
1=X
3∧o
2=X
1 ∧o
3=X
3)?
•Slow, stupid way:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
(
)
(
)
( ) ( )
paths of length 3 paths of length 3
|
Q Q
p O p OQ
p
Q
O Q p
∈ ∈
=
=∑ ∑
ππππ
s
1
s
2
s
3
•How to compute p(Q) for an
arbitrary path Q?
•How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011Markov models62
paths of length 3
Q
∈
p(O|Q) = p(o
1o
2o
3|q
1q
2q
3)
= p(o
1|q
1)p(o
2|q
1)p(o
3|q
3) (why?)
Example in the case Q=S
3S
1S
1
P(O|Q) = p(
X
3|S
3
)p(
X
1|S
1
) p(
X
3|S
1
)
=0.8 * 0.3 * 0.7 = 0.168
0.3 0.3 0.4

Here’s a HMM
•What is p(O) = p(o
1o
2o
3)
= p(o
1=X
3∧o
2=X
1 ∧o
3=X
3)?
•Slow, stupid way:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
(
)
(
)
( ) ( )
paths of length 3 paths of length 3
|
Q Q
p O p OQ
p
Q
O Q p
∈ ∈
=
=∑ ∑
ππππ
s
1
s
2
s
3
•How to compute p(Q) for an
arbitrary path Q?
•How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011Markov models63
paths of length 3
Q
∈
p(O|Q) = p(o
1o
2o
3|q
1q
2q
3)
= p(o
1|q
1)p(o
2|q
1)p(o
3|q
3) (why?)
Example in the case Q=S
3S
1S
1
P(O|Q) = p(
X
3|S
3
)p(
X
1|S
1
) p(
X
3|S
1
)
=0.8 * 0.3 * 0.7 = 0.168
0.3 0.3 0.4
p(O) needs 27 p(Q)
computations and 27
p(O|Q) computations.
What if the sequence has
20 observations?
So let’s be smarter...

The Forward algorithm •Given observation o
1o
2...o
T
•Forward probabilities:
α
t(i) = p(o
1o
2...o
t∧q
t= s
i| Φ) where 1 ≤t ≤T
αα
t(i) = probability that, in a random trial:
–We’d have seen the first tobservations
–We’d have ended up in s
i
as the t’th state visited.
•In our example, what is α
2(3) ?
28/03/2011Markov models64

αααα
t(i): easy to define recursively
(
)
(
)
( ) ( )
( )
1 1 1
1 1 1
1
|
i
i i
i i
i p o q s
p q s p o q s
oE
α
π
= ∧ =
= = = =
(
)
(
)
1 2
...
|
t t t i
i p oo o q s
α
= ∧ = Φ
(
)
(
)
(
)
1 1 1 1
1 1 1
2
2
...
...
t t t i
N
t t j t t i
i p oo o q s
oo o q s o q s p
α
+ + +
+ +
= ∧ =
= ∧ = ∧ ∧ =
∑
(
)
{
}
{ } ( )
{ }
{ } ( )
{ }
1
1
{ }
|
|
i i
ij t j t i
ij t j t i
p q s
T p s q s E p x q
q
o
s
π
+
Π= = =
= = = =
= = = =
T
E
28/03/2011Markov models65
(
)
( ) ( )
( )
( )
( )
( )
( )
( ) ( )
1 1 1
1
1 1 1 1
1
1 1
1
1 1 1
2
2 2
1
1
1
. |
|
|
. . .
|
. .
t t j t t i
j
N
t t i t t j t t j
j
N
t t i t j t
j
N
t t i t i t j t
j
N
ji i t t
j
o q s oo o q s o p p
p
p p
T
o o q s
o q s q s j
o q s q s q s j
o j E
α
α
α
+ +
=
+ +
=
+ +
=
+ + +
=
+
=
= ∧ = ∧ = ∧ =
= ∧ = =
= = = =
=
∑ ∑
∑
∑
∑

In our example
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
ππππ
s
1
s
2
s
3
(
)
(
)
( ) ( )
( ) ( ) ( ) ( ) ( )
1 2
1 1
1 1 1
. |..
t t t i
i i
t ji i t t i t ji t
jj
i p oo o q s
i E o
i T E o j E o T j
α
α π
α α α
+ + +
= ∧ = Φ
=
= =
∑ ∑
28/03/2011Markov models66
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
We observed: x
1
x
2
α
1(1) = 0.3 * 0.3 = 0.09
α
1(2) = 0
α
1(3) = 0.2 * 0.4 = 0.08
α
2(1) = 0 * (0.09*0 .5+ 0*0.4 + 0.08*0.2) = 0
α
2(2) = 0.1 * (0.09*0.5 + 0*0 + 0.08*0.8) = 0.0109
α
2(3) = 0 * (0.09*0 + 0*0.6 + 0.08*0) = 0

Forward probabilities -Trellis
N
s
3
s
4 28/03/2011Markov models67
T
1 2 3 4 5 6
s
1
s
2

Forward probabilities -Trellis
N
s
3
s
4
α
1 (4)
α
1 (3)α
2 (3)α
6(3)
28/03/2011Markov models68
T
1 2 3 4 5 6
s
1
s
2
α
1 (2)
α
1 (1)
α
3 (2)
α
4(1)
α
5(2)

Forward probabilities -Trellis
N
s
3
s
4
α
1 (4)
α
1 (3)α
2 (3)
(
)
(
)
1 1
i i
i E o
α π
=
28/03/2011Markov models69
T
1 2 3 4 5 6
s
1
s
2
α
1 (2)
α
1 (1)

Forward probabilities -Trellis
N
s
3
s
4
α
1 (4)
α
1 (3)α
2 (3)
(
)
(
)
(
)
1 1 t i t ji t
j
i E o T j
α α
+ +
=
∑
28/03/2011Markov models70
T
1 2 3 4 5 6
s
1
s
2
α
1 (2)
α
1 (1)

Forward probabilities •So, we can cheaply compute:
•How can we cheaply compute:
(
)
(
)
1 2
...
t t t i
i p oo o q s
α
= ∧ =
(
)
1 2
...
t
p o o o
•How can we cheaply compute:
28/03/2011Markov models71
(
)
1 2
|
...
t i t
p q s o o o
=

(
)
t
i
i
α
=
∑
Forward probabilities •So, we can cheaply compute:
•How can we cheaply compute:
(
)
(
)
1 2
...
t t t i
i p oo o q s
α
= ∧ =
(
)
1 2
...
t
p o o o
i
•How can we cheaply compute:
MLook back the trellis...
28/03/2011Markov models72
(
)
1 2
|
...
t i t
p q s o o o
=
(
)
( )
j
t
t
i
j
α
α
=
∑

State estimation problem •State estimationis solved:
•Can we utilize the elegant trellisto solve the Inference
problem?
( ) ( ) ()
1 2
1
|
N
t i
i
O p oo o i
p
α
=
Φ = …=∑
problem?
–Given an observation sequence O, find the best stat e sequence Q
28/03/2011Markov models73
(
)
*
argmax |
Q
O
Q p Q=

Inference problem •Given: Φ= (T, E, π), observation O = {o
1, o
2,..., o
t}
•Goal: Find
•
Practical problems:
(
)
( )
1 2
*
1 2 1 2
argmax |
argmax |
t
Q
t t
q q q
Q p Q O
p qq q oo o
…
=
= … …
•
Practical problems:
–Speech recognition: Given an utterance (sound), what is
the best sentence (text) that matches the utterance?
–Video tracking
–POS Tagging
28/03/2011Markov models74
s
1
s
2
x
1
x
2
x
3
s
3

Inference problem •We can do this in a slow, stupid way:
(
)
( ) ( )
( )
(
)
(
)
*
argmax |
|
argmax
Q
Q
p Q O
p O Q p Q
p O
Q=
=
•But it’s better if we can find another way to
compute the most probability path(MPP)...
28/03/2011Markov models75
(
)
(
)
( ) ( )
1 2
argmax |
argmax |
t
Q
Q
p O Q p Q
p oo o Q p Q
=
= …

Efficient MPP computation •We are going to compute the following variables: •δ
t(i) is the probability of the best path of length
(
)
(
)
1 2 1
1 2 1 1 2
max
t
t t t i t
q q q
i p qq q q s oo o
δ
−
−
…
= … ∧ = ∧ …
t(i) is the probability of the best path of length
t-1which ends up in s
iand emits o
1...o
t.
•Define: mpp
t(i) = that path
so:δ
t(i) = p(mpp
t(i))
28/03/2011Markov models76

Viterbi algorithm
(
)
(
)
( ) ( )
( ) ( )
( ) ( )
1 2 1
1 2 1
1 2 1 1 2
1 2 1 1 2
1 1 1
one choice
1 1
max
argmax
max
t
t
t t t i t
q q q
tt t
i
i t
q q q
i
i
i p qq q q s oo o
mpp i p qq q q s oo o
i p q s o
E o i
π
δ
δ
α
−
−
−
…
−
…
= … ∧ = ∧ …
= … ∧ = ∧ …
= = ∧
= =
N
s
4
δ
1 (4)
28/03/2011Markov models77
T
1 2 3 4 5 6
s
1
s
2
s
3
s
4
δ
1 (3)
δ
1 (2)
δ
1 (1)
δ
2 (3)

Viterbi algorithm
•The most prob path with last two states
s
is
j
is
the most path to s
i, followed by
transition s
i
Ms
j.
•The prob of that path will be:
s
1
s
i
s
j
time ttime t + 1
... ...
δ
t(i) ×p(s
i
Ms
j
∧o
t+1
)
= δ
t(i)T
ij
E
j(o
t+1
)
•So, the previous state at time tis:
28/03/2011Markov models78
(
)
(
)
*
1
argmax
t ij j t
i
i T E i
o
δ
+
=
...

Viterbi algorithm •Summary:
(
)
(
)
(
)
( )
( )
( ) ( )
*
1
1
1
*
1
*
argmax
t ij j t
t t j
t i
t
j j
i
t
i T E o
mpp j mpp i s
i T E
j
i
o
δ δ
δ
+ +
+
+
=
=
=
N
s
δ
1
(4)
(
)
(
)
(
)
1 1 1ii
i E o i
δ π α
= =
28/03/2011Markov models79
T
1 2 3 4 5 6
s
1
s
2
s
3
s
4
δ
1
(4)
δ
1 (3)
δ
1 (2)
δ
1 (1)
δ
2 (3)

What’s Viterbi used for? •Speech Recognition
28/03/2011Markov models80
Chong, Jike and Yi, Youngmin and Faria, Arlo and Sa tish, Nadathur Rajagopalan and Keutzer, Kurt, “ Data-Parallel Large Vocabulary
Continuous Speech Recognition on Graphics Processor s”, EECS Department, University of California, Berke ley, 2008.

Training HMMs •Given: large sequence of observation o
1o
2...o
T
and number of states N.
•Goal: Estimation of parameters Φ= 〈T, E, π〉
•That is, how to design an HMM.
•We will infer the model from a large amount of
data o
1o
2...o
Twith a big “T”.
28/03/2011Markov models81

Training HMMs •Remember, we have just computed
p(o
1
o
2
...o
T
| Φ)
•Now, we have some observations and we want to infer ence Φ
from them.
•So, we could use:
–MAX LIKELIHOOD:
–BAYES:
Compute
then take or
28/03/2011Markov models82
(
)
1
argmax |
T
p o o
Φ
Φ= … Φ
(
)
1
|
T
p
o o
Φ … [
]
E
Φ
(
)
1
max |
T
o
p o
Φ
Φ …

Max likelihood for HMMs •Forward probability
: the probability of producing o
1
...o
t
while
ending up in state s
i
(
)
(
)
1 2
...
t t t i
i p oo o q s
α
= ∧ =
(
)
(
)
( ) ( ) ( )
1 1
1 1
i i
t i t ji t
j
i E o
i E o T j
α π
α α
+ +
=
=
∑
•Backward probability
: the probability of producing o
t+1
...o
T
given
that at time t, we are at state s
i
28/03/2011Markov models83
(
)
(
)
1 2
. |..
t t t i
T t
i p o o o q s
β
+ +
= =

Max likelihood for HMMs -Backward •Backward probability
: easy to define recursively
(
)
(
)
( )
( )
( )
1 2
1 2 1
1
...
...
|
1
|
t t t t i
T
N
t t t t j t i
j
T
T
i p o o o q s
i
i p o o o q s q s
β
β
β
+ +
+ + +
=
= =
=
= ∧ ∧ = =
∑
(
)
( ) ( ) ( )
1 1
1
1
T
N
t t ij j t
j
i
i T o
j E
β
β β
+ +
=
=
=
∑
28/03/2011Markov models84
( ) ( )
( ) ( )
( ) ( )
1
1 1 2 1 1
1
1 1 2 1
1
1 1
1
. | |
| |
..
...
j N
t t j t i t t t j t i
j
N
t t j t i t t j
j
N
t ij j t
T
T
j
p o q s q s p o o o q s q s p o q s q s p o o q s
T o j E β
=
+ + + + +
=
+ + + +
=
+ +
=
= ∧ = = ∧ = ∧ =
= ∧ = = =
=
∑ ∑
∑
∑

Max likelihood for HMMs •The probability of traversing a certain arc at time tgiven
o
1
o
2
...o
T
:
(
)
(
)
( )
(
)
1 1 2
1 1 2
|
ij t i t j T
t i t j T
t p q s q s oo o
p q s q s oo o
p oo o
ε
+
+
= = ∧ = …
= ∧ = ∧ …
=
…
28/03/2011Markov models85
(
)
( )
( )
( )
( ) ( )
( )
( ) ( )
( ) ( )
1 2
1 21 1 2
1 2 1 2
1
1
|
|
T
t t i t i t j t t T t i
N
t i t t t i
i
ij t
ijN
t
i
t T
t
t
p oo o
p oo o q s p q s q s p o o o q s
p oo o q p o o o q
i T i
t
i i
s s
α β
α
ε
β
+ + +
+ +
=
=
=
…
… ∧ = = ∧
=
= … =
=
… ∧ …
=
=
∑
∑

Max likelihood for HMMs •The probability of being at state s
i
at time tgiven o
1
o
2
...o
T
:
(
)
(
)
( )
1 2
1 1 2
1
|
|
i t i T
N
t i t j T
j
t p q s oo o
p q s q s oo o
γ
+
=
= = …
= = ∧ = …
∑
28/03/2011Markov models86
( ) ( )
1 1
j N
ij
j
i
t t
γ ε
=
=
∑

Max likelihood for HMMs •Sum over the time index:
–Expected # of transitions from state ito jin o
1
o
2
...o
T
:
()
1 1
T
ij
t
t
ε
− =∑
–Expected # of transitions from state iin o
1
o
2
...o
T
:
28/03/2011Markov models87
1
t
=
() () ()
1 1 1
1 1 1 11
T T T N N
i ij ij
t t j jt
t t t
ε ε γ
− − −
= = = ==
= =
∑ ∑∑ ∑∑

Update parameters
(
) ( ) (
)
( )
(
)
1 1
1 1
1 1
ˆexpected frequency in state i at time t = 1 1
expected # of transitions from state i to j
expected # of transitions from state i
ii
T T
ij ij
t t
ijT N T
i ij
t t
T
t t
πγ
γε ε
ε
− −
= =
− −
==
== =
∑ ∑
∑ ∑∑
(
)
{
}
{ } ( )
{ }
{ } ( )
{ }
1
1
{ }
|
|
i i
ij t j t i
ij t j t i
p q s
T p s q s E p x q
q
o
s
π
+
Π= = =
= = = =
= = = =
T
E
28/03/2011Markov models88
(
)
(
)
1 1 1
expected
i ij
t j t
ik
t t
E
γ
ε
= = =
=
∑ ∑∑
( ) ( )
( )
( ) ( )
( )
1 1
1 1 1
1 1
1 1 1
# of transitions from state i with x observed
expected # of transitions from state i
, ,
k
N T T
t k ij t k i
j t t
T N T
i ij
t j t
o x t o x t
t t
δ ε δ γ
γ ε
− −
= = =
− −
= = =
= =
∑∑ ∑
∑ ∑∑

The inner loop of Forward-Backward Given an input sequence.
1. Calculate forward probability:
–Base case:
–Recursive case:
2. Calculate backward probability:
(
)
(
)
( ) ( ) ( )
1 1
1 1
i i
t i t ji t
j
i E o
i E o T j
α π
α α
+ +
=
=
∑
–Base case:
–Recursive case:
3. Calculate expected count:
4. Update parameters:
28/03/2011Markov models89
(
)
( ) ( ) ( )
1 1
1
1
T
N
t t ij j t
j
i
i T o
j E
β
β β
+ +
=
=
=
∑
( )
(
)
(
)
( ) ( )
1
ij t
ijN
t
t
i
ti T i
t
i i
α β
α β
ε
=
=
∑
( )
( )
( ) ( )
( )
1 1
1 1 1
1 1
1 1 1 1
,
N T T
t k ij ij
j t t
ij ikN T N T
ij ij
j t j t
o x t
t
T E
t t
δ ε
ε
ε
ε
− −
= = =
− −
= = = =
= =
∑∑ ∑
∑∑ ∑∑

Forward-Backward: EM for HMM •If we knew Φwe could estimate
expectations
of quantities
such as
–Expected number of times in state i
–Expected number of transitions iMj
•
If we knew the quantities such as
•
If we knew the quantities such as –Expected number of times in state i
–Expected number of transitions iMj
we could compute the
max likelihood
estimate of Φ= 〈T, E, Π〉
•Also known (for the HMM case) as the Baum-Welch algorithm.
28/03/2011Markov models90

EM for HMM •Each iteration provides values for all the paramete rs
•The new model always improve the likeliness of the
training data:
(
)
(
)
ˆ
| |
p p
oo o oo o
… Φ ≥ … Φ
•The algorithm does not guarantee to reach global
maximum.
28/03/2011Markov models91
(
)
(
)
1 2 1 2
ˆ
| |
T T
p p
oo o oo o
… Φ ≥ … Φ

EM for HMM •Bad News
–There are lots of local minima
•Good News
–The local minima are usually adequate models of the da ta.
•Notice
–EM does not estimate the number of states. That must b e given (tradeoffs)
–Often, HMMs are forced to have some links with zero probability. This is done
by setting Tij = 0 in initial estimate Φ(0)
–Easy extension of everything seen today: HMMs with real valued outputs
28/03/2011Markov models92

Contents •Introduction
•Markov Chain
•
Hidden Markov Models Hidden Markov Models
•Markov Random Field (from the viewpoint of
classification)
28/03/201193 Markov models

Example: Image segmentation •Observations: pixel values
•Hidden variable: classof each pixel
•It’s reasonable to think that there are some underl ying relationships
between neighbouring pixels... Can we use Markov mo dels?
•Errr.... the relationships are in 2D!
28/03/2011Markov models94

MRF as a 2D generalization of MC •Array of observations:
•Classes/States:
•Our objective is
classification
: given the array of
observations, estimate the corresponding values of the
{
}
, 0 ,0
y
ij x
i N j N
X x< ≤ < = ≤
{
}
, 1...
ij ij
S s s M
= =
observations, estimate the corresponding values of the state array S so that
28/03/2011Markov models95
(
)
(
)
| is maximum.
Xp S p S

2D context-dependent classification •Assumptions:
–The values of elements in S are mutually dependent.
–The range of this dependence is limited within a ne ighborhood.
•For each (i, j) element of S, a neighborhood N
ij
is defined so
thatthat
–s
ij∉N
ij
: (i, j) element does not belong to its own set of neighbors.
–s
ij∈N
kl⇔s
kl∈N
ij
: if s
ijis a neighbor of s
klthen s
klis also a neighbor
of s
ij
28/03/2011Markov models96

2D context-dependent classification •The Markov property for 2D case:
where includes all the elements of S excep t the (i, j) one.
•The elegeant dynamic programing is not applicable: the problem is
(
)
(
)
| |
ij ij ij ij
s S p s p=
N
ij
S
much harder now!
28/03/2011Markov models97

2D context-dependent classification •The Markov property for 2D case:
where includes all the elements of S excep t the (i, j) one.
•The elegeant dynamic programing is not applicable: the problem is
(
)
(
)
| |
ij ij ij ij
s S p s p
=
N
ij
S
We are gonna see an
application of MRF for
much harder now!
28/03/2011Markov models98
application of MRF for
Image Segmentation
and Restoration.

MRF for Image Segmentation •Cliques
: a set of each pixel which are neighbors
of each other (w.r.t the type of neighborhood)
28/03/2011Markov models99

MRF for Image Segmentation •Dual Lattice number
•Line process:
28/03/2011Markov models100

MRF for Image Segmentation •Gibbs distribution:
–Z: normalizing constant
( )
(
)
1
exp
U s
s
Z T
π
−
=
–T: parameter
•It turns out that Gibbs distribution implies MRF
([Gema 84])
28/03/2011Markov models101

MRF for Image Segmentation •A Gibbs conditional probability is of the form:
–C
k(i, j): clique of the pixel (i, j)
( )
( )
( )
1 1
| exp ,
ij ij k k
k
s F C i j
Z T
p
 
= −
   
∑
N
–F
k: some functions, e.g.
28/03/2011Markov models102
( ) ( )
(
)
1 2 1, 1, 2 , 1 , 1
1
ij i j i j i j i j
s s s s s
T
α α α
− + − +
+ + − + +

MRF for Image Segmentation •Then, the joint probability for the Gibbs model is
–
The sum is calculated over all possible cliques associ ated
( )
(
)
(
)
,
,
exp
k k
i j k
p
F C i j
S
T
   
= −
     
∑∑
–
The sum is calculated over all possible cliques associ ated with the neighborhood.
•We also need to work out p(X|S)
•Then p(X|S)p(S) can be maximized... [Gema 84]
28/03/2011Markov models103

More on Markov models... •MRF does not stop there... Here are some related mo dels:
–Conditional random field (CRF)
–Graphical models
–...
•
Markov
Chain and HMM
does not stop there...
•
Markov
Chain and HMM
does not stop there...
–Markov chain of order m
–Continuous-time Markov chains...
–Real-value observations
–...
28/03/2011Markov models104

What you should know •Markov property, Markov Chain
•HMM:
–Defining and computing α
t(i)
–Viterbi algorithm
–Outline of the EM algorithm for HMM
•Markov Random Field
–And an application in Image Segmentation
–[Geman 84] for more information.
28/03/2011Markov models105

Q & A
28/03/2011Markov models106

References •L. R. Rabiner, " A Tutorial on Hidden Markov Models and Selected Application s
in Speech Recognition“, Proc. of the IEEE, Vol.77, No.2, pp.257--286, 198 9.
•Andrew W. Moore, “Hidden Markov Models”,
http://www.autonlab.org/tutorials/
•Geman S., Geman D. “Stochastic relaxation, Gibbs distributions and the
Bayesian restoration of images
,” IEEE Transactions on Pattern Analysis and
Bayesian restoration of images
,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, Vol. 6(6), pp. 721-741, 1984.
28/03/2011Markov models107

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Markov Models

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77