Markov Models

phvu 13,208 views 107 slides Mar 29, 2011
Slide 1
Slide 1 of 107
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107

About This Presentation

A presentation on Markov Chain, HMM, Markov Random Fields with the needed algorithms and detailed explanations.


Slide Content

PATTERNRECOGNITION
Markov models Markov models
Vu PHAM
[email protected]
Department of Computer Science
March 28
th
, 2011
28/03/20111 Markov models

Contents •Introduction
–Introduction
–Motivation
•Markov Chain
•Hidden Markov Models
•Markov Random Field
28/03/20112 Markov models

Introduction •Markov processes are first proposed by
Russian mathematician Andrei Markov
–He used these processes to investigate
Pushkin’s poem.
•Nowadays, Markov property and HMMs are
widely used in many domains: widely used in many domains:
–Natural Language Processing
–Speech Recognition
–Bioinformatics
–Image/video processing
–...
28/03/2011Markov models3

Motivation [0] •As shown in his paper in 1906, Markov’s original
motivation is purely mathematical:
–Application of The Weak Law of Large Number to
dependent
random variables. random variables.
•However, we shall not follow this motivation...
28/03/2011Markov models4

Motivation [1] •From the viewpoint of
classification
:
–Context-free classification: Bayes classifier
(
)
(
)
| |
i j
j i
p p
ω ω
> ∀ ≠
x x
28/03/2011Markov models5

Motivation [1] •From the viewpoint of
classification
:
–Context-free classification: Bayes classifier
(
)
(
)
| |
i j
j i
p p
ω ω
> ∀ ≠
x x
28/03/2011Markov models6
•Classes are independent.
•Feature vectors are independent
.

Motivation [1] •From the viewpoint of
classification
:
–Context-free classification: Bayes classifier
(
)
(
)
| |
i j
j i
p p
ω ω
> ∀ ≠
x x
–However, there are some applications where various
classes are closely realated:
•POS Tagging, Tracking, Gene boundary recover...
28/03/2011Markov models7
s
1
...
s
2
s
3
...
s
m

Motivation [1] •Context-dependent classification:
–s
1, s
2, ..., s
m: sequence of mfeature vector
–ω
1, ω
2,..., ω
N: classes in which these vectors are classified: ω
i= 1...k.
s
1
...
s
2
s
3
...
s
m
28/03/2011Markov models8

Motivation [1] •Context-dependent classification:
–s
1, s
2, ..., s
m: sequence of mfeature vector
–ω
1, ω
2,..., ω
N: classes in which these vectors are classified: ω
i= 1...k.
s
1
...
s
2
s
3
...
s
m
•To apply Bayes classifier:
–X= s
1s
2...s
m: extened feature vector
–Ω
i= ω
i1, ω
i2,..., ω
iN: a classification MN
m
possible classifications
28/03/2011Markov models9
(
)
(
)
| |
i j
j i
p p
Ω > Ω ∀ ≠
X X
(
)
(
)
(
)
(
)
| |
i i j j
p
p j
p
p i
Ω Ω > Ω Ω ∀ ≠
X X

Motivation [1] •Context-dependent classification:
–s
1, s
2, ..., s
m: sequence of mfeature vector
–ω
1, ω
2,..., ω
N: classes in which these vectors are classified: ω
i= 1...k.
s
1
...
s
2
s
3
...
s
m
•To apply Bayes classifier:
–X= s
1s
2...s
m: extened feature vector
–Ω
i= ω
i1, ω
i2,..., ω
iN: a classification MN
m
possible classifications
28/03/2011Markov models10
(
)
(
)
| |
i j
j i
p p
Ω > Ω ∀ ≠
X X
(
)
(
)
(
)
(
)
| |
i i j j
p
p j
p
p i
Ω Ω > Ω Ω ∀ ≠
X X

Motivation [2] •From a
general view
, sometimes we want to evaluate the
joint
distribution
of a
sequence
of
dependent
random variables
28/03/2011Markov models11

Motivation [2] •From a
general view
, sometimes we want to evaluate the
joint
distribution
of a
sequence
of
dependent
random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
28/03/2011Markov models12
Hôm
...
nay
mùng
...
vào
q
1q
2q
3q
m

Motivation [2] •From a
general view
, sometimes we want to evaluate the
joint
distribution
of a
sequence
of
dependent
random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
•What is p(Hôm nay.... vào) = p(q
1
=Hômq
2
=nay... q
m
=vào)?
28/03/2011Markov models13
Hôm
...
nay
mùng
...
vào
q
1q
2q
3q
m

Motivation [2] •From a
general view
, sometimes we want to evaluate the
joint
distribution
of a
sequence
of
dependent
random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
•What is p(Hôm nay.... vào) = p(q
1
=Hômq
2
=nay... q
m
=vào)?
28/03/2011Markov models14
Hôm
...
nay
mùng
...
vào
q
1q
2q
3q
m
p(s
m
|s
1
s
2
...s
m-1
) =
p(s
1
s
2
...s
m-1
s
m
)
p(s
1
s
2
...s
m-1
)

Contents •Introduction •Markov Chain •
Hidden Markov Models Hidden Markov Models
•Markov Random Field
28/03/201115 Markov models

Markov Chain •Has N states, called s
1
, s
2
, ..., s
N
•There are discrete timesteps, t=0,
t=1,...
•On the t’th timestep the system is in
exactly one of the available states.
s
1
s
s
2
Call it
28/03/201116 Markov models
s
3
N = 3
t = 0
q
t
= q
0
= s
3
{
}
1 2
, ,...,
t N
s s q
s

Current state

Markov Chain •Has N states, called s
1
, s
2
, ..., s
N
•There are discrete timesteps, t=0,
t=1,...
•On the t’th timestep the system is in
exactly one of the available states.
s
1
s
s
2
Current state
Call it
•Between each timestep, the next
state is chosen randomly.
28/03/201117 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
{
}
1 2
, ,...,
t N
s s q
s

Markov Chain •Has N states, called s
1
, s
2
, ..., s
N
•There are discrete timesteps, t=0,
t=1,...
•On the t’th timestep the system is in
exactly one of the available states.
s
1
s
s
2
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
Call it
•Between each timestep, the next
state is chosen randomly.
•The current state determines the
probability for the next state.
28/03/201118 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
{
}
1 2
, ,...,
t N
s s q
s

(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚

Markov Chain •Has N states, called s
1
, s
2
, ..., s
N
•There are discrete timesteps, t=0,
t=1,...
•On the t’th timestep the system is in
exactly one of the available states.
s
1
s
s
2
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
2/3
Call it
•Between each timestep, the next
state is chosen randomly.
•The current state determines the
probability for the next state.
–Often notated with arcs between states
28/03/201119 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
{
}
1 2
, ,...,
t N
s s q
s

(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚

Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
s
1
s
s
2
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
2/3
28/03/201120 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚

Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
s
1
s
s
2
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
2/3
28/03/201121 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
The state at timestep t+1depends
only on the state at timestep t

Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
s
1
s
s
2
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
2/3
28/03/201122 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
The state at timestep t+1depends
only on the state at timestep t
A
Markov chain of order m
(mfinite): the state at
timestep t+1depends on the past mstates:
(
)
(
)
1 1 0 1 1 1
, ,..., , ,...,
t t t t t t t m
p q q q q p q q q q
+ − + − − +
=
˚ ˚

Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
s
1
s
s
2
1
1/3
1/2
1/2
2/3
•How to represent the joint
distribution of (q
0
, q
1
, q
2
...) using
graphical models?
28/03/201123 Markov models
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
s
3
The state at timestep t+1depends
only on the state at timestep t

Markov Property •q
t+1
is conditionally independent of
{q
t-1
, q
t-2
,..., q
0
} given q
t.
•In other words:
s
1
s
s
2
(
)
(
)
1 1 0
1
, ,...,
t t t
t t
p q q q q
p q q
+ −
+
=
˚
˚
(
)
0
p q
s q s
=
=
=
˚
˚
(
)
( ) ( )
2
2
1
3
2
2
1 2 1 2 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
1
1/3
1/2
1/2
1/3
q
0 q
1 q
•How to represent the joint
distribution of (q
0
, q
1
, q
2
...) using
graphical models?
28/03/201124 Markov models
s
3
N = 3
t = 1
q
t
= q
1
= s
2
(
)
1
t t
p q q
+
=
˚
˚
(
)
( )
( )
1 1 1
1 2
3 1
0
0
1
t tp q p
s q s
s s
sp s
+
=
=
=
=
=
˚
˚ ˚
(
)
( ) ( )
2
3
1
3
3
3
1 3 2 3 0
s s
s s
p
p
s p s
=
=
=
˚ ˚ ˚
The state at timestep t+1depends
only on the state at timestep t
q
2
q
3

Markov chain •So, the chainof {q
t} is called
Markov chain
q
0
q
1
q
2
q
3
28/03/2011Markov models25

Markov chain •So, the chainof {q
t} is called
Markov chain
•Each q
t
takes value from the
countable
state-space
{s
1
, s
2
, s
3
...}
•Each q
t
is observed at a
discrete
timestep t

{q
t} sastifies the
Markov property
:
q
0
q
1
q
2
q
3
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚

{q
t} sastifies the
Markov property
:
28/03/2011Markov models26
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚

Markov chain •So, the chainof {q
t} is called
Markov chain
•Each q
t
takes value from the
countable
state-space
{s
1
, s
2
, s
3
...}
•Each q
t
is observed at a
discrete
timestep t

{q
t} sastifies the
Markov property
:
q
0
q
1
q
2
q
3
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚

{q
t} sastifies the
Markov property
:
•The transition from q
t
to q
t+1
is calculated from the
transition
probability matrix
28/03/2011Markov models27
s
1
s
3s
2
1
1/3
1/2
1/2
2/3
s
1
s
2
s
3
s
1
0 0 1
s
2
½ ½ 0
s
3
1/3 2/3 0
Transition probabilities
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚

Markov chain •So, the chainof {q
t} is called
Markov chain
•Each q
t
takes value from the
countable
state-space
{s
1
, s
2
, s
3
...}
•Each q
t
is observed at a
discrete
timestep t

{q
t} sastifies the
Markov property
:
q
0
q
1
q
2
q
3
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚

{q
t} sastifies the
Markov property
:
•The transition from q
t
to q
t+1
is calculated from the
transition
probability matrix
28/03/2011Markov models28
s
1
s
3s
2
1
1/3
1/2
1/2
2/3
s
1
s
2
s
3
s
1
0 0 1
s
2
½ ½ 0
s
3
1/3 2/3 0
Transition probabilities
(
)
(
)
1 1 0 1
, ,...,
t t t t t
p q q q q p q q
+ − +
=
˚ ˚

Markov Chain –Important property •In a Markov chain, the joint distribution is
( ) ( )
( )
0 1 0 1
1
, ,..., |
m
m j j
j
q q q p q q q p p

=
=

28/03/2011Markov models29

Markov Chain –Important property •In a Markov chain, the joint distribution is

Why?
(
)
(
)
(
)
,previous st
, ,...
ates
, |
m
q q q p q p q q
p
=

( ) ( )
( )
0 1 0 1
1
, ,..., |
m
m j j
j
q q q p q q q p p

=
=

Why?
28/03/2011Markov models30
(
)
(
)
(
)
( )
( )
0 1 0 1
1
0 1
1
,previous st
, ,...
ates
, |
|
m j j
j
m
j j
j q q q p q p q q
p q p q q
p

=

=
= =
∏ ∏
Due to the Markov property

Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
28/03/2011Markov models31

Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
1/2
1/2
1/3
2/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
28/03/2011Markov models32

Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
1/2
1/2
1/3
2/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
•Markov assumption: weather in the t+1’th day is
depends only on the t’th day.
28/03/2011Markov models33

Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
1/2
1/2
1/3
2/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
•Markov assumption: weather in the t+1’th day is
depends only on the t’th day.
•We have observed the weather in a week:
28/03/2011Markov models34
rain
wind
cloud
rain
wind
Day: 0 1 2 3 4

Markov Chain: e.g. •The state-space of weather:
cloud
wind
rain
1/2
1/2
1/3
2/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
•Markov assumption: weather in the t+1’th day is
depends only on the t’th day.
•We have observed the weather in a week:
28/03/2011Markov models35
rain
wind
cloud
rain
wind
Day: 0 1 2 3 4
Markov Chain

Contents •Introduction
•Markov Chain
•Hidden Markov Models
–Independent assumptions –
Formal definition

Formal definition
–Forward algorithm
–Viterbi algorithm
–Baum-Welch algorithm
•Markov Random Field
28/03/201136 Markov models

Modeling pairsof sequences •In many applications, we have to model pairof sequences
•Examples:
–POS tagging in Natural Language Processing (assign each word in a
sentence to Noun, Adj, Verb...)
–Speech recognition (map acoustic sequences to sequence s of words)
–Computational biology (recover gene boundaries in DNA seq uences)
–Video tracking (estimate the underlying model states from the observation
sequences)
–And many others...
28/03/2011Markov models37

Probabilistic models for sequence pairs •We have two sequences of random variables:
X
1, X
2, ..., X
mand S
1, S
2, ..., S
m
•Intuitively, in a pratical system, each X
icorresponds to an observation
and each S
icorresponds to a state that generated the observati on.
•Let each S
ibe in {1, 2, ..., k} and each X
ibe in {1, 2, ..., o}
•How do we model the joint distribution:
28/03/2011Markov models38
(
)
1 1 1 1
,..., , ,...,
m m m m
p X
x X x S s S s
= = = =

Hidden Markov Models (HMMs) •In HMMs, we assume that
(
)
( )
( ) ( )
1 1 1 1
1 1 1 1
21
,..., , ,...,
m m m m
mm
j j j j j j j j
jj
p X
p
x X x S s S s
s p S s S s p X x S s
S
− −
==
= = = =
= = = = = =
∏ ∏
˚ ˚
•This is often called
Independence assumptions
in
HMMs
•We are gonna prove it in the next slides
28/03/2011Markov models39

Independence Assumptions in HMMs [1] •By
the chain rule
, the following equality is exact:
(
)
( )
( )
1 1 1 1
1 1
1 1 1 1
,..., , ,...,
,...,
,..., ,...,
m m m m
m m
m m m m
p
p
p
X x X x S s S s
S s S s
X x X x S s S s
= = = =
= = = ×
= = = =
˚
(
)
(
)
(
)
(
)
(
)
(
)
| | ABC p A BC p BC p A BC p B p
C p C
= =
˚
•Assumption 1
: the state sequence forms a Markov chain
28/03/2011Markov models40
( ) ( )
( )
1 1 1 1 1 1
2
,...,
m
m m j j j j
j
S s S s p S s p S s S p
s
− −
=
= = = = = =

˚

Independence Assumptions in HMMs [2] •By the chain rule, the following equality is exact: •
Assumption 2
: each observation depends only on the underlying
(
)
()
1 1 1
1
1
1
1
11 1 1
,
,..., ,...,
,..., ,...,
m m m m
m
j j m m
j
j j
X x X x S s S s
X x S s S s X x X x
p
p

=

= = = =
= = = = = =

˚
˚

Assumption 2
: each observation depends only on the underlying
state
•These two assumptions are often called independence
assumptions in HMMs
28/03/2011Markov models41
(
)
( )
1 1 1 1 1 1
,..., ,., ..,
j j m m j j
j j j j
X x S s S s x X x
X
X
p
p
x S s
− −
= = = = =
= = =
˚
˚

The Model form for HMMs •The model takes the following form:

Parameters in the model:
( ) ( )
( ) ( )
1 1 1 1
2 1
,.., , ,..., ;
m m
m m j j j j
j j
x x s s s t s s p
e x s
θ π

= =
=
∏ ∏
˚ ˚
Parameters in the model:



28/03/2011Markov models42
(
)
{
}
Initial probabilities for 1,2,...,
s s k
π

(
)
{
}
Transition probabilities for , ' 1,2,...,
t s s s s k
′∈
˚
(
)
{
}
{ }
Emission probabilities for 1,2,..., and 1,2,..,
e x s s k
x o


˚

6 components of HMMs •Discrete
timesteps: 1, 2, ...
•Finite
state space: {s
i} (N states)
•Events {x
i} (M events)
•Vector of initial probabilities { ππππ
i}
ΠΠΠΠ= {π
i } = { p(q
1= s
i) }

Matrix of transition probabilities
s
1
s
2
s
3
t
11
t
21
t
12
t
31
t
23
t
32
e
11
e
e
13
e
23
e
start
ππππ
1
ππππ
2ππππ
3

Matrix of transition probabilities T= {T
ij} = { p(q
t+1=s
j|q
t=s
i) }
•Matrix of emission probabilities
E= {E
ij} = { p(o
t=x
j|q
t=s
i) }
28/03/2011Markov models43
x
1
x
2
x
3
e
11
e
31
e
22
e
23
e
33
The observations at continuous timesteps form an
observation sequence
{o
1, o
2, ..., o
t}, where o
i∈{x
1, x
2, ..., x
M}

6 components of HMMs •Discrete
timesteps: 1, 2, ...
•Finite
state space: {s
i} (N states)
•Events {x
i} (M events)
•Vector of initial probabilities { ππππ
i}
ΠΠΠΠ= {π
i } = { p(q
1= s
i) }

Matrix of transition probabilities
s
1
s
2
s
3
t
11
t
21
t
12
t
31
t
23
t
32
e
11
e
e
13
e
23
e
start
ππππ
1
ππππ
2ππππ
3

Matrix of transition probabilities T= {T
ij} = { p(q
t+1=s
j|q
t=s
i) }
•Matrix of emission probabilities
E= {E
ij} = { p(o
t=x
j|q
t=s
i) }
28/03/2011Markov models44
x
1
x
2
x
3
e
11
e
31
e
22
e
23
e
33
The observations at continuous timesteps form an
observation sequence
{o
1, o
2, ..., o
t}, where o
i∈{x
1, x
2, ..., x
M}
1 1 1
Constraints:
1 1 1
N N M
i ij ij
i j j
T E
π
= = =
= = =
∑ ∑ ∑

6 components of HMMs •Given a specific HMM and an
observation sequence, the
corresponding sequence of states
is generally not deterministic
•Example:
Given the observation sequence:
{x
, x
, x
, x
}
s
1
s
2
s
3
t
11
t
21
t
12
t
31
t
23
t
32
e
11
e
e
13
e
23
e
start
ππππ
1
ππππ
2ππππ
3
{x
1
, x
3
, x
3
, x
2
}
The corresponding states can be
any of following sequences:
{s
1, s
2, s
1, s
2}
{s
1, s
2, s
3, s
2}
{s
1, s
1, s
1, s
2}
...
28/03/2011Markov models45
x
1
x
2
x
3
e
11
e
31
e
22
e
23
e
33

Here’s an HMM
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models46
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4

Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models47
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1o
1
q
2o
2
q
3o
3
0.3 -0.3 -0.4
randomply choice
between S
1, S
2, S
3

Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models48
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
q
2o
2
q
3o
3
0.2 -0.8
choice between X
1
and X
3

Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models49
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2o
2
q
3o
3
Go to S
2with
probability 0.8 or
S
1with prob. 0.2

Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models50
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2
S
1
o
2
q
3o
3
0.3 -0.7
choice between X
1
and X
3

Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models51
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2
S
1
o
2
X
1
q
3o
3
Go to S
2with
probability 0.5 or
S
1with prob. 0.5

Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
28/03/2011Markov models52
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2
S
1
o
2
X
1
q
3
S
1
o
3
0.3 -0.7
choice between X
1
and X
3

Here’s a HMM
•Start randomly in state 1, 2
or 3.
•Choose a output at each
state in random.
•Let’s generate a sequence
of observations:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.9
0.8
We got a sequence
28/03/2011Markov models53
T s
1
s
2
s
3
s
1
0.5 0.5 0
s
2
0.4 0 0.6
s
3
0.2 0.8 0
E x
1
x
2
x
3
s
1
0.3 0 0.7
s
2
0 0.1 0.9
s
3
0.2 0 0.8
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
q
1
S
3
o
1
X
3
q
2
S
1
o
2
X
1
q
3
S
1
o
3
X
3
We got a sequence
of states and
corresponding
observations!

Three famous HMM tasks •Given a HMM Φ= (T, E, π). Three famous HMM tasks are:
•Probability of an observation sequence (state estim ation)
–Given: Φ, observation O = {o
1, o
2,..., o
t}
–Goal: p(O|Φ), or equivalently p(s
t= S
i|O)

Most likely expaination (inference) Most likely expaination (inference)
–Given: Φ, the observation O = {o
1, o
2,..., o
t}
–Goal: Q* = argmax
Qp(Q|O)
•Learning the HMM
–Given: observation O = {o
1, o
2,..., o
t} and corresponding state sequence
–Goal: estimate parameters of the HMM Φ= (T, E, π)
28/03/2011Markov models54

Three famous HMM tasks •Given a HMM Φ= (T, E, π). Three famous HMM tasks are:
•Probability of an observation sequence (state estim ation)
–Given: Φ, observation O = {o
1, o
2,..., o
t}
–Goal: p(O|Φ), or equivalently p(s
t= S
i|O)

Most likely expaination (inference)
Calculating the probability of
observing the sequence O over
Most likely expaination (inference)
–Given: Φ, the observation O = {o
1, o
2,..., o
t}
–Goal: Q* = argmax
Qp(Q|O)
•Learning the HMM
–Given: observation O = {o
1, o
2,..., o
t} and corresponding state sequence
–Goal: estimate parameters of the HMM Φ= (T, E, π)
28/03/2011Markov models55
all of possible sequences.

Three famous HMM tasks •Given a HMM Φ= (T, E, π). Three famous HMM tasks are:
•Probability of an observation sequence (state estim ation)
–Given: Φ, observation O = {o
1, o
2,..., o
t}
–Goal: p(O|Φ), or equivalently p(s
t= S
i|O)

Most likely expaination (inference)
Calculating the best
corresponding state sequence,
Most likely expaination (inference)
–Given: Φ, the observation O = {o
1, o
2,..., o
t}
–Goal: Q* = argmax
Qp(Q|O)
•Learning the HMM
–Given: observation O = {o
1, o
2,..., o
t} and corresponding state sequence
–Goal: estimate parameters of the HMM Φ= (T, E, π)
28/03/2011Markov models56
given an observation
sequence.

Three famous HMM tasks •Given a HMM Φ= (T, E, π). Three famous HMM tasks are:
•Probability of an observation sequence (state estim ation)
–Given: Φ, observation O = {o
1, o
2,..., o
t}
–Goal: p(O|Φ), or equivalently p(s
t= S
i|O)

Most likely expaination (inference)
Given an (or a set of)
observation sequence and
corresponding state sequence,
Most likely expaination (inference)
–Given: Φ, the observation O = {o
1, o
2,..., o
t}
–Goal: Q* = argmax
Qp(Q|O)
•Learning the HMM
–Given: observation O = {o
1, o
2,..., o
t} and corresponding state sequence
–Goal: estimate parameters of the HMM Φ= (T, E, π)
28/03/2011Markov models57
corresponding state sequence, estimate the Transition matrix,
Emission matrix and initial
probabilities of the HMM

Three famous HMM tasks
Problem Algorithm Complexity State estimation
Calculating: p(O|Φ)
Forward O(TN
2
)
Inference
Calculating: Q*= argmax
Q
p(Q|O)
Viterbi decoding O(TN
2
)
28/03/2011Markov models58
Calculating: Q*= argmax
Q
p(Q|O)
Learning
Calculating: Φ* = argmax
Φ
p(O|Φ)
Baum-Welch (EM) O(TN
2
)
T: number of timesteps
N: number of states

State estimation problem •Given: Φ= (T, E, π), observation O = {o
1, o
2,..., o
t}
•Goal: What is p(o
1o
2...o
t) ?

We can do this in a slow, stupid way We can do this in a slow, stupid way
–As shown in the next slide...
28/03/2011Markov models59

Here’s a HMM
•What is p(O) = p(o
1o
2o
3)
= p(o
1=X
3∧o
2=X
1 ∧o
3=X
3)?
•Slow, stupid way:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
(
)
(
)
( ) ( )
paths of length 3 paths of length 3
|
Q Q
p O p OQ
p
Q
O Q p
∈ ∈
=
=∑ ∑
•How to compute p(Q) for an
arbitrary path Q?
•How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011Markov models60
paths of length 3
Q

Here’s a HMM
•What is p(O) = p(o
1o
2o
3)
= p(o
1=X
3∧o
2=X
1 ∧o
3=X
3)?
•Slow, stupid way:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
(
)
(
)
( ) ( )
paths of length 3 paths of length 3
|
Q Q
p O p OQ
p
Q
O Q p
∈ ∈
=
=∑ ∑
ππππ
s
1
s
2
s
3
•How to compute p(Q) for an
arbitrary path Q?
•How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011Markov models61
paths of length 3
Q

p(Q) = p(q
1q
2q
3)
= p(q
1)p(q
2|q
1)p(q
3|q
2,q
1) (chain)
= p(q
1)p(q
2|q
1)p(q
3|q
2) (why?)
Example in the case Q=S
3S
1S
1
P(Q) = 0.4 * 0.2 * 0.5 = 0.04
0.3 0.3 0.4

Here’s a HMM
•What is p(O) = p(o
1o
2o
3)
= p(o
1=X
3∧o
2=X
1 ∧o
3=X
3)?
•Slow, stupid way:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
(
)
(
)
( ) ( )
paths of length 3 paths of length 3
|
Q Q
p O p OQ
p
Q
O Q p
∈ ∈
=
=∑ ∑
ππππ
s
1
s
2
s
3
•How to compute p(Q) for an
arbitrary path Q?
•How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011Markov models62
paths of length 3
Q

p(O|Q) = p(o
1o
2o
3|q
1q
2q
3)
= p(o
1|q
1)p(o
2|q
1)p(o
3|q
3) (why?)
Example in the case Q=S
3S
1S
1
P(O|Q) = p(
X
3|S
3
)p(
X
1|S
1
) p(
X
3|S
1
)
=0.8 * 0.3 * 0.7 = 0.168
0.3 0.3 0.4

Here’s a HMM
•What is p(O) = p(o
1o
2o
3)
= p(o
1=X
3∧o
2=X
1 ∧o
3=X
3)?
•Slow, stupid way:
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
(
)
(
)
( ) ( )
paths of length 3 paths of length 3
|
Q Q
p O p OQ
p
Q
O Q p
∈ ∈
=
=∑ ∑
ππππ
s
1
s
2
s
3
•How to compute p(Q) for an
arbitrary path Q?
•How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011Markov models63
paths of length 3
Q

p(O|Q) = p(o
1o
2o
3|q
1q
2q
3)
= p(o
1|q
1)p(o
2|q
1)p(o
3|q
3) (why?)
Example in the case Q=S
3S
1S
1
P(O|Q) = p(
X
3|S
3
)p(
X
1|S
1
) p(
X
3|S
1
)
=0.8 * 0.3 * 0.7 = 0.168
0.3 0.3 0.4
p(O) needs 27 p(Q)
computations and 27
p(O|Q) computations.
What if the sequence has
20 observations?
So let’s be smarter...

The Forward algorithm •Given observation o
1o
2...o
T
•Forward probabilities:
α
t(i) = p(o
1o
2...o
t∧q
t= s
i| Φ) where 1 ≤t ≤T
αα
t(i) = probability that, in a random trial:
–We’d have seen the first tobservations
–We’d have ended up in s
i
as the t’th state visited.
•In our example, what is α
2(3) ?
28/03/2011Markov models64

αααα
t(i): easy to define recursively
(
)
(
)
( ) ( )
( )
1 1 1
1 1 1
1
|
i
i i
i i
i p o q s
p q s p o q s
oE
α
π
= ∧ =
= = = =
(
)
(
)
1 2
...
|
t t t i
i p oo o q s
α
= ∧ = Φ
(
)
(
)
(
)
1 1 1 1
1 1 1
2
2
...
...
t t t i
N
t t j t t i
i p oo o q s
oo o q s o q s p
α
+ + +
+ +
= ∧ =
= ∧ = ∧ ∧ =

(
)
{
}
{ } ( )
{ }
{ } ( )
{ }
1
1
{ }
|
|
i i
ij t j t i
ij t j t i
p q s
T p s q s E p x q
q
o
s
π
+
Π= = =
= = = =
= = = =
T
E
28/03/2011Markov models65
(
)
( ) ( )
( )
( )
( )
( )
( )
( ) ( )
1 1 1
1
1 1 1 1
1
1 1
1
1 1 1
2
2 2
1
1
1
. |
|
|
. . .
|
. .
t t j t t i
j
N
t t i t t j t t j
j
N
t t i t j t
j
N
t t i t i t j t
j
N
ji i t t
j
o q s oo o q s o p p
p
p p
T
o o q s
o q s q s j
o q s q s q s j
o j E
α
α
α
+ +
=
+ +
=
+ +
=
+ + +
=
+
=
= ∧ = ∧ = ∧ =
= ∧ = =
= = = =
=
∑ ∑


In our example
s
1
s
2
s
3
x
1
x
2
x
3
0.5
0.4
0.5
0.2
0.6
0.8
0.3
0.2
0.7
0.1
0.90.8
ππππ
s
1
s
2
s
3
(
)
(
)
( ) ( )
( ) ( ) ( ) ( ) ( )
1 2
1 1
1 1 1
. |..
t t t i
i i
t ji i t t i t ji t
jj
i p oo o q s
i E o
i T E o j E o T j
α
α π
α α α
+ + +
= ∧ = Φ
=
= =
∑ ∑
28/03/2011Markov models66
ππππ
s
1
s
2
s
3
0.3 0.3 0.4
We observed: x
1
x
2
α
1(1) = 0.3 * 0.3 = 0.09
α
1(2) = 0
α
1(3) = 0.2 * 0.4 = 0.08
α
2(1) = 0 * (0.09*0 .5+ 0*0.4 + 0.08*0.2) = 0
α
2(2) = 0.1 * (0.09*0.5 + 0*0 + 0.08*0.8) = 0.0109
α
2(3) = 0 * (0.09*0 + 0*0.6 + 0.08*0) = 0

Forward probabilities -Trellis
N
s
3
s
4 28/03/2011Markov models67
T
1 2 3 4 5 6
s
1
s
2

Forward probabilities -Trellis
N
s
3
s
4
α
1 (4)
α
1 (3)α
2 (3)α
6(3)
28/03/2011Markov models68
T
1 2 3 4 5 6
s
1
s
2
α
1 (2)
α
1 (1)
α
3 (2)
α
4(1)
α
5(2)

Forward probabilities -Trellis
N
s
3
s
4
α
1 (4)
α
1 (3)α
2 (3)
(
)
(
)
1 1
i i
i E o
α π
=
28/03/2011Markov models69
T
1 2 3 4 5 6
s
1
s
2
α
1 (2)
α
1 (1)

Forward probabilities -Trellis
N
s
3
s
4
α
1 (4)
α
1 (3)α
2 (3)
(
)
(
)
(
)
1 1 t i t ji t
j
i E o T j
α α
+ +
=

28/03/2011Markov models70
T
1 2 3 4 5 6
s
1
s
2
α
1 (2)
α
1 (1)

Forward probabilities •So, we can cheaply compute:
•How can we cheaply compute:
(
)
(
)
1 2
...
t t t i
i p oo o q s
α
= ∧ =
(
)
1 2
...
t
p o o o
•How can we cheaply compute:
28/03/2011Markov models71
(
)
1 2
|
...
t i t
p q s o o o
=

(
)
t
i
i
α
=

Forward probabilities •So, we can cheaply compute:
•How can we cheaply compute:
(
)
(
)
1 2
...
t t t i
i p oo o q s
α
= ∧ =
(
)
1 2
...
t
p o o o
i
•How can we cheaply compute:
MLook back the trellis...
28/03/2011Markov models72
(
)
1 2
|
...
t i t
p q s o o o
=
(
)
( )
j
t
t
i
j
α
α
=

State estimation problem •State estimationis solved:
•Can we utilize the elegant trellisto solve the Inference
problem?
( ) ( ) ()
1 2
1
|
N
t i
i
O p oo o i
p
α
=
Φ = …=∑
problem?
–Given an observation sequence O, find the best stat e sequence Q
28/03/2011Markov models73
(
)
*
argmax |
Q
O
Q p Q=

Inference problem •Given: Φ= (T, E, π), observation O = {o
1, o
2,..., o
t}
•Goal: Find

Practical problems:
(
)
( )
1 2
*
1 2 1 2
argmax |
argmax |
t
Q
t t
q q q
Q p Q O
p qq q oo o

=
= … …

Practical problems:
–Speech recognition: Given an utterance (sound), what is
the best sentence (text) that matches the utterance?
–Video tracking
–POS Tagging
28/03/2011Markov models74
s
1
s
2
x
1
x
2
x
3
s
3

Inference problem •We can do this in a slow, stupid way:
(
)
( ) ( )
( )
(
)
(
)
*
argmax |
|
argmax
Q
Q
p Q O
p O Q p Q
p O
Q=
=
•But it’s better if we can find another way to
compute the most probability path(MPP)...
28/03/2011Markov models75
(
)
(
)
( ) ( )
1 2
argmax |
argmax |
t
Q
Q
p O Q p Q
p oo o Q p Q
=
= …

Efficient MPP computation •We are going to compute the following variables: •δ
t(i) is the probability of the best path of length
(
)
(
)
1 2 1
1 2 1 1 2
max
t
t t t i t
q q q
i p qq q q s oo o
δ



= … ∧ = ∧ …
t(i) is the probability of the best path of length
t-1which ends up in s
iand emits o
1...o
t.
•Define: mpp
t(i) = that path
so:δ
t(i) = p(mpp
t(i))
28/03/2011Markov models76

Viterbi algorithm
(
)
(
)
( ) ( )
( ) ( )
( ) ( )
1 2 1
1 2 1
1 2 1 1 2
1 2 1 1 2
1 1 1
one choice
1 1
max
argmax
max
t
t
t t t i t
q q q
tt t
i
i t
q q q
i
i
i p qq q q s oo o
mpp i p qq q q s oo o
i p q s o
E o i
π
δ
δ
α






= … ∧ = ∧ …
= … ∧ = ∧ …
= = ∧
= =
N
s
4
δ
1 (4)
28/03/2011Markov models77
T
1 2 3 4 5 6
s
1
s
2
s
3
s
4
δ
1 (3)
δ
1 (2)
δ
1 (1)
δ
2 (3)

Viterbi algorithm
•The most prob path with last two states
s
is
j
is
the most path to s
i, followed by
transition s
i
Ms
j.
•The prob of that path will be:
s
1
s
i
s
j
time ttime t + 1
... ...
δ
t(i) ×p(s
i
Ms
j
∧o
t+1
)
= δ
t(i)T
ij
E
j(o
t+1
)
•So, the previous state at time tis:
28/03/2011Markov models78
(
)
(
)
*
1
argmax
t ij j t
i
i T E i
o
δ
+
=
...

Viterbi algorithm •Summary:
(
)
(
)
(
)
( )
( )
( ) ( )
*
1
1
1
*
1
*
argmax
t ij j t
t t j
t i
t
j j
i
t
i T E o
mpp j mpp i s
i T E
j
i
o
δ δ
δ
+ +
+
+
=
=
=
N
s
δ
1
(4)
(
)
(
)
(
)
1 1 1ii
i E o i
δ π α
= =
28/03/2011Markov models79
T
1 2 3 4 5 6
s
1
s
2
s
3
s
4
δ
1
(4)
δ
1 (3)
δ
1 (2)
δ
1 (1)
δ
2 (3)

What’s Viterbi used for? •Speech Recognition
28/03/2011Markov models80
Chong, Jike and Yi, Youngmin and Faria, Arlo and Sa tish, Nadathur Rajagopalan and Keutzer, Kurt, “ Data-Parallel Large Vocabulary
Continuous Speech Recognition on Graphics Processor s”, EECS Department, University of California, Berke ley, 2008.

Training HMMs •Given: large sequence of observation o
1o
2...o
T
and number of states N.
•Goal: Estimation of parameters Φ= 〈T, E, π〉
•That is, how to design an HMM.
•We will infer the model from a large amount of
data o
1o
2...o
Twith a big “T”.
28/03/2011Markov models81

Training HMMs •Remember, we have just computed
p(o
1
o
2
...o
T
| Φ)
•Now, we have some observations and we want to infer ence Φ
from them.
•So, we could use:
–MAX LIKELIHOOD:
–BAYES:
Compute
then take or
28/03/2011Markov models82
(
)
1
argmax |
T
p o o
Φ
Φ= … Φ
(
)
1
|
T
p
o o
Φ … [
]
E
Φ
(
)
1
max |
T
o
p o
Φ
Φ …

Max likelihood for HMMs •Forward probability
: the probability of producing o
1
...o
t
while
ending up in state s
i
(
)
(
)
1 2
...
t t t i
i p oo o q s
α
= ∧ =
(
)
(
)
( ) ( ) ( )
1 1
1 1
i i
t i t ji t
j
i E o
i E o T j
α π
α α
+ +
=
=

•Backward probability
: the probability of producing o
t+1
...o
T
given
that at time t, we are at state s
i
28/03/2011Markov models83
(
)
(
)
1 2
. |..
t t t i
T t
i p o o o q s
β
+ +
= =

Max likelihood for HMMs -Backward •Backward probability
: easy to define recursively
(
)
(
)
( )
( )
( )
1 2
1 2 1
1
...
...
|
1
|
t t t t i
T
N
t t t t j t i
j
T
T
i p o o o q s
i
i p o o o q s q s
β
β
β
+ +
+ + +
=
= =
=
= ∧ ∧ = =

(
)
( ) ( ) ( )
1 1
1
1
T
N
t t ij j t
j
i
i T o
j E
β
β β
+ +
=
=
=

28/03/2011Markov models84
( ) ( )
( ) ( )
( ) ( )
1
1 1 2 1 1
1
1 1 2 1
1
1 1
1
. | |
| |
..
...
j N
t t j t i t t t j t i
j
N
t t j t i t t j
j
N
t ij j t
T
T
j
p o q s q s p o o o q s q s p o q s q s p o o q s
T o j E β
=
+ + + + +
=
+ + + +
=
+ +
=
= ∧ = = ∧ = ∧ =
= ∧ = = =
=
∑ ∑

Max likelihood for HMMs •The probability of traversing a certain arc at time tgiven
o
1
o
2
...o
T
:
(
)
(
)
( )
(
)
1 1 2
1 1 2
|
ij t i t j T
t i t j T
t p q s q s oo o
p q s q s oo o
p oo o
ε
+
+
= = ∧ = …
= ∧ = ∧ …
=

28/03/2011Markov models85
(
)
( )
( )
( )
( ) ( )
( )
( ) ( )
( ) ( )
1 2
1 21 1 2
1 2 1 2
1
1
|
|
T
t t i t i t j t t T t i
N
t i t t t i
i
ij t
ijN
t
i
t T
t
t
p oo o
p oo o q s p q s q s p o o o q s
p oo o q p o o o q
i T i
t
i i
s s
α β
α
ε
β
+ + +
+ +
=
=
=

… ∧ = = ∧
=
= … =
=
… ∧ …
=
=

Max likelihood for HMMs •The probability of being at state s
i
at time tgiven o
1
o
2
...o
T
:
(
)
(
)
( )
1 2
1 1 2
1
|
|
i t i T
N
t i t j T
j
t p q s oo o
p q s q s oo o
γ
+
=
= = …
= = ∧ = …

28/03/2011Markov models86
( ) ( )
1 1
j N
ij
j
i
t t
γ ε
=
=

Max likelihood for HMMs •Sum over the time index:
–Expected # of transitions from state ito jin o
1
o
2
...o
T
:
()
1 1
T
ij
t
t
ε
− =∑
–Expected # of transitions from state iin o
1
o
2
...o
T
:
28/03/2011Markov models87
1
t
=
() () ()
1 1 1
1 1 1 11
T T T N N
i ij ij
t t j jt
t t t
ε ε γ
− − −
= = = ==
= =
∑ ∑∑ ∑∑

Update parameters
(
) ( ) (
)
( )
(
)
1 1
1 1
1 1
ˆexpected frequency in state i at time t = 1 1
expected # of transitions from state i to j
expected # of transitions from state i
ii
T T
ij ij
t t
ijT N T
i ij
t t
T
t t
πγ
γε ε
ε
− −
= =
− −
==
== =
∑ ∑
∑ ∑∑
(
)
{
}
{ } ( )
{ }
{ } ( )
{ }
1
1
{ }
|
|
i i
ij t j t i
ij t j t i
p q s
T p s q s E p x q
q
o
s
π
+
Π= = =
= = = =
= = = =
T
E
28/03/2011Markov models88
(
)
(
)
1 1 1
expected
i ij
t j t
ik
t t
E
γ
ε
= = =
=
∑ ∑∑
( ) ( )
( )
( ) ( )
( )
1 1
1 1 1
1 1
1 1 1
# of transitions from state i with x observed
expected # of transitions from state i
, ,
k
N T T
t k ij t k i
j t t
T N T
i ij
t j t
o x t o x t
t t
δ ε δ γ
γ ε
− −
= = =
− −
= = =
= =
∑∑ ∑
∑ ∑∑

The inner loop of Forward-Backward Given an input sequence.
1. Calculate forward probability:
–Base case:
–Recursive case:
2. Calculate backward probability:
(
)
(
)
( ) ( ) ( )
1 1
1 1
i i
t i t ji t
j
i E o
i E o T j
α π
α α
+ +
=
=

–Base case:
–Recursive case:
3. Calculate expected count:
4. Update parameters:
28/03/2011Markov models89
(
)
( ) ( ) ( )
1 1
1
1
T
N
t t ij j t
j
i
i T o
j E
β
β β
+ +
=
=
=

( )
(
)
(
)
( ) ( )
1
ij t
ijN
t
t
i
ti T i
t
i i
α β
α β
ε
=
=

( )
( )
( ) ( )
( )
1 1
1 1 1
1 1
1 1 1 1
,
N T T
t k ij ij
j t t
ij ikN T N T
ij ij
j t j t
o x t
t
T E
t t
δ ε
ε
ε
ε
− −
= = =
− −
= = = =
= =
∑∑ ∑
∑∑ ∑∑

Forward-Backward: EM for HMM •If we knew Φwe could estimate
expectations
of quantities
such as
–Expected number of times in state i
–Expected number of transitions iMj

If we knew the quantities such as

If we knew the quantities such as –Expected number of times in state i
–Expected number of transitions iMj
we could compute the
max likelihood
estimate of Φ= 〈T, E, Π〉
•Also known (for the HMM case) as the Baum-Welch algorithm.
28/03/2011Markov models90

EM for HMM •Each iteration provides values for all the paramete rs
•The new model always improve the likeliness of the
training data:
(
)
(
)
ˆ
| |
p p
oo o oo o
… Φ ≥ … Φ
•The algorithm does not guarantee to reach global
maximum.
28/03/2011Markov models91
(
)
(
)
1 2 1 2
ˆ
| |
T T
p p
oo o oo o
… Φ ≥ … Φ

EM for HMM •Bad News
–There are lots of local minima
•Good News
–The local minima are usually adequate models of the da ta.
•Notice
–EM does not estimate the number of states. That must b e given (tradeoffs)
–Often, HMMs are forced to have some links with zero probability. This is done
by setting Tij = 0 in initial estimate Φ(0)
–Easy extension of everything seen today: HMMs with real valued outputs
28/03/2011Markov models92

Contents •Introduction
•Markov Chain

Hidden Markov Models Hidden Markov Models
•Markov Random Field (from the viewpoint of
classification)
28/03/201193 Markov models

Example: Image segmentation •Observations: pixel values
•Hidden variable: classof each pixel
•It’s reasonable to think that there are some underl ying relationships
between neighbouring pixels... Can we use Markov mo dels?
•Errr.... the relationships are in 2D!
28/03/2011Markov models94

MRF as a 2D generalization of MC •Array of observations:
•Classes/States:
•Our objective is
classification
: given the array of
observations, estimate the corresponding values of the
{
}
, 0 ,0
y
ij x
i N j N
X x< ≤ < = ≤
{
}
, 1...
ij ij
S s s M
= =
observations, estimate the corresponding values of the state array S so that
28/03/2011Markov models95
(
)
(
)
| is maximum.
Xp S p S

2D context-dependent classification •Assumptions:
–The values of elements in S are mutually dependent.
–The range of this dependence is limited within a ne ighborhood.
•For each (i, j) element of S, a neighborhood N
ij
is defined so
thatthat
–s
ij∉N
ij
: (i, j) element does not belong to its own set of neighbors.
–s
ij∈N
kl⇔s
kl∈N
ij
: if s
ijis a neighbor of s
klthen s
klis also a neighbor
of s
ij
28/03/2011Markov models96

2D context-dependent classification •The Markov property for 2D case:
where includes all the elements of S excep t the (i, j) one.
•The elegeant dynamic programing is not applicable: the problem is
(
)
(
)
| |
ij ij ij ij
s S p s p=
N
ij
S
much harder now!
28/03/2011Markov models97

2D context-dependent classification •The Markov property for 2D case:
where includes all the elements of S excep t the (i, j) one.
•The elegeant dynamic programing is not applicable: the problem is
(
)
(
)
| |
ij ij ij ij
s S p s p
=
N
ij
S
We are gonna see an
application of MRF for
much harder now!
28/03/2011Markov models98
application of MRF for
Image Segmentation
and Restoration.

MRF for Image Segmentation •Cliques
: a set of each pixel which are neighbors
of each other (w.r.t the type of neighborhood)
28/03/2011Markov models99

MRF for Image Segmentation •Dual Lattice number
•Line process:
28/03/2011Markov models100

MRF for Image Segmentation •Gibbs distribution:
–Z: normalizing constant
( )
(
)
1
exp
U s
s
Z T
π

=
–T: parameter
•It turns out that Gibbs distribution implies MRF
([Gema 84])
28/03/2011Markov models101

MRF for Image Segmentation •A Gibbs conditional probability is of the form:
–C
k(i, j): clique of the pixel (i, j)
( )
( )
( )
1 1
| exp ,
ij ij k k
k
s F C i j
Z T
p
 
= −
   

N
–F
k: some functions, e.g.
28/03/2011Markov models102
( ) ( )
(
)
1 2 1, 1, 2 , 1 , 1
1
ij i j i j i j i j
s s s s s
T
α α α
− + − +
+ + − + +

MRF for Image Segmentation •Then, the joint probability for the Gibbs model is

The sum is calculated over all possible cliques associ ated
( )
(
)
(
)
,
,
exp
k k
i j k
p
F C i j
S
T
   
= −
     
∑∑

The sum is calculated over all possible cliques associ ated with the neighborhood.
•We also need to work out p(X|S)
•Then p(X|S)p(S) can be maximized... [Gema 84]
28/03/2011Markov models103

More on Markov models... •MRF does not stop there... Here are some related mo dels:
–Conditional random field (CRF)
–Graphical models
–...

Markov
Chain and HMM
does not stop there...

Markov
Chain and HMM
does not stop there...
–Markov chain of order m
–Continuous-time Markov chains...
–Real-value observations
–...
28/03/2011Markov models104

What you should know •Markov property, Markov Chain
•HMM:
–Defining and computing α
t(i)
–Viterbi algorithm
–Outline of the EM algorithm for HMM
•Markov Random Field
–And an application in Image Segmentation
–[Geman 84] for more information.
28/03/2011Markov models105

Q & A
28/03/2011Markov models106

References •L. R. Rabiner, " A Tutorial on Hidden Markov Models and Selected Application s
in Speech Recognition“, Proc. of the IEEE, Vol.77, No.2, pp.257--286, 198 9.
•Andrew W. Moore, “Hidden Markov Models”,
http://www.autonlab.org/tutorials/
•Geman S., Geman D. “Stochastic relaxation, Gibbs distributions and the
Bayesian restoration of images
,” IEEE Transactions on Pattern Analysis and
Bayesian restoration of images
,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, Vol. 6(6), pp. 721-741, 1984.
28/03/2011Markov models107