Hidden Markov Models

MACHINE LEARNING

Hidden Markov Models
VU H. Pham
phvu@fit.hcmus.edu.vn

Department of Computer Science

Dececmber 6th, 2010

08/12/2010 Hidden Markov Models 1

Contents
• Introduction

• Markov Chain

• Hidden Markov Models


Introduction
• Markov processes are first proposed by
Russian mathematician Andrei Markov
– He used these processes to investigate
Pushkin’s poem.
• Nowaday, Markov property and HMMs are
widely used in many domains:
– Natural Language Processing
– Speech Recognition
– Bioinformatics
– Image/video processing
– ...


Markov Chain
• Has N states, called s1, s2, ..., sN
• There are discrete timesteps, t=0,
s2
t=1,...
s1
• On the t’th timestep the system is in
exactly one of the available states.
s3
Call it qt ∈ {s1 , s2 ,..., sN }

Current state

N=3
t=0
q t = q 0 = s3

Markov Chain
s2
t=1,...
s1
• On the t’th timestep the system is in Current state

s3
Call it qt ∈ {s1 , s2 ,..., sN }
• Between each timestep, the next
state is chosen randomly.

N=3
t=1
q t = q 1 = s2

p ( s1 ˚ s2 ) = 1 2
Markov Chain p ( s2 ˚ s2 ) = 1 2
p ( s3 ˚ s2 ) = 0
s2
t=1,...
s1
• On the t’th timestep the system is in
p ( qt +1 = s1 ˚ qt = s1 ) = 0 s3
Call it qt ∈ {s1 , s2 ,..., sN }
p ( s2 ˚ s1 ) = 0
• Between each timestep, the next p ( s3 ˚ s1 ) = 1 p ( s1 ˚ s3 ) = 1 3
state is chosen randomly. p ( s2 ˚ s3 ) = 2 3
p ( s3 ˚ s3 ) = 0
• The current state determines the
probability for the next state. N=3
t=1
q t = q 1 = s2

p ( s1 ˚ s2 ) = 1 2
Markov Chain p ( s2 ˚ s2 ) = 1 2
p ( s3 ˚ s2 ) = 0
• Has N states, called s1, s2, ..., sN 1/2
s2
1/2
t=1,...
s1 2/3
• On the t’th timestep the system is in 1/3
1
p ( qt +1 = s1 ˚ qt = s1 ) = 0 s3
Call it qt ∈ {s1 , s2 ,..., sN }
p ( s2 ˚ s1 ) = 0
• Between each timestep, the next p ( s3 ˚ s1 ) = 1 p ( s1 ˚ s3 ) = 1 3
state is chosen randomly. p ( s2 ˚ s3 ) = 2 3
p ( s3 ˚ s3 ) = 0
• The current state determines the
probability for the next state. N=3
– Often notated with arcs between states
t=1
q t = q 1 = s2

p ( s1 ˚ s2 ) = 1 2
Markov Property p ( s2 ˚ s2 ) = 1 2
p ( s3 ˚ s2 ) = 0
• qt+1 is conditionally independent of 1/2
{qt-1, qt-2,..., q0} given qt. s2
1/2
• In other words:
s1 2/3
p ( qt +1 ˚ qt , qt −1 ,..., q0 ) 1/3
1
= p ( qt +1 ˚ qt ) p ( qt +1 = s1 ˚ qt = s1 ) = 0 s3
p ( s2 ˚ s1 ) = 0
p ( s3 ˚ s1 ) = 1 p ( s1 ˚ s3 ) = 1 3
p ( s2 ˚ s3 ) = 2 3
p ( s3 ˚ s3 ) = 0

N=3
t=1
q t = q 1 = s2

p ( s1 ˚ s2 ) = 1 2
p ( s3 ˚ s2 ) = 0
{qt-1, qt-2,..., q0} given qt. s2
1/2
• In other words:
s1 2/3
p ( qt +1 ˚ qt , qt −1 ,..., q0 ) 1/3
1
= p ( qt +1 ˚ qt ) p ( qt +1 = s1 ˚ qt = s1 ) = 0 s3
The state at timestep t+1 depends p ( s2 ˚ s1 ) = 0
p ( s3 ˚ s1 ) = 1 p ( s1 ˚ s3 ) = 1 3
only on the state at timestep t
p ( s2 ˚ s3 ) = 2 3
p ( s3 ˚ s3 ) = 0

N=3
t=1
q t = q 1 = s2

p ( s1 ˚ s2 ) = 1 2
p ( s3 ˚ s2 ) = 0
{qt-1, qt-2,..., q0} given qt. s2
1/2
• In other words:
s1 2/3
p ( qt +1 ˚ qt , qt −1 ,..., q0 ) 1/3
1
= p ( qt +1 ˚ qt ) p ( qt +1 = s1 ˚ qt = s1 ) = 0 s3
p ( s3 ˚ s1 ) = 1 p ( s1 ˚ s3 ) = 1 3
p ( s2 ˚ s3 ) = 2 3
• How to represent the joint p ( s3 ˚ s3 ) = 0
distribution of (q0, q1, q2...) using
N=3
graphical models? t=1
q t = q 1 = s2

p ( s1 ˚ s2 ) = 1 2

q0p ( s 3 ˚ s2 ) = 0
{qt-1, qt-2,..., q0} given qt. s2
1/2
• In other words: q1
s1 1/3
p ( qt +1 ˚ qt , qt −1 ,..., q0 ) 1/3
1
= p ( qt +1 ˚ qt ) p ( qt +1 = s1 ˚ qt = s1 ) = 0
q2 s3
p ( s3 ˚ s1 ) = 1 p ( s1 ˚ s3 ) = 1 3
• How to represent the joint q3 p ( s 2 ˚ s3 ) = 2 3
p ( s3 ˚ s3 ) = 0
distribution of (q0, q1, q2...) using
N=3
graphical models? t=1
q t = q 1 = s2

Markov chain
• So, the chain of {qt} is called Markov chain
q0 q1 q2 q3


Markov chain
q0 q1 q2 q3
• Each qt takes value from the finite state-space {s1, s2, s3}
• Each qt is observed at a discrete timestep t
• {qt} sastifies the Markov property: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) = p ( qt +1 ˚ qt )


Markov chain
q0 q1 q2 q3
• The transition from qt to qt+1 is calculated from the transition
probability matrix
1/2
s1 s2 s3
s2 s1 0 0 1
1/2
s1 s2 ½ ½ 0
2/3
1
1/3 s3 1/3 2/3 0

08/12/2010
s3 Hidden Markov Models
Transition probabilities
14

Markov chain
q0 q1 q2 q3
• The transition from qt to qt+1 is calculated from the transition
probability matrix
1/2
s1 s2 s3
s2 s1 0 0 1
1/2
s1 s2 ½ ½ 0
2/3
1
1/3 s3 1/3 2/3 0

08/12/2010
s3 Hidden Markov Models
Transition probabilities
15

Markov Chain – Important property
• In a Markov chain, the joint distribution is
m
p ( q0 , q1 ,..., qm ) = p ( q0 ) ∏ p ( q j | q j −1 )
j =1


Markov Chain – Important property
• In a Markov chain, the joint distribution is
m
p ( q0 , q1 ,..., qm ) = p ( q0 ) ∏ p ( q j | q j −1 )
j =1

• Why? m
p ( q0 , q1 ,..., qm ) = p ( q0 ) ∏ p ( q j | q j −1 , previous states )
j =1
m
= p ( q0 ) ∏ p ( q j | q j −1 )
j =1

Due to the Markov property


Markov Chain: e.g.
• The state-space of weather:

rain wind

cloud


Markov Chain: e.g.
1/2 Rain Cloud Wind
rain wind
Rain ½ 0 ½
2/3 Cloud 1/3 0 2/3
1/2 1/3 1
cloud Wind 0 1 0


Markov Chain: e.g.
1/2 Rain Cloud Wind
rain wind
Rain ½ 0 ½
2/3 Cloud 1/3 0 2/3
1/2 1/3 1
cloud Wind 0 1 0

• Markov assumption: weather in the t+1’th day is
depends only on the t’th day.


Markov Chain: e.g.
1/2 Rain Cloud Wind
rain wind
Rain ½ 0 ½
2/3 Cloud 1/3 0 2/3
1/2 1/3 1
cloud Wind 0 1 0

• We have observed the weather in a week:
rain wind rain rain cloud

Day: 0 1 2 3 4

Markov Chain: e.g.
1/2 Rain Cloud Wind
rain wind
Rain ½ 0 ½
2/3 Cloud 1/3 0 2/3
1/2 1/3 1
cloud Wind 0 1 0

• We have observed the weather in a week: Markov Chain

rain wind rain rain cloud

Day: 0 1 2 3 4

Contents
• Introduction

• Markov Chain

• Hidden Markov Models


Modeling pairs of sequences
• In many applications, we have to model pair of sequences
• Examples:
– POS tagging in Natural Language Processing (assign each word in a
sentence to Noun, Adj, Verb...)
– Speech recognition (map acoustic sequences to sequences of words)
– Computational biology (recover gene boundaries in DNA sequences)
– Video tracking (estimate the underlying model states from the observation
sequences)
– And many others...


Probabilistic models for sequence pairs
• We have two sequences of random variables:
X1, X2, ..., Xm and S1, S2, ..., Sm

• Intuitively, in a pratical system, each Xi corresponds to an observation
and each Si corresponds to a state that generated the observation.

• Let each Si be in {1, 2, ..., k} and each Xi be in {1, 2, ..., o}

• How do we model the joint distribution:

p ( X 1 = x1 ,..., X m = xm , S1 = s1 ,..., S m = sm )


Hidden Markov Models (HMMs)
• In HMMs, we assume that
p ( X 1 = x1 ,..., X m = xm , S1 = s1 ,..., Sm = sm )
m m
= p ( S1 = s1 ) ∏ p ( S j = s j ˚ S j −1 = s j −1 ) ∏ p ( X j = x j ˚ S j = s j )
j =2 j =1

• This is often called Independence assumptions in
HMMs

• We are gonna prove it in the next slides


Independence Assumptions in HMMs [1]
p ( ABC ) = p ( A | BC ) p ( BC ) = p ( A | BC ) p ( B ˚ C ) p ( C )
• By the chain rule, the following equality is exact:
p ( X 1 = x1 ,..., X m = xm , S1 = s1 ,..., S m = sm )
= p ( S1 = s1 ,..., S m = sm ) ×
p ( X 1 = x1 ,..., X m = xm ˚ S1 = s1 ,..., S m = sm )

• Assumption 1: the state sequence forms a Markov chain
m
p ( S1 = s1 ,..., S m = sm ) = p ( S1 = s1 ) ∏ p ( S j = s j ˚ S j −1 = s j −1 )
j =2


Independence Assumptions in HMMs [2]
• By the chain rule, the following equality is exact:
p ( X 1 = x1 ,..., X m = xm ˚ S1 = s1 ,..., S m = sm )
m
= ∏ p ( X j = x j ˚ S1 = s1 ,..., Sm = sm , X 1 = x1 ,..., X j −1 = x j −1 )
j =1

• Assumption 2: each observation depends only on the underlying
state
p ( X j = x j ˚ S1 = s1 ,..., Sm = sm , X 1 = x1 ,..., X j −1 = x j −1 )
= p( X j = xj ˚ S j = sj )
• These two assumptions are often called independence
assumptions in HMMs


The Model form for HMMs
• The model takes the following form:
m m
p ( x1 ,.., xm , s1 ,..., sm ;θ ) = π ( s1 ) ∏ t ( s j ˚ s j −1 ) ∏ e ( x j ˚ s j )
j =2 j =1

• Parameters in the model:
– Initial probabilities π ( s ) for s ∈ {1, 2,..., k }

– Transition probabilities t ( s ˚ s′ ) for s, s ' ∈ {1, 2,..., k }

– Emission probabilities e ( x ˚ s ) for s ∈ {1, 2,..., k }
and x ∈ {1, 2,.., o}

6 components of HMMs
start
• Discrete timesteps: 1, 2, ...
• Finite state space: {si} π1 π2 π3
• Events {xi} t31
t11
t12 t23
π
• Vector of initial probabilities {πi} s1 s2 s3
t21 t32
πi = p(q0 = si)
• Matrix of transition probabilities e13
e11 e23 e33
e31
T = {tij} = { p(qt+1=sj|qt=si) } e22
• Matrix of emission probabilities x1 x2 x3
E = {eij} = { p(ot=xj|qt=si) }

The observations at continuous timesteps form an observation sequence
{o1, o2, ..., ot}, where oi ∈ {x1, x2, ..., xo}


6 components of HMMs
start
• Given a specific HMM and an
observation sequence, the π1 π2 π3
corresponding sequence of states t31
t11
is generally not deterministic t12 t23
• Example: s1 t21
s2 t32
s3
Given the observation sequence: e13
e11 e23 e33
{x1, x3, x3, x2} e31
e22
The corresponding states can be
any of following sequences:
x1 x2 x3
{s1, s1, s2, s2}
{s1, s2, s3, s2}
{s1, s1, s1, s2}
...

Here’s an HMM
0.2
0.5
0.5 0.6
s1 0.4
s2 0.8
s3

0.3 0.7 0.9 0.8
0.2 0.1

x1 x2 x3

T s1 s2 s3 E x1 x2 x3 π s1 s2 s3
s1 0.5 0.5 0 s1 0.3 0 0.7 0.3 0.3 0.4
s2 0.4 0 0.6 s2 0 0.1 0.9
s3 0.2 0.8 0 s3 0.2 0 0.8


Here’s a HMM
0.2
0.5 • Start randomly in state 1, 2
0.5 0.6
s1 s2 s3 or 3.
0.4 0.8
• Choose a output at each
0.3 0.7 0.9 state in random.
0.2 0.8
0.1 • Let’s generate a sequence
of observations:
x1 x2 x3
0.3 - 0.3 - 0.4
π s1 s2 s3 randomply choice
between S1, S2, S3
0.3 0.3 0.4

T s1 s2 s3 E x1 x2 x3
s1 0.5 0.5 0 s1 0.3 0 0.7 q1 o1
s2 0.4 0 0.6 s2 0 0.1 0.9 q2 o2
s3 0.2 0.8 0 s3 0.2 0 0.8 q3 o3


Here’s a HMM
0.2
0.5 0.6
s1 s2 s3 or 3.
0.4 0.8
0.2 0.8
of observations:
x1 x2 x3
0.2 - 0.8
π s1 s2 s3 choice between X1
and X3
0.3 0.3 0.4

T s1 s2 s3 E x1 x2 x3
s1 0.5 0.5 0 s1 0.3 0 0.7 q1 S3 o1
s2 0.4 0 0.6 s2 0 0.1 0.9 q2 o2
s3 0.2 0.8 0 s3 0.2 0 0.8 q3 o3


Here’s a HMM
0.2
0.5 0.6
s1 s2 s3 or 3.
0.4 0.8
0.2 0.8
of observations:
x1 x2 x3
Go to S2 with
π s1 s2 s3 probability 0.8 or
S1 with prob. 0.2
0.3 0.3 0.4

T s1 s2 s3 E x1 x2 x3
s1 0.5 0.5 0 s1 0.3 0 0.7 q1 S3 o1 X3
s2 0.4 0 0.6 s2 0 0.1 0.9 q2 o2
s3 0.2 0.8 0 s3 0.2 0 0.8 q3 o3


Here’s a HMM
0.2
0.5 0.6
s1 s2 s3 or 3.
0.4 0.8
0.2 0.8
of observations:
x1 x2 x3
0.3 - 0.7
and X3
0.3 0.3 0.4

T s1 s2 s3 E x1 x2 x3
s1 0.5 0.5 0 s1 0.3 0 0.7 q1 S3 o1 X3
s2 0.4 0 0.6 s2 0 0.1 0.9 q2 S1 o2
s3 0.2 0.8 0 s3 0.2 0 0.8 q3 o3


Here’s a HMM
0.2
0.5 0.6
s1 s2 s3 or 3.
0.4 0.8
0.2 0.8
of observations:
x1 x2 x3
Go to S2 with
π s1 s2 s3 probability 0.5 or
S1 with prob. 0.5
0.3 0.3 0.4

T s1 s2 s3 E x1 x2 x3
s1 0.5 0.5 0 s1 0.3 0 0.7 q1 S3 o1 X3
s2 0.4 0 0.6 s2 0 0.1 0.9 q2 S1 o2 X1
s3 0.2 0.8 0 s3 0.2 0 0.8 q3 o3


Here’s a HMM
0.2
0.5 0.6
s1 s2 s3 or 3.
0.4 0.8
0.2 0.8
of observations:
x1 x2 x3
0.3 - 0.7
and X3
0.3 0.3 0.4

T s1 s2 s3 E x1 x2 x3
s1 0.5 0.5 0 s1 0.3 0 0.7 q1 S3 o1 X3
s2 0.4 0 0.6 s2 0 0.1 0.9 q2 S1 o2 X1
s3 0.2 0.8 0 s3 0.2 0 0.8 q3 S1 o3


Here’s a HMM
0.2
0.5 0.6
s1 s2 s3 or 3.
0.4 0.8
0.2 0.8
of observations:
x1 x2 x3
We got a sequence
of states and
π s1 s2 s3 corresponding
0.3 0.3 0.4 observations!
T s1 s2 s3 E x1 x2 x3
s1 0.5 0.5 0 s1 0.3 0 0.7 q1 S3 o1 X3
s2 0.4 0 0.6 s2 0 0.1 0.9 q2 S1 o2 X1
s3 0.2 0.8 0 s3 0.2 0 0.8 q3 S1 o3 X3


Three famous HMM tasks
• Given a HMM Φ = (T, E, π). Three famous HMM tasks are:
• Probability of an observation sequence (state estimation)
– Given: Φ, observation O = {o1, o2,..., ot}
– Goal: p(O|Φ), or equivalently p(st = Si|O)
• Most likely expaination (inference)
– Given: Φ, the observation O = {o1, o2,..., ot}
– Goal: Q* = argmaxQ p(Q|O)
• Learning the HMM
– Given: observation O = {o1, o2,..., ot} and corresponding state sequence
– Goal: estimate parameters of the HMM Φ = (T, E, π)


– Goal: p(O|Φ), or equivalently p(st = Si|O) Calculating the probability of

• Most likely expaination (inference) observing the sequence O over
all of possible sequences.


– Goal: p(O|Φ), or equivalently p(st = Si|O) Calculating the best

• Most likely expaination (inference) corresponding state sequence,
given an observation
sequence.


Given an (or a set of)
– Goal: p(O|Φ), or equivalently p(st = Si|O) observation sequence and
• Most likely expaination (inference) corresponding state sequence,
– Given: Φ, the observation O = {o1, o2,..., ot} estimate the Transition matrix,

– Goal: Q* = argmaxQ p(Q|O) Emission matrix and initial
probabilities of the HMM


Problem Algorithm Complexity

State estimation Forward-Backward O(TN2)
Calculating: p(O|Φ)

Inference Viterbi decoding O(TN2)
Calculating: Q*= argmaxQp(Q|O)

Learning Baum-Welch (EM) O(TN2)
Calculating: Φ* = argmaxΦp(O|Φ)

T: number of timesteps
N: number of states


The Forward-Backward Algorithm
• Given: Φ = (T, E, π), observation O = {o1, o2,..., ot}

• Goal: What is p(o1o2...ot)

• We can do this in a slow, stupid way
– As shown in the next slide...


Here’s a HMM
0.5 0.2
0.5 0.6 • What is p(O) = p(o1o2o3)
s1 0.4
s2 0.8
s3 = p(o1=X3 ∧ o2=X1 ∧ o3=X3)?

0.3 0.7 0.9 • Slow, stupid way:
0.2 0.8
0.1
p (O ) = ∑ p ( OQ )
x1 x2 x3 Q∈paths of length 3

= ∑ p (O | Q ) p (Q )
Q∈paths of length 3
Q∈

• How to compute p(Q) for an
arbitrary path Q?
• How to compute p(O|Q) for an
arbitrary path Q?


Here’s a HMM
0.5 0.2
0.5 0.6 • What is p(O) = p(o1o2o3)
s1 0.4
s2 0.8
s3 = p(o1=X3 ∧ o2=X1 ∧ o3=X3)?

0.2 0.8
0.1
p (O ) = ∑ p ( OQ )

π s1 s2 s3 = ∑ p (O | Q ) p (Q )
Q∈
0.3 0.3 0.4

p(Q) = p(q1q2q3) • How to compute p(Q) for an
= p(q1)p(q2|q1)p(q3|q2,q1) (chain) arbitrary path Q?
= p(q1)p(q2|q1)p(q3|q2) (why?) • How to compute p(O|Q) for an
arbitrary path Q?
Example in the case Q=S3S1S1
P(Q) = 0.4 * 0.2 * 0.5 = 0.04

Here’s a HMM
0.5 0.2
0.5 0.6 • What is p(O) = p(o1o2o3)
s1 0.4
s2 0.8
s3 = p(o1=X3 ∧ o2=X1 ∧ o3=X3)?

0.2 0.8
0.1
p (O ) = ∑ p ( OQ )

π s1 s2 s3 = ∑ p (O | Q ) p (Q )
Q∈
0.3 0.3 0.4

p(O|Q) = p(o1o2o3|q1q2q3) • How to compute p(Q) for an
= p(o1|q1)p(o2|q1)p(o3|q3) (why?) arbitrary path Q?
Example in the case Q=S3S1S1 arbitrary path Q?
P(O|Q) = p(X3|S3)p(X1|S1) p(X3|S1)
=0.8 * 0.3 * 0.7 = 0.168

Here’s a HMM
0.5 0.2
0.5 0.6 • What is p(O) = p(o1o2o3)
s1 0.4
s2 0.8
s3 = p(o1=X3 ∧ o2=X1 ∧ o3=X3)?

0.2 0.8
0.1
p (O ) = ∑ p ( OQ )

π s1 s2 s3 = ∑ p (O | Q ) p (Q )
Q∈
0.3 0.3 0.4

p(O|Q) = p(o1o2o3|q1q2q3) • How to compute p(Q) for an
p(O) needs 27 p(Q) arbitrary path Q?
= p(o1|q1)p(o2|q1)p(o3|q3) (why?)
computations and 27
p(O|Q) computations.
Example in the case Q=S3S1S1 arbitrary path Q?
P(O|Q) = p(X3|S3)p(Xsequence3has )
What if the
1|S1) p(X |S1
20 observations?
=0.8 * 0.3 * 0.7 = 0.168 So let’s be smarter...

The Forward algorithm
• Given observation o1o2...oT

• Define:

αt(i) = p(o1o2...ot ∧ qt = Si | Φ) where 1 ≤ t ≤ T

αt(i) = probability that, in a random trial:
– We’d have seen the first t observations

– We’d have ended up in Si as the t’th state visited.

• In our example, what is α2(3) ?


Hidden Markov Models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hidden Markov Models

Similar to Hidden Markov Models (17)

Recently uploaded

Recently uploaded (20)

Hidden Markov Models