Hmm viterbi

Markov Models &
Hidden Markov Models

Time-based Models
• Simple parametric distributions are typically based
on what is called the “independence assumption”-
each data point is independent of the others, and
there is no time-sequencing or ordering.
• What if the data has correlations based on its order,
like a time-series?

States
• An atomic event is an assignment to every
random variable in the domain.
• States are atomic events that can transfer
from one to another
• Suppose a model has n states
• A state-transition diagram describes how the
model behaves

State-transition

Following assumptions
- Transition probabilities are stationary
- The event space does not change over time
- Probability distribution over next states depends
only on the current state

State-transition

Following assumptions
- Transition probabilities are stationary
- The event space does not change over time
- Probability distribution over next states depends
only on the current state

Markov Assumption

Markov random processes
• A random sequence has the Markov property if its
distribution is determined solely by its current state.
• Any random process having this property is called a
Markov random process.
• A system with states that obey the Markov
assumption is called a Markov Model.
• A sequence of states resulting from such a model is
called a Markov Chain.

Markov Assumption
• The Markov assumption states that
probability of the occurrence of word wi at
time t depends only on occurrence of word
wi-1 at time t-1
– Chain rule:
n
( 1 . n ∏ i 1 . i−)
Pw . ,w)= Pw|w . ,w1
,. ( ,.
=
i 2

– Markov assumption:
n
( 1 .,w ∏ i i− )
Pw . n)≈ Pw|w1
,. (
=
i 2

Andrei Andreyevich Markov

Born: 14 June 1856 in Ryazan, Russia
Died: 20 July 1922 in Petrograd (now St Petersburg),
Russia
Markov is particularly remembered for his study of
Markov chains, sequences of random variables in
which the future variable is determined by the
present variable but is independent of the way in
which the present state arose from its predecessors.
This work launched the theory of stochastic
processes.

A Markov System
Has N states, called s1, s2 .. sN
There are discrete timesteps, t=0, t=1, …
s2

s1 s3
N=3
t=0

A Markov System
s2 On the t’th timestep the system is in exactly
one of the available states. Call it qt
Note: qt ∈{s1, s2 .. sN }
Current State

s1 s3
N=3
t=0
qt=q0=s3

A Markov System
Current State Has N states, called s1, s2 .. sN
one of the available states. Call it qt
Note: qt ∈{s1, s2 .. sN }
Between each timestep, the next state is
chosen randomly.
s1 s3
N=3
t=1
qt=q1=s2

P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
P(qt+1=s3|qt=s2) = 0
P(qt+1=s1|qt=s1) = 0
P(qt+1=s2|qt=s1) = 0 one of the available states. Call it qt
P(qt+1=s3|qt=s1) = 1 Note: qt ∈{s1, s2 .. sN }
chosen randomly.
s1 s3 The current state determines the probability
N=3 distribution for the next state.

t=1
P(qt+1=s1|qt=s3) = 1/3
qt=q1=s2
P(qt+1=s2|qt=s3) = 2/3
P(qt+1=s3|qt=s3) = 0

P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
P(qt+1=s3|qt=s2) = 0
P(qt+1=s1|qt=s1) = 0
s2 1/2 On the t’th timestep the system is in exactly
P(qt+1=s2|qt=s1) = 0 one of the available states. Call it qt
P(qt+1=s3|qt=s1) = 1 Note: qt ∈{s1, s2 .. sN }
1/2 2/3
chosen randomly.
s1 1/3 s3 The current state determines the probability
N=3 1 distribution for the next state.

t=1
P(qt+1=s1|qt=s3) = 1/3
qt=q1=s2
P(qt+1=s2|qt=s3) = 2/3
P(qt+1=s3|qt=s3) = 0
Often notated with arcs
between states

P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
P(qt+1=s3|qt=s2) = 0
Markov Property
P(qt+1=s1|qt=s1) = 0
s2 qt+1 is conditionally independent of { qt-1, qt-2,
1/2 … q , q } given q .
P(qt+1=s2|qt=s1) = 0 1 0 t

P(qt+1=s3|qt=s1) = 1 In other words:
1/2 2/3
P(qt+1 = sj |qt = si ) =

s1 1/3 s3 P(qt+1 = sj |qt = si ,any earlier history)
Notation:
N=3 1

t=1
P(qt+1=s1|qt=s3) = 1/3
qt=q1=s2
P(qt+1=s2|qt=s3) = 2/3 aij = P (qt +1 = si | q = s j )
P(qt+1=s3|qt=s3) = 0
π i = P(q1 = si )

P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
P(qt+1=s3|qt=s2) = 0
Markov Property
P(qt+1=s1|qt=s1) = 0
s2 qt+1 is conditionally independent of { qt-1, qt-2,
1/2 … q , q } given q .
P(qt+1=s2|qt=s1) = 0 1 0 t

P(qt+1=s3|qt=s1) = 1 In other words:
1/2 2/3
P(qt+1 = sj |qt = si ) =

s1 1/3 s3 P(qt+1 = sj |qt = si ,any earlier history)
Notation:
N=3 1
Transition probability
t=1
P(qt+1=s1|qt=s3) = 1/3
qt=q1=s2
P(qt+1=s2|qt=s3) = 2/3 aij = P (qt +1 = si | q = s j )
P(qt+1=s3|qt=s3) = 0
π i = P(q1 = si )
Initial probability

Example: A Simple Markov Model For
Weather Prediction
• Any given day, the weather can be described as being in
one of three states:
– State 1: precipitation (rain, snow, hail, etc.)
– State 2: cloudy
– State 3: sunny
Transitions between states are described by the
transition matrix

This model can then be described by the
following directed graph

Basic Calculations
• Example: What is the probability that the
weather for eight consecutive days is “sun-
sun-sun-rain-rain-sun-cloudy-sun”?
• Solution:
• O = sun sun sun rain rain sun cloudy sun
3 3 3 1 1 3 2 3

From Markov To Hidden Markov
• The previous model assumes that each state can be uniquely
associated with an observable event
– Once an observation is made, the state of the system is then trivially
retrieved
– This model, however, is too restrictive to be of practical use for most
realistic problems
• To make the model more flexible, we will assume that the
outcomes or observations of the model are a probabilistic
function of each state
– Each state can produce a number of outputs according to a unique
probability distribution, and each distinct output can potentially be
generated at any state
– These are known a Hidden Markov Models (HMM), because the state
sequence is not directly observable, it can only be approximated from
the sequence of observations produced by the system

The coin-toss problem
• To illustrate the concept of an HMM consider the following
scenario
– Assume that you are placed in a room with a curtain
– Behind the curtain there is a person performing a coin-toss
experiment
– This person selects one of several coins, and tosses it: heads (H) or
tails (T)
– The person tells you the outcome (H,T), but not which coin was used
each time
• Your goal is to build a probabilistic model that best explains
a sequence of observations O={o1,o2,o3,o4,…}={H,T,T,H,,…}
– The coins represent the states; these are hidden because you do not
know which coin was tossed each time
– The outcome of each toss represents an observation
– A “likely” sequence of coins may be inferred from the observations,
but this state sequence will not be unique
•

The Coin Toss Example – 1 coin
•As a result, the Markov model is observable since there is only one state
•In fact, we may describe the system with a deterministic model where the states are
the actual observations (see figure)
•the model parameter P(H) may be found from the ratio of heads and tails
•O= H H H T T H…
•S = 1 1 1 2 2 1…

The Coin Toss Example – 2 coins

From Markov to Hidden Markov Model:
The Coin Toss Example – 3 coins

Hidden model
• As spectators, we can not tell which coin is
being used, all we can observe is the output
(head/tail)
• We assume the outputs are based on coin
tendencies (output) probabilities

Coin Toss Example
hidden state variables L

= coins

C1 C2 Ci CL-1 CL

P1 P2 Pi PL-1 PL

observed data
(“output”) =
heads/tails

• Used when states can not directly be observed, good
for noisy data

• Requirements:
– A finite number of states, each with an output probability
distribution
– State transition probabilities
– Observed phenomenon, which can be randomly generated
given state-associated probabilities.

HMM Notation *L. R. Rabiner, "A Tutorial on

(from Rabiner’s Survey) Hidden Markov Models and
Selected Applications in
Speech Recognition," Proc.
The states are labeled S1 S2 .. SN of the IEEE, Vol.77, No.2,
pp.257--286, 1989.

For a particular trial….
Let T be the number of observations
T is also the number of states passed
through
O = O1 O2 .. OT is the sequence of observations
Q = q1 q2 .. qT is the notation for a path of states

λ = 〈N,M,{π i,},{aij},{bi(j)}〉 is the specification of an
HMM

HMM Formal Definition
An HMM, λ, is a 5-tuple consisting of
• N the number of states
• M the number of possible observations
• {π1, π2, .. πN} The starting state probabilities
P(q0 = Si) = πi
• a11 a22 … a1N
a21 a22 … a2N The state transition probabilities
: : :
P(qt+1=Sj | qt=Si)=aij
aN1 aN2 … aNN

• b1(1) b1(2) … b1(M) The observation probabilities
b2(1) b2(2) … b2(M) P(Ot=k | qt=Si)=bi(k)
: : :
bN(1) bN(2) … bN(M)

Assumptions
• Markov assumption
– States depend on previous states
• Stationary assumption
– Transition probabilities are independent of time
(“memoryless”)
• Output independence
– Observations are independent of previous
observations

The three main questions on HMMs
• Evaluation
– What is the probability that the observations were
generated by a given model?
• Decoding
– Given a model and a sequence of observations, what is the
most likely state observations?
• Learning:
– Given a model and a sequence of observations, how
should we modify the model parameters to maximize
p{observe|model)

The three main questions on HMMs

1. Evaluation

GIVEN a HMM M, and a sequence x,
FIND Prob[ x | M ]

• Decoding

GIVEN a HMM M, and a sequence x,
FIND the sequence π of states that maximizes P[ x, π | M ]

5. Learning

GIVEN a HMM M, with unspecified transition/emission probs.,
and a sequence x,

FIND parameters θ = (bi(.), aij) that maximize P[ x | θ ]

Let’s not be confused by notation
P[ x | M ]: The probability that sequence x was generated by
the model

The model is: architecture (#states, etc)
+ parameters θ = aij, ei(.)

So, P[ x | θ ], and P[ x ] are the same, when the architecture, and
the entire model, respectively, are implied

Similarly, P[ x, π | M ] and P[ x, π ] are the same

In the LEARNING problem we always write P[ x | θ ] to emphasize
that we are seeking the θ that maximizes P[ x | θ ]

Description

Specification of an HMM
• N - number of states
– Q = {q1; q2; : : : ;qT} - set of states
• M - the number of symbols (observables)
– O = {o1; o2; : : : ;oT} - set of symbols

Description

Specification of an HMM
• A - the state transition probability matrix
– aij = P(qt+1 = j|qt = i)
• B- observation probability distribution
– bj(k) = P(ot = k|qt = j) i ≤ k ≤ M
• π - the initial state distribution

Central
problems
Central problems in HMM modelling
• Problem 1
Evaluation:
– Probability of occurrence of a particular
observation sequence, O = {o1,…,ok}, given the
model
– P(O|λ)
– Complicated – hidden states
– Useful in sequence classification

Central
problems
• Problem 2
Decoding:
– Optimal state sequence to produce given
observations, O = {o1,…,ok}, given model
– Optimality criterion
– Useful in recognition problems

Central
problems
• Problem 3
Learning:
– Determine optimum model, given a training set of
observations
– Find λ, such that P(O|λ) is maximal

Task: Part-Of-Speech Tagging
• Goal: Assign the correct part-of-speech to
each word (and punctuation) in a text.
• Example:
Two old men bet on the game .
CRD ADJ NN VBD Prep Det NN SYM

• Learn a local model of POS dependencies,
usually from pretagged data

• Assume: POS generated as random process,
and each POS randomly generates a word

0.2

ADJ 0.3
“cats”
0.2
“a” 0.6 0.5 NNS
0.3
Det 0.9 “men”

0.5
0.4 NN
“the” 0.1

“cat”
“bet”

HMMs For Tagging
• First-order (bigram) Markov assumptions:
– Limited Horizon: Tag depends only on previous tag
P(ti+1 = tk | t1=tj1,…,ti=tji) = P(ti+1 = tk | ti = tj)
– Time invariance: No change over time
P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj  tk)
• Output probabilities:
– Probability of getting word wk for tag tj: P(wk | tj)
– Assumption:
Not dependent on other tags or words!

Combining Probabilities
• Probability of a tag sequence:
P(t1t2…tn) = P(t1)P(t1t2)P(t2t3)…P(tn-1tn)
Assume t0 – starting tag:
= P(t0t1)P(t1t2)P(t2t3)…P(tn-1tn)

• Prob. of word sequence and tag sequence:
P(W,T) = Πi P(ti-1ti) P(wi | ti)

Training from labeled training
• Labeled training = each word has a POS tag
• Thus:
π(tj) = PMLE(tj) = C(tj) / N
a(tjtk) = PMLE(tjtk) = C(tj, tk) / C(tj)
b(tj | wk) = PMLE(tj | wk) = C(tj:wk) / C(wk)

• Smoothing applies as usual

Three Basic Problems
• Compute the probability of a text:
Pλ(W1,N)
• Compute maximum probability tag
sequence:
arg maxT1,N Pλ(T1,N | W1,N)
• Compute maximum likelihood model
arg max λ Pλ(W1,N)

Central

Problem 1: Naïve solution problems

• State sequence Q = (q1,…qT)
• Assume independent observations:
T
P (O | q, λ ) = ∏ P (ot | qt , λ ) = bq1 (o1 )bq 2 (o2 )...bqT (oT )
i =1

NB Observations are mutually independent, given the
hidden states. (Joint distribution of independent
variables factorises into marginal distributions of the
independent variables.)

Central

Problem 1: Naïve solution
problems

• Observe that :

P(q | λ ) = π q1aq1q 2 aq 2 q 3 ...aqT −1qT
• And that:

P (O | λ ) = ∑ P (O | q, λ ) P (q | λ )
q

Central

Problem 1: Naïve solution problems

• Finally get:

P (O | λ ) = ∑ P (O | q, λ ) P (q | λ )
q

NB:
-The above sum is over all state paths
-There are NT states paths, each ‘costing’
O(T) calculations, leading to O(TNT)
time complexity.

Central

Problem 1: Efficient solution
problems

Forward algorithm:
• Define auxiliary forward variable α:

α t (i ) = P(o1 ,..., ot | qt = i, λ )
αt(i) is the probability of observing a partial sequence of
observables o1,…ot such that at time t, state qt=i

Central

Problem 1: Efficient solution
problems

• Recursive algorithm:
– Initialise:

α1 (i ) = π i bi (o1 )
(Partial obs seq to t AND state i at t)
– Calculate: x (transition to j at t+1) x (sensor)
N
α t +1 ( j ) = [∑ α t (i )aij ]b j (ot +1 ) Sum, as can reach j from
any preceding state
i =1
– Obtain: α incorporates partial obs seq to t
N
P (O | λ ) = ∑ α T (i )
i =1
Sum of different ways Complexity is O(N2T)
of getting obs seq

Forward Algorithm
Define αk(i) = P(w1,k, tk=ti)

• For i = 1 To N:
α1(i) = a(t0ti)b(w1 | ti)
4. For k = 2 To T; For j = 1 To N:
αk(j) = [Σ α i k-1 ]
(i)a(titj) b(wk | tj)
5. Then:
Pλ(W1,T) = Σ α (i)
i T

Complexity = O(N2 T)

Pλ(W1,3)
Forward Algorithm
w1 w2 w3
1 a(t1t1) 1 a(t1t1)
t α1(1) t α2(1) t1 α3(1)
a(t2t1) a(t2t1)

t2 α1(2) a(t3t1) t2 α2(2) a(t3t1) t2 α3(2)

a(t0ti)
t3 α1(3) t3 α2(3) t3 α3(3)
a(t4t1) a(t4t1)

t4 α1(4) t4 α2(4) t4 α3(4)
a(t5t1) a(t5t1)

t5 α1(5) t5 α2(5) t5 α3(5)

Central
problems
Problem 1: Alternative solution
Backward algorithm:
• Define auxiliary forward
variable β:

β t (i ) = P (ot +1 , ot + 2 ,..., oT | qt = i, λ )

βt(i) – the probability of observing a sequence of
observables ot+1,…,oT given state qt =i at time t, and λ

Central
problems

Problem 1: Alternative solution
• Recursive algorithm:
– Initialise:

βT ( j ) = 1
– Calculate:
N
βt (i ) = ∑ β t +1 ( j )aij b j (ot +1 )
j =1

– Terminate:
N
p (O | λ ) = ∑ β1 (i ) t = T − 1,...,1
i =1

Complexity is O(N2T)

Backward Algorithm
Define βk(i) = P(wk+1,N | tk=ti) --note the difference!

• For i = 1 To N:
βT(i) = 1
5. For k = T-1 To 1; For j = 1 To N:
βk(j) = [Σ a(t t )b(w
i
j i
k+1 | ti) βk+1(i) ]
6. Then:
Pλ(W1,T) = Σ a(t t )b(w | t ) β (i)
i 0
i
1
i
1

Complexity = O(Nt2 N)

Pλ(W1,3) Backward Algorithm
w1 w2 w3
1 β1(1) a(t1t1) 1 β2(1) a(t1t1) β3(1)
t t t1
a(t0ti) a(t2t1) a(t2t1)

t2 β1(2) t2 β2(2) t2 β3(2)
a(t3t1) a(t3t1)

t3 β1(3) t3 β2(3) t3 β3(3)
a(t t )
4 1
a(t t )
4 1

t4 β1(4) t4 β2(4) t4 β3(4)
a(t5t1) a(t5t1)

t5 β1(5) t5 β2(5) t5 β3(5)

Viterbi Algorithm (Decoding)
• Most probable tag sequence given text:
T* = arg maxT Pλ(T | W)
= arg maxT Pλ(W | T) Pλ(T) / Pλ(W)
(Bayes’ Theorem)
= arg maxT Pλ(W | T) Pλ(T)
(W is constant for all T)
= arg maxT Πi[a(ti-1ti) b(wi | ti) ]
= arg maxT Σi log[a(ti-1ti) b(wi | ti) ]

w1 w2 w3

t1 t1 t1

t0 t2 t2 t2

t3 t3 t3

A(,) t1 t2 t3 B(,) w1 w2 w3
t0  0.005 0.02 0.1 t1 0.2 0.005 0.005

t1  0.02 0.1 0.005 t2 0.02 0.2 0.0005

t2  0.5 0.0005 0.0005 t3 0.02 0.02 0.05

t3  0.05 0.05 0.005

w1 w2 w3
-1.7 -1.7
1
t 1
t -6 t1 -7.3
-3
-2.3
-0.3 -0.3
0
-1.7 2
t t -3.4 t2 -4.7 t2 -10.3

-1 -1.3 -1.3

t3 -2.7 t3 -6.7 t3 -9.3

-log A t1 t2 t3 -log B w1 w2 w3
t0  2.3 1.7 1 t1 0.7 2.3 2.3
t1  1.7 1 2.3 t2 1.7 0.7 3.3
t2  0.3 3.3 3.3 t3 1.7 1.7 1.3
t3  1.3 1.3 2.3

Viterbi Algorithm
• D(0, START) = 0
• for each tag t != START do: D(1, t) = -∞
• for i  1 to N do:
a. for each tag tj do:
D(i, tj)  maxk D(i-1,tk)b(wi|tj)a(tktj)
D(i, tj)  maxk D(i-1,tk) + logb(wi|tj) + loga(tktj)
• log P(W,T) = maxj D(N, tj)

where logb(wi|tj) = log b(wi|tj) and so forth

Question: Suppose the sequence of our game is:
HHHTHHHTTHHTH?

0.5 start 0.5
Heads Heads
0.1
fair loaded
0.9 0.9

Tails Tails
0.1

What is the probability of the sequence given the model?

Decoding
• Suppose we have a text written by Shakespeare
and a monkey. Can we tell who wrote what?

• Text: Shakespeare or Monkey?

• Case 1:
– Fehwufhweuromeojulietpoisonjigjreijge
• Case 2:
– mmmmbananammmmmmmbananammm

Hmm viterbi

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hmm viterbi

Similar to Hmm viterbi (20)

More from Digvijay Singh

More from Digvijay Singh (14)

Recently uploaded

Recently uploaded (20)

Hmm viterbi