This document discusses hidden Markov models (HMMs) and their parameters. It contains the following key points:
1. HMMs are statistical models used to model time series data. They have hidden states that generate observed outputs.
2. HMMs have three main parameters: initial state probabilities, transition probabilities between states, and emission probabilities of observations given states.
3. The Expectation-Maximization (EM) algorithm is commonly used to estimate HMM parameters from observed data by iteratively computing expected state probabilities.
2. HMM Topics
1. What is an HMM?
2. State-space Representation
3. HMM Parameters
4. Generative View of HMM
5. Determining HMM Parameters Using EM
6. Forward-Backward or α−β algorithm
7. HMM Implementation Issues:
a) Length of Sequence
b) Predictive Distribution
c) Sum-Product Algorithm
d) Scaling Factors
e) Viterbi Algorithm
Machine Learning: CSE 574 1
3. 1. What is an HMM?
• Ubiquitous tool for modeling time series data
• Used in
• Almost all speech recognition systems
• Computational molecular biology
• Group amino acid sequences into proteins
• Handwritten word recognition
• It is a tool for representing probability distributions over
sequences of observations
• HMM gets its name from two defining properties:
• Observation xt at time t was generated by some process whose state
zt is hidden from the observer
• Assumes that state at zt is dependent only on state zt-1 and
independent of all prior states (First order)
• Example: z are phoneme sequences
x are acoustic observations
Machine Learning: CSE 574 2
4. Graphical Model of HMM
• Has the graphical model shown below and
latent variables are discrete
• Joint distribution has the form:
⎡ N ⎤ N
p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n )
⎣ n=2 ⎦ n =1
Machine Learning: CSE 574 3
5. HMM Viewed as Mixture
• A single time slice corresponds to a mixture distribution
with component densities p(x|z)
• Each state of discrete variable z represents a different component
• An extension of mixture model
• Choice of mixture component depends on choice of mixture
component for previous distribution
• Latent variables are multinomial variables zn
• That describe component responsible for generating xn
• Can use one-of-K coding scheme
Machine Learning: CSE 574 4
6. 2. State-Space Representation
• Probability distribution of zn State k 1 2 . K
depends on state of previous of zn znk 0 1 . 0
latent variable zn-1 through
probability distribution p(zn|zn-1)
• One-of K coding State j 1 2 . K
of zn-1 zn-1,j 1 0 . 0
• Since latent variables are K-
dimensional binary vectors A zn
Ajk=p(znk=1|zn-1,j=1) matrix
1 2 …. K
ΣkAjk=1
1
• These are known as Transition
Probabilities zn-1 2
• K(K-1) independent parameters … Ajk
Machine Learning: CSE 574 K 5
7. Transition Probabilities
Example with 3-state latent variable
zn=1 zn=2 zn=3 State Transition Diagram
zn1=1 zn1=0 zn1=0
zn2=0 zn2=1 zn2=0
zn3=0 zn3=0 zn3=1
zn-1=1 A11 A12 A13
zn-1,1=1 A11+A12+A13=1
zn-1,2=0
zn-1,3=0
zn-1=2 A21 A22 A23
zn-1,1=0
zn-1,2=1
zn-1,3=0 • Not a graphical model
zn-3= 3 A31 A32 A33 since nodes are not
zn-1,1=0 separate variables but
zn-1,2=0 states of a single
zn-1,3=1 variable
Machine Learning: CSE 574 6
• Here K=3
8. Conditional Probabilities
• Transition probabilities Ajk represent state-to-state
probabilities for each variable
• Conditional probabilities are variable-to-variable
probabilities
• can be written in terms of transition probabilities as
K K
p (z n | z n −1 , A) = ∏∏ A
z n−1, j z n ,k
jk
k =1 j =1
• Note that exponent zn-1,j zn,k is a product that evaluates to 0 or 1
• Hence the overall product will evaluate to a single Ajk for each
setting of values of zn and zn-1
• E.g., zn-1=2 and zn=3 will result in only zn-1,2=1 and zn,3=1. Thus
p(zn=3|zn-1=2)=A23
• A is a global HMM parameter
Machine Learning: CSE 574 7
9. Initial Variable Probabilities
• Initial latent node z1 is a special case without
parent node
• Represented by vector of probabilities π with
elements πk=p(z1k=1) so that
K
p ( z1 | π ) = ∏ π where Σ kπ k = 1
z1,k
k
k =1
• Note that π is an HMM parameter
• representing probabilities of each state for the
first variable
Machine Learning: CSE 574 8
10. Lattice or Trellis Diagram
• State transition diagram unfolded over time
• Representation of latent variable states
• Each column corresponds to one of latent
variables zn
Machine Learning: CSE 574 9
11. Emission Probabilities p(xn|zn)
• We have so far only specified p(zn|zn-1) by
means of transition probabilities
• Need to specify probabilities p(xn|zn,φ) to
complete the model, where φ are parameters
• These can be continuous or discrete
• Because xn is observed and zn is discrete
p(xn|zn,φ) consists of a table of K numbers
corresponding to K states of zn
• Analogous to class-conditional probabilities
• Can be represented as K
p (x n | z n , ϕ ) = ∏ p(x n | ϕk ) znk
k =1
Machine Learning: CSE 574 10
12. 3. HMM Parameters
• We have defined three types of HMM parameters: θ =
(π, A,φ)
1. Initial Probabilities of first latent variable:
π is a vector of K probabilities of the states for latent variable z1
2. Transition Probabilities (State-to-state for any latent variable):
A is a K x K matrix of transition probabilities Aij
3. Emission Probabilities (Observations conditioned on latent):
φ are parameters of conditional distribution p(xk|zk)
• A and π parameters are often initialized uniformly
• Initialization of φ depends on form of distribution
Machine Learning: CSE 574 11
13. Joint Distribution over Latent and
Observed variables
• Joint can be expressed in terms of parameters:
⎡ N ⎤ N
p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p (x m | z m , ϕ )
⎣ n=2 ⎦ m =1
where X = { x1 ,..x N }, Z = { z1 ,..z N }, θ = {π , A, ϕ}
• Most discussion of HMM is independent of
emission probabilities
• Tractable for discrete tables, Gaussian, GMMs
Machine Learning: CSE 574 12
14. 4. Generative View of HMM
• Sampling from an HMM
• HMM with 3-state latent variable z
• Gaussian emission model p(x|z)
• Contours of constant density of emission
distribution shown for each state
• Two-dimensional x
• 50 data points generated from HMM
• Lines show successive observations
• Transition probabilities fixed so that
• 5% probability of making transition
• 90% of remaining in same
Machine Learning: CSE 574 13
15. Left-to-Right HMM
• Setting elements of Ajk=0 if k < j
• Corresponding lattice diagram
Machine Learning: CSE 574 14
16. Left-to-Right Applied Generatively to Digits
• Examples of on-line handwritten 2’s
• x is a sequence of pen coordinates
• There are16 states for z, or K=16
• Each state can generate a line segment of fixed
length in one of 16 angles
• Emission probabilities: 16 x 16 table
• Transition probabilities set to zero except for those
that keep state index k the same or increment by one
• Parameters optimized by 25 EM iterations Training Set
• Trained on 45 digits
• Generative model is quite poor
• Since generated don’t look like training
• If classification is goal, can do better by using a
discriminative HMM
Machine Learning: CSE 574 15
Generated Set
17. 5. Determining HMM Parameters
• Given data set X={x1,..xn} we can determine
HMM parameters θ ={π,A,φ} using maximum
likelihood
• Likelihood function obtained from joint
distribution by marginalizing over latent
variables Z={z1,..zn}
p(X|θ) = ΣZ p(X,Z|θ)
Joint (X,Z)
Machine Learning: CSE 574
X
16
18. Computational issues for Parameters
p(X|θ) = ΣZ p(X,Z|θ)
• Computational difficulties
• Joint distribution p(X,Z|θ) does not factorize over n,
in contrast to mixture model
• Z has exponential number of terms corresponding to trellis
• Solution
• Use conditional independence properties to reorder
summations
• Use EM instead to maximize log-likelihood function of joint
ln p(X,Z|θ)
Efficient framework for maximizing the likelihood function in HMMs
Machine Learning: CSE 574 17
19. EM for MLE in HMM
1. Start with initial selection for model parameters θold
2. In E step take these parameter values and find
posterior distribution of latent variables p(Z|X,θ old)
Use this posterior distribution to evaluate
expectation of the logarithm of the complete-data
likelihood function ln p(X,Z|θ)
Which can be written as
Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
Z
underlined portion independent of θ is evaluated
3. In M-Step maximize Q w.r.t. θ
Machine Learning: CSE 574 18
20. Expansion of Q
• Introduce notation
γ (zn) = p(zn|X,θold): Marginal posterior distribution of
latent variable zn
ξ(zn-1,zn) = p(zn-1,zn|X,θ old): Joint posterior of two
successive latent variables
• We will be re-expressing Q in terms of γ and ξ
Q (θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
Z
γ
ξ
Machine Learning: CSE 574 19
21. Detail of γ and ξ
For each value of n we can store
γ(zn) using K non-negative numbers that sum to unity
ξ(zn-1,zn) using a K x K matrix whose elements also sum
to unity
• Using notation
γ (znk) denotes conditional probability of znk=1
Similar notation for ξ(zn-1,j,znk)
• Because the expectation of a binary random
variable is the probability that it takes value 1
γ(znk) = E[znk] = Σz γ(z)znk
ξ(z ,z ) = E[zn-1,j,znk] = Σz γ(z) zn-1,jznk
n-1,j nk
Machine Learning: CSE 574 20
22. Expansion of Q
• We begin with
Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
Z
• Substitute
⎡ N ⎤ N
p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p(x m | z m , ϕ )
⎣ n=2 ⎦ m =1
• And use definitions of γ and ξ to get:
K N K K
Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk
k =1 n = 2 j =1 k =1
N K
+ ∑∑ γ ( z nk ) ln p ( xn | φk )
21
Machine Learning: CSE 574 n =1 k =1
23. E-Step
K N K K
Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk
k =1 n = 2 j =1 k =1
N K
+ ∑∑ γ ( z nk ) ln p ( xn | φk )
n =1 k =1
• Goal of E step is to evaluate γ(zn) and ξ(zn-1,zn)
efficiently (Forward-Backward Algorithm)
γ
ξ
Machine Learning: CSE 574 22
24. M-Step
• Maximize Q(θ,θ old) with respect to parameters
θ ={π,A,φ}
• Treat γ (zn) and ξ(zn-1,zn) as constant
• Maximization w.r.t. π and A
• easily achieved (using Lagrangian multipliers)
N
πk =
γ ( z1k ) ∑ξ (z n −1, j , z nk )
K
A jk = n=2
∑γ (z
j =1
1j ) K N
∑∑ ξ ( z n −1, j , z nl )
l =1 n = 2
• Maximization w.r.t. φ k N K
• Only last term of Q depends on φ ∑∑ γ ( z k n =1 k =1
nk ) ln p ( xn | φk )
• Same form as in mixture distribution for i.i.d. 23
Machine Learning: CSE 574
25. M-step for Gaussian emission
• Maximization of Q(θ,θ old) wrt φ k
• Gaussian Emission Densities
p(x|φk) ~ N(x|μk,Σk)
• Solution:
N N
∑ γ ( znk )x n ∑ γ ( z nk )( x n − μ k )( x n − μ k )T
μk = n =1
N Σk = n =1
N
∑γ (z
n =1
nk ) ∑γ (z nk )
n =1
Machine Learning: CSE 574 24
26. M-Step for Multinomial Observed
• Conditional Distribution of Observations
have the form
D K
p ( x | z) = ∏∏ μiki zk
x
i =1 k =1
• M-Step equations: N
∑γ (z nk ) xni
μik = n =1
N
∑γ (z
n =1
nk )
• Analogous result holds for Bernoulli
observed variables
Machine Learning: CSE 574 25
27. 6. Forward-Backward Algorithm
• E step: efficient procedure to evaluate
γ(zn) and ξ(zn-1,zn)
• Graph of HMM, a tree
• Implies that posterior distribution of latent
variables can be obtained efficiently using
message passing algorithm
• In HMM it is called forward-backward
algorithm or Baum-Welch Algorithm
• Several variants lead to exact marginals
• Method called alpha-beta discussed here
Machine Learning: CSE 574 26
28. Derivation of Forward-Backward
• Several conditional-independences (A-H) hold
A. p(X|zn) = p(x1,..xn|zn) p(xn+1,..xN|zn)
• Proved using d-separation:
Path from x1 to xn-1 passes through zn which is observed.
Path is head-to-tail. Thus (x1,..xn-1) _||_ xn | zn
Similarly (x1,..xn-1,xn) _||_ xn+1,..xN |zn
zN
…..
xN
…..
Machine Learning: CSE 574 27
29. Conditional independence B
• Since (x1,..xn-1) _||_ xn | zn we have
B. p(x1,..xn-1|xn,zn)= p(x1,..xn-1|zn)
Machine Learning: CSE 574 28
30. Conditional independence C
• Since (x1,..xn-1) _||_ zn | zn-1
C. p(x1,..xn-1|zn-1,zn)= p ( x1,..xn-1|zn-1 )
Machine Learning: CSE 574 29
31. Conditional independence D
• Since (xn+1,..xN) _||_ zn | zn+1
D. p(xn+1,..xN|zn,zn+1)= p ( x1,..xN|zn+1 )
zN
xN
……
Machine Learning: CSE 574 30
32. Conditional independence E
• Since (xn+2,..xN) _||_ zn | zn+1
E. p(xn+2,..xN|zn+1,xn+1)= p ( xn+2,..xN|zn+1 )
zN
xN
……
Machine Learning: CSE 574 31
33. Conditional independence F
F. p(X|zn-1,zn)= p ( x1,..xn-1|zn-1 )p(xn|zn)
p(xn+1,..xN|zn)
zN
xN
……
Machine Learning: CSE 574 32
34. Conditional independence G
Since (x1,..xN) _||_ xN+1 | zN+1
G. p(xN+1|X,zN+1)=p(xN+1|zN+1 )
zN zN+1
….
xN xN+1
….
Machine Learning: CSE 574 33
35. Conditional independence H
H. p(zN+1|zN , X)= p ( zN+1|zN )
zN zN+1
….
xN xN+1
….
Machine Learning: CSE 574 34
36. Evaluation of γ(zn)
• Recall that this is to efficiently compute
the E step of estimating parameters of
HMM
γ (zn) = p(zn|X,θold): Marginal posterior
distribution of latent variable zn
• We are interested in finding posterior
distribution p(zn|x1,..xN)
• This is a vector of length K whose entries
correspond to expected values of znk
Machine Learning: CSE 574 35
37. Introducing α and β
• Using Bayes theorem γ (z ) = p( z | X) = p(X |pz(X) )p( z )
n n
n n
• Using conditional independence A
p ( x1 ,..x n | z n ) p ( x n +1 ,..x N | z n ) p (z n )
γ (z n ) =
p(X)
p( x1 ,..x n , z n ) p( x n +1 ,..x N | z n ) α (z n ) β (z n )
= =
p(X) p(X)
• where α(zn) = p(x1,..,xn,zn)
which is the probability of observing all beta
given data up to time n and the value of
zn
β(zn) = p(xn+1,..,xN|zn)
which is the conditional probability of all
alpha
Machine Learning: CSE 574 time n+1 up to N given
future data from 36
the value of z
38. Recursion Relation for α
α (z n ) = p( x1 ,.., x n , z n )
= p( x1 ,.., x n | z n ) p(z n ) by Bayes rule
= p( x n | z n ) p( x1 ,.., x n −1 | z n ) p (z n ) by conditional independence B
= p( x n | z n ) p( x1 ,.., x n −1 , z n ) by Bayes rule
= p( x n | z n )∑ p( x1 ,.., x n −1 , z n −1 , z n ) by Sum Rule
z n−1
= p( x n | z n )∑ p( x1 ,.., x n −1 , z n | z n −1 ) p(z n −1 ) by Bayes rule
z n−1
= p( x n | z n )∑ p( x1 ,.., x n −1 | z n −1 ) p (z n | z n −1 ) p (z n −1 ) by cond. ind. C
z n−1
= p( x n | z n )∑ p ( x1 ,.., x n −1 , z n −1 ) p (z n | z n −1 ) by Bayes rule
z n−1
= p( x | z )∑ α (z n −1 ) p(z n | z n −1 ) by definition of α
Machine Learning:nCSE 574
n
37
z n−1
39. Forward Recursion for α Εvaluation
• Recursion Relation is
α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 )
z n−1
• There are K terms in the
summation
• Has to be evaluated for each of K
values of zn
• Each step of recursion is O(K2)
• Initial condition is
K
α (z1 ) = p ( x1 , z1 ) = p (z1 ) p ( x1 | z1 ) = ∏ {π k p( x1 | φk )}z
1k
k =1
• Overall cost for the chain in
O(K2N)
Machine Learning: CSE 574 38
40. Recursion Relation for β
β (z n ) = p (x n +1 ,.., x N | z n )
= ∑ p (x n +1 ,.., x N , z n +1 | z n ) by Sum Rule
zn+1
= ∑ p(x
zn+1
n +1 ,.., x N | z n ,z n +1 ) p( z n +1|z n ) by Bayes rule
= ∑ p (x n +1 ,.., x N | z n +1 ) p( z n +1|z n ) by Cond ind. D
zn+1
= ∑ p (x n + 2 ,.., x N | z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by Cond. ind E
zn+1
= ∑ β (z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by definition of β
zn+1
Machine Learning: CSE 574 39
41. Backward Recursion for β
• Backward message passing
β (z n ) = ∑ β (z n +1 ) p( x n +1|z n )p( z n +1|z n )
zn+1
• Evaluates β(zn) in terms of β(zn+1)
• Starting condition for recursion is
p ( X, z N ) β ( z N )
p(z N | X) =
p(X)
• Is correct provided we set β(zN) =1
for all settings of zN
• This is the initial condition for backward
computation
Machine Learning: CSE 574 40
42. M step Equations
• In the M-step equations p(x) will cancel out
N
∑γ (z nk )x n
μk = n =1
N
∑γ (z
n =1
nk )
p(X) = ∑ α ( zn ) β ( zn )
zn
Machine Learning: CSE 574 41
43. Evaluation of Quantities ξ(zn-1,zn)
• They correspond to the values of the conditional
probabilities p(zn-1,zn|X) for each of the K x K settings for
(zn-1,zn)
ξ (z n -1 , z n ) = p(z n -1 , z n | X) by definition
p(X|zn-1,z n )p(zn-1,z n )
= by Bayes Rule
p(X)
p(x1 ,..x n −1|z n-1 )p( x n | z n ) p( x n +1 ,.., x N | z n ) p(z n | z n-1 )p(z n-1 )
= by cond ind F
p(X)
α (z n −1 ) p ( x n | z n ) p(z n | z n-1 )β(z n )
=
p(X)
• Thus we calculate ξ(zn-1,zn) directly by using results of the
α and β recursions
Machine Learning: CSE 574 42
44. Summary of EM to train HMM
Step 1: Initialization
• Make an initial selection of parameters θ old
where θ = (π, A,φ)
1. π is a vector of K probabilities of the states for latent
variable z1
2. A is a K x K matrix of transition probabilities Aij
3. φ are parameters of conditional distribution p(xk|zk)
• A and π parameters are often initialized uniformly
• Initialization of φ depends on form of distribution
• For Gaussian:
• parameters μk initialized by applying K-means to the data, Σk
corresponds to covariance matrix of cluster
Machine Learning: CSE 574 43
45. Summary of EM to train HMM
Step 2: E Step
• Run both forward α recursion and
backward β recursion
• Use results to evaluate γ(zn) and ξ(zn-1,zn)
and the likelihood function
Step 3: M Step
• Use results of E step to find revised set of
parameters θnew using M-step equations
Alternate between E and M
until convergence of likelihood function
Machine Learning: CSE 574 44
46. Values for p(xn|zn)
• In recursion relations, observations enter
through conditional distributions p(xn|zn)
• Recursions are independent of
• Dimensionality of observed variables
• Form of conditional distribution
• So long as it can be computed for each of K
possible states of zn
• Since observed variables {xn} are fixed
they can be pre-computed at the start of
the EM algorithm
Machine Learning: CSE 574 45
47. 7(a) Sequence Length: Using Multiple Short Sequences
• HMM can be trained effectively if length of
sequence is sufficiently long
• True of all maximum likelihood approaches
• Alternatively we can use multiple short
sequences
• Requires straightforward modification of HMM-EM
algorithm
• Particularly important in left-to-right models
• In given observation sequence, a given state
transition for a non-diagonal element of A occurs only
once
Machine Learning: CSE 574 46
48. 7(b). Predictive Distribution
• Observed data is X={ x1,..,xN}
• Wish to predict xN+1
• Application in financial forecasting
p (x | X ) = ∑ p(x , z | X )
N +1 N +1 N +1
z N +1
= ∑ p(x
z N +1
N +1 |z N +1 | X ) p (z N +1 | X ) by Product Rule
= ∑ p(x N +1|z N +1 )∑ p(z N +1 , z N | X ) by Sum Rule
z N +1 zN
= ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) p( z N | X ) by conditional ind H
z N +1 zN
p( z N , X )
= ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) by Bayes rule
z N +1 zN p( X )
1
= ∑ p(x N +1|z N +1 )∑ p(z N +1 | zN )α ( zN ) by definition of α
p ( X ) zN +1 zN
• Can be evaluated by first running forward α recursion and
summing over zN and zN+1
• Can be extended to subsequent predictions of xN+2, after xN+1 is
observed, using a fixed amount of storage
Machine Learning: CSE 574 47
49. 7(c). Sum-Product and HMM
HMM Graph
• HMM graph is a tree and
hence sum-product
algorithm can be used to
find local marginals for
hidden variables Joint distribution
• Equivalent to forward-
backward algorithm ⎡ N ⎤ N
p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n )
• Sum-product provides a ⎣ n=2 ⎦ n =1
simple way to derive alpha-
beta recursion formulae
• Transform directed graph to Fragment of Factor Graph
factor graph
• Each variable has a node,
small squares represent
factors, undirected links
connect factor nodes to
variables used
Machine Learning: CSE 574 48
50. Deriving alpha-beta from Sum-Product
• Begin with simplified form of Fragment of Factor Graph
factor graph
• Factors are given by
h( z1 ) = p ( z1 ) p ( x1 | z1 )
Simplified by absorbing emission probabilities
f n ( zn −1 , zn ) = p ( zn | zn −1 ) p ( xn | zn ) into transition probability factors
• Messages propagated are
μz n−1 → f n
(z n −1 ) = μ fn−1 →zn−1 (z n −1 )
Final Results
μf n →zn
(z n ) = ∑ f n (z n −1 , z n ) μ zn−1 → fn (z n −1 )
z n−1
α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 )
• Can show that α recursion is z n−1
computed β (z n ) = ∑ β ( z n +1 ) p(x n +1 | z n ) p(z n +1 | z n )
• Similarly starting with the root
z n+1
α (z n ) β (z n )
node β recursion is computed γ (z n ) =
p (X)
• So also γ and ξ are derived
ξ (z n-1 ,z n ) =
α (z n −1 ) p (x n | z n ) p( z n | z n-1 )β ( z n )
Machine Learning: CSE 574 49
p(X)
51. 7(d). Scaling Factors
• Implementation issue for small probabilities
• At each step of recursion
α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 )
z n−1
• To obtain new value of α(zn) from previous value α(zn-1)
we multiply p(zn|zn-1) and p(xn|zn)
• These probabilities are small and products will underflow
• Logs don’t help since we have sums of products
• Solution is rescaling
• of α(zn) and β(zn) whose values remain close to unity
Machine Learning: CSE 574 50
52. 7(e). The Viterbi Algorithm
• Finding most probable sequence of hidden
states for a given sequence of observables
• In speech recognition: finding most probable
phoneme sequence for a given series of
acoustic observations
• Since graphical model of HMM is a tree, can be
solved exactly using max-sum algorithm
• Known as Viterbi algorithm in the context of HMM
• Since max-sum works with log probabilities no need
to work with re-scaled varaibles as with forward-
backward
Machine Learning: CSE 574 51
53. Viterbi Algorithm for HMM
• Fragment of HMM lattice
showing two paths
• Number of possible paths
grows exponentially with
length of chain
• Viterbi searches space of
paths efficiently
• Finds most probable path
with computational cost
linear with length of chain
Machine Learning: CSE 574 52
54. Deriving Viterbi from Max-Sum
• Start with simplified factor graph
• Treat variable zN as root node, passing
messages to root from leaf nodes
• Messages passed are
μz n → f n +1
(z n ) = μ fn →zn (z n )
μf n +1 → z n +1
{ }
(z n +1 ) = max ln f n +1 (z n , z n +1 ) + μzn → fn+1 (z n )
zn
Machine Learning: CSE 574 53
55. Other Topics on Sequential Data
• Sequential Data and Markov Models:
http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.1-
MarkovModels.pdf
• Extensions of HMMs:
http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.3-
HMMExtensions.pdf
• Linear Dynamical Systems:
http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.4-
LinearDynamicalSystems.pdf
• Conditional Random Fields:
http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.5-
ConditionalRandomFields.pdf
Machine Learning: CSE 574 54