PowerPoint Presentation

Hidden Markov Models

Sargur N. Srihari
srihari@cedar.buffalo.edu
Machine Learning Course:
http://www.cedar.buffalo.edu/~srihari/CSE574/index.html

Machine Learning: CSE 574 0

HMM Topics
1. What is an HMM?
2. State-space Representation
3. HMM Parameters
4. Generative View of HMM
5. Determining HMM Parameters Using EM
6. Forward-Backward or α−β algorithm
7. HMM Implementation Issues:
a) Length of Sequence
b) Predictive Distribution
c) Sum-Product Algorithm
d) Scaling Factors
e) Viterbi Algorithm

1. What is an HMM?
• Ubiquitous tool for modeling time series data
• Used in
• Almost all speech recognition systems
• Computational molecular biology
• Group amino acid sequences into proteins
• Handwritten word recognition
• It is a tool for representing probability distributions over
sequences of observations
• HMM gets its name from two defining properties:
• Observation xt at time t was generated by some process whose state
zt is hidden from the observer
• Assumes that state at zt is dependent only on state zt-1 and
independent of all prior states (First order)
• Example: z are phoneme sequences
x are acoustic observations


Graphical Model of HMM
• Has the graphical model shown below and
latent variables are discrete

• Joint distribution has the form:
⎡ N ⎤ N
p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n )
⎣ n=2 ⎦ n =1

HMM Viewed as Mixture

• A single time slice corresponds to a mixture distribution
with component densities p(x|z)
• Each state of discrete variable z represents a different component
• An extension of mixture model
• Choice of mixture component depends on choice of mixture
component for previous distribution
• Latent variables are multinomial variables zn
• That describe component responsible for generating xn
• Can use one-of-K coding scheme

2. State-Space Representation
• Probability distribution of zn State k 1 2 . K
depends on state of previous of zn znk 0 1 . 0
latent variable zn-1 through
probability distribution p(zn|zn-1)
• One-of K coding State j 1 2 . K
of zn-1 zn-1,j 1 0 . 0
• Since latent variables are K-
dimensional binary vectors A zn
Ajk=p(znk=1|zn-1,j=1) matrix
1 2 …. K
ΣkAjk=1
1
• These are known as Transition
Probabilities zn-1 2
• K(K-1) independent parameters … Ajk
Machine Learning: CSE 574 K 5

Transition Probabilities
Example with 3-state latent variable
zn=1 zn=2 zn=3 State Transition Diagram
zn1=1 zn1=0 zn1=0
zn2=0 zn2=1 zn2=0
zn3=0 zn3=0 zn3=1
zn-1=1 A11 A12 A13
zn-1,1=1 A11+A12+A13=1
zn-1,2=0
zn-1,3=0
zn-1=2 A21 A22 A23
zn-1,1=0
zn-1,2=1
zn-1,3=0 • Not a graphical model
zn-3= 3 A31 A32 A33 since nodes are not
zn-1,1=0 separate variables but
zn-1,2=0 states of a single
zn-1,3=1 variable
• Here K=3

Conditional Probabilities
• Transition probabilities Ajk represent state-to-state
probabilities for each variable
• Conditional probabilities are variable-to-variable
probabilities
• can be written in terms of transition probabilities as
K K
p (z n | z n −1 , A) = ∏∏ A
z n−1, j z n ,k
jk
k =1 j =1
• Note that exponent zn-1,j zn,k is a product that evaluates to 0 or 1
• Hence the overall product will evaluate to a single Ajk for each
setting of values of zn and zn-1
• E.g., zn-1=2 and zn=3 will result in only zn-1,2=1 and zn,3=1. Thus
p(zn=3|zn-1=2)=A23
• A is a global HMM parameter

Initial Variable Probabilities
• Initial latent node z1 is a special case without
parent node
• Represented by vector of probabilities π with
elements πk=p(z1k=1) so that
K
p ( z1 | π ) = ∏ π where Σ kπ k = 1
z1,k
k
k =1

• Note that π is an HMM parameter
• representing probabilities of each state for the
first variable


Lattice or Trellis Diagram
• State transition diagram unfolded over time

• Representation of latent variable states
• Each column corresponds to one of latent
variables zn

Emission Probabilities p(xn|zn)
• We have so far only specified p(zn|zn-1) by
means of transition probabilities
• Need to specify probabilities p(xn|zn,φ) to
complete the model, where φ are parameters
• These can be continuous or discrete
• Because xn is observed and zn is discrete
p(xn|zn,φ) consists of a table of K numbers
corresponding to K states of zn
• Analogous to class-conditional probabilities
• Can be represented as K
p (x n | z n , ϕ ) = ∏ p(x n | ϕk ) znk
k =1

3. HMM Parameters

• We have defined three types of HMM parameters: θ =
(π, A,φ)
1. Initial Probabilities of first latent variable:
π is a vector of K probabilities of the states for latent variable z1
2. Transition Probabilities (State-to-state for any latent variable):
A is a K x K matrix of transition probabilities Aij
3. Emission Probabilities (Observations conditioned on latent):
φ are parameters of conditional distribution p(xk|zk)
• A and π parameters are often initialized uniformly
• Initialization of φ depends on form of distribution


Joint Distribution over Latent and
Observed variables
• Joint can be expressed in terms of parameters:
⎡ N ⎤ N
p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p (x m | z m , ϕ )
⎣ n=2 ⎦ m =1
where X = { x1 ,..x N }, Z = { z1 ,..z N }, θ = {π , A, ϕ}

• Most discussion of HMM is independent of
emission probabilities
• Tractable for discrete tables, Gaussian, GMMs

4. Generative View of HMM
• Sampling from an HMM
• HMM with 3-state latent variable z
• Gaussian emission model p(x|z)
• Contours of constant density of emission
distribution shown for each state
• Two-dimensional x

• 50 data points generated from HMM
• Lines show successive observations

• Transition probabilities fixed so that
• 5% probability of making transition
• 90% of remaining in same


Left-to-Right HMM
• Setting elements of Ajk=0 if k < j

• Corresponding lattice diagram


Left-to-Right Applied Generatively to Digits
• Examples of on-line handwritten 2’s
• x is a sequence of pen coordinates
• There are16 states for z, or K=16
• Each state can generate a line segment of fixed
length in one of 16 angles
• Emission probabilities: 16 x 16 table
• Transition probabilities set to zero except for those
that keep state index k the same or increment by one
• Parameters optimized by 25 EM iterations Training Set
• Trained on 45 digits
• Generative model is quite poor
• Since generated don’t look like training
• If classification is goal, can do better by using a
discriminative HMM
Generated Set

5. Determining HMM Parameters
• Given data set X={x1,..xn} we can determine
HMM parameters θ ={π,A,φ} using maximum
likelihood
• Likelihood function obtained from joint
distribution by marginalizing over latent
variables Z={z1,..zn}
p(X|θ) = ΣZ p(X,Z|θ)

Joint (X,Z)

Machine Learning: CSE 574
X
16

Computational issues for Parameters
p(X|θ) = ΣZ p(X,Z|θ)
• Computational difficulties
• Joint distribution p(X,Z|θ) does not factorize over n,
in contrast to mixture model
• Z has exponential number of terms corresponding to trellis
• Solution
• Use conditional independence properties to reorder
summations
• Use EM instead to maximize log-likelihood function of joint
ln p(X,Z|θ)
Efficient framework for maximizing the likelihood function in HMMs


EM for MLE in HMM
1. Start with initial selection for model parameters θold
2. In E step take these parameter values and find
posterior distribution of latent variables p(Z|X,θ old)
Use this posterior distribution to evaluate
expectation of the logarithm of the complete-data
likelihood function ln p(X,Z|θ)
Which can be written as
Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
Z

underlined portion independent of θ is evaluated
3. In M-Step maximize Q w.r.t. θ

Expansion of Q
• Introduce notation
γ (zn) = p(zn|X,θold): Marginal posterior distribution of
latent variable zn
ξ(zn-1,zn) = p(zn-1,zn|X,θ old): Joint posterior of two
successive latent variables
• We will be re-expressing Q in terms of γ and ξ
Q (θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
Z
γ

ξ


Detail of γ and ξ
For each value of n we can store
γ(zn) using K non-negative numbers that sum to unity
ξ(zn-1,zn) using a K x K matrix whose elements also sum
to unity
• Using notation
γ (znk) denotes conditional probability of znk=1
Similar notation for ξ(zn-1,j,znk)

• Because the expectation of a binary random
variable is the probability that it takes value 1
γ(znk) = E[znk] = Σz γ(z)znk
ξ(z ,z ) = E[zn-1,j,znk] = Σz γ(z) zn-1,jznk
n-1,j nk

Expansion of Q
• We begin with
Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
Z
• Substitute
⎡ N ⎤ N
p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p(x m | z m , ϕ )
⎣ n=2 ⎦ m =1

• And use definitions of γ and ξ to get:
K N K K
Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk
k =1 n = 2 j =1 k =1
N K
+ ∑∑ γ ( z nk ) ln p ( xn | φk )
21
Machine Learning: CSE 574 n =1 k =1

E-Step
K N K K
Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk
k =1 n = 2 j =1 k =1
N K
+ ∑∑ γ ( z nk ) ln p ( xn | φk )
n =1 k =1

• Goal of E step is to evaluate γ(zn) and ξ(zn-1,zn)
efficiently (Forward-Backward Algorithm)
γ

ξ


M-Step
• Maximize Q(θ,θ old) with respect to parameters
θ ={π,A,φ}
• Treat γ (zn) and ξ(zn-1,zn) as constant
• Maximization w.r.t. π and A
• easily achieved (using Lagrangian multipliers)
N

πk =
γ ( z1k ) ∑ξ (z n −1, j , z nk )
K
A jk = n=2

∑γ (z
j =1
1j ) K N

∑∑ ξ ( z n −1, j , z nl )
l =1 n = 2

• Maximization w.r.t. φ k N K

• Only last term of Q depends on φ ∑∑ γ ( z k n =1 k =1
nk ) ln p ( xn | φk )

• Same form as in mixture distribution for i.i.d. 23
Machine Learning: CSE 574

M-step for Gaussian emission
• Maximization of Q(θ,θ old) wrt φ k

• Gaussian Emission Densities
p(x|φk) ~ N(x|μk,Σk)
• Solution:
N N

∑ γ ( znk )x n ∑ γ ( z nk )( x n − μ k )( x n − μ k )T
μk = n =1
N Σk = n =1
N
∑γ (z
n =1
nk ) ∑γ (z nk )
n =1


M-Step for Multinomial Observed
• Conditional Distribution of Observations
have the form
D K
p ( x | z) = ∏∏ μiki zk
x

i =1 k =1

• M-Step equations: N

∑γ (z nk ) xni
μik = n =1
N

∑γ (z
n =1
nk )

• Analogous result holds for Bernoulli
observed variables

6. Forward-Backward Algorithm
• E step: efficient procedure to evaluate
γ(zn) and ξ(zn-1,zn)
• Graph of HMM, a tree
• Implies that posterior distribution of latent
variables can be obtained efficiently using
message passing algorithm
• In HMM it is called forward-backward
algorithm or Baum-Welch Algorithm
• Several variants lead to exact marginals
• Method called alpha-beta discussed here

Derivation of Forward-Backward
• Several conditional-independences (A-H) hold
A. p(X|zn) = p(x1,..xn|zn) p(xn+1,..xN|zn)
• Proved using d-separation:
Path from x1 to xn-1 passes through zn which is observed.
Path is head-to-tail. Thus (x1,..xn-1) _||_ xn | zn
Similarly (x1,..xn-1,xn) _||_ xn+1,..xN |zn

zN
…..

xN
…..


Conditional independence B
• Since (x1,..xn-1) _||_ xn | zn we have

B. p(x1,..xn-1|xn,zn)= p(x1,..xn-1|zn)


Conditional independence C
• Since (x1,..xn-1) _||_ zn | zn-1

C. p(x1,..xn-1|zn-1,zn)= p ( x1,..xn-1|zn-1 )


Conditional independence D
• Since (xn+1,..xN) _||_ zn | zn+1

D. p(xn+1,..xN|zn,zn+1)= p ( x1,..xN|zn+1 )

zN

xN
……


Conditional independence E
• Since (xn+2,..xN) _||_ zn | zn+1

E. p(xn+2,..xN|zn+1,xn+1)= p ( xn+2,..xN|zn+1 )

zN

xN
……


Conditional independence F

F. p(X|zn-1,zn)= p ( x1,..xn-1|zn-1 )p(xn|zn)
p(xn+1,..xN|zn)

zN

xN
……


Conditional independence G
Since (x1,..xN) _||_ xN+1 | zN+1

G. p(xN+1|X,zN+1)=p(xN+1|zN+1 )

zN zN+1
….

xN xN+1
….


Conditional independence H

H. p(zN+1|zN , X)= p ( zN+1|zN )

zN zN+1
….

xN xN+1
….


Evaluation of γ(zn)
• Recall that this is to efficiently compute
the E step of estimating parameters of
HMM
γ (zn) = p(zn|X,θold): Marginal posterior
distribution of latent variable zn
• We are interested in finding posterior
distribution p(zn|x1,..xN)
• This is a vector of length K whose entries
correspond to expected values of znk


Introducing α and β
• Using Bayes theorem γ (z ) = p( z | X) = p(X |pz(X) )p( z )
n n
n n

• Using conditional independence A
p ( x1 ,..x n | z n ) p ( x n +1 ,..x N | z n ) p (z n )
γ (z n ) =
p(X)
p( x1 ,..x n , z n ) p( x n +1 ,..x N | z n ) α (z n ) β (z n )
= =
p(X) p(X)
• where α(zn) = p(x1,..,xn,zn)
which is the probability of observing all beta
given data up to time n and the value of
zn
β(zn) = p(xn+1,..,xN|zn)
which is the conditional probability of all
alpha
Machine Learning: CSE 574 time n+1 up to N given
future data from 36

the value of z

Recursion Relation for α
α (z n ) = p( x1 ,.., x n , z n )
= p( x1 ,.., x n | z n ) p(z n ) by Bayes rule
= p( x n | z n ) p( x1 ,.., x n −1 | z n ) p (z n ) by conditional independence B
= p( x n | z n ) p( x1 ,.., x n −1 , z n ) by Bayes rule
= p( x n | z n )∑ p( x1 ,.., x n −1 , z n −1 , z n ) by Sum Rule
z n−1

= p( x n | z n )∑ p( x1 ,.., x n −1 , z n | z n −1 ) p(z n −1 ) by Bayes rule
z n−1

= p( x n | z n )∑ p( x1 ,.., x n −1 | z n −1 ) p (z n | z n −1 ) p (z n −1 ) by cond. ind. C
z n−1

= p( x n | z n )∑ p ( x1 ,.., x n −1 , z n −1 ) p (z n | z n −1 ) by Bayes rule
z n−1

= p( x | z )∑ α (z n −1 ) p(z n | z n −1 ) by definition of α
Machine Learning:nCSE 574
n
37
z n−1

Forward Recursion for α Εvaluation
• Recursion Relation is
α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 )
z n−1

• There are K terms in the
summation
• Has to be evaluated for each of K
values of zn
• Each step of recursion is O(K2)
• Initial condition is
K
α (z1 ) = p ( x1 , z1 ) = p (z1 ) p ( x1 | z1 ) = ∏ {π k p( x1 | φk )}z
1k

k =1

• Overall cost for the chain in
O(K2N)

Backward Recursion for β
• Backward message passing
β (z n ) = ∑ β (z n +1 ) p( x n +1|z n )p( z n +1|z n )
zn+1

• Evaluates β(zn) in terms of β(zn+1)

• Starting condition for recursion is
p ( X, z N ) β ( z N )
p(z N | X) =
p(X)
• Is correct provided we set β(zN) =1
for all settings of zN
• This is the initial condition for backward
computation


M step Equations
• In the M-step equations p(x) will cancel out
N

∑γ (z nk )x n
μk = n =1
N

∑γ (z
n =1
nk )

p(X) = ∑ α ( zn ) β ( zn )
zn


Evaluation of Quantities ξ(zn-1,zn)
• They correspond to the values of the conditional
probabilities p(zn-1,zn|X) for each of the K x K settings for
(zn-1,zn)
ξ (z n -1 , z n ) = p(z n -1 , z n | X) by definition
p(X|zn-1,z n )p(zn-1,z n )
= by Bayes Rule
p(X)
p(x1 ,..x n −1|z n-1 )p( x n | z n ) p( x n +1 ,.., x N | z n ) p(z n | z n-1 )p(z n-1 )
= by cond ind F
p(X)
α (z n −1 ) p ( x n | z n ) p(z n | z n-1 )β(z n )
=
p(X)
• Thus we calculate ξ(zn-1,zn) directly by using results of the
α and β recursions


Summary of EM to train HMM
Step 1: Initialization
• Make an initial selection of parameters θ old
where θ = (π, A,φ)
1. π is a vector of K probabilities of the states for latent
variable z1
2. A is a K x K matrix of transition probabilities Aij
3. φ are parameters of conditional distribution p(xk|zk)
• A and π parameters are often initialized uniformly
• Initialization of φ depends on form of distribution
• For Gaussian:
• parameters μk initialized by applying K-means to the data, Σk
corresponds to covariance matrix of cluster

Summary of EM to train HMM
Step 2: E Step
• Run both forward α recursion and
backward β recursion
• Use results to evaluate γ(zn) and ξ(zn-1,zn)
and the likelihood function
Step 3: M Step
• Use results of E step to find revised set of
parameters θnew using M-step equations
Alternate between E and M
until convergence of likelihood function

Values for p(xn|zn)
• In recursion relations, observations enter
through conditional distributions p(xn|zn)
• Recursions are independent of
• Dimensionality of observed variables
• Form of conditional distribution
• So long as it can be computed for each of K
possible states of zn
• Since observed variables {xn} are fixed
they can be pre-computed at the start of
the EM algorithm

7(a) Sequence Length: Using Multiple Short Sequences

• HMM can be trained effectively if length of
sequence is sufficiently long
• True of all maximum likelihood approaches
• Alternatively we can use multiple short
sequences
• Requires straightforward modification of HMM-EM
algorithm
• Particularly important in left-to-right models
• In given observation sequence, a given state
transition for a non-diagonal element of A occurs only
once


7(b). Predictive Distribution
• Observed data is X={ x1,..,xN}
• Wish to predict xN+1
• Application in financial forecasting
p (x | X ) = ∑ p(x , z | X )
N +1 N +1 N +1
z N +1

= ∑ p(x
z N +1
N +1 |z N +1 | X ) p (z N +1 | X ) by Product Rule

= ∑ p(x N +1|z N +1 )∑ p(z N +1 , z N | X ) by Sum Rule
z N +1 zN

= ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) p( z N | X ) by conditional ind H
z N +1 zN

p( z N , X )
= ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) by Bayes rule
z N +1 zN p( X )
1
= ∑ p(x N +1|z N +1 )∑ p(z N +1 | zN )α ( zN ) by definition of α
p ( X ) zN +1 zN

• Can be evaluated by first running forward α recursion and
summing over zN and zN+1
• Can be extended to subsequent predictions of xN+2, after xN+1 is
observed, using a fixed amount of storage

7(c). Sum-Product and HMM
HMM Graph
• HMM graph is a tree and
hence sum-product
algorithm can be used to
find local marginals for
hidden variables Joint distribution
• Equivalent to forward-
backward algorithm ⎡ N ⎤ N
p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n )
• Sum-product provides a ⎣ n=2 ⎦ n =1
simple way to derive alpha-
beta recursion formulae
• Transform directed graph to Fragment of Factor Graph
factor graph
• Each variable has a node,
small squares represent
factors, undirected links
connect factor nodes to
variables used

Deriving alpha-beta from Sum-Product
• Begin with simplified form of Fragment of Factor Graph
factor graph
• Factors are given by
h( z1 ) = p ( z1 ) p ( x1 | z1 )
Simplified by absorbing emission probabilities
f n ( zn −1 , zn ) = p ( zn | zn −1 ) p ( xn | zn ) into transition probability factors

• Messages propagated are
μz n−1 → f n
(z n −1 ) = μ fn−1 →zn−1 (z n −1 )
Final Results
μf n →zn
(z n ) = ∑ f n (z n −1 , z n ) μ zn−1 → fn (z n −1 )
z n−1
• Can show that α recursion is z n−1

computed β (z n ) = ∑ β ( z n +1 ) p(x n +1 | z n ) p(z n +1 | z n )
• Similarly starting with the root
z n+1

α (z n ) β (z n )
node β recursion is computed γ (z n ) =
p (X)
• So also γ and ξ are derived
ξ (z n-1 ,z n ) =
α (z n −1 ) p (x n | z n ) p( z n | z n-1 )β ( z n )
p(X)

7(d). Scaling Factors
• Implementation issue for small probabilities
• At each step of recursion
z n−1
• To obtain new value of α(zn) from previous value α(zn-1)
we multiply p(zn|zn-1) and p(xn|zn)
• These probabilities are small and products will underflow
• Logs don’t help since we have sums of products

• Solution is rescaling
• of α(zn) and β(zn) whose values remain close to unity

7(e). The Viterbi Algorithm
• Finding most probable sequence of hidden
states for a given sequence of observables
• In speech recognition: finding most probable
phoneme sequence for a given series of
acoustic observations
• Since graphical model of HMM is a tree, can be
solved exactly using max-sum algorithm
• Known as Viterbi algorithm in the context of HMM
• Since max-sum works with log probabilities no need
to work with re-scaled varaibles as with forward-
backward

Viterbi Algorithm for HMM
• Fragment of HMM lattice
showing two paths
• Number of possible paths
grows exponentially with
length of chain
• Viterbi searches space of
paths efficiently
• Finds most probable path
with computational cost
linear with length of chain


Deriving Viterbi from Max-Sum
• Start with simplified factor graph

• Treat variable zN as root node, passing
messages to root from leaf nodes
• Messages passed are
μz n → f n +1
(z n ) = μ fn →zn (z n )
μf n +1 → z n +1
{ }
(z n +1 ) = max ln f n +1 (z n , z n +1 ) + μzn → fn+1 (z n )
zn


Other Topics on Sequential Data
• Sequential Data and Markov Models:
http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.1-
MarkovModels.pdf
• Extensions of HMMs:
HMMExtensions.pdf
• Linear Dynamical Systems:
LinearDynamicalSystems.pdf
• Conditional Random Fields:
ConditionalRandomFields.pdf


PowerPoint Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to PowerPoint Presentation

Similar to PowerPoint Presentation (20)

More from butest

More from butest (20)

PowerPoint Presentation