PowerPoint Presentation

Hidden Markov Models

Sargur N. Srihari
srihari@cedar.buffalo.edu
Machine Learning Course:
http://www.cedar.buffalo.edu/~srihari/CSE574/index.html

Machine Learning: CSE 574 0

HMM Topics
1. What is an HMM?
2. State-space Representation
3. HMM Parameters
4. Generative View of HMM
5. Determining HMM Parameters Using EM
6. Forward-Backward or α−β algorithm
7. HMM Implementation Issues:
a) Length of Sequence
b) Predictive Distribution
c) Sum-Product Algorithm
d) Scaling Factors
e) Viterbi Algorithm

1. What is an HMM?
• Ubiquitous tool for modeling time series data
• Used in
• Almost all speech recognition systems
• Computational molecular biology
• Group amino acid sequences into proteins
• Handwritten word recognition
• It is a tool for representing probability distributions over
sequences of observations
• HMM gets its name from two defining properties:
• Observation xt at time t was generated by some process whose state
zt is hidden from the observer
• Assumes that state at zt is dependent only on state zt-1 and
independent of all prior states (First order)
• Example: z are phoneme sequences
x are acoustic observations


Graphical Model of HMM
• Has the graphical model shown below and
latent variables are discrete

• Joint distribution has the form:
⎡ N ⎤ N
p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n )
⎣ n=2 ⎦ n =1

HMM Viewed as Mixture

• A single time slice corresponds to a mixture distribution
with component densities p(x|z)
• Each state of discrete variable z represents a different component
• An extension of mixture model
• Choice of mixture component depends on choice of mixture
component for previous distribution
• Latent variables are multinomial variables zn
• That describe component responsible for generating xn
• Can use one-of-K coding scheme

2. State-Space Representation
• Probability distribution of zn State k 1 2 . K
depends on state of previous of zn znk 0 1 . 0
latent variable zn-1 through
probability distribution p(zn|zn-1)
• One-of K coding State j 1 2 . K
of zn-1 zn-1,j 1 0 . 0
• Since latent variables are K-
dimensional binary vectors A zn
Ajk=p(znk=1|zn-1,j=1) matrix
1 2 …. K
ΣkAjk=1
1
• These are known as Transition
Probabilities zn-1 2
• K(K-1) independent parameters … Ajk
Machine Learning: CSE 574 K 5

Transition Probabilities
Example with 3-state latent variable
zn=1 zn=2 zn=3 State Transition Diagram
zn1=1 zn1=0 zn1=0
zn2=0 zn2=1 zn2=0
zn3=0 zn3=0 zn3=1
zn-1=1 A11 A12 A13
zn-1,1=1 A11+A12+A13=1
zn-1,2=0
zn-1,3=0
zn-1=2 A21 A22 A23
zn-1,1=0
zn-1,2=1
zn-1,3=0 • Not a graphical model
zn-3= 3 A31 A32 A33 since nodes are not
zn-1,1=0 separate variables but
zn-1,2=0 states of a single
zn-1,3=1 variable
• Here K=3

Conditional Probabilities
• Transition probabilities Ajk represent state-to-state
probabilities for each variable
• Conditional probabilities are variable-to-variable
probabilities
• can be written in terms of transition probabilities as
K K
p (z n | z n −1 , A) = ∏∏ A
z n−1, j z n ,k
jk
k =1 j =1
• Note that exponent zn-1,j zn,k is a product that evaluates to 0 or 1
• Hence the overall product will evaluate to a single Ajk for each
setting of values of zn and zn-1
• E.g., zn-1=2 and zn=3 will result in only zn-1,2=1 and zn,3=1. Thus
p(zn=3|zn-1=2)=A23
• A is a global HMM parameter

Initial Variable Probabilities
• Initial latent node z1 is a special case without
parent node
• Represented by vector of probabilities π with
elements πk=p(z1k=1) so that
K
p ( z1 | π ) = ∏ π where Σ kπ k = 1
z1,k
k
k =1

• Note that π is an HMM parameter
• representing probabilities of each state for the
first variable


Lattice or Trellis Diagram
• State transition diagram unfolded over time

• Representation of latent variable states
• Each column corresponds to one of latent
variables zn

Emission Probabilities p(xn|zn)
• We have so far only specified p(zn|zn-1) by
means of transition probabilities
• Need to specify probabilities p(xn|zn,φ) to
complete the model, where φ are parameters
• These can be continuous or discrete
• Because xn is observed and zn is discrete
p(xn|zn,φ) consists of a table of K numbers
corresponding to K states of zn
• Analogous to class-conditional probabilities
• Can be represented as K
p (x n | z n , ϕ ) = ∏ p(x n | ϕk ) znk
k =1

3. HMM Parameters

• We have defined three types of HMM parameters: θ =
(π, A,φ)
1. Initial Probabilities of first latent variable:
π is a vector of K probabilities of the states for latent variable z1
2. Transition Probabilities (State-to-state for any latent variable):
A is a K x K matrix of transition probabilities Aij
3. Emission Probabilities (Observations conditioned on latent):
φ are parameters of conditional distribution p(xk|zk)
• A and π parameters are often initialized uniformly
• Initialization of φ depends on form of distribution


Joint Distribution over Latent and
Observed variables
• Joint can be expressed in terms of parameters:
⎡ N ⎤ N
p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p (x m | z m , ϕ )
⎣ n=2 ⎦ m =1
where X = { x1 ,..x N }, Z = { z1 ,..z N }, θ = {π , A, ϕ}

• Most discussion of HMM is independent of
emission probabilities
• Tractable for discrete tables, Gaussian, GMMs

4. Generative View of HMM
• Sampling from an HMM
• HMM with 3-state latent variable z
• Gaussian emission model p(x|z)
• Contours of constant density of emission
distribution shown for each state
• Two-dimensional x

• 50 data points generated from HMM
• Lines show successive observations

• Transition probabilities fixed so that
• 5% probability of making transition
• 90% of remaining in same


Left-to-Right HMM
• Setting elements of Ajk=0 if k < j

• Corresponding lattice diagram


Left-to-Right Applied Generatively to Digits
• Examples of on-line handwritten 2’s
• x is a sequence of pen coordinates
• There are16 states for z, or K=16
• Each state can generate a line segment of fixed
length in one of 16 angles
• Emission probabilities: 16 x 16 table
• Transition probabilities set to zero except for those
that keep state index k the same or increment by one
• Parameters optimized by 25 EM iterations Training Set
• Trained on 45 digits
• Generative model is quite poor
• Since generated don’t look like training
• If classification is goal, can do better by using a
discriminative HMM
Generated Set

5. Determining HMM Parameters
• Given data set X={x1,..xn} we can determine
HMM parameters θ ={π,A,φ} using maximum
likelihood
• Likelihood function obtained from joint
distribution by marginalizing over latent
variables Z={z1,..zn}
p(X|θ) = ΣZ p(X,Z|θ)

Joint (X,Z)

Machine Learning: CSE 574
X
16

Computational issues for Parameters
p(X|θ) = ΣZ p(X,Z|θ)
• Computational difficulties
• Joint distribution p(X,Z|θ) does not factorize over n,
in contrast to mixture model
• Z has exponential number of terms corresponding to trellis
• Solution
• Use conditional independence properties to reorder
summations
• Use EM instead to maximize log-likelihood function of joint
ln p(X,Z|θ)
Efficient framework for maximizing the likelihood function in HMMs


EM for MLE in HMM
1. Start with initial selection for model parameters θold
2. In E step take these parameter values and find
posterior distribution of latent variables p(Z|X,θ old)
Use this posterior distribution to evaluate
expectation of the logarithm of the complete-data
likelihood function ln p(X,Z|θ)
Which can be written as
Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
Z

underlined portion independent of θ is evaluated
3. In M-Step maximize Q w.r.t. θ

Expansion of Q
• Introduce notation
γ (zn) = p(zn|X,θold): Marginal posterior distribution of
latent variable zn
ξ(zn-1,zn) = p(zn-1,zn|X,θ old): Joint posterior of two
successive latent variables
• We will be re-expressing Q in terms of γ and ξ
Q (θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
Z
γ

ξ


Detail of γ and ξ
For each value of n we can store
γ(zn) using K non-negative numbers that sum to unity
ξ(zn-1,zn) using a K x K matrix whose elements also sum
to unity
• Using notation
γ (znk) denotes conditional probability of znk=1
Similar notation for ξ(zn-1,j,znk)

• Because the expectation of a binary random
variable is the probability that it takes value 1
γ(znk) = E[znk] = Σz γ(z)znk
ξ(z ,z ) = E[zn-1,j,znk] = Σz γ(z) zn-1,jznk
n-1,j nk

Expansion of Q
• We begin with
Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
Z
• Substitute
⎡ N ⎤ N
p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p(x m | z m , ϕ )
⎣ n=2 ⎦ m =1

• And use definitions of γ and ξ to get:
K N K K
Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk
k =1 n = 2 j =1 k =1
N K
+ ∑∑ γ ( z nk ) ln p ( xn | φk )
21
Machine Learning: CSE 574 n =1 k =1

E-Step
K N K K
Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk
k =1 n = 2 j =1 k =1
N K
+ ∑∑ γ ( z nk ) ln p ( xn | φk )
n =1 k =1

• Goal of E step is to evaluate γ(zn) and ξ(zn-1,zn)
efficiently (Forward-Backward Algorithm)
γ

ξ


M-Step
• Maximize Q(θ,θ old) with respect to parameters
θ ={π,A,φ}
• Treat γ (zn) and ξ(zn-1,zn) as constant
• Maximization w.r.t. π and A
• easily achieved (using Lagrangian multipliers)
N

πk =
γ ( z1k ) ∑ξ (z n −1, j , z nk )
K
A jk = n=2

∑γ (z
j =1
1j ) K N

∑∑ ξ ( z n −1, j , z nl )
l =1 n = 2

• Maximization w.r.t. φ k N K

• Only last term of Q depends on φ ∑∑ γ ( z k n =1 k =1
nk ) ln p ( xn | φk )

• Same form as in mixture distribution for i.i.d. 23
Machine Learning: CSE 574

M-step for Gaussian emission
• Maximization of Q(θ,θ old) wrt φ k

• Gaussian Emission Densities
p(x|φk) ~ N(x|μk,Σk)
• Solution:
N N

∑ γ ( znk )x n ∑ γ ( z nk )( x n − μ k )( x n − μ k )T
μk = n =1
N Σk = n =1
N
∑γ (z
n =1
nk ) ∑γ (z nk )
n =1


M-Step for Multinomial Observed
• Conditional Distribution of Observations
have the form
D K
p ( x | z) = ∏∏ μiki zk
x

i =1 k =1

• M-Step equations: N

∑γ (z nk ) xni
μik = n =1
N

∑γ (z
n =1
nk )

• Analogous result holds for Bernoulli
observed variables

6. Forward-Backward Algorithm
• E step: efficient procedure to evaluate
γ(zn) and ξ(zn-1,zn)
• Graph of HMM, a tree
• Implies that posterior distribution of latent
variables can be obtained efficiently using
message passing algorithm
• In HMM it is called forward-backward
algorithm or Baum-Welch Algorithm
• Several variants lead to exact marginals
• Method called alpha-beta discussed here

Derivation of Forward-Backward
• Several conditional-independences (A-H) hold
A. p(X|zn) = p(x1,..xn|zn) p(xn+1,..xN|zn)
• Proved using d-separation:
Path from x1 to xn-1 passes through zn which is observed.
Path is head-to-tail. Thus (x1,..xn-1) _||_ xn | zn
Similarly (x1,..xn-1,xn) _||_ xn+1,..xN |zn

zN
…..

xN
…..


Conditional independence B
• Since (x1,..xn-1) _||_ xn | zn we have

B. p(x1,..xn-1|xn,zn)= p(x1,..xn-1|zn)


Conditional independence C
• Since (x1,..xn-1) _||_ zn | zn-1

C. p(x1,..xn-1|zn-1,zn)= p ( x1,..xn-1|zn-1 )


Conditional independence D
• Since (xn+1,..xN) _||_ zn | zn+1

D. p(xn+1,..xN|zn,zn+1)= p ( x1,..xN|zn+1 )

zN

xN
……


Conditional independence E
• Since (xn+2,..xN) _||_ zn | zn+1

E. p(xn+2,..xN|zn+1,xn+1)= p ( xn+2,..xN|zn+1 )

zN

xN
……


Conditional independence F

F. p(X|zn-1,zn)= p ( x1,..xn-1|zn-1 )p(xn|zn)
p(xn+1,..xN|zn)

zN

xN
……


Conditional independence G
Since (x1,..xN) _||_ xN+1 | zN+1

G. p(xN+1|X,zN+1)=p(xN+1|zN+1 )

zN zN+1
….

xN xN+1
….


Conditional independence H

H. p(zN+1|zN , X)= p ( zN+1|zN )

zN zN+1
….

xN xN+1
….


Evaluation of γ(zn)
• Recall that this is to efficiently compute
the E step of estimating parameters of
HMM
γ (zn) = p(zn|X,θold): Marginal posterior
distribution of latent variable zn
• We are interested in finding posterior
distribution p(zn|x1,..xN)
• This is a vector of length K whose entries
correspond to expected values of znk


Introducing α and β
• Using Bayes theorem γ (z ) = p( z | X) = p(X |pz(X) )p( z )
n n
n n

• Using conditional independence A
p ( x1 ,..x n | z n ) p ( x n +1 ,..x N | z n ) p (z n )
γ (z n ) =
p(X)
p( x1 ,..x n , z n ) p( x n +1 ,..x N | z n ) α (z n ) β (z n )
= =
p(X) p(X)
• where α(zn) = p(x1,..,xn,zn)
which is the probability of observing all beta
given data up to time n and the value of
zn
β(zn) = p(xn+1,..,xN|zn)
which is the conditional probability of all
alpha
Machine Learning: CSE 574 time n+1 up to N given
future data from 36

the value of z

Recursion Relation for α
α (z n ) = p( x1 ,.., x n , z n )
= p( x1 ,.., x n | z n ) p(z n ) by Bayes rule
= p( x n | z n ) p( x1 ,.., x n −1 | z n ) p (z n ) by conditional independence B
= p( x n | z n ) p( x1 ,.., x n −1 , z n ) by Bayes rule
= p( x n | z n )∑ p( x1 ,.., x n −1 , z n −1 , z n ) by Sum Rule
z n−1

= p( x n | z n )∑ p( x1 ,.., x n −1 , z n | z n −1 ) p(z n −1 ) by Bayes rule
z n−1

= p( x n | z n )∑ p( x1 ,.., x n −1 | z n −1 ) p (z n | z n −1 ) p (z n −1 ) by cond. ind. C
z n−1

= p( x n | z n )∑ p ( x1 ,.., x n −1 , z n −1 ) p (z n | z n −1 ) by Bayes rule
z n−1

= p( x | z )∑ α (z n −1 ) p(z n | z n −1 ) by definition of α
Machine Learning:nCSE 574
n
37
z n−1

Forward Recursion for α Εvaluation
• Recursion Relation is
α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 )
z n−1

• There are K terms in the
summation
• Has to be evaluated for each of K
values of zn
• Each step of recursion is O(K2)
• Initial condition is
K
α (z1 ) = p ( x1 , z1 ) = p (z1 ) p ( x1 | z1 ) = ∏ {π k p( x1 | φk )}z
1k

k =1

• Overall cost for the chain in
O(K2N)

Backward Recursion for β
• Backward message passing
β (z n ) = ∑ β (z n +1 ) p( x n +1|z n )p( z n +1|z n )
zn+1

• Evaluates β(zn) in terms of β(zn+1)

• Starting condition for recursion is
p ( X, z N ) β ( z N )
p(z N | X) =
p(X)
• Is correct provided we set β(zN) =1
for all settings of zN
• This is the initial condition for backward
computation


M step Equations
• In the M-step equations p(x) will cancel out
N

∑γ (z nk )x n
μk = n =1
N

∑γ (z
n =1
nk )

p(X) = ∑ α ( zn ) β ( zn )
zn


Evaluation of Quantities ξ(zn-1,zn)
• They correspond to the values of the conditional
probabilities p(zn-1,zn|X) for each of the K x K settings for
(zn-1,zn)
ξ (z n -1 , z n ) = p(z n -1 , z n | X) by definition
p(X|zn-1,z n )p(zn-1,z n )
= by Bayes Rule
p(X)
p(x1 ,..x n −1|z n-1 )p( x n | z n ) p( x n +1 ,.., x N | z n ) p(z n | z n-1 )p(z n-1 )
= by cond ind F
p(X)
α (z n −1 ) p ( x n | z n ) p(z n | z n-1 )β(z n )
=
p(X)
• Thus we calculate ξ(zn-1,zn) directly by using results of the
α and β recursions


Summary of EM to train HMM
Step 1: Initialization
• Make an initial selection of parameters θ old
where θ = (π, A,φ)
1. π is a vector of K probabilities of the states for latent
variable z1
2. A is a K x K matrix of transition probabilities Aij
3. φ are parameters of conditional distribution p(xk|zk)
• A and π parameters are often initialized uniformly
• Initialization of φ depends on form of distribution
• For Gaussian:
• parameters μk initialized by applying K-means to the data, Σk
corresponds to covariance matrix of cluster

Summary of EM to train HMM
Step 2: E Step
• Run both forward α recursion and
backward β recursion
• Use results to evaluate γ(zn) and ξ(zn-1,zn)
and the likelihood function
Step 3: M Step
• Use results of E step to find revised set of
parameters θnew using M-step equations
Alternate between E and M
until convergence of likelihood function

Values for p(xn|zn)
• In recursion relations, observations enter
through conditional distributions p(xn|zn)
• Recursions are independent of
• Dimensionality of observed variables
• Form of conditional distribution
• So long as it can be computed for each of K
possible states of zn
• Since observed variables {xn} are fixed
they can be pre-computed at the start of
the EM algorithm

7(a) Sequence Length: Using Multiple Short Sequences

• HMM can be trained effectively if length of
sequence is sufficiently long
• True of all maximum likelihood approaches
• Alternatively we can use multiple short
sequences
• Requires straightforward modification of HMM-EM
algorithm
• Particularly important in left-to-right models
• In given observation sequence, a given state
transition for a non-diagonal element of A occurs only
once


7(b). Predictive Distribution
• Observed data is X={ x1,..,xN}
• Wish to predict xN+1
• Application in financial forecasting
p (x | X ) = ∑ p(x , z | X )
N +1 N +1 N +1
z N +1

= ∑ p(x
z N +1
N +1 |z N +1 | X ) p (z N +1 | X ) by Product Rule

= ∑ p(x N +1|z N +1 )∑ p(z N +1 , z N | X ) by Sum Rule
z N +1 zN

= ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) p( z N | X ) by conditional ind H
z N +1 zN

p( z N , X )
= ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) by Bayes rule
z N +1 zN p( X )
1
= ∑ p(x N +1|z N +1 )∑ p(z N +1 | zN )α ( zN ) by definition of α
p ( X ) zN +1 zN

• Can be evaluated by first running forward α recursion and
summing over zN and zN+1
• Can be extended to subsequent predictions of xN+2, after xN+1 is
observed, using a fixed amount of storage

7(c). Sum-Product and HMM
HMM Graph
• HMM graph is a tree and
hence sum-product
algorithm can be used to
find local marginals for
hidden variables Joint distribution
• Equivalent to forward-
backward algorithm ⎡ N ⎤ N
p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n )
• Sum-product provides a ⎣ n=2 ⎦ n =1
simple way to derive alpha-
beta recursion formulae
• Transform directed graph to Fragment of Factor Graph
factor graph
• Each variable has a node,
small squares represent
factors, undirected links
connect factor nodes to
variables used

Deriving alpha-beta from Sum-Product
• Begin with simplified form of Fragment of Factor Graph
factor graph
• Factors are given by
h( z1 ) = p ( z1 ) p ( x1 | z1 )
Simplified by absorbing emission probabilities
f n ( zn −1 , zn ) = p ( zn | zn −1 ) p ( xn | zn ) into transition probability factors

• Messages propagated are
μz n−1 → f n
(z n −1 ) = μ fn−1 →zn−1 (z n −1 )
Final Results
μf n →zn
(z n ) = ∑ f n (z n −1 , z n ) μ zn−1 → fn (z n −1 )
z n−1
• Can show that α recursion is z n−1

computed β (z n ) = ∑ β ( z n +1 ) p(x n +1 | z n ) p(z n +1 | z n )
• Similarly starting with the root
z n+1

α (z n ) β (z n )
node β recursion is computed γ (z n ) =
p (X)
• So also γ and ξ are derived
ξ (z n-1 ,z n ) =
α (z n −1 ) p (x n | z n ) p( z n | z n-1 )β ( z n )
p(X)

7(d). Scaling Factors
• Implementation issue for small probabilities
• At each step of recursion
z n−1
• To obtain new value of α(zn) from previous value α(zn-1)
we multiply p(zn|zn-1) and p(xn|zn)
• These probabilities are small and products will underflow
• Logs don’t help since we have sums of products

• Solution is rescaling
• of α(zn) and β(zn) whose values remain close to unity

7(e). The Viterbi Algorithm
• Finding most probable sequence of hidden
states for a given sequence of observables
• In speech recognition: finding most probable
phoneme sequence for a given series of
acoustic observations
• Since graphical model of HMM is a tree, can be
solved exactly using max-sum algorithm
• Known as Viterbi algorithm in the context of HMM
• Since max-sum works with log probabilities no need
to work with re-scaled varaibles as with forward-
backward

Viterbi Algorithm for HMM
• Fragment of HMM lattice
showing two paths
• Number of possible paths
grows exponentially with
length of chain
• Viterbi searches space of
paths efficiently
• Finds most probable path
with computational cost
linear with length of chain


Deriving Viterbi from Max-Sum
• Start with simplified factor graph

• Treat variable zN as root node, passing
messages to root from leaf nodes
• Messages passed are
μz n → f n +1
(z n ) = μ fn →zn (z n )
μf n +1 → z n +1
{ }
(z n +1 ) = max ln f n +1 (z n , z n +1 ) + μzn → fn+1 (z n )
zn


Other Topics on Sequential Data
• Sequential Data and Markov Models:
http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.1-
MarkovModels.pdf
• Extensions of HMMs:
HMMExtensions.pdf
• Linear Dynamical Systems:
LinearDynamicalSystems.pdf
• Conditional Random Fields:
ConditionalRandomFields.pdf


PowerPoint Presentation

More Related Content

What's hot

Viewers also liked

Similar to PowerPoint Presentation

More from butest

PowerPoint Presentation