This document discusses hidden Markov models (HMMs) and their parameters. It contains the following key points:
1. HMMs are statistical models used to model time series data. They have hidden states that generate observations.
2. HMMs have three main parameters: initial state probabilities, transition probabilities between states, and emission probabilities of observations given states.
3. The Expectation-Maximization (EM) algorithm is commonly used to estimate HMM parameters from observed data by iteratively computing expected state probabilities.
EM 알고리즘을 jensen's inequality부터 천천히 잘 설명되어있다
이것을 보면, LDA의 Variational method로 학습하는 방식이 어느정도 이해가 갈 것이다.
옛날 Andrew Ng 선생님의 강의노트에서 발췌한 건데 5년전에 본 것을
아직도 찾아가면서 참고하면서 해야 된다는 게 그 강의가 얼마나 명강의였는지 새삼 느끼게 된다.
The slide covers a few state of the art models of word embedding and deep explanation on algorithms for approximation of softmax function in language models.
EM 알고리즘을 jensen's inequality부터 천천히 잘 설명되어있다
이것을 보면, LDA의 Variational method로 학습하는 방식이 어느정도 이해가 갈 것이다.
옛날 Andrew Ng 선생님의 강의노트에서 발췌한 건데 5년전에 본 것을
아직도 찾아가면서 참고하면서 해야 된다는 게 그 강의가 얼마나 명강의였는지 새삼 느끼게 된다.
The slide covers a few state of the art models of word embedding and deep explanation on algorithms for approximation of softmax function in language models.
My talk in the MCQMC Conference 2016, Stanford University. The talk is about Multilevel Hybrid Split Step Implicit Tau-Leap
for Stochastic Reaction Networks.
The generation of Gaussian random fields over a physical domain is a challenging problem in computational mathematics, especially when the correlation length is short and the field is rough. The traditional approach is to make use of a truncated Karhunen-Loeve (KL) expansion, but the generation of even a single realisation of the field may then be effectively beyond reach (especially for 3-dimensional domains) if the need is to obtain an expected L2 error of say 5%, because of the potentially very slow convergence of the KL expansion. In this talk, based on joint work with Ivan Graham, Frances Kuo, Dirk Nuyens, and Rob Scheichl, a completely different approach is used, in which the field is initially generated at a regular grid on a 2- or 3-dimensional rectangle that contains the physical domain, and then possibly interpolated to obtain the field at other points. In that case there is no need for any truncation. Rather the main problem becomes the factorisation of a large dense matrix. For this we use circulant embedding and FFT ideas. Quasi-Monte Carlo integration is then used to evaluate the expected value of some functional of the finite-element solution of an elliptic PDE with a random field as input.
R exam (B) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013Christian Robert
This is one of two exams given to our students this year. They had two hours to solve three problems and had to return R codes as well as handwritten explanations.
International Conference on Monte Carlo techniques
Closing conference of thematic cycle
Paris July 5-8th 2016
Campus les cordeliers
Jere Koskela's slides
Dynamic Programming is an algorithmic paradigm that solves a given complex problem by breaking it into subproblems and stores the results of subproblems to avoid computing the same results again.
Generating a high quality Chaotic sequence is crucial to the success of the Superefficient Monte Carlo Simulation methodology. In this slides, we discuss how to numerically generates Chebychev Chaotic Sequence with arbitrary precision, and proposed a highly efficient parallel implementation.
International Conference on Monte Carlo techniques
Closing conference of thematic cycle
Paris July 5-8th 2016
Campus les Cordeliers
Slides of Richard Everitt's presentation
AACIMP 2010 Summer School lecture by Leonidas Sakalauskas. "Applied Mathematics" stream. "Stochastic Programming and Applications" course. Part 3.
More info at http://summerschool.ssa.org.ua
My talk in the MCQMC Conference 2016, Stanford University. The talk is about Multilevel Hybrid Split Step Implicit Tau-Leap
for Stochastic Reaction Networks.
The generation of Gaussian random fields over a physical domain is a challenging problem in computational mathematics, especially when the correlation length is short and the field is rough. The traditional approach is to make use of a truncated Karhunen-Loeve (KL) expansion, but the generation of even a single realisation of the field may then be effectively beyond reach (especially for 3-dimensional domains) if the need is to obtain an expected L2 error of say 5%, because of the potentially very slow convergence of the KL expansion. In this talk, based on joint work with Ivan Graham, Frances Kuo, Dirk Nuyens, and Rob Scheichl, a completely different approach is used, in which the field is initially generated at a regular grid on a 2- or 3-dimensional rectangle that contains the physical domain, and then possibly interpolated to obtain the field at other points. In that case there is no need for any truncation. Rather the main problem becomes the factorisation of a large dense matrix. For this we use circulant embedding and FFT ideas. Quasi-Monte Carlo integration is then used to evaluate the expected value of some functional of the finite-element solution of an elliptic PDE with a random field as input.
R exam (B) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013Christian Robert
This is one of two exams given to our students this year. They had two hours to solve three problems and had to return R codes as well as handwritten explanations.
International Conference on Monte Carlo techniques
Closing conference of thematic cycle
Paris July 5-8th 2016
Campus les cordeliers
Jere Koskela's slides
Dynamic Programming is an algorithmic paradigm that solves a given complex problem by breaking it into subproblems and stores the results of subproblems to avoid computing the same results again.
Generating a high quality Chaotic sequence is crucial to the success of the Superefficient Monte Carlo Simulation methodology. In this slides, we discuss how to numerically generates Chebychev Chaotic Sequence with arbitrary precision, and proposed a highly efficient parallel implementation.
International Conference on Monte Carlo techniques
Closing conference of thematic cycle
Paris July 5-8th 2016
Campus les Cordeliers
Slides of Richard Everitt's presentation
AACIMP 2010 Summer School lecture by Leonidas Sakalauskas. "Applied Mathematics" stream. "Stochastic Programming and Applications" course. Part 3.
More info at http://summerschool.ssa.org.ua
Slides of a talk at CMU Theory lunch (http://www.cs.cmu.edu/~theorylunch/20111116.html) and Capital Area Theory seminar (http://www.cs.umd.edu/areas/Theory/CATS/#Grigory).
In this presentation we describe the formulation of the HMM model as consisting of states that are hidden that generate the observables. We introduce the 3 basic problems: Finding the probability of a sequence of observation given the model, the decoding problem of finding the hidden states given the observations and the model and the training problem of determining the model parameters that generate the given observations. We discuss the Forward, Backward, Viterbi and Forward-Backward algorithms.
Recurrent Neuron Network-from point of dynamic system & state machineGAYO3
General explanation of various recurrent framework and intuitions behind it.
Part 1: Focus on series sampling from continuous time.
Part 2: Explain the connection between state machine and language. And some ideas of NLP.
2. HMM Topics
1. What is an HMM?
2. State-space Representation
3. HMM Parameters
4. Generative View of HMM
5. Determining HMM Parameters Using EM
6. Forward-Backward or α−β algorithm
7. HMM Implementation Issues:
a) Length of Sequence
b) Predictive Distribution
c) Sum-Product Algorithm
d) Scaling Factors
e) Viterbi Algorithm
Machine Learning: CSE 574 1
3. 1. What is an HMM?
• Ubiquitous tool for modeling time series data
• Used in
• Almost all speech recognition systems
• Computational molecular biology
• Group amino acid sequences into proteins
• Handwritten word recognition
• It is a tool for representing probability distributions over
sequences of observations
• HMM gets its name from two defining properties:
• Observation xt at time t was generated by some process whose state
zt is hidden from the observer
• Assumes that state at zt is dependent only on state zt-1 and
independent of all prior states (First order)
• Example: z are phoneme sequences
x are acoustic observations
Machine Learning: CSE 574 2
4. Graphical Model of HMM
• Has the graphical model shown below and
latent variables are discrete
• Joint distribution has the form:
⎡ N ⎤ N
p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n )
⎣ n=2 ⎦ n =1
Machine Learning: CSE 574 3
5. HMM Viewed as Mixture
• A single time slice corresponds to a mixture distribution
with component densities p(x|z)
• Each state of discrete variable z represents a different component
• An extension of mixture model
• Choice of mixture component depends on choice of mixture
component for previous distribution
• Latent variables are multinomial variables zn
• That describe component responsible for generating xn
• Can use one-of-K coding scheme
Machine Learning: CSE 574 4
6. 2. State-Space Representation
• Probability distribution of zn State k 1 2 . K
depends on state of previous of zn znk 0 1 . 0
latent variable zn-1 through
probability distribution p(zn|zn-1)
• One-of K coding State j 1 2 . K
of zn-1 zn-1,j 1 0 . 0
• Since latent variables are K-
dimensional binary vectors A zn
Ajk=p(znk=1|zn-1,j=1) matrix
1 2 …. K
ΣkAjk=1
1
• These are known as Transition
Probabilities zn-1 2
• K(K-1) independent parameters … Ajk
Machine Learning: CSE 574 K 5
7. Transition Probabilities
Example with 3-state latent variable
zn=1 zn=2 zn=3 State Transition Diagram
zn1=1 zn1=0 zn1=0
zn2=0 zn2=1 zn2=0
zn3=0 zn3=0 zn3=1
zn-1=1 A11 A12 A13
zn-1,1=1 A11+A12+A13=1
zn-1,2=0
zn-1,3=0
zn-1=2 A21 A22 A23
zn-1,1=0
zn-1,2=1
zn-1,3=0 • Not a graphical model
zn-3= 3 A31 A32 A33 since nodes are not
zn-1,1=0 separate variables but
zn-1,2=0 states of a single
zn-1,3=1 variable
Machine Learning: CSE 574 6
• Here K=3
8. Conditional Probabilities
• Transition probabilities Ajk represent state-to-state
probabilities for each variable
• Conditional probabilities are variable-to-variable
probabilities
• can be written in terms of transition probabilities as
K K
p (z n | z n −1 , A) = ∏∏ A
z n−1, j z n ,k
jk
k =1 j =1
• Note that exponent zn-1,j zn,k is a product that evaluates to 0 or 1
• Hence the overall product will evaluate to a single Ajk for each
setting of values of zn and zn-1
• E.g., zn-1=2 and zn=3 will result in only zn-1,2=1 and zn,3=1. Thus
p(zn=3|zn-1=2)=A23
• A is a global HMM parameter
Machine Learning: CSE 574 7
9. Initial Variable Probabilities
• Initial latent node z1 is a special case without
parent node
• Represented by vector of probabilities π with
elements πk=p(z1k=1) so that
K
p ( z1 | π ) = ∏ π where Σ kπ k = 1
z1,k
k
k =1
• Note that π is an HMM parameter
• representing probabilities of each state for the
first variable
Machine Learning: CSE 574 8
10. Lattice or Trellis Diagram
• State transition diagram unfolded over time
• Representation of latent variable states
• Each column corresponds to one of latent
variables zn
Machine Learning: CSE 574 9
11. Emission Probabilities p(xn|zn)
• We have so far only specified p(zn|zn-1) by
means of transition probabilities
• Need to specify probabilities p(xn|zn,φ) to
complete the model, where φ are parameters
• These can be continuous or discrete
• Because xn is observed and zn is discrete
p(xn|zn,φ) consists of a table of K numbers
corresponding to K states of zn
• Analogous to class-conditional probabilities
• Can be represented as K
p (x n | z n , ϕ ) = ∏ p(x n | ϕk ) znk
k =1
Machine Learning: CSE 574 10
12. 3. HMM Parameters
• We have defined three types of HMM parameters: θ =
(π, A,φ)
1. Initial Probabilities of first latent variable:
π is a vector of K probabilities of the states for latent variable z1
2. Transition Probabilities (State-to-state for any latent variable):
A is a K x K matrix of transition probabilities Aij
3. Emission Probabilities (Observations conditioned on latent):
φ are parameters of conditional distribution p(xk|zk)
• A and π parameters are often initialized uniformly
• Initialization of φ depends on form of distribution
Machine Learning: CSE 574 11
13. Joint Distribution over Latent and
Observed variables
• Joint can be expressed in terms of parameters:
⎡ N ⎤ N
p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p (x m | z m , ϕ )
⎣ n=2 ⎦ m =1
where X = { x1 ,..x N }, Z = { z1 ,..z N }, θ = {π , A, ϕ}
• Most discussion of HMM is independent of
emission probabilities
• Tractable for discrete tables, Gaussian, GMMs
Machine Learning: CSE 574 12
14. 4. Generative View of HMM
• Sampling from an HMM
• HMM with 3-state latent variable z
• Gaussian emission model p(x|z)
• Contours of constant density of emission
distribution shown for each state
• Two-dimensional x
• 50 data points generated from HMM
• Lines show successive observations
• Transition probabilities fixed so that
• 5% probability of making transition
• 90% of remaining in same
Machine Learning: CSE 574 13
15. Left-to-Right HMM
• Setting elements of Ajk=0 if k < j
• Corresponding lattice diagram
Machine Learning: CSE 574 14
16. Left-to-Right Applied Generatively to Digits
• Examples of on-line handwritten 2’s
• x is a sequence of pen coordinates
• There are16 states for z, or K=16
• Each state can generate a line segment of fixed
length in one of 16 angles
• Emission probabilities: 16 x 16 table
• Transition probabilities set to zero except for those
that keep state index k the same or increment by one
• Parameters optimized by 25 EM iterations Training Set
• Trained on 45 digits
• Generative model is quite poor
• Since generated don’t look like training
• If classification is goal, can do better by using a
discriminative HMM
Machine Learning: CSE 574 15
Generated Set
17. 5. Determining HMM Parameters
• Given data set X={x1,..xn} we can determine
HMM parameters θ ={π,A,φ} using maximum
likelihood
• Likelihood function obtained from joint
distribution by marginalizing over latent
variables Z={z1,..zn}
p(X|θ) = ΣZ p(X,Z|θ)
Joint (X,Z)
Machine Learning: CSE 574
X
16
18. Computational issues for Parameters
p(X|θ) = ΣZ p(X,Z|θ)
• Computational difficulties
• Joint distribution p(X,Z|θ) does not factorize over n,
in contrast to mixture model
• Z has exponential number of terms corresponding to trellis
• Solution
• Use conditional independence properties to reorder
summations
• Use EM instead to maximize log-likelihood function of joint
ln p(X,Z|θ)
Efficient framework for maximizing the likelihood function in HMMs
Machine Learning: CSE 574 17
19. EM for MLE in HMM
1. Start with initial selection for model parameters θold
2. In E step take these parameter values and find
posterior distribution of latent variables p(Z|X,θ old)
Use this posterior distribution to evaluate
expectation of the logarithm of the complete-data
likelihood function ln p(X,Z|θ)
Which can be written as
Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
Z
underlined portion independent of θ is evaluated
3. In M-Step maximize Q w.r.t. θ
Machine Learning: CSE 574 18
20. Expansion of Q
• Introduce notation
γ (zn) = p(zn|X,θold): Marginal posterior distribution of
latent variable zn
ξ(zn-1,zn) = p(zn-1,zn|X,θ old): Joint posterior of two
successive latent variables
• We will be re-expressing Q in terms of γ and ξ
Q (θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
Z
γ
ξ
Machine Learning: CSE 574 19
21. Detail of γ and ξ
For each value of n we can store
γ(zn) using K non-negative numbers that sum to unity
ξ(zn-1,zn) using a K x K matrix whose elements also sum
to unity
• Using notation
γ (znk) denotes conditional probability of znk=1
Similar notation for ξ(zn-1,j,znk)
• Because the expectation of a binary random
variable is the probability that it takes value 1
γ(znk) = E[znk] = Σz γ(z)znk
ξ(z ,z ) = E[zn-1,j,znk] = Σz γ(z) zn-1,jznk
n-1,j nk
Machine Learning: CSE 574 20
22. Expansion of Q
• We begin with
Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
Z
• Substitute
⎡ N ⎤ N
p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p(x m | z m , ϕ )
⎣ n=2 ⎦ m =1
• And use definitions of γ and ξ to get:
K N K K
Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk
k =1 n = 2 j =1 k =1
N K
+ ∑∑ γ ( z nk ) ln p ( xn | φk )
21
Machine Learning: CSE 574 n =1 k =1
23. E-Step
K N K K
Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk
k =1 n = 2 j =1 k =1
N K
+ ∑∑ γ ( z nk ) ln p ( xn | φk )
n =1 k =1
• Goal of E step is to evaluate γ(zn) and ξ(zn-1,zn)
efficiently (Forward-Backward Algorithm)
γ
ξ
Machine Learning: CSE 574 22
24. M-Step
• Maximize Q(θ,θ old) with respect to parameters
θ ={π,A,φ}
• Treat γ (zn) and ξ(zn-1,zn) as constant
• Maximization w.r.t. π and A
• easily achieved (using Lagrangian multipliers)
N
πk =
γ ( z1k ) ∑ξ (z n −1, j , z nk )
K
A jk = n=2
∑γ (z
j =1
1j ) K N
∑∑ ξ ( z n −1, j , z nl )
l =1 n = 2
• Maximization w.r.t. φ k N K
• Only last term of Q depends on φ ∑∑ γ ( z k n =1 k =1
nk ) ln p ( xn | φk )
• Same form as in mixture distribution for i.i.d. 23
Machine Learning: CSE 574
25. M-step for Gaussian emission
• Maximization of Q(θ,θ old) wrt φ k
• Gaussian Emission Densities
p(x|φk) ~ N(x|μk,Σk)
• Solution:
N N
∑ γ ( znk )x n ∑ γ ( z nk )( x n − μ k )( x n − μ k )T
μk = n =1
N Σk = n =1
N
∑γ (z
n =1
nk ) ∑γ (z nk )
n =1
Machine Learning: CSE 574 24
26. M-Step for Multinomial Observed
• Conditional Distribution of Observations
have the form
D K
p ( x | z) = ∏∏ μiki zk
x
i =1 k =1
• M-Step equations: N
∑γ (z nk ) xni
μik = n =1
N
∑γ (z
n =1
nk )
• Analogous result holds for Bernoulli
observed variables
Machine Learning: CSE 574 25
27. 6. Forward-Backward Algorithm
• E step: efficient procedure to evaluate
γ(zn) and ξ(zn-1,zn)
• Graph of HMM, a tree
• Implies that posterior distribution of latent
variables can be obtained efficiently using
message passing algorithm
• In HMM it is called forward-backward
algorithm or Baum-Welch Algorithm
• Several variants lead to exact marginals
• Method called alpha-beta discussed here
Machine Learning: CSE 574 26
28. Derivation of Forward-Backward
• Several conditional-independences (A-H) hold
A. p(X|zn) = p(x1,..xn|zn) p(xn+1,..xN|zn)
• Proved using d-separation:
Path from x1 to xn-1 passes through zn which is observed.
Path is head-to-tail. Thus (x1,..xn-1) _||_ xn | zn
Similarly (x1,..xn-1,xn) _||_ xn+1,..xN |zn
zN
…..
xN
…..
Machine Learning: CSE 574 27
29. Conditional independence B
• Since (x1,..xn-1) _||_ xn | zn we have
B. p(x1,..xn-1|xn,zn)= p(x1,..xn-1|zn)
Machine Learning: CSE 574 28
30. Conditional independence C
• Since (x1,..xn-1) _||_ zn | zn-1
C. p(x1,..xn-1|zn-1,zn)= p ( x1,..xn-1|zn-1 )
Machine Learning: CSE 574 29
31. Conditional independence D
• Since (xn+1,..xN) _||_ zn | zn+1
D. p(xn+1,..xN|zn,zn+1)= p ( x1,..xN|zn+1 )
zN
xN
……
Machine Learning: CSE 574 30
32. Conditional independence E
• Since (xn+2,..xN) _||_ zn | zn+1
E. p(xn+2,..xN|zn+1,xn+1)= p ( xn+2,..xN|zn+1 )
zN
xN
……
Machine Learning: CSE 574 31
33. Conditional independence F
F. p(X|zn-1,zn)= p ( x1,..xn-1|zn-1 )p(xn|zn)
p(xn+1,..xN|zn)
zN
xN
……
Machine Learning: CSE 574 32
34. Conditional independence G
Since (x1,..xN) _||_ xN+1 | zN+1
G. p(xN+1|X,zN+1)=p(xN+1|zN+1 )
zN zN+1
….
xN xN+1
….
Machine Learning: CSE 574 33
35. Conditional independence H
H. p(zN+1|zN , X)= p ( zN+1|zN )
zN zN+1
….
xN xN+1
….
Machine Learning: CSE 574 34
36. Evaluation of γ(zn)
• Recall that this is to efficiently compute
the E step of estimating parameters of
HMM
γ (zn) = p(zn|X,θold): Marginal posterior
distribution of latent variable zn
• We are interested in finding posterior
distribution p(zn|x1,..xN)
• This is a vector of length K whose entries
correspond to expected values of znk
Machine Learning: CSE 574 35
37. Introducing α and β
• Using Bayes theorem γ (z ) = p( z | X) = p(X |pz(X) )p( z )
n n
n n
• Using conditional independence A
p ( x1 ,..x n | z n ) p ( x n +1 ,..x N | z n ) p (z n )
γ (z n ) =
p(X)
p( x1 ,..x n , z n ) p( x n +1 ,..x N | z n ) α (z n ) β (z n )
= =
p(X) p(X)
• where α(zn) = p(x1,..,xn,zn)
which is the probability of observing all beta
given data up to time n and the value of
zn
β(zn) = p(xn+1,..,xN|zn)
which is the conditional probability of all
alpha
Machine Learning: CSE 574 time n+1 up to N given
future data from 36
the value of z
38. Recursion Relation for α
α (z n ) = p( x1 ,.., x n , z n )
= p( x1 ,.., x n | z n ) p(z n ) by Bayes rule
= p( x n | z n ) p( x1 ,.., x n −1 | z n ) p (z n ) by conditional independence B
= p( x n | z n ) p( x1 ,.., x n −1 , z n ) by Bayes rule
= p( x n | z n )∑ p( x1 ,.., x n −1 , z n −1 , z n ) by Sum Rule
z n−1
= p( x n | z n )∑ p( x1 ,.., x n −1 , z n | z n −1 ) p(z n −1 ) by Bayes rule
z n−1
= p( x n | z n )∑ p( x1 ,.., x n −1 | z n −1 ) p (z n | z n −1 ) p (z n −1 ) by cond. ind. C
z n−1
= p( x n | z n )∑ p ( x1 ,.., x n −1 , z n −1 ) p (z n | z n −1 ) by Bayes rule
z n−1
= p( x | z )∑ α (z n −1 ) p(z n | z n −1 ) by definition of α
Machine Learning:nCSE 574
n
37
z n−1
39. Forward Recursion for α Εvaluation
• Recursion Relation is
α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 )
z n−1
• There are K terms in the
summation
• Has to be evaluated for each of K
values of zn
• Each step of recursion is O(K2)
• Initial condition is
K
α (z1 ) = p ( x1 , z1 ) = p (z1 ) p ( x1 | z1 ) = ∏ {π k p( x1 | φk )}z
1k
k =1
• Overall cost for the chain in
O(K2N)
Machine Learning: CSE 574 38
40. Recursion Relation for β
β (z n ) = p (x n +1 ,.., x N | z n )
= ∑ p (x n +1 ,.., x N , z n +1 | z n ) by Sum Rule
zn+1
= ∑ p(x
zn+1
n +1 ,.., x N | z n ,z n +1 ) p( z n +1|z n ) by Bayes rule
= ∑ p (x n +1 ,.., x N | z n +1 ) p( z n +1|z n ) by Cond ind. D
zn+1
= ∑ p (x n + 2 ,.., x N | z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by Cond. ind E
zn+1
= ∑ β (z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by definition of β
zn+1
Machine Learning: CSE 574 39
41. Backward Recursion for β
• Backward message passing
β (z n ) = ∑ β (z n +1 ) p( x n +1|z n )p( z n +1|z n )
zn+1
• Evaluates β(zn) in terms of β(zn+1)
• Starting condition for recursion is
p ( X, z N ) β ( z N )
p(z N | X) =
p(X)
• Is correct provided we set β(zN) =1
for all settings of zN
• This is the initial condition for backward
computation
Machine Learning: CSE 574 40
42. M step Equations
• In the M-step equations p(x) will cancel out
N
∑γ (z nk )x n
μk = n =1
N
∑γ (z
n =1
nk )
p(X) = ∑ α ( zn ) β ( zn )
zn
Machine Learning: CSE 574 41
43. Evaluation of Quantities ξ(zn-1,zn)
• They correspond to the values of the conditional
probabilities p(zn-1,zn|X) for each of the K x K settings for
(zn-1,zn)
ξ (z n -1 , z n ) = p(z n -1 , z n | X) by definition
p(X|zn-1,z n )p(zn-1,z n )
= by Bayes Rule
p(X)
p(x1 ,..x n −1|z n-1 )p( x n | z n ) p( x n +1 ,.., x N | z n ) p(z n | z n-1 )p(z n-1 )
= by cond ind F
p(X)
α (z n −1 ) p ( x n | z n ) p(z n | z n-1 )β(z n )
=
p(X)
• Thus we calculate ξ(zn-1,zn) directly by using results of the
α and β recursions
Machine Learning: CSE 574 42
44. Summary of EM to train HMM
Step 1: Initialization
• Make an initial selection of parameters θ old
where θ = (π, A,φ)
1. π is a vector of K probabilities of the states for latent
variable z1
2. A is a K x K matrix of transition probabilities Aij
3. φ are parameters of conditional distribution p(xk|zk)
• A and π parameters are often initialized uniformly
• Initialization of φ depends on form of distribution
• For Gaussian:
• parameters μk initialized by applying K-means to the data, Σk
corresponds to covariance matrix of cluster
Machine Learning: CSE 574 43
45. Summary of EM to train HMM
Step 2: E Step
• Run both forward α recursion and
backward β recursion
• Use results to evaluate γ(zn) and ξ(zn-1,zn)
and the likelihood function
Step 3: M Step
• Use results of E step to find revised set of
parameters θnew using M-step equations
Alternate between E and M
until convergence of likelihood function
Machine Learning: CSE 574 44
46. Values for p(xn|zn)
• In recursion relations, observations enter
through conditional distributions p(xn|zn)
• Recursions are independent of
• Dimensionality of observed variables
• Form of conditional distribution
• So long as it can be computed for each of K
possible states of zn
• Since observed variables {xn} are fixed
they can be pre-computed at the start of
the EM algorithm
Machine Learning: CSE 574 45
47. 7(a) Sequence Length: Using Multiple Short Sequences
• HMM can be trained effectively if length of
sequence is sufficiently long
• True of all maximum likelihood approaches
• Alternatively we can use multiple short
sequences
• Requires straightforward modification of HMM-EM
algorithm
• Particularly important in left-to-right models
• In given observation sequence, a given state
transition for a non-diagonal element of A occurs only
once
Machine Learning: CSE 574 46
48. 7(b). Predictive Distribution
• Observed data is X={ x1,..,xN}
• Wish to predict xN+1
• Application in financial forecasting
p (x | X ) = ∑ p(x , z | X )
N +1 N +1 N +1
z N +1
= ∑ p(x
z N +1
N +1 |z N +1 | X ) p (z N +1 | X ) by Product Rule
= ∑ p(x N +1|z N +1 )∑ p(z N +1 , z N | X ) by Sum Rule
z N +1 zN
= ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) p( z N | X ) by conditional ind H
z N +1 zN
p( z N , X )
= ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) by Bayes rule
z N +1 zN p( X )
1
= ∑ p(x N +1|z N +1 )∑ p(z N +1 | zN )α ( zN ) by definition of α
p ( X ) zN +1 zN
• Can be evaluated by first running forward α recursion and
summing over zN and zN+1
• Can be extended to subsequent predictions of xN+2, after xN+1 is
observed, using a fixed amount of storage
Machine Learning: CSE 574 47
49. 7(c). Sum-Product and HMM
HMM Graph
• HMM graph is a tree and
hence sum-product
algorithm can be used to
find local marginals for
hidden variables Joint distribution
• Equivalent to forward-
backward algorithm ⎡ N ⎤ N
p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n )
• Sum-product provides a ⎣ n=2 ⎦ n =1
simple way to derive alpha-
beta recursion formulae
• Transform directed graph to Fragment of Factor Graph
factor graph
• Each variable has a node,
small squares represent
factors, undirected links
connect factor nodes to
variables used
Machine Learning: CSE 574 48
50. Deriving alpha-beta from Sum-Product
• Begin with simplified form of Fragment of Factor Graph
factor graph
• Factors are given by
h( z1 ) = p ( z1 ) p ( x1 | z1 )
Simplified by absorbing emission probabilities
f n ( zn −1 , zn ) = p ( zn | zn −1 ) p ( xn | zn ) into transition probability factors
• Messages propagated are
μz n−1 → f n
(z n −1 ) = μ fn−1 →zn−1 (z n −1 )
Final Results
μf n →zn
(z n ) = ∑ f n (z n −1 , z n ) μ zn−1 → fn (z n −1 )
z n−1
α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 )
• Can show that α recursion is z n−1
computed β (z n ) = ∑ β ( z n +1 ) p(x n +1 | z n ) p(z n +1 | z n )
• Similarly starting with the root
z n+1
α (z n ) β (z n )
node β recursion is computed γ (z n ) =
p (X)
• So also γ and ξ are derived
ξ (z n-1 ,z n ) =
α (z n −1 ) p (x n | z n ) p( z n | z n-1 )β ( z n )
Machine Learning: CSE 574 49
p(X)
51. 7(d). Scaling Factors
• Implementation issue for small probabilities
• At each step of recursion
α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 )
z n−1
• To obtain new value of α(zn) from previous value α(zn-1)
we multiply p(zn|zn-1) and p(xn|zn)
• These probabilities are small and products will underflow
• Logs don’t help since we have sums of products
• Solution is rescaling
• of α(zn) and β(zn) whose values remain close to unity
Machine Learning: CSE 574 50
52. 7(e). The Viterbi Algorithm
• Finding most probable sequence of hidden
states for a given sequence of observables
• In speech recognition: finding most probable
phoneme sequence for a given series of
acoustic observations
• Since graphical model of HMM is a tree, can be
solved exactly using max-sum algorithm
• Known as Viterbi algorithm in the context of HMM
• Since max-sum works with log probabilities no need
to work with re-scaled varaibles as with forward-
backward
Machine Learning: CSE 574 51
53. Viterbi Algorithm for HMM
• Fragment of HMM lattice
showing two paths
• Number of possible paths
grows exponentially with
length of chain
• Viterbi searches space of
paths efficiently
• Finds most probable path
with computational cost
linear with length of chain
Machine Learning: CSE 574 52
54. Deriving Viterbi from Max-Sum
• Start with simplified factor graph
• Treat variable zN as root node, passing
messages to root from leaf nodes
• Messages passed are
μz n → f n +1
(z n ) = μ fn →zn (z n )
μf n +1 → z n +1
{ }
(z n +1 ) = max ln f n +1 (z n , z n +1 ) + μzn → fn+1 (z n )
zn
Machine Learning: CSE 574 53
55. Other Topics on Sequential Data
• Sequential Data and Markov Models:
http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.1-
MarkovModels.pdf
• Extensions of HMMs:
http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.3-
HMMExtensions.pdf
• Linear Dynamical Systems:
http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.4-
LinearDynamicalSystems.pdf
• Conditional Random Fields:
http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.5-
ConditionalRandomFields.pdf
Machine Learning: CSE 574 54