Upcoming SlideShare
×

# PowerPoint Presentation

466 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

Views
Total views
466
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
5
0
Likes
0
Embeds 0
No embeds

No notes for slide

### PowerPoint Presentation

1. 1. Hidden Markov Models Sargur N. Srihari srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/CSE574/index.html Machine Learning: CSE 574 0
2. 2. HMM Topics 1. What is an HMM? 2. State-space Representation 3. HMM Parameters 4. Generative View of HMM 5. Determining HMM Parameters Using EM 6. Forward-Backward or α−β algorithm 7. HMM Implementation Issues: a) Length of Sequence b) Predictive Distribution c) Sum-Product Algorithm d) Scaling Factors e) Viterbi Algorithm Machine Learning: CSE 574 1
3. 3. 1. What is an HMM? • Ubiquitous tool for modeling time series data • Used in • Almost all speech recognition systems • Computational molecular biology • Group amino acid sequences into proteins • Handwritten word recognition • It is a tool for representing probability distributions over sequences of observations • HMM gets its name from two defining properties: • Observation xt at time t was generated by some process whose state zt is hidden from the observer • Assumes that state at zt is dependent only on state zt-1 and independent of all prior states (First order) • Example: z are phoneme sequences x are acoustic observations Machine Learning: CSE 574 2
4. 4. Graphical Model of HMM • Has the graphical model shown below and latent variables are discrete • Joint distribution has the form: ⎡ N ⎤ N p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n ) ⎣ n=2 ⎦ n =1 Machine Learning: CSE 574 3
5. 5. HMM Viewed as Mixture • A single time slice corresponds to a mixture distribution with component densities p(x|z) • Each state of discrete variable z represents a different component • An extension of mixture model • Choice of mixture component depends on choice of mixture component for previous distribution • Latent variables are multinomial variables zn • That describe component responsible for generating xn • Can use one-of-K coding scheme Machine Learning: CSE 574 4
6. 6. 2. State-Space Representation • Probability distribution of zn State k 1 2 . K depends on state of previous of zn znk 0 1 . 0 latent variable zn-1 through probability distribution p(zn|zn-1) • One-of K coding State j 1 2 . K of zn-1 zn-1,j 1 0 . 0 • Since latent variables are K- dimensional binary vectors A zn Ajk=p(znk=1|zn-1,j=1) matrix 1 2 …. K ΣkAjk=1 1 • These are known as Transition Probabilities zn-1 2 • K(K-1) independent parameters … Ajk Machine Learning: CSE 574 K 5
7. 7. Transition Probabilities Example with 3-state latent variable zn=1 zn=2 zn=3 State Transition Diagram zn1=1 zn1=0 zn1=0 zn2=0 zn2=1 zn2=0 zn3=0 zn3=0 zn3=1 zn-1=1 A11 A12 A13 zn-1,1=1 A11+A12+A13=1 zn-1,2=0 zn-1,3=0 zn-1=2 A21 A22 A23 zn-1,1=0 zn-1,2=1 zn-1,3=0 • Not a graphical model zn-3= 3 A31 A32 A33 since nodes are not zn-1,1=0 separate variables but zn-1,2=0 states of a single zn-1,3=1 variable Machine Learning: CSE 574 6 • Here K=3
8. 8. Conditional Probabilities • Transition probabilities Ajk represent state-to-state probabilities for each variable • Conditional probabilities are variable-to-variable probabilities • can be written in terms of transition probabilities as K K p (z n | z n −1 , A) = ∏∏ A z n−1, j z n ,k jk k =1 j =1 • Note that exponent zn-1,j zn,k is a product that evaluates to 0 or 1 • Hence the overall product will evaluate to a single Ajk for each setting of values of zn and zn-1 • E.g., zn-1=2 and zn=3 will result in only zn-1,2=1 and zn,3=1. Thus p(zn=3|zn-1=2)=A23 • A is a global HMM parameter Machine Learning: CSE 574 7
9. 9. Initial Variable Probabilities • Initial latent node z1 is a special case without parent node • Represented by vector of probabilities π with elements πk=p(z1k=1) so that K p ( z1 | π ) = ∏ π where Σ kπ k = 1 z1,k k k =1 • Note that π is an HMM parameter • representing probabilities of each state for the first variable Machine Learning: CSE 574 8
10. 10. Lattice or Trellis Diagram • State transition diagram unfolded over time • Representation of latent variable states • Each column corresponds to one of latent variables zn Machine Learning: CSE 574 9
11. 11. Emission Probabilities p(xn|zn) • We have so far only specified p(zn|zn-1) by means of transition probabilities • Need to specify probabilities p(xn|zn,φ) to complete the model, where φ are parameters • These can be continuous or discrete • Because xn is observed and zn is discrete p(xn|zn,φ) consists of a table of K numbers corresponding to K states of zn • Analogous to class-conditional probabilities • Can be represented as K p (x n | z n , ϕ ) = ∏ p(x n | ϕk ) znk k =1 Machine Learning: CSE 574 10
12. 12. 3. HMM Parameters • We have defined three types of HMM parameters: θ = (π, A,φ) 1. Initial Probabilities of first latent variable: π is a vector of K probabilities of the states for latent variable z1 2. Transition Probabilities (State-to-state for any latent variable): A is a K x K matrix of transition probabilities Aij 3. Emission Probabilities (Observations conditioned on latent): φ are parameters of conditional distribution p(xk|zk) • A and π parameters are often initialized uniformly • Initialization of φ depends on form of distribution Machine Learning: CSE 574 11
13. 13. Joint Distribution over Latent and Observed variables • Joint can be expressed in terms of parameters: ⎡ N ⎤ N p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p (x m | z m , ϕ ) ⎣ n=2 ⎦ m =1 where X = { x1 ,..x N }, Z = { z1 ,..z N }, θ = {π , A, ϕ} • Most discussion of HMM is independent of emission probabilities • Tractable for discrete tables, Gaussian, GMMs Machine Learning: CSE 574 12
14. 14. 4. Generative View of HMM • Sampling from an HMM • HMM with 3-state latent variable z • Gaussian emission model p(x|z) • Contours of constant density of emission distribution shown for each state • Two-dimensional x • 50 data points generated from HMM • Lines show successive observations • Transition probabilities fixed so that • 5% probability of making transition • 90% of remaining in same Machine Learning: CSE 574 13
15. 15. Left-to-Right HMM • Setting elements of Ajk=0 if k < j • Corresponding lattice diagram Machine Learning: CSE 574 14
16. 16. Left-to-Right Applied Generatively to Digits • Examples of on-line handwritten 2’s • x is a sequence of pen coordinates • There are16 states for z, or K=16 • Each state can generate a line segment of fixed length in one of 16 angles • Emission probabilities: 16 x 16 table • Transition probabilities set to zero except for those that keep state index k the same or increment by one • Parameters optimized by 25 EM iterations Training Set • Trained on 45 digits • Generative model is quite poor • Since generated don’t look like training • If classification is goal, can do better by using a discriminative HMM Machine Learning: CSE 574 15 Generated Set
17. 17. 5. Determining HMM Parameters • Given data set X={x1,..xn} we can determine HMM parameters θ ={π,A,φ} using maximum likelihood • Likelihood function obtained from joint distribution by marginalizing over latent variables Z={z1,..zn} p(X|θ) = ΣZ p(X,Z|θ) Joint (X,Z) Machine Learning: CSE 574 X 16
18. 18. Computational issues for Parameters p(X|θ) = ΣZ p(X,Z|θ) • Computational difficulties • Joint distribution p(X,Z|θ) does not factorize over n, in contrast to mixture model • Z has exponential number of terms corresponding to trellis • Solution • Use conditional independence properties to reorder summations • Use EM instead to maximize log-likelihood function of joint ln p(X,Z|θ) Efficient framework for maximizing the likelihood function in HMMs Machine Learning: CSE 574 17
19. 19. EM for MLE in HMM 1. Start with initial selection for model parameters θold 2. In E step take these parameter values and find posterior distribution of latent variables p(Z|X,θ old) Use this posterior distribution to evaluate expectation of the logarithm of the complete-data likelihood function ln p(X,Z|θ) Which can be written as Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ ) Z underlined portion independent of θ is evaluated 3. In M-Step maximize Q w.r.t. θ Machine Learning: CSE 574 18
20. 20. Expansion of Q • Introduce notation γ (zn) = p(zn|X,θold): Marginal posterior distribution of latent variable zn ξ(zn-1,zn) = p(zn-1,zn|X,θ old): Joint posterior of two successive latent variables • We will be re-expressing Q in terms of γ and ξ Q (θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ ) Z γ ξ Machine Learning: CSE 574 19
21. 21. Detail of γ and ξ For each value of n we can store γ(zn) using K non-negative numbers that sum to unity ξ(zn-1,zn) using a K x K matrix whose elements also sum to unity • Using notation γ (znk) denotes conditional probability of znk=1 Similar notation for ξ(zn-1,j,znk) • Because the expectation of a binary random variable is the probability that it takes value 1 γ(znk) = E[znk] = Σz γ(z)znk ξ(z ,z ) = E[zn-1,j,znk] = Σz γ(z) zn-1,jznk n-1,j nk Machine Learning: CSE 574 20
22. 22. Expansion of Q • We begin with Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ ) Z • Substitute ⎡ N ⎤ N p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p(x m | z m , ϕ ) ⎣ n=2 ⎦ m =1 • And use definitions of γ and ξ to get: K N K K Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk k =1 n = 2 j =1 k =1 N K + ∑∑ γ ( z nk ) ln p ( xn | φk ) 21 Machine Learning: CSE 574 n =1 k =1
23. 23. E-Step K N K K Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk k =1 n = 2 j =1 k =1 N K + ∑∑ γ ( z nk ) ln p ( xn | φk ) n =1 k =1 • Goal of E step is to evaluate γ(zn) and ξ(zn-1,zn) efficiently (Forward-Backward Algorithm) γ ξ Machine Learning: CSE 574 22
24. 24. M-Step • Maximize Q(θ,θ old) with respect to parameters θ ={π,A,φ} • Treat γ (zn) and ξ(zn-1,zn) as constant • Maximization w.r.t. π and A • easily achieved (using Lagrangian multipliers) N πk = γ ( z1k ) ∑ξ (z n −1, j , z nk ) K A jk = n=2 ∑γ (z j =1 1j ) K N ∑∑ ξ ( z n −1, j , z nl ) l =1 n = 2 • Maximization w.r.t. φ k N K • Only last term of Q depends on φ ∑∑ γ ( z k n =1 k =1 nk ) ln p ( xn | φk ) • Same form as in mixture distribution for i.i.d. 23 Machine Learning: CSE 574
25. 25. M-step for Gaussian emission • Maximization of Q(θ,θ old) wrt φ k • Gaussian Emission Densities p(x|φk) ~ N(x|μk,Σk) • Solution: N N ∑ γ ( znk )x n ∑ γ ( z nk )( x n − μ k )( x n − μ k )T μk = n =1 N Σk = n =1 N ∑γ (z n =1 nk ) ∑γ (z nk ) n =1 Machine Learning: CSE 574 24
26. 26. M-Step for Multinomial Observed • Conditional Distribution of Observations have the form D K p ( x | z) = ∏∏ μiki zk x i =1 k =1 • M-Step equations: N ∑γ (z nk ) xni μik = n =1 N ∑γ (z n =1 nk ) • Analogous result holds for Bernoulli observed variables Machine Learning: CSE 574 25
27. 27. 6. Forward-Backward Algorithm • E step: efficient procedure to evaluate γ(zn) and ξ(zn-1,zn) • Graph of HMM, a tree • Implies that posterior distribution of latent variables can be obtained efficiently using message passing algorithm • In HMM it is called forward-backward algorithm or Baum-Welch Algorithm • Several variants lead to exact marginals • Method called alpha-beta discussed here Machine Learning: CSE 574 26
28. 28. Derivation of Forward-Backward • Several conditional-independences (A-H) hold A. p(X|zn) = p(x1,..xn|zn) p(xn+1,..xN|zn) • Proved using d-separation: Path from x1 to xn-1 passes through zn which is observed. Path is head-to-tail. Thus (x1,..xn-1) _||_ xn | zn Similarly (x1,..xn-1,xn) _||_ xn+1,..xN |zn zN ….. xN ….. Machine Learning: CSE 574 27
29. 29. Conditional independence B • Since (x1,..xn-1) _||_ xn | zn we have B. p(x1,..xn-1|xn,zn)= p(x1,..xn-1|zn) Machine Learning: CSE 574 28
30. 30. Conditional independence C • Since (x1,..xn-1) _||_ zn | zn-1 C. p(x1,..xn-1|zn-1,zn)= p ( x1,..xn-1|zn-1 ) Machine Learning: CSE 574 29
31. 31. Conditional independence D • Since (xn+1,..xN) _||_ zn | zn+1 D. p(xn+1,..xN|zn,zn+1)= p ( x1,..xN|zn+1 ) zN xN …… Machine Learning: CSE 574 30
32. 32. Conditional independence E • Since (xn+2,..xN) _||_ zn | zn+1 E. p(xn+2,..xN|zn+1,xn+1)= p ( xn+2,..xN|zn+1 ) zN xN …… Machine Learning: CSE 574 31
33. 33. Conditional independence F F. p(X|zn-1,zn)= p ( x1,..xn-1|zn-1 )p(xn|zn) p(xn+1,..xN|zn) zN xN …… Machine Learning: CSE 574 32
34. 34. Conditional independence G Since (x1,..xN) _||_ xN+1 | zN+1 G. p(xN+1|X,zN+1)=p(xN+1|zN+1 ) zN zN+1 …. xN xN+1 …. Machine Learning: CSE 574 33
35. 35. Conditional independence H H. p(zN+1|zN , X)= p ( zN+1|zN ) zN zN+1 …. xN xN+1 …. Machine Learning: CSE 574 34
36. 36. Evaluation of γ(zn) • Recall that this is to efficiently compute the E step of estimating parameters of HMM γ (zn) = p(zn|X,θold): Marginal posterior distribution of latent variable zn • We are interested in finding posterior distribution p(zn|x1,..xN) • This is a vector of length K whose entries correspond to expected values of znk Machine Learning: CSE 574 35
37. 37. Introducing α and β • Using Bayes theorem γ (z ) = p( z | X) = p(X |pz(X) )p( z ) n n n n • Using conditional independence A p ( x1 ,..x n | z n ) p ( x n +1 ,..x N | z n ) p (z n ) γ (z n ) = p(X) p( x1 ,..x n , z n ) p( x n +1 ,..x N | z n ) α (z n ) β (z n ) = = p(X) p(X) • where α(zn) = p(x1,..,xn,zn) which is the probability of observing all beta given data up to time n and the value of zn β(zn) = p(xn+1,..,xN|zn) which is the conditional probability of all alpha Machine Learning: CSE 574 time n+1 up to N given future data from 36 the value of z
38. 38. Recursion Relation for α α (z n ) = p( x1 ,.., x n , z n ) = p( x1 ,.., x n | z n ) p(z n ) by Bayes rule = p( x n | z n ) p( x1 ,.., x n −1 | z n ) p (z n ) by conditional independence B = p( x n | z n ) p( x1 ,.., x n −1 , z n ) by Bayes rule = p( x n | z n )∑ p( x1 ,.., x n −1 , z n −1 , z n ) by Sum Rule z n−1 = p( x n | z n )∑ p( x1 ,.., x n −1 , z n | z n −1 ) p(z n −1 ) by Bayes rule z n−1 = p( x n | z n )∑ p( x1 ,.., x n −1 | z n −1 ) p (z n | z n −1 ) p (z n −1 ) by cond. ind. C z n−1 = p( x n | z n )∑ p ( x1 ,.., x n −1 , z n −1 ) p (z n | z n −1 ) by Bayes rule z n−1 = p( x | z )∑ α (z n −1 ) p(z n | z n −1 ) by definition of α Machine Learning:nCSE 574 n 37 z n−1
39. 39. Forward Recursion for α Εvaluation • Recursion Relation is α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 ) z n−1 • There are K terms in the summation • Has to be evaluated for each of K values of zn • Each step of recursion is O(K2) • Initial condition is K α (z1 ) = p ( x1 , z1 ) = p (z1 ) p ( x1 | z1 ) = ∏ {π k p( x1 | φk )}z 1k k =1 • Overall cost for the chain in O(K2N) Machine Learning: CSE 574 38
40. 40. Recursion Relation for β β (z n ) = p (x n +1 ,.., x N | z n ) = ∑ p (x n +1 ,.., x N , z n +1 | z n ) by Sum Rule zn+1 = ∑ p(x zn+1 n +1 ,.., x N | z n ,z n +1 ) p( z n +1|z n ) by Bayes rule = ∑ p (x n +1 ,.., x N | z n +1 ) p( z n +1|z n ) by Cond ind. D zn+1 = ∑ p (x n + 2 ,.., x N | z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by Cond. ind E zn+1 = ∑ β (z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by definition of β zn+1 Machine Learning: CSE 574 39
41. 41. Backward Recursion for β • Backward message passing β (z n ) = ∑ β (z n +1 ) p( x n +1|z n )p( z n +1|z n ) zn+1 • Evaluates β(zn) in terms of β(zn+1) • Starting condition for recursion is p ( X, z N ) β ( z N ) p(z N | X) = p(X) • Is correct provided we set β(zN) =1 for all settings of zN • This is the initial condition for backward computation Machine Learning: CSE 574 40
42. 42. M step Equations • In the M-step equations p(x) will cancel out N ∑γ (z nk )x n μk = n =1 N ∑γ (z n =1 nk ) p(X) = ∑ α ( zn ) β ( zn ) zn Machine Learning: CSE 574 41
43. 43. Evaluation of Quantities ξ(zn-1,zn) • They correspond to the values of the conditional probabilities p(zn-1,zn|X) for each of the K x K settings for (zn-1,zn) ξ (z n -1 , z n ) = p(z n -1 , z n | X) by definition p(X|zn-1,z n )p(zn-1,z n ) = by Bayes Rule p(X) p(x1 ,..x n −1|z n-1 )p( x n | z n ) p( x n +1 ,.., x N | z n ) p(z n | z n-1 )p(z n-1 ) = by cond ind F p(X) α (z n −1 ) p ( x n | z n ) p(z n | z n-1 )β(z n ) = p(X) • Thus we calculate ξ(zn-1,zn) directly by using results of the α and β recursions Machine Learning: CSE 574 42
44. 44. Summary of EM to train HMM Step 1: Initialization • Make an initial selection of parameters θ old where θ = (π, A,φ) 1. π is a vector of K probabilities of the states for latent variable z1 2. A is a K x K matrix of transition probabilities Aij 3. φ are parameters of conditional distribution p(xk|zk) • A and π parameters are often initialized uniformly • Initialization of φ depends on form of distribution • For Gaussian: • parameters μk initialized by applying K-means to the data, Σk corresponds to covariance matrix of cluster Machine Learning: CSE 574 43
45. 45. Summary of EM to train HMM Step 2: E Step • Run both forward α recursion and backward β recursion • Use results to evaluate γ(zn) and ξ(zn-1,zn) and the likelihood function Step 3: M Step • Use results of E step to find revised set of parameters θnew using M-step equations Alternate between E and M until convergence of likelihood function Machine Learning: CSE 574 44
46. 46. Values for p(xn|zn) • In recursion relations, observations enter through conditional distributions p(xn|zn) • Recursions are independent of • Dimensionality of observed variables • Form of conditional distribution • So long as it can be computed for each of K possible states of zn • Since observed variables {xn} are fixed they can be pre-computed at the start of the EM algorithm Machine Learning: CSE 574 45
47. 47. 7(a) Sequence Length: Using Multiple Short Sequences • HMM can be trained effectively if length of sequence is sufficiently long • True of all maximum likelihood approaches • Alternatively we can use multiple short sequences • Requires straightforward modification of HMM-EM algorithm • Particularly important in left-to-right models • In given observation sequence, a given state transition for a non-diagonal element of A occurs only once Machine Learning: CSE 574 46
48. 48. 7(b). Predictive Distribution • Observed data is X={ x1,..,xN} • Wish to predict xN+1 • Application in financial forecasting p (x | X ) = ∑ p(x , z | X ) N +1 N +1 N +1 z N +1 = ∑ p(x z N +1 N +1 |z N +1 | X ) p (z N +1 | X ) by Product Rule = ∑ p(x N +1|z N +1 )∑ p(z N +1 , z N | X ) by Sum Rule z N +1 zN = ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) p( z N | X ) by conditional ind H z N +1 zN p( z N , X ) = ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) by Bayes rule z N +1 zN p( X ) 1 = ∑ p(x N +1|z N +1 )∑ p(z N +1 | zN )α ( zN ) by definition of α p ( X ) zN +1 zN • Can be evaluated by first running forward α recursion and summing over zN and zN+1 • Can be extended to subsequent predictions of xN+2, after xN+1 is observed, using a fixed amount of storage Machine Learning: CSE 574 47
49. 49. 7(c). Sum-Product and HMM HMM Graph • HMM graph is a tree and hence sum-product algorithm can be used to find local marginals for hidden variables Joint distribution • Equivalent to forward- backward algorithm ⎡ N ⎤ N p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n ) • Sum-product provides a ⎣ n=2 ⎦ n =1 simple way to derive alpha- beta recursion formulae • Transform directed graph to Fragment of Factor Graph factor graph • Each variable has a node, small squares represent factors, undirected links connect factor nodes to variables used Machine Learning: CSE 574 48
50. 50. Deriving alpha-beta from Sum-Product • Begin with simplified form of Fragment of Factor Graph factor graph • Factors are given by h( z1 ) = p ( z1 ) p ( x1 | z1 ) Simplified by absorbing emission probabilities f n ( zn −1 , zn ) = p ( zn | zn −1 ) p ( xn | zn ) into transition probability factors • Messages propagated are μz n−1 → f n (z n −1 ) = μ fn−1 →zn−1 (z n −1 ) Final Results μf n →zn (z n ) = ∑ f n (z n −1 , z n ) μ zn−1 → fn (z n −1 ) z n−1 α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 ) • Can show that α recursion is z n−1 computed β (z n ) = ∑ β ( z n +1 ) p(x n +1 | z n ) p(z n +1 | z n ) • Similarly starting with the root z n+1 α (z n ) β (z n ) node β recursion is computed γ (z n ) = p (X) • So also γ and ξ are derived ξ (z n-1 ,z n ) = α (z n −1 ) p (x n | z n ) p( z n | z n-1 )β ( z n ) Machine Learning: CSE 574 49 p(X)
51. 51. 7(d). Scaling Factors • Implementation issue for small probabilities • At each step of recursion α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 ) z n−1 • To obtain new value of α(zn) from previous value α(zn-1) we multiply p(zn|zn-1) and p(xn|zn) • These probabilities are small and products will underflow • Logs don’t help since we have sums of products • Solution is rescaling • of α(zn) and β(zn) whose values remain close to unity Machine Learning: CSE 574 50
52. 52. 7(e). The Viterbi Algorithm • Finding most probable sequence of hidden states for a given sequence of observables • In speech recognition: finding most probable phoneme sequence for a given series of acoustic observations • Since graphical model of HMM is a tree, can be solved exactly using max-sum algorithm • Known as Viterbi algorithm in the context of HMM • Since max-sum works with log probabilities no need to work with re-scaled varaibles as with forward- backward Machine Learning: CSE 574 51
53. 53. Viterbi Algorithm for HMM • Fragment of HMM lattice showing two paths • Number of possible paths grows exponentially with length of chain • Viterbi searches space of paths efficiently • Finds most probable path with computational cost linear with length of chain Machine Learning: CSE 574 52
54. 54. Deriving Viterbi from Max-Sum • Start with simplified factor graph • Treat variable zN as root node, passing messages to root from leaf nodes • Messages passed are μz n → f n +1 (z n ) = μ fn →zn (z n ) μf n +1 → z n +1 { } (z n +1 ) = max ln f n +1 (z n , z n +1 ) + μzn → fn+1 (z n ) zn Machine Learning: CSE 574 53
55. 55. Other Topics on Sequential Data • Sequential Data and Markov Models: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.1- MarkovModels.pdf • Extensions of HMMs: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.3- HMMExtensions.pdf • Linear Dynamical Systems: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.4- LinearDynamicalSystems.pdf • Conditional Random Fields: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.5- ConditionalRandomFields.pdf Machine Learning: CSE 574 54