Hidden Markov Models

                            Sargur N. Srihari
                            srihari@cedar.buffalo.edu
             Machine Learning Course:
  http://www.cedar.buffalo.edu/~srihari/CSE574/index.html

Machine Learning: CSE 574                                   0
HMM Topics
  1.     What is an HMM?
  2.     State-space Representation
  3.     HMM Parameters
  4.     Generative View of HMM
  5.     Determining HMM Parameters Using EM
  6.     Forward-Backward or α−β algorithm
  7.     HMM Implementation Issues:
        a)    Length of Sequence
        b)    Predictive Distribution
        c)    Sum-Product Algorithm
        d)    Scaling Factors
        e)    Viterbi Algorithm
Machine Learning: CSE 574                      1
1. What is an HMM?
•    Ubiquitous tool for modeling time series data
•    Used in
      •   Almost all speech recognition systems
      •   Computational molecular biology
            •   Group amino acid sequences into proteins
      •   Handwritten word recognition
•    It is a tool for representing probability distributions over
     sequences of observations
•    HMM gets its name from two defining properties:
      •   Observation xt at time t was generated by some process whose state
          zt is hidden from the observer
      •   Assumes that state at zt is dependent only on state zt-1 and
          independent of all prior states (First order)
•    Example: z are phoneme sequences
              x are acoustic observations

    Machine Learning: CSE 574                                              2
Graphical Model of HMM
  • Has the graphical model shown below and
      latent variables are discrete




  • Joint distribution has the form:
                                              ⎡ N                  ⎤ N
       p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n )
                                              ⎣ n=2                ⎦ n =1
Machine Learning: CSE 574                                                             3
HMM Viewed as Mixture



•   A single time slice corresponds to a mixture distribution
    with component densities p(x|z)
    •   Each state of discrete variable z represents a different component
•   An extension of mixture model
    •   Choice of mixture component depends on choice of mixture
        component for previous distribution
•   Latent variables are multinomial variables zn
    •   That describe component responsible for generating xn
•   Can use one-of-K coding scheme
Machine Learning: CSE 574                                                    4
2. State-Space Representation
• Probability distribution of zn           State k 1 2 . K
     depends on state of previous          of zn znk 0 1 . 0
     latent variable zn-1 through
     probability distribution p(zn|zn-1)
•    One-of K coding                     State     j      1 2 . K
                                         of zn-1   zn-1,j 1 0 . 0
      • Since latent variables are K-
          dimensional binary vectors    A                zn
            Ajk=p(znk=1|zn-1,j=1)       matrix
                                                       1 2 …. K
            ΣkAjk=1
                                                   1
• These are known as Transition
     Probabilities                 zn-1            2
•    K(K-1) independent parameters                 …          Ajk
    Machine Learning: CSE 574                      K                5
Transition Probabilities
Example with 3-state latent variable
               zn=1     zn=2       zn=3           State Transition Diagram
                  zn1=1    zn1=0      zn1=0
                  zn2=0    zn2=1      zn2=0
                  zn3=0    zn3=0      zn3=1
zn-1=1         A11      A12        A13
    zn-1,1=1                                  A11+A12+A13=1
    zn-1,2=0
    zn-1,3=0
zn-1=2         A21      A22        A23
    zn-1,1=0
    zn-1,2=1
    zn-1,3=0                                                  • Not a graphical model
zn-3= 3        A31      A32        A33                          since nodes are not
    zn-1,1=0                                                    separate variables but
    zn-1,2=0                                                    states of a single
    zn-1,3=1                                                    variable
  Machine Learning: CSE 574                                                          6
                                                              • Here K=3
Conditional Probabilities
•    Transition probabilities Ajk represent state-to-state
     probabilities for each variable
•    Conditional probabilities are variable-to-variable
     probabilities
      •   can be written in terms of transition probabilities as
                                         K     K
          p (z n | z n −1 , A) = ∏∏ A
                                                      z n−1, j z n ,k
                                                      jk
                                        k =1 j =1
      •   Note that exponent zn-1,j zn,k is a product that evaluates to 0 or 1
      •   Hence the overall product will evaluate to a single Ajk for each
          setting of values of zn and zn-1
            •   E.g., zn-1=2 and zn=3 will result in only zn-1,2=1 and zn,3=1. Thus
                p(zn=3|zn-1=2)=A23
•    A is a global HMM parameter
    Machine Learning: CSE 574                                                         7
Initial Variable Probabilities
  • Initial latent node z1 is a special case without
      parent node
  •   Represented by vector of probabilities π with
      elements πk=p(z1k=1) so that
                            K
      p ( z1 | π ) = ∏ π                  where Σ kπ k = 1
                                   z1,k
                                   k
                            k =1


  • Note that π is an HMM parameter
        • representing probabilities of each state for       the
           first variable

Machine Learning: CSE 574                                          8
Lattice or Trellis Diagram
• State transition diagram unfolded over time




• Representation of latent variable states
• Each column corresponds to one of latent
  variables zn
 Machine Learning: CSE 574                      9
Emission Probabilities p(xn|zn)
  • We have so far only specified p(zn|zn-1)                  by
      means of transition probabilities
  •   Need to specify probabilities p(xn|zn,φ) to
      complete the model, where φ are parameters
  •   These can be continuous or discrete
  •   Because xn is observed and zn is discrete
      p(xn|zn,φ) consists of a table of K numbers
      corresponding to K states of zn
        • Analogous to class-conditional probabilities
  • Can be represented as                               K
                                    p (x n | z n , ϕ ) = ∏ p(x n | ϕk ) znk
                                                       k =1
Machine Learning: CSE 574                                                 10
3. HMM Parameters


 •   We have defined three types of HMM parameters: θ =
     (π, A,φ)
       1. Initial Probabilities of first latent variable:
          π is a vector of K probabilities of the states for latent variable z1
       2. Transition Probabilities (State-to-state for any latent variable):
          A is a K x K matrix of transition probabilities Aij
       3. Emission Probabilities (Observations conditioned on latent):
          φ are parameters of conditional distribution p(xk|zk)
 •    A and π parameters are often initialized uniformly
 •   Initialization of φ depends on form of distribution

Machine Learning: CSE 574                                                         11
Joint Distribution over Latent and
             Observed variables
• Joint can be expressed in terms of parameters:
                                 ⎡ N                     ⎤ N
   p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p (x m | z m , ϕ )
                                 ⎣ n=2                   ⎦ m =1
   where X = { x1 ,..x N }, Z = { z1 ,..z N }, θ = {π , A, ϕ}


• Most discussion of HMM is independent of
 emission probabilities
  • Tractable for discrete tables, Gaussian, GMMs
  Machine Learning: CSE 574                                                       12
4. Generative View of HMM
•   Sampling from an HMM
•   HMM with 3-state latent variable z
    •   Gaussian emission model p(x|z)
    •   Contours of constant density of emission
        distribution shown for each state
    •   Two-dimensional x


•   50 data points generated from HMM
•   Lines show successive observations

•   Transition probabilities fixed so that
    •   5% probability of making transition
    •   90% of remaining in same

Machine Learning: CSE 574                          13
Left-to-Right HMM
  • Setting elements of Ajk=0       if k < j




  • Corresponding lattice diagram



Machine Learning: CSE 574                      14
Left-to-Right Applied Generatively to Digits
•    Examples of on-line handwritten 2’s
•    x is a sequence of pen coordinates
•    There are16 states for z, or K=16
•    Each state can generate a line segment of fixed
     length in one of 16 angles
      •   Emission probabilities: 16 x 16 table
•    Transition probabilities set to zero except for those
     that keep state index k the same or increment by one
•    Parameters optimized by 25 EM iterations         Training Set
•    Trained on 45 digits
•    Generative model is quite poor
      •   Since generated don’t look like training
      •   If classification is goal, can do better by using a
          discriminative HMM
    Machine Learning: CSE 574                                              15
                                                                Generated Set
5. Determining HMM Parameters
• Given data set X={x1,..xn} we can determine
    HMM parameters θ ={π,A,φ} using maximum
    likelihood
•   Likelihood function obtained from joint
    distribution by marginalizing over latent
    variables Z={z1,..zn}
                    p(X|θ) = ΣZ p(X,Z|θ)

                                           Joint (X,Z)


Machine Learning: CSE 574
                                                    X
                                                    16
Computational issues for Parameters
                     p(X|θ) = ΣZ p(X,Z|θ)
• Computational difficulties
    • Joint distribution p(X,Z|θ) does not factorize over n,
           in contrast to mixture model
    • Z has exponential number of terms corresponding to trellis
• Solution
    • Use conditional independence properties to reorder
        summations
    •   Use EM instead to maximize log-likelihood function of joint
        ln p(X,Z|θ)
          Efficient framework for maximizing the likelihood function in HMMs

 Machine Learning: CSE 574                                                17
EM for MLE in HMM
1. Start with initial selection for model parameters θold
2. In E step take these parameter values and find
   posterior distribution of latent variables p(Z|X,θ old)
   Use this posterior distribution to evaluate
   expectation of the logarithm of the complete-data
   likelihood function ln p(X,Z|θ)
         Which can be written as
               Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
                                Z

      underlined portion independent of θ is evaluated
3. In M-Step maximize Q w.r.t. θ
 Machine Learning: CSE 574                                                18
Expansion of Q
  • Introduce                   notation
        γ (zn) = p(zn|X,θold): Marginal posterior distribution of
          latent variable zn
        ξ(zn-1,zn) = p(zn-1,zn|X,θ old): Joint posterior of two
          successive latent variables
  • We will be                  re-expressing Q in terms of γ and ξ
        Q (θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
                            Z
                                                                        γ

                                                                    ξ


Machine Learning: CSE 574                                                   19
Detail of γ and ξ
  For each value of n we can store
        γ(zn) using K non-negative numbers that sum to unity
        ξ(zn-1,zn) using a K x K matrix whose elements also sum
          to unity
  • Using           notation
        γ (znk) denotes conditional probability of znk=1
        Similar notation for ξ(zn-1,j,znk)

  • Because the expectation of a binary random
      variable is the probability that it takes value 1
        γ(znk) = E[znk] = Σz γ(z)znk
        ξ(z      ,z ) = E[zn-1,j,znk] = Σz γ(z) zn-1,jznk
             n-1,j nk
Machine Learning: CSE 574                                     20
Expansion of Q
  • We begin with
                             Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
                                              Z
  • Substitute
                                          ⎡ N                     ⎤ N
            p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p(x m | z m , ϕ )
                                          ⎣ n=2                   ⎦ m =1

  • And use definitions of γ and ξ to get:
                             K                       N    K    K
       Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk
                            k =1                    n = 2 j =1 k =1
                                 N     K
                            + ∑∑ γ ( z nk ) ln p ( xn | φk )
                                                                                          21
Machine Learning: CSE 574        n =1 k =1
E-Step
                          K                   N    K    K
       Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk
                          k =1               n = 2 j =1 k =1
                               N     K
                         + ∑∑ γ ( z nk ) ln p ( xn | φk )
                               n =1 k =1


• Goal of E step is to evaluate γ(zn) and ξ(zn-1,zn)
  efficiently (Forward-Backward Algorithm)
                                                                            γ

                                                                       ξ


   Machine Learning: CSE 574                                                    22
M-Step
  • Maximize Q(θ,θ old) with respect to parameters
      θ ={π,A,φ}
       • Treat γ (zn) and ξ(zn-1,zn) as constant
  • Maximization w.r.t. π and A
        • easily achieved (using Lagrangian multipliers)
                                                         N

              πk =
                        γ ( z1k )                      ∑ξ (z       n −1, j   , z nk )
                       K
                                             A jk =    n=2

                      ∑γ (z
                       j =1
                               1j   )                 K N

                                                      ∑∑ ξ ( z        n −1, j     , z nl )
                                                      l =1 n = 2

  • Maximization w.r.t. φ                k                                    N     K

        • Only last term of Q depends on φ ∑∑ γ ( z            k             n =1 k =1
                                                                                             nk   ) ln p ( xn | φk )

        • Same form as in mixture distribution for i.i.d.                                                      23
Machine Learning: CSE 574
M-step for Gaussian emission
  • Maximization of Q(θ,θ old) wrt φ                    k

  • Gaussian Emission Densities
      p(x|φk) ~ N(x|μk,Σk)
  •   Solution:
               N                             N

              ∑ γ ( znk )x n                ∑ γ ( z nk )( x n − μ k )( x n − μ k )T
       μk =    n =1
                  N                  Σk =   n =1
                                                            N
                ∑γ (z
                n =1
                            nk   )                          ∑γ (z   nk   )
                                                            n =1




Machine Learning: CSE 574                                                             24
M-Step for Multinomial Observed
  • Conditional Distribution of Observations
      have the form
                                               D    K
                            p ( x | z) = ∏∏ μiki zk
                                              x

                                           i =1 k =1

  • M-Step equations:                  N

                                       ∑γ (z       nk   ) xni
                               μik =   n =1
                                           N

                                        ∑γ (z
                                         n =1
                                                    nk    )

  • Analogous result holds for Bernoulli
      observed variables
Machine Learning: CSE 574                                       25
6. Forward-Backward Algorithm
  • E step: efficient procedure to evaluate
      γ(zn) and ξ(zn-1,zn)
  • Graph of HMM, a tree
        • Implies that posterior distribution of latent
           variables can be obtained efficiently using
           message passing algorithm
  • In HMM it is called forward-backward
      algorithm or Baum-Welch Algorithm
  •   Several variants lead to exact marginals
        • Method called alpha-beta discussed here
Machine Learning: CSE 574                                 26
Derivation of Forward-Backward
• Several conditional-independences (A-H) hold
    A. p(X|zn) = p(x1,..xn|zn) p(xn+1,..xN|zn)
•   Proved using d-separation:
       Path from x1 to xn-1 passes through zn which is observed.
       Path is head-to-tail. Thus (x1,..xn-1) _||_ xn | zn
       Similarly (x1,..xn-1,xn) _||_ xn+1,..xN |zn

                                                             zN
                                                     …..

                                                             xN
                                                      …..

    Machine Learning: CSE 574                                      27
Conditional independence B
  • Since                   (x1,..xn-1) _||_ xn | zn   we have

  B. p(x1,..xn-1|xn,zn)= p(x1,..xn-1|zn)




Machine Learning: CSE 574                                        28
Conditional independence C
  • Since              (x1,..xn-1) _||_ zn | zn-1


      C. p(x1,..xn-1|zn-1,zn)= p ( x1,..xn-1|zn-1 )




Machine Learning: CSE 574                             29
Conditional independence D
  • Since              (xn+1,..xN) _||_ zn | zn+1


      D. p(xn+1,..xN|zn,zn+1)= p ( x1,..xN|zn+1 )


                                                         zN


                                                         xN
                                                    ……


Machine Learning: CSE 574                                     30
Conditional independence E
  • Since              (xn+2,..xN) _||_ zn | zn+1


      E. p(xn+2,..xN|zn+1,xn+1)= p ( xn+2,..xN|zn+1 )


                                                         zN


                                                         xN
                                                    ……


Machine Learning: CSE 574                                 31
Conditional independence F

  F. p(X|zn-1,zn)= p ( x1,..xn-1|zn-1 )p(xn|zn)
                   p(xn+1,..xN|zn)


                                                   zN


                                                   xN
                                              ……


Machine Learning: CSE 574                               32
Conditional independence G
      Since            (x1,..xN) _||_ xN+1 | zN+1

  G.        p(xN+1|X,zN+1)=p(xN+1|zN+1 )



                                                          zN   zN+1
                                                    ….

                                                          xN   xN+1
                                                     ….


Machine Learning: CSE 574                                         33
Conditional independence H

  H.         p(zN+1|zN , X)= p ( zN+1|zN )



                                                   zN   zN+1
                                             ….

                                                   xN   xN+1
                                              ….


Machine Learning: CSE 574                                  34
Evaluation of γ(zn)
  • Recall that this is to efficiently compute
      the E step of estimating parameters of
      HMM
        γ (zn) = p(zn|X,θold): Marginal posterior
           distribution of latent variable zn
  • We are interested in finding posterior
      distribution p(zn|x1,..xN)
  •   This is a vector of length K whose entries
      correspond to expected values of znk

Machine Learning: CSE 574                           35
Introducing α and β
• Using Bayes theorem γ (z ) = p( z | X) = p(X |pz(X) )p( z )
                                                    n          n
                                                                                 n      n



• Using conditional independence A
              p ( x1 ,..x n | z n ) p ( x n +1 ,..x N | z n ) p (z n )
    γ (z n ) =
                                      p(X)
              p( x1 ,..x n , z n ) p( x n +1 ,..x N | z n ) α (z n ) β (z n )
            =                                                 =
                                 p(X)                                  p(X)
• where          α(zn) = p(x1,..,xn,zn)
  which is the probability of observing all                                                 beta
  given data up to time n and the value of
  zn
                   β(zn) = p(xn+1,..,xN|zn)
  which is the conditional probability of all
                                                                                alpha
  Machine Learning: CSE 574 time n+1 up to N given
  future data from                                                                           36

  the value of z
Recursion Relation for α
α (z n ) = p( x1 ,.., x n , z n )
         = p( x1 ,.., x n | z n ) p(z n ) by Bayes rule
         = p( x n | z n ) p( x1 ,.., x n −1 | z n ) p (z n ) by conditional independence B
         = p( x n | z n ) p( x1 ,.., x n −1 , z n ) by Bayes rule
         = p( x n | z n )∑ p( x1 ,.., x n −1 , z n −1 , z n ) by Sum Rule
                            z n−1

         = p( x n | z n )∑ p( x1 ,.., x n −1 , z n | z n −1 ) p(z n −1 ) by Bayes rule
                            z n−1

         = p( x n | z n )∑ p( x1 ,.., x n −1 | z n −1 ) p (z n | z n −1 ) p (z n −1 ) by cond. ind. C
                            z n−1

         = p( x n | z n )∑ p ( x1 ,.., x n −1 , z n −1 ) p (z n | z n −1 ) by Bayes rule
                            z n−1

         = p( x | z )∑ α (z n −1 ) p(z n | z n −1 ) by definition of α
  Machine Learning:nCSE 574
                        n
                                                                                              37
                           z n−1
Forward Recursion for α Εvaluation
 •    Recursion Relation is
     α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 )
                                 z n−1


 •    There are K terms in the
      summation
       •   Has to be evaluated for each of K
           values of zn
       •   Each step of recursion is O(K2)
 •    Initial condition is
                                                 K
α (z1 ) = p ( x1 , z1 ) = p (z1 ) p ( x1 | z1 ) = ∏ {π k p( x1 | φk )}z
                                                                      1k


                                                k =1

 •    Overall cost for the chain in
      O(K2N)
       Machine Learning: CSE 574                                           38
Recursion Relation for β

   β (z n ) = p (x n +1 ,.., x N | z n )
             = ∑ p (x n +1 ,.., x N , z n +1 | z n ) by Sum Rule
                 zn+1

             =    ∑ p(x
                  zn+1
                            n +1   ,.., x N | z n ,z n +1 ) p( z n +1|z n ) by Bayes rule

             = ∑ p (x n +1 ,.., x N | z n +1 ) p( z n +1|z n ) by Cond ind. D
                 zn+1

             = ∑ p (x n + 2 ,.., x N | z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by Cond. ind E
                 zn+1

             = ∑ β (z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by definition of β
                 zn+1




Machine Learning: CSE 574                                                                          39
Backward Recursion for β
•   Backward message passing
β (z n ) = ∑ β (z n +1 ) p( x n +1|z n )p( z n +1|z n )
              zn+1

•   Evaluates β(zn) in terms of β(zn+1)

•   Starting condition for recursion is
                    p ( X, z N ) β ( z N )
       p(z N | X) =
                           p(X)
•   Is correct provided we set β(zN) =1
    for all settings of zN
    •   This is the initial condition for backward
        computation


     Machine Learning: CSE 574                            40
M step Equations
  • In the M-step equations p(x) will cancel out
                       N

                     ∑γ (z      nk   )x n
           μk =       n =1
                         N

                        ∑γ (z
                        n =1
                                    nk   )


                p(X) = ∑ α ( zn ) β ( zn )
                               zn



Machine Learning: CSE 574                          41
Evaluation of Quantities ξ(zn-1,zn)
•   They correspond to the values of the conditional
    probabilities p(zn-1,zn|X) for each of the K x K settings for
    (zn-1,zn)
ξ (z n -1 , z n ) = p(z n -1 , z n | X) by definition
               p(X|zn-1,z n )p(zn-1,z n )
             =                                  by Bayes Rule
                           p(X)
               p(x1 ,..x n −1|z n-1 )p( x n | z n ) p( x n +1 ,.., x N | z n ) p(z n | z n-1 )p(z n-1 )
             =                                                                                          by cond ind F
                                                         p(X)
               α (z n −1 ) p ( x n | z n ) p(z n | z n-1 )β(z n )
             =
                                      p(X)
•   Thus we calculate ξ(zn-1,zn) directly by using results of the
    α and β recursions

  Machine Learning: CSE 574                                                                                  42
Summary of EM to train HMM
Step 1: Initialization
•       Make an initial selection of parameters θ old
        where θ = (π, A,φ)
      1. π is a vector of K probabilities of the states for latent
         variable z1
      2. A is a K x K matrix of transition probabilities Aij
      3. φ are parameters of conditional distribution p(xk|zk)
•        A and π parameters are often initialized uniformly
•       Initialization of φ depends on form of distribution
      •      For Gaussian:
            •    parameters μk initialized by applying K-means to the data, Σk
                 corresponds to covariance matrix of cluster
    Machine Learning: CSE 574                                                    43
Summary of EM to train HMM
Step 2: E Step
• Run both forward α recursion and
    backward β recursion
• Use results to evaluate γ(zn) and ξ(zn-1,zn)
    and the likelihood function
Step 3: M Step
• Use results of E step to find revised set of
    parameters θnew using M-step equations
Alternate between E and M
   until convergence of likelihood function
 Machine Learning: CSE 574                       44
Values for p(xn|zn)
  • In recursion relations, observations enter
      through conditional distributions p(xn|zn)
  •   Recursions are independent of
        • Dimensionality of observed variables
        • Form of conditional distribution
             • So long as it can be computed for each of K
                possible states of zn
  • Since observed variables {xn} are fixed
      they can be pre-computed at the start of
      the EM algorithm
Machine Learning: CSE 574                                    45
7(a) Sequence Length: Using Multiple Short Sequences

    • HMM can be trained effectively if length of
        sequence is sufficiently long
          • True of all maximum likelihood approaches
    • Alternatively we can use multiple short
        sequences
          • Requires straightforward modification of HMM-EM
             algorithm
    • Particularly important in left-to-right models
          • In given observation sequence, a given state
             transition for a non-diagonal element of A occurs only
             once

  Machine Learning: CSE 574                                      46
7(b). Predictive Distribution
  •   Observed data is X={ x1,..,xN}
  •   Wish to predict xN+1
  •   Application in financial forecasting
        p (x | X ) = ∑ p(x , z | X )
             N +1                 N +1     N +1
                        z N +1


                    =    ∑ p(x
                         z N +1
                                  N +1   |z N +1 | X ) p (z N +1 | X ) by Product Rule


                    = ∑ p(x N +1|z N +1 )∑ p(z N +1 , z N | X ) by Sum Rule
                        z N +1                    zN


                    = ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) p( z N | X ) by conditional ind H
                        z N +1                    zN

                                                                 p( z N , X )
                    = ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N )                  by Bayes rule
                        z N +1                    zN               p( X )
                            1
                    =            ∑ p(x N +1|z N +1 )∑ p(z N +1 | zN )α ( zN ) by definition of α
                         p ( X ) zN +1              zN


  •   Can be evaluated by first running forward α recursion and
      summing over zN and zN+1
  •   Can be extended to subsequent predictions of xN+2, after xN+1 is
      observed, using a fixed amount of storage
Machine Learning: CSE 574                                                                          47
7(c). Sum-Product and HMM
                                            HMM Graph
•   HMM graph is a tree and
    hence sum-product
    algorithm can be used to
    find local marginals for
    hidden variables                      Joint distribution
    •   Equivalent to forward-
        backward algorithm                                                   ⎡ N                  ⎤ N
                                      p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n )
    •   Sum-product provides a                                               ⎣ n=2                ⎦ n =1
        simple way to derive alpha-
        beta recursion formulae
•   Transform directed graph to                Fragment of Factor Graph
    factor graph
    •   Each variable has a node,
        small squares represent
        factors, undirected links
        connect factor nodes to
        variables used
Machine Learning: CSE 574                                                                                    48
Deriving alpha-beta from Sum-Product
•   Begin with simplified form of                                                              Fragment of Factor Graph
    factor graph
•   Factors are given by
     h( z1 ) = p ( z1 ) p ( x1 | z1 )
                                                                       Simplified by absorbing emission probabilities
     f n ( zn −1 , zn ) = p ( zn | zn −1 ) p ( xn | zn )               into transition probability factors

•   Messages propagated are
     μz   n−1 → f n
                      (z n −1 ) = μ fn−1 →zn−1 (z n −1 )
                                                                       Final Results
     μf   n →zn
                (z n ) = ∑ f n (z n −1 , z n ) μ zn−1 → fn (z n −1 )
                               z n−1
                                                                        α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 )
•   Can show that α recursion is                                                                      z n−1


    computed                                                            β (z n ) = ∑ β ( z n +1 ) p(x n +1 | z n ) p(z n +1 | z n )
•   Similarly starting with the root
                                                                                      z n+1


                                                                                     α (z n ) β (z n )
    node β recursion is computed                                        γ (z n ) =
                                                                                              p (X)
•   So also γ and ξ are derived
                                                                        ξ (z n-1 ,z n ) =
                                                                                               α (z n −1 ) p (x n | z n ) p( z n | z n-1 )β ( z n )
 Machine Learning: CSE 574                                                                                                                  49
                                                                                                                      p(X)
7(d). Scaling Factors
• Implementation issue for small probabilities
• At each step of recursion
                               α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 )
                                                        z n−1
  • To obtain new value of α(zn) from previous value α(zn-1)
      we multiply p(zn|zn-1) and p(xn|zn)
  •   These probabilities are small and products will underflow
  •   Logs don’t help since we have sums of products



• Solution is rescaling
  • of α(zn) and β(zn) whose values remain close to unity
   Machine Learning: CSE 574                                                             50
7(e). The Viterbi Algorithm
  • Finding most probable sequence of hidden
      states for a given sequence of observables
  •   In speech recognition: finding most probable
      phoneme sequence for a given series of
      acoustic observations
  •   Since graphical model of HMM is a tree, can be
      solved exactly using max-sum algorithm
        • Known as Viterbi algorithm in the context of HMM
        • Since max-sum works with log probabilities no need
           to work with re-scaled varaibles as with forward-
           backward
Machine Learning: CSE 574                                      51
Viterbi Algorithm for HMM
  • Fragment of HMM lattice
      showing two paths
  •   Number of possible paths
      grows exponentially with
      length of chain
  •   Viterbi searches space of
      paths efficiently
        • Finds most probable path
           with computational cost
           linear with length of chain


Machine Learning: CSE 574                  52
Deriving Viterbi from Max-Sum
  • Start with simplified factor graph

  • Treat variable zN as root node, passing
      messages to root from leaf nodes
  •   Messages passed are
           μz   n → f n +1
                             (z n ) = μ fn →zn (z n )
           μf   n +1 → z n +1
                                                  {                                         }
                                (z n +1 ) = max ln f n +1 (z n , z n +1 ) + μzn → fn+1 (z n )
                                             zn




Machine Learning: CSE 574                                                                       53
Other Topics on Sequential Data
• Sequential Data and Markov Models:
  http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.1-
  MarkovModels.pdf
• Extensions of HMMs:
  http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.3-
  HMMExtensions.pdf
• Linear Dynamical Systems:
  http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.4-
  LinearDynamicalSystems.pdf
• Conditional Random Fields:
  http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.5-
  ConditionalRandomFields.pdf


Machine Learning: CSE 574                                       54

PowerPoint Presentation

  • 1.
    Hidden Markov Models Sargur N. Srihari srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/CSE574/index.html Machine Learning: CSE 574 0
  • 2.
    HMM Topics 1. What is an HMM? 2. State-space Representation 3. HMM Parameters 4. Generative View of HMM 5. Determining HMM Parameters Using EM 6. Forward-Backward or α−β algorithm 7. HMM Implementation Issues: a) Length of Sequence b) Predictive Distribution c) Sum-Product Algorithm d) Scaling Factors e) Viterbi Algorithm Machine Learning: CSE 574 1
  • 3.
    1. What isan HMM? • Ubiquitous tool for modeling time series data • Used in • Almost all speech recognition systems • Computational molecular biology • Group amino acid sequences into proteins • Handwritten word recognition • It is a tool for representing probability distributions over sequences of observations • HMM gets its name from two defining properties: • Observation xt at time t was generated by some process whose state zt is hidden from the observer • Assumes that state at zt is dependent only on state zt-1 and independent of all prior states (First order) • Example: z are phoneme sequences x are acoustic observations Machine Learning: CSE 574 2
  • 4.
    Graphical Model ofHMM • Has the graphical model shown below and latent variables are discrete • Joint distribution has the form: ⎡ N ⎤ N p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n ) ⎣ n=2 ⎦ n =1 Machine Learning: CSE 574 3
  • 5.
    HMM Viewed asMixture • A single time slice corresponds to a mixture distribution with component densities p(x|z) • Each state of discrete variable z represents a different component • An extension of mixture model • Choice of mixture component depends on choice of mixture component for previous distribution • Latent variables are multinomial variables zn • That describe component responsible for generating xn • Can use one-of-K coding scheme Machine Learning: CSE 574 4
  • 6.
    2. State-Space Representation •Probability distribution of zn State k 1 2 . K depends on state of previous of zn znk 0 1 . 0 latent variable zn-1 through probability distribution p(zn|zn-1) • One-of K coding State j 1 2 . K of zn-1 zn-1,j 1 0 . 0 • Since latent variables are K- dimensional binary vectors A zn Ajk=p(znk=1|zn-1,j=1) matrix 1 2 …. K ΣkAjk=1 1 • These are known as Transition Probabilities zn-1 2 • K(K-1) independent parameters … Ajk Machine Learning: CSE 574 K 5
  • 7.
    Transition Probabilities Example with3-state latent variable zn=1 zn=2 zn=3 State Transition Diagram zn1=1 zn1=0 zn1=0 zn2=0 zn2=1 zn2=0 zn3=0 zn3=0 zn3=1 zn-1=1 A11 A12 A13 zn-1,1=1 A11+A12+A13=1 zn-1,2=0 zn-1,3=0 zn-1=2 A21 A22 A23 zn-1,1=0 zn-1,2=1 zn-1,3=0 • Not a graphical model zn-3= 3 A31 A32 A33 since nodes are not zn-1,1=0 separate variables but zn-1,2=0 states of a single zn-1,3=1 variable Machine Learning: CSE 574 6 • Here K=3
  • 8.
    Conditional Probabilities • Transition probabilities Ajk represent state-to-state probabilities for each variable • Conditional probabilities are variable-to-variable probabilities • can be written in terms of transition probabilities as K K p (z n | z n −1 , A) = ∏∏ A z n−1, j z n ,k jk k =1 j =1 • Note that exponent zn-1,j zn,k is a product that evaluates to 0 or 1 • Hence the overall product will evaluate to a single Ajk for each setting of values of zn and zn-1 • E.g., zn-1=2 and zn=3 will result in only zn-1,2=1 and zn,3=1. Thus p(zn=3|zn-1=2)=A23 • A is a global HMM parameter Machine Learning: CSE 574 7
  • 9.
    Initial Variable Probabilities • Initial latent node z1 is a special case without parent node • Represented by vector of probabilities π with elements πk=p(z1k=1) so that K p ( z1 | π ) = ∏ π where Σ kπ k = 1 z1,k k k =1 • Note that π is an HMM parameter • representing probabilities of each state for the first variable Machine Learning: CSE 574 8
  • 10.
    Lattice or TrellisDiagram • State transition diagram unfolded over time • Representation of latent variable states • Each column corresponds to one of latent variables zn Machine Learning: CSE 574 9
  • 11.
    Emission Probabilities p(xn|zn) • We have so far only specified p(zn|zn-1) by means of transition probabilities • Need to specify probabilities p(xn|zn,φ) to complete the model, where φ are parameters • These can be continuous or discrete • Because xn is observed and zn is discrete p(xn|zn,φ) consists of a table of K numbers corresponding to K states of zn • Analogous to class-conditional probabilities • Can be represented as K p (x n | z n , ϕ ) = ∏ p(x n | ϕk ) znk k =1 Machine Learning: CSE 574 10
  • 12.
    3. HMM Parameters • We have defined three types of HMM parameters: θ = (π, A,φ) 1. Initial Probabilities of first latent variable: π is a vector of K probabilities of the states for latent variable z1 2. Transition Probabilities (State-to-state for any latent variable): A is a K x K matrix of transition probabilities Aij 3. Emission Probabilities (Observations conditioned on latent): φ are parameters of conditional distribution p(xk|zk) • A and π parameters are often initialized uniformly • Initialization of φ depends on form of distribution Machine Learning: CSE 574 11
  • 13.
    Joint Distribution overLatent and Observed variables • Joint can be expressed in terms of parameters: ⎡ N ⎤ N p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p (x m | z m , ϕ ) ⎣ n=2 ⎦ m =1 where X = { x1 ,..x N }, Z = { z1 ,..z N }, θ = {π , A, ϕ} • Most discussion of HMM is independent of emission probabilities • Tractable for discrete tables, Gaussian, GMMs Machine Learning: CSE 574 12
  • 14.
    4. Generative Viewof HMM • Sampling from an HMM • HMM with 3-state latent variable z • Gaussian emission model p(x|z) • Contours of constant density of emission distribution shown for each state • Two-dimensional x • 50 data points generated from HMM • Lines show successive observations • Transition probabilities fixed so that • 5% probability of making transition • 90% of remaining in same Machine Learning: CSE 574 13
  • 15.
    Left-to-Right HMM • Setting elements of Ajk=0 if k < j • Corresponding lattice diagram Machine Learning: CSE 574 14
  • 16.
    Left-to-Right Applied Generativelyto Digits • Examples of on-line handwritten 2’s • x is a sequence of pen coordinates • There are16 states for z, or K=16 • Each state can generate a line segment of fixed length in one of 16 angles • Emission probabilities: 16 x 16 table • Transition probabilities set to zero except for those that keep state index k the same or increment by one • Parameters optimized by 25 EM iterations Training Set • Trained on 45 digits • Generative model is quite poor • Since generated don’t look like training • If classification is goal, can do better by using a discriminative HMM Machine Learning: CSE 574 15 Generated Set
  • 17.
    5. Determining HMMParameters • Given data set X={x1,..xn} we can determine HMM parameters θ ={π,A,φ} using maximum likelihood • Likelihood function obtained from joint distribution by marginalizing over latent variables Z={z1,..zn} p(X|θ) = ΣZ p(X,Z|θ) Joint (X,Z) Machine Learning: CSE 574 X 16
  • 18.
    Computational issues forParameters p(X|θ) = ΣZ p(X,Z|θ) • Computational difficulties • Joint distribution p(X,Z|θ) does not factorize over n, in contrast to mixture model • Z has exponential number of terms corresponding to trellis • Solution • Use conditional independence properties to reorder summations • Use EM instead to maximize log-likelihood function of joint ln p(X,Z|θ) Efficient framework for maximizing the likelihood function in HMMs Machine Learning: CSE 574 17
  • 19.
    EM for MLEin HMM 1. Start with initial selection for model parameters θold 2. In E step take these parameter values and find posterior distribution of latent variables p(Z|X,θ old) Use this posterior distribution to evaluate expectation of the logarithm of the complete-data likelihood function ln p(X,Z|θ) Which can be written as Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ ) Z underlined portion independent of θ is evaluated 3. In M-Step maximize Q w.r.t. θ Machine Learning: CSE 574 18
  • 20.
    Expansion of Q • Introduce notation γ (zn) = p(zn|X,θold): Marginal posterior distribution of latent variable zn ξ(zn-1,zn) = p(zn-1,zn|X,θ old): Joint posterior of two successive latent variables • We will be re-expressing Q in terms of γ and ξ Q (θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ ) Z γ ξ Machine Learning: CSE 574 19
  • 21.
    Detail of γand ξ For each value of n we can store γ(zn) using K non-negative numbers that sum to unity ξ(zn-1,zn) using a K x K matrix whose elements also sum to unity • Using notation γ (znk) denotes conditional probability of znk=1 Similar notation for ξ(zn-1,j,znk) • Because the expectation of a binary random variable is the probability that it takes value 1 γ(znk) = E[znk] = Σz γ(z)znk ξ(z ,z ) = E[zn-1,j,znk] = Σz γ(z) zn-1,jznk n-1,j nk Machine Learning: CSE 574 20
  • 22.
    Expansion of Q • We begin with Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ ) Z • Substitute ⎡ N ⎤ N p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p(x m | z m , ϕ ) ⎣ n=2 ⎦ m =1 • And use definitions of γ and ξ to get: K N K K Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk k =1 n = 2 j =1 k =1 N K + ∑∑ γ ( z nk ) ln p ( xn | φk ) 21 Machine Learning: CSE 574 n =1 k =1
  • 23.
    E-Step K N K K Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk k =1 n = 2 j =1 k =1 N K + ∑∑ γ ( z nk ) ln p ( xn | φk ) n =1 k =1 • Goal of E step is to evaluate γ(zn) and ξ(zn-1,zn) efficiently (Forward-Backward Algorithm) γ ξ Machine Learning: CSE 574 22
  • 24.
    M-Step •Maximize Q(θ,θ old) with respect to parameters θ ={π,A,φ} • Treat γ (zn) and ξ(zn-1,zn) as constant • Maximization w.r.t. π and A • easily achieved (using Lagrangian multipliers) N πk = γ ( z1k ) ∑ξ (z n −1, j , z nk ) K A jk = n=2 ∑γ (z j =1 1j ) K N ∑∑ ξ ( z n −1, j , z nl ) l =1 n = 2 • Maximization w.r.t. φ k N K • Only last term of Q depends on φ ∑∑ γ ( z k n =1 k =1 nk ) ln p ( xn | φk ) • Same form as in mixture distribution for i.i.d. 23 Machine Learning: CSE 574
  • 25.
    M-step for Gaussianemission • Maximization of Q(θ,θ old) wrt φ k • Gaussian Emission Densities p(x|φk) ~ N(x|μk,Σk) • Solution: N N ∑ γ ( znk )x n ∑ γ ( z nk )( x n − μ k )( x n − μ k )T μk = n =1 N Σk = n =1 N ∑γ (z n =1 nk ) ∑γ (z nk ) n =1 Machine Learning: CSE 574 24
  • 26.
    M-Step for MultinomialObserved • Conditional Distribution of Observations have the form D K p ( x | z) = ∏∏ μiki zk x i =1 k =1 • M-Step equations: N ∑γ (z nk ) xni μik = n =1 N ∑γ (z n =1 nk ) • Analogous result holds for Bernoulli observed variables Machine Learning: CSE 574 25
  • 27.
    6. Forward-Backward Algorithm • E step: efficient procedure to evaluate γ(zn) and ξ(zn-1,zn) • Graph of HMM, a tree • Implies that posterior distribution of latent variables can be obtained efficiently using message passing algorithm • In HMM it is called forward-backward algorithm or Baum-Welch Algorithm • Several variants lead to exact marginals • Method called alpha-beta discussed here Machine Learning: CSE 574 26
  • 28.
    Derivation of Forward-Backward •Several conditional-independences (A-H) hold A. p(X|zn) = p(x1,..xn|zn) p(xn+1,..xN|zn) • Proved using d-separation: Path from x1 to xn-1 passes through zn which is observed. Path is head-to-tail. Thus (x1,..xn-1) _||_ xn | zn Similarly (x1,..xn-1,xn) _||_ xn+1,..xN |zn zN ….. xN ….. Machine Learning: CSE 574 27
  • 29.
    Conditional independence B • Since (x1,..xn-1) _||_ xn | zn we have B. p(x1,..xn-1|xn,zn)= p(x1,..xn-1|zn) Machine Learning: CSE 574 28
  • 30.
    Conditional independence C • Since (x1,..xn-1) _||_ zn | zn-1 C. p(x1,..xn-1|zn-1,zn)= p ( x1,..xn-1|zn-1 ) Machine Learning: CSE 574 29
  • 31.
    Conditional independence D • Since (xn+1,..xN) _||_ zn | zn+1 D. p(xn+1,..xN|zn,zn+1)= p ( x1,..xN|zn+1 ) zN xN …… Machine Learning: CSE 574 30
  • 32.
    Conditional independence E • Since (xn+2,..xN) _||_ zn | zn+1 E. p(xn+2,..xN|zn+1,xn+1)= p ( xn+2,..xN|zn+1 ) zN xN …… Machine Learning: CSE 574 31
  • 33.
    Conditional independence F F. p(X|zn-1,zn)= p ( x1,..xn-1|zn-1 )p(xn|zn) p(xn+1,..xN|zn) zN xN …… Machine Learning: CSE 574 32
  • 34.
    Conditional independence G Since (x1,..xN) _||_ xN+1 | zN+1 G. p(xN+1|X,zN+1)=p(xN+1|zN+1 ) zN zN+1 …. xN xN+1 …. Machine Learning: CSE 574 33
  • 35.
    Conditional independence H H. p(zN+1|zN , X)= p ( zN+1|zN ) zN zN+1 …. xN xN+1 …. Machine Learning: CSE 574 34
  • 36.
    Evaluation of γ(zn) • Recall that this is to efficiently compute the E step of estimating parameters of HMM γ (zn) = p(zn|X,θold): Marginal posterior distribution of latent variable zn • We are interested in finding posterior distribution p(zn|x1,..xN) • This is a vector of length K whose entries correspond to expected values of znk Machine Learning: CSE 574 35
  • 37.
    Introducing α andβ • Using Bayes theorem γ (z ) = p( z | X) = p(X |pz(X) )p( z ) n n n n • Using conditional independence A p ( x1 ,..x n | z n ) p ( x n +1 ,..x N | z n ) p (z n ) γ (z n ) = p(X) p( x1 ,..x n , z n ) p( x n +1 ,..x N | z n ) α (z n ) β (z n ) = = p(X) p(X) • where α(zn) = p(x1,..,xn,zn) which is the probability of observing all beta given data up to time n and the value of zn β(zn) = p(xn+1,..,xN|zn) which is the conditional probability of all alpha Machine Learning: CSE 574 time n+1 up to N given future data from 36 the value of z
  • 38.
    Recursion Relation forα α (z n ) = p( x1 ,.., x n , z n ) = p( x1 ,.., x n | z n ) p(z n ) by Bayes rule = p( x n | z n ) p( x1 ,.., x n −1 | z n ) p (z n ) by conditional independence B = p( x n | z n ) p( x1 ,.., x n −1 , z n ) by Bayes rule = p( x n | z n )∑ p( x1 ,.., x n −1 , z n −1 , z n ) by Sum Rule z n−1 = p( x n | z n )∑ p( x1 ,.., x n −1 , z n | z n −1 ) p(z n −1 ) by Bayes rule z n−1 = p( x n | z n )∑ p( x1 ,.., x n −1 | z n −1 ) p (z n | z n −1 ) p (z n −1 ) by cond. ind. C z n−1 = p( x n | z n )∑ p ( x1 ,.., x n −1 , z n −1 ) p (z n | z n −1 ) by Bayes rule z n−1 = p( x | z )∑ α (z n −1 ) p(z n | z n −1 ) by definition of α Machine Learning:nCSE 574 n 37 z n−1
  • 39.
    Forward Recursion forα Εvaluation • Recursion Relation is α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 ) z n−1 • There are K terms in the summation • Has to be evaluated for each of K values of zn • Each step of recursion is O(K2) • Initial condition is K α (z1 ) = p ( x1 , z1 ) = p (z1 ) p ( x1 | z1 ) = ∏ {π k p( x1 | φk )}z 1k k =1 • Overall cost for the chain in O(K2N) Machine Learning: CSE 574 38
  • 40.
    Recursion Relation forβ β (z n ) = p (x n +1 ,.., x N | z n ) = ∑ p (x n +1 ,.., x N , z n +1 | z n ) by Sum Rule zn+1 = ∑ p(x zn+1 n +1 ,.., x N | z n ,z n +1 ) p( z n +1|z n ) by Bayes rule = ∑ p (x n +1 ,.., x N | z n +1 ) p( z n +1|z n ) by Cond ind. D zn+1 = ∑ p (x n + 2 ,.., x N | z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by Cond. ind E zn+1 = ∑ β (z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by definition of β zn+1 Machine Learning: CSE 574 39
  • 41.
    Backward Recursion forβ • Backward message passing β (z n ) = ∑ β (z n +1 ) p( x n +1|z n )p( z n +1|z n ) zn+1 • Evaluates β(zn) in terms of β(zn+1) • Starting condition for recursion is p ( X, z N ) β ( z N ) p(z N | X) = p(X) • Is correct provided we set β(zN) =1 for all settings of zN • This is the initial condition for backward computation Machine Learning: CSE 574 40
  • 42.
    M step Equations • In the M-step equations p(x) will cancel out N ∑γ (z nk )x n μk = n =1 N ∑γ (z n =1 nk ) p(X) = ∑ α ( zn ) β ( zn ) zn Machine Learning: CSE 574 41
  • 43.
    Evaluation of Quantitiesξ(zn-1,zn) • They correspond to the values of the conditional probabilities p(zn-1,zn|X) for each of the K x K settings for (zn-1,zn) ξ (z n -1 , z n ) = p(z n -1 , z n | X) by definition p(X|zn-1,z n )p(zn-1,z n ) = by Bayes Rule p(X) p(x1 ,..x n −1|z n-1 )p( x n | z n ) p( x n +1 ,.., x N | z n ) p(z n | z n-1 )p(z n-1 ) = by cond ind F p(X) α (z n −1 ) p ( x n | z n ) p(z n | z n-1 )β(z n ) = p(X) • Thus we calculate ξ(zn-1,zn) directly by using results of the α and β recursions Machine Learning: CSE 574 42
  • 44.
    Summary of EMto train HMM Step 1: Initialization • Make an initial selection of parameters θ old where θ = (π, A,φ) 1. π is a vector of K probabilities of the states for latent variable z1 2. A is a K x K matrix of transition probabilities Aij 3. φ are parameters of conditional distribution p(xk|zk) • A and π parameters are often initialized uniformly • Initialization of φ depends on form of distribution • For Gaussian: • parameters μk initialized by applying K-means to the data, Σk corresponds to covariance matrix of cluster Machine Learning: CSE 574 43
  • 45.
    Summary of EMto train HMM Step 2: E Step • Run both forward α recursion and backward β recursion • Use results to evaluate γ(zn) and ξ(zn-1,zn) and the likelihood function Step 3: M Step • Use results of E step to find revised set of parameters θnew using M-step equations Alternate between E and M until convergence of likelihood function Machine Learning: CSE 574 44
  • 46.
    Values for p(xn|zn) • In recursion relations, observations enter through conditional distributions p(xn|zn) • Recursions are independent of • Dimensionality of observed variables • Form of conditional distribution • So long as it can be computed for each of K possible states of zn • Since observed variables {xn} are fixed they can be pre-computed at the start of the EM algorithm Machine Learning: CSE 574 45
  • 47.
    7(a) Sequence Length:Using Multiple Short Sequences • HMM can be trained effectively if length of sequence is sufficiently long • True of all maximum likelihood approaches • Alternatively we can use multiple short sequences • Requires straightforward modification of HMM-EM algorithm • Particularly important in left-to-right models • In given observation sequence, a given state transition for a non-diagonal element of A occurs only once Machine Learning: CSE 574 46
  • 48.
    7(b). Predictive Distribution • Observed data is X={ x1,..,xN} • Wish to predict xN+1 • Application in financial forecasting p (x | X ) = ∑ p(x , z | X ) N +1 N +1 N +1 z N +1 = ∑ p(x z N +1 N +1 |z N +1 | X ) p (z N +1 | X ) by Product Rule = ∑ p(x N +1|z N +1 )∑ p(z N +1 , z N | X ) by Sum Rule z N +1 zN = ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) p( z N | X ) by conditional ind H z N +1 zN p( z N , X ) = ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) by Bayes rule z N +1 zN p( X ) 1 = ∑ p(x N +1|z N +1 )∑ p(z N +1 | zN )α ( zN ) by definition of α p ( X ) zN +1 zN • Can be evaluated by first running forward α recursion and summing over zN and zN+1 • Can be extended to subsequent predictions of xN+2, after xN+1 is observed, using a fixed amount of storage Machine Learning: CSE 574 47
  • 49.
    7(c). Sum-Product andHMM HMM Graph • HMM graph is a tree and hence sum-product algorithm can be used to find local marginals for hidden variables Joint distribution • Equivalent to forward- backward algorithm ⎡ N ⎤ N p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n ) • Sum-product provides a ⎣ n=2 ⎦ n =1 simple way to derive alpha- beta recursion formulae • Transform directed graph to Fragment of Factor Graph factor graph • Each variable has a node, small squares represent factors, undirected links connect factor nodes to variables used Machine Learning: CSE 574 48
  • 50.
    Deriving alpha-beta fromSum-Product • Begin with simplified form of Fragment of Factor Graph factor graph • Factors are given by h( z1 ) = p ( z1 ) p ( x1 | z1 ) Simplified by absorbing emission probabilities f n ( zn −1 , zn ) = p ( zn | zn −1 ) p ( xn | zn ) into transition probability factors • Messages propagated are μz n−1 → f n (z n −1 ) = μ fn−1 →zn−1 (z n −1 ) Final Results μf n →zn (z n ) = ∑ f n (z n −1 , z n ) μ zn−1 → fn (z n −1 ) z n−1 α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 ) • Can show that α recursion is z n−1 computed β (z n ) = ∑ β ( z n +1 ) p(x n +1 | z n ) p(z n +1 | z n ) • Similarly starting with the root z n+1 α (z n ) β (z n ) node β recursion is computed γ (z n ) = p (X) • So also γ and ξ are derived ξ (z n-1 ,z n ) = α (z n −1 ) p (x n | z n ) p( z n | z n-1 )β ( z n ) Machine Learning: CSE 574 49 p(X)
  • 51.
    7(d). Scaling Factors •Implementation issue for small probabilities • At each step of recursion α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 ) z n−1 • To obtain new value of α(zn) from previous value α(zn-1) we multiply p(zn|zn-1) and p(xn|zn) • These probabilities are small and products will underflow • Logs don’t help since we have sums of products • Solution is rescaling • of α(zn) and β(zn) whose values remain close to unity Machine Learning: CSE 574 50
  • 52.
    7(e). The ViterbiAlgorithm • Finding most probable sequence of hidden states for a given sequence of observables • In speech recognition: finding most probable phoneme sequence for a given series of acoustic observations • Since graphical model of HMM is a tree, can be solved exactly using max-sum algorithm • Known as Viterbi algorithm in the context of HMM • Since max-sum works with log probabilities no need to work with re-scaled varaibles as with forward- backward Machine Learning: CSE 574 51
  • 53.
    Viterbi Algorithm forHMM • Fragment of HMM lattice showing two paths • Number of possible paths grows exponentially with length of chain • Viterbi searches space of paths efficiently • Finds most probable path with computational cost linear with length of chain Machine Learning: CSE 574 52
  • 54.
    Deriving Viterbi fromMax-Sum • Start with simplified factor graph • Treat variable zN as root node, passing messages to root from leaf nodes • Messages passed are μz n → f n +1 (z n ) = μ fn →zn (z n ) μf n +1 → z n +1 { } (z n +1 ) = max ln f n +1 (z n , z n +1 ) + μzn → fn+1 (z n ) zn Machine Learning: CSE 574 53
  • 55.
    Other Topics onSequential Data • Sequential Data and Markov Models: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.1- MarkovModels.pdf • Extensions of HMMs: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.3- HMMExtensions.pdf • Linear Dynamical Systems: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.4- LinearDynamicalSystems.pdf • Conditional Random Fields: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.5- ConditionalRandomFields.pdf Machine Learning: CSE 574 54