SlideShare a Scribd company logo
1 of 55
Download to read offline
Hidden Markov Models

                            Sargur N. Srihari
                            srihari@cedar.buffalo.edu
             Machine Learning Course:
  http://www.cedar.buffalo.edu/~srihari/CSE574/index.html

Machine Learning: CSE 574                                   0
HMM Topics
  1.     What is an HMM?
  2.     State-space Representation
  3.     HMM Parameters
  4.     Generative View of HMM
  5.     Determining HMM Parameters Using EM
  6.     Forward-Backward or α−β algorithm
  7.     HMM Implementation Issues:
        a)    Length of Sequence
        b)    Predictive Distribution
        c)    Sum-Product Algorithm
        d)    Scaling Factors
        e)    Viterbi Algorithm
Machine Learning: CSE 574                      1
1. What is an HMM?
•    Ubiquitous tool for modeling time series data
•    Used in
      •   Almost all speech recognition systems
      •   Computational molecular biology
            •   Group amino acid sequences into proteins
      •   Handwritten word recognition
•    It is a tool for representing probability distributions over
     sequences of observations
•    HMM gets its name from two defining properties:
      •   Observation xt at time t was generated by some process whose state
          zt is hidden from the observer
      •   Assumes that state at zt is dependent only on state zt-1 and
          independent of all prior states (First order)
•    Example: z are phoneme sequences
              x are acoustic observations

    Machine Learning: CSE 574                                              2
Graphical Model of HMM
  • Has the graphical model shown below and
      latent variables are discrete




  • Joint distribution has the form:
                                              ⎡ N                  ⎤ N
       p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n )
                                              ⎣ n=2                ⎦ n =1
Machine Learning: CSE 574                                                             3
HMM Viewed as Mixture



•   A single time slice corresponds to a mixture distribution
    with component densities p(x|z)
    •   Each state of discrete variable z represents a different component
•   An extension of mixture model
    •   Choice of mixture component depends on choice of mixture
        component for previous distribution
•   Latent variables are multinomial variables zn
    •   That describe component responsible for generating xn
•   Can use one-of-K coding scheme
Machine Learning: CSE 574                                                    4
2. State-Space Representation
• Probability distribution of zn           State k 1 2 . K
     depends on state of previous          of zn znk 0 1 . 0
     latent variable zn-1 through
     probability distribution p(zn|zn-1)
•    One-of K coding                     State     j      1 2 . K
                                         of zn-1   zn-1,j 1 0 . 0
      • Since latent variables are K-
          dimensional binary vectors    A                zn
            Ajk=p(znk=1|zn-1,j=1)       matrix
                                                       1 2 …. K
            ΣkAjk=1
                                                   1
• These are known as Transition
     Probabilities                 zn-1            2
•    K(K-1) independent parameters                 …          Ajk
    Machine Learning: CSE 574                      K                5
Transition Probabilities
Example with 3-state latent variable
               zn=1     zn=2       zn=3           State Transition Diagram
                  zn1=1    zn1=0      zn1=0
                  zn2=0    zn2=1      zn2=0
                  zn3=0    zn3=0      zn3=1
zn-1=1         A11      A12        A13
    zn-1,1=1                                  A11+A12+A13=1
    zn-1,2=0
    zn-1,3=0
zn-1=2         A21      A22        A23
    zn-1,1=0
    zn-1,2=1
    zn-1,3=0                                                  • Not a graphical model
zn-3= 3        A31      A32        A33                          since nodes are not
    zn-1,1=0                                                    separate variables but
    zn-1,2=0                                                    states of a single
    zn-1,3=1                                                    variable
  Machine Learning: CSE 574                                                          6
                                                              • Here K=3
Conditional Probabilities
•    Transition probabilities Ajk represent state-to-state
     probabilities for each variable
•    Conditional probabilities are variable-to-variable
     probabilities
      •   can be written in terms of transition probabilities as
                                         K     K
          p (z n | z n −1 , A) = ∏∏ A
                                                      z n−1, j z n ,k
                                                      jk
                                        k =1 j =1
      •   Note that exponent zn-1,j zn,k is a product that evaluates to 0 or 1
      •   Hence the overall product will evaluate to a single Ajk for each
          setting of values of zn and zn-1
            •   E.g., zn-1=2 and zn=3 will result in only zn-1,2=1 and zn,3=1. Thus
                p(zn=3|zn-1=2)=A23
•    A is a global HMM parameter
    Machine Learning: CSE 574                                                         7
Initial Variable Probabilities
  • Initial latent node z1 is a special case without
      parent node
  •   Represented by vector of probabilities π with
      elements πk=p(z1k=1) so that
                            K
      p ( z1 | π ) = ∏ π                  where Σ kπ k = 1
                                   z1,k
                                   k
                            k =1


  • Note that π is an HMM parameter
        • representing probabilities of each state for       the
           first variable

Machine Learning: CSE 574                                          8
Lattice or Trellis Diagram
• State transition diagram unfolded over time




• Representation of latent variable states
• Each column corresponds to one of latent
  variables zn
 Machine Learning: CSE 574                      9
Emission Probabilities p(xn|zn)
  • We have so far only specified p(zn|zn-1)                  by
      means of transition probabilities
  •   Need to specify probabilities p(xn|zn,φ) to
      complete the model, where φ are parameters
  •   These can be continuous or discrete
  •   Because xn is observed and zn is discrete
      p(xn|zn,φ) consists of a table of K numbers
      corresponding to K states of zn
        • Analogous to class-conditional probabilities
  • Can be represented as                               K
                                    p (x n | z n , ϕ ) = ∏ p(x n | ϕk ) znk
                                                       k =1
Machine Learning: CSE 574                                                 10
3. HMM Parameters


 •   We have defined three types of HMM parameters: θ =
     (π, A,φ)
       1. Initial Probabilities of first latent variable:
          π is a vector of K probabilities of the states for latent variable z1
       2. Transition Probabilities (State-to-state for any latent variable):
          A is a K x K matrix of transition probabilities Aij
       3. Emission Probabilities (Observations conditioned on latent):
          φ are parameters of conditional distribution p(xk|zk)
 •    A and π parameters are often initialized uniformly
 •   Initialization of φ depends on form of distribution

Machine Learning: CSE 574                                                         11
Joint Distribution over Latent and
             Observed variables
• Joint can be expressed in terms of parameters:
                                 ⎡ N                     ⎤ N
   p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p (x m | z m , ϕ )
                                 ⎣ n=2                   ⎦ m =1
   where X = { x1 ,..x N }, Z = { z1 ,..z N }, θ = {π , A, ϕ}


• Most discussion of HMM is independent of
 emission probabilities
  • Tractable for discrete tables, Gaussian, GMMs
  Machine Learning: CSE 574                                                       12
4. Generative View of HMM
•   Sampling from an HMM
•   HMM with 3-state latent variable z
    •   Gaussian emission model p(x|z)
    •   Contours of constant density of emission
        distribution shown for each state
    •   Two-dimensional x


•   50 data points generated from HMM
•   Lines show successive observations

•   Transition probabilities fixed so that
    •   5% probability of making transition
    •   90% of remaining in same

Machine Learning: CSE 574                          13
Left-to-Right HMM
  • Setting elements of Ajk=0       if k < j




  • Corresponding lattice diagram



Machine Learning: CSE 574                      14
Left-to-Right Applied Generatively to Digits
•    Examples of on-line handwritten 2’s
•    x is a sequence of pen coordinates
•    There are16 states for z, or K=16
•    Each state can generate a line segment of fixed
     length in one of 16 angles
      •   Emission probabilities: 16 x 16 table
•    Transition probabilities set to zero except for those
     that keep state index k the same or increment by one
•    Parameters optimized by 25 EM iterations         Training Set
•    Trained on 45 digits
•    Generative model is quite poor
      •   Since generated don’t look like training
      •   If classification is goal, can do better by using a
          discriminative HMM
    Machine Learning: CSE 574                                              15
                                                                Generated Set
5. Determining HMM Parameters
• Given data set X={x1,..xn} we can determine
    HMM parameters θ ={π,A,φ} using maximum
    likelihood
•   Likelihood function obtained from joint
    distribution by marginalizing over latent
    variables Z={z1,..zn}
                    p(X|θ) = ΣZ p(X,Z|θ)

                                           Joint (X,Z)


Machine Learning: CSE 574
                                                    X
                                                    16
Computational issues for Parameters
                     p(X|θ) = ΣZ p(X,Z|θ)
• Computational difficulties
    • Joint distribution p(X,Z|θ) does not factorize over n,
           in contrast to mixture model
    • Z has exponential number of terms corresponding to trellis
• Solution
    • Use conditional independence properties to reorder
        summations
    •   Use EM instead to maximize log-likelihood function of joint
        ln p(X,Z|θ)
          Efficient framework for maximizing the likelihood function in HMMs

 Machine Learning: CSE 574                                                17
EM for MLE in HMM
1. Start with initial selection for model parameters θold
2. In E step take these parameter values and find
   posterior distribution of latent variables p(Z|X,θ old)
   Use this posterior distribution to evaluate
   expectation of the logarithm of the complete-data
   likelihood function ln p(X,Z|θ)
         Which can be written as
               Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
                                Z

      underlined portion independent of θ is evaluated
3. In M-Step maximize Q w.r.t. θ
 Machine Learning: CSE 574                                                18
Expansion of Q
  • Introduce                   notation
        γ (zn) = p(zn|X,θold): Marginal posterior distribution of
          latent variable zn
        ξ(zn-1,zn) = p(zn-1,zn|X,θ old): Joint posterior of two
          successive latent variables
  • We will be                  re-expressing Q in terms of γ and ξ
        Q (θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
                            Z
                                                                        γ

                                                                    ξ


Machine Learning: CSE 574                                                   19
Detail of γ and ξ
  For each value of n we can store
        γ(zn) using K non-negative numbers that sum to unity
        ξ(zn-1,zn) using a K x K matrix whose elements also sum
          to unity
  • Using           notation
        γ (znk) denotes conditional probability of znk=1
        Similar notation for ξ(zn-1,j,znk)

  • Because the expectation of a binary random
      variable is the probability that it takes value 1
        γ(znk) = E[znk] = Σz γ(z)znk
        ξ(z      ,z ) = E[zn-1,j,znk] = Σz γ(z) zn-1,jznk
             n-1,j nk
Machine Learning: CSE 574                                     20
Expansion of Q
  • We begin with
                             Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ )
                                              Z
  • Substitute
                                          ⎡ N                     ⎤ N
            p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p(x m | z m , ϕ )
                                          ⎣ n=2                   ⎦ m =1

  • And use definitions of γ and ξ to get:
                             K                       N    K    K
       Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk
                            k =1                    n = 2 j =1 k =1
                                 N     K
                            + ∑∑ γ ( z nk ) ln p ( xn | φk )
                                                                                          21
Machine Learning: CSE 574        n =1 k =1
E-Step
                          K                   N    K    K
       Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk
                          k =1               n = 2 j =1 k =1
                               N     K
                         + ∑∑ γ ( z nk ) ln p ( xn | φk )
                               n =1 k =1


• Goal of E step is to evaluate γ(zn) and ξ(zn-1,zn)
  efficiently (Forward-Backward Algorithm)
                                                                            γ

                                                                       ξ


   Machine Learning: CSE 574                                                    22
M-Step
  • Maximize Q(θ,θ old) with respect to parameters
      θ ={π,A,φ}
       • Treat γ (zn) and ξ(zn-1,zn) as constant
  • Maximization w.r.t. π and A
        • easily achieved (using Lagrangian multipliers)
                                                         N

              πk =
                        γ ( z1k )                      ∑ξ (z       n −1, j   , z nk )
                       K
                                             A jk =    n=2

                      ∑γ (z
                       j =1
                               1j   )                 K N

                                                      ∑∑ ξ ( z        n −1, j     , z nl )
                                                      l =1 n = 2

  • Maximization w.r.t. φ                k                                    N     K

        • Only last term of Q depends on φ ∑∑ γ ( z            k             n =1 k =1
                                                                                             nk   ) ln p ( xn | φk )

        • Same form as in mixture distribution for i.i.d.                                                      23
Machine Learning: CSE 574
M-step for Gaussian emission
  • Maximization of Q(θ,θ old) wrt φ                    k

  • Gaussian Emission Densities
      p(x|φk) ~ N(x|μk,Σk)
  •   Solution:
               N                             N

              ∑ γ ( znk )x n                ∑ γ ( z nk )( x n − μ k )( x n − μ k )T
       μk =    n =1
                  N                  Σk =   n =1
                                                            N
                ∑γ (z
                n =1
                            nk   )                          ∑γ (z   nk   )
                                                            n =1




Machine Learning: CSE 574                                                             24
M-Step for Multinomial Observed
  • Conditional Distribution of Observations
      have the form
                                               D    K
                            p ( x | z) = ∏∏ μiki zk
                                              x

                                           i =1 k =1

  • M-Step equations:                  N

                                       ∑γ (z       nk   ) xni
                               μik =   n =1
                                           N

                                        ∑γ (z
                                         n =1
                                                    nk    )

  • Analogous result holds for Bernoulli
      observed variables
Machine Learning: CSE 574                                       25
6. Forward-Backward Algorithm
  • E step: efficient procedure to evaluate
      γ(zn) and ξ(zn-1,zn)
  • Graph of HMM, a tree
        • Implies that posterior distribution of latent
           variables can be obtained efficiently using
           message passing algorithm
  • In HMM it is called forward-backward
      algorithm or Baum-Welch Algorithm
  •   Several variants lead to exact marginals
        • Method called alpha-beta discussed here
Machine Learning: CSE 574                                 26
Derivation of Forward-Backward
• Several conditional-independences (A-H) hold
    A. p(X|zn) = p(x1,..xn|zn) p(xn+1,..xN|zn)
•   Proved using d-separation:
       Path from x1 to xn-1 passes through zn which is observed.
       Path is head-to-tail. Thus (x1,..xn-1) _||_ xn | zn
       Similarly (x1,..xn-1,xn) _||_ xn+1,..xN |zn

                                                             zN
                                                     …..

                                                             xN
                                                      …..

    Machine Learning: CSE 574                                      27
Conditional independence B
  • Since                   (x1,..xn-1) _||_ xn | zn   we have

  B. p(x1,..xn-1|xn,zn)= p(x1,..xn-1|zn)




Machine Learning: CSE 574                                        28
Conditional independence C
  • Since              (x1,..xn-1) _||_ zn | zn-1


      C. p(x1,..xn-1|zn-1,zn)= p ( x1,..xn-1|zn-1 )




Machine Learning: CSE 574                             29
Conditional independence D
  • Since              (xn+1,..xN) _||_ zn | zn+1


      D. p(xn+1,..xN|zn,zn+1)= p ( x1,..xN|zn+1 )


                                                         zN


                                                         xN
                                                    ……


Machine Learning: CSE 574                                     30
Conditional independence E
  • Since              (xn+2,..xN) _||_ zn | zn+1


      E. p(xn+2,..xN|zn+1,xn+1)= p ( xn+2,..xN|zn+1 )


                                                         zN


                                                         xN
                                                    ……


Machine Learning: CSE 574                                 31
Conditional independence F

  F. p(X|zn-1,zn)= p ( x1,..xn-1|zn-1 )p(xn|zn)
                   p(xn+1,..xN|zn)


                                                   zN


                                                   xN
                                              ……


Machine Learning: CSE 574                               32
Conditional independence G
      Since            (x1,..xN) _||_ xN+1 | zN+1

  G.        p(xN+1|X,zN+1)=p(xN+1|zN+1 )



                                                          zN   zN+1
                                                    ….

                                                          xN   xN+1
                                                     ….


Machine Learning: CSE 574                                         33
Conditional independence H

  H.         p(zN+1|zN , X)= p ( zN+1|zN )



                                                   zN   zN+1
                                             ….

                                                   xN   xN+1
                                              ….


Machine Learning: CSE 574                                  34
Evaluation of γ(zn)
  • Recall that this is to efficiently compute
      the E step of estimating parameters of
      HMM
        γ (zn) = p(zn|X,θold): Marginal posterior
           distribution of latent variable zn
  • We are interested in finding posterior
      distribution p(zn|x1,..xN)
  •   This is a vector of length K whose entries
      correspond to expected values of znk

Machine Learning: CSE 574                           35
Introducing α and β
• Using Bayes theorem γ (z ) = p( z | X) = p(X |pz(X) )p( z )
                                                    n          n
                                                                                 n      n



• Using conditional independence A
              p ( x1 ,..x n | z n ) p ( x n +1 ,..x N | z n ) p (z n )
    γ (z n ) =
                                      p(X)
              p( x1 ,..x n , z n ) p( x n +1 ,..x N | z n ) α (z n ) β (z n )
            =                                                 =
                                 p(X)                                  p(X)
• where          α(zn) = p(x1,..,xn,zn)
  which is the probability of observing all                                                 beta
  given data up to time n and the value of
  zn
                   β(zn) = p(xn+1,..,xN|zn)
  which is the conditional probability of all
                                                                                alpha
  Machine Learning: CSE 574 time n+1 up to N given
  future data from                                                                           36

  the value of z
Recursion Relation for α
α (z n ) = p( x1 ,.., x n , z n )
         = p( x1 ,.., x n | z n ) p(z n ) by Bayes rule
         = p( x n | z n ) p( x1 ,.., x n −1 | z n ) p (z n ) by conditional independence B
         = p( x n | z n ) p( x1 ,.., x n −1 , z n ) by Bayes rule
         = p( x n | z n )∑ p( x1 ,.., x n −1 , z n −1 , z n ) by Sum Rule
                            z n−1

         = p( x n | z n )∑ p( x1 ,.., x n −1 , z n | z n −1 ) p(z n −1 ) by Bayes rule
                            z n−1

         = p( x n | z n )∑ p( x1 ,.., x n −1 | z n −1 ) p (z n | z n −1 ) p (z n −1 ) by cond. ind. C
                            z n−1

         = p( x n | z n )∑ p ( x1 ,.., x n −1 , z n −1 ) p (z n | z n −1 ) by Bayes rule
                            z n−1

         = p( x | z )∑ α (z n −1 ) p(z n | z n −1 ) by definition of α
  Machine Learning:nCSE 574
                        n
                                                                                              37
                           z n−1
Forward Recursion for α Εvaluation
 •    Recursion Relation is
     α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 )
                                 z n−1


 •    There are K terms in the
      summation
       •   Has to be evaluated for each of K
           values of zn
       •   Each step of recursion is O(K2)
 •    Initial condition is
                                                 K
α (z1 ) = p ( x1 , z1 ) = p (z1 ) p ( x1 | z1 ) = ∏ {π k p( x1 | φk )}z
                                                                      1k


                                                k =1

 •    Overall cost for the chain in
      O(K2N)
       Machine Learning: CSE 574                                           38
Recursion Relation for β

   β (z n ) = p (x n +1 ,.., x N | z n )
             = ∑ p (x n +1 ,.., x N , z n +1 | z n ) by Sum Rule
                 zn+1

             =    ∑ p(x
                  zn+1
                            n +1   ,.., x N | z n ,z n +1 ) p( z n +1|z n ) by Bayes rule

             = ∑ p (x n +1 ,.., x N | z n +1 ) p( z n +1|z n ) by Cond ind. D
                 zn+1

             = ∑ p (x n + 2 ,.., x N | z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by Cond. ind E
                 zn+1

             = ∑ β (z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by definition of β
                 zn+1




Machine Learning: CSE 574                                                                          39
Backward Recursion for β
•   Backward message passing
β (z n ) = ∑ β (z n +1 ) p( x n +1|z n )p( z n +1|z n )
              zn+1

•   Evaluates β(zn) in terms of β(zn+1)

•   Starting condition for recursion is
                    p ( X, z N ) β ( z N )
       p(z N | X) =
                           p(X)
•   Is correct provided we set β(zN) =1
    for all settings of zN
    •   This is the initial condition for backward
        computation


     Machine Learning: CSE 574                            40
M step Equations
  • In the M-step equations p(x) will cancel out
                       N

                     ∑γ (z      nk   )x n
           μk =       n =1
                         N

                        ∑γ (z
                        n =1
                                    nk   )


                p(X) = ∑ α ( zn ) β ( zn )
                               zn



Machine Learning: CSE 574                          41
Evaluation of Quantities ξ(zn-1,zn)
•   They correspond to the values of the conditional
    probabilities p(zn-1,zn|X) for each of the K x K settings for
    (zn-1,zn)
ξ (z n -1 , z n ) = p(z n -1 , z n | X) by definition
               p(X|zn-1,z n )p(zn-1,z n )
             =                                  by Bayes Rule
                           p(X)
               p(x1 ,..x n −1|z n-1 )p( x n | z n ) p( x n +1 ,.., x N | z n ) p(z n | z n-1 )p(z n-1 )
             =                                                                                          by cond ind F
                                                         p(X)
               α (z n −1 ) p ( x n | z n ) p(z n | z n-1 )β(z n )
             =
                                      p(X)
•   Thus we calculate ξ(zn-1,zn) directly by using results of the
    α and β recursions

  Machine Learning: CSE 574                                                                                  42
Summary of EM to train HMM
Step 1: Initialization
•       Make an initial selection of parameters θ old
        where θ = (π, A,φ)
      1. π is a vector of K probabilities of the states for latent
         variable z1
      2. A is a K x K matrix of transition probabilities Aij
      3. φ are parameters of conditional distribution p(xk|zk)
•        A and π parameters are often initialized uniformly
•       Initialization of φ depends on form of distribution
      •      For Gaussian:
            •    parameters μk initialized by applying K-means to the data, Σk
                 corresponds to covariance matrix of cluster
    Machine Learning: CSE 574                                                    43
Summary of EM to train HMM
Step 2: E Step
• Run both forward α recursion and
    backward β recursion
• Use results to evaluate γ(zn) and ξ(zn-1,zn)
    and the likelihood function
Step 3: M Step
• Use results of E step to find revised set of
    parameters θnew using M-step equations
Alternate between E and M
   until convergence of likelihood function
 Machine Learning: CSE 574                       44
Values for p(xn|zn)
  • In recursion relations, observations enter
      through conditional distributions p(xn|zn)
  •   Recursions are independent of
        • Dimensionality of observed variables
        • Form of conditional distribution
             • So long as it can be computed for each of K
                possible states of zn
  • Since observed variables {xn} are fixed
      they can be pre-computed at the start of
      the EM algorithm
Machine Learning: CSE 574                                    45
7(a) Sequence Length: Using Multiple Short Sequences

    • HMM can be trained effectively if length of
        sequence is sufficiently long
          • True of all maximum likelihood approaches
    • Alternatively we can use multiple short
        sequences
          • Requires straightforward modification of HMM-EM
             algorithm
    • Particularly important in left-to-right models
          • In given observation sequence, a given state
             transition for a non-diagonal element of A occurs only
             once

  Machine Learning: CSE 574                                      46
7(b). Predictive Distribution
  •   Observed data is X={ x1,..,xN}
  •   Wish to predict xN+1
  •   Application in financial forecasting
        p (x | X ) = ∑ p(x , z | X )
             N +1                 N +1     N +1
                        z N +1


                    =    ∑ p(x
                         z N +1
                                  N +1   |z N +1 | X ) p (z N +1 | X ) by Product Rule


                    = ∑ p(x N +1|z N +1 )∑ p(z N +1 , z N | X ) by Sum Rule
                        z N +1                    zN


                    = ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) p( z N | X ) by conditional ind H
                        z N +1                    zN

                                                                 p( z N , X )
                    = ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N )                  by Bayes rule
                        z N +1                    zN               p( X )
                            1
                    =            ∑ p(x N +1|z N +1 )∑ p(z N +1 | zN )α ( zN ) by definition of α
                         p ( X ) zN +1              zN


  •   Can be evaluated by first running forward α recursion and
      summing over zN and zN+1
  •   Can be extended to subsequent predictions of xN+2, after xN+1 is
      observed, using a fixed amount of storage
Machine Learning: CSE 574                                                                          47
7(c). Sum-Product and HMM
                                            HMM Graph
•   HMM graph is a tree and
    hence sum-product
    algorithm can be used to
    find local marginals for
    hidden variables                      Joint distribution
    •   Equivalent to forward-
        backward algorithm                                                   ⎡ N                  ⎤ N
                                      p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n )
    •   Sum-product provides a                                               ⎣ n=2                ⎦ n =1
        simple way to derive alpha-
        beta recursion formulae
•   Transform directed graph to                Fragment of Factor Graph
    factor graph
    •   Each variable has a node,
        small squares represent
        factors, undirected links
        connect factor nodes to
        variables used
Machine Learning: CSE 574                                                                                    48
Deriving alpha-beta from Sum-Product
•   Begin with simplified form of                                                              Fragment of Factor Graph
    factor graph
•   Factors are given by
     h( z1 ) = p ( z1 ) p ( x1 | z1 )
                                                                       Simplified by absorbing emission probabilities
     f n ( zn −1 , zn ) = p ( zn | zn −1 ) p ( xn | zn )               into transition probability factors

•   Messages propagated are
     μz   n−1 → f n
                      (z n −1 ) = μ fn−1 →zn−1 (z n −1 )
                                                                       Final Results
     μf   n →zn
                (z n ) = ∑ f n (z n −1 , z n ) μ zn−1 → fn (z n −1 )
                               z n−1
                                                                        α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 )
•   Can show that α recursion is                                                                      z n−1


    computed                                                            β (z n ) = ∑ β ( z n +1 ) p(x n +1 | z n ) p(z n +1 | z n )
•   Similarly starting with the root
                                                                                      z n+1


                                                                                     α (z n ) β (z n )
    node β recursion is computed                                        γ (z n ) =
                                                                                              p (X)
•   So also γ and ξ are derived
                                                                        ξ (z n-1 ,z n ) =
                                                                                               α (z n −1 ) p (x n | z n ) p( z n | z n-1 )β ( z n )
 Machine Learning: CSE 574                                                                                                                  49
                                                                                                                      p(X)
7(d). Scaling Factors
• Implementation issue for small probabilities
• At each step of recursion
                               α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 )
                                                        z n−1
  • To obtain new value of α(zn) from previous value α(zn-1)
      we multiply p(zn|zn-1) and p(xn|zn)
  •   These probabilities are small and products will underflow
  •   Logs don’t help since we have sums of products



• Solution is rescaling
  • of α(zn) and β(zn) whose values remain close to unity
   Machine Learning: CSE 574                                                             50
7(e). The Viterbi Algorithm
  • Finding most probable sequence of hidden
      states for a given sequence of observables
  •   In speech recognition: finding most probable
      phoneme sequence for a given series of
      acoustic observations
  •   Since graphical model of HMM is a tree, can be
      solved exactly using max-sum algorithm
        • Known as Viterbi algorithm in the context of HMM
        • Since max-sum works with log probabilities no need
           to work with re-scaled varaibles as with forward-
           backward
Machine Learning: CSE 574                                      51
Viterbi Algorithm for HMM
  • Fragment of HMM lattice
      showing two paths
  •   Number of possible paths
      grows exponentially with
      length of chain
  •   Viterbi searches space of
      paths efficiently
        • Finds most probable path
           with computational cost
           linear with length of chain


Machine Learning: CSE 574                  52
Deriving Viterbi from Max-Sum
  • Start with simplified factor graph

  • Treat variable zN as root node, passing
      messages to root from leaf nodes
  •   Messages passed are
           μz   n → f n +1
                             (z n ) = μ fn →zn (z n )
           μf   n +1 → z n +1
                                                  {                                         }
                                (z n +1 ) = max ln f n +1 (z n , z n +1 ) + μzn → fn+1 (z n )
                                             zn




Machine Learning: CSE 574                                                                       53
Other Topics on Sequential Data
• Sequential Data and Markov Models:
  http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.1-
  MarkovModels.pdf
• Extensions of HMMs:
  http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.3-
  HMMExtensions.pdf
• Linear Dynamical Systems:
  http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.4-
  LinearDynamicalSystems.pdf
• Conditional Random Fields:
  http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.5-
  ConditionalRandomFields.pdf


Machine Learning: CSE 574                                       54

More Related Content

What's hot

Refresher probabilities-statistics
Refresher probabilities-statisticsRefresher probabilities-statistics
Refresher probabilities-statisticsSteve Nouri
 
R exam (B) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013
R exam (B) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013R exam (B) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013
R exam (B) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013Christian Robert
 
Dynamic programming - fundamentals review
Dynamic programming - fundamentals reviewDynamic programming - fundamentals review
Dynamic programming - fundamentals reviewElifTech
 
Cheatsheet unsupervised-learning
Cheatsheet unsupervised-learningCheatsheet unsupervised-learning
Cheatsheet unsupervised-learningSteve Nouri
 
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWER
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWERUndecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWER
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWERmuthukrishnavinayaga
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
 
Graph representation of DFA’s Da
Graph representation of DFA’s DaGraph representation of DFA’s Da
Graph representation of DFA’s Daparmeet834
 
Probability Density Function (PDF)
Probability Density Function (PDF)Probability Density Function (PDF)
Probability Density Function (PDF)AakankshaR
 
Convex optimization methods
Convex optimization methodsConvex optimization methods
Convex optimization methodsDong Guo
 
Generating Chebychev Chaotic Sequence
Generating Chebychev Chaotic SequenceGenerating Chebychev Chaotic Sequence
Generating Chebychev Chaotic SequenceCheng-An Yang
 
Refresher algebra-calculus
Refresher algebra-calculusRefresher algebra-calculus
Refresher algebra-calculusSteve Nouri
 
Sampling and Markov Chain Monte Carlo Techniques
Sampling and Markov Chain Monte Carlo TechniquesSampling and Markov Chain Monte Carlo Techniques
Sampling and Markov Chain Monte Carlo TechniquesTomasz Kusmierczyk
 
Stochastic Differentiation
Stochastic DifferentiationStochastic Differentiation
Stochastic DifferentiationSSA KPI
 

What's hot (20)

Refresher probabilities-statistics
Refresher probabilities-statisticsRefresher probabilities-statistics
Refresher probabilities-statistics
 
Mcqmc talk
Mcqmc talkMcqmc talk
Mcqmc talk
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
R exam (B) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013
R exam (B) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013R exam (B) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013
R exam (B) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Jere Koskela slides
Jere Koskela slidesJere Koskela slides
Jere Koskela slides
 
Dynamic programming - fundamentals review
Dynamic programming - fundamentals reviewDynamic programming - fundamentals review
Dynamic programming - fundamentals review
 
Cheatsheet unsupervised-learning
Cheatsheet unsupervised-learningCheatsheet unsupervised-learning
Cheatsheet unsupervised-learning
 
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWER
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWERUndecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWER
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWER
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
Graph representation of DFA’s Da
Graph representation of DFA’s DaGraph representation of DFA’s Da
Graph representation of DFA’s Da
 
Review
ReviewReview
Review
 
Probability Density Function (PDF)
Probability Density Function (PDF)Probability Density Function (PDF)
Probability Density Function (PDF)
 
Convex optimization methods
Convex optimization methodsConvex optimization methods
Convex optimization methods
 
Generating Chebychev Chaotic Sequence
Generating Chebychev Chaotic SequenceGenerating Chebychev Chaotic Sequence
Generating Chebychev Chaotic Sequence
 
Probability And Random Variable Lecture(5)
Probability And Random Variable Lecture(5)Probability And Random Variable Lecture(5)
Probability And Random Variable Lecture(5)
 
Refresher algebra-calculus
Refresher algebra-calculusRefresher algebra-calculus
Refresher algebra-calculus
 
Sampling and Markov Chain Monte Carlo Techniques
Sampling and Markov Chain Monte Carlo TechniquesSampling and Markov Chain Monte Carlo Techniques
Sampling and Markov Chain Monte Carlo Techniques
 
Richard Everitt's slides
Richard Everitt's slidesRichard Everitt's slides
Richard Everitt's slides
 
Stochastic Differentiation
Stochastic DifferentiationStochastic Differentiation
Stochastic Differentiation
 

Viewers also liked

Fantastic Airplane Images
Fantastic Airplane ImagesFantastic Airplane Images
Fantastic Airplane ImagesMarian Pascu
 
Vecer Na Temu25marec2010
Vecer Na Temu25marec2010Vecer Na Temu25marec2010
Vecer Na Temu25marec2010guestc27e91
 
Certificate IV Training & Assessment
Certificate IV Training & AssessmentCertificate IV Training & Assessment
Certificate IV Training & AssessmentRamon Mairata
 

Viewers also liked (6)

Fantastic Airplane Images
Fantastic Airplane ImagesFantastic Airplane Images
Fantastic Airplane Images
 
Vecer Na Temu25marec2010
Vecer Na Temu25marec2010Vecer Na Temu25marec2010
Vecer Na Temu25marec2010
 
Teresita Carvajal Resume
Teresita Carvajal ResumeTeresita Carvajal Resume
Teresita Carvajal Resume
 
Slides 1
Slides 1Slides 1
Slides 1
 
Certificate IV Training & Assessment
Certificate IV Training & AssessmentCertificate IV Training & Assessment
Certificate IV Training & Assessment
 
VSS_Analyst
VSS_AnalystVSS_Analyst
VSS_Analyst
 

Similar to PowerPoint Presentation

Phase-Type Distributions for Finite Interacting Particle Systems
Phase-Type Distributions for Finite Interacting Particle SystemsPhase-Type Distributions for Finite Interacting Particle Systems
Phase-Type Distributions for Finite Interacting Particle SystemsStefan Eng
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptGayathriRHICETCSESTA
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptGayathriRHICETCSESTA
 
IGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfIGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfgrssieee
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber SecurityAltoros
 
ガウス過程入門
ガウス過程入門ガウス過程入門
ガウス過程入門ShoShimoyama
 
Lecture 3: Stochastic Hydrology
Lecture 3: Stochastic HydrologyLecture 3: Stochastic Hydrology
Lecture 3: Stochastic HydrologyAmro Elfeki
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modeljins0618
 
Cheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networksCheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networksSteve Nouri
 
Statistics lecture 6 (ch5)
Statistics lecture 6 (ch5)Statistics lecture 6 (ch5)
Statistics lecture 6 (ch5)jillmitchell8778
 
Randomness conductors
Randomness conductorsRandomness conductors
Randomness conductorswtyru1989
 
Dsp U Lec04 Discrete Time Signals & Systems
Dsp U   Lec04 Discrete Time Signals & SystemsDsp U   Lec04 Discrete Time Signals & Systems
Dsp U Lec04 Discrete Time Signals & Systemstaha25
 
Dissertation Defense: The Physics of DNA, RNA, and RNA-like polymers
Dissertation Defense: The Physics of DNA, RNA, and RNA-like polymersDissertation Defense: The Physics of DNA, RNA, and RNA-like polymers
Dissertation Defense: The Physics of DNA, RNA, and RNA-like polymersLi Tai Fang
 
Formulas statistics
Formulas statisticsFormulas statistics
Formulas statisticsPrashi_Jain
 
Dsp U Lec06 The Z Transform And Its Application
Dsp U   Lec06 The Z Transform And Its ApplicationDsp U   Lec06 The Z Transform And Its Application
Dsp U Lec06 The Z Transform And Its Applicationtaha25
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)ananth
 
Recurrent Neuron Network-from point of dynamic system & state machine
Recurrent Neuron Network-from point of dynamic system & state machineRecurrent Neuron Network-from point of dynamic system & state machine
Recurrent Neuron Network-from point of dynamic system & state machineGAYO3
 

Similar to PowerPoint Presentation (20)

Advances in Directed Spanners
Advances in Directed SpannersAdvances in Directed Spanners
Advances in Directed Spanners
 
Phase-Type Distributions for Finite Interacting Particle Systems
Phase-Type Distributions for Finite Interacting Particle SystemsPhase-Type Distributions for Finite Interacting Particle Systems
Phase-Type Distributions for Finite Interacting Particle Systems
 
feedforward-network-
feedforward-network-feedforward-network-
feedforward-network-
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
 
IGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfIGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdf
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber Security
 
ガウス過程入門
ガウス過程入門ガウス過程入門
ガウス過程入門
 
Lecture 3: Stochastic Hydrology
Lecture 3: Stochastic HydrologyLecture 3: Stochastic Hydrology
Lecture 3: Stochastic Hydrology
 
Z transform
Z transformZ transform
Z transform
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
Cheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networksCheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networks
 
Statistics lecture 6 (ch5)
Statistics lecture 6 (ch5)Statistics lecture 6 (ch5)
Statistics lecture 6 (ch5)
 
Randomness conductors
Randomness conductorsRandomness conductors
Randomness conductors
 
Dsp U Lec04 Discrete Time Signals & Systems
Dsp U   Lec04 Discrete Time Signals & SystemsDsp U   Lec04 Discrete Time Signals & Systems
Dsp U Lec04 Discrete Time Signals & Systems
 
Dissertation Defense: The Physics of DNA, RNA, and RNA-like polymers
Dissertation Defense: The Physics of DNA, RNA, and RNA-like polymersDissertation Defense: The Physics of DNA, RNA, and RNA-like polymers
Dissertation Defense: The Physics of DNA, RNA, and RNA-like polymers
 
Formulas statistics
Formulas statisticsFormulas statistics
Formulas statistics
 
Dsp U Lec06 The Z Transform And Its Application
Dsp U   Lec06 The Z Transform And Its ApplicationDsp U   Lec06 The Z Transform And Its Application
Dsp U Lec06 The Z Transform And Its Application
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)
 
Recurrent Neuron Network-from point of dynamic system & state machine
Recurrent Neuron Network-from point of dynamic system & state machineRecurrent Neuron Network-from point of dynamic system & state machine
Recurrent Neuron Network-from point of dynamic system & state machine
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

PowerPoint Presentation

  • 1. Hidden Markov Models Sargur N. Srihari srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/CSE574/index.html Machine Learning: CSE 574 0
  • 2. HMM Topics 1. What is an HMM? 2. State-space Representation 3. HMM Parameters 4. Generative View of HMM 5. Determining HMM Parameters Using EM 6. Forward-Backward or α−β algorithm 7. HMM Implementation Issues: a) Length of Sequence b) Predictive Distribution c) Sum-Product Algorithm d) Scaling Factors e) Viterbi Algorithm Machine Learning: CSE 574 1
  • 3. 1. What is an HMM? • Ubiquitous tool for modeling time series data • Used in • Almost all speech recognition systems • Computational molecular biology • Group amino acid sequences into proteins • Handwritten word recognition • It is a tool for representing probability distributions over sequences of observations • HMM gets its name from two defining properties: • Observation xt at time t was generated by some process whose state zt is hidden from the observer • Assumes that state at zt is dependent only on state zt-1 and independent of all prior states (First order) • Example: z are phoneme sequences x are acoustic observations Machine Learning: CSE 574 2
  • 4. Graphical Model of HMM • Has the graphical model shown below and latent variables are discrete • Joint distribution has the form: ⎡ N ⎤ N p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n ) ⎣ n=2 ⎦ n =1 Machine Learning: CSE 574 3
  • 5. HMM Viewed as Mixture • A single time slice corresponds to a mixture distribution with component densities p(x|z) • Each state of discrete variable z represents a different component • An extension of mixture model • Choice of mixture component depends on choice of mixture component for previous distribution • Latent variables are multinomial variables zn • That describe component responsible for generating xn • Can use one-of-K coding scheme Machine Learning: CSE 574 4
  • 6. 2. State-Space Representation • Probability distribution of zn State k 1 2 . K depends on state of previous of zn znk 0 1 . 0 latent variable zn-1 through probability distribution p(zn|zn-1) • One-of K coding State j 1 2 . K of zn-1 zn-1,j 1 0 . 0 • Since latent variables are K- dimensional binary vectors A zn Ajk=p(znk=1|zn-1,j=1) matrix 1 2 …. K ΣkAjk=1 1 • These are known as Transition Probabilities zn-1 2 • K(K-1) independent parameters … Ajk Machine Learning: CSE 574 K 5
  • 7. Transition Probabilities Example with 3-state latent variable zn=1 zn=2 zn=3 State Transition Diagram zn1=1 zn1=0 zn1=0 zn2=0 zn2=1 zn2=0 zn3=0 zn3=0 zn3=1 zn-1=1 A11 A12 A13 zn-1,1=1 A11+A12+A13=1 zn-1,2=0 zn-1,3=0 zn-1=2 A21 A22 A23 zn-1,1=0 zn-1,2=1 zn-1,3=0 • Not a graphical model zn-3= 3 A31 A32 A33 since nodes are not zn-1,1=0 separate variables but zn-1,2=0 states of a single zn-1,3=1 variable Machine Learning: CSE 574 6 • Here K=3
  • 8. Conditional Probabilities • Transition probabilities Ajk represent state-to-state probabilities for each variable • Conditional probabilities are variable-to-variable probabilities • can be written in terms of transition probabilities as K K p (z n | z n −1 , A) = ∏∏ A z n−1, j z n ,k jk k =1 j =1 • Note that exponent zn-1,j zn,k is a product that evaluates to 0 or 1 • Hence the overall product will evaluate to a single Ajk for each setting of values of zn and zn-1 • E.g., zn-1=2 and zn=3 will result in only zn-1,2=1 and zn,3=1. Thus p(zn=3|zn-1=2)=A23 • A is a global HMM parameter Machine Learning: CSE 574 7
  • 9. Initial Variable Probabilities • Initial latent node z1 is a special case without parent node • Represented by vector of probabilities π with elements πk=p(z1k=1) so that K p ( z1 | π ) = ∏ π where Σ kπ k = 1 z1,k k k =1 • Note that π is an HMM parameter • representing probabilities of each state for the first variable Machine Learning: CSE 574 8
  • 10. Lattice or Trellis Diagram • State transition diagram unfolded over time • Representation of latent variable states • Each column corresponds to one of latent variables zn Machine Learning: CSE 574 9
  • 11. Emission Probabilities p(xn|zn) • We have so far only specified p(zn|zn-1) by means of transition probabilities • Need to specify probabilities p(xn|zn,φ) to complete the model, where φ are parameters • These can be continuous or discrete • Because xn is observed and zn is discrete p(xn|zn,φ) consists of a table of K numbers corresponding to K states of zn • Analogous to class-conditional probabilities • Can be represented as K p (x n | z n , ϕ ) = ∏ p(x n | ϕk ) znk k =1 Machine Learning: CSE 574 10
  • 12. 3. HMM Parameters • We have defined three types of HMM parameters: θ = (π, A,φ) 1. Initial Probabilities of first latent variable: π is a vector of K probabilities of the states for latent variable z1 2. Transition Probabilities (State-to-state for any latent variable): A is a K x K matrix of transition probabilities Aij 3. Emission Probabilities (Observations conditioned on latent): φ are parameters of conditional distribution p(xk|zk) • A and π parameters are often initialized uniformly • Initialization of φ depends on form of distribution Machine Learning: CSE 574 11
  • 13. Joint Distribution over Latent and Observed variables • Joint can be expressed in terms of parameters: ⎡ N ⎤ N p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p (x m | z m , ϕ ) ⎣ n=2 ⎦ m =1 where X = { x1 ,..x N }, Z = { z1 ,..z N }, θ = {π , A, ϕ} • Most discussion of HMM is independent of emission probabilities • Tractable for discrete tables, Gaussian, GMMs Machine Learning: CSE 574 12
  • 14. 4. Generative View of HMM • Sampling from an HMM • HMM with 3-state latent variable z • Gaussian emission model p(x|z) • Contours of constant density of emission distribution shown for each state • Two-dimensional x • 50 data points generated from HMM • Lines show successive observations • Transition probabilities fixed so that • 5% probability of making transition • 90% of remaining in same Machine Learning: CSE 574 13
  • 15. Left-to-Right HMM • Setting elements of Ajk=0 if k < j • Corresponding lattice diagram Machine Learning: CSE 574 14
  • 16. Left-to-Right Applied Generatively to Digits • Examples of on-line handwritten 2’s • x is a sequence of pen coordinates • There are16 states for z, or K=16 • Each state can generate a line segment of fixed length in one of 16 angles • Emission probabilities: 16 x 16 table • Transition probabilities set to zero except for those that keep state index k the same or increment by one • Parameters optimized by 25 EM iterations Training Set • Trained on 45 digits • Generative model is quite poor • Since generated don’t look like training • If classification is goal, can do better by using a discriminative HMM Machine Learning: CSE 574 15 Generated Set
  • 17. 5. Determining HMM Parameters • Given data set X={x1,..xn} we can determine HMM parameters θ ={π,A,φ} using maximum likelihood • Likelihood function obtained from joint distribution by marginalizing over latent variables Z={z1,..zn} p(X|θ) = ΣZ p(X,Z|θ) Joint (X,Z) Machine Learning: CSE 574 X 16
  • 18. Computational issues for Parameters p(X|θ) = ΣZ p(X,Z|θ) • Computational difficulties • Joint distribution p(X,Z|θ) does not factorize over n, in contrast to mixture model • Z has exponential number of terms corresponding to trellis • Solution • Use conditional independence properties to reorder summations • Use EM instead to maximize log-likelihood function of joint ln p(X,Z|θ) Efficient framework for maximizing the likelihood function in HMMs Machine Learning: CSE 574 17
  • 19. EM for MLE in HMM 1. Start with initial selection for model parameters θold 2. In E step take these parameter values and find posterior distribution of latent variables p(Z|X,θ old) Use this posterior distribution to evaluate expectation of the logarithm of the complete-data likelihood function ln p(X,Z|θ) Which can be written as Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ ) Z underlined portion independent of θ is evaluated 3. In M-Step maximize Q w.r.t. θ Machine Learning: CSE 574 18
  • 20. Expansion of Q • Introduce notation γ (zn) = p(zn|X,θold): Marginal posterior distribution of latent variable zn ξ(zn-1,zn) = p(zn-1,zn|X,θ old): Joint posterior of two successive latent variables • We will be re-expressing Q in terms of γ and ξ Q (θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ ) Z γ ξ Machine Learning: CSE 574 19
  • 21. Detail of γ and ξ For each value of n we can store γ(zn) using K non-negative numbers that sum to unity ξ(zn-1,zn) using a K x K matrix whose elements also sum to unity • Using notation γ (znk) denotes conditional probability of znk=1 Similar notation for ξ(zn-1,j,znk) • Because the expectation of a binary random variable is the probability that it takes value 1 γ(znk) = E[znk] = Σz γ(z)znk ξ(z ,z ) = E[zn-1,j,znk] = Σz γ(z) zn-1,jznk n-1,j nk Machine Learning: CSE 574 20
  • 22. Expansion of Q • We begin with Q(θ , θ old ) = ∑ p ( Z | X , θ old ) ln p ( X , Z | θ ) Z • Substitute ⎡ N ⎤ N p ( X , Z | θ ) = p (z1 | π ) ⎢∏ p (z n | z n −1 , A) ⎥ ∏ p(x m | z m , ϕ ) ⎣ n=2 ⎦ m =1 • And use definitions of γ and ξ to get: K N K K Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk k =1 n = 2 j =1 k =1 N K + ∑∑ γ ( z nk ) ln p ( xn | φk ) 21 Machine Learning: CSE 574 n =1 k =1
  • 23. E-Step K N K K Q(θ , θ old ) = ∑ γ ( z1k ) ln π k + ∑∑∑ ξ(zn-1, j ,z nk ) ln A jk k =1 n = 2 j =1 k =1 N K + ∑∑ γ ( z nk ) ln p ( xn | φk ) n =1 k =1 • Goal of E step is to evaluate γ(zn) and ξ(zn-1,zn) efficiently (Forward-Backward Algorithm) γ ξ Machine Learning: CSE 574 22
  • 24. M-Step • Maximize Q(θ,θ old) with respect to parameters θ ={π,A,φ} • Treat γ (zn) and ξ(zn-1,zn) as constant • Maximization w.r.t. π and A • easily achieved (using Lagrangian multipliers) N πk = γ ( z1k ) ∑ξ (z n −1, j , z nk ) K A jk = n=2 ∑γ (z j =1 1j ) K N ∑∑ ξ ( z n −1, j , z nl ) l =1 n = 2 • Maximization w.r.t. φ k N K • Only last term of Q depends on φ ∑∑ γ ( z k n =1 k =1 nk ) ln p ( xn | φk ) • Same form as in mixture distribution for i.i.d. 23 Machine Learning: CSE 574
  • 25. M-step for Gaussian emission • Maximization of Q(θ,θ old) wrt φ k • Gaussian Emission Densities p(x|φk) ~ N(x|μk,Σk) • Solution: N N ∑ γ ( znk )x n ∑ γ ( z nk )( x n − μ k )( x n − μ k )T μk = n =1 N Σk = n =1 N ∑γ (z n =1 nk ) ∑γ (z nk ) n =1 Machine Learning: CSE 574 24
  • 26. M-Step for Multinomial Observed • Conditional Distribution of Observations have the form D K p ( x | z) = ∏∏ μiki zk x i =1 k =1 • M-Step equations: N ∑γ (z nk ) xni μik = n =1 N ∑γ (z n =1 nk ) • Analogous result holds for Bernoulli observed variables Machine Learning: CSE 574 25
  • 27. 6. Forward-Backward Algorithm • E step: efficient procedure to evaluate γ(zn) and ξ(zn-1,zn) • Graph of HMM, a tree • Implies that posterior distribution of latent variables can be obtained efficiently using message passing algorithm • In HMM it is called forward-backward algorithm or Baum-Welch Algorithm • Several variants lead to exact marginals • Method called alpha-beta discussed here Machine Learning: CSE 574 26
  • 28. Derivation of Forward-Backward • Several conditional-independences (A-H) hold A. p(X|zn) = p(x1,..xn|zn) p(xn+1,..xN|zn) • Proved using d-separation: Path from x1 to xn-1 passes through zn which is observed. Path is head-to-tail. Thus (x1,..xn-1) _||_ xn | zn Similarly (x1,..xn-1,xn) _||_ xn+1,..xN |zn zN ….. xN ….. Machine Learning: CSE 574 27
  • 29. Conditional independence B • Since (x1,..xn-1) _||_ xn | zn we have B. p(x1,..xn-1|xn,zn)= p(x1,..xn-1|zn) Machine Learning: CSE 574 28
  • 30. Conditional independence C • Since (x1,..xn-1) _||_ zn | zn-1 C. p(x1,..xn-1|zn-1,zn)= p ( x1,..xn-1|zn-1 ) Machine Learning: CSE 574 29
  • 31. Conditional independence D • Since (xn+1,..xN) _||_ zn | zn+1 D. p(xn+1,..xN|zn,zn+1)= p ( x1,..xN|zn+1 ) zN xN …… Machine Learning: CSE 574 30
  • 32. Conditional independence E • Since (xn+2,..xN) _||_ zn | zn+1 E. p(xn+2,..xN|zn+1,xn+1)= p ( xn+2,..xN|zn+1 ) zN xN …… Machine Learning: CSE 574 31
  • 33. Conditional independence F F. p(X|zn-1,zn)= p ( x1,..xn-1|zn-1 )p(xn|zn) p(xn+1,..xN|zn) zN xN …… Machine Learning: CSE 574 32
  • 34. Conditional independence G Since (x1,..xN) _||_ xN+1 | zN+1 G. p(xN+1|X,zN+1)=p(xN+1|zN+1 ) zN zN+1 …. xN xN+1 …. Machine Learning: CSE 574 33
  • 35. Conditional independence H H. p(zN+1|zN , X)= p ( zN+1|zN ) zN zN+1 …. xN xN+1 …. Machine Learning: CSE 574 34
  • 36. Evaluation of γ(zn) • Recall that this is to efficiently compute the E step of estimating parameters of HMM γ (zn) = p(zn|X,θold): Marginal posterior distribution of latent variable zn • We are interested in finding posterior distribution p(zn|x1,..xN) • This is a vector of length K whose entries correspond to expected values of znk Machine Learning: CSE 574 35
  • 37. Introducing α and β • Using Bayes theorem γ (z ) = p( z | X) = p(X |pz(X) )p( z ) n n n n • Using conditional independence A p ( x1 ,..x n | z n ) p ( x n +1 ,..x N | z n ) p (z n ) γ (z n ) = p(X) p( x1 ,..x n , z n ) p( x n +1 ,..x N | z n ) α (z n ) β (z n ) = = p(X) p(X) • where α(zn) = p(x1,..,xn,zn) which is the probability of observing all beta given data up to time n and the value of zn β(zn) = p(xn+1,..,xN|zn) which is the conditional probability of all alpha Machine Learning: CSE 574 time n+1 up to N given future data from 36 the value of z
  • 38. Recursion Relation for α α (z n ) = p( x1 ,.., x n , z n ) = p( x1 ,.., x n | z n ) p(z n ) by Bayes rule = p( x n | z n ) p( x1 ,.., x n −1 | z n ) p (z n ) by conditional independence B = p( x n | z n ) p( x1 ,.., x n −1 , z n ) by Bayes rule = p( x n | z n )∑ p( x1 ,.., x n −1 , z n −1 , z n ) by Sum Rule z n−1 = p( x n | z n )∑ p( x1 ,.., x n −1 , z n | z n −1 ) p(z n −1 ) by Bayes rule z n−1 = p( x n | z n )∑ p( x1 ,.., x n −1 | z n −1 ) p (z n | z n −1 ) p (z n −1 ) by cond. ind. C z n−1 = p( x n | z n )∑ p ( x1 ,.., x n −1 , z n −1 ) p (z n | z n −1 ) by Bayes rule z n−1 = p( x | z )∑ α (z n −1 ) p(z n | z n −1 ) by definition of α Machine Learning:nCSE 574 n 37 z n−1
  • 39. Forward Recursion for α Εvaluation • Recursion Relation is α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 ) z n−1 • There are K terms in the summation • Has to be evaluated for each of K values of zn • Each step of recursion is O(K2) • Initial condition is K α (z1 ) = p ( x1 , z1 ) = p (z1 ) p ( x1 | z1 ) = ∏ {π k p( x1 | φk )}z 1k k =1 • Overall cost for the chain in O(K2N) Machine Learning: CSE 574 38
  • 40. Recursion Relation for β β (z n ) = p (x n +1 ,.., x N | z n ) = ∑ p (x n +1 ,.., x N , z n +1 | z n ) by Sum Rule zn+1 = ∑ p(x zn+1 n +1 ,.., x N | z n ,z n +1 ) p( z n +1|z n ) by Bayes rule = ∑ p (x n +1 ,.., x N | z n +1 ) p( z n +1|z n ) by Cond ind. D zn+1 = ∑ p (x n + 2 ,.., x N | z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by Cond. ind E zn+1 = ∑ β (z n +1 ) p( x n +1|z n +1 )p( z n +1|z n ) by definition of β zn+1 Machine Learning: CSE 574 39
  • 41. Backward Recursion for β • Backward message passing β (z n ) = ∑ β (z n +1 ) p( x n +1|z n )p( z n +1|z n ) zn+1 • Evaluates β(zn) in terms of β(zn+1) • Starting condition for recursion is p ( X, z N ) β ( z N ) p(z N | X) = p(X) • Is correct provided we set β(zN) =1 for all settings of zN • This is the initial condition for backward computation Machine Learning: CSE 574 40
  • 42. M step Equations • In the M-step equations p(x) will cancel out N ∑γ (z nk )x n μk = n =1 N ∑γ (z n =1 nk ) p(X) = ∑ α ( zn ) β ( zn ) zn Machine Learning: CSE 574 41
  • 43. Evaluation of Quantities ξ(zn-1,zn) • They correspond to the values of the conditional probabilities p(zn-1,zn|X) for each of the K x K settings for (zn-1,zn) ξ (z n -1 , z n ) = p(z n -1 , z n | X) by definition p(X|zn-1,z n )p(zn-1,z n ) = by Bayes Rule p(X) p(x1 ,..x n −1|z n-1 )p( x n | z n ) p( x n +1 ,.., x N | z n ) p(z n | z n-1 )p(z n-1 ) = by cond ind F p(X) α (z n −1 ) p ( x n | z n ) p(z n | z n-1 )β(z n ) = p(X) • Thus we calculate ξ(zn-1,zn) directly by using results of the α and β recursions Machine Learning: CSE 574 42
  • 44. Summary of EM to train HMM Step 1: Initialization • Make an initial selection of parameters θ old where θ = (π, A,φ) 1. π is a vector of K probabilities of the states for latent variable z1 2. A is a K x K matrix of transition probabilities Aij 3. φ are parameters of conditional distribution p(xk|zk) • A and π parameters are often initialized uniformly • Initialization of φ depends on form of distribution • For Gaussian: • parameters μk initialized by applying K-means to the data, Σk corresponds to covariance matrix of cluster Machine Learning: CSE 574 43
  • 45. Summary of EM to train HMM Step 2: E Step • Run both forward α recursion and backward β recursion • Use results to evaluate γ(zn) and ξ(zn-1,zn) and the likelihood function Step 3: M Step • Use results of E step to find revised set of parameters θnew using M-step equations Alternate between E and M until convergence of likelihood function Machine Learning: CSE 574 44
  • 46. Values for p(xn|zn) • In recursion relations, observations enter through conditional distributions p(xn|zn) • Recursions are independent of • Dimensionality of observed variables • Form of conditional distribution • So long as it can be computed for each of K possible states of zn • Since observed variables {xn} are fixed they can be pre-computed at the start of the EM algorithm Machine Learning: CSE 574 45
  • 47. 7(a) Sequence Length: Using Multiple Short Sequences • HMM can be trained effectively if length of sequence is sufficiently long • True of all maximum likelihood approaches • Alternatively we can use multiple short sequences • Requires straightforward modification of HMM-EM algorithm • Particularly important in left-to-right models • In given observation sequence, a given state transition for a non-diagonal element of A occurs only once Machine Learning: CSE 574 46
  • 48. 7(b). Predictive Distribution • Observed data is X={ x1,..,xN} • Wish to predict xN+1 • Application in financial forecasting p (x | X ) = ∑ p(x , z | X ) N +1 N +1 N +1 z N +1 = ∑ p(x z N +1 N +1 |z N +1 | X ) p (z N +1 | X ) by Product Rule = ∑ p(x N +1|z N +1 )∑ p(z N +1 , z N | X ) by Sum Rule z N +1 zN = ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) p( z N | X ) by conditional ind H z N +1 zN p( z N , X ) = ∑ p(x N +1|z N +1 )∑ p (z N +1 | z N ) by Bayes rule z N +1 zN p( X ) 1 = ∑ p(x N +1|z N +1 )∑ p(z N +1 | zN )α ( zN ) by definition of α p ( X ) zN +1 zN • Can be evaluated by first running forward α recursion and summing over zN and zN+1 • Can be extended to subsequent predictions of xN+2, after xN+1 is observed, using a fixed amount of storage Machine Learning: CSE 574 47
  • 49. 7(c). Sum-Product and HMM HMM Graph • HMM graph is a tree and hence sum-product algorithm can be used to find local marginals for hidden variables Joint distribution • Equivalent to forward- backward algorithm ⎡ N ⎤ N p ( x1 ,..x N , z1 ,..z n ) = p ( z1 ) ⎢∏ p ( z n | z n −1 )⎥∏ p (xn | z n ) • Sum-product provides a ⎣ n=2 ⎦ n =1 simple way to derive alpha- beta recursion formulae • Transform directed graph to Fragment of Factor Graph factor graph • Each variable has a node, small squares represent factors, undirected links connect factor nodes to variables used Machine Learning: CSE 574 48
  • 50. Deriving alpha-beta from Sum-Product • Begin with simplified form of Fragment of Factor Graph factor graph • Factors are given by h( z1 ) = p ( z1 ) p ( x1 | z1 ) Simplified by absorbing emission probabilities f n ( zn −1 , zn ) = p ( zn | zn −1 ) p ( xn | zn ) into transition probability factors • Messages propagated are μz n−1 → f n (z n −1 ) = μ fn−1 →zn−1 (z n −1 ) Final Results μf n →zn (z n ) = ∑ f n (z n −1 , z n ) μ zn−1 → fn (z n −1 ) z n−1 α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 ) • Can show that α recursion is z n−1 computed β (z n ) = ∑ β ( z n +1 ) p(x n +1 | z n ) p(z n +1 | z n ) • Similarly starting with the root z n+1 α (z n ) β (z n ) node β recursion is computed γ (z n ) = p (X) • So also γ and ξ are derived ξ (z n-1 ,z n ) = α (z n −1 ) p (x n | z n ) p( z n | z n-1 )β ( z n ) Machine Learning: CSE 574 49 p(X)
  • 51. 7(d). Scaling Factors • Implementation issue for small probabilities • At each step of recursion α (z n ) = p( x n | z n )∑ α (z n −1 ) p(z n | z n −1 ) z n−1 • To obtain new value of α(zn) from previous value α(zn-1) we multiply p(zn|zn-1) and p(xn|zn) • These probabilities are small and products will underflow • Logs don’t help since we have sums of products • Solution is rescaling • of α(zn) and β(zn) whose values remain close to unity Machine Learning: CSE 574 50
  • 52. 7(e). The Viterbi Algorithm • Finding most probable sequence of hidden states for a given sequence of observables • In speech recognition: finding most probable phoneme sequence for a given series of acoustic observations • Since graphical model of HMM is a tree, can be solved exactly using max-sum algorithm • Known as Viterbi algorithm in the context of HMM • Since max-sum works with log probabilities no need to work with re-scaled varaibles as with forward- backward Machine Learning: CSE 574 51
  • 53. Viterbi Algorithm for HMM • Fragment of HMM lattice showing two paths • Number of possible paths grows exponentially with length of chain • Viterbi searches space of paths efficiently • Finds most probable path with computational cost linear with length of chain Machine Learning: CSE 574 52
  • 54. Deriving Viterbi from Max-Sum • Start with simplified factor graph • Treat variable zN as root node, passing messages to root from leaf nodes • Messages passed are μz n → f n +1 (z n ) = μ fn →zn (z n ) μf n +1 → z n +1 { } (z n +1 ) = max ln f n +1 (z n , z n +1 ) + μzn → fn+1 (z n ) zn Machine Learning: CSE 574 53
  • 55. Other Topics on Sequential Data • Sequential Data and Markov Models: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.1- MarkovModels.pdf • Extensions of HMMs: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.3- HMMExtensions.pdf • Linear Dynamical Systems: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.4- LinearDynamicalSystems.pdf • Conditional Random Fields: http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.5- ConditionalRandomFields.pdf Machine Learning: CSE 574 54