SlideShare a Scribd company logo
Hidden Markov Models
Phil Blunsom                                                                   
                                         August 19, 2004

   The Hidden Markov Model (HMM) is a popular statistical tool for modelling a wide
range of time series data. In the context of natural language processing(NLP), HMMs have
been applied with great success to problems such as part-of-speech tagging and noun-phrase

1     Introduction
The Hidden Markov Model(HMM) is a powerful statistical tool for modeling generative se-
quences that can be characterised by an underlying process generating an observable sequence.
HMMs have found application in many areas interested in signal processing, and in particular
speech processing, but have also been applied with success to low level NLP tasks such as
part-of-speech tagging, phrase chunking, and extracting target information from documents.
Andrei Markov gave his name to the mathematical theory of Markov processes in the early
twentieth century[3], but it was Baum and his colleagues that developed the theory of HMMs
in the 1960s[2].

Markov Processes Diagram 1 depicts an example of a Markov process. The model
presented describes a simple model for a stock market index. The model has three states, Bull,
Bear and Even, and three index observations up, down, unchanged. The model is a finite state
automaton, with probabilistic transitions between states. Given a sequence of observations,
example: up-down-down we can easily verify that the state sequence that produced those
observations was: Bull-Bear-Bear, and the probability of the sequence is simply the product
of the transitions, in this case 0.2 × 0.3 × 0.3.

Hidden Markov Models Diagram 2 shows an example of how the previous model can
be extended into a HMM. The new model now allows all observation symbols to be emitted
from each state with a finite probability. This change makes the model much more expressive


                                  Bull                                 Bear

                                               0.4               0.1
                            up           0.2               0.2                 down



                          Figure 1: Markov process example[1]

up                                                       0.3                   up
                          0.7                          0.2                        0.1
               down       0.1      Bull                                  Bear       0.6       down
                          0.2                          0.5

            unchanged                            0.4               0.1                      unchanged
                                          0.2                0.2


                        Figure 2: Hidden Markov model example[1]

and able to better represent our intuition, in this case, that a bull market would have both
good days and bad days, but there would be more good ones. The key difference is that
now if we have the observation sequence up-down-down then we cannot say exactly what
state sequence produced these observations and thus the state sequence is ‘hidden’. We can
however calculate the probability that the model produced the sequence, as well as which
state sequence was most likely to have produced the observations. The next three sections
describe the common calculations that we would like to be able to perform on a HMM.
    The formal definition of a HMM is as follows:

                                                λ = (A, B, π)                                            (1)

   S is our state alphabet set, and V is the observation alphabet set:

                                          S = (s1 , s2 , · · · , sN )                                    (2)

                                          V = (v1 , v2 , · · · , vM )                                    (3)
   We define Q to be a fixed state sequence of length T, and corresponding observations O:

                                           Q = q 1 , q2 , · · · , q T                                    (4)

                                           O = o1 , o2 , · · · , oT                                      (5)
   A is a transition array, storing the probability of state j following state i . Note the state
transition probabilities are independent of time:

                          A = [aij ] , aij = P (qt = sj |qt−1 = si ) .                                   (6)

   B is the observation array, storing the probability of observation k being produced from
the state j, independent of t:

                         B = [bi (k)] , bi (k) = P (xt = vk |qt = si ) .                                 (7)

   π is the initial probability array:

                                 π = [πi ] , πi = P (q1 = si ) .                                         (8)

   Two assumptions are made by the model. The first, called the Markov assumption, states
that the current state is dependent only on the previous state, this represents the memory of
the model:
                                   P (qt |q1 ) = P (qt |qt−1 )                            (9)
   The independence assumption states that the output observation at time t is dependent
only on the current state, it is independent of previous observations and states:

                                    P (ot |ot−1 , q1 ) = P (ot |qt )

1              1               1              1

                                    2              2               2              2

                                    3              3               3              3

                                    4              4               4              4

                                    t=1            t=2             t=3            t=4

                                        Figure 3: A trellis algorithm

2      Evaluation
Given a HMM, and a sequence of observations, we’d like to be able to compute P (O|λ), the
probability of the observation sequence given a model. This problem could be viewed as one
of evaluating how well a model predicts a given observation sequence, and thus allow us to
choose the most appropriate model from a set.
    The probability of the observations O for a specific state sequence Q is:

                 P (O|Q, λ) =           P (ot |qt , λ) = bq1 (o1 ) × bq2 (o2 ) · · · bqT (oT )                      (11)

     and the probability of the state sequence is:
                              P (Q|λ) = πq1 aq1 q2 aq2 q3 · · · aqT −1 qT                                           (12)
     so we can calculate the probability of the observations given the model as:

    P (O|λ) =        P (O|Q, λ)P (Q|λ) =                  πq1 bq1 (o1 )aq1 q2 bq2 (o2 ) · · · aqT −1 qT bqT (oT )   (13)
                 Q                             q1 ···qT

This result allows the evaluation of the probability of O, but to evaluate it directly would be
exponential in T.
    A better approach is to recognise that many redundant calculations would be made by
directly evaluating equation 13, and therefore caching calculations can lead to reduced com-
plexity. We implement the cache as a trellis of states at each time step, calculating the cached
valued (called α) for each state as a sum over all states at the previous time step. α is the
probability of the partial observation sequence o1 , o2 · · · ot and state si at time t. This can be
visualised as in figure 3. We define the forward probability variable:
                                αt (i) = P (o1 o2 · · · ot , qt = si |λ)                                            (14)
so if we work through the trellis filling in the values of α the sum of the final column of the
trellis will equal the probability of the observation sequence. The algorithm for this process
is called the forward algorithm and is as follows:
    1. Initialisation:
                                          α1 (i) = πi bi (o1 ), 1 ≤ i ≤ N.                                          (15)

    2. Induction:

                     αt+1 (j) = [         αt (i)aij ]bj (ot+1 ), 1 ≤ t ≤ T − 1, 1 ≤ j ≤ N.                          (16)


                                             α2(t)            a2j



                                                t                                  t+1

                    Figure 4: The induction step of the forward algorithm

    3. Termination:

                                                    P (O|λ) =                  αT (i).                 (17)

    The induction step is the key to the forward algorithm and is depicted in figure 4. For each
state sj , αj (t) stores the probability of arriving in that state having observed the observation
sequence up until time t.
    It is apparent that by caching α values the forward algorithm reduces the complexity of
calculations involved to N 2 T rather than 2T N T . We can also define an analogous backwards
algorithm which is the exact reverse of the forwards algorithm with the backwards variable:
                                    βt (i) = P (ot+1 ot+2 · · · oT |qt = si , λ)                       (18)
as the probability of the partial observation sequence from t + 1 to T , starting in state si .

3      Decoding
The aim of decoding is to discover the hidden state sequence that was most likely to have
produced a given observation sequence. One solution to this problem is to use the Viterbi
algorithm to find the single best state sequence for an observation sequence. The Viterbi
algorithm is another trellis algorithm which is very similar to the forward algorithm, except
that the transition probabilities are maximised at each step, instead of summed. First we
                         δt (i) =       max            P (q1 q2 · · · qt = si , o1 , o2 · · · ot |λ)   (19)
                                    q1 ,q2 ,···,qt−1

as the probability of the most probable state path for the partial observation sequence.
    The Viterbi algorithm and is as follows:
    1. Initialisation:
                                      δ1 (i) = πi bi (o1 ), 1 ≤ i ≤ N, ψ1 (i) = 0.                     (20)
    2. Recursion:
                          δt (j) = max [δt−1 (i)aij ]bj (ot ), 2 ≤ t ≤ T, 1 ≤ j ≤ N,                   (21)

                            ψt (j) = arg max [δt−1 (i)aij ], 2 ≤ t ≤ T, 1 ≤ j ≤ N.                     (22)


                                            a2j                          Sj
                                                                     δ    (j) =
                      S3                             a3j           δt(2)bj(ot+1)
                     δt(3)                                          ψt+1(j) = 2



                       t                                                     t+1

                  Figure 5: The recursion step of the viterbi algorithm

                                1              1             1        1

                                2              2             2        2

                                3              3             3        3

                                4              4             4        4

                                4              3              3          3

                Figure 6: The backtracing step of the viterbi algorithm

  3. Termination:

                                            P ∗ = max [δT (i)]                            (23)
                                       qT   = arg max [δT (i)].                           (24)

  4. Optimal state sequence backtracking:
                              ∗          ∗
                             qt = ψt+1 (qt+1 ), t = T − 1, T − 2, · · · , 1.              (25)

    The recursion step is illustrated in figure 5. The main difference with the forward algorithm
in the recursions step is that we are maximising, rather than summing, and storing the state
that was chosen as the maximum for use as a backpointer. The backtracking step is shown in
6. The backtracking allows the best state sequence to be found from the back pointers stored
in the recursion step, but it should be noted that there is no easy way to find the second best
state sequence.

4      Learning
Given a set of examples from a process, we would like to be able to estimate the model pa-
rameters λ = (A, B, π) that best describe that process. There are two standard approaches to
this task, dependent on the form of the examples, which will be referred to here as supervised
and unsupervised training. If the training examples contain both the inputs and outputs of a
process, we can perform supervised training by equating inputs to observations, and outputs
to states, but if only the inputs are provided in the training data then we must used unsuper-
vised training to guess a model that may have produced those observations. In this section we
will discuss the supervised approach to training, for a discussion of the Baum-Welch algorithm
for unsupervised training see [5].
    The easiest solution for creating a model λ is to have a large corpus of training examples,
each annotated with the correct classification. The classic example for this approach is PoS
tagging. We define two sets:
    • t1 · · · tN is the set of tags, which we equate to the HMM state set s1 · · · sN
    • w1 · · · wM is the set of words, which we equate to the HMM observation set v1 · · · vM
so with this model we frame part-of-speech tagging as decoding the most probable hidden
state sequence of PoS tags given an observation sequence of words. To determine the model
parameters λ, we can use maximum likelihood estimates(MLE) from a corpus containing
sentences tagged with their correct PoS tags. For the transition matrix we use:

                                                     Count(ti , tj )
                               aij = P (ti |tj ) =                                         (26)
                                                      Count(ti )

where Count(ti , tj ) is the number of times tj followed ti in the training data. For the obser-
vation matrix:
                                                         Count(wk , tj )
                             bj (k) = P (wk |tj ) =                                        (27)
                                                          Count(tj )

where Count(wk , tj ) is the number of times wk was tagged tj in the training data. And lastly
the initial probability distribution:

                                                     Count(q1 = ti )
                             πi = P (q1 = ti ) =                                           (28)
                                                      Count(q1 )

In practice when estimating a HMM from counts it is normally necessary to apply smoothing
in order to avoid zero counts and improve the performance of the model on data not appearing
in the training set.

5      Multi-Dimensional Feature Space
A limitation of the model described is that observations are assumed to be single dimensional
features, but many tasks are most naturally modelled using a multi-dimensional feature space.
One solution to this problem is to use a multinomial model that assumes the features of the
observations are independent [4]:

                                      vk = (f1 , · · · , fN )                              (29)


                                   P (vk |sj ) =         P (fj |sj )                       (30)

   This model is easy to implement and computationally simple, but obviously many features
one might want to use are not independent. For many NLP systems it has been found that
flawed Baysian independence assumptions can still be very effective.

6     Implementing HMMs
When implementing a HMM, floating-point underflow is a significant problem. It is apparent
that when applying the Viterbi or forward algorithms to long sequences the extremely small
probability values that would result could underflow on most machines. We solve this problem
differently for each algorithm:
 Viterbi underflow As the Viterbi algorithms only multiplies probabilities, a simple solution
    to underflow is to log all the probability values and then add values instead of multiply.
    In fact if all the values in the model matrices (A, B, π) are stored logged, then at runtime
    only addition operations are needed.
 forward algorithm underflow The forward algorithm sums probability values, so it is
     not a viable solution to log the values in order to avoid underflow. The most common
     solution to this problem is to use scaling coefficients that keep the probability values in
     the dynamic range of the machine, and that are dependent only on t. The coefficient ct
     is defined as:
                                       ct =     N
                                                       αt (i)

     and thus the new scaled value for α becomes:
                                                             αt (i)
                               αt (i) = ct × αt (i) =
                               ˆ                            N
                                                            i=1 t
                                                                 α (i)

     a similar coefficient can be computed for βt (i).

[1] Huang et. al. Spoken Language Processing. Prentice Hall PTR.
[2] L. Baum et. al. A maximization technique occuring in the statistical analysis of probab-
    listic functions of markov chains. Annals of Mathematical Statistics, 41:164–171, 1970.
[3] A. Markov. An example of statistical investigation in the text of eugene onyegin, illustrat-
    ing coupling of tests in chains. Proceedings of the Academy of Sciences of St. Petersburg,
[4] A. McCallum and K. Nigram. A comparison of event models for naive bayes classification.
    In AAAI-98 Workshop on Learning for Text Categorization, 1998.
[5] L. Rabiner. A tutorial on hidden markov models and selected applications in speech
    recognition. Proceedings of IEEE, 1989.


More Related Content

What's hot

Presenting objective and subjective uncertainty information for spatial syste...
Presenting objective and subjective uncertainty information for spatial syste...Presenting objective and subjective uncertainty information for spatial syste...
Presenting objective and subjective uncertainty information for spatial syste...
University of Adelaide
T07 Euler Path
T07 Euler PathT07 Euler Path
T07 Euler Path
AlgoPerm2012 - 04 Christophe Paul
AlgoPerm2012 - 04 Christophe PaulAlgoPerm2012 - 04 Christophe Paul
AlgoPerm2012 - 04 Christophe Paul
AlgoPerm 2012
Model-based analysis using generative embedding
Model-based analysis using generative embeddingModel-based analysis using generative embedding
Model-based analysis using generative embedding
aniruddh Tyagi

What's hot (6)

Presenting objective and subjective uncertainty information for spatial syste...
Presenting objective and subjective uncertainty information for spatial syste...Presenting objective and subjective uncertainty information for spatial syste...
Presenting objective and subjective uncertainty information for spatial syste...
T07 Euler Path
T07 Euler PathT07 Euler Path
T07 Euler Path
AlgoPerm2012 - 04 Christophe Paul
AlgoPerm2012 - 04 Christophe PaulAlgoPerm2012 - 04 Christophe Paul
AlgoPerm2012 - 04 Christophe Paul
Model-based analysis using generative embedding
Model-based analysis using generative embeddingModel-based analysis using generative embedding
Model-based analysis using generative embedding

Viewers also liked

u.cs101 "Алгоритм ба програмчлал" Лекц №5
u.cs101 "Алгоритм ба програмчлал" Лекц №5u.cs101 "Алгоритм ба програмчлал" Лекц №5
u.cs101 "Алгоритм ба програмчлал" Лекц №5
Khuder Altangerel
u.cs101 "Алгоритм ба програмчлал" Лекц №4
u.cs101 "Алгоритм ба програмчлал" Лекц №4u.cs101 "Алгоритм ба програмчлал" Лекц №4
u.cs101 "Алгоритм ба програмчлал" Лекц №4
Khuder Altangerel
u.cs101 "Алгоритм ба програмчлал" Лекц №6
u.cs101 "Алгоритм ба програмчлал" Лекц №6u.cs101 "Алгоритм ба програмчлал" Лекц №6
u.cs101 "Алгоритм ба програмчлал" Лекц №6
Khuder Altangerel
u.cs101 "Алгоритм ба програмчлал" Лекц №3
u.cs101 "Алгоритм ба програмчлал" Лекц №3u.cs101 "Алгоритм ба програмчлал" Лекц №3
u.cs101 "Алгоритм ба програмчлал" Лекц №3
Khuder Altangerel
u.cs101 "Алгоритм ба програмчлал" Лекц №2
u.cs101 "Алгоритм ба програмчлал" Лекц №2u.cs101 "Алгоритм ба програмчлал" Лекц №2
u.cs101 "Алгоритм ба програмчлал" Лекц №2
Khuder Altangerel
u.cs101 "Алгоритм ба програмчлал" Лекц №1
u.cs101 "Алгоритм ба програмчлал" Лекц №1u.cs101 "Алгоритм ба програмчлал" Лекц №1
u.cs101 "Алгоритм ба програмчлал" Лекц №1
Khuder Altangerel
u.cs101 "Алгоритм ба програмчлал" Лекц №7
u.cs101 "Алгоритм ба програмчлал" Лекц №7u.cs101 "Алгоритм ба програмчлал" Лекц №7
u.cs101 "Алгоритм ба програмчлал" Лекц №7
Khuder Altangerel

Viewers also liked (7)

u.cs101 "Алгоритм ба програмчлал" Лекц №5
u.cs101 "Алгоритм ба програмчлал" Лекц №5u.cs101 "Алгоритм ба програмчлал" Лекц №5
u.cs101 "Алгоритм ба програмчлал" Лекц №5
u.cs101 "Алгоритм ба програмчлал" Лекц №4
u.cs101 "Алгоритм ба програмчлал" Лекц №4u.cs101 "Алгоритм ба програмчлал" Лекц №4
u.cs101 "Алгоритм ба програмчлал" Лекц №4
u.cs101 "Алгоритм ба програмчлал" Лекц №6
u.cs101 "Алгоритм ба програмчлал" Лекц №6u.cs101 "Алгоритм ба програмчлал" Лекц №6
u.cs101 "Алгоритм ба програмчлал" Лекц №6
u.cs101 "Алгоритм ба програмчлал" Лекц №3
u.cs101 "Алгоритм ба програмчлал" Лекц №3u.cs101 "Алгоритм ба програмчлал" Лекц №3
u.cs101 "Алгоритм ба програмчлал" Лекц №3
u.cs101 "Алгоритм ба програмчлал" Лекц №2
u.cs101 "Алгоритм ба програмчлал" Лекц №2u.cs101 "Алгоритм ба програмчлал" Лекц №2
u.cs101 "Алгоритм ба програмчлал" Лекц №2
u.cs101 "Алгоритм ба програмчлал" Лекц №1
u.cs101 "Алгоритм ба програмчлал" Лекц №1u.cs101 "Алгоритм ба програмчлал" Лекц №1
u.cs101 "Алгоритм ба програмчлал" Лекц №1
u.cs101 "Алгоритм ба програмчлал" Лекц №7
u.cs101 "Алгоритм ба програмчлал" Лекц №7u.cs101 "Алгоритм ба програмчлал" Лекц №7
u.cs101 "Алгоритм ба програмчлал" Лекц №7

Similar to Hmm Tutorial

Hmm tutorial
Hmm tutorialHmm tutorial
Hmm tutorial
Shape contexts
Shape contextsShape contexts
Shape contexts
Brief survey on Three-Dimensional Displays
Brief survey on Three-Dimensional DisplaysBrief survey on Three-Dimensional Displays
Brief survey on Three-Dimensional Displays
Taufiq Widjanarko
Bayesian Inference on a Stochastic Volatility model Using PMCMC methods
Bayesian Inference on a Stochastic Volatility model Using PMCMC methodsBayesian Inference on a Stochastic Volatility model Using PMCMC methods
Bayesian Inference on a Stochastic Volatility model Using PMCMC methods
Slides registration. Vetrovsem
Slides registration. VetrovsemSlides registration. Vetrovsem
Slides registration. Vetrovsem
Valera Vishnevskiy
Talk data sciencemeetup
Talk data sciencemeetupTalk data sciencemeetup
Talk data sciencemeetup
Why we don’t know how many colors there are
Why we don’t know how many colors there areWhy we don’t know how many colors there are
Why we don’t know how many colors there are
Jan Morovic
The Origin of Diversity - Thinking with Chaotic Walk
The Origin of Diversity - Thinking with Chaotic WalkThe Origin of Diversity - Thinking with Chaotic Walk
The Origin of Diversity - Thinking with Chaotic Walk
Takashi Iba
Amth250 octave matlab some solutions (3)
Amth250 octave matlab some solutions (3)Amth250 octave matlab some solutions (3)
Amth250 octave matlab some solutions (3)
Markov models explained
Markov models explainedMarkov models explained
Markov models explained
Ashish K Agarwal
Natalia Restrepo-Coupe_Remotely-sensed photosynthetic phenology and ecosystem...
Natalia Restrepo-Coupe_Remotely-sensed photosynthetic phenology and ecosystem...Natalia Restrepo-Coupe_Remotely-sensed photosynthetic phenology and ecosystem...
Natalia Restrepo-Coupe_Remotely-sensed photosynthetic phenology and ecosystem...
TERN Australia
Signal Processing Course : Wavelets
Signal Processing Course : WaveletsSignal Processing Course : Wavelets
Signal Processing Course : Wavelets
Gabriel Peyré
aniruddh Tyagi
aniruddh Tyagi
Aniruddh Tyagi
Ism et chapter_8
Ism et chapter_8Ism et chapter_8
Ism et chapter_8
Drradz Maths
Ism et chapter_8
Ism et chapter_8Ism et chapter_8
Ism et chapter_8
Drradz Maths
Slides euria-2
Slides euria-2Slides euria-2
Slides euria-2
Arthur Charpentier
NumXL 1.55 LYNX release notes
NumXL 1.55 LYNX release notesNumXL 1.55 LYNX release notes
NumXL 1.55 LYNX release notes
Spider Financial
Dsp U Lec01 Real Time Dsp Systems
Dsp U   Lec01 Real Time Dsp SystemsDsp U   Lec01 Real Time Dsp Systems
Dsp U Lec01 Real Time Dsp Systems

Similar to Hmm Tutorial (20)

Hmm tutorial
Hmm tutorialHmm tutorial
Hmm tutorial
Shape contexts
Shape contextsShape contexts
Shape contexts
Brief survey on Three-Dimensional Displays
Brief survey on Three-Dimensional DisplaysBrief survey on Three-Dimensional Displays
Brief survey on Three-Dimensional Displays
Bayesian Inference on a Stochastic Volatility model Using PMCMC methods
Bayesian Inference on a Stochastic Volatility model Using PMCMC methodsBayesian Inference on a Stochastic Volatility model Using PMCMC methods
Bayesian Inference on a Stochastic Volatility model Using PMCMC methods
Slides registration. Vetrovsem
Slides registration. VetrovsemSlides registration. Vetrovsem
Slides registration. Vetrovsem
Talk data sciencemeetup
Talk data sciencemeetupTalk data sciencemeetup
Talk data sciencemeetup
Why we don’t know how many colors there are
Why we don’t know how many colors there areWhy we don’t know how many colors there are
Why we don’t know how many colors there are
The Origin of Diversity - Thinking with Chaotic Walk
The Origin of Diversity - Thinking with Chaotic WalkThe Origin of Diversity - Thinking with Chaotic Walk
The Origin of Diversity - Thinking with Chaotic Walk
Amth250 octave matlab some solutions (3)
Amth250 octave matlab some solutions (3)Amth250 octave matlab some solutions (3)
Amth250 octave matlab some solutions (3)
Markov models explained
Markov models explainedMarkov models explained
Markov models explained
Natalia Restrepo-Coupe_Remotely-sensed photosynthetic phenology and ecosystem...
Natalia Restrepo-Coupe_Remotely-sensed photosynthetic phenology and ecosystem...Natalia Restrepo-Coupe_Remotely-sensed photosynthetic phenology and ecosystem...
Natalia Restrepo-Coupe_Remotely-sensed photosynthetic phenology and ecosystem...
Signal Processing Course : Wavelets
Signal Processing Course : WaveletsSignal Processing Course : Wavelets
Signal Processing Course : Wavelets
Ism et chapter_8
Ism et chapter_8Ism et chapter_8
Ism et chapter_8
Ism et chapter_8
Ism et chapter_8Ism et chapter_8
Ism et chapter_8
Slides euria-2
Slides euria-2Slides euria-2
Slides euria-2
NumXL 1.55 LYNX release notes
NumXL 1.55 LYNX release notesNumXL 1.55 LYNX release notes
NumXL 1.55 LYNX release notes
Dsp U Lec01 Real Time Dsp Systems
Dsp U   Lec01 Real Time Dsp SystemsDsp U   Lec01 Real Time Dsp Systems
Dsp U Lec01 Real Time Dsp Systems

Recently uploaded

Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan

Recently uploaded (20)

Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Artificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic WarfareArtificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic Warfare

Hmm Tutorial

  • 1. Hidden Markov Models Phil Blunsom August 19, 2004 Abstract The Hidden Markov Model (HMM) is a popular statistical tool for modelling a wide range of time series data. In the context of natural language processing(NLP), HMMs have been applied with great success to problems such as part-of-speech tagging and noun-phrase chunking. 1 Introduction The Hidden Markov Model(HMM) is a powerful statistical tool for modeling generative se- quences that can be characterised by an underlying process generating an observable sequence. HMMs have found application in many areas interested in signal processing, and in particular speech processing, but have also been applied with success to low level NLP tasks such as part-of-speech tagging, phrase chunking, and extracting target information from documents. Andrei Markov gave his name to the mathematical theory of Markov processes in the early twentieth century[3], but it was Baum and his colleagues that developed the theory of HMMs in the 1960s[2]. Markov Processes Diagram 1 depicts an example of a Markov process. The model presented describes a simple model for a stock market index. The model has three states, Bull, Bear and Even, and three index observations up, down, unchanged. The model is a finite state automaton, with probabilistic transitions between states. Given a sequence of observations, example: up-down-down we can easily verify that the state sequence that produced those observations was: Bull-Bear-Bear, and the probability of the sequence is simply the product of the transitions, in this case 0.2 × 0.3 × 0.3. Hidden Markov Models Diagram 2 shows an example of how the previous model can be extended into a HMM. The new model now allows all observation symbols to be emitted from each state with a finite probability. This change makes the model much more expressive 0.3 0.6 0.2 Bull Bear 0.5 0.4 0.1 up 0.2 0.2 down Even unchanged 0.5 Figure 1: Markov process example[1] 1
  • 2. up 0.3 up 0.6 0.7 0.2 0.1 down 0.1 Bull Bear 0.6 down 0.2 0.5 0.3 unchanged 0.4 0.1 unchanged 0.2 0.2 up 0.3 Even 0.3 down 0.4 0.5 unchanged Figure 2: Hidden Markov model example[1] and able to better represent our intuition, in this case, that a bull market would have both good days and bad days, but there would be more good ones. The key difference is that now if we have the observation sequence up-down-down then we cannot say exactly what state sequence produced these observations and thus the state sequence is ‘hidden’. We can however calculate the probability that the model produced the sequence, as well as which state sequence was most likely to have produced the observations. The next three sections describe the common calculations that we would like to be able to perform on a HMM. The formal definition of a HMM is as follows: λ = (A, B, π) (1) S is our state alphabet set, and V is the observation alphabet set: S = (s1 , s2 , · · · , sN ) (2) V = (v1 , v2 , · · · , vM ) (3) We define Q to be a fixed state sequence of length T, and corresponding observations O: Q = q 1 , q2 , · · · , q T (4) O = o1 , o2 , · · · , oT (5) A is a transition array, storing the probability of state j following state i . Note the state transition probabilities are independent of time: A = [aij ] , aij = P (qt = sj |qt−1 = si ) . (6) B is the observation array, storing the probability of observation k being produced from the state j, independent of t: B = [bi (k)] , bi (k) = P (xt = vk |qt = si ) . (7) π is the initial probability array: π = [πi ] , πi = P (q1 = si ) . (8) Two assumptions are made by the model. The first, called the Markov assumption, states that the current state is dependent only on the previous state, this represents the memory of the model: t−1 P (qt |q1 ) = P (qt |qt−1 ) (9) The independence assumption states that the output observation at time t is dependent only on the current state, it is independent of previous observations and states: P (ot |ot−1 , q1 ) = P (ot |qt ) 1 t (10) 2
  • 3. 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 t=1 t=2 t=3 t=4 Figure 3: A trellis algorithm 2 Evaluation Given a HMM, and a sequence of observations, we’d like to be able to compute P (O|λ), the probability of the observation sequence given a model. This problem could be viewed as one of evaluating how well a model predicts a given observation sequence, and thus allow us to choose the most appropriate model from a set. The probability of the observations O for a specific state sequence Q is: T P (O|Q, λ) = P (ot |qt , λ) = bq1 (o1 ) × bq2 (o2 ) · · · bqT (oT ) (11) t=1 and the probability of the state sequence is: P (Q|λ) = πq1 aq1 q2 aq2 q3 · · · aqT −1 qT (12) so we can calculate the probability of the observations given the model as: P (O|λ) = P (O|Q, λ)P (Q|λ) = πq1 bq1 (o1 )aq1 q2 bq2 (o2 ) · · · aqT −1 qT bqT (oT ) (13) Q q1 ···qT This result allows the evaluation of the probability of O, but to evaluate it directly would be exponential in T. A better approach is to recognise that many redundant calculations would be made by directly evaluating equation 13, and therefore caching calculations can lead to reduced com- plexity. We implement the cache as a trellis of states at each time step, calculating the cached valued (called α) for each state as a sum over all states at the previous time step. α is the probability of the partial observation sequence o1 , o2 · · · ot and state si at time t. This can be visualised as in figure 3. We define the forward probability variable: αt (i) = P (o1 o2 · · · ot , qt = si |λ) (14) so if we work through the trellis filling in the values of α the sum of the final column of the trellis will equal the probability of the observation sequence. The algorithm for this process is called the forward algorithm and is as follows: 1. Initialisation: α1 (i) = πi bi (o1 ), 1 ≤ i ≤ N. (15) 2. Induction: N αt+1 (j) = [ αt (i)aij ]bj (ot+1 ), 1 ≤ t ≤ T − 1, 1 ≤ j ≤ N. (16) i=1 3
  • 4. S1 S1 α1(t) a1j S2 α2(t) a2j Sj αj(t+1) a3j S3 α3(t) aNj SN SN αN(t) t t+1 Figure 4: The induction step of the forward algorithm 3. Termination: N P (O|λ) = αT (i). (17) i=1 The induction step is the key to the forward algorithm and is depicted in figure 4. For each state sj , αj (t) stores the probability of arriving in that state having observed the observation sequence up until time t. It is apparent that by caching α values the forward algorithm reduces the complexity of calculations involved to N 2 T rather than 2T N T . We can also define an analogous backwards algorithm which is the exact reverse of the forwards algorithm with the backwards variable: βt (i) = P (ot+1 ot+2 · · · oT |qt = si , λ) (18) as the probability of the partial observation sequence from t + 1 to T , starting in state si . 3 Decoding The aim of decoding is to discover the hidden state sequence that was most likely to have produced a given observation sequence. One solution to this problem is to use the Viterbi algorithm to find the single best state sequence for an observation sequence. The Viterbi algorithm is another trellis algorithm which is very similar to the forward algorithm, except that the transition probabilities are maximised at each step, instead of summed. First we define: δt (i) = max P (q1 q2 · · · qt = si , o1 , o2 · · · ot |λ) (19) q1 ,q2 ,···,qt−1 as the probability of the most probable state path for the partial observation sequence. The Viterbi algorithm and is as follows: 1. Initialisation: δ1 (i) = πi bi (o1 ), 1 ≤ i ≤ N, ψ1 (i) = 0. (20) 2. Recursion: δt (j) = max [δt−1 (i)aij ]bj (ot ), 2 ≤ t ≤ T, 1 ≤ j ≤ N, (21) 1≤i≤N ψt (j) = arg max [δt−1 (i)aij ], 2 ≤ t ≤ T, 1 ≤ j ≤ N. (22) 1≤i≤N 4
  • 5. S1 S1 δt(1) a1j S2 δt(2) a2j Sj δ (j) = t+1 S3 a3j δt(2)bj(ot+1) δt(3) ψt+1(j) = 2 aNj SN SN δt(N) t t+1 Figure 5: The recursion step of the viterbi algorithm 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 4 3 3 3 Figure 6: The backtracing step of the viterbi algorithm 3. Termination: P ∗ = max [δT (i)] (23) 1≤i≤N ∗ qT = arg max [δT (i)]. (24) 1≤i≤N 4. Optimal state sequence backtracking: ∗ ∗ qt = ψt+1 (qt+1 ), t = T − 1, T − 2, · · · , 1. (25) The recursion step is illustrated in figure 5. The main difference with the forward algorithm in the recursions step is that we are maximising, rather than summing, and storing the state that was chosen as the maximum for use as a backpointer. The backtracking step is shown in 6. The backtracking allows the best state sequence to be found from the back pointers stored in the recursion step, but it should be noted that there is no easy way to find the second best state sequence. 5
  • 6. 4 Learning Given a set of examples from a process, we would like to be able to estimate the model pa- rameters λ = (A, B, π) that best describe that process. There are two standard approaches to this task, dependent on the form of the examples, which will be referred to here as supervised and unsupervised training. If the training examples contain both the inputs and outputs of a process, we can perform supervised training by equating inputs to observations, and outputs to states, but if only the inputs are provided in the training data then we must used unsuper- vised training to guess a model that may have produced those observations. In this section we will discuss the supervised approach to training, for a discussion of the Baum-Welch algorithm for unsupervised training see [5]. The easiest solution for creating a model λ is to have a large corpus of training examples, each annotated with the correct classification. The classic example for this approach is PoS tagging. We define two sets: • t1 · · · tN is the set of tags, which we equate to the HMM state set s1 · · · sN • w1 · · · wM is the set of words, which we equate to the HMM observation set v1 · · · vM so with this model we frame part-of-speech tagging as decoding the most probable hidden state sequence of PoS tags given an observation sequence of words. To determine the model parameters λ, we can use maximum likelihood estimates(MLE) from a corpus containing sentences tagged with their correct PoS tags. For the transition matrix we use: Count(ti , tj ) aij = P (ti |tj ) = (26) Count(ti ) where Count(ti , tj ) is the number of times tj followed ti in the training data. For the obser- vation matrix: Count(wk , tj ) bj (k) = P (wk |tj ) = (27) Count(tj ) where Count(wk , tj ) is the number of times wk was tagged tj in the training data. And lastly the initial probability distribution: Count(q1 = ti ) πi = P (q1 = ti ) = (28) Count(q1 ) In practice when estimating a HMM from counts it is normally necessary to apply smoothing in order to avoid zero counts and improve the performance of the model on data not appearing in the training set. 5 Multi-Dimensional Feature Space A limitation of the model described is that observations are assumed to be single dimensional features, but many tasks are most naturally modelled using a multi-dimensional feature space. One solution to this problem is to use a multinomial model that assumes the features of the observations are independent [4]: vk = (f1 , · · · , fN ) (29) N P (vk |sj ) = P (fj |sj ) (30) j=1 This model is easy to implement and computationally simple, but obviously many features one might want to use are not independent. For many NLP systems it has been found that flawed Baysian independence assumptions can still be very effective. 6
  • 7. 6 Implementing HMMs When implementing a HMM, floating-point underflow is a significant problem. It is apparent that when applying the Viterbi or forward algorithms to long sequences the extremely small probability values that would result could underflow on most machines. We solve this problem differently for each algorithm: Viterbi underflow As the Viterbi algorithms only multiplies probabilities, a simple solution to underflow is to log all the probability values and then add values instead of multiply. In fact if all the values in the model matrices (A, B, π) are stored logged, then at runtime only addition operations are needed. forward algorithm underflow The forward algorithm sums probability values, so it is not a viable solution to log the values in order to avoid underflow. The most common solution to this problem is to use scaling coefficients that keep the probability values in the dynamic range of the machine, and that are dependent only on t. The coefficient ct is defined as: 1 ct = N (31) i=1 αt (i) and thus the new scaled value for α becomes: αt (i) αt (i) = ct × αt (i) = ˆ N (32) i=1 t α (i) ˆ a similar coefficient can be computed for βt (i). References [1] Huang et. al. Spoken Language Processing. Prentice Hall PTR. [2] L. Baum et. al. A maximization technique occuring in the statistical analysis of probab- listic functions of markov chains. Annals of Mathematical Statistics, 41:164–171, 1970. [3] A. Markov. An example of statistical investigation in the text of eugene onyegin, illustrat- ing coupling of tests in chains. Proceedings of the Academy of Sciences of St. Petersburg, 1913. [4] A. McCallum and K. Nigram. A comparison of event models for naive bayes classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998. [5] L. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of IEEE, 1989. 7