SlideShare a Scribd company logo
Markov Models &
Hidden Markov Models
Time-based Models
• Simple parametric distributions are typically based
  on what is called the “independence assumption”-
  each data point is independent of the others, and
  there is no time-sequencing or ordering.
• What if the data has correlations based on its order,
  like a time-series?
States
• An atomic event is an assignment to every
  random variable in the domain.
• States are atomic events that can transfer
  from one to another
• Suppose a model has n states
• A state-transition diagram describes how the
  model behaves
State-transition




Following assumptions
       - Transition probabilities are stationary
       - The event space does not change over time
       - Probability distribution over next states depends
         only on the current state
State-transition




Following assumptions
       - Transition probabilities are stationary
       - The event space does not change over time
       - Probability distribution over next states depends
                 only on the current state

              Markov Assumption
Markov random processes
• A random sequence has the Markov property if its
  distribution is determined solely by its current state.
• Any random process having this property is called a
  Markov random process.
• A system with states that obey the Markov
  assumption is called a Markov Model.
• A sequence of states resulting from such a model is
  called a Markov Chain.
Chain Rule & Markov Property
    Bayes rule



  P (qt , qt −1 ,...q1 ) = P(qt | qt −1 ,...q1 ) P (qt −1 ,...q1 )
                     = P(qt | qt −1 ,...q1 ) P(qt −1 | qt − 2 ,...q1 ) P(qt − 2 ,...q1 )
                                    t
                    = P (q1 )∏ P (qi | qi −1 ,...q1 )
                                   i =2
  Markov property


            P (qi | qi −1 ,...q1 ) = P (qi | qi −1 ) for i > 1
                               t
P (qt , qt −1 ,...q1 ) = P (q1 )∏ P (qi | qi −1 ) = P (q1 ) P(q2 | q1 )...P(qt | qt −1 )
                              i=2
Markov Assumption
• The Markov assumption states that
  probability of the occurrence of word wi at
  time t depends only on occurrence of word
  wi-1 at time t-1
  – Chain rule:
                          n
                   ( 1 . n ∏ i 1 . i−)
                  Pw . ,w)= Pw|w . ,w1
                      ,.     (  ,.
                           =
                          i 2

  – Markov assumption:
                                n
                     ( 1 .,w ∏ i i− )
                    Pw . n)≈ Pw|w1
                        ,.    (
                                 =
                                i 2
Andrei Andreyevich Markov

Born: 14 June 1856 in Ryazan, Russia
Died: 20 July 1922 in Petrograd (now St Petersburg),
Russia
Markov is particularly remembered for his study of
Markov chains, sequences of random variables in
which the future variable is determined by the
present variable but is independent of the way in
which the present state arose from its predecessors.
This work launched the theory of stochastic
processes.
A Markov System
                Has N states, called s1, s2 .. sN
                There are discrete timesteps, t=0, t=1, …
           s2



      s1   s3
N=3
t=0
A Markov System
                                Has N states, called s1, s2 .. sN
                                There are discrete timesteps, t=0, t=1, …
                           s2   On the t’th timestep the system is in exactly
                                one of the available states. Call it qt
                                Note: qt ∈{s1, s2 .. sN }
           Current State

           s1              s3
N=3
t=0
qt=q0=s3
A Markov System
           Current State        Has N states, called s1, s2 .. sN
                                There are discrete timesteps, t=0, t=1, …
                           s2   On the t’th timestep the system is in exactly
                                one of the available states. Call it qt
                                Note: qt ∈{s1, s2 .. sN }
                                Between each timestep, the next state is
                                chosen randomly.
           s1              s3
N=3
t=1
qt=q1=s2
P(qt+1=s1|qt=s2) = 1/2
                 P(qt+1=s2|qt=s2) = 1/2
                                            Has N states, called s1, s2 .. sN
                 P(qt+1=s3|qt=s2) = 0
                                            There are discrete timesteps, t=0, t=1, …
P(qt+1=s1|qt=s1) = 0
                             s2             On the t’th timestep the system is in exactly
P(qt+1=s2|qt=s1) = 0                        one of the available states. Call it qt
P(qt+1=s3|qt=s1) = 1                        Note: qt ∈{s1, s2 .. sN }
                                            Between each timestep, the next state is
                                            chosen randomly.
            s1               s3             The current state determines the probability
N=3                                         distribution for the next state.

t=1
                   P(qt+1=s1|qt=s3) = 1/3
qt=q1=s2
                   P(qt+1=s2|qt=s3) = 2/3
                   P(qt+1=s3|qt=s3) = 0
P(qt+1=s1|qt=s2) = 1/2
                 P(qt+1=s2|qt=s2) = 1/2
                                            Has N states, called s1, s2 .. sN
                 P(qt+1=s3|qt=s2) = 0
                                            There are discrete timesteps, t=0, t=1, …
P(qt+1=s1|qt=s1) = 0
                              s2        1/2 On the t’th timestep the system is in exactly
P(qt+1=s2|qt=s1) = 0                        one of the available states. Call it qt
P(qt+1=s3|qt=s1) = 1                        Note: qt ∈{s1, s2 .. sN }
                  1/2         2/3
                                            Between each timestep, the next state is
                                            chosen randomly.
            s1          1/3   s3            The current state determines the probability
N=3                    1                    distribution for the next state.

t=1
                   P(qt+1=s1|qt=s3) = 1/3
qt=q1=s2
                   P(qt+1=s2|qt=s3) = 2/3
                   P(qt+1=s3|qt=s3) = 0
                 Often notated with arcs
                 between states
P(qt+1=s1|qt=s2) = 1/2
                 P(qt+1=s2|qt=s2) = 1/2
                 P(qt+1=s3|qt=s2) = 0
                                              Markov Property
P(qt+1=s1|qt=s1) = 0
                              s2            qt+1 is conditionally independent of { qt-1, qt-2,
                                        1/2 … q , q } given q .
P(qt+1=s2|qt=s1) = 0                              1  0          t


P(qt+1=s3|qt=s1) = 1                         In other words:
                  1/2         2/3
                                             P(qt+1 = sj |qt = si ) =

            s1          1/3   s3             P(qt+1 = sj |qt = si ,any earlier history)
                                             Notation:
N=3                    1

t=1
                   P(qt+1=s1|qt=s3) = 1/3
qt=q1=s2
                   P(qt+1=s2|qt=s3) = 2/3         aij = P (qt +1 = si | q = s j )
                   P(qt+1=s3|qt=s3) = 0
                                                         π i = P(q1 = si )
P(qt+1=s1|qt=s2) = 1/2
                 P(qt+1=s2|qt=s2) = 1/2
                 P(qt+1=s3|qt=s2) = 0
                                               Markov Property
P(qt+1=s1|qt=s1) = 0
                              s2            qt+1 is conditionally independent of { qt-1, qt-2,
                                        1/2 … q , q } given q .
P(qt+1=s2|qt=s1) = 0                              1  0          t


P(qt+1=s3|qt=s1) = 1                          In other words:
                  1/2         2/3
                                              P(qt+1 = sj |qt = si ) =

            s1          1/3   s3              P(qt+1 = sj |qt = si ,any earlier history)
                                              Notation:
N=3                    1
                                                                         Transition probability
t=1
                   P(qt+1=s1|qt=s3) = 1/3
qt=q1=s2
                   P(qt+1=s2|qt=s3) = 2/3           aij = P (qt +1 = si | q = s j )
                   P(qt+1=s3|qt=s3) = 0
                                                          π i = P(q1 = si )
                              Initial probability
Example: A Simple Markov Model For
              Weather Prediction
• Any given day, the weather can be described as being in
  one of three states:
   – State 1: precipitation (rain, snow, hail, etc.)
   – State 2: cloudy
   – State 3: sunny
  Transitions between states are described by the
  transition matrix




    This model can then be described by the
    following directed graph
Basic Calculations
• Example: What is the probability that the
  weather for eight consecutive days is “sun-
  sun-sun-rain-rain-sun-cloudy-sun”?
• Solution:
• O = sun sun sun rain rain sun cloudy sun
        3 3 3 1 1 3 2                 3
From Markov To Hidden Markov
• The previous model assumes that each state can be uniquely
  associated with an observable event
   – Once an observation is made, the state of the system is then trivially
     retrieved
   – This model, however, is too restrictive to be of practical use for most
     realistic problems
• To make the model more flexible, we will assume that the
  outcomes or observations of the model are a probabilistic
  function of each state
   – Each state can produce a number of outputs according to a unique
     probability distribution, and each distinct output can potentially be
     generated at any state
   – These are known a Hidden Markov Models (HMM), because the state
     sequence is not directly observable, it can only be approximated from
     the sequence of observations produced by the system
The coin-toss problem
• To illustrate the concept of an HMM consider the following
  scenario
    – Assume that you are placed in a room with a curtain
    – Behind the curtain there is a person performing a coin-toss
      experiment
    – This person selects one of several coins, and tosses it: heads (H) or
      tails (T)
    – The person tells you the outcome (H,T), but not which coin was used
      each time
• Your goal is to build a probabilistic model that best explains
  a sequence of observations O={o1,o2,o3,o4,…}={H,T,T,H,,…}
    – The coins represent the states; these are hidden because you do not
      know which coin was tossed each time
    – The outcome of each toss represents an observation
    – A “likely” sequence of coins may be inferred from the observations,
      but this state sequence will not be unique
•
The Coin Toss Example – 1 coin
•As a result, the Markov model is observable since there is only one state
•In fact, we may describe the system with a deterministic model where the states are
the actual observations (see figure)
•the model parameter P(H) may be found from the ratio of heads and tails
•O= H H H T T H…
•S = 1 1 1 2 2 1…
The Coin Toss Example – 1 coin
•As a result, the Markov model is observable since there is only one state
•In fact, we may describe the system with a deterministic model where the states are
the actual observations (see figure)
•the model parameter P(H) may be found from the ratio of heads and tails
•O= H H H T T H…
•S = 1 1 1 2 2 1…
The Coin Toss Example – 2 coins
From Markov to Hidden Markov Model:
The Coin Toss Example – 3 coins
Hidden model
• As spectators, we can not tell which coin is
  being used, all we can observe is the output
  (head/tail)
• We assume the outputs are based on coin
  tendencies (output) probabilities
Coin Toss Example
     hidden state variables        L

            = coins

C1            C2         Ci        CL-1   CL


P1            P2          Pi       PL-1   PL




                       observed data
                        (“output”) =
                         heads/tails
Hidden Markov Models
• Used when states can not directly be observed, good
  for noisy data

• Requirements:
   – A finite number of states, each with an output probability
     distribution
   – State transition probabilities
   – Observed phenomenon, which can be randomly generated
     given state-associated probabilities.
HMM Notation                       *L. R. Rabiner, "A Tutorial on

  (from Rabiner’s Survey)                Hidden Markov Models and
                                         Selected Applications in
                                         Speech Recognition," Proc.
The states are labeled S1 S2 .. SN       of the IEEE, Vol.77, No.2,
                                         pp.257--286, 1989.


For a particular trial….
  Let T      be the number of observations
       T     is also the number of states passed
       through
       O = O1 O2 .. OT is the sequence of observations
      Q = q1 q2 .. qT is the notation for a path of states

λ = 〈N,M,{π i,},{aij},{bi(j)}〉 is the specification of an
                            HMM
HMM Formal Definition
An HMM, λ, is a 5-tuple consisting of
• N the number of states
• M the number of possible observations
• {π1, π2, .. πN} The starting state probabilities
         P(q0 = Si) = πi
• a11               a22      …         a1N
    a21             a22      …         a2N         The state transition probabilities
     :               :                  :
                                                    P(qt+1=Sj | qt=Si)=aij
    aN1             aN2      …         aNN

•   b1(1) b1(2)   …        b1(M)                The observation probabilities
    b2(1) b2(2)   …        b2(M)                   P(Ot=k | qt=Si)=bi(k)
     :            :                  :
    bN(1) bN(2)   …        bN(M)
Assumptions
• Markov assumption
  – States depend on previous states
• Stationary assumption
  – Transition probabilities are independent of time
    (“memoryless”)
• Output independence
  – Observations are independent of previous
    observations
The three main questions on HMMs
• Evaluation
  – What is the probability that the observations were
    generated by a given model?
• Decoding
  – Given a model and a sequence of observations, what is the
    most likely state observations?
• Learning:
  – Given a model and a sequence of observations, how
    should we modify the model parameters to maximize
    p{observe|model)
The three main questions on HMMs

1. Evaluation

    GIVEN           a HMM M,                 and a sequence x,
    FIND Prob[ x | M ]

•   Decoding

    GIVEN          a HMM M,              and a sequence x,
    FIND the sequence π of states that maximizes P[ x, π | M ]

5. Learning

    GIVEN            a HMM M, with unspecified transition/emission probs.,
                     and a sequence x,

    FIND parameters θ = (bi(.), aij) that maximize P[ x | θ ]
Let’s not be confused by notation
P[ x | M ]:      The probability that sequence x was generated by
                 the model

                 The model is: architecture (#states, etc)
                              + parameters θ = aij, ei(.)

So, P[ x | θ ], and P[ x ] are the same, when the architecture, and
   the entire model, respectively, are implied

Similarly, P[ x, π | M ] and P[ x, π ] are the same

In the LEARNING problem we always write P[ x | θ ] to emphasize
    that we are seeking the θ that maximizes P[ x | θ ]
HMMs
Hidden Markov Models
• Used when states can not directly be observed, good
  for noisy data

• Requirements:
   – A finite number of states, each with an output probability
     distribution
   – State transition probabilities
   – Observed phenomenon, which can be randomly generated
     given state-associated probabilities.
Description

        Specification of an HMM
• N - number of states
  – Q = {q1; q2; : : : ;qT} - set of states
• M - the number of symbols (observables)
  – O = {o1; o2; : : : ;oT} - set of symbols
Description

         Specification of an HMM
• A - the state transition probability matrix
   – aij = P(qt+1 = j|qt = i)
• B- observation probability distribution
   – bj(k) = P(ot = k|qt = j) i ≤ k ≤ M
• π - the initial state distribution
HMM Formal Definition
An HMM, λ, is a 5-tuple consisting of
• N the number of states
• M the number of possible observations
• {π1, π2, .. πN} The starting state probabilities
         P(q0 = Si) = πi
• a11               a22      …         a1N
    a21             a22      …         a2N         The state transition probabilities
     :               :                  :
                                                    P(qt+1=Sj | qt=Si)=aij
    aN1             aN2      …         aNN

•   b1(1) b1(2)   …        b1(M)                The observation probabilities
    b2(1) b2(2)   …        b2(M)                   P(Ot=k | qt=Si)=bi(k)
     :            :                  :
    bN(1) bN(2)   …        bN(M)
Assumptions
• Markov assumption
  – States depend on previous states
• Stationary assumption
  – Transition probabilities are independent of time
    (“memoryless”)
• Output independence
  – Observations are independent of previous
    observations
The three main questions on HMMs
• Evaluation
  – What is the probability that the observations were
    generated by a given model?
• Decoding
  – Given a model and a sequence of observations, what is the
    most likely state observations?
• Learning:
  – Given a model and a sequence of observations, how
    should we modify the model parameters to maximize
    p{observe|model)
Central
                                                     problems
 Central problems in HMM modelling
• Problem 1
  Evaluation:
  – Probability of occurrence of a particular
    observation sequence, O = {o1,…,ok}, given the
    model
  – P(O|λ)
  – Complicated – hidden states
  – Useful in sequence classification
Central
                                               problems
 Central problems in HMM modelling
• Problem 2
  Decoding:
  – Optimal state sequence to produce given
    observations, O = {o1,…,ok}, given model
  – Optimality criterion
  – Useful in recognition problems
Central
                                                problems
 Central problems in HMM modelling
• Problem 3
  Learning:
  – Determine optimum model, given a training set of
    observations
  – Find λ, such that P(O|λ) is maximal
Task: Part-Of-Speech Tagging
• Goal: Assign the correct part-of-speech to
  each word (and punctuation) in a text.
• Example:
  Two    old   men    bet    on     the   game    .
  CRD    ADJ   NN     VBD    Prep   Det   NN     SYM


• Learn a local model of POS dependencies,
  usually from pretagged data
Hidden Markov Models
  • Assume: POS generated as random process,
    and each POS randomly generates a word

                  0.2

                                          ADJ           0.3
                                                                            “cats”
                        0.2
“a”   0.6                                         0.5                 NNS
                                    0.3
            Det                                           0.9               “men”

                              0.5
            0.4                                    NN
  “the”                                                         0.1

                                          “cat”
                                                        “bet”
HMMs For Tagging
• First-order (bigram) Markov assumptions:
  – Limited Horizon: Tag depends only on previous tag
    P(ti+1 = tk | t1=tj1,…,ti=tji) = P(ti+1 = tk | ti = tj)
  – Time invariance: No change over time
     P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj  tk)
• Output probabilities:
  – Probability of getting word wk for tag tj: P(wk | tj)
  – Assumption:
    Not dependent on other tags or words!
Combining Probabilities
• Probability of a tag sequence:
P(t1t2…tn) = P(t1)P(t1t2)P(t2t3)…P(tn-1tn)
Assume t0 – starting tag:
        = P(t0t1)P(t1t2)P(t2t3)…P(tn-1tn)


• Prob. of word sequence and tag sequence:
   P(W,T) = Πi P(ti-1ti) P(wi | ti)
Training from labeled training
• Labeled training = each word has a POS tag
• Thus:
  π(tj) = PMLE(tj) = C(tj) / N
  a(tjtk) = PMLE(tjtk) = C(tj, tk) / C(tj)
  b(tj | wk) = PMLE(tj | wk) = C(tj:wk) / C(wk)

• Smoothing applies as usual
Three Basic Problems
•   Compute the probability of a text:
          Pλ(W1,N)
•   Compute maximum probability tag
    sequence:
          arg maxT1,N Pλ(T1,N | W1,N)
•   Compute maximum likelihood model
         arg max λ Pλ(W1,N)
Central

         Problem 1: Naïve solution                        problems



 • State sequence Q = (q1,…qT)
 • Assume independent observations:
                 T
P (O | q, λ ) = ∏ P (ot | qt , λ ) = bq1 (o1 )bq 2 (o2 )...bqT (oT )
                 i =1

NB Observations are mutually independent, given the
hidden states. (Joint distribution of independent
variables factorises into marginal distributions of the
independent variables.)
Central

       Problem 1: Naïve solution
                                                  problems




• Observe that :

       P(q | λ ) = π q1aq1q 2 aq 2 q 3 ...aqT −1qT
• And that:

        P (O | λ ) = ∑ P (O | q, λ ) P (q | λ )
                      q
Central

        Problem 1: Naïve solution                    problems



• Finally get:

           P (O | λ ) = ∑ P (O | q, λ ) P (q | λ )
                         q


 NB:
 -The above sum is over all state paths
 -There are NT states paths, each ‘costing’
  O(T) calculations, leading to O(TNT)
  time complexity.
Central

    Problem 1: Efficient solution
                                                       problems



Forward algorithm:
• Define auxiliary forward variable α:

       α t (i ) = P(o1 ,..., ot | qt = i, λ )
 αt(i) is the probability of observing a partial sequence of
 observables o1,…ot such that at time t, state qt=i
Central

      Problem 1: Efficient solution
                                                                          problems


• Recursive algorithm:
   – Initialise:

            α1 (i ) = π i bi (o1 )
                                   (Partial obs seq to t AND state i at t)
   – Calculate:                     x (transition to j at t+1) x (sensor)
                             N
            α t +1 ( j ) = [∑ α t (i )aij ]b j (ot +1 )    Sum, as can reach j from
                                                             any preceding state
                            i =1
   – Obtain:                        α incorporates partial obs seq to t
                           N
             P (O | λ ) = ∑ α T (i )
                           i =1
   Sum of different ways                              Complexity is O(N2T)
    of getting obs seq
Forward Algorithm
Define αk(i) = P(w1,k, tk=ti)

•    For i = 1 To N:
              α1(i) = a(t0ti)b(w1 | ti)
4. For k = 2 To T; For j = 1 To N:
              αk(j) =     [Σ α i   k-1          ]
                                      (i)a(titj) b(wk | tj)
5. Then:
   Pλ(W1,T) =     Σ α (i)
                      i    T


Complexity = O(N2 T)
Pλ(W1,3)
                          Forward Algorithm
           w1                          w2                          w3
           1                a(t1t1)   1                a(t1t1)
           t    α1(1)                  t    α2(1)                  t1   α3(1)
                            a(t2t1)                    a(t2t1)

           t2   α1(2)     a(t3t1)     t2   α2(2)     a(t3t1)     t2   α3(2)

a(t0ti)
           t3   α1(3)                  t3   α2(3)                  t3   α3(3)
                         a(t4t1)                    a(t4t1)


           t4   α1(4)                  t4   α2(4)                  t4   α3(4)
                        a(t5t1)                    a(t5t1)

           t5   α1(5)                  t5   α2(5)                  t5   α3(5)
Central
                                                        problems
   Problem 1: Alternative solution
Backward algorithm:
• Define auxiliary forward
  variable β:

      β t (i ) = P (ot +1 , ot + 2 ,..., oT | qt = i, λ )

   βt(i) – the probability of observing a sequence of
   observables ot+1,…,oT given state qt =i at time t, and λ
Central
                                                                   problems

   Problem 1: Alternative solution
• Recursive algorithm:
   – Initialise:

               βT ( j ) = 1
   – Calculate:
                           N
               βt (i ) = ∑ β t +1 ( j )aij b j (ot +1 )
                          j =1

   – Terminate:
                                 N
               p (O | λ ) = ∑ β1 (i )         t = T − 1,...,1
                                 i =1

                                                      Complexity is O(N2T)
Backward Algorithm
Define βk(i) = P(wk+1,N | tk=ti) --note the difference!


•    For i = 1 To N:
              βT(i) = 1
5. For k = T-1 To 1; For j = 1 To N:
               βk(j) =   [Σ a(t t )b(w
                              i
                                      j   i
                                                      k+1 | ti) βk+1(i)   ]
6. Then:
      Pλ(W1,T) =   Σ a(t t )b(w | t ) β (i)
                     i    0
                                  i
                                              1
                                                  i
                                                         1


Complexity = O(Nt2 N)
Pλ(W1,3)              Backward Algorithm
       w1                                w2                          w3
           1   β1(1)     a(t1t1)        1    β2(1)   a(t1t1)            β3(1)
       t                                 t                           t1
a(t0ti)                 a(t2t1)                     a(t2t1)

       t2      β1(2)                     t2   β2(2)                  t2   β3(2)
                          a(t3t1)                     a(t3t1)


       t3      β1(3)                     t3   β2(3)                  t3   β3(3)
                           a(t t )
                              4      1
                                                        a(t t )
                                                           4     1




       t4      β1(4)                     t4   β2(4)                  t4   β3(4)
                             a(t5t1)                     a(t5t1)

       t5      β1(5)                     t5   β2(5)                  t5   β3(5)
Viterbi Algorithm (Decoding)
• Most probable tag sequence given text:
  T* = arg maxT Pλ(T | W)
     = arg maxT Pλ(W | T) Pλ(T) / Pλ(W)
           (Bayes’ Theorem)
     = arg maxT Pλ(W | T) Pλ(T)
           (W is constant for all T)
     = arg maxT Πi[a(ti-1ti) b(wi | ti) ]
     = arg maxT Σi log[a(ti-1ti) b(wi | ti) ]
w1                               w2                           w3

            t1                           t1                           t1



t0          t2                           t2                           t2



            t3                           t3                           t3


     A(,)        t1      t2     t3            B(,)   w1     w2      w3
     t0         0.005   0.02   0.1           t1     0.2    0.005   0.005

     t1         0.02    0.1    0.005         t2     0.02   0.2     0.0005

     t2         0.5     0.0005 0.0005        t3     0.02   0.02    0.05

     t3         0.05    0.05   0.005
w1                      w2                         w3
                      -1.7                      -1.7
           1
           t                       1
                                   t    -6                    t1   -7.3
                -3
    -2.3
                      -0.3                      -0.3
0
    -1.7 2
t        t -3.4                    t2    -4.7                 t2   -10.3

    -1                -1.3                      -1.3

           t3 -2.7                 t3    -6.7                 t3   -9.3


    -log A t1         t2     t3         -log B w1      w2    w3
    t0         2.3   1.7    1          t1      0.7    2.3   2.3
    t1         1.7   1      2.3        t2      1.7    0.7   3.3
    t2         0.3   3.3    3.3        t3      1.7    1.7   1.3
    t3         1.3   1.3    2.3
Viterbi Algorithm
•     D(0, START) = 0
•     for each tag t != START do: D(1, t) = -∞
•     for i  1 to N do:
    a. for each tag tj do:
        D(i, tj)  maxk D(i-1,tk)b(wi|tj)a(tktj)
        D(i, tj)  maxk D(i-1,tk) + logb(wi|tj) + loga(tktj)
•    log P(W,T) = maxj D(N, tj)

where logb(wi|tj) = log b(wi|tj) and so forth
Question: Suppose the sequence of our game is:
                  HHHTHHHTTHHTH?




                  0.5        start          0.5
Heads                                                Heads
                             0.1
                  fair                   loaded
     0.9                                          0.9

 Tails                                                    Tails
                                   0.1



     What is the probability of the sequence given the model?
Decoding
• Suppose we have a text written by Shakespeare
  and a monkey. Can we tell who wrote what?

• Text: Shakespeare or Monkey?

• Case 1:
  – Fehwufhweuromeojulietpoisonjigjreijge
• Case 2:
  – mmmmbananammmmmmmbananammm

More Related Content

What's hot

Hidden Markov Model
Hidden Markov Model Hidden Markov Model
Hidden Markov Model
Mahmoud El-tayeb
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
ananth
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
Arshad Farhad
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
Sung Yub Kim
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
Andrii Gakhov
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
Marijn van Zelst
 
K Nearest Neighbor Algorithm
K Nearest Neighbor AlgorithmK Nearest Neighbor Algorithm
K Nearest Neighbor Algorithm
Tharuka Vishwajith Sarathchandra
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and Credibility
Pier Luca Lanzi
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
Hemantha Kulathilake
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
Carlos Castillo (ChaTo)
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
Mark Chang
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
Databricks
 
Confusion Matrix Explained
Confusion Matrix ExplainedConfusion Matrix Explained
Confusion Matrix Explained
Stockholm University
 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learning
Ridge-i, Inc.
 
Text Classification
Text ClassificationText Classification
Text Classification
RAX Automation Suite
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Modelspetitegeek
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
Turi, Inc.
 
Activation functions
Activation functionsActivation functions
Activation functions
PRATEEK SAHU
 

What's hot (20)

Hidden Markov Model
Hidden Markov Model Hidden Markov Model
Hidden Markov Model
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 
K Nearest Neighbor Algorithm
K Nearest Neighbor AlgorithmK Nearest Neighbor Algorithm
K Nearest Neighbor Algorithm
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and Credibility
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
Confusion Matrix Explained
Confusion Matrix ExplainedConfusion Matrix Explained
Confusion Matrix Explained
 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learning
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 
Activation functions
Activation functionsActivation functions
Activation functions
 

Similar to Hmm viterbi

Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Modelsguestfee8698
 
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
The Statistical and Applied Mathematical Sciences Institute
 
Proceedings A Method For Finding Complete Observables In Classical Mechanics
Proceedings A Method For Finding Complete Observables In Classical MechanicsProceedings A Method For Finding Complete Observables In Classical Mechanics
Proceedings A Method For Finding Complete Observables In Classical Mechanicsvcuesta
 
2018 MUMS Fall Course - Sampling-based techniques for uncertainty propagation...
2018 MUMS Fall Course - Sampling-based techniques for uncertainty propagation...2018 MUMS Fall Course - Sampling-based techniques for uncertainty propagation...
2018 MUMS Fall Course - Sampling-based techniques for uncertainty propagation...
The Statistical and Applied Mathematical Sciences Institute
 
hmm.ppt
hmm.ppthmm.ppt
Markov Models
Markov ModelsMarkov Models
Markov Models
Vu Pham
 
Aurelian Isar - Decoherence And Transition From Quantum To Classical In Open ...
Aurelian Isar - Decoherence And Transition From Quantum To Classical In Open ...Aurelian Isar - Decoherence And Transition From Quantum To Classical In Open ...
Aurelian Isar - Decoherence And Transition From Quantum To Classical In Open ...
SEENET-MTP
 
Block Cipher vs. Stream Cipher
Block Cipher vs. Stream CipherBlock Cipher vs. Stream Cipher
Block Cipher vs. Stream Cipher
Amirul Wiramuda
 
Dynamical systems
Dynamical systemsDynamical systems
Dynamical systemsSpringer
 
Queueing theory
Queueing theoryQueueing theory
Queueing theory
Meenakshi Dhasmana
 
FiniteAutomata.ppt
FiniteAutomata.pptFiniteAutomata.ppt
FiniteAutomata.ppt
RohitPaul71
 
FiniteAutomata (1).ppt
FiniteAutomata (1).pptFiniteAutomata (1).ppt
FiniteAutomata (1).ppt
ssuser47f7f2
 
Markov chain
Markov chainMarkov chain
Markov chain
Santosh Phad
 
Nfa egs
Nfa egsNfa egs
Nfa egs
ankitamakin
 
Kumegawa russia
Kumegawa russiaKumegawa russia
Kumegawa russia
Kazuki Kumegawa
 
PhD_Theory_Probab.Appl
PhD_Theory_Probab.ApplPhD_Theory_Probab.Appl
PhD_Theory_Probab.ApplAndrey Lange
 
Dsp U Lec10 DFT And FFT
Dsp U   Lec10  DFT And  FFTDsp U   Lec10  DFT And  FFT
Dsp U Lec10 DFT And FFT
taha25
 
lecture6.ppt
lecture6.pptlecture6.ppt
lecture6.ppt
AbhiYadav655132
 
Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)
Shohei Taniguchi
 

Similar to Hmm viterbi (20)

Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
 
Proceedings A Method For Finding Complete Observables In Classical Mechanics
Proceedings A Method For Finding Complete Observables In Classical MechanicsProceedings A Method For Finding Complete Observables In Classical Mechanics
Proceedings A Method For Finding Complete Observables In Classical Mechanics
 
2018 MUMS Fall Course - Sampling-based techniques for uncertainty propagation...
2018 MUMS Fall Course - Sampling-based techniques for uncertainty propagation...2018 MUMS Fall Course - Sampling-based techniques for uncertainty propagation...
2018 MUMS Fall Course - Sampling-based techniques for uncertainty propagation...
 
hmm.ppt
hmm.ppthmm.ppt
hmm.ppt
 
Markov Models
Markov ModelsMarkov Models
Markov Models
 
Finite automata
Finite automataFinite automata
Finite automata
 
Aurelian Isar - Decoherence And Transition From Quantum To Classical In Open ...
Aurelian Isar - Decoherence And Transition From Quantum To Classical In Open ...Aurelian Isar - Decoherence And Transition From Quantum To Classical In Open ...
Aurelian Isar - Decoherence And Transition From Quantum To Classical In Open ...
 
Block Cipher vs. Stream Cipher
Block Cipher vs. Stream CipherBlock Cipher vs. Stream Cipher
Block Cipher vs. Stream Cipher
 
Dynamical systems
Dynamical systemsDynamical systems
Dynamical systems
 
Queueing theory
Queueing theoryQueueing theory
Queueing theory
 
FiniteAutomata.ppt
FiniteAutomata.pptFiniteAutomata.ppt
FiniteAutomata.ppt
 
FiniteAutomata (1).ppt
FiniteAutomata (1).pptFiniteAutomata (1).ppt
FiniteAutomata (1).ppt
 
Markov chain
Markov chainMarkov chain
Markov chain
 
Nfa egs
Nfa egsNfa egs
Nfa egs
 
Kumegawa russia
Kumegawa russiaKumegawa russia
Kumegawa russia
 
PhD_Theory_Probab.Appl
PhD_Theory_Probab.ApplPhD_Theory_Probab.Appl
PhD_Theory_Probab.Appl
 
Dsp U Lec10 DFT And FFT
Dsp U   Lec10  DFT And  FFTDsp U   Lec10  DFT And  FFT
Dsp U Lec10 DFT And FFT
 
lecture6.ppt
lecture6.pptlecture6.ppt
lecture6.ppt
 
Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)
 

More from Digvijay Singh (14)

Week3 applications
Week3 applicationsWeek3 applications
Week3 applications
 
Week3.1
Week3.1Week3.1
Week3.1
 
Week2.1
Week2.1Week2.1
Week2.1
 
Week1.2 intro
Week1.2 introWeek1.2 intro
Week1.2 intro
 
Networks
NetworksNetworks
Networks
 
Uncertainty
UncertaintyUncertainty
Uncertainty
 
Overfitting and-tbl
Overfitting and-tblOverfitting and-tbl
Overfitting and-tbl
 
Ngrams smoothing
Ngrams smoothingNgrams smoothing
Ngrams smoothing
 
Query execution
Query executionQuery execution
Query execution
 
Query compiler
Query compilerQuery compiler
Query compiler
 
Machine learning
Machine learningMachine learning
Machine learning
 
3 fol examples v2
3 fol examples v23 fol examples v2
3 fol examples v2
 
Multidimensional Indexing
Multidimensional IndexingMultidimensional Indexing
Multidimensional Indexing
 
Bayesnetwork
BayesnetworkBayesnetwork
Bayesnetwork
 

Recently uploaded

Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 

Recently uploaded (20)

Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 

Hmm viterbi

  • 1. Markov Models & Hidden Markov Models
  • 2. Time-based Models • Simple parametric distributions are typically based on what is called the “independence assumption”- each data point is independent of the others, and there is no time-sequencing or ordering. • What if the data has correlations based on its order, like a time-series?
  • 3. States • An atomic event is an assignment to every random variable in the domain. • States are atomic events that can transfer from one to another • Suppose a model has n states • A state-transition diagram describes how the model behaves
  • 4. State-transition Following assumptions - Transition probabilities are stationary - The event space does not change over time - Probability distribution over next states depends only on the current state
  • 5. State-transition Following assumptions - Transition probabilities are stationary - The event space does not change over time - Probability distribution over next states depends only on the current state Markov Assumption
  • 6. Markov random processes • A random sequence has the Markov property if its distribution is determined solely by its current state. • Any random process having this property is called a Markov random process. • A system with states that obey the Markov assumption is called a Markov Model. • A sequence of states resulting from such a model is called a Markov Chain.
  • 7. Chain Rule & Markov Property Bayes rule P (qt , qt −1 ,...q1 ) = P(qt | qt −1 ,...q1 ) P (qt −1 ,...q1 ) = P(qt | qt −1 ,...q1 ) P(qt −1 | qt − 2 ,...q1 ) P(qt − 2 ,...q1 ) t = P (q1 )∏ P (qi | qi −1 ,...q1 ) i =2 Markov property P (qi | qi −1 ,...q1 ) = P (qi | qi −1 ) for i > 1 t P (qt , qt −1 ,...q1 ) = P (q1 )∏ P (qi | qi −1 ) = P (q1 ) P(q2 | q1 )...P(qt | qt −1 ) i=2
  • 8. Markov Assumption • The Markov assumption states that probability of the occurrence of word wi at time t depends only on occurrence of word wi-1 at time t-1 – Chain rule: n ( 1 . n ∏ i 1 . i−) Pw . ,w)= Pw|w . ,w1 ,. ( ,. = i 2 – Markov assumption: n ( 1 .,w ∏ i i− ) Pw . n)≈ Pw|w1 ,. ( = i 2
  • 9. Andrei Andreyevich Markov Born: 14 June 1856 in Ryazan, Russia Died: 20 July 1922 in Petrograd (now St Petersburg), Russia Markov is particularly remembered for his study of Markov chains, sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. This work launched the theory of stochastic processes.
  • 10. A Markov System Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, … s2 s1 s3 N=3 t=0
  • 11. A Markov System Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, … s2 On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt ∈{s1, s2 .. sN } Current State s1 s3 N=3 t=0 qt=q0=s3
  • 12. A Markov System Current State Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, … s2 On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt ∈{s1, s2 .. sN } Between each timestep, the next state is chosen randomly. s1 s3 N=3 t=1 qt=q1=s2
  • 13. P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 Has N states, called s1, s2 .. sN P(qt+1=s3|qt=s2) = 0 There are discrete timesteps, t=0, t=1, … P(qt+1=s1|qt=s1) = 0 s2 On the t’th timestep the system is in exactly P(qt+1=s2|qt=s1) = 0 one of the available states. Call it qt P(qt+1=s3|qt=s1) = 1 Note: qt ∈{s1, s2 .. sN } Between each timestep, the next state is chosen randomly. s1 s3 The current state determines the probability N=3 distribution for the next state. t=1 P(qt+1=s1|qt=s3) = 1/3 qt=q1=s2 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0
  • 14. P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 Has N states, called s1, s2 .. sN P(qt+1=s3|qt=s2) = 0 There are discrete timesteps, t=0, t=1, … P(qt+1=s1|qt=s1) = 0 s2 1/2 On the t’th timestep the system is in exactly P(qt+1=s2|qt=s1) = 0 one of the available states. Call it qt P(qt+1=s3|qt=s1) = 1 Note: qt ∈{s1, s2 .. sN } 1/2 2/3 Between each timestep, the next state is chosen randomly. s1 1/3 s3 The current state determines the probability N=3 1 distribution for the next state. t=1 P(qt+1=s1|qt=s3) = 1/3 qt=q1=s2 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0 Often notated with arcs between states
  • 15. P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 Markov Property P(qt+1=s1|qt=s1) = 0 s2 qt+1 is conditionally independent of { qt-1, qt-2, 1/2 … q , q } given q . P(qt+1=s2|qt=s1) = 0 1 0 t P(qt+1=s3|qt=s1) = 1 In other words: 1/2 2/3 P(qt+1 = sj |qt = si ) = s1 1/3 s3 P(qt+1 = sj |qt = si ,any earlier history) Notation: N=3 1 t=1 P(qt+1=s1|qt=s3) = 1/3 qt=q1=s2 P(qt+1=s2|qt=s3) = 2/3 aij = P (qt +1 = si | q = s j ) P(qt+1=s3|qt=s3) = 0 π i = P(q1 = si )
  • 16. P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 Markov Property P(qt+1=s1|qt=s1) = 0 s2 qt+1 is conditionally independent of { qt-1, qt-2, 1/2 … q , q } given q . P(qt+1=s2|qt=s1) = 0 1 0 t P(qt+1=s3|qt=s1) = 1 In other words: 1/2 2/3 P(qt+1 = sj |qt = si ) = s1 1/3 s3 P(qt+1 = sj |qt = si ,any earlier history) Notation: N=3 1 Transition probability t=1 P(qt+1=s1|qt=s3) = 1/3 qt=q1=s2 P(qt+1=s2|qt=s3) = 2/3 aij = P (qt +1 = si | q = s j ) P(qt+1=s3|qt=s3) = 0 π i = P(q1 = si ) Initial probability
  • 17. Example: A Simple Markov Model For Weather Prediction • Any given day, the weather can be described as being in one of three states: – State 1: precipitation (rain, snow, hail, etc.) – State 2: cloudy – State 3: sunny Transitions between states are described by the transition matrix This model can then be described by the following directed graph
  • 18. Basic Calculations • Example: What is the probability that the weather for eight consecutive days is “sun- sun-sun-rain-rain-sun-cloudy-sun”? • Solution: • O = sun sun sun rain rain sun cloudy sun 3 3 3 1 1 3 2 3
  • 19. From Markov To Hidden Markov • The previous model assumes that each state can be uniquely associated with an observable event – Once an observation is made, the state of the system is then trivially retrieved – This model, however, is too restrictive to be of practical use for most realistic problems • To make the model more flexible, we will assume that the outcomes or observations of the model are a probabilistic function of each state – Each state can produce a number of outputs according to a unique probability distribution, and each distinct output can potentially be generated at any state – These are known a Hidden Markov Models (HMM), because the state sequence is not directly observable, it can only be approximated from the sequence of observations produced by the system
  • 20. The coin-toss problem • To illustrate the concept of an HMM consider the following scenario – Assume that you are placed in a room with a curtain – Behind the curtain there is a person performing a coin-toss experiment – This person selects one of several coins, and tosses it: heads (H) or tails (T) – The person tells you the outcome (H,T), but not which coin was used each time • Your goal is to build a probabilistic model that best explains a sequence of observations O={o1,o2,o3,o4,…}={H,T,T,H,,…} – The coins represent the states; these are hidden because you do not know which coin was tossed each time – The outcome of each toss represents an observation – A “likely” sequence of coins may be inferred from the observations, but this state sequence will not be unique •
  • 21. The Coin Toss Example – 1 coin •As a result, the Markov model is observable since there is only one state •In fact, we may describe the system with a deterministic model where the states are the actual observations (see figure) •the model parameter P(H) may be found from the ratio of heads and tails •O= H H H T T H… •S = 1 1 1 2 2 1…
  • 22. The Coin Toss Example – 1 coin •As a result, the Markov model is observable since there is only one state •In fact, we may describe the system with a deterministic model where the states are the actual observations (see figure) •the model parameter P(H) may be found from the ratio of heads and tails •O= H H H T T H… •S = 1 1 1 2 2 1…
  • 23. The Coin Toss Example – 2 coins
  • 24. From Markov to Hidden Markov Model: The Coin Toss Example – 3 coins
  • 25. Hidden model • As spectators, we can not tell which coin is being used, all we can observe is the output (head/tail) • We assume the outputs are based on coin tendencies (output) probabilities
  • 26. Coin Toss Example hidden state variables L = coins C1 C2 Ci CL-1 CL P1 P2 Pi PL-1 PL observed data (“output”) = heads/tails
  • 27. Hidden Markov Models • Used when states can not directly be observed, good for noisy data • Requirements: – A finite number of states, each with an output probability distribution – State transition probabilities – Observed phenomenon, which can be randomly generated given state-associated probabilities.
  • 28. HMM Notation *L. R. Rabiner, "A Tutorial on (from Rabiner’s Survey) Hidden Markov Models and Selected Applications in Speech Recognition," Proc. The states are labeled S1 S2 .. SN of the IEEE, Vol.77, No.2, pp.257--286, 1989. For a particular trial…. Let T be the number of observations T is also the number of states passed through O = O1 O2 .. OT is the sequence of observations Q = q1 q2 .. qT is the notation for a path of states λ = 〈N,M,{π i,},{aij},{bi(j)}〉 is the specification of an HMM
  • 29. HMM Formal Definition An HMM, λ, is a 5-tuple consisting of • N the number of states • M the number of possible observations • {π1, π2, .. πN} The starting state probabilities P(q0 = Si) = πi • a11 a22 … a1N a21 a22 … a2N The state transition probabilities : : : P(qt+1=Sj | qt=Si)=aij aN1 aN2 … aNN • b1(1) b1(2) … b1(M) The observation probabilities b2(1) b2(2) … b2(M) P(Ot=k | qt=Si)=bi(k) : : : bN(1) bN(2) … bN(M)
  • 30. Assumptions • Markov assumption – States depend on previous states • Stationary assumption – Transition probabilities are independent of time (“memoryless”) • Output independence – Observations are independent of previous observations
  • 31. The three main questions on HMMs • Evaluation – What is the probability that the observations were generated by a given model? • Decoding – Given a model and a sequence of observations, what is the most likely state observations? • Learning: – Given a model and a sequence of observations, how should we modify the model parameters to maximize p{observe|model)
  • 32. The three main questions on HMMs 1. Evaluation GIVEN a HMM M, and a sequence x, FIND Prob[ x | M ] • Decoding GIVEN a HMM M, and a sequence x, FIND the sequence π of states that maximizes P[ x, π | M ] 5. Learning GIVEN a HMM M, with unspecified transition/emission probs., and a sequence x, FIND parameters θ = (bi(.), aij) that maximize P[ x | θ ]
  • 33. Let’s not be confused by notation P[ x | M ]: The probability that sequence x was generated by the model The model is: architecture (#states, etc) + parameters θ = aij, ei(.) So, P[ x | θ ], and P[ x ] are the same, when the architecture, and the entire model, respectively, are implied Similarly, P[ x, π | M ] and P[ x, π ] are the same In the LEARNING problem we always write P[ x | θ ] to emphasize that we are seeking the θ that maximizes P[ x | θ ]
  • 34. HMMs
  • 35. Hidden Markov Models • Used when states can not directly be observed, good for noisy data • Requirements: – A finite number of states, each with an output probability distribution – State transition probabilities – Observed phenomenon, which can be randomly generated given state-associated probabilities.
  • 36. Description Specification of an HMM • N - number of states – Q = {q1; q2; : : : ;qT} - set of states • M - the number of symbols (observables) – O = {o1; o2; : : : ;oT} - set of symbols
  • 37. Description Specification of an HMM • A - the state transition probability matrix – aij = P(qt+1 = j|qt = i) • B- observation probability distribution – bj(k) = P(ot = k|qt = j) i ≤ k ≤ M • π - the initial state distribution
  • 38. HMM Formal Definition An HMM, λ, is a 5-tuple consisting of • N the number of states • M the number of possible observations • {π1, π2, .. πN} The starting state probabilities P(q0 = Si) = πi • a11 a22 … a1N a21 a22 … a2N The state transition probabilities : : : P(qt+1=Sj | qt=Si)=aij aN1 aN2 … aNN • b1(1) b1(2) … b1(M) The observation probabilities b2(1) b2(2) … b2(M) P(Ot=k | qt=Si)=bi(k) : : : bN(1) bN(2) … bN(M)
  • 39. Assumptions • Markov assumption – States depend on previous states • Stationary assumption – Transition probabilities are independent of time (“memoryless”) • Output independence – Observations are independent of previous observations
  • 40. The three main questions on HMMs • Evaluation – What is the probability that the observations were generated by a given model? • Decoding – Given a model and a sequence of observations, what is the most likely state observations? • Learning: – Given a model and a sequence of observations, how should we modify the model parameters to maximize p{observe|model)
  • 41. Central problems Central problems in HMM modelling • Problem 1 Evaluation: – Probability of occurrence of a particular observation sequence, O = {o1,…,ok}, given the model – P(O|λ) – Complicated – hidden states – Useful in sequence classification
  • 42. Central problems Central problems in HMM modelling • Problem 2 Decoding: – Optimal state sequence to produce given observations, O = {o1,…,ok}, given model – Optimality criterion – Useful in recognition problems
  • 43. Central problems Central problems in HMM modelling • Problem 3 Learning: – Determine optimum model, given a training set of observations – Find λ, such that P(O|λ) is maximal
  • 44. Task: Part-Of-Speech Tagging • Goal: Assign the correct part-of-speech to each word (and punctuation) in a text. • Example: Two old men bet on the game . CRD ADJ NN VBD Prep Det NN SYM • Learn a local model of POS dependencies, usually from pretagged data
  • 45. Hidden Markov Models • Assume: POS generated as random process, and each POS randomly generates a word 0.2 ADJ 0.3 “cats” 0.2 “a” 0.6 0.5 NNS 0.3 Det 0.9 “men” 0.5 0.4 NN “the” 0.1 “cat” “bet”
  • 46. HMMs For Tagging • First-order (bigram) Markov assumptions: – Limited Horizon: Tag depends only on previous tag P(ti+1 = tk | t1=tj1,…,ti=tji) = P(ti+1 = tk | ti = tj) – Time invariance: No change over time P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj  tk) • Output probabilities: – Probability of getting word wk for tag tj: P(wk | tj) – Assumption: Not dependent on other tags or words!
  • 47. Combining Probabilities • Probability of a tag sequence: P(t1t2…tn) = P(t1)P(t1t2)P(t2t3)…P(tn-1tn) Assume t0 – starting tag: = P(t0t1)P(t1t2)P(t2t3)…P(tn-1tn) • Prob. of word sequence and tag sequence: P(W,T) = Πi P(ti-1ti) P(wi | ti)
  • 48. Training from labeled training • Labeled training = each word has a POS tag • Thus: π(tj) = PMLE(tj) = C(tj) / N a(tjtk) = PMLE(tjtk) = C(tj, tk) / C(tj) b(tj | wk) = PMLE(tj | wk) = C(tj:wk) / C(wk) • Smoothing applies as usual
  • 49. Three Basic Problems • Compute the probability of a text: Pλ(W1,N) • Compute maximum probability tag sequence: arg maxT1,N Pλ(T1,N | W1,N) • Compute maximum likelihood model arg max λ Pλ(W1,N)
  • 50. Central Problem 1: Naïve solution problems • State sequence Q = (q1,…qT) • Assume independent observations: T P (O | q, λ ) = ∏ P (ot | qt , λ ) = bq1 (o1 )bq 2 (o2 )...bqT (oT ) i =1 NB Observations are mutually independent, given the hidden states. (Joint distribution of independent variables factorises into marginal distributions of the independent variables.)
  • 51. Central Problem 1: Naïve solution problems • Observe that : P(q | λ ) = π q1aq1q 2 aq 2 q 3 ...aqT −1qT • And that: P (O | λ ) = ∑ P (O | q, λ ) P (q | λ ) q
  • 52. Central Problem 1: Naïve solution problems • Finally get: P (O | λ ) = ∑ P (O | q, λ ) P (q | λ ) q NB: -The above sum is over all state paths -There are NT states paths, each ‘costing’ O(T) calculations, leading to O(TNT) time complexity.
  • 53. Central Problem 1: Efficient solution problems Forward algorithm: • Define auxiliary forward variable α: α t (i ) = P(o1 ,..., ot | qt = i, λ ) αt(i) is the probability of observing a partial sequence of observables o1,…ot such that at time t, state qt=i
  • 54. Central Problem 1: Efficient solution problems • Recursive algorithm: – Initialise: α1 (i ) = π i bi (o1 ) (Partial obs seq to t AND state i at t) – Calculate: x (transition to j at t+1) x (sensor) N α t +1 ( j ) = [∑ α t (i )aij ]b j (ot +1 ) Sum, as can reach j from any preceding state i =1 – Obtain: α incorporates partial obs seq to t N P (O | λ ) = ∑ α T (i ) i =1 Sum of different ways Complexity is O(N2T) of getting obs seq
  • 55. Forward Algorithm Define αk(i) = P(w1,k, tk=ti) • For i = 1 To N: α1(i) = a(t0ti)b(w1 | ti) 4. For k = 2 To T; For j = 1 To N: αk(j) = [Σ α i k-1 ] (i)a(titj) b(wk | tj) 5. Then: Pλ(W1,T) = Σ α (i) i T Complexity = O(N2 T)
  • 56. Pλ(W1,3) Forward Algorithm w1 w2 w3 1 a(t1t1) 1 a(t1t1) t α1(1) t α2(1) t1 α3(1) a(t2t1) a(t2t1) t2 α1(2) a(t3t1) t2 α2(2) a(t3t1) t2 α3(2) a(t0ti) t3 α1(3) t3 α2(3) t3 α3(3) a(t4t1) a(t4t1) t4 α1(4) t4 α2(4) t4 α3(4) a(t5t1) a(t5t1) t5 α1(5) t5 α2(5) t5 α3(5)
  • 57. Central problems Problem 1: Alternative solution Backward algorithm: • Define auxiliary forward variable β: β t (i ) = P (ot +1 , ot + 2 ,..., oT | qt = i, λ ) βt(i) – the probability of observing a sequence of observables ot+1,…,oT given state qt =i at time t, and λ
  • 58. Central problems Problem 1: Alternative solution • Recursive algorithm: – Initialise: βT ( j ) = 1 – Calculate: N βt (i ) = ∑ β t +1 ( j )aij b j (ot +1 ) j =1 – Terminate: N p (O | λ ) = ∑ β1 (i ) t = T − 1,...,1 i =1 Complexity is O(N2T)
  • 59. Backward Algorithm Define βk(i) = P(wk+1,N | tk=ti) --note the difference! • For i = 1 To N: βT(i) = 1 5. For k = T-1 To 1; For j = 1 To N: βk(j) = [Σ a(t t )b(w i j i k+1 | ti) βk+1(i) ] 6. Then: Pλ(W1,T) = Σ a(t t )b(w | t ) β (i) i 0 i 1 i 1 Complexity = O(Nt2 N)
  • 60. Pλ(W1,3) Backward Algorithm w1 w2 w3 1 β1(1) a(t1t1) 1 β2(1) a(t1t1) β3(1) t t t1 a(t0ti) a(t2t1) a(t2t1) t2 β1(2) t2 β2(2) t2 β3(2) a(t3t1) a(t3t1) t3 β1(3) t3 β2(3) t3 β3(3) a(t t ) 4 1 a(t t ) 4 1 t4 β1(4) t4 β2(4) t4 β3(4) a(t5t1) a(t5t1) t5 β1(5) t5 β2(5) t5 β3(5)
  • 61. Viterbi Algorithm (Decoding) • Most probable tag sequence given text: T* = arg maxT Pλ(T | W) = arg maxT Pλ(W | T) Pλ(T) / Pλ(W) (Bayes’ Theorem) = arg maxT Pλ(W | T) Pλ(T) (W is constant for all T) = arg maxT Πi[a(ti-1ti) b(wi | ti) ] = arg maxT Σi log[a(ti-1ti) b(wi | ti) ]
  • 62. w1 w2 w3 t1 t1 t1 t0 t2 t2 t2 t3 t3 t3 A(,) t1 t2 t3 B(,) w1 w2 w3 t0  0.005 0.02 0.1 t1 0.2 0.005 0.005 t1  0.02 0.1 0.005 t2 0.02 0.2 0.0005 t2  0.5 0.0005 0.0005 t3 0.02 0.02 0.05 t3  0.05 0.05 0.005
  • 63. w1 w2 w3 -1.7 -1.7 1 t 1 t -6 t1 -7.3 -3 -2.3 -0.3 -0.3 0 -1.7 2 t t -3.4 t2 -4.7 t2 -10.3 -1 -1.3 -1.3 t3 -2.7 t3 -6.7 t3 -9.3 -log A t1 t2 t3 -log B w1 w2 w3 t0  2.3 1.7 1 t1 0.7 2.3 2.3 t1  1.7 1 2.3 t2 1.7 0.7 3.3 t2  0.3 3.3 3.3 t3 1.7 1.7 1.3 t3  1.3 1.3 2.3
  • 64. Viterbi Algorithm • D(0, START) = 0 • for each tag t != START do: D(1, t) = -∞ • for i  1 to N do: a. for each tag tj do: D(i, tj)  maxk D(i-1,tk)b(wi|tj)a(tktj) D(i, tj)  maxk D(i-1,tk) + logb(wi|tj) + loga(tktj) • log P(W,T) = maxj D(N, tj) where logb(wi|tj) = log b(wi|tj) and so forth
  • 65.
  • 66. Question: Suppose the sequence of our game is: HHHTHHHTTHHTH? 0.5 start 0.5 Heads Heads 0.1 fair loaded 0.9 0.9 Tails Tails 0.1 What is the probability of the sequence given the model?
  • 67. Decoding • Suppose we have a text written by Shakespeare and a monkey. Can we tell who wrote what? • Text: Shakespeare or Monkey? • Case 1: – Fehwufhweuromeojulietpoisonjigjreijge • Case 2: – mmmmbananammmmmmmbananammm