SlideShare a Scribd company logo
1 of 40
Markov Models
K.A.S.H. Kulathilake
B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
Introduction
• The hidden Markov model is one of the most
important machine learning models in speech
and language processing.
• It is based on FSA.
Sequence Models
• The HMM is a sequence model.
• A sequence model or sequence classifier is a
model whose job is to assign a label or class to
each unit in a sequence, thus mapping a
sequence of observations to a sequence of labels.
• An HMM is a probabilistic sequence model: given
a sequence of units (words, letters, morphemes,
sentences, whatever), they compute a probability
distribution over possible sequences of labels and
choose the best label sequence.
Markov Chains
• To define HMM properly, we need to first introduce the Markov
chain, sometimes called the observed Markov model.
• Recall that a weighted finite automaton is defined by a set of states
and a set of transitions between states, with each arc associated
with a weight.
• A Markov chain is a special case of a weighted automaton in which
weights are probabilities (the probabilities on all arcs leaving a node
must sum to 1) and in which the input sequence uniquely
determines which states the automaton will go through.
• Because it can’t represent inherently ambiguous problems, a
Markov chain is only useful for assigning probabilities to
unambiguous sequences.
Markov Chains (Cont…)
• Following figure shows a Markov chain for
assigning a probability to a sequence of
weather events, for which the vocabulary
consists of HOT, COLD, and WARM.
Markov Chains (Cont…)
• First, let’s be more formal and view a Markov
chain as a kind of probabilistic graphical
model: a way of representing probabilistic
assumptions in a graph.
• A Markov chain is specified by the following
components:
Markov Chains (Cont…)
• Markov Model depicted in the previous image
shows that we represent the states (including
start and end states) as nodes in the graph, and
the transitions as edges between nodes.
• A Markov chain embodies an important
assumption about these probabilities.
• In a first-order Markov chain, the probability of a
particular state depends only on the previous
state:
• Markov Assumption: P(qi|q1,…..,qi-1) = P(qi|qi-1)
Markov Chains (Cont…)
• Note that because each ai j expresses the probability p(qj|qi), the
laws of probability require that the values of the outgoing arcs from
a given state must sum to 1:
𝑗=1
𝑛
𝑎𝑖𝑗 = 1 ∀𝑖
• An alternative representation that is sometimes used for Markov
chains doesn’t rely on a start or end state, instead representing the
distribution over initial states and accepting states explicitly:
Markov Chains (Cont…)
• Thus, the probability of state 1 being the first
state can be represented either as a01 or as π1.
• Note that because each π1 expresses the
probability p(qi|START), all the π probabilities
must sum to 1:
𝑖=1
𝑛
𝜋𝑖 = 1
Markov Chains (Cont…)
• Compute the probability of each of the
following sequences:
– (a) hot hot hot hot
– (b) cold hot cold hot
Markov Chains (Cont…)
• Compute the probability of each of the following
sequences:
• We can use bigram model to compute this (current
state is determined by the previous state).
(a) hot hot hot hot
P(HHHH) = P(H) * P(H|H) * P(H|H) * P(H|H)
P(HHHH) = 0.5*0.5*0.5*0.5 = 0.0625
(b) cold hot cold hot
P(CHCH) = P(C) * P(H|C) * P(C|H) * P(H|C)
P(HHHH) = 0.3*0.2*0.2*0.2 = 0.0024
Hidden Markov Model (HMM)
• Markov Chain is useful when we need to compute probability for a
sequence of events that we can observe/visible in the world.
• In many cases, however the events we are interested in may not be
directly observable in the world.
– Ex 1: Part-of-Speech Tagging - In POS tagging we observe words in a
sentence and based on the word observation sequence we need to
infer the correct tag sequence.
– Ex 2: Speech Recognition - We observe acoustic events in the world
and based on the sequence of acoustic events we need to infer the
correct word sequence.
• A hidden Hidden Markov model (HMM) allows us to talk about both
observed events (like words that we see in the input) and hidden
events (like part-of-speech tags) that we think of as causal factors in
our probabilistic model.
Hidden Markov Model (HMM) (Cont…)
• Let’s begin with a formal definition of a hidden Markov
model, focusing on how it differs from a Markov chain.
• An HMM is specified by the following components:
Hidden Markov Model (HMM) (Cont…)
• As we noted for Markov chains, an alternative
representation that is sometimes used for
HMMs doesn’t rely on a start or end state,
instead representing the distribution over
initial and accepting states explicitly.
Hidden Markov Model (HMM) (Cont…)
• A first-order hidden Markov model instantiates two
simplifying assumptions.
– First, as with a first-order Markov chain, the probability of
a particular state depends only on the previous state:
• Markov Assumption:
P(qi|q1,…..,qi-1) = P(qi|qi-1)
– Second, the probability of an output observation oi
depends only on the state that produced the observation
qi and not on any other states or any other observations:
• Output Independence:
P(oi|q1 ,…….,qi,…….,qT,o1,…..,oi,…….,oT ) = P(oi|qi)
Likelihood Computation
• Imagine that you are a climatologist in the year 2800 studying the
history of global warming. You cannot find any records of the
weather in Moratuwa, for the October of 2017, but you do find
Nilanthas' diary, which lists how many ice creams Nilantha ate every
day that October. Assuming there are only two kinds of days: Cold
(C) and Hot (H). Figure out the correct hidden" sequence Q of
weather states (H or C) which caused Nilantha to eat the ice cream.
• Determine the probability of an ice-cream observation sequence
like 3 1 3?
Likelihood Computation (Cont…)
• Computing Likelihood: Given an HMM λ= (A,B) and an
observation sequence O, determine the likelihood P(O|
λ).
• For a Markov chain, where the surface observations
are the same as the hidden events, we could compute
the probability of 3 1 3 just by following the states
labeled 3 1 3 and multiplying the probabilities along
the arcs.
• For a hidden Markov model, things are not so simple.
We want to determine the probability of an ice-cream
observation sequence like 3 1 3, but we don’t know
what the hidden state sequence is!
Likelihood Computation (Cont…)
• Let’s start with a slightly simpler situation.
• Suppose we already knew the weather and
wanted to predict how much ice cream Jason
would eat.
• This is a useful part of many HMM tasks.
• For a given hidden state sequence (e.g., hot
hot cold), we can easily compute the output
likelihood of 3 1 3.
Likelihood Computation (Cont…)
• Let’s see how. First, recall that for hidden Markov models,
each hidden state produces only a single observation.
• Thus, the sequence of hidden states and the sequence of
observations have the same length.
• Given this one-to-one mapping and the Limited history
Markov assumptions, for a particular hidden state
sequence Q = q0,q1,q2,……,qT and an observation sequence
O = o1,o2,……,oT , the likelihood of the observation
sequence is
𝑃 𝑂 𝑄 =
𝑖=1
𝑇
𝑃(𝑂𝑖|𝑄𝑖)
Likelihood Computation (Cont…)
• The computation of the forward probability for our ice-
cream observation 3 1 3 from one possible hidden state
sequence hot hot cold is:
• P(3 1 3|hot hot cold) = P(3|hot)P(1|hot)P(3|cold)
• The graphic representation of this computation is given
below:
Likelihood Computation (Cont…)
• But of course, we don’t actually know what the hidden
state (weather) sequence was.
• We’ll need to compute the probability of ice-cream events
3 1 3 instead by summing over all possible weather
sequences, weighted by their probability.
• First, let’s compute the joint probability of being in a
particular weather sequence Q and generating a particular
sequence O of ice-cream events.
• In general, this is
𝑃 𝑂, 𝑄 = 𝑃 𝑂 𝑄 × 𝑃 𝑄
=
𝑖=1
𝑇
𝑃(𝑜𝑖|𝑞𝑖) ×
𝑖=1
𝑇
𝑃(𝑞𝑖 |𝑞𝑖−1)
Likelihood Computation (Cont…)
• The computation of the joint probability of our ice-
cream observation 3 1 3 and one possible hidden state
sequence hot hot cold is shown below;
• P(3 1 3| hot hot cold) =
P(hot|start)P(hot|hot)P(cold|hot)
P(3|hot)P(1|hot)P(3|cold)
Likelihood Computation (Cont…)
• Now that we know how to compute the joint probability of
the observations with a particular hidden state sequence,
we can compute the total probability of the observations
just by summing over all possible hidden state sequences:
𝑃 𝑂 =
𝑄
𝑃 𝑂, 𝑄 =
𝑄
𝑃(𝑂|𝑄) × 𝑃(𝑄)
• For our particular case, we would sum over the eight 3-
event sequences:
• P(3 1 3 , Cold, Cold, Cold) + P(3 1 3 , Cold, Cold, Hot)+ P(3 1
3 , Cold, Hot, Cold) + P(3 1 3 , Cold, Hot, Hot) + (3 1 3 , Hot,
Cold, Cold) + P(3 1 3 , Hot, Cold, Hot)+ P(3 1 3 , Hot, Hot,
Cold) + P(3 1 3 , Hot, Hot, Hot)
Likelihood Computation (Cont…)
• For an HMM with N hidden states and an
observation sequence of T observations, there
are NT possible hidden sequences.
• For real tasks, where N and T are both large, NT is
a very large number, so we cannot compute the
total observation likelihood by computing a
separate observation likelihood for each hidden
state sequence and then summing them.
• Solution is Forward Algorithm
Forward Algorithm
• Forward algorithm is an efficient O(N2T) algorithm.
• The forward algorithm is a kind algorithm of dynamic
programming algorithm, that is, an algorithm that uses
a table to store intermediate values as it builds up the
probability of the observation sequence.
• The forward algorithm computes the observation
probability by summing over the probabilities of all
possible hidden state paths that could generate the
observation sequence, but it does so efficiently by
implicitly folding each of these paths into a single
forward trellis.
Forward Algorithm (Cont…)
• Following diagram shows an example of the forward trellis for computing
the likelihood of 3 1 3 given the hidden state sequence hot hot cold.
Forward Algorithm (Cont…)
• Each cell of the forward algorithm trellis at
𝛼 𝑡 𝑗 represents the probability of being in state j
after seeing the first t observations, given the
automaton λ.
• The value of each cell at 𝛼 𝑡 𝑗 is computed by
summing over the probabilities of every path that
could lead us to this cell.
• Formally, each cell expresses the following
probability:
𝛼 𝑡 𝑗 = 𝑃( 𝑜1, 𝑜2, … … … , 𝑜𝑡, 𝑞𝑡 = 𝑗|𝜆)
Forward Algorithm (Cont…)
• Here, qt = j means “the tth state in the sequence of states is state j”.
• We compute this probability at 𝛼 𝑡 𝑗 by summing over the
extensions of all the paths that lead to the current cell.
• For a given state qj at time t, the value at 𝛼 𝑡 𝑗 is computed as:
𝛼 𝑡 𝑗 =
𝑖=1
𝑁
𝛼 𝑡−1 𝑖 𝑎𝑖𝑗 𝑏𝑗(𝑜𝑡)
• The three factors that are multiplied in above equation in extending
the previous paths to compute the forward probability at time t are:
Forward Algorithm (Cont…)
Decoding: The Viterbi Algorithm
• For any model, such as an HMM, that contains
hidden variables, the task of determining which
sequence of variables is the underlying source of
some sequence of Decoding observations is
called the decoding task.
– E.g. In the ice-cream domain, given a sequence
Decoder of ice-cream observations 3 1 3 and an
HMM, the task of the decoder is to find the best
hidden weather sequence (H H H).
• More formally, the objective of decoding is:
– Given an observation sequence O and an HMM λ =
(A,B, π), discover the best hidden state sequence Q.
Decoding: The Viterbi Algorithm
(Cont…)
• We might propose to find the best sequence as
follows: For each possible hidden state sequence
(HHH, HHC, HCH, etc.), we could run the forward
algorithm and compute the likelihood of the
observation sequence given that hidden state
sequence.
• Then we could choose the hidden state sequence
with the maximum observation likelihood.
• It should be clear from the previous section that
we cannot do this because there are an
exponentially large number of state sequences.
Decoding: The Viterbi Algorithm
(Cont…)
• Instead, the most common decoding
algorithms for HMMs is the Viterbi algorithm.
• Like the forward algorithm, Viterbi is a kind of
dynamic programming algorithm that makes
uses of a dynamic programming trellis.
Decoding: The Viterbi Algorithm
(Cont…)
• Following section shows an example of the Viterbi trellis for computing
the best hidden state sequence for the observation sequence 3 1 3.
Decoding: The Viterbi Algorithm
(Cont…)
• The idea is to process the observation sequence left to
right, filling out the trellis.
• Each cell of the trellis, vt( j), represents the probability
that the HMM is in state j after seeing the first t
observations and passing through the most probable
state sequence q0,q1,……..,qt-1, given the automaton λ.
• The value of each cell vt( j) is computed by recursively
taking the most probable path that could lead us to this
cell.
• Formally, each cell expresses the probability
𝑉𝑡 𝑗 = 𝑚𝑎𝑥 𝑞0 𝑞1,…….,𝑞 𝑡−1
, 𝑃( 𝑞0, 𝑞1, … … … , 𝑞𝑡−1, 𝑜1, 𝑜2, … … . , 𝑜𝑡 = 𝑗|𝜆)
Decoding: The Viterbi Algorithm
(Cont…)
• Note that we represent the most probable path
by taking the maximum over all possible previous
state sequences.
• Like other dynamic programming algorithms,
Viterbi fills each cell recursively.
• Given that we had already computed the
probability of being in every state at time t-1, we
compute the Viterbi probability by taking the
most probable of the extensions of the paths that
lead to the current cell.
Decoding: The Viterbi Algorithm
(Cont…)
• For a given state qj at time t, the value vt( j) is
computed as
𝑣𝑡 𝑗 = 𝑚𝑎𝑥𝑖=1
𝑁
𝑣𝑡−1 𝑖 𝑎𝑖𝑗 𝑏𝑗(𝑜𝑡)
• The three factors that are multiplied in above
equation for extending the previous paths to
compute the Viterbi probability at time t are:
Decoding: The Viterbi Algorithm
(Cont…)
Decoding: The Viterbi Algorithm
(Cont…)
• Note that the Viterbi algorithm is identical to the forward algorithm
except that it takes the max over the previous path probabilities
whereas the forward algorithm takes the sum.
• Note also that the Viterbi algorithm has one component that the
forward algorithm doesn’t have: backpointers.
• The reason is that while the forward algorithm needs to produce an
observation likelihood, the Viterbi algorithm must produce a
probability and also the most likely state sequence.
• We compute this best state sequence by keeping track of the path
of hidden states that led to each state, and Viterbi then at the end
backtracing the best path to the beginning (the Viterbi backtrace).
Decoding: The Viterbi Algorithm
(Cont…)
HMM Training: The Forward-Backward
Algorithm
• We turn to the third problem for HMMs: learning the parameters of
an HMM, that is, the A and B matrices.
• Given an observation sequence O and the set of possible states in
the HMM, learn the HMM parameters A and B.
• The input to such a learning algorithm would be an unlabeled
sequence of observations O and a vocabulary of potential hidden
states Q.
• Thus, for the ice cream task, we would start with a sequence of
observations O = {1,3,2,…..,} and the set of hidden states H and C.
• The standard algorithm for HMM training is the forward-backward,
or Baum Welch algorithm (Baum, 1972), a special case of the
Expectation-Maximization or EM algorithm (Dempster et al., 1977).

More Related Content

What's hot

Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronMostafa G. M. Mostafa
 
Probabilistic Reasoning
Probabilistic ReasoningProbabilistic Reasoning
Probabilistic ReasoningJunya Tanaka
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashrisheetal katkar
 
4. THREE DIMENSIONAL DISPLAY METHODS
4.	THREE DIMENSIONAL DISPLAY METHODS4.	THREE DIMENSIONAL DISPLAY METHODS
4. THREE DIMENSIONAL DISPLAY METHODSSanthiNivas
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Inference in First-Order Logic
Inference in First-Order Logic Inference in First-Order Logic
Inference in First-Order Logic Junya Tanaka
 
Lecture 7: Hidden Markov Models (HMMs)
Lecture 7: Hidden Markov Models (HMMs)Lecture 7: Hidden Markov Models (HMMs)
Lecture 7: Hidden Markov Models (HMMs)Marina Santini
 
Artificial Intelligence Notes Unit 2
Artificial Intelligence Notes Unit 2Artificial Intelligence Notes Unit 2
Artificial Intelligence Notes Unit 2DigiGurukul
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionAritra Mukherjee
 
AI-900: Microsoft Azure AI Fundamentals 2021
AI-900: Microsoft Azure AI Fundamentals 2021AI-900: Microsoft Azure AI Fundamentals 2021
AI-900: Microsoft Azure AI Fundamentals 2021Sean Xie
 
Genetic algorithm ppt
Genetic algorithm pptGenetic algorithm ppt
Genetic algorithm pptMayank Jain
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksSharath TS
 
Simulated Annealing
Simulated AnnealingSimulated Annealing
Simulated Annealingkellison00
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
 

What's hot (20)

Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
Viterbi algorithm
Viterbi algorithmViterbi algorithm
Viterbi algorithm
 
Probabilistic Reasoning
Probabilistic ReasoningProbabilistic Reasoning
Probabilistic Reasoning
 
Genetic programming
Genetic programmingGenetic programming
Genetic programming
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashri
 
Lstm
LstmLstm
Lstm
 
4. THREE DIMENSIONAL DISPLAY METHODS
4.	THREE DIMENSIONAL DISPLAY METHODS4.	THREE DIMENSIONAL DISPLAY METHODS
4. THREE DIMENSIONAL DISPLAY METHODS
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Inference in First-Order Logic
Inference in First-Order Logic Inference in First-Order Logic
Inference in First-Order Logic
 
Lecture 7: Hidden Markov Models (HMMs)
Lecture 7: Hidden Markov Models (HMMs)Lecture 7: Hidden Markov Models (HMMs)
Lecture 7: Hidden Markov Models (HMMs)
 
Artificial Intelligence Notes Unit 2
Artificial Intelligence Notes Unit 2Artificial Intelligence Notes Unit 2
Artificial Intelligence Notes Unit 2
 
Approximation Algorithms
Approximation AlgorithmsApproximation Algorithms
Approximation Algorithms
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
 
AI-900: Microsoft Azure AI Fundamentals 2021
AI-900: Microsoft Azure AI Fundamentals 2021AI-900: Microsoft Azure AI Fundamentals 2021
AI-900: Microsoft Azure AI Fundamentals 2021
 
Genetic algorithm ppt
Genetic algorithm pptGenetic algorithm ppt
Genetic algorithm ppt
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Simulated Annealing
Simulated AnnealingSimulated Annealing
Simulated Annealing
 
NLP
NLPNLP
NLP
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 

Similar to NLP_KASHK:Markov Models

An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)ananth
 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionbutest
 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionbutest
 
Hidden markov model explained
Hidden markov model explainedHidden markov model explained
Hidden markov model explainedDonghoon Park
 
Shor’s algorithm the ppt
Shor’s algorithm the pptShor’s algorithm the ppt
Shor’s algorithm the pptMrinal Mondal
 
International Publication - (Calcolo)
International Publication - (Calcolo)International Publication - (Calcolo)
International Publication - (Calcolo)Alberto Auricchio
 
Change Point Analysis
Change Point AnalysisChange Point Analysis
Change Point AnalysisMark Conway
 
Hidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathHidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathLê Hòa
 
continious hmm.pdf
continious  hmm.pdfcontinious  hmm.pdf
continious hmm.pdfRahul Halder
 
Rfid presentation in internet
Rfid presentation in internetRfid presentation in internet
Rfid presentation in internetAli Azarnia
 
osama-quantum-computing and its uses and applications
osama-quantum-computing and its uses and applicationsosama-quantum-computing and its uses and applications
osama-quantum-computing and its uses and applicationsRachitdas2
 

Similar to NLP_KASHK:Markov Models (20)

An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)
 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
 
HMM DAY-3.ppt
HMM DAY-3.pptHMM DAY-3.ppt
HMM DAY-3.ppt
 
Hidden Markov Model
Hidden Markov Model Hidden Markov Model
Hidden Markov Model
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
Hidden markov model explained
Hidden markov model explainedHidden markov model explained
Hidden markov model explained
 
Shor’s algorithm the ppt
Shor’s algorithm the pptShor’s algorithm the ppt
Shor’s algorithm the ppt
 
International Publication - (Calcolo)
International Publication - (Calcolo)International Publication - (Calcolo)
International Publication - (Calcolo)
 
intro
introintro
intro
 
By BIRASA FABRICE
By BIRASA FABRICEBy BIRASA FABRICE
By BIRASA FABRICE
 
Markov chains1
Markov chains1Markov chains1
Markov chains1
 
Change Point Analysis
Change Point AnalysisChange Point Analysis
Change Point Analysis
 
Hidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathHidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable Path
 
Talk 5
Talk 5Talk 5
Talk 5
 
quantum-computing.ppt
quantum-computing.pptquantum-computing.ppt
quantum-computing.ppt
 
continious hmm.pdf
continious  hmm.pdfcontinious  hmm.pdf
continious hmm.pdf
 
osama-quantum-computing.ppt
osama-quantum-computing.pptosama-quantum-computing.ppt
osama-quantum-computing.ppt
 
Rfid presentation in internet
Rfid presentation in internetRfid presentation in internet
Rfid presentation in internet
 
osama-quantum-computing and its uses and applications
osama-quantum-computing and its uses and applicationsosama-quantum-computing and its uses and applications
osama-quantum-computing and its uses and applications
 

More from Hemantha Kulathilake

NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar Hemantha Kulathilake
 
NLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for EnglishNLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for EnglishHemantha Kulathilake
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
NLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingNLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingHemantha Kulathilake
 
COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Structures, Unions & Dynamic Memory Allocation COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Structures, Unions & Dynamic Memory Allocation Hemantha Kulathilake
 
COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Repetition and Loops COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Repetition and Loops Hemantha Kulathilake
 

More from Hemantha Kulathilake (20)

NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar
 
NLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for EnglishNLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for English
 
NLP_KASHK:POS Tagging
NLP_KASHK:POS TaggingNLP_KASHK:POS Tagging
NLP_KASHK:POS Tagging
 
NLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram ModelsNLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram Models
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
 
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
 
NLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingNLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological Parsing
 
NLP_KASHK:Morphology
NLP_KASHK:MorphologyNLP_KASHK:Morphology
NLP_KASHK:Morphology
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
NLP_KASHK:Finite-State Automata
NLP_KASHK:Finite-State AutomataNLP_KASHK:Finite-State Automata
NLP_KASHK:Finite-State Automata
 
NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions
 
NLP_KASHK: Introduction
NLP_KASHK: Introduction NLP_KASHK: Introduction
NLP_KASHK: Introduction
 
COM1407: File Processing
COM1407: File Processing COM1407: File Processing
COM1407: File Processing
 
COm1407: Character & Strings
COm1407: Character & StringsCOm1407: Character & Strings
COm1407: Character & Strings
 
COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Structures, Unions & Dynamic Memory Allocation COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Structures, Unions & Dynamic Memory Allocation
 
COM1407: Input/ Output Functions
COM1407: Input/ Output FunctionsCOM1407: Input/ Output Functions
COM1407: Input/ Output Functions
 
COM1407: Working with Pointers
COM1407: Working with PointersCOM1407: Working with Pointers
COM1407: Working with Pointers
 
COM1407: Arrays
COM1407: ArraysCOM1407: Arrays
COM1407: Arrays
 
COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Repetition and Loops COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Repetition and Loops
 

Recently uploaded

Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixingviprabot1
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage examplePragyanshuParadkar1
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 

Recently uploaded (20)

Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixing
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage example
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 

NLP_KASHK:Markov Models

  • 2. Introduction • The hidden Markov model is one of the most important machine learning models in speech and language processing. • It is based on FSA.
  • 3. Sequence Models • The HMM is a sequence model. • A sequence model or sequence classifier is a model whose job is to assign a label or class to each unit in a sequence, thus mapping a sequence of observations to a sequence of labels. • An HMM is a probabilistic sequence model: given a sequence of units (words, letters, morphemes, sentences, whatever), they compute a probability distribution over possible sequences of labels and choose the best label sequence.
  • 4. Markov Chains • To define HMM properly, we need to first introduce the Markov chain, sometimes called the observed Markov model. • Recall that a weighted finite automaton is defined by a set of states and a set of transitions between states, with each arc associated with a weight. • A Markov chain is a special case of a weighted automaton in which weights are probabilities (the probabilities on all arcs leaving a node must sum to 1) and in which the input sequence uniquely determines which states the automaton will go through. • Because it can’t represent inherently ambiguous problems, a Markov chain is only useful for assigning probabilities to unambiguous sequences.
  • 5. Markov Chains (Cont…) • Following figure shows a Markov chain for assigning a probability to a sequence of weather events, for which the vocabulary consists of HOT, COLD, and WARM.
  • 6. Markov Chains (Cont…) • First, let’s be more formal and view a Markov chain as a kind of probabilistic graphical model: a way of representing probabilistic assumptions in a graph. • A Markov chain is specified by the following components:
  • 7. Markov Chains (Cont…) • Markov Model depicted in the previous image shows that we represent the states (including start and end states) as nodes in the graph, and the transitions as edges between nodes. • A Markov chain embodies an important assumption about these probabilities. • In a first-order Markov chain, the probability of a particular state depends only on the previous state: • Markov Assumption: P(qi|q1,…..,qi-1) = P(qi|qi-1)
  • 8. Markov Chains (Cont…) • Note that because each ai j expresses the probability p(qj|qi), the laws of probability require that the values of the outgoing arcs from a given state must sum to 1: 𝑗=1 𝑛 𝑎𝑖𝑗 = 1 ∀𝑖 • An alternative representation that is sometimes used for Markov chains doesn’t rely on a start or end state, instead representing the distribution over initial states and accepting states explicitly:
  • 9. Markov Chains (Cont…) • Thus, the probability of state 1 being the first state can be represented either as a01 or as π1. • Note that because each π1 expresses the probability p(qi|START), all the π probabilities must sum to 1: 𝑖=1 𝑛 𝜋𝑖 = 1
  • 10. Markov Chains (Cont…) • Compute the probability of each of the following sequences: – (a) hot hot hot hot – (b) cold hot cold hot
  • 11. Markov Chains (Cont…) • Compute the probability of each of the following sequences: • We can use bigram model to compute this (current state is determined by the previous state). (a) hot hot hot hot P(HHHH) = P(H) * P(H|H) * P(H|H) * P(H|H) P(HHHH) = 0.5*0.5*0.5*0.5 = 0.0625 (b) cold hot cold hot P(CHCH) = P(C) * P(H|C) * P(C|H) * P(H|C) P(HHHH) = 0.3*0.2*0.2*0.2 = 0.0024
  • 12. Hidden Markov Model (HMM) • Markov Chain is useful when we need to compute probability for a sequence of events that we can observe/visible in the world. • In many cases, however the events we are interested in may not be directly observable in the world. – Ex 1: Part-of-Speech Tagging - In POS tagging we observe words in a sentence and based on the word observation sequence we need to infer the correct tag sequence. – Ex 2: Speech Recognition - We observe acoustic events in the world and based on the sequence of acoustic events we need to infer the correct word sequence. • A hidden Hidden Markov model (HMM) allows us to talk about both observed events (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model.
  • 13. Hidden Markov Model (HMM) (Cont…) • Let’s begin with a formal definition of a hidden Markov model, focusing on how it differs from a Markov chain. • An HMM is specified by the following components:
  • 14. Hidden Markov Model (HMM) (Cont…) • As we noted for Markov chains, an alternative representation that is sometimes used for HMMs doesn’t rely on a start or end state, instead representing the distribution over initial and accepting states explicitly.
  • 15. Hidden Markov Model (HMM) (Cont…) • A first-order hidden Markov model instantiates two simplifying assumptions. – First, as with a first-order Markov chain, the probability of a particular state depends only on the previous state: • Markov Assumption: P(qi|q1,…..,qi-1) = P(qi|qi-1) – Second, the probability of an output observation oi depends only on the state that produced the observation qi and not on any other states or any other observations: • Output Independence: P(oi|q1 ,…….,qi,…….,qT,o1,…..,oi,…….,oT ) = P(oi|qi)
  • 16. Likelihood Computation • Imagine that you are a climatologist in the year 2800 studying the history of global warming. You cannot find any records of the weather in Moratuwa, for the October of 2017, but you do find Nilanthas' diary, which lists how many ice creams Nilantha ate every day that October. Assuming there are only two kinds of days: Cold (C) and Hot (H). Figure out the correct hidden" sequence Q of weather states (H or C) which caused Nilantha to eat the ice cream. • Determine the probability of an ice-cream observation sequence like 3 1 3?
  • 17. Likelihood Computation (Cont…) • Computing Likelihood: Given an HMM λ= (A,B) and an observation sequence O, determine the likelihood P(O| λ). • For a Markov chain, where the surface observations are the same as the hidden events, we could compute the probability of 3 1 3 just by following the states labeled 3 1 3 and multiplying the probabilities along the arcs. • For a hidden Markov model, things are not so simple. We want to determine the probability of an ice-cream observation sequence like 3 1 3, but we don’t know what the hidden state sequence is!
  • 18. Likelihood Computation (Cont…) • Let’s start with a slightly simpler situation. • Suppose we already knew the weather and wanted to predict how much ice cream Jason would eat. • This is a useful part of many HMM tasks. • For a given hidden state sequence (e.g., hot hot cold), we can easily compute the output likelihood of 3 1 3.
  • 19. Likelihood Computation (Cont…) • Let’s see how. First, recall that for hidden Markov models, each hidden state produces only a single observation. • Thus, the sequence of hidden states and the sequence of observations have the same length. • Given this one-to-one mapping and the Limited history Markov assumptions, for a particular hidden state sequence Q = q0,q1,q2,……,qT and an observation sequence O = o1,o2,……,oT , the likelihood of the observation sequence is 𝑃 𝑂 𝑄 = 𝑖=1 𝑇 𝑃(𝑂𝑖|𝑄𝑖)
  • 20. Likelihood Computation (Cont…) • The computation of the forward probability for our ice- cream observation 3 1 3 from one possible hidden state sequence hot hot cold is: • P(3 1 3|hot hot cold) = P(3|hot)P(1|hot)P(3|cold) • The graphic representation of this computation is given below:
  • 21. Likelihood Computation (Cont…) • But of course, we don’t actually know what the hidden state (weather) sequence was. • We’ll need to compute the probability of ice-cream events 3 1 3 instead by summing over all possible weather sequences, weighted by their probability. • First, let’s compute the joint probability of being in a particular weather sequence Q and generating a particular sequence O of ice-cream events. • In general, this is 𝑃 𝑂, 𝑄 = 𝑃 𝑂 𝑄 × 𝑃 𝑄 = 𝑖=1 𝑇 𝑃(𝑜𝑖|𝑞𝑖) × 𝑖=1 𝑇 𝑃(𝑞𝑖 |𝑞𝑖−1)
  • 22. Likelihood Computation (Cont…) • The computation of the joint probability of our ice- cream observation 3 1 3 and one possible hidden state sequence hot hot cold is shown below; • P(3 1 3| hot hot cold) = P(hot|start)P(hot|hot)P(cold|hot) P(3|hot)P(1|hot)P(3|cold)
  • 23. Likelihood Computation (Cont…) • Now that we know how to compute the joint probability of the observations with a particular hidden state sequence, we can compute the total probability of the observations just by summing over all possible hidden state sequences: 𝑃 𝑂 = 𝑄 𝑃 𝑂, 𝑄 = 𝑄 𝑃(𝑂|𝑄) × 𝑃(𝑄) • For our particular case, we would sum over the eight 3- event sequences: • P(3 1 3 , Cold, Cold, Cold) + P(3 1 3 , Cold, Cold, Hot)+ P(3 1 3 , Cold, Hot, Cold) + P(3 1 3 , Cold, Hot, Hot) + (3 1 3 , Hot, Cold, Cold) + P(3 1 3 , Hot, Cold, Hot)+ P(3 1 3 , Hot, Hot, Cold) + P(3 1 3 , Hot, Hot, Hot)
  • 24. Likelihood Computation (Cont…) • For an HMM with N hidden states and an observation sequence of T observations, there are NT possible hidden sequences. • For real tasks, where N and T are both large, NT is a very large number, so we cannot compute the total observation likelihood by computing a separate observation likelihood for each hidden state sequence and then summing them. • Solution is Forward Algorithm
  • 25. Forward Algorithm • Forward algorithm is an efficient O(N2T) algorithm. • The forward algorithm is a kind algorithm of dynamic programming algorithm, that is, an algorithm that uses a table to store intermediate values as it builds up the probability of the observation sequence. • The forward algorithm computes the observation probability by summing over the probabilities of all possible hidden state paths that could generate the observation sequence, but it does so efficiently by implicitly folding each of these paths into a single forward trellis.
  • 26. Forward Algorithm (Cont…) • Following diagram shows an example of the forward trellis for computing the likelihood of 3 1 3 given the hidden state sequence hot hot cold.
  • 27. Forward Algorithm (Cont…) • Each cell of the forward algorithm trellis at 𝛼 𝑡 𝑗 represents the probability of being in state j after seeing the first t observations, given the automaton λ. • The value of each cell at 𝛼 𝑡 𝑗 is computed by summing over the probabilities of every path that could lead us to this cell. • Formally, each cell expresses the following probability: 𝛼 𝑡 𝑗 = 𝑃( 𝑜1, 𝑜2, … … … , 𝑜𝑡, 𝑞𝑡 = 𝑗|𝜆)
  • 28. Forward Algorithm (Cont…) • Here, qt = j means “the tth state in the sequence of states is state j”. • We compute this probability at 𝛼 𝑡 𝑗 by summing over the extensions of all the paths that lead to the current cell. • For a given state qj at time t, the value at 𝛼 𝑡 𝑗 is computed as: 𝛼 𝑡 𝑗 = 𝑖=1 𝑁 𝛼 𝑡−1 𝑖 𝑎𝑖𝑗 𝑏𝑗(𝑜𝑡) • The three factors that are multiplied in above equation in extending the previous paths to compute the forward probability at time t are:
  • 30. Decoding: The Viterbi Algorithm • For any model, such as an HMM, that contains hidden variables, the task of determining which sequence of variables is the underlying source of some sequence of Decoding observations is called the decoding task. – E.g. In the ice-cream domain, given a sequence Decoder of ice-cream observations 3 1 3 and an HMM, the task of the decoder is to find the best hidden weather sequence (H H H). • More formally, the objective of decoding is: – Given an observation sequence O and an HMM λ = (A,B, π), discover the best hidden state sequence Q.
  • 31. Decoding: The Viterbi Algorithm (Cont…) • We might propose to find the best sequence as follows: For each possible hidden state sequence (HHH, HHC, HCH, etc.), we could run the forward algorithm and compute the likelihood of the observation sequence given that hidden state sequence. • Then we could choose the hidden state sequence with the maximum observation likelihood. • It should be clear from the previous section that we cannot do this because there are an exponentially large number of state sequences.
  • 32. Decoding: The Viterbi Algorithm (Cont…) • Instead, the most common decoding algorithms for HMMs is the Viterbi algorithm. • Like the forward algorithm, Viterbi is a kind of dynamic programming algorithm that makes uses of a dynamic programming trellis.
  • 33. Decoding: The Viterbi Algorithm (Cont…) • Following section shows an example of the Viterbi trellis for computing the best hidden state sequence for the observation sequence 3 1 3.
  • 34. Decoding: The Viterbi Algorithm (Cont…) • The idea is to process the observation sequence left to right, filling out the trellis. • Each cell of the trellis, vt( j), represents the probability that the HMM is in state j after seeing the first t observations and passing through the most probable state sequence q0,q1,……..,qt-1, given the automaton λ. • The value of each cell vt( j) is computed by recursively taking the most probable path that could lead us to this cell. • Formally, each cell expresses the probability 𝑉𝑡 𝑗 = 𝑚𝑎𝑥 𝑞0 𝑞1,…….,𝑞 𝑡−1 , 𝑃( 𝑞0, 𝑞1, … … … , 𝑞𝑡−1, 𝑜1, 𝑜2, … … . , 𝑜𝑡 = 𝑗|𝜆)
  • 35. Decoding: The Viterbi Algorithm (Cont…) • Note that we represent the most probable path by taking the maximum over all possible previous state sequences. • Like other dynamic programming algorithms, Viterbi fills each cell recursively. • Given that we had already computed the probability of being in every state at time t-1, we compute the Viterbi probability by taking the most probable of the extensions of the paths that lead to the current cell.
  • 36. Decoding: The Viterbi Algorithm (Cont…) • For a given state qj at time t, the value vt( j) is computed as 𝑣𝑡 𝑗 = 𝑚𝑎𝑥𝑖=1 𝑁 𝑣𝑡−1 𝑖 𝑎𝑖𝑗 𝑏𝑗(𝑜𝑡) • The three factors that are multiplied in above equation for extending the previous paths to compute the Viterbi probability at time t are:
  • 37. Decoding: The Viterbi Algorithm (Cont…)
  • 38. Decoding: The Viterbi Algorithm (Cont…) • Note that the Viterbi algorithm is identical to the forward algorithm except that it takes the max over the previous path probabilities whereas the forward algorithm takes the sum. • Note also that the Viterbi algorithm has one component that the forward algorithm doesn’t have: backpointers. • The reason is that while the forward algorithm needs to produce an observation likelihood, the Viterbi algorithm must produce a probability and also the most likely state sequence. • We compute this best state sequence by keeping track of the path of hidden states that led to each state, and Viterbi then at the end backtracing the best path to the beginning (the Viterbi backtrace).
  • 39. Decoding: The Viterbi Algorithm (Cont…)
  • 40. HMM Training: The Forward-Backward Algorithm • We turn to the third problem for HMMs: learning the parameters of an HMM, that is, the A and B matrices. • Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B. • The input to such a learning algorithm would be an unlabeled sequence of observations O and a vocabulary of potential hidden states Q. • Thus, for the ice cream task, we would start with a sequence of observations O = {1,3,2,…..,} and the set of hidden states H and C. • The standard algorithm for HMM training is the forward-backward, or Baum Welch algorithm (Baum, 1972), a special case of the Expectation-Maximization or EM algorithm (Dempster et al., 1977).