SlideShare a Scribd company logo
1 of 98
Download to read offline
Viterbi Decoding for HMM, Parameter Learning
Pawan Goyal
CSE, IIT Kharagpur
Week 4, Lecture 1
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 1 / 6
N
P
T
E
L
Walking through the states: best path
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 2 / 6
N
P
T
E
L
Walking through the states: best path
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 3 / 6
N
P
T
E
L
Finding the best path: Viterbi Algorithm
Intuition
Optimal path for each state can be recorded. We need
Cheapest cost to state j at step s: δj(s)
Backtrace from that state to best predecessor ψj(s)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 4 / 6
N
P
T
E
L
Finding the best path: Viterbi Algorithm
Intuition
Optimal path for each state can be recorded. We need
Cheapest cost to state j at step s: δj(s)
Backtrace from that state to best predecessor ψj(s)
Computing these values
δj(s+1) = max1≤i≤Nδi(s)p(tj|ti)p(ws+1|tj)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 4 / 6
N
P
T
E
L
Finding the best path: Viterbi Algorithm
Intuition
Optimal path for each state can be recorded. We need
Cheapest cost to state j at step s: δj(s)
Backtrace from that state to best predecessor ψj(s)
Computing these values
δj(s+1) = max1≤i≤Nδi(s)p(tj|ti)p(ws+1|tj)
ψj(s+1) = argmax1≤i≤Nδi(s)p(tj|ti)p(ws+1|tj)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 4 / 6
N
P
T
E
L
Finding the best path: Viterbi Algorithm
Intuition
Optimal path for each state can be recorded. We need
Cheapest cost to state j at step s: δj(s)
Backtrace from that state to best predecessor ψj(s)
Computing these values
δj(s+1) = max1≤i≤Nδi(s)p(tj|ti)p(ws+1|tj)
ψj(s+1) = argmax1≤i≤Nδi(s)p(tj|ti)p(ws+1|tj)
Best final state is argmax1≤i≤Nδi(|S|), we can backtrack from there
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 4 / 6
N
P
T
E
L
Practice Question
Suppose you want to use a HMM tagger to tag the phrase, “the light
book”, where we have the following probabilities:
P(the|Det) = 0.3, P(the|Noun) = 0.1, P(light|Noun) = 0.003, P(light|Adj) =
0.002, P(light|Verb) = 0.06, P(book|Noun) = 0.003, P(book|Verb) = 0.01
P(Verb|Det) = 0.00001, P(Noun|Det) = 0.5, P(Adj|Det) = 0.3,
P(Noun|Noun) =0.2, P(Adj|Noun) = 0.002, P(Noun|Adj) = 0.2,
P(Noun|Verb) = 0.3, P(Verb|Noun) = 0.3, P(Verb|Adj) = 0.001,
P(Verb|Verb) = 0.1
Work out in details the steps of the Viterbi algorithm. You can use a Table
to show the steps. Assume all other conditional probabilities, not
mentioned to be zero. Also, assume that all tags have the same
probabilities to appear in the beginning of a sentence.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 5 / 6
N
P
T
E
L
Learning the Parameters
Two Scenarios
A labeled dataset is available, with the POS category of individual words
in a corpus
Only the corpus is available, but not labeled with the POS categories
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 6 / 6
N
P
T
E
L
Learning the Parameters
Two Scenarios
A labeled dataset is available, with the POS category of individual words
in a corpus
Only the corpus is available, but not labeled with the POS categories
Methods for these scenarios
For the first scenario, parameters can be directly estimated using
maximum likelihood estimate from the labeled dataset
For the second scenario, Baum-Welch Algorithm is used to estimate the
parameters of the hidden markov model.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 6 / 6
N
P
T
E
L
Baum Welch Algorithm
Pawan Goyal
CSE, IIT Kharagpur
Week 4, Lecture 2
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 1 / 6
N
P
T
E
L
Baum Welch Algorithm
Uses the well-known EM algorithm to find the maximum likelihood estimate of
the parameters of a hidden markov model
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 2 / 6
N
P
T
E
L
Baum Welch Algorithm
Uses the well-known EM algorithm to find the maximum likelihood estimate of
the parameters of a hidden markov model
Parameters of HMM
Let Xt be the random variable denoting hidden state at time t, and Yt be the
observation variable at time T. HMM parameters are given by θ = (A,B,π)
where
A = {aij} = P(Xt = j|Xt−1 = i) is the state transition matrix
π = {πi} = P(X1 = i) is the initial state distribution
B = {bj(yt)} = P(Yt = yt|Xt = j) is the emission matrix
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 2 / 6
N
P
T
E
L
Baum Welch Algorithm
Uses the well-known EM algorithm to find the maximum likelihood estimate of
the parameters of a hidden markov model
Parameters of HMM
Let Xt be the random variable denoting hidden state at time t, and Yt be the
observation variable at time T. HMM parameters are given by θ = (A,B,π)
where
A = {aij} = P(Xt = j|Xt−1 = i) is the state transition matrix
π = {πi} = P(X1 = i) is the initial state distribution
B = {bj(yt)} = P(Yt = yt|Xt = j) is the emission matrix
Given observation sequences Y = (Y1 = y1,Y2 = y2,...,YT = yT), the
algorithm tries to find the parameters θ that maximise the probability of the
observation.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 2 / 6
N
P
T
E
L
The Algorithm
The basic idea is to start with some random initial conditions on the
parameters θ, estimate best values of state paths Xt using these, then
re-estimate the parameters θ using the just-computed values of Xt, iteratively.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 3 / 6
N
P
T
E
L
The Algorithm
The basic idea is to start with some random initial conditions on the
parameters θ, estimate best values of state paths Xt using these, then
re-estimate the parameters θ using the just-computed values of Xt, iteratively.
Intuition
Choose some initial values for θ = (A,B,π).
Repeat the following step until convergence:
Determine probable (state) paths ...Xt−1 = i,Xt = j...
Count the expected number of transitions aij as well as the expected
number of times, various emissions bj(yt) are made
Re-estimate θ = (A,B,π) using aij and bj(yt)s.
A forward-backward algorithm is used for finding probable paths.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 3 / 6
N
P
T
E
L
Forward-Backward Algorithm
Forward Procedure
αi(t) = P(Y1 = y1,...,Yt = yt,Xt = i|θ) be the probability of seeing y1,...,yt
and being in state i at time t. Found recursively using:
αi(1) = πibi(y1)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 4 / 6
N
P
T
E
L
Forward-Backward Algorithm
Forward Procedure
αi(t) = P(Y1 = y1,...,Yt = yt,Xt = i|θ) be the probability of seeing y1,...,yt
and being in state i at time t. Found recursively using:
αi(1) = πibi(y1)
αj(t +1) = bj(yt+1)
N
X
i=1
αi(t)aij
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 4 / 6
N
P
T
E
L
Forward-Backward Algorithm
Forward Procedure
αi(t) = P(Y1 = y1,...,Yt = yt,Xt = i|θ) be the probability of seeing y1,...,yt
and being in state i at time t. Found recursively using:
αi(1) = πibi(y1)
αj(t +1) = bj(yt+1)
N
X
i=1
αi(t)aij
Backward Procedure
βi(t) = P(Yt+1 = yt+1,...,YT = yT|Xt = i,θ) be the probability of ending partial
sequence yt+1,...,yT given starting state i at time t. βi(t) is computed
recursively as:
βi(T) = 1
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 4 / 6
N
P
T
E
L
Forward-Backward Algorithm
Forward Procedure
αi(t) = P(Y1 = y1,...,Yt = yt,Xt = i|θ) be the probability of seeing y1,...,yt
and being in state i at time t. Found recursively using:
αi(1) = πibi(y1)
αj(t +1) = bj(yt+1)
N
X
i=1
αi(t)aij
Backward Procedure
βi(t) = P(Yt+1 = yt+1,...,YT = yT|Xt = i,θ) be the probability of ending partial
sequence yt+1,...,yT given starting state i at time t. βi(t) is computed
recursively as:
βi(T) = 1
βi(t) =
N
X
j=1
βj(t +1)aijbj(yt+1)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 4 / 6
N
P
T
E
L
Finding probabilities of paths
We compute the following variables:
Probability of being in state i at time t given the observation Y and
parameters θ
γi(t) = P(Xt = i|Y,θ) =
αi(t)βi(t)
PN
j=1 αj(t)βj(t)
Probability of being in state i and j at time t and t +1 respectively given
the observation Y and parameters θ
ζij(t) = P(Xt = i,Xt+1 = j|Y,θ) =
αi(t)aijβj(t +1)bj(yt+1)
PN
i=1
PN
j=1 αi(t)aijβj(t +1)bj(yt+1)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 5 / 6
N
P
T
E
L
Updating the parameters
πi = γi(1), expected number of times state i was seen at time 1
aij =
PT
t=1 ζij(t)
PT
t=1 γi(t)
, expected number of transitions from state i to state j,
compared to the total number of transitions away from state i
bi(vk) =
PT
t=1 1yt=vk
γi(t)
PT
t=1 γi(t)
with 1yt=vk being an indicator function, is the
expected number of times the output observations are vk while being in
state i compared to the expected total number of times in state i.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 6 / 6
N
P
T
E
L
Maximum Entropy Models
Pawan Goyal
CSE, IIT Kharagpur
Week 4, Lecture 3
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 1 / 16
N
P
T
E
L
Issues with Markov Model Tagging
Unknown Words
We do not have the required probabilities.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 2 / 16
N
P
T
E
L
Issues with Markov Model Tagging
Unknown Words
We do not have the required probabilities.
Possible solutions:
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 2 / 16
N
P
T
E
L
Issues with Markov Model Tagging
Unknown Words
We do not have the required probabilities.
Possible solutions:
Use morphological cues (capitalization, suffix) to assign a more
calculated guess
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 2 / 16
N
P
T
E
L
Issues with Markov Model Tagging
Unknown Words
We do not have the required probabilities.
Possible solutions:
Use morphological cues (capitalization, suffix) to assign a more
calculated guess
Limited Context
“is clearly marked” → verb, past participle
“he clearly marked” → verb, past tense
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 2 / 16
N
P
T
E
L
Issues with Markov Model Tagging
Unknown Words
We do not have the required probabilities.
Possible solutions:
Use morphological cues (capitalization, suffix) to assign a more
calculated guess
Limited Context
“is clearly marked” → verb, past participle
“he clearly marked” → verb, past tense
Possible solution:
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 2 / 16
N
P
T
E
L
Issues with Markov Model Tagging
Unknown Words
We do not have the required probabilities.
Possible solutions:
Use morphological cues (capitalization, suffix) to assign a more
calculated guess
Limited Context
“is clearly marked” → verb, past participle
“he clearly marked” → verb, past tense
Possible solution: Use higher order model, combine various n-gram models to
avoid sparseness problem
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 2 / 16
N
P
T
E
L
Maximum Entropy Modeling: Discriminative Model
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 3 / 16
N
P
T
E
L
Maximum Entropy Modeling: Discriminative Model
We may identify a heterogeneous set of features which contribute in some way
to the choice of POS tag of the current word.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 3 / 16
N
P
T
E
L
Maximum Entropy Modeling: Discriminative Model
We may identify a heterogeneous set of features which contribute in some way
to the choice of POS tag of the current word.
I Whether it is the first word in the article
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 3 / 16
N
P
T
E
L
Maximum Entropy Modeling: Discriminative Model
We may identify a heterogeneous set of features which contribute in some way
to the choice of POS tag of the current word.
I Whether it is the first word in the article
I Whether the next word is to
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 3 / 16
N
P
T
E
L
Maximum Entropy Modeling: Discriminative Model
We may identify a heterogeneous set of features which contribute in some way
to the choice of POS tag of the current word.
I Whether it is the first word in the article
I Whether the next word is to
I Whether one of the last 5 words is a preposition, etc.
MaxEnt combines these features in a probabilistic model
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 3 / 16
N
P
T
E
L
Maximum Entropy: The Model
pλ(y|x) =
1
Zλ(x)
exp







X
i
λifi(x,y)







Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 4 / 16
N
P
T
E
L
Maximum Entropy: The Model
pλ(y|x) =
1
Zλ(x)
exp







X
i
λifi(x,y)







where
Zλ(x) is a normalizing constant given by
Zλ(x) =
X
y
exp







X
i
λifi(x,y)







Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 4 / 16
N
P
T
E
L
Maximum Entropy: The Model
pλ(y|x) =
1
Zλ(x)
exp







X
i
λifi(x,y)







where
Zλ(x) is a normalizing constant given by
Zλ(x) =
X
y
exp







X
i
λifi(x,y)







λi is a weight given to a feature fi
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 4 / 16
N
P
T
E
L
Maximum Entropy: The Model
pλ(y|x) =
1
Zλ(x)
exp







X
i
λifi(x,y)







where
Zλ(x) is a normalizing constant given by
Zλ(x) =
X
y
exp







X
i
λifi(x,y)







λi is a weight given to a feature fi
x denotes an observed datum and y denotes a class
What is the form of the features?
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 4 / 16
N
P
T
E
L
Features in Maximum Entropy Models
Features encode elements of the context x for predicting tag y
Context x is taken around the word w, for which a tag y is to be predicted
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 5 / 16
N
P
T
E
L
Features in Maximum Entropy Models
Features encode elements of the context x for predicting tag y
Context x is taken around the word w, for which a tag y is to be predicted
Features are binary values functions, e.g.,
f(x,y) =
(
1 if isCapitalized(w)&y = NNP
0 otherwise
)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 5 / 16
N
P
T
E
L
Example Features
Example: Named Entities
LOCATION (in Arcadia)
LOCATION (in Québec)
DRUG (taking Zantac)
PERSON (saw Sue)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 6 / 16
N
P
T
E
L
Example Features
Example: Named Entities
LOCATION (in Arcadia)
LOCATION (in Québec)
DRUG (taking Zantac)
PERSON (saw Sue)
Example Features
f1(x,y) = [y = LOCATION ∧w−1 = “in”∧isCapitalized(w)]
f2(x,y) = [y = LOCATION ∧hasAccentedLatinChar(w)]
f3(x,y) = [y = DRUG∧ends(w,“c”)]
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 6 / 16
N
P
T
E
L
Tagging with Maximum Entropy Model
W = w1 ...wn - words in the corpus (observed)
T = t1 ...tn - the corresponding tags (unknown)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 7 / 16
N
P
T
E
L
Tagging with Maximum Entropy Model
W = w1 ...wn - words in the corpus (observed)
T = t1 ...tn - the corresponding tags (unknown)
Tag sequence candidate {t1,...,tn} has conditional probability:
P(t1,...,tn|w1 ...,wn) =
n
Y
i=1
p(ti|xi)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 7 / 16
N
P
T
E
L
Tagging with Maximum Entropy Model
W = w1 ...wn - words in the corpus (observed)
T = t1 ...tn - the corresponding tags (unknown)
Tag sequence candidate {t1,...,tn} has conditional probability:
P(t1,...,tn|w1 ...,wn) =
n
Y
i=1
p(ti|xi)
The context xi also includes previously assigned tags for a fixed history.
Beam search is used to find the most probable sequence
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 7 / 16
N
P
T
E
L
Beam Inference
Beam Inference
At each position, keep the top k complete sequences
Extend each sequence in each local way
The extensions compete for the k slots at the next position
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 8 / 16
N
P
T
E
L
Beam Inference
Beam Inference
At each position, keep the top k complete sequences
Extend each sequence in each local way
The extensions compete for the k slots at the next position
But what is a MaxEnt model?
Let’s go to the basics now!
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 8 / 16
N
P
T
E
L
Maximum Entropy Model
Intuitive Principle
Model all that is known and assume nothing about that which is unknown.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 9 / 16
N
P
T
E
L
Maximum Entropy Model
Intuitive Principle
Model all that is known and assume nothing about that which is unknown.
Given a collection of facts, choose a model which is consistent with all the
facts, but otherwise as uniform as possible.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 9 / 16
N
P
T
E
L
Maximum Entropy: Overview
Suppose we wish to model an expert translator’s decisions concerning
the proper French rendering of the English word ‘in’.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 10 / 16
N
P
T
E
L
Maximum Entropy: Overview
Suppose we wish to model an expert translator’s decisions concerning
the proper French rendering of the English word ‘in’.
Each French word or phrase f is assigned an estimate p(f), probability
that the expert would choose f as a translation of ‘in’.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 10 / 16
N
P
T
E
L
Maximum Entropy: Overview
Suppose we wish to model an expert translator’s decisions concerning
the proper French rendering of the English word ‘in’.
Each French word or phrase f is assigned an estimate p(f), probability
that the expert would choose f as a translation of ‘in’.
Collect a large sample of instances of the expert’s decisions
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 10 / 16
N
P
T
E
L
Maximum Entropy: Overview
Suppose we wish to model an expert translator’s decisions concerning
the proper French rendering of the English word ‘in’.
Each French word or phrase f is assigned an estimate p(f), probability
that the expert would choose f as a translation of ‘in’.
Collect a large sample of instances of the expert’s decisions
Goal: extract a set of facts about the decision-making process (first task)
that will aid in constructing a model of this process (second task)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 10 / 16
N
P
T
E
L
Maximum Entropy Model: Overview
First clue: list of allowed translations
Suppose the translator always chooses among {dans, en, á, au cours de,
pendant}.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 11 / 16
N
P
T
E
L
Maximum Entropy Model: Overview
First clue: list of allowed translations
Suppose the translator always chooses among {dans, en, á, au cours de,
pendant}.
First constraint: p(dans)+p(en)+p(á)+p(au cours de)+p(pendant) = 1.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 11 / 16
N
P
T
E
L
Maximum Entropy Model: Overview
First clue: list of allowed translations
Suppose the translator always chooses among {dans, en, á, au cours de,
pendant}.
First constraint: p(dans)+p(en)+p(á)+p(au cours de)+p(pendant) = 1.
Infinite number of models p for which this identity holds, the most intuitive
model?
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 11 / 16
N
P
T
E
L
Maximum Entropy Model: Overview
First clue: list of allowed translations
Suppose the translator always chooses among {dans, en, á, au cours de,
pendant}.
First constraint: p(dans)+p(en)+p(á)+p(au cours de)+p(pendant) = 1.
Infinite number of models p for which this identity holds, the most intuitive
model?
allocate the total probability evenly among the five possible phrases →
most uniform model subject to our knowledge.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 11 / 16
N
P
T
E
L
Maximum Entropy Model: Overview
First clue: list of allowed translations
Suppose the translator always chooses among {dans, en, á, au cours de,
pendant}.
First constraint: p(dans)+p(en)+p(á)+p(au cours de)+p(pendant) = 1.
Infinite number of models p for which this identity holds, the most intuitive
model?
allocate the total probability evenly among the five possible phrases →
most uniform model subject to our knowledge.
Is it the most uniform model overall?
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 11 / 16
N
P
T
E
L
Maximum Entropy Model: Overview
First clue: list of allowed translations
Suppose the translator always chooses among {dans, en, á, au cours de,
pendant}.
First constraint: p(dans)+p(en)+p(á)+p(au cours de)+p(pendant) = 1.
Infinite number of models p for which this identity holds, the most intuitive
model?
allocate the total probability evenly among the five possible phrases →
most uniform model subject to our knowledge.
Is it the most uniform model overall? → No, that would grant an equal
probability to every possible French phrase.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 11 / 16
N
P
T
E
L
Maximum Entropy Model: Overview
More clues from the expert’s decision
Second clue: Suppose the expert chose either ‘dans’ or ‘en’ 30% of the
time.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 12 / 16
N
P
T
E
L
Maximum Entropy Model: Overview
More clues from the expert’s decision
Second clue: Suppose the expert chose either ‘dans’ or ‘en’ 30% of the
time.
Third clue: In half of the cases, the expert chose either ‘dans’ or ‘á’
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 12 / 16
N
P
T
E
L
Maximum Entropy Model: Overview
More clues from the expert’s decision
Second clue: Suppose the expert chose either ‘dans’ or ‘en’ 30% of the
time.
Third clue: In half of the cases, the expert chose either ‘dans’ or ‘á’
How do we measure uniformity of a model?
As we add complexity to the model, we face two difficulties:
What exactly is meant by “uniform”?
How can one measure the uniformity of a model?
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 12 / 16
N
P
T
E
L
Maximum Entropy Modeling
Entropy: measures the uncertainty of a distribution.
Quantifying uncertainty (“surprise”)
Event x
Probability px
Surprise: log(1/px)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 13 / 16
N
P
T
E
L
Maximum Entropy Modeling
Entropy: measures the uncertainty of a distribution.
Quantifying uncertainty (“surprise”)
Event x
Probability px
Surprise: log(1/px)
Entropy: expected surprise (over p)
H(p) = Ep
"
log2
1
px
#
= −
X
x
pxlog2px
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 13 / 16
N
P
T
E
L
Maximum Entropy Modeling
Entropy: measures the uncertainty of a distribution.
Quantifying uncertainty (“surprise”)
Event x
Probability px
Surprise: log(1/px)
Entropy: expected surprise (over p)
H(p) = Ep
"
log2
1
px
#
= −
X
x
pxlog2px
Coin Tossing
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 13 / 16
N
P
T
E
L
Maximum Entropy Modeling
Distribution required
Minimize commitment = maximize entropy
Resemble some reference distribution
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 14 / 16
N
P
T
E
L
Maximum Entropy Modeling
Distribution required
Minimize commitment = maximize entropy
Resemble some reference distribution
Solution
Maximize entropy H, subject to feature-based constraints:
Ep[fi] = Ep̃[fi]
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 14 / 16
N
P
T
E
L
Maximum Entropy Modeling
Distribution required
Minimize commitment = maximize entropy
Resemble some reference distribution
Solution
Maximize entropy H, subject to feature-based constraints:
Ep[fi] = Ep̃[fi]
Adding constraints
Lowers maximum entropy
Brings the distribution further from uniform and closer to data
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 14 / 16
N
P
T
E
L
Maximum Entropy Principle
Given n feature functions fi, we would like p to lie in the subset C of P defined
by
C = {p ∈ P|p(fi) = p̃(fi),i ∈ {1,2,...,n}}
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 15 / 16
N
P
T
E
L
Maximum Entropy Principle
Given n feature functions fi, we would like p to lie in the subset C of P defined
by
C = {p ∈ P|p(fi) = p̃(fi),i ∈ {1,2,...,n}}
Empirical count (expectation) of a feature
p̃(fi) =
X
x,y
p̃(x,y)fi(x,y)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 15 / 16
N
P
T
E
L
Maximum Entropy Principle
Given n feature functions fi, we would like p to lie in the subset C of P defined
by
C = {p ∈ P|p(fi) = p̃(fi),i ∈ {1,2,...,n}}
Empirical count (expectation) of a feature
p̃(fi) =
X
x,y
p̃(x,y)fi(x,y)
Model expectation of a feature
p(fi) =
X
x,y
p̃(x)p(y|x)fi(x,y)
Select the distribution which is most uniform (conditional probability):
p∗
= argmaxp∈CH(p) = H(Y|X) ≈ −
X
x,y
p̃(x)p(y|x)logp(y|x)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 15 / 16
N
P
T
E
L
Maximum Entropy Principle
p∗
= argmaxp∈CH(p)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 16 / 16
N
P
T
E
L
Maximum Entropy Principle
p∗
= argmaxp∈CH(p)
Constraint Optimization
Introduce a parameter λi for each feature fi. Lagrangian is given by
∧(p,λ) = H(p)+
X
i
λi(p(fi)−p̃(fi))
Solving, we get
pλ(y|x) =
1
Zλ(x)
exp







X
i
λifi(x,y)







where Zλ(x) is a normalizing constant given by
Zλ(x) =
X
y
exp







X
i
λifi(x,y)







Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 16 / 16
N
P
T
E
L
Maximum Entropy Models
Pawan Goyal
CSE, IIT Kharagpur
Week 4, Lecture 4
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 1 / 8
N
P
T
E
L
Practice Question
Consider the maximum entropy model for POS tagging, where you want to estimate P(tag|word). In a
hypothetical setting, assume that tag can take the values D, N and V (short forms for Determiner, Noun and
Verb). The variable word could be any member of a set V of possible words, where V contains the words a,
man, sleeps, as well as additional words. The distribution should give the following probabilities
P(D|a) = 0.9
P(N|man) = 0.9
P(V|sleeps) = 0.9
P(D|word) = 0.6 for any word other than a, man or sleeps
P(N|word) = 0.3 for any word other than a, man or sleeps
P(V|word) = 0.1 for any word other than a, man or sleeps
It is assumed that all other probabilities, not defined above could take any values such that
P
tag P(tag|word) = 1 is satisfied for any word in V.
Define the features of your maximum entropy model that can model this distribution. Mark your
features as f1, f2 and so on. Each feature should have the same format as explained in the class.
[Hint: 6 Features should make the analysis easier]
For each feature fi, assume a weight λi. Now, write expression for the following probabilities in terms
of your model parameters
I P(D|cat)
I P(N|laughs)
I P(D|man)
What value do the parameters in your model take to give the distribution as described above. (i.e.
P(D|a) = 0.9 and so on. You may leave the final answer in terms of equations)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 2 / 8
N
P
T
E
L
Features for POS Tagging (Ratnaparakhi, 1996)
The specific word and tag context available to a feature is
hi = {wi,wi+1,wi+2,wi−1,wi−2,ti−1,ti−2}
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 3 / 8
N
P
T
E
L
Features for POS Tagging (Ratnaparakhi, 1996)
The specific word and tag context available to a feature is
hi = {wi,wi+1,wi+2,wi−1,wi−2,ti−1,ti−2}
Example: fj(hi,ti) = 1 if suffix(wi) = “ing00&ti = VBG
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 3 / 8
N
P
T
E
L
Example Features
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 4 / 8
N
P
T
E
L
Example Features
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 5 / 8
N
P
T
E
L
Example Features
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 5 / 8
N
P
T
E
L
Search Algorithm
Conditional Probability
Given a sentence {w1,...,wn}, a tag sequence candidate {t1,...,tn} has
conditional probability:
P(t1,...,tn|w1 ...,wn) =
n
Y
i=1
p(ti|xi)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 6 / 8
N
P
T
E
L
Search Algorithm
Conditional Probability
Given a sentence {w1,...,wn}, a tag sequence candidate {t1,...,tn} has
conditional probability:
P(t1,...,tn|w1 ...,wn) =
n
Y
i=1
p(ti|xi)
A Tag Dictionary is used, which, for each known word, lists the tags that it has
appeared with in the training set.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 6 / 8
N
P
T
E
L
Search Algorithm
Let W = {w1,...,wn} be a test sentence, sij be the jth highest probability tag
sequence up to and including word wi.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 7 / 8
N
P
T
E
L
Search Algorithm
Let W = {w1,...,wn} be a test sentence, sij be the jth highest probability tag
sequence up to and including word wi.
Search description
Generate tags for w1, find top N, set s1j,1 ≤ j ≤ N, accordingly.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 7 / 8
N
P
T
E
L
Search Algorithm
Let W = {w1,...,wn} be a test sentence, sij be the jth highest probability tag
sequence up to and including word wi.
Search description
Generate tags for w1, find top N, set s1j,1 ≤ j ≤ N, accordingly.
Initialize i = 2
I Initialize j = 1
I Generate tags for wi, given s(i−1)j as previous tag context, and append
each tag to s(i−1)j to make a new sequence
I j = j+1, repeat if j ≤ N
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 7 / 8
N
P
T
E
L
Search Algorithm
Let W = {w1,...,wn} be a test sentence, sij be the jth highest probability tag
sequence up to and including word wi.
Search description
Generate tags for w1, find top N, set s1j,1 ≤ j ≤ N, accordingly.
Initialize i = 2
I Initialize j = 1
I Generate tags for wi, given s(i−1)j as previous tag context, and append
each tag to s(i−1)j to make a new sequence
I j = j+1, repeat if j ≤ N
Find N highest probability sequences generated by above loop, set sij
accordingly
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 7 / 8
N
P
T
E
L
Search Algorithm
Let W = {w1,...,wn} be a test sentence, sij be the jth highest probability tag
sequence up to and including word wi.
Search description
Generate tags for w1, find top N, set s1j,1 ≤ j ≤ N, accordingly.
Initialize i = 2
I Initialize j = 1
I Generate tags for wi, given s(i−1)j as previous tag context, and append
each tag to s(i−1)j to make a new sequence
I j = j+1, repeat if j ≤ N
Find N highest probability sequences generated by above loop, set sij
accordingly
i = i+1, repeat if i ≤ n
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 7 / 8
N
P
T
E
L
Search Algorithm
Let W = {w1,...,wn} be a test sentence, sij be the jth highest probability tag
sequence up to and including word wi.
Search description
Generate tags for w1, find top N, set s1j,1 ≤ j ≤ N, accordingly.
Initialize i = 2
I Initialize j = 1
I Generate tags for wi, given s(i−1)j as previous tag context, and append
each tag to s(i−1)j to make a new sequence
I j = j+1, repeat if j ≤ N
Find N highest probability sequences generated by above loop, set sij
accordingly
i = i+1, repeat if i ≤ n
Return highest probability sequence sn1
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 7 / 8
N
P
T
E
L
A Good Reference
Berger et al., A Maximum Entropy Approach to Natural Language Processing,
Computational Linguistics, Vol. 22, No. 1.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 8 / 8
N
P
T
E
L
Conditional Random Fields
Pawan Goyal
CSE, IIT Kharagpur
Week 4, Lecture 5
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 1 / 8
N
P
T
E
L
Practice Question
Suppose you want to use a MaxEnt tagger to tag the sentence, “the light book”. We know that the top 2
POS tags for the words the, light and book are {Det,Noun}, {Verb,Adj} and {Verb,Noun}, respectively.
Assume that the MaxEnt model uses the following history hi (context) for a word wi:
hi = {wi,wi−1,wi+1,ti−1}
where wi−1 and wi+1 correspond to the previous and next words and ti−1 corresponds to the tag of the
previous word. Accordingly, the following features are being used by the MaxEnt model:
f1: ti−1 = Det and ti = Adj
f2: ti−1 = Noun and ti = Verb
f3: ti−1 = Adj and ti = Noun
f4: wi−1 = the and ti = Adj
f5: wi−1 = the&wi+1 = book and ti = Adj
f6: wi−1 = light and ti = Noun
f7: wi+1 = light and ti = Det
f8: wi−1 = NULL and ti = Noun
Assume that each feature has a uniform weight of 1.0.
Use Beam search algorithm with a beam-size of 2 to identify the highest probability tag sequence for the
sentence.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 2 / 8
N
P
T
E
L
Problem with Maximum Entropy Models
Per-state normalization
All the mass that arrives at a state must be distributed among the possible
successor states
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 3 / 8
N
P
T
E
L
Problem with Maximum Entropy Models
Per-state normalization
All the mass that arrives at a state must be distributed among the possible
successor states
This gives a ‘label bias’ problem
Let’s see the intuition (on paper)
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 3 / 8
N
P
T
E
L
Conditional Random Fields
CRFs are conditionally trained, undirected graphical models.
Let’s look at the linear chain structure
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 4 / 8
N
P
T
E
L
Conditional Random Fields: Feature Functions
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 5 / 8
N
P
T
E
L
Feature Functions
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 6 / 8
N
P
T
E
L
Conditional Random Fields: Distribution
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 7 / 8
N
P
T
E
L
CRFs
Have the advantages of MEMM but avoid the label bias problem
CRFs are globally normalized, whereas MEMMs are locally normalized.
Widely used and applied. CRFs have been (close to) state-of-the-art in
many sequence labeling tasks.
Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 8 / 8
N
P
T
E
L

More Related Content

Similar to Week-4.pdf

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsPer Kristian Lehre
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsPK Lehre
 
Hardness of approximation
Hardness of approximationHardness of approximation
Hardness of approximationcarlol
 
Machine Learning, Financial Engineering and Quantitative Investing
Machine Learning, Financial Engineering and Quantitative InvestingMachine Learning, Financial Engineering and Quantitative Investing
Machine Learning, Financial Engineering and Quantitative InvestingShengyuan Wang Steven
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationFeynman Liang
 
2012 mdsp pr06  hmm
2012 mdsp pr06  hmm2012 mdsp pr06  hmm
2012 mdsp pr06  hmmnozomuhamada
 
Data Driven Process Optimization Using Real-Coded Genetic Algorithms ~陳奇中教授演講投影片
Data Driven Process Optimization Using Real-Coded Genetic Algorithms ~陳奇中教授演講投影片Data Driven Process Optimization Using Real-Coded Genetic Algorithms ~陳奇中教授演講投影片
Data Driven Process Optimization Using Real-Coded Genetic Algorithms ~陳奇中教授演講投影片Chyi-Tsong Chen
 
Reinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and FinanceReinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and FinanceArthur Charpentier
 
On intuitionistic fuzzy transportation problem using hexagonal intuitionistic...
On intuitionistic fuzzy transportation problem using hexagonal intuitionistic...On intuitionistic fuzzy transportation problem using hexagonal intuitionistic...
On intuitionistic fuzzy transportation problem using hexagonal intuitionistic...ijfls
 
2_GLMs_printable.pdf
2_GLMs_printable.pdf2_GLMs_printable.pdf
2_GLMs_printable.pdfElio Laureano
 

Similar to Week-4.pdf (20)

Slides ACTINFO 2016
Slides ACTINFO 2016Slides ACTINFO 2016
Slides ACTINFO 2016
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Classification
ClassificationClassification
Classification
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
 
Hardness of approximation
Hardness of approximationHardness of approximation
Hardness of approximation
 
Particle filter
Particle filterParticle filter
Particle filter
 
Machine Learning, Financial Engineering and Quantitative Investing
Machine Learning, Financial Engineering and Quantitative InvestingMachine Learning, Financial Engineering and Quantitative Investing
Machine Learning, Financial Engineering and Quantitative Investing
 
AI R16 - UNIT-3.pdf
AI R16 - UNIT-3.pdfAI R16 - UNIT-3.pdf
AI R16 - UNIT-3.pdf
 
Phd Proposal
Phd ProposalPhd Proposal
Phd Proposal
 
BAYSM'14, Wien, Austria
BAYSM'14, Wien, AustriaBAYSM'14, Wien, Austria
BAYSM'14, Wien, Austria
 
Thesis
ThesisThesis
Thesis
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference Compilation
 
2012 mdsp pr06  hmm
2012 mdsp pr06  hmm2012 mdsp pr06  hmm
2012 mdsp pr06  hmm
 
Data Driven Process Optimization Using Real-Coded Genetic Algorithms ~陳奇中教授演講投影片
Data Driven Process Optimization Using Real-Coded Genetic Algorithms ~陳奇中教授演講投影片Data Driven Process Optimization Using Real-Coded Genetic Algorithms ~陳奇中教授演講投影片
Data Driven Process Optimization Using Real-Coded Genetic Algorithms ~陳奇中教授演講投影片
 
Reinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and FinanceReinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and Finance
 
Seminar talk, 2008
Seminar talk, 2008Seminar talk, 2008
Seminar talk, 2008
 
Semantics
SemanticsSemantics
Semantics
 
On intuitionistic fuzzy transportation problem using hexagonal intuitionistic...
On intuitionistic fuzzy transportation problem using hexagonal intuitionistic...On intuitionistic fuzzy transportation problem using hexagonal intuitionistic...
On intuitionistic fuzzy transportation problem using hexagonal intuitionistic...
 
2_GLMs_printable.pdf
2_GLMs_printable.pdf2_GLMs_printable.pdf
2_GLMs_printable.pdf
 

Recently uploaded

22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf203318pmpc
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoordharasingh5698
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 

Recently uploaded (20)

22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 

Week-4.pdf

  • 1. Viterbi Decoding for HMM, Parameter Learning Pawan Goyal CSE, IIT Kharagpur Week 4, Lecture 1 Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 1 / 6 N P T E L
  • 2. Walking through the states: best path Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 2 / 6 N P T E L
  • 3. Walking through the states: best path Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 3 / 6 N P T E L
  • 4. Finding the best path: Viterbi Algorithm Intuition Optimal path for each state can be recorded. We need Cheapest cost to state j at step s: δj(s) Backtrace from that state to best predecessor ψj(s) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 4 / 6 N P T E L
  • 5. Finding the best path: Viterbi Algorithm Intuition Optimal path for each state can be recorded. We need Cheapest cost to state j at step s: δj(s) Backtrace from that state to best predecessor ψj(s) Computing these values δj(s+1) = max1≤i≤Nδi(s)p(tj|ti)p(ws+1|tj) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 4 / 6 N P T E L
  • 6. Finding the best path: Viterbi Algorithm Intuition Optimal path for each state can be recorded. We need Cheapest cost to state j at step s: δj(s) Backtrace from that state to best predecessor ψj(s) Computing these values δj(s+1) = max1≤i≤Nδi(s)p(tj|ti)p(ws+1|tj) ψj(s+1) = argmax1≤i≤Nδi(s)p(tj|ti)p(ws+1|tj) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 4 / 6 N P T E L
  • 7. Finding the best path: Viterbi Algorithm Intuition Optimal path for each state can be recorded. We need Cheapest cost to state j at step s: δj(s) Backtrace from that state to best predecessor ψj(s) Computing these values δj(s+1) = max1≤i≤Nδi(s)p(tj|ti)p(ws+1|tj) ψj(s+1) = argmax1≤i≤Nδi(s)p(tj|ti)p(ws+1|tj) Best final state is argmax1≤i≤Nδi(|S|), we can backtrack from there Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 4 / 6 N P T E L
  • 8. Practice Question Suppose you want to use a HMM tagger to tag the phrase, “the light book”, where we have the following probabilities: P(the|Det) = 0.3, P(the|Noun) = 0.1, P(light|Noun) = 0.003, P(light|Adj) = 0.002, P(light|Verb) = 0.06, P(book|Noun) = 0.003, P(book|Verb) = 0.01 P(Verb|Det) = 0.00001, P(Noun|Det) = 0.5, P(Adj|Det) = 0.3, P(Noun|Noun) =0.2, P(Adj|Noun) = 0.002, P(Noun|Adj) = 0.2, P(Noun|Verb) = 0.3, P(Verb|Noun) = 0.3, P(Verb|Adj) = 0.001, P(Verb|Verb) = 0.1 Work out in details the steps of the Viterbi algorithm. You can use a Table to show the steps. Assume all other conditional probabilities, not mentioned to be zero. Also, assume that all tags have the same probabilities to appear in the beginning of a sentence. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 5 / 6 N P T E L
  • 9. Learning the Parameters Two Scenarios A labeled dataset is available, with the POS category of individual words in a corpus Only the corpus is available, but not labeled with the POS categories Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 6 / 6 N P T E L
  • 10. Learning the Parameters Two Scenarios A labeled dataset is available, with the POS category of individual words in a corpus Only the corpus is available, but not labeled with the POS categories Methods for these scenarios For the first scenario, parameters can be directly estimated using maximum likelihood estimate from the labeled dataset For the second scenario, Baum-Welch Algorithm is used to estimate the parameters of the hidden markov model. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 1 6 / 6 N P T E L
  • 11. Baum Welch Algorithm Pawan Goyal CSE, IIT Kharagpur Week 4, Lecture 2 Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 1 / 6 N P T E L
  • 12. Baum Welch Algorithm Uses the well-known EM algorithm to find the maximum likelihood estimate of the parameters of a hidden markov model Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 2 / 6 N P T E L
  • 13. Baum Welch Algorithm Uses the well-known EM algorithm to find the maximum likelihood estimate of the parameters of a hidden markov model Parameters of HMM Let Xt be the random variable denoting hidden state at time t, and Yt be the observation variable at time T. HMM parameters are given by θ = (A,B,π) where A = {aij} = P(Xt = j|Xt−1 = i) is the state transition matrix π = {πi} = P(X1 = i) is the initial state distribution B = {bj(yt)} = P(Yt = yt|Xt = j) is the emission matrix Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 2 / 6 N P T E L
  • 14. Baum Welch Algorithm Uses the well-known EM algorithm to find the maximum likelihood estimate of the parameters of a hidden markov model Parameters of HMM Let Xt be the random variable denoting hidden state at time t, and Yt be the observation variable at time T. HMM parameters are given by θ = (A,B,π) where A = {aij} = P(Xt = j|Xt−1 = i) is the state transition matrix π = {πi} = P(X1 = i) is the initial state distribution B = {bj(yt)} = P(Yt = yt|Xt = j) is the emission matrix Given observation sequences Y = (Y1 = y1,Y2 = y2,...,YT = yT), the algorithm tries to find the parameters θ that maximise the probability of the observation. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 2 / 6 N P T E L
  • 15. The Algorithm The basic idea is to start with some random initial conditions on the parameters θ, estimate best values of state paths Xt using these, then re-estimate the parameters θ using the just-computed values of Xt, iteratively. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 3 / 6 N P T E L
  • 16. The Algorithm The basic idea is to start with some random initial conditions on the parameters θ, estimate best values of state paths Xt using these, then re-estimate the parameters θ using the just-computed values of Xt, iteratively. Intuition Choose some initial values for θ = (A,B,π). Repeat the following step until convergence: Determine probable (state) paths ...Xt−1 = i,Xt = j... Count the expected number of transitions aij as well as the expected number of times, various emissions bj(yt) are made Re-estimate θ = (A,B,π) using aij and bj(yt)s. A forward-backward algorithm is used for finding probable paths. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 3 / 6 N P T E L
  • 17. Forward-Backward Algorithm Forward Procedure αi(t) = P(Y1 = y1,...,Yt = yt,Xt = i|θ) be the probability of seeing y1,...,yt and being in state i at time t. Found recursively using: αi(1) = πibi(y1) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 4 / 6 N P T E L
  • 18. Forward-Backward Algorithm Forward Procedure αi(t) = P(Y1 = y1,...,Yt = yt,Xt = i|θ) be the probability of seeing y1,...,yt and being in state i at time t. Found recursively using: αi(1) = πibi(y1) αj(t +1) = bj(yt+1) N X i=1 αi(t)aij Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 4 / 6 N P T E L
  • 19. Forward-Backward Algorithm Forward Procedure αi(t) = P(Y1 = y1,...,Yt = yt,Xt = i|θ) be the probability of seeing y1,...,yt and being in state i at time t. Found recursively using: αi(1) = πibi(y1) αj(t +1) = bj(yt+1) N X i=1 αi(t)aij Backward Procedure βi(t) = P(Yt+1 = yt+1,...,YT = yT|Xt = i,θ) be the probability of ending partial sequence yt+1,...,yT given starting state i at time t. βi(t) is computed recursively as: βi(T) = 1 Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 4 / 6 N P T E L
  • 20. Forward-Backward Algorithm Forward Procedure αi(t) = P(Y1 = y1,...,Yt = yt,Xt = i|θ) be the probability of seeing y1,...,yt and being in state i at time t. Found recursively using: αi(1) = πibi(y1) αj(t +1) = bj(yt+1) N X i=1 αi(t)aij Backward Procedure βi(t) = P(Yt+1 = yt+1,...,YT = yT|Xt = i,θ) be the probability of ending partial sequence yt+1,...,yT given starting state i at time t. βi(t) is computed recursively as: βi(T) = 1 βi(t) = N X j=1 βj(t +1)aijbj(yt+1) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 4 / 6 N P T E L
  • 21. Finding probabilities of paths We compute the following variables: Probability of being in state i at time t given the observation Y and parameters θ γi(t) = P(Xt = i|Y,θ) = αi(t)βi(t) PN j=1 αj(t)βj(t) Probability of being in state i and j at time t and t +1 respectively given the observation Y and parameters θ ζij(t) = P(Xt = i,Xt+1 = j|Y,θ) = αi(t)aijβj(t +1)bj(yt+1) PN i=1 PN j=1 αi(t)aijβj(t +1)bj(yt+1) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 5 / 6 N P T E L
  • 22. Updating the parameters πi = γi(1), expected number of times state i was seen at time 1 aij = PT t=1 ζij(t) PT t=1 γi(t) , expected number of transitions from state i to state j, compared to the total number of transitions away from state i bi(vk) = PT t=1 1yt=vk γi(t) PT t=1 γi(t) with 1yt=vk being an indicator function, is the expected number of times the output observations are vk while being in state i compared to the expected total number of times in state i. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 2 6 / 6 N P T E L
  • 23. Maximum Entropy Models Pawan Goyal CSE, IIT Kharagpur Week 4, Lecture 3 Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 1 / 16 N P T E L
  • 24. Issues with Markov Model Tagging Unknown Words We do not have the required probabilities. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 2 / 16 N P T E L
  • 25. Issues with Markov Model Tagging Unknown Words We do not have the required probabilities. Possible solutions: Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 2 / 16 N P T E L
  • 26. Issues with Markov Model Tagging Unknown Words We do not have the required probabilities. Possible solutions: Use morphological cues (capitalization, suffix) to assign a more calculated guess Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 2 / 16 N P T E L
  • 27. Issues with Markov Model Tagging Unknown Words We do not have the required probabilities. Possible solutions: Use morphological cues (capitalization, suffix) to assign a more calculated guess Limited Context “is clearly marked” → verb, past participle “he clearly marked” → verb, past tense Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 2 / 16 N P T E L
  • 28. Issues with Markov Model Tagging Unknown Words We do not have the required probabilities. Possible solutions: Use morphological cues (capitalization, suffix) to assign a more calculated guess Limited Context “is clearly marked” → verb, past participle “he clearly marked” → verb, past tense Possible solution: Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 2 / 16 N P T E L
  • 29. Issues with Markov Model Tagging Unknown Words We do not have the required probabilities. Possible solutions: Use morphological cues (capitalization, suffix) to assign a more calculated guess Limited Context “is clearly marked” → verb, past participle “he clearly marked” → verb, past tense Possible solution: Use higher order model, combine various n-gram models to avoid sparseness problem Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 2 / 16 N P T E L
  • 30. Maximum Entropy Modeling: Discriminative Model Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 3 / 16 N P T E L
  • 31. Maximum Entropy Modeling: Discriminative Model We may identify a heterogeneous set of features which contribute in some way to the choice of POS tag of the current word. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 3 / 16 N P T E L
  • 32. Maximum Entropy Modeling: Discriminative Model We may identify a heterogeneous set of features which contribute in some way to the choice of POS tag of the current word. I Whether it is the first word in the article Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 3 / 16 N P T E L
  • 33. Maximum Entropy Modeling: Discriminative Model We may identify a heterogeneous set of features which contribute in some way to the choice of POS tag of the current word. I Whether it is the first word in the article I Whether the next word is to Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 3 / 16 N P T E L
  • 34. Maximum Entropy Modeling: Discriminative Model We may identify a heterogeneous set of features which contribute in some way to the choice of POS tag of the current word. I Whether it is the first word in the article I Whether the next word is to I Whether one of the last 5 words is a preposition, etc. MaxEnt combines these features in a probabilistic model Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 3 / 16 N P T E L
  • 35. Maximum Entropy: The Model pλ(y|x) = 1 Zλ(x) exp        X i λifi(x,y)        Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 4 / 16 N P T E L
  • 36. Maximum Entropy: The Model pλ(y|x) = 1 Zλ(x) exp        X i λifi(x,y)        where Zλ(x) is a normalizing constant given by Zλ(x) = X y exp        X i λifi(x,y)        Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 4 / 16 N P T E L
  • 37. Maximum Entropy: The Model pλ(y|x) = 1 Zλ(x) exp        X i λifi(x,y)        where Zλ(x) is a normalizing constant given by Zλ(x) = X y exp        X i λifi(x,y)        λi is a weight given to a feature fi Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 4 / 16 N P T E L
  • 38. Maximum Entropy: The Model pλ(y|x) = 1 Zλ(x) exp        X i λifi(x,y)        where Zλ(x) is a normalizing constant given by Zλ(x) = X y exp        X i λifi(x,y)        λi is a weight given to a feature fi x denotes an observed datum and y denotes a class What is the form of the features? Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 4 / 16 N P T E L
  • 39. Features in Maximum Entropy Models Features encode elements of the context x for predicting tag y Context x is taken around the word w, for which a tag y is to be predicted Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 5 / 16 N P T E L
  • 40. Features in Maximum Entropy Models Features encode elements of the context x for predicting tag y Context x is taken around the word w, for which a tag y is to be predicted Features are binary values functions, e.g., f(x,y) = ( 1 if isCapitalized(w)&y = NNP 0 otherwise ) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 5 / 16 N P T E L
  • 41. Example Features Example: Named Entities LOCATION (in Arcadia) LOCATION (in Québec) DRUG (taking Zantac) PERSON (saw Sue) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 6 / 16 N P T E L
  • 42. Example Features Example: Named Entities LOCATION (in Arcadia) LOCATION (in Québec) DRUG (taking Zantac) PERSON (saw Sue) Example Features f1(x,y) = [y = LOCATION ∧w−1 = “in”∧isCapitalized(w)] f2(x,y) = [y = LOCATION ∧hasAccentedLatinChar(w)] f3(x,y) = [y = DRUG∧ends(w,“c”)] Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 6 / 16 N P T E L
  • 43. Tagging with Maximum Entropy Model W = w1 ...wn - words in the corpus (observed) T = t1 ...tn - the corresponding tags (unknown) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 7 / 16 N P T E L
  • 44. Tagging with Maximum Entropy Model W = w1 ...wn - words in the corpus (observed) T = t1 ...tn - the corresponding tags (unknown) Tag sequence candidate {t1,...,tn} has conditional probability: P(t1,...,tn|w1 ...,wn) = n Y i=1 p(ti|xi) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 7 / 16 N P T E L
  • 45. Tagging with Maximum Entropy Model W = w1 ...wn - words in the corpus (observed) T = t1 ...tn - the corresponding tags (unknown) Tag sequence candidate {t1,...,tn} has conditional probability: P(t1,...,tn|w1 ...,wn) = n Y i=1 p(ti|xi) The context xi also includes previously assigned tags for a fixed history. Beam search is used to find the most probable sequence Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 7 / 16 N P T E L
  • 46. Beam Inference Beam Inference At each position, keep the top k complete sequences Extend each sequence in each local way The extensions compete for the k slots at the next position Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 8 / 16 N P T E L
  • 47. Beam Inference Beam Inference At each position, keep the top k complete sequences Extend each sequence in each local way The extensions compete for the k slots at the next position But what is a MaxEnt model? Let’s go to the basics now! Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 8 / 16 N P T E L
  • 48. Maximum Entropy Model Intuitive Principle Model all that is known and assume nothing about that which is unknown. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 9 / 16 N P T E L
  • 49. Maximum Entropy Model Intuitive Principle Model all that is known and assume nothing about that which is unknown. Given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 9 / 16 N P T E L
  • 50. Maximum Entropy: Overview Suppose we wish to model an expert translator’s decisions concerning the proper French rendering of the English word ‘in’. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 10 / 16 N P T E L
  • 51. Maximum Entropy: Overview Suppose we wish to model an expert translator’s decisions concerning the proper French rendering of the English word ‘in’. Each French word or phrase f is assigned an estimate p(f), probability that the expert would choose f as a translation of ‘in’. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 10 / 16 N P T E L
  • 52. Maximum Entropy: Overview Suppose we wish to model an expert translator’s decisions concerning the proper French rendering of the English word ‘in’. Each French word or phrase f is assigned an estimate p(f), probability that the expert would choose f as a translation of ‘in’. Collect a large sample of instances of the expert’s decisions Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 10 / 16 N P T E L
  • 53. Maximum Entropy: Overview Suppose we wish to model an expert translator’s decisions concerning the proper French rendering of the English word ‘in’. Each French word or phrase f is assigned an estimate p(f), probability that the expert would choose f as a translation of ‘in’. Collect a large sample of instances of the expert’s decisions Goal: extract a set of facts about the decision-making process (first task) that will aid in constructing a model of this process (second task) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 10 / 16 N P T E L
  • 54. Maximum Entropy Model: Overview First clue: list of allowed translations Suppose the translator always chooses among {dans, en, á, au cours de, pendant}. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 11 / 16 N P T E L
  • 55. Maximum Entropy Model: Overview First clue: list of allowed translations Suppose the translator always chooses among {dans, en, á, au cours de, pendant}. First constraint: p(dans)+p(en)+p(á)+p(au cours de)+p(pendant) = 1. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 11 / 16 N P T E L
  • 56. Maximum Entropy Model: Overview First clue: list of allowed translations Suppose the translator always chooses among {dans, en, á, au cours de, pendant}. First constraint: p(dans)+p(en)+p(á)+p(au cours de)+p(pendant) = 1. Infinite number of models p for which this identity holds, the most intuitive model? Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 11 / 16 N P T E L
  • 57. Maximum Entropy Model: Overview First clue: list of allowed translations Suppose the translator always chooses among {dans, en, á, au cours de, pendant}. First constraint: p(dans)+p(en)+p(á)+p(au cours de)+p(pendant) = 1. Infinite number of models p for which this identity holds, the most intuitive model? allocate the total probability evenly among the five possible phrases → most uniform model subject to our knowledge. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 11 / 16 N P T E L
  • 58. Maximum Entropy Model: Overview First clue: list of allowed translations Suppose the translator always chooses among {dans, en, á, au cours de, pendant}. First constraint: p(dans)+p(en)+p(á)+p(au cours de)+p(pendant) = 1. Infinite number of models p for which this identity holds, the most intuitive model? allocate the total probability evenly among the five possible phrases → most uniform model subject to our knowledge. Is it the most uniform model overall? Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 11 / 16 N P T E L
  • 59. Maximum Entropy Model: Overview First clue: list of allowed translations Suppose the translator always chooses among {dans, en, á, au cours de, pendant}. First constraint: p(dans)+p(en)+p(á)+p(au cours de)+p(pendant) = 1. Infinite number of models p for which this identity holds, the most intuitive model? allocate the total probability evenly among the five possible phrases → most uniform model subject to our knowledge. Is it the most uniform model overall? → No, that would grant an equal probability to every possible French phrase. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 11 / 16 N P T E L
  • 60. Maximum Entropy Model: Overview More clues from the expert’s decision Second clue: Suppose the expert chose either ‘dans’ or ‘en’ 30% of the time. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 12 / 16 N P T E L
  • 61. Maximum Entropy Model: Overview More clues from the expert’s decision Second clue: Suppose the expert chose either ‘dans’ or ‘en’ 30% of the time. Third clue: In half of the cases, the expert chose either ‘dans’ or ‘á’ Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 12 / 16 N P T E L
  • 62. Maximum Entropy Model: Overview More clues from the expert’s decision Second clue: Suppose the expert chose either ‘dans’ or ‘en’ 30% of the time. Third clue: In half of the cases, the expert chose either ‘dans’ or ‘á’ How do we measure uniformity of a model? As we add complexity to the model, we face two difficulties: What exactly is meant by “uniform”? How can one measure the uniformity of a model? Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 12 / 16 N P T E L
  • 63. Maximum Entropy Modeling Entropy: measures the uncertainty of a distribution. Quantifying uncertainty (“surprise”) Event x Probability px Surprise: log(1/px) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 13 / 16 N P T E L
  • 64. Maximum Entropy Modeling Entropy: measures the uncertainty of a distribution. Quantifying uncertainty (“surprise”) Event x Probability px Surprise: log(1/px) Entropy: expected surprise (over p) H(p) = Ep " log2 1 px # = − X x pxlog2px Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 13 / 16 N P T E L
  • 65. Maximum Entropy Modeling Entropy: measures the uncertainty of a distribution. Quantifying uncertainty (“surprise”) Event x Probability px Surprise: log(1/px) Entropy: expected surprise (over p) H(p) = Ep " log2 1 px # = − X x pxlog2px Coin Tossing Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 13 / 16 N P T E L
  • 66. Maximum Entropy Modeling Distribution required Minimize commitment = maximize entropy Resemble some reference distribution Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 14 / 16 N P T E L
  • 67. Maximum Entropy Modeling Distribution required Minimize commitment = maximize entropy Resemble some reference distribution Solution Maximize entropy H, subject to feature-based constraints: Ep[fi] = Ep̃[fi] Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 14 / 16 N P T E L
  • 68. Maximum Entropy Modeling Distribution required Minimize commitment = maximize entropy Resemble some reference distribution Solution Maximize entropy H, subject to feature-based constraints: Ep[fi] = Ep̃[fi] Adding constraints Lowers maximum entropy Brings the distribution further from uniform and closer to data Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 14 / 16 N P T E L
  • 69. Maximum Entropy Principle Given n feature functions fi, we would like p to lie in the subset C of P defined by C = {p ∈ P|p(fi) = p̃(fi),i ∈ {1,2,...,n}} Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 15 / 16 N P T E L
  • 70. Maximum Entropy Principle Given n feature functions fi, we would like p to lie in the subset C of P defined by C = {p ∈ P|p(fi) = p̃(fi),i ∈ {1,2,...,n}} Empirical count (expectation) of a feature p̃(fi) = X x,y p̃(x,y)fi(x,y) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 15 / 16 N P T E L
  • 71. Maximum Entropy Principle Given n feature functions fi, we would like p to lie in the subset C of P defined by C = {p ∈ P|p(fi) = p̃(fi),i ∈ {1,2,...,n}} Empirical count (expectation) of a feature p̃(fi) = X x,y p̃(x,y)fi(x,y) Model expectation of a feature p(fi) = X x,y p̃(x)p(y|x)fi(x,y) Select the distribution which is most uniform (conditional probability): p∗ = argmaxp∈CH(p) = H(Y|X) ≈ − X x,y p̃(x)p(y|x)logp(y|x) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 15 / 16 N P T E L
  • 72. Maximum Entropy Principle p∗ = argmaxp∈CH(p) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 16 / 16 N P T E L
  • 73. Maximum Entropy Principle p∗ = argmaxp∈CH(p) Constraint Optimization Introduce a parameter λi for each feature fi. Lagrangian is given by ∧(p,λ) = H(p)+ X i λi(p(fi)−p̃(fi)) Solving, we get pλ(y|x) = 1 Zλ(x) exp        X i λifi(x,y)        where Zλ(x) is a normalizing constant given by Zλ(x) = X y exp        X i λifi(x,y)        Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 3 16 / 16 N P T E L
  • 74. Maximum Entropy Models Pawan Goyal CSE, IIT Kharagpur Week 4, Lecture 4 Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 1 / 8 N P T E L
  • 75. Practice Question Consider the maximum entropy model for POS tagging, where you want to estimate P(tag|word). In a hypothetical setting, assume that tag can take the values D, N and V (short forms for Determiner, Noun and Verb). The variable word could be any member of a set V of possible words, where V contains the words a, man, sleeps, as well as additional words. The distribution should give the following probabilities P(D|a) = 0.9 P(N|man) = 0.9 P(V|sleeps) = 0.9 P(D|word) = 0.6 for any word other than a, man or sleeps P(N|word) = 0.3 for any word other than a, man or sleeps P(V|word) = 0.1 for any word other than a, man or sleeps It is assumed that all other probabilities, not defined above could take any values such that P tag P(tag|word) = 1 is satisfied for any word in V. Define the features of your maximum entropy model that can model this distribution. Mark your features as f1, f2 and so on. Each feature should have the same format as explained in the class. [Hint: 6 Features should make the analysis easier] For each feature fi, assume a weight λi. Now, write expression for the following probabilities in terms of your model parameters I P(D|cat) I P(N|laughs) I P(D|man) What value do the parameters in your model take to give the distribution as described above. (i.e. P(D|a) = 0.9 and so on. You may leave the final answer in terms of equations) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 2 / 8 N P T E L
  • 76. Features for POS Tagging (Ratnaparakhi, 1996) The specific word and tag context available to a feature is hi = {wi,wi+1,wi+2,wi−1,wi−2,ti−1,ti−2} Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 3 / 8 N P T E L
  • 77. Features for POS Tagging (Ratnaparakhi, 1996) The specific word and tag context available to a feature is hi = {wi,wi+1,wi+2,wi−1,wi−2,ti−1,ti−2} Example: fj(hi,ti) = 1 if suffix(wi) = “ing00&ti = VBG Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 3 / 8 N P T E L
  • 78. Example Features Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 4 / 8 N P T E L
  • 79. Example Features Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 5 / 8 N P T E L
  • 80. Example Features Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 5 / 8 N P T E L
  • 81. Search Algorithm Conditional Probability Given a sentence {w1,...,wn}, a tag sequence candidate {t1,...,tn} has conditional probability: P(t1,...,tn|w1 ...,wn) = n Y i=1 p(ti|xi) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 6 / 8 N P T E L
  • 82. Search Algorithm Conditional Probability Given a sentence {w1,...,wn}, a tag sequence candidate {t1,...,tn} has conditional probability: P(t1,...,tn|w1 ...,wn) = n Y i=1 p(ti|xi) A Tag Dictionary is used, which, for each known word, lists the tags that it has appeared with in the training set. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 6 / 8 N P T E L
  • 83. Search Algorithm Let W = {w1,...,wn} be a test sentence, sij be the jth highest probability tag sequence up to and including word wi. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 7 / 8 N P T E L
  • 84. Search Algorithm Let W = {w1,...,wn} be a test sentence, sij be the jth highest probability tag sequence up to and including word wi. Search description Generate tags for w1, find top N, set s1j,1 ≤ j ≤ N, accordingly. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 7 / 8 N P T E L
  • 85. Search Algorithm Let W = {w1,...,wn} be a test sentence, sij be the jth highest probability tag sequence up to and including word wi. Search description Generate tags for w1, find top N, set s1j,1 ≤ j ≤ N, accordingly. Initialize i = 2 I Initialize j = 1 I Generate tags for wi, given s(i−1)j as previous tag context, and append each tag to s(i−1)j to make a new sequence I j = j+1, repeat if j ≤ N Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 7 / 8 N P T E L
  • 86. Search Algorithm Let W = {w1,...,wn} be a test sentence, sij be the jth highest probability tag sequence up to and including word wi. Search description Generate tags for w1, find top N, set s1j,1 ≤ j ≤ N, accordingly. Initialize i = 2 I Initialize j = 1 I Generate tags for wi, given s(i−1)j as previous tag context, and append each tag to s(i−1)j to make a new sequence I j = j+1, repeat if j ≤ N Find N highest probability sequences generated by above loop, set sij accordingly Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 7 / 8 N P T E L
  • 87. Search Algorithm Let W = {w1,...,wn} be a test sentence, sij be the jth highest probability tag sequence up to and including word wi. Search description Generate tags for w1, find top N, set s1j,1 ≤ j ≤ N, accordingly. Initialize i = 2 I Initialize j = 1 I Generate tags for wi, given s(i−1)j as previous tag context, and append each tag to s(i−1)j to make a new sequence I j = j+1, repeat if j ≤ N Find N highest probability sequences generated by above loop, set sij accordingly i = i+1, repeat if i ≤ n Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 7 / 8 N P T E L
  • 88. Search Algorithm Let W = {w1,...,wn} be a test sentence, sij be the jth highest probability tag sequence up to and including word wi. Search description Generate tags for w1, find top N, set s1j,1 ≤ j ≤ N, accordingly. Initialize i = 2 I Initialize j = 1 I Generate tags for wi, given s(i−1)j as previous tag context, and append each tag to s(i−1)j to make a new sequence I j = j+1, repeat if j ≤ N Find N highest probability sequences generated by above loop, set sij accordingly i = i+1, repeat if i ≤ n Return highest probability sequence sn1 Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 7 / 8 N P T E L
  • 89. A Good Reference Berger et al., A Maximum Entropy Approach to Natural Language Processing, Computational Linguistics, Vol. 22, No. 1. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 4 8 / 8 N P T E L
  • 90. Conditional Random Fields Pawan Goyal CSE, IIT Kharagpur Week 4, Lecture 5 Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 1 / 8 N P T E L
  • 91. Practice Question Suppose you want to use a MaxEnt tagger to tag the sentence, “the light book”. We know that the top 2 POS tags for the words the, light and book are {Det,Noun}, {Verb,Adj} and {Verb,Noun}, respectively. Assume that the MaxEnt model uses the following history hi (context) for a word wi: hi = {wi,wi−1,wi+1,ti−1} where wi−1 and wi+1 correspond to the previous and next words and ti−1 corresponds to the tag of the previous word. Accordingly, the following features are being used by the MaxEnt model: f1: ti−1 = Det and ti = Adj f2: ti−1 = Noun and ti = Verb f3: ti−1 = Adj and ti = Noun f4: wi−1 = the and ti = Adj f5: wi−1 = the&wi+1 = book and ti = Adj f6: wi−1 = light and ti = Noun f7: wi+1 = light and ti = Det f8: wi−1 = NULL and ti = Noun Assume that each feature has a uniform weight of 1.0. Use Beam search algorithm with a beam-size of 2 to identify the highest probability tag sequence for the sentence. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 2 / 8 N P T E L
  • 92. Problem with Maximum Entropy Models Per-state normalization All the mass that arrives at a state must be distributed among the possible successor states Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 3 / 8 N P T E L
  • 93. Problem with Maximum Entropy Models Per-state normalization All the mass that arrives at a state must be distributed among the possible successor states This gives a ‘label bias’ problem Let’s see the intuition (on paper) Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 3 / 8 N P T E L
  • 94. Conditional Random Fields CRFs are conditionally trained, undirected graphical models. Let’s look at the linear chain structure Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 4 / 8 N P T E L
  • 95. Conditional Random Fields: Feature Functions Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 5 / 8 N P T E L
  • 96. Feature Functions Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 6 / 8 N P T E L
  • 97. Conditional Random Fields: Distribution Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 7 / 8 N P T E L
  • 98. CRFs Have the advantages of MEMM but avoid the label bias problem CRFs are globally normalized, whereas MEMMs are locally normalized. Widely used and applied. CRFs have been (close to) state-of-the-art in many sequence labeling tasks. Pawan Goyal (IIT Kharagpur) POS Tagging Week 4, Lecture 5 8 / 8 N P T E L