Word Level Analysis
POS (Continued)
L3 NLP (Elective)
1
TJS
Words and Word Classes
 Words are classified into categories called
part of speech (word classes or lexical
categories)
2
Part of Speech
NN noun student, chair, proof, mechanism
VB verb study, increase, produce
ADJ adj large, high, tall, few,
JJ adverb carefully slowly, uniformly
3
JJ adverb carefully slowly, uniformly
IN preposition in, on, to, of
PRP pronoun I, me, they
DET determiner the, a, an, this, those
 open vs. closed word classes
Part of Speech tagging
 process of assigning a part of speech like noun,
verb, pronoun, preposition, adverb, adjective,
etc. to each word in a sentence
4
POS tagger
Words
+
tag set
POS
tag
Speech/NN sounds/NNS were/VBD sampled/VBN
by/IN a/DT microphone/NN.
Another tagging possible for the sentence is:
5
Speech/NN sounds/VBZ were/VBD sampled/VBN
by/IN a/DT microphone/NN.
Part of speech tagging
methods
 Rule-based (linguistic)
 Stochastic (Data-driven) and
 TBL (Transformation Based Learning)
6
 TBL (Transformation Based Learning)
Rule-based (linguistic)
Steps:
1. Dictionary lookup  potential tags
2. Hand-coded Rules
The show must go on.
7
The show must go on.
Step 1  NN, VB
Step 2  discard incorrect tag
Rule: IF preceding word is determiner THEN
eliminate VB tag.
 Morphological information
IF word ends in –ing and preceding word is a verb
THEN label it a verb (VB).
8
 Capitalization information
+
 Speed
 Deterministic
9
 Deterministic
-
 requires manual work
 usable for only one language
Stochastic Tagger
 The standard stochastic tagger algorithm is the
Hidden Markov Model (HMM) tagger.
 A Markov model applies the simplifying
assumption that the probability of a chain of
assumption that the probability of a chain of
symbols can be approximated in terms of its
parts or n-grams.
 The simplest n-gram model is the unigram
model, which assigns the most likely tag (part-of-
speech) to each token.
10
 The unigram model requires tagged data to
gather most likely statistics. The context used by
the unigram tagger is the text of the word itself.
For example, it will assign the tag JJ for each
occurrence of fast if fast is used as an adjective
more frequently than it is used as a noun, verb,
or adverb.
or adverb.
 She had a fast
 Muslim fast during Ramadan
 Those who are injured need medical help fast.
 We would expect more accurate predictions if we
took more context into account when making a
tagging decision.
11
 A bi-gram tagger uses the current word and the
tag of the previous word in the tagging process.
As the tag sequence “DT NN” is more likely than
the tag sequence “DT JJ”, a bi-gram model will
assign a correct tag to the word fast in sentence
assign a correct tag to the word fast in sentence
(1).
 Similarly, it is more likely that an adverb (rather
than a noun or an adjective) follows a verb.
Hence, in sentence (3), the tag assigned to fast
will be RB (Adverb)
12
N- gram Model
 An n-gram model considers the current word
and the tag of the previous n-1 words in
assigning a tag to a word.
Fig. Context used by Tri-gram Model
13
HMM Tagger
 Given a sequence of words (sentence), the
objective is to find the most probable tag sequence
for the sentence.
 Let W be the sequence of words:
 Let W be the sequence of words:
W = w1, w2, … , wn
 The task is to find the tag sequence
T = t1, t2, … , tn
which maximizes P(T|W), i.e.,
T’ = argmaxT P(T|W)
14
 Applying Bayes Rule, P(T/W) can be
estimated using the expression:
P(T|W) = P(W|T) * P(T) /P(W)
 As the probability of the word sequence,
P(W), remains the same for each tag
sequence, we can drop it. The expression
for the most likely tag sequence becomes:
T’ = argmaxT P(W|T) * P(T)
15
 Using the Markov assumption, the probability of
a tag sequence can be estimated as the product
of the probability of its constituent n-grams, i.e.,
P(T) = P(t1) * P(t2|t1) * P(t3|t1t2) … * P(tn|t1 …
 P(T) = P(t1) * P(t2|t1) * P(t3|t1t2) … * P(tn|t1 …
tn-1)
 P(W/T) is the probability of seeing a word
sequence, given a tag sequence
 For ex, what is the probability of seeing ‘The egg
is rotten’ given ‘DT NNP VB JJ’.
16
 We make following two assumptions :
1. The words are independent of each other, and
2. The probability of a word is dependent only on
2. The probability of a word is dependent only on
its tag.
Using these assumptions P(W/T) can be expr :
P(W/T) = P(w1/t1) * P(w2/t2) .... P(wi/ti) *
...P(Wn/tn)
i,.e.,
17
18
19
 Some of the possible tag sequence:
DT NNP NNP NNP
DT NNP MD VB or DT NNP MD NNP (Output  Most likely)
20
21
Brill Tagger: Initial state
 Initial State:
most likely tag
 Transformation:
22
 Transformation:
The text is then passed through an ordered
list of transformations.
Each transformation is a pair of a a rewrite
rule and a contextual condition .
Learning Rules
Rules are learned in the following manner
1. each rule, i.e. each possible transformation, is
applied to each matching word-tag-pair.
2. the number of tagging errors is measured
against the correct sequences of the training
corpus ("Truth" ).
23
corpus ("Truth" ).
3. the transformation which yields the greatest
error reduction is chosen.
4. Learning stops when no rules / transformations
can be found that, if applied, reduces errors
beyond some given threshold.
• Set of possible ‘transforms’ is infinite, e.g.,
“transform NN to VB if the previous word
was MicrosoftWindoze & word braindead
occurs between 17 and 158 words before
24
occurs between 17 and 158 words before
that”
• To limit: start with small set of abstracted
transforms, or templates
Templates used: Change a
to b when…
25
Rules learned by TBL tagger
26
Lexicalized transformations
Brill complements the rule schemes by so-called
lexicalized rules which refer to particular words in
the condition part of the transformation:
Change a to b if
27
 Change a to b if
1. the preceding (following, current) word is C
2. the preceding (following, current) word is C and
the preceding (following) word is tagged d.
etc.
unknown words
 In handling unknown words, a POS-tagger can
adopt the following strategies
 assign all possible tags to the unknown word
 assign the most probable tag to the unknown
word
28
 same distribution as ‘Things seen once’
estimator of ‘things never seen’
 use word features i.e. see how words are
spelled (prefixes, suffixes, word length,
capitalization) to guess a (set of) word
class(es). -- Most powerful
Most powerful unknown word
detectors
 32 derivational endings ( -ion,etc.);
 capitalization; hyphenation
 More generally: should use morphological
29
 More generally: should use morphological
analysis! (and some kind of machine learning
approach)

word level analysis

  • 1.
    Word Level Analysis POS(Continued) L3 NLP (Elective) 1 TJS
  • 2.
    Words and WordClasses  Words are classified into categories called part of speech (word classes or lexical categories) 2
  • 3.
    Part of Speech NNnoun student, chair, proof, mechanism VB verb study, increase, produce ADJ adj large, high, tall, few, JJ adverb carefully slowly, uniformly 3 JJ adverb carefully slowly, uniformly IN preposition in, on, to, of PRP pronoun I, me, they DET determiner the, a, an, this, those  open vs. closed word classes
  • 4.
    Part of Speechtagging  process of assigning a part of speech like noun, verb, pronoun, preposition, adverb, adjective, etc. to each word in a sentence 4 POS tagger Words + tag set POS tag
  • 5.
    Speech/NN sounds/NNS were/VBDsampled/VBN by/IN a/DT microphone/NN. Another tagging possible for the sentence is: 5 Speech/NN sounds/VBZ were/VBD sampled/VBN by/IN a/DT microphone/NN.
  • 6.
    Part of speechtagging methods  Rule-based (linguistic)  Stochastic (Data-driven) and  TBL (Transformation Based Learning) 6  TBL (Transformation Based Learning)
  • 7.
    Rule-based (linguistic) Steps: 1. Dictionarylookup  potential tags 2. Hand-coded Rules The show must go on. 7 The show must go on. Step 1  NN, VB Step 2  discard incorrect tag Rule: IF preceding word is determiner THEN eliminate VB tag.
  • 8.
     Morphological information IFword ends in –ing and preceding word is a verb THEN label it a verb (VB). 8  Capitalization information
  • 9.
    +  Speed  Deterministic 9 Deterministic -  requires manual work  usable for only one language
  • 10.
    Stochastic Tagger  Thestandard stochastic tagger algorithm is the Hidden Markov Model (HMM) tagger.  A Markov model applies the simplifying assumption that the probability of a chain of assumption that the probability of a chain of symbols can be approximated in terms of its parts or n-grams.  The simplest n-gram model is the unigram model, which assigns the most likely tag (part-of- speech) to each token. 10
  • 11.
     The unigrammodel requires tagged data to gather most likely statistics. The context used by the unigram tagger is the text of the word itself. For example, it will assign the tag JJ for each occurrence of fast if fast is used as an adjective more frequently than it is used as a noun, verb, or adverb. or adverb.  She had a fast  Muslim fast during Ramadan  Those who are injured need medical help fast.  We would expect more accurate predictions if we took more context into account when making a tagging decision. 11
  • 12.
     A bi-gramtagger uses the current word and the tag of the previous word in the tagging process. As the tag sequence “DT NN” is more likely than the tag sequence “DT JJ”, a bi-gram model will assign a correct tag to the word fast in sentence assign a correct tag to the word fast in sentence (1).  Similarly, it is more likely that an adverb (rather than a noun or an adjective) follows a verb. Hence, in sentence (3), the tag assigned to fast will be RB (Adverb) 12
  • 13.
    N- gram Model An n-gram model considers the current word and the tag of the previous n-1 words in assigning a tag to a word. Fig. Context used by Tri-gram Model 13
  • 14.
    HMM Tagger  Givena sequence of words (sentence), the objective is to find the most probable tag sequence for the sentence.  Let W be the sequence of words:  Let W be the sequence of words: W = w1, w2, … , wn  The task is to find the tag sequence T = t1, t2, … , tn which maximizes P(T|W), i.e., T’ = argmaxT P(T|W) 14
  • 15.
     Applying BayesRule, P(T/W) can be estimated using the expression: P(T|W) = P(W|T) * P(T) /P(W)  As the probability of the word sequence, P(W), remains the same for each tag sequence, we can drop it. The expression for the most likely tag sequence becomes: T’ = argmaxT P(W|T) * P(T) 15
  • 16.
     Using theMarkov assumption, the probability of a tag sequence can be estimated as the product of the probability of its constituent n-grams, i.e., P(T) = P(t1) * P(t2|t1) * P(t3|t1t2) … * P(tn|t1 …  P(T) = P(t1) * P(t2|t1) * P(t3|t1t2) … * P(tn|t1 … tn-1)  P(W/T) is the probability of seeing a word sequence, given a tag sequence  For ex, what is the probability of seeing ‘The egg is rotten’ given ‘DT NNP VB JJ’. 16
  • 17.
     We makefollowing two assumptions : 1. The words are independent of each other, and 2. The probability of a word is dependent only on 2. The probability of a word is dependent only on its tag. Using these assumptions P(W/T) can be expr : P(W/T) = P(w1/t1) * P(w2/t2) .... P(wi/ti) * ...P(Wn/tn) i,.e., 17
  • 18.
  • 19.
  • 20.
     Some ofthe possible tag sequence: DT NNP NNP NNP DT NNP MD VB or DT NNP MD NNP (Output  Most likely) 20
  • 21.
  • 22.
    Brill Tagger: Initialstate  Initial State: most likely tag  Transformation: 22  Transformation: The text is then passed through an ordered list of transformations. Each transformation is a pair of a a rewrite rule and a contextual condition .
  • 23.
    Learning Rules Rules arelearned in the following manner 1. each rule, i.e. each possible transformation, is applied to each matching word-tag-pair. 2. the number of tagging errors is measured against the correct sequences of the training corpus ("Truth" ). 23 corpus ("Truth" ). 3. the transformation which yields the greatest error reduction is chosen. 4. Learning stops when no rules / transformations can be found that, if applied, reduces errors beyond some given threshold.
  • 24.
    • Set ofpossible ‘transforms’ is infinite, e.g., “transform NN to VB if the previous word was MicrosoftWindoze & word braindead occurs between 17 and 158 words before 24 occurs between 17 and 158 words before that” • To limit: start with small set of abstracted transforms, or templates
  • 25.
    Templates used: Changea to b when… 25
  • 26.
    Rules learned byTBL tagger 26
  • 27.
    Lexicalized transformations Brill complementsthe rule schemes by so-called lexicalized rules which refer to particular words in the condition part of the transformation: Change a to b if 27  Change a to b if 1. the preceding (following, current) word is C 2. the preceding (following, current) word is C and the preceding (following) word is tagged d. etc.
  • 28.
    unknown words  Inhandling unknown words, a POS-tagger can adopt the following strategies  assign all possible tags to the unknown word  assign the most probable tag to the unknown word 28  same distribution as ‘Things seen once’ estimator of ‘things never seen’  use word features i.e. see how words are spelled (prefixes, suffixes, word length, capitalization) to guess a (set of) word class(es). -- Most powerful
  • 29.
    Most powerful unknownword detectors  32 derivational endings ( -ion,etc.);  capitalization; hyphenation  More generally: should use morphological 29  More generally: should use morphological analysis! (and some kind of machine learning approach)