word level analysis

Word Level Analysis
POS (Continued)
L3 NLP (Elective)
1
TJS

Words and Word Classes
 Words are classified into categories called
part of speech (word classes or lexical
categories)
2

Part of Speech
NN noun student, chair, proof, mechanism
VB verb study, increase, produce
ADJ adj large, high, tall, few,
JJ adverb carefully slowly, uniformly
3
JJ adverb carefully slowly, uniformly
IN preposition in, on, to, of
PRP pronoun I, me, they
DET determiner the, a, an, this, those
 open vs. closed word classes

Part of Speech tagging
 process of assigning a part of speech like noun,
verb, pronoun, preposition, adverb, adjective,
etc. to each word in a sentence
4
POS tagger
Words
+
tag set
POS
tag

Speech/NN sounds/NNS were/VBD sampled/VBN
by/IN a/DT microphone/NN.
Another tagging possible for the sentence is:
5
Speech/NN sounds/VBZ were/VBD sampled/VBN
by/IN a/DT microphone/NN.

Part of speech tagging
methods
 Rule-based (linguistic)
 Stochastic (Data-driven) and
 TBL (Transformation Based Learning)
6
 TBL (Transformation Based Learning)

Rule-based (linguistic)
Steps:
1. Dictionary lookup  potential tags
2. Hand-coded Rules
The show must go on.
7
The show must go on.
Step 1  NN, VB
Step 2  discard incorrect tag
Rule: IF preceding word is determiner THEN
eliminate VB tag.

 Morphological information
IF word ends in –ing and preceding word is a verb
THEN label it a verb (VB).
8
 Capitalization information

+
 Speed
 Deterministic
9
 Deterministic
-
 requires manual work
 usable for only one language

Stochastic Tagger
 The standard stochastic tagger algorithm is the
Hidden Markov Model (HMM) tagger.
 A Markov model applies the simplifying
assumption that the probability of a chain of
assumption that the probability of a chain of
symbols can be approximated in terms of its
parts or n-grams.
 The simplest n-gram model is the unigram
model, which assigns the most likely tag (part-of-
speech) to each token.
10

 The unigram model requires tagged data to
gather most likely statistics. The context used by
the unigram tagger is the text of the word itself.
For example, it will assign the tag JJ for each
occurrence of fast if fast is used as an adjective
more frequently than it is used as a noun, verb,
or adverb.
or adverb.
 She had a fast
 Muslim fast during Ramadan
 Those who are injured need medical help fast.
 We would expect more accurate predictions if we
took more context into account when making a
tagging decision.
11

 A bi-gram tagger uses the current word and the
tag of the previous word in the tagging process.
As the tag sequence “DT NN” is more likely than
the tag sequence “DT JJ”, a bi-gram model will
assign a correct tag to the word fast in sentence
assign a correct tag to the word fast in sentence
(1).
 Similarly, it is more likely that an adverb (rather
than a noun or an adjective) follows a verb.
Hence, in sentence (3), the tag assigned to fast
will be RB (Adverb)
12

N- gram Model
 An n-gram model considers the current word
and the tag of the previous n-1 words in
assigning a tag to a word.
Fig. Context used by Tri-gram Model
13

HMM Tagger
 Given a sequence of words (sentence), the
objective is to find the most probable tag sequence
for the sentence.
 Let W be the sequence of words:
 Let W be the sequence of words:
W = w1, w2, … , wn
 The task is to find the tag sequence
T = t1, t2, … , tn
which maximizes P(T|W), i.e.,
T’ = argmaxT P(T|W)
14

 Applying Bayes Rule, P(T/W) can be
estimated using the expression:
P(T|W) = P(W|T) * P(T) /P(W)
 As the probability of the word sequence,
P(W), remains the same for each tag
sequence, we can drop it. The expression
for the most likely tag sequence becomes:
T’ = argmaxT P(W|T) * P(T)
15

 Using the Markov assumption, the probability of
a tag sequence can be estimated as the product
of the probability of its constituent n-grams, i.e.,
P(T) = P(t1) * P(t2|t1) * P(t3|t1t2) … * P(tn|t1 …
 P(T) = P(t1) * P(t2|t1) * P(t3|t1t2) … * P(tn|t1 …
tn-1)
 P(W/T) is the probability of seeing a word
sequence, given a tag sequence
 For ex, what is the probability of seeing ‘The egg
is rotten’ given ‘DT NNP VB JJ’.
16

 We make following two assumptions :
1. The words are independent of each other, and
2. The probability of a word is dependent only on
2. The probability of a word is dependent only on
its tag.
Using these assumptions P(W/T) can be expr :
P(W/T) = P(w1/t1) * P(w2/t2) .... P(wi/ti) *
...P(Wn/tn)
i,.e.,
17

 Some of the possible tag sequence:
DT NNP NNP NNP
DT NNP MD VB or DT NNP MD NNP (Output  Most likely)
20

Brill Tagger: Initial state
 Initial State:
most likely tag
 Transformation:
22
 Transformation:
The text is then passed through an ordered
list of transformations.
Each transformation is a pair of a a rewrite
rule and a contextual condition .

Learning Rules
Rules are learned in the following manner
1. each rule, i.e. each possible transformation, is
applied to each matching word-tag-pair.
2. the number of tagging errors is measured
against the correct sequences of the training
corpus ("Truth" ).
23
corpus ("Truth" ).
3. the transformation which yields the greatest
error reduction is chosen.
4. Learning stops when no rules / transformations
can be found that, if applied, reduces errors
beyond some given threshold.

• Set of possible ‘transforms’ is infinite, e.g.,
“transform NN to VB if the previous word
was MicrosoftWindoze & word braindead
occurs between 17 and 158 words before
24
occurs between 17 and 158 words before
that”
• To limit: start with small set of abstracted
transforms, or templates

Templates used: Change a
to b when…
25

Rules learned by TBL tagger
26

Lexicalized transformations
Brill complements the rule schemes by so-called
lexicalized rules which refer to particular words in
the condition part of the transformation:
Change a to b if
27
 Change a to b if
1. the preceding (following, current) word is C
2. the preceding (following, current) word is C and
the preceding (following) word is tagged d.
etc.

unknown words
 In handling unknown words, a POS-tagger can
adopt the following strategies
 assign all possible tags to the unknown word
 assign the most probable tag to the unknown
word
28
 same distribution as ‘Things seen once’
estimator of ‘things never seen’
 use word features i.e. see how words are
spelled (prefixes, suffixes, word length,
capitalization) to guess a (set of) word
class(es). -- Most powerful

Most powerful unknown word
detectors
 32 derivational endings ( -ion,etc.);
 capitalization; hyphenation
 More generally: should use morphological
29
 More generally: should use morphological
analysis! (and some kind of machine learning
approach)

word level analysis

More Related Content

What's hot

Similar to word level analysis

Recently uploaded

word level analysis