SlideShare a Scribd company logo
Part of Speech (POS) Tagging
K.A.S.H. Kulathilake
B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
Closed Class Vs. Open Class
• Parts-of-speech can be divided into two broad categories:
closed class types and open class types.
• Closed classes are those with relatively fixed membership,
such as prepositions—new prepositions are rarely coined.
• By contrast, nouns and verbs are open classes—new nouns
and verbs like iPhone or to fax are continually being created
or borrowed.
• Any given speaker or corpus may have different open class
words, but all speakers of a language, and sufficiently large
corpora, likely share the set of closed class words.
• Closed class words are generally function word words like
of, it, and, or you, which tend to be very short, occur
frequently, and often have structuring uses in grammar.
Closed Class Vs. Open Class (Cont…)
• Open Classes
– Four major open classes occur in the languages of
the world:
• nouns,
• verbs,
• adjectives and
• adverbs.
Closed Class Vs. Open Class (Cont…)
• Nouns
– The syntactic class noun includes the words for most
people, places, or things, but others as well.
– Nouns include concrete terms like ship and chair,
abstractions like bandwidth and relationship, and
verb-like terms like pacing as in His pacing to and fro
became quite annoying.
– What defines a noun in English, then, are things like
its ability to occur with determiners (a goat, its
bandwidth, Plato’s Republic), to take possessives
(IBM’s annual revenue), and for most but not all
nouns to occur in the plural form (goats, abaci).
Closed Class Vs. Open Class (Cont…)
– Open class nouns fall into two classes.
– Proper nouns:
• like Regina, Colorado, and IBM, are names of specific persons or entities.
• In English, they generally aren’t preceded by articles (e.g., the book is upstairs,
but Regina is upstairs).
• In written English, proper nouns are usually capitalized.
– Common nouns:
• Common nouns are divided in many languages, including English, into count
nouns and mass nouns.
• Count nouns allow grammatical enumeration, occurring in both the singular
and plural (goat/goats, relationship/relationships) and they can be counted
(one goat, two goats).
• Mass nouns are used when something is conceptualized as a homogeneous
group.
• So words like snow, salt, and communism are not counted (i.e., *two snows or
*two communisms).
• Mass nouns can also appear without articles where singular count nouns
cannot (Snow is white but not *Goat is white).
Closed Class Vs. Open Class (Cont…)
• Verb
– The verb class includes most of the words
referring to actions and processes, including main
verbs like draw, provide, and go.
– English verbs have inflections (non-third-person-sg
(eat), third-person-sg (eats), progressive (eating),
past participle (eaten)).
Closed Class Vs. Open Class (Cont…)
• Adjective:
– A class that includes many terms for properties or
qualities.
– Most languages have adjectives for the concepts of
color (white, black), age (old, young), and value (good,
bad), but there are languages without adjectives.
– In Korean, for example, the words corresponding to
English adjectives act as a subclass of verbs, so what is
in English an adjective “beautiful” acts in Korean like a
verb meaning “to be beautiful”.
Closed Class Vs. Open Class (Cont…)
• Adverb:
– The final open class form, adverbs, is rather a
hodge-podge, both semantically and formally.
– In the following sentence from Schachter (1985)
all the italicized words are adverbs:
– Unfortunately, John walked home extremely
slowly yesterday.
Closed Class Vs. Open Class (Cont…)
– What coherence the class has semantically may be solely that
each of these words can be viewed as modifying something
(often verbs, hence the name “adverb”, but also other adverbs
and entire verb phrases).
– Directional adverbs or locative adverbs (home, here, downhill)
specify the direction or location of some action;
– degree adverbs (extremely, very, somewhat) specify the extent
of some action, process, or property;
– manner adverbs (slowly, slinkily, delicately) describe the manner
of some action or process;
– and temporal adverbs describe the time that some action or
event took place (yesterday, Monday).
– Because of the heterogeneous nature of this class, some
adverbs (e.g., temporal adverbs like Monday) are tagged in
some tagging schemes as nouns.
Closed Class Vs. Open Class (Cont…)
• Closed Class
– The closed classes differ more from language to
language than do the open classes.
– Some of the important closed classes in English
include:
• prepositions: on, under, over, near, by, at, from, to, with
• determiners: a, an, the
• pronouns: she, who, I, others
• conjunctions: and, but, or, as, if, when
• auxiliary verbs: can, may, should, are
• particles: up, down, on, off, in, out, at, by
• numerals: one, two, three, first, second, third
The Penn Treebank Part-of-Speech
Tagset
• While there are many lists of parts-of-speech, most
modern language processing on English uses the 45-tag
Penn Treebank tagset.
• Parts-of-speech are generally represented by placing
the tag after each word, delimited by a slash, as in the
following examples:
– The/DT grand/JJ jury/NN commented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
– There/EX are/VBP 70/CD children/NNS there/RB
– Preliminary/JJ findings/NNS were/VBD reported/VBN in/IN
today/NN ’s/POS New/NNP England/NNP Journal/NNP
of/IN Medicine/NNP ./.
The Penn Treebank Part-of-Speech
Tagset (Cont…)
The Penn Treebank Part-of-Speech
Tagset (Cont…)
• Corpora labeled with parts-of-speech like the Treebank
corpora are crucial training (and testing) sets for
statistical tagging algorithms.
• Three main tagged corpora are consistently used for
training and testing part-of-speech taggers for English:
– The Brown corpus is a million words of samples from 500
written texts from different genres published in the United
States in 1961.
– The WSJ corpus contains a million words published in the
Wall Street Journal in 1989.
– The Switchboard corpus consists of 2 million words of
telephone conversations collected in 1990-1991.
The Penn Treebank Part-of-Speech
Tagset (Cont…)
• The corpora were created by running an automatic
part-of-speech tagger on the texts and then human
annotators hand-corrected each tag.
• Tagging algorithms assume that words have been
tokenized before tagging.
• The Penn Treebank and the British National Corpus
split contractions and the ’s-genitive from their stems:
– would/MD n’t/RB
– children/NNS ’s/POS
• Indeed, the special Treebank tag POS is used only for
the morpheme ’s, which must be segmented off during
tokenization.
The Penn Treebank Part-of-Speech
Tagset (Cont…)
• Another tokenization issue concerns multipart
words.
• The Treebank tagset assumes that tokenization of
words like New York is done at whitespace.
• The phrase a New York City firm is tagged in
Treebank notation as five separate words: a/DT
New/NNP York/NNP City/NNP firm/NN.
• The C5 tagset for the British National Corpus, by
contrast, allow prepositions like “in terms of” to
be treated as a single word by adding numbers to
each tag, as in in/II31 terms/II32 of/II33.
POS Tagging
• Part-of-speech tagging (tagging for short) is the
process of assigning a part-of speech marker to
each word in an input text.
• Because tags are generally also applied to
punctuation, tokenization is usually performed
before, or as part of, the tagging process:
separating commas, quotation marks, etc., from
words and disambiguating end-of-sentence
punctuation (period, question mark, etc.) from
part-of-word punctuation (such as in
abbreviations like e.g. and etc.)
POS Tagging (Cont…)
• Tagging is a disambiguation task; words are ambiguous—
have more than one possible part-of-speech— and the goal
is to find the correct tag for the situation.
• For example, the word book can be a verb (book that flight)
or a noun (as in hand me that book).
• That can be a determiner (Does that flight serve dinner) or
a complementizer (I thought that your flight was earlier).
• The problem of POS-tagging is to resolve these ambiguities,
choosing the proper tag for the context.
• Part-of-speech tagging is thus one of the many
disambiguation tasks in language processing.
POS Tagging (Cont…)
• Here are some examples of the 6 different
parts-of-speech for the word back:
– earnings growth took a back/JJ seat
– a small building in the back/NN
– a clear majority of senators back/VBP the bill
– Dave began to back/VB toward the door
– enable the country to buy back/RP about debt
– I was twenty-one back/RB then
POS Tagging (Cont…)
• POS Techniques
– Rule-Based: Human crafted rules based on lexical and
other linguistic knowledge.
– Stochastic: Trained on human annotated corpora like
the Penn Treebank. Statistical models: Hidden Markov
Model (HMM), Maximum Entropy Markov Model
(MEMM), Conditional Random Field (CRF)
– Transformation Based Tagging: Generally, learning-
based approaches have been found to be more
effective overall, taking into account the total amount
of human expertise and efort involved.
HMM POS Tagging
• When we apply HMM to part-of-speech tagging we
generally don’t use the Baum-Welch algorithm for
learning the HMM parameters.
• Instead HMMs for part-of-speech tagging are trained
on a fully labeled dataset—a set of sentences with
each word annotated with a part-of-speech tag—
setting parameters by maximum likelihood estimates
on this training data.
• Thus the only algorithm we will need is the Viterbi
algorithm for decoding, and we will also need to see
how to set the parameters from training data.
The Basic Equation of HMM Tagging
• Let’s begin with a quick reminder of the intuition of HMM decoding.
• The goal of HMM decoding is to choose the tag sequence that is
most probable given the observation sequence of n words 𝑤1
𝑛
:
𝑡′1
𝑛
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1
𝑛 𝑃(𝑡1
𝑛
|𝑤1
𝑛
• by using Bayes’ rule to instead compute:
𝑡′1
𝑛
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1
𝑛
𝑃 𝑤1
𝑛
|𝑡1
𝑛
𝑃(𝑡1
𝑛
)
𝑃(𝑤1
𝑛
)
• Furthermore, we simplify above equation by dropping the
denominator 𝑃 𝑤1
𝑛
:
𝑡′1
𝑛
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1
𝑛 𝑃 𝑤1
𝑛
|𝑡1
𝑛
𝑃(𝑡1
𝑛
)
The Basic Equation of HMM Tagging
(Cont…)
• HMM taggers make two further simplifying assumptions.
• The first is that the probability of a word appearing
depends only on its own tag and is independent of
neighboring words and tags:
𝑃 𝑤1
𝑛
𝑡1
𝑛
≈
𝑖=1
𝑛
𝑃(𝑤𝑖|𝑡𝑖)
• The second assumption, the bigram assumption, is that the
probability of a tag is dependent only on the previous tag,
rather than the entire tag sequence;
𝑃(𝑡1
𝑛
) ≈
𝑖=1
𝑛
𝑃(𝑡𝑖|𝑡𝑖−1)
The Basic Equation of HMM Tagging
(Cont…)
• Using previous three equations we can derive following equation for the most
probable tag sequence from a bigram tagger, which as we will soon see,
correspond to the emission probability and transition probability for the HMM.
𝑡′1
𝑛
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1
𝑛 𝑃 𝑤1
𝑛
|𝑡1
𝑛
𝑃 𝑡1
𝑛
−→ (1)
𝑃 𝑤1
𝑛
𝑡1
𝑛
≈
𝑖=1
𝑛
𝑃 𝑤𝑖 𝑡𝑖 −→ (2)
𝑃 𝑡1
𝑛
≈
𝑖=1
𝑛
𝑃 𝑡𝑖 𝑡𝑖−1 −→ (3)
Apply (2) and (3) to (1)
𝑡′1
𝑛
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1
𝑛 𝑃 𝑡1
𝑛
𝑤1
𝑛
≈ 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1
𝑛
𝑖=1
𝑛
𝑃 𝑤𝑖 𝑡𝑖 𝑃 𝑡𝑖 𝑡𝑖−1
Where 𝑃 𝑤𝑖 𝑡𝑖 is emission probability and 𝑃 𝑡𝑖 𝑡𝑖−1 is transition probability.
Estimating Probabilities
• Let’s walk through an example, seeing how these
probabilities are estimated and used in a sample
tagging task, before we return to the Viterbi algorithm.
• In HMM tagging, rather than using the full power of
HMM EM learning, the probabilities are estimated just
by counting on a tagged training corpus.
• For this example we’ll use the tagged WSJ corpus.
• The tag transition probabilities 𝑃 𝑡𝑖 𝑡𝑖−1 represent
the probability of a tag given the previous tag.
Estimating Probabilities (Cont…)
• The maximum likelihood estimate of a transition probability is computed
by counting, out of the times we see the first tag in a labeled corpus, how
often the first tag is followed by the second
Example
• Let’s now work through an example of
computing the best sequence of tags that
corresponds to the following sequence of
words
• Janet will back the bill
• The correct series of tags is:
• Janet/NNP will/MD back/VB the/DT bill/NN
Example (Cont…)
Example (Cont…)
This table is (slightly simplified) from counts in the WSJ corpus.
So the word Janet only appears as an NNP, back has 4 possible parts of speech,
and the word the can appear as a determiner or as an NNP (in titles like “Somewhere
Over the Rainbow” all words are tagged as NNP).
Example (Cont…)
Example (Cont…)
• Following figure shows a schematic of the
possible tags for each word and the correct
final tag sequence.

More Related Content

What's hot

NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
Hemantha Kulathilake
 
NLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingNLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological Parsing
Hemantha Kulathilake
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Robert Lujo
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
Tomer Lieber
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Toine Bogers
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
VenkateshMurugadas
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
VeenaSKumar2
 
Wordnet
WordnetWordnet
Wordnet
Govind Raj
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
Rrubaa Panchendrarajan
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
live_and_let_live
 
NLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for EnglishNLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for English
Hemantha Kulathilake
 
Natural language processing
Natural language processing Natural language processing
Natural language processing
Md.Sumon Sarder
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
Benjamin Bengfort
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
Christian Perone
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMariana Soffer
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
Bhaskar Mitra
 

What's hot (20)

NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
NLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingNLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological Parsing
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
NLP
NLPNLP
NLP
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
NLP
NLPNLP
NLP
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Wordnet
WordnetWordnet
Wordnet
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
NLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for EnglishNLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for English
 
Natural language processing
Natural language processing Natural language processing
Natural language processing
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Language models
Language modelsLanguage models
Language models
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 

Similar to NLP_KASHK:POS Tagging

2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt
HayomeTakele
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Basha Chand
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Varunjeet Singh Rekhi
 
6 POS SA.pptx
6 POS SA.pptx6 POS SA.pptx
6 POS SA.pptx
AnkitMishra746378
 
Data collection and Materials Development
Data collection and Materials DevelopmentData collection and Materials Development
Data collection and Materials Development
Rabby Zibon
 
5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 5810 day 3 sept 20 2014
5810 day 3 sept 20 2014
SVTaylor123
 
2021-0509_JAECS2021_Spring
2021-0509_JAECS2021_Spring2021-0509_JAECS2021_Spring
2021-0509_JAECS2021_Spring
Mizumoto Atsushi
 
Universal grammar (ug)
Universal grammar (ug)Universal grammar (ug)
Universal grammar (ug)
RajpootBhatti5
 
CHAPTER 3. Lexeme Formation (Morphology (Linguistics)
CHAPTER 3. Lexeme Formation (Morphology (Linguistics)CHAPTER 3. Lexeme Formation (Morphology (Linguistics)
CHAPTER 3. Lexeme Formation (Morphology (Linguistics)
MehakAli97
 
Ch 9 Language and Speech Processing.pptx
Ch 9 Language and Speech Processing.pptxCh 9 Language and Speech Processing.pptx
Ch 9 Language and Speech Processing.pptx
Larry195181
 
Principles of parameters
Principles of parametersPrinciples of parameters
Principles of parametersVelnar
 
5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 5810 day 3 sept 20 2014
5810 day 3 sept 20 2014
SVTaylor123
 
Computational Linguistics - Finite State Automata
Computational Linguistics - Finite State AutomataComputational Linguistics - Finite State Automata
Computational Linguistics - Finite State Automata
blacksimos
 
Vocabulary connections
Vocabulary connectionsVocabulary connections
Vocabulary connectionsdannaet
 
Applied linguistic
Applied linguisticApplied linguistic
Applied linguistic
Bousong En
 
Syntax.ppt
Syntax.pptSyntax.ppt
Syntax.ppt
KhenAguinillo
 
Morphological structure
Morphological structureMorphological structure
Morphological structure
Nohemy Tocto Llacsahuanga
 

Similar to NLP_KASHK:POS Tagging (20)

2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
6 POS SA.pptx
6 POS SA.pptx6 POS SA.pptx
6 POS SA.pptx
 
Data collection and Materials Development
Data collection and Materials DevelopmentData collection and Materials Development
Data collection and Materials Development
 
5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 5810 day 3 sept 20 2014
5810 day 3 sept 20 2014
 
2021-0509_JAECS2021_Spring
2021-0509_JAECS2021_Spring2021-0509_JAECS2021_Spring
2021-0509_JAECS2021_Spring
 
Universal grammar (ug)
Universal grammar (ug)Universal grammar (ug)
Universal grammar (ug)
 
Syntax course
Syntax courseSyntax course
Syntax course
 
CHAPTER 3. Lexeme Formation (Morphology (Linguistics)
CHAPTER 3. Lexeme Formation (Morphology (Linguistics)CHAPTER 3. Lexeme Formation (Morphology (Linguistics)
CHAPTER 3. Lexeme Formation (Morphology (Linguistics)
 
Ch 9 Language and Speech Processing.pptx
Ch 9 Language and Speech Processing.pptxCh 9 Language and Speech Processing.pptx
Ch 9 Language and Speech Processing.pptx
 
Principles of parameters
Principles of parametersPrinciples of parameters
Principles of parameters
 
Su 2012 ss orthography(1)
Su 2012 ss orthography(1)Su 2012 ss orthography(1)
Su 2012 ss orthography(1)
 
5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 5810 day 3 sept 20 2014
5810 day 3 sept 20 2014
 
Computational Linguistics - Finite State Automata
Computational Linguistics - Finite State AutomataComputational Linguistics - Finite State Automata
Computational Linguistics - Finite State Automata
 
Vocabulary connections
Vocabulary connectionsVocabulary connections
Vocabulary connections
 
Applied linguistic
Applied linguisticApplied linguistic
Applied linguistic
 
Syntax.ppt
Syntax.pptSyntax.ppt
Syntax.ppt
 
Morphological structure
Morphological structureMorphological structure
Morphological structure
 
TIC'S EFL
TIC'S EFLTIC'S EFL
TIC'S EFL
 

More from Hemantha Kulathilake

NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar
Hemantha Kulathilake
 
NLP_KASHK:Markov Models
NLP_KASHK:Markov ModelsNLP_KASHK:Markov Models
NLP_KASHK:Markov Models
Hemantha Kulathilake
 
NLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram ModelsNLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram Models
Hemantha Kulathilake
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
Hemantha Kulathilake
 
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
Hemantha Kulathilake
 
NLP_KASHK:Morphology
NLP_KASHK:MorphologyNLP_KASHK:Morphology
NLP_KASHK:Morphology
Hemantha Kulathilake
 
NLP_KASHK:Finite-State Automata
NLP_KASHK:Finite-State AutomataNLP_KASHK:Finite-State Automata
NLP_KASHK:Finite-State Automata
Hemantha Kulathilake
 
NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions
Hemantha Kulathilake
 
NLP_KASHK: Introduction
NLP_KASHK: Introduction NLP_KASHK: Introduction
NLP_KASHK: Introduction
Hemantha Kulathilake
 
COM1407: File Processing
COM1407: File Processing COM1407: File Processing
COM1407: File Processing
Hemantha Kulathilake
 
COm1407: Character & Strings
COm1407: Character & StringsCOm1407: Character & Strings
COm1407: Character & Strings
Hemantha Kulathilake
 
COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Structures, Unions & Dynamic Memory Allocation COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Structures, Unions & Dynamic Memory Allocation
Hemantha Kulathilake
 
COM1407: Input/ Output Functions
COM1407: Input/ Output FunctionsCOM1407: Input/ Output Functions
COM1407: Input/ Output Functions
Hemantha Kulathilake
 
COM1407: Working with Pointers
COM1407: Working with PointersCOM1407: Working with Pointers
COM1407: Working with Pointers
Hemantha Kulathilake
 
COM1407: Arrays
COM1407: ArraysCOM1407: Arrays
COM1407: Arrays
Hemantha Kulathilake
 
COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Repetition and Loops COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Repetition and Loops
Hemantha Kulathilake
 
COM1407: Program Control Structures – Decision Making & Branching
COM1407: Program Control Structures – Decision Making & BranchingCOM1407: Program Control Structures – Decision Making & Branching
COM1407: Program Control Structures – Decision Making & Branching
Hemantha Kulathilake
 
COM1407: C Operators
COM1407: C OperatorsCOM1407: C Operators
COM1407: C Operators
Hemantha Kulathilake
 
COM1407: Type Casting, Command Line Arguments and Defining Constants
COM1407: Type Casting, Command Line Arguments and Defining Constants COM1407: Type Casting, Command Line Arguments and Defining Constants
COM1407: Type Casting, Command Line Arguments and Defining Constants
Hemantha Kulathilake
 
COM1407: Variables and Data Types
COM1407: Variables and Data Types COM1407: Variables and Data Types
COM1407: Variables and Data Types
Hemantha Kulathilake
 

More from Hemantha Kulathilake (20)

NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar
 
NLP_KASHK:Markov Models
NLP_KASHK:Markov ModelsNLP_KASHK:Markov Models
NLP_KASHK:Markov Models
 
NLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram ModelsNLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram Models
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
 
NLP_KASHK:Morphology
NLP_KASHK:MorphologyNLP_KASHK:Morphology
NLP_KASHK:Morphology
 
NLP_KASHK:Finite-State Automata
NLP_KASHK:Finite-State AutomataNLP_KASHK:Finite-State Automata
NLP_KASHK:Finite-State Automata
 
NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions
 
NLP_KASHK: Introduction
NLP_KASHK: Introduction NLP_KASHK: Introduction
NLP_KASHK: Introduction
 
COM1407: File Processing
COM1407: File Processing COM1407: File Processing
COM1407: File Processing
 
COm1407: Character & Strings
COm1407: Character & StringsCOm1407: Character & Strings
COm1407: Character & Strings
 
COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Structures, Unions & Dynamic Memory Allocation COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Structures, Unions & Dynamic Memory Allocation
 
COM1407: Input/ Output Functions
COM1407: Input/ Output FunctionsCOM1407: Input/ Output Functions
COM1407: Input/ Output Functions
 
COM1407: Working with Pointers
COM1407: Working with PointersCOM1407: Working with Pointers
COM1407: Working with Pointers
 
COM1407: Arrays
COM1407: ArraysCOM1407: Arrays
COM1407: Arrays
 
COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Repetition and Loops COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Repetition and Loops
 
COM1407: Program Control Structures – Decision Making & Branching
COM1407: Program Control Structures – Decision Making & BranchingCOM1407: Program Control Structures – Decision Making & Branching
COM1407: Program Control Structures – Decision Making & Branching
 
COM1407: C Operators
COM1407: C OperatorsCOM1407: C Operators
COM1407: C Operators
 
COM1407: Type Casting, Command Line Arguments and Defining Constants
COM1407: Type Casting, Command Line Arguments and Defining Constants COM1407: Type Casting, Command Line Arguments and Defining Constants
COM1407: Type Casting, Command Line Arguments and Defining Constants
 
COM1407: Variables and Data Types
COM1407: Variables and Data Types COM1407: Variables and Data Types
COM1407: Variables and Data Types
 

Recently uploaded

ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
PrashantGoswami42
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
Kamal Acharya
 

Recently uploaded (20)

ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 

NLP_KASHK:POS Tagging

  • 1. Part of Speech (POS) Tagging K.A.S.H. Kulathilake B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
  • 2. Closed Class Vs. Open Class • Parts-of-speech can be divided into two broad categories: closed class types and open class types. • Closed classes are those with relatively fixed membership, such as prepositions—new prepositions are rarely coined. • By contrast, nouns and verbs are open classes—new nouns and verbs like iPhone or to fax are continually being created or borrowed. • Any given speaker or corpus may have different open class words, but all speakers of a language, and sufficiently large corpora, likely share the set of closed class words. • Closed class words are generally function word words like of, it, and, or you, which tend to be very short, occur frequently, and often have structuring uses in grammar.
  • 3. Closed Class Vs. Open Class (Cont…) • Open Classes – Four major open classes occur in the languages of the world: • nouns, • verbs, • adjectives and • adverbs.
  • 4. Closed Class Vs. Open Class (Cont…) • Nouns – The syntactic class noun includes the words for most people, places, or things, but others as well. – Nouns include concrete terms like ship and chair, abstractions like bandwidth and relationship, and verb-like terms like pacing as in His pacing to and fro became quite annoying. – What defines a noun in English, then, are things like its ability to occur with determiners (a goat, its bandwidth, Plato’s Republic), to take possessives (IBM’s annual revenue), and for most but not all nouns to occur in the plural form (goats, abaci).
  • 5. Closed Class Vs. Open Class (Cont…) – Open class nouns fall into two classes. – Proper nouns: • like Regina, Colorado, and IBM, are names of specific persons or entities. • In English, they generally aren’t preceded by articles (e.g., the book is upstairs, but Regina is upstairs). • In written English, proper nouns are usually capitalized. – Common nouns: • Common nouns are divided in many languages, including English, into count nouns and mass nouns. • Count nouns allow grammatical enumeration, occurring in both the singular and plural (goat/goats, relationship/relationships) and they can be counted (one goat, two goats). • Mass nouns are used when something is conceptualized as a homogeneous group. • So words like snow, salt, and communism are not counted (i.e., *two snows or *two communisms). • Mass nouns can also appear without articles where singular count nouns cannot (Snow is white but not *Goat is white).
  • 6. Closed Class Vs. Open Class (Cont…) • Verb – The verb class includes most of the words referring to actions and processes, including main verbs like draw, provide, and go. – English verbs have inflections (non-third-person-sg (eat), third-person-sg (eats), progressive (eating), past participle (eaten)).
  • 7. Closed Class Vs. Open Class (Cont…) • Adjective: – A class that includes many terms for properties or qualities. – Most languages have adjectives for the concepts of color (white, black), age (old, young), and value (good, bad), but there are languages without adjectives. – In Korean, for example, the words corresponding to English adjectives act as a subclass of verbs, so what is in English an adjective “beautiful” acts in Korean like a verb meaning “to be beautiful”.
  • 8. Closed Class Vs. Open Class (Cont…) • Adverb: – The final open class form, adverbs, is rather a hodge-podge, both semantically and formally. – In the following sentence from Schachter (1985) all the italicized words are adverbs: – Unfortunately, John walked home extremely slowly yesterday.
  • 9. Closed Class Vs. Open Class (Cont…) – What coherence the class has semantically may be solely that each of these words can be viewed as modifying something (often verbs, hence the name “adverb”, but also other adverbs and entire verb phrases). – Directional adverbs or locative adverbs (home, here, downhill) specify the direction or location of some action; – degree adverbs (extremely, very, somewhat) specify the extent of some action, process, or property; – manner adverbs (slowly, slinkily, delicately) describe the manner of some action or process; – and temporal adverbs describe the time that some action or event took place (yesterday, Monday). – Because of the heterogeneous nature of this class, some adverbs (e.g., temporal adverbs like Monday) are tagged in some tagging schemes as nouns.
  • 10. Closed Class Vs. Open Class (Cont…) • Closed Class – The closed classes differ more from language to language than do the open classes. – Some of the important closed classes in English include: • prepositions: on, under, over, near, by, at, from, to, with • determiners: a, an, the • pronouns: she, who, I, others • conjunctions: and, but, or, as, if, when • auxiliary verbs: can, may, should, are • particles: up, down, on, off, in, out, at, by • numerals: one, two, three, first, second, third
  • 11. The Penn Treebank Part-of-Speech Tagset • While there are many lists of parts-of-speech, most modern language processing on English uses the 45-tag Penn Treebank tagset. • Parts-of-speech are generally represented by placing the tag after each word, delimited by a slash, as in the following examples: – The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. – There/EX are/VBP 70/CD children/NNS there/RB – Preliminary/JJ findings/NNS were/VBD reported/VBN in/IN today/NN ’s/POS New/NNP England/NNP Journal/NNP of/IN Medicine/NNP ./.
  • 12. The Penn Treebank Part-of-Speech Tagset (Cont…)
  • 13. The Penn Treebank Part-of-Speech Tagset (Cont…) • Corpora labeled with parts-of-speech like the Treebank corpora are crucial training (and testing) sets for statistical tagging algorithms. • Three main tagged corpora are consistently used for training and testing part-of-speech taggers for English: – The Brown corpus is a million words of samples from 500 written texts from different genres published in the United States in 1961. – The WSJ corpus contains a million words published in the Wall Street Journal in 1989. – The Switchboard corpus consists of 2 million words of telephone conversations collected in 1990-1991.
  • 14. The Penn Treebank Part-of-Speech Tagset (Cont…) • The corpora were created by running an automatic part-of-speech tagger on the texts and then human annotators hand-corrected each tag. • Tagging algorithms assume that words have been tokenized before tagging. • The Penn Treebank and the British National Corpus split contractions and the ’s-genitive from their stems: – would/MD n’t/RB – children/NNS ’s/POS • Indeed, the special Treebank tag POS is used only for the morpheme ’s, which must be segmented off during tokenization.
  • 15. The Penn Treebank Part-of-Speech Tagset (Cont…) • Another tokenization issue concerns multipart words. • The Treebank tagset assumes that tokenization of words like New York is done at whitespace. • The phrase a New York City firm is tagged in Treebank notation as five separate words: a/DT New/NNP York/NNP City/NNP firm/NN. • The C5 tagset for the British National Corpus, by contrast, allow prepositions like “in terms of” to be treated as a single word by adding numbers to each tag, as in in/II31 terms/II32 of/II33.
  • 16. POS Tagging • Part-of-speech tagging (tagging for short) is the process of assigning a part-of speech marker to each word in an input text. • Because tags are generally also applied to punctuation, tokenization is usually performed before, or as part of, the tagging process: separating commas, quotation marks, etc., from words and disambiguating end-of-sentence punctuation (period, question mark, etc.) from part-of-word punctuation (such as in abbreviations like e.g. and etc.)
  • 17. POS Tagging (Cont…) • Tagging is a disambiguation task; words are ambiguous— have more than one possible part-of-speech— and the goal is to find the correct tag for the situation. • For example, the word book can be a verb (book that flight) or a noun (as in hand me that book). • That can be a determiner (Does that flight serve dinner) or a complementizer (I thought that your flight was earlier). • The problem of POS-tagging is to resolve these ambiguities, choosing the proper tag for the context. • Part-of-speech tagging is thus one of the many disambiguation tasks in language processing.
  • 18. POS Tagging (Cont…) • Here are some examples of the 6 different parts-of-speech for the word back: – earnings growth took a back/JJ seat – a small building in the back/NN – a clear majority of senators back/VBP the bill – Dave began to back/VB toward the door – enable the country to buy back/RP about debt – I was twenty-one back/RB then
  • 19. POS Tagging (Cont…) • POS Techniques – Rule-Based: Human crafted rules based on lexical and other linguistic knowledge. – Stochastic: Trained on human annotated corpora like the Penn Treebank. Statistical models: Hidden Markov Model (HMM), Maximum Entropy Markov Model (MEMM), Conditional Random Field (CRF) – Transformation Based Tagging: Generally, learning- based approaches have been found to be more effective overall, taking into account the total amount of human expertise and efort involved.
  • 20. HMM POS Tagging • When we apply HMM to part-of-speech tagging we generally don’t use the Baum-Welch algorithm for learning the HMM parameters. • Instead HMMs for part-of-speech tagging are trained on a fully labeled dataset—a set of sentences with each word annotated with a part-of-speech tag— setting parameters by maximum likelihood estimates on this training data. • Thus the only algorithm we will need is the Viterbi algorithm for decoding, and we will also need to see how to set the parameters from training data.
  • 21. The Basic Equation of HMM Tagging • Let’s begin with a quick reminder of the intuition of HMM decoding. • The goal of HMM decoding is to choose the tag sequence that is most probable given the observation sequence of n words 𝑤1 𝑛 : 𝑡′1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1 𝑛 𝑃(𝑡1 𝑛 |𝑤1 𝑛 • by using Bayes’ rule to instead compute: 𝑡′1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1 𝑛 𝑃 𝑤1 𝑛 |𝑡1 𝑛 𝑃(𝑡1 𝑛 ) 𝑃(𝑤1 𝑛 ) • Furthermore, we simplify above equation by dropping the denominator 𝑃 𝑤1 𝑛 : 𝑡′1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1 𝑛 𝑃 𝑤1 𝑛 |𝑡1 𝑛 𝑃(𝑡1 𝑛 )
  • 22. The Basic Equation of HMM Tagging (Cont…) • HMM taggers make two further simplifying assumptions. • The first is that the probability of a word appearing depends only on its own tag and is independent of neighboring words and tags: 𝑃 𝑤1 𝑛 𝑡1 𝑛 ≈ 𝑖=1 𝑛 𝑃(𝑤𝑖|𝑡𝑖) • The second assumption, the bigram assumption, is that the probability of a tag is dependent only on the previous tag, rather than the entire tag sequence; 𝑃(𝑡1 𝑛 ) ≈ 𝑖=1 𝑛 𝑃(𝑡𝑖|𝑡𝑖−1)
  • 23. The Basic Equation of HMM Tagging (Cont…) • Using previous three equations we can derive following equation for the most probable tag sequence from a bigram tagger, which as we will soon see, correspond to the emission probability and transition probability for the HMM. 𝑡′1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1 𝑛 𝑃 𝑤1 𝑛 |𝑡1 𝑛 𝑃 𝑡1 𝑛 −→ (1) 𝑃 𝑤1 𝑛 𝑡1 𝑛 ≈ 𝑖=1 𝑛 𝑃 𝑤𝑖 𝑡𝑖 −→ (2) 𝑃 𝑡1 𝑛 ≈ 𝑖=1 𝑛 𝑃 𝑡𝑖 𝑡𝑖−1 −→ (3) Apply (2) and (3) to (1) 𝑡′1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1 𝑛 𝑃 𝑡1 𝑛 𝑤1 𝑛 ≈ 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1 𝑛 𝑖=1 𝑛 𝑃 𝑤𝑖 𝑡𝑖 𝑃 𝑡𝑖 𝑡𝑖−1 Where 𝑃 𝑤𝑖 𝑡𝑖 is emission probability and 𝑃 𝑡𝑖 𝑡𝑖−1 is transition probability.
  • 24. Estimating Probabilities • Let’s walk through an example, seeing how these probabilities are estimated and used in a sample tagging task, before we return to the Viterbi algorithm. • In HMM tagging, rather than using the full power of HMM EM learning, the probabilities are estimated just by counting on a tagged training corpus. • For this example we’ll use the tagged WSJ corpus. • The tag transition probabilities 𝑃 𝑡𝑖 𝑡𝑖−1 represent the probability of a tag given the previous tag.
  • 25. Estimating Probabilities (Cont…) • The maximum likelihood estimate of a transition probability is computed by counting, out of the times we see the first tag in a labeled corpus, how often the first tag is followed by the second
  • 26. Example • Let’s now work through an example of computing the best sequence of tags that corresponds to the following sequence of words • Janet will back the bill • The correct series of tags is: • Janet/NNP will/MD back/VB the/DT bill/NN
  • 28. Example (Cont…) This table is (slightly simplified) from counts in the WSJ corpus. So the word Janet only appears as an NNP, back has 4 possible parts of speech, and the word the can appear as a determiner or as an NNP (in titles like “Somewhere Over the Rainbow” all words are tagged as NNP).
  • 30. Example (Cont…) • Following figure shows a schematic of the possible tags for each word and the correct final tag sequence.