1. Issues in POS tagging
Thennarasu Sakkan
Department of Linguistics
Central University of Kerala
2. 2
Usually one part-of-speech per word.
Resolving lexical ambiguity.
For Example,
avaḷPR_PRP cantaiyilN_NN kattiN_NN
viṟṟāḷV_VM_VF .RD_PUNC
3. 3
For Example,
paccaiJJ miḷakāyilN_NN namakkuP_PRP
teriyātaV_VM_VNF palaJJ uṭalN_NN
nalaJJ/N_NN payaṉkaḷN_NN
aṭaṅkiyuḷḷatuV_VM_VF .RD_PUNC
intaDM_DMD paccaiN_NN mikavumRP_INTF
meṇmaiyākaRB irukkiṟatuV_VM_VF
.RD_PUNC
4. 4
One of the main reasons for incorporating a
tagging is to reduce ambiguities.
Fruit flies like a Banana.
Fruit/NNP flies/VBZ like/IN a/DT Banana/NNP
./.
Fruit/NNP Flies/NNP like/VBP a/DT
Banana/NNP ./.
5. We need to normalization the corpus which makes
the tagging process very complex.
For example : malaiyaṭivārattilN_NN
kōviloṉṟuṇṭuV_VM_VF .RD_PUNC
Issues in Tamil POS Tagging?
6. ‘paṭi’ (படி) in Tamil: A Corpus-based study shows:
a. அவன்PR_PRP டினில்N_NN எிான்V_VM_VF .RD_PUNC
(Noun)
b. அவள்PR_PRP இன்று காலனில்N_NN டித்தாள்V_VM_VF
.RD_PUNC (Verb)
c. அவர்PR_PRP டித்தV_VM_VNF புத்தகம்N_NN தான்RD_PRD
இதுPR_PRP .RD_PUNC (Relative Participle or Adjective)
d. அவர்PR_PRP வாய்க்குவந்தடிRB பசிார்V_VM_VF
.RD_PUNC (Adverb)
e. அம்நாவின்N_NN கணக்குப்டிN_NN எக்குPR_PRP
வனதுN_NN 30QT_QTC. (Particle)
f. ான்PR_PRP அந்தDM_DMD பபத்தில்N_NN த்தாம்QT_QTO
வகுப்புN_NN டித்துV_VM_VNF வந்பதன்V_VM_VF .RD_PUNC
(Verbal participle)
g. ான்PR_PRP டிக்கV_VM_VINF பாபன்V_VM_VF
.RD_PUNC (Infinitive Verb)
Issues in Tamil POS Tagging?
7. a. avaṉPR_PRP paṭiyilN_NN eṟiṉāṉV_VM_VF .RD_PUNC
(Noun)
b. avaḷPR_PRP iṉṟu kālaiyilN_NN paṭittāḷV_VM_VF
.RD_PUNC (Verb)
c. avarPR_PRP paṭittaV_VM_VNF puttakamN_NN tāṉRD_PRD
ituPR_PRP .RD_PUNC (Relative Participle or Adjective)
d. avarPR_PRP vāykkuvantapaṭiRB pēciṉārV_VM_VF
.RD_PUNC (Adverb)
e. ammāviṉN_NN kaṇakkuppaṭiN_NN eṉakkuPR_PRP
vayatuN_NN 30QT_QTC. (Particle)
f. nāṉPR_PRP antaDM_DMD nērattilN_NN pattāmQT_QTO
vakuppuN_NN paṭittuV_VM_VNF vantēṉV_VM_VF
.RD_PUNC
(Verbal participle)
g. nāṉPR_PRP paṭikkaV_VM_VINF pōṟēṉV_VM_VF
.RD_PUNC (Infinitive Verb)
15. Another tokenization issues concerns with compound words
and multi word expression
māṭṭu vaṇṭiN_NN
kāli ḵpiḷavarN_NN
muṭṭaikkōsN_NN
muṭṭaik N_NN kōsN_NN
vēḷāṇmaiN_NN uṟpattip N_NN poruḷkaḷ N_NN
Title of the books, name of the movies S. Ramakrishnan's
Short stories
naṭantuV_VM_VNF cellumV_VM_VNF nīrūṟṟuN_NN
appōtum kaṭalN_NN pārttukkoṇṭiruntatuV_VM_VF
.RD_PUNC
mīṇṭumRB varuvēṉV_VM_VF by intirāN_NNP
ceḷantirrājaṉN_NNP
19. In case of a POS tagger, the major issues that need
to be dealt with are:
1. Fineness v/s Coarseness in linguistic analysis
2. Syntactic Function v/s lexical category
3. New tags v/s tags from a standard tagger
20.
21. Content
Eby's argument on POS
Extracting rules from tagged data
Mubeena's query on VBZ
Multiple Choice Questions
23. a. avaṉPR_PRP paṭiyilN_NN eṟiṉāṉV_VM_VF .RD_PUNC
(Noun)
b. avaḷPR_PRP iṉṟu kālaiyilN_NN paṭittāḷV_VM_VF
.RD_PUNC (Verb)
c. avarPR_PRP paṭittaV_VM_VNF puttakamN_NN tāṉRD_PRD
ituPR_PRP .RD_PUNC (Relative Participle or Adjective)
d. avarPR_PRP vāykkuvantapaṭiRB pēciṉārV_VM_VF
.RD_PUNC (Adverb)
e. ammāviṉN_NN kaṇakkuppaṭiN_NN eṉakkuPR_PRP
vayatuN_NN 30QT_QTC. (Particle)
f. nāṉPR_PRP antaDM_DMD nērattilN_NN pattāmQT_QTO
vakuppuN_NN paṭittuV_VM_VNF vantēṉV_VM_VF
.RD_PUNC
(Verbal participle)
g. nāṉPR_PRP paṭikkaV_VM_VINF pōṟēṉV_VM_VF
.RD_PUNC (Infinitive Verb)
Extracting rules from tagged data
24. The algorithm for the lexical item ‘paṭi’ can be described as below:
if {# word contains plu. & cas.mar assign tag as noun
}
elsif {# word takes ten.mar then tag as verb
}
elsif {# word contains ten.mar + -a/-um then tag as relative participle
}
elsif {# paṭi comes after a noun then tag as particle
}
elsif {# paṭi occurs after a verb + pst.tns + -a followed by a finite verb
then tag as adverb
}
elsif {# paṭi comes after a verb + pst.tns + -u followed by a finite verb
then tag as verbal participle
}
else {# do the 'else'
}
28. Multiple Choice Questions
1. Who said that Computational linguistics as the study of
computer systems for understanding and generating
natural language?
2. ________ is a simple yet powerful programming
language with excellent functionality for processing
linguistic data.
3. What is NLTK stands for_________?
4. Name any of the scripting language?
5. What does mean by this command 'tr 'a-z' 'A-Z' <
inputfile > outputfile' in Linux?
29. 7. What does it mean by tr ’aiou’ e < inputfile >
outputfile
8. What does it mean by tr -c 'A-Za-z' '012' <inputfile>
outputfile
9. What are the salient features of Corpus?
10. _________ means to divide into parts and describe
the relations among the parts.
11. The word parser typically is restricted to the _______
level analyzer.
30. 12. Shallow parsing is also known as __________ parsing.
13. Parsed corpora are sometimes known as ________.
14. What is the simplest n-gram model is called ______
model?
15. What are the six requirements kept in mind while designing
NLTK?
16. Could you mention the Microsoft keys functions for
consonants in your language?
17. What are the relevance of annotated corpus?
18. Who has argued first for the relevance of shallow parsing?
19. Describe the tags of Chunking with example for each.