SlideShare a Scribd company logo
USE OF ANNOTATED CORPUS
Thennarasu Sakkan
Annotated Text Corpora is an important resource
for advances in NLP research and for developing
different language technologies.
The annotation of corpora is done using a set of
tags, which mark the linguistic properties of a word,
sentence or discourse.
The corpora annotated with various linguistic
information not only forms a precious resource for
language technologies but also involves large
amount of effort and time.
Therefore, it is important to create corpora which
once created can be used for various purposes.
Layered approach
It was proposed to follow a layered approach. Some of the
layers are:
Layer 1: Morphology
Layer 2: POS <morphosyntactic>
Layer 3: LWG
Layer 4: Chunks
Layer 5: Syntactic Analysis
Layer 6: Thematic roles/Predicate Argument structure
Layer 7: Semantic properties of the lexical items
Layers 8,9,10,11: Word sense, Pronoun referents (Anaphora),
etc, etc
Example,
((My younger sister Suguna))_NP
((will be coming))_VP ((from Tamil
Nadu))_PP ((early this month))_NP.
((செவ்஬ா஦ில்_NNP))_NP ((ச஬ற்நிக஧஥ாக_RB))_RBP
((ர஧ா஬ர்_NNP ஬ிண்கனம்_NN))_NP ((஡ர஧஦ிநங்கி஦து_VF))_VP
!
(஢ாொ_NNP ஬ிஞ்ஞாணிகள்_NN))_NP ((ொ஡ரண_NN))_NP
!!_RD_SYM (See here exclamation marker.)
((஢ியூ஦ார்க்_NNP))_NP :_RD_PUNC ((செவ்஬ாய்_NNP
கி஧கத்ர஡_NN ஆய்வு_NN))_NP ((செய்஬஡ற்காக_RB))_RBP
((அச஥ரிக்கா_NNP))_NP ((அனுப்தி஦_VNF))_VGNF (ர஧ா஬ர்_NNP
஬ிண்கனம்_NN))_NP ((கிட்டத்஡ட்ட_RB))_RBP ((8_TC? ஥ா஡_NN
த஦஠த்஡ிற்கு_NN))_NP ((திநகு_NST))_? இன்று_NST))_?
(06.08.12) ((ச஬ற்நிக஧஥ாக_RB))_RBP
((஡ர஧஦ிநங்கி஦து_VF))_VP ((._PUNC))_?
((஬ிண்ச஬பி_NN ஆய்வு_NN ர஥஦த்஡ில்_NN))_NP
((இது_PRP))_?? ((ஒய௃_TC ஥ிகப்_INTF சதரி஦_JJ
ர஥ல்கல்னாக_RB??))_NP?? / RBP?? ((கய௃஡ப்தடுகிநது_VF))_VP
((._PUNC))_??
((பூ஥ி஦ில்_NN))_NP ((இய௃ந்து_N_NST))_NP?/N_ST?
((சு஥ார்_RB)) ((570_TC ஥ில்னி஦ன்_NN கி.஥ீ.,_NN
ச஡ாரன஬ில்_NN))_NP ((உள்பது_VF))_VGF
((செவ்஬ாய்_NNP கி஧கம்_NNP))_NP ._PUNC
((இந்஡_DMD கி஧கத்஡ில்_NN ஊ஦ிரிணங்கள்_NN))_NP
((஬ாழ்஬஡ற்காண_VNF))_VGNF ((஌ற்ந_JJ சூ஫ல்_NN))_NP
((இய௃க்கிந஡ா_VF))_VGF ((஋ன்தது_CCS))_??
((குநித்து_PSP))_?? ((ஆய்வு_NN))_NP
((செய்஦_VINF))_VGINF ((அச஥ரிக்கா஬ின்_NNP ஢ாொ_NNP
஬ிண்ச஬பி_NNP ஆ஧ாய்ச்ெி_NNP ர஥஦ம்_NNP))_NP
((தல்ர஬று_JJ))_JJP ((ஆய்வுகரப_NN))_NP
((ர஥ற்சகாண்டு_VNF))_VGNF ((஬ய௃கிநது_VF))_VGF.
((செவ்஬ாய்_NNP கி஧கம்_NN))_NP ((ச஡ாடர்தாண_JJ))_JJP
((தடங்கரபயும்_NN))_NP ((அவ்஬ப்ரதாது_RB))_RBP
((ச஬பி஦ிட்டு_VNF ஬ய௃கிநது_VM))_VGF ._SYM
How are corpora annotated?
• Automatic annotation
• Computer-assisted annotation
• Manual annotation
Sinclair (1992): the introduction of the human
element in corpus annotation reduces
consistency.
Corpus in NLP
NLP is unthinkable without involving corpora.
Corpora are essential ingredients of every aspects
of natural language processing
a) Morph analysis – the morph features of a given
word are marked. If the word has multiple
morph feature sets, all are provided for it.
• Morphological level
–Prefixes
–Suffixes
–Stems - (morphological annotation)
Example: pens <root=”pen” cat=”n” gender=”m”
number=”pl” person=”3”>|<root=”pen”
cat=”v” gender=”m” number=”sing”
person=”3” tense=”present” aspect=”hab”>
Corpus Vs Morph
• 10% 63 54 59 4 (Te, Ma, Ta, Hi,)
• 20% 293 335 257 11
• 30% 934 1196 728 26
• 40% 2433 3439 1803 74
• 50% 5707 8810 4091 186
• 60% 13280 21663 8992 454
• 70% 31941 53718 20191 1092
b) POS a word is tagged for its POS category in a
given sentence.
Example: I need two <pos=”NN”>pens
</pos=”NN”> to finish this article. He
<pos=”VBS”> pens </pos=”VBS”> his views
regularly.
c) Word sense – the appropriate sense of a word in a
given context is marked.
Example: I need two <word_sense=”pen”> pens
</word_sense=”pen”> to finish this article. He
<word_sense=”write”> pens
</word_sense=”write”> his views regularly.
POS Vs Corpus
11% of words in Brow corpus are ambiguous.
What about our languages?
At the sentence level the information could be
a) Identification of chunks/MWEs/LWGs/phrases
Chunks are minimal constituent units.
The chunk analysis of a sentence provides a
shallow level of parsing. Thus, a corpora
annotated with POS and chunks can be useful for
building a shallow parser.
Example, I saw a man with telescope.
• Syntactic level
– parsing
– treebanking
– bracketing
• Discourse level
– Anaphoric relations (coreference annotation)
– Speech acts (pragmatic annotation)
– Stylistic features such as speech and thought
in presentation (stylistic annotation).
Corpus Vs Machine translation
parallel and comparable corpora, which include
their use in lexicography, terminology extraction to
build terminology databases and bilingual reference
tools, pride of place must be given to machine
translation (MT).
parallel corpora have played a pivotal role in a
(partial) paradigm shift from rule-based approaches
to statistical and example-based approaches to MT.
Essentially, statistical MT (SMT) involves computing
the probability that a TL string is the translation of
an SL string, based on the frequency of the co-
occurrence of these strings in the corpus, whereas
example-based MT (EBMT) involves searching for
similar phrases in previous translations and
extracting the TL fragments corresponding to the SL
fragments.
Show demo on KWIC
5a use of annotated corpus

More Related Content

What's hot

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Saurabh Kaushik
 
Prosodic Morphology
Prosodic Morphology Prosodic Morphology
Prosodic Morphology
Maroua Harrif
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Basha Chand
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
Marcis Pinnis
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
Hemantha Kulathilake
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMariana Soffer
 
Nlp
NlpNlp
Networks and Natural Language Processing
Networks and Natural Language ProcessingNetworks and Natural Language Processing
Networks and Natural Language Processing
Ahmed Magdy Ezzeldin, MSc.
 
A Review on a web based Punjabi t o English Machine Transliteration System
A Review on a web based Punjabi t o English Machine Transliteration SystemA Review on a web based Punjabi t o English Machine Transliteration System
A Review on a web based Punjabi t o English Machine Transliteration System
Editor IJCATR
 
Natural Language Processing glossary for Coders
Natural Language Processing glossary for CodersNatural Language Processing glossary for Coders
Natural Language Processing glossary for Coders
Aravind Mohanoor
 
Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of Persian
IDES Editor
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ijnlc
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
Marina Santini
 
Hidden markov model based part of speech tagger for sinhala language
Hidden markov model based part of speech tagger for sinhala languageHidden markov model based part of speech tagger for sinhala language
Hidden markov model based part of speech tagger for sinhala language
ijnlc
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
Minh Pham
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
 

What's hot (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Prosodic Morphology
Prosodic Morphology Prosodic Morphology
Prosodic Morphology
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
NLP
NLPNLP
NLP
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Nlp
NlpNlp
Nlp
 
Networks and Natural Language Processing
Networks and Natural Language ProcessingNetworks and Natural Language Processing
Networks and Natural Language Processing
 
A Review on a web based Punjabi t o English Machine Transliteration System
A Review on a web based Punjabi t o English Machine Transliteration SystemA Review on a web based Punjabi t o English Machine Transliteration System
A Review on a web based Punjabi t o English Machine Transliteration System
 
Natural Language Processing glossary for Coders
Natural Language Processing glossary for CodersNatural Language Processing glossary for Coders
Natural Language Processing glossary for Coders
 
Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of Persian
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Hidden markov model based part of speech tagger for sinhala language
Hidden markov model based part of speech tagger for sinhala languageHidden markov model based part of speech tagger for sinhala language
Hidden markov model based part of speech tagger for sinhala language
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 

Similar to 5a use of annotated corpus

5 relevance of annotated corpus
5 relevance of annotated corpus5 relevance of annotated corpus
5 relevance of annotated corpus
ThennarasuSakkan
 
Pos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil TextsPos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil Texts
ijcnes
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
paperpublications3
 
Nlp (1)
Nlp (1)Nlp (1)
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
paperpublications3
 
An implementation of apertium based assamese morphological analyzer
An implementation of apertium based assamese morphological analyzerAn implementation of apertium based assamese morphological analyzer
An implementation of apertium based assamese morphological analyzer
ijnlc
 
Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
REPORT.doc
REPORT.docREPORT.doc
Building of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert SystemBuilding of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert System
Waqas Tariq
 
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
Sanskrit parser Project Report
Sanskrit parser Project ReportSanskrit parser Project Report
Sanskrit parser Project Report
Laxmi Kant Yadav
 
Identification of prosodic features of punjabi for enhancing the pronunciatio...
Identification of prosodic features of punjabi for enhancing the pronunciatio...Identification of prosodic features of punjabi for enhancing the pronunciatio...
Identification of prosodic features of punjabi for enhancing the pronunciatio...
ijnlc
 
Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)
Dhabal Sethi
 
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
iosrjce
 
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMSTANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
ijnlc
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflow
seungwoo kim
 
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...
Guy De Pauw
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
DigiGurukul
 
Implementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesImplementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar Rules
IJERA Editor
 

Similar to 5a use of annotated corpus (20)

5 relevance of annotated corpus
5 relevance of annotated corpus5 relevance of annotated corpus
5 relevance of annotated corpus
 
Pos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil TextsPos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil Texts
 
nlp (1).pptx
nlp (1).pptxnlp (1).pptx
nlp (1).pptx
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Nlp (1)
Nlp (1)Nlp (1)
Nlp (1)
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
An implementation of apertium based assamese morphological analyzer
An implementation of apertium based assamese morphological analyzerAn implementation of apertium based assamese morphological analyzer
An implementation of apertium based assamese morphological analyzer
 
Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania
 
REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
 
Building of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert SystemBuilding of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert System
 
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
 
Sanskrit parser Project Report
Sanskrit parser Project ReportSanskrit parser Project Report
Sanskrit parser Project Report
 
Identification of prosodic features of punjabi for enhancing the pronunciatio...
Identification of prosodic features of punjabi for enhancing the pronunciatio...Identification of prosodic features of punjabi for enhancing the pronunciatio...
Identification of prosodic features of punjabi for enhancing the pronunciatio...
 
Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)
 
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
 
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMSTANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflow
 
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Implementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesImplementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar Rules
 

More from ThennarasuSakkan

11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)
ThennarasuSakkan
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)
ThennarasuSakkan
 
8 issues in pos tagging
8 issues in pos tagging8 issues in pos tagging
8 issues in pos tagging
ThennarasuSakkan
 
7 probability and statistics an introduction
7 probability and statistics an introduction7 probability and statistics an introduction
7 probability and statistics an introduction
ThennarasuSakkan
 
6 shallow parsing introduction
6 shallow parsing introduction6 shallow parsing introduction
6 shallow parsing introduction
ThennarasuSakkan
 
4 salient features of corpus
4 salient features of corpus4 salient features of corpus
4 salient features of corpus
ThennarasuSakkan
 
2 why python for nlp
2 why python for nlp2 why python for nlp
2 why python for nlp
ThennarasuSakkan
 
1 computational linguistics an introduction
1 computational linguistics   an introduction1 computational linguistics   an introduction
1 computational linguistics an introduction
ThennarasuSakkan
 

More from ThennarasuSakkan (8)

11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)
 
8 issues in pos tagging
8 issues in pos tagging8 issues in pos tagging
8 issues in pos tagging
 
7 probability and statistics an introduction
7 probability and statistics an introduction7 probability and statistics an introduction
7 probability and statistics an introduction
 
6 shallow parsing introduction
6 shallow parsing introduction6 shallow parsing introduction
6 shallow parsing introduction
 
4 salient features of corpus
4 salient features of corpus4 salient features of corpus
4 salient features of corpus
 
2 why python for nlp
2 why python for nlp2 why python for nlp
2 why python for nlp
 
1 computational linguistics an introduction
1 computational linguistics   an introduction1 computational linguistics   an introduction
1 computational linguistics an introduction
 

Recently uploaded

Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
Excellence Foundation for South Sudan
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
Nguyen Thanh Tu Collection
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
AzmatAli747758
 

Recently uploaded (20)

Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
 

5a use of annotated corpus

  • 1. USE OF ANNOTATED CORPUS Thennarasu Sakkan
  • 2. Annotated Text Corpora is an important resource for advances in NLP research and for developing different language technologies. The annotation of corpora is done using a set of tags, which mark the linguistic properties of a word, sentence or discourse. The corpora annotated with various linguistic information not only forms a precious resource for language technologies but also involves large amount of effort and time.
  • 3. Therefore, it is important to create corpora which once created can be used for various purposes. Layered approach It was proposed to follow a layered approach. Some of the layers are: Layer 1: Morphology Layer 2: POS <morphosyntactic> Layer 3: LWG Layer 4: Chunks Layer 5: Syntactic Analysis Layer 6: Thematic roles/Predicate Argument structure Layer 7: Semantic properties of the lexical items Layers 8,9,10,11: Word sense, Pronoun referents (Anaphora), etc, etc
  • 4. Example, ((My younger sister Suguna))_NP ((will be coming))_VP ((from Tamil Nadu))_PP ((early this month))_NP.
  • 5. ((செவ்஬ா஦ில்_NNP))_NP ((ச஬ற்நிக஧஥ாக_RB))_RBP ((ர஧ா஬ர்_NNP ஬ிண்கனம்_NN))_NP ((஡ர஧஦ிநங்கி஦து_VF))_VP ! (஢ாொ_NNP ஬ிஞ்ஞாணிகள்_NN))_NP ((ொ஡ரண_NN))_NP !!_RD_SYM (See here exclamation marker.) ((஢ியூ஦ார்க்_NNP))_NP :_RD_PUNC ((செவ்஬ாய்_NNP கி஧கத்ர஡_NN ஆய்வு_NN))_NP ((செய்஬஡ற்காக_RB))_RBP ((அச஥ரிக்கா_NNP))_NP ((அனுப்தி஦_VNF))_VGNF (ர஧ா஬ர்_NNP ஬ிண்கனம்_NN))_NP ((கிட்டத்஡ட்ட_RB))_RBP ((8_TC? ஥ா஡_NN த஦஠த்஡ிற்கு_NN))_NP ((திநகு_NST))_? இன்று_NST))_? (06.08.12) ((ச஬ற்நிக஧஥ாக_RB))_RBP ((஡ர஧஦ிநங்கி஦து_VF))_VP ((._PUNC))_? ((஬ிண்ச஬பி_NN ஆய்வு_NN ர஥஦த்஡ில்_NN))_NP ((இது_PRP))_?? ((ஒய௃_TC ஥ிகப்_INTF சதரி஦_JJ ர஥ல்கல்னாக_RB??))_NP?? / RBP?? ((கய௃஡ப்தடுகிநது_VF))_VP ((._PUNC))_??
  • 6. ((பூ஥ி஦ில்_NN))_NP ((இய௃ந்து_N_NST))_NP?/N_ST? ((சு஥ார்_RB)) ((570_TC ஥ில்னி஦ன்_NN கி.஥ீ.,_NN ச஡ாரன஬ில்_NN))_NP ((உள்பது_VF))_VGF ((செவ்஬ாய்_NNP கி஧கம்_NNP))_NP ._PUNC ((இந்஡_DMD கி஧கத்஡ில்_NN ஊ஦ிரிணங்கள்_NN))_NP ((஬ாழ்஬஡ற்காண_VNF))_VGNF ((஌ற்ந_JJ சூ஫ல்_NN))_NP ((இய௃க்கிந஡ா_VF))_VGF ((஋ன்தது_CCS))_?? ((குநித்து_PSP))_?? ((ஆய்வு_NN))_NP ((செய்஦_VINF))_VGINF ((அச஥ரிக்கா஬ின்_NNP ஢ாொ_NNP ஬ிண்ச஬பி_NNP ஆ஧ாய்ச்ெி_NNP ர஥஦ம்_NNP))_NP ((தல்ர஬று_JJ))_JJP ((ஆய்வுகரப_NN))_NP ((ர஥ற்சகாண்டு_VNF))_VGNF ((஬ய௃கிநது_VF))_VGF. ((செவ்஬ாய்_NNP கி஧கம்_NN))_NP ((ச஡ாடர்தாண_JJ))_JJP ((தடங்கரபயும்_NN))_NP ((அவ்஬ப்ரதாது_RB))_RBP ((ச஬பி஦ிட்டு_VNF ஬ய௃கிநது_VM))_VGF ._SYM
  • 7. How are corpora annotated? • Automatic annotation • Computer-assisted annotation • Manual annotation Sinclair (1992): the introduction of the human element in corpus annotation reduces consistency.
  • 8. Corpus in NLP NLP is unthinkable without involving corpora. Corpora are essential ingredients of every aspects of natural language processing
  • 9. a) Morph analysis – the morph features of a given word are marked. If the word has multiple morph feature sets, all are provided for it. • Morphological level –Prefixes –Suffixes –Stems - (morphological annotation) Example: pens <root=”pen” cat=”n” gender=”m” number=”pl” person=”3”>|<root=”pen” cat=”v” gender=”m” number=”sing” person=”3” tense=”present” aspect=”hab”>
  • 10. Corpus Vs Morph • 10% 63 54 59 4 (Te, Ma, Ta, Hi,) • 20% 293 335 257 11 • 30% 934 1196 728 26 • 40% 2433 3439 1803 74 • 50% 5707 8810 4091 186 • 60% 13280 21663 8992 454 • 70% 31941 53718 20191 1092
  • 11. b) POS a word is tagged for its POS category in a given sentence. Example: I need two <pos=”NN”>pens </pos=”NN”> to finish this article. He <pos=”VBS”> pens </pos=”VBS”> his views regularly. c) Word sense – the appropriate sense of a word in a given context is marked. Example: I need two <word_sense=”pen”> pens </word_sense=”pen”> to finish this article. He <word_sense=”write”> pens </word_sense=”write”> his views regularly.
  • 12. POS Vs Corpus 11% of words in Brow corpus are ambiguous. What about our languages?
  • 13. At the sentence level the information could be a) Identification of chunks/MWEs/LWGs/phrases Chunks are minimal constituent units. The chunk analysis of a sentence provides a shallow level of parsing. Thus, a corpora annotated with POS and chunks can be useful for building a shallow parser. Example, I saw a man with telescope.
  • 14. • Syntactic level – parsing – treebanking – bracketing • Discourse level – Anaphoric relations (coreference annotation) – Speech acts (pragmatic annotation) – Stylistic features such as speech and thought in presentation (stylistic annotation).
  • 15. Corpus Vs Machine translation parallel and comparable corpora, which include their use in lexicography, terminology extraction to build terminology databases and bilingual reference tools, pride of place must be given to machine translation (MT). parallel corpora have played a pivotal role in a (partial) paradigm shift from rule-based approaches to statistical and example-based approaches to MT.
  • 16. Essentially, statistical MT (SMT) involves computing the probability that a TL string is the translation of an SL string, based on the frequency of the co- occurrence of these strings in the corpus, whereas example-based MT (EBMT) involves searching for similar phrases in previous translations and extracting the TL fragments corresponding to the SL fragments.
  • 17. Show demo on KWIC