SlideShare a Scribd company logo
1 of 15
Download to read offline
Papers We Love
#pwlnepal
22 NOVEMBER, 2015
A Morphosyntactic Categorization Scheme
for the Automated Analysis of Nepali
Andrew Hardie, Ram Raj Lohani, Bhim N. Regmi and Yogendra P. Yadava
INTRODUCTION
TO
NATURAL LANGUAGE PROCESSING
Ashmit Bhattarai
COMPUTER ENGINEERING devilcrackstheearth@gmail.com KATHMANDU UNIVERSITY
Topics
To
Discuss
● Morphology
● Tokenization
● Words to relate
1. INTRODUCTION
Morphosyntactic Tagging
Features of tagsets
➔ Precise and Distinct
➔ Optimal Distinctions
2. TOKENISATION
i. TOKENS
- Appropriate size unites for
morphosyntactic analysis
- Grammatical categories
assigned
ii. ORTHOGRAPHIC WORD
- Set of strings bounded by
whitespace or punctuation
NOTES
- Separate sentences into tokens
- OW < Tokens = multiword units, not investigated yet
- Graphical word with multiple elements => Tokenized Separately
- Tokens are separated by space for written language
iii. CLITICS
- A morpheme that has
syntactic characteristics of
word
- must be tokenized
- can be postpositions or
affixes
2. TOKENISATION
- Mark oblique cases
- Also written as part of orthographic word as noun, adj. or
other word whose case they mark
- Suffixes except “haru” (Plural or Collective), “ko/kii/kaa”
(genetive), “le”(ergative), “lai” (accusative/dative) are
postpositions
Postpositions
USES
2. TOKENISATION
ISSUE
- Analyse as inflection
element as noun
- Add separate tokens
- Different consideration
for suffixes on one hand
and other
METHODS PROBLEMS
- For singular ergative noun
“le”, use NN1E
- For plural accusative
noun ”harulaai”, use NN2A
Layer II Postpositions
- Hard to know when to
treat postpostion as
suffixes but clitics
(Assign Tokens
“ma/bata/sanga”)
- Suffixes can get attached
to noun, pronoun, adj
and adverb too
Conclusion: Abandon
NN1E / NN2A
2. TOKENISATION
SOLUTION
- Category of postposition is tagged as II
- Plural collective marker “haru” tagged as IH
- Genitive postpositions “ko/kii/kaa” : IKM/IKF/IKO
respectively
- Eragative-instrumental PP “le” : IE
- Accusative/dative PP ”laai” : IA
- Possessive Pronouns
“mero” : PMXKM, “tero”: PTNKM, “aafno” : PRFKM
Postpositions
3. GENDER ON NOUNS AND
ADJECTIVES
- Nepali has grammatically marked gender
- Masculine => suffix “o”
Feminine => suffix “ii”
Other => suffix “aa”
- The default other noun and suffixes is mostly
masculine
3. GENDER ON NOUNS AND
ADJECTIVES
ISSUE
- Most of the Adj., nouns,
descriptive determiners
like “bibhinna, sampurna”
are not gender marked
- Feminine noun ending
with “ii” like “aaimaai”
donot have respective
masculine noun ending
with “o”
- Gender marked form
“yetro” has unmarked
forms “yo/yi/eti”
METHODS PROBLEMS
- Ignore Gender Inflection
altogether
- Difficult to extract feminine
marked adj. due to false
positives such as “dhani”
ending with “ii”
- Including gender marking in
tagging system causes
problem for unmarked words
and complicates automated
tagging
3. GENDER ON NOUNS AND
ADJECTIVES
- Assign following tags JM, JF, JO and JX to
suffixe “o” (masucline), suffix “ii” (feminine),
suffix “aa” (other) and unmarked Adj.
respectively
- Ignore plural, public and honorificity for
simplicity
- Ignore gender marking on nouns
Example: “Sita” as NP and “aaimaaii” as
NN
SOLUTION
4. MODELLING NEPALI VERB INFLECTION
ISSUE
- Multiplicity of inflected forms
“bhanidiyeko”
- Compound verbs = main verb + vector
verb / light verb
- “garidiyo” = “gari” + “diyo”
- Tense-aspect mood combination
created by use of auxiliary verbs to
form compounded form
“hunu/ hunthyo/ huncha/ bhairahayo
/hunecha”
- Each compounded verb can represent
voice, tense, mood, aspect, person,
gender, number, honorificity and vector
verb. This leads to large number of
tagsets.
METHODS
- Possible solutions :
Probabilistic approach
(Markov model)
- Training data == (tagset)2
- Impractical as tagset grows
- Re-Tokenization approach
- Difficult to trace root word, root verb
- Example :
“huncha” => “hun” + “cha”
Solution
- Simplify assumptions for descriptions of Nepali verbs underlying
the tagset.
- Tag accordingly to last element of compound verb (Person-
Number-Gender Inflection)
- High honorific verb is tagged in isolation
- For non-finite form receives a tag of its own in a tagset
- For finite form, the tagsets are possible combination of:
4. MODELLING NEPALI VERB INFLECTION
THANK YOU !!
#pwlnepal

More Related Content

Viewers also liked

Word vs lexeme by james jamie 2014 presentation assigned by asifa memon lect...
Word vs lexeme  by james jamie 2014 presentation assigned by asifa memon lect...Word vs lexeme  by james jamie 2014 presentation assigned by asifa memon lect...
Word vs lexeme by james jamie 2014 presentation assigned by asifa memon lect...
James Jamie
 
Morphological Analysis
Morphological AnalysisMorphological Analysis
Morphological Analysis
Innowiz
 
Words and lexemes ppt
Words and lexemes pptWords and lexemes ppt
Words and lexemes ppt
Angeline-dbz
 

Viewers also liked (7)

Failed queries: a morpho-syntactic analysis based on transaction log files
Failed queries: a morpho-syntactic analysis based on transaction log filesFailed queries: a morpho-syntactic analysis based on transaction log files
Failed queries: a morpho-syntactic analysis based on transaction log files
 
LING 100 - Morphosyntactic Categories
LING 100 - Morphosyntactic CategoriesLING 100 - Morphosyntactic Categories
LING 100 - Morphosyntactic Categories
 
Collocation and multi word lexemes
Collocation and multi word lexemesCollocation and multi word lexemes
Collocation and multi word lexemes
 
Morphology # Productivity in Word-Formation
Morphology # Productivity in Word-FormationMorphology # Productivity in Word-Formation
Morphology # Productivity in Word-Formation
 
Word vs lexeme by james jamie 2014 presentation assigned by asifa memon lect...
Word vs lexeme  by james jamie 2014 presentation assigned by asifa memon lect...Word vs lexeme  by james jamie 2014 presentation assigned by asifa memon lect...
Word vs lexeme by james jamie 2014 presentation assigned by asifa memon lect...
 
Morphological Analysis
Morphological AnalysisMorphological Analysis
Morphological Analysis
 
Words and lexemes ppt
Words and lexemes pptWords and lexemes ppt
Words and lexemes ppt
 

Recently uploaded

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 

Recently uploaded (20)

DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 

A morphosyntactic-categorization-scheme-for-the-automated-analysis-of-nepali

  • 2. A Morphosyntactic Categorization Scheme for the Automated Analysis of Nepali Andrew Hardie, Ram Raj Lohani, Bhim N. Regmi and Yogendra P. Yadava
  • 3. INTRODUCTION TO NATURAL LANGUAGE PROCESSING Ashmit Bhattarai COMPUTER ENGINEERING devilcrackstheearth@gmail.com KATHMANDU UNIVERSITY
  • 5. 1. INTRODUCTION Morphosyntactic Tagging Features of tagsets ➔ Precise and Distinct ➔ Optimal Distinctions
  • 6. 2. TOKENISATION i. TOKENS - Appropriate size unites for morphosyntactic analysis - Grammatical categories assigned ii. ORTHOGRAPHIC WORD - Set of strings bounded by whitespace or punctuation NOTES - Separate sentences into tokens - OW < Tokens = multiword units, not investigated yet - Graphical word with multiple elements => Tokenized Separately - Tokens are separated by space for written language iii. CLITICS - A morpheme that has syntactic characteristics of word - must be tokenized - can be postpositions or affixes
  • 7. 2. TOKENISATION - Mark oblique cases - Also written as part of orthographic word as noun, adj. or other word whose case they mark - Suffixes except “haru” (Plural or Collective), “ko/kii/kaa” (genetive), “le”(ergative), “lai” (accusative/dative) are postpositions Postpositions USES
  • 8. 2. TOKENISATION ISSUE - Analyse as inflection element as noun - Add separate tokens - Different consideration for suffixes on one hand and other METHODS PROBLEMS - For singular ergative noun “le”, use NN1E - For plural accusative noun ”harulaai”, use NN2A Layer II Postpositions - Hard to know when to treat postpostion as suffixes but clitics (Assign Tokens “ma/bata/sanga”) - Suffixes can get attached to noun, pronoun, adj and adverb too Conclusion: Abandon NN1E / NN2A
  • 9. 2. TOKENISATION SOLUTION - Category of postposition is tagged as II - Plural collective marker “haru” tagged as IH - Genitive postpositions “ko/kii/kaa” : IKM/IKF/IKO respectively - Eragative-instrumental PP “le” : IE - Accusative/dative PP ”laai” : IA - Possessive Pronouns “mero” : PMXKM, “tero”: PTNKM, “aafno” : PRFKM Postpositions
  • 10. 3. GENDER ON NOUNS AND ADJECTIVES - Nepali has grammatically marked gender - Masculine => suffix “o” Feminine => suffix “ii” Other => suffix “aa” - The default other noun and suffixes is mostly masculine
  • 11. 3. GENDER ON NOUNS AND ADJECTIVES ISSUE - Most of the Adj., nouns, descriptive determiners like “bibhinna, sampurna” are not gender marked - Feminine noun ending with “ii” like “aaimaai” donot have respective masculine noun ending with “o” - Gender marked form “yetro” has unmarked forms “yo/yi/eti” METHODS PROBLEMS - Ignore Gender Inflection altogether - Difficult to extract feminine marked adj. due to false positives such as “dhani” ending with “ii” - Including gender marking in tagging system causes problem for unmarked words and complicates automated tagging
  • 12. 3. GENDER ON NOUNS AND ADJECTIVES - Assign following tags JM, JF, JO and JX to suffixe “o” (masucline), suffix “ii” (feminine), suffix “aa” (other) and unmarked Adj. respectively - Ignore plural, public and honorificity for simplicity - Ignore gender marking on nouns Example: “Sita” as NP and “aaimaaii” as NN SOLUTION
  • 13. 4. MODELLING NEPALI VERB INFLECTION ISSUE - Multiplicity of inflected forms “bhanidiyeko” - Compound verbs = main verb + vector verb / light verb - “garidiyo” = “gari” + “diyo” - Tense-aspect mood combination created by use of auxiliary verbs to form compounded form “hunu/ hunthyo/ huncha/ bhairahayo /hunecha” - Each compounded verb can represent voice, tense, mood, aspect, person, gender, number, honorificity and vector verb. This leads to large number of tagsets. METHODS - Possible solutions : Probabilistic approach (Markov model) - Training data == (tagset)2 - Impractical as tagset grows - Re-Tokenization approach - Difficult to trace root word, root verb - Example : “huncha” => “hun” + “cha”
  • 14. Solution - Simplify assumptions for descriptions of Nepali verbs underlying the tagset. - Tag accordingly to last element of compound verb (Person- Number-Gender Inflection) - High honorific verb is tagged in isolation - For non-finite form receives a tag of its own in a tagset - For finite form, the tagsets are possible combination of: 4. MODELLING NEPALI VERB INFLECTION