This document discusses topics related to developing a morphosyntactic tagging scheme for the automated analysis of Nepali text, including:
1. Issues with tokenization, such as how to handle clitics and postpositions.
2. Proposed solutions for tokenization, including assigning part-of-speech tags to suffixes, clitics, and postpositions.
3. Issues with modeling gender on nouns and adjectives in Nepali.
4. Challenges in modeling verb inflection in Nepali due to complex compound verb forms and proposed solutions such as simplifying assumptions and tagging the last element of compound verbs.
6. 2. TOKENISATION
i. TOKENS
- Appropriate size unites for
morphosyntactic analysis
- Grammatical categories
assigned
ii. ORTHOGRAPHIC WORD
- Set of strings bounded by
whitespace or punctuation
NOTES
- Separate sentences into tokens
- OW < Tokens = multiword units, not investigated yet
- Graphical word with multiple elements => Tokenized Separately
- Tokens are separated by space for written language
iii. CLITICS
- A morpheme that has
syntactic characteristics of
word
- must be tokenized
- can be postpositions or
affixes
7. 2. TOKENISATION
- Mark oblique cases
- Also written as part of orthographic word as noun, adj. or
other word whose case they mark
- Suffixes except “haru” (Plural or Collective), “ko/kii/kaa”
(genetive), “le”(ergative), “lai” (accusative/dative) are
postpositions
Postpositions
USES
8. 2. TOKENISATION
ISSUE
- Analyse as inflection
element as noun
- Add separate tokens
- Different consideration
for suffixes on one hand
and other
METHODS PROBLEMS
- For singular ergative noun
“le”, use NN1E
- For plural accusative
noun ”harulaai”, use NN2A
Layer II Postpositions
- Hard to know when to
treat postpostion as
suffixes but clitics
(Assign Tokens
“ma/bata/sanga”)
- Suffixes can get attached
to noun, pronoun, adj
and adverb too
Conclusion: Abandon
NN1E / NN2A
9. 2. TOKENISATION
SOLUTION
- Category of postposition is tagged as II
- Plural collective marker “haru” tagged as IH
- Genitive postpositions “ko/kii/kaa” : IKM/IKF/IKO
respectively
- Eragative-instrumental PP “le” : IE
- Accusative/dative PP ”laai” : IA
- Possessive Pronouns
“mero” : PMXKM, “tero”: PTNKM, “aafno” : PRFKM
Postpositions
10. 3. GENDER ON NOUNS AND
ADJECTIVES
- Nepali has grammatically marked gender
- Masculine => suffix “o”
Feminine => suffix “ii”
Other => suffix “aa”
- The default other noun and suffixes is mostly
masculine
11. 3. GENDER ON NOUNS AND
ADJECTIVES
ISSUE
- Most of the Adj., nouns,
descriptive determiners
like “bibhinna, sampurna”
are not gender marked
- Feminine noun ending
with “ii” like “aaimaai”
donot have respective
masculine noun ending
with “o”
- Gender marked form
“yetro” has unmarked
forms “yo/yi/eti”
METHODS PROBLEMS
- Ignore Gender Inflection
altogether
- Difficult to extract feminine
marked adj. due to false
positives such as “dhani”
ending with “ii”
- Including gender marking in
tagging system causes
problem for unmarked words
and complicates automated
tagging
12. 3. GENDER ON NOUNS AND
ADJECTIVES
- Assign following tags JM, JF, JO and JX to
suffixe “o” (masucline), suffix “ii” (feminine),
suffix “aa” (other) and unmarked Adj.
respectively
- Ignore plural, public and honorificity for
simplicity
- Ignore gender marking on nouns
Example: “Sita” as NP and “aaimaaii” as
NN
SOLUTION
13. 4. MODELLING NEPALI VERB INFLECTION
ISSUE
- Multiplicity of inflected forms
“bhanidiyeko”
- Compound verbs = main verb + vector
verb / light verb
- “garidiyo” = “gari” + “diyo”
- Tense-aspect mood combination
created by use of auxiliary verbs to
form compounded form
“hunu/ hunthyo/ huncha/ bhairahayo
/hunecha”
- Each compounded verb can represent
voice, tense, mood, aspect, person,
gender, number, honorificity and vector
verb. This leads to large number of
tagsets.
METHODS
- Possible solutions :
Probabilistic approach
(Markov model)
- Training data == (tagset)2
- Impractical as tagset grows
- Re-Tokenization approach
- Difficult to trace root word, root verb
- Example :
“huncha” => “hun” + “cha”
14. Solution
- Simplify assumptions for descriptions of Nepali verbs underlying
the tagset.
- Tag accordingly to last element of compound verb (Person-
Number-Gender Inflection)
- High honorific verb is tagged in isolation
- For non-finite form receives a tag of its own in a tagset
- For finite form, the tagsets are possible combination of:
4. MODELLING NEPALI VERB INFLECTION