Your SlideShare is downloading. ×
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words

780

Published on

by Gertrud Faaß, Ulrich Heid, Elsabé Taljard and Danie Prinsloo

by Gertrud Faaß, Ulrich Heid, Elsabé Taljard and Danie Prinsloo

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
780
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Transcript

    • 1. Part-of-Speech tagging of Northern Sotho: Disambiguating polysemous function words Gertrud Faa ß [email_address]   Ulrich Heid [email_address] E lsab é Taljard [email_address] DJ Prinsloo [email_address]
    • 2. This Talk
      • Prologue
      • Challenges for tagging Sotho texts
      • Objectives
      • Descriptive state of the art for tagging of Sotho texts
        • Tools
        • Tagsets
      • The ambiguity problem
      • Methodology
      • Results
      • Conclusions & future work
    • 3. Nine Official Bantu Languages of SA
      • Sotho Group
        • Northern Sotho / Sepedi
        • Tswana
        • Southern Sotho
      • Nguni Group
        • Zulu
        • Swati
        • Xhosa
        • Ndebele
        • *********************
        • Venda and Tsonga
    • 4.  
    • 5. Noun class system 1 1 ga re ‘middle’ ga- (24) ga - n tle ‘outside’ pele ‘in front’ N- / Ø- N - mo rago ‘behind’ mo- 18 go dimo ‘above’ go- 17 fa se ‘below’ fa- 16 go ruta ‘to learn’ go- 15 ma dulo‘residences’ ma- (6) bo dulo ‘residence’ bo- 14 dim pša ‘dogs’ / di hlogo ‘heads’ di N - / di- 10 m pša ‘dog’ / hlogo ‘head’ N - / Ø- 9 di lepe ‘axes’ di- 8 se lepe ‘axe’ se- 7 ma bone ‘lights’ ma- 6 le bone ‘light’ le- 5 me nwana ‘fingers’ me- 4 mo nwana ‘finger’ mo- 3 malome ‘uncle’ bo malome ‘uncle & co’ Ø- bo- 1a 2b mo sadi ‘woman’ ba sadi ‘women’ mo- ba- 1 2 Example CP Cl.No
    • 6. Concordial agreement – Northern Sotho
        • Taljard and Bosch (2005)
    • 7. Challenges for tagging
      • Ambiguity, for example:
        • function words:
        • -a- being 9-ways ambiguous, - go- up to 30(11,6,5,…)-ways
      • Unknown words (N+V)
        • noun derivation:
        • toropo (town) -> toropong (in/at/to town)
        • verb derivation: next slides
    • 8. Challenges: unknown words
      • Agglutinating languages:
      • extensive use of affixes
        • Example: rekišeditšwe ‘was / were sold for’ < rek- ‘buy’ (verb root) + -iš- (causative) + -el- (applied) + -il- (past tense) + -w- (passive) + -e (inflectional ending)
    • 9.
      • ROOTetšane , ROOTetšanwa , ROOTetšanwe , ROOTiša , ROOTišitše , ROOTišwa , ROOTišitšwe , ROOTišana , ROOTišane , ROOTišanwa , ROOTišanwe , ROOTišega , ROOTišegile, ROOTišetša , ROOTišeditše , ROOTišetšwa , ROOTišeditšwe, ROOTišetšana, ROOTišetšane , ROOTišetšanwa , ROOTišetšanwe , ROOTišiša , ROOTišišitše , ROOTišišwa , ROOTišišitšwe , ROOTišišana , ROOTišišane , ROOTišišanwa , ROOTišišanwe , ROOToga , ROOTogile , ROOTogwa , ROOTogilwe , ROOTogana , ROOTogane , ROOToganwa , ROOToganwe , ROOTogela , ROOTogetše , ROOTogelwa , ROOTogetšwe , ROOTola, ROOTotše , ROOTolwa , ROOTotšwe , ROOTolana , ROOTolane , ROOTolanwa , ROOTolanwe , ROOTolega , ROOTolegile , ROOTolela, ROOToletše , ROOTolelwa , ROOToletšwe , ROOTolelana , ROOTolelane , ROOTolelanwa , ROOTolelanwe , ROOTolla, ROOTolotše , ROOTollwa , ROOTolotšwe , ROOTollana , ROOTollane , ROOTollanwa , ROOTollanwe , ROOTollega , ROOTollegile , ROOTollela, ROOTolletše , ROOTollelwa , ROOTolletšwe , ROOTollelana , ROOTollelane , ROOTollelanwa , ROOTollelanwe , ROOTolliša , ROOTollišitše , ROOTollišwa , ROOTollišitšwe , ROOTollišana , ROOTollišane , ROOTollišanwa , ROOTollišanwe , ROOTologa , ROOTologile , ROOTologana , ROOTologane , ROOTologanwa , ROOTologanwe , ROOTološa , ROOTološitše , ROOTološwa , ROOTološitšwe , ROOTološana , ROOTološane , ROOTološanwa , ROOTološanwe , ROOTološetša , ROOTološeditše , ROOTološetšwa , ROOTološeditšwe , ROOTološetšana , ROOTološetšane , ROOTološetšanwa , ROOTološetšanwe , ROOToša , ROOTošitše , ROOTošwa , ROOTošitšwe , ROOTošetša , ROOTošeditše , ROOTošetšwa , ROOTošeditšwe , ROOTošetšana , ROOTošetšane , ROOTošetšanwa , ROOTošetšanwe
      Examples of suffixes and combinations for a single verb
    • 10. Solution for unknown verbs and nouns
      • Verb guesser: detection of
        • longest match suffix combinations
        • occurrences in corpora
      • Noun guesser: matching of
        • singular/plural-forms
        • nominal suffixes
        • occurrences in corpora
    • 11. Objectives
      • Tagging with a detailed tagset: class numbers
        • Nouns, adjectives, pronouns, concords, demonstratives
      • Disambiguation
      • Motivation: tagging used as preprocessing for:
        • Chunking, parsing
        • Lexicography (tag relatively large corpora,e.g. PSC)
        • Detailed linguistic research (e.g. grammar development)
        • Information extraction
    • 12. State of the art for tagging: Sotho languages
      • Comparison of tagsets and tools
      • is hardly possible
        • Different applications of tagged material (linguistic description, lexicography, parsing, etc.)
        • Different number of tags
        • Differences in granularity
    • 13. Descriptive State of the Art: tagsets and tools yes yes 25/141 This paper no yes 141/262 Taljard et al. (2008) yes no partial Kotzé (several, e.g. 2008) yes no 56 De Schryver and De Pauw (2007) no no 106 Van Rooy and Pretorius (2003) Tool? Noun class yes/no No. of tags Authors
    • 14. Descriptive State of the Art for tagging: Sotho languages
      • Tools:
      • Full
        • De Schryver and de Pauw (2007) Northern Sotho tagger (statistical)
      • Partial
        • Kotz é (several publications, e.g. 2008) Verbal and nominal segment (finite state)
    • 15. Descriptive state of the art for tagging: Sotho languages
      • Applications of tagsets:
      • De Schryver and de Pauw (2007): used for lexicography
      • Van Rooy and Pretorius (2003): linguistic description of Setswana
      • Taljard et al. (2008): morphosyntactic and general linguistic description
    • 16. The ambiguity problem
      • - a- , - go- : see handout for possible readings
      • Local context may not identify noun class of subject concord: ( Masogana) … A nwa bjalwa CS06 drink beer (Young men) … “They drink beer.”
    • 17. The ambiguity problem: possible solutions
        • Dependent on objectives
          • Flat tagset ignoring irrelevant details (cf. handout for -go -)
          • Layered tagset: granularity
    • 18. Tagset (cf. Handout)
      • Level 1
        • Noun = (N)
        • Subject concord (CS), Object concord (CO)
        • Pronouns (PRO)
      • Level 2
        • emphatic (only for pronouns) EMP
        • possessive (dto.) POSS
      • Level 3
        • Classes -> N.01a, N.01, N.02, N.03, … , PERS, etc.
      • Example: noun of class 1 = N.01 possessive pronoun of class 6 = PRO.POSS.06
    • 19. RF tagger technology (cf. Schmid and Laws (2008)
      • Hidden Markov Model (HMM) Tagger
      • Additional external lexicon
      • Large, fine-grained tagsets
      • Several levels of description:
      • e.g. German articles: ART.Definiteness.Case.Number.Gender
      • Calculates joint (product) probabilities
    • 20. Training corpus
      • 45,000 tokens
      • manually annotated word forms
      • from two text types
      • Not balanced
      • (25,000 tokens out of a novel,
      • 2 times 10,000 tokens out of dissertations)
    • 21. Comparing taggers on manually annotated data
      • Tree-Tagger (Schmidt 1994)
      • TnT Tagger (Brants 2000)
      • MBT Tagger (Daelemans et al. 2007)
      • RF-Tagger (Schmid and Laws 2008)
    • 22. Effects of size of training corpus No more adding of training data necessary
    • 23. Effects of highly polysemous function words
      • Distribution problem
          • Probability guesses for scarce labels become unreliable
            • a :
              • PART (45) vs. CS.01 (1,182)
              • 91% incorrect labeling of PART.
      • Detailed discussion:
      • Handout: - a - refer to pages 2, 4
    • 24. Alternative proposal: hybrid taggers Spoustov á et al. (2007)
      • Combine
      • rule-based tagging with statistical tagging
      • For Northern Sotho:
        • - Contextual disambiguation works fine
        • with RF-tagger
        • if unambiguous indicators are available
        • Disambiguating macros (using the same indicators) hence have little effect
        • Ambiguous contexts hard to account for either way:
        • need for parsing?
    • 25. Results: 10-fold cross validation
      • Without guessers (to simulate similar conditions for TnT and MBT)
        • RF-tagger: 91.00%
        • TnT tagger: 91.01%
        • MBT: 87.68%
      • w ith guessers: (several thousand nouns and verbs part of the lexicon)
        • Tree-tagger: 92.46%
        • RF-tagger: 94.16%
    • 26. Conclusions
      • Different intended uses lead to different tagsets (granularity, number of tags)
      • Including noun class information is essential for general linguistic research, e.g. grammar development, applications of chunking/parsing
      • RF-Tagger performs well for our layered tagset with the existing amount of training data (45,000), over 94% correct
      • Ambiguous contexts and sparse data problem combined lead to a high error rate for statistical parsing - not likely to be solvable with macros
        • Chunking / Parsing might lead to a more adequate solution for this problem
    • 27. Future work
      • Apply RF-tagger to the PSC corpus
      • Evaluate results
      • Instead of preprocessing rules, a partial postprocessing may make sense (e.g. chunking, parsing)

    ×