Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words

on

  • 1,348 views

by Gertrud Faaß, Ulrich Heid, Elsabé Taljard and Danie Prinsloo

by Gertrud Faaß, Ulrich Heid, Elsabé Taljard and Danie Prinsloo

Statistics

Views

Total Views
1,348
Views on SlideShare
1,107
Embed Views
241

Actions

Likes
0
Downloads
11
Comments
0

2 Embeds 241

http://aflat.org 231
http://www.aflat.org 10

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words Presentation Transcript

  • 1. Part-of-Speech tagging of Northern Sotho: Disambiguating polysemous function words Gertrud Faa ß [email_address]   Ulrich Heid [email_address] E lsab é Taljard [email_address] DJ Prinsloo [email_address]
  • 2. This Talk
    • Prologue
    • Challenges for tagging Sotho texts
    • Objectives
    • Descriptive state of the art for tagging of Sotho texts
      • Tools
      • Tagsets
    • The ambiguity problem
    • Methodology
    • Results
    • Conclusions & future work
  • 3. Nine Official Bantu Languages of SA
    • Sotho Group
      • Northern Sotho / Sepedi
      • Tswana
      • Southern Sotho
    • Nguni Group
      • Zulu
      • Swati
      • Xhosa
      • Ndebele
      • *********************
      • Venda and Tsonga
  • 4.  
  • 5. Noun class system 1 1 ga re ‘middle’ ga- (24) ga - n tle ‘outside’ pele ‘in front’ N- / Ø- N - mo rago ‘behind’ mo- 18 go dimo ‘above’ go- 17 fa se ‘below’ fa- 16 go ruta ‘to learn’ go- 15 ma dulo‘residences’ ma- (6) bo dulo ‘residence’ bo- 14 dim pša ‘dogs’ / di hlogo ‘heads’ di N - / di- 10 m pša ‘dog’ / hlogo ‘head’ N - / Ø- 9 di lepe ‘axes’ di- 8 se lepe ‘axe’ se- 7 ma bone ‘lights’ ma- 6 le bone ‘light’ le- 5 me nwana ‘fingers’ me- 4 mo nwana ‘finger’ mo- 3 malome ‘uncle’ bo malome ‘uncle & co’ Ø- bo- 1a 2b mo sadi ‘woman’ ba sadi ‘women’ mo- ba- 1 2 Example CP Cl.No
  • 6. Concordial agreement – Northern Sotho
      • Taljard and Bosch (2005)
  • 7. Challenges for tagging
    • Ambiguity, for example:
      • function words:
      • -a- being 9-ways ambiguous, - go- up to 30(11,6,5,…)-ways
    • Unknown words (N+V)
      • noun derivation:
      • toropo (town) -> toropong (in/at/to town)
      • verb derivation: next slides
  • 8. Challenges: unknown words
    • Agglutinating languages:
    • extensive use of affixes
      • Example: rekišeditšwe ‘was / were sold for’ < rek- ‘buy’ (verb root) + -iš- (causative) + -el- (applied) + -il- (past tense) + -w- (passive) + -e (inflectional ending)
  • 9.
    • ROOTetšane , ROOTetšanwa , ROOTetšanwe , ROOTiša , ROOTišitše , ROOTišwa , ROOTišitšwe , ROOTišana , ROOTišane , ROOTišanwa , ROOTišanwe , ROOTišega , ROOTišegile, ROOTišetša , ROOTišeditše , ROOTišetšwa , ROOTišeditšwe, ROOTišetšana, ROOTišetšane , ROOTišetšanwa , ROOTišetšanwe , ROOTišiša , ROOTišišitše , ROOTišišwa , ROOTišišitšwe , ROOTišišana , ROOTišišane , ROOTišišanwa , ROOTišišanwe , ROOToga , ROOTogile , ROOTogwa , ROOTogilwe , ROOTogana , ROOTogane , ROOToganwa , ROOToganwe , ROOTogela , ROOTogetše , ROOTogelwa , ROOTogetšwe , ROOTola, ROOTotše , ROOTolwa , ROOTotšwe , ROOTolana , ROOTolane , ROOTolanwa , ROOTolanwe , ROOTolega , ROOTolegile , ROOTolela, ROOToletše , ROOTolelwa , ROOToletšwe , ROOTolelana , ROOTolelane , ROOTolelanwa , ROOTolelanwe , ROOTolla, ROOTolotše , ROOTollwa , ROOTolotšwe , ROOTollana , ROOTollane , ROOTollanwa , ROOTollanwe , ROOTollega , ROOTollegile , ROOTollela, ROOTolletše , ROOTollelwa , ROOTolletšwe , ROOTollelana , ROOTollelane , ROOTollelanwa , ROOTollelanwe , ROOTolliša , ROOTollišitše , ROOTollišwa , ROOTollišitšwe , ROOTollišana , ROOTollišane , ROOTollišanwa , ROOTollišanwe , ROOTologa , ROOTologile , ROOTologana , ROOTologane , ROOTologanwa , ROOTologanwe , ROOTološa , ROOTološitše , ROOTološwa , ROOTološitšwe , ROOTološana , ROOTološane , ROOTološanwa , ROOTološanwe , ROOTološetša , ROOTološeditše , ROOTološetšwa , ROOTološeditšwe , ROOTološetšana , ROOTološetšane , ROOTološetšanwa , ROOTološetšanwe , ROOToša , ROOTošitše , ROOTošwa , ROOTošitšwe , ROOTošetša , ROOTošeditše , ROOTošetšwa , ROOTošeditšwe , ROOTošetšana , ROOTošetšane , ROOTošetšanwa , ROOTošetšanwe
    Examples of suffixes and combinations for a single verb
  • 10. Solution for unknown verbs and nouns
    • Verb guesser: detection of
      • longest match suffix combinations
      • occurrences in corpora
    • Noun guesser: matching of
      • singular/plural-forms
      • nominal suffixes
      • occurrences in corpora
  • 11. Objectives
    • Tagging with a detailed tagset: class numbers
      • Nouns, adjectives, pronouns, concords, demonstratives
    • Disambiguation
    • Motivation: tagging used as preprocessing for:
      • Chunking, parsing
      • Lexicography (tag relatively large corpora,e.g. PSC)
      • Detailed linguistic research (e.g. grammar development)
      • Information extraction
  • 12. State of the art for tagging: Sotho languages
    • Comparison of tagsets and tools
    • is hardly possible
      • Different applications of tagged material (linguistic description, lexicography, parsing, etc.)
      • Different number of tags
      • Differences in granularity
  • 13. Descriptive State of the Art: tagsets and tools yes yes 25/141 This paper no yes 141/262 Taljard et al. (2008) yes no partial Kotzé (several, e.g. 2008) yes no 56 De Schryver and De Pauw (2007) no no 106 Van Rooy and Pretorius (2003) Tool? Noun class yes/no No. of tags Authors
  • 14. Descriptive State of the Art for tagging: Sotho languages
    • Tools:
    • Full
      • De Schryver and de Pauw (2007) Northern Sotho tagger (statistical)
    • Partial
      • Kotz é (several publications, e.g. 2008) Verbal and nominal segment (finite state)
  • 15. Descriptive state of the art for tagging: Sotho languages
    • Applications of tagsets:
    • De Schryver and de Pauw (2007): used for lexicography
    • Van Rooy and Pretorius (2003): linguistic description of Setswana
    • Taljard et al. (2008): morphosyntactic and general linguistic description
  • 16. The ambiguity problem
    • - a- , - go- : see handout for possible readings
    • Local context may not identify noun class of subject concord: ( Masogana) … A nwa bjalwa CS06 drink beer (Young men) … “They drink beer.”
  • 17. The ambiguity problem: possible solutions
      • Dependent on objectives
        • Flat tagset ignoring irrelevant details (cf. handout for -go -)
        • Layered tagset: granularity
  • 18. Tagset (cf. Handout)
    • Level 1
      • Noun = (N)
      • Subject concord (CS), Object concord (CO)
      • Pronouns (PRO)
    • Level 2
      • emphatic (only for pronouns) EMP
      • possessive (dto.) POSS
    • Level 3
      • Classes -> N.01a, N.01, N.02, N.03, … , PERS, etc.
    • Example: noun of class 1 = N.01 possessive pronoun of class 6 = PRO.POSS.06
  • 19. RF tagger technology (cf. Schmid and Laws (2008)
    • Hidden Markov Model (HMM) Tagger
    • Additional external lexicon
    • Large, fine-grained tagsets
    • Several levels of description:
    • e.g. German articles: ART.Definiteness.Case.Number.Gender
    • Calculates joint (product) probabilities
  • 20. Training corpus
    • 45,000 tokens
    • manually annotated word forms
    • from two text types
    • Not balanced
    • (25,000 tokens out of a novel,
    • 2 times 10,000 tokens out of dissertations)
  • 21. Comparing taggers on manually annotated data
    • Tree-Tagger (Schmidt 1994)
    • TnT Tagger (Brants 2000)
    • MBT Tagger (Daelemans et al. 2007)
    • RF-Tagger (Schmid and Laws 2008)
  • 22. Effects of size of training corpus No more adding of training data necessary
  • 23. Effects of highly polysemous function words
    • Distribution problem
        • Probability guesses for scarce labels become unreliable
          • a :
            • PART (45) vs. CS.01 (1,182)
            • 91% incorrect labeling of PART.
    • Detailed discussion:
    • Handout: - a - refer to pages 2, 4
  • 24. Alternative proposal: hybrid taggers Spoustov á et al. (2007)
    • Combine
    • rule-based tagging with statistical tagging
    • For Northern Sotho:
      • - Contextual disambiguation works fine
      • with RF-tagger
      • if unambiguous indicators are available
      • Disambiguating macros (using the same indicators) hence have little effect
      • Ambiguous contexts hard to account for either way:
      • need for parsing?
  • 25. Results: 10-fold cross validation
    • Without guessers (to simulate similar conditions for TnT and MBT)
      • RF-tagger: 91.00%
      • TnT tagger: 91.01%
      • MBT: 87.68%
    • w ith guessers: (several thousand nouns and verbs part of the lexicon)
      • Tree-tagger: 92.46%
      • RF-tagger: 94.16%
  • 26. Conclusions
    • Different intended uses lead to different tagsets (granularity, number of tags)
    • Including noun class information is essential for general linguistic research, e.g. grammar development, applications of chunking/parsing
    • RF-Tagger performs well for our layered tagset with the existing amount of training data (45,000), over 94% correct
    • Ambiguous contexts and sparse data problem combined lead to a high error rate for statistical parsing - not likely to be solvable with macros
      • Chunking / Parsing might lead to a more adequate solution for this problem
  • 27. Future work
    • Apply RF-tagger to the PSC corpus
    • Evaluate results
    • Instead of preprocessing rules, a partial postprocessing may make sense (e.g. chunking, parsing)