Part-of-Speech tagging of Northern Sotho: Disambiguating polysemous function words Gertrud Faa ß [email_address]   Ulrich ...
This Talk <ul><li>Prologue </li></ul><ul><li>Challenges for tagging Sotho texts </li></ul><ul><li>Objectives </li></ul><ul...
Nine Official Bantu Languages of SA <ul><li>Sotho Group </li></ul><ul><ul><li>Northern Sotho / Sepedi </li></ul></ul><ul><...
 
Noun class system 1 1 ga re ‘middle’ ga- (24)  ga - n tle ‘outside’ pele ‘in front’ N- / Ø- N - mo rago ‘behind’ mo- 18 go...
Concordial agreement – Northern Sotho <ul><ul><li>Taljard and Bosch (2005) </li></ul></ul>
Challenges for tagging <ul><li>Ambiguity, for example: </li></ul><ul><ul><li>function words:  </li></ul></ul><ul><ul><li>-...
Challenges: unknown words <ul><li>Agglutinating languages:  </li></ul><ul><li>extensive use of affixes </li></ul><ul><ul><...
<ul><li>ROOTetšane ,  ROOTetšanwa ,  ROOTetšanwe ,  ROOTiša ,  ROOTišitše ,  ROOTišwa ,  ROOTišitšwe ,  ROOTišana ,  ROOTi...
Solution  for unknown verbs and nouns <ul><li>Verb guesser: detection of </li></ul><ul><ul><li>longest match suffix combin...
Objectives <ul><li>Tagging with a detailed tagset: class numbers </li></ul><ul><ul><li>Nouns, adjectives, pronouns, concor...
State of the art for tagging:  Sotho languages <ul><li>Comparison of tagsets and tools  </li></ul><ul><li>is hardly possib...
Descriptive State of the Art:  tagsets and tools yes yes 25/141 This paper no yes 141/262 Taljard et al. (2008) yes no par...
Descriptive State of the Art for tagging: Sotho languages <ul><li>Tools: </li></ul><ul><li>Full </li></ul><ul><ul><li>De S...
Descriptive state of the art for tagging: Sotho languages <ul><li>Applications of tagsets: </li></ul><ul><li>De Schryver a...
The ambiguity problem <ul><li>- a- , - go- : see handout for possible readings </li></ul><ul><li>Local context may not ide...
The ambiguity problem:  possible solutions <ul><ul><li>Dependent on objectives </li></ul></ul><ul><ul><ul><li>Flat tagset ...
Tagset (cf. Handout) <ul><li>Level 1 </li></ul><ul><ul><li>Noun = (N)  </li></ul></ul><ul><ul><li>Subject concord (CS), Ob...
RF tagger technology   (cf. Schmid and Laws (2008) <ul><li>Hidden Markov Model (HMM) Tagger </li></ul><ul><li>Additional e...
Training corpus <ul><li>45,000 tokens  </li></ul><ul><li>manually annotated word forms </li></ul><ul><li>from two text typ...
Comparing taggers on manually annotated data <ul><li>Tree-Tagger (Schmidt 1994) </li></ul><ul><li>TnT Tagger (Brants 2000)...
Effects of size of training corpus No more adding of training data necessary
Effects of highly polysemous function words <ul><li>Distribution problem </li></ul><ul><ul><ul><li>Probability guesses for...
Alternative proposal: hybrid taggers Spoustov á  et al. (2007)   <ul><li>Combine  </li></ul><ul><li>rule-based tagging wit...
Results: 10-fold cross validation <ul><li>Without guessers  (to simulate similar conditions for TnT and MBT) </li></ul><ul...
Conclusions <ul><li>Different intended uses lead to different tagsets (granularity, number of tags) </li></ul><ul><li>Incl...
Future work <ul><li>Apply RF-tagger to the PSC corpus </li></ul><ul><li>Evaluate results </li></ul><ul><li>Instead of prep...
Upcoming SlideShare
Loading in …5
×

Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words

1,221 views

Published on

by Gertrud Faaß, Ulrich Heid, Elsabé Taljard and Danie Prinsloo

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,221
On SlideShare
0
From Embeds
0
Number of Embeds
272
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words

    1. 1. Part-of-Speech tagging of Northern Sotho: Disambiguating polysemous function words Gertrud Faa ß [email_address]   Ulrich Heid [email_address] E lsab é Taljard [email_address] DJ Prinsloo [email_address]
    2. 2. This Talk <ul><li>Prologue </li></ul><ul><li>Challenges for tagging Sotho texts </li></ul><ul><li>Objectives </li></ul><ul><li>Descriptive state of the art for tagging of Sotho texts </li></ul><ul><ul><li>Tools </li></ul></ul><ul><ul><li>Tagsets </li></ul></ul><ul><li>The ambiguity problem </li></ul><ul><li>Methodology </li></ul><ul><li>Results </li></ul><ul><li>Conclusions & future work </li></ul>
    3. 3. Nine Official Bantu Languages of SA <ul><li>Sotho Group </li></ul><ul><ul><li>Northern Sotho / Sepedi </li></ul></ul><ul><ul><li>Tswana </li></ul></ul><ul><ul><li>Southern Sotho </li></ul></ul><ul><li>Nguni Group </li></ul><ul><ul><li>Zulu </li></ul></ul><ul><ul><li>Swati </li></ul></ul><ul><ul><li>Xhosa </li></ul></ul><ul><ul><li>Ndebele </li></ul></ul><ul><ul><li>********************* </li></ul></ul><ul><ul><li>Venda and Tsonga </li></ul></ul>
    4. 5. Noun class system 1 1 ga re ‘middle’ ga- (24) ga - n tle ‘outside’ pele ‘in front’ N- / Ø- N - mo rago ‘behind’ mo- 18 go dimo ‘above’ go- 17 fa se ‘below’ fa- 16 go ruta ‘to learn’ go- 15 ma dulo‘residences’ ma- (6) bo dulo ‘residence’ bo- 14 dim pša ‘dogs’ / di hlogo ‘heads’ di N - / di- 10 m pša ‘dog’ / hlogo ‘head’ N - / Ø- 9 di lepe ‘axes’ di- 8 se lepe ‘axe’ se- 7 ma bone ‘lights’ ma- 6 le bone ‘light’ le- 5 me nwana ‘fingers’ me- 4 mo nwana ‘finger’ mo- 3 malome ‘uncle’ bo malome ‘uncle & co’ Ø- bo- 1a 2b mo sadi ‘woman’ ba sadi ‘women’ mo- ba- 1 2 Example CP Cl.No
    5. 6. Concordial agreement – Northern Sotho <ul><ul><li>Taljard and Bosch (2005) </li></ul></ul>
    6. 7. Challenges for tagging <ul><li>Ambiguity, for example: </li></ul><ul><ul><li>function words: </li></ul></ul><ul><ul><li>-a- being 9-ways ambiguous, - go- up to 30(11,6,5,…)-ways </li></ul></ul><ul><li>Unknown words (N+V) </li></ul><ul><ul><li>noun derivation: </li></ul></ul><ul><ul><li>toropo (town) -> toropong (in/at/to town) </li></ul></ul><ul><ul><li>verb derivation: next slides </li></ul></ul>
    7. 8. Challenges: unknown words <ul><li>Agglutinating languages: </li></ul><ul><li>extensive use of affixes </li></ul><ul><ul><li>Example: rekišeditšwe ‘was / were sold for’ < rek- ‘buy’ (verb root) + -iš- (causative) + -el- (applied) + -il- (past tense) + -w- (passive) + -e (inflectional ending) </li></ul></ul>
    8. 9. <ul><li>ROOTetšane , ROOTetšanwa , ROOTetšanwe , ROOTiša , ROOTišitše , ROOTišwa , ROOTišitšwe , ROOTišana , ROOTišane , ROOTišanwa , ROOTišanwe , ROOTišega , ROOTišegile, ROOTišetša , ROOTišeditše , ROOTišetšwa , ROOTišeditšwe, ROOTišetšana, ROOTišetšane , ROOTišetšanwa , ROOTišetšanwe , ROOTišiša , ROOTišišitše , ROOTišišwa , ROOTišišitšwe , ROOTišišana , ROOTišišane , ROOTišišanwa , ROOTišišanwe , ROOToga , ROOTogile , ROOTogwa , ROOTogilwe , ROOTogana , ROOTogane , ROOToganwa , ROOToganwe , ROOTogela , ROOTogetše , ROOTogelwa , ROOTogetšwe , ROOTola, ROOTotše , ROOTolwa , ROOTotšwe , ROOTolana , ROOTolane , ROOTolanwa , ROOTolanwe , ROOTolega , ROOTolegile , ROOTolela, ROOToletše , ROOTolelwa , ROOToletšwe , ROOTolelana , ROOTolelane , ROOTolelanwa , ROOTolelanwe , ROOTolla, ROOTolotše , ROOTollwa , ROOTolotšwe , ROOTollana , ROOTollane , ROOTollanwa , ROOTollanwe , ROOTollega , ROOTollegile , ROOTollela, ROOTolletše , ROOTollelwa , ROOTolletšwe , ROOTollelana , ROOTollelane , ROOTollelanwa , ROOTollelanwe , ROOTolliša , ROOTollišitše , ROOTollišwa , ROOTollišitšwe , ROOTollišana , ROOTollišane , ROOTollišanwa , ROOTollišanwe , ROOTologa , ROOTologile , ROOTologana , ROOTologane , ROOTologanwa , ROOTologanwe , ROOTološa , ROOTološitše , ROOTološwa , ROOTološitšwe , ROOTološana , ROOTološane , ROOTološanwa , ROOTološanwe , ROOTološetša , ROOTološeditše , ROOTološetšwa , ROOTološeditšwe , ROOTološetšana , ROOTološetšane , ROOTološetšanwa , ROOTološetšanwe , ROOToša , ROOTošitše , ROOTošwa , ROOTošitšwe , ROOTošetša , ROOTošeditše , ROOTošetšwa , ROOTošeditšwe , ROOTošetšana , ROOTošetšane , ROOTošetšanwa , ROOTošetšanwe </li></ul>Examples of suffixes and combinations for a single verb
    9. 10. Solution for unknown verbs and nouns <ul><li>Verb guesser: detection of </li></ul><ul><ul><li>longest match suffix combinations </li></ul></ul><ul><ul><li>occurrences in corpora </li></ul></ul><ul><li>Noun guesser: matching of </li></ul><ul><ul><li>singular/plural-forms </li></ul></ul><ul><ul><li>nominal suffixes </li></ul></ul><ul><ul><li>occurrences in corpora </li></ul></ul>
    10. 11. Objectives <ul><li>Tagging with a detailed tagset: class numbers </li></ul><ul><ul><li>Nouns, adjectives, pronouns, concords, demonstratives </li></ul></ul><ul><li>Disambiguation </li></ul><ul><li>Motivation: tagging used as preprocessing for: </li></ul><ul><ul><li>Chunking, parsing </li></ul></ul><ul><ul><li>Lexicography (tag relatively large corpora,e.g. PSC) </li></ul></ul><ul><ul><li>Detailed linguistic research (e.g. grammar development) </li></ul></ul><ul><ul><li>Information extraction </li></ul></ul>
    11. 12. State of the art for tagging: Sotho languages <ul><li>Comparison of tagsets and tools </li></ul><ul><li>is hardly possible </li></ul><ul><ul><li>Different applications of tagged material (linguistic description, lexicography, parsing, etc.) </li></ul></ul><ul><ul><li>Different number of tags </li></ul></ul><ul><ul><li>Differences in granularity </li></ul></ul>
    12. 13. Descriptive State of the Art: tagsets and tools yes yes 25/141 This paper no yes 141/262 Taljard et al. (2008) yes no partial Kotzé (several, e.g. 2008) yes no 56 De Schryver and De Pauw (2007) no no 106 Van Rooy and Pretorius (2003) Tool? Noun class yes/no No. of tags Authors
    13. 14. Descriptive State of the Art for tagging: Sotho languages <ul><li>Tools: </li></ul><ul><li>Full </li></ul><ul><ul><li>De Schryver and de Pauw (2007) Northern Sotho tagger (statistical) </li></ul></ul><ul><li>Partial </li></ul><ul><ul><li>Kotz é (several publications, e.g. 2008) Verbal and nominal segment (finite state) </li></ul></ul>
    14. 15. Descriptive state of the art for tagging: Sotho languages <ul><li>Applications of tagsets: </li></ul><ul><li>De Schryver and de Pauw (2007): used for lexicography </li></ul><ul><li>Van Rooy and Pretorius (2003): linguistic description of Setswana </li></ul><ul><li>Taljard et al. (2008): morphosyntactic and general linguistic description </li></ul>
    15. 16. The ambiguity problem <ul><li>- a- , - go- : see handout for possible readings </li></ul><ul><li>Local context may not identify noun class of subject concord: ( Masogana) … A nwa bjalwa CS06 drink beer (Young men) … “They drink beer.” </li></ul>
    16. 17. The ambiguity problem: possible solutions <ul><ul><li>Dependent on objectives </li></ul></ul><ul><ul><ul><li>Flat tagset ignoring irrelevant details (cf. handout for -go -) </li></ul></ul></ul><ul><ul><ul><li>Layered tagset: granularity </li></ul></ul></ul>
    17. 18. Tagset (cf. Handout) <ul><li>Level 1 </li></ul><ul><ul><li>Noun = (N) </li></ul></ul><ul><ul><li>Subject concord (CS), Object concord (CO) </li></ul></ul><ul><ul><li>Pronouns (PRO) </li></ul></ul><ul><li>Level 2 </li></ul><ul><ul><li>emphatic (only for pronouns) EMP </li></ul></ul><ul><ul><li>possessive (dto.) POSS </li></ul></ul><ul><li>Level 3 </li></ul><ul><ul><li>Classes -> N.01a, N.01, N.02, N.03, … , PERS, etc. </li></ul></ul><ul><li>Example: noun of class 1 = N.01 possessive pronoun of class 6 = PRO.POSS.06 </li></ul>
    18. 19. RF tagger technology (cf. Schmid and Laws (2008) <ul><li>Hidden Markov Model (HMM) Tagger </li></ul><ul><li>Additional external lexicon </li></ul><ul><li>Large, fine-grained tagsets </li></ul><ul><li>Several levels of description: </li></ul><ul><li>e.g. German articles: ART.Definiteness.Case.Number.Gender </li></ul><ul><li>Calculates joint (product) probabilities </li></ul>
    19. 20. Training corpus <ul><li>45,000 tokens </li></ul><ul><li>manually annotated word forms </li></ul><ul><li>from two text types </li></ul><ul><li>Not balanced </li></ul><ul><li>(25,000 tokens out of a novel, </li></ul><ul><li>2 times 10,000 tokens out of dissertations) </li></ul>
    20. 21. Comparing taggers on manually annotated data <ul><li>Tree-Tagger (Schmidt 1994) </li></ul><ul><li>TnT Tagger (Brants 2000) </li></ul><ul><li>MBT Tagger (Daelemans et al. 2007) </li></ul><ul><li>RF-Tagger (Schmid and Laws 2008) </li></ul>
    21. 22. Effects of size of training corpus No more adding of training data necessary
    22. 23. Effects of highly polysemous function words <ul><li>Distribution problem </li></ul><ul><ul><ul><li>Probability guesses for scarce labels become unreliable </li></ul></ul></ul><ul><ul><ul><ul><li>a : </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>PART (45) vs. CS.01 (1,182) </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>91% incorrect labeling of PART. </li></ul></ul></ul></ul></ul><ul><li>Detailed discussion: </li></ul><ul><li>Handout: - a - refer to pages 2, 4 </li></ul>
    23. 24. Alternative proposal: hybrid taggers Spoustov á et al. (2007) <ul><li>Combine </li></ul><ul><li>rule-based tagging with statistical tagging </li></ul><ul><li>For Northern Sotho: </li></ul><ul><ul><li>- Contextual disambiguation works fine </li></ul></ul><ul><ul><li>with RF-tagger </li></ul></ul><ul><ul><li>if unambiguous indicators are available </li></ul></ul><ul><ul><li>Disambiguating macros (using the same indicators) hence have little effect </li></ul></ul><ul><ul><li>Ambiguous contexts hard to account for either way: </li></ul></ul><ul><ul><li>need for parsing? </li></ul></ul>
    24. 25. Results: 10-fold cross validation <ul><li>Without guessers (to simulate similar conditions for TnT and MBT) </li></ul><ul><ul><li>RF-tagger: 91.00% </li></ul></ul><ul><ul><li>TnT tagger: 91.01% </li></ul></ul><ul><ul><li>MBT: 87.68% </li></ul></ul><ul><li>w ith guessers: (several thousand nouns and verbs part of the lexicon) </li></ul><ul><ul><li>Tree-tagger: 92.46% </li></ul></ul><ul><ul><li>RF-tagger: 94.16% </li></ul></ul>
    25. 26. Conclusions <ul><li>Different intended uses lead to different tagsets (granularity, number of tags) </li></ul><ul><li>Including noun class information is essential for general linguistic research, e.g. grammar development, applications of chunking/parsing </li></ul><ul><li>RF-Tagger performs well for our layered tagset with the existing amount of training data (45,000), over 94% correct </li></ul><ul><li>Ambiguous contexts and sparse data problem combined lead to a high error rate for statistical parsing - not likely to be solvable with macros </li></ul><ul><ul><li>Chunking / Parsing might lead to a more adequate solution for this problem </li></ul></ul>
    26. 27. Future work <ul><li>Apply RF-tagger to the PSC corpus </li></ul><ul><li>Evaluate results </li></ul><ul><li>Instead of preprocessing rules, a partial postprocessing may make sense (e.g. chunking, parsing) </li></ul>

    ×