A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech TaggingGuy De Pauw(guy@aflat.org)Naomi Maajabu (naomi@aflat.org)Peter WaiganjoWagacha (waiganjo@aflat.org)
OutlineResource-scarce language engineeringThe case of Luo (Dholuo)A trilingual parallel corpus English – Swahili – LuoMachine Translation experimentsProjection of Annotation experimentsConclusion & Future Work
Resource-Scarce LanguagesLimited financial, political, … resourcesFew digital resources: digital lexicons, corporaBottleneck of linguistic expertise (in LT)Two approaches:Rule-based approachesAdvantages: 	meticulous design, linguistically relevantCorpus-based, data-driven approachesGrowing importance and availability of digital text materialAdvantages: 	performance models, fast development, 			automatic quantitative evaluation
DholuoUGKEDRCRWBUWestern Nilotic languageSpoken by +3M Luo peopleKenya, Uganda, TanzaniaNo official dialectTonal, but not marked in orthographyLatin alphabet, no diacriticsResource-scarce (not official language)Web-mined corpus of 200k wordsDe Pauw, Wagacha & Abade (2007) Unsupervised Induction of Dholuo Word Classes using Maximum Entropy LearningTZ
Most famous LuoNilotic languageSpoken by +3M people in Kenya, Tanzania and Uganda Not an official languageLatin scriptMost famous Luo:
AfLaT 2009 / LRE (submitted)SAWA CORPUS2 million word parallel corpus English – SwahiliCompetitive machine translation resultsProjection of annotation of part-of-speech tags from English into Swahili is viableBut what about true resource-scarce languages?
Parallel Data for LuoInternational Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luoUse English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpusPreprocessing:Pdftext conversionTokenizationSentence alignment
Parallel Data for LuoInternational Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luoUse English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpusPreprocessing:Pdftext conversionKoolwire.comTokenizationSentence alignment
Parallel Data for LuoInternational Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luoUse English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpusPreprocessing:Pdftext conversionKoolwire.comTokenizationSentence alignmentLuk 1:1 Jimang'enyosebedo ka chanoweche mane otimore e dierwa ,Luk 1:2 Mana kaka nochiwgi ne wan kodjogomotelo mane jonenowang'giwang kendo jotichwach .Luk 1:3 Kuommano , an bende kaka asenonotiendwechegimalong'onyaka a chakruok , en gimaberbendemondoandikni e yomochanoremaler , in mulourTheofilo .Luk 1:4 Mondoing'eadier mar gikmosepwonji .Luk 1:5 E ndalo ma Heroderuodh Judea , ne nitiejadolo ma nyingeZakaria , ma ne en achielkuomogandajodolomagAbija ; chiege Elizabeth bende ne nyardhoodHarun .Luk 1:6 Gidutojariyo ne gin jomakarenyimNyasaye , ne giritochikemadongogimatindomagRuothNyasaye , maongeketho .Luk 1:7 To ne giongeginyithindo , nikech Elizabeth ne migumba , kendo gin jariyogohikgi : nose niang'Luk 1:8 Chieng' morokaneogandagiZakaria ne ni e tich to notiyo kaka jadolo e nyimNyasaye ,Luk 1:9 Noyieregiombulu kaka chik mar jodolo , mondoodonjieihekalu mar Ruoth kendo owangubani .Luk 1:10 To ka sa mar wang'oubaniochopo , jolemodutonochokoreoko kendo negilamo .Luk 1:11 Ekamalaika mar RuothNyasayenofwenyorene , kochungo bath kendo mar ubanikorachwich .
Parallel Data for LuoInternational Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luoUse English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpusPreprocessing:Pdftext conversionKoolwire.comTokenizationSentence alignmentR.C.Moore (2002) Fast and accurate sentence alignment of bilingual corpora.
Trilingual corpus80% training set
10% validation set
10% test set (partly annotated for pos-tags)Tiny, register-specific parallel corpus
Word alingmentGoal: word aligned corpus
Misconception:morphologically rich languages cannot be used in statistical machine translation, since word-alignment is word-basedhaveturnedhimdownInimemkatalia
Factored DataGoal: word aligned corpus
Misconception: morphologically rich languages cannot be used in statistical machine translation, since it is word-based
Word alignment and language modeling can be enhanced by using factored data
General idea: use extra annotation layers (part-of-speech tagging, lemmatization) to aid discovery of possible translation pairsFactored Data
Factored Data
Machine Translation ExperimentsEnglish  Luo	and 	Swahili  LuoUse standard SMT tool MOSES (Koehn et al 2007)Phrase-based machine translationCan handle factored dataUses SRILM language modeling tool (Stolcke 2002)English: 	Gigaword corpusSwahili:	TshwaneDJe Kiswahili Internet CorpusLuo: 	200k Luo corpus + Training/Evaluation Set of New 		Testament data
ResultsOOV: percentage of out-of-vocabulary words (i.e. words unknown to the language model)BLEU: Bilingual Evaluation Understudy (calculates n-gram overlap between reference and machine translation)NIST: modification of BLEU, taking into account information value of n-gramsBLEU & NIST attempt to optimize correlation with human evaluation
SMT ExperimentsTranslation dictionary did not significantly improve results, but factored data didSMT Experiments
SMT Experiments
Examples
Projection of annotationUse word alignment to bootstrap annotation in a resource-scarce language
Project part-of-speech tags from resource-rich(er) language

A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging

  • 1.
    A Knowledge-Light Approachto Luo Machine Translation and Part-of-Speech TaggingGuy De Pauw(guy@aflat.org)Naomi Maajabu (naomi@aflat.org)Peter WaiganjoWagacha (waiganjo@aflat.org)
  • 2.
    OutlineResource-scarce language engineeringThecase of Luo (Dholuo)A trilingual parallel corpus English – Swahili – LuoMachine Translation experimentsProjection of Annotation experimentsConclusion & Future Work
  • 3.
    Resource-Scarce LanguagesLimited financial,political, … resourcesFew digital resources: digital lexicons, corporaBottleneck of linguistic expertise (in LT)Two approaches:Rule-based approachesAdvantages: meticulous design, linguistically relevantCorpus-based, data-driven approachesGrowing importance and availability of digital text materialAdvantages: performance models, fast development, automatic quantitative evaluation
  • 4.
    DholuoUGKEDRCRWBUWestern Nilotic languageSpokenby +3M Luo peopleKenya, Uganda, TanzaniaNo official dialectTonal, but not marked in orthographyLatin alphabet, no diacriticsResource-scarce (not official language)Web-mined corpus of 200k wordsDe Pauw, Wagacha & Abade (2007) Unsupervised Induction of Dholuo Word Classes using Maximum Entropy LearningTZ
  • 5.
    Most famous LuoNiloticlanguageSpoken by +3M people in Kenya, Tanzania and Uganda Not an official languageLatin scriptMost famous Luo:
  • 6.
    AfLaT 2009 /LRE (submitted)SAWA CORPUS2 million word parallel corpus English – SwahiliCompetitive machine translation resultsProjection of annotation of part-of-speech tags from English into Swahili is viableBut what about true resource-scarce languages?
  • 7.
    Parallel Data forLuoInternational Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luoUse English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpusPreprocessing:Pdftext conversionTokenizationSentence alignment
  • 8.
    Parallel Data forLuoInternational Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luoUse English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpusPreprocessing:Pdftext conversionKoolwire.comTokenizationSentence alignment
  • 9.
    Parallel Data forLuoInternational Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luoUse English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpusPreprocessing:Pdftext conversionKoolwire.comTokenizationSentence alignmentLuk 1:1 Jimang'enyosebedo ka chanoweche mane otimore e dierwa ,Luk 1:2 Mana kaka nochiwgi ne wan kodjogomotelo mane jonenowang'giwang kendo jotichwach .Luk 1:3 Kuommano , an bende kaka asenonotiendwechegimalong'onyaka a chakruok , en gimaberbendemondoandikni e yomochanoremaler , in mulourTheofilo .Luk 1:4 Mondoing'eadier mar gikmosepwonji .Luk 1:5 E ndalo ma Heroderuodh Judea , ne nitiejadolo ma nyingeZakaria , ma ne en achielkuomogandajodolomagAbija ; chiege Elizabeth bende ne nyardhoodHarun .Luk 1:6 Gidutojariyo ne gin jomakarenyimNyasaye , ne giritochikemadongogimatindomagRuothNyasaye , maongeketho .Luk 1:7 To ne giongeginyithindo , nikech Elizabeth ne migumba , kendo gin jariyogohikgi : nose niang'Luk 1:8 Chieng' morokaneogandagiZakaria ne ni e tich to notiyo kaka jadolo e nyimNyasaye ,Luk 1:9 Noyieregiombulu kaka chik mar jodolo , mondoodonjieihekalu mar Ruoth kendo owangubani .Luk 1:10 To ka sa mar wang'oubaniochopo , jolemodutonochokoreoko kendo negilamo .Luk 1:11 Ekamalaika mar RuothNyasayenofwenyorene , kochungo bath kendo mar ubanikorachwich .
  • 10.
    Parallel Data forLuoInternational Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luoUse English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpusPreprocessing:Pdftext conversionKoolwire.comTokenizationSentence alignmentR.C.Moore (2002) Fast and accurate sentence alignment of bilingual corpora.
  • 11.
  • 12.
  • 13.
    10% test set(partly annotated for pos-tags)Tiny, register-specific parallel corpus
  • 14.
  • 15.
    Misconception:morphologically rich languagescannot be used in statistical machine translation, since word-alignment is word-basedhaveturnedhimdownInimemkatalia
  • 16.
  • 17.
    Misconception: morphologically richlanguages cannot be used in statistical machine translation, since it is word-based
  • 18.
    Word alignment andlanguage modeling can be enhanced by using factored data
  • 19.
    General idea: useextra annotation layers (part-of-speech tagging, lemmatization) to aid discovery of possible translation pairsFactored Data
  • 20.
  • 21.
    Machine Translation ExperimentsEnglish Luo and Swahili  LuoUse standard SMT tool MOSES (Koehn et al 2007)Phrase-based machine translationCan handle factored dataUses SRILM language modeling tool (Stolcke 2002)English: Gigaword corpusSwahili: TshwaneDJe Kiswahili Internet CorpusLuo: 200k Luo corpus + Training/Evaluation Set of New Testament data
  • 22.
    ResultsOOV: percentage ofout-of-vocabulary words (i.e. words unknown to the language model)BLEU: Bilingual Evaluation Understudy (calculates n-gram overlap between reference and machine translation)NIST: modification of BLEU, taking into account information value of n-gramsBLEU & NIST attempt to optimize correlation with human evaluation
  • 23.
    SMT ExperimentsTranslation dictionarydid not significantly improve results, but factored data didSMT Experiments
  • 24.
  • 25.
  • 26.
    Projection of annotationUseword alignment to bootstrap annotation in a resource-scarce language
  • 27.
    Project part-of-speech tagsfrom resource-rich(er) language