Scaling API-first – The story of a global engineering organization
Knowledge-Light Luo Machine Translation and POS Tagging
1. A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging Guy De Pauw(guy@aflat.org) Naomi Maajabu (naomi@aflat.org) Peter WaiganjoWagacha (waiganjo@aflat.org)
2. Outline Resource-scarce language engineering The case of Luo (Dholuo) A trilingual parallel corpus English – Swahili – Luo Machine Translation experiments Projection of Annotation experiments Conclusion & Future Work
3. Resource-Scarce Languages Limited financial, political, … resources Few digital resources: digital lexicons, corpora Bottleneck of linguistic expertise (in LT) Two approaches: Rule-based approaches Advantages: meticulous design, linguistically relevant Corpus-based, data-driven approaches Growing importance and availability of digital text material Advantages: performance models, fast development, automatic quantitative evaluation
4. Dholuo UG KE DRC RW BU Western Nilotic language Spoken by +3M Luo people Kenya, Uganda, Tanzania No official dialect Tonal, but not marked in orthography Latin alphabet, no diacritics Resource-scarce (not official language) Web-mined corpus of 200k words De Pauw, Wagacha & Abade (2007) Unsupervised Induction of Dholuo Word Classes using Maximum Entropy Learning TZ
5. Most famous Luo Nilotic language Spoken by +3M people in Kenya, Tanzania and Uganda Not an official language Latin script Most famous Luo:
6. AfLaT 2009 / LRE (submitted) SAWA CORPUS 2 million word parallel corpus English – Swahili Competitive machine translation results Projection of annotation of part-of-speech tags from English into Swahili is viable But what about true resource-scarce languages?
7. Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Tokenization Sentence alignment
8. Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Koolwire.com Tokenization Sentence alignment
9. Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Koolwire.com Tokenization Sentence alignment Luk 1:1 Jimang'enyosebedo ka chanoweche mane otimore e dierwa , Luk 1:2 Mana kaka nochiwgi ne wan kodjogomotelo mane jonenowang'giwang kendo jotichwach . Luk 1:3 Kuommano , an bende kaka asenonotiendwechegimalong'onyaka a chakruok , en gimaberbendemondoandikni e yomochanoremaler , in mulourTheofilo . Luk 1:4 Mondoing'eadier mar gikmosepwonji . Luk 1:5 E ndalo ma Heroderuodh Judea , ne nitiejadolo ma nyingeZakaria , ma ne en achielkuomogandajodolomagAbija ; chiege Elizabeth bende ne nyardhoodHarun . Luk 1:6 Gidutojariyo ne gin jomakarenyimNyasaye , ne giritochikemadongogimatindomagRuothNyasaye , maongeketho . Luk 1:7 To ne giongeginyithindo , nikech Elizabeth ne migumba , kendo gin jariyogohikgi : nose niang' Luk 1:8 Chieng' morokaneogandagiZakaria ne ni e tich to notiyo kaka jadolo e nyimNyasaye , Luk 1:9 Noyieregiombulu kaka chik mar jodolo , mondoodonjieihekalu mar Ruoth kendo owangubani . Luk 1:10 To ka sa mar wang'oubaniochopo , jolemodutonochokoreoko kendo negilamo . Luk 1:11 Ekamalaika mar RuothNyasayenofwenyorene , kochungo bath kendo mar ubanikorachwich .
10. Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Koolwire.com Tokenization Sentence alignment R.C.Moore (2002) Fast and accurate sentence alignment of bilingual corpora.
21. Machine Translation Experiments English Luo and Swahili Luo Use standard SMT tool MOSES (Koehn et al 2007) Phrase-based machine translation Can handle factored data Uses SRILM language modeling tool (Stolcke 2002) English: Gigaword corpus Swahili: TshwaneDJe Kiswahili Internet Corpus Luo: 200k Luo corpus + Training/Evaluation Set of New Testament data
22. Results OOV: percentage of out-of-vocabulary words (i.e. words unknown to the language model) BLEU: Bilingual Evaluation Understudy (calculates n-gram overlap between reference and machine translation) NIST: modification of BLEU, taking into account information value of n-grams BLEU & NIST attempt to optimize correlation with human evaluation
35. Conclusion First proof-of-the-principle experiments (machine translation, projection of annotation) for a Nilotic language “If you have a digital bible, you have an MT system and other NLP components” Small register-specific parallel corpus English – Swahili – Luo Modest, but encouraging BLEU & NIST scores SMT for Luo is possible No alternatives (cf. other African languages?) Factored data can overcome limitations of pure word-based methods for word-alignment. Morphological generation on the target language side is still a bottleneck
36. Future Work Make trilingual, annotated corpus available through Open-Content Text Corpus (SAWA corpus will be made available soon as well) Use translation model as seed for bilingual web mining Tweak & tune MOSES parameters to improve quality Better morphological analysis, generation for Dholuo Unsupervised morphology induction Use automatically induced annotation as training data for supervised data-driven taggers Repeat experiment for other resource-scarce languages