Knowledge-Light Luo Machine Translation and POS Tagging

A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging Guy De Pauw(guy@aflat.org) Naomi Maajabu (naomi@aflat.org) Peter WaiganjoWagacha (waiganjo@aflat.org)

Outline Resource-scarce language engineering The case of Luo (Dholuo) A trilingual parallel corpus English – Swahili – Luo Machine Translation experiments Projection of Annotation experiments Conclusion & Future Work

Resource-Scarce Languages Limited financial, political, … resources Few digital resources: digital lexicons, corpora Bottleneck of linguistic expertise (in LT) Two approaches: Rule-based approaches Advantages: meticulous design, linguistically relevant Corpus-based, data-driven approaches Growing importance and availability of digital text material Advantages: performance models, fast development, automatic quantitative evaluation

Dholuo UG KE DRC RW BU Western Nilotic language Spoken by +3M Luo people Kenya, Uganda, Tanzania No official dialect Tonal, but not marked in orthography Latin alphabet, no diacritics Resource-scarce (not official language) Web-mined corpus of 200k words De Pauw, Wagacha & Abade (2007) Unsupervised Induction of Dholuo Word Classes using Maximum Entropy Learning TZ

Most famous Luo Nilotic language Spoken by +3M people in Kenya, Tanzania and Uganda Not an official language Latin script Most famous Luo:

AfLaT 2009 / LRE (submitted) SAWA CORPUS 2 million word parallel corpus English – Swahili Competitive machine translation results Projection of annotation of part-of-speech tags from English into Swahili is viable But what about true resource-scarce languages?

Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Tokenization Sentence alignment

Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Koolwire.com Tokenization Sentence alignment

Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Koolwire.com Tokenization Sentence alignment Luk 1:1 Jimang'enyosebedo ka chanoweche mane otimore e dierwa , Luk 1:2 Mana kaka nochiwgi ne wan kodjogomotelo mane jonenowang'giwang kendo jotichwach . Luk 1:3 Kuommano , an bende kaka asenonotiendwechegimalong'onyaka a chakruok , en gimaberbendemondoandikni e yomochanoremaler , in mulourTheofilo . Luk 1:4 Mondoing'eadier mar gikmosepwonji . Luk 1:5 E ndalo ma Heroderuodh Judea , ne nitiejadolo ma nyingeZakaria , ma ne en achielkuomogandajodolomagAbija ; chiege Elizabeth bende ne nyardhoodHarun . Luk 1:6 Gidutojariyo ne gin jomakarenyimNyasaye , ne giritochikemadongogimatindomagRuothNyasaye , maongeketho . Luk 1:7 To ne giongeginyithindo , nikech Elizabeth ne migumba , kendo gin jariyogohikgi : nose niang' Luk 1:8 Chieng' morokaneogandagiZakaria ne ni e tich to notiyo kaka jadolo e nyimNyasaye , Luk 1:9 Noyieregiombulu kaka chik mar jodolo , mondoodonjieihekalu mar Ruoth kendo owangubani . Luk 1:10 To ka sa mar wang'oubaniochopo , jolemodutonochokoreoko kendo negilamo . Luk 1:11 Ekamalaika mar RuothNyasayenofwenyorene , kochungo bath kendo mar ubanikorachwich .

Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Koolwire.com Tokenization Sentence alignment R.C.Moore (2002) Fast and accurate sentence alignment of bilingual corpora.

Trilingual corpus ,[object Object]

10% test set (partly annotated for pos-tags)Tiny, register-specific parallel corpus

Word alingment ,[object Object]

Misconception:morphologically rich languages cannot be used in statistical machine translation, since word-alignment is word-basedhave turned him down I nimemkatalia

Factored Data ,[object Object]

Misconception: morphologically rich languages cannot be used in statistical machine translation, since it is word-based

Word alignment and language modeling can be enhanced by using factored data

General idea: use extra annotation layers (part-of-speech tagging, lemmatization) to aid discovery of possible translation pairs,[object Object]

Machine Translation Experiments English  Luo and Swahili  Luo Use standard SMT tool MOSES (Koehn et al 2007) Phrase-based machine translation Can handle factored data Uses SRILM language modeling tool (Stolcke 2002) English: Gigaword corpus Swahili: TshwaneDJe Kiswahili Internet Corpus Luo: 200k Luo corpus + Training/Evaluation Set of New Testament data

Results OOV: percentage of out-of-vocabulary words (i.e. words unknown to the language model) BLEU: Bilingual Evaluation Understudy (calculates n-gram overlap between reference and machine translation) NIST: modification of BLEU, taking into account information value of n-grams BLEU & NIST attempt to optimize correlation with human evaluation

SMT Experiments ,[object Object],[object Object]

Projection of annotation ,[object Object]

Project part-of-speech tags from resource-rich(er) language

Knowledge-Light Luo Machine Translation and POS Tagging

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Knowledge-Light Luo Machine Translation and POS Tagging

Similar to Knowledge-Light Luo Machine Translation and POS Tagging (20)

More from Guy De Pauw

More from Guy De Pauw (20)

Recently uploaded

Recently uploaded (20)

Knowledge-Light Luo Machine Translation and POS Tagging