SlideShare a Scribd company logo
1 of 27
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging Guy De Pauw(guy@aflat.org) Naomi Maajabu (naomi@aflat.org) Peter WaiganjoWagacha (waiganjo@aflat.org)
Outline Resource-scarce language engineering The case of Luo (Dholuo) A trilingual parallel corpus English – Swahili – Luo Machine Translation experiments Projection of Annotation experiments Conclusion & Future Work
Resource-Scarce Languages Limited financial, political, … resources Few digital resources: digital lexicons, corpora Bottleneck of linguistic expertise (in LT) Two approaches: Rule-based approaches Advantages: 	meticulous design, linguistically relevant Corpus-based, data-driven approaches Growing importance and availability of digital text material Advantages: 	performance models, fast development, 			automatic quantitative evaluation
Dholuo UG KE DRC RW BU Western Nilotic language Spoken by +3M Luo people Kenya, Uganda, Tanzania No official dialect Tonal, but not marked in orthography Latin alphabet, no diacritics Resource-scarce (not official language) Web-mined corpus of 200k words De Pauw, Wagacha & Abade (2007) Unsupervised Induction of Dholuo Word Classes using Maximum Entropy Learning TZ
Most famous Luo Nilotic language Spoken by +3M people in Kenya, Tanzania and Uganda  Not an official language Latin script Most famous Luo:
AfLaT 2009 / LRE (submitted) SAWA CORPUS 2 million word parallel corpus English – Swahili Competitive machine translation results Projection of annotation of part-of-speech tags from English into Swahili is viable But what about true resource-scarce languages?
Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Tokenization Sentence alignment
Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Koolwire.com Tokenization Sentence alignment
Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Koolwire.com Tokenization Sentence alignment Luk 1:1 Jimang'enyosebedo ka chanoweche mane otimore e dierwa , Luk 1:2 Mana kaka nochiwgi ne wan kodjogomotelo mane jonenowang'giwang kendo jotichwach . Luk 1:3 Kuommano , an bende kaka asenonotiendwechegimalong'onyaka a chakruok , en gimaberbendemondoandikni e yomochanoremaler , in mulourTheofilo . Luk 1:4 Mondoing'eadier mar gikmosepwonji . Luk 1:5 E ndalo ma Heroderuodh Judea , ne nitiejadolo ma nyingeZakaria , ma ne en achielkuomogandajodolomagAbija ; chiege Elizabeth bende ne nyardhoodHarun . Luk 1:6 Gidutojariyo ne gin jomakarenyimNyasaye , ne giritochikemadongogimatindomagRuothNyasaye , maongeketho . Luk 1:7 To ne giongeginyithindo , nikech Elizabeth ne migumba , kendo gin jariyogohikgi : nose niang' Luk 1:8 Chieng' morokaneogandagiZakaria ne ni e tich to notiyo kaka jadolo e nyimNyasaye , Luk 1:9 Noyieregiombulu kaka chik mar jodolo , mondoodonjieihekalu mar Ruoth kendo owangubani . Luk 1:10 To ka sa mar wang'oubaniochopo , jolemodutonochokoreoko kendo negilamo . Luk 1:11 Ekamalaika mar RuothNyasayenofwenyorene , kochungo bath kendo mar ubanikorachwich .
Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Koolwire.com Tokenization Sentence alignment R.C.Moore (2002) Fast and accurate sentence alignment of bilingual corpora.
Trilingual corpus ,[object Object]
10% validation set
10% test set (partly annotated for pos-tags)Tiny, register-specific parallel corpus
Word alingment ,[object Object]
Misconception:morphologically rich languages cannot be used in statistical machine translation, since word-alignment is word-basedhave turned him down I nimemkatalia
Factored Data ,[object Object]
Misconception: morphologically rich languages cannot be used in statistical machine translation, since it is word-based
Word alignment and language modeling can be enhanced by using factored data
General idea: use extra annotation layers (part-of-speech tagging, lemmatization) to aid discovery of possible translation pairs,[object Object]
Factored Data
Machine Translation Experiments English  Luo	and 	Swahili  Luo Use standard SMT tool MOSES (Koehn et al 2007) Phrase-based machine translation Can handle factored data Uses SRILM language modeling tool (Stolcke 2002) English: 	Gigaword corpus Swahili:	TshwaneDJe Kiswahili Internet Corpus Luo: 	200k Luo corpus + Training/Evaluation Set of New 		Testament data
Results OOV: percentage of out-of-vocabulary words (i.e. words unknown to the language model) BLEU: Bilingual Evaluation Understudy (calculates n-gram overlap between reference and machine translation) NIST: modification of BLEU, taking into account information value of n-grams BLEU & NIST attempt to optimize correlation with human evaluation
SMT Experiments ,[object Object],[object Object]
SMT Experiments
Examples
Projection of annotation ,[object Object]
Project part-of-speech tags from resource-rich(er) language

More Related Content

What's hot

Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsshrey bhate
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationssChandan Deb
 
Using OpenNLP with Solr to improve search relevance and to extract named enti...
Using OpenNLP with Solr to improve search relevance and to extract named enti...Using OpenNLP with Solr to improve search relevance and to extract named enti...
Using OpenNLP with Solr to improve search relevance and to extract named enti...Steve Rowe
 
Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Jorge Baptista
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLPRobert Viseur
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text ProcessingSuneel Marthi
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
 
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web CorpusPart-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web CorpusAtsushi Keyaki
 
Embracing diversity searching over multiple languages
Embracing diversity  searching over multiple languagesEmbracing diversity  searching over multiple languages
Embracing diversity searching over multiple languagesSuneel Marthi
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHugJimmy Lai
 
Natural language processing
Natural language processingNatural language processing
Natural language processingRobert Antony
 
Language and Intelligence
Language and IntelligenceLanguage and Intelligence
Language and Intelligencebutest
 

What's hot (20)

Intro
IntroIntro
Intro
 
NLP new words
NLP new wordsNLP new words
NLP new words
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Using OpenNLP with Solr to improve search relevance and to extract named enti...
Using OpenNLP with Solr to improve search relevance and to extract named enti...Using OpenNLP with Solr to improve search relevance and to extract named enti...
Using OpenNLP with Solr to improve search relevance and to extract named enti...
 
Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)
 
L1
L1L1
L1
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
L1 nlp intro
L1 nlp introL1 nlp intro
L1 nlp intro
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web CorpusPart-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
 
Embracing diversity searching over multiple languages
Embracing diversity  searching over multiple languagesEmbracing diversity  searching over multiple languages
Embracing diversity searching over multiple languages
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
Corpus Linguistics
Corpus LinguisticsCorpus Linguistics
Corpus Linguistics
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Language and Intelligence
Language and IntelligenceLanguage and Intelligence
Language and Intelligence
 

Viewers also liked

Precission Approach Path Indicator (PAPI)
Precission Approach Path Indicator (PAPI)Precission Approach Path Indicator (PAPI)
Precission Approach Path Indicator (PAPI)Suhanto Tuban
 
Electrical component at airport for airfield lighting circuit
Electrical component at airport for airfield lighting circuitElectrical component at airport for airfield lighting circuit
Electrical component at airport for airfield lighting circuitRLarivee
 
Nexans Airfield Ground Lighting Cables (Primary Secondary)
Nexans Airfield Ground Lighting Cables (Primary  Secondary)Nexans Airfield Ground Lighting Cables (Primary  Secondary)
Nexans Airfield Ground Lighting Cables (Primary Secondary)Thorne & Derrick International
 
Автоматизация тестирования iOS приложений: от идеи к готовому решению
Автоматизация тестирования iOS приложений: от идеи к готовому решениюАвтоматизация тестирования iOS приложений: от идеи к готовому решению
Автоматизация тестирования iOS приложений: от идеи к готовому решениюSQALab
 
Foresight AI Overview
Foresight AI OverviewForesight AI Overview
Foresight AI OverviewEveryDayFX
 
Cuando el cocinero no tiene nada que hacer
Cuando el cocinero no tiene nada que hacerCuando el cocinero no tiene nada que hacer
Cuando el cocinero no tiene nada que hacerAlicia y familia
 
How MS PPM Partners Can Maximize Social Media #msPC12
How MS PPM Partners Can Maximize Social Media  #msPC12How MS PPM Partners Can Maximize Social Media  #msPC12
How MS PPM Partners Can Maximize Social Media #msPC12Dux Raymond Sy
 
Enterprise Project Presentations And Assessment
Enterprise Project Presentations And AssessmentEnterprise Project Presentations And Assessment
Enterprise Project Presentations And Assessmentruhma
 
Value of Biodiversity in Cape Town
Value of Biodiversity in Cape TownValue of Biodiversity in Cape Town
Value of Biodiversity in Cape TownMartin de Wit
 
Art Peter Callesen Paper Objects
Art Peter Callesen   Paper ObjectsArt Peter Callesen   Paper Objects
Art Peter Callesen Paper Objectsdistractie
 
Grievance Addocate's Rules of the Road
Grievance Addocate's Rules of the RoadGrievance Addocate's Rules of the Road
Grievance Addocate's Rules of the RoadJim Walker
 
Oulu 2010 - the future
Oulu 2010  - the futureOulu 2010  - the future
Oulu 2010 - the futurejkraaer
 
История о внедрении Процесса
История о внедрении ПроцессаИстория о внедрении Процесса
История о внедрении ПроцессаSQALab
 
Circular nº 11
Circular nº 11Circular nº 11
Circular nº 11Adalberto
 

Viewers also liked (20)

Precission Approach Path Indicator (PAPI)
Precission Approach Path Indicator (PAPI)Precission Approach Path Indicator (PAPI)
Precission Approach Path Indicator (PAPI)
 
Electrical component at airport for airfield lighting circuit
Electrical component at airport for airfield lighting circuitElectrical component at airport for airfield lighting circuit
Electrical component at airport for airfield lighting circuit
 
Nexans Airfield Ground Lighting Cables (Primary Secondary)
Nexans Airfield Ground Lighting Cables (Primary  Secondary)Nexans Airfield Ground Lighting Cables (Primary  Secondary)
Nexans Airfield Ground Lighting Cables (Primary Secondary)
 
Автоматизация тестирования iOS приложений: от идеи к готовому решению
Автоматизация тестирования iOS приложений: от идеи к готовому решениюАвтоматизация тестирования iOS приложений: от идеи к готовому решению
Автоматизация тестирования iOS приложений: от идеи к готовому решению
 
Foresight AI Overview
Foresight AI OverviewForesight AI Overview
Foresight AI Overview
 
Cuando el cocinero no tiene nada que hacer
Cuando el cocinero no tiene nada que hacerCuando el cocinero no tiene nada que hacer
Cuando el cocinero no tiene nada que hacer
 
How MS PPM Partners Can Maximize Social Media #msPC12
How MS PPM Partners Can Maximize Social Media  #msPC12How MS PPM Partners Can Maximize Social Media  #msPC12
How MS PPM Partners Can Maximize Social Media #msPC12
 
Enterprise Project Presentations And Assessment
Enterprise Project Presentations And AssessmentEnterprise Project Presentations And Assessment
Enterprise Project Presentations And Assessment
 
Value of Biodiversity in Cape Town
Value of Biodiversity in Cape TownValue of Biodiversity in Cape Town
Value of Biodiversity in Cape Town
 
Art Peter Callesen Paper Objects
Art Peter Callesen   Paper ObjectsArt Peter Callesen   Paper Objects
Art Peter Callesen Paper Objects
 
Granada antigua
Granada antiguaGranada antigua
Granada antigua
 
Grievance Addocate's Rules of the Road
Grievance Addocate's Rules of the RoadGrievance Addocate's Rules of the Road
Grievance Addocate's Rules of the Road
 
Amizade
AmizadeAmizade
Amizade
 
Projeto lobinhos
Projeto lobinhosProjeto lobinhos
Projeto lobinhos
 
Oulu 2010 - the future
Oulu 2010  - the futureOulu 2010  - the future
Oulu 2010 - the future
 
De Conversation Manager
De Conversation ManagerDe Conversation Manager
De Conversation Manager
 
Lecture5 ls 2
Lecture5 ls 2Lecture5 ls 2
Lecture5 ls 2
 
История о внедрении Процесса
История о внедрении ПроцессаИстория о внедрении Процесса
История о внедрении Процесса
 
Circular nº 11
Circular nº 11Circular nº 11
Circular nº 11
 
Securitization
Securitization Securitization
Securitization
 

Similar to Knowledge-Light Luo Machine Translation and POS Tagging

Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP
 
Allin Qillqay A Free On-Line Web Spell Checking Service For Quechua
Allin Qillqay  A Free On-Line Web Spell Checking Service For QuechuaAllin Qillqay  A Free On-Line Web Spell Checking Service For Quechua
Allin Qillqay A Free On-Line Web Spell Checking Service For QuechuaAndrea Porter
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to HindiRajat Jain
 
Identification of prosodic features of punjabi for enhancing the pronunciatio...
Identification of prosodic features of punjabi for enhancing the pronunciatio...Identification of prosodic features of punjabi for enhancing the pronunciatio...
Identification of prosodic features of punjabi for enhancing the pronunciatio...ijnlc
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indianeSAT Publishing House
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...IJITE
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...ijrap
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...gerogepatton
 
Corpus study design
Corpus study designCorpus study design
Corpus study designbikashtaly
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
 
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...kevig
 
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...ijnlc
 
Hidden markov model based part of speech tagger for sinhala language
Hidden markov model based part of speech tagger for sinhala languageHidden markov model based part of speech tagger for sinhala language
Hidden markov model based part of speech tagger for sinhala languageijnlc
 

Similar to Knowledge-Light Luo Machine Translation and POS Tagging (20)

Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
W17 5406
W17 5406W17 5406
W17 5406
 
Allin Qillqay A Free On-Line Web Spell Checking Service For Quechua
Allin Qillqay  A Free On-Line Web Spell Checking Service For QuechuaAllin Qillqay  A Free On-Line Web Spell Checking Service For Quechua
Allin Qillqay A Free On-Line Web Spell Checking Service For Quechua
 
presentation
presentationpresentation
presentation
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to Hindi
 
Identification of prosodic features of punjabi for enhancing the pronunciatio...
Identification of prosodic features of punjabi for enhancing the pronunciatio...Identification of prosodic features of punjabi for enhancing the pronunciatio...
Identification of prosodic features of punjabi for enhancing the pronunciatio...
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indian
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
 
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
 
Hidden markov model based part of speech tagger for sinhala language
Hidden markov model based part of speech tagger for sinhala languageHidden markov model based part of speech tagger for sinhala language
Hidden markov model based part of speech tagger for sinhala language
 

More from Guy De Pauw

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageGuy De Pauw
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)Guy De Pauw
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusGuy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of SantomeGuy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTGuy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionGuy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsGuy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemGuy De Pauw
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Guy De Pauw
 

More from Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Knowledge-Light Luo Machine Translation and POS Tagging

  • 1. A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging Guy De Pauw(guy@aflat.org) Naomi Maajabu (naomi@aflat.org) Peter WaiganjoWagacha (waiganjo@aflat.org)
  • 2. Outline Resource-scarce language engineering The case of Luo (Dholuo) A trilingual parallel corpus English – Swahili – Luo Machine Translation experiments Projection of Annotation experiments Conclusion & Future Work
  • 3. Resource-Scarce Languages Limited financial, political, … resources Few digital resources: digital lexicons, corpora Bottleneck of linguistic expertise (in LT) Two approaches: Rule-based approaches Advantages: meticulous design, linguistically relevant Corpus-based, data-driven approaches Growing importance and availability of digital text material Advantages: performance models, fast development, automatic quantitative evaluation
  • 4. Dholuo UG KE DRC RW BU Western Nilotic language Spoken by +3M Luo people Kenya, Uganda, Tanzania No official dialect Tonal, but not marked in orthography Latin alphabet, no diacritics Resource-scarce (not official language) Web-mined corpus of 200k words De Pauw, Wagacha & Abade (2007) Unsupervised Induction of Dholuo Word Classes using Maximum Entropy Learning TZ
  • 5. Most famous Luo Nilotic language Spoken by +3M people in Kenya, Tanzania and Uganda Not an official language Latin script Most famous Luo:
  • 6. AfLaT 2009 / LRE (submitted) SAWA CORPUS 2 million word parallel corpus English – Swahili Competitive machine translation results Projection of annotation of part-of-speech tags from English into Swahili is viable But what about true resource-scarce languages?
  • 7. Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Tokenization Sentence alignment
  • 8. Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Koolwire.com Tokenization Sentence alignment
  • 9. Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Koolwire.com Tokenization Sentence alignment Luk 1:1 Jimang'enyosebedo ka chanoweche mane otimore e dierwa , Luk 1:2 Mana kaka nochiwgi ne wan kodjogomotelo mane jonenowang'giwang kendo jotichwach . Luk 1:3 Kuommano , an bende kaka asenonotiendwechegimalong'onyaka a chakruok , en gimaberbendemondoandikni e yomochanoremaler , in mulourTheofilo . Luk 1:4 Mondoing'eadier mar gikmosepwonji . Luk 1:5 E ndalo ma Heroderuodh Judea , ne nitiejadolo ma nyingeZakaria , ma ne en achielkuomogandajodolomagAbija ; chiege Elizabeth bende ne nyardhoodHarun . Luk 1:6 Gidutojariyo ne gin jomakarenyimNyasaye , ne giritochikemadongogimatindomagRuothNyasaye , maongeketho . Luk 1:7 To ne giongeginyithindo , nikech Elizabeth ne migumba , kendo gin jariyogohikgi : nose niang' Luk 1:8 Chieng' morokaneogandagiZakaria ne ni e tich to notiyo kaka jadolo e nyimNyasaye , Luk 1:9 Noyieregiombulu kaka chik mar jodolo , mondoodonjieihekalu mar Ruoth kendo owangubani . Luk 1:10 To ka sa mar wang'oubaniochopo , jolemodutonochokoreoko kendo negilamo . Luk 1:11 Ekamalaika mar RuothNyasayenofwenyorene , kochungo bath kendo mar ubanikorachwich .
  • 10. Parallel Data for Luo International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus Preprocessing: Pdftext conversion Koolwire.com Tokenization Sentence alignment R.C.Moore (2002) Fast and accurate sentence alignment of bilingual corpora.
  • 11.
  • 13. 10% test set (partly annotated for pos-tags)Tiny, register-specific parallel corpus
  • 14.
  • 15. Misconception:morphologically rich languages cannot be used in statistical machine translation, since word-alignment is word-basedhave turned him down I nimemkatalia
  • 16.
  • 17. Misconception: morphologically rich languages cannot be used in statistical machine translation, since it is word-based
  • 18. Word alignment and language modeling can be enhanced by using factored data
  • 19.
  • 21. Machine Translation Experiments English  Luo and Swahili  Luo Use standard SMT tool MOSES (Koehn et al 2007) Phrase-based machine translation Can handle factored data Uses SRILM language modeling tool (Stolcke 2002) English: Gigaword corpus Swahili: TshwaneDJe Kiswahili Internet Corpus Luo: 200k Luo corpus + Training/Evaluation Set of New Testament data
  • 22. Results OOV: percentage of out-of-vocabulary words (i.e. words unknown to the language model) BLEU: Bilingual Evaluation Understudy (calculates n-gram overlap between reference and machine translation) NIST: modification of BLEU, taking into account information value of n-grams BLEU & NIST attempt to optimize correlation with human evaluation
  • 23.
  • 26.
  • 27. Project part-of-speech tags from resource-rich(er) language
  • 29. Tag-projection list + Tag priority list
  • 30.
  • 31. Errors made mainly on function words (closed class)
  • 34.
  • 35. Conclusion First proof-of-the-principle experiments (machine translation, projection of annotation) for a Nilotic language “If you have a digital bible, you have an MT system and other NLP components” Small register-specific parallel corpus English – Swahili – Luo Modest, but encouraging BLEU & NIST scores SMT for Luo is possible No alternatives (cf. other African languages?) Factored data can overcome limitations of pure word-based methods for word-alignment. Morphological generation on the target language side is still a bottleneck
  • 36. Future Work Make trilingual, annotated corpus available through Open-Content Text Corpus (SAWA corpus will be made available soon as well) Use translation model as seed for bilingual web mining Tweak & tune MOSES parameters to improve quality Better morphological analysis, generation for Dholuo Unsupervised morphology induction Use automatically induced annotation as training data for supervised data-driven taggers Repeat experiment for other resource-scarce languages