The SAWA Corpus A Parallel Corpus  English - Swahili Guy De Pauw   (guy.depauw@aflat.org) Peter Waiganjo Wagacha   (waiganjo@aflat.org) Gilles-Maurice de Schryver   (gillesmaurice.deschryver@aflat.org)
Resource-scarceness Language technology vs the digital divide Digital data increasingly important for African languages (web, mobile phone, …)  But:  most research on African languages is rooted in knowledge-based paradigm (↔ LT for Indo-European languages):  Hand-crafted expert systems Typically high accuracy for domain Limited portability to other languages and subdomains Costly development phase Limited resources (linguistic, expertise, financial, …) Need for a cheaper and faster (language-independent) alternative for developing African language technology
Data-driven approaches For Indo-European and Asian languages: the data-driven, corpus-based approach has become the dominant paradigm since the 90’s  Basic methodology: automatically extract linguistic knowledge from annotated text material (corpus) and bootstrap the development of language technology component Advantages: language independence: portability (!!!!) Knowledge acquisition bottleneck    data-acquisition bottleneck Robustness AfLaT-team: explore application of data-driven paradigm to African languages (Swahili, Gikuyu, Luo, Northern Sotho, …)
Machine Translation 3 paradigms: Rule-based MT Statistical MT Example-based MT data-driven Learn translation from examples: !! Parallel corpus !!
Parallel Corpus Collection of  translated  texts in two different languages, aligned on paragraph, sentence, phrase and/or word level SAWA Corpus:   parallel corpus English - Swahili
Example Universal Declaration of Human Rights  Preamble  Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world,  Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya nchi yo yote."  UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA YA ULIMWENGU JUU YA HAKI ZA BINADAMU  UTANGULIZI  Kwa kuwa kukiri heshima ya asili na haki sawa kwa binadamu wote ndio msingi wa uhuru, haki na amani duniani,  Kwa kuwa kutojali na kudharau haki za binadamu kumeletea vitendo vya kishenzi ambavyo vimeharibu dhamiri ya binadamu na kwa sababu taarifa ya ulimwengu ambayo itawafanya binadamu wafurahie uhuru wao wa kusema, kusadiki na wa kutoogopa cho chote imekwisha kutangazwa kwamba ndio hamu kuu ya watu wote,
3 phases Data-collection:  finding parallel texts Data-constitution:  aligning the parallel texts on word level Data-exploitation Statistical Machine Translation Bootstrapping linguistic annotation
Data Collection Limited availability of parallel texts English – Kiswahili: Smaller documents: investment reports, political texts, e.g.  Universal Declaration of Human Rights “ there is no data, like more data” Bible, Quran, secular literature New translations
Data Collection Even if the source data is digitally available beforehand, we are often faced with tough alignment problems during data constitution.  e.g. paragraph alignment
Universal Declaration of Human Rights  Preamble  Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world,  Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya nchi yo yote."  UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA YA ULIMWENGU JUU YA HAKI ZA BINADAMU  UTANGULIZI  Kwa kuwa kukiri heshima ya asili na haki sawa kwa binadamu wote ndio msingi wa uhuru, haki na amani duniani,  Kwa kuwa kutojali na kudharau haki za binadamu kumeletea vitendo vya kishenzi ambavyo vimeharibu dhamiri ya binadamu na kwa sababu taarifa ya ulimwengu ambayo itawafanya binadamu wafurahie uhuru wao wa kusema, kusadiki na wa kutoogopa cho chote imekwisha kutangazwa kwamba ndio hamu kuu ya watu wote,
e.g. sentence alignment Article 12  No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honour and reputation.  Everyone has the right to the protection of the law against such interference or attacks.  Kifungu cha 12 Kila mtu asiingiliwe bila sheria katika mambo yake ya faragha, ya jamaa yake, ya nyumbani mwake au ya barua zake. Wala asivunjiwe heshima na sifa yake. Kila mmoja ana haki ya kulindwa na sheria kutokana na pingamizi au mambo kama hayo.
Available data in SAWA Corpus All manually sentence aligned! English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! Thanks to Mahmoud Shokrollahi-Far University College of Nabiye Akram (Iran) English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! Thanks to Dr. James Omboga Zaja University of Nairobi English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Available data in SAWA Corpus All manually sentence aligned! English  Sentences Kiswahili  Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
Word alignment Most difficult task: relate words between languages No she ‘ s uh , , up north La  , , , yuko , aa juu  kaskazini
Word alignment You caught me skiving , I ‘ m afraid . Samahani , umenidaka  nikihepa  .
Word alignment Can be done automatically using established tools (GIZA++) Provide manual reference to evaluate automatic word alignment tools (5000 words)
Current results Still a lot of room for improvement Precision Recall F (  =1) 39.4% 44.5% 41.79%
Word alignment Some alignment patterns are easy No she ‘ s uh , , up north La  , , , yuko , aa juu  kaskazini
Alignment problems nimemkatalia have turned him down I
Morphological decomposition have turned him down I ni+ me+ m+ katalia
Current results Morpheme/Word alignment Better alignment,  but  more complicated decoding Precision Recall F (  =1) 50.2% 64.5% 55.8%
Future work Projection of Annotation
Future work Projection of Annotation Refine GIZA++ alignment Part-of-speech tagger
Future work Projection of Annotation Refine GIZA++ alignment Part-of-speech tagger No data like more data: web-mining & comparable corpora Example-based MT (omegaT) Statistical MT (Moses)
Conclusion Modest, but workable parallel corpus English – Swahili Bi-directional Machine Translation is now in the cards Modest, but encouraging word alignment scores Data-driven approach is viable for African languages

The SAWA Corpus - A parallel Corpus English - Swahili

  • 1.
    The SAWA CorpusA Parallel Corpus English - Swahili Guy De Pauw (guy.depauw@aflat.org) Peter Waiganjo Wagacha (waiganjo@aflat.org) Gilles-Maurice de Schryver (gillesmaurice.deschryver@aflat.org)
  • 2.
    Resource-scarceness Language technologyvs the digital divide Digital data increasingly important for African languages (web, mobile phone, …) But: most research on African languages is rooted in knowledge-based paradigm (↔ LT for Indo-European languages): Hand-crafted expert systems Typically high accuracy for domain Limited portability to other languages and subdomains Costly development phase Limited resources (linguistic, expertise, financial, …) Need for a cheaper and faster (language-independent) alternative for developing African language technology
  • 3.
    Data-driven approaches ForIndo-European and Asian languages: the data-driven, corpus-based approach has become the dominant paradigm since the 90’s Basic methodology: automatically extract linguistic knowledge from annotated text material (corpus) and bootstrap the development of language technology component Advantages: language independence: portability (!!!!) Knowledge acquisition bottleneck  data-acquisition bottleneck Robustness AfLaT-team: explore application of data-driven paradigm to African languages (Swahili, Gikuyu, Luo, Northern Sotho, …)
  • 4.
    Machine Translation 3paradigms: Rule-based MT Statistical MT Example-based MT data-driven Learn translation from examples: !! Parallel corpus !!
  • 5.
    Parallel Corpus Collectionof translated texts in two different languages, aligned on paragraph, sentence, phrase and/or word level SAWA Corpus: parallel corpus English - Swahili
  • 6.
    Example Universal Declarationof Human Rights Preamble Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya nchi yo yote." UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA YA ULIMWENGU JUU YA HAKI ZA BINADAMU UTANGULIZI Kwa kuwa kukiri heshima ya asili na haki sawa kwa binadamu wote ndio msingi wa uhuru, haki na amani duniani, Kwa kuwa kutojali na kudharau haki za binadamu kumeletea vitendo vya kishenzi ambavyo vimeharibu dhamiri ya binadamu na kwa sababu taarifa ya ulimwengu ambayo itawafanya binadamu wafurahie uhuru wao wa kusema, kusadiki na wa kutoogopa cho chote imekwisha kutangazwa kwamba ndio hamu kuu ya watu wote,
  • 7.
    3 phases Data-collection: finding parallel texts Data-constitution: aligning the parallel texts on word level Data-exploitation Statistical Machine Translation Bootstrapping linguistic annotation
  • 8.
    Data Collection Limitedavailability of parallel texts English – Kiswahili: Smaller documents: investment reports, political texts, e.g. Universal Declaration of Human Rights “ there is no data, like more data” Bible, Quran, secular literature New translations
  • 9.
    Data Collection Evenif the source data is digitally available beforehand, we are often faced with tough alignment problems during data constitution. e.g. paragraph alignment
  • 10.
    Universal Declaration ofHuman Rights Preamble Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya nchi yo yote." UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA YA ULIMWENGU JUU YA HAKI ZA BINADAMU UTANGULIZI Kwa kuwa kukiri heshima ya asili na haki sawa kwa binadamu wote ndio msingi wa uhuru, haki na amani duniani, Kwa kuwa kutojali na kudharau haki za binadamu kumeletea vitendo vya kishenzi ambavyo vimeharibu dhamiri ya binadamu na kwa sababu taarifa ya ulimwengu ambayo itawafanya binadamu wafurahie uhuru wao wa kusema, kusadiki na wa kutoogopa cho chote imekwisha kutangazwa kwamba ndio hamu kuu ya watu wote,
  • 11.
    e.g. sentence alignmentArticle 12 No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honour and reputation. Everyone has the right to the protection of the law against such interference or attacks. Kifungu cha 12 Kila mtu asiingiliwe bila sheria katika mambo yake ya faragha, ya jamaa yake, ya nyumbani mwake au ya barua zake. Wala asivunjiwe heshima na sifa yake. Kila mmoja ana haki ya kulindwa na sheria kutokana na pingamizi au mambo kama hayo.
  • 12.
    Available data inSAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 13.
    Available data inSAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 14.
    Available data inSAWA Corpus All manually sentence aligned! Thanks to Mahmoud Shokrollahi-Far University College of Nabiye Akram (Iran) English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 15.
    Available data inSAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 16.
    Available data inSAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 17.
    Available data inSAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 18.
    Available data inSAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 19.
    Available data inSAWA Corpus All manually sentence aligned! Thanks to Dr. James Omboga Zaja University of Nairobi English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 20.
    Available data inSAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
  • 21.
    Word alignment Mostdifficult task: relate words between languages No she ‘ s uh , , up north La , , , yuko , aa juu kaskazini
  • 22.
    Word alignment Youcaught me skiving , I ‘ m afraid . Samahani , umenidaka nikihepa .
  • 23.
    Word alignment Canbe done automatically using established tools (GIZA++) Provide manual reference to evaluate automatic word alignment tools (5000 words)
  • 24.
    Current results Stilla lot of room for improvement Precision Recall F (  =1) 39.4% 44.5% 41.79%
  • 25.
    Word alignment Somealignment patterns are easy No she ‘ s uh , , up north La , , , yuko , aa juu kaskazini
  • 26.
    Alignment problems nimemkataliahave turned him down I
  • 27.
    Morphological decomposition haveturned him down I ni+ me+ m+ katalia
  • 28.
    Current results Morpheme/Wordalignment Better alignment, but more complicated decoding Precision Recall F (  =1) 50.2% 64.5% 55.8%
  • 29.
  • 30.
    Future work Projectionof Annotation Refine GIZA++ alignment Part-of-speech tagger
  • 31.
    Future work Projectionof Annotation Refine GIZA++ alignment Part-of-speech tagger No data like more data: web-mining & comparable corpora Example-based MT (omegaT) Statistical MT (Moses)
  • 32.
    Conclusion Modest, butworkable parallel corpus English – Swahili Bi-directional Machine Translation is now in the cards Modest, but encouraging word alignment scores Data-driven approach is viable for African languages