• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
The SAWA Corpus - A parallel Corpus English - Swahili
 

The SAWA Corpus - A parallel Corpus English - Swahili

on

  • 2,224 views

by Guy De Pauw, Peter Waiganjo Wagacha and Gilles-Maurice de Schryver

by Guy De Pauw, Peter Waiganjo Wagacha and Gilles-Maurice de Schryver

Statistics

Views

Total Views
2,224
Views on SlideShare
1,734
Embed Views
490

Actions

Likes
0
Downloads
13
Comments
0

3 Embeds 490

http://aflat.org 458
http://www.aflat.org 29
http://www.slideshare.net 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    The SAWA Corpus - A parallel Corpus English - Swahili The SAWA Corpus - A parallel Corpus English - Swahili Presentation Transcript

    • The SAWA Corpus A Parallel Corpus English - Swahili Guy De Pauw (guy.depauw@aflat.org) Peter Waiganjo Wagacha (waiganjo@aflat.org) Gilles-Maurice de Schryver (gillesmaurice.deschryver@aflat.org)
    • Resource-scarceness
      • Language technology vs the digital divide
      • Digital data increasingly important for African languages (web, mobile phone, …)
      • But: most research on African languages is rooted in knowledge-based paradigm (↔ LT for Indo-European languages):
        • Hand-crafted expert systems
        • Typically high accuracy for domain
        • Limited portability to other languages and subdomains
        • Costly development phase
        • Limited resources (linguistic, expertise, financial, …)
      • Need for a cheaper and faster (language-independent) alternative for developing African language technology
    • Data-driven approaches
      • For Indo-European and Asian languages: the data-driven, corpus-based approach has become the dominant paradigm since the 90’s
      • Basic methodology: automatically extract linguistic knowledge from annotated text material (corpus) and bootstrap the development of language technology component
      • Advantages:
        • language independence: portability (!!!!)
        • Knowledge acquisition bottleneck  data-acquisition bottleneck
        • Robustness
      • AfLaT-team: explore application of data-driven paradigm to African languages (Swahili, Gikuyu, Luo, Northern Sotho, …)
    • Machine Translation
      • 3 paradigms:
            • Rule-based MT
            • Statistical MT
            • Example-based MT
      data-driven Learn translation from examples: !! Parallel corpus !!
    • Parallel Corpus
      • Collection of translated texts in two different languages, aligned on paragraph, sentence, phrase and/or word level
      • SAWA Corpus:
      • parallel corpus English - Swahili
    • Example
      • Universal Declaration of Human Rights
      • Preamble
      • Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world,
      • Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people,
      • Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya nchi yo yote."
      • UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA YA ULIMWENGU JUU YA HAKI ZA BINADAMU
      • UTANGULIZI
      • Kwa kuwa kukiri heshima ya asili na haki sawa kwa binadamu wote ndio msingi wa uhuru, haki na amani duniani,
      • Kwa kuwa kutojali na kudharau haki za binadamu kumeletea vitendo vya kishenzi ambavyo vimeharibu dhamiri ya binadamu na kwa sababu taarifa ya ulimwengu ambayo itawafanya binadamu wafurahie uhuru wao wa kusema, kusadiki na wa kutoogopa cho chote imekwisha kutangazwa kwamba ndio hamu kuu ya watu wote,
    • 3 phases
      • Data-collection: finding parallel texts
      • Data-constitution: aligning the parallel texts on word level
      • Data-exploitation
        • Statistical Machine Translation
        • Bootstrapping linguistic annotation
    • Data Collection
      • Limited availability of parallel texts English – Kiswahili:
        • Smaller documents: investment reports, political texts, e.g. Universal Declaration of Human Rights
        • “ there is no data, like more data”
        • Bible, Quran, secular literature
        • New translations
    • Data Collection
      • Even if the source data is digitally available beforehand, we are often faced with tough alignment problems during data constitution.
      • e.g. paragraph alignment
      • Universal Declaration of Human Rights
      • Preamble
      • Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world,
      • Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people,
      • Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya nchi yo yote."
      • UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA YA ULIMWENGU JUU YA HAKI ZA BINADAMU
      • UTANGULIZI
      • Kwa kuwa kukiri heshima ya asili na haki sawa kwa binadamu wote ndio msingi wa uhuru, haki na amani duniani,
      • Kwa kuwa kutojali na kudharau haki za binadamu kumeletea vitendo vya kishenzi ambavyo vimeharibu dhamiri ya binadamu na kwa sababu taarifa ya ulimwengu ambayo itawafanya binadamu wafurahie uhuru wao wa kusema, kusadiki na wa kutoogopa cho chote imekwisha kutangazwa kwamba ndio hamu kuu ya watu wote,
      • e.g. sentence alignment
      • Article 12
      • No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honour and reputation.
      • Everyone has the right to the protection of the law against such interference or attacks.
      • Kifungu cha 12
      • Kila mtu asiingiliwe bila sheria katika mambo yake ya faragha, ya jamaa yake, ya nyumbani mwake au ya barua zake.
      • Wala asivunjiwe heshima na sifa yake.
      • Kila mmoja ana haki ya kulindwa na sheria kutokana na pingamizi au mambo kama hayo.
    • Available data in SAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
    • Available data in SAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
    • Available data in SAWA Corpus All manually sentence aligned! Thanks to Mahmoud Shokrollahi-Far University College of Nabiye Akram (Iran) English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
    • Available data in SAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
    • Available data in SAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
    • Available data in SAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
    • Available data in SAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
    • Available data in SAWA Corpus All manually sentence aligned! Thanks to Dr. James Omboga Zaja University of Nairobi English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
    • Available data in SAWA Corpus All manually sentence aligned! English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament 16.4k 16.3k 189.2k 151.1k Quran 14.3k 14.5k 165.5k 124.3k Declaration of HR 0.2k 1.8k 1.8k Kamusi.org 5.6k 35.5k 26.7k Movie Subtitles 9.0k 72.2k 58.4k Investment Reports 3.2k 3.1k 52.9k 54.9k Local Translator 1.5k 1.6k 25.0k 25.7k Total 50.2k 50.3k 542.1k 442.9k
    • Word alignment
      • Most difficult task: relate words between languages
      No she ‘ s uh , , up north La , , , yuko , aa juu kaskazini
    • Word alignment You caught me skiving , I ‘ m afraid . Samahani , umenidaka nikihepa .
    • Word alignment
      • Can be done automatically using established tools (GIZA++)
      • Provide manual reference to evaluate automatic word alignment tools (5000 words)
    • Current results
      • Still a lot of room for improvement
      Precision Recall F (  =1) 39.4% 44.5% 41.79%
    • Word alignment
      • Some alignment patterns are easy
      No she ‘ s uh , , up north La , , , yuko , aa juu kaskazini
    • Alignment problems nimemkatalia have turned him down I
    • Morphological decomposition have turned him down I ni+ me+ m+ katalia
    • Current results
      • Morpheme/Word alignment
      • Better alignment, but more complicated decoding
      Precision Recall F (  =1) 50.2% 64.5% 55.8%
    • Future work
      • Projection of Annotation
    • Future work
      • Projection of Annotation
      • Refine GIZA++ alignment
      • Part-of-speech tagger
    • Future work
      • Projection of Annotation
      • Refine GIZA++ alignment
      • Part-of-speech tagger
      • No data like more data: web-mining & comparable corpora
      • Example-based MT (omegaT)
      • Statistical MT (Moses)
    • Conclusion
      • Modest, but workable parallel corpus English – Swahili
      • Bi-directional Machine Translation is now in the cards
      • Modest, but encouraging word alignment scores
      • Data-driven approach is viable for African languages