Bilingual Data Mining for the English-Amharic Statistical Machine Translation (EASMT)
Bilingual Data Mining for the English-Amharic Statistical Machine Translation (EASMT) Mulu GebreegziabherAddis Ababa, Ethiopia: IT Doctoral Program, Addis Ababa University Prof. Laurent Besacier Grenoble, France: University Joseph Fourier Dr. Girma Taye & Dr. Dereje Teferi Addis Ababa, Ethiopia: Addis Ababa University December 2, 2011
Presentation Outline• Introduction• Objectives• Experiment on the English-Amharic bilingual corpus• ENA English-Amharic parallel news corpus• Parliamentary English-Amharic parallel proclamation corpus• Sentence level aligned English-Amharic parallel corpora• Way Forward
Introduction MT is the application of computers to translate text from one natural language to another. Machine Translation Systems Machine Assisted Fully Automated Translation TranslationHuman Aided Machine Aided Rule-based Empirical Systems Systems Translation Translation Statistical Machine Example-based Translation Translation
Introduction Contd…• SMT systems are data driven that rely on bilingual parallel aligned corpus.• The performance of a SMT systems depends on the size of the available training corpus.• The larger the corpus, the better is the performance of the SMT system.• To develop EASMT, parallel data has to be collected from English-Amharic bilingual sentence pairs.• The experiment is to be conducted on at least a corpus of size 2M word pairs (40K sentence pairs).
Introduction Contd…English-Amharic Statistical Machine Translation (EASMT)• Translation between two disparate languages Amharic English Language Family Afro-Asiatic Indo-European Morphology Complex Less inflected Syntactic Structure SOV SVO Writing System Geez Alphabet Latin Letters
Introduction Contd…Parallel Corpus• Parallel corpus is a collection of text paired with translations into another language.• The experiment is conducted on training corpus of both languages based on expressions that are found in parallel Amharic-English news, parliamentary and constitutional documents.• The parallel ENA news contains sentences of day-to- day usage: – Direct translations of each other – Indirect translations written on the same topic in different languages called comparable corpora.
ObjectivesThe objective of the research is to study anddevelop an English-Amharic StatisticalMachine Translation (EASMT) system and toimprove the translation quality by integratinglinguistic knowledge into the system.
Experiment on the English-Amharic bilingual corpusMining the parallel corpus• There are five steps to process a bilingual text corpus used for SMT system. (by Besacier et.al, 2009): – Raw data collection: proclamation and parallel news corpora have been collected – Document alignment: manual & automatic – Tokenization: splitting and trimming – Sentence splitting: done using the punct. [?!. ፡፡ ] – Sentence alignment: almost completed
ENA English-Amharic parallel news corpus • News coverage: Aug 21, 2006 - January 06, 2008 News Corpus Counts Total Domestic Language 10,116 Amharic 23,771 Regional 13,655 English Foreign Language 11,276 11,276 Monitoring 494 Amharic-English 3,610 Information 3,116 Table 1: ENA news corpus
ENA English-Amharic parallel news corpus • Count Summary: ENA news corpus Collected Amharic English Total Documents 23,771 11,276 35,047 Sentences 322,673 212,050 534,723 Counts of Raw 5,277,711 3,704,644 8,982,355 Words Vocabularies 270,786 130,803 401,589 Documents 1,036 1,036 2,072 Sentences 26,112 25,834 51,946 Counts of Aligned 207,200 198,461 405,661 Words Vocabularies 36,519 17,987 54,506 Table 2: The status of English-Amharic parallel news corpus on May 25, 2011
ENA English-Amharic parallel news corpus • Manual alignment at document level: Challenges – Easy: preprocessing including exporting from SQL database to word, converting to Unicode using Zilla word to text converter – Time consuming: difficult to align at document level, since the files are stored in different folders with no structure, the date difference, punctuation, heading information differs (parallel/comparable corpus) – Document level alignment is done by looking at the heading and pick the news id from the folders
ENA English-Amharic parallel news corpus • Automatically aligned English-Amharic Sample ENA news corpora at document level • The aligner takes the following into consideration to align the news items: – Start from the English corpus (constitute 32%). – Match news items that have different story language. – Limit the match with neighboring Amharic corpus to look 80 files around the current file. – A scoring method is used that gives equal weights to all matching columns.
ENA English-Amharic parallel news corpus • The output result of the automatic aligner. Aligned Corpus Counts Cumulative % 1-1 383 383 0.37 1-2 155 538 0.52 1-3 498 1,036 1.00 Total Exact Matches 880 0.85 Unique Amharic Corpus 968 0.93 Unique English Corpus 1,036 1.00 Table 4: Automatically Aligned English-Amharic Sample ENA news items
ENA English-Amharic parallel news corpus• Some of the sample English Documents were better aligned with not seen document, e.g. – 41827 41791 (manual 41827 41826)• 85% matches have been exactly automatically aligned similar to the manual alignment.• Thus, 15% is a new match that does not indicate to an error. Table ENA: Aligned Sample English/Amharic News corpus
ENA English-Amharic parallel news corpus • Extended to automatically align the whole English- Amharic ENA news items Aligned Corpus Counts Cumulative % 1-1 2,928 2,928 0.26 1-2 1,535 4,463 0.40 1-3 6,813 11,276 1.00 Unique Amharic Corpus 10,487 0.93 Unique English Corpus 11,276 1.00 Table 5: Automatically Aligned English-Amharic ENA news items
Parliamentary English-Amharic parallel proclamation corpus• Proclamation coverage: Aug 21, 1995 - July 16, 2010 Collected Amharic English Total Counts of Raw Documents 632 632 1,264 Documents 115 115 230 Sentences 19,115 25,730 44,845 Counts of Aligned Words 219,430 283,578 503,008 Vocabularies 32,299 17,908 50,207 Table 6: Aligned Parliamentary English-Amharic parallel proclamation corpus
Sentence level aligned English-Amharic parallel corpora• The alignment process is similar for both the ENA news items and the proclamation.• The alignment is done using a sentence aligner called Hunalign (similar to Gale and Church ,1993).• Hunalign aligns bilingual text using sentence-length.• An English-Amharic bilingual dictionary of word lists sized 8,212 have been adopted and used (Armbruster, 2007).• The aligner aligns an English Sentence to Amharic in 0-1, 1-1 or 1-2.
Sentence level aligned English-Amharic parallel corpora• The result of the alignment at the sentence level for both the ENA news and the proclamation Aligned Sentence pairs Counts ENA Corpus 155,200 Proclamation Corpus 18,632 Total 173,832 Table 7: Sentence aligned English-Amharic bilingual corpus
Way Forward• To increase the number of the English-Amharic proclamation corpus as much as possible.• To further analyze the experiment conducted so far.• To increase the translation quality using linguistic knowledge: morpho-syntactically.