Machine Translation is an emerging field of Computer Science. Researchers have been done to make Machine Translation systems for different language pairs using different practices including rule based machine translation and Statistical Machine Translation (SMT). The goal of the project is to design a Statistical Machine translator for software language localization using Moses decoder. The system is expected to automatically localize (translate) software contents from English into Tamil by using Statistical Machine Translation.
Statistical Machine Translation for Language Localisation
1. Statistical Machine Translation for Language Localisation
By Y. Achchuthan 2010/SP/007
Supervised by Mr. K. Sarveswaran
Department of Computer Science, University of Jaffna.
6. Introduction
• Localisation of software has become an inevitable part of software
development.
• Machine Translation systems : Rule-based Machine Translation and
Statistical Machine Translation (SMT)
• Several frameworks have been implemented to carry out Machine
Translations
• SMT has a set of defined phases: Corpus preparation, Language
Modelling, Training, Testing and Evaluation
10. Existing Efforts
• Morphological Processing for English-Tamil Statistical Machine
Translation
• Suffix-separation rules for both of the languages and evaluate the impact of
this pre-processing on translation quality of the phrase-based as well as
hierarchical model in terms of BLEU score and a small manual evaluation
13. Step 1: Corpus Preparation [1/4]
• Data Collection
• Data are collected from language
resource files of different open source
projects.
• Online Tamil corpus that is published by
LoganathanRamasamy, OndrejBojar
Source Sentences
(No. of phrases)
Mozilla Firefox 4,568
Mozilla OS 3,465
Drupal 4,544
Moodle 4,355
Squirrel Mail 1,116
Tamil Glossary 2,567
Joomla 4,358
EnTam v2.0
(non technical)
169,871
Table 1 : Collected parallel data from the Internet
14. Step 1: Corpus Preparation [2/4]
• Tokenization:
This means that spaces have to be inserted between words and punctuation.
Example:
smart search: manage search filters
smart search: search filters - new/edit
joomla update
private messages: inbox
private messages: read
private messages: write
smart search : manage search filters
smart search : search filters - new / edit
joomla update
private messages : inbox
private messages : read
private messages : write
15. Step 1: Corpus Preparation [3/4]
• True-casing:
Words in each sentence are converted to their most probable casing.
Example:
எந்த (40/40)
இதத (34/34)
சரியான (26/26)
அதைவடிவம் (1/1)
தட்டச்சியது (2/2)
பியூகெ-பூட்டியில் (1/1)
ந ாக்கும் (1/1)
ெட்டதைக்ெ (1/1)
தனித்த (4/4)
இதைப்பில் (1/1)
ொரைங்ெளால் (2/2)
கசாடுக்ெில் (2/2)
அறிக்தெதய (9/9)
அதைக்ெப்பட்ட (13/13)
preceding (2/2)
system (125/125)
project (20/20)
submit (2/3) / Submit (1/3)
electronic (1/1)
sector (2/2)
earlier (7/7)
threaded (2/2)
super (3/4)
Super (1)
registering (2/2)
wait (15/15)
p3p (8/8)
16. Step 1: Corpus Preparation [4/4]
• Cleaning:
Long sentences and empty sentences are removed as they can cause
problems with the training pipeline, and obviously misaligned sentences are
removed.
17. Step 2: Language Modeling
• Language Model (LM) is used to improve the
translation result
• Built with the target language
• Language Model toolkit estimates n-gram
probabilities using given text corpus
• IRSTLM and KenLM are used to build LM
Example:
ngram 1= 13346
ngram 2= 35419
ngram 3= 11607
ngram 4= 6390
1-grams:
-4.575466 ஏதுவான -0.10647591
-3.7375624 கபாத்தாதனக் -0.369015
-3.2596145 ொட்டுெிறது -1.0157927
-3.8978152 ெட்டுதரதயத் -0.27033526
-4.154526 நதர்ந்கதடுக்ெ -0.10647591
-3.8978152 தங்ெதள -0.12376224
-3.7375624 அனுைதிக்கும் -0.42978552
-4.154526 நைல்நதான்று -0.10647591
-5.135497 சாளரத்ததக் -0.10647591
-5.135497 படங்ெதளச் -0.10647591
2-grams:
-0.97480524 உருக்கள் எண்ணிக்கக -0.0629627
-1.1356568 ககோப்பகங்கள் எண்ணிக்கக -0.10245394
-1.6087823 பதிப்புகள் எண்ணிக்கக -0.10245394
-0.96094394 வகைபட எண்ணிக்கக -0.10245394
-1.2593822 வகைபடங்கள் எண்ணிக்கக -0.10245394
-0.96094394 நிைல்கள் எண்ணிக்கக -0.10245394
18. Step 3: Word Alignment
• Phrase extraction and scoring
• Most of the current Phrase-Based SMT systems rely on IBM Models (Specifically
model 4) for word alignment. Most popular implementation is GIZA++
• Running the algorithm in both directions, source to target and target to source
Example: Word Alignment Example: Phrase table
# Sentence pair (364) source length 2 target length 3 alignment score : 0.00613603
central control unit
NULL ({ }) தையக் ({ 1 }) ெட்டுப்பாட்டெம் ({ 2 3 })
# Sentence pair (445) source length 2 target length 2 alignment score : 0.295143
data declaration
NULL ({ }) தரவுப் ({ 1 }) பிரெடனம் ({ 2 })
# Sentence pair (474) source length 2 target length 2 alignment score : 0.151245
data import
NULL ({ }) தரவு ({ 1 }) இறக்குைதி ({ 2 })
cache controller ||| விதரநவெ ெட்டுப்பாட்டெம் ||| 1 0.1875 1 0.0582878 |||
0-0 1-0 1-1 ||| 1 1 1 |||
center ||| தையம் ||| 0.625 0.625 0.769231 0.555556 ||| 0-0 ||| 16 13 10 |||
|||central control unit ||| தையக் ெட்டுப்பாட்டெம் ||| 1 0.0390625 1 0.0136171 |||
0-0 0-1 1-1 2-1 ||| 1 1 1 |||
central control ||| தையக் ெட்டுப்பாட்டு ||| 1 0.75 1 0.0375 ||| 0-0 1-1 |||
1 1 1 |||
19. Step 4: Decoding
• Find the translation of a sentence that has the maximum probability
• Probabilistic model for phrase-based translation:
𝑒 𝑏𝑒𝑠𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑒
𝑖=1
𝐼
𝜙 𝑓𝑖 𝑒𝑖 𝑑 𝑠𝑡𝑎𝑟𝑡𝑖 − 𝑒𝑛𝑑𝑖−1 − 1 𝑝 𝐿𝑀 𝑒
• Components
• Phrase translation Picking phrase 𝑓𝑖 to be translated as a phrase 𝑒𝑖
• look up score 𝜙 𝑓𝑖 𝑒𝑖 from phrase translation table
• Reordering Previous phrase ended in 𝑒𝑛𝑑𝑖−1, current phrase starts at 𝑠𝑡𝑎𝑟𝑡𝑖
• compute 𝑑 𝑠𝑡𝑎𝑟𝑡𝑖 − 𝑒𝑛𝑑𝑖−1 − 1
• Language model For n-gram model, need to keep track of last 𝑛 − 1 words
• compute score 𝑝 𝐿𝑀 𝑤𝑖 𝑤𝑖−(𝑛−1), … , 𝑤𝑖−1 for added words 𝑤𝑖
• Moses Toolkit used to do the decoding process
20. Step 5: Evaluation
• Automatic evaluation
BLEU (BiLingual Evaluation Understudy) is an algorithm for evaluating the quality of text
which has been machine-translated from one natural language to another.
𝐵𝐿𝐸𝑈 = min 1,
𝑜𝑢𝑡𝑝𝑢𝑡𝑙𝑒𝑛𝑔𝑡ℎ
𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑙𝑒𝑛𝑔𝑡ℎ
𝑖=1
4
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖
1
4
• Human evaluation
22. Architecture Overview
Parallel
Corpus
Language Modeling
Phrase Extraction
Phrase Table
Language Model
Decoder
Web Service
.po File
Translated .po
Web Server
SMT Server
Word Alignment using GIZA ++
Language Modeling using
IRSTLM & KenLM
Using Moses toolkit
28. Conclusion
• Localisation can be done using SMT. However, it can be improved if
we can collect more parallel data.
• Output of SMT result is better for a specific domain than the generic
domain.
• Compare to IRSTLM, KenLM performs better.
30. Deliverable
• Dissertation
• An online interface for Tamil language localization using SMT
• A web service for Tamil language localization
• A research article
34. Selected References
• ZdenekŽabokrtský, LoganathanRamasamy OndrejBojar. "Morphological Processing for English-Tamil Statistical Machine
Translation." 24th International Conference on Computational Linguistics.
• Sripirakas, S.; Weerasinghe, A.R.; Herath, D.L., "Statistical machine translation of systems for Sinhala - Tamil," Advances in ICT
for Emerging Regions (ICTer), 2010 International Conference on , vol., no., pp.62,68, Sept. 29 2010-Oct. 1 2010
• Germann, Ulrich. "Building a statistical machine translation system from scratch: how much bang for the buck can we expect?."
Proceedings of the workshop on Data-driven methods in machine translation-Volume 14. Association for Computational
Linguistics, 2001.
• Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,Nicola Bertoldi, Brooke Cowan, Wade
Shen, Christine Moran, Richard Zens, Chris Dyer,Ondrej Bojar, Alexandra Constantin, Evan Herbst, Moses: Open Source Toolkit
for Statistical Machine Translation, Annual Meeting of the Association for Computational Linguistics (ACL), demonstration
session, Prague, Czech Republic, June 2007.