SlideShare a Scribd company logo
1 of 35
Statistical Machine Translation for Language Localisation
By Y. Achchuthan 2010/SP/007
Supervised by Mr. K. Sarveswaran
Department of Computer Science, University of Jaffna.
Outline
• Motivation
• Introduction
• Problem Definition
• Methodology
• Architecture Overview & Experimental Setup
• Result
• Discussions
• Conclusion
• Deliverable
• References
• Demo
Motivation
Motivation
Statistical
Machine
Translation
(SMT)
Localisation
Introduction
Introduction
• Localisation of software has become an inevitable part of software
development.
• Machine Translation systems : Rule-based Machine Translation and
Statistical Machine Translation (SMT)
• Several frameworks have been implemented to carry out Machine
Translations
• SMT has a set of defined phases: Corpus preparation, Language
Modelling, Training, Testing and Evaluation
Problem Definition
Problem Definition
Study whether Statistical Machine Translation can be used for
Language localisation of software.
Existing Efforts
Existing Efforts
• Morphological Processing for English-Tamil Statistical Machine
Translation
• Suffix-separation rules for both of the languages and evaluate the impact of
this pre-processing on translation quality of the phrase-based as well as
hierarchical model in terms of BLEU score and a small manual evaluation
Methodology
Overview
Corpus
Preparation
Language
Modelling
Word
Alignment
Decoding
Evaluation
Step 1: Corpus Preparation [1/4]
• Data Collection
• Data are collected from language
resource files of different open source
projects.
• Online Tamil corpus that is published by
LoganathanRamasamy, OndrejBojar
Source Sentences
(No. of phrases)
Mozilla Firefox 4,568
Mozilla OS 3,465
Drupal 4,544
Moodle 4,355
Squirrel Mail 1,116
Tamil Glossary 2,567
Joomla 4,358
EnTam v2.0
(non technical)
169,871
Table 1 : Collected parallel data from the Internet
Step 1: Corpus Preparation [2/4]
• Tokenization:
This means that spaces have to be inserted between words and punctuation.
Example:
smart search: manage search filters
smart search: search filters - new/edit
joomla update
private messages: inbox
private messages: read
private messages: write
smart search : manage search filters
smart search : search filters - new / edit
joomla update
private messages : inbox
private messages : read
private messages : write
Step 1: Corpus Preparation [3/4]
• True-casing:
Words in each sentence are converted to their most probable casing.
Example:
எந்த (40/40)
இதத (34/34)
சரியான (26/26)
அதைவடிவம் (1/1)
தட்டச்சியது (2/2)
பியூகெ-பூட்டியில் (1/1)
ந ாக்கும் (1/1)
ெட்டதைக்ெ (1/1)
தனித்த (4/4)
இதைப்பில் (1/1)
ொரைங்ெளால் (2/2)
கசாடுக்ெில் (2/2)
அறிக்தெதய (9/9)
அதைக்ெப்பட்ட (13/13)
preceding (2/2)
system (125/125)
project (20/20)
submit (2/3) / Submit (1/3)
electronic (1/1)
sector (2/2)
earlier (7/7)
threaded (2/2)
super (3/4)
Super (1)
registering (2/2)
wait (15/15)
p3p (8/8)
Step 1: Corpus Preparation [4/4]
• Cleaning:
Long sentences and empty sentences are removed as they can cause
problems with the training pipeline, and obviously misaligned sentences are
removed.
Step 2: Language Modeling
• Language Model (LM) is used to improve the
translation result
• Built with the target language
• Language Model toolkit estimates n-gram
probabilities using given text corpus
• IRSTLM and KenLM are used to build LM
Example:
ngram 1= 13346
ngram 2= 35419
ngram 3= 11607
ngram 4= 6390
1-grams:
-4.575466 ஏதுவான -0.10647591
-3.7375624 கபாத்தாதனக் -0.369015
-3.2596145 ொட்டுெிறது -1.0157927
-3.8978152 ெட்டுதரதயத் -0.27033526
-4.154526 நதர்ந்கதடுக்ெ -0.10647591
-3.8978152 தங்ெதள -0.12376224
-3.7375624 அனுைதிக்கும் -0.42978552
-4.154526 நைல்நதான்று -0.10647591
-5.135497 சாளரத்ததக் -0.10647591
-5.135497 படங்ெதளச் -0.10647591
2-grams:
-0.97480524 உருக்கள் எண்ணிக்கக -0.0629627
-1.1356568 ககோப்பகங்கள் எண்ணிக்கக -0.10245394
-1.6087823 பதிப்புகள் எண்ணிக்கக -0.10245394
-0.96094394 வகைபட எண்ணிக்கக -0.10245394
-1.2593822 வகைபடங்கள் எண்ணிக்கக -0.10245394
-0.96094394 நிைல்கள் எண்ணிக்கக -0.10245394
Step 3: Word Alignment
• Phrase extraction and scoring
• Most of the current Phrase-Based SMT systems rely on IBM Models (Specifically
model 4) for word alignment. Most popular implementation is GIZA++
• Running the algorithm in both directions, source to target and target to source
Example: Word Alignment Example: Phrase table
# Sentence pair (364) source length 2 target length 3 alignment score : 0.00613603
central control unit
NULL ({ }) தையக் ({ 1 }) ெட்டுப்பாட்டெம் ({ 2 3 })
# Sentence pair (445) source length 2 target length 2 alignment score : 0.295143
data declaration
NULL ({ }) தரவுப் ({ 1 }) பிரெடனம் ({ 2 })
# Sentence pair (474) source length 2 target length 2 alignment score : 0.151245
data import
NULL ({ }) தரவு ({ 1 }) இறக்குைதி ({ 2 })
cache controller ||| விதரநவெ ெட்டுப்பாட்டெம் ||| 1 0.1875 1 0.0582878 |||
0-0 1-0 1-1 ||| 1 1 1 |||
center ||| தையம் ||| 0.625 0.625 0.769231 0.555556 ||| 0-0 ||| 16 13 10 |||
|||central control unit ||| தையக் ெட்டுப்பாட்டெம் ||| 1 0.0390625 1 0.0136171 |||
0-0 0-1 1-1 2-1 ||| 1 1 1 |||
central control ||| தையக் ெட்டுப்பாட்டு ||| 1 0.75 1 0.0375 ||| 0-0 1-1 |||
1 1 1 |||
Step 4: Decoding
• Find the translation of a sentence that has the maximum probability
• Probabilistic model for phrase-based translation:
𝑒 𝑏𝑒𝑠𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑒
𝑖=1
𝐼
𝜙 𝑓𝑖 𝑒𝑖 𝑑 𝑠𝑡𝑎𝑟𝑡𝑖 − 𝑒𝑛𝑑𝑖−1 − 1 𝑝 𝐿𝑀 𝑒
• Components
• Phrase translation Picking phrase 𝑓𝑖 to be translated as a phrase 𝑒𝑖
• look up score 𝜙 𝑓𝑖 𝑒𝑖 from phrase translation table
• Reordering Previous phrase ended in 𝑒𝑛𝑑𝑖−1, current phrase starts at 𝑠𝑡𝑎𝑟𝑡𝑖
• compute 𝑑 𝑠𝑡𝑎𝑟𝑡𝑖 − 𝑒𝑛𝑑𝑖−1 − 1
• Language model For n-gram model, need to keep track of last 𝑛 − 1 words
• compute score 𝑝 𝐿𝑀 𝑤𝑖 𝑤𝑖−(𝑛−1), … , 𝑤𝑖−1 for added words 𝑤𝑖
• Moses Toolkit used to do the decoding process
Step 5: Evaluation
• Automatic evaluation
BLEU (BiLingual Evaluation Understudy) is an algorithm for evaluating the quality of text
which has been machine-translated from one natural language to another.
𝐵𝐿𝐸𝑈 = min 1,
𝑜𝑢𝑡𝑝𝑢𝑡𝑙𝑒𝑛𝑔𝑡ℎ
𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑙𝑒𝑛𝑔𝑡ℎ
𝑖=1
4
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖
1
4
• Human evaluation
Architecture Overview
Architecture Overview
Parallel
Corpus
Language Modeling
Phrase Extraction
Phrase Table
Language Model
Decoder
Web Service
.po File
Translated .po
Web Server
SMT Server
Word Alignment using GIZA ++
Language Modeling using
IRSTLM & KenLM
Using Moses toolkit
Result
Result
70 60 50 40 30 20 10 0
14.43
25.87
55.06
12.74
3.04
5.56
13.67
27.72
56.44
13.42
2.61
6.08
13.57
28.28
56.73
13.64
2.63
6.11
2-gram3-gram4-gram
IRSTLM KenLM
Discussion
Discussion
• Unavailability of parallel data
• Variations in collected parallel data
• BLEU scoring is optimized for generic domain
Conclusion
Conclusion
• Localisation can be done using SMT. However, it can be improved if
we can collect more parallel data.
• Output of SMT result is better for a specific domain than the generic
domain.
• Compare to IRSTLM, KenLM performs better.
Deliverable
Deliverable
• Dissertation
• An online interface for Tamil language localization using SMT
• A web service for Tamil language localization
• A research article
Future Work
Future Work
• Test Factored Translation Models
• Study the Evaluation method and word alignment algorithm
• Improve the SMT performance
Selected References
Selected References
• ZdenekŽabokrtský, LoganathanRamasamy OndrejBojar. "Morphological Processing for English-Tamil Statistical Machine
Translation." 24th International Conference on Computational Linguistics.
• Sripirakas, S.; Weerasinghe, A.R.; Herath, D.L., "Statistical machine translation of systems for Sinhala - Tamil," Advances in ICT
for Emerging Regions (ICTer), 2010 International Conference on , vol., no., pp.62,68, Sept. 29 2010-Oct. 1 2010
• Germann, Ulrich. "Building a statistical machine translation system from scratch: how much bang for the buck can we expect?."
Proceedings of the workshop on Data-driven methods in machine translation-Volume 14. Association for Computational
Linguistics, 2001.
• Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,Nicola Bertoldi, Brooke Cowan, Wade
Shen, Christine Moran, Richard Zens, Chris Dyer,Ondrej Bojar, Alexandra Constantin, Evan Herbst, Moses: Open Source Toolkit
for Statistical Machine Translation, Annual Meeting of the Association for Computational Linguistics (ACL), demonstration
session, Prague, Czech Republic, June 2007.
DEMO
URL: 10.20.10.211/smt
: 10.20.10.125/smt

More Related Content

Similar to Statistical Machine Translation for Language Localisation

Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingVL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingYoungSeok Yoon
 
Analysis of speech signal mlbp features
Analysis of speech signal mlbp featuresAnalysis of speech signal mlbp features
Analysis of speech signal mlbp featuresZiadAlqady
 
Analysis of speech signal mlbp features
Analysis of speech signal mlbp featuresAnalysis of speech signal mlbp features
Analysis of speech signal mlbp featuresZiadAlqady
 
Analysis of speech signal mlbp features
Analysis of speech signal mlbp featuresAnalysis of speech signal mlbp features
Analysis of speech signal mlbp featuresZiadAlqady
 
Analysis of speech signal mlbp features
Analysis of speech signal mlbp featuresAnalysis of speech signal mlbp features
Analysis of speech signal mlbp featuresZiadAlqady
 
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Lionel Briand
 
Text independent speaker recognition system
Text independent speaker recognition systemText independent speaker recognition system
Text independent speaker recognition systemDeepesh Lekhak
 
Query Execution Time and Query Optimization.
Query Execution Time and Query Optimization.Query Execution Time and Query Optimization.
Query Execution Time and Query Optimization.Radhe Krishna Rajan
 
A Novel Specification and Composition Language for Services
A Novel Specification and Composition Language for ServicesA Novel Specification and Composition Language for Services
A Novel Specification and Composition Language for ServicesGeorge Baryannis
 
Real-time Non-Intrusive Speech Quality Estimation of VoIP Using Genetic Progr...
Real-time Non-Intrusive Speech Quality Estimation of VoIP Using Genetic Progr...Real-time Non-Intrusive Speech Quality Estimation of VoIP Using Genetic Progr...
Real-time Non-Intrusive Speech Quality Estimation of VoIP Using Genetic Progr...adil raja
 
An improved approach to minimize context switching in round robin scheduling ...
An improved approach to minimize context switching in round robin scheduling ...An improved approach to minimize context switching in round robin scheduling ...
An improved approach to minimize context switching in round robin scheduling ...eSAT Publishing House
 
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)Mauro Vallati
 
Sangram Nayak_22Jan15
Sangram Nayak_22Jan15Sangram Nayak_22Jan15
Sangram Nayak_22Jan15Sangram Nayak
 
STV-20151019-ServiceFunctionaTestAutomation (2)
STV-20151019-ServiceFunctionaTestAutomation (2)STV-20151019-ServiceFunctionaTestAutomation (2)
STV-20151019-ServiceFunctionaTestAutomation (2)Libero Maesano
 
Cs 568 Spring 10 Lecture 5 Estimation
Cs 568 Spring 10  Lecture 5 EstimationCs 568 Spring 10  Lecture 5 Estimation
Cs 568 Spring 10 Lecture 5 EstimationLawrence Bernstein
 
Recent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
Recent and Robust Query Auto-Completion - WWW 2014 Conference PresentationRecent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
Recent and Robust Query Auto-Completion - WWW 2014 Conference Presentationstewhir
 
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...Rafael Ferreira da Silva
 

Similar to Statistical Machine Translation for Language Localisation (20)

Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
Rui Meng - 2017 - Deep Keyphrase Generation
Rui Meng - 2017 - Deep Keyphrase GenerationRui Meng - 2017 - Deep Keyphrase Generation
Rui Meng - 2017 - Deep Keyphrase Generation
 
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingVL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
 
Analysis of speech signal mlbp features
Analysis of speech signal mlbp featuresAnalysis of speech signal mlbp features
Analysis of speech signal mlbp features
 
Analysis of speech signal mlbp features
Analysis of speech signal mlbp featuresAnalysis of speech signal mlbp features
Analysis of speech signal mlbp features
 
Analysis of speech signal mlbp features
Analysis of speech signal mlbp featuresAnalysis of speech signal mlbp features
Analysis of speech signal mlbp features
 
Analysis of speech signal mlbp features
Analysis of speech signal mlbp featuresAnalysis of speech signal mlbp features
Analysis of speech signal mlbp features
 
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
 
Text independent speaker recognition system
Text independent speaker recognition systemText independent speaker recognition system
Text independent speaker recognition system
 
Query Execution Time and Query Optimization.
Query Execution Time and Query Optimization.Query Execution Time and Query Optimization.
Query Execution Time and Query Optimization.
 
A Novel Specification and Composition Language for Services
A Novel Specification and Composition Language for ServicesA Novel Specification and Composition Language for Services
A Novel Specification and Composition Language for Services
 
Real-time Non-Intrusive Speech Quality Estimation of VoIP Using Genetic Progr...
Real-time Non-Intrusive Speech Quality Estimation of VoIP Using Genetic Progr...Real-time Non-Intrusive Speech Quality Estimation of VoIP Using Genetic Progr...
Real-time Non-Intrusive Speech Quality Estimation of VoIP Using Genetic Progr...
 
An improved approach to minimize context switching in round robin scheduling ...
An improved approach to minimize context switching in round robin scheduling ...An improved approach to minimize context switching in round robin scheduling ...
An improved approach to minimize context switching in round robin scheduling ...
 
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
 
Sangram Nayak_22Jan15
Sangram Nayak_22Jan15Sangram Nayak_22Jan15
Sangram Nayak_22Jan15
 
STV-20151019-ServiceFunctionaTestAutomation (2)
STV-20151019-ServiceFunctionaTestAutomation (2)STV-20151019-ServiceFunctionaTestAutomation (2)
STV-20151019-ServiceFunctionaTestAutomation (2)
 
Cs 568 Spring 10 Lecture 5 Estimation
Cs 568 Spring 10  Lecture 5 EstimationCs 568 Spring 10  Lecture 5 Estimation
Cs 568 Spring 10 Lecture 5 Estimation
 
Recent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
Recent and Robust Query Auto-Completion - WWW 2014 Conference PresentationRecent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
Recent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
 
Ajila (1)
Ajila (1)Ajila (1)
Ajila (1)
 
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...
 

More from Achchuthan Yogarajah

More from Achchuthan Yogarajah (10)

Managing the design process
Managing the design processManaging the design process
Managing the design process
 
intoduction to network devices
intoduction to network devicesintoduction to network devices
intoduction to network devices
 
basic network concepts
basic network conceptsbasic network concepts
basic network concepts
 
4 php-advanced
4 php-advanced4 php-advanced
4 php-advanced
 
3 php-connect-to-my sql
3 php-connect-to-my sql3 php-connect-to-my sql
3 php-connect-to-my sql
 
PHP Form Handling
PHP Form HandlingPHP Form Handling
PHP Form Handling
 
PHP-introduction
PHP-introductionPHP-introduction
PHP-introduction
 
Introduction to Web Programming
Introduction to Web Programming Introduction to Web Programming
Introduction to Web Programming
 
PADDY CULTIVATION MANAGEMENT SYSTEM
PADDY CULTIVATION MANAGEMENT  SYSTEMPADDY CULTIVATION MANAGEMENT  SYSTEM
PADDY CULTIVATION MANAGEMENT SYSTEM
 
Greedy Knapsack Problem - by Y Achchuthan
Greedy Knapsack Problem  - by Y AchchuthanGreedy Knapsack Problem  - by Y Achchuthan
Greedy Knapsack Problem - by Y Achchuthan
 

Recently uploaded

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 

Recently uploaded (20)

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 

Statistical Machine Translation for Language Localisation

  • 1. Statistical Machine Translation for Language Localisation By Y. Achchuthan 2010/SP/007 Supervised by Mr. K. Sarveswaran Department of Computer Science, University of Jaffna.
  • 2. Outline • Motivation • Introduction • Problem Definition • Methodology • Architecture Overview & Experimental Setup • Result • Discussions • Conclusion • Deliverable • References • Demo
  • 6. Introduction • Localisation of software has become an inevitable part of software development. • Machine Translation systems : Rule-based Machine Translation and Statistical Machine Translation (SMT) • Several frameworks have been implemented to carry out Machine Translations • SMT has a set of defined phases: Corpus preparation, Language Modelling, Training, Testing and Evaluation
  • 8. Problem Definition Study whether Statistical Machine Translation can be used for Language localisation of software.
  • 10. Existing Efforts • Morphological Processing for English-Tamil Statistical Machine Translation • Suffix-separation rules for both of the languages and evaluate the impact of this pre-processing on translation quality of the phrase-based as well as hierarchical model in terms of BLEU score and a small manual evaluation
  • 13. Step 1: Corpus Preparation [1/4] • Data Collection • Data are collected from language resource files of different open source projects. • Online Tamil corpus that is published by LoganathanRamasamy, OndrejBojar Source Sentences (No. of phrases) Mozilla Firefox 4,568 Mozilla OS 3,465 Drupal 4,544 Moodle 4,355 Squirrel Mail 1,116 Tamil Glossary 2,567 Joomla 4,358 EnTam v2.0 (non technical) 169,871 Table 1 : Collected parallel data from the Internet
  • 14. Step 1: Corpus Preparation [2/4] • Tokenization: This means that spaces have to be inserted between words and punctuation. Example: smart search: manage search filters smart search: search filters - new/edit joomla update private messages: inbox private messages: read private messages: write smart search : manage search filters smart search : search filters - new / edit joomla update private messages : inbox private messages : read private messages : write
  • 15. Step 1: Corpus Preparation [3/4] • True-casing: Words in each sentence are converted to their most probable casing. Example: எந்த (40/40) இதத (34/34) சரியான (26/26) அதைவடிவம் (1/1) தட்டச்சியது (2/2) பியூகெ-பூட்டியில் (1/1) ந ாக்கும் (1/1) ெட்டதைக்ெ (1/1) தனித்த (4/4) இதைப்பில் (1/1) ொரைங்ெளால் (2/2) கசாடுக்ெில் (2/2) அறிக்தெதய (9/9) அதைக்ெப்பட்ட (13/13) preceding (2/2) system (125/125) project (20/20) submit (2/3) / Submit (1/3) electronic (1/1) sector (2/2) earlier (7/7) threaded (2/2) super (3/4) Super (1) registering (2/2) wait (15/15) p3p (8/8)
  • 16. Step 1: Corpus Preparation [4/4] • Cleaning: Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously misaligned sentences are removed.
  • 17. Step 2: Language Modeling • Language Model (LM) is used to improve the translation result • Built with the target language • Language Model toolkit estimates n-gram probabilities using given text corpus • IRSTLM and KenLM are used to build LM Example: ngram 1= 13346 ngram 2= 35419 ngram 3= 11607 ngram 4= 6390 1-grams: -4.575466 ஏதுவான -0.10647591 -3.7375624 கபாத்தாதனக் -0.369015 -3.2596145 ொட்டுெிறது -1.0157927 -3.8978152 ெட்டுதரதயத் -0.27033526 -4.154526 நதர்ந்கதடுக்ெ -0.10647591 -3.8978152 தங்ெதள -0.12376224 -3.7375624 அனுைதிக்கும் -0.42978552 -4.154526 நைல்நதான்று -0.10647591 -5.135497 சாளரத்ததக் -0.10647591 -5.135497 படங்ெதளச் -0.10647591 2-grams: -0.97480524 உருக்கள் எண்ணிக்கக -0.0629627 -1.1356568 ககோப்பகங்கள் எண்ணிக்கக -0.10245394 -1.6087823 பதிப்புகள் எண்ணிக்கக -0.10245394 -0.96094394 வகைபட எண்ணிக்கக -0.10245394 -1.2593822 வகைபடங்கள் எண்ணிக்கக -0.10245394 -0.96094394 நிைல்கள் எண்ணிக்கக -0.10245394
  • 18. Step 3: Word Alignment • Phrase extraction and scoring • Most of the current Phrase-Based SMT systems rely on IBM Models (Specifically model 4) for word alignment. Most popular implementation is GIZA++ • Running the algorithm in both directions, source to target and target to source Example: Word Alignment Example: Phrase table # Sentence pair (364) source length 2 target length 3 alignment score : 0.00613603 central control unit NULL ({ }) தையக் ({ 1 }) ெட்டுப்பாட்டெம் ({ 2 3 }) # Sentence pair (445) source length 2 target length 2 alignment score : 0.295143 data declaration NULL ({ }) தரவுப் ({ 1 }) பிரெடனம் ({ 2 }) # Sentence pair (474) source length 2 target length 2 alignment score : 0.151245 data import NULL ({ }) தரவு ({ 1 }) இறக்குைதி ({ 2 }) cache controller ||| விதரநவெ ெட்டுப்பாட்டெம் ||| 1 0.1875 1 0.0582878 ||| 0-0 1-0 1-1 ||| 1 1 1 ||| center ||| தையம் ||| 0.625 0.625 0.769231 0.555556 ||| 0-0 ||| 16 13 10 ||| |||central control unit ||| தையக் ெட்டுப்பாட்டெம் ||| 1 0.0390625 1 0.0136171 ||| 0-0 0-1 1-1 2-1 ||| 1 1 1 ||| central control ||| தையக் ெட்டுப்பாட்டு ||| 1 0.75 1 0.0375 ||| 0-0 1-1 ||| 1 1 1 |||
  • 19. Step 4: Decoding • Find the translation of a sentence that has the maximum probability • Probabilistic model for phrase-based translation: 𝑒 𝑏𝑒𝑠𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑒 𝑖=1 𝐼 𝜙 𝑓𝑖 𝑒𝑖 𝑑 𝑠𝑡𝑎𝑟𝑡𝑖 − 𝑒𝑛𝑑𝑖−1 − 1 𝑝 𝐿𝑀 𝑒 • Components • Phrase translation Picking phrase 𝑓𝑖 to be translated as a phrase 𝑒𝑖 • look up score 𝜙 𝑓𝑖 𝑒𝑖 from phrase translation table • Reordering Previous phrase ended in 𝑒𝑛𝑑𝑖−1, current phrase starts at 𝑠𝑡𝑎𝑟𝑡𝑖 • compute 𝑑 𝑠𝑡𝑎𝑟𝑡𝑖 − 𝑒𝑛𝑑𝑖−1 − 1 • Language model For n-gram model, need to keep track of last 𝑛 − 1 words • compute score 𝑝 𝐿𝑀 𝑤𝑖 𝑤𝑖−(𝑛−1), … , 𝑤𝑖−1 for added words 𝑤𝑖 • Moses Toolkit used to do the decoding process
  • 20. Step 5: Evaluation • Automatic evaluation BLEU (BiLingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. 𝐵𝐿𝐸𝑈 = min 1, 𝑜𝑢𝑡𝑝𝑢𝑡𝑙𝑒𝑛𝑔𝑡ℎ 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑙𝑒𝑛𝑔𝑡ℎ 𝑖=1 4 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 1 4 • Human evaluation
  • 22. Architecture Overview Parallel Corpus Language Modeling Phrase Extraction Phrase Table Language Model Decoder Web Service .po File Translated .po Web Server SMT Server Word Alignment using GIZA ++ Language Modeling using IRSTLM & KenLM Using Moses toolkit
  • 24. Result 70 60 50 40 30 20 10 0 14.43 25.87 55.06 12.74 3.04 5.56 13.67 27.72 56.44 13.42 2.61 6.08 13.57 28.28 56.73 13.64 2.63 6.11 2-gram3-gram4-gram IRSTLM KenLM
  • 26. Discussion • Unavailability of parallel data • Variations in collected parallel data • BLEU scoring is optimized for generic domain
  • 28. Conclusion • Localisation can be done using SMT. However, it can be improved if we can collect more parallel data. • Output of SMT result is better for a specific domain than the generic domain. • Compare to IRSTLM, KenLM performs better.
  • 30. Deliverable • Dissertation • An online interface for Tamil language localization using SMT • A web service for Tamil language localization • A research article
  • 32. Future Work • Test Factored Translation Models • Study the Evaluation method and word alignment algorithm • Improve the SMT performance
  • 34. Selected References • ZdenekŽabokrtský, LoganathanRamasamy OndrejBojar. "Morphological Processing for English-Tamil Statistical Machine Translation." 24th International Conference on Computational Linguistics. • Sripirakas, S.; Weerasinghe, A.R.; Herath, D.L., "Statistical machine translation of systems for Sinhala - Tamil," Advances in ICT for Emerging Regions (ICTer), 2010 International Conference on , vol., no., pp.62,68, Sept. 29 2010-Oct. 1 2010 • Germann, Ulrich. "Building a statistical machine translation system from scratch: how much bang for the buck can we expect?." Proceedings of the workshop on Data-driven methods in machine translation-Volume 14. Association for Computational Linguistics, 2001. • Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer,Ondrej Bojar, Alexandra Constantin, Evan Herbst, Moses: Open Source Toolkit for Statistical Machine Translation, Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007.

Editor's Notes

  1. நீங்கள் இந்த பதிவை தொகுக்க அனுமதிக்கப்படவில்லை . இந்த பதிவை வௌியிடும் உரிமை உங்களிடம் இல்லை .