Moses

On the Utility of Moses for
Sinhala Tamil Translation
1
Yashothara.S, Dr.R.T.Uthayasanker
National Language Processing center

Outline
• Background: Statistical Machine Translation (SMT)
• Introduction to Moses
• Training
• Decoder
2

Machine Translation
• Process of translating from one language into
another language using a computer
• Types of machine translation
• Rule based
• Example based
• Knowledge based
• Statistical based
• Hybrid model based
• Neural network based
3
ComputerSource Target

Statistical Machine Translation
4
Hmmm. Every times she sees
“මගේ”, she either types
“எனது” or “என்னுடைய”
… but if she sees “මගේ නම”
she always types “எனது”
S
S T
S T
T
Translate, translate
…
Parallel Corpus

Statistical Machine Translation
5
s-Sinhala
t-Tamil
TM LM
P(t|s) P(s|t) p(t)

Statistical Machine
Translation
6
Translation
Model
Language
ModelTM LM
Decoder
මගේ ගම නම යාපනය ව
ග .
எனது ஊர் யாழ்ப்பாணம்.

Moses
• Open source SMT framework
• Language independent
• Plug and play
Steps
1. Preprocessing
2. Translation Model Building
3. Language Model Building
4. Decoding
7

Step1: Preprocessing
• Tokenization: Splitting the sentences as tokens
• tokenizer.perl script can be used.
Example:
Before tokenizing
ගසවක සංඛ්‍යා ග ොරතුරු ලබා ගෙන්න.
ஆளணி தகவல்கடள வழங்கவும்.
After Tokenizing
ගසවක සංඛ්‍ යා ග ොරතුරු ලබා ගෙන න .
ஆளணி தகவல ் கடள வழங ் கவும ் .
8

Step1: Preprocessing
• Cleaning: Removing low quality sentences
• clean-corpus-n.perl can be used.
9
Sinhala Tamil
එම ැනැත් ාට පැමිණීමට ගනොහැකි ත්ත්වයක්
උද්ග වූගේ නම් ඒ බව සනාථ කිරීගමන්
අනතුරුව සුදුසු….
அவருக்கு பாைசாடலக்குச்
சமுகமளிக்கமுடியாத சந்தர்ப்பத்தில்
அதுபற்றி உறுதிப்படுத்தியதன் பின்னர்
பபாருத்தமான ஒருவருக்கு…..
எனது பபயர் கீதா.
විශව විෙයාලයීය අධ්‍යාපනය ඇතුළු උසස
අධ්‍යාපනයට ප්‍රගේශ වීමට ඉංග්‍රීසි ෙැනුම උපකාර
වනු ඇ .
பபருந்து
මගේ මිතුරියගේ නම ශුබා ය. எனது நண்பியின் பபயர் சுபா.

10
Language
Model
Translation
Model
TM LM
Decoder
Parallel corpus
எனது ஊர் யாழ்ப்பாணம்.මගේ ගම යාපනය ගේ.

Step 2:Building Translation Model
• Assigns probability P(s|t) to the pair of target and source
words/phrases
11
Sinhala Tamil φ(s|t)
මගේ எனது 0.66
මගේ என்னுடைய 0.22
මගේ ගපො எனது புத்தகம் 0.12
මගේ නම ගී ා எனது பபயர் கீதா 0.22
E.g.
මගේ නම ගී ා ගේ. எனது பபயர் கீதா.
මගේ ගපො . என்னுடைய புத்தகம்.
Word Alignment toolS T P(s|t)
GIZA++

12
Language
Model
Phrase Table
LM
Decoder
Monolingual corpus
Si Ta φ(s|t)
මගේ නම எனது
பபயர்
0.12
මගේ ගම යාපනය ගේ. எனது ஊர் யாழ்ப்பாணம்.

Building Language model
• Used to ensure the fluent output.
Getting probability of each word according to the n-grams. Standardly
calculated with a trigram language model
• Using KenLM or SRILM* or irstlm
E.g. ராம் பந்டத அடித்தான்
ராம் பந்டத வ ீசினான்
13
Count(ராம் பந்டத அடித்தான்)
Count(ராம் பந்டத)
P(அடித்தான்| ராம் பந்டத) =
w3 w1w2 score
அடித்தான் ராம் பந்டத -1.855783
வ ீசினான் ராம் பந்டத -0.4191293

w3 w1w2 score
சாப்பிை நான்
கடைக்கு
-1.855783
பபா
பனன்
கடைக்கு
சாப்பிை
-0.4191293
14
Phrase Table
Decoder
Language Model Table
Si Ta φ(S|T)
මගේ නම எனது
பபயர்
0.12
எனது ஊர் யாழ்ப்பாணம்.මගේ ගම යාපනය ගේ.

15
எனது
என்னுடைய
யாழ்ப்பாணம்
ஊர்
ගම යාපනය ගේ.
Sinhala Tamil φ(s|t)
මගේ என்னுடைய 0.22
ගම ஊர் 0.34
මගේ
ගම
எனது ஊர் 0.23
යාපනය யாழ்ப்பாணம் 0.25
ගේ கீதா 0.12
යාපනය
ගේ
யாழ்ப்பாணம் 0.62
கீதாயாழ்ப்பாணம்
ஊர்
எனது ஊர்
கீதா
මගේ
கீதா

Using Moses for Si-Ta
Translation
• Custom Tokenization
• Morphology rich languages
• Low resource languages
• Standards are not well established
16

Moses

Recommended

Recommended

More Related Content

Featured

Featured (20)

Moses