SlideShare a Scribd company logo
1 of 17
On the Utility of Moses for
Sinhala Tamil Translation
1
Yashothara.S, Dr.R.T.Uthayasanker
National Language Processing center
Outline
• Background: Statistical Machine Translation (SMT)
• Introduction to Moses
• Training
• Decoder
2
Machine Translation
• Process of translating from one language into
another language using a computer
• Types of machine translation
• Rule based
• Example based
• Knowledge based
• Statistical based
• Hybrid model based
• Neural network based
3
ComputerSource Target
Statistical Machine Translation
4
Hmmm. Every times she sees
“මගේ”, she either types
“எனது” or “என்னுடைய”
… but if she sees “මගේ නම”
she always types “எனது”
S
S T
S T
T
Translate, translate
…
Parallel Corpus
Statistical Machine Translation
5
s-Sinhala
t-Tamil
TM LM
P(t|s) P(s|t) p(t)
Statistical Machine
Translation
6
Translation
Model
Language
ModelTM LM
Decoder
මගේ ගම නම යාපනය ව
ග .
எனது ஊர் யாழ்ப்பாணம்.
Moses
• Open source SMT framework
• Language independent
• Plug and play
Steps
1. Preprocessing
2. Translation Model Building
3. Language Model Building
4. Decoding
7
Step1: Preprocessing
• Tokenization: Splitting the sentences as tokens
• tokenizer.perl script can be used.
Example:
Before tokenizing
ගසවක සංඛ්‍යා ග ොරතුරු ලබා ගෙන්න.
ஆளணி தகவல்கடள வழங்கவும்.
After Tokenizing
ගසවක සංඛ්‍ යා ග ොරතුරු ලබා ගෙන න .
ஆளணி தகவல ் கடள வழங ் கவும ் .
8
Step1: Preprocessing
• Cleaning: Removing low quality sentences
• clean-corpus-n.perl can be used.
9
Sinhala Tamil
එම ැනැත් ාට පැමිණීමට ගනොහැකි ත්ත්වයක්
උද්ග වූගේ නම් ඒ බව සනාථ කිරීගමන්
අනතුරුව සුදුසු….
அவருக்கு பாைசாடலக்குச்
சமுகமளிக்கமுடியாத சந்தர்ப்பத்தில்
அதுபற்றி உறுதிப்படுத்தியதன் பின்னர்
பபாருத்தமான ஒருவருக்கு…..
எனது பபயர் கீதா.
විශව විෙයාලයීය අධ්‍යාපනය ඇතුළු උසස
අධ්‍යාපනයට ප්‍රගේශ වීමට ඉංග්‍රීසි ෙැනුම උපකාර
වනු ඇ .
பபருந்து
මගේ මිතුරියගේ නම ශුබා ය. எனது நண்பியின் பபயர் சுபா.
10
Language
Model
Translation
Model
TM LM
Decoder
Parallel corpus
எனது ஊர் யாழ்ப்பாணம்.මගේ ගම යාපනය ගේ.
Step 2:Building Translation Model
• Assigns probability P(s|t) to the pair of target and source
words/phrases
11
Sinhala Tamil φ(s|t)
මගේ எனது 0.66
මගේ என்னுடைய 0.22
මගේ ගපො எனது புத்தகம் 0.12
මගේ නම ගී ා எனது பபயர் கீதா 0.22
E.g.
මගේ නම ගී ා ගේ. எனது பபயர் கீதா.
මගේ ගපො . என்னுடைய புத்தகம்.
Word Alignment toolS T P(s|t)
GIZA++
12
Language
Model
Phrase Table
LM
Decoder
Monolingual corpus
Si Ta φ(s|t)
මගේ எனது 0.66
මගේ නම எனது
பபயர்
0.12
මගේ ගම යාපනය ගේ. எனது ஊர் யாழ்ப்பாணம்.
Building Language model
• Used to ensure the fluent output.
Getting probability of each word according to the n-grams. Standardly
calculated with a trigram language model
• Using KenLM or SRILM* or irstlm
E.g. ராம் பந்டத அடித்தான்
ராம் பந்டத வ ீசினான்
13
Count(ராம் பந்டத அடித்தான்)
Count(ராம் பந்டத)
P(அடித்தான்| ராம் பந்டத) =
w3 w1w2 score
அடித்தான் ராம் பந்டத -1.855783
வ ீசினான் ராம் பந்டத -0.4191293
w3 w1w2 score
சாப்பிை நான்
கடைக்கு
-1.855783
பபா
பனன்
கடைக்கு
சாப்பிை
-0.4191293
14
Phrase Table
Decoder
Language Model Table
Si Ta φ(S|T)
මගේ எனது 0.66
මගේ නම எனது
பபயர்
0.12
எனது ஊர் யாழ்ப்பாணம்.මගේ ගම යාපනය ගේ.
15
எனது
என்னுடைய
யாழ்ப்பாணம்
ஊர்
ගම යාපනය ගේ.
Sinhala Tamil φ(s|t)
මගේ எனது 0.66
මගේ என்னுடைய 0.22
ගම ஊர் 0.34
මගේ
ගම
எனது ஊர் 0.23
යාපනය யாழ்ப்பாணம் 0.25
ගේ கீதா 0.12
යාපනය
ගේ
யாழ்ப்பாணம் 0.62
கீதாயாழ்ப்பாணம்
ஊர்
யாழ்ப்பாணம்
யாழ்ப்பாணம்
எனது ஊர்
கீதா
யாழ்ப்பாணம்
මගේ
யாழ்ப்பாணம்
கீதா
Using Moses for Si-Ta
Translation
• Custom Tokenization
• Morphology rich languages
• Low resource languages
• Standards are not well established
16
Thanks & Questions
17

More Related Content

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Moses

  • 1. On the Utility of Moses for Sinhala Tamil Translation 1 Yashothara.S, Dr.R.T.Uthayasanker National Language Processing center
  • 2. Outline • Background: Statistical Machine Translation (SMT) • Introduction to Moses • Training • Decoder 2
  • 3. Machine Translation • Process of translating from one language into another language using a computer • Types of machine translation • Rule based • Example based • Knowledge based • Statistical based • Hybrid model based • Neural network based 3 ComputerSource Target
  • 4. Statistical Machine Translation 4 Hmmm. Every times she sees “මගේ”, she either types “எனது” or “என்னுடைய” … but if she sees “මගේ නම” she always types “எனது” S S T S T T Translate, translate … Parallel Corpus
  • 6. Statistical Machine Translation 6 Translation Model Language ModelTM LM Decoder මගේ ගම නම යාපනය ව ග . எனது ஊர் யாழ்ப்பாணம்.
  • 7. Moses • Open source SMT framework • Language independent • Plug and play Steps 1. Preprocessing 2. Translation Model Building 3. Language Model Building 4. Decoding 7
  • 8. Step1: Preprocessing • Tokenization: Splitting the sentences as tokens • tokenizer.perl script can be used. Example: Before tokenizing ගසවක සංඛ්‍යා ග ොරතුරු ලබා ගෙන්න. ஆளணி தகவல்கடள வழங்கவும். After Tokenizing ගසවක සංඛ්‍ යා ග ොරතුරු ලබා ගෙන න . ஆளணி தகவல ் கடள வழங ் கவும ் . 8
  • 9. Step1: Preprocessing • Cleaning: Removing low quality sentences • clean-corpus-n.perl can be used. 9 Sinhala Tamil එම ැනැත් ාට පැමිණීමට ගනොහැකි ත්ත්වයක් උද්ග වූගේ නම් ඒ බව සනාථ කිරීගමන් අනතුරුව සුදුසු…. அவருக்கு பாைசாடலக்குச் சமுகமளிக்கமுடியாத சந்தர்ப்பத்தில் அதுபற்றி உறுதிப்படுத்தியதன் பின்னர் பபாருத்தமான ஒருவருக்கு….. எனது பபயர் கீதா. විශව විෙයාලයීය අධ්‍යාපනය ඇතුළු උසස අධ්‍යාපනයට ප්‍රගේශ වීමට ඉංග්‍රීසි ෙැනුම උපකාර වනු ඇ . பபருந்து මගේ මිතුරියගේ නම ශුබා ය. எனது நண்பியின் பபயர் சுபா.
  • 10. 10 Language Model Translation Model TM LM Decoder Parallel corpus எனது ஊர் யாழ்ப்பாணம்.මගේ ගම යාපනය ගේ.
  • 11. Step 2:Building Translation Model • Assigns probability P(s|t) to the pair of target and source words/phrases 11 Sinhala Tamil φ(s|t) මගේ எனது 0.66 මගේ என்னுடைய 0.22 මගේ ගපො எனது புத்தகம் 0.12 මගේ නම ගී ා எனது பபயர் கீதா 0.22 E.g. මගේ නම ගී ා ගේ. எனது பபயர் கீதா. මගේ ගපො . என்னுடைய புத்தகம். Word Alignment toolS T P(s|t) GIZA++
  • 12. 12 Language Model Phrase Table LM Decoder Monolingual corpus Si Ta φ(s|t) මගේ எனது 0.66 මගේ නම எனது பபயர் 0.12 මගේ ගම යාපනය ගේ. எனது ஊர் யாழ்ப்பாணம்.
  • 13. Building Language model • Used to ensure the fluent output. Getting probability of each word according to the n-grams. Standardly calculated with a trigram language model • Using KenLM or SRILM* or irstlm E.g. ராம் பந்டத அடித்தான் ராம் பந்டத வ ீசினான் 13 Count(ராம் பந்டத அடித்தான்) Count(ராம் பந்டத) P(அடித்தான்| ராம் பந்டத) = w3 w1w2 score அடித்தான் ராம் பந்டத -1.855783 வ ீசினான் ராம் பந்டத -0.4191293
  • 14. w3 w1w2 score சாப்பிை நான் கடைக்கு -1.855783 பபா பனன் கடைக்கு சாப்பிை -0.4191293 14 Phrase Table Decoder Language Model Table Si Ta φ(S|T) මගේ எனது 0.66 මගේ නම எனது பபயர் 0.12 எனது ஊர் யாழ்ப்பாணம்.මගේ ගම යාපනය ගේ.
  • 15. 15 எனது என்னுடைய யாழ்ப்பாணம் ஊர் ගම යාපනය ගේ. Sinhala Tamil φ(s|t) මගේ எனது 0.66 මගේ என்னுடைய 0.22 ගම ஊர் 0.34 මගේ ගම எனது ஊர் 0.23 යාපනය யாழ்ப்பாணம் 0.25 ගේ கீதா 0.12 යාපනය ගේ யாழ்ப்பாணம் 0.62 கீதாயாழ்ப்பாணம் ஊர் யாழ்ப்பாணம் யாழ்ப்பாணம் எனது ஊர் கீதா யாழ்ப்பாணம் මගේ யாழ்ப்பாணம் கீதா
  • 16. Using Moses for Si-Ta Translation • Custom Tokenization • Morphology rich languages • Low resource languages • Standards are not well established 16