Assamese to English Statistical Machine Translation


Published on

Machine Translation,Natural Language Processing

Published in: Software, Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Assamese to English Statistical Machine Translation

  1. 1. Assamese- ENGLISH Statistical Machine Translation Using Moses PRESENTED BY KALYANEE KANCHAN BARUAH AND PRANJAL DAS
  3. 3. INTRODUCTION What is Natural Language Processing ? • Natural Language Processing (NLP) is the ability of a computer program to understand human speech as it is spoken. • NLP automates the translation between computers and humans.
  4. 4. WHAT IS MACHINE TRANSLATION • Machine translation (MT) is automated translation. It is the process by which computer software is used to translate a text from one natural language (such as Assamese) to another (such as English).
  5. 5. WHAT IS MACHINE TRANSLATION • The ideal aim of machine translation systems is to produce the best possible translation without human assistance. Basically every machine translation system requires programs for translation and automated dictionaries and grammars to support translation.
  6. 6. ADVANTAGES OF MACHINE TRANSLATION • Quick Translation • Low price • Confidentiality • Online translation and translation of web page content • Overcomes technological barriers
  7. 7. PROBLEMS IN MACHINE TRANSLATION • Translation is not straightforward • Word order • Word sense • Idioms
  8. 8. TYPES OF MACHINE TRANSLATION • BILINGUAL – MT systems that produce translations between any two particular languages. • MULTILINGUAL – MT systems that produce translations for any given pair of languages. – They are preferred to bi-directional and bi-lingual as they have ability to translate from any given language to any other given language and vice versa
  9. 9. SOME EXISTING MT SYSTEMS • Google Translate • Systran • Bing Translator • Bable Fish • Apertium
  10. 10. SOME MAJOR MT PROECTS IN INDIA • Anglabharat (and Anubharati) • Anusaaraka • MaTra • UCSG-based English-Kannada MT • Tamil-Hindi Anusaaraka and English-Tamil MT • Anuvadak English-Hindi software • Sampark
  12. 12. STATISTICAL MACHINE TRANSLATION • Enables us to automatically build machine translation systems using statistical models trained by text data. • Every sentence in a language has a possible translation in another language.
  14. 14. LANGUAGE MODEL • Gives the probability of a sentence • Uses n-gram model • IRSTLM is used to develop the Language Model The probability of sentence P (S), is broken down as the probability of individual words P(w). P(s) = P(w1, w2, w3,....., wn) =P(w1) P(w2|w1) P(w3,|w1w2 ) P(w4|w1w2w3)...P(wn|w1w2...wn-1)
  15. 15. LANGUAGE MODEL Suppose for a large amount of corpus we have the following bigram probabilities .001Eat British.03Eat today .007Eat dessert.04Eat Indian .01Eat tomorrow.04Eat a .02Eat Mexican.04Eat at .02Eat Chinese.05Eat dinner .02Eat in.06Eat lunch .03Eat breakfast.06Eat some .03Eat Thai.16Eat on
  16. 16. LANGUAGE MODEL .01British lunch.05Want a .01British cuisine.65Want to .15British restaurant.04I have .60British food.08I don’t .02To be.29I would .09To spend.32I want .14To have.02<start> I’m .26To eat.04<start> Tell .01Want Thai.06<start> I’d .04Want some.25<start> I
  17. 17. LANGUAGE MODEL Then, the probability of a sentence “I want to eat British food” is P(I want to eat British food) = P(I|<start>) P(want | I) P(to | want) P(eat | to) P(British | eat) P(food | British) = .25*.32*.65*.26*.001*.60 = .000080
  18. 18. TRANSLATION MODEL • Computes the probability of source sentence ‘S’, for a given target sentence ‘T’ i.e. P(S|T). • May be done word based or phrase based. • Output of TM is fed into the Moses decoder. • Giza++ along with mkcls is used to develop Translation Model.
  19. 19. TRANSLATION MODEL Example : জয়পুৰ ৰাজস্থানৰ এখন বিখযাত চহৰ Jaipur is a famous city of Rajasthan
  20. 20. DECODER • Maximizes the probability of the translated text • Search for sentence T is performed that maximizes P (S|T) i.e. Pr (S, T) = argmax P (S|T) P (T) DECODING ALGORITHM TRANSLATION MODEL LANGUAGE MODEL
  23. 23. IMPLEMENTATION  Install all packages in Moses • Install Giza++ • Install IRSTLM Training Tuning Generate output (decoding)
  24. 24. TRAINING THE MOSES DECODER Prepare data Run Giza++ Get lexical translation table Build lexicalized reordering model Create configuration fileBuild generation models. Align words Extract phrases
  25. 25. PREPARING THE DATA  Tokenising - inserting spaces between words and punctuation.  Truecasing - setting the case of the first word in each sentence.  Cleaning - removing empty lines, redundant spaces, and lines that are too short or too long.
  26. 26. EXAMPLE PARALLEL DATA ass-eng1.en বিকাণেবি ভূ বিয়া আিু বিঠাই হৈণে বিকাণেিত বকবিি পিা অবত উত্তি সািগ্ৰীসিূৈি বভতিত বকেুিাি। The famous Bikaneri Bhujias and sweets are some of the best items to purchase in Bikaner. ভািতিৰ্ষি গ ালপীয়া চৈি িাণি খ্যাত িয়পুি, িািস্থাি িািযি িািধািী। Jaipur, popularly known as the Pink City, is the capital of Rajasthan state, India. অম্বি গপণলচটণটা হৈণে গিা ল আিু বৈন্দু স্থাপতয বিদ্যাি আদ্ৰ্ষ উদ্াৈিে। The Amber Palace is a classic example of Mughal and Hindu architecture. কিক িৃন্দািি হৈণে িয়পুিি এখ্ি িিবিয় িিণভাি স্থাি। Kanak Vrindavan is a popular picnic spot in Jaipur. িয়পুি িািষলি িূবত্তষ, িীলা কলৈ আিু িািস্থািী গিাতাি িাণিও বিখ্যাত। Jaipur is also famous for marble statues, blue pottery and the Rajasthani shoes
  27. 27. SAMPLE OUTPUTS Input Assamese Sentence Output English sentence জয়পুৰ ৰাজস্থানৰ এখন বিখযাত চহৰ । Jaipur is a famous city of Rajasthan . তাজমহল আগ্ৰাত অৱবস্থত । the Taj Mahal , is located in the heart of the Agra . জামা মছবজদ শ্বাহজাহানন বনমমান কবৰবছল। Jama Masjid built by Shahjahan . অন্ধ্ৰপ্ৰনদশ ভাৰতৰ এখন অনযতম ৰাজযৰ বভতৰত এক। Andhra Pradesh is one of the state of one of India . গুৱাহাটী অসমৰ ৰাজধানী। Guwahati is connected by the capital of the State . আগ্ৰা এখন বিখযাত চহৰ Agra is the one of the famous city বদল্লী ভাৰতৰ ৰাজধানী। Delhi is the capital of India .
  28. 28. PROBLEMS WITH PROPER NOUNS Input Assamese Sentence Output English sentence কানাদা এখন বিশাল দদশ । কানাদা is a vast country . মুলতান চহৰখন ৰাজস্থানৰ পৰা ৯৯৯ বক.বম. দুৰত্বত অৱবস্থত। মুলতান from the city is located at a distance of ৯৯৯ of Rajasthan . পানাবজ দ াৱাৰ ৰাজধানী । the capital of Goa , পানাবজ|
  29. 29. TRANSLITERATION IN TRANSLATION  Transliteration – Transcription from one alphabet to another  Some proper nouns which are not in our corpus are not translated.  For example: Translating “কানাদা এখন বিশাল দদশ” gives “কানাদা is a vast country.”  Because ‘কানাদা’ is not in our corpus.
  30. 30. TRANSLITERATION IN TRANSLATION  Store each Assamese alphabet and their English transliteration in a perl script For example: ক -> k খ্ -> kh -> g  Used this perl script and run with moses using the following command echo ‘কানাদা এখন বিশাল দদশ’ | ~/mymoses/bin/moses –f ~/work/mert- work/moses.ini | ./  Output : kanada is a vast country .
  31. 31. IMPLEMENTING TRANSLITERATION INPUT ASSAMESE SENTENCE OUTPUT BEFORE TRANSLITERATION OUTPUT AFTER TRANSLITERATION কানাদা এখন বিশাল দদশ কানাদা is a vast country . kanada is a vast country . মুলতান চহৰখন ৰাজস্থানৰ পৰা ৯৯৯ বক.বম. দুৰত্বত অৱবস্থত। মুলতান from the city is located at a distance of ৯৯৯ of Rajasthan . multan from the city is located at a distance of 999 of Rajasthan . পানাবজ দ াৱাৰ ৰাজধানী । the capital of Goa , পানাবজ| the capital of Goa , panaji .
  32. 32. EVALUATION OF BLEU SCORE Source/Target Bleu Score 1/2/3/4-gram precision Assamese – English 7.02 30.5/8.5/4.1/2.3
  33. 33. CONCLUSION AND FUTURE WORK • The SMT is a part of corpus based MT system which requires parallel corpus before undertaking translation. • A parallel corpus of about 2500 Assamese and English sentences was used to train the system. • The SMT system developed accepts Assamese sentences as input and generates corresponding translation in Assamese. • The results shows that significant improvements can be made by increasing the amount of parallel corpus.
  34. 34. CONCLUSION AND FUTURE WORK • In the future, we will try to include the Transliteration in our system. • We will try to increase the volume of our corpus, such that we get a much better translation system. • We will also try to implement the translation process without using the Moses toolkit
  35. 35. REFERENCES • “Machine Translation”, [Online]. Available: • “Statistical Machine Translation” , [Online]. Available: • “Problems in Machine Translation system”, [Online]. Available: • “Machine Translation”, [Online]. Available: • D. D. Rao, “Machine Translation A Gentle Introduction”, RESONANCE, July 1998. • S.K. Dwivedi and P. P. Sukadeve, “Machine Translation System Indian Perspectives”, Proceeding of Journal of Computer Science Vol. 6 No. 10. pp 1082-1087, May 2010.
  36. 36. REFERENCES • P. F. Brown, S. De. Pietra, V. D. Pietra and R. Mercer, “The mathematics of statistical machine translation: parameter estimation”. “Journal Computational Linguistics”, vol. 10, no.2, June 1993 • “ Natural Language Processing” , [Online]. Available:
  37. 37. THANK YOU