Pbsmt presenation waleed_oransa_29_april2010

4,011 views
4,048 views

Published on

SMT and PBSMT Presentation

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
4,011
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Pbsmt presenation waleed_oransa_29_april2010

  1. 1. Statistical Machine Translation Waleed Oransa, M.Sc. College of Computing and Information Technology Arab Academy for Science and Technology Cairo, Egypt [email_address]
  2. 2. Agenda <ul><li>MT Background </li></ul><ul><li>PBSMT Approach </li></ul><ul><li>MT Evaluation </li></ul><ul><li>Online MT services review </li></ul><ul><li>Conclusion and Future work </li></ul>
  3. 3. Agenda <ul><li>MT Background </li></ul><ul><li>PBSMT Approach </li></ul><ul><li>MT Evaluation </li></ul><ul><li>Online MT services review </li></ul><ul><li>Conclusion and Future work </li></ul>
  4. 4. Why is Machine Translation so Hard? <ul><li>Translation difficulty is caused by the differences between human languages </li></ul><ul><ul><li>Systematic differences: </li></ul></ul><ul><ul><ul><li>Morphology (one morpheme vs. many, morpheme boundaries) </li></ul></ul></ul><ul><ul><ul><li>Syntactic (English SVO, Japanese SOV, Arabic VSO ) </li></ul></ul></ul><ul><ul><ul><li>Argument structure and linking (e.g. Pronoun drop in Arabic, Head-marking vs. dependent marking) </li></ul></ul></ul><ul><ul><li>Phrase ordering/Idiosyncratic differences (adjectives location, &quot;to kick the bucket'' means “to die”) </li></ul></ul><ul><ul><li>Lexical differences (Bank, watch, Biweekly) </li></ul></ul>
  5. 5. MT Approaches <ul><li>Linguistic approaches </li></ul><ul><ul><li>Direct </li></ul></ul><ul><ul><li>Transfer </li></ul></ul><ul><ul><li>Interlingua </li></ul></ul><ul><li>Statistical (Corpus based) approaches </li></ul><ul><ul><li>Word Based Statistical Machine Translation </li></ul></ul><ul><ul><li>Phrase Based Statistical Machine Translation </li></ul></ul><ul><ul><li>Example Based Machine Translation </li></ul></ul><ul><li>Hybrid approaches </li></ul>
  6. 6. Interlingua Semantic Syntactic Structure Word s Semantic Syntactic Structure Word s Direct Syntactic Transfer Semantic Transfer Source Language Text Target Language Text Conceptual Generation Semantic Generation Syntactic Generation Morphological Generation Conceptual Analysis Semantic Analysis Parsing Morphological Analysis Better Quality & More difficulty
  7. 7. Why Statistical Machine Translation (SMT)? <ul><li>Advantages: </li></ul><ul><ul><li>Has a way of dealing with lexical ambiguity </li></ul></ul><ul><ul><li>Can deal with idioms that occur in the training data </li></ul></ul><ul><ul><li>Requires minimal human effort, easy to maintain. </li></ul></ul><ul><ul><li>Can be created for any language pair that has enough training data </li></ul></ul><ul><li>Disadvantages: </li></ul><ul><ul><li>Does not explicitly deal with syntax </li></ul></ul>
  8. 8. Example of a parallel corpus <ul><li>ايران تضع شروطا للعلاقات مع امريكا ولا تغلق الباب </li></ul><ul><li>من فريدريك دال </li></ul><ul><li>طهران ( رويترز ) - ردت ايران على عرض الرئيس الامريكي باراك اوباما تحسين العلاقات بمطالبة واشنطن بتغير سياساتها غير انها لم تغلق الباب امام امكانية تحسن العلاقات </li></ul><ul><li>وقال محمد ماراندي الاستاذ بجامعة طهران إن ايران تريد ان تظهر الولايات المتحدة تغيرا ملموسا في سلوكها بشأنها من خلال خطوات من بينها على سبيل المثال الافراج عن الاصول المجمدة ولكن طهران لا تتابع سياسة &quot; العداء الازلي ” </li></ul><ul><li>وقال ماراندي الذي يرأس قسم دراسات امريكا الشمالية في الجامعة &quot; اعتقد انهم ( القيادة الايرانية ) مستعدون تماما لتحسين العلاقات اذا كان الامريكيون جادين </li></ul><ul><li>Iran sets terms for US ties </li></ul><ul><li>By Fredrik Dahl </li></ul><ul><li>TEHRAN (Reuters) - Iran has responded to US President Barack Obama's offer of better relations by demanding policy changes from Washington, but the Islamic state is not closing the door to a possible thaw in ties with its old foe </li></ul><ul><li>Iran wants the United States to show concrete change in its behavior toward it, for example by handing back frozen assets, but Tehran is not pursuing &quot;eternal hostility,&quot; said Professor Mohammad Marandi at Tehran University </li></ul><ul><li>&quot; I think they (the Iranian leadership) are quite willing to have better relations if the Americans are serious,&quot; said Marandi, who heads North American studies at the university </li></ul>
  9. 9. Statistical Machine Translation (SMT) <ul><li>How does it work? </li></ul><ul><ul><li>Find the most probable target sentence given a source sentence </li></ul></ul><ul><ul><li>Based on two probabilistic models for: </li></ul></ul><ul><ul><ul><li>Fluency (called Language Model) </li></ul></ul></ul><ul><ul><ul><li>Faithfulness (called Translation Model) </li></ul></ul></ul>
  10. 10. How to build SMT System <ul><li>The SMT approach consists of two-phases: </li></ul><ul><ul><li>Training phase </li></ul></ul><ul><ul><ul><li>Input: Parallel sentences in any language pair </li></ul></ul></ul><ul><ul><ul><li>Output: Language Model, Translation Model </li></ul></ul></ul><ul><ul><li>Translation phase </li></ul></ul><ul><ul><ul><li>Input: Source sentence, Language Model, Translation Model </li></ul></ul></ul><ul><ul><ul><li>Output: Target sentence </li></ul></ul></ul>
  11. 11. <ul><li>Let’s see the training phase </li></ul>
  12. 12. SMT Training Phase English Sentences PBSMT System Training Language Model (Arabic) Input: Training Corpus Arabic/English Bi-Text Output: Language Model and Translation Model Arabic Sentences Translation Model (English/Arabic) Language Modeling Training (Tool: SRILM toolkit) Translation Modeling Training (Tool: Giza++ & Moses toolkit) What is the Language Model?
  13. 13. Language Model (LM) <ul><li>LM is P( sentence ) </li></ul><ul><li>LM Assigns a higher probability to fluent /grammatical sentences </li></ul><ul><ul><ul><li>Correct word order : P( ذهب علي للمنزل ) >> P( علي للمنزل ذهب ) </li></ul></ul></ul><ul><ul><ul><li>Correct word choice : P( منى ذهب ت للمنزل ) >> P( منى ذهب للمنزل ) </li></ul></ul></ul><ul><ul><li>LM is estimated using mono corpus </li></ul></ul><ul><ul><li>P( sentence )  </li></ul></ul><ul><ul><li>LM is based on N-Grams: </li></ul></ul><ul><ul><ul><li>N -grams are token sequences of length N - </li></ul></ul></ul><ul><ul><ul><li>الولايات  Unigram </li></ul></ul></ul><ul><ul><ul><li>رئيس الولايات  Bigram </li></ul></ul></ul><ul><ul><ul><li>صرح رئيس الولايات  Trigram </li></ul></ul></ul>
  14. 14. LM Role in SMT مشرف يجتمعوا مشرف تجتمع مشرف يجتمع مشرف يتقابل مشرف يقابل مشرف يجتمعا مشرف تجتمعا مشرف يجتمعن مشرف يجتمعون مشرف يتقابلان Language Model مشرف xxxx مع كبار المسؤولين المدنيين والعسكريين Language Model يجتمع يجتمعوا تجتمع يجتمع يتقابل يقابل يجتمعا تجتمعا يجتمعن يجتمعون يتقابلان
  15. 15. <ul><li>Back to our SMT training chart </li></ul>
  16. 16. SMT Training Phase English Sentences PBSMT System Training Language Model (Arabic) Input: Training Corpus Arabic/English Bi-Text Output: Language Model and Translation Model Arabic Sentences Translation Model (English/Arabic) Language Modeling Training (Tool: SRILM toolkit) Translation Modeling Training (Tool: Giza++ & Moses toolkit) What is the Translation Model?
  17. 17. Translation Model (TM) <ul><li>TM learns translations of words and phrases from parallel corpus </li></ul><ul><li>TM associate probabilities with translations empirically by counting co-occurrences in the data </li></ul><ul><li>TM gets more accurate as size of the data increases </li></ul>
  18. 18. Translation Model (TM) <ul><li>TM is P(target sentence|source sentence) </li></ul><ul><li>TM assigns higher probability to sentences that have corresponding meaning (based on the corpus) </li></ul><ul><li>TM is estimated using parallel corpus </li></ul><ul><li>Word-based SMT uses word-based alignment, while Phrase based uses phrase-based alignment. </li></ul>
  19. 19. Translation Model (TM) <ul><li>Decompose the sentences into smaller chunks, like in language modeling </li></ul><ul><li>Introduce another variable a that represents alignments between the individual words in the sentence pair </li></ul>
  20. 20. SMT Translation Phase SMT System (Decoder) Source Text Target Text Language Model (Arabic) Translation Model (English/Arabic) Musharraf Meets with Senior Civilian <ul><li>مشرف يجتمع مع كبار المسؤولين المدنيين </li></ul><ul><li>مشرف يقابل مع كبار الموظفين المدنيين </li></ul><ul><li>مشرف يجتمع مع أهم المسؤولين المدنيين </li></ul><ul><li>مشرف يتقابل مع كبار المسؤولين المدنيين </li></ul><ul><li>مشرف يجتمع مع كبار المسؤولين الحكوميين </li></ul>Initial N-best hypotheses p=0.13 p=0.21 p=0.23 p=0.12 p=0.18 <ul><li>مشرف يجتمع مع كبار المسؤولين المدنيين </li></ul><ul><li>مشرف يقابل مع كبار الموظفين المدنيين </li></ul><ul><li>مشرف يجتمع مع أهم المسؤولين المدنيين </li></ul><ul><li>مشرف يتقابل مع كبار المسؤولين المدنيين </li></ul><ul><li>مشرف يجتمع مع كبار المسؤولين الحكوميين </li></ul>p=0.53 p=0.42 p=0.37 p=0.22 p=0.48 Final N-best hypotheses <ul><li>مشرف يجتمع مع كبار المسؤولين المدنيين </li></ul>
  21. 21. How TM & LM work together? Musharraf Meets with Senior Civilian and Military Officials مشرف ***** مع كبار المسؤولين المدنيين والعسكريين يجتمع Language Model Translation Model يجتمعوا تجتمع يجتمع يتقابل يقابل يجتمعا تجتمعا يجتمعن يجتمعون يتقابلان
  22. 22. Agenda <ul><li>MT Background </li></ul><ul><li>PBSMT Approach </li></ul><ul><li>MT Evaluation </li></ul><ul><li>Online MT services review </li></ul>
  23. 23. PBSMT Approach <ul><li>Over all Phrase Pair Extraction Algorithm </li></ul><ul><ul><li>Run a sentence aligner on a parallel bilingual corpus </li></ul></ul><ul><ul><li>Run word aligner (e.g., one based on IBM models) on each aligned sentence pair. </li></ul></ul><ul><ul><li>From each aligned sentence pair, extract all phrase pairs with no external links - see next slide </li></ul></ul>ولد The prophet Mohamed was born في سنة 570 ميلادية in 570 A.D الرسول محمد
  24. 24. PBSMT Training Phase English Sentences PBSMT System Training Language Model (Arabic) Input: Training Corpus Arabic/English Bi-Text Output: Language Model and Translation Model PBSMT Normal Training Phase Arabic Sentences Translation Model Phrase Table (English/Arabic) Language Modeling Training (Tool: SRILM toolkit) Translation Modeling Training (Tool: Giza++ & Moses toolkit)
  25. 25. Phrase based alignment (The prophet, الرسول ) (The prophet Mohamed, الرسول محمد ) (great man, رجل عظيم ) (Mohamed is a great man, محمد رجل عظيم ) (The prophet Mohamed is a great man, الرسول محمد رجل عظيم ) etc. Extract all phrase: English to Arabic word alignment Arabic to English word alignment Intersection of both alignments
  26. 26. PBSMT drawbacks <ul><li>PBSMT has two drawbacks: </li></ul><ul><ul><li>Need a huge corpus to give good translation </li></ul></ul><ul><ul><li>Phrases are fragmenting the sentence while in language like Arabic sentence should be homogenous (i.e. high dependencies between words and long distance inflection) </li></ul></ul>
  27. 27. Agenda <ul><li>MT Background </li></ul><ul><li>PBSMT Approach </li></ul><ul><li>MT Evaluation </li></ul><ul><li>Online MT services review </li></ul>
  28. 28. MT Evaluation <ul><li>MT evaluation based on two scoring: </li></ul><ul><ul><li>Adequacy (measures semantic) </li></ul></ul><ul><ul><li>Fluency (measures how grammatical and natural the translation is) </li></ul></ul><ul><li>MT evaluation can be: </li></ul><ul><ul><li>Human Evaluation </li></ul></ul><ul><ul><ul><li>The cost is high in terms of money and time </li></ul></ul></ul><ul><ul><li>Automatic Evaluation </li></ul></ul><ul><ul><ul><li>Fast, lower cost, correlated with Human evaluation </li></ul></ul></ul>
  29. 29. Human Evaluation 5 Adequacy (1-5) 4 Fluency (1-5) مشرف يتقابل مع كبار المسؤولين المدنيين والعسكريين بخصوص ملف الحد من الانتشار النووي MT Musharraf Meets with Senior Civilian and Military Officials to Examine Nuclear Anti-Proliferation Dossier Source
  30. 30. Automatic Evaluation <ul><li>It computes the translation closeness between the MT output and the human translation ref. sentences </li></ul><ul><li>An MT output is ranked as better if on average it is closer to the human translations. </li></ul><ul><li>BLEU is the de-facto standard </li></ul><ul><li>BLEU : Ranks each MT output by a weighted average of the number of N-gram overlaps with the human translations. </li></ul>Higher BLEU Score مشرف يجتمع بكبار المسؤولين المدنيين والعسكريين لبحث ملف الحد من الانتشار النووي Ref3 مشرف يجتمع مع كبار المسؤولين المدنيين والعسكريين لدرس ملف الحد من الانتشار النووي Ref2 مشرف يجتمع بكبار المسؤولين المدنيين والعسكريين لدرس ملف الحد من الانتشار النووي Ref1 مشرف يجتمع بالقادة ال مدنيين و الع سكريين رفيعي المستوى ل دراسة ملف وقف الانتشار النووي MT2 مشرف يتقابل مع كبار المسؤولين المدنيين والعسكريين بخصوص ملف الحد من الانتشار النووي MT1 Musharraf Meets with Senior Civilian and Military Officials to Examine Nuclear Anti-Proliferation Dossier Source
  31. 31. Agenda <ul><li>MT Background </li></ul><ul><li>PBSMT Approach </li></ul><ul><li>MT Evaluation </li></ul><ul><li>Online MT services review </li></ul>
  32. 32. Online MT Services Review البنتان قالتا أنّنا جيّدين SK 8 قال الاثنان بنات &quot; نحن جيّد &quot; SY 7 ان فتاتين &quot; نحن جيدة &quot; MS 6 الفتاتين وقال &quot; نحن جاهزون &quot; GO 5 قالت الفتاتان &quot; نحن جيدات &quot; The two girls said &quot;we are good&quot; B خمسة عشر بنتًا Sakhr Trjem (SK) 4 خمسة عشر بنات Systran translator (SY) 3 خمسة عشر الفتيات MS-Bing Translator (MS) 2 خمسة عشر فتيات Google Translate (GO) 1 خمس عشرة فتاة Fifteen girls A Arabic Translation Sentence/Translation Service
  33. 33. Thank you شكراً

×