A short introduction to Statistical Machine Translation


Published on

A presentation of only 10 slides that tells you quickly the state of the art techniques in Automatic Machine Translation

Published in: Technology, Education
1 Comment
1 Like
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A short introduction to Statistical Machine Translation

  1. 1. A (10-slide) introduction toStatistical Machine Translation(SMT) Cuong Huy TO
  2. 2. Machine Translation MT (first demo by IBM on 1954) is: Commercially interesting (EU spends 1,000,000,000 €/year) Academically interesting (NLP technologies) What makes MT hard ? Word order: Monde entier whole world Word sense: bank rivière / banque Idioms: to kick the bucket mourir Various approaches: Rule-based MT Example-based MT (1984): merges what is in memory Statistical MT (1993) Source: Wikipedia, John Hutchins, P.Koehn 2
  3. 3. Statistical Machine TranslationUses a parallel corpus Europarl (European parliament) Hansard (Canadian parliament) Vermobil (German-english) United Nations (used by Google)Learns Lexicon: Le monde entier ne parle pas du problème Alignment: The whole world is not talking about the problem Well-formedness: Advantages (over rule-based, example-based): good performance, quick implementation Can deal with noisy text 3
  4. 4. Translation units & alignment ? Unit = WORD Seems to be more ready to the variability BUT more parameters Unit = PHRASE Seems to be less ready to the variability BUT less parameters 4
  5. 5. Two approaches to SMT 1 1 j I ^I { I J }Source s J = s ,..., s ,...s J ⇒ target t1 = t1,..., ti ,...t I . Find t 1 = arg max Pr(t1 | s1 ) tI 1 Source-channel translation: { } ^I t 1 = arg max Pr(t1I ). Pr( s1J | t1I ) t1I Direct maximum entropy translation: M exp[∑ λm hm (t1I , s1J )] Pr(t1I | s1J ) = pλM (t1I | s1J ) = m =1 M ∑ exp[∑ λ 1 h (t 1 , s1J )] m m I I t 1 m =1 ^I ⎧M ⎫ t 1 = arg max ⎨∑ λm hm (t1I , s1J )⎬ t1I ⎩ m =1 ⎭ 5
  6. 6. Source-channel vs. Maximum-Entropy Source language text Source language text Preprocessing Preprocessing Global search Pr( t1I ) Global search λ1h1 (t1I , s1J ) I ^I ⎧M ⎫ = arg max {Pr(t } t 1 = arg max ⎨∑ λm hm (t1I , s1J )⎬^t1 I ) • Pr(t | s ) I J ⎩ m =1 ⎭ 1 1 1 t1I t1I Pr( s1J | t1I ) λM hM (t1I , s1J ) Target language text Target language text SC approach is one special case of ME one We will first train the SC, then integrate it in the ME framework 6
  7. 7. Alignment in source-channel approach Pr( s1J | t1I ) = ∑ Pr( s1J , a | t1I ) aa : alignment between translation unitsP(translation)=P(alignment) x P(lexicon) Word-to-Word alignment: IJ possible alignments Source-to-target mapping: Model-1 (can be efficiently trained) Target-to-source fertility: Model-2 (cannot be efficiently trained) Training (EM): train Model-1, and use its parameters to initialize the parameters of Model-2 Phrase-to-Phrase alignment: Got the Viterbi alignments from Word-to-Word training. Build consistent pairs of phrase-to-phrase alignment 7
  8. 8. The state-of-the-art SMT system Start with a source-channel system Train and find the word-to-word alignments Build phrase-to-phrase alignments and lexicon Then include the phrase-to-phrase model into a Maximum Entropy framework Train the scaling factor lambda using GIS (Generalize Iterative Scaling) Add more feature functions Many language models (trillions of words) P(s|t) and P(t|s) can be both used (more symmetrical translation) P(I) (word penalty) We can add many more features (conventional dictionary,…) 8
  9. 9. Search in SMT is inexact Problem: Search is an NP-hard problem even with Model-1, mainly due to the need to re-order the target words we need to approximate the search Solutions: A* search/integer programming: not efficient for long sentence Greedy search: severe search errors Beam search with pruning and a heuristic function Decision = Q(n)+H(n) where Q = past, H = future Good heuristic function leads to efficient quality/speed Conclusion: search is still far from good 9
  10. 10. Translation evaluation: many metrics Objective automatic scores: most count word- matching against a set of references WER / mWER / PER / mPER / BLEU / NIST Subjective score (judged by human): SSER/IER/ISER: meaning, syntax Adequacy-Fluency: We need automatic scoring to speed-up research, but no metric is persuasive enough Must use many metrics at the same time 10
  11. 11. Issues in state-of-the-art SMT techniques Too much approximation in training and decoding Decoding implementation for a new model is expensive since search is heavily dependant on the model Phrase segmentation is still not powerful Phrase reordering is still not powerful Objective metrics are not highly correlated to adequacy and fluency Real challenge for computation: 1012 words for language model 108 words for translation model 106 feature functions 11