Upcoming SlideShare
×

# Smt in-a-few-slides

771 views

Published on

Statistical Machine Translation in a few slides

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
771
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
13
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Smt in-a-few-slides

1. 1. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle Statistical machine translation in a few slides Mikel L. Forcada1,2 1Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant, E-03071 Alacant (Spain) 2Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain) April 14-16, 2009: Free/open-source MT tutorial at the CNGL Mikel L. Forcada SMT in a few slides
2. 2. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle Contents 1 Translation as probability 2 “Decoding” 3 Training 4 “Log-linear” 5 Ain’t got nothin’ but the BLEUs? 6 The SMT lifecycle Mikel L. Forcada SMT in a few slides
3. 3. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle The “canonical” model Translation as probability/1 Instead of saying that a source-language (SL) sentence s in a SL text and a target-language (TL) sentence t as found in a SL–TL bitext are or are not a translation of each other, in SMT one says that they are a translation of each other with a probability p(s, t) = p(t, s) (a joint probability). We’ll assume we have such a probability model available. Or at least a reasonable estimate. Mikel L. Forcada SMT in a few slides
4. 4. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle The “canonical” model Translation as probability/2 According to basic probability laws, we can write: p(s, t) = p(t, s) = p(s|t)p(t) = p(t|s)p(s) (1) where p(x|y) is the conditional probability of x given y. We are interested in translating from SL to TL. That is, we want to ﬁnd the most likely translation given the SL sentence s: t = arg max t p(t|s) (2) Mikel L. Forcada SMT in a few slides
5. 5. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle The “canonical” model The “canonical” model We can rewrite eq. (1) as p(t|s) = p(s|t)p(t) p(s) (3) and then with (2) to get t = arg max t p(s|t)p(t) (4) Mikel L. Forcada SMT in a few slides
6. 6. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle “Decoding”/1 t = arg max t p(s|t)p(t) We have a product of two probability models: A reverse translation model p(s|t) which tells us how likely the SL sentence s is a translation of the candidate TL sentence t, and a target-language model p(t) which tells us how likely the sentence t is in the TL side of bitexts. These may be related (respectively) to the usual notions of [reverse] adequacy: how much of the meaning of t is conveyed by s ﬂuency: how ﬂuent is the candidate TL sentence. The arg max strikes a balance between the two. Mikel L. Forcada SMT in a few slides
7. 7. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle “Decoding”/2 In SMT parlance, the process of ﬁnding t∗ is called decoding.1 Obviously, it does not explore all possible translations t in the search space. There are inﬁnitely many. The search space is pruned. Therefore, one just gets a reasonable t instead of the ideal t Pruning and search strategies are a very active research topic. Free/open-source software: Moses. 1 Reading SMT articles usually entails deciphering jargon which may be very obscure to outsiders or newcomers Mikel L. Forcada SMT in a few slides
8. 8. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle Training/1 So where do these probabilities come from? p(t) may easily be estimated from a large monolingual TL corpus (free/open-source software: irstlm) The estimation of p(s|t) is more complex. It’s usually made of a lexical model describing the probability that the translation of certain TL word or sequence of words (“phrase”2 ) is a certain SL word or sequence of words. an alignment model describing the reordering of words or “phrases”. 2 A very unfortunate choice in SMT jargon Mikel L. Forcada SMT in a few slides
9. 9. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle Training/2 The lexical model and the alignment model are estimated using a large sentence-aligned bilingual corpus through a complex iterative process. An initial set of lexical probabilities is obtained by assuming, for instance, that any word in the TL sentence aligns with any word in its SL counterpart. And then: Alignment probabilities in accordance with the lexical probabilities are computed. Lexical probabilities are obtained in accordance with the alignment probabilities This process (“expectation maximization”) is repeated a ﬁxed number of times or until some convergence is observed (free/open-source software: Giza++). Mikel L. Forcada SMT in a few slides
10. 10. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle Training/3 In “phrase-based” SMT, alignments may be used to extract (SL-phrase, TL-phrase) pairs of phrases and their corresponding probabilities for easier decoding and to avoid “word salad”. Mikel L. Forcada SMT in a few slides
11. 11. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle “Log-linear”/1 More SMT jargon! It’s short for linear combination of logarithms of probabilities. And, sometimes, even features that aren’t logarithms or probabilities of any kind. OK, let’s take a look at the maths. Mikel L. Forcada SMT in a few slides
12. 12. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle “Log-linear”/2 One can write a more general formula: p(t|s) = exp( nF k=1 λk fk (t, s)) Z (5) with nF feature functions fk (t, s) which can depend on s, t or both. Setting nF = 2, f1(s, t) = log p(s|t), f2(s, t) = log p(t), and Z = p(s) one recovers the canonical formula (3). The best translation is then t = arg max t nF k=1 λk fk (t, s) (6) Most of the fk (t, s) are logarithms, hence “log-linear”. Mikel L. Forcada SMT in a few slides
13. 13. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle “Log-linear”/3 “Feature selection is a very open problem in SMT” (Lopez 2008) Other possible functions include length penalties (discouraging unreasonably short or long translations), “inverted” versions of p(s|t), etc. Where do we get the λk ’s from? They are usually tuned so as to optimize the results on a tuning set, according to a certain objective function that is taken to be an indicator that correlates with translation quality may be automatically obtained from the output of the SMT system and the translation in the corpus. This is called MERT (minimum error rate training) sometimes (free/open-source software: the Moses suite). Mikel L. Forcada SMT in a few slides
14. 14. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle Ain’t got nothin’ but the BLEUs? The most famous “quality indicator” is called BLEU, but there are many others. BLEU counts which fraction of the 1-word, 2-word,. . . n-word sequences in the output match the reference translation. Correlation with subjective assessments of quality is still an open question. A lot of SMT research is currently BLEU-driven and makes little contact with real applications of MT. Mikel L. Forcada SMT in a few slides
15. 15. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle The SMT lifecycle Development: Training: monolingual and sentence-aligned bilingual corpora are used to estimate probability models (features) Tuning: a held-out portion of the sentence-aligned bilingual corpus is used to tune the coeﬁcients λk Decoding: sentences s are fed into the SMT system and “decoded” into their translations t. Evaluation: the system is evaluated against a reference corpus. Mikel L. Forcada SMT in a few slides
16. 16. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle License This work may be distributed under the terms of the Creative Commons Attribution–Share Alike license: http: //creativecommons.org/licenses/by-sa/3.0/ the GNU GPL v. 3.0 License: http://www.gnu.org/licenses/gpl.html Dual license! E-mail me to get the sources: mlf@ua.es Mikel L. Forcada SMT in a few slides