Smt in-a-few-slides

Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
Statistical machine translation in a few slides
Mikel L. Forcada1,2
1Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant,
E-03071 Alacant (Spain)
2Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain)
April 14-16, 2009: Free/open-source MT tutorial at the
CNGL
Mikel L. Forcada SMT in a few slides

“Decoding”
Training
“Log-linear”
The SMT lifecycle
Contents
1 Translation as probability
2 “Decoding”
3 Training
4 “Log-linear”
5 Ain’t got nothin’ but the BLEUs?
6 The SMT lifecycle

“Decoding”
Training
“Log-linear”
The SMT lifecycle
The “canonical” model
Translation as probability/1
Instead of saying that
a source-language (SL) sentence s in a SL text
and a target-language (TL) sentence t
as found in a SL–TL bitext are or are not a translation of
each other,
in SMT one says that they are a translation of each other
with a probability p(s, t) = p(t, s) (a joint probability).
We’ll assume we have such a probability model available.
Or at least a reasonable estimate.

“Decoding”
Training
“Log-linear”
The SMT lifecycle
Translation as probability/2
According to basic probability laws, we can write:
p(s, t) = p(t, s) = p(s|t)p(t) = p(t|s)p(s) (1)
where p(x|y) is the conditional probability of x given y.
We are interested in translating from SL to TL. That is, we
want to ﬁnd the most likely translation given the SL
sentence s:
t = arg max
t
p(t|s) (2)

“Decoding”
Training
“Log-linear”
The SMT lifecycle
We can rewrite eq. (1) as
p(t|s) =
p(s|t)p(t)
p(s)
(3)
and then with (2) to get
t = arg max
t
p(s|t)p(t) (4)

“Decoding”
Training
“Log-linear”
The SMT lifecycle
“Decoding”/1
t = arg max
t
p(s|t)p(t)
We have a product of two probability models:
A reverse translation model p(s|t) which tells us how likely
the SL sentence s is a translation of the candidate TL
sentence t, and
a target-language model p(t) which tells us how likely the
sentence t is in the TL side of bitexts.
These may be related (respectively) to the usual notions of
[reverse] adequacy: how much of the meaning of t is
conveyed by s
ﬂuency: how ﬂuent is the candidate TL sentence.
The arg max strikes a balance between the two.

“Decoding”
Training
“Log-linear”
The SMT lifecycle
“Decoding”/2
In SMT parlance, the process of ﬁnding t∗ is called
decoding.1
Obviously, it does not explore all possible translations t in
the search space. There are inﬁnitely many.
The search space is pruned.
Therefore, one just gets a reasonable t instead of the
ideal t
Pruning and search strategies are a very active research
topic.
Free/open-source software: Moses.
1
Reading SMT articles usually entails deciphering jargon which may be
very obscure to outsiders or newcomers

“Decoding”
Training
“Log-linear”
The SMT lifecycle
Training/1
So where do these probabilities come from?
p(t) may easily be estimated from a large monolingual TL
corpus (free/open-source software: irstlm)
The estimation of p(s|t) is more complex. It’s usually made
of
a lexical model describing the probability that the
translation of certain TL word or sequence of words
(“phrase”2
) is a certain SL word or sequence of words.
an alignment model describing the reordering of words or
“phrases”.
2
A very unfortunate choice in SMT jargon

“Decoding”
Training
“Log-linear”
The SMT lifecycle
Training/2
The lexical model and the alignment model are estimated
using a large sentence-aligned bilingual corpus through a
complex iterative process.
An initial set of lexical probabilities is obtained by
assuming, for instance, that any word in the TL sentence
aligns with any word in its SL counterpart. And then:
Alignment probabilities in accordance with the lexical
probabilities are computed.
Lexical probabilities are obtained in accordance with the
alignment probabilities
This process (“expectation maximization”) is repeated a
ﬁxed number of times or until some convergence is
observed (free/open-source software: Giza++).

“Decoding”
Training
“Log-linear”
The SMT lifecycle
Training/3
In “phrase-based” SMT, alignments may be used to extract
(SL-phrase, TL-phrase) pairs of phrases
and their corresponding probabilities
for easier decoding and to avoid “word salad”.

“Decoding”
Training
“Log-linear”
The SMT lifecycle
“Log-linear”/1
More SMT jargon!
It’s short for linear combination of logarithms of
probabilities.
And, sometimes, even features that aren’t logarithms or
probabilities of any kind.
OK, let’s take a look at the maths.

“Decoding”
Training
“Log-linear”
The SMT lifecycle
“Log-linear”/2
One can write a more general formula:
p(t|s) =
exp( nF
k=1 λk fk (t, s))
Z
(5)
with nF feature functions fk (t, s) which can depend on s, t
or both.
Setting nF = 2, f1(s, t) = log p(s|t), f2(s, t) = log p(t), and
Z = p(s) one recovers the canonical formula (3).
The best translation is then
t = arg max
t
nF
k=1
λk fk (t, s) (6)
Most of the fk (t, s) are logarithms, hence “log-linear”.

“Decoding”
Training
“Log-linear”
The SMT lifecycle
“Log-linear”/3
“Feature selection is a very open problem in SMT” (Lopez
2008)
Other possible functions include length penalties
(discouraging unreasonably short or long translations),
“inverted” versions of p(s|t), etc.
Where do we get the λk ’s from?
They are usually tuned so as to optimize the results on a
tuning set, according to a certain objective function that
is taken to be an indicator that correlates with translation
quality
may be automatically obtained from the output of the SMT
system and the translation in the corpus.
This is called MERT (minimum error rate training)
sometimes (free/open-source software: the Moses suite).

“Decoding”
Training
“Log-linear”
The SMT lifecycle
The most famous “quality indicator” is called BLEU, but
there are many others.
BLEU counts which fraction of the 1-word,
2-word,. . . n-word sequences in the output match the
reference translation.
Correlation with subjective assessments of quality is still an
open question.
A lot of SMT research is currently BLEU-driven and makes
little contact with real applications of MT.

“Decoding”
Training
“Log-linear”
The SMT lifecycle
The SMT lifecycle
Development:
Training: monolingual and sentence-aligned
bilingual corpora are used to estimate
probability models (features)
Tuning: a held-out portion of the
sentence-aligned bilingual corpus is
used to tune the coeﬁcients λk
Decoding: sentences s are fed into the SMT system and
“decoded” into their translations t.
Evaluation: the system is evaluated against a reference
corpus.

“Decoding”
Training
“Log-linear”
The SMT lifecycle
License
This work may be distributed under the terms of
the Creative Commons Attribution–Share Alike license:
http:
//creativecommons.org/licenses/by-sa/3.0/
the GNU GPL v. 3.0 License:
http://www.gnu.org/licenses/gpl.html
Dual license! E-mail me to get the sources: mlf@ua.es

Smt in-a-few-slides

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Smt in-a-few-slides

Similar to Smt in-a-few-slides (20)

Recently uploaded

Recently uploaded (20)

Smt in-a-few-slides