A short introduction to Statistical Machine Translation

A (10-slide) introduction to
Statistical Machine Translation
(SMT)
Cuong Huy TO

Machine Translation
MT (first demo by IBM on 1954) is:
Commercially interesting (EU spends 1,000,000,000 €/year)
Academically interesting (NLP technologies)
What makes MT hard ?
Word order: Monde entier whole world
Word sense: bank rivière / banque
Idioms: to kick the bucket mourir
Various approaches:
Rule-based MT
Example-based MT (1984): merges what is in memory
Statistical MT (1993)

Source: Wikipedia, John Hutchins, P.Koehn 2

Statistical Machine Translation
Uses a parallel corpus
Europarl (European parliament)
Hansard (Canadian parliament)
Vermobil (German-english)
United Nations (used by Google)
Learns
Lexicon: Le monde entier ne parle pas du problème
Alignment:
The whole world is not talking about the problem
Well-formedness:

Advantages (over rule-based, example-based):
good performance,
quick implementation
Can deal with noisy text
3

Translation units & alignment ?
Unit = WORD
Seems to be more ready to the
variability
BUT more parameters

Unit = PHRASE
Seems to be less ready to the
variability
BUT less parameters

4

Two approaches to SMT
1 1 j
I Î
{
I J }
Source s J = s ,..., s ,...s J ⇒ target t1 = t1,..., ti ,...t I . Find t 1 = arg max Pr(t1 | s1 )
tI
1
Source-channel translation:
{ }
Î
t 1 = arg max Pr(t1I ). Pr( s1J | t1I )
t1I

Direct maximum entropy translation:
M
exp[∑ λm hm (t1I , s1J )]
Pr(t1I | s1J ) = pλM (t1I | s1J ) = m =1
M

∑ exp[∑ λ
1
h (t '1 , s1J )]
m m
I

I
t '1 m =1

Î ⎧M ⎫
t 1 = arg max ⎨∑ λm hm (t1I , s1J )⎬
t1I ⎩ m =1 ⎭

5

Source-channel vs. Maximum-Entropy
Source language text Source language text

Preprocessing Preprocessing

Global search Pr( t1I ) Global search λ1h1 (t1I , s1J )

I ^I ⎧M ⎫
= arg max {Pr(t } t 1 = arg max ⎨∑ λm hm (t1I , s1J )⎬
^
t1 I
) • Pr(t | s )
I J

⎩ m =1 ⎭
1 1 1
t1I t1I

Pr( s1J | t1I ) λM hM (t1I , s1J )

Target language text Target language text

SC approach is one special case of ME one
We will first train the SC, then integrate it in the ME
framework
6

Alignment in source-channel approach
Pr( s1J | t1I ) = ∑ Pr( s1J , a | t1I )
a

a : alignment between translation units
P(translation)=P(alignment) x P(lexicon)

Word-to-Word alignment: IJ possible alignments
Source-to-target mapping: Model-1 (can be efficiently trained)
Target-to-source fertility: Model-2 (cannot be efficiently trained)
Training (EM): train Model-1, and use its parameters to initialize the
parameters of Model-2

Phrase-to-Phrase alignment:
Got the Viterbi alignments from Word-to-Word training.
Build consistent pairs of phrase-to-phrase alignment

7

The state-of-the-art SMT system
Start with a source-channel system
Train and find the word-to-word alignments
Build phrase-to-phrase alignments and lexicon
Then include the phrase-to-phrase model into a
Maximum Entropy framework
Train the scaling factor lambda using GIS (Generalize Iterative
Scaling)
Add more feature functions
Many language models (trillions of words)
P(s|t) and P(t|s) can be both used (more symmetrical translation)
P(I) (word penalty)
We can add many more features (conventional dictionary,…)

8

Search in SMT is inexact
Problem: Search is an NP-hard problem even with
Model-1, mainly due to the need to re-order the
target words we need to approximate the search
Solutions:
A* search/integer programming: not efficient for long
sentence
Greedy search: severe search errors
Beam search with pruning and a heuristic function
Decision = Q(n)+H(n) where Q = past, H = future
Good heuristic function leads to efficient quality/speed
Conclusion: search is still far from good

9

Translation evaluation: many metrics

Objective automatic scores: most count word-
matching against a set of references
WER / mWER / PER / mPER / BLEU / NIST
Subjective score (judged by human):
SSER/IER/ISER: meaning, syntax
Adequacy-Fluency:
We need automatic scoring to speed-up
research, but no metric is persuasive enough
Must use many metrics at the same time

10

Issues in state-of-the-art SMT techniques
Too much approximation in training and decoding
Decoding implementation for a new model is
expensive since search is heavily dependant on the
model
Phrase segmentation is still not powerful
Phrase reordering is still not powerful
Objective metrics are not highly correlated to
adequacy and fluency
Real challenge for computation:
1012 words for language model
108 words for translation model
106 feature functions

11

A short introduction to Statistical Machine Translation

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to A short introduction to Statistical Machine Translation

Similar to A short introduction to Statistical Machine Translation (16)

Recently uploaded

Recently uploaded (20)

A short introduction to Statistical Machine Translation