Deep Learning
for Machine Translation
A dramatic turn of paradigm
Alberto Massidda
Who we are
● Founded in 2001;
● Branches in Milan, Rome and London;
● Market leader in enterprise ready solutions based on Open Source tech;
● Expertise: Open Source, DevOps, Public and private cloud, Search, BigData
and many more...
Outline 1. Statistical Machine Translation
a. Language Model
b. Translation Model
c. Decoding
2. Neural Machine Translation
a. Recurrent Networks
b. Encoder - Decoder architecture
c. Attention Model
3. In the next episode
Statistical Machine Translation
1. Foreign language as a noisy channel
2. Language model and Translation model
3. Training (building the translation model)
4. Decoding (translating with the translation model)
Noisy channel model
Goal
Translate a sentence in foreign language f to our language e:
The abstract model
1. Transmit e over a noisy channel.
2. Channel garbles sentence and f is received.
3. Try to recover e by thinking about:
a. how likely is that e was the message, p(e) (source model)
b. how f is turned into e, p(e|f) (channel model)
Word choice and word reordering
P(f|e) cares about words, in any order.
● “It’s too late” → “Troppo tardi è” ✓
● “It’s too late” → “È troppo tardi” ✓
● “It’s too late” → “È troppa birra” ✗
P(e) cares about words order.
● “È troppo tardi” ✓
● “Troppo tardi è” ✗
P(e) and P(f|e)
Where does
these numbers
come from?
P(e) comes from a Language model, a machine that assigns scores to
sentences, estimating their likelihood.
1. Record every sentence ever said in English (1 Billion?)
2. If the sentence “how’s it going?” appears 76413 times in that database, then
we say:
Language model
Language model N-grams
Problem
A lot of nice sentences (“My phone is poisonous”) may obtain zero probability due
to lack of recurrence.
Solution
Break sentences into components:
if components are good, (probably) whole sentence is.
“My phone”, “phone is”, “is poisonous”.
Language model 3-grams
b(My|<start><start>)*
b(phone|My<start>)*
b(is|My phone)*
b(poisonous|phone is)*
b(<end>|is poisonous)*
b(<end>|poisonous<end>)
P(My phone is poisonous) ≃
Translation model
Next we need to worry about P(f|e), the probability of a French string f given an
English string e.
This is called a translation model.
It boils down to computing alignments between source and target languages.
Computing alignments intuition
Pairs of English and Chinese words which come together in a parallel example
may be translations of each other.
Training Data
A parallel corpus is a collection of texts, each of which is translated into one or
more other languages than the original.
EN IT
Look at that! Guarda lì!
I' ve never seen anything like that in my life! Non ho mai visto nulla di simile in vita mia!
That's incredible! É incredibile!
That's terrific. É eccezionale.
Expectation Maximization algorithm
Simple 2 sentence corpus:
b c
b y
yx
Expectation Maximization algorithm
This algorithm iterates over data, exacerbating latent properties of a system.
It finds a local optimum convergence point without any user supervision.
In machine translation, we call it IBM Model 1
In translation industry, we call it GIZA
Decoding
Word and phrase alignments are leveraged to build a “space” for a search
algorithm.
Translating is searching in a space of options.
Enters Moses
Translation options selection
Decoding in action
● Each level corresponds to the source sentence coverage
● Translation stops when all words are translated
● The algorithm expands the most promising node first
● Each option is picked on highest probability first
● Reordering adds a penalty
● Language model tests on output of each stage to influence the decoder
judgement
Decoding in action
Neural machine translation
NMT is based on probability too, but has some differences:
● end-to-end training, no more separate Translation + Language Models;
● Markovian assumption, instead of Bayesian (naive independence void);
if a sentence f of length n is a sequence of words , then p(f) is:
Recurrent network
Encoder - Decoder architecture
With a sentence f and e :
(one single sequence)
Languages are independent (vocabulary and domain), so we can split in 2
separate RNNs:
1.
2.
Sequence-to-sequence (seq2seq) architecture
THE WAITER TOOK THE PLATES
h
1
h
2
h
3
h
4
h
5
g
1
g
2
g
3
g
4
g
5
IL
CAMERI
ERE
PRESE I PIATTI </s>
g
6
Summary vector as information bottleneck
Fixed sized representation degrades as sentence length increases.
Plus, the alignment learning operates on many-to-many logic. Gradient flows
towards everybody for any alignment mistake.
Let’s gate gradient flow through a context vector, as a weighted average of
source hidden states (also known as “soft search”).
Weights computed by feed-forward network with softmax activation.
Attention model
THE WAITER TOOK THE PLATES
h
1
h
2
h
3
h
4
h
5
g
1
g
2
g
3
g
4
g
5
IL
CAMERI
ERE
PRESE I PIATTI </s>
g
6
+
0.7 0.05
0.1 0.050.1
Attention model
THE WAITER TOOK THE PLATES
h
1
h
2
h
3
h
4
h
5
g
1
g
2
g
3
g
4
g
5
IL
CAMERI
ERE
PRESE I PIATTI </s>
g
6
+
0.1 0.05
0.1 0.050.7
Attention model
THE WAITER TOOK THE PLATES
h
1
h
2
h
3
h
4
h
5
g
1
g
2
g
3
g
4
g
5
IL
CAMERI
ERE
PRESE I PIATTI </s>
g
6
+
0.1 0.05
0.7 0.050.1
Beam decoding
We can do better than just picking the most probable word at each step.
We can keep a list of best hypothesis at each level, thus gaining a Beam search.
Beam size 10 offers best trade off.
Deep recurrent models
Modern models use stacked bidirectional RNN layers, with residual
connections to help gradient flow.
THE WAITER TOOK THE PLATES
h
1
h
2
h
3
h
4
h
5
h
1
h
2
h
3
h
4
h
5
h
1
h
2
h
3
h
4
h
5
h
1
h
2
h
3
h
4
h
5
h
1
h
2
h
3
h
4
h
5
h
1
h
2
h
3
h
4
h
5
Links
https://www.tensorflow.org/tutorials/seq2seq
NMT (seq2seq) Tutorial
https://github.com/google/seq2seq
A general-purpose encoder-decoder framework for Tensorflow
https://github.com/awslabs/sockeye
seq2seq framework with a focus on NMT based on Apache MXNet
In the next episode
● Online adaptation of Machine Translation models.
● Translating between unseen language pairs (“zero shot translation”).
QA

Deep Learning for Machine Translation - A dramatic turn of paradigm

  • 1.
    Deep Learning for MachineTranslation A dramatic turn of paradigm Alberto Massidda
  • 2.
    Who we are ●Founded in 2001; ● Branches in Milan, Rome and London; ● Market leader in enterprise ready solutions based on Open Source tech; ● Expertise: Open Source, DevOps, Public and private cloud, Search, BigData and many more...
  • 3.
    Outline 1. StatisticalMachine Translation a. Language Model b. Translation Model c. Decoding 2. Neural Machine Translation a. Recurrent Networks b. Encoder - Decoder architecture c. Attention Model 3. In the next episode
  • 4.
    Statistical Machine Translation 1.Foreign language as a noisy channel 2. Language model and Translation model 3. Training (building the translation model) 4. Decoding (translating with the translation model)
  • 5.
    Noisy channel model Goal Translatea sentence in foreign language f to our language e: The abstract model 1. Transmit e over a noisy channel. 2. Channel garbles sentence and f is received. 3. Try to recover e by thinking about: a. how likely is that e was the message, p(e) (source model) b. how f is turned into e, p(e|f) (channel model)
  • 6.
    Word choice andword reordering P(f|e) cares about words, in any order. ● “It’s too late” → “Troppo tardi è” ✓ ● “It’s too late” → “È troppo tardi” ✓ ● “It’s too late” → “È troppa birra” ✗ P(e) cares about words order. ● “È troppo tardi” ✓ ● “Troppo tardi è” ✗
  • 7.
    P(e) and P(f|e) Wheredoes these numbers come from?
  • 8.
    P(e) comes froma Language model, a machine that assigns scores to sentences, estimating their likelihood. 1. Record every sentence ever said in English (1 Billion?) 2. If the sentence “how’s it going?” appears 76413 times in that database, then we say: Language model
  • 9.
    Language model N-grams Problem Alot of nice sentences (“My phone is poisonous”) may obtain zero probability due to lack of recurrence. Solution Break sentences into components: if components are good, (probably) whole sentence is. “My phone”, “phone is”, “is poisonous”.
  • 10.
    Language model 3-grams b(My|<start><start>)* b(phone|My<start>)* b(is|Myphone)* b(poisonous|phone is)* b(<end>|is poisonous)* b(<end>|poisonous<end>) P(My phone is poisonous) ≃
  • 11.
    Translation model Next weneed to worry about P(f|e), the probability of a French string f given an English string e. This is called a translation model. It boils down to computing alignments between source and target languages.
  • 12.
    Computing alignments intuition Pairsof English and Chinese words which come together in a parallel example may be translations of each other.
  • 13.
    Training Data A parallelcorpus is a collection of texts, each of which is translated into one or more other languages than the original. EN IT Look at that! Guarda lì! I' ve never seen anything like that in my life! Non ho mai visto nulla di simile in vita mia! That's incredible! É incredibile! That's terrific. É eccezionale.
  • 14.
    Expectation Maximization algorithm Simple2 sentence corpus: b c b y yx
  • 15.
    Expectation Maximization algorithm Thisalgorithm iterates over data, exacerbating latent properties of a system. It finds a local optimum convergence point without any user supervision. In machine translation, we call it IBM Model 1 In translation industry, we call it GIZA
  • 16.
    Decoding Word and phrasealignments are leveraged to build a “space” for a search algorithm. Translating is searching in a space of options. Enters Moses
  • 17.
  • 18.
    Decoding in action ●Each level corresponds to the source sentence coverage ● Translation stops when all words are translated ● The algorithm expands the most promising node first ● Each option is picked on highest probability first ● Reordering adds a penalty ● Language model tests on output of each stage to influence the decoder judgement
  • 19.
  • 20.
    Neural machine translation NMTis based on probability too, but has some differences: ● end-to-end training, no more separate Translation + Language Models; ● Markovian assumption, instead of Bayesian (naive independence void); if a sentence f of length n is a sequence of words , then p(f) is:
  • 21.
  • 22.
    Encoder - Decoderarchitecture With a sentence f and e : (one single sequence) Languages are independent (vocabulary and domain), so we can split in 2 separate RNNs: 1. 2.
  • 23.
    Sequence-to-sequence (seq2seq) architecture THEWAITER TOOK THE PLATES h 1 h 2 h 3 h 4 h 5 g 1 g 2 g 3 g 4 g 5 IL CAMERI ERE PRESE I PIATTI </s> g 6
  • 24.
    Summary vector asinformation bottleneck Fixed sized representation degrades as sentence length increases. Plus, the alignment learning operates on many-to-many logic. Gradient flows towards everybody for any alignment mistake. Let’s gate gradient flow through a context vector, as a weighted average of source hidden states (also known as “soft search”). Weights computed by feed-forward network with softmax activation.
  • 25.
    Attention model THE WAITERTOOK THE PLATES h 1 h 2 h 3 h 4 h 5 g 1 g 2 g 3 g 4 g 5 IL CAMERI ERE PRESE I PIATTI </s> g 6 + 0.7 0.05 0.1 0.050.1
  • 26.
    Attention model THE WAITERTOOK THE PLATES h 1 h 2 h 3 h 4 h 5 g 1 g 2 g 3 g 4 g 5 IL CAMERI ERE PRESE I PIATTI </s> g 6 + 0.1 0.05 0.1 0.050.7
  • 27.
    Attention model THE WAITERTOOK THE PLATES h 1 h 2 h 3 h 4 h 5 g 1 g 2 g 3 g 4 g 5 IL CAMERI ERE PRESE I PIATTI </s> g 6 + 0.1 0.05 0.7 0.050.1
  • 28.
    Beam decoding We cando better than just picking the most probable word at each step. We can keep a list of best hypothesis at each level, thus gaining a Beam search. Beam size 10 offers best trade off.
  • 29.
    Deep recurrent models Modernmodels use stacked bidirectional RNN layers, with residual connections to help gradient flow. THE WAITER TOOK THE PLATES h 1 h 2 h 3 h 4 h 5 h 1 h 2 h 3 h 4 h 5 h 1 h 2 h 3 h 4 h 5 h 1 h 2 h 3 h 4 h 5 h 1 h 2 h 3 h 4 h 5 h 1 h 2 h 3 h 4 h 5
  • 30.
    Links https://www.tensorflow.org/tutorials/seq2seq NMT (seq2seq) Tutorial https://github.com/google/seq2seq Ageneral-purpose encoder-decoder framework for Tensorflow https://github.com/awslabs/sockeye seq2seq framework with a focus on NMT based on Apache MXNet
  • 31.
    In the nextepisode ● Online adaptation of Machine Translation models. ● Translating between unseen language pairs (“zero shot translation”).
  • 33.