Deep Learning for Machine Translation - A dramatic turn of paradigm

Deep Learning
for Machine Translation
A dramatic turn of paradigm
Alberto Massidda

Who we are
● Founded in 2001;
● Branches in Milan, Rome and London;
● Market leader in enterprise ready solutions based on Open Source tech;
● Expertise: Open Source, DevOps, Public and private cloud, Search, BigData
and many more...

Outline 1. Statistical Machine Translation
a. Language Model
b. Translation Model
c. Decoding
2. Neural Machine Translation
a. Recurrent Networks
b. Encoder - Decoder architecture
c. Attention Model
3. In the next episode

Statistical Machine Translation
1. Foreign language as a noisy channel
2. Language model and Translation model
3. Training (building the translation model)
4. Decoding (translating with the translation model)

Noisy channel model
Goal
Translate a sentence in foreign language f to our language e:
The abstract model
1. Transmit e over a noisy channel.
2. Channel garbles sentence and f is received.
3. Try to recover e by thinking about:
a. how likely is that e was the message, p(e) (source model)
b. how f is turned into e, p(e|f) (channel model)

Word choice and word reordering
P(f|e) cares about words, in any order.
● “It’s too late” → “Troppo tardi è” ✓
● “It’s too late” → “È troppo tardi” ✓
● “It’s too late” → “È troppa birra” ✗
P(e) cares about words order.
● “È troppo tardi” ✓
● “Troppo tardi è” ✗

P(e) and P(f|e)
Where does
these numbers
come from?

P(e) comes from a Language model, a machine that assigns scores to
sentences, estimating their likelihood.
1. Record every sentence ever said in English (1 Billion?)
2. If the sentence “how’s it going?” appears 76413 times in that database, then
we say:
Language model

Language model N-grams
Problem
A lot of nice sentences (“My phone is poisonous”) may obtain zero probability due
to lack of recurrence.
Solution
Break sentences into components:
if components are good, (probably) whole sentence is.
“My phone”, “phone is”, “is poisonous”.

Translation model
Next we need to worry about P(f|e), the probability of a French string f given an
English string e.
This is called a translation model.
It boils down to computing alignments between source and target languages.

Computing alignments intuition
Pairs of English and Chinese words which come together in a parallel example
may be translations of each other.

Training Data
A parallel corpus is a collection of texts, each of which is translated into one or
more other languages than the original.
EN IT
Look at that! Guarda lì!
I' ve never seen anything like that in my life! Non ho mai visto nulla di simile in vita mia!
That's incredible! É incredibile!
That's terrific. É eccezionale.

Expectation Maximization algorithm
Simple 2 sentence corpus:
b c
b y
yx

Expectation Maximization algorithm
This algorithm iterates over data, exacerbating latent properties of a system.
It finds a local optimum convergence point without any user supervision.
In machine translation, we call it IBM Model 1
In translation industry, we call it GIZA

Decoding
Word and phrase alignments are leveraged to build a “space” for a search
algorithm.
Translating is searching in a space of options.
Enters Moses

Decoding in action
● Each level corresponds to the source sentence coverage
● Translation stops when all words are translated
● The algorithm expands the most promising node first
● Each option is picked on highest probability first
● Reordering adds a penalty
● Language model tests on output of each stage to influence the decoder
judgement

Neural machine translation
NMT is based on probability too, but has some differences:
● end-to-end training, no more separate Translation + Language Models;
● Markovian assumption, instead of Bayesian (naive independence void);
if a sentence f of length n is a sequence of words , then p(f) is:

Encoder - Decoder architecture
With a sentence f and e :
(one single sequence)
Languages are independent (vocabulary and domain), so we can split in 2
separate RNNs:
1.
2.

Sequence-to-sequence (seq2seq) architecture
THE WAITER TOOK THE PLATES
h
1
h
2
h
3
h
4
h
5
g
1
g
2
g
3
g
4
g
5
IL
CAMERI
ERE
PRESE I PIATTI </s>
g
6

Summary vector as information bottleneck
Fixed sized representation degrades as sentence length increases.
Plus, the alignment learning operates on many-to-many logic. Gradient flows
towards everybody for any alignment mistake.
Let’s gate gradient flow through a context vector, as a weighted average of
source hidden states (also known as “soft search”).
Weights computed by feed-forward network with softmax activation.

Attention model
h
1
h
2
h
3
h
4
h
5
g
1
g
2
g
3
g
4
g
5
IL
CAMERI
ERE
PRESE I PIATTI </s>
g
6
+
0.7 0.05
0.1 0.050.1

Attention model
h
1
h
2
h
3
h
4
h
5
g
1
g
2
g
3
g
4
g
5
IL
CAMERI
ERE
PRESE I PIATTI </s>
g
6
+
0.1 0.05
0.1 0.050.7

Attention model
h
1
h
2
h
3
h
4
h
5
g
1
g
2
g
3
g
4
g
5
IL
CAMERI
ERE
PRESE I PIATTI </s>
g
6
+
0.1 0.05
0.7 0.050.1

Beam decoding
We can do better than just picking the most probable word at each step.
We can keep a list of best hypothesis at each level, thus gaining a Beam search.
Beam size 10 offers best trade off.

Deep recurrent models
Modern models use stacked bidirectional RNN layers, with residual
connections to help gradient flow.
h
1
h
2
h
3
h
4
h
5
h
1
h
2
h
3
h
4
h
5
h
1
h
2
h
3
h
4
h
5
h
1
h
2
h
3
h
4
h
5
h
1
h
2
h
3
h
4
h
5
h
1
h
2
h
3
h
4
h
5

Links
https://www.tensorflow.org/tutorials/seq2seq
NMT (seq2seq) Tutorial
https://github.com/google/seq2seq
A general-purpose encoder-decoder framework for Tensorflow
https://github.com/awslabs/sockeye
seq2seq framework with a focus on NMT based on Apache MXNet

In the next episode
● Online adaptation of Machine Translation models.
● Translating between unseen language pairs (“zero shot translation”).

Deep Learning for Machine Translation - A dramatic turn of paradigm

More Related Content

What's hot

Similar to Deep Learning for Machine Translation - A dramatic turn of paradigm

More from MeetupDataScienceRoma

Recently uploaded

Deep Learning for Machine Translation - A dramatic turn of paradigm