Understanding Neural MT
Understanding Neural Machine Translation
Mikel L. Forcada1,2
1Departament de Llenguatges i Sistemes Informàtics,
Universitat d’Alacant, E-03071 Alacant
2Prompsit Language Engineering, S.L.,
Edifici Quorum III, Av. Universitat s/n, E-03202 Elx
Misr International University, Cairo
8 April 2019
Understanding Neural MT
Before we start. . .
I prepared a deck of slides about neural machine translation
for translators.
I’d like this session to be as useful as possible to you.
Please interrupt me anytime to ask questions!
We can set aside some time to talk about other matters
such as:
How we teach translation technologies at the Universitat
A brief summary of our research.
. . . or any other aspect of interest to you.
Understanding Neural MT
1 Corpus-based machine translation
2 Neural machine translation
3 Training: details
4 Where to start
5 What to expect
6 Translation technologies at the U. Alacant
Understanding Neural MT
Corpus-based machine translation
1 Corpus-based machine translation
2 Neural machine translation
3 Training: details
4 Where to start
5 What to expect
6 Translation technologies at the U. Alacant
Understanding Neural MT
Corpus-based machine translation
Machine translation
Machine translation
The translation,
by means of a computer using suitable software,
of a text written in the source language (SL)
which produces another text in the target language (TL)
which may be called its raw translation.
SL text →
TL text
Understanding Neural MT
Corpus-based machine translation
Machine translation
There are two main groups of machine translation technologies:
Rule-based MT, and
Corpus-based MT
Understanding Neural MT
Corpus-based machine translation
Machine translation
Rule-based machine translation (RBMT) (Lucy Software,
ProMT, Apertium. . . ):
builds upwards from word-for-word translation,
hopefully to reach the sentence level,
Translation experts write translation dictionaries and
rules transforming SL structures into TL structures.
Translators’ intuitive, un-formalized knowledge about the
task has to be turned into rules and encoded in a
computable manner:
Additional crude simplifications and sacrifices needed!
If well chosen, some of them will often work fine.
Computer experts write engines that look up those
dictionaries and apply those rules to the input text
Understanding Neural MT
Corpus-based machine translation
Rule-based machine translation
Rule-based machine translation:
In most of these systems, an additional simplification is
made: the “transfer” approximation (rules transform parse
trees or similar structures)
Output is consistent but mechanical, lacking fluency
has trouble solving ambiguity at all levels:
lexical (“replace” → “put back”/”substitute”),
syntactical/structural (“I saw the girl with the telescope”)
Customization: experts edit dictionaries and rules
Understanding Neural MT
Corpus-based machine translation
Corpus-based machine translation
Corpus-based MT learns to translate a corpus containing
100,000’s or 1,000,000’s of translated sentences.
Output: may be deceivingly fluent (unfaithful).
Main approaches:
statistical machine translation (2005–2015)
Uses probabilistic models estimated by counting events in
the bilingual corpus used to train them.
neural machine translation (2015–).
Based on artificial neural networks inspired on how the
human brain learns and generalizes.
Such large corpora not be available for less-translated
Understanding Neural MT
Neural machine translation
1 Corpus-based machine translation
2 Neural machine translation
3 Training: details
4 Where to start
5 What to expect
6 Translation technologies at the U. Alacant
Understanding Neural MT
Neural machine translation
Neural MT: the new corpus-based MT
Neural machine translation or deep learning based machine
translation is a recent alternative to statistical MT:
It is corpus-based (usually needs more and cleaner data)
First ideas in the ’90s,1
abandoned due to insufficient
Retaken around 2013
First commercial implementations in 2016 (Google
Competitive with statistical MT in many applications.
Castaño & Casacuberta, EuroSpeech 1997; Forcada & Ñeco, ICANN
Understanding Neural MT
Neural machine translation
Artificial neurons /1
Why is it called neural?
It is performed by software that simulates large networks of
artificial neurons.
Their activation (excitation) depends on the activation of
other neurons and the strength of their connections.
The sign and magnitude of weights determine the
behaviour of the network:
Neurons connected via a positive weight tend to excite or
inhibit simultaneously.
Neurons connected via a negative weight tend to be in
opposite states.
The effect of the interaction increases with the magnitude of
the weight.
Training fixes weights to the necessary values to ensure a
specific behaviour.
Understanding Neural MT
Neural machine translation
Artificial neurons/2
S₄ = F (w₁×S₁ + w₂×S₂ + w₃×S₃)
-4 -2 	0 	2 	4
Understanding Neural MT
Neural machine translation
Xarxes neurals
A neural net with 3 inputs, 3 neurons in a hidden layer, and two
output neurons.
One talks about deep learning when information is processed
using many hidden layers.
Understanding Neural MT
Neural machine translation
Representations /1
The activation values of specific groups of neurons (usually
those in a layer) form representations of the information
they are processing.
For example,
could be the representation of the word ”study”, and
that of the word ”cat”.
Understanding Neural MT
Neural machine translation
Representations /2
Let us imagine lexical representations with just three neurons:
Words with similar meanings are found close to each other.
Understanding Neural MT
Neural machine translation
Representations /3
One can even perform semantic arithmetics with
representations (adding and subtracting activation values
neuron by neuron):
[king] − [man] + [woman] ≃ [queen]
Understanding Neural MT
Neural machine translation
Neural MT: the encoder–decoder architecture
A large part of neural MT systems use the encoder–decoder
The encoder is a neural net that reads, one by one,
representations of words in the source sentence and
recursively builds a representation; then,
The decoder is a neural let that predicts, one by one, the
target words:
Each output unit computes the probability of each possible
target word.
The most likely word is selected.
Works similarly to the keyboard in our smartphones.
Other architectures such as transformer are now also very common.
Understanding Neural MT
Neural machine translation
Input: “My flight is delayed .”
encoder E(“My”)
Understanding Neural MT
Neural machine translation
Input: “My flight is delayed .”
Understanding Neural MT
Neural machine translation
Input: “My flight is delayed .”
flight is”)
Understanding Neural MT
Neural machine translation
Input: “My flight is delayed .”
flight is”) E(“My
flight is
Understanding Neural MT
Neural machine translation
Input: “My flight is delayed .”
flight is
delayed”) E(“My
flight is
Understanding Neural MT
Neural machine translation
Encoding of the source sentence “My flight is delayed .” from
the representations of its words.
encoder E(“My”)
e(“flight”) E(“My
flight is
delayed”) E(“My
flight is
Understanding Neural MT
Neural machine translation
Decoding “My flight is delayed .” → “Mi vuelo está retrasado .”
E(“My flight
is delayed”)
D(“My flight
is delayed”,
P(x,”My flight is
x=“mi” 0.125
x=“vuelo” 0.078
x=“su” 0.027
x=“avión” 0.011
Understanding Neural MT
Neural machine translation
Decoding “My flight is delayed .” → “Mi vuelo está retrasado .”
D(“My flight
is delayed”,
D(“My flight
is delayed”,
P(x,”My flight is
x=“vuelo” 0.315
x=“avión” 0.088
x=“escala” 0.071
x=“está” 0.009
Understanding Neural MT
Neural machine translation
Decoding “My flight is delayed .” → “Mi vuelo está retrasado .”
D(“My flight
is delayed”,
P(x,”My flight is
delayed”|”Mi vuelo”),
x=“está” 0.415
x=“es” 0.218
x=“tarde” 0071
x=“hay” 0.009
D(“My flight
is delayed”,
“Mi vuelo”)
Understanding Neural MT
Neural machine translation
Decoding “My flight is delayed .” → “Mi vuelo está retrasado .”
D(“My flight
is delayed”,
“Mi vuelo”)
P(x,”My flight is
delayed”|”Mi vuelo está”),
x=“listo” 0.048
x=“tardando” 0.112
x=“retrasado” 0.683
x=“cancelado” 0.092
D(“My flight
is delayed”,
“Mi vuelo
Understanding Neural MT
Neural machine translation
Decoding “My flight is delayed .” → “Mi vuelo está retrasado .”
D(“My flight is
delayed”, “Mi
vuelo está”)
P(x,”My flight is delayed”|”Mi
vuelo está retrasado”),
x=“.” 0.773
x=“porque” 0.038
x=“dos” 0.011
x=“hasta” 0.001
D(“My flight
is delayed”,
“Mi vuelo
Understanding Neural MT
Neural machine translation
Decoding of the translation of “My flight is delayed .”: “Mi vuelo
está retrasado .”.
E(“My flight
is delayed”)
D(“My flight
is delayed”,
P(x,”My flight is
x=“mi” 0.125
x=“vuelo” 0.078
x=“su” 0.027
x=“avión” 0.011
D(“My flight
is delayed”,
P(x,”My flight is
x=“vuelo” 0.315
x=“avión” 0.088
x=“escala” 0.071
x=“está” 0.009
P(x,”My flight is
delayed”|”Mi vuelo”),
x=“está” 0.415
x=“es” 0.218
x=“tarde” 0071
x=“hay” 0.009
D(“My flight
is delayed”,
“Mi vuelo”)
Understanding Neural MT
Neural machine translation
An extension: attention
Encoder–decoders are sometimes augmented with attention.
The decoder learns to “pay (more or less) attention”. . .
not only to the last encoded representation E(’My flight is
delayed .’). . .
but also to all of the intermediate representations created
during encoding,
E(’My’), E(’My flight’), E(’My flight is’), E(’My flight is
. . . using special attention connections.
Understanding Neural MT
Training: details
1 Corpus-based machine translation
2 Neural machine translation
3 Training: details
4 Where to start
5 What to expect
6 Translation technologies at the U. Alacant
Understanding Neural MT
Training: details
Where do I get corpora from?
It is unlikely that a freelance translator or a small
translation agency has 1,000,000 sentence pairs available
for their languages and text domains.
If one is lucky, repositories like OPUS3
may contain useful
corpora for one’s language pair and text domain.
Another possibility is to harvest them from the Internet
using specialized software such as Bitextor.4
Very challenging!
Understanding Neural MT
Training: details
Corpus preparation /1
When one wants to train a neural MT system one prepares
three disjoint corpora from the data available:
A large training set, ideally 100,000’s or 1,000,000’s of
sentence pairs, and ideally representative of the task (but
this may not be available!).
The examples in this set are shown to the neural net for
A small development set of 1,000–3,000 sentence pairs,
representative of the task.
The examples in this held-out set are used to determine
when to stop training so that overfitting to the training set is
A small test set of 1,000–3,000 sentence pairs,
representative of the task.
Examples in this held-out set give an idea of the
performance of the system.
Understanding Neural MT
Training: details
Corpus preparation /2
Tokenization: One can improve the segmentation of text
beyond whitespace to improve processing and reduce
vocabulary size.
Some words are written together.
don’t → don 't
won’t → won 't
l’amour → l' amour
Punctuation is often written without intervening space, and
it needs to be separated, but not always:
I know, Dr. Jones (and you know too). →
I know , Dr. Jones ( and you know too ) .
How would this work for languages such as Arabic?
Understanding Neural MT
Training: details
Corpus preparation /3
Truecasing and de-truecasing:
Many languages write words in capitals in some contexts
(start of sentence, headlines). Words which are the same
appear different.
Truecasing trained on the training corpus tries to undo
capitalization where needed:
Mes amis et mes amies sont arrivés hier à
Paris . → mes amis et mes amies sont arrivés
hier à Paris .
De-truecasing applies simple rules to capitalize machine
translation output from a system trained on truecased text.
Is this necessary for languages such as Arabic?
Understanding Neural MT
Training: details
Corpus preparation /4
Subword units/1
Neural MT systems have
as many input units as source words in the vocabulary
as many output units as target words in the vocabulary
(one needs to fix the size of vocabularies).
Large vocabularies are not uncommon in
morphologically-rich languages → unfeasible!
All other words have to be encoded as unknown words
Understanding Neural MT
Training: details
Corpus preparation /4
Subword units/2
The solution: break words into smaller, repeating units!
Solution #1: Byte-pair encoding: a
language-independent approach.
Initial set of codes = set of characters
Iteratively group the most frequent pairs of consecutive
codes into single codes.
Until a certain number of codes (vocabulary) is reached.
Use the resulting character sequences as tokens:
institutionalisations →
institu@@ tion@@ alis@@ ations
Many words are stored complete.
Understanding Neural MT
Training: details
Corpus preparation /5
Subword units/3
Solution #2: “Sentencepiece”: a completely
language-independent, data-driven approach.
No need to tokenize: processes the whole sentence.
Learns a neural model directly from the training corpus.
Understanding Neural MT
Training: details
How does one train neural MT /1
Training: adjusting all of the weights in the neural net.
NMT decoder output: word probabilities in context →
sentence probabilities:
P(I love you . Je t’aime) = p(. I love you, Je t’aime)×
× p(you I love, Je t’aime)×
× p(love I, Je t’aime)×
× p(I START, Je t’aime).
Objective: maximize the likelihood P(I love you . Je t’aime)
of the reference translation I love you.
Understanding Neural MT
Training: details
How does one train neural MT /2
The training algorithm computes a gradient of the
probabilities of reference sentences in the training set with
respect to each weight w connecting neurons:
gradient(w) =
P(with w + ∆w) − P(with w)
That is, how much the probability varies for a small change
∆w in each weight w.
Then, after showing a number of reference sentences,
weights are updated proportionally to their effect on their
probability → gradient ascent
new w = w + (learning rate) × gradient(w)
This is done repeatedly.
Understanding Neural MT
Training: details
How does one train neural MT /3
Examples (sentence pairs) are grouped in minibatches of
e.g. 128 examples.
Weights are updated after each minibatch.
An epoch completes each time the whole set of examples,
e.g. 1,000,000 examples, have been processed.
It is not uncommon for a training job to require tens or
hundreds of epochs.
Understanding Neural MT
Training: details
How does one train neural MT /4
When does one stop?
Training “too deep” may lead to “memorization” of the
training set.
But we want the network to generalize.
This is what the development set is used for.
Every certain number of weight updates, the system
automatically evaluates the performance on the sentences
of the development set.
It compares MT output to the reference outputs and
computes a measure such as BLEU.
Understanding Neural MT
Training: details
How does one train neural MT /5
What is BLEU?
The most famous automatic evaluation measure is called
BLEU, but there are many others.
BLEU counts which fraction of the 1-word,
2-word,. . . n-word sequences in the output match the
reference translation.
These fractions are grouped in a single quantity.
The result is a number between 0 and 1, or between 0%
and 100%.
Correlation with measurements of translation usefulness is
still an open question.
A lot of MT research is still BLEU-driven and makes little
contact with real applications of MT.
Understanding Neural MT
Training: details
How does one train neural MT /6
Neural MT training is computationally very expensive.
A very small encoder–decoder (2 layers of 128 units
each). . .
. . . with a small training set of 260,000 sentences. . .
. . . using a small vocabulary of 10,000 byte-pair encoding
operations . . .
. . . between French and Spanish (easy language pair). . .
. . . takes about 1 week on a 4-core, 3.2 GHz desktop . . .
. . . to reach a BLEU score of around 25% (barely starts to
be posteditable).
Understanding Neural MT
Training: details
How does one train neural MT /7
We need stronger, specialized, expensive hardware:
Regular CPUs in our desktops and laptops are too slow.
Neural training implies a lot of vector, matrix, tensor
operations. . .
. . . which are nicely performed by GPUs (graphic
processing units).5
Using GPUs one can speed up training by 100× or more.
One GPU costs ≃US$2,000
Understanding Neural MT
Where to start
1 Corpus-based machine translation
2 Neural machine translation
3 Training: details
4 Where to start
5 What to expect
6 Translation technologies at the U. Alacant
Understanding Neural MT
Where to start
Can I do it myself? /1
Clara is a 4th-year translation and interpreting student of mine.
She accepted the challenge to learn neural MT. She has
To install TensorFlow and TensorFlow NMT on a Windows
computer and on a GNU/Linux computer (no GPUs on
To use the command-line interface on a Windows
computer and on a GNU/Linux computer.
To select and prepare training, development and test
To check the output of the system and interpret how it is
She is writing a guide for fellow students as her bachelor’s
Understanding Neural MT
Where to start
Can I do it myself? /2
I have helped her by:
Solving doubts, and guiding her.
Writing up some small Python programs (scripts) to
process corpora and launch training jobs.
Letting her slow down my desktop computer with her jobs
And I have learned a lot by working with her.
If she can, why can’t you?
Understanding Neural MT
What to expect
1 Corpus-based machine translation
2 Neural machine translation
3 Training: details
4 Where to start
5 What to expect
6 Translation technologies at the U. Alacant
Understanding Neural MT
What to expect
New technology, new behaviour/1
Neural MT. . .
. . . requires specialized, powerful hardware
. . . needs large amounts of bilingual data
Not normally available to the average translator or
translation company
One can resort to third-parties to train and execute neural
MT for us:
This is actually a business model in the translation industry.
They can add our translation memories to the stock data in
the company to build a system for us.
Understanding Neural MT
What to expect
New technology, new behaviour /2
Neural MT. . .
. . . works with representations of the whole sentence: it is
hard to know the source for each target word.
Lack of transparency.
. . . produces grammatically fluent texts.
. . . produces semantically motivated errors: if a word has
not been seen during training, it is replaced. . .
. . . by a similar word: palace → castle
. . . by a paraphrase: Michael Jordan → the Chicago Bulls
shooting guard;
with dangerous results sometimes: Tunisia → Norway.
. . . may invent words: engineerage, recruitation, etc. when
they work with sub-word units (done quite frequently).
Posteditors have to pay special attention (cognitive load).
Understanding Neural MT
What to expect
Concluding remarks
Neural MT is a new kind of corpus-based MT.
It is currently displacing statistical MT in many applications.
It is based on a simplification of the nervous system.
Requires large corpora.
Requires powerful, specialized hardware.
May produce natural text with errors which are hard to spot
and correct.
Requires very special attention on the part of post-editors.
Understanding Neural MT
Translation technologies at the U. Alacant
1 Corpus-based machine translation
2 Neural machine translation
3 Training: details
4 Where to start
5 What to expect
6 Translation technologies at the U. Alacant
Understanding Neural MT
Translation technologies at the U. Alacant
Teaching translation technologies at U. Alacant/1
4-year bachelor’s degree in Translation and Interpreting
Subject: “Translation Technologies”
The only subject on technologies or computing in the
whole degree.
2nd year, 2nd semester
6 ECTS credits (60 h lab + classroom, 90 h at home)
1.5 credits in the classroom, 4.5 credits in the laboratory.
Languages: Spanish and Valencian (Catalan); no English
All material available under CC licenses.
Understanding Neural MT
Translation technologies at the U. Alacant
Teaching translation technologies at U. Alacant/2
Classroom blocks:
Hardware and software: MB, GB, GHz, CPU. . .
Internet: se
Texts and formats: Unicode, XML, HTML; markup. . . .
Uses of machine translation
Ambiguity as an obstacle.
How does machine translation work?
Computer-aided translation with translation memories.
Understanding Neural MT
Translation technologies at the U. Alacant
Teaching translation technologies at U. Alacant/3
Laboratory work. Part 1, practice (20 h):
HTML basics (for their HTML portfolio)
Advanced word processing: styles (LibreOffice).
XML: validation, etc.
Computer-aided translation of XML-formatted documents
with OmegaT.
Understanding Neural MT
Translation technologies at the U. Alacant
Teaching translation technologies at U. Alacant/4
Part 2: collaborative projects (25 h in the lab.)
Groups of 4 students tackle a sizable translation project
(about 10000 words)
We encourage them to find real projects (NGOs, etc.)
They select an MT system and evaluate their performance
with it.
They gather translation memories and evaluate their
performance with them.
They decide what is the best combination and translate
using the team functionality of OmegaT.6
They learn to organize work and assess quality.
New translated segments are easily shared via a free account on a
GitHub or GitLab server.
Understanding Neural MT
Translation technologies at the U. Alacant
Research in translation technologies at U. Alacant/1
The translation technologies group at Alacant:
Part of the Transducens research group7
at the Department of
Software and Computing Systems:8
Mikel L. Forcada, full professor.
Juan Antonio Pérez-Ortiz, associate professor.
Felipe Sánchez-Martínez, post-doctoral lecturer.
Miquel Esplà-Gomis, post-doctoral researcher.
Víctor Sánchez-Cartagena, post-doctoral researcher.
Leopoldo Pla Sempere, research technician.
Francisco de Borja Valero Antón, research technician.
Current PhD students: John E. Ortega, Kenneth
Jordan-Núñez (with Universitat Pompeu Fabra)
Understanding Neural MT
Translation technologies at the U. Alacant
Research in translation technologies at U. Alacant/2
Main research lines:
Quality estimation for machine translation
Effort-oriented training of MT systems
Low-resource neural machine translation
Parallel corpus harvesting and cleaning
Fuzzy-match repair using MT
Machine translation evaluation for gisting
Understanding Neural MT
Translation technologies at the U. Alacant
Research in translation technologies at U. Alacant/3
Current projects /1:
GoURMET, Global Under-Resourced Media Translation,
Horizon 2020 project (2019–2021) with BBC, Deutsche
Welle, U. Amsterdam and Edinburgh U. (coordinator). 9
News translation to and from 12 less-resourced langauges
of interest to media partners BBC and DW.
Understanding Neural MT
Translation technologies at the U. Alacant
Research in translation technologies at U. Alacant/4
Current projects /2:
Paracrawl 2 Broader Web-Scale Provision of Parallel
Corpora for European Languages, Connecting Europe
Facility project (2018–2020) with Prompsit, Omniscien,
TAUS, and Edinburgh U. (coordinator).10
Automatically harvesting bitexts from the Web to use as
translation memories or to train MT systems (such as
eTranslation at the Eropean Commission).
Alacant develops Bitextor11
, free/open-source software to
perform the task.
Understanding Neural MT
Translation technologies at the U. Alacant
Research in translation technologies at U. Alacant/5
Current projects /3:
EfforTune Effort-driven optimization of statistical machine
translation, Spanish Government (2016–2019).
Smaller project
Design new automatic evaluation measures that correlate
better with post-editing effort to tune statistical (and neural)
MT systems
Understanding Neural MT
Translation technologies at the U. Alacant
These slides are free
This work may be distributed under the terms of either
the Creative Commons Attribution–Share Alike licence:
the GNU GPL v. 3.0 Licence:
Dual license! E-mail me to get the sources:

