Cairo 2019-seminar

Understanding Neural MT
Understanding Neural Machine Translation
Mikel L. Forcada1,2
1Departament de Llenguatges i Sistemes Informàtics,
Universitat d’Alacant, E-03071 Alacant
2Prompsit Language Engineering, S.L.,
Ediﬁci Quorum III, Av. Universitat s/n, E-03202 Elx
Misr International University, Cairo
8 April 2019

Before we start. . .
I prepared a deck of slides about neural machine translation
for translators.
However,
I’d like this session to be as useful as possible to you.
Please interrupt me anytime to ask questions!
We can set aside some time to talk about other matters
such as:
How we teach translation technologies at the Universitat
d’Alacant
A brief summary of our research.
. . . or any other aspect of interest to you.

Outline
1 Corpus-based machine translation
2 Neural machine translation
3 Training: details
4 Where to start
5 What to expect
6 Translation technologies at the U. Alacant

Corpus-based machine translation
Outline
3 Training: details
4 Where to start
5 What to expect

Machine translation
Machine translation
The translation,
by means of a computer using suitable software,
of a text written in the source language (SL)
which produces another text in the target language (TL)
which may be called its raw translation.
SL text →
Machine
translation
system
→
TL text
(raw)

Machine translation
There are two main groups of machine translation technologies:
Rule-based MT, and
Corpus-based MT

Machine translation
Rule-based machine translation (RBMT) (Lucy Software,
ProMT, Apertium. . . ):
builds upwards from word-for-word translation,
hopefully to reach the sentence level,
Translation experts write translation dictionaries and
rules transforming SL structures into TL structures.
Translators’ intuitive, un-formalized knowledge about the
task has to be turned into rules and encoded in a
computable manner:
Additional crude simplifications and sacrifices needed!
If well chosen, some of them will often work fine.
Computer experts write engines that look up those
dictionaries and apply those rules to the input text

Rule-based machine translation
Rule-based machine translation:
In most of these systems, an additional simpliﬁcation is
made: the “transfer” approximation (rules transform parse
trees or similar structures)
Output is consistent but mechanical, lacking ﬂuency
has trouble solving ambiguity at all levels:
lexical (“replace” → “put back”/”substitute”),
syntactical/structural (“I saw the girl with the telescope”)
Customization: experts edit dictionaries and rules

Corpus-based MT learns to translate a corpus containing
100,000’s or 1,000,000’s of translated sentences.
Output: may be deceivingly ﬂuent (unfaithful).
Main approaches:
statistical machine translation (2005–2015)
Uses probabilistic models estimated by counting events in
the bilingual corpus used to train them.
neural machine translation (2015–).
Based on artiﬁcial neural networks inspired on how the
human brain learns and generalizes.
Such large corpora not be available for less-translated
languages.

Neural machine translation
Outline
3 Training: details
4 Where to start
5 What to expect

Neural MT: the new corpus-based MT
Neural machine translation or deep learning based machine
translation is a recent alternative to statistical MT:
It is corpus-based (usually needs more and cleaner data)
First ideas in the ’90s,1
abandoned due to insufﬁcient
hardware
Retaken around 2013
First commercial implementations in 2016 (Google
Translate)
Competitive with statistical MT in many applications.
1
Castaño & Casacuberta, EuroSpeech 1997; Forcada & Ñeco, ICANN
1997

Artificial neurons /1
Why is it called neural?
It is performed by software that simulates large networks of
artificial neurons.
Their activation (excitation) depends on the activation of
other neurons and the strength of their connections.
The sign and magnitude of weights determine the
behaviour of the network:
Neurons connected via a positive weight tend to excite or
inhibit simultaneously.
Neurons connected via a negative weight tend to be in
opposite states.
The effect of the interaction increases with the magnitude of
the weight.
Training fixes weights to the necessary values to ensure a
specific behaviour.

Artiﬁcial neurons/2
S₄ = F (w₁×S₁ + w₂×S₂ + w₃×S₃)
S₄S₂
S₁
S₃
w₁
w₂
w₃
-1
-0.5
0
0.5
1
-4 -2 0 2 4
F(x)

Xarxes neurals
A neural net with 3 inputs, 3 neurons in a hidden layer, and two
output neurons.
One talks about deep learning when information is processed
using many hidden layers.

Representations /1
The activation values of speciﬁc groups of neurons (usually
those in a layer) form representations of the information
they are processing.
For example,
(0.35,0.28,−0.15,0.76,...,0.88)
could be the representation of the word ”study”, and
(0.93,−0.78,0.22,0.31,...,−0.71)
that of the word ”cat”.

Representations /2
Let us imagine lexical representations with just three neurons:
Words with similar meanings are found close to each other.

Representations /3
One can even perform semantic arithmetics with
representations (adding and subtracting activation values
neuron by neuron):
[king] − [man] + [woman] ≃ [queen]

Neural MT: the encoder–decoder architecture
A large part of neural MT systems use the encoder–decoder
architecture:2
The encoder is a neural net that reads, one by one,
representations of words in the source sentence and
recursively builds a representation; then,
The decoder is a neural let that predicts, one by one, the
target words:
Each output unit computes the probability of each possible
target word.
The most likely word is selected.
Works similarly to the keyboard in our smartphones.
2
Other architectures such as transformer are now also very common.

Encoding
Input: “My ﬂight is delayed .”
E(“”)
e(“my”)
encoder E(“My”)

Encoding
E(“My”)
e(“flight”)
E(“My
flight”)
encoder

Encoding
E(“My
flight”)
E(“My
flight is”)
e(“is”)
encoder

Encoding
E(“My
flight is”) E(“My
flight is
delayed”)
e(“delayed”)
encoder

Encoding
E(“My
flight is
delayed”) E(“My
flight is
delayed.”)
e(“.”)
encoder

Encoding
Encoding of the source sentence “My ﬂight is delayed .” from
the representations of its words.
E(“”)
e(“my”)
encoder E(“My”)
e(“flight”) E(“My
flight is
delayed”) E(“My
flight is
delayed.”)
e(“.”)
encoder
encoder
[...]

Decoding
Decoding “My ﬂight is delayed .” → “Mi vuelo está retrasado .”
E(“My flight
is delayed”)
START
D(“My flight
is delayed”,
“”)
decoder
P(x,”My flight is
delayed”|START),
x=“mi” 0.125
x=“vuelo” 0.078
x=“su” 0.027
x=“avión” 0.011
…
mi

Decoding
D(“My flight
is delayed”,
“”)
mi
decoder
D(“My flight
is delayed”,
“Mi”)
P(x,”My flight is
delayed”|”Mi”),
x=“vuelo” 0.315
x=“escala” 0.071
x=“está” 0.009
…
vuelo

Decoding
D(“My flight
is delayed”,
“Mi”)
vuelo
decoder
P(x,”My flight is
delayed”|”Mi vuelo”),
x=“está” 0.415
x=“es” 0.218
x=“tarde” 0071
x=“hay” 0.009
…
D(“My flight
is delayed”,
“Mi vuelo”)
está

Decoding
D(“My flight
is delayed”,
“Mi vuelo”)
está
decoder
P(x,”My flight is
delayed”|”Mi vuelo está”),
x=“listo” 0.048
x=“tardando” 0.112
x=“retrasado” 0.683
x=“cancelado” 0.092
…
D(“My flight
is delayed”,
“Mi vuelo
está”)
retrasado

Decoding
D(“My flight is
delayed”, “Mi
vuelo está”)
retrasado
decoder
P(x,”My flight is delayed”|”Mi
vuelo está retrasado”),
x=“.” 0.773
x=“porque” 0.038
x=“dos” 0.011
x=“hasta” 0.001
…
D(“My flight
is delayed”,
“Mi vuelo
está
retrasado”)
.

Decoding
Decoding of the translation of “My ﬂight is delayed .”: “Mi vuelo
está retrasado .”.
E(“My flight
is delayed”)
START
D(“My flight
is delayed”,
“”)
decoder
P(x,”My flight is
delayed”|START),
x=“mi” 0.125
x=“vuelo” 0.078
x=“su” 0.027
…
mi
decoder
D(“My flight
is delayed”,
“Mi”)
P(x,”My flight is
delayed”|”Mi”),
x=“vuelo” 0.315
x=“escala” 0.071
x=“está” 0.009
…
vuelo
decoder
P(x,”My flight is
delayed”|”Mi vuelo”),
x=“está” 0.415
x=“es” 0.218
x=“tarde” 0071
x=“hay” 0.009
…
D(“My flight
is delayed”,
“Mi vuelo”)
está
[…]

An extension: attention
Encoder–decoders are sometimes augmented with attention.
The decoder learns to “pay (more or less) attention”. . .
not only to the last encoded representation E(’My flight is
delayed .’). . .
but also to all of the intermediate representations created
during encoding,
E(’My’), E(’My flight’), E(’My flight is’), E(’My flight is
delayed’).
. . . using special attention connections.

Training: details
Outline
3 Training: details
4 Where to start
5 What to expect

Training: details
Where do I get corpora from?
It is unlikely that a freelance translator or a small
translation agency has 1,000,000 sentence pairs available
for their languages and text domains.
If one is lucky, repositories like OPUS3
may contain useful
corpora for one’s language pair and text domain.
Another possibility is to harvest them from the Internet
using specialized software such as Bitextor.4
Very challenging!
3
http://opus.nlpl.eu
4
http://github.com/bitextor

Training: details
Corpus preparation /1
When one wants to train a neural MT system one prepares
three disjoint corpora from the data available:
A large training set, ideally 100,000’s or 1,000,000’s of
sentence pairs, and ideally representative of the task (but
this may not be available!).
The examples in this set are shown to the neural net for
training.
A small development set of 1,000–3,000 sentence pairs,
representative of the task.
The examples in this held-out set are used to determine
when to stop training so that overﬁtting to the training set is
avoided.
A small test set of 1,000–3,000 sentence pairs,
representative of the task.
Examples in this held-out set give an idea of the
performance of the system.

Training: details
Tokenization: One can improve the segmentation of text
beyond whitespace to improve processing and reduce
vocabulary size.
Some words are written together.
don’t → don 't
won’t → won 't
l’amour → l' amour
Punctuation is often written without intervening space, and
it needs to be separated, but not always:
I know, Dr. Jones (and you know too). →
I know , Dr. Jones ( and you know too ) .
How would this work for languages such as Arabic?

Training: details
Truecasing and de-truecasing:
Many languages write words in capitals in some contexts
(start of sentence, headlines). Words which are the same
appear different.
Truecasing trained on the training corpus tries to undo
capitalization where needed:
Mes amis et mes amies sont arrivés hier à
Paris . → mes amis et mes amies sont arrivés
hier à Paris .
De-truecasing applies simple rules to capitalize machine
translation output from a system trained on truecased text.
Is this necessary for languages such as Arabic?

Training: details
Subword units/1
Neural MT systems have
as many input units as source words in the vocabulary
as many output units as target words in the vocabulary
(one needs to ﬁx the size of vocabularies).
Large vocabularies are not uncommon in
morphologically-rich languages → unfeasible!
All other words have to be encoded as unknown words
<unk>

Training: details
Subword units/2
The solution: break words into smaller, repeating units!
Solution #1: Byte-pair encoding: a
language-independent approach.
Initial set of codes = set of characters
Iteratively group the most frequent pairs of consecutive
codes into single codes.
Until a certain number of codes (vocabulary) is reached.
Use the resulting character sequences as tokens:
institutionalisations →
institu@@ tion@@ alis@@ ations
Many words are stored complete.

Training: details
Subword units/3
Solution #2: “Sentencepiece”: a completely
language-independent, data-driven approach.
No need to tokenize: processes the whole sentence.
Learns a neural model directly from the training corpus.

Training: details
How does one train neural MT /1
Training: adjusting all of the weights in the neural net.
NMT decoder output: word probabilities in context →
sentence probabilities:
P(I love you . Je t’aime) = p(. I love you, Je t’aime)×
× p(you I love, Je t’aime)×
× p(love I, Je t’aime)×
× p(I START, Je t’aime).
Objective: maximize the likelihood P(I love you . Je t’aime)
of the reference translation I love you.

Training: details
The training algorithm computes a gradient of the
probabilities of reference sentences in the training set with
respect to each weight w connecting neurons:
gradient(w) =
P(with w + ∆w) − P(with w)
∆w
That is, how much the probability varies for a small change
∆w in each weight w.
Then, after showing a number of reference sentences,
weights are updated proportionally to their effect on their
probability → gradient ascent
new w = w + (learning rate) × gradient(w)
This is done repeatedly.

Training: details
Examples (sentence pairs) are grouped in minibatches of
e.g. 128 examples.
Weights are updated after each minibatch.
An epoch completes each time the whole set of examples,
e.g. 1,000,000 examples, have been processed.
It is not uncommon for a training job to require tens or
hundreds of epochs.

Training: details
When does one stop?
Training “too deep” may lead to “memorization” of the
training set.
But we want the network to generalize.
This is what the development set is used for.
Every certain number of weight updates, the system
automatically evaluates the performance on the sentences
of the development set.
It compares MT output to the reference outputs and
computes a measure such as BLEU.

Training: details
What is BLEU?
The most famous automatic evaluation measure is called
BLEU, but there are many others.
BLEU counts which fraction of the 1-word,
2-word,. . . n-word sequences in the output match the
reference translation.
These fractions are grouped in a single quantity.
The result is a number between 0 and 1, or between 0%
and 100%.
Correlation with measurements of translation usefulness is
still an open question.
A lot of MT research is still BLEU-driven and makes little
contact with real applications of MT.

Training: details
Neural MT training is computationally very expensive.
Example:
A very small encoder–decoder (2 layers of 128 units
each). . .
. . . with a small training set of 260,000 sentences. . .
. . . using a small vocabulary of 10,000 byte-pair encoding
operations . . .
. . . between French and Spanish (easy language pair). . .
. . . takes about 1 week on a 4-core, 3.2 GHz desktop . . .
. . . to reach a BLEU score of around 25% (barely starts to
be posteditable).

Training: details
We need stronger, specialized, expensive hardware:
Regular CPUs in our desktops and laptops are too slow.
Neural training implies a lot of vector, matrix, tensor
operations. . .
. . . which are nicely performed by GPUs (graphic
processing units).5
Using GPUs one can speed up training by 100× or more.
5
One GPU costs ≃US$2,000

Where to start
Outline
3 Training: details
4 Where to start
5 What to expect

Where to start
Can I do it myself? /1
Clara is a 4th-year translation and interpreting student of mine.
She accepted the challenge to learn neural MT. She has
learned:
To install TensorFlow and TensorFlow NMT on a Windows
computer and on a GNU/Linux computer (no GPUs on
either).
To use the command-line interface on a Windows
computer and on a GNU/Linux computer.
To select and prepare training, development and test
corpora.
To check the output of the system and interpret how it is
learning.
She is writing a guide for fellow students as her bachelor’s
thesis.

Where to start
Can I do it myself? /2
I have helped her by:
Solving doubts, and guiding her.
Writing up some small Python programs (scripts) to
process corpora and launch training jobs.
Letting her slow down my desktop computer with her jobs
;-)
And I have learned a lot by working with her.
If she can, why can’t you?

What to expect
Outline
3 Training: details
4 Where to start
5 What to expect

What to expect
New technology, new behaviour/1
Neural MT. . .
. . . requires specialized, powerful hardware
. . . needs large amounts of bilingual data
Not normally available to the average translator or
translation company
One can resort to third-parties to train and execute neural
MT for us:
This is actually a business model in the translation industry.
They can add our translation memories to the stock data in
the company to build a system for us.

What to expect
New technology, new behaviour /2
Neural MT. . .
. . . works with representations of the whole sentence: it is
hard to know the source for each target word.
Lack of transparency.
. . . produces grammatically ﬂuent texts.
. . . produces semantically motivated errors: if a word has
not been seen during training, it is replaced. . .
. . . by a similar word: palace → castle
. . . by a paraphrase: Michael Jordan → the Chicago Bulls
shooting guard;
with dangerous results sometimes: Tunisia → Norway.
. . . may invent words: engineerage, recruitation, etc. when
they work with sub-word units (done quite frequently).
Posteditors have to pay special attention (cognitive load).

What to expect
Concluding remarks
Neural MT is a new kind of corpus-based MT.
It is currently displacing statistical MT in many applications.
It is based on a simpliﬁcation of the nervous system.
Requires large corpora.
Requires powerful, specialized hardware.
May produce natural text with errors which are hard to spot
and correct.
Requires very special attention on the part of post-editors.

Translation technologies at the U. Alacant
Outline
3 Training: details
4 Where to start
5 What to expect

Teaching translation technologies at U. Alacant/1
4-year bachelor’s degree in Translation and Interpreting
Subject: “Translation Technologies”
The only subject on technologies or computing in the
whole degree.
2nd year, 2nd semester
6 ECTS credits (60 h lab + classroom, 90 h at home)
1.5 credits in the classroom, 4.5 credits in the laboratory.
Languages: Spanish and Valencian (Catalan); no English
yet.
All material available under CC licenses.

Classroom blocks:
Hardware and software: MB, GB, GHz, CPU. . .
Internet: se
Texts and formats: Unicode, XML, HTML; markup. . . .
Uses of machine translation
Ambiguity as an obstacle.
How does machine translation work?
Computer-aided translation with translation memories.
Termbases.

Laboratory work. Part 1, practice (20 h):
HTML basics (for their HTML portfolio)
Advanced word processing: styles (LibreOfﬁce).
XML: validation, etc.
Computer-aided translation of XML-formatted documents
with OmegaT.

Part 2: collaborative projects (25 h in the lab.)
Groups of 4 students tackle a sizable translation project
(about 10000 words)
We encourage them to ﬁnd real projects (NGOs, etc.)
They select an MT system and evaluate their performance
with it.
They gather translation memories and evaluate their
performance with them.
They decide what is the best combination and translate
using the team functionality of OmegaT.6
They learn to organize work and assess quality.
6
New translated segments are easily shared via a free account on a
GitHub or GitLab server.

Research in translation technologies at U. Alacant/1
The translation technologies group at Alacant:
Part of the Transducens research group7
at the Department of
Software and Computing Systems:8
Mikel L. Forcada, full professor.
Juan Antonio Pérez-Ortiz, associate professor.
Felipe Sánchez-Martínez, post-doctoral lecturer.
Miquel Esplà-Gomis, post-doctoral researcher.
Víctor Sánchez-Cartagena, post-doctoral researcher.
Leopoldo Pla Sempere, research technician.
Francisco de Borja Valero Antón, research technician.
Current PhD students: John E. Ortega, Kenneth
Jordan-Núñez (with Universitat Pompeu Fabra)
7
http://transducens.dlsi.ua.es
8
http://www.dlsi.ua.es/index.cgi?id=eng

Main research lines:
Quality estimation for machine translation
Effort-oriented training of MT systems
Low-resource neural machine translation
Parallel corpus harvesting and cleaning
Fuzzy-match repair using MT
Machine translation evaluation for gisting

Current projects /1:
GoURMET, Global Under-Resourced Media Translation,
Horizon 2020 project (2019–2021) with BBC, Deutsche
Welle, U. Amsterdam and Edinburgh U. (coordinator). 9
News translation to and from 12 less-resourced langauges
of interest to media partners BBC and DW.
9
https:
//cordis.europa.eu/project/rcn/221392/factsheet/en

Paracrawl 2 Broader Web-Scale Provision of Parallel
Corpora for European Languages, Connecting Europe
Facility project (2018–2020) with Prompsit, Omniscien,
TAUS, and Edinburgh U. (coordinator).10
.
Automatically harvesting bitexts from the Web to use as
translation memories or to train MT systems (such as
eTranslation at the Eropean Commission).
Alacant develops Bitextor11
, free/open-source software to
perform the task.
10
https://paracrawl.eu/
11
http://github.com/bitextor/

EfforTune Effort-driven optimization of statistical machine
translation, Spanish Government (2016–2019).
Smaller project
Design new automatic evaluation measures that correlate
better with post-editing effort to tune statistical (and neural)
MT systems

These slides are free
This work may be distributed under the terms of either
the Creative Commons Attribution–Share Alike licence:
http:
//creativecommons.org/licenses/by-sa/4.0/
the GNU GPL v. 3.0 Licence:
http://www.gnu.org/licenses/gpl.html
Dual license! E-mail me to get the sources: mlf@ua.es

Cairo 2019-seminar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cairo 2019-seminar

Similar to Cairo 2019-seminar (20)

More from Forcada Mikel

More from Forcada Mikel (9)

Recently uploaded

Recently uploaded (20)

Cairo 2019-seminar