Building a Neural Machine Translation System From Scratch

Building a Neural
Machine
Translation System
from Scratch
Deep Learning World 2019, Munich
Natasha Latysheva, Welocalize

This talk
• 1. Introduction to machine translation
• 2. Data for machine translation
• 3. Representing words with embeddings
• 4. Deep learning architectures
• Recurrent neural networks
• Transformers
• 5. Some fun things about MT
• 6. Tech stack for machine translation

Intro
• Welocalize
• Language services
• 1500+ employees
• 8th largest globally, 4th largest US
• NLP engineering team
• 13 people
• Remote across US, Ireland, UK,
Germany, China
img

What is machine translation?
• Automated translation between
languages
• MT challenges:
• Language is very complex, flexible with
lots of exceptions
• Language pairs might be very different
• Lots of ”non-standard” usage
• Not always a lot of data
• But if people can do it, a model
should be able to learn to do it

What is machine translation?
• Automated translation between
languages
• MT challenges:
• Language is very complex, flexible with
lots of exceptions
• Language pairs might be very different
• Lots of ”non-standard” usage
• Not always a lot of data
• But if people can do it, a model
should be able to learn to do it
• Why bother?
• Huge industry and market demand
because communication is important
• Humans are expensive and slow
• Research side: understanding
language is probably key to
intelligence

Rule-based MT
• Very manual, laborious. Hand-crafted rules by expert linguists.
• Early focus on Russian. E.g. translate English “much” or “many”
into Russian:
Jurafsky and Martin, Speech and Language Processing, chapter 25

Data for machine
translation
• Parallel texts, bitexts, corpus
• You need a lot of data
(millions of decent length
sentence pairs) to build
decent neural systems
• Increasing amount of freely-
available parallel data
available (curated or scraped
or both)

Neural machine
translation
• Dominant architecture is an
encoder-decoder
• Based on recurrent neural
networks (RNNs)
• Or the Transformer

REPRESENTING WORDS
WITH EMBEDDINGS

Word representations
• ML models can’t process text
strings directly in any
meaningful way
• Need to find a way to
represent words as numbers
• And hopefully the numbers
are linguistically meaningful in
some way

Simplest way to encode words
• Your vocabulary is
all the possible
words
• Each word is
assigned an integer
index

Properties of word
embeddings
• Similar words cluster
together
• The values are coordinates
in a high-dimensional
semantic space
• Not easily interpretable
Img

Properties of word
embeddings
• Semantically-meaningful
vector arithmetic holds
• Analogical reasoning
• King – Man + Woman = ?
Img

Vector arithmetic with embeddings
Google ML blog post link

Where do the
embeddings come from?
• Which embedding?
• FastText > GloVe >
word2vec
• This is (shallow) transfer
learning in NLP
Google ML blog post link

Calculating
embeddings
• word2vec skip-gram
example
• Train shallow net to predict
a surrounding word, given a
word
• Take hidden layer weight
matrix, treat as coordinates
• So goal is actually just to
learn this hidden layer
weight matrix… the output
layer, we don’t care about
Chris McCormick blog post

Tokenisation and sub-word embeddings
• Maybe our embeddings
should reflect that different
forms of same word are
related
• “walking” and “walked”
• Translation works better if you
can incorporate some
morphological knowledge
• Can be learned or linguistic
knowledge baked in
Cotterell and Schutze, 2018, Joint Semantic Synthesis and Morphological Analysis of the Derived Word,

Recurrent Neural Networks (RNNs)
• Why not feed-forward networks
for translation?
• Words aren’t independent
• Third word really depends on first
and second
• Similar to how conv nets
capture interdependence of
neighbouring pixels

Pictures of RNNs
• Main idea behind RNNs is
that you’re allowed to reuse
information from previous
time steps to inform
predictions of the current
time step

Standard FF network to RNN
• At each time step, RNN
passes on its activations from
previous time step for next
time step to use
• Parameters governing the
connection are shared
between time steps

Standard FF network to RNN
Activation function probably tanh or ReLU

More layers!
• Increase number
of layers to
increase capacity
for abstraction,
hierarchical
processing of
input

Encoder-Decoder architectures
• Rather than
being forced to
immediately
output a French
word for every
English word we
read…

Almost there..
• Bidirectionality and
attention
• For the encoder,
bidirectional RNNs
(BRNNs) often used
• BRNNs read the input
text forwards and
backwards

Bidirectional
RNNs
• Encode input text in
both directions

Trouble with memorising long passages
• For long sentences, we’re
asking encoder-decoder
to read the entire English
sentence, memorise it,
then write it back in
French
• Condense everything
down to small vector?!
• The issue is that the
decoder needs different
information at different
timesteps but all it gets is
this vector
• Not really how human
translators work

The problem with RNN encoder-decoders
• Serious information
bottleneck
• Condense all source
input down to a small
vector?!
• Long computation
paths

Some ways of handling
long sequences
• Long-range
dependencies
• LSTMs, GRUs
• Meant for long-range
memory.. but it’s still
very difficult
Colah’s blog, “Understanding LSTMs”

Some ways of handling
long sequences
• Reverse source
sentence (feed it in
backwards)
• Kind of a hack…
works for English-
>French, what about
Japanese?
• Feed sentence in
twice

Attention Idea
• Has been very influential in
deep learning
• Originally developed for MT
(Bahdanau, 2014)
• As you’re producing your
output sequence, maybe not
every part of your input is as
equally relevant
• Image captioning example
Lu et al. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.

Attention intuition
• Attention allows the
network to refer back to
the input sequence,
instead of forcing it to
encode all information
into one fixed-length
vector

Attention in machine translation
Xiandong Qi, 2017, Link.

Attention intuition
• Encoder: Used BRNN to
compute rich set of
features about source
words and their
surrounding words
• Decoder: Use another
RNN to generate output
as before
• Decoder is asked to
choose which hidden
states to use and ignore
• Weighted sum of hidden
states used to predict the
next word

Attention intuition
• Decoder RNN uses
attention parameters to
decide how much to
pay attention to input
features
• Allows the model to
amplify the signal from
relevant parts of the
source sequence
• This improves
translation

Main differences
and benefits
• Encoder passes a lot more
data to the decoder
• Not just last hidden state
• Passes all hidden states at
every time step
• Computation path length
a lot shorter from relevant
information

Transformers
• Paradigm shift in sequence
processing
• People convinced you need
recurrence or convolutions to
learn interdependence
• RNNs were the best way to
capture time-dependent
patterns, like in language
• Transformers use only
attention to do the same job

Transformer intuition
• Also have an encoder-decoder
structure
• In RNNs, hidden state
incorporates context
• In transformers, self-attention
incorporates context

Transformer intuition
• Self-attention
• Instead of processing input
tokens one by one, attention
takes in set of input tokens
• Learns the dependencies between
all of them using three learned
weight matrices (key, value, query)
• Makes better use of GPU
resources

Transformers now often SOTA
• You can often get a
couple of points
improvement by
switching to using
Transformers for
your machine
translation system

Standard Python Scientific Stack

Frameworks
• How low-level do you go?
• Implementing backprop and
gradient checking yourself in
numpy
• …
• Clicking ‘Train’ and ‘Deploy’ in a
GUI
• You probably want to be
somewhere in between
• Around a dozen open-source
NMT implementations around
• Nematus, OpenNMT, tensorflow-
seq2seq, Marian, fair-seq,
Tensor2Tensor

Recommendations
• Python scientific stack
• OpenNMT-tf (TensorFlow version)
• TensorBoard monitoring is great
• Good checkpointing, automatic
evaluation during training
• Get some GPUs if you can
• 3x GeForce GTX Titan X take 2-3 days
to train decent Transformer model
• AWS or GCP good but can be
expensive
• Docker containers

Building a Neural Machine Translation System From Scratch

More Related Content

What's hot

Similar to Building a Neural Machine Translation System From Scratch

Recently uploaded

Building a Neural Machine Translation System From Scratch