Building a Neural
Machine
Translation System
from Scratch
Deep Learning World 2019, Munich
Natasha Latysheva, Welocalize
This talk
• 1. Introduction to machine translation
• 2. Data for machine translation
• 3. Representing words with embeddings
• 4. Deep learning architectures
• Recurrent neural networks
• Transformers
• 5. Some fun things about MT
• 6. Tech stack for machine translation
Intro
• Welocalize
• Language services
• 1500+ employees
• 8th largest globally, 4th largest US
• NLP engineering team
• 13 people
• Remote across US, Ireland, UK,
Germany, China
img
What is machine translation?
• Automated translation between
languages
• MT challenges:
• Language is very complex, flexible with
lots of exceptions
• Language pairs might be very different
• Lots of ”non-standard” usage
• Not always a lot of data
• But if people can do it, a model
should be able to learn to do it
What is machine translation?
• Automated translation between
languages
• MT challenges:
• Language is very complex, flexible with
lots of exceptions
• Language pairs might be very different
• Lots of ”non-standard” usage
• Not always a lot of data
• But if people can do it, a model
should be able to learn to do it
• Why bother?
• Huge industry and market demand
because communication is important
• Humans are expensive and slow
• Research side: understanding
language is probably key to
intelligence
Other sequence problems
Ilya Pestov blog post
Rule-based MT
• Very manual, laborious. Hand-crafted rules by expert linguists.
• Early focus on Russian. E.g. translate English “much” or “many”
into Russian:
Jurafsky and Martin, Speech and Language Processing, chapter 25
Data for machine
translation
• Parallel texts, bitexts, corpus
• You need a lot of data
(millions of decent length
sentence pairs) to build
decent neural systems
• Increasing amount of freely-
available parallel data
available (curated or scraped
or both)
Neural machine
translation
• Dominant architecture is an
encoder-decoder
• Based on recurrent neural
networks (RNNs)
• Or the Transformer
REPRESENTING WORDS
WITH EMBEDDINGS
Word representations
• ML models can’t process text
strings directly in any
meaningful way
• Need to find a way to
represent words as numbers
• And hopefully the numbers
are linguistically meaningful in
some way
Simplest way to encode words
• Your vocabulary is
all the possible
words
• Each word is
assigned an integer
index
One-hot vector encoding
One-hot vector encoding
One-hot vector encoding
One-hot vector encoding
Richer word representations
Richer word representations
Properties of word
embeddings
• Similar words cluster
together
• The values are coordinates
in a high-dimensional
semantic space
• Not easily interpretable
Img
Properties of word
embeddings
• Semantically-meaningful
vector arithmetic holds
• Analogical reasoning
• King – Man + Woman = ?
Img
Vector arithmetic with embeddings
Google ML blog post link
Where do the
embeddings come from?
• Which embedding?
• FastText > GloVe >
word2vec
• This is (shallow) transfer
learning in NLP
Google ML blog post link
Calculating
embeddings
• word2vec skip-gram
example
• Train shallow net to predict
a surrounding word, given a
word
• Take hidden layer weight
matrix, treat as coordinates
• So goal is actually just to
learn this hidden layer
weight matrix… the output
layer, we don’t care about
Chris McCormick blog post
Calculating
embeddings
• word2vec skip-gram
example
• Train shallow net to predict
a surrounding word, given a
word
• Take hidden layer weight
matrix, treat as coordinates
• So goal is actually just to
learn this hidden layer
weight matrix… the output
layer, we don’t care about
Chris McCormick blog post
Tokenisation and sub-word embeddings
• Maybe our embeddings
should reflect that different
forms of same word are
related
• “walking” and “walked”
• Translation works better if you
can incorporate some
morphological knowledge
• Can be learned or linguistic
knowledge baked in
Cotterell and Schutze, 2018, Joint Semantic Synthesis and Morphological Analysis of the Derived Word,
RECURRENT NEURAL NETWORKS
Recurrent Neural Networks (RNNs)
• Why not feed-forward networks
for translation?
• Words aren’t independent
• Third word really depends on first
and second
• Similar to how conv nets
capture interdependence of
neighbouring pixels
Pictures of RNNs
• Main idea behind RNNs is
that you’re allowed to reuse
information from previous
time steps to inform
predictions of the current
time step
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
• At each time step, RNN
passes on its activations from
previous time step for next
time step to use
• Parameters governing the
connection are shared
between time steps
Standard FF network to RNN
Activation function probably tanh or ReLU
More layers!
• Increase number
of layers to
increase capacity
for abstraction,
hierarchical
processing of
input
ENCODER-DECODER MODELS
Encoder-Decoder architectures
• Rather than
being forced to
immediately
output a French
word for every
English word we
read…
Encoder-Decoder architectures
Almost there..
• Bidirectionality and
attention
• For the encoder,
bidirectional RNNs
(BRNNs) often used
• BRNNs read the input
text forwards and
backwards
Bidirectional
RNNs
• Encode input text in
both directions
Trouble with memorising long passages
• For long sentences, we’re
asking encoder-decoder
to read the entire English
sentence, memorise it,
then write it back in
French
• Condense everything
down to small vector?!
• The issue is that the
decoder needs different
information at different
timesteps but all it gets is
this vector
• Not really how human
translators work
The problem with RNN encoder-decoders
• Serious information
bottleneck
• Condense all source
input down to a small
vector?!
• Long computation
paths
Some ways of handling
long sequences
• Long-range
dependencies
• LSTMs, GRUs
• Meant for long-range
memory.. but it’s still
very difficult
Colah’s blog, “Understanding LSTMs”
Some ways of handling
long sequences
• Reverse source
sentence (feed it in
backwards)
• Kind of a hack…
works for English-
>French, what about
Japanese?
• Feed sentence in
twice
ATTENTION
Attention Idea
• Has been very influential in
deep learning
• Originally developed for MT
(Bahdanau, 2014)
• As you’re producing your
output sequence, maybe not
every part of your input is as
equally relevant
• Image captioning example
Lu et al. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
Attention intuition
• Attention allows the
network to refer back to
the input sequence,
instead of forcing it to
encode all information
into one fixed-length
vector
Attention in machine translation
Xiandong Qi, 2017, Link.
Attention intuition
• Encoder: Used BRNN to
compute rich set of
features about source
words and their
surrounding words
• Decoder: Use another
RNN to generate output
as before
• Decoder is asked to
choose which hidden
states to use and ignore
• Weighted sum of hidden
states used to predict the
next word
Attention intuition
• Decoder RNN uses
attention parameters to
decide how much to
pay attention to input
features
• Allows the model to
amplify the signal from
relevant parts of the
source sequence
• This improves
translation
Main differences
and benefits
• Encoder passes a lot more
data to the decoder
• Not just last hidden state
• Passes all hidden states at
every time step
• Computation path length
a lot shorter from relevant
information
TRANSFORMERS
Transformers
• Paradigm shift in sequence
processing
• People convinced you need
recurrence or convolutions to
learn interdependence
• RNNs were the best way to
capture time-dependent
patterns, like in language
• Transformers use only
attention to do the same job
Transformer intuition
• Also have an encoder-decoder
structure
• In RNNs, hidden state
incorporates context
• In transformers, self-attention
incorporates context
Transformer intuition
• Self-attention
• Instead of processing input
tokens one by one, attention
takes in set of input tokens
• Learns the dependencies between
all of them using three learned
weight matrices (key, value, query)
• Makes better use of GPU
resources
Transformers now often SOTA
• You can often get a
couple of points
improvement by
switching to using
Transformers for
your machine
translation system
TECH STACK
Standard Python Scientific Stack
Frameworks
• How low-level do you go?
• Implementing backprop and
gradient checking yourself in
numpy
• …
• Clicking ‘Train’ and ‘Deploy’ in a
GUI
• You probably want to be
somewhere in between
• Around a dozen open-source
NMT implementations around
• Nematus, OpenNMT, tensorflow-
seq2seq, Marian, fair-seq,
Tensor2Tensor
Recommendations
• Python scientific stack
• OpenNMT-tf (TensorFlow version)
• TensorBoard monitoring is great
• Good checkpointing, automatic
evaluation during training
• Get some GPUs if you can
• 3x GeForce GTX Titan X take 2-3 days
to train decent Transformer model
• AWS or GCP good but can be
expensive
• Docker containers

Building a Neural Machine Translation System From Scratch

  • 1.
    Building a Neural Machine TranslationSystem from Scratch Deep Learning World 2019, Munich Natasha Latysheva, Welocalize
  • 2.
    This talk • 1.Introduction to machine translation • 2. Data for machine translation • 3. Representing words with embeddings • 4. Deep learning architectures • Recurrent neural networks • Transformers • 5. Some fun things about MT • 6. Tech stack for machine translation
  • 3.
    Intro • Welocalize • Languageservices • 1500+ employees • 8th largest globally, 4th largest US • NLP engineering team • 13 people • Remote across US, Ireland, UK, Germany, China img
  • 4.
    What is machinetranslation? • Automated translation between languages • MT challenges: • Language is very complex, flexible with lots of exceptions • Language pairs might be very different • Lots of ”non-standard” usage • Not always a lot of data • But if people can do it, a model should be able to learn to do it
  • 5.
    What is machinetranslation? • Automated translation between languages • MT challenges: • Language is very complex, flexible with lots of exceptions • Language pairs might be very different • Lots of ”non-standard” usage • Not always a lot of data • But if people can do it, a model should be able to learn to do it • Why bother? • Huge industry and market demand because communication is important • Humans are expensive and slow • Research side: understanding language is probably key to intelligence
  • 7.
  • 8.
  • 9.
    Rule-based MT • Verymanual, laborious. Hand-crafted rules by expert linguists. • Early focus on Russian. E.g. translate English “much” or “many” into Russian: Jurafsky and Martin, Speech and Language Processing, chapter 25
  • 10.
    Data for machine translation •Parallel texts, bitexts, corpus • You need a lot of data (millions of decent length sentence pairs) to build decent neural systems • Increasing amount of freely- available parallel data available (curated or scraped or both)
  • 11.
    Neural machine translation • Dominantarchitecture is an encoder-decoder • Based on recurrent neural networks (RNNs) • Or the Transformer
  • 12.
  • 13.
    Word representations • MLmodels can’t process text strings directly in any meaningful way • Need to find a way to represent words as numbers • And hopefully the numbers are linguistically meaningful in some way
  • 14.
    Simplest way toencode words • Your vocabulary is all the possible words • Each word is assigned an integer index
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    Properties of word embeddings •Similar words cluster together • The values are coordinates in a high-dimensional semantic space • Not easily interpretable Img
  • 22.
    Properties of word embeddings •Semantically-meaningful vector arithmetic holds • Analogical reasoning • King – Man + Woman = ? Img
  • 23.
    Vector arithmetic withembeddings Google ML blog post link
  • 24.
    Where do the embeddingscome from? • Which embedding? • FastText > GloVe > word2vec • This is (shallow) transfer learning in NLP Google ML blog post link
  • 25.
    Calculating embeddings • word2vec skip-gram example •Train shallow net to predict a surrounding word, given a word • Take hidden layer weight matrix, treat as coordinates • So goal is actually just to learn this hidden layer weight matrix… the output layer, we don’t care about Chris McCormick blog post
  • 26.
    Calculating embeddings • word2vec skip-gram example •Train shallow net to predict a surrounding word, given a word • Take hidden layer weight matrix, treat as coordinates • So goal is actually just to learn this hidden layer weight matrix… the output layer, we don’t care about Chris McCormick blog post
  • 27.
    Tokenisation and sub-wordembeddings • Maybe our embeddings should reflect that different forms of same word are related • “walking” and “walked” • Translation works better if you can incorporate some morphological knowledge • Can be learned or linguistic knowledge baked in Cotterell and Schutze, 2018, Joint Semantic Synthesis and Morphological Analysis of the Derived Word,
  • 28.
  • 29.
    Recurrent Neural Networks(RNNs) • Why not feed-forward networks for translation? • Words aren’t independent • Third word really depends on first and second • Similar to how conv nets capture interdependence of neighbouring pixels
  • 30.
    Pictures of RNNs •Main idea behind RNNs is that you’re allowed to reuse information from previous time steps to inform predictions of the current time step
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    Standard FF networkto RNN • At each time step, RNN passes on its activations from previous time step for next time step to use • Parameters governing the connection are shared between time steps
  • 36.
    Standard FF networkto RNN Activation function probably tanh or ReLU
  • 37.
    More layers! • Increasenumber of layers to increase capacity for abstraction, hierarchical processing of input
  • 38.
  • 39.
    Encoder-Decoder architectures • Ratherthan being forced to immediately output a French word for every English word we read…
  • 40.
  • 42.
    Almost there.. • Bidirectionalityand attention • For the encoder, bidirectional RNNs (BRNNs) often used • BRNNs read the input text forwards and backwards
  • 43.
    Bidirectional RNNs • Encode inputtext in both directions
  • 44.
    Trouble with memorisinglong passages • For long sentences, we’re asking encoder-decoder to read the entire English sentence, memorise it, then write it back in French • Condense everything down to small vector?! • The issue is that the decoder needs different information at different timesteps but all it gets is this vector • Not really how human translators work
  • 45.
    The problem withRNN encoder-decoders • Serious information bottleneck • Condense all source input down to a small vector?! • Long computation paths
  • 46.
    Some ways ofhandling long sequences • Long-range dependencies • LSTMs, GRUs • Meant for long-range memory.. but it’s still very difficult Colah’s blog, “Understanding LSTMs”
  • 47.
    Some ways ofhandling long sequences • Reverse source sentence (feed it in backwards) • Kind of a hack… works for English- >French, what about Japanese? • Feed sentence in twice
  • 48.
  • 49.
    Attention Idea • Hasbeen very influential in deep learning • Originally developed for MT (Bahdanau, 2014) • As you’re producing your output sequence, maybe not every part of your input is as equally relevant • Image captioning example Lu et al. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
  • 50.
    Attention intuition • Attentionallows the network to refer back to the input sequence, instead of forcing it to encode all information into one fixed-length vector
  • 51.
    Attention in machinetranslation Xiandong Qi, 2017, Link.
  • 52.
    Attention intuition • Encoder:Used BRNN to compute rich set of features about source words and their surrounding words • Decoder: Use another RNN to generate output as before • Decoder is asked to choose which hidden states to use and ignore • Weighted sum of hidden states used to predict the next word
  • 53.
    Attention intuition • DecoderRNN uses attention parameters to decide how much to pay attention to input features • Allows the model to amplify the signal from relevant parts of the source sequence • This improves translation
  • 55.
    Main differences and benefits •Encoder passes a lot more data to the decoder • Not just last hidden state • Passes all hidden states at every time step • Computation path length a lot shorter from relevant information
  • 56.
  • 57.
    Transformers • Paradigm shiftin sequence processing • People convinced you need recurrence or convolutions to learn interdependence • RNNs were the best way to capture time-dependent patterns, like in language • Transformers use only attention to do the same job
  • 58.
    Transformer intuition • Alsohave an encoder-decoder structure • In RNNs, hidden state incorporates context • In transformers, self-attention incorporates context
  • 59.
    Transformer intuition • Self-attention •Instead of processing input tokens one by one, attention takes in set of input tokens • Learns the dependencies between all of them using three learned weight matrices (key, value, query) • Makes better use of GPU resources
  • 60.
    Transformers now oftenSOTA • You can often get a couple of points improvement by switching to using Transformers for your machine translation system
  • 61.
  • 62.
  • 63.
    Frameworks • How low-leveldo you go? • Implementing backprop and gradient checking yourself in numpy • … • Clicking ‘Train’ and ‘Deploy’ in a GUI • You probably want to be somewhere in between • Around a dozen open-source NMT implementations around • Nematus, OpenNMT, tensorflow- seq2seq, Marian, fair-seq, Tensor2Tensor
  • 64.
    Recommendations • Python scientificstack • OpenNMT-tf (TensorFlow version) • TensorBoard monitoring is great • Good checkpointing, automatic evaluation during training • Get some GPUs if you can • 3x GeForce GTX Titan X take 2-3 days to train decent Transformer model • AWS or GCP good but can be expensive • Docker containers