Learning to Translate with Joey NMT

Learning to Translate with
Joey NMT
PyData Meetup Montreal
Julia Kreutzer
Feb 25, 2021

Today
1. Neural Machine Translation 101
a. Translation as a ML problem
b. Transformer model
c. The role of data
2. Joey NMT
a. Features and purpose
b. Demo
c. Use cases
3. Q & A
Assuming basic ML knowledge, familiarity with neural networks.
What's the technology behind modern
machine translation?
How can you get started?
Why open-sourcing?
Why another toolkit?

[Optional] Demo Preparation
If you want to train your own translation model during this presentation:
1. Open joey_demo.ipynb on Colab.
2. Create a copy.
3. Select GPU runtime: Runtime -> Change runtime type -> Hardware accelerator: GPU
4. Run all cells: Runtime -> Run all
5. Come back to the talk :)
We'll inspect later what's happening there.

Neural Machine Translation 101

Translation as a ML Problem
Challenges
➢ Unlimited length
➢ Structural dependencies
➢ Unseen words
➢ Figurative language
Seq2Seq
➢ Modeling sentences (mostly)
➢ Connections between all words
➢ Sub-word modeling
➢ A lot of training data
Input: What is a poutine ?
Output: Qu'est-ce qu'une poutine ?

The
Transformer
"Attention is all you need"
Vaswani et al. 2017
Decoder
Specialties
Source: Vaswani et al. 2017

Training vs. Inference
Conditional language modeling: Predict the next token yt
:
● given source X and all previous tokens of the reference during training.
● given source X and previously predicted tokens during inference.
Training with MLE, inference with greedy or beam search.

Beam Search
Source: G. Neubig's course on MT and Seq2Seq
Keep the k most likely
prediction sequences in
each step.
➢ more expensive
than greedy
➢ more exact
Implementation on
mini-batches is tricky!
k=2

Words?
Pre-processing plays a huge role in NMT.
qu'est-ce qu'une poutine ? 4 tokens, 4 types
vs
qu ' est - ce qu ' une poutine ? 10 tokens, 8 types
➢ Sub-words instead of words: frequency-based automatic segmentation.
➢ Algorithms: BPE, unigram LM.
➢ Implementations: subword-nmt, SentencePiece.

The Role of Data
A "base"-sized Transformer has ~65M weights. How much data does it need?
➢ It depends!
➢ “As much as you can ﬁnd" heuristic
➢ Beyond parallel data
○ unsupervised NMT
○ data augmentation
○ dictionaries
○ pre-trained embeddings
○ multilingual modeling
How similar are source and target language?
What kind of quality are you expecting?
How complex is the text?

Evaluation
Reference: Qu'est-ce qu'une poutine ?
Outputs:
1. Est-ce qu'une poutine ?
2. Que-ce une poutine ?
3. Qu'une poutine ?
4. Qu'est-ce qu'un poutin ?
5. C'est qu'une poutine .
How should these outputs be ranked / scored?

Evaluation
Reference: Qu'est-ce qu'une poutine ?
Outputs:
1. Est-ce qu'une poutine ? 59.5 82.8
2. Que-ce une poutine ? 32.0 51.4
3. Qu'une poutine ? 39.4 58.3
4. Qu'est-ce qu'un poutin ? 19.0 74.4
5. C'est qu'une poutine . 32.0 60.8
BLEU: geometric average of
token n-gram precisions,
brevity penalty
ChrF: character
n-gram F-score

Joey NMT
Joint work with Jasmijn Bastings, Mayumi Ohta and Joey NMT contributors

Problem
+ A lot of code for NMT is online.
+ Free compute through Colab.
+ Data is freely available. Is it clean?
How long would I have to study it?
Are all features documented?
How can I run it on Colab?
How do I need to prepare data to use it?
Does that mean it's accessible?

Solution
Joey NMT: clean, minimalist, documented.
➢ Much smaller than other toolkits
➢ Covers core features
➢ User study on usability
➢ The core API changes very little.
➢ Examples, pre-trained models, tutorials, FAQ
➢ Based on PyTorch
Does not do
everything,
does not grow
much.

Features
You can:
● train a RNN/Transformer model
● on CPU, one or multiple GPUs
● monitor the training process
● conﬁgure hyperparameters
● store it, load it, test it
And more:
● follow training recipes
● modify the code easily
● get inspiration from other extensions
● share/load pre-trained models

It's cute, but can it compete?
Quality?
➢ Comparable to other toolkits.
Adoption?
➢ Not as popular.
Innovation?
➢ More and more research.

It's cute, but can it compete?
Quality?
➢ Comparable to other toolkits.
Adoption?
➢ Not as popular.
Innovation?
➢ More and more research.
It might not be the best choice for
➢ exact replication of another paper
-> use their code instead
➢ non-seq2seq applications
➢ performance-critical applications
(not optimized for it)
➢ loading BERT (not implemented)

Cool stuff feat. Joey NMT
Grassroots research communities
➢ Masakhane: NLP for African languages
➢ Turkic Interlingua: NLP for Turkic languages
Extensions
➢ Reinforcement learning
➢ Sign language translation
➢ Speech translation
➢ Image captioning
➢ Slack bot
More on this list.

Material
➢ Neural networks in NLP
○ Y. Goldberg: A Primer on Neural Network Models for Natural Language Processing
○ G. Neubig: CMU CS 11-747: Neural Networks for NLP
➢ Neural Machine translation
○ P. Koehn: Neural Machine Translation (Draft Chapter of the Statistical MT book)
○ G. Neubig: Tutorial on Neural Machine Translation
○ A. Rush: The Annotated Transformer
○ J. Bastings: The Annotated Encoder-Decoder
○ M. Müller: Seven Recommendations for MT Evaluation
➢ Joey NMT
○ Joey NMT paper
○ Joey NMT tutorial
○ Masakhane notebooks and YouTube tutorial
○ Turkic Interlingua YouTube tutorial

Thank you!
jkreutzer@google.com
Twitter: @KreutzerJulia
Q & A

Learning to Translate with Joey NMT

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Learning to Translate with Joey NMT

Similar to Learning to Translate with Joey NMT (20)

Recently uploaded

Recently uploaded (20)

Learning to Translate with Joey NMT