Building streaming pipelines for neural machine translation

Building Streaming pipelinesBuilding Streaming pipelines
for Neural Machine Translationfor Neural Machine Translation
Suneel MarthiSuneel Marthi
Kellen SunderlandKellen Sunderland
April 19, 2018April 19, 2018
DataWorks Summit, Berlin, GermanyDataWorks Summit, Berlin, Germany
1

$WhoAreWe$WhoAreWe
Kellen SunderlandKellen Sunderland
 @KellenDB@KellenDB
Member of Apache Software Foundation
Contributor to Apache MXNet (incubating), and committer on Apache Joshua
(incubating)
Suneel MarthiSuneel Marthi
 @suneelmarthi@suneelmarthi
Member of Apache Software Foundation
Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams
2

AgendaAgenda
What is Machine Translation ?
Why move to NMT from SMT ?
NMT Samples
NMT Challenges
Streaming Pipelines for NMT
Demo
3

OSS ToolsOSS Tools
Apache Flink - A distributed stream processing engine
written in Java and Scala.
Apache OpenNLP - A machine learning toolkit for
Natural Language Processing, written in Java.
Apache Thrift - A framework for cross-language
services development.
4

OSS Tools (contd)OSS Tools (contd)
Apache Joshua (incubating) - A statistical machine
translation decoder for phrase-based, hierarchical,
and syntax-based machine translation, written in
Java.
Apache MXNet (incubating) - A flexible and efficient
library for deep learning.
Sockeye - A sequence-to-sequence framework for
Neural Machine Translation based on Apache MXNet
Incubating.
5

What is Machine Translation ?What is Machine Translation ?
6

Statistical Machine TranslationStatistical Machine Translation
Generate Translations from Statistical Models trainedGenerate Translations from Statistical Models trained
on Bilingual Corpora.on Bilingual Corpora.
Translation happens per a probability distributionTranslation happens per a probability distribution
p(e|f)p(e|f)
E = string in the target language (English)
F = string in the source language (Spanish)
e~ = argmax p(e|f) = argmax p(f|e) * p(e)
e~ = best translation, the one with highest probability
7

Word-based TranslationWord-based Translation
8

How to translate a word → lookup in dictionary
Gebäude — building, house, tower.
Multiple translations
some more frequent than others
for instance: house and building most common
9

Look at a parallel corpusLook at a parallel corpus
(German text along with English translation)(German text along with English translation)
Translation of Gebäude Count Probability
house 5.28 billion 0.51
building 4.16 billion 0.402
tower 9.28 million 0.09
10

AlignmentAlignment
In a parallel text (or when we translate), we align
words in one language with the word in the other
Das Gebäude ist hoch
↓ ↓ ↓ ↓
the building is high
Word positions are numbered 1—4
11

Alignment FunctionAlignment Function
Define the Alignment with an Alignment Function
Mapping an English target word at position i to a
German source word at position j with a function a :
i → j
Example
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4}
12

One-to-Many TranslationOne-to-Many Translation
A source word could translate into multiple target wordsA source word could translate into multiple target words
Das ist ein Hochhaus
↓ ↓ ↓ ↙ ↓ ↘
This is a high rise building
13

Phrase-based TranslationPhrase-based Translation
14

Phrase-Based ModelPhrase-Based Model
Berlin ist ein herausragendes Kunst- und Kulturzentrum .
↓ ↓ ↓ ↓ ↓ ↓
Berlin is an outstanding Art and cultural center .
Foreign input is segmented in phrases
Each phrase is translated into English
Phrases are reordered
15

Alignment FunctionAlignment Function
Word-Based Models translate words as atomic units
Phrase-Based Models translate phrases as atomic
units
Advantages:
many-to-many translation can handle non-
compositional phrases
use of local context in translation
the more data, the longer phrases can be learned
“Standard Model”, used by Google Translate until
2016 (switched to Neural MT)
16

We have a mathematical model for translation
p(e|f)
Task of decoding: find the translation ebest with
highest probability
Two types of error
the most probable translation is bad →fix the
model
search does not find the most probable translation
→fix the search
ebest = argmax p(e|f)
18

Neural Machine TranslationNeural Machine Translation
19

Generate Translations from Neural Network modelsGenerate Translations from Neural Network models
trained on Bilingual Corpora.trained on Bilingual Corpora.
Translation happens per a probability distribution oneTranslation happens per a probability distribution one
word at time (no phrases).word at time (no phrases).
20

NMT is deep learning applied to machine translation.NMT is deep learning applied to machine translation.
"Attention Is All You Need" - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin"Attention Is All You Need" - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Google Brain https://arxiv.org/abs/1706.03762Google Brain https://arxiv.org/abs/1706.03762
21

Why move from SMT to NMT?Why move from SMT to NMT?
Research results were too good to ignore.
The fluency of translations was a huge step forward
compared to statistical systems.
We knew that there would be exciting future work to
be done in this area.
22

Why move from SMT to NMT?Why move from SMT to NMT?
The University of Edinburgh’s Neural MT Systems for WMT17 – Rico Sennrich, Alexandra Birch, Anna Currey,The University of Edinburgh’s Neural MT Systems for WMT17 – Rico Sennrich, Alexandra Birch, Anna Currey,
Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone and Philip Williams.Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone and Philip Williams.
23

SMT versus NMT at ScaleSMT versus NMT at Scale
Apache Joshua Sockeye
Reasonable Quality Translation High Quality Translations
Java / C++ Python 3 / C++
Model size 60GB-120GB Model size 256 MB
Complicated Training Process Simple Training Process
Relatively complex implementation 400 lines of code
Low translation costs High translation costs
24

SMT versus NMT at ScaleSMT versus NMT at Scale
Apache Joshua Sockeye
Reasonable Quality Translation High Quality Translations
Java / C++ Python 3 / C++
Model size 60GB-120GB Model size 256 MB
Complicated Training Process Simple Training Process
Relatively complex implementation 400 lines of code
Low translation costs High translation costs
25

Jetzt LIVE: Abgeordnete debattieren über ZuspitzungJetzt LIVE: Abgeordnete debattieren über Zuspitzung
des Syrien-Konflikts.des Syrien-Konflikts.
last but not least, Members are debating the escalationlast but not least, Members are debating the escalation
of the Syrian conflict.of the Syrian conflict.
27

Sie haben wenig Zeit, wollen aber Fett verbrennen undSie haben wenig Zeit, wollen aber Fett verbrennen und
Muskeln aufbauen?Muskeln aufbauen?
You have little time, but want to burn fat and buildYou have little time, but want to burn fat and build
muscles?muscles?
28

NMT Challenges – TwitterNMT Challenges – Twitter
ContentContent
29

NMT Challenges – InputNMT Challenges – Input
The input into all neural network models is always a
vector.
Training data is always parallel text.
How do you represent a word from the text as a
vector?
30

Embedding LayerEmbedding Layer
31

NMT Challenges – Rare WordsNMT Challenges – Rare Words
Ok we can now represent 30,000 words as vectors, whatOk we can now represent 30,000 words as vectors, what
about the rest?about the rest?
33

NMT Challenges – Byte PairNMT Challenges – Byte Pair
EncodingEncoding

Rico Sennrich, Barry Haddow and Alexandra Birch (2016): Neural Machine Translation of Rare Words with Subword Units Proceedings of the 54th Annual MeetingRico Sennrich, Barry Haddow and Alexandra Birch (2016): Neural Machine Translation of Rare Words with Subword Units Proceedings of the 54th Annual Meeting
of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.
34

Byte Pair EncodingByte Pair Encoding
"positional addition contextual"
35

"posiXonal addiXon contextual"
ti = X
36

"posiXonY addiXon contextuY"
ti = X
al = Y
37

"posiZnY addiZn contextuY"
ti = X
al = Y
Xo = Z
38

these
ing
other
s,
must
Member
39

NMT Challenges – JaggedNMT Challenges – Jagged
TensorsTensors
Input is not sorted by length.Input is not sorted by length.
40

Jagged Tensors cont.Jagged Tensors cont.
41

42

43

NMT Challenges – CostNMT Challenges – Cost
Step 1: Create great profiling tools, measurement.
Step 2: Get specialists to optimize bottlenecks.
Step 3: ???
Step 4: Profit.
New layer norm, top-k, batch-mul, transpose, smoothing op. 3.5x speedup so far. Working in branches:
https://github.com/MXNetEdge/sockeye/tree/dev_speed
https://github.com/MXNetEdge/incubator-mxnet/tree/dev_speed
44

Apache MXNet Proﬁling ToolsApache MXNet Proﬁling Tools
CPU Profiler (vtune) GPU Profiler (nvprof)
45

TVMTVM
TVM is a Tensor intermediate representation(IR) stack for deep learningTVM is a Tensor intermediate representation(IR) stack for deep learning
systems. It is designed to close the gap between the productivity-focusedsystems. It is designed to close the gap between the productivity-focused
deep learning frameworks, and the performance- and efficiency-focuseddeep learning frameworks, and the performance- and efficiency-focused
hardware backends. TVM works with deep learning frameworks to providehardware backends. TVM works with deep learning frameworks to provide
end to end compilation to different backends.end to end compilation to different backends.
https://github.com/dmlc/tvmhttps://github.com/dmlc/tvm
46

Alibaba TVM OptimizationAlibaba TVM Optimization
http://tvmlang.org/2018/03/23/nmt-transformer-optimize.htmlhttp://tvmlang.org/2018/03/23/nmt-transformer-optimize.html
47

Alibaba TVM OptimizationAlibaba TVM Optimization
48

Facebook - TensorFacebook - Tensor
ComprehensionsComprehensions
https://research.fb.com/announcing-tensor-comprehensions/https://research.fb.com/announcing-tensor-comprehensions/
49

Streaming Pipelines for NMTStreaming Pipelines for NMT
50

NMT Inference PreprocessingNMT Inference Preprocessing
51

Language Detection (Flink +Language Detection (Flink +
OpenNLP)OpenNLP)
52

Sentence Detection (Flink +Sentence Detection (Flink +
OpenNLP)OpenNLP)
53

Tokenization (Flink + OpenNLP)Tokenization (Flink + OpenNLP)
54

SockeyeTranslate (Flink +SockeyeTranslate (Flink +
Thrift)Thrift)
55

Complete Pipeline (Flink)Complete Pipeline (Flink)
56

NMT Inference PipelineNMT Inference Pipeline
57

Apache OpenNLP TeamApache OpenNLP Team
59

Apache Flink TeamApache Flink Team
60

Credits cont.Credits cont.
Asmus Hetzel (Amazon), Marek Kolodziej (NVIDIA),
Dick Carter (NVIDIA), Tianqi Chen (U of W), MKL-DNN
Team (Intel)
Sockeye: Felix Hieber (Amazon), Tobias Domhan
(Amazon), David Vilar (Amazon), Matt Post (Amazon)
Apache Joshua: Matt Post (Johns Hopkins), Tommaso
Teofili (Adobe), NASA JPL
University of Edinburgh, Google, Facebook, NYU,
Stanford
61

LinksLinks
Attention is All You Need, Annotated:
http://nlp.seas.harvard.edu/2018/04/03/attention.htm
Sockeye training tutorial:
https://github.com/awslabs/sockeye/tree/master/tutor
Intro Deep Learning Tutorial: http://gluon.mxnet.io
Slides: https://smarthi.github.io/DSW-Berlin18-Stream
NMT/
Code: https://github.com/smarthi/streamingnmt
62

Sockeye Model TypesSockeye Model Types
RNN Models
Convolutional Models
Transformer Models
64

Building streaming pipelines for neural machine translation

More Related Content

Similar to Building streaming pipelines for neural machine translation

More from Suneel Marthi

Recently uploaded

Building streaming pipelines for neural machine translation