Building Streaming pipelinesBuilding Streaming pipelines
for Neural Machine Translationfor Neural Machine Translation
Suneel MarthiSuneel Marthi
Kellen SunderlandKellen Sunderland
April 19, 2018April 19, 2018
DataWorks Summit, Berlin, GermanyDataWorks Summit, Berlin, Germany
1
$WhoAreWe$WhoAreWe
Kellen SunderlandKellen Sunderland
 @KellenDB@KellenDB
Member of Apache Software Foundation
Contributor to Apache MXNet (incubating), and committer on Apache Joshua
(incubating)
Suneel MarthiSuneel Marthi
 @suneelmarthi@suneelmarthi
Member of Apache Software Foundation
Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams
2
AgendaAgenda
What is Machine Translation ?
Why move to NMT from SMT ?
NMT Samples
NMT Challenges
Streaming Pipelines for NMT
Demo
3
OSS ToolsOSS Tools
Apache Flink - A distributed stream processing engine
written in Java and Scala.
Apache OpenNLP - A machine learning toolkit for
Natural Language Processing, written in Java.
Apache Thrift - A framework for cross-language
services development.
4
OSS Tools (contd)OSS Tools (contd)
Apache Joshua (incubating) - A statistical machine
translation decoder for phrase-based, hierarchical,
and syntax-based machine translation, written in
Java.
Apache MXNet (incubating) - A flexible and efficient
library for deep learning.
Sockeye - A sequence-to-sequence framework for
Neural Machine Translation based on Apache MXNet
Incubating.
5
What is Machine Translation ?What is Machine Translation ?
6
Statistical Machine TranslationStatistical Machine Translation
Generate Translations from Statistical Models trainedGenerate Translations from Statistical Models trained
on Bilingual Corpora.on Bilingual Corpora.
Translation happens per a probability distributionTranslation happens per a probability distribution
p(e|f)p(e|f)
E = string in the target language (English)
F = string in the source language (Spanish)
e~ = argmax p(e|f) = argmax p(f|e) * p(e)
e~ = best translation, the one with highest probability
7
Word-based TranslationWord-based Translation
8
How to translate a word → lookup in dictionary
Gebäude — building, house, tower.
Multiple translations
some more frequent than others
for instance: house and building most common
9
Look at a parallel corpusLook at a parallel corpus
(German text along with English translation)(German text along with English translation)
Translation of Gebäude Count Probability
house 5.28 billion 0.51
building 4.16 billion 0.402
tower 9.28 million 0.09
10
AlignmentAlignment
In a parallel text (or when we translate), we align
words in one language with the word in the other
Das Gebäude ist hoch
↓ ↓ ↓ ↓
the building is high
Word positions are numbered 1—4
11
Alignment FunctionAlignment Function
Define the Alignment with an Alignment Function
Mapping an English target word at position i to a
German source word at position j with a function a :
i → j
Example
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4}
12
One-to-Many TranslationOne-to-Many Translation
A source word could translate into multiple target wordsA source word could translate into multiple target words
Das ist ein Hochhaus   
↓ ↓ ↓ ↙ ↓ ↘
This is a high    rise building
13
Phrase-based TranslationPhrase-based Translation
14
Phrase-Based ModelPhrase-Based Model
Berlin ist ein herausragendes Kunst- und Kulturzentrum .
↓ ↓ ↓ ↓ ↓ ↓
Berlin is an outstanding Art and cultural center .
Foreign input is segmented in phrases
Each phrase is translated into English
Phrases are reordered
15
Alignment FunctionAlignment Function
Word-Based Models translate words as atomic units
Phrase-Based Models translate phrases as atomic
units
Advantages:
many-to-many translation can handle non-
compositional phrases
use of local context in translation
the more data, the longer phrases can be learned
“Standard Model”, used by Google Translate until
2016 (switched to Neural MT)
16
DecodingDecoding
17
We have a mathematical model for translation
p(e|f)
Task of decoding: find the translation ebest with
highest probability
Two types of error
the most probable translation is bad →fix the
model
search does not find the most probable translation
→fix the search
ebest = argmax p(e|f)
18
Neural Machine TranslationNeural Machine Translation
19
Generate Translations from Neural Network modelsGenerate Translations from Neural Network models
trained on Bilingual Corpora.trained on Bilingual Corpora.
Translation happens per a probability distribution oneTranslation happens per a probability distribution one
word at time (no phrases).word at time (no phrases).
20
NMT is deep learning applied to machine translation.NMT is deep learning applied to machine translation.
"Attention Is All You Need" - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin"Attention Is All You Need" - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Google Brain https://arxiv.org/abs/1706.03762Google Brain https://arxiv.org/abs/1706.03762
21
Why move from SMT to NMT?Why move from SMT to NMT?
Research results were too good to ignore.
The fluency of translations was a huge step forward
compared to statistical systems.
We knew that there would be exciting future work to
be done in this area.
22
Why move from SMT to NMT?Why move from SMT to NMT?
The University of Edinburgh’s Neural MT Systems for WMT17 – Rico Sennrich, Alexandra Birch, Anna Currey,The University of Edinburgh’s Neural MT Systems for WMT17 – Rico Sennrich, Alexandra Birch, Anna Currey,
Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone and Philip Williams.Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone and Philip Williams.
23
SMT versus NMT at ScaleSMT versus NMT at Scale
Apache Joshua Sockeye
Reasonable Quality Translation High Quality Translations
Java / C++ Python 3 / C++
Model size 60GB-120GB Model size 256 MB
Complicated Training Process Simple Training Process
Relatively complex implementation 400 lines of code
Low translation costs High translation costs
24
SMT versus NMT at ScaleSMT versus NMT at Scale
Apache Joshua Sockeye
Reasonable Quality Translation High Quality Translations
Java / C++ Python 3 / C++
Model size 60GB-120GB Model size 256 MB
Complicated Training Process Simple Training Process
Relatively complex implementation 400 lines of code
Low translation costs High translation costs
25
NMT SamplesNMT Samples
26
Jetzt LIVE: Abgeordnete debattieren über ZuspitzungJetzt LIVE: Abgeordnete debattieren über Zuspitzung
des Syrien-Konflikts.des Syrien-Konflikts.
last but not least, Members are debating the escalationlast but not least, Members are debating the escalation
of the Syrian conflict.of the Syrian conflict.
27
Sie haben wenig Zeit, wollen aber Fett verbrennen undSie haben wenig Zeit, wollen aber Fett verbrennen und
Muskeln aufbauen?Muskeln aufbauen?
You have little time, but want to burn fat and buildYou have little time, but want to burn fat and build
muscles?muscles?
28
NMT Challenges – TwitterNMT Challenges – Twitter
ContentContent
29
NMT Challenges – InputNMT Challenges – Input
The input into all neural network models is always a
vector.
Training data is always parallel text.
How do you represent a word from the text as a
vector?
30
Embedding LayerEmbedding Layer
31
32
NMT Challenges – Rare WordsNMT Challenges – Rare Words
Ok we can now represent 30,000 words as vectors, whatOk we can now represent 30,000 words as vectors, what
about the rest?about the rest?
33
NMT Challenges – Byte PairNMT Challenges – Byte Pair
EncodingEncoding
Rico Sennrich, Barry Haddow and Alexandra Birch (2016): Neural Machine Translation of Rare Words with Subword Units Proceedings of the 54th Annual MeetingRico Sennrich, Barry Haddow and Alexandra Birch (2016): Neural Machine Translation of Rare Words with Subword Units Proceedings of the 54th Annual Meeting
of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.
34
Byte Pair EncodingByte Pair Encoding
"positional addition contextual"
35
Byte Pair EncodingByte Pair Encoding
"posiXonal addiXon contextual"
ti = X
36
Byte Pair EncodingByte Pair Encoding
"posiXonY addiXon contextuY"
ti = X
al = Y
37
Byte Pair EncodingByte Pair Encoding
"posiZnY addiZn contextuY"
ti = X
al = Y
Xo = Z
38
Byte Pair EncodingByte Pair Encoding
these
ing
other
s,
must
Member
39
NMT Challenges – JaggedNMT Challenges – Jagged
TensorsTensors
Input is not sorted by length.Input is not sorted by length.
40
Jagged Tensors cont.Jagged Tensors cont.
41
Jagged Tensors cont.Jagged Tensors cont.
42
Jagged Tensors cont.Jagged Tensors cont.
43
NMT Challenges – CostNMT Challenges – Cost
Step 1: Create great profiling tools, measurement.
Step 2: Get specialists to optimize bottlenecks.
Step 3: ???
Step 4: Profit.
New layer norm, top-k, batch-mul, transpose, smoothing op. 3.5x speedup so far. Working in branches:
https://github.com/MXNetEdge/sockeye/tree/dev_speed
https://github.com/MXNetEdge/incubator-mxnet/tree/dev_speed
44
Apache MXNet Profiling ToolsApache MXNet Profiling Tools
CPU Profiler (vtune) GPU Profiler (nvprof)
45
TVMTVM
TVM is a Tensor intermediate representation(IR) stack for deep learningTVM is a Tensor intermediate representation(IR) stack for deep learning
systems. It is designed to close the gap between the productivity-focusedsystems. It is designed to close the gap between the productivity-focused
deep learning frameworks, and the performance- and efficiency-focuseddeep learning frameworks, and the performance- and efficiency-focused
hardware backends. TVM works with deep learning frameworks to providehardware backends. TVM works with deep learning frameworks to provide
end to end compilation to different backends.end to end compilation to different backends.
https://github.com/dmlc/tvmhttps://github.com/dmlc/tvm
46
Alibaba TVM OptimizationAlibaba TVM Optimization
http://tvmlang.org/2018/03/23/nmt-transformer-optimize.htmlhttp://tvmlang.org/2018/03/23/nmt-transformer-optimize.html
47
Alibaba TVM OptimizationAlibaba TVM Optimization
48
Facebook - TensorFacebook - Tensor
ComprehensionsComprehensions
https://research.fb.com/announcing-tensor-comprehensions/https://research.fb.com/announcing-tensor-comprehensions/
49
Streaming Pipelines for NMTStreaming Pipelines for NMT
50
NMT Inference PreprocessingNMT Inference Preprocessing
51
Language Detection (Flink +Language Detection (Flink +
OpenNLP)OpenNLP)
52
Sentence Detection (Flink +Sentence Detection (Flink +
OpenNLP)OpenNLP)
53
Tokenization (Flink + OpenNLP)Tokenization (Flink + OpenNLP)
54
SockeyeTranslate (Flink +SockeyeTranslate (Flink +
Thrift)Thrift)
55
Complete Pipeline (Flink)Complete Pipeline (Flink)
56
NMT Inference PipelineNMT Inference Pipeline
57
CreditsCredits
58
Apache OpenNLP TeamApache OpenNLP Team
59
Apache Flink TeamApache Flink Team
60
Credits cont.Credits cont.
Asmus Hetzel (Amazon), Marek Kolodziej (NVIDIA),
Dick Carter (NVIDIA), Tianqi Chen (U of W), MKL-DNN
Team (Intel)
Sockeye: Felix Hieber (Amazon), Tobias Domhan
(Amazon), David Vilar (Amazon), Matt Post (Amazon)
Apache Joshua: Matt Post (Johns Hopkins), Tommaso
Teofili (Adobe), NASA JPL
University of Edinburgh, Google, Facebook, NYU,
Stanford
61
LinksLinks
Attention is All You Need, Annotated:
http://nlp.seas.harvard.edu/2018/04/03/attention.htm
Sockeye training tutorial:
https://github.com/awslabs/sockeye/tree/master/tutor
Intro Deep Learning Tutorial: http://gluon.mxnet.io
Slides: https://smarthi.github.io/DSW-Berlin18-Stream
NMT/
Code: https://github.com/smarthi/streamingnmt
62
Questions ???Questions ???
63
Sockeye Model TypesSockeye Model Types
RNN Models
Convolutional Models
Transformer Models
64

Building streaming pipelines for neural machine translation

  • 1.
    Building Streaming pipelinesBuildingStreaming pipelines for Neural Machine Translationfor Neural Machine Translation Suneel MarthiSuneel Marthi Kellen SunderlandKellen Sunderland April 19, 2018April 19, 2018 DataWorks Summit, Berlin, GermanyDataWorks Summit, Berlin, Germany 1
  • 2.
    $WhoAreWe$WhoAreWe Kellen SunderlandKellen Sunderland @KellenDB@KellenDB Member of Apache Software Foundation Contributor to Apache MXNet (incubating), and committer on Apache Joshua (incubating) Suneel MarthiSuneel Marthi  @suneelmarthi@suneelmarthi Member of Apache Software Foundation Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams 2
  • 3.
    AgendaAgenda What is MachineTranslation ? Why move to NMT from SMT ? NMT Samples NMT Challenges Streaming Pipelines for NMT Demo 3
  • 4.
    OSS ToolsOSS Tools ApacheFlink - A distributed stream processing engine written in Java and Scala. Apache OpenNLP - A machine learning toolkit for Natural Language Processing, written in Java. Apache Thrift - A framework for cross-language services development. 4
  • 5.
    OSS Tools (contd)OSSTools (contd) Apache Joshua (incubating) - A statistical machine translation decoder for phrase-based, hierarchical, and syntax-based machine translation, written in Java. Apache MXNet (incubating) - A flexible and efficient library for deep learning. Sockeye - A sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet Incubating. 5
  • 6.
    What is MachineTranslation ?What is Machine Translation ? 6
  • 7.
    Statistical Machine TranslationStatisticalMachine Translation Generate Translations from Statistical Models trainedGenerate Translations from Statistical Models trained on Bilingual Corpora.on Bilingual Corpora. Translation happens per a probability distributionTranslation happens per a probability distribution p(e|f)p(e|f) E = string in the target language (English) F = string in the source language (Spanish) e~ = argmax p(e|f) = argmax p(f|e) * p(e) e~ = best translation, the one with highest probability 7
  • 8.
  • 9.
    How to translatea word → lookup in dictionary Gebäude — building, house, tower. Multiple translations some more frequent than others for instance: house and building most common 9
  • 10.
    Look at aparallel corpusLook at a parallel corpus (German text along with English translation)(German text along with English translation) Translation of Gebäude Count Probability house 5.28 billion 0.51 building 4.16 billion 0.402 tower 9.28 million 0.09 10
  • 11.
    AlignmentAlignment In a paralleltext (or when we translate), we align words in one language with the word in the other Das Gebäude ist hoch ↓ ↓ ↓ ↓ the building is high Word positions are numbered 1—4 11
  • 12.
    Alignment FunctionAlignment Function Definethe Alignment with an Alignment Function Mapping an English target word at position i to a German source word at position j with a function a : i → j Example a : {1 → 1, 2 → 2, 3 → 3, 4 → 4} 12
  • 13.
    One-to-Many TranslationOne-to-Many Translation Asource word could translate into multiple target wordsA source word could translate into multiple target words Das ist ein Hochhaus    ↓ ↓ ↓ ↙ ↓ ↘ This is a high    rise building 13
  • 14.
  • 15.
    Phrase-Based ModelPhrase-Based Model Berlinist ein herausragendes Kunst- und Kulturzentrum . ↓ ↓ ↓ ↓ ↓ ↓ Berlin is an outstanding Art and cultural center . Foreign input is segmented in phrases Each phrase is translated into English Phrases are reordered 15
  • 16.
    Alignment FunctionAlignment Function Word-BasedModels translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many translation can handle non- compositional phrases use of local context in translation the more data, the longer phrases can be learned “Standard Model”, used by Google Translate until 2016 (switched to Neural MT) 16
  • 17.
  • 18.
    We have amathematical model for translation p(e|f) Task of decoding: find the translation ebest with highest probability Two types of error the most probable translation is bad →fix the model search does not find the most probable translation →fix the search ebest = argmax p(e|f) 18
  • 19.
    Neural Machine TranslationNeuralMachine Translation 19
  • 20.
    Generate Translations fromNeural Network modelsGenerate Translations from Neural Network models trained on Bilingual Corpora.trained on Bilingual Corpora. Translation happens per a probability distribution oneTranslation happens per a probability distribution one word at time (no phrases).word at time (no phrases). 20
  • 21.
    NMT is deeplearning applied to machine translation.NMT is deep learning applied to machine translation. "Attention Is All You Need" - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin"Attention Is All You Need" - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin Google Brain https://arxiv.org/abs/1706.03762Google Brain https://arxiv.org/abs/1706.03762 21
  • 22.
    Why move fromSMT to NMT?Why move from SMT to NMT? Research results were too good to ignore. The fluency of translations was a huge step forward compared to statistical systems. We knew that there would be exciting future work to be done in this area. 22
  • 23.
    Why move fromSMT to NMT?Why move from SMT to NMT? The University of Edinburgh’s Neural MT Systems for WMT17 – Rico Sennrich, Alexandra Birch, Anna Currey,The University of Edinburgh’s Neural MT Systems for WMT17 – Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone and Philip Williams.Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone and Philip Williams. 23
  • 24.
    SMT versus NMTat ScaleSMT versus NMT at Scale Apache Joshua Sockeye Reasonable Quality Translation High Quality Translations Java / C++ Python 3 / C++ Model size 60GB-120GB Model size 256 MB Complicated Training Process Simple Training Process Relatively complex implementation 400 lines of code Low translation costs High translation costs 24
  • 25.
    SMT versus NMTat ScaleSMT versus NMT at Scale Apache Joshua Sockeye Reasonable Quality Translation High Quality Translations Java / C++ Python 3 / C++ Model size 60GB-120GB Model size 256 MB Complicated Training Process Simple Training Process Relatively complex implementation 400 lines of code Low translation costs High translation costs 25
  • 26.
  • 27.
    Jetzt LIVE: Abgeordnetedebattieren über ZuspitzungJetzt LIVE: Abgeordnete debattieren über Zuspitzung des Syrien-Konflikts.des Syrien-Konflikts. last but not least, Members are debating the escalationlast but not least, Members are debating the escalation of the Syrian conflict.of the Syrian conflict. 27
  • 28.
    Sie haben wenigZeit, wollen aber Fett verbrennen undSie haben wenig Zeit, wollen aber Fett verbrennen und Muskeln aufbauen?Muskeln aufbauen? You have little time, but want to burn fat and buildYou have little time, but want to burn fat and build muscles?muscles? 28
  • 29.
    NMT Challenges –TwitterNMT Challenges – Twitter ContentContent 29
  • 30.
    NMT Challenges –InputNMT Challenges – Input The input into all neural network models is always a vector. Training data is always parallel text. How do you represent a word from the text as a vector? 30
  • 31.
  • 32.
  • 33.
    NMT Challenges –Rare WordsNMT Challenges – Rare Words Ok we can now represent 30,000 words as vectors, whatOk we can now represent 30,000 words as vectors, what about the rest?about the rest? 33
  • 34.
    NMT Challenges –Byte PairNMT Challenges – Byte Pair EncodingEncoding
  • 35.
    Rico Sennrich, BarryHaddow and Alexandra Birch (2016): Neural Machine Translation of Rare Words with Subword Units Proceedings of the 54th Annual MeetingRico Sennrich, Barry Haddow and Alexandra Birch (2016): Neural Machine Translation of Rare Words with Subword Units Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.of the Association for Computational Linguistics (ACL 2016). Berlin, Germany. 34
  • 36.
    Byte Pair EncodingBytePair Encoding "positional addition contextual" 35
  • 37.
    Byte Pair EncodingBytePair Encoding "posiXonal addiXon contextual" ti = X 36
  • 38.
    Byte Pair EncodingBytePair Encoding "posiXonY addiXon contextuY" ti = X al = Y 37
  • 39.
    Byte Pair EncodingBytePair Encoding "posiZnY addiZn contextuY" ti = X al = Y Xo = Z 38
  • 40.
    Byte Pair EncodingBytePair Encoding these ing other s, must Member 39
  • 41.
    NMT Challenges –JaggedNMT Challenges – Jagged TensorsTensors Input is not sorted by length.Input is not sorted by length. 40
  • 42.
  • 43.
  • 44.
  • 45.
    NMT Challenges –CostNMT Challenges – Cost Step 1: Create great profiling tools, measurement. Step 2: Get specialists to optimize bottlenecks. Step 3: ??? Step 4: Profit. New layer norm, top-k, batch-mul, transpose, smoothing op. 3.5x speedup so far. Working in branches: https://github.com/MXNetEdge/sockeye/tree/dev_speed https://github.com/MXNetEdge/incubator-mxnet/tree/dev_speed 44
  • 46.
    Apache MXNet ProfilingToolsApache MXNet Profiling Tools CPU Profiler (vtune) GPU Profiler (nvprof) 45
  • 47.
    TVMTVM TVM is aTensor intermediate representation(IR) stack for deep learningTVM is a Tensor intermediate representation(IR) stack for deep learning systems. It is designed to close the gap between the productivity-focusedsystems. It is designed to close the gap between the productivity-focused deep learning frameworks, and the performance- and efficiency-focuseddeep learning frameworks, and the performance- and efficiency-focused hardware backends. TVM works with deep learning frameworks to providehardware backends. TVM works with deep learning frameworks to provide end to end compilation to different backends.end to end compilation to different backends. https://github.com/dmlc/tvmhttps://github.com/dmlc/tvm 46
  • 48.
    Alibaba TVM OptimizationAlibabaTVM Optimization http://tvmlang.org/2018/03/23/nmt-transformer-optimize.htmlhttp://tvmlang.org/2018/03/23/nmt-transformer-optimize.html 47
  • 49.
  • 50.
    Facebook - TensorFacebook- Tensor ComprehensionsComprehensions https://research.fb.com/announcing-tensor-comprehensions/https://research.fb.com/announcing-tensor-comprehensions/ 49
  • 51.
    Streaming Pipelines forNMTStreaming Pipelines for NMT 50
  • 52.
    NMT Inference PreprocessingNMTInference Preprocessing 51
  • 53.
    Language Detection (Flink+Language Detection (Flink + OpenNLP)OpenNLP) 52
  • 54.
    Sentence Detection (Flink+Sentence Detection (Flink + OpenNLP)OpenNLP) 53
  • 55.
    Tokenization (Flink +OpenNLP)Tokenization (Flink + OpenNLP) 54
  • 56.
  • 57.
  • 58.
    NMT Inference PipelineNMTInference Pipeline 57
  • 59.
  • 60.
  • 61.
  • 62.
    Credits cont.Credits cont. AsmusHetzel (Amazon), Marek Kolodziej (NVIDIA), Dick Carter (NVIDIA), Tianqi Chen (U of W), MKL-DNN Team (Intel) Sockeye: Felix Hieber (Amazon), Tobias Domhan (Amazon), David Vilar (Amazon), Matt Post (Amazon) Apache Joshua: Matt Post (Johns Hopkins), Tommaso Teofili (Adobe), NASA JPL University of Edinburgh, Google, Facebook, NYU, Stanford 61
  • 63.
    LinksLinks Attention is AllYou Need, Annotated: http://nlp.seas.harvard.edu/2018/04/03/attention.htm Sockeye training tutorial: https://github.com/awslabs/sockeye/tree/master/tutor Intro Deep Learning Tutorial: http://gluon.mxnet.io Slides: https://smarthi.github.io/DSW-Berlin18-Stream NMT/ Code: https://github.com/smarthi/streamingnmt 62
  • 64.
  • 65.
    Sockeye Model TypesSockeyeModel Types RNN Models Convolutional Models Transformer Models 64