Denis Yarats ITEM 2018

Convolutional Sequence to Sequence
Learning
Denis Yarats
with Jonas Gehring, Michael Auli, David
Grangier, Yann Dauphin

Paris
Montreal
New York City
Menlo Park
Facebook AI Research
2

Overview
1. What kind of problems we want to solve?
2. Sequence to Sequence Learning
3. Convolutional Sequence to Sequence Learning
4. Experiments and Results

Problems we want to solve
• Machine Translation
• Dialogue
• Text Summarization
• Text Generation
• etc.

Machine Translation
• The classic test of language understanding
• Language analysis and generation
• MT is important for humanity and commerce:
• Facebook: 2 billion translations in 45 languages daily
• Google: 100 billion words being translated a day
• eBay: uses MT to enable cross-border trade
Stats from: https://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf

Text Summarization
Jenson Button was denied his 100th race for McLaren after an ERS prevented
him from making it to the startline. It capped a miserable weekend for the Briton; his
time in Bahrain plagued by reliability issues. Button spent much of the race on Twitter
delivering his verdict as the action unfolded. ’Kimi is the man to watch,’ and ’loving the
sparks’, were among his pearls of wisdom, but the tweet which courted the most
attention was a rather mischievous one: ’Ooh is Lewis backing his team mate into
Vettel?’ he quizzed after Rosberg accused Hamilton of pulling off such a manoeuvre in
China. Jenson Button waves to the crowd ahead of the Bahrain Grand Prix which he
failed to start Perhaps a career in the media beckons... Lewis Hamilton has out-qualiﬁed
and ﬁnished ahead of Nico Rosberg at every race this season. Indeed Rosberg has
Button was denied his 100th race for McLaren. The ERS prevented him from
making it to the start-line. Button was his team mate in the 11 races in Bahrain.
Paulus et. al 2017 A DEEP REINFORCED MODEL FOR ABSTRACTIVE SUMMARIZATION

Text Generation
Aliens start abducting
It has been two weeks, and the last of my kind has gone. It is
only a matter of time until there will be nothing left. I’m not sure what
the hell is going on... I can’t think. I can hear a distant scream. I think
of a strange, alien sound. I try to speak, but am interrupted by
something, something that feels like a drum, I ca not tell. I mean I’m
just a little bit older than an average human. But I can, and I can feel
the vibrations . I hear the sound of a ship approaching. The ground
quakes at the force of the impact, and a loud explosion shatters the
Fan et. al 2018 Hierarchical Neural Story Generation

Sequence to Sequence
• What is common in all of these problems?
Figures from: https://github.com/farizrahman4u/seq2seq
x0 x1 x2 ... xN y0 y1 ...y2 yM
[ [] ]

Recurrent Neural Network
• Like a feed forward network, except allows self connections
• Self connections are used to build an internal representation
of past inputs
• They give the network memory
Figures from: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Long Short Term Memory (LSTM)
• Modiﬁcation of RNN to have longer memory
• Additional memory cell to store information
• RNN overwrites the hidden state, LSTM adds to the hidden
state
• Hochreiter et al. 1997
Figures from: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Sequence to Sequence
• Make NN to read one sequence and produce another
• Use 2 LSTMs
• Sutskever et al. 2014
Figures from: http://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/
Ilya_LSTMs_for_Translation.pdf

Attention
• Incorporate all hidden states of encoder, rather than the last
one
• Assign a separate weight to each state, compute weighted
sum
• Bahdahau et al. 2014
Figures from: https://distill.pub/2016/augmented-rnns

Attention
• Example of attention for machine translation
Figures from: https://distill.pub/2016/augmented-rnns

LSTM based Seq2Seq: Problems
• Still challenging to model long term dependencies
• RNNs are hard to parallelize because of non-homogeneous
nature
• Convolutional neural networks to the rescue?

Convolutional Neural Network
• Each output neuron is a linear combination of input neurons in
some spatial neighborhood
• Unlike a feed forward network, where it is connected to every
input neurons
• Parameter sharing and better scalability
Figures from: http://cs231n.github.io/convolutional-networks
fully connected convolution

Convolutional Neural Network
• Hierarchical processing: bottom-up vs. left-right
• Homogeneous: all elements processed in the same way
• Scalable computation: parallelizable, suited for GPUs
the tall
…
building .
vs.
the tall building … .

Contributions
• Show that non-RNN models can outperform very well-
engineered RNNs on large translation benchmarks
• Multi-hop attention
• Approach improves inference speed (>9x faster than RNN)
19

Encoder
. <start>
Decoder
21
la maison de. Léa <end> .

. <start>
Encoder
Attention
Decoder
22

. <start>
.
Encoder
Attention
Decoder
Léa
23

. <start>
Encoder
Attention
Decoder
Léa
24
.

. <start>
Encoder
Attention
Decoder
Léa 's
25
.

Encoder
Attention
Decoder
. <start> Léa 's
26
.

Encoder
Attention
Decoder
. <start> Léa 's house
27
.

inputs. Non-linearities allow the networks to exploit the
full input ﬁeld, or to focus on fewer elements if needed.
Each convolution kernel is parameterized as W 2 R2d⇥kd
,
bw 2 R2d
and takes as input X 2 Rk⇥d
which is a
concatenation of k input elements embedded in d dimen-
sions and maps them to a single output element Y 2 R2d
that has twice the dimensionality of the input elements;
subsequent layers operate over the k output elements of
the previous layer. We choose gated linear units (GLU;
Dauphin et al., 2016) as non-linearity which implement a
simple gating mechanism over the output of the convolu-
tion Y = [A B] 2 R2d
:
v([A B]) = A ⌦ (B)
where A, B 2 Rd
are the inputs to the non-linearity, ⌦ is
the point-wise multiplication and the output v([A B]) 2
Rd
is half the size of Y . The gates (B) control which
inputs A of the current context are relevant. A similar non-
linearity has been introduced in Oord et al. (2016b) who
apply tanh to A but Dauphin et al. (2016) shows that GLUs
perform better in the context of language modelling.
To enable deep convolutional networks, we add residual
connections from the input of each convolution to the out-
put of the block (He et al., 2015a).
• High training efﬁciency due to 
parallel computation in decoder
• Cross Entropy objective
• Very similar to ResNet models  
(Nesterov etc., He et al. '15)
28
ConvS2S: Full Model

Vocabulary BLEU
CNN ByteNet (Kalchbrenner et al., 2016) Characters 23.75
RNN GNMT (Wu et al., 2016) Word 80k 23.12
RNN GNMT (Wu et al., 2016) Word pieces 24.61
ConvS2S BPE 40k 25.16
Transformer (Vaswani et al., 2017) Word pieces 28.4
ConvS2S: 15 layers in encoder/decoder (10x512 units, 3x768 units, 2x2048) 
Maximum context size: 27 words 29
More work on non-RNN models!
ConvS2S: WMT'14 English-German

Vocabulary BLEU
RNN GNMT (Wu et al., 2016) Word 80k 37.90
RNN GNMT (Wu et al., 2016) Word pieces 38.95
RNN GNMT + RL (Wu et al., 2016) Word pieces 39.92
ConvS2S BPE 40k 40.51
Transformer (Vaswani et al., 2017) Word pieces 41.0
ConvS2S: 15 layers in encoder/decoder (5x512 units, 4x768 units, 3x2048, 2x4096)
30
ConvS2S: WMT'14 English-French
WMT'14 English-French

ConvS2S: Inference Speed
Hardware BLEU Time (s)
RNN GNMT (Wu et al., 2016) GPU (K80) 31.20 3028
RNN GNMT (Wu et al., 2016) CPU (88 cores) 31.20 1322
RNN GNMT (Wu et al., 2016) TPU 31.21 384
ConvS2S, beam=5 GPU (K40) 34.10 587
ConvS2S, beam=1 GPU (K40) 33.45 327
ConvS2S, beam=1 GPU (GTX-1080ti) 33.45 142
ConvS2S, beam=1 CPU (48 cores) 33.45 142
ntst1213 (6003 sentences) 31

ConvS2S: Training
Dataset # Pairs (in million) Hardware Days
WMT14 EN-DE 4.5 1 x GPU (K40) 18
WMT14 EN-FR 36 8 x GPU (K40) 30

ConvS2S: Conclusion
• Alternative architecture for sequence to sequence learning
• Higher accuracy than models of similar size, despite ﬁxed size
context
• Faster generation (9x faster on lesser hardware)

Denis Yarats ITEM 2018

More Related Content

Similar to Denis Yarats ITEM 2018

More from ITEM

Recently uploaded

Denis Yarats ITEM 2018