Convolutional Sequence to Sequence
Learning
Denis Yarats
with Jonas Gehring, Michael Auli, David
Grangier, Yann Dauphin
Paris
Montreal
New York City
Menlo Park
Facebook AI Research
2
Overview
1. What kind of problems we want to solve?
2. Sequence to Sequence Learning
3. Convolutional Sequence to Sequence Learning
4. Experiments and Results
Problems we want to solve
• Machine Translation
• Dialogue
• Text Summarization
• Text Generation
• etc.
Machine Translation
• The classic test of language understanding
• Language analysis and generation
• MT is important for humanity and commerce:
• Facebook: 2 billion translations in 45 languages daily
• Google: 100 billion words being translated a day
• eBay: uses MT to enable cross-border trade
Stats from: https://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf
Machine Translation
Dialogue
Text Summarization
Jenson Button was denied his 100th race for McLaren after an ERS prevented
him from making it to the startline. It capped a miserable weekend for the Briton; his
time in Bahrain plagued by reliability issues. Button spent much of the race on Twitter
delivering his verdict as the action unfolded. ’Kimi is the man to watch,’ and ’loving the
sparks’, were among his pearls of wisdom, but the tweet which courted the most
attention was a rather mischievous one: ’Ooh is Lewis backing his team mate into
Vettel?’ he quizzed after Rosberg accused Hamilton of pulling off such a manoeuvre in
China. Jenson Button waves to the crowd ahead of the Bahrain Grand Prix which he
failed to start Perhaps a career in the media beckons... Lewis Hamilton has out-qualified
and finished ahead of Nico Rosberg at every race this season. Indeed Rosberg has
Button was denied his 100th race for McLaren. The ERS prevented him from
making it to the start-line. Button was his team mate in the 11 races in Bahrain.
Paulus et. al 2017 A DEEP REINFORCED MODEL FOR ABSTRACTIVE SUMMARIZATION
Text Generation
Aliens start abducting
It has been two weeks, and the last of my kind has gone. It is
only a matter of time until there will be nothing left. I’m not sure what
the hell is going on... I can’t think. I can hear a distant scream. I think
of a strange, alien sound. I try to speak, but am interrupted by
something, something that feels like a drum, I ca not tell. I mean I’m
just a little bit older than an average human. But I can, and I can feel
the vibrations . I hear the sound of a ship approaching. The ground
quakes at the force of the impact, and a loud explosion shatters the
Fan et. al 2018 Hierarchical Neural Story Generation
Sequence to Sequence
• What is common in all of these problems?
Figures from: https://github.com/farizrahman4u/seq2seq
x0 x1 x2 ... xN y0 y1 ...y2 yM
[ [] ]
Recurrent Neural Network
• Like a feed forward network, except allows self connections
• Self connections are used to build an internal representation
of past inputs
• They give the network memory
Figures from: http://colah.github.io/posts/2015-08-Understanding-LSTMs
Long Short Term Memory (LSTM)
• Modification of RNN to have longer memory
• Additional memory cell to store information
• RNN overwrites the hidden state, LSTM adds to the hidden
state
• Hochreiter et al. 1997
Figures from: http://colah.github.io/posts/2015-08-Understanding-LSTMs
Sequence to Sequence
• Make NN to read one sequence and produce another
• Use 2 LSTMs
• Sutskever et al. 2014
Figures from: http://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/
Ilya_LSTMs_for_Translation.pdf
Attention
• Incorporate all hidden states of encoder, rather than the last
one
• Assign a separate weight to each state, compute weighted
sum
• Bahdahau et al. 2014
Figures from: https://distill.pub/2016/augmented-rnns
Attention
• Example of attention for machine translation
Figures from: https://distill.pub/2016/augmented-rnns
LSTM based Seq2Seq: Problems
• Still challenging to model long term dependencies
• RNNs are hard to parallelize because of non-homogeneous
nature
• Convolutional neural networks to the rescue?
Convolutional Neural Network
• Each output neuron is a linear combination of input neurons in
some spatial neighborhood
• Unlike a feed forward network, where it is connected to every
input neurons
• Parameter sharing and better scalability
Figures from: http://cs231n.github.io/convolutional-networks
fully connected convolution
Convolutional Neural Network
• Hierarchical processing: bottom-up vs. left-right
• Homogeneous: all elements processed in the same way
• Scalable computation: parallelizable, suited for GPUs
the tall
…
building .
vs.
the tall building … .
Contributions
• Show that non-RNN models can outperform very well-
engineered RNNs on large translation benchmarks
• Multi-hop attention
• Approach improves inference speed (>9x faster than RNN)
19
la maison de. Léa <end> .
20
Encoder
. <start>
Decoder
21
la maison de. Léa <end> .
. <start>
Encoder
Attention
Decoder
22
la maison de. Léa <end> .
. <start>
.
Encoder
Attention
Decoder
Léa
23
la maison de. Léa <end> .
. <start>
Encoder
Attention
Decoder
Léa
24
la maison de. Léa <end> .
.
. <start>
Encoder
Attention
Decoder
Léa 's
25
la maison de. Léa <end> .
.
Encoder
Attention
Decoder
. <start> Léa 's
26
la maison de. Léa <end> .
.
Encoder
Attention
Decoder
. <start> Léa 's house
27
la maison de. Léa <end> .
.
inputs. Non-linearities allow the networks to exploit the
full input field, or to focus on fewer elements if needed.
Each convolution kernel is parameterized as W 2 R2d⇥kd
,
bw 2 R2d
and takes as input X 2 Rk⇥d
which is a
concatenation of k input elements embedded in d dimen-
sions and maps them to a single output element Y 2 R2d
that has twice the dimensionality of the input elements;
subsequent layers operate over the k output elements of
the previous layer. We choose gated linear units (GLU;
Dauphin et al., 2016) as non-linearity which implement a
simple gating mechanism over the output of the convolu-
tion Y = [A B] 2 R2d
:
v([A B]) = A ⌦ (B)
where A, B 2 Rd
are the inputs to the non-linearity, ⌦ is
the point-wise multiplication and the output v([A B]) 2
Rd
is half the size of Y . The gates (B) control which
inputs A of the current context are relevant. A similar non-
linearity has been introduced in Oord et al. (2016b) who
apply tanh to A but Dauphin et al. (2016) shows that GLUs
perform better in the context of language modelling.
To enable deep convolutional networks, we add residual
connections from the input of each convolution to the out-
put of the block (He et al., 2015a).
• High training efficiency due to

parallel computation in decoder
• Cross Entropy objective
• Very similar to ResNet models 

(Nesterov etc., He et al. '15)
28
ConvS2S: Full Model
Vocabulary BLEU
CNN ByteNet (Kalchbrenner et al., 2016) Characters 23.75
RNN GNMT (Wu et al., 2016) Word 80k 23.12
RNN GNMT (Wu et al., 2016) Word pieces 24.61
ConvS2S BPE 40k 25.16
Transformer (Vaswani et al., 2017) Word pieces 28.4
ConvS2S: 15 layers in encoder/decoder (10x512 units, 3x768 units, 2x2048)

Maximum context size: 27 words 29
More work on non-RNN models!
ConvS2S: WMT'14 English-German
Vocabulary BLEU
RNN GNMT (Wu et al., 2016) Word 80k 37.90
RNN GNMT (Wu et al., 2016) Word pieces 38.95
RNN GNMT + RL (Wu et al., 2016) Word pieces 39.92
ConvS2S BPE 40k 40.51
Transformer (Vaswani et al., 2017) Word pieces 41.0
ConvS2S: 15 layers in encoder/decoder (5x512 units, 4x768 units, 3x2048, 2x4096)
30
ConvS2S: WMT'14 English-French
WMT'14 English-French
ConvS2S: Inference Speed
Hardware BLEU Time (s)
RNN GNMT (Wu et al., 2016) GPU (K80) 31.20 3028
RNN GNMT (Wu et al., 2016) CPU (88 cores) 31.20 1322
RNN GNMT (Wu et al., 2016) TPU 31.21 384
ConvS2S, beam=5 GPU (K40) 34.10 587
ConvS2S, beam=1 GPU (K40) 33.45 327
ConvS2S, beam=1 GPU (GTX-1080ti) 33.45 142
ConvS2S, beam=1 CPU (48 cores) 33.45 142
ntst1213 (6003 sentences) 31
ConvS2S: Training
Dataset # Pairs (in million) Hardware Days
WMT14 EN-DE 4.5 1 x GPU (K40) 18
WMT14 EN-FR 36 8 x GPU (K40) 30
ConvS2S: Conclusion
• Alternative architecture for sequence to sequence learning
• Higher accuracy than models of similar size, despite fixed size
context
• Faster generation (9x faster on lesser hardware)
Blogpost
Research Paper
Source Code
Denis Yarats ITEM 2018

Denis Yarats ITEM 2018

  • 1.
    Convolutional Sequence toSequence Learning Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin
  • 2.
    Paris Montreal New York City MenloPark Facebook AI Research 2
  • 3.
    Overview 1. What kindof problems we want to solve? 2. Sequence to Sequence Learning 3. Convolutional Sequence to Sequence Learning 4. Experiments and Results
  • 4.
    Problems we wantto solve • Machine Translation • Dialogue • Text Summarization • Text Generation • etc.
  • 5.
    Machine Translation • Theclassic test of language understanding • Language analysis and generation • MT is important for humanity and commerce: • Facebook: 2 billion translations in 45 languages daily • Google: 100 billion words being translated a day • eBay: uses MT to enable cross-border trade Stats from: https://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf
  • 6.
  • 7.
  • 8.
    Text Summarization Jenson Buttonwas denied his 100th race for McLaren after an ERS prevented him from making it to the startline. It capped a miserable weekend for the Briton; his time in Bahrain plagued by reliability issues. Button spent much of the race on Twitter delivering his verdict as the action unfolded. ’Kimi is the man to watch,’ and ’loving the sparks’, were among his pearls of wisdom, but the tweet which courted the most attention was a rather mischievous one: ’Ooh is Lewis backing his team mate into Vettel?’ he quizzed after Rosberg accused Hamilton of pulling off such a manoeuvre in China. Jenson Button waves to the crowd ahead of the Bahrain Grand Prix which he failed to start Perhaps a career in the media beckons... Lewis Hamilton has out-qualified and finished ahead of Nico Rosberg at every race this season. Indeed Rosberg has Button was denied his 100th race for McLaren. The ERS prevented him from making it to the start-line. Button was his team mate in the 11 races in Bahrain. Paulus et. al 2017 A DEEP REINFORCED MODEL FOR ABSTRACTIVE SUMMARIZATION
  • 9.
    Text Generation Aliens startabducting It has been two weeks, and the last of my kind has gone. It is only a matter of time until there will be nothing left. I’m not sure what the hell is going on... I can’t think. I can hear a distant scream. I think of a strange, alien sound. I try to speak, but am interrupted by something, something that feels like a drum, I ca not tell. I mean I’m just a little bit older than an average human. But I can, and I can feel the vibrations . I hear the sound of a ship approaching. The ground quakes at the force of the impact, and a loud explosion shatters the Fan et. al 2018 Hierarchical Neural Story Generation
  • 10.
    Sequence to Sequence •What is common in all of these problems? Figures from: https://github.com/farizrahman4u/seq2seq x0 x1 x2 ... xN y0 y1 ...y2 yM [ [] ]
  • 11.
    Recurrent Neural Network •Like a feed forward network, except allows self connections • Self connections are used to build an internal representation of past inputs • They give the network memory Figures from: http://colah.github.io/posts/2015-08-Understanding-LSTMs
  • 12.
    Long Short TermMemory (LSTM) • Modification of RNN to have longer memory • Additional memory cell to store information • RNN overwrites the hidden state, LSTM adds to the hidden state • Hochreiter et al. 1997 Figures from: http://colah.github.io/posts/2015-08-Understanding-LSTMs
  • 13.
    Sequence to Sequence •Make NN to read one sequence and produce another • Use 2 LSTMs • Sutskever et al. 2014 Figures from: http://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/ Ilya_LSTMs_for_Translation.pdf
  • 14.
    Attention • Incorporate allhidden states of encoder, rather than the last one • Assign a separate weight to each state, compute weighted sum • Bahdahau et al. 2014 Figures from: https://distill.pub/2016/augmented-rnns
  • 15.
    Attention • Example ofattention for machine translation Figures from: https://distill.pub/2016/augmented-rnns
  • 16.
    LSTM based Seq2Seq:Problems • Still challenging to model long term dependencies • RNNs are hard to parallelize because of non-homogeneous nature • Convolutional neural networks to the rescue?
  • 17.
    Convolutional Neural Network •Each output neuron is a linear combination of input neurons in some spatial neighborhood • Unlike a feed forward network, where it is connected to every input neurons • Parameter sharing and better scalability Figures from: http://cs231n.github.io/convolutional-networks fully connected convolution
  • 18.
    Convolutional Neural Network •Hierarchical processing: bottom-up vs. left-right • Homogeneous: all elements processed in the same way • Scalable computation: parallelizable, suited for GPUs the tall … building . vs. the tall building … .
  • 19.
    Contributions • Show thatnon-RNN models can outperform very well- engineered RNNs on large translation benchmarks • Multi-hop attention • Approach improves inference speed (>9x faster than RNN) 19
  • 20.
    la maison de.Léa <end> . 20
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    Encoder Attention Decoder . <start> Léa's 26 la maison de. Léa <end> . .
  • 27.
    Encoder Attention Decoder . <start> Léa's house 27 la maison de. Léa <end> . .
  • 28.
    inputs. Non-linearities allowthe networks to exploit the full input field, or to focus on fewer elements if needed. Each convolution kernel is parameterized as W 2 R2d⇥kd , bw 2 R2d and takes as input X 2 Rk⇥d which is a concatenation of k input elements embedded in d dimen- sions and maps them to a single output element Y 2 R2d that has twice the dimensionality of the input elements; subsequent layers operate over the k output elements of the previous layer. We choose gated linear units (GLU; Dauphin et al., 2016) as non-linearity which implement a simple gating mechanism over the output of the convolu- tion Y = [A B] 2 R2d : v([A B]) = A ⌦ (B) where A, B 2 Rd are the inputs to the non-linearity, ⌦ is the point-wise multiplication and the output v([A B]) 2 Rd is half the size of Y . The gates (B) control which inputs A of the current context are relevant. A similar non- linearity has been introduced in Oord et al. (2016b) who apply tanh to A but Dauphin et al. (2016) shows that GLUs perform better in the context of language modelling. To enable deep convolutional networks, we add residual connections from the input of each convolution to the out- put of the block (He et al., 2015a). • High training efficiency due to
 parallel computation in decoder • Cross Entropy objective • Very similar to ResNet models 
 (Nesterov etc., He et al. '15) 28 ConvS2S: Full Model
  • 29.
    Vocabulary BLEU CNN ByteNet(Kalchbrenner et al., 2016) Characters 23.75 RNN GNMT (Wu et al., 2016) Word 80k 23.12 RNN GNMT (Wu et al., 2016) Word pieces 24.61 ConvS2S BPE 40k 25.16 Transformer (Vaswani et al., 2017) Word pieces 28.4 ConvS2S: 15 layers in encoder/decoder (10x512 units, 3x768 units, 2x2048)
 Maximum context size: 27 words 29 More work on non-RNN models! ConvS2S: WMT'14 English-German
  • 30.
    Vocabulary BLEU RNN GNMT(Wu et al., 2016) Word 80k 37.90 RNN GNMT (Wu et al., 2016) Word pieces 38.95 RNN GNMT + RL (Wu et al., 2016) Word pieces 39.92 ConvS2S BPE 40k 40.51 Transformer (Vaswani et al., 2017) Word pieces 41.0 ConvS2S: 15 layers in encoder/decoder (5x512 units, 4x768 units, 3x2048, 2x4096) 30 ConvS2S: WMT'14 English-French WMT'14 English-French
  • 31.
    ConvS2S: Inference Speed HardwareBLEU Time (s) RNN GNMT (Wu et al., 2016) GPU (K80) 31.20 3028 RNN GNMT (Wu et al., 2016) CPU (88 cores) 31.20 1322 RNN GNMT (Wu et al., 2016) TPU 31.21 384 ConvS2S, beam=5 GPU (K40) 34.10 587 ConvS2S, beam=1 GPU (K40) 33.45 327 ConvS2S, beam=1 GPU (GTX-1080ti) 33.45 142 ConvS2S, beam=1 CPU (48 cores) 33.45 142 ntst1213 (6003 sentences) 31
  • 32.
    ConvS2S: Training Dataset #Pairs (in million) Hardware Days WMT14 EN-DE 4.5 1 x GPU (K40) 18 WMT14 EN-FR 36 8 x GPU (K40) 30
  • 33.
    ConvS2S: Conclusion • Alternativearchitecture for sequence to sequence learning • Higher accuracy than models of similar size, despite fixed size context • Faster generation (9x faster on lesser hardware)
  • 34.
  • 35.
  • 36.