The slides for the techniques to use Convolutional Neural Networks (CNN) for the sequence modeling tasks, including image captioning and natural machine translation (NMT). The slides contain the main building blocks from different papers. Used in group paper reading in University of Sydney.
Unblocking The Main Thread Solving ANRs and Frozen Frames
CNN and RNN Models for Sequence Processing
1. From RNN to CNN:
Usng CNN in sequence processing
Dongang Wang
20 Jun 2018
2. Contents
Background in sequence processing
• Basic Seq2Seq model and Attention model
Important tricks
• Dilated convolution, Position Encoding, Multiplicative Attention, etc.
Example Networks
• ByteNet, ConvS2S, Transformer, etc.
Application in captioning
• Convolutional Image Captioning
• Transformer in Dense Video Captioning
3. Main references
• Convolutional Sequence to Sequence Learning
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann Dauphin
FAIR, published in arxiv 2017
• Attention Is All You Need
Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Illia Polosukhin, et al.
Google Research & Google Brain, published in NIPS 2017
• An Empirical Evaluation of Generic Convolutional and Recurrent
Networks for Sequence Modeling
Shaojie Bai, J Zico Kolter, Vladlen Koltun
CMU & Intel Labs, published in arxiv 2018 [Bai, 2018]
[Vaswani, 2017]
[Gehring, 2017]
4. Other references (1)
• Neural Machine Translation in Linear Time
Nal Kalchbrenner, Lasse Espehold, Karen Simonyan, et al.
Google DeepMind, published in arxiv 2016
• Convolutional Image Captioning
Jyoti Aneja, Aditya Deshpande, Alexander Schwing
UIUC, published in CVPR 2018
• End-to-End Dense Video Captioning with Masked Transformer
Luowei Zhou, Yingbo Zhou, Jason Corso
U of Michigan, published in CVPR 2018
[Aneja, 2018]
[Kalchbrenner, 2016]
[Zhou, 2018]
5. Other references (2)
• Sequence to sequence learning with neural networks
Ilya Sutskever, Oriol Vinyals, Quoc V. Le
Google Research, published in NIPS 2014
• Neural machine translation by jointly learning to align and translate
Dzmitry Bahdanau , KyungHyun Cho, Yoshua Bengio
Jacobs U & U of Montreal, published in ICLR 2015 as oral
• Multi-scale context aggregation by dilated convolutions.
Fisher Yu, Vladlen Koltun
Princeton & Intel Lab, published in ICLR 2016 [Yu, 2016]
[Bahdanau, 2014]
[Sutskever, 2014]
6. Other references (3)
• End-to-End Memory Networks
Sainbayar Sukhbaatar, Arthur Szlam, JasonWeston, Rob Fergus
NYU & FAIR, published in NIPS 2015
• Layer Normalization
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
U of Toronto, published in NIPS 2016 in workshop
• Weight Normalization: A simple Reparameterization to Accelerate
Training of Deep Neural Network
Tim Salimans, Diederik P. Kingma
OpenAI, published in NIPS 2016
[Sukhbaatar, 2015]
[Salimans, 2016]
[Ba, 2016]
7. Basic Seq2Seq in NMT
Model:
– Encoder: sentence encoded to a length-fixed vector
– Decoder: the encoded vector acts as the first input to decoder, and the
output of each time step will be input to next time step.
Tricks:
– Deep LSTM using four layers.
– Reverse the order of the words of the input.
[Sutskever, 2014]
8. Basic Attention in NMT
Model:
– Encoder: bidirectional LSTM
– Decoder: input the label and state from last
time step, and the combination of all encoder
features.
[Bahdanau, 2014]
9. Limitations
Running Time (main concern)
• RNN cannot run in parallel because of the serial structure
Long-term Dependency
• Gradient will vanish or explore along long sequences
Structure almost untouched
• LSTM is proved to be the best structure at present (among ten thousand
RNNs), and the variants of LSTM cannot improve significantly.
• The techniques like batch normalization does not work in LSTM properly.
Relationships are not proper in Seq2Seq
• For NMT, the path between corresponding input token and output token
should be short, but original Seq2Seq cannot model this relationship.
10. Tricks as Building blocks
Modified ConvNets for sequences (no pooling)
– Stacked CNN with multiple kernels, without padding
– Dilated Convolutional Network
Residual Connections
Normalization (Batch, Weight, Layer)
– To accelerate optimization
Position Encoding
– To remedy the loss of position information
Multiplicative Attention
– Another kind of attention method
11. Building block: Stacked CNN
For sequences:
– Multiple kernels (filters), the kernels should have the size of k by d, where k
is the interesting region, d is the dimension of word embedding.
– Stack several layers without padding, then the CNN could have a larger
receptive field. For example, 5 convs with k=3, then the output will
correspond to an input of 11 words (11->9->7->5->3->1).
For variant lengths:
– Use same length with padding
– Use mask to control training
[Gehring, 2017]
12. Building block: Dilated Convolution
This is a kind of causal convolution, in which the future information is not
taken into account.
This method originally used in segmentation, where the resolution of the
input image is very essential. In sequence modeling, it is also essential to
retain the information from the word embedding.
13. Building block: Dilated Convolution
For sequence: 1D dilated convolution
[Kalchbrenner, 2016]
14. Building block: Residual & Normalization
For residual:
– It proved to be very powerful in ResNet.
– Since we may need deep network in modeling the sequence, it is also useful
to train the layers to learn modifications.
For normalization:
– Intuition: gradients are not influenced by data, so that
optimization could be accelerated.
– Batch normalization: use mean/variance of batch data
– Weight normalization: use mean/variance of weights
– Layer normalization: use mean/variance of layer
16. Building block: Position Encoding
If we process all the words in the sentence together, we will lose the
information of the sequence order. In that case, we can modify the
original word embedding vector by adding a position vector.
– Train another embedding feature parallel to the word embedding, using the
position input as one-hot vector
– For j-th word out of J words, the embedded feature has the same dimension
as d. The k-th element in the d-dim is
– Using sine and cosine functions
4
( )( ) 1
2 2
kj
d J
l k j
Jd
= − − +
( 1)
sin( 10000 ), if is even
cos( 10000 ), if is odd
k d
kj k d
j k
l
j k−
=
17. Building block: Multiplicative Attention
Additive attention:
– Train a MLP, input is the encoded feature and the hidden state of last step
– Use the weights to get a weighted sum of the encoded feature to decode
Multiplicative attention in decoding:
– g is the word of previous step
– h is hidden state of previous step
– z is the encoded feature
Modified multiplicative attention (Scaled Dot-Product Attention):
– The dot product could be very large in some cases, which will make the
attention very bias. In that case, the dot-product could be divided by 𝑑𝑑𝑧𝑧
18. Network: ByteNet
Blocks:
– Dilated Convolution
– Residual block with layer normalization
– Masked input
Specialty:
– Dynamic unfolding: in neural machine
translation, the sentence length of
source and target has linear relation.
They modify the maximum length of
target sentence, with a=1.2 and b=0.
ˆt a s b= +
20. Network: Transformer
Blocks:
– Position Encoding
– Scaled Dot-Product Attention
– Masked input
– Residual block with Layer Normalization
Specialty:
– Multi-Head Attention: they perform 8 times
parallel attention layers, and concatenate the
output attention into one vector.
21. Application: Convolutional Image Captioning
Block:
– Gated linear units
– Additive attention
– Residual block
with weight norm
– Fine-tune image
encoder
Performance:
– not as good as
LSTM