SlideShare a Scribd company logo
1 of 21
Download to read offline
From RNN to CNN:
Usng CNN in sequence processing
Dongang Wang
20 Jun 2018
Contents
 Background in sequence processing
• Basic Seq2Seq model and Attention model
 Important tricks
• Dilated convolution, Position Encoding, Multiplicative Attention, etc.
 Example Networks
• ByteNet, ConvS2S, Transformer, etc.
 Application in captioning
• Convolutional Image Captioning
• Transformer in Dense Video Captioning
Main references
• Convolutional Sequence to Sequence Learning
  Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann Dauphin
  FAIR, published in arxiv 2017
• Attention Is All You Need
  Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Illia Polosukhin, et al.
  Google Research & Google Brain, published in NIPS 2017
• An Empirical Evaluation of Generic Convolutional and Recurrent
Networks for Sequence Modeling
  Shaojie Bai, J Zico Kolter, Vladlen Koltun
  CMU & Intel Labs, published in arxiv 2018 [Bai, 2018]
[Vaswani, 2017]
[Gehring, 2017]
Other references (1)
• Neural Machine Translation in Linear Time
  Nal Kalchbrenner, Lasse Espehold, Karen Simonyan, et al.
  Google DeepMind, published in arxiv 2016
• Convolutional Image Captioning
  Jyoti Aneja, Aditya Deshpande, Alexander Schwing
  UIUC, published in CVPR 2018
• End-to-End Dense Video Captioning with Masked Transformer
  Luowei Zhou, Yingbo Zhou, Jason Corso
  U of Michigan, published in CVPR 2018
[Aneja, 2018]
[Kalchbrenner, 2016]
[Zhou, 2018]
Other references (2)
• Sequence to sequence learning with neural networks
  Ilya Sutskever, Oriol Vinyals, Quoc V. Le
  Google Research, published in NIPS 2014
• Neural machine translation by jointly learning to align and translate
  Dzmitry Bahdanau , KyungHyun Cho, Yoshua Bengio
  Jacobs U & U of Montreal, published in ICLR 2015 as oral
• Multi-scale context aggregation by dilated convolutions.
  Fisher Yu, Vladlen Koltun
  Princeton & Intel Lab, published in ICLR 2016 [Yu, 2016]
[Bahdanau, 2014]
[Sutskever, 2014]
Other references (3)
• End-to-End Memory Networks
  Sainbayar Sukhbaatar, Arthur Szlam, JasonWeston, Rob Fergus
  NYU & FAIR, published in NIPS 2015
• Layer Normalization
  Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
  U of Toronto, published in NIPS 2016 in workshop
• Weight Normalization: A simple Reparameterization to Accelerate
Training of Deep Neural Network
  Tim Salimans, Diederik P. Kingma
  OpenAI, published in NIPS 2016
[Sukhbaatar, 2015]
[Salimans, 2016]
[Ba, 2016]
Basic Seq2Seq in NMT
 Model:
– Encoder: sentence encoded to a length-fixed vector
– Decoder: the encoded vector acts as the first input to decoder, and the
output of each time step will be input to next time step.
 Tricks:
– Deep LSTM using four layers.
– Reverse the order of the words of the input.
[Sutskever, 2014]
Basic Attention in NMT
 Model:
– Encoder: bidirectional LSTM
– Decoder: input the label and state from last
time step, and the combination of all encoder
features.
[Bahdanau, 2014]
Limitations
 Running Time (main concern)
• RNN cannot run in parallel because of the serial structure
 Long-term Dependency
• Gradient will vanish or explore along long sequences
 Structure almost untouched
• LSTM is proved to be the best structure at present (among ten thousand
RNNs), and the variants of LSTM cannot improve significantly.
• The techniques like batch normalization does not work in LSTM properly.
 Relationships are not proper in Seq2Seq
• For NMT, the path between corresponding input token and output token
should be short, but original Seq2Seq cannot model this relationship.
Tricks as Building blocks
 Modified ConvNets for sequences (no pooling)
– Stacked CNN with multiple kernels, without padding
– Dilated Convolutional Network
 Residual Connections
 Normalization (Batch, Weight, Layer)
– To accelerate optimization
 Position Encoding
– To remedy the loss of position information
 Multiplicative Attention
– Another kind of attention method
Building block: Stacked CNN
 For sequences:
– Multiple kernels (filters), the kernels should have the size of k by d, where k
is the interesting region, d is the dimension of word embedding.
– Stack several layers without padding, then the CNN could have a larger
receptive field. For example, 5 convs with k=3, then the output will
correspond to an input of 11 words (11->9->7->5->3->1).
 For variant lengths:
– Use same length with padding
– Use mask to control training
[Gehring, 2017]
Building block: Dilated Convolution
 This is a kind of causal convolution, in which the future information is not
taken into account.
 This method originally used in segmentation, where the resolution of the
input image is very essential. In sequence modeling, it is also essential to
retain the information from the word embedding.
Building block: Dilated Convolution
 For sequence: 1D dilated convolution
[Kalchbrenner, 2016]
Building block: Residual & Normalization
 For residual:
– It proved to be very powerful in ResNet.
– Since we may need deep network in modeling the sequence, it is also useful
to train the layers to learn modifications.
 For normalization:
– Intuition: gradients are not influenced by data, so that
optimization could be accelerated.
– Batch normalization: use mean/variance of batch data
– Weight normalization: use mean/variance of weights
– Layer normalization: use mean/variance of layer
Building block: Residual & Normalization
 Batch Normalization & Layer Normalization
 Batch Normalization & Weight Normalization
Building block: Position Encoding
 If we process all the words in the sentence together, we will lose the
information of the sequence order. In that case, we can modify the
original word embedding vector by adding a position vector.
– Train another embedding feature parallel to the word embedding, using the
position input as one-hot vector
– For j-th word out of J words, the embedded feature has the same dimension
as d. The k-th element in the d-dim is
– Using sine and cosine functions
4
( )( ) 1
2 2
kj
d J
l k j
Jd
= − − +
( 1)
sin( 10000 ), if is even
cos( 10000 ), if is odd
k d
kj k d
j k
l
j k−

= 

Building block: Multiplicative Attention
 Additive attention:
– Train a MLP, input is the encoded feature and the hidden state of last step
– Use the weights to get a weighted sum of the encoded feature to decode
 Multiplicative attention in decoding:
– g is the word of previous step
– h is hidden state of previous step
– z is the encoded feature
 Modified multiplicative attention (Scaled Dot-Product Attention):
– The dot product could be very large in some cases, which will make the
attention very bias. In that case, the dot-product could be divided by 𝑑𝑑𝑧𝑧
Network: ByteNet
 Blocks:
– Dilated Convolution
– Residual block with layer normalization
– Masked input
 Specialty:
– Dynamic unfolding: in neural machine
translation, the sentence length of
source and target has linear relation.
They modify the maximum length of
target sentence, with a=1.2 and b=0.
ˆt a s b= +
Network: ConvS2S
 Stacked CNN without pooling
 Position Encoding
 Multiplicative Attention
Network: Transformer
 Blocks:
– Position Encoding
– Scaled Dot-Product Attention
– Masked input
– Residual block with Layer Normalization
 Specialty:
– Multi-Head Attention: they perform 8 times
parallel attention layers, and concatenate the
output attention into one vector.
Application: Convolutional Image Captioning
 Block:
– Gated linear units
– Additive attention
– Residual block
with weight norm
– Fine-tune image
encoder
 Performance:
– not as good as
LSTM

More Related Content

What's hot

Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018
Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018
Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...Universitat Politècnica de Catalunya
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]Dongmin Choi
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataYao-Chieh Hu
 
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...Universitat Politècnica de Catalunya
 
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...Masumi Shirakawa
 
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural NetworksSkip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural NetworksUniversitat Politècnica de Catalunya
 
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 

What's hot (20)

Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018
Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018
Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018
 
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)
 
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
 
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
 
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
 
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
 
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
 
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural NetworksSkip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
 
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
 
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
 
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
 

Similar to CNN and RNN Models for Sequence Processing

Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImageryRAHUL BHOJWANI
 
Talk from NVidia Developer Connect
Talk from NVidia Developer ConnectTalk from NVidia Developer Connect
Talk from NVidia Developer ConnectAnuj Gupta
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsJinwon Lee
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerShunta Saito
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksSangwoo Mo
 
Resnet.pdf
Resnet.pdfResnet.pdf
Resnet.pdfYanhuaSi
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptxKtonNguyn2
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningBrodmann17
 
Chainer OpenPOWER developer congress HandsON 20170522_ota
Chainer OpenPOWER developer congress HandsON 20170522_otaChainer OpenPOWER developer congress HandsON 20170522_ota
Chainer OpenPOWER developer congress HandsON 20170522_otaPreferred Networks
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
 
Scalable image recognition model with deep embedding
Scalable image recognition model with deep embeddingScalable image recognition model with deep embedding
Scalable image recognition model with deep embedding捷恩 蔡
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchSubhashis Hazarika
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptxthanhdowork
 
Deep Neural Networks for Multimodal Learning
Deep Neural Networks for Multimodal LearningDeep Neural Networks for Multimodal Learning
Deep Neural Networks for Multimodal LearningMarc Bolaños Solà
 

Similar to CNN and RNN Models for Sequence Processing (20)

Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
 
Talk from NVidia Developer Connect
Talk from NVidia Developer ConnectTalk from NVidia Developer Connect
Talk from NVidia Developer Connect
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
Resnet.pdf
Resnet.pdfResnet.pdf
Resnet.pdf
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep Learning
 
Chainer OpenPOWER developer congress HandsON 20170522_ota
Chainer OpenPOWER developer congress HandsON 20170522_otaChainer OpenPOWER developer congress HandsON 20170522_ota
Chainer OpenPOWER developer congress HandsON 20170522_ota
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
Scalable image recognition model with deep embedding
Scalable image recognition model with deep embeddingScalable image recognition model with deep embedding
Scalable image recognition model with deep embedding
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
Deep Neural Networks for Multimodal Learning
Deep Neural Networks for Multimodal LearningDeep Neural Networks for Multimodal Learning
Deep Neural Networks for Multimodal Learning
 

Recently uploaded

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

CNN and RNN Models for Sequence Processing

  • 1. From RNN to CNN: Usng CNN in sequence processing Dongang Wang 20 Jun 2018
  • 2. Contents  Background in sequence processing • Basic Seq2Seq model and Attention model  Important tricks • Dilated convolution, Position Encoding, Multiplicative Attention, etc.  Example Networks • ByteNet, ConvS2S, Transformer, etc.  Application in captioning • Convolutional Image Captioning • Transformer in Dense Video Captioning
  • 3. Main references • Convolutional Sequence to Sequence Learning   Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann Dauphin   FAIR, published in arxiv 2017 • Attention Is All You Need   Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Illia Polosukhin, et al.   Google Research & Google Brain, published in NIPS 2017 • An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling   Shaojie Bai, J Zico Kolter, Vladlen Koltun   CMU & Intel Labs, published in arxiv 2018 [Bai, 2018] [Vaswani, 2017] [Gehring, 2017]
  • 4. Other references (1) • Neural Machine Translation in Linear Time   Nal Kalchbrenner, Lasse Espehold, Karen Simonyan, et al.   Google DeepMind, published in arxiv 2016 • Convolutional Image Captioning   Jyoti Aneja, Aditya Deshpande, Alexander Schwing   UIUC, published in CVPR 2018 • End-to-End Dense Video Captioning with Masked Transformer   Luowei Zhou, Yingbo Zhou, Jason Corso   U of Michigan, published in CVPR 2018 [Aneja, 2018] [Kalchbrenner, 2016] [Zhou, 2018]
  • 5. Other references (2) • Sequence to sequence learning with neural networks   Ilya Sutskever, Oriol Vinyals, Quoc V. Le   Google Research, published in NIPS 2014 • Neural machine translation by jointly learning to align and translate   Dzmitry Bahdanau , KyungHyun Cho, Yoshua Bengio   Jacobs U & U of Montreal, published in ICLR 2015 as oral • Multi-scale context aggregation by dilated convolutions.   Fisher Yu, Vladlen Koltun   Princeton & Intel Lab, published in ICLR 2016 [Yu, 2016] [Bahdanau, 2014] [Sutskever, 2014]
  • 6. Other references (3) • End-to-End Memory Networks   Sainbayar Sukhbaatar, Arthur Szlam, JasonWeston, Rob Fergus   NYU & FAIR, published in NIPS 2015 • Layer Normalization   Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton   U of Toronto, published in NIPS 2016 in workshop • Weight Normalization: A simple Reparameterization to Accelerate Training of Deep Neural Network   Tim Salimans, Diederik P. Kingma   OpenAI, published in NIPS 2016 [Sukhbaatar, 2015] [Salimans, 2016] [Ba, 2016]
  • 7. Basic Seq2Seq in NMT  Model: – Encoder: sentence encoded to a length-fixed vector – Decoder: the encoded vector acts as the first input to decoder, and the output of each time step will be input to next time step.  Tricks: – Deep LSTM using four layers. – Reverse the order of the words of the input. [Sutskever, 2014]
  • 8. Basic Attention in NMT  Model: – Encoder: bidirectional LSTM – Decoder: input the label and state from last time step, and the combination of all encoder features. [Bahdanau, 2014]
  • 9. Limitations  Running Time (main concern) • RNN cannot run in parallel because of the serial structure  Long-term Dependency • Gradient will vanish or explore along long sequences  Structure almost untouched • LSTM is proved to be the best structure at present (among ten thousand RNNs), and the variants of LSTM cannot improve significantly. • The techniques like batch normalization does not work in LSTM properly.  Relationships are not proper in Seq2Seq • For NMT, the path between corresponding input token and output token should be short, but original Seq2Seq cannot model this relationship.
  • 10. Tricks as Building blocks  Modified ConvNets for sequences (no pooling) – Stacked CNN with multiple kernels, without padding – Dilated Convolutional Network  Residual Connections  Normalization (Batch, Weight, Layer) – To accelerate optimization  Position Encoding – To remedy the loss of position information  Multiplicative Attention – Another kind of attention method
  • 11. Building block: Stacked CNN  For sequences: – Multiple kernels (filters), the kernels should have the size of k by d, where k is the interesting region, d is the dimension of word embedding. – Stack several layers without padding, then the CNN could have a larger receptive field. For example, 5 convs with k=3, then the output will correspond to an input of 11 words (11->9->7->5->3->1).  For variant lengths: – Use same length with padding – Use mask to control training [Gehring, 2017]
  • 12. Building block: Dilated Convolution  This is a kind of causal convolution, in which the future information is not taken into account.  This method originally used in segmentation, where the resolution of the input image is very essential. In sequence modeling, it is also essential to retain the information from the word embedding.
  • 13. Building block: Dilated Convolution  For sequence: 1D dilated convolution [Kalchbrenner, 2016]
  • 14. Building block: Residual & Normalization  For residual: – It proved to be very powerful in ResNet. – Since we may need deep network in modeling the sequence, it is also useful to train the layers to learn modifications.  For normalization: – Intuition: gradients are not influenced by data, so that optimization could be accelerated. – Batch normalization: use mean/variance of batch data – Weight normalization: use mean/variance of weights – Layer normalization: use mean/variance of layer
  • 15. Building block: Residual & Normalization  Batch Normalization & Layer Normalization  Batch Normalization & Weight Normalization
  • 16. Building block: Position Encoding  If we process all the words in the sentence together, we will lose the information of the sequence order. In that case, we can modify the original word embedding vector by adding a position vector. – Train another embedding feature parallel to the word embedding, using the position input as one-hot vector – For j-th word out of J words, the embedded feature has the same dimension as d. The k-th element in the d-dim is – Using sine and cosine functions 4 ( )( ) 1 2 2 kj d J l k j Jd = − − + ( 1) sin( 10000 ), if is even cos( 10000 ), if is odd k d kj k d j k l j k−  =  
  • 17. Building block: Multiplicative Attention  Additive attention: – Train a MLP, input is the encoded feature and the hidden state of last step – Use the weights to get a weighted sum of the encoded feature to decode  Multiplicative attention in decoding: – g is the word of previous step – h is hidden state of previous step – z is the encoded feature  Modified multiplicative attention (Scaled Dot-Product Attention): – The dot product could be very large in some cases, which will make the attention very bias. In that case, the dot-product could be divided by 𝑑𝑑𝑧𝑧
  • 18. Network: ByteNet  Blocks: – Dilated Convolution – Residual block with layer normalization – Masked input  Specialty: – Dynamic unfolding: in neural machine translation, the sentence length of source and target has linear relation. They modify the maximum length of target sentence, with a=1.2 and b=0. ˆt a s b= +
  • 19. Network: ConvS2S  Stacked CNN without pooling  Position Encoding  Multiplicative Attention
  • 20. Network: Transformer  Blocks: – Position Encoding – Scaled Dot-Product Attention – Masked input – Residual block with Layer Normalization  Specialty: – Multi-Head Attention: they perform 8 times parallel attention layers, and concatenate the output attention into one vector.
  • 21. Application: Convolutional Image Captioning  Block: – Gated linear units – Additive attention – Residual block with weight norm – Fine-tune image encoder  Performance: – not as good as LSTM