SlideShare a Scribd company logo
[course site]
Attention Models
Day 3 Lecture 6
#DLUPC
Amaia Salvador
amaia.salvador@upc.edu
PhD Candidate
Universitat Politècnica de Catalunya
Attention Models: Motivation
Image:
H x W x 3
bird
The whole input volume is used to predict the output...
...despite the fact that not all pixels are equally important
2
Attention Models: Motivation
3
A bird flying over a body of water
Attend to different parts of the input to optimize a certain output
Case study: Image Captioning
Previously D3L5: Image Captioning
4
only takes into account
image features in the first
hidden state
Multimodal Recurrent
Neural Network
Karpathy and Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
LSTM Decoder for Image Captioning
LSTMLSTM LSTM
CNN LSTM
A bird flying
...
<EOS>
Features:
D
5
...
Vinyals et al. Show and tell: A neural image caption generator. CVPR 2015
Limitation: All output predictions are based on the final and static output
of the encoder
Attention for Image Captioning
CNN
Image:
H x W x 3
6
Attention for Image Captioning
CNN
Image:
H x W x 3
Features f:
L x D
h0
7
a1 y1
c0 y0
first context vector
is the average
Attention weights (LxD) Predicted word
First word (<start> token)
Attention for Image Captioning
CNN
Image:
H x W x 3
h0
c1
Visual features weighted with
attention give the next
context vector
y1
h1
a2 y2
8
a1 y1
c0 y0
Predicted word in
previous timestep
Attention for Image Captioning
CNN
Image:
H x W x 3
h0
c1 y1
h1
a2 y2
h2
a3 y3
c2 y2
9
a1 y1
c0 y0
Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
10
Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
11
Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
12
Some outputs can probably be predicted without looking at the image...
Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
13
Some outputs can probably be predicted without looking at the image...
Attention for Image Captioning
14
Can we focus on the image only when necessary?
Attention for Image Captioning
CNN
Image:
H x W x 3
h0
c1 y1
h1
a2 y2
h2
a3 y3
c2 y2
15
a1 y1
c0 y0
“Regular” spatial attention
Attention for Image Captioning
CNN
Image:
H x W x 3 c1 y1
a2 y2 a3 y3
c2 y2
16
a1 y1
c0 y0
Attention with sentinel: LSTM is modified to output a “non-visual” feature to attend to
s0 h0 s1 h1 s2 h2
Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR
2017
Attention for Image Captioning
Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR
2017
17
Attention weights indicate when it’s more important to look at the image features, and when it’s better to
rely on the current LSTM state
If:
sum(a[0:LxD]) > a[LxD]
image features are needed
for the final decision
Else:
RNN state is enough
to predict the next word
Soft Attention
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
CNN
Image:
H x W x 3
Grid of features
(Each
D-dimensional)
a b
c d
pa
pb
pc
pd
Distribution over
grid locations
pa
+ pb
+ pc
+ pc
= 1
Soft attention:
Summarize ALL locations
z = pa
a+ pb
b + pc
c + pd
d
Derivative dz/dp is nice!
Train with gradient descent
Context vector z
(D-dimensional)
From
RNN:
Slide Credit: CS231n 18
Soft Attention
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
CNN
Image:
H x W x 3
Grid of features
(Each
D-dimensional)
a b
c d
pa
pb
pc
pd
Distribution over
grid locations
pa
+ pb
+ pc
+ pc
= 1
Soft attention:
Summarize ALL locations
z = pa
a+ pb
b + pc
c + pd
d
Differentiable function
Train with gradient descent
Context vector z
(D-dimensional)
From
RNN:
Slide Credit: CS231n
● Still uses the whole input !
● Constrained to fix grid
19
Hard Attention
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
Not a differentiable function !
Can’t train with backprop :(
20
Hard attention:
Sample a subset
of the input
Need other optimization strategies
e.g.: reinforcement learning
Spatial Transformer Networks
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
CNN
bird
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Not a differentiable function !
Can’t train with backprop :(
Make it differentiable
Train with backprop :) 21
Spatial Transformer Networks
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Input image:
H x W x 3 Cropped and
rescaled image:
X x Y x 3
Can we make this
function differentiable?
Idea: Function mapping
pixel coordinates (xt, yt) of
output to pixel coordinates
(xs, ys) of input
Slide Credit: CS231n
Repeat for all pixels
in output
Network
attends to
input by
predicting
22
Mapping given by box coordinates
(translation + scale)
Spatial Transformer Networks
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Easy to incorporate in any network, anywhere !
Differentiable module
Insert spatial transformers into a
classification network and it learns
to attend and transform the input
23
Spatial Transformer Networks
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
24
Fine-grained classification
Also used as an alternative to RoI pooling in proposal-based detection & segmentation pipelines
Deformable Convolutions
Dai, Qi, Xiong, Li, Zhang et al. Deformable Convolutional Networks. arXiv Mar 2017
25
Dynamic & learnable receptive field
Resources
26
Seq2seq implementations with attention:
● Tensorflow
● Pytorch
Spatial Transformers
● Tensorflow
● Coming soon to Pytorch (thread here)
Deformable Convolutions
● MXNet (Original)
● Tensorflow / Keras (slow)
● [WIP]PyTorch
Questions?
Attention Mechanism
28
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
The vector to be fed to the RNN at each timestep is a
weighted sum of all the annotation vectors.
Attention Mechanism
29
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
An attention weight (scalar) is predicted at each time-step for each annotation vector
hj
with a simple fully connected neural network.
h1
zi
Annotation
vector
Recurrent
state
Attention
weight
(a1
)
Attention Mechanism
30
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
An attention weight (scalar) is predicted at each time-step for each annotation vector
hj
with a simple fully connected neural network.
h2
zi
Annotation
vector
Recurrent
state
Attention
weight
(a2
)
Shared for all j
Attention Mechanism
31
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Once a relevance score (weight) is estimated for each word, they are normalized
with a softmax function so they sum up to 1.
Attention Mechanism
32
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Finally, a context-aware representation ci+1
for the output word at timestep i can be
defined as:
Attention Mechanism
33
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
The model automatically finds the correspondence structure between two languages
(alignment).
(Edge thicknesses represent the attention weights found by the attention model)
Attention Models
Attend to different parts of the input to optimize a certain output
34
Attention Models
35
Chan et al. Listen, Attend and Spell. ICASSP 2016
Source: distill.pub
Input: Audio features; Output: Text
Attend to different parts of the input to optimize a certain output
Attention for Image Captioning
36
Side-note: attention can be computed with previous or current hidden state
CNN
Image:
H x W x 3
h1
v y1
h2 h3
v y2
a1
y1
v y0average
c1
a2
y2
c2
a3
y3
c3
Attention for Image Captioning
37
Attention with sentinel: LSTM is modified to output a “non-visual” feature to attend to
CNN
Image:
H x W x 3 v y1 v y2
a1
y1
v y0average
c1
a2
y2
c2
a3
y3
c3
s1 h1 s2 h2 s3 h3
Semantic Attention: Image Captioning
38You et al. Image Captioning with Semantic Attention. CVPR 2016
Visual Attention: Saliency Detection
Kuen et al. Recurrent Attentional Networks for Saliency Detection. CVPR 2016
39
Visual Attention: Fixation Prediction
Cornia et al. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.
40

More Related Content

What's hot

Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
Sungjoon Choi
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
Mark Chang
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
Basit Rafiq
 
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
Hichem Felouat
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
Illia Polosukhin
 
Semantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSemantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep Learning
Sungjoon Choi
 
cnn ppt.pptx
cnn ppt.pptxcnn ppt.pptx
cnn ppt.pptx
rohithprabhas1
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM
健程 杨
 
CNN Machine learning DeepLearning
CNN Machine learning DeepLearningCNN Machine learning DeepLearning
CNN Machine learning DeepLearning
Abhishek Sharma
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Universitat Politècnica de Catalunya
 
Image classification using cnn
Image classification using cnnImage classification using cnn
Image classification using cnn
SumeraHangi
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
Yan Xu
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNN
Pradnya Saval
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
健程 杨
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
Suraj Aavula
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
Shuai Zhang
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
Richard Kuo
 

What's hot (20)

Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Semantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSemantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep Learning
 
cnn ppt.pptx
cnn ppt.pptxcnn ppt.pptx
cnn ppt.pptx
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM
 
CNN Machine learning DeepLearning
CNN Machine learning DeepLearningCNN Machine learning DeepLearning
CNN Machine learning DeepLearning
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)
 
Image classification using cnn
Image classification using cnnImage classification using cnn
Image classification using cnn
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNN
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
 

Similar to Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)

Backpropagation for Deep Learning
Backpropagation for Deep LearningBackpropagation for Deep Learning
Backpropagation for Deep Learning
Universitat Politècnica de Catalunya
 
Log polar coordinates
Log polar coordinatesLog polar coordinates
Log polar coordinates
Oğul Göçmen
 
ANISH_and_DR.DANIEL_augmented_reality_presentation
ANISH_and_DR.DANIEL_augmented_reality_presentationANISH_and_DR.DANIEL_augmented_reality_presentation
ANISH_and_DR.DANIEL_augmented_reality_presentationAnish Patel
 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
When Discrete Optimization Meets Multimedia Security (and Beyond)
When Discrete Optimization Meets Multimedia Security (and Beyond)When Discrete Optimization Meets Multimedia Security (and Beyond)
When Discrete Optimization Meets Multimedia Security (and Beyond)
Shujun Li
 
Image Texture Analysis
Image Texture AnalysisImage Texture Analysis
Image Texture Analysis
lalitxp
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
Kenta Oono
 
Scene understanding
Scene understandingScene understanding
Scene understanding
Mohammed Shoaib
 
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfCD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
RajJain516913
 
Iaetsd traffic sign recognition for advanced driver
Iaetsd traffic sign recognition for  advanced driverIaetsd traffic sign recognition for  advanced driver
Iaetsd traffic sign recognition for advanced driverIaetsd Iaetsd
 
A Beginner's Guide to Monocular Depth Estimation
A Beginner's Guide to Monocular Depth EstimationA Beginner's Guide to Monocular Depth Estimation
A Beginner's Guide to Monocular Depth Estimation
Ryo Takahashi
 
Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine Learning
Krzysztof Kowalczyk
 
Conference_paper.pdf
Conference_paper.pdfConference_paper.pdf
Conference_paper.pdf
NarenRajVivek
 
CG.pptx
CG.pptxCG.pptx
CG.pptx
AdityaBisht34
 
Session 4 .pdf
Session 4 .pdfSession 4 .pdf
Session 4 .pdf
ssuser8cda84
 
OpenCV+Android.pptx
OpenCV+Android.pptxOpenCV+Android.pptx
OpenCV+Android.pptx
Vishwas459764
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and Culling
Mark Kilgard
 
Cgm Lab Manual
Cgm Lab ManualCgm Lab Manual

Similar to Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision) (20)

Backpropagation for Deep Learning
Backpropagation for Deep LearningBackpropagation for Deep Learning
Backpropagation for Deep Learning
 
Log polar coordinates
Log polar coordinatesLog polar coordinates
Log polar coordinates
 
ANISH_and_DR.DANIEL_augmented_reality_presentation
ANISH_and_DR.DANIEL_augmented_reality_presentationANISH_and_DR.DANIEL_augmented_reality_presentation
ANISH_and_DR.DANIEL_augmented_reality_presentation
 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
 
Cgm Lab Manual
Cgm Lab ManualCgm Lab Manual
Cgm Lab Manual
 
When Discrete Optimization Meets Multimedia Security (and Beyond)
When Discrete Optimization Meets Multimedia Security (and Beyond)When Discrete Optimization Meets Multimedia Security (and Beyond)
When Discrete Optimization Meets Multimedia Security (and Beyond)
 
Image Texture Analysis
Image Texture AnalysisImage Texture Analysis
Image Texture Analysis
 
mini prjt
mini prjtmini prjt
mini prjt
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
Scene understanding
Scene understandingScene understanding
Scene understanding
 
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfCD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
 
Iaetsd traffic sign recognition for advanced driver
Iaetsd traffic sign recognition for  advanced driverIaetsd traffic sign recognition for  advanced driver
Iaetsd traffic sign recognition for advanced driver
 
A Beginner's Guide to Monocular Depth Estimation
A Beginner's Guide to Monocular Depth EstimationA Beginner's Guide to Monocular Depth Estimation
A Beginner's Guide to Monocular Depth Estimation
 
Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine Learning
 
Conference_paper.pdf
Conference_paper.pdfConference_paper.pdf
Conference_paper.pdf
 
CG.pptx
CG.pptxCG.pptx
CG.pptx
 
Session 4 .pdf
Session 4 .pdfSession 4 .pdf
Session 4 .pdf
 
OpenCV+Android.pptx
OpenCV+Android.pptxOpenCV+Android.pptx
OpenCV+Android.pptx
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and Culling
 
Cgm Lab Manual
Cgm Lab ManualCgm Lab Manual
Cgm Lab Manual
 

More from Universitat Politècnica de Catalunya

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 

Recently uploaded

一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 

Recently uploaded (20)

一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 

Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)

  • 1. [course site] Attention Models Day 3 Lecture 6 #DLUPC Amaia Salvador amaia.salvador@upc.edu PhD Candidate Universitat Politècnica de Catalunya
  • 2. Attention Models: Motivation Image: H x W x 3 bird The whole input volume is used to predict the output... ...despite the fact that not all pixels are equally important 2
  • 3. Attention Models: Motivation 3 A bird flying over a body of water Attend to different parts of the input to optimize a certain output Case study: Image Captioning
  • 4. Previously D3L5: Image Captioning 4 only takes into account image features in the first hidden state Multimodal Recurrent Neural Network Karpathy and Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
  • 5. LSTM Decoder for Image Captioning LSTMLSTM LSTM CNN LSTM A bird flying ... <EOS> Features: D 5 ... Vinyals et al. Show and tell: A neural image caption generator. CVPR 2015 Limitation: All output predictions are based on the final and static output of the encoder
  • 6. Attention for Image Captioning CNN Image: H x W x 3 6
  • 7. Attention for Image Captioning CNN Image: H x W x 3 Features f: L x D h0 7 a1 y1 c0 y0 first context vector is the average Attention weights (LxD) Predicted word First word (<start> token)
  • 8. Attention for Image Captioning CNN Image: H x W x 3 h0 c1 Visual features weighted with attention give the next context vector y1 h1 a2 y2 8 a1 y1 c0 y0 Predicted word in previous timestep
  • 9. Attention for Image Captioning CNN Image: H x W x 3 h0 c1 y1 h1 a2 y2 h2 a3 y3 c2 y2 9 a1 y1 c0 y0
  • 10. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 10
  • 11. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 11
  • 12. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 12 Some outputs can probably be predicted without looking at the image...
  • 13. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 13 Some outputs can probably be predicted without looking at the image...
  • 14. Attention for Image Captioning 14 Can we focus on the image only when necessary?
  • 15. Attention for Image Captioning CNN Image: H x W x 3 h0 c1 y1 h1 a2 y2 h2 a3 y3 c2 y2 15 a1 y1 c0 y0 “Regular” spatial attention
  • 16. Attention for Image Captioning CNN Image: H x W x 3 c1 y1 a2 y2 a3 y3 c2 y2 16 a1 y1 c0 y0 Attention with sentinel: LSTM is modified to output a “non-visual” feature to attend to s0 h0 s1 h1 s2 h2 Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR 2017
  • 17. Attention for Image Captioning Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR 2017 17 Attention weights indicate when it’s more important to look at the image features, and when it’s better to rely on the current LSTM state If: sum(a[0:LxD]) > a[LxD] image features are needed for the final decision Else: RNN state is enough to predict the next word
  • 18. Soft Attention Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 CNN Image: H x W x 3 Grid of features (Each D-dimensional) a b c d pa pb pc pd Distribution over grid locations pa + pb + pc + pc = 1 Soft attention: Summarize ALL locations z = pa a+ pb b + pc c + pd d Derivative dz/dp is nice! Train with gradient descent Context vector z (D-dimensional) From RNN: Slide Credit: CS231n 18
  • 19. Soft Attention Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 CNN Image: H x W x 3 Grid of features (Each D-dimensional) a b c d pa pb pc pd Distribution over grid locations pa + pb + pc + pc = 1 Soft attention: Summarize ALL locations z = pa a+ pb b + pc c + pd d Differentiable function Train with gradient descent Context vector z (D-dimensional) From RNN: Slide Credit: CS231n ● Still uses the whole input ! ● Constrained to fix grid 19
  • 20. Hard Attention Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 Not a differentiable function ! Can’t train with backprop :( 20 Hard attention: Sample a subset of the input Need other optimization strategies e.g.: reinforcement learning
  • 21. Spatial Transformer Networks Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 CNN bird Jaderberg et al. Spatial Transformer Networks. NIPS 2015 Not a differentiable function ! Can’t train with backprop :( Make it differentiable Train with backprop :) 21
  • 22. Spatial Transformer Networks Jaderberg et al. Spatial Transformer Networks. NIPS 2015 Input image: H x W x 3 Cropped and rescaled image: X x Y x 3 Can we make this function differentiable? Idea: Function mapping pixel coordinates (xt, yt) of output to pixel coordinates (xs, ys) of input Slide Credit: CS231n Repeat for all pixels in output Network attends to input by predicting 22 Mapping given by box coordinates (translation + scale)
  • 23. Spatial Transformer Networks Jaderberg et al. Spatial Transformer Networks. NIPS 2015 Easy to incorporate in any network, anywhere ! Differentiable module Insert spatial transformers into a classification network and it learns to attend and transform the input 23
  • 24. Spatial Transformer Networks Jaderberg et al. Spatial Transformer Networks. NIPS 2015 24 Fine-grained classification Also used as an alternative to RoI pooling in proposal-based detection & segmentation pipelines
  • 25. Deformable Convolutions Dai, Qi, Xiong, Li, Zhang et al. Deformable Convolutional Networks. arXiv Mar 2017 25 Dynamic & learnable receptive field
  • 26. Resources 26 Seq2seq implementations with attention: ● Tensorflow ● Pytorch Spatial Transformers ● Tensorflow ● Coming soon to Pytorch (thread here) Deformable Convolutions ● MXNet (Original) ● Tensorflow / Keras (slow) ● [WIP]PyTorch
  • 28. Attention Mechanism 28 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) The vector to be fed to the RNN at each timestep is a weighted sum of all the annotation vectors.
  • 29. Attention Mechanism 29 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) An attention weight (scalar) is predicted at each time-step for each annotation vector hj with a simple fully connected neural network. h1 zi Annotation vector Recurrent state Attention weight (a1 )
  • 30. Attention Mechanism 30 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) An attention weight (scalar) is predicted at each time-step for each annotation vector hj with a simple fully connected neural network. h2 zi Annotation vector Recurrent state Attention weight (a2 ) Shared for all j
  • 31. Attention Mechanism 31 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Once a relevance score (weight) is estimated for each word, they are normalized with a softmax function so they sum up to 1.
  • 32. Attention Mechanism 32 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Finally, a context-aware representation ci+1 for the output word at timestep i can be defined as:
  • 33. Attention Mechanism 33 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) The model automatically finds the correspondence structure between two languages (alignment). (Edge thicknesses represent the attention weights found by the attention model)
  • 34. Attention Models Attend to different parts of the input to optimize a certain output 34
  • 35. Attention Models 35 Chan et al. Listen, Attend and Spell. ICASSP 2016 Source: distill.pub Input: Audio features; Output: Text Attend to different parts of the input to optimize a certain output
  • 36. Attention for Image Captioning 36 Side-note: attention can be computed with previous or current hidden state CNN Image: H x W x 3 h1 v y1 h2 h3 v y2 a1 y1 v y0average c1 a2 y2 c2 a3 y3 c3
  • 37. Attention for Image Captioning 37 Attention with sentinel: LSTM is modified to output a “non-visual” feature to attend to CNN Image: H x W x 3 v y1 v y2 a1 y1 v y0average c1 a2 y2 c2 a3 y3 c3 s1 h1 s2 h2 s3 h3
  • 38. Semantic Attention: Image Captioning 38You et al. Image Captioning with Semantic Attention. CVPR 2016
  • 39. Visual Attention: Saliency Detection Kuen et al. Recurrent Attentional Networks for Saliency Detection. CVPR 2016 39
  • 40. Visual Attention: Fixation Prediction Cornia et al. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model. 40