Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)

[course site]
Attention Models
Day 3 Lecture 6
#DLUPC
Amaia Salvador
amaia.salvador@upc.edu
PhD Candidate
Universitat Politècnica de Catalunya

Attention Models: Motivation
Image:
H x W x 3
bird
The whole input volume is used to predict the output...
...despite the fact that not all pixels are equally important
2

Attention Models: Motivation
3
A bird flying over a body of water
Attend to different parts of the input to optimize a certain output
Case study: Image Captioning

Previously D3L5: Image Captioning
4
only takes into account
image features in the first
hidden state
Multimodal Recurrent
Neural Network
Karpathy and Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015

LSTM Decoder for Image Captioning
LSTMLSTM LSTM
CNN LSTM
A bird flying
...
<EOS>
Features:
D
5
...
Vinyals et al. Show and tell: A neural image caption generator. CVPR 2015
Limitation: All output predictions are based on the final and static output
of the encoder

Attention for Image Captioning
CNN
Image:
H x W x 3
6

CNN
Image:
H x W x 3
Features f:
L x D
h0
7
a1 y1
c0 y0
first context vector
is the average
Attention weights (LxD) Predicted word
First word (<start> token)

CNN
Image:
H x W x 3
h0
c1
Visual features weighted with
attention give the next
context vector
y1
h1
a2 y2
8
a1 y1
c0 y0
Predicted word in
previous timestep

CNN
Image:
H x W x 3
h0
c1 y1
h1
a2 y2
h2
a3 y3
c2 y2
9
a1 y1
c0 y0

Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
10

11

12
Some outputs can probably be predicted without looking at the image...

13
Some outputs can probably be predicted without looking at the image...

14
Can we focus on the image only when necessary?

CNN
Image:
H x W x 3
h0
c1 y1
h1
a2 y2
h2
a3 y3
c2 y2
15
a1 y1
c0 y0
“Regular” spatial attention

CNN
Image:
H x W x 3 c1 y1
a2 y2 a3 y3
c2 y2
16
a1 y1
c0 y0
Attention with sentinel: LSTM is modified to output a “non-visual” feature to attend to
s0 h0 s1 h1 s2 h2
Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR
2017

Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR
2017
17
Attention weights indicate when it’s more important to look at the image features, and when it’s better to
rely on the current LSTM state
If:
sum(a[0:LxD]) > a[LxD]
image features are needed
for the final decision
Else:
RNN state is enough
to predict the next word

Soft Attention
CNN
Image:
H x W x 3
Grid of features
(Each
D-dimensional)
a b
c d
pa
pb
pc
pd
Distribution over
grid locations
pa
+ pb
+ pc
+ pc
= 1
Soft attention:
Summarize ALL locations
z = pa
a+ pb
b + pc
c + pd
d
Derivative dz/dp is nice!
Train with gradient descent
Context vector z
(D-dimensional)
From
RNN:
Slide Credit: CS231n 18

Soft Attention
CNN
Image:
H x W x 3
Grid of features
(Each
D-dimensional)
a b
c d
pa
pb
pc
pd
Distribution over
grid locations
pa
+ pb
+ pc
+ pc
= 1
Soft attention:
Summarize ALL locations
z = pa
a+ pb
b + pc
c + pd
d
Differentiable function
Train with gradient descent
Context vector z
(D-dimensional)
From
RNN:
Slide Credit: CS231n
● Still uses the whole input !
● Constrained to fix grid
19

Hard Attention
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
Not a differentiable function !
Can’t train with backprop :(
20
Hard attention:
Sample a subset
of the input
Need other optimization strategies
e.g.: reinforcement learning

Spatial Transformer Networks
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
CNN
bird
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Not a differentiable function !
Can’t train with backprop :(
Make it differentiable
Train with backprop :) 21

Input image:
H x W x 3 Cropped and
rescaled image:
X x Y x 3
Can we make this
function differentiable?
Idea: Function mapping
pixel coordinates (xt, yt) of
output to pixel coordinates
(xs, ys) of input
Slide Credit: CS231n
Repeat for all pixels
in output
Network
attends to
input by
predicting
22
Mapping given by box coordinates
(translation + scale)

Easy to incorporate in any network, anywhere !
Differentiable module
Insert spatial transformers into a
classification network and it learns
to attend and transform the input
23

24
Fine-grained classification
Also used as an alternative to RoI pooling in proposal-based detection & segmentation pipelines

Deformable Convolutions
Dai, Qi, Xiong, Li, Zhang et al. Deformable Convolutional Networks. arXiv Mar 2017
25
Dynamic & learnable receptive field

Resources
26
Seq2seq implementations with attention:
● Tensorflow
● Pytorch
Spatial Transformers
● Tensorflow
● Coming soon to Pytorch (thread here)
Deformable Convolutions
● MXNet (Original)
● Tensorflow / Keras (slow)
● [WIP]PyTorch

Attention Mechanism
28
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
The vector to be fed to the RNN at each timestep is a
weighted sum of all the annotation vectors.

Attention Mechanism
29
An attention weight (scalar) is predicted at each time-step for each annotation vector
hj
with a simple fully connected neural network.
h1
zi
Annotation
vector
Recurrent
state
Attention
weight
(a1
)

Attention Mechanism
30
An attention weight (scalar) is predicted at each time-step for each annotation vector
hj
with a simple fully connected neural network.
h2
zi
Annotation
vector
Recurrent
state
Attention
weight
(a2
)
Shared for all j

Attention Mechanism
31
Once a relevance score (weight) is estimated for each word, they are normalized
with a softmax function so they sum up to 1.

Attention Mechanism
32
Finally, a context-aware representation ci+1
for the output word at timestep i can be
defined as:

Attention Mechanism
33
The model automatically finds the correspondence structure between two languages
(alignment).
(Edge thicknesses represent the attention weights found by the attention model)

Attention Models
34

Attention Models
35
Chan et al. Listen, Attend and Spell. ICASSP 2016
Source: distill.pub
Input: Audio features; Output: Text

36
Side-note: attention can be computed with previous or current hidden state
CNN
Image:
H x W x 3
h1
v y1
h2 h3
v y2
a1
y1
v y0average
c1
a2
y2
c2
a3
y3
c3

37
Attention with sentinel: LSTM is modified to output a “non-visual” feature to attend to
CNN
Image:
H x W x 3 v y1 v y2
a1
y1
v y0average
c1
a2
y2
c2
a3
y3
c3
s1 h1 s2 h2 s3 h3

Semantic Attention: Image Captioning
38You et al. Image Captioning with Semantic Attention. CVPR 2016

Visual Attention: Saliency Detection
Kuen et al. Recurrent Attentional Networks for Saliency Detection. CVPR 2016
39

Visual Attention: Fixation Prediction
Cornia et al. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.
40

Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)

Similar to Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision) (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)