Paper Reviews on Visual Attention

Paper Reviews in
Visual Attention
1
2018.3.29
SNU DATAMINING CENTER
MINKI CHUNG

WHO AM I 2
▸ Chung Minki
▸ BS, KAIST, IE, 2016
▸ MS, SNU, IE, 2018..?!
▸ Vision Projects
▸ Working on Semantic Image Inpainting

WHAT IS VISUAL ATTENTION 3
▸ Attention is HOT nowadays
▸ http://openaccess.thecvf.com/CVPR2017_search.py
▸ http://search.iclr2018.smerity.com/search/?query=attention

▸ Maybe heard of
▸ "Neural Machine Translation by Jointly Learning to Align and Translate"
▸ "Show, Attend, and Tell: Neural Image Caption"
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, 2015, ICLR. "Neural Machine Translation by Jointly Learning to Align and Translate"
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, 2015, ICML.
"Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention"

▸ More,
Jimmy Lei Ba, Volodymyr Mnih, Koray Kavukcuoglu, 2015, ILCR. "Multiple Object Recognition With Visual Attention"
Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, NIPS, 2014. "Spatial Transformer Network"
Jianlong Fu, Heliang Zheng, Tao Mei, 2017, CVPR. "Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-
grained Image Recognition"
Siavash Gorji, James J. Clark, 2017, CVPR. "Attentional Push: A Deep Convolutional Network for Augmenting Image Salience
with Shared Attention Modeling in Social Scenes"

▸ Visual Attention:
▸ Attend on certain part of image to solve a task more efﬁciently
▸ Deep learning, the black box model → Interpretability

TABLE OF CONTENTS 7
▸ Early Works
▸ Recurrent Attention Model (RAM)
▸ Spatial Transformer Network (STN)
▸ Recent Works of visual attention
▸ in ICLR
▸ in CVPR

PREREQUISITE 8
▸ CNN, Transpose Convolution(or Deconvolution), Dilated Convolution
▸ RNN
▸ MLP
▸ GAN
https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d

RECURRENT ATTENTION MODEL 10
▸ Volodymyr Mnih, Nicolas Heess, Alex Graves, Koray Kavukcuoglu, 2014, NIPS.
"Recurrent Models of Visual Attention"
▸ Google DeepMind, 563 citations
▸ Motivation: Confronted by large image, human process image sequentially,
selecting where and what to look
▸ Tackle ConvNet limitation: poor scalability with increasing input image size

▸ Multiple Object Recognition with Visual Attention (DRAM), 2015, ICLR
▸ Reﬁned architecture version of RAM
▸ RNN Structure with multi-resolution crop, called glimpse
▸ Architecture:

▸ Architecture:
WHERE TO SEE
WHAT TO SEE
provide initial state
locate glimpse
outputs the inputs for rnn(1)
for multiple objects

▸ Demo
▸ Single object classiﬁcation
https://github.com/kevinzakka/recurrent-visual-attention

▸ Training:
▸ maximize
LOWERBOUND F
multiple object case

▸ Cont'd:
REINFORCE

▸ Experiments & Results
▸ MNIST, SVHN

SPATIAL TRANSFORMER NETWORK 17
▸ Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, 2014
NIPS. "Spatial Transformer Network"
▸ Google DeepMind, 624 citations
▸ Motivation: Human process distorted objects by un-distorting it
▸ ConvNet is not actually invariant to large transformation(only realised over a
deep hierarchy of max-pooling)
Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, 2014, NIPS. "Spatial Transformer Network"
https://kevinzakka.github.io/2017/01/18/stn-part2/

▸ Architecture:
▸ three parts: localisation net, sampling grid, sampler
▸ Assume 𝛵𝜃 is 2D afﬁne transformation A𝜃,
regression
H,W,C H',W',C

▸ 𝛵𝜃, for attention becomes:
▸ Allowing cropping, translation, isotropic scaling
▸ In case if a bilinear sampling kernel,
▸ Differentiable, Modular,

▸ Experiments and Results
▸ MNIST
▸ SVHN

▸ Experiments and Results
▸ Fine-grained classiﬁcation (CUB-200-211 bird classiﬁcation dataset)

▸ Already implemented in Tensorlayer

RECURRENT ATTENTIONAL NETWORKS FOR SALIENCY DETECTION 23
▸ Jason Kuen, Zhenhua Wang, Gang Wang, 2016, CVPR. "Recurrent Attentional
Networks for Saliency Detection"
▸ RAM(Glimpse system) + STN(Differentiability) for Saliency Detection
Jason Kuen, Zhenhua Wang, Gang Wang, 2016, CVPR. "Recurrent Attentional Networks for Saliency Detection"

▸ Recurrent Attentional Convolutional-Deconvolutional Network (RACDNN)
▸ Architecture

GENERATIVE IMAGE INPAINTING WITH CONTEXTUAL ATTENTION 27
▸ Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, Thomas S. Huang, 2018, CVPR.
"Generative Image Inpainting with Contextual Attention"
▸ Adobe Research
Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, Thomas S. Huang, 2018, CVPR. "Generative Image Inpainting with Contextual Attention

▸ Architecture
▸ Two-stage(coarse to ﬁne)
▸ Global and Local W-GANS
▸ Spatially discounted reconstruction loss(𝑙1): 𝛾
USE W-GAN
attention
𝑙

▸ Attention
fx,y
bx,y
Calculate cosine similarity:

LEARN TO PAY ATTENTION 31
▸ Saumya Jetley, Nicholas A. Lord, Namhoon Lee, Philip H. S. Torr, 2018, ICLR. "Learn
to Pay Attention"
▸ Very simple
Saumya Jetley, Nicholas A. Lord, Namhoon Lee, Philip H. S. Torr, 2018, ICLR. "Learn to Pay Attention"

▸ Architecture
Attention
Compatibility
function(dot
product)

▸ Image classiﬁcation and ﬁne-grained recognition

▸ Weakly supervised semantic segmentation

LOOK CLOSER TO SEE BETTER 35
▸ Jianlong Fu, Heliang Zheng, Tao Mei, 2017, CVPR. "Look Closer to See Better:
Recurrent Attention Convolutional Neural Network for Fine-grained Image
Recognition"
▸ Fine-grained image recognition:
▸ Discriminative region localization + ﬁne-grained feature learning

▸ Recurrent Attention Convolutional Neural Network (RA-CNN)
▸ Multi-scale networks: classification sub-network, attention proposal sub-
network(APN)
▸ Finer-scale network (coarse to fine)
▸ Intra-scale softmax loss for classification, inter-scale pairwise ranking loss for
APN

▸ RA-CNN architecture:
bilinear
interpolation
to amplify

▸ Training:
▸ Multi-task loss:
forces

▸ CUB-200-211 Bird Dataset

▸ Stanford Dogs, Stanford Cars

SUMMARY 41
▸ Attention for efﬁciency, better performance, interpretability
▸ Many types of Attention:
▸ RAM
▸ STN
▸ RAM+STN
▸ Others

REFERERNCE 43
▸ Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, 2015, ICLR. "Neural Machine Translation by Jointly
Learning to Align and Translate"
▸ Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard
Zemel, Yoshua Bengio, 2015, ICML. "Show, Attend, and Tell: Neural Image Caption Generation with Visual
Attention"
▸ Volodymyr Mnih, Nicolas Heess, Alex Graves, Koray Kavukcuoglu, 2014, NIPS. "Recurrent Models of Visual
Attention"
▸ Jimmy Lei Ba, Volodymyr Mnih, Koray Kavukcuoglu, 2015, ILCR. "Multiple Object Recognition With Visual
Attention"
▸ Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, 2014 NIPS. "Spatial Transformer
Network"
▸ Jason Kuen, Zhenhua Wang, Gang Wang, 2016, CVPR. "Recurrent Attentional Networks for Saliency Detection"
▸ Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, Thomas S. Huang, 2018, CVPR. "Generative Image
Inpainting with Contextual Attention"
▸ Saumya Jetley, Nicholas A. Lord, Namhoon Lee, Philip H. S. Torr, 2018, ICLR. "Learn to Pay Attention"
▸ Jianlong Fu, Heliang Zheng, Tao Mei, 2017, CVPR. "Look Closer to See Better: Recurrent Attention
Convolutional Neural Network for Fine-grained Image Recognition"

Paper Reviews on Visual Attention

Recommended

Recommended

More Related Content

Similar to Paper Reviews on Visual Attention

Similar to Paper Reviews on Visual Attention (20)

Recently uploaded

Recently uploaded (20)

Paper Reviews on Visual Attention