Show observe and tell giang nguyen

Show, Observe and Tell: Attribute-driven Attention
Model for Image Captioning
Hui Chen, Guiguang Ding, Zijia Lin, Sicheng Zhao, Jungong Han
IJCAI - 2018

• Given an image, produce a sentence describing its
contents
• Inputs: An image
• Outputs: Multiple words (let’s consider one sentence)
: The dog is hiding
Introduction to Image Captioning

RNN
CNN

RNN
CNN
RNN
h2h1
h2
Linear
Classifier
The

RNN
CNN
RNN RNN
h2 h3h1
The dog
h2 h3
Linear
Classifier
Linear
Classifier

RNN
CNN
RNN RNN
h2 h3h1
The dog
h2 h3
Linear
Classifier
Linear
Classifier
Motivation
Static
representation
irrespective
redundant

Ideas of paper
• Objective
• More compact representations should be explored to gain better attention
accuracy
• Solutions
• A CNN-RNN framework with the
attention mechanism to predict attributes
• Model co-occurrence dependencies
among attributes. For example,
object terms, women and umbrella, can
help to recognize the relational term, under

Attribute-based method
1) Image captioning with semantic attention (You et al. 2016)
2) Boosting image captioning with attributes (Yao et al. 2017)
3) Semantic regularization for recurrent image annotation (Liu et al/ 2017)
Attributes are mainly
predicted by a CNN
Can not model co-occurrence
dependency of attributes

Visual attention method
Many visual attention schemes are introduced recently:
1) Show, attend and tell (Xu et al. 2015) – soft + hard attention
2) SCA-CNN: Spatial and channel-wise attention in convolutional networks
for image captioning (Chen et al. 2017) – channel attention
3) Bottom-up and top-down attention for image captioning and VQA
(Anderson et al. 2017) – bottom up + top down attention

Overview architecture
• Encoder – Decoder paradigm

Attention module
• Region-based feature
• Attribute-based
• Attend to a series of context features to acquire attribute-based features c*
t
soft-attention over regions to get
context feature v*
t

Inference module
• Capture co-occurrence dependencies among attributes
in which v*
t is context features from
attention module
Binary vector Is where 1 means
that the image has the
corresponding attribute,
and 0 means not.
inference layer generates the distribution of the next
attribute to be predicted via a soft-max function

Generation module
• Generate a sentence word by word
• Is acts as a semantic regularization
force the RNN to understand
the image in the beginning

Dataset
• MS COCO 2014 dataset - 123287 images - 5 captions per image
• 5000 for validation
• 5000 for test
• Remaining for training

Evaluation metrics
• BLEU
• ROUGE-L
• METEOR
• CIDEr

Ablation analysis
• The second row shows that the proposed approach to predict attributes
is effective

Qualitative analysis
• Detect object terms, relational terms and descriptive terms
• New attributes can be explored

Take-home messages
• This paper introduces an attribute detection mechanism using visual
attention with inferring attribute dependencies allowing more accurate
attention for image captioning. RNN is effective in modeling
dependences.
• Sematic regularization (forcing binary vector) could be popularized in
image captioning to gain a better performance.
• They conduct extensive experiments to demonstrate the effectiveness of
the proposed framework.
• Detailed analysis (ablation and qualitative analysis) are provided for
further insights and research in this domain.

Quizzes
1) What kind of attention mechanisms are being used in this paper?
• Region-based attention from feature map
• Attribute-based attention from context features
2) What is the size of Is?
• The size of Is is determined by the attributes that we have obtained from
groud-truth captions. This vector depicts object/relational/descriptive terms

Discussion
1) Attributes could be ameliorated like Neural Baby Talk (cat -> tabby,
kitten, feline, …)
2) Besides beam search, teacher forcing could be applied to speed up
training and may get better performance.
3) The metrics in this paper are weak although they are written in 2018,
BLEU, ROUGE and CIDEr are mostly based-on overlapping calculations
between prediction and ground-truth. More advanced metrics like SPICE
should be also validated to assure the feasibility of this approach.

Show observe and tell giang nguyen

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Show observe and tell giang nguyen

Similar to Show observe and tell giang nguyen (20)

More from Nguyen Giang

More from Nguyen Giang (8)

Recently uploaded

Recently uploaded (20)

Show observe and tell giang nguyen