1) This paper introduces an attribute-driven attention model for image captioning that uses a CNN-RNN framework with an attention mechanism to predict attributes and their dependencies.
2) The model uses region-based attention over CNN features to obtain context features for the attribute prediction module, which then models co-occurrence relationships between attributes.
3) The generated attributes are then used as semantic regularization for the RNN caption generation module to produce image descriptions word by word.
3. • Given an image, produce a sentence describing its
contents
• Inputs: An image
• Outputs: Multiple words (let’s consider one sentence)
: The dog is hiding
Introduction to Image Captioning
7. RNN
CNN
RNN RNN
h2 h3h1
The dog
h2 h3
Linear
Classifier
Linear
Classifier
Motivation
Static
representation
irrespective
redundant
8. Ideas of paper
• Objective
• More compact representations should be explored to gain better attention
accuracy
• Solutions
• A CNN-RNN framework with the
attention mechanism to predict attributes
• Model co-occurrence dependencies
among attributes. For example,
object terms, women and umbrella, can
help to recognize the relational term, under
10. Attribute-based method
1) Image captioning with semantic attention (You et al. 2016)
2) Boosting image captioning with attributes (Yao et al. 2017)
3) Semantic regularization for recurrent image annotation (Liu et al/ 2017)
Attributes are mainly
predicted by a CNN
Can not model co-occurrence
dependency of attributes
11. Visual attention method
Many visual attention schemes are introduced recently:
1) Show, attend and tell (Xu et al. 2015) – soft + hard attention
2) SCA-CNN: Spatial and channel-wise attention in convolutional networks
for image captioning (Chen et al. 2017) – channel attention
3) Bottom-up and top-down attention for image captioning and VQA
(Anderson et al. 2017) – bottom up + top down attention
14. Attention module
• Region-based feature
• Attribute-based
• Attend to a series of context features to acquire attribute-based features c*
t
soft-attention over regions to get
context feature v*
t
15. Inference module
• Capture co-occurrence dependencies among attributes
in which v*
t is context features from
attention module
Binary vector Is where 1 means
that the image has the
corresponding attribute,
and 0 means not.
inference layer generates the distribution of the next
attribute to be predicted via a soft-max function
16. Generation module
• Generate a sentence word by word
• Is acts as a semantic regularization
force the RNN to understand
the image in the beginning
24. Take-home messages
• This paper introduces an attribute detection mechanism using visual
attention with inferring attribute dependencies allowing more accurate
attention for image captioning. RNN is effective in modeling
dependences.
• Sematic regularization (forcing binary vector) could be popularized in
image captioning to gain a better performance.
• They conduct extensive experiments to demonstrate the effectiveness of
the proposed framework.
• Detailed analysis (ablation and qualitative analysis) are provided for
further insights and research in this domain.
25. Quizzes
1) What kind of attention mechanisms are being used in this paper?
• Region-based attention from feature map
• Attribute-based attention from context features
2) What is the size of Is?
• The size of Is is determined by the attributes that we have obtained from
groud-truth captions. This vector depicts object/relational/descriptive terms
26. Discussion
1) Attributes could be ameliorated like Neural Baby Talk (cat -> tabby,
kitten, feline, …)
2) Besides beam search, teacher forcing could be applied to speed up
training and may get better performance.
3) The metrics in this paper are weak although they are written in 2018,
BLEU, ROUGE and CIDEr are mostly based-on overlapping calculations
between prediction and ground-truth. More advanced metrics like SPICE
should be also validated to assure the feasibility of this approach.