Learning a Recurrent Visual Representation for Image Caption G.docx

Learning a Recurrent Visual Representation for Image Caption
Generation
Xinlei Chen
Carnegie Mellon University
[email protected]
C. Lawrence Zitnick
Microsoft Research, Redmond
[email protected]
Abstract
In this paper we explore the bi-directional mapping be-
tween images and their sentence-based descriptions. We
propose learning this mapping using a recurrent neural net-
work. Unlike previous approaches that map both sentences
and images to a common embedding, we enable the gener-
ation of novel sentences given an image. Using the same
model, we can also reconstruct the visual features associ-
ated with an image given its visual description. We use a
novel recurrent visual memory that automatically learns to
remember long-term visual concepts to aid in both sentence
generation and visual feature reconstruction. We evaluate
our approach on several tasks. These include sentence gen-
eration, sentence retrieval and image retrieval. State-of-
the-art results are shown for the task of generating novel im-
age descriptions. When compared to human generated cap-
tions, our automatically generated captions are preferred
by humans over 19.8% of the time. Results are better than
or comparable to state-of-the-art results on the image and
sentence retrieval tasks for methods using similar visual

features.
1. Introduction
A good image description is often said to “paint a picture
in your mind’s eye.” The creation of a mental image may
play a significant role in sentence comprehension in humans
[15]. In fact, it is often this mental image that is remem-
bered long after the exact sentence is forgotten [29, 21].
What role should visual memory play in computer vision al-
gorithms that comprehend and generate image descriptions?
Recently, several papers have explored learning joint fea-
ture spaces for images and their descriptions [13, 32, 16].
These approaches project image features and sentence fea-
tures into a common space, which may be used for image
search or for ranking image captions. Various approaches
were used to learn the projection, including Kernel Canon-
ical Correlation Analysis (KCCA) [13], recursive neural
networks [32], or deep neural networks [16]. While these
approaches project both semantics and visual features to a
common embedding, they are not able to perform the in-
verse projection. That is, they cannot generate novel sen-
tences or visual depictions from the embedding.
In this paper, we propose a bi-directional representation
capable of generating both novel descriptions from images
and visual representations from descriptions. Critical to
both of these tasks is a novel representation that dynami-
cally captures the visual aspects of the scene that have al-
ready been described. That is, as a word is generated or
read the visual representation is updated to reflect the new
information contained in the word. We accomplish this us-
ing Recurrent Neural Networks (RNNs) [6, 24, 27]. One
long-standing problem of RNNs is their weakness in re-

membering concepts after a few iterations of recurrence.
For instance RNN language models often find difficultly in
learning long distance relations [3, 24] without specialized
gating units [12]. During sentence generation, our novel dy-
namically updated visual representation acts as a long-term
memory of the concepts that have already been mentioned.
This allows the network to automatically pick salient con-
cepts to convey that have yet to be spoken. As we demon-
strate, the same representation may be used to create a vi-
sual representation of a written description.
We demonstrate our method on numerous datasets.
These include the PASCAL sentence dataset [31], Flickr
8K [31], Flickr 30K [31], and the Microsoft COCO dataset
[22]. When generating novel image descriptions, we
demonstrate state-of-the-art results as measured by both
BLEU [30] and METEOR [1] on PASCAL 1K. Surpris-
ingly, we achieve performance only slightly below humans
as measured by BLEU and METEOR on the MS COCO
dataset. Qualitative results are shown for the generation of
novel image captions. We also evaluate the bi-directional
ability of our algorithm on both the image and sentence re-
trieval tasks. Since this does not require the ability to gener-
ate novel sentences, numerous previous papers have evalu-
ated on this task. We show results that are better or compa-
rable to previous state-of-the-art results using similar visual
features.
1
ar
X
iv
:1

41
1.
56
54
v1
[
cs
.C
V
]
2
0
N
ov
2
01
4
Figure 1. Illustration of our model. (a) shows the full model
used for training. (b) and (c) show the parts of the model needed
for generating
sentences from visual features and generating visual features
from sentences respectively.
2. Related work

The task of building a visual memory lies at the heart
of two long-standing AI-hard problems: grounding natural
language symbols to the physical world and semantically
understanding the content of an image. Whereas learning
the mapping between image patches and single text labels
remains a popular topic in computer vision [18, 8, 9], there
is a growing interest in using entire sentence descriptions
together with pixels to learn joint embeddings [13, 32, 16,
10]. Viewing corresponding text and images as correlated,
KCCA [13] is a natural option to discover the shared fea-
tures spaces. However, given the highly non-linear mapping
between the two, finding a generic distance metric based on
shallow representations can be extremely difficult. Recent
papers seek better objective functions that directly optimize
the ranking [13], or directly adopts pre-trained representa-
tions [32] to simplify the learning, or a combination of the
two [16, 10].
With a good distance metric, it is possible to perform
tasks like bi-directional image-sentence retrieval. However,
in many scenarios it is also desired to generate novel image
descriptions and to hallucinate a scene given a sentence de-
scription. Numerous papers have explored the area of gen-
erating novel image descriptions [7, 36, 19, 37, 28, 11, 20,
17]. These papers use various approaches to generate text,
such as using pre-trained object detectors with template-
based sentence generation [36, 7, 19]. Retrieved sentences
may be combined to form novel descriptions [20]. Recently,
purely statistical models have been used to generate sen-
tences based on sampling [17] or recurrent neural networks
[23]. While [23] also uses a RNN, their model is signif-
icantly different from our model. Specifically their RNN
does not attempt to reconstruct the visual features, and is
more similar to the contextual RNN of [27]. For the synthe-
sizing of images from sentences, the recent paper by Zitnick

et al. [38] uses abstract clip art images to learn the visual
interpretation of sentences. Relation tuples are extracted
from the sentences and a conditional random field is used
to model the visual scene.
There are numerous papers using recurrent neural net-
works for language modeling [2, 24, 27, 17]. We build
most directly on top of [2, 24, 27] that use RNNs to
learn word context. Several models use other sources of
contextual information to help inform the language model
[27, 17]. Despite its success, RNNs still have difficulty cap-
turing long-range relationships in sequential modeling [3].
One solution is Long Short-Term Memory (LSTM) net-
works [12, 33, 17], which use “gates” to control gradient
back-propagation explicitly and allow for the learning of
long-term interactions. However, the main focus of this pa-
per is to show that the hidden layers learned by “translat-
ing” between multiple modalities can already discover rich
structures in the data and learn long distance relations in an
automatic, data-driven manner.
3. Approach
In this section we describe our approach using recurrent
neural networks. Our goals are twofold. First, we want to be
able to generate sentences given a set of visual observations
or features. Specifically, we want to compute the probabil-
ity of a word wt being generated at time t given the set of
previously generated words Wt−1 = w1, . . . ,wt−1 and the
observed visual features V . Second, we want to enable the
capability of computing the likelihood of the visual features
V given a set of spoken or read words Wt for generating
visual representations of the scene or for performing image
search. To accomplish both of these tasks we introduce a

set of latent variables Ut−1 that encodes the visual interpre-
tation of the previously generated or read words Wt−1. As
we demonstrate later, the latent variables U play the critical
role of acting as a long-term visual memory of the words
that have been previously generated or read.
Using U, our goal is to compute P(wt|V,Wt−1,Ut−1)
and P(V |Wt−1,Ut−1). Combining these two likelihoods
together our global objective is to maximize,
P(wt,V |Wt−1,Ut−1)
= P(wt|V,Wt−1,Ut−1)P(V |Wt−1,Ut−1). (1)
That is, we want to maximize the likelihood of the word wt
and the observed visual features V given the previous words
and their visual interpretation. Note that in previous papers
[27, 23] the objective was only to compute P(wt|V,Wt−1)
and not P(V |Wt−1).
3.1. Model structure
Our recurrent neural network model structure builds on
the prior models proposed by [24, 27]. Mikolov [24] pro-
posed a RNN language model shown by the green boxes in
Figure 1(a). The word at time t is represented by a vector
wt using a “one hot” representation. That is, wt is the same
size as the word vocabulary with each entry having a value
of 0 or 1 depending on whether the word was used. The
output w̃ t contains the likelihood of generating each word.
The recurrent hidden state s provides context based on the
previous words. However, s typically only models short-
range interactions due to the problem of vanishing gradi-
ents [3, 24]. This simple, yet effective language model was
shown to provide a useful continuous word embedding for
a variety of applications [25].

Following [24], Mikolov et al. [27] added an input layer
v to the RNN shown by the white box in Figure 1. This
layer may represent a variety of information, such as topic
models or parts of speech [27]. In our application, v rep-
resents the set of observed visual features. We assume the
visual features v are constant. These visual features help
inform the selection of words. For instance, if a cat was de-
tected, the word “cat” is more likely to be spoken. Note that
unlike [27], it is not necessary to directly connect v to w̃ ,
since v is static for our application. In [27] v represented
dynamic information such as parts of speech for which w̃
needed direct access. We also found that only connecting v
to half of the s units provided better results, since it allowed
different units to specialize on modeling either text or visual
features.
The main contribution of this paper is the addition of the
recurrent visual hidden layer u, blue boxes in Figure 1(a).
The recurrent layer u attempts to reconstruct the visual fea-
tures v from the previous words, i.e. ṽ ≈ v. The visual hid-
den layer is also used by w̃ t to help in predicting the next
word. That is, the network can compare its visual memory
Figure 2. Illustration of the hidden units s and u activations
through time (vertical axis). Notice that the visual hidden units
u
exhibit long-term memory through the temporal stability of
some
units, where the hidden units s change significantly each time
step.
of what it has already said u to what it currently observes v
to predict what to say next. At the beginning of the sentence,
u represents the prior probability of the visual features. As
more words are observed, the visual feature likelihoods are

updated to reflect the words’ visual interpretation. For in-
stance, if the word “sink” is generated, the visual feature
corresponding to sink will increase. Other features that cor-
respond to stove or refrigerator might increase as well, since
they are highly correlated with sink.
A critical property of the recurrent visual features u is
their ability to remember visual concepts over the long term.
The property arises from the model structure. Intuitively,
one may expect the visual features shouldn’t be estimated
until the sentence is finished. That is, u should not be used
to estimate v until wt generates the end of sentence token.
However, in our model we force u to estimate v at every
time step to help in remembering visual concepts. For in-
stance, if the word “cat” is generated, ut will increase the
likelihood of the visual feature corresponding to cat. As-
suming the “cat” visual feature in v is active, the network
will receive positive reinforcement to propagate u’s mem-
ory of “cat” from one time instance to the next. Figure
2 shows an illustrative example of the hidden units s and
u. As can be observed, some visual hidden units u exhibit
longer temporal stability.
Note that the same network structure can predict visual
features from sentences or generate sentences from visual
features. For generating sentences (Fig. 1(b)), v is known
and ṽ may be ignored. For predicting visual features from
sentences (Fig. 1(c)), w is known, and s and v may be
ignored. This property arises from the fact that the words
units w separate the model into two halves for predicting
words or visual features respectively. Alternatively, if the
hidden units s were connected directly to u, this property
would be lost and the network would act as a normal auto-
encoder [34].

3.2. Implementation details
In this section we describe the details of our language
model and how we learn our network.
3.3. Language Model
Our language model typically has between 3,000 and
20,000 words. While each word may be predicted indepen-
dently, this approach is computationally expensive. Instead,
we adopted the idea of word classing [24] and factorized the
distribution into a product of two terms:
P(wt|·) = P(ct|·)×P(wt|ct, ·). (2)
P(wt|·) is the probability of the word, P(ct|·) is the proba-
bility of the class. The class label of the word is computed
in an unsupervised manner, grouping words of similar fre-
quencies together. Generally, this approach greatly acceler-
ates the learning process, with little loss of perplexity. The
predicted word likelihoods are computed using the standard
soft-max function. After each epoch, the perplexity is eval-
uated on a separate validation set and the learning reduced
(cut in half in our experiments) if perplexity does not de-
crease.
In order to further reduce the perplexity, we combine
the RNN model’s output with the output from a Maximum
Entropy language model [26], simultaneously learned from
the training corpus. For all experiments we fix how many
words to look back when predicting the next word used by
the Maximum Entropy model to three.
For any natural language processing task, pre-processing
is crucial to the final performance. For all the sentences, we

did the following two steps before feeding them into the
RNN model. 1) Use Stanford CoreNLP Tool to tokenize
the sentences. 2) Lower case all the letters.
3.4. Learning
For learning we use the Backpropagation Through Time
(BPTT) algorithm [35]. Specifically, the network is un-
rolled for several words and standard backpropagation is
applied. Note that we reset the model after an End-of-
Sentence (EOS) is encountered, so that prediction does not
cross sentence boundaries. As shown to be beneficial in
[24], we use online learning for the weights from the recur-
rent units to the output words. The weights for the rest of the
network use a once per sentence batch update. The activa-
tions for all units are computed using the sigmoid function
σ(z) = 1/(1 + exp(−z)) with clipping, except the word
predictions that use soft-max. We found that Rectified Lin-
ear Units (ReLUs) [18] with unbounded activations were
numerically unstable and commonly “blew up” when used
in recurrent networks.
We used the open source RNN code of [24] and the
Caffe framework [14] to implement our model. A big ad-
vantage of combining the two is that we can jointly learn
the word and image representations: the error from predict-
ing the words can be directly backpropagated to the image-
level features. However, deep convolution neural networks
require large amounts of data to train on, but the largest
sentence-image dataset has only ∼80K images [22]. There-
fore, instead of training from scratch, we choose to fine-
tune from the pre-trained 1000-class ImageNet model [4] to
avoid potential over-fitting. In all experiments, we used the
4096D 7th full-connected layer output as the visual input v
to our model.

4. Results
In this section we evaluate the effectiveness of our bi-
directional RNN model on multiple tasks. We begin by de-
scribing the datasets used for training and testing, followed
by our baselines. Our first set of evaluations measure our
model’s ability to generate novel descriptions of images.
Since our model is bi-directional, we evaluate its perfor-
mance on both the sentence retrieval and image retrieval
tasks. For addition results please see the supplementary ma-
terial.
4.1. Datasets
For evaluation we perform experiments on several stan-
dard datasets that are used for sentence generation and the
sentence-image retrieval task:
PASCAL 1K [31] The dataset contains a subset of im-
ages from the PASCAL VOC challenge. For each of the 20
categories, it has a random sample of 50 images with 5 de-
scriptions provided by Amazon’s Mechanical Turk (AMT).
Flickr 8K and 30K [31] These datasets consists of 8,000
and 31,783 images collected from Flickr respectively. Most
of the images depict humans participating in various activ-
ities. Each image is also paired with 5 sentences. These
datasets have a standard training, validation, and testing
splits.
MS COCO [22] The Microsoft COCO dataset contains
82,783 training images and 40,504 validation images each
with ∼5 human generated descriptions. The images are col-
lected from Flickr by searching for common object cate-
gories, and typically contain multiple objects with signif-

icant contextual information. We downloaded the version
which contains ∼40K annotated training images and ∼10K
validation images for our experiments.
4.2. RNN Baselines
To gain insight into the various components of our
model, we compared our final model with three RNN base-
lines. For fair comparison, the random seed initialization
PASCAL Flickr 8K Flickr 30K MS COCO
PPL BLEU METR PPL BLEU METR PPL BLEU METR PPL
BLEU METR
Midge [28] - 2.89 8.80 -
Baby Talk [19] - 0.49 9.69 -
RNN 36.79 2.79 10.08 21.88 4.86 11.81 26.94 6.29 12.34 18.96
4.63 11.47
RNN+IF 30.04 10.16 16.43 20.43 12.04 17.10 23.74 10.59 15.56
15.39 16.60 19.24
RNN+IF+FT 29.43 10.18 16.45 - - - - - - 14.90 16.77 19.41
Our Approach 27.97 10.48 16.69 19.24 14.10 17.97 22.51 12.60
16.42 14.23 18.35 20.04
Our Approach + FT 26.95 10.77 16.87 - - - - - - 13.98 18.99
20.42
Human - 22.07 25.80 - 22.51 26.31 - 19.62 23.76 - 20.19 24.94
Table 1. Results for novel sentence generation for PASCAL 1K,
Flickr 8K, FLickr 30K and MS COCO. Results are measured
using
perplexity (PPL), BLEU (%) [30] and METEOR (METR, %) [1].

When available results for Midge [28] and BabyTalk [19] are
provided.
Human agreement scores are shown in the last row. See the text
for more details.
Sentence Retrieval Image Retrieval
[email protected][email protected][email protected] Mean r
[email protected][email protected][email protected] Mean r
Random Ranking 4.0 9.0 12.0 71.0 1.6 5.2 10.6 50.0
DeViSE [8] 17.0 57.0 68.0 11.9 21.6 54.6 72.4 9.5
SDT-RNN [32] 25.0 56.0 70.0 13.4 35.4 65.2 84.4 7.0
DeepFE [16] 39.0 68.0 79.0 10.5 23.6 65.2 79.8 7.6
RNN+IF 31.0 68.0 87.0 6.0 27.2 65.4 79.8 7.0
Our Approach (T) 25.0 71.0 86.0 5.4 28.0 65.4 82.2 6.8
Our Approach (T+I) 30.0 75.0 87.0 5.0 28.0 67.4 83.4 6.2
Table 2. PASCAL image and sentence retrieval experiments.
The protocol of [32] was followed. Results are shown for recall
at 1, 5, and
10 recalled items and the mean ranking (Mean r) of ground truth
items.
was fixed for all experiments. The the hidden layers s and
u sizes are fixed to 100. We tried increasing the number of
hidden units, but results did not improve. For small datasets,
more units can lead to overfitting.
RNN based Language Model (RNN) This is the basic
RNN language model developed by [24], which has no in-
put visual features.
RNN with Image Features (RNN+IF) This is an RNN
model with image features feeding into the hidden layer in-

spired by [27]. As described in Section 3 v is only con-
nected to s and not w̃ . For the visual features v we used the
4096D 7th Layer output of BVLC reference Net [14] after
ReLUs. This network is trained on the ImageNet 1000-way
classification task [4]. We experimented with other layers
(5th and 6th) but they do not perform as well.
RNN with Image Features Fine-Tuned (RNN+FT)
This model has the same architecture as RNN+IF, but the
error is back-propagated to the Convolution Neural Net-
work [9]. The CNN is initialized with the weights from
the BVLC reference net. The RNN is initialized with the
the pre-trained RNN language model. That is, the only ran-
domly initialized weights are the ones from visual features
v to hidden layers s. If the RNN is not pre-trained we found
the initial gradients to be too noisy for the CNN. If the
weights from v to hidden layers s are also pre-trained the
search space becomes too limited. Our current implemen-
tation takes ∼5 seconds to learn a mini-batch of size 128 on
a Tesla K40 GPU. It is also crucial to keep track of the val-
idation error and avoid overfitting. We observed this fine-
tuning strategy is particularly helpful for MS COCO, but
does not give much performance gain on Flickr Datasets be-
fore it overfits. The Flickr datasets may not provide enough
training data to avoid overfitting.
After fine-tuning, we fix the image features again and
retrain our model on top of it.
4.3. Sentence generation
Our first set of experiments evaluate our model’s ability
to generate novel sentence descriptions of images. We ex-
periment on all the image-sentence datasets described previ-
ously and compare to the RNN baselines and other previous

papers [28, 19]. Since PASCAL 1K has a limited amount of
training data, we report results trained on MS COCO and
tested on PASCAL 1K. We use the standard train-test splits
for the Flickr 8K and 30K datasets. For MS COCO we
train and validate on the training set (∼37K/∼3K), and test
on the validation set, since the testing set is not available.
To generate a sentence, we first sample a target sentence
length from the multinomial distribution of lengths learned
from the training data, then for this fixed length we sample
100 random sentences, and use the one with the lowest loss
(negative likelihood, and in case of our model, also recon-
struction error) as output.
[email protected][email protected][email protected] Med r
Random Ranking 0.1 0.6 1.1 631 0.1 0.5 1.0 500
[32] 4.5 18.0 28.6 32 6.1 18.5 29.0 29
DeViSE [8] 4.8 16.5 27.3 28 5.9 20.1 29.6 29
DeepFE [16] 12.6 32.9 44.0 14 9.7 29.6 42.5 15
DeepFE+DECAF [16] 5.9 19.2 27.3 34 5.2 17.6 26.5 32
RNN+IF 7.2 18.7 28.7 30.5 4.5 15.34 24.0 39
Our Approach (T) 7.6 21.1 31.8 27 5.0 17.6 27.4 33
Our Approach (T+I) 7.7 21.0 31.7 26.6 5.2 17.5 27.9 31
[13] 8.3 21.6 30.3 34 7.6 20.7 30.1 38
RNN+IF 5.5 17.0 27.2 28 5.0 15.0 23.9 39.5
Our Approach (T) 6.0 19.4 31.1 26 5.3 17.5 28.5 33
Our Approach (T+I) 6.2 19.3 32.1 24 5.7 18.1 28.4 31

M-RNN [23] 14.5 37.2 48.5 11 11.5 31.0 42.4 15
RNN+IF 10.4 30.9 44.2 14 10.2 28.0 40.6 16
Our Approach (T) 11.6 33.8 47.3 11.5 11.4 31.8 45.8 12.5
Our Approach (T+I) 11.7 34.8 48.6 11.2 11.4 32.0 46.2 11
Table 3. Flickr 8K Retrieval Experiments. The protocols of
[32], [13] and [23] are used respectively in each row. See text
for details.
Random Ranking 0.1 0.6 1.1 631 0.1 0.5 1.0 500
DeViSE [8] 4.5 18.1 29.2 26 6.7 21.9 32.7 25
DeepFE+FT [16] 16.4 40.2 54.7 8 10.3 31.4 44.5 13
RNN+IF 8.0 19.4 27.6 37 5.1 14.8 22.8 47
Our Approach (T) 9.3 23.8 24.0 28 6.0 17.7 27.0 35
Our Approach (T+I) 9.6 24.0 27.2 25 7.1 17.9 29.0 31
M-RNN [23] 18.4 40.2 50.9 10 12.6 31.2 41.5 16
RNN+IF 9.5 29.3 42.4 15 9.2 27.1 36.6 21
Our Approach (T) 11.9 25.0 47.7 12 12.8 32.9 44.5 13
Our Approach (T+I) 12.1 27.8 47.8 11 12.7 33.1 44.9 12.5
Table 4. Flickr 30K Retrieval Experiments. The protocols of [8]
and [23] are used respectively in each row. See text for details.
We choose three automatic metrics for evaluating the
quality of the generated sentences, perplexity, BLEU [30]
and METEOR [1]. Perplexity measures the likelihood of

generating the testing sentence based on the number of bits
it would take to encode it. The lower the value the bet-
ter. BLEU and METEOR were originally designed for au-
tomatic machine translation where they rate the quality of a
translated sentences given several references sentences. We
can treat the sentence generation task as the “translation”
of images to sentences. For BLEU, we took the geomet-
ric mean of the scores from 1-gram to 4-gram, and used
the ground truth length closest to the generated sentence to
penalize brevity. For METEOR, we used the latest version1
(v1.5). For both BLEU and METEOR higher scores are bet-
ter. For reference, we also report the consistency between
human annotators (using 1 sentence as query and the rest as
references)2.
1http://www.cs.cmu.edu/˜alavie/METEOR/
2We used 5 sentences as references for system evaluation, but
leave out
4 sentences for human consistency. It is a bit unfair but the
difference is
usually around 0.01∼0.02.
Results are shown in Table 1. Our approach signifi-
cantly improves over both Midge [28] and BabyTalk [19]
on the PASCAL 1K dataset as measured by BLEU and ME-
TEOR. Several qualitative results for the three algorithms
are shown in Figure 4. Our approach generally provides
more naturally descriptive sentences, such as mentioning
an image is black and white, or a bus is a “double decker”.
Midge’s descriptions are often shorter with less detail and
BabyTalk provides long, but often redundant descriptions.
Results on Flickr 8K and Flickr 30K are also provided.
On the MS COCO dataset that contains more images

of high complexity we provide perplexity, BLEU and ME-
TEOR scores. Surprisingly our BLEU and METEOR
scores (18.99 & 20.42) are just slightly lower than the hu-
man scores (20.19 & 24.94). The use of image features
(RNN + IF) significantly improves performance over using
just an RNN language model. Fine-tuning (FT) and our full
approach provide additional improvements for all datasets.
For future reference, our final model gives BLEU-1 to
BLEU-4 (with penalty) as 60.4%, 26.4%, 12.6% and 6.5%,
compared to human consistency 65.9%, 30.5%, 13.6% and
http://www.cs.cmu.edu/~alavie/METEOR/
Figure 3. Qualitative results for sentence generation on the MS
COCO dataset. Both a generated sentence (red) using (Our
Approach +
FT) and a human generated caption (black) are shown.
6.0%. Qualitative results for the MS COCO dataset are
shown in Figure 3. Note that since our model is trained
on MS COCO, the generated sentences are generally better
on MS COCO than PASCAL 1K.
It is known that automatic measures are only roughly
correlated with human judgment [5], so it is also important
to evaluate the generated sentences using human studies.
We evaluated 1000 generated sentences on MS COCO by
asking human subjects to judge whether it had better, worse
or same quality to a human generated ground truth caption.
5 subjects were asked to rate each image, and the major-
ity vote was recorded. In the case of a tie (2-2-1) the two
winners each got half of a vote. We find 12.6% and 19.8%
prefer our automatically generated captions to the human
captions without (Our Approach) and with fine-tuning (Our
Approach + FT) respectively. Less than 1% of the subjects

rated the captions as the same. This is an impressive re-
sult given we only used image-level visual features for the
complex images in MS COCO.
4.4. Bi-Directional Retrieval
Our RNN model is bi-directional. That is, it can gener-
ate image features from sentences and sentences from im-
age features. To evaluate its ability to do both, we measure
its performance on two retrieval tasks. We retrieve images
given a sentence description, and we retrieve a description
given an image. Since most previous methods are capable
of only the retrieval task, this also helps provide experimen-
tal comparison.
Following other methods, we adopted two protocols for
using multiple image descriptions. The first one is to treat
each of the ∼5 sentences individually. In this scenario, the
rank of the retrieved ground truth sentences are used for
evaluation. In the second case, we treat all the sentences
as a single annotation, and concatenate them together for
retrieval.
For each retrieval task we have two methods for ranking.
First, we may rank based on the likelihood of the sentence
Figure 4. Qualitative results for sentence generation on the
PASCAL 1K dataset. Generated sentences are shown for our
approach (red),
Midge [28] (green) and BabyTalk [19] (blue). For reference, a
human generated caption is shown in black.
given the image (T). Since shorter sentences naturally have
higher probability of being generated, we followed [23] and

normalized the probability by dividing it with the total prob-
ability summed over the entire retrieval set. Second, we
could rank based on the reconstruction error between the
image’s visual features v and their reconstructed visual fea-
tures ṽ (I). Due to better performance, we use the average
reconstruction error over all time steps rather than just the
error at the end of the sentence. In Tables 3, we report re-
trieval results on using the text likelihood term only (I) and
its combination with the visual feature reconstruction error
(T+I).
The same evaluation metrics were adopted from previous
papers for both the tasks of sentence retrieval and image re-
trieval. They used [email protected] (K = 1, 5, 10) as the
measurements,
which are the recall rates of the (first) ground truth sen-
tences (sentence retrieval task) or images (image retrieval
task). Higher [email protected] corresponds to better retrieval
perfor-
mance. We also report the median/mean rank of the (first)
retrieved ground truth sentences or images (Med/Mean r).
Lower Med/Mean r implies better performance. For Flickr
8K and 30K several different evaluation methodologies
have been proposed. We report three scores for Flickr 8K
corresponding to the methodologies proposed by [32], [13]
and [23] respectively, and for Flickr 30K [8] and [23].
Measured by Mean r, we achieve state-of-the-art re-
sults on PASCAL 1K image and sentence retrieval (Table
2). As shown in Tables 3 and 4, for Flickr 8K and 30K
our approach achieves comparable or better results than
all methods except for the recently proposed DeepFE [16].
However, DeepFE uses a different set of features based
on smaller image regions. If the same features are used
(DeepFE+DECAF) as our approach, we achieve better re-

sults. We believe these contributions are complementary,
and by using better features our approach may also show
further improvement. In general ranking based on text and
visual features (T + I) outperforms just using text (T). Please
see the supplementary material for retrieval results on MS
COCO.
5. Discussion
Image captions describe both the objects in the image
and their relationships. An area of future work is to examine
the sequential exploration of an image and how it relates to
image descriptions. Many words correspond to spatial rela-
tions that our current model has difficultly in detecting. As
demonstrated by the recent paper of [16] better feature lo-
calization in the image can greatly improve the performance
of retrieval tasks and similar improvement might be seen in
the description generation task.
In conclusion, we describe the first bi-directional model
capable of the generating both novel image descriptions and
visual features. Unlike many previous approaches using
RNNs, our model is capable of learning long-term inter-
actions. This arises from using a recurrent visual memory
that learns to reconstruct the visual features as new words
are read or generated. We demonstrate state-of-the-art re-
sults on the task of sentence generation, image retrieval and
sentence retrieval on numerous datasets.
6. Acknowledgements
We thank Hao Fang, Saurabh Gupta, Meg Mitchell, Xi-
aodong He, Geoff Zweig, John Platt and Piotr Dollar for

their thoughtful and insightful discussions in the creation of
this paper.
References
[1] S. Banerjee and A. Lavie. Meteor: An automatic metric for
mt evaluation with improved correlation with human judg-
ments. In Proceedings of the ACL Workshop on Intrinsic
and Extrinsic Evaluation Measures for Machine Translation
and/or Summarization, pages 65–72, 2005. 1, 5
[2] Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, and J.-L.
Gauvain. Neural probabilistic language models. In Innova-
tions in Machine Learning, pages 137–186. Springer, 2006.
2
[3] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term
dependencies with gradient descent is difficult. Neural Net-
works, IEEE Transactions on, 5(2):157–166, 1994. 1, 2, 3
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
Fei. Imagenet: A large-scale hierarchical image database.
In Computer Vision and Pattern Recognition, 2009. CVPR
2009. IEEE Conference on, pages 248–255. IEEE, 2009. 4,
5
[5] D. Elliott and F. Keller. Comparing automatic evaluation
measures for image description. 2014. 7
[6] J. L. Elman. Finding structure in time. Cognitive science,
14(2):179–211, 1990. 1
[7] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young,
C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every pic-
ture tells a story: Generating sentences from images. In
ECCV, pages 15–29. Springer, 2010. 2

[8] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean,
T. Mikolov, et al. Devise: A deep visual-semantic embed-
ding model. In Advances in Neural Information Processing
Systems, pages 2121–2129, 2013. 2, 5, 6, 8
[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
ture hierarchies for accurate object detection and semantic
segmentation. 2014. 2, 5
[10] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and
S. Lazebnik. Improving image-sentence embeddings using
large weakly annotated photo collections. In ECCV, pages
529–545, 2014. 2
[11] A. Gupta, Y. Verma, and C. Jawahar. Choosing linguistics
over vision to describe images. In AAAI, 2012. 2
[12] S. Hochreiter and J. Schmidhuber. Long short-term
memory.
Neural computation, 9(8):1735–1780, 1997. 1, 2
[13] M. Hodosh, P. Young, and J. Hockenmaier. Framing image
description as a ranking task: Data, models and evaluation
metrics. J. Artif. Intell. Res.(JAIR), 47:853–899, 2013. 1, 2,
6, 8
[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R.
Gir-
shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-
tional architecture for fast feature embedding. arXiv preprint
arXiv:1408.5093, 2014. 4, 5
[15] M. A. Just, S. D. Newman, T. A. Keller, A. McEleney, and
P. A. Carpenter. Imagery in sentence comprehension: an fmri
study. Neuroimage, 21(1):112–124, 2004. 1

[16] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment em-
beddings for bidirectional image sentence mapping. arXiv
preprint arXiv:1406.5679, 2014. 1, 2, 5, 6, 8
[17] R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neu-
ral language models. In ICML, 2014. 2
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012. 2, 4
[19] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C.
Berg,
and T. L. Berg. Baby talk: Understanding and generat-
ing simple image descriptions. In CVPR, pages 1601–1608.
IEEE, 2011. 2, 5, 6, 8
[20] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and
Y. Choi. Collective generation of natural image descriptions.
2012. 2
[21] L. R. Lieberman and J. T. Culpepper. Words versus
objects:
Comparison of free verbal recall. Psychological Reports,
17(3):983–988, 1965. 1
[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D.
Ra-
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
mon objects in context. In ECCV, 2014. 1, 4
[23] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille.
Explain
images with multimodal recurrent neural networks. arXiv

preprint arXiv:1410.1090, 2014. 2, 3, 6, 7, 8
[24] T. Mikolov. Recurrent neural network based language
model.
1, 2, 3, 4, 5
[25] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient
esti-
mation of word representations in vector space. International
Conference on Learning Representations: Workshops Track,
2013. 3
[26] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J.
Cernocky.
Strategies for training large scale neural network language
models. In Automatic Speech Recognition and Understand-
ing (ASRU), 2011 IEEE Workshop on, pages 196–201. IEEE,
2011. 4
[27] T. Mikolov and G. Zweig. Context dependent recurrent
neu-
ral network language model. In SLT, pages 234–239, 2012.
1, 2, 3, 5
[28] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal,
A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and
H. Daumé III. Midge: Generating image descriptions from
computer vision detections. In EACL, pages 747–756. Asso-
ciation for Computational Linguistics, 2012. 2, 5, 6, 8
[29] A. Paivio, T. B. Rogers, and P. C. Smythe. Why are pic-
tures easier to recall than words? Psychonomic Science,
11(4):137–138, 1968. 1

[30] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a
method for automatic evaluation of machine translation. In
Proceedings of the 40th annual meeting on association for
computational linguistics, pages 311–318. Association for
Computational Linguistics, 2002. 1, 5
[31] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier.
Collecting image annotations using Amazon’s mechanical
turk. In NAACL HLT Workshop Creating Speech and Lan-
guage Data with Amazon’s Mechanical Turk, 2010. 1, 4
[32] R. Socher, Q. Le, C. Manning, and A. Ng. Grounded com-
positional semantics for finding and describing images with
sentences. In NIPS Deep Learning Workshop, 2013. 1, 2, 5,
6, 8
[33] I. Sutskever, J. Martens, and G. E. Hinton. Generating text
with recurrent neural networks. In Proceedings of the 28th
International Conference on Machine Learning (ICML-11),
pages 1017–1024, 2011. 2
[34] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.
Extracting and composing robust features with denoising au-
toencoders. In Proceedings of the 25th international confer-
ence on Machine learning, pages 1096–1103. ACM, 2008.
3
[35] R. J. Williams and D. Zipser. Experimental analysis of the
real-time recurrent learning algorithm. Connection Science,
1(1):87–111, 1989. 4
[36] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos.
Corpus-guided sentence generation of natural images. In
EMNLP, 2011. 2

[37] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu.
I2T:
Image parsing to text description. Proceedings of the IEEE,
98(8):1485–1508, 2010. 2
[38] C. L. Zitnick and D. Parikh. Bringing semantics into fo-
cus using visual abstraction. In Computer Vision and Pat-
tern Recognition (CVPR), 2013 IEEE Conference on, pages
3009–3016. IEEE, 2013. 2
Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention
Kelvin Xu [email protected]
Jimmy Lei Ba [email protected]
Ryan Kiros [email protected]
Kyunghyun Cho [email protected]
Aaron Courville [email protected]
Ruslan Salakhutdinov [email protected]
Richard S. Zemel [email protected]
Yoshua Bengio [email protected]
Abstract
Inspired by recent work in machine translation
and object detection, we introduce an attention
based model that automatically learns to describe
the content of images. We describe how we
can train this model in a deterministic manner
using standard backpropagation techniques and
stochastically by maximizing a variational lower
bound. We also show through visualization how
the model is able to automatically learn to fix its
gaze on salient objects while generating the cor-

responding words in the output sequence. We
validate the use of attention with state-of-the-
art performance on three benchmark datasets:
Flickr8k, Flickr30k and MS COCO.
1. Introduction
Automatically generating captions of an image is a task
very close to the heart of scene understanding — one of the
primary goals of computer vision. Not only must caption
generation models be powerful enough to solve the com-
puter vision challenges of determining which objects are in
an image, but they must also be capable of capturing and
expressing their relationships in a natural language. For
this reason, caption generation has long been viewed as
a difficult problem. It is a very important challenge for
machine learning algorithms, as it amounts to mimicking
the remarkable human ability to compress huge amounts of
salient visual infomation into descriptive language.
Despite the challenging nature of this task, there has been
a recent surge of research interest in attacking the image
caption generation problem. Aided by advances in training
neural networks (Krizhevsky et al., 2012) and large clas-
sification datasets (Russakovsky et al., 2014), recent work
Figure 1. Our model learns a words/image alignment. The
visual-
ized attentional maps (3) are explained in section 3.1 & 5.4
1. Input
Image
2. Convolutional
Feature Extraction
3. RNN with attention

LSTM
4. Word by
word
14x14 Feature Map
over the image
generation
A
bird
flying
over
a
body
of
water
has significantly improved the quality of caption genera-
tion using a combination of convolutional neural networks
(convnets) to obtain vectorial representation of images and
recurrent neural networks to decode those representations
into natural language sentences (see Sec. 2).
One of the most curious facets of the human visual sys-
tem is the presence of attention (Rensink, 2000; Corbetta &
Shulman, 2002). Rather than compress an entire image into
a static representation, attention allows for salient features
to dynamically come to the forefront as needed. This is
especially important when there is a lot of clutter in an im-
age. Using representations (such as those from the top layer
of a convnet) that distill information in image down to the
most salient objects is one effective solution that has been
widely adopted in previous work. Unfortunately, this has

one potential drawback of losing information which could
be useful for richer, more descriptive captions. Using more
low-level representation can help preserve this information.
However working with these features necessitates a power-
ful mechanism to steer the model to information important
to the task at hand.
In this paper, we describe approaches to caption genera-
tion that attempt to incorporate a form of attention with
ar
X
iv
:1
50
2.
03
04
4v
3
[
cs
.L
G
]
1
9
A

pr
2
01
6
Neural Image Caption Generation with Visual Attention
Figure 2. Attention over time. As the model generates each
word, its attention changes to reflect the relevant parts of the
image. “soft”
(top row) vs “hard” (bottom row) attention. (Note that both
models generated the same captions in this example.)
Figure 3. Examples of attending to the correct object (white
indicates the attended regions, underlines indicated the
corresponding word)
two variants: a “hard” attention mechanism and a “soft”
attention mechanism. We also show how one advantage of
including attention is the ability to visualize what the model
“sees”. Encouraged by recent advances in caption genera-
tion and inspired by recent success in employing attention
in machine translation (Bahdanau et al., 2014) and object
recognition (Ba et al., 2014; Mnih et al., 2014), we investi-
gate models that can attend to salient part of an image while
generating its caption.
The contributions of this paper are the following:
• We introduce two attention-based image caption gen-
erators under a common framework (Sec. 3.1): 1) a
“soft” deterministic attention mechanism trainable by

standard back-propagation methods and 2) a “hard”
stochastic attention mechanism trainable by maximiz-
ing an approximate variational lower bound or equiv-
alently by REINFORCE (Williams, 1992).
• We show how we can gain insight and interpret the
results of this framework by visualizing “where” and
“what” the attention focused on. (see Sec. 5.4)
• Finally, we quantitatively validate the usefulness of
attention in caption generation with state of the art
performance (Sec. 5.3) on three benchmark datasets:
Flickr8k (Hodosh et al., 2013) , Flickr30k (Young
et al., 2014) and the MS COCO dataset (Lin et al.,
2014).
2. Related Work
In this section we provide relevant background on previous
work on image caption generation and attention. Recently,
several methods have been proposed for generating image
descriptions. Many of these methods are based on recur-
rent neural networks and inspired by the successful use of
sequence to sequence training with neural networks for ma-
chine translation (Cho et al., 2014; Bahdanau et al., 2014;
Sutskever et al., 2014). One major reason image caption
generation is well suited to the encoder-decoder framework
(Cho et al., 2014) of machine translation is because it is
analogous to “translating” an image to a sentence.
The first approach to use neural networks for caption gener-
ation was Kiros et al. (2014a), who proposed a multimodal
log-bilinear model that was biased by features from the im-
age. This work was later followed by Kiros et al. (2014b)
whose method was designed to explicitly allow a natural
way of doing both ranking and generation. Mao et al.
(2014) took a similar approach to generation but replaced a

feed-forward neural language model with a recurrent one.
Both Vinyals et al. (2014) and Donahue et al. (2014) use
LSTM RNNs for their models. Unlike Kiros et al. (2014a)
and Mao et al. (2014) whose models see the image at each
time step of the output word sequence, Vinyals et al. (2014)
only show the image to the RNN at the beginning. Along
with images, Donahue et al. (2014) also apply LSTMs to
videos, allowing their model to generate video descriptions.
All of these works represent images as a single feature vec-
tor from the top layer of a pre-trained convolutional net-
work. Karpathy & Li (2014) instead proposed to learn a
joint embedding space for ranking and generation whose
model learns to score sentence and image similarity as a
function of R-CNN object detections with outputs of a bidi-
rectional RNN. Fang et al. (2014) proposed a three-step
pipeline for generation by incorporating object detections.
Their model first learn detectors for several visual concepts
based on a multi-instance learning framework. A language
model trained on captions was then applied to the detector
outputs, followed by rescoring from a joint image-text em-
bedding space. Unlike these models, our proposed atten-
tion framework does not explicitly use object detectors but
instead learns latent alignments from scratch. This allows
our model to go beyond “objectness” and learn to attend to
abstract concepts.
Prior to the use of neural networks for generating captions,
two main approaches were dominant. The first involved
generating caption templates which were filled in based
on the results of object detections and attribute discovery

(Kulkarni et al. (2013), Li et al. (2011), Yang et al. (2011),
Mitchell et al. (2012), Elliott & Keller (2013)). The second
approach was based on first retrieving similar captioned im-
ages from a large database then modifying these retrieved
captions to fit the query (Kuznetsova et al., 2012; 2014).
These approaches typically involved an intermediate “gen-
eralization” step to remove the specifics of a caption that
are only relevant to the retrieved image, such as the name
of a city. Both of these approaches have since fallen out of
favour to the now dominant neural network methods.
There has been a long line of previous work incorpo-
rating attention into neural networks for vision related
tasks. Some that share the same spirit as our work include
Larochelle & Hinton (2010); Denil et al. (2012); Tang et al.
(2014). In particular however, our work directly extends
the work of Bahdanau et al. (2014); Mnih et al. (2014); Ba
et al. (2014).
3. Image Caption Generation with Attention
Mechanism
3.1. Model Details
In this section, we describe the two variants of our
attention-based model by first describing their common
framework. The main difference is the definition of the
φ function which we describe in detail in Section 4. We
denote vectors with bolded font and matrices with capital
letters. In our description below, we suppress bias terms for
readability.
f
c

oi
ht
ht-1
zt
Eyt-1 ht-1
zt
Eyt-1
ht-1
zt
Eyt-1
ht-1
zt
Eyt-1
input gate output gate
memory cell
forget gate
input modulator
Figure 4. A LSTM cell, lines with bolded squares imply projec-
tions with a learnt weight vector. Each cell learns how to weigh
its input components (input gate), while learning how to
modulate
that contribution to the memory (input modulator). It also learns

weights which erase the memory cell (forget gate), and weights
which control how this memory should be emitted (output gate).
3.1.1. ENCODER: CONVOLUTIONAL FEATURES
Our model takes a single raw image and generates a caption
y encoded as a sequence of 1-of-K encoded words.
y = {y1, . . . ,yC} , yi ∈ RK
where K is the size of the vocabulary and C is the length
of the caption.
We use a convolutional neural network in order to extract a
set of feature vectors which we refer to as annotation vec-
tors. The extractor produces L vectors, each of which is
a D-dimensional representation corresponding to a part of
the image.
a = {a1, . . . ,aL} , ai ∈ RD
In order to obtain a correspondence between the feature
vectors and portions of the 2-D image, we extract features
from a lower convolutional layer unlike previous work
which instead used a fully connected layer. This allows the
decoder to selectively focus on certain parts of an image by
selecting a subset of all the feature vectors.
3.1.2. DECODER: LONG SHORT-TERM MEMORY
NETWORK
We use a long short-term memory (LSTM) net-
work (Hochreiter & Schmidhuber, 1997) that produces a
caption by generating one word at every time step condi-
tioned on a context vector, the previous hidden state and the
previously generated words. Our implementation of LSTM

closely follows the one used in Zaremba et al. (2014) (see
Fig. 4). Using Ts,t : Rs → Rt to denote a simple affine
transformation with parameters that are learned,
it
ft
ot
gt
σ
σ
σ
tanh
ẑt

ct = ft �ct−1 + it �gt (2)
ht = ot � tanh(ct). (3)
Here, it, ft, ct, ot, ht are the input, forget, memory, out-
put and hidden state of the LSTM, respectively. The vector
ẑ ∈ RD is the context vector, capturing the visual infor-
mation associated with a particular input location, as ex-
plained below. E ∈ Rm×K is an embedding matrix. Let
m and n denote the embedding and LSTM dimensionality
respectively and σ and � be the logistic sigmoid activation
and element-wise multiplication respectively.
In simple terms, the context vector ẑt (equations (1)–(3)) is
a dynamic representation of the relevant part of the image
input at time t. We define a mechanism φ that computes ẑt
from the annotation vectors ai, i = 1, . . . ,L corresponding
to the features extracted at different image locations. For
each location i, the mechanism generates a positive weight
αi which can be interpreted either as the probability that
location i is the right place to focus for producing the next
word (the “hard” but stochastic attention mechanism), or as
the relative importance to give to location i in blending the
ai’s together. The weight αi of each annotation vector ai
is computed by an attention model fatt for which we use
a multilayer perceptron conditioned on the previous hidden
state ht−1. The soft version of this attention mechanism
was introduced by Bahdanau et al. (2014). For emphasis,
we note that the hidden state varies as the output RNN ad-
vances in its output sequence: “where” the network looks
next depends on the sequence of words that has already
been generated.

eti =fatt(ai,ht−1) (4)
αti =
exp(eti)∑L
k=1 exp(etk)
. (5)
Once the weights (which sum to one) are computed, the
context vector ẑt is computed by
ẑt = φ({ai} ,{αi}) , (6)
where φ is a function that returns a single vector given the
set of annotation vectors and their corresponding weights.
The details of φ function are discussed in Sec. 4.
The initial memory state and hidden state of the LSTM
are predicted by an average of the annotation vectors fed
through two separate MLPs (init,c and init,h):
c0 = finit,c(
1
L
L∑
i
ai)
h0 = finit,h(
1
L

L∑
i
ai)
In this work, we use a deep output layer (Pascanu et al.,
2014) to compute the output word probability given the
LSTM state, the context vector and the previous word:
p(yt|a,yt−11 ) ∝ exp(Lo(Eyt−1 + Lhht + Lzẑt)) (7)
Where Lo ∈ RK×m, Lh ∈ Rm×n, Lz ∈ Rm×D, and E
are learned parameters initialized randomly.
4. Learning Stochastic “Hard” vs
Deterministic “Soft” Attention
In this section we discuss two alternative mechanisms for
the attention model fatt: stochastic attention and determin-
istic attention.
4.1. Stochastic “Hard” Attention
We represent the location variable st as where the model
decides to focus attention when generating the tth word.
st,i is an indicator one-hot variable which is set to 1 if the
i-th location (out of L) is the one used to extract visual
features. By treating the attention locations as intermedi-
ate latent variables, we can assign a multinoulli distribution
parametrized by {αi}, and view ẑt as a random variable:
p(st,i = 1 | sj<t,a) = αt,i (8)
ẑt =
∑

i
st,iai. (9)
We define a new objective function Ls that is a variational
lower bound on the marginal log-likelihood log p(y | a) of
observing the sequence of words y given image features a.
The learning algorithm for the parameters W of the models
can be derived by directly optimizing Ls:
Ls =
∑
s
p(s | a) log p(y | s,a)
≤ log
∑
s
p(s | a)p(y | s,a)
= log p(y | a) (10)
∂Ls
∂W
=
∑
s
p(s | a)
[
∂ log p(y | s,a)
∂W

+
log p(y | s,a)
∂ log p(s | a)
∂W
]
. (11)
Figure 5. Examples of mistakes where we can use attention to
gain intuition into what the model saw.
Equation 11 suggests a Monte Carlo based sampling ap-
proximation of the gradient with respect to the model pa-
rameters. This can be done by sampling the location st
from a multinouilli distribution defined by Equation 8.
s̃t ∼ MultinoulliL({αi})
∂Ls
∂W
≈
1
N
N∑
n=1
[

∂ log p(y | s̃n,a)
∂W
+
log p(y | s̃n,a)
∂ log p(s̃n | a)
∂W
]
(12)
A moving average baseline is used to reduce the vari-
ance in the Monte Carlo estimator of the gradient, follow-
ing Weaver & Tao (2001). Similar, but more complicated
variance reduction techniques have previously been used
by Mnih et al. (2014) and Ba et al. (2014). Upon seeing the
kth mini-batch, the moving average baseline is estimated
as an accumulated sum of the previous log likelihoods with
exponential decay:
bk = 0.9× bk−1 + 0.1× log p(y | s̃k,a)
To further reduce the estimator variance, an entropy term
on the multinouilli distribution H[s] is added. Also, with
probability 0.5 for a given image, we set the sampled at-
tention location s̃ to its expected value α. Both techniques
improve the robustness of the stochastic attention learning
algorithm. The final learning rule for the model is then the
following:
∂Ls
∂W

≈
1
N
N∑
n=1
[
∂ log p(y | s̃n,a)
∂W
+
λr(log p(y | s̃n,a)− b)
∂ log p(s̃n | a)
∂W
+ λe
∂H[s̃n]
∂W
]
where, λr and λe are two hyper-parameters set by cross-
validation. As pointed out and used in Ba et al. (2014)
and Mnih et al. (2014), this is formulation is equivalent to
the REINFORCE learning rule (Williams, 1992), where the
reward for the attention choosing a sequence of actions is
a real value proportional to the log likelihood of the target
sentence under the sampled attention trajectory.
In making a hard choice at every point, φ({ai} ,{αi})
from Equation 6 is a function that returns a sampled ai at
every point in time based upon a multinouilli distribution

parameterized by α.
4.2. Deterministic “Soft” Attention
Learning stochastic attention requires sampling the atten-
tion location st each time, instead we can take the expecta-
tion of the context vector ẑt directly,
Ep(st|a)[ẑt] =
L∑
i=1
αt,iai (13)
and formulate a deterministic attention model by com-
puting a soft attention weighted annotation vector
φ({ai} ,{αi}) =
∑L
i αiai as introduced by Bahdanau
et al. (2014). This corresponds to feeding in a soft α
weighted context into the system. The whole model is
smooth and differentiable under the deterministic attention,
so learning end-to-end is trivial by using standard back-
propagation.
Learning the deterministic attention can also be under-
stood as approximately optimizing the marginal likelihood
in Equation 10 under the attention location random vari-
able st from Sec. 4.1. The hidden activation of LSTM

ht is a linear projection of the stochastic context vector
ẑt followed by tanh non-linearity. To the first order Tay-
lor approximation, the expected value Ep(st|a)[ht] is equal
to computing ht using a single forward prop with the ex-
pected context vector Ep(st|a)[ẑt]. Considering Eq. 7, let
nt = Lo(Eyt−1+Lhht+Lzẑt), nt,i denotes nt computed
by setting the random variable ẑ value to ai. We define the
normalized weighted geometric mean for the softmax kth
word prediction:
NWGM[p(yt = k | a)] =
∏
i exp(nt,k,i)
p(st,i=1|a)∑
j
∏
i exp(nt,j,i)
p(st,i=1|a)
=
exp(Ep(st|a)[nt,k])∑
j exp(Ep(st|a)[nt,j])
The equation above shows the normalized weighted ge-
ometric mean of the caption prediction can be approxi-
mated well by using the expected context vector, where
E[nt] = Lo(Eyt−1 + LhE[ht] + LzE[ẑt]). It shows that
the NWGM of a softmax unit is obtained by applying soft-
max to the expectations of the underlying linear projec-
tions. Also, from the results in (Baldi & Sadowski, 2014),
NWGM[p(yt = k | a)] ≈ E[p(yt = k | a)] under
softmax activation. That means the expectation of the out-

puts over all possible attention locations induced by ran-
dom variable st is computed by simple feedforward propa-
gation with expected context vector E[ẑt]. In other words,
the deterministic attention model is an approximation to the
marginal likelihood over the attention locations.
4.2.1. DOUBLY STOCHASTIC ATTENTION
By construction,
∑
i αti = 1 as they are the output of a
softmax. In training the deterministic version of our model
we introduce a form of doubly stochastic regularization,
where we also encourage
∑
t αti ≈ 1. This can be in-
terpreted as encouraging the model to pay equal attention
to every part of the image over the course of generation. In
our experiments, we observed that this penalty was impor-
tant quantitatively to improving overall BLEU score and
that qualitatively this leads to more rich and descriptive
captions. In addition, the soft attention model predicts a
gating scalar β from previous hidden state ht−1 at each
time step t, such that, φ({ai} ,{αi}) = β
∑L
i αiai, where
βt = σ(fβ(ht−1)). We notice our attention weights put
more emphasis on the objects in the images by including
the scalar β.

Concretely, the model is trained end-to-end by minimizing
the following penalized negative log-likelihood:
Ld = − log(P(y|x)) + λ
L∑
i
(1−
C∑
t
αti)
2 (14)
4.3. Training Procedure
Both variants of our attention model were trained with
stochastic gradient descent using adaptive learning rate al-
gorithms. For the Flickr8k dataset, we found that RM-
SProp (Tieleman & Hinton, 2012) worked best, while for
Flickr30k/MS COCO dataset we used the recently pro-
posed Adam algorithm (Kingma & Ba, 2014) .
To create the annotations ai used by our decoder, we used
the Oxford VGGnet (Simonyan & Zisserman, 2014) pre-
trained on ImageNet without finetuning. In principle how-
ever, any encoding function could be used. In addition,
with enough data, we could also train the encoder from
scratch (or fine-tune) with the rest of the model. In our ex-
periments we use the 14×14×512 feature map of the fourth
convolutional layer before max pooling. This means our
decoder operates on the flattened 196 × 512 (i.e L × D)
encoding.
As our implementation requires time proportional to the
length of the longest sentence per update, we found train-

ing on a random group of captions to be computationally
wasteful. To mitigate this problem, in preprocessing we
build a dictionary mapping the length of a sentence to the
corresponding subset of captions. Then, during training we
randomly sample a length and retrieve a mini-batch of size
64 of that length. We found that this greatly improved con-
vergence speed with no noticeable diminishment in perfor-
mance. On our largest dataset (MS COCO), our soft atten-
tion model took less than 3 days to train on an NVIDIA
Titan Black GPU.
In addition to dropout (Srivastava et al., 2014), the only
other regularization strategy we used was early stopping
on BLEU score. We observed a breakdown in correla-
tion between the validation set log-likelihood and BLEU in
the later stages of training during our experiments. Since
BLEU is the most commonly reported metric, we used
BLEU on our validation set for model selection.
In our experiments with soft attention, we also used Whet-
lab1 (Snoek et al., 2012; 2014) in our Flickr8k experi-
ments. Some of the intuitions we gained from hyperparam-
eter regions it explored were especially important in our
Flickr30k and COCO experiments.
We make our code for these models based in Theano
1https://www.whetlab.com/
https://www.whetlab.com/
Table 1. BLEU-1,2,3,4/METEOR metrics compared to other
methods, † indicates a different split, (—) indicates an unknown
metric, ◦

indicates the authors kindly provided missing metrics by
personal communication, Σ indicates an ensemble, a indicates
using AlexNet
BLEU
Dataset Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR
Flickr8k
Google NIC(Vinyals et al., 2014)†Σ
Log Bilinear (Kiros et al., 2014a)◦
Soft-Attention
Hard-Attention
63
65.6
67
67
41
42.4
44.8
45.7
27
27.7
29.9
31.4
—
17.7
19.5
21.3

—
17.31
18.93
20.30
Flickr30k
Google NIC†◦Σ
Log Bilinear
Soft-Attention
Hard-Attention
66.3
60.0
66.7
66.9
42.3
38
43.4
43.9
27.7
25.4
28.8
29.6
18.3
17.1
19.1
19.9
—
16.88

18.49
18.46
COCO
CMU/MS Research (Chen & Zitnick, 2014)a
MS Research (Fang et al., 2014)†a
BRNN (Karpathy & Li, 2014)◦
Google NIC†◦Σ
Log Bilinear◦
Soft-Attention
Hard-Attention
—
—
64.2
66.6
70.8
70.7
71.8
—
—
45.1
46.1
48.9
49.2
50.4

—
—
30.4
32.9
34.4
34.4
35.7
—
—
20.3
24.6
24.3
24.3
25.0
20.41
20.71
—
—
20.03
23.90
23.04
(Bergstra et al., 2010) publicly available upon publication
to encourage future research in this area.
5. Experiments
We describe our experimental methodology and quantita-
tive results which validate the effectiveness of our model
for caption generation.

5.1. Data
We report results on the popular Flickr8k and Flickr30k
dataset which has 8,000 and 30,000 images respectively
as well as the more challenging Microsoft COCO dataset
which has 82,783 images. The Flickr8k/Flickr30k dataset
both come with 5 reference sentences per image, but for
the MS COCO dataset, some of the images have references
in excess of 5 which for consistency across our datasets
we discard. We applied only basic tokenization to MS
COCO so that it is consistent with the tokenization present
in Flickr8k and Flickr30k. For all our experiments, we used
a fixed vocabulary size of 10,000.
Results for our attention-based architecture are reported in
Table 4.2.1. We report results with the frequently used
BLEU metric2 which is the standard in the caption gen-
eration literature. We report BLEU from 1 to 4 with-
2We verified that our BLEU evaluation code matches the au-
thors of Vinyals et al. (2014), Karpathy & Li (2014) and Kiros
et al. (2014b). For fairness, we only compare against results for
which we have verified that our BLEU evaluation code is the
same. With the upcoming release of the COCO evaluation
server,
we will include comparison results with all other recent image
captioning models.
out a brevity penalty. There has been, however, criticism
of BLEU, so in addition we report another common met-
ric METEOR (Denkowski & Lavie, 2014), and compare
whenever possible.
5.2. Evaluation Procedures
A few challenges exist for comparison, which we explain

here. The first is a difference in choice of convolutional
feature extractor. For identical decoder architectures, us-
ing more recent architectures such as GoogLeNet or Ox-
ford VGG Szegedy et al. (2014), Simonyan & Zisserman
(2014) can give a boost in performance over using the
AlexNet (Krizhevsky et al., 2012). In our evaluation, we
compare directly only with results which use the compa-
rable GoogLeNet/Oxford VGG features, but for METEOR
comparison we note some results that use AlexNet.
The second challenge is a single model versus ensemble
comparison. While other methods have reported perfor-
mance boosts by using ensembling, in our results we report
a single model performance.
Finally, there is challenge due to differences between
dataset splits. In our reported results, we use the pre-
defined splits of Flickr8k. However, one challenge for the
Flickr30k and COCO datasets is the lack of standardized
splits. As a result, we report with the publicly available
splits3 used in previous work (Karpathy & Li, 2014). In our
experience, differences in splits do not make a substantial
difference in overall performance, but we note the differ-
3http://cs.stanford.edu/people/karpathy/
deepimagesent/
http://cs.stanford.edu/people/karpathy/deepimagesent/
http://cs.stanford.edu/people/karpathy/deepimagesent/
ences where they exist.
5.3. Quantitative Analysis

In Table 4.2.1, we provide a summary of the experi-
ment validating the quantitative effectiveness of attention.
We obtain state of the art performance on the Flickr8k,
Flickr30k and MS COCO. In addition, we note that in our
experiments we are able to significantly improve the state
of the art performance METEOR on MS COCO that we
speculate is connected to some of the regularization tech-
niques we used 4.2.1 and our lower level representation.
Finally, we also note that we are able to obtain this perfor-
mance using a single model without an ensemble.
5.4. Qualitative Analysis: Learning to attend
By visualizing the attention component learned by the
model, we are able to add an extra layer of interpretabil-
ity to the output of the model (see Fig. 1). Other systems
that have done this rely on object detection systems to pro-
duce candidate alignment targets (Karpathy & Li, 2014).
Our approach is much more flexible, since the model can
attend to “non object” salient regions.
The 19-layer OxfordNet uses stacks of 3x3 filters mean-
ing the only time the feature maps decrease in size are due
to the max pooling layers. The input image is resized so
that the shortest side is 256 dimensional with preserved as-
pect ratio. The input to the convolutional network is the
center cropped 224x224 image. Consequently, with 4 max
pooling layers we get an output dimension of the top con-
volutional layer of 14x14. Thus in order to visualize the
attention weights for the soft model, we simply upsample
the weights by a factor of 24 = 16 and apply a Gaussian
filter. We note that the receptive fields of each of the 14x14
units are highly overlapping.
As we can see in Figure 2 and 3, the model learns align-

ments that correspond very strongly with human intuition.
Especially in the examples of mistakes, we see that it is
possible to exploit such visualizations to get an intuition as
to why those mistakes were made. We provide a more ex-
tensive list of visualizations in Appendix A for the reader.
6. Conclusion
We propose an attention based approach that gives state
of the art performance on three benchmark datasets us-
ing the BLEU and METEOR metric. We also show how
the learned attention can be exploited to give more inter-
pretability into the models generation process, and demon-
strate that the learned alignments correspond very well to
human intuition. We hope that the results of this paper will
encourage future work in using visual attention. We also
expect that the modularity of the encoder-decoder approach
combined with attention to have useful applications in other
domains.
Acknowledgments
The authors would like to thank the developers of
Theano (Bergstra et al., 2010; Bastien et al., 2012). We
acknowledge the support of the following organizations
for research funding and computing support: the Nuance
Foundation, NSERC, Samsung, Calcul Québec, Compute
Canada, the Canada Research Chairs and CIFAR. The au-
thors would also like to thank Nitish Srivastava for assis-
tance with his ConvNet package as well as preparing the
Oxford convolutional network and Relu Patrascu for help-
ing with numerous infrastructure related problems.
References
Ba, Jimmy Lei, Mnih, Volodymyr, and Kavukcuoglu, Ko-
ray. Multiple object recognition with visual attention.

arXiv:1412.7755, December 2014.
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua.
Neu-
ral machine translation by jointly learning to align and trans-
late. arXiv:1409.0473, September 2014.
Baldi, Pierre and Sadowski, Peter. The dropout learning algo-
rithm. Artificial intelligence, 210:78–122, 2014.
Bastien, Frederic, Lamblin, Pascal, Pascanu, Razvan, Bergstra,
James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nico-
las, Warde-Farley, David, and Bengio, Yoshua. Theano:
new features and speed improvements. Submited to the
Deep Learning and Unsupervised Feature Learning NIPS 2012
Workshop, 2012.
Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lam-
blin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian,
Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a
CPU and GPU math expression compiler. In Proceedings of
the Python for Scientific Computing Conference (SciPy), 2010.
Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent
visual representation for image caption generation. arXiv
preprint arXiv:1411.5654, 2014.
Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Caglar,
Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua.
Learning phrase representations using RNN encoder-decoder
for statistical machine translation. In EMNLP, October 2014.
Corbetta, Maurizio and Shulman, Gordon L. Control of goal-
directed and stimulus-driven attention in the brain. Nature re-
views neuroscience, 3(3):201–215, 2002.

Denil, Misha, Bazzani, Loris, Larochelle, Hugo, and de Freitas,
Nando. Learning where to attend with deep architectures for
image tracking. Neural Computation, 2012.
Denkowski, Michael and Lavie, Alon. Meteor universal: Lan-
guage specific translation evaluation for any target language.
In Proceedings of the EACL 2014 Workshop on Statistical Ma-
chine Translation, 2014.
Donahue, Jeff, Hendrikcs, Lisa Anne, Guadarrama, Se-
gio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko,
Kate, and Darrell, Trevor. Long-term recurrent convo-
lutional networks for visual recognition and description.
arXiv:1411.4389v2, November 2014.
Elliott, Desmond and Keller, Frank. Image description using vi-
sual dependency representations. In EMNLP, 2013.
Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava,
Rupesh,
Deng, Li, Dollár, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell,
Margaret, Platt, John, et al. From captions to visual concepts
and back. arXiv:1411.4952, November 2014.
Hochreiter, S. and Schmidhuber, J. Long short-term memory.
Neural Computation, 9(8):1735–1780, 1997.
Hodosh, Micah, Young, Peter, and Hockenmaier, Julia. Framing
image description as a ranking task: Data, models and evalu-
ation metrics. Journal of Artificial Intelligence Research, pp.
853–899, 2013.

Karpathy, Andrej and Li, Fei-Fei. Deep visual-semantic align-
ments for generating image descriptions. arXiv:1412.2306,
December 2014.
Kingma, Diederik P. and Ba, Jimmy. Adam: A Method
for Stochastic Optimization. arXiv:1412.6980, December
2014.
Kiros, Ryan, Salahutdinov, Ruslan, and Zemel, Richard. Multi-
modal neural language models. In International Conference on
Machine Learning, pp. 595–603, 2014a.
Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard. Uni-
fying visual-semantic embeddings with multimodal neural lan-
guage models. arXiv:1411.2539, November 2014b.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey. Ima-
geNet classification with deep convolutional neural networks.
In NIPS. 2012.
Kulkarni, Girish, Premraj, Visruth, Ordonez, Vicente, Dhar,
Sag-
nik, Li, Siming, Choi, Yejin, Berg, Alexander C, and Berg,
Tamara L. Babytalk: Understanding and generating simple im-
age descriptions. PAMI, IEEE Transactions on, 35(12):2891–
2903, 2013.
Kuznetsova, Polina, Ordonez, Vicente, Berg, Alexander C,
Berg,
Tamara L, and Choi, Yejin. Collective generation of natural
image descriptions. In Association for Computational Linguis-
tics. ACL, 2012.
Kuznetsova, Polina, Ordonez, Vicente, Berg, Tamara L, and
Choi,
Yejin. Treetalk: Composition and compression of trees for im-

age descriptions. TACL, 2(10):351–362, 2014.
Larochelle, Hugo and Hinton, Geoffrey E. Learning to com-
bine foveal glimpses with a third-order boltzmann machine. In
NIPS, pp. 1243–1251, 2010.
Li, Siming, Kulkarni, Girish, Berg, Tamara L, Berg, Alexander
C,
and Choi, Yejin. Composing simple image descriptions us-
ing web-scale n-grams. In Computational Natural Language
Learning. ACL, 2011.
Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James,
Perona, Pietro, Ramanan, Deva, Dollár, Piotr, and Zitnick,
C Lawrence. Microsoft coco: Common objects in context. In
ECCV, pp. 740–755. 2014.
Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, and Yuille,
Alan.
Deep captioning with multimodal recurrent neural networks
(m-rnn). arXiv:1412.6632, December 2014.
Mitchell, Margaret, Han, Xufeng, Dodge, Jesse, Mensch,
Alyssa,
Goyal, Amit, Berg, Alex, Yamaguchi, Kota, Berg, Tamara,
Stratos, Karl, and Daumé III, Hal. Midge: Generating im-
age descriptions from computer vision detections. In European
Chapter of the Association for Computational Linguistics, pp.
747–756. ACL, 2012.
Mnih, Volodymyr, Hees, Nicolas, Graves, Alex, and
Kavukcuoglu, Koray. Recurrent models of visual atten-
tion. In NIPS, 2014.
Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, and Ben-
gio, Yoshua. How to construct deep recurrent neural networks.

In ICLR, 2014.
Rensink, Ronald A. The dynamic representation of scenes.
Visual
cognition, 7(1-3):17–42, 2000.
Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan,
Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, An-
drej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C.,
and Fei-Fei, Li. ImageNet Large Scale Visual Recognition
Challenge, 2014.
Simonyan, K. and Zisserman, A. Very deep convolu-
tional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2014.
Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practi-
cal bayesian optimization of machine learning algorithms. In
NIPS, pp. 2951–2959, 2012.
Snoek, Jasper, Swersky, Kevin, Zemel, Richard S, and Adams,
Ryan P. Input warping for bayesian optimization of non-
stationary functions. arXiv preprint arXiv:1402.0929, 2014.
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,
Sutskever,
Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to
prevent neural networks from overfitting. JMLR, 15, 2014.
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to
sequence learning with neural networks. In NIPS, pp. 3104–
3112, 2014.
Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre,
Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Van-
houcke, Vincent, and Rabinovich, Andrew. Going deeper with

convolutions. arXiv preprint arXiv:1409.4842, 2014.
Tang, Yichuan, Srivastava, Nitish, and Salakhutdinov, Ruslan
R.
Learning generative models with visual attention. In NIPS, pp.
1808–1816, 2014.
Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5 - rmsprop.
Technical report, 2012.
Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan,
Dumitru. Show and tell: A neural image caption generator.
arXiv:1411.4555, November 2014.
Weaver, Lex and Tao, Nigel. The optimal reward baseline for
gradient-based reinforcement learning. In Proc. UAI’2001, pp.
538–545, 2001.
Williams, Ronald J. Simple statistical gradient-following al-
gorithms for connectionist reinforcement learning. Machine
learning, 8(3-4):229–256, 1992.
Yang, Yezhou, Teo, Ching Lik, Daumé III, Hal, and Aloimonos,
Yiannis. Corpus-guided sentence generation of natural images.
In EMNLP, pp. 444–454. ACL, 2011.
Young, Peter, Lai, Alice, Hodosh, Micah, and Hockenmaier,
Julia.
From image descriptions to visual denotations: New similarity
metrics for semantic inference over event descriptions. TACL,
2:67–78, 2014.

Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Re-
current neural network regularization. arXiv preprint
arXiv:1409.2329, September 2014.
A. Appendix
Visualizations from our “hard” (a) and “soft” (b) attention
model. White indicates the regions where the model roughly
attends to (see section 5.4).
AA manman andand
aa womanwoman playingplaying frisbeefrisbee
inin aa fieldfield ..
(a) A man and a woman playing frisbee in a field.
(b) A woman is throwing a frisbee in a park.
Figure 6.
AA giraffegiraffe standingstanding
inin aa fieldfield withwith
treestrees ..
(a) A giraffe standing in the field with trees.

(b) A large white bird standing in a forest.
Figure 7.
AA dogdog isis
layinglaying onon aa bedbed
withwith aa bookbook ..
(a) A dog is laying on a bed with a book.
(b) A dog is standing on a hardwood floor.
Figure 8.
AA womanwoman isis
holdingholding aa donutdonut inin
hishis handhand ..
(a) A woman is holding a donut in his hand.
(b) A woman holding a clock in her hand.
Figure 9.

AA stopstop signsign
withwith aa stopstop signsign
onon itit ..
(a) A stop sign with a stop sign on it.
(b) A stop sign is on a road with a mountain in the background.
Figure 10.
AA manman inin
aa suitsuit andand aa
hathat holdingholding aa remoteremote
controlcontrol ..
(a) A man in a suit and a hat holding a remote control.
(b) A man wearing a hat and a hat on a skateboard.
Figure 11.

AA littlelittle girlgirl
sittingsitting onon aa couchcouch
withwith aa teddyteddy bearbear
..
(a) A little girl sitting on a couch with a teddy bear.
(b) A little girl sitting on a bed with a teddy bear.
AA manman isis
standingstanding onon aa beachbeach
withwith aa surfboardsurfboard ..
(a) A man is standing on a beach with a surfboard.
(b) A person is standing on a beach with a surfboard.
AA manman andand
aa womanwoman ridingriding aa

boatboat inin thethe waterwater
..
(a) A man and a woman riding a boat in the water.
(b) A group of people sitting on a boat in the water.
Figure 12.
AA manman isis
standingstanding inin aa marketmarket
withwith aa largelarge amountamount
ofof foodfood ..
(a) A man is standing in a market with a large amount of food.
(b) A woman is sitting at a table with a large pizza.
Figure 13.
AA giraffegiraffe standingstanding
inin aa fieldfield withwith

treestrees ..
(a) A giraffe standing in a field with trees.
(b) A giraffe standing in a forest with trees in the background.
Figure 14.
AA groupgroup ofof
peoplepeople standingstanding nextnext toto
eacheach otherother ..
(a) A group of people standing next to each other.
(b) A man is talking on his cell phone while another man
watches.
Figure 15.
University of Groningen
Automatic Description Generation from Images
Bernardi, Raffaella; Cakici, Ruket; Elliott, Desmond; Erdem,
Aykut; Erdem, Erkut; Ikizler-

Cinbis, Nazli; Keller, Frank; Muscat, Adrian; Plank, Barbara
Published in:
Journal of artificial intelligence research
DOI:
10.1613/jair.4900
IMPORTANT NOTE: You are advised to consult the publisher's
version (publisher's PDF) if you wish to cite from
it. Please check the document version below.
Document Version
Publisher's PDF, also known as Version of record
Publication date:
2016
Link to publication in University of Groningen/UMCG research
database
Citation for published version (APA):
Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E.,
Ikizler-Cinbis, N., ... Plank, B. (2016). Automatic
Description Generation from Images: A Survey of Models,
Datasets, and Evaluation Measures. Journal of
artificial intelligence research, 55, 409-442. [4900].
https://doi.org/10.1613/jair.4900
Copyright
Other than for strictly personal use, it is not permitted to
download or to forward/distribute the text or part of it without
the consent of the
author(s) and/or copyright holder(s), unless the work is under
an open content license (like Creative Commons).
Take-down policy

If you believe that this document breaches copyright please
contact us providing details, and we will remove access to the
work immediately
and investigate your claim.
Downloaded from the University of Groningen/UMCG research
database (Pure): http://www.rug.nl/research/portal. For
technical reasons the
number of authors shown on this cover page is limited to 10
maximum.
Download date: 09-04-2020
https://www.rug.nl/research/portal/en/publications/automatic-
description-generation-from-images(b2b505c4-212d-4e10-9c12-
1d33ed6429ce).html
1d33ed6429ce).html
1d33ed6429ce).html
https://www.rug.nl/research/portal/en/journals/journal-of-
artificial-intelligence-research(2d21fcbd-8d5e-4bc4-9dd5-
362c2a3fe629).html
https://www.rug.nl/research/portal/en/journals/journal-of-
artificial-intelligence-research(2d21fcbd-8d5e-4bc4-9dd5-
362c2a3fe629).html
Journal of Artificial Intelligence Research 55 (2016) 409-442
Submitted 07/15; published 02/16
Automatic Description Generation from Images: A Survey

of Models, Datasets, and Evaluation Measures
Raffaella Bernardi [email protected]
University of Trento, Italy
Ruket Cakici [email protected]
Middle East Technical University, Turkey
Desmond Elliott [email protected]
University of Amsterdam, Netherlands
Aykut Erdem [email protected]
Erkut Erdem [email protected]
Nazli Ikizler-Cinbis [email protected]
Hacettepe University, Turkey
Frank Keller [email protected]
University of Edinburgh, UK
Adrian Muscat [email protected]
University of Malta, Malta
Barbara Plank [email protected]
University of Copenhagen, Denmark
Abstract
Automatic description generation from natural images is a
challenging problem that
has recently received a large amount of interest from the
computer vision and natural lan-
guage processing communities. In this survey, we classify the
existing approaches based
on how they conceptualize this problem, viz., models that cast
description as either gen-
eration problem or as a retrieval problem over a visual or

multimodal representational
space. We provide a detailed review of existing models,
highlighting their advantages and
disadvantages. Moreover, we give an overview of the
benchmark image datasets and the
evaluation measures that have been developed to assess the
quality of machine-generated
image descriptions. Finally we extrapolate future directions in
the area of automatic image
description generation.
1. Introduction
Over the past two decades, the fields of natural language
processing (NLP) and computer
vision (CV) have seen great advances in their respective goals
of analyzing and generating
text, and of understanding images and videos. While both fields
share a similar set of meth-
ods rooted in artificial intelligence and machine learning, they
have historically developed
separately, and their scientific communities have typically
interacted very little.
Recent years, however, have seen an upsurge of interest in
problems that require a
combination of linguistic and visual information. A lot of
everyday tasks are of this nature,
e.g., interpreting a photo in the context of a newspaper article,
following instructions in
conjunction with a diagram or a map, understanding slides while
listening to a lecture. In
c©2016 AI Access Foundation. All rights reserved.

Bernardi et al.
addition to this, the web provides a vast amount of data that
combines linguistic and visual
information: tagged photographs, illustrations in newspaper
articles, videos with subtitles,
and multimodal feeds on social media. To tackle combined
language and vision tasks and to
exploit the large amounts of multimodal data, the CV and NLP
communities have moved
closer together, for example by organizing workshops on
language and vision that have been
held regularly at both CV and NLP conferences over the past
few years.
In this new language-vision community, automatic image
description has emerged as a
key task. This task involves taking an image, analyzing its
visual content, and generating
a textual description (typically a sentence) that verbalizes the
most salient aspects of the
image. This is challenging from a CV point of view, as the
description could in principle
talk about any visual aspect of the image: it can mention objects
and their attributes, it
can talk about features of the scene (e.g., indoor/outdoor), or
verbalize how the people
and objects in the scene interact. More challenging still, the
description could even refer to
objects that are not depicted (e.g., it can talk about people
waiting for a train, even when
the train is not visible because it has not arrived yet) and
provide background knowledge
that cannot be derived directly from the image (e.g., the person
depicted is the Mona Lisa).

In short, a good image description requires full image
understanding, and therefore the
description task is an excellent test bed for computer vision
systems, one that is much more
comprehensive than standard CV evaluations that typically test,
for instance, the accuracy
of object detectors or scene classifiers over a limited set of
classes.
Image understanding is necessary, but not sufficient for
producing a good description.
Imagine we apply an array of state-of-the-art detectors to the
image to localize objects
(e.g., Felzenszwalb, Girshick, McAllester, & Ramanan, 2010;
Girshick, Donahue, Darrell,
& Malik, 2014), determine attributes (e.g., Lampert, Nickisch,
& Harmeling, 2009; Berg,
Berg, & Shih, 2010; Parikh & Grauman, 2011), compute scene
properties (e.g., Oliva &
Torralba, 2001; Lazebnik, Schmid, & Ponce, 2006), and
recognize human-object interac-
tions (e.g., Prest, Schmid, & Ferrari, 2012; Yao & Fei-Fei,
2010). The result would be a
long, unstructured list of labels (detector outputs), which would
be unusable as an image
description. A good image description, in contrast, has to be
comprehensive but concise
(talk about all and only the important things in the image), and
has to be formally correct,
i.e., consists of grammatically well-formed sentences.
From an NLP point of view, generating such a description is a
natural language gener-
ation (NLG) problem. The task of NLG is to turn a non-
linguistic representation into
human-readable text. Classically, the non-linguistic

representation is a logical form, a
database query, or a set of numbers. In image description, the
input is an image rep-
resentation (e.g., the detector outputs listed in the previous
paragraph), which the NLG
model has to turn into sentences. Generating text involves a
series of steps, traditionally
referred to as the NLP pipeline (Reiter & Dale, 2006): we need
to decide which aspects
of the input to talk about (content selection), then we need to
organize the content (text
planning) and verbalize it (surface realization). Surface
realization in turn requires choos-
ing the right words (lexicalization), using pronouns if
appropriate (referential expression
generation), and grouping related information together
(aggregation).
In other words, automatic image description requires not only
full image understanding,
but also sophisticated natural language generation. This is what
makes it such an interesting
410
task that has been embraced by both the CV and the NLP
communities.1 Note that
the description task can become even more challenging when we
take into account that
good descriptions are often user-specific. For instance, an art
critic will require a different
description than a librarian or a journalist, even for the same

photograph. We will briefly
touch upon this issue when we talk about the difference between
descriptions and captions
in Section 3 and discuss future directions in Section 4.
Given that automatic image description is such an interesting
task, and it is driven by the
existence of mature CV and NLP methods and the availability of
relevant datasets, a large
image description literature has appeared over the last five
years. The aim of this survey
article is to give a comprehensive overview of this literature,
covering models, datasets, and
evaluation metrics.
We sort the existing literature into three categories based on the
image description
models used. The first group of models follows the classical
pipeline we outlined above: they
first detect or predict the image content in terms of objects,
attributes, scene types, and
actions, based on a set of visual features. Then, these models
use this content information
to drive a natural language generation system that outputs an
image description. We will
term these approaches direct generation models.
The second group of models cast the problem as a retrieval
problem. That is, to create a
description for a novel image, these models search for images in
a database that are similar to
the novel image. Then they build a description for the novel
image based on the descriptions
of the set of similar images that was retrieved. The novel image
is described by simply
reusing the description of the most similar retrieved image

(transfer), or by synthesizing a
novel description based on the description of a set of similar
images. Retrieval-based models
can be further subdivided based on what type of approach they
use to represent images
and compute similarity. The first subgroup of models uses a
visual space to retrieve images,
while the second subgroup uses a multimodal space that
represents images and text jointly.
For an overview of the models that will be reviewed in this
survey, and which category they
fall into, see Table 1.
Generating natural language descriptions from videos presents
unique challenges over
and above image-based description, as it additionally requires
analyzing the objects and
their attributes and actions in the temporal dimension. Models
that aim to solve descrip-
tion generation from videos have been proposed in the literature
(e.g., Khan, Zhang, &
Gotoh, 2011; Guadarrama, Krishnamoorthy, Malkarnenkar,
Venugopalan, Mooney, Darrell,
& Saenko, 2013; Krishnamoorthy, Malkarnenkar, Mooney,
Saenko, & Guadarrama, 2013;
Rohrbach, Qiu, Titov, Thater, Pinkal, & Schiele, 2013;
Thomason, Venugopalan, Guadar-
rama, Saenko, & Mooney, 2014; Rohrbach, Rohrback, Tandon,
& Schiele, 2015; Yao, Torabi,
Cho, Ballas, Pal, Larochelle, & Courville, 2015; Zhu, Kiros,
Zemel, Salakhutdinov, Urtasun,
Torralba, & Fidler, 2015). However, most existing work on
description generation has used
static images, and this is what we will focus on in this survey.2
In this survey article, we first group automatic image

description models into the three
categories outlined above and provide a comprehensive
overview of the models in each
1. Though some image description approaches circumvent the
NLG aspect by transferring human-authored
descriptions, see Sections 2.2 and 2.3.
2. An interesting intermediate approach involves the annotation
of image streams with sequences of sen-
tences, see the work of Park and Kim (2015).
411
Bernardi et al.
category in Section 2. We then examine the available
multimodal image datasets used for
training and testing description generation models in Section 3.
Furthermore, we review
evaluation measures that have been used to gauge the quality of
generated descriptions in
Section 3. Finally, in Section 4, we discuss future research
directions, including possible
new tasks related to image description, such as visual question
answering.
2. Image Description Models
Generating automatic descriptions from images requires an
understanding of how humans
describe images. An image description can be analyzed in
several different dimensions (Shat-
ford, 1986; Jaimes & Chang, 2000). We follow Hodosh, Young,

and Hockenmaier (2013)
and assume that the descriptions that are of interest for this
survey article are the ones
that verbalize visual and conceptual information depicted in the
image, i.e., descriptions
that refer to the depicted entities, their attributes and relations,
and the actions they are
involved in. Outside the scope of automatic image description
are non-visual descriptions,
which give background information or refer to objects not
depicted in the image (e.g., the
location at which the image was taken or who took the picture).
Also, not relevant for
standard approaches to image description are perceptual
descriptions, which capture the
global low-level visual characteristics of images (e.g., the
dominant color in the image or
the type of the media such as photograph, drawing, animation,
etc.).
In the following subsections, we give a comprehensive overview
of state-of-the-art ap-
proaches to description generation. Table 1 offers a high-level
summary of the field, using
the three categories of models outlined in the introduction:
direct generation models, re-
trieval models from visual space, and retrieval model from
multimodal space.
2.1 Description as Generation from Visual Input
The general approach of the studies in this group is to first
predict the most likely meaning
of a given image by analyzing its visual content, and then
generate a sentence reflecting
this meaning. All models in this category achieve this using the

following general pipeline
architecture:
1. Computer vision techniques are applied to classify the scene
type, to detect the ob-
jects present in the image, to predict their attributes and the
relationships that hold
between them, and to recognize the actions taking place.
2. This is followed by a generation phase that turns the detector
outputs into words or
phrases. These are then combined to produce a natural language
description of the
image, using techniques from natural language generation (e.g.,
templates, n-grams,
grammar rules).
The approaches reviewed in this section perform an explicit
mapping from images to
descriptions, which differentiates them from the studies
described in Section 2.2 and 2.3,
which incorporate implicit vision and language models. An
illustration of a sample model is
shown in Figure 1. An explicit pipeline architecture, while
tailored to the problem at hand,
constrains the generated descriptions, as it relies on a
predefined sets of semantic classes of
scenes, objects, attributes, and actions. Moreover, such an
architecture crucially assumes
412

Reference Generation Retrieval from
Visual Space Multimodal Space
Farhadi et al. (2010) X
Kulkarni et al. (2011) X
Li et al. (2011) X
Ordonez et al. (2011) X
Yang et al. (2011) X
Gupta et al. (2012) X
Kuznetsova et al. (2012) X
Mitchell et al. (2012) X
Elliott and Keller (2013) X
Hodosh et al. (2013) X
Gong et al. (2014) X
Karpathy et al. (2014) X
Kuznetsova et al. (2014) X
Mason and Charniak (2014) X
Patterson et al. (2014) X
Socher et al. (2014) X
Verma and Jawahar (2014) X
Yatskar et al. (2014) X
Chen and Zitnick (2015) X X
Donahue et al. (2015) X X
Devlin et al. (2015) X
Elliott and de Vries (2015) X
Fang et al. (2015) X
Jia et al. (2015) X X
Karpathy and Fei-Fei (2015) X X
Kiros et al. (2015) X X
Lebret et al. (2015) X X
Lin et al. (2015) X
Mao et al. (2015a) X X
Ortiz et al. (2015) X
Pinheiro et al. (2015) X
Ushiku et al. (2015) X
Vinyals et al. (2015) X X

Xu et al. (2015) X X
Yagcioglu et al. (2015) X
Table 1: An overview of existing approaches to automatic image
description. We have
categorized the literature into approaches that directly generate
a description of an image
(Section 2.1), approaches that retrieve images via visual
similarity and transfer their de-
scription to the new image (Section 2.2), and approaches that
frame the task as retrieving
descriptions and images from a multimodal space (Section 2.3).
413
Bernardi et al.
Figure 1: The automatic image description generation system
proposed by Kulkarni et al.
(2011).
the accuracy of the detectors for each semantic class, an
assumption that is not always met
in practice.
Approaches to description generation differ along two main
dimensions: (a) which image
representations they derive descriptions from, and (b) how they
address the sentence gen-
eration problem. In terms of the representations used, existing
models have conceptualized
images in a number of different ways, relying on spatial
relationships (Farhadi et al., 2010),
corpus-based relationships (Yang et al., 2011), or spatial and

Learning a Recurrent Visual Representation for Image Caption G.docx

Learning a Recurrent Visual Representation for Image Caption G.docx

Recommended

Recommended

More Related Content

Similar to Learning a Recurrent Visual Representation for Image Caption G.docx

Similar to Learning a Recurrent Visual Representation for Image Caption G.docx (20)

More from croysierkathey

More from croysierkathey (20)

Recently uploaded

Recently uploaded (20)

Learning a Recurrent Visual Representation for Image Caption G.docx