SlideShare a Scribd company logo
1 of 558
Learning a Recurrent Visual Representation for Image Caption
Generation
Xinlei Chen
Carnegie Mellon University
[email protected]
C. Lawrence Zitnick
Microsoft Research, Redmond
[email protected]
Abstract
In this paper we explore the bi-directional mapping be-
tween images and their sentence-based descriptions. We
propose learning this mapping using a recurrent neural net-
work. Unlike previous approaches that map both sentences
and images to a common embedding, we enable the gener-
ation of novel sentences given an image. Using the same
model, we can also reconstruct the visual features associ-
ated with an image given its visual description. We use a
novel recurrent visual memory that automatically learns to
remember long-term visual concepts to aid in both sentence
generation and visual feature reconstruction. We evaluate
our approach on several tasks. These include sentence gen-
eration, sentence retrieval and image retrieval. State-of-
the-art results are shown for the task of generating novel im-
age descriptions. When compared to human generated cap-
tions, our automatically generated captions are preferred
by humans over 19.8% of the time. Results are better than
or comparable to state-of-the-art results on the image and
sentence retrieval tasks for methods using similar visual
features.
1. Introduction
A good image description is often said to “paint a picture
in your mind’s eye.” The creation of a mental image may
play a significant role in sentence comprehension in humans
[15]. In fact, it is often this mental image that is remem-
bered long after the exact sentence is forgotten [29, 21].
What role should visual memory play in computer vision al-
gorithms that comprehend and generate image descriptions?
Recently, several papers have explored learning joint fea-
ture spaces for images and their descriptions [13, 32, 16].
These approaches project image features and sentence fea-
tures into a common space, which may be used for image
search or for ranking image captions. Various approaches
were used to learn the projection, including Kernel Canon-
ical Correlation Analysis (KCCA) [13], recursive neural
networks [32], or deep neural networks [16]. While these
approaches project both semantics and visual features to a
common embedding, they are not able to perform the in-
verse projection. That is, they cannot generate novel sen-
tences or visual depictions from the embedding.
In this paper, we propose a bi-directional representation
capable of generating both novel descriptions from images
and visual representations from descriptions. Critical to
both of these tasks is a novel representation that dynami-
cally captures the visual aspects of the scene that have al-
ready been described. That is, as a word is generated or
read the visual representation is updated to reflect the new
information contained in the word. We accomplish this us-
ing Recurrent Neural Networks (RNNs) [6, 24, 27]. One
long-standing problem of RNNs is their weakness in re-
membering concepts after a few iterations of recurrence.
For instance RNN language models often find difficultly in
learning long distance relations [3, 24] without specialized
gating units [12]. During sentence generation, our novel dy-
namically updated visual representation acts as a long-term
memory of the concepts that have already been mentioned.
This allows the network to automatically pick salient con-
cepts to convey that have yet to be spoken. As we demon-
strate, the same representation may be used to create a vi-
sual representation of a written description.
We demonstrate our method on numerous datasets.
These include the PASCAL sentence dataset [31], Flickr
8K [31], Flickr 30K [31], and the Microsoft COCO dataset
[22]. When generating novel image descriptions, we
demonstrate state-of-the-art results as measured by both
BLEU [30] and METEOR [1] on PASCAL 1K. Surpris-
ingly, we achieve performance only slightly below humans
as measured by BLEU and METEOR on the MS COCO
dataset. Qualitative results are shown for the generation of
novel image captions. We also evaluate the bi-directional
ability of our algorithm on both the image and sentence re-
trieval tasks. Since this does not require the ability to gener-
ate novel sentences, numerous previous papers have evalu-
ated on this task. We show results that are better or compa-
rable to previous state-of-the-art results using similar visual
features.
1
ar
X
iv
:1
41
1.
56
54
v1
[
cs
.C
V
]
2
0
N
ov
2
01
4
Figure 1. Illustration of our model. (a) shows the full model
used for training. (b) and (c) show the parts of the model needed
for generating
sentences from visual features and generating visual features
from sentences respectively.
2. Related work
The task of building a visual memory lies at the heart
of two long-standing AI-hard problems: grounding natural
language symbols to the physical world and semantically
understanding the content of an image. Whereas learning
the mapping between image patches and single text labels
remains a popular topic in computer vision [18, 8, 9], there
is a growing interest in using entire sentence descriptions
together with pixels to learn joint embeddings [13, 32, 16,
10]. Viewing corresponding text and images as correlated,
KCCA [13] is a natural option to discover the shared fea-
tures spaces. However, given the highly non-linear mapping
between the two, finding a generic distance metric based on
shallow representations can be extremely difficult. Recent
papers seek better objective functions that directly optimize
the ranking [13], or directly adopts pre-trained representa-
tions [32] to simplify the learning, or a combination of the
two [16, 10].
With a good distance metric, it is possible to perform
tasks like bi-directional image-sentence retrieval. However,
in many scenarios it is also desired to generate novel image
descriptions and to hallucinate a scene given a sentence de-
scription. Numerous papers have explored the area of gen-
erating novel image descriptions [7, 36, 19, 37, 28, 11, 20,
17]. These papers use various approaches to generate text,
such as using pre-trained object detectors with template-
based sentence generation [36, 7, 19]. Retrieved sentences
may be combined to form novel descriptions [20]. Recently,
purely statistical models have been used to generate sen-
tences based on sampling [17] or recurrent neural networks
[23]. While [23] also uses a RNN, their model is signif-
icantly different from our model. Specifically their RNN
does not attempt to reconstruct the visual features, and is
more similar to the contextual RNN of [27]. For the synthe-
sizing of images from sentences, the recent paper by Zitnick
et al. [38] uses abstract clip art images to learn the visual
interpretation of sentences. Relation tuples are extracted
from the sentences and a conditional random field is used
to model the visual scene.
There are numerous papers using recurrent neural net-
works for language modeling [2, 24, 27, 17]. We build
most directly on top of [2, 24, 27] that use RNNs to
learn word context. Several models use other sources of
contextual information to help inform the language model
[27, 17]. Despite its success, RNNs still have difficulty cap-
turing long-range relationships in sequential modeling [3].
One solution is Long Short-Term Memory (LSTM) net-
works [12, 33, 17], which use “gates” to control gradient
back-propagation explicitly and allow for the learning of
long-term interactions. However, the main focus of this pa-
per is to show that the hidden layers learned by “translat-
ing” between multiple modalities can already discover rich
structures in the data and learn long distance relations in an
automatic, data-driven manner.
3. Approach
In this section we describe our approach using recurrent
neural networks. Our goals are twofold. First, we want to be
able to generate sentences given a set of visual observations
or features. Specifically, we want to compute the probabil-
ity of a word wt being generated at time t given the set of
previously generated words Wt−1 = w1, . . . ,wt−1 and the
observed visual features V . Second, we want to enable the
capability of computing the likelihood of the visual features
V given a set of spoken or read words Wt for generating
visual representations of the scene or for performing image
search. To accomplish both of these tasks we introduce a
set of latent variables Ut−1 that encodes the visual interpre-
tation of the previously generated or read words Wt−1. As
we demonstrate later, the latent variables U play the critical
role of acting as a long-term visual memory of the words
that have been previously generated or read.
Using U, our goal is to compute P(wt|V,Wt−1,Ut−1)
and P(V |Wt−1,Ut−1). Combining these two likelihoods
together our global objective is to maximize,
P(wt,V |Wt−1,Ut−1)
= P(wt|V,Wt−1,Ut−1)P(V |Wt−1,Ut−1). (1)
That is, we want to maximize the likelihood of the word wt
and the observed visual features V given the previous words
and their visual interpretation. Note that in previous papers
[27, 23] the objective was only to compute P(wt|V,Wt−1)
and not P(V |Wt−1).
3.1. Model structure
Our recurrent neural network model structure builds on
the prior models proposed by [24, 27]. Mikolov [24] pro-
posed a RNN language model shown by the green boxes in
Figure 1(a). The word at time t is represented by a vector
wt using a “one hot” representation. That is, wt is the same
size as the word vocabulary with each entry having a value
of 0 or 1 depending on whether the word was used. The
output w̃ t contains the likelihood of generating each word.
The recurrent hidden state s provides context based on the
previous words. However, s typically only models short-
range interactions due to the problem of vanishing gradi-
ents [3, 24]. This simple, yet effective language model was
shown to provide a useful continuous word embedding for
a variety of applications [25].
Following [24], Mikolov et al. [27] added an input layer
v to the RNN shown by the white box in Figure 1. This
layer may represent a variety of information, such as topic
models or parts of speech [27]. In our application, v rep-
resents the set of observed visual features. We assume the
visual features v are constant. These visual features help
inform the selection of words. For instance, if a cat was de-
tected, the word “cat” is more likely to be spoken. Note that
unlike [27], it is not necessary to directly connect v to w̃ ,
since v is static for our application. In [27] v represented
dynamic information such as parts of speech for which w̃
needed direct access. We also found that only connecting v
to half of the s units provided better results, since it allowed
different units to specialize on modeling either text or visual
features.
The main contribution of this paper is the addition of the
recurrent visual hidden layer u, blue boxes in Figure 1(a).
The recurrent layer u attempts to reconstruct the visual fea-
tures v from the previous words, i.e. ṽ ≈ v. The visual hid-
den layer is also used by w̃ t to help in predicting the next
word. That is, the network can compare its visual memory
Figure 2. Illustration of the hidden units s and u activations
through time (vertical axis). Notice that the visual hidden units
u
exhibit long-term memory through the temporal stability of
some
units, where the hidden units s change significantly each time
step.
of what it has already said u to what it currently observes v
to predict what to say next. At the beginning of the sentence,
u represents the prior probability of the visual features. As
more words are observed, the visual feature likelihoods are
updated to reflect the words’ visual interpretation. For in-
stance, if the word “sink” is generated, the visual feature
corresponding to sink will increase. Other features that cor-
respond to stove or refrigerator might increase as well, since
they are highly correlated with sink.
A critical property of the recurrent visual features u is
their ability to remember visual concepts over the long term.
The property arises from the model structure. Intuitively,
one may expect the visual features shouldn’t be estimated
until the sentence is finished. That is, u should not be used
to estimate v until wt generates the end of sentence token.
However, in our model we force u to estimate v at every
time step to help in remembering visual concepts. For in-
stance, if the word “cat” is generated, ut will increase the
likelihood of the visual feature corresponding to cat. As-
suming the “cat” visual feature in v is active, the network
will receive positive reinforcement to propagate u’s mem-
ory of “cat” from one time instance to the next. Figure
2 shows an illustrative example of the hidden units s and
u. As can be observed, some visual hidden units u exhibit
longer temporal stability.
Note that the same network structure can predict visual
features from sentences or generate sentences from visual
features. For generating sentences (Fig. 1(b)), v is known
and ṽ may be ignored. For predicting visual features from
sentences (Fig. 1(c)), w is known, and s and v may be
ignored. This property arises from the fact that the words
units w separate the model into two halves for predicting
words or visual features respectively. Alternatively, if the
hidden units s were connected directly to u, this property
would be lost and the network would act as a normal auto-
encoder [34].
3.2. Implementation details
In this section we describe the details of our language
model and how we learn our network.
3.3. Language Model
Our language model typically has between 3,000 and
20,000 words. While each word may be predicted indepen-
dently, this approach is computationally expensive. Instead,
we adopted the idea of word classing [24] and factorized the
distribution into a product of two terms:
P(wt|·) = P(ct|·)×P(wt|ct, ·). (2)
P(wt|·) is the probability of the word, P(ct|·) is the proba-
bility of the class. The class label of the word is computed
in an unsupervised manner, grouping words of similar fre-
quencies together. Generally, this approach greatly acceler-
ates the learning process, with little loss of perplexity. The
predicted word likelihoods are computed using the standard
soft-max function. After each epoch, the perplexity is eval-
uated on a separate validation set and the learning reduced
(cut in half in our experiments) if perplexity does not de-
crease.
In order to further reduce the perplexity, we combine
the RNN model’s output with the output from a Maximum
Entropy language model [26], simultaneously learned from
the training corpus. For all experiments we fix how many
words to look back when predicting the next word used by
the Maximum Entropy model to three.
For any natural language processing task, pre-processing
is crucial to the final performance. For all the sentences, we
did the following two steps before feeding them into the
RNN model. 1) Use Stanford CoreNLP Tool to tokenize
the sentences. 2) Lower case all the letters.
3.4. Learning
For learning we use the Backpropagation Through Time
(BPTT) algorithm [35]. Specifically, the network is un-
rolled for several words and standard backpropagation is
applied. Note that we reset the model after an End-of-
Sentence (EOS) is encountered, so that prediction does not
cross sentence boundaries. As shown to be beneficial in
[24], we use online learning for the weights from the recur-
rent units to the output words. The weights for the rest of the
network use a once per sentence batch update. The activa-
tions for all units are computed using the sigmoid function
σ(z) = 1/(1 + exp(−z)) with clipping, except the word
predictions that use soft-max. We found that Rectified Lin-
ear Units (ReLUs) [18] with unbounded activations were
numerically unstable and commonly “blew up” when used
in recurrent networks.
We used the open source RNN code of [24] and the
Caffe framework [14] to implement our model. A big ad-
vantage of combining the two is that we can jointly learn
the word and image representations: the error from predict-
ing the words can be directly backpropagated to the image-
level features. However, deep convolution neural networks
require large amounts of data to train on, but the largest
sentence-image dataset has only ∼80K images [22]. There-
fore, instead of training from scratch, we choose to fine-
tune from the pre-trained 1000-class ImageNet model [4] to
avoid potential over-fitting. In all experiments, we used the
4096D 7th full-connected layer output as the visual input v
to our model.
4. Results
In this section we evaluate the effectiveness of our bi-
directional RNN model on multiple tasks. We begin by de-
scribing the datasets used for training and testing, followed
by our baselines. Our first set of evaluations measure our
model’s ability to generate novel descriptions of images.
Since our model is bi-directional, we evaluate its perfor-
mance on both the sentence retrieval and image retrieval
tasks. For addition results please see the supplementary ma-
terial.
4.1. Datasets
For evaluation we perform experiments on several stan-
dard datasets that are used for sentence generation and the
sentence-image retrieval task:
PASCAL 1K [31] The dataset contains a subset of im-
ages from the PASCAL VOC challenge. For each of the 20
categories, it has a random sample of 50 images with 5 de-
scriptions provided by Amazon’s Mechanical Turk (AMT).
Flickr 8K and 30K [31] These datasets consists of 8,000
and 31,783 images collected from Flickr respectively. Most
of the images depict humans participating in various activ-
ities. Each image is also paired with 5 sentences. These
datasets have a standard training, validation, and testing
splits.
MS COCO [22] The Microsoft COCO dataset contains
82,783 training images and 40,504 validation images each
with ∼5 human generated descriptions. The images are col-
lected from Flickr by searching for common object cate-
gories, and typically contain multiple objects with signif-
icant contextual information. We downloaded the version
which contains ∼40K annotated training images and ∼10K
validation images for our experiments.
4.2. RNN Baselines
To gain insight into the various components of our
model, we compared our final model with three RNN base-
lines. For fair comparison, the random seed initialization
PASCAL Flickr 8K Flickr 30K MS COCO
PPL BLEU METR PPL BLEU METR PPL BLEU METR PPL
BLEU METR
Midge [28] - 2.89 8.80 -
Baby Talk [19] - 0.49 9.69 -
RNN 36.79 2.79 10.08 21.88 4.86 11.81 26.94 6.29 12.34 18.96
4.63 11.47
RNN+IF 30.04 10.16 16.43 20.43 12.04 17.10 23.74 10.59 15.56
15.39 16.60 19.24
RNN+IF+FT 29.43 10.18 16.45 - - - - - - 14.90 16.77 19.41
Our Approach 27.97 10.48 16.69 19.24 14.10 17.97 22.51 12.60
16.42 14.23 18.35 20.04
Our Approach + FT 26.95 10.77 16.87 - - - - - - 13.98 18.99
20.42
Human - 22.07 25.80 - 22.51 26.31 - 19.62 23.76 - 20.19 24.94
Table 1. Results for novel sentence generation for PASCAL 1K,
Flickr 8K, FLickr 30K and MS COCO. Results are measured
using
perplexity (PPL), BLEU (%) [30] and METEOR (METR, %) [1].
When available results for Midge [28] and BabyTalk [19] are
provided.
Human agreement scores are shown in the last row. See the text
for more details.
Sentence Retrieval Image Retrieval
[email protected][email protected][email protected] Mean r
[email protected][email protected][email protected] Mean r
Random Ranking 4.0 9.0 12.0 71.0 1.6 5.2 10.6 50.0
DeViSE [8] 17.0 57.0 68.0 11.9 21.6 54.6 72.4 9.5
SDT-RNN [32] 25.0 56.0 70.0 13.4 35.4 65.2 84.4 7.0
DeepFE [16] 39.0 68.0 79.0 10.5 23.6 65.2 79.8 7.6
RNN+IF 31.0 68.0 87.0 6.0 27.2 65.4 79.8 7.0
Our Approach (T) 25.0 71.0 86.0 5.4 28.0 65.4 82.2 6.8
Our Approach (T+I) 30.0 75.0 87.0 5.0 28.0 67.4 83.4 6.2
Table 2. PASCAL image and sentence retrieval experiments.
The protocol of [32] was followed. Results are shown for recall
at 1, 5, and
10 recalled items and the mean ranking (Mean r) of ground truth
items.
was fixed for all experiments. The the hidden layers s and
u sizes are fixed to 100. We tried increasing the number of
hidden units, but results did not improve. For small datasets,
more units can lead to overfitting.
RNN based Language Model (RNN) This is the basic
RNN language model developed by [24], which has no in-
put visual features.
RNN with Image Features (RNN+IF) This is an RNN
model with image features feeding into the hidden layer in-
spired by [27]. As described in Section 3 v is only con-
nected to s and not w̃ . For the visual features v we used the
4096D 7th Layer output of BVLC reference Net [14] after
ReLUs. This network is trained on the ImageNet 1000-way
classification task [4]. We experimented with other layers
(5th and 6th) but they do not perform as well.
RNN with Image Features Fine-Tuned (RNN+FT)
This model has the same architecture as RNN+IF, but the
error is back-propagated to the Convolution Neural Net-
work [9]. The CNN is initialized with the weights from
the BVLC reference net. The RNN is initialized with the
the pre-trained RNN language model. That is, the only ran-
domly initialized weights are the ones from visual features
v to hidden layers s. If the RNN is not pre-trained we found
the initial gradients to be too noisy for the CNN. If the
weights from v to hidden layers s are also pre-trained the
search space becomes too limited. Our current implemen-
tation takes ∼5 seconds to learn a mini-batch of size 128 on
a Tesla K40 GPU. It is also crucial to keep track of the val-
idation error and avoid overfitting. We observed this fine-
tuning strategy is particularly helpful for MS COCO, but
does not give much performance gain on Flickr Datasets be-
fore it overfits. The Flickr datasets may not provide enough
training data to avoid overfitting.
After fine-tuning, we fix the image features again and
retrain our model on top of it.
4.3. Sentence generation
Our first set of experiments evaluate our model’s ability
to generate novel sentence descriptions of images. We ex-
periment on all the image-sentence datasets described previ-
ously and compare to the RNN baselines and other previous
papers [28, 19]. Since PASCAL 1K has a limited amount of
training data, we report results trained on MS COCO and
tested on PASCAL 1K. We use the standard train-test splits
for the Flickr 8K and 30K datasets. For MS COCO we
train and validate on the training set (∼37K/∼3K), and test
on the validation set, since the testing set is not available.
To generate a sentence, we first sample a target sentence
length from the multinomial distribution of lengths learned
from the training data, then for this fixed length we sample
100 random sentences, and use the one with the lowest loss
(negative likelihood, and in case of our model, also recon-
struction error) as output.
Sentence Retrieval Image Retrieval
[email protected][email protected][email protected] Med r
[email protected][email protected][email protected] Med r
Random Ranking 0.1 0.6 1.1 631 0.1 0.5 1.0 500
[32] 4.5 18.0 28.6 32 6.1 18.5 29.0 29
DeViSE [8] 4.8 16.5 27.3 28 5.9 20.1 29.6 29
DeepFE [16] 12.6 32.9 44.0 14 9.7 29.6 42.5 15
DeepFE+DECAF [16] 5.9 19.2 27.3 34 5.2 17.6 26.5 32
RNN+IF 7.2 18.7 28.7 30.5 4.5 15.34 24.0 39
Our Approach (T) 7.6 21.1 31.8 27 5.0 17.6 27.4 33
Our Approach (T+I) 7.7 21.0 31.7 26.6 5.2 17.5 27.9 31
[13] 8.3 21.6 30.3 34 7.6 20.7 30.1 38
RNN+IF 5.5 17.0 27.2 28 5.0 15.0 23.9 39.5
Our Approach (T) 6.0 19.4 31.1 26 5.3 17.5 28.5 33
Our Approach (T+I) 6.2 19.3 32.1 24 5.7 18.1 28.4 31
M-RNN [23] 14.5 37.2 48.5 11 11.5 31.0 42.4 15
RNN+IF 10.4 30.9 44.2 14 10.2 28.0 40.6 16
Our Approach (T) 11.6 33.8 47.3 11.5 11.4 31.8 45.8 12.5
Our Approach (T+I) 11.7 34.8 48.6 11.2 11.4 32.0 46.2 11
Table 3. Flickr 8K Retrieval Experiments. The protocols of
[32], [13] and [23] are used respectively in each row. See text
for details.
Sentence Retrieval Image Retrieval
[email protected][email protected][email protected] Med r
[email protected][email protected][email protected] Med r
Random Ranking 0.1 0.6 1.1 631 0.1 0.5 1.0 500
DeViSE [8] 4.5 18.1 29.2 26 6.7 21.9 32.7 25
DeepFE+FT [16] 16.4 40.2 54.7 8 10.3 31.4 44.5 13
RNN+IF 8.0 19.4 27.6 37 5.1 14.8 22.8 47
Our Approach (T) 9.3 23.8 24.0 28 6.0 17.7 27.0 35
Our Approach (T+I) 9.6 24.0 27.2 25 7.1 17.9 29.0 31
M-RNN [23] 18.4 40.2 50.9 10 12.6 31.2 41.5 16
RNN+IF 9.5 29.3 42.4 15 9.2 27.1 36.6 21
Our Approach (T) 11.9 25.0 47.7 12 12.8 32.9 44.5 13
Our Approach (T+I) 12.1 27.8 47.8 11 12.7 33.1 44.9 12.5
Table 4. Flickr 30K Retrieval Experiments. The protocols of [8]
and [23] are used respectively in each row. See text for details.
We choose three automatic metrics for evaluating the
quality of the generated sentences, perplexity, BLEU [30]
and METEOR [1]. Perplexity measures the likelihood of
generating the testing sentence based on the number of bits
it would take to encode it. The lower the value the bet-
ter. BLEU and METEOR were originally designed for au-
tomatic machine translation where they rate the quality of a
translated sentences given several references sentences. We
can treat the sentence generation task as the “translation”
of images to sentences. For BLEU, we took the geomet-
ric mean of the scores from 1-gram to 4-gram, and used
the ground truth length closest to the generated sentence to
penalize brevity. For METEOR, we used the latest version1
(v1.5). For both BLEU and METEOR higher scores are bet-
ter. For reference, we also report the consistency between
human annotators (using 1 sentence as query and the rest as
references)2.
1http://www.cs.cmu.edu/˜alavie/METEOR/
2We used 5 sentences as references for system evaluation, but
leave out
4 sentences for human consistency. It is a bit unfair but the
difference is
usually around 0.01∼0.02.
Results are shown in Table 1. Our approach signifi-
cantly improves over both Midge [28] and BabyTalk [19]
on the PASCAL 1K dataset as measured by BLEU and ME-
TEOR. Several qualitative results for the three algorithms
are shown in Figure 4. Our approach generally provides
more naturally descriptive sentences, such as mentioning
an image is black and white, or a bus is a “double decker”.
Midge’s descriptions are often shorter with less detail and
BabyTalk provides long, but often redundant descriptions.
Results on Flickr 8K and Flickr 30K are also provided.
On the MS COCO dataset that contains more images
of high complexity we provide perplexity, BLEU and ME-
TEOR scores. Surprisingly our BLEU and METEOR
scores (18.99 & 20.42) are just slightly lower than the hu-
man scores (20.19 & 24.94). The use of image features
(RNN + IF) significantly improves performance over using
just an RNN language model. Fine-tuning (FT) and our full
approach provide additional improvements for all datasets.
For future reference, our final model gives BLEU-1 to
BLEU-4 (with penalty) as 60.4%, 26.4%, 12.6% and 6.5%,
compared to human consistency 65.9%, 30.5%, 13.6% and
http://www.cs.cmu.edu/~alavie/METEOR/
Figure 3. Qualitative results for sentence generation on the MS
COCO dataset. Both a generated sentence (red) using (Our
Approach +
FT) and a human generated caption (black) are shown.
6.0%. Qualitative results for the MS COCO dataset are
shown in Figure 3. Note that since our model is trained
on MS COCO, the generated sentences are generally better
on MS COCO than PASCAL 1K.
It is known that automatic measures are only roughly
correlated with human judgment [5], so it is also important
to evaluate the generated sentences using human studies.
We evaluated 1000 generated sentences on MS COCO by
asking human subjects to judge whether it had better, worse
or same quality to a human generated ground truth caption.
5 subjects were asked to rate each image, and the major-
ity vote was recorded. In the case of a tie (2-2-1) the two
winners each got half of a vote. We find 12.6% and 19.8%
prefer our automatically generated captions to the human
captions without (Our Approach) and with fine-tuning (Our
Approach + FT) respectively. Less than 1% of the subjects
rated the captions as the same. This is an impressive re-
sult given we only used image-level visual features for the
complex images in MS COCO.
4.4. Bi-Directional Retrieval
Our RNN model is bi-directional. That is, it can gener-
ate image features from sentences and sentences from im-
age features. To evaluate its ability to do both, we measure
its performance on two retrieval tasks. We retrieve images
given a sentence description, and we retrieve a description
given an image. Since most previous methods are capable
of only the retrieval task, this also helps provide experimen-
tal comparison.
Following other methods, we adopted two protocols for
using multiple image descriptions. The first one is to treat
each of the ∼5 sentences individually. In this scenario, the
rank of the retrieved ground truth sentences are used for
evaluation. In the second case, we treat all the sentences
as a single annotation, and concatenate them together for
retrieval.
For each retrieval task we have two methods for ranking.
First, we may rank based on the likelihood of the sentence
Figure 4. Qualitative results for sentence generation on the
PASCAL 1K dataset. Generated sentences are shown for our
approach (red),
Midge [28] (green) and BabyTalk [19] (blue). For reference, a
human generated caption is shown in black.
given the image (T). Since shorter sentences naturally have
higher probability of being generated, we followed [23] and
normalized the probability by dividing it with the total prob-
ability summed over the entire retrieval set. Second, we
could rank based on the reconstruction error between the
image’s visual features v and their reconstructed visual fea-
tures ṽ (I). Due to better performance, we use the average
reconstruction error over all time steps rather than just the
error at the end of the sentence. In Tables 3, we report re-
trieval results on using the text likelihood term only (I) and
its combination with the visual feature reconstruction error
(T+I).
The same evaluation metrics were adopted from previous
papers for both the tasks of sentence retrieval and image re-
trieval. They used [email protected] (K = 1, 5, 10) as the
measurements,
which are the recall rates of the (first) ground truth sen-
tences (sentence retrieval task) or images (image retrieval
task). Higher [email protected] corresponds to better retrieval
perfor-
mance. We also report the median/mean rank of the (first)
retrieved ground truth sentences or images (Med/Mean r).
Lower Med/Mean r implies better performance. For Flickr
8K and 30K several different evaluation methodologies
have been proposed. We report three scores for Flickr 8K
corresponding to the methodologies proposed by [32], [13]
and [23] respectively, and for Flickr 30K [8] and [23].
Measured by Mean r, we achieve state-of-the-art re-
sults on PASCAL 1K image and sentence retrieval (Table
2). As shown in Tables 3 and 4, for Flickr 8K and 30K
our approach achieves comparable or better results than
all methods except for the recently proposed DeepFE [16].
However, DeepFE uses a different set of features based
on smaller image regions. If the same features are used
(DeepFE+DECAF) as our approach, we achieve better re-
sults. We believe these contributions are complementary,
and by using better features our approach may also show
further improvement. In general ranking based on text and
visual features (T + I) outperforms just using text (T). Please
see the supplementary material for retrieval results on MS
COCO.
5. Discussion
Image captions describe both the objects in the image
and their relationships. An area of future work is to examine
the sequential exploration of an image and how it relates to
image descriptions. Many words correspond to spatial rela-
tions that our current model has difficultly in detecting. As
demonstrated by the recent paper of [16] better feature lo-
calization in the image can greatly improve the performance
of retrieval tasks and similar improvement might be seen in
the description generation task.
In conclusion, we describe the first bi-directional model
capable of the generating both novel image descriptions and
visual features. Unlike many previous approaches using
RNNs, our model is capable of learning long-term inter-
actions. This arises from using a recurrent visual memory
that learns to reconstruct the visual features as new words
are read or generated. We demonstrate state-of-the-art re-
sults on the task of sentence generation, image retrieval and
sentence retrieval on numerous datasets.
6. Acknowledgements
We thank Hao Fang, Saurabh Gupta, Meg Mitchell, Xi-
aodong He, Geoff Zweig, John Platt and Piotr Dollar for
their thoughtful and insightful discussions in the creation of
this paper.
References
[1] S. Banerjee and A. Lavie. Meteor: An automatic metric for
mt evaluation with improved correlation with human judg-
ments. In Proceedings of the ACL Workshop on Intrinsic
and Extrinsic Evaluation Measures for Machine Translation
and/or Summarization, pages 65–72, 2005. 1, 5
[2] Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, and J.-L.
Gauvain. Neural probabilistic language models. In Innova-
tions in Machine Learning, pages 137–186. Springer, 2006.
2
[3] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term
dependencies with gradient descent is difficult. Neural Net-
works, IEEE Transactions on, 5(2):157–166, 1994. 1, 2, 3
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
Fei. Imagenet: A large-scale hierarchical image database.
In Computer Vision and Pattern Recognition, 2009. CVPR
2009. IEEE Conference on, pages 248–255. IEEE, 2009. 4,
5
[5] D. Elliott and F. Keller. Comparing automatic evaluation
measures for image description. 2014. 7
[6] J. L. Elman. Finding structure in time. Cognitive science,
14(2):179–211, 1990. 1
[7] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young,
C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every pic-
ture tells a story: Generating sentences from images. In
ECCV, pages 15–29. Springer, 2010. 2
[8] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean,
T. Mikolov, et al. Devise: A deep visual-semantic embed-
ding model. In Advances in Neural Information Processing
Systems, pages 2121–2129, 2013. 2, 5, 6, 8
[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
ture hierarchies for accurate object detection and semantic
segmentation. 2014. 2, 5
[10] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and
S. Lazebnik. Improving image-sentence embeddings using
large weakly annotated photo collections. In ECCV, pages
529–545, 2014. 2
[11] A. Gupta, Y. Verma, and C. Jawahar. Choosing linguistics
over vision to describe images. In AAAI, 2012. 2
[12] S. Hochreiter and J. Schmidhuber. Long short-term
memory.
Neural computation, 9(8):1735–1780, 1997. 1, 2
[13] M. Hodosh, P. Young, and J. Hockenmaier. Framing image
description as a ranking task: Data, models and evaluation
metrics. J. Artif. Intell. Res.(JAIR), 47:853–899, 2013. 1, 2,
6, 8
[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R.
Gir-
shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-
tional architecture for fast feature embedding. arXiv preprint
arXiv:1408.5093, 2014. 4, 5
[15] M. A. Just, S. D. Newman, T. A. Keller, A. McEleney, and
P. A. Carpenter. Imagery in sentence comprehension: an fmri
study. Neuroimage, 21(1):112–124, 2004. 1
[16] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment em-
beddings for bidirectional image sentence mapping. arXiv
preprint arXiv:1406.5679, 2014. 1, 2, 5, 6, 8
[17] R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neu-
ral language models. In ICML, 2014. 2
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012. 2, 4
[19] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C.
Berg,
and T. L. Berg. Baby talk: Understanding and generat-
ing simple image descriptions. In CVPR, pages 1601–1608.
IEEE, 2011. 2, 5, 6, 8
[20] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and
Y. Choi. Collective generation of natural image descriptions.
2012. 2
[21] L. R. Lieberman and J. T. Culpepper. Words versus
objects:
Comparison of free verbal recall. Psychological Reports,
17(3):983–988, 1965. 1
[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D.
Ra-
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
mon objects in context. In ECCV, 2014. 1, 4
[23] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille.
Explain
images with multimodal recurrent neural networks. arXiv
preprint arXiv:1410.1090, 2014. 2, 3, 6, 7, 8
[24] T. Mikolov. Recurrent neural network based language
model.
1, 2, 3, 4, 5
[25] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient
esti-
mation of word representations in vector space. International
Conference on Learning Representations: Workshops Track,
2013. 3
[26] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J.
Cernocky.
Strategies for training large scale neural network language
models. In Automatic Speech Recognition and Understand-
ing (ASRU), 2011 IEEE Workshop on, pages 196–201. IEEE,
2011. 4
[27] T. Mikolov and G. Zweig. Context dependent recurrent
neu-
ral network language model. In SLT, pages 234–239, 2012.
1, 2, 3, 5
[28] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal,
A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and
H. Daumé III. Midge: Generating image descriptions from
computer vision detections. In EACL, pages 747–756. Asso-
ciation for Computational Linguistics, 2012. 2, 5, 6, 8
[29] A. Paivio, T. B. Rogers, and P. C. Smythe. Why are pic-
tures easier to recall than words? Psychonomic Science,
11(4):137–138, 1968. 1
[30] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a
method for automatic evaluation of machine translation. In
Proceedings of the 40th annual meeting on association for
computational linguistics, pages 311–318. Association for
Computational Linguistics, 2002. 1, 5
[31] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier.
Collecting image annotations using Amazon’s mechanical
turk. In NAACL HLT Workshop Creating Speech and Lan-
guage Data with Amazon’s Mechanical Turk, 2010. 1, 4
[32] R. Socher, Q. Le, C. Manning, and A. Ng. Grounded com-
positional semantics for finding and describing images with
sentences. In NIPS Deep Learning Workshop, 2013. 1, 2, 5,
6, 8
[33] I. Sutskever, J. Martens, and G. E. Hinton. Generating text
with recurrent neural networks. In Proceedings of the 28th
International Conference on Machine Learning (ICML-11),
pages 1017–1024, 2011. 2
[34] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.
Extracting and composing robust features with denoising au-
toencoders. In Proceedings of the 25th international confer-
ence on Machine learning, pages 1096–1103. ACM, 2008.
3
[35] R. J. Williams and D. Zipser. Experimental analysis of the
real-time recurrent learning algorithm. Connection Science,
1(1):87–111, 1989. 4
[36] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos.
Corpus-guided sentence generation of natural images. In
EMNLP, 2011. 2
[37] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu.
I2T:
Image parsing to text description. Proceedings of the IEEE,
98(8):1485–1508, 2010. 2
[38] C. L. Zitnick and D. Parikh. Bringing semantics into fo-
cus using visual abstraction. In Computer Vision and Pat-
tern Recognition (CVPR), 2013 IEEE Conference on, pages
3009–3016. IEEE, 2013. 2
Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention
Kelvin Xu [email protected]
Jimmy Lei Ba [email protected]
Ryan Kiros [email protected]
Kyunghyun Cho [email protected]
Aaron Courville [email protected]
Ruslan Salakhutdinov [email protected]
Richard S. Zemel [email protected]
Yoshua Bengio [email protected]
Abstract
Inspired by recent work in machine translation
and object detection, we introduce an attention
based model that automatically learns to describe
the content of images. We describe how we
can train this model in a deterministic manner
using standard backpropagation techniques and
stochastically by maximizing a variational lower
bound. We also show through visualization how
the model is able to automatically learn to fix its
gaze on salient objects while generating the cor-
responding words in the output sequence. We
validate the use of attention with state-of-the-
art performance on three benchmark datasets:
Flickr8k, Flickr30k and MS COCO.
1. Introduction
Automatically generating captions of an image is a task
very close to the heart of scene understanding — one of the
primary goals of computer vision. Not only must caption
generation models be powerful enough to solve the com-
puter vision challenges of determining which objects are in
an image, but they must also be capable of capturing and
expressing their relationships in a natural language. For
this reason, caption generation has long been viewed as
a difficult problem. It is a very important challenge for
machine learning algorithms, as it amounts to mimicking
the remarkable human ability to compress huge amounts of
salient visual infomation into descriptive language.
Despite the challenging nature of this task, there has been
a recent surge of research interest in attacking the image
caption generation problem. Aided by advances in training
neural networks (Krizhevsky et al., 2012) and large clas-
sification datasets (Russakovsky et al., 2014), recent work
Figure 1. Our model learns a words/image alignment. The
visual-
ized attentional maps (3) are explained in section 3.1 & 5.4
1. Input
Image
2. Convolutional
Feature Extraction
3. RNN with attention
LSTM
4. Word by
word
14x14 Feature Map
over the image
generation
A
bird
flying
over
a
body
of
water
has significantly improved the quality of caption genera-
tion using a combination of convolutional neural networks
(convnets) to obtain vectorial representation of images and
recurrent neural networks to decode those representations
into natural language sentences (see Sec. 2).
One of the most curious facets of the human visual sys-
tem is the presence of attention (Rensink, 2000; Corbetta &
Shulman, 2002). Rather than compress an entire image into
a static representation, attention allows for salient features
to dynamically come to the forefront as needed. This is
especially important when there is a lot of clutter in an im-
age. Using representations (such as those from the top layer
of a convnet) that distill information in image down to the
most salient objects is one effective solution that has been
widely adopted in previous work. Unfortunately, this has
one potential drawback of losing information which could
be useful for richer, more descriptive captions. Using more
low-level representation can help preserve this information.
However working with these features necessitates a power-
ful mechanism to steer the model to information important
to the task at hand.
In this paper, we describe approaches to caption genera-
tion that attempt to incorporate a form of attention with
ar
X
iv
:1
50
2.
03
04
4v
3
[
cs
.L
G
]
1
9
A
pr
2
01
6
Neural Image Caption Generation with Visual Attention
Figure 2. Attention over time. As the model generates each
word, its attention changes to reflect the relevant parts of the
image. “soft”
(top row) vs “hard” (bottom row) attention. (Note that both
models generated the same captions in this example.)
Figure 3. Examples of attending to the correct object (white
indicates the attended regions, underlines indicated the
corresponding word)
two variants: a “hard” attention mechanism and a “soft”
attention mechanism. We also show how one advantage of
including attention is the ability to visualize what the model
“sees”. Encouraged by recent advances in caption genera-
tion and inspired by recent success in employing attention
in machine translation (Bahdanau et al., 2014) and object
recognition (Ba et al., 2014; Mnih et al., 2014), we investi-
gate models that can attend to salient part of an image while
generating its caption.
The contributions of this paper are the following:
• We introduce two attention-based image caption gen-
erators under a common framework (Sec. 3.1): 1) a
“soft” deterministic attention mechanism trainable by
standard back-propagation methods and 2) a “hard”
stochastic attention mechanism trainable by maximiz-
ing an approximate variational lower bound or equiv-
alently by REINFORCE (Williams, 1992).
• We show how we can gain insight and interpret the
results of this framework by visualizing “where” and
“what” the attention focused on. (see Sec. 5.4)
• Finally, we quantitatively validate the usefulness of
attention in caption generation with state of the art
performance (Sec. 5.3) on three benchmark datasets:
Flickr8k (Hodosh et al., 2013) , Flickr30k (Young
et al., 2014) and the MS COCO dataset (Lin et al.,
2014).
2. Related Work
In this section we provide relevant background on previous
work on image caption generation and attention. Recently,
several methods have been proposed for generating image
descriptions. Many of these methods are based on recur-
rent neural networks and inspired by the successful use of
sequence to sequence training with neural networks for ma-
chine translation (Cho et al., 2014; Bahdanau et al., 2014;
Sutskever et al., 2014). One major reason image caption
generation is well suited to the encoder-decoder framework
(Cho et al., 2014) of machine translation is because it is
analogous to “translating” an image to a sentence.
The first approach to use neural networks for caption gener-
ation was Kiros et al. (2014a), who proposed a multimodal
log-bilinear model that was biased by features from the im-
age. This work was later followed by Kiros et al. (2014b)
whose method was designed to explicitly allow a natural
way of doing both ranking and generation. Mao et al.
(2014) took a similar approach to generation but replaced a
feed-forward neural language model with a recurrent one.
Both Vinyals et al. (2014) and Donahue et al. (2014) use
LSTM RNNs for their models. Unlike Kiros et al. (2014a)
and Mao et al. (2014) whose models see the image at each
time step of the output word sequence, Vinyals et al. (2014)
only show the image to the RNN at the beginning. Along
Neural Image Caption Generation with Visual Attention
with images, Donahue et al. (2014) also apply LSTMs to
videos, allowing their model to generate video descriptions.
All of these works represent images as a single feature vec-
tor from the top layer of a pre-trained convolutional net-
work. Karpathy & Li (2014) instead proposed to learn a
joint embedding space for ranking and generation whose
model learns to score sentence and image similarity as a
function of R-CNN object detections with outputs of a bidi-
rectional RNN. Fang et al. (2014) proposed a three-step
pipeline for generation by incorporating object detections.
Their model first learn detectors for several visual concepts
based on a multi-instance learning framework. A language
model trained on captions was then applied to the detector
outputs, followed by rescoring from a joint image-text em-
bedding space. Unlike these models, our proposed atten-
tion framework does not explicitly use object detectors but
instead learns latent alignments from scratch. This allows
our model to go beyond “objectness” and learn to attend to
abstract concepts.
Prior to the use of neural networks for generating captions,
two main approaches were dominant. The first involved
generating caption templates which were filled in based
on the results of object detections and attribute discovery
(Kulkarni et al. (2013), Li et al. (2011), Yang et al. (2011),
Mitchell et al. (2012), Elliott & Keller (2013)). The second
approach was based on first retrieving similar captioned im-
ages from a large database then modifying these retrieved
captions to fit the query (Kuznetsova et al., 2012; 2014).
These approaches typically involved an intermediate “gen-
eralization” step to remove the specifics of a caption that
are only relevant to the retrieved image, such as the name
of a city. Both of these approaches have since fallen out of
favour to the now dominant neural network methods.
There has been a long line of previous work incorpo-
rating attention into neural networks for vision related
tasks. Some that share the same spirit as our work include
Larochelle & Hinton (2010); Denil et al. (2012); Tang et al.
(2014). In particular however, our work directly extends
the work of Bahdanau et al. (2014); Mnih et al. (2014); Ba
et al. (2014).
3. Image Caption Generation with Attention
Mechanism
3.1. Model Details
In this section, we describe the two variants of our
attention-based model by first describing their common
framework. The main difference is the definition of the
φ function which we describe in detail in Section 4. We
denote vectors with bolded font and matrices with capital
letters. In our description below, we suppress bias terms for
readability.
f
c
oi
ht
ht-1
zt
Eyt-1 ht-1
zt
Eyt-1
ht-1
zt
Eyt-1
ht-1
zt
Eyt-1
input gate output gate
memory cell
forget gate
input modulator
Figure 4. A LSTM cell, lines with bolded squares imply projec-
tions with a learnt weight vector. Each cell learns how to weigh
its input components (input gate), while learning how to
modulate
that contribution to the memory (input modulator). It also learns
weights which erase the memory cell (forget gate), and weights
which control how this memory should be emitted (output gate).
3.1.1. ENCODER: CONVOLUTIONAL FEATURES
Our model takes a single raw image and generates a caption
y encoded as a sequence of 1-of-K encoded words.
y = {y1, . . . ,yC} , yi ∈ RK
where K is the size of the vocabulary and C is the length
of the caption.
We use a convolutional neural network in order to extract a
set of feature vectors which we refer to as annotation vec-
tors. The extractor produces L vectors, each of which is
a D-dimensional representation corresponding to a part of
the image.
a = {a1, . . . ,aL} , ai ∈ RD
In order to obtain a correspondence between the feature
vectors and portions of the 2-D image, we extract features
from a lower convolutional layer unlike previous work
which instead used a fully connected layer. This allows the
decoder to selectively focus on certain parts of an image by
selecting a subset of all the feature vectors.
3.1.2. DECODER: LONG SHORT-TERM MEMORY
NETWORK
We use a long short-term memory (LSTM) net-
work (Hochreiter & Schmidhuber, 1997) that produces a
caption by generating one word at every time step condi-
tioned on a context vector, the previous hidden state and the
previously generated words. Our implementation of LSTM
Neural Image Caption Generation with Visual Attention
closely follows the one used in Zaremba et al. (2014) (see
Fig. 4). Using Ts,t : Rs → Rt to denote a simple affine
transformation with parameters that are learned,
it
ft
ot
gt
σ
σ
σ
tanh
ẑt
ct = ft �ct−1 + it �gt (2)
ht = ot � tanh(ct). (3)
Here, it, ft, ct, ot, ht are the input, forget, memory, out-
put and hidden state of the LSTM, respectively. The vector
ẑ ∈ RD is the context vector, capturing the visual infor-
mation associated with a particular input location, as ex-
plained below. E ∈ Rm×K is an embedding matrix. Let
m and n denote the embedding and LSTM dimensionality
respectively and σ and � be the logistic sigmoid activation
and element-wise multiplication respectively.
In simple terms, the context vector ẑt (equations (1)–(3)) is
a dynamic representation of the relevant part of the image
input at time t. We define a mechanism φ that computes ẑt
from the annotation vectors ai, i = 1, . . . ,L corresponding
to the features extracted at different image locations. For
each location i, the mechanism generates a positive weight
αi which can be interpreted either as the probability that
location i is the right place to focus for producing the next
word (the “hard” but stochastic attention mechanism), or as
the relative importance to give to location i in blending the
ai’s together. The weight αi of each annotation vector ai
is computed by an attention model fatt for which we use
a multilayer perceptron conditioned on the previous hidden
state ht−1. The soft version of this attention mechanism
was introduced by Bahdanau et al. (2014). For emphasis,
we note that the hidden state varies as the output RNN ad-
vances in its output sequence: “where” the network looks
next depends on the sequence of words that has already
been generated.
eti =fatt(ai,ht−1) (4)
αti =
exp(eti)∑L
k=1 exp(etk)
. (5)
Once the weights (which sum to one) are computed, the
context vector ẑt is computed by
ẑt = φ({ai} ,{αi}) , (6)
where φ is a function that returns a single vector given the
set of annotation vectors and their corresponding weights.
The details of φ function are discussed in Sec. 4.
The initial memory state and hidden state of the LSTM
are predicted by an average of the annotation vectors fed
through two separate MLPs (init,c and init,h):
c0 = finit,c(
1
L
L∑
i
ai)
h0 = finit,h(
1
L
L∑
i
ai)
In this work, we use a deep output layer (Pascanu et al.,
2014) to compute the output word probability given the
LSTM state, the context vector and the previous word:
p(yt|a,yt−11 ) ∝ exp(Lo(Eyt−1 + Lhht + Lzẑt)) (7)
Where Lo ∈ RK×m, Lh ∈ Rm×n, Lz ∈ Rm×D, and E
are learned parameters initialized randomly.
4. Learning Stochastic “Hard” vs
Deterministic “Soft” Attention
In this section we discuss two alternative mechanisms for
the attention model fatt: stochastic attention and determin-
istic attention.
4.1. Stochastic “Hard” Attention
We represent the location variable st as where the model
decides to focus attention when generating the tth word.
st,i is an indicator one-hot variable which is set to 1 if the
i-th location (out of L) is the one used to extract visual
features. By treating the attention locations as intermedi-
ate latent variables, we can assign a multinoulli distribution
parametrized by {αi}, and view ẑt as a random variable:
p(st,i = 1 | sj<t,a) = αt,i (8)
ẑt =
∑
i
st,iai. (9)
We define a new objective function Ls that is a variational
lower bound on the marginal log-likelihood log p(y | a) of
observing the sequence of words y given image features a.
The learning algorithm for the parameters W of the models
can be derived by directly optimizing Ls:
Ls =
∑
s
p(s | a) log p(y | s,a)
≤ log
∑
s
p(s | a)p(y | s,a)
= log p(y | a) (10)
∂Ls
∂W
=
∑
s
p(s | a)
[
∂ log p(y | s,a)
∂W
+
log p(y | s,a)
∂ log p(s | a)
∂W
]
. (11)
Neural Image Caption Generation with Visual Attention
Figure 5. Examples of mistakes where we can use attention to
gain intuition into what the model saw.
Equation 11 suggests a Monte Carlo based sampling ap-
proximation of the gradient with respect to the model pa-
rameters. This can be done by sampling the location st
from a multinouilli distribution defined by Equation 8.
s̃t ∼ MultinoulliL({αi})
∂Ls
∂W
≈
1
N
N∑
n=1
[
∂ log p(y | s̃n,a)
∂W
+
log p(y | s̃n,a)
∂ log p(s̃n | a)
∂W
]
(12)
A moving average baseline is used to reduce the vari-
ance in the Monte Carlo estimator of the gradient, follow-
ing Weaver & Tao (2001). Similar, but more complicated
variance reduction techniques have previously been used
by Mnih et al. (2014) and Ba et al. (2014). Upon seeing the
kth mini-batch, the moving average baseline is estimated
as an accumulated sum of the previous log likelihoods with
exponential decay:
bk = 0.9× bk−1 + 0.1× log p(y | s̃k,a)
To further reduce the estimator variance, an entropy term
on the multinouilli distribution H[s] is added. Also, with
probability 0.5 for a given image, we set the sampled at-
tention location s̃ to its expected value α. Both techniques
improve the robustness of the stochastic attention learning
algorithm. The final learning rule for the model is then the
following:
∂Ls
∂W
≈
1
N
N∑
n=1
[
∂ log p(y | s̃n,a)
∂W
+
λr(log p(y | s̃n,a)− b)
∂ log p(s̃n | a)
∂W
+ λe
∂H[s̃n]
∂W
]
where, λr and λe are two hyper-parameters set by cross-
validation. As pointed out and used in Ba et al. (2014)
and Mnih et al. (2014), this is formulation is equivalent to
the REINFORCE learning rule (Williams, 1992), where the
reward for the attention choosing a sequence of actions is
a real value proportional to the log likelihood of the target
sentence under the sampled attention trajectory.
In making a hard choice at every point, φ({ai} ,{αi})
from Equation 6 is a function that returns a sampled ai at
every point in time based upon a multinouilli distribution
parameterized by α.
4.2. Deterministic “Soft” Attention
Learning stochastic attention requires sampling the atten-
tion location st each time, instead we can take the expecta-
tion of the context vector ẑt directly,
Ep(st|a)[ẑt] =
L∑
i=1
αt,iai (13)
and formulate a deterministic attention model by com-
puting a soft attention weighted annotation vector
φ({ai} ,{αi}) =
∑L
i αiai as introduced by Bahdanau
et al. (2014). This corresponds to feeding in a soft α
Neural Image Caption Generation with Visual Attention
weighted context into the system. The whole model is
smooth and differentiable under the deterministic attention,
so learning end-to-end is trivial by using standard back-
propagation.
Learning the deterministic attention can also be under-
stood as approximately optimizing the marginal likelihood
in Equation 10 under the attention location random vari-
able st from Sec. 4.1. The hidden activation of LSTM
ht is a linear projection of the stochastic context vector
ẑt followed by tanh non-linearity. To the first order Tay-
lor approximation, the expected value Ep(st|a)[ht] is equal
to computing ht using a single forward prop with the ex-
pected context vector Ep(st|a)[ẑt]. Considering Eq. 7, let
nt = Lo(Eyt−1+Lhht+Lzẑt), nt,i denotes nt computed
by setting the random variable ẑ value to ai. We define the
normalized weighted geometric mean for the softmax kth
word prediction:
NWGM[p(yt = k | a)] =
∏
i exp(nt,k,i)
p(st,i=1|a)∑
j
∏
i exp(nt,j,i)
p(st,i=1|a)
=
exp(Ep(st|a)[nt,k])∑
j exp(Ep(st|a)[nt,j])
The equation above shows the normalized weighted ge-
ometric mean of the caption prediction can be approxi-
mated well by using the expected context vector, where
E[nt] = Lo(Eyt−1 + LhE[ht] + LzE[ẑt]). It shows that
the NWGM of a softmax unit is obtained by applying soft-
max to the expectations of the underlying linear projec-
tions. Also, from the results in (Baldi & Sadowski, 2014),
NWGM[p(yt = k | a)] ≈ E[p(yt = k | a)] under
softmax activation. That means the expectation of the out-
puts over all possible attention locations induced by ran-
dom variable st is computed by simple feedforward propa-
gation with expected context vector E[ẑt]. In other words,
the deterministic attention model is an approximation to the
marginal likelihood over the attention locations.
4.2.1. DOUBLY STOCHASTIC ATTENTION
By construction,
∑
i αti = 1 as they are the output of a
softmax. In training the deterministic version of our model
we introduce a form of doubly stochastic regularization,
where we also encourage
∑
t αti ≈ 1. This can be in-
terpreted as encouraging the model to pay equal attention
to every part of the image over the course of generation. In
our experiments, we observed that this penalty was impor-
tant quantitatively to improving overall BLEU score and
that qualitatively this leads to more rich and descriptive
captions. In addition, the soft attention model predicts a
gating scalar β from previous hidden state ht−1 at each
time step t, such that, φ({ai} ,{αi}) = β
∑L
i αiai, where
βt = σ(fβ(ht−1)). We notice our attention weights put
more emphasis on the objects in the images by including
the scalar β.
Concretely, the model is trained end-to-end by minimizing
the following penalized negative log-likelihood:
Ld = − log(P(y|x)) + λ
L∑
i
(1−
C∑
t
αti)
2 (14)
4.3. Training Procedure
Both variants of our attention model were trained with
stochastic gradient descent using adaptive learning rate al-
gorithms. For the Flickr8k dataset, we found that RM-
SProp (Tieleman & Hinton, 2012) worked best, while for
Flickr30k/MS COCO dataset we used the recently pro-
posed Adam algorithm (Kingma & Ba, 2014) .
To create the annotations ai used by our decoder, we used
the Oxford VGGnet (Simonyan & Zisserman, 2014) pre-
trained on ImageNet without finetuning. In principle how-
ever, any encoding function could be used. In addition,
with enough data, we could also train the encoder from
scratch (or fine-tune) with the rest of the model. In our ex-
periments we use the 14×14×512 feature map of the fourth
convolutional layer before max pooling. This means our
decoder operates on the flattened 196 × 512 (i.e L × D)
encoding.
As our implementation requires time proportional to the
length of the longest sentence per update, we found train-
ing on a random group of captions to be computationally
wasteful. To mitigate this problem, in preprocessing we
build a dictionary mapping the length of a sentence to the
corresponding subset of captions. Then, during training we
randomly sample a length and retrieve a mini-batch of size
64 of that length. We found that this greatly improved con-
vergence speed with no noticeable diminishment in perfor-
mance. On our largest dataset (MS COCO), our soft atten-
tion model took less than 3 days to train on an NVIDIA
Titan Black GPU.
In addition to dropout (Srivastava et al., 2014), the only
other regularization strategy we used was early stopping
on BLEU score. We observed a breakdown in correla-
tion between the validation set log-likelihood and BLEU in
the later stages of training during our experiments. Since
BLEU is the most commonly reported metric, we used
BLEU on our validation set for model selection.
In our experiments with soft attention, we also used Whet-
lab1 (Snoek et al., 2012; 2014) in our Flickr8k experi-
ments. Some of the intuitions we gained from hyperparam-
eter regions it explored were especially important in our
Flickr30k and COCO experiments.
We make our code for these models based in Theano
1https://www.whetlab.com/
https://www.whetlab.com/
Neural Image Caption Generation with Visual Attention
Table 1. BLEU-1,2,3,4/METEOR metrics compared to other
methods, † indicates a different split, (—) indicates an unknown
metric, ◦
indicates the authors kindly provided missing metrics by
personal communication, Σ indicates an ensemble, a indicates
using AlexNet
BLEU
Dataset Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR
Flickr8k
Google NIC(Vinyals et al., 2014)†Σ
Log Bilinear (Kiros et al., 2014a)◦
Soft-Attention
Hard-Attention
63
65.6
67
67
41
42.4
44.8
45.7
27
27.7
29.9
31.4
—
17.7
19.5
21.3
—
17.31
18.93
20.30
Flickr30k
Google NIC†◦Σ
Log Bilinear
Soft-Attention
Hard-Attention
66.3
60.0
66.7
66.9
42.3
38
43.4
43.9
27.7
25.4
28.8
29.6
18.3
17.1
19.1
19.9
—
16.88
18.49
18.46
COCO
CMU/MS Research (Chen & Zitnick, 2014)a
MS Research (Fang et al., 2014)†a
BRNN (Karpathy & Li, 2014)◦
Google NIC†◦Σ
Log Bilinear◦
Soft-Attention
Hard-Attention
—
—
64.2
66.6
70.8
70.7
71.8
—
—
45.1
46.1
48.9
49.2
50.4
—
—
30.4
32.9
34.4
34.4
35.7
—
—
20.3
24.6
24.3
24.3
25.0
20.41
20.71
—
—
20.03
23.90
23.04
(Bergstra et al., 2010) publicly available upon publication
to encourage future research in this area.
5. Experiments
We describe our experimental methodology and quantita-
tive results which validate the effectiveness of our model
for caption generation.
5.1. Data
We report results on the popular Flickr8k and Flickr30k
dataset which has 8,000 and 30,000 images respectively
as well as the more challenging Microsoft COCO dataset
which has 82,783 images. The Flickr8k/Flickr30k dataset
both come with 5 reference sentences per image, but for
the MS COCO dataset, some of the images have references
in excess of 5 which for consistency across our datasets
we discard. We applied only basic tokenization to MS
COCO so that it is consistent with the tokenization present
in Flickr8k and Flickr30k. For all our experiments, we used
a fixed vocabulary size of 10,000.
Results for our attention-based architecture are reported in
Table 4.2.1. We report results with the frequently used
BLEU metric2 which is the standard in the caption gen-
eration literature. We report BLEU from 1 to 4 with-
2We verified that our BLEU evaluation code matches the au-
thors of Vinyals et al. (2014), Karpathy & Li (2014) and Kiros
et al. (2014b). For fairness, we only compare against results for
which we have verified that our BLEU evaluation code is the
same. With the upcoming release of the COCO evaluation
server,
we will include comparison results with all other recent image
captioning models.
out a brevity penalty. There has been, however, criticism
of BLEU, so in addition we report another common met-
ric METEOR (Denkowski & Lavie, 2014), and compare
whenever possible.
5.2. Evaluation Procedures
A few challenges exist for comparison, which we explain
here. The first is a difference in choice of convolutional
feature extractor. For identical decoder architectures, us-
ing more recent architectures such as GoogLeNet or Ox-
ford VGG Szegedy et al. (2014), Simonyan & Zisserman
(2014) can give a boost in performance over using the
AlexNet (Krizhevsky et al., 2012). In our evaluation, we
compare directly only with results which use the compa-
rable GoogLeNet/Oxford VGG features, but for METEOR
comparison we note some results that use AlexNet.
The second challenge is a single model versus ensemble
comparison. While other methods have reported perfor-
mance boosts by using ensembling, in our results we report
a single model performance.
Finally, there is challenge due to differences between
dataset splits. In our reported results, we use the pre-
defined splits of Flickr8k. However, one challenge for the
Flickr30k and COCO datasets is the lack of standardized
splits. As a result, we report with the publicly available
splits3 used in previous work (Karpathy & Li, 2014). In our
experience, differences in splits do not make a substantial
difference in overall performance, but we note the differ-
3http://cs.stanford.edu/people/karpathy/
deepimagesent/
http://cs.stanford.edu/people/karpathy/deepimagesent/
http://cs.stanford.edu/people/karpathy/deepimagesent/
Neural Image Caption Generation with Visual Attention
ences where they exist.
5.3. Quantitative Analysis
In Table 4.2.1, we provide a summary of the experi-
ment validating the quantitative effectiveness of attention.
We obtain state of the art performance on the Flickr8k,
Flickr30k and MS COCO. In addition, we note that in our
experiments we are able to significantly improve the state
of the art performance METEOR on MS COCO that we
speculate is connected to some of the regularization tech-
niques we used 4.2.1 and our lower level representation.
Finally, we also note that we are able to obtain this perfor-
mance using a single model without an ensemble.
5.4. Qualitative Analysis: Learning to attend
By visualizing the attention component learned by the
model, we are able to add an extra layer of interpretabil-
ity to the output of the model (see Fig. 1). Other systems
that have done this rely on object detection systems to pro-
duce candidate alignment targets (Karpathy & Li, 2014).
Our approach is much more flexible, since the model can
attend to “non object” salient regions.
The 19-layer OxfordNet uses stacks of 3x3 filters mean-
ing the only time the feature maps decrease in size are due
to the max pooling layers. The input image is resized so
that the shortest side is 256 dimensional with preserved as-
pect ratio. The input to the convolutional network is the
center cropped 224x224 image. Consequently, with 4 max
pooling layers we get an output dimension of the top con-
volutional layer of 14x14. Thus in order to visualize the
attention weights for the soft model, we simply upsample
the weights by a factor of 24 = 16 and apply a Gaussian
filter. We note that the receptive fields of each of the 14x14
units are highly overlapping.
As we can see in Figure 2 and 3, the model learns align-
ments that correspond very strongly with human intuition.
Especially in the examples of mistakes, we see that it is
possible to exploit such visualizations to get an intuition as
to why those mistakes were made. We provide a more ex-
tensive list of visualizations in Appendix A for the reader.
6. Conclusion
We propose an attention based approach that gives state
of the art performance on three benchmark datasets us-
ing the BLEU and METEOR metric. We also show how
the learned attention can be exploited to give more inter-
pretability into the models generation process, and demon-
strate that the learned alignments correspond very well to
human intuition. We hope that the results of this paper will
encourage future work in using visual attention. We also
expect that the modularity of the encoder-decoder approach
combined with attention to have useful applications in other
domains.
Acknowledgments
The authors would like to thank the developers of
Theano (Bergstra et al., 2010; Bastien et al., 2012). We
acknowledge the support of the following organizations
for research funding and computing support: the Nuance
Foundation, NSERC, Samsung, Calcul Québec, Compute
Canada, the Canada Research Chairs and CIFAR. The au-
thors would also like to thank Nitish Srivastava for assis-
tance with his ConvNet package as well as preparing the
Oxford convolutional network and Relu Patrascu for help-
ing with numerous infrastructure related problems.
References
Ba, Jimmy Lei, Mnih, Volodymyr, and Kavukcuoglu, Ko-
ray. Multiple object recognition with visual attention.
arXiv:1412.7755, December 2014.
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua.
Neu-
ral machine translation by jointly learning to align and trans-
late. arXiv:1409.0473, September 2014.
Baldi, Pierre and Sadowski, Peter. The dropout learning algo-
rithm. Artificial intelligence, 210:78–122, 2014.
Bastien, Frederic, Lamblin, Pascal, Pascanu, Razvan, Bergstra,
James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nico-
las, Warde-Farley, David, and Bengio, Yoshua. Theano:
new features and speed improvements. Submited to the
Deep Learning and Unsupervised Feature Learning NIPS 2012
Workshop, 2012.
Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lam-
blin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian,
Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a
CPU and GPU math expression compiler. In Proceedings of
the Python for Scientific Computing Conference (SciPy), 2010.
Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent
visual representation for image caption generation. arXiv
preprint arXiv:1411.5654, 2014.
Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Caglar,
Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua.
Learning phrase representations using RNN encoder-decoder
for statistical machine translation. In EMNLP, October 2014.
Corbetta, Maurizio and Shulman, Gordon L. Control of goal-
directed and stimulus-driven attention in the brain. Nature re-
views neuroscience, 3(3):201–215, 2002.
Denil, Misha, Bazzani, Loris, Larochelle, Hugo, and de Freitas,
Nando. Learning where to attend with deep architectures for
image tracking. Neural Computation, 2012.
Denkowski, Michael and Lavie, Alon. Meteor universal: Lan-
guage specific translation evaluation for any target language.
In Proceedings of the EACL 2014 Workshop on Statistical Ma-
chine Translation, 2014.
Neural Image Caption Generation with Visual Attention
Donahue, Jeff, Hendrikcs, Lisa Anne, Guadarrama, Se-
gio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko,
Kate, and Darrell, Trevor. Long-term recurrent convo-
lutional networks for visual recognition and description.
arXiv:1411.4389v2, November 2014.
Elliott, Desmond and Keller, Frank. Image description using vi-
sual dependency representations. In EMNLP, 2013.
Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava,
Rupesh,
Deng, Li, Dollár, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell,
Margaret, Platt, John, et al. From captions to visual concepts
and back. arXiv:1411.4952, November 2014.
Hochreiter, S. and Schmidhuber, J. Long short-term memory.
Neural Computation, 9(8):1735–1780, 1997.
Hodosh, Micah, Young, Peter, and Hockenmaier, Julia. Framing
image description as a ranking task: Data, models and evalu-
ation metrics. Journal of Artificial Intelligence Research, pp.
853–899, 2013.
Karpathy, Andrej and Li, Fei-Fei. Deep visual-semantic align-
ments for generating image descriptions. arXiv:1412.2306,
December 2014.
Kingma, Diederik P. and Ba, Jimmy. Adam: A Method
for Stochastic Optimization. arXiv:1412.6980, December
2014.
Kiros, Ryan, Salahutdinov, Ruslan, and Zemel, Richard. Multi-
modal neural language models. In International Conference on
Machine Learning, pp. 595–603, 2014a.
Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard. Uni-
fying visual-semantic embeddings with multimodal neural lan-
guage models. arXiv:1411.2539, November 2014b.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey. Ima-
geNet classification with deep convolutional neural networks.
In NIPS. 2012.
Kulkarni, Girish, Premraj, Visruth, Ordonez, Vicente, Dhar,
Sag-
nik, Li, Siming, Choi, Yejin, Berg, Alexander C, and Berg,
Tamara L. Babytalk: Understanding and generating simple im-
age descriptions. PAMI, IEEE Transactions on, 35(12):2891–
2903, 2013.
Kuznetsova, Polina, Ordonez, Vicente, Berg, Alexander C,
Berg,
Tamara L, and Choi, Yejin. Collective generation of natural
image descriptions. In Association for Computational Linguis-
tics. ACL, 2012.
Kuznetsova, Polina, Ordonez, Vicente, Berg, Tamara L, and
Choi,
Yejin. Treetalk: Composition and compression of trees for im-
age descriptions. TACL, 2(10):351–362, 2014.
Larochelle, Hugo and Hinton, Geoffrey E. Learning to com-
bine foveal glimpses with a third-order boltzmann machine. In
NIPS, pp. 1243–1251, 2010.
Li, Siming, Kulkarni, Girish, Berg, Tamara L, Berg, Alexander
C,
and Choi, Yejin. Composing simple image descriptions us-
ing web-scale n-grams. In Computational Natural Language
Learning. ACL, 2011.
Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James,
Perona, Pietro, Ramanan, Deva, Dollár, Piotr, and Zitnick,
C Lawrence. Microsoft coco: Common objects in context. In
ECCV, pp. 740–755. 2014.
Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, and Yuille,
Alan.
Deep captioning with multimodal recurrent neural networks
(m-rnn). arXiv:1412.6632, December 2014.
Mitchell, Margaret, Han, Xufeng, Dodge, Jesse, Mensch,
Alyssa,
Goyal, Amit, Berg, Alex, Yamaguchi, Kota, Berg, Tamara,
Stratos, Karl, and Daumé III, Hal. Midge: Generating im-
age descriptions from computer vision detections. In European
Chapter of the Association for Computational Linguistics, pp.
747–756. ACL, 2012.
Mnih, Volodymyr, Hees, Nicolas, Graves, Alex, and
Kavukcuoglu, Koray. Recurrent models of visual atten-
tion. In NIPS, 2014.
Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, and Ben-
gio, Yoshua. How to construct deep recurrent neural networks.
In ICLR, 2014.
Rensink, Ronald A. The dynamic representation of scenes.
Visual
cognition, 7(1-3):17–42, 2000.
Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan,
Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, An-
drej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C.,
and Fei-Fei, Li. ImageNet Large Scale Visual Recognition
Challenge, 2014.
Simonyan, K. and Zisserman, A. Very deep convolu-
tional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2014.
Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practi-
cal bayesian optimization of machine learning algorithms. In
NIPS, pp. 2951–2959, 2012.
Snoek, Jasper, Swersky, Kevin, Zemel, Richard S, and Adams,
Ryan P. Input warping for bayesian optimization of non-
stationary functions. arXiv preprint arXiv:1402.0929, 2014.
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,
Sutskever,
Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to
prevent neural networks from overfitting. JMLR, 15, 2014.
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to
sequence learning with neural networks. In NIPS, pp. 3104–
3112, 2014.
Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre,
Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Van-
houcke, Vincent, and Rabinovich, Andrew. Going deeper with
convolutions. arXiv preprint arXiv:1409.4842, 2014.
Tang, Yichuan, Srivastava, Nitish, and Salakhutdinov, Ruslan
R.
Learning generative models with visual attention. In NIPS, pp.
1808–1816, 2014.
Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5 - rmsprop.
Technical report, 2012.
Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan,
Dumitru. Show and tell: A neural image caption generator.
arXiv:1411.4555, November 2014.
Neural Image Caption Generation with Visual Attention
Weaver, Lex and Tao, Nigel. The optimal reward baseline for
gradient-based reinforcement learning. In Proc. UAI’2001, pp.
538–545, 2001.
Williams, Ronald J. Simple statistical gradient-following al-
gorithms for connectionist reinforcement learning. Machine
learning, 8(3-4):229–256, 1992.
Yang, Yezhou, Teo, Ching Lik, Daumé III, Hal, and Aloimonos,
Yiannis. Corpus-guided sentence generation of natural images.
In EMNLP, pp. 444–454. ACL, 2011.
Young, Peter, Lai, Alice, Hodosh, Micah, and Hockenmaier,
Julia.
From image descriptions to visual denotations: New similarity
metrics for semantic inference over event descriptions. TACL,
2:67–78, 2014.
Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Re-
current neural network regularization. arXiv preprint
arXiv:1409.2329, September 2014.
Neural Image Caption Generation with Visual Attention
A. Appendix
Visualizations from our “hard” (a) and “soft” (b) attention
model. White indicates the regions where the model roughly
attends to (see section 5.4).
AA manman andand
aa womanwoman playingplaying frisbeefrisbee
inin aa fieldfield ..
(a) A man and a woman playing frisbee in a field.
(b) A woman is throwing a frisbee in a park.
Figure 6.
Neural Image Caption Generation with Visual Attention
AA giraffegiraffe standingstanding
inin aa fieldfield withwith
treestrees ..
(a) A giraffe standing in the field with trees.
(b) A large white bird standing in a forest.
Figure 7.
Neural Image Caption Generation with Visual Attention
AA dogdog isis
layinglaying onon aa bedbed
withwith aa bookbook ..
(a) A dog is laying on a bed with a book.
(b) A dog is standing on a hardwood floor.
Figure 8.
Neural Image Caption Generation with Visual Attention
AA womanwoman isis
holdingholding aa donutdonut inin
hishis handhand ..
(a) A woman is holding a donut in his hand.
(b) A woman holding a clock in her hand.
Figure 9.
Neural Image Caption Generation with Visual Attention
AA stopstop signsign
withwith aa stopstop signsign
onon itit ..
(a) A stop sign with a stop sign on it.
(b) A stop sign is on a road with a mountain in the background.
Figure 10.
Neural Image Caption Generation with Visual Attention
AA manman inin
aa suitsuit andand aa
hathat holdingholding aa remoteremote
controlcontrol ..
(a) A man in a suit and a hat holding a remote control.
(b) A man wearing a hat and a hat on a skateboard.
Figure 11.
Neural Image Caption Generation with Visual Attention
AA littlelittle girlgirl
sittingsitting onon aa couchcouch
withwith aa teddyteddy bearbear
..
(a) A little girl sitting on a couch with a teddy bear.
(b) A little girl sitting on a bed with a teddy bear.
Neural Image Caption Generation with Visual Attention
AA manman isis
standingstanding onon aa beachbeach
withwith aa surfboardsurfboard ..
(a) A man is standing on a beach with a surfboard.
(b) A person is standing on a beach with a surfboard.
Neural Image Caption Generation with Visual Attention
AA manman andand
aa womanwoman ridingriding aa
boatboat inin thethe waterwater
..
(a) A man and a woman riding a boat in the water.
(b) A group of people sitting on a boat in the water.
Figure 12.
Neural Image Caption Generation with Visual Attention
AA manman isis
standingstanding inin aa marketmarket
withwith aa largelarge amountamount
ofof foodfood ..
(a) A man is standing in a market with a large amount of food.
(b) A woman is sitting at a table with a large pizza.
Figure 13.
Neural Image Caption Generation with Visual Attention
AA giraffegiraffe standingstanding
inin aa fieldfield withwith
treestrees ..
(a) A giraffe standing in a field with trees.
(b) A giraffe standing in a forest with trees in the background.
Figure 14.
Neural Image Caption Generation with Visual Attention
AA groupgroup ofof
peoplepeople standingstanding nextnext toto
eacheach otherother ..
(a) A group of people standing next to each other.
(b) A man is talking on his cell phone while another man
watches.
Figure 15.
University of Groningen
Automatic Description Generation from Images
Bernardi, Raffaella; Cakici, Ruket; Elliott, Desmond; Erdem,
Aykut; Erdem, Erkut; Ikizler-
Cinbis, Nazli; Keller, Frank; Muscat, Adrian; Plank, Barbara
Published in:
Journal of artificial intelligence research
DOI:
10.1613/jair.4900
IMPORTANT NOTE: You are advised to consult the publisher's
version (publisher's PDF) if you wish to cite from
it. Please check the document version below.
Document Version
Publisher's PDF, also known as Version of record
Publication date:
2016
Link to publication in University of Groningen/UMCG research
database
Citation for published version (APA):
Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E.,
Ikizler-Cinbis, N., ... Plank, B. (2016). Automatic
Description Generation from Images: A Survey of Models,
Datasets, and Evaluation Measures. Journal of
artificial intelligence research, 55, 409-442. [4900].
https://doi.org/10.1613/jair.4900
Copyright
Other than for strictly personal use, it is not permitted to
download or to forward/distribute the text or part of it without
the consent of the
author(s) and/or copyright holder(s), unless the work is under
an open content license (like Creative Commons).
Take-down policy
If you believe that this document breaches copyright please
contact us providing details, and we will remove access to the
work immediately
and investigate your claim.
Downloaded from the University of Groningen/UMCG research
database (Pure): http://www.rug.nl/research/portal. For
technical reasons the
number of authors shown on this cover page is limited to 10
maximum.
Download date: 09-04-2020
https://doi.org/10.1613/jair.4900
https://www.rug.nl/research/portal/en/publications/automatic-
description-generation-from-images(b2b505c4-212d-4e10-9c12-
1d33ed6429ce).html
https://www.rug.nl/research/portal/en/publications/automatic-
description-generation-from-images(b2b505c4-212d-4e10-9c12-
1d33ed6429ce).html
https://www.rug.nl/research/portal/en/publications/automatic-
description-generation-from-images(b2b505c4-212d-4e10-9c12-
1d33ed6429ce).html
https://www.rug.nl/research/portal/en/journals/journal-of-
artificial-intelligence-research(2d21fcbd-8d5e-4bc4-9dd5-
362c2a3fe629).html
https://www.rug.nl/research/portal/en/journals/journal-of-
artificial-intelligence-research(2d21fcbd-8d5e-4bc4-9dd5-
362c2a3fe629).html
https://doi.org/10.1613/jair.4900
Journal of Artificial Intelligence Research 55 (2016) 409-442
Submitted 07/15; published 02/16
Automatic Description Generation from Images: A Survey
of Models, Datasets, and Evaluation Measures
Raffaella Bernardi [email protected]
University of Trento, Italy
Ruket Cakici [email protected]
Middle East Technical University, Turkey
Desmond Elliott [email protected]
University of Amsterdam, Netherlands
Aykut Erdem [email protected]
Erkut Erdem [email protected]
Nazli Ikizler-Cinbis [email protected]
Hacettepe University, Turkey
Frank Keller [email protected]
University of Edinburgh, UK
Adrian Muscat [email protected]
University of Malta, Malta
Barbara Plank [email protected]
University of Copenhagen, Denmark
Abstract
Automatic description generation from natural images is a
challenging problem that
has recently received a large amount of interest from the
computer vision and natural lan-
guage processing communities. In this survey, we classify the
existing approaches based
on how they conceptualize this problem, viz., models that cast
description as either gen-
eration problem or as a retrieval problem over a visual or
multimodal representational
space. We provide a detailed review of existing models,
highlighting their advantages and
disadvantages. Moreover, we give an overview of the
benchmark image datasets and the
evaluation measures that have been developed to assess the
quality of machine-generated
image descriptions. Finally we extrapolate future directions in
the area of automatic image
description generation.
1. Introduction
Over the past two decades, the fields of natural language
processing (NLP) and computer
vision (CV) have seen great advances in their respective goals
of analyzing and generating
text, and of understanding images and videos. While both fields
share a similar set of meth-
ods rooted in artificial intelligence and machine learning, they
have historically developed
separately, and their scientific communities have typically
interacted very little.
Recent years, however, have seen an upsurge of interest in
problems that require a
combination of linguistic and visual information. A lot of
everyday tasks are of this nature,
e.g., interpreting a photo in the context of a newspaper article,
following instructions in
conjunction with a diagram or a map, understanding slides while
listening to a lecture. In
c©2016 AI Access Foundation. All rights reserved.
Bernardi et al.
addition to this, the web provides a vast amount of data that
combines linguistic and visual
information: tagged photographs, illustrations in newspaper
articles, videos with subtitles,
and multimodal feeds on social media. To tackle combined
language and vision tasks and to
exploit the large amounts of multimodal data, the CV and NLP
communities have moved
closer together, for example by organizing workshops on
language and vision that have been
held regularly at both CV and NLP conferences over the past
few years.
In this new language-vision community, automatic image
description has emerged as a
key task. This task involves taking an image, analyzing its
visual content, and generating
a textual description (typically a sentence) that verbalizes the
most salient aspects of the
image. This is challenging from a CV point of view, as the
description could in principle
talk about any visual aspect of the image: it can mention objects
and their attributes, it
can talk about features of the scene (e.g., indoor/outdoor), or
verbalize how the people
and objects in the scene interact. More challenging still, the
description could even refer to
objects that are not depicted (e.g., it can talk about people
waiting for a train, even when
the train is not visible because it has not arrived yet) and
provide background knowledge
that cannot be derived directly from the image (e.g., the person
depicted is the Mona Lisa).
In short, a good image description requires full image
understanding, and therefore the
description task is an excellent test bed for computer vision
systems, one that is much more
comprehensive than standard CV evaluations that typically test,
for instance, the accuracy
of object detectors or scene classifiers over a limited set of
classes.
Image understanding is necessary, but not sufficient for
producing a good description.
Imagine we apply an array of state-of-the-art detectors to the
image to localize objects
(e.g., Felzenszwalb, Girshick, McAllester, & Ramanan, 2010;
Girshick, Donahue, Darrell,
& Malik, 2014), determine attributes (e.g., Lampert, Nickisch,
& Harmeling, 2009; Berg,
Berg, & Shih, 2010; Parikh & Grauman, 2011), compute scene
properties (e.g., Oliva &
Torralba, 2001; Lazebnik, Schmid, & Ponce, 2006), and
recognize human-object interac-
tions (e.g., Prest, Schmid, & Ferrari, 2012; Yao & Fei-Fei,
2010). The result would be a
long, unstructured list of labels (detector outputs), which would
be unusable as an image
description. A good image description, in contrast, has to be
comprehensive but concise
(talk about all and only the important things in the image), and
has to be formally correct,
i.e., consists of grammatically well-formed sentences.
From an NLP point of view, generating such a description is a
natural language gener-
ation (NLG) problem. The task of NLG is to turn a non-
linguistic representation into
human-readable text. Classically, the non-linguistic
representation is a logical form, a
database query, or a set of numbers. In image description, the
input is an image rep-
resentation (e.g., the detector outputs listed in the previous
paragraph), which the NLG
model has to turn into sentences. Generating text involves a
series of steps, traditionally
referred to as the NLP pipeline (Reiter & Dale, 2006): we need
to decide which aspects
of the input to talk about (content selection), then we need to
organize the content (text
planning) and verbalize it (surface realization). Surface
realization in turn requires choos-
ing the right words (lexicalization), using pronouns if
appropriate (referential expression
generation), and grouping related information together
(aggregation).
In other words, automatic image description requires not only
full image understanding,
but also sophisticated natural language generation. This is what
makes it such an interesting
410
Automatic Description Generation from Images: A Survey
task that has been embraced by both the CV and the NLP
communities.1 Note that
the description task can become even more challenging when we
take into account that
good descriptions are often user-specific. For instance, an art
critic will require a different
description than a librarian or a journalist, even for the same
photograph. We will briefly
touch upon this issue when we talk about the difference between
descriptions and captions
in Section 3 and discuss future directions in Section 4.
Given that automatic image description is such an interesting
task, and it is driven by the
existence of mature CV and NLP methods and the availability of
relevant datasets, a large
image description literature has appeared over the last five
years. The aim of this survey
article is to give a comprehensive overview of this literature,
covering models, datasets, and
evaluation metrics.
We sort the existing literature into three categories based on the
image description
models used. The first group of models follows the classical
pipeline we outlined above: they
first detect or predict the image content in terms of objects,
attributes, scene types, and
actions, based on a set of visual features. Then, these models
use this content information
to drive a natural language generation system that outputs an
image description. We will
term these approaches direct generation models.
The second group of models cast the problem as a retrieval
problem. That is, to create a
description for a novel image, these models search for images in
a database that are similar to
the novel image. Then they build a description for the novel
image based on the descriptions
of the set of similar images that was retrieved. The novel image
is described by simply
reusing the description of the most similar retrieved image
(transfer), or by synthesizing a
novel description based on the description of a set of similar
images. Retrieval-based models
can be further subdivided based on what type of approach they
use to represent images
and compute similarity. The first subgroup of models uses a
visual space to retrieve images,
while the second subgroup uses a multimodal space that
represents images and text jointly.
For an overview of the models that will be reviewed in this
survey, and which category they
fall into, see Table 1.
Generating natural language descriptions from videos presents
unique challenges over
and above image-based description, as it additionally requires
analyzing the objects and
their attributes and actions in the temporal dimension. Models
that aim to solve descrip-
tion generation from videos have been proposed in the literature
(e.g., Khan, Zhang, &
Gotoh, 2011; Guadarrama, Krishnamoorthy, Malkarnenkar,
Venugopalan, Mooney, Darrell,
& Saenko, 2013; Krishnamoorthy, Malkarnenkar, Mooney,
Saenko, & Guadarrama, 2013;
Rohrbach, Qiu, Titov, Thater, Pinkal, & Schiele, 2013;
Thomason, Venugopalan, Guadar-
rama, Saenko, & Mooney, 2014; Rohrbach, Rohrback, Tandon,
& Schiele, 2015; Yao, Torabi,
Cho, Ballas, Pal, Larochelle, & Courville, 2015; Zhu, Kiros,
Zemel, Salakhutdinov, Urtasun,
Torralba, & Fidler, 2015). However, most existing work on
description generation has used
static images, and this is what we will focus on in this survey.2
In this survey article, we first group automatic image
description models into the three
categories outlined above and provide a comprehensive
overview of the models in each
1. Though some image description approaches circumvent the
NLG aspect by transferring human-authored
descriptions, see Sections 2.2 and 2.3.
2. An interesting intermediate approach involves the annotation
of image streams with sequences of sen-
tences, see the work of Park and Kim (2015).
411
Bernardi et al.
category in Section 2. We then examine the available
multimodal image datasets used for
training and testing description generation models in Section 3.
Furthermore, we review
evaluation measures that have been used to gauge the quality of
generated descriptions in
Section 3. Finally, in Section 4, we discuss future research
directions, including possible
new tasks related to image description, such as visual question
answering.
2. Image Description Models
Generating automatic descriptions from images requires an
understanding of how humans
describe images. An image description can be analyzed in
several different dimensions (Shat-
ford, 1986; Jaimes & Chang, 2000). We follow Hodosh, Young,
and Hockenmaier (2013)
and assume that the descriptions that are of interest for this
survey article are the ones
that verbalize visual and conceptual information depicted in the
image, i.e., descriptions
that refer to the depicted entities, their attributes and relations,
and the actions they are
involved in. Outside the scope of automatic image description
are non-visual descriptions,
which give background information or refer to objects not
depicted in the image (e.g., the
location at which the image was taken or who took the picture).
Also, not relevant for
standard approaches to image description are perceptual
descriptions, which capture the
global low-level visual characteristics of images (e.g., the
dominant color in the image or
the type of the media such as photograph, drawing, animation,
etc.).
In the following subsections, we give a comprehensive overview
of state-of-the-art ap-
proaches to description generation. Table 1 offers a high-level
summary of the field, using
the three categories of models outlined in the introduction:
direct generation models, re-
trieval models from visual space, and retrieval model from
multimodal space.
2.1 Description as Generation from Visual Input
The general approach of the studies in this group is to first
predict the most likely meaning
of a given image by analyzing its visual content, and then
generate a sentence reflecting
this meaning. All models in this category achieve this using the
following general pipeline
architecture:
1. Computer vision techniques are applied to classify the scene
type, to detect the ob-
jects present in the image, to predict their attributes and the
relationships that hold
between them, and to recognize the actions taking place.
2. This is followed by a generation phase that turns the detector
outputs into words or
phrases. These are then combined to produce a natural language
description of the
image, using techniques from natural language generation (e.g.,
templates, n-grams,
grammar rules).
The approaches reviewed in this section perform an explicit
mapping from images to
descriptions, which differentiates them from the studies
described in Section 2.2 and 2.3,
which incorporate implicit vision and language models. An
illustration of a sample model is
shown in Figure 1. An explicit pipeline architecture, while
tailored to the problem at hand,
constrains the generated descriptions, as it relies on a
predefined sets of semantic classes of
scenes, objects, attributes, and actions. Moreover, such an
architecture crucially assumes
412
Automatic Description Generation from Images: A Survey
Reference Generation Retrieval from
Visual Space Multimodal Space
Farhadi et al. (2010) X
Kulkarni et al. (2011) X
Li et al. (2011) X
Ordonez et al. (2011) X
Yang et al. (2011) X
Gupta et al. (2012) X
Kuznetsova et al. (2012) X
Mitchell et al. (2012) X
Elliott and Keller (2013) X
Hodosh et al. (2013) X
Gong et al. (2014) X
Karpathy et al. (2014) X
Kuznetsova et al. (2014) X
Mason and Charniak (2014) X
Patterson et al. (2014) X
Socher et al. (2014) X
Verma and Jawahar (2014) X
Yatskar et al. (2014) X
Chen and Zitnick (2015) X X
Donahue et al. (2015) X X
Devlin et al. (2015) X
Elliott and de Vries (2015) X
Fang et al. (2015) X
Jia et al. (2015) X X
Karpathy and Fei-Fei (2015) X X
Kiros et al. (2015) X X
Lebret et al. (2015) X X
Lin et al. (2015) X
Mao et al. (2015a) X X
Ortiz et al. (2015) X
Pinheiro et al. (2015) X
Ushiku et al. (2015) X
Vinyals et al. (2015) X X
Xu et al. (2015) X X
Yagcioglu et al. (2015) X
Table 1: An overview of existing approaches to automatic image
description. We have
categorized the literature into approaches that directly generate
a description of an image
(Section 2.1), approaches that retrieve images via visual
similarity and transfer their de-
scription to the new image (Section 2.2), and approaches that
frame the task as retrieving
descriptions and images from a multimodal space (Section 2.3).
413
Bernardi et al.
Figure 1: The automatic image description generation system
proposed by Kulkarni et al.
(2011).
the accuracy of the detectors for each semantic class, an
assumption that is not always met
in practice.
Approaches to description generation differ along two main
dimensions: (a) which image
representations they derive descriptions from, and (b) how they
address the sentence gen-
eration problem. In terms of the representations used, existing
models have conceptualized
images in a number of different ways, relying on spatial
relationships (Farhadi et al., 2010),
corpus-based relationships (Yang et al., 2011), or spatial and
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx

More Related Content

Similar to Learning a Recurrent Visual Representation for Image Caption G.docx

M.Phil Computer Science Image Processing Projects
M.Phil Computer Science Image Processing ProjectsM.Phil Computer Science Image Processing Projects
M.Phil Computer Science Image Processing ProjectsVijay Karan
 
M.Phil Computer Science Image Processing Projects
M.Phil Computer Science Image Processing ProjectsM.Phil Computer Science Image Processing Projects
M.Phil Computer Science Image Processing ProjectsVijay Karan
 
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAIDIMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAIDIRJET Journal
 
Natural language description of images using hybrid recurrent neural network
Natural language description of images using hybrid recurrent neural networkNatural language description of images using hybrid recurrent neural network
Natural language description of images using hybrid recurrent neural networkIJECEIAES
 
Multiple Style-Transfer in Real-Time
Multiple Style-Transfer in Real-TimeMultiple Style-Transfer in Real-Time
Multiple Style-Transfer in Real-TimeKaustavChakraborty28
 
Image Super-Resolution Reconstruction Based On Multi-Dictionary Learning
Image Super-Resolution Reconstruction Based On Multi-Dictionary LearningImage Super-Resolution Reconstruction Based On Multi-Dictionary Learning
Image Super-Resolution Reconstruction Based On Multi-Dictionary LearningIJRESJOURNAL
 
Machine Intelligence.html
Machine Intelligence.htmlMachine Intelligence.html
Machine Intelligence.htmlJohnChan191
 
Automated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired PeopleAutomated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired PeopleChristopher Mehdi Elamri
 
Scene Description From Images To Sentences
Scene Description From Images To SentencesScene Description From Images To Sentences
Scene Description From Images To SentencesIRJET Journal
 
Natural Language Description Generation for Image using Deep Learning Archite...
Natural Language Description Generation for Image using Deep Learning Archite...Natural Language Description Generation for Image using Deep Learning Archite...
Natural Language Description Generation for Image using Deep Learning Archite...ijtsrd
 
CrossScenarioTransferPersonReidentification_finalManuscript
CrossScenarioTransferPersonReidentification_finalManuscriptCrossScenarioTransferPersonReidentification_finalManuscript
CrossScenarioTransferPersonReidentification_finalManuscriptXiaojuan (Kathleen) WANG
 
proposal_pura
proposal_puraproposal_pura
proposal_puraErick Lin
 
Zeroshot multimodal named entity disambiguation for noisy social media posts
Zeroshot multimodal named entity disambiguation for noisy social media postsZeroshot multimodal named entity disambiguation for noisy social media posts
Zeroshot multimodal named entity disambiguation for noisy social media postsSyo Kyojin
 
Refining Simulated Smooth Surface Annealing and Consistent Hashing - Crimson ...
Refining Simulated Smooth Surface Annealing and Consistent Hashing - Crimson ...Refining Simulated Smooth Surface Annealing and Consistent Hashing - Crimson ...
Refining Simulated Smooth Surface Annealing and Consistent Hashing - Crimson ...CrimsonPublishersRDMS
 
Ijarcet vol-2-issue-4-1374-1382
Ijarcet vol-2-issue-4-1374-1382Ijarcet vol-2-issue-4-1374-1382
Ijarcet vol-2-issue-4-1374-1382Editor IJARCET
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation ijcax
 

Similar to Learning a Recurrent Visual Representation for Image Caption G.docx (20)

M.Phil Computer Science Image Processing Projects
M.Phil Computer Science Image Processing ProjectsM.Phil Computer Science Image Processing Projects
M.Phil Computer Science Image Processing Projects
 
M.Phil Computer Science Image Processing Projects
M.Phil Computer Science Image Processing ProjectsM.Phil Computer Science Image Processing Projects
M.Phil Computer Science Image Processing Projects
 
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAIDIMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
 
A1804010105
A1804010105A1804010105
A1804010105
 
Taskonomy of Transfer Learning
Taskonomy  of Transfer LearningTaskonomy  of Transfer Learning
Taskonomy of Transfer Learning
 
Natural language description of images using hybrid recurrent neural network
Natural language description of images using hybrid recurrent neural networkNatural language description of images using hybrid recurrent neural network
Natural language description of images using hybrid recurrent neural network
 
Multiple Style-Transfer in Real-Time
Multiple Style-Transfer in Real-TimeMultiple Style-Transfer in Real-Time
Multiple Style-Transfer in Real-Time
 
Image Super-Resolution Reconstruction Based On Multi-Dictionary Learning
Image Super-Resolution Reconstruction Based On Multi-Dictionary LearningImage Super-Resolution Reconstruction Based On Multi-Dictionary Learning
Image Super-Resolution Reconstruction Based On Multi-Dictionary Learning
 
Machine Intelligence.html
Machine Intelligence.htmlMachine Intelligence.html
Machine Intelligence.html
 
Automated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired PeopleAutomated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired People
 
Scene Description From Images To Sentences
Scene Description From Images To SentencesScene Description From Images To Sentences
Scene Description From Images To Sentences
 
Natural Language Description Generation for Image using Deep Learning Archite...
Natural Language Description Generation for Image using Deep Learning Archite...Natural Language Description Generation for Image using Deep Learning Archite...
Natural Language Description Generation for Image using Deep Learning Archite...
 
graph_embeddings
graph_embeddingsgraph_embeddings
graph_embeddings
 
CrossScenarioTransferPersonReidentification_finalManuscript
CrossScenarioTransferPersonReidentification_finalManuscriptCrossScenarioTransferPersonReidentification_finalManuscript
CrossScenarioTransferPersonReidentification_finalManuscript
 
proposal_pura
proposal_puraproposal_pura
proposal_pura
 
Zeroshot multimodal named entity disambiguation for noisy social media posts
Zeroshot multimodal named entity disambiguation for noisy social media postsZeroshot multimodal named entity disambiguation for noisy social media posts
Zeroshot multimodal named entity disambiguation for noisy social media posts
 
IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
 
Refining Simulated Smooth Surface Annealing and Consistent Hashing - Crimson ...
Refining Simulated Smooth Surface Annealing and Consistent Hashing - Crimson ...Refining Simulated Smooth Surface Annealing and Consistent Hashing - Crimson ...
Refining Simulated Smooth Surface Annealing and Consistent Hashing - Crimson ...
 
Ijarcet vol-2-issue-4-1374-1382
Ijarcet vol-2-issue-4-1374-1382Ijarcet vol-2-issue-4-1374-1382
Ijarcet vol-2-issue-4-1374-1382
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
 

More from croysierkathey

1.  Discuss the organization and the family role in every one of the.docx
1.  Discuss the organization and the family role in every one of the.docx1.  Discuss the organization and the family role in every one of the.docx
1.  Discuss the organization and the family role in every one of the.docxcroysierkathey
 
1.  Compare and contrast DEmilios Capitalism and Gay Identity .docx
1.  Compare and contrast DEmilios Capitalism and Gay Identity .docx1.  Compare and contrast DEmilios Capitalism and Gay Identity .docx
1.  Compare and contrast DEmilios Capitalism and Gay Identity .docxcroysierkathey
 
1.Purpose the purpose of this essay is to spread awareness .docx
1.Purpose the purpose of this essay is to spread awareness .docx1.Purpose the purpose of this essay is to spread awareness .docx
1.Purpose the purpose of this essay is to spread awareness .docxcroysierkathey
 
1.  Tell us why it is your favorite film.2.  Talk about the .docx
1.  Tell us why it is your favorite film.2.  Talk about the .docx1.  Tell us why it is your favorite film.2.  Talk about the .docx
1.  Tell us why it is your favorite film.2.  Talk about the .docxcroysierkathey
 
1.What are the main issues facing Fargo and Town Manager Susan.docx
1.What are the main issues facing Fargo and Town Manager Susan.docx1.What are the main issues facing Fargo and Town Manager Susan.docx
1.What are the main issues facing Fargo and Town Manager Susan.docxcroysierkathey
 
1.Writing Practice in Reading a PhotographAttached Files.docx
1.Writing Practice in Reading a PhotographAttached Files.docx1.Writing Practice in Reading a PhotographAttached Files.docx
1.Writing Practice in Reading a PhotographAttached Files.docxcroysierkathey
 
1.Some say that analytics in general dehumanize managerial activitie.docx
1.Some say that analytics in general dehumanize managerial activitie.docx1.Some say that analytics in general dehumanize managerial activitie.docx
1.Some say that analytics in general dehumanize managerial activitie.docxcroysierkathey
 
1.What is the psychological term for the symptoms James experiences .docx
1.What is the psychological term for the symptoms James experiences .docx1.What is the psychological term for the symptoms James experiences .docx
1.What is the psychological term for the symptoms James experiences .docxcroysierkathey
 
1.Write at least 500 words discussing the benefits of using R with H.docx
1.Write at least 500 words discussing the benefits of using R with H.docx1.Write at least 500 words discussing the benefits of using R with H.docx
1.Write at least 500 words discussing the benefits of using R with H.docxcroysierkathey
 
1.What is Starbucks’ ROA for 2012, 2011, and 2010 Why might focusin.docx
1.What is Starbucks’ ROA for 2012, 2011, and 2010 Why might focusin.docx1.What is Starbucks’ ROA for 2012, 2011, and 2010 Why might focusin.docx
1.What is Starbucks’ ROA for 2012, 2011, and 2010 Why might focusin.docxcroysierkathey
 
1.  Discuss the cultural development of the Japanese and the Jewis.docx
1.  Discuss the cultural development of the Japanese and the Jewis.docx1.  Discuss the cultural development of the Japanese and the Jewis.docx
1.  Discuss the cultural development of the Japanese and the Jewis.docxcroysierkathey
 
1.  Discuss at least 2  contextual factors(family, peers,  school,.docx
1.  Discuss at least 2  contextual factors(family, peers,  school,.docx1.  Discuss at least 2  contextual factors(family, peers,  school,.docx
1.  Discuss at least 2  contextual factors(family, peers,  school,.docxcroysierkathey
 
1.Write at least 500 words in APA format discussing how to use senti.docx
1.Write at least 500 words in APA format discussing how to use senti.docx1.Write at least 500 words in APA format discussing how to use senti.docx
1.Write at least 500 words in APA format discussing how to use senti.docxcroysierkathey
 
1.The following clause was added to the Food and Drug Actthe S.docx
1.The following clause was added to the Food and Drug Actthe S.docx1.The following clause was added to the Food and Drug Actthe S.docx
1.The following clause was added to the Food and Drug Actthe S.docxcroysierkathey
 
1.What are social determinants of health  Explain how social determ.docx
1.What are social determinants of health  Explain how social determ.docx1.What are social determinants of health  Explain how social determ.docx
1.What are social determinants of health  Explain how social determ.docxcroysierkathey
 
1.This week, we’ve been introduced to the humanities and have ta.docx
1.This week, we’ve been introduced to the humanities and have ta.docx1.This week, we’ve been introduced to the humanities and have ta.docx
1.This week, we’ve been introduced to the humanities and have ta.docxcroysierkathey
 
1.What are barriers to listening2.Communicators identif.docx
1.What are barriers to listening2.Communicators identif.docx1.What are barriers to listening2.Communicators identif.docx
1.What are barriers to listening2.Communicators identif.docxcroysierkathey
 
1.Timeline description and details There are multiple way.docx
1.Timeline description and details There are multiple way.docx1.Timeline description and details There are multiple way.docx
1.Timeline description and details There are multiple way.docxcroysierkathey
 
1.The PresidentArticle II of the Constitution establishe.docx
1.The PresidentArticle II of the Constitution establishe.docx1.The PresidentArticle II of the Constitution establishe.docx
1.The PresidentArticle II of the Constitution establishe.docxcroysierkathey
 
1.What other potential root causes might influence patient fal.docx
1.What other potential root causes might influence patient fal.docx1.What other potential root causes might influence patient fal.docx
1.What other potential root causes might influence patient fal.docxcroysierkathey
 

More from croysierkathey (20)

1.  Discuss the organization and the family role in every one of the.docx
1.  Discuss the organization and the family role in every one of the.docx1.  Discuss the organization and the family role in every one of the.docx
1.  Discuss the organization and the family role in every one of the.docx
 
1.  Compare and contrast DEmilios Capitalism and Gay Identity .docx
1.  Compare and contrast DEmilios Capitalism and Gay Identity .docx1.  Compare and contrast DEmilios Capitalism and Gay Identity .docx
1.  Compare and contrast DEmilios Capitalism and Gay Identity .docx
 
1.Purpose the purpose of this essay is to spread awareness .docx
1.Purpose the purpose of this essay is to spread awareness .docx1.Purpose the purpose of this essay is to spread awareness .docx
1.Purpose the purpose of this essay is to spread awareness .docx
 
1.  Tell us why it is your favorite film.2.  Talk about the .docx
1.  Tell us why it is your favorite film.2.  Talk about the .docx1.  Tell us why it is your favorite film.2.  Talk about the .docx
1.  Tell us why it is your favorite film.2.  Talk about the .docx
 
1.What are the main issues facing Fargo and Town Manager Susan.docx
1.What are the main issues facing Fargo and Town Manager Susan.docx1.What are the main issues facing Fargo and Town Manager Susan.docx
1.What are the main issues facing Fargo and Town Manager Susan.docx
 
1.Writing Practice in Reading a PhotographAttached Files.docx
1.Writing Practice in Reading a PhotographAttached Files.docx1.Writing Practice in Reading a PhotographAttached Files.docx
1.Writing Practice in Reading a PhotographAttached Files.docx
 
1.Some say that analytics in general dehumanize managerial activitie.docx
1.Some say that analytics in general dehumanize managerial activitie.docx1.Some say that analytics in general dehumanize managerial activitie.docx
1.Some say that analytics in general dehumanize managerial activitie.docx
 
1.What is the psychological term for the symptoms James experiences .docx
1.What is the psychological term for the symptoms James experiences .docx1.What is the psychological term for the symptoms James experiences .docx
1.What is the psychological term for the symptoms James experiences .docx
 
1.Write at least 500 words discussing the benefits of using R with H.docx
1.Write at least 500 words discussing the benefits of using R with H.docx1.Write at least 500 words discussing the benefits of using R with H.docx
1.Write at least 500 words discussing the benefits of using R with H.docx
 
1.What is Starbucks’ ROA for 2012, 2011, and 2010 Why might focusin.docx
1.What is Starbucks’ ROA for 2012, 2011, and 2010 Why might focusin.docx1.What is Starbucks’ ROA for 2012, 2011, and 2010 Why might focusin.docx
1.What is Starbucks’ ROA for 2012, 2011, and 2010 Why might focusin.docx
 
1.  Discuss the cultural development of the Japanese and the Jewis.docx
1.  Discuss the cultural development of the Japanese and the Jewis.docx1.  Discuss the cultural development of the Japanese and the Jewis.docx
1.  Discuss the cultural development of the Japanese and the Jewis.docx
 
1.  Discuss at least 2  contextual factors(family, peers,  school,.docx
1.  Discuss at least 2  contextual factors(family, peers,  school,.docx1.  Discuss at least 2  contextual factors(family, peers,  school,.docx
1.  Discuss at least 2  contextual factors(family, peers,  school,.docx
 
1.Write at least 500 words in APA format discussing how to use senti.docx
1.Write at least 500 words in APA format discussing how to use senti.docx1.Write at least 500 words in APA format discussing how to use senti.docx
1.Write at least 500 words in APA format discussing how to use senti.docx
 
1.The following clause was added to the Food and Drug Actthe S.docx
1.The following clause was added to the Food and Drug Actthe S.docx1.The following clause was added to the Food and Drug Actthe S.docx
1.The following clause was added to the Food and Drug Actthe S.docx
 
1.What are social determinants of health  Explain how social determ.docx
1.What are social determinants of health  Explain how social determ.docx1.What are social determinants of health  Explain how social determ.docx
1.What are social determinants of health  Explain how social determ.docx
 
1.This week, we’ve been introduced to the humanities and have ta.docx
1.This week, we’ve been introduced to the humanities and have ta.docx1.This week, we’ve been introduced to the humanities and have ta.docx
1.This week, we’ve been introduced to the humanities and have ta.docx
 
1.What are barriers to listening2.Communicators identif.docx
1.What are barriers to listening2.Communicators identif.docx1.What are barriers to listening2.Communicators identif.docx
1.What are barriers to listening2.Communicators identif.docx
 
1.Timeline description and details There are multiple way.docx
1.Timeline description and details There are multiple way.docx1.Timeline description and details There are multiple way.docx
1.Timeline description and details There are multiple way.docx
 
1.The PresidentArticle II of the Constitution establishe.docx
1.The PresidentArticle II of the Constitution establishe.docx1.The PresidentArticle II of the Constitution establishe.docx
1.The PresidentArticle II of the Constitution establishe.docx
 
1.What other potential root causes might influence patient fal.docx
1.What other potential root causes might influence patient fal.docx1.What other potential root causes might influence patient fal.docx
1.What other potential root causes might influence patient fal.docx
 

Recently uploaded

PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 

Recently uploaded (20)

PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 

Learning a Recurrent Visual Representation for Image Caption G.docx

  • 1. Learning a Recurrent Visual Representation for Image Caption Generation Xinlei Chen Carnegie Mellon University [email protected] C. Lawrence Zitnick Microsoft Research, Redmond [email protected] Abstract In this paper we explore the bi-directional mapping be- tween images and their sentence-based descriptions. We propose learning this mapping using a recurrent neural net- work. Unlike previous approaches that map both sentences and images to a common embedding, we enable the gener- ation of novel sentences given an image. Using the same model, we can also reconstruct the visual features associ- ated with an image given its visual description. We use a novel recurrent visual memory that automatically learns to remember long-term visual concepts to aid in both sentence generation and visual feature reconstruction. We evaluate our approach on several tasks. These include sentence gen- eration, sentence retrieval and image retrieval. State-of- the-art results are shown for the task of generating novel im- age descriptions. When compared to human generated cap- tions, our automatically generated captions are preferred by humans over 19.8% of the time. Results are better than or comparable to state-of-the-art results on the image and sentence retrieval tasks for methods using similar visual
  • 2. features. 1. Introduction A good image description is often said to “paint a picture in your mind’s eye.” The creation of a mental image may play a significant role in sentence comprehension in humans [15]. In fact, it is often this mental image that is remem- bered long after the exact sentence is forgotten [29, 21]. What role should visual memory play in computer vision al- gorithms that comprehend and generate image descriptions? Recently, several papers have explored learning joint fea- ture spaces for images and their descriptions [13, 32, 16]. These approaches project image features and sentence fea- tures into a common space, which may be used for image search or for ranking image captions. Various approaches were used to learn the projection, including Kernel Canon- ical Correlation Analysis (KCCA) [13], recursive neural networks [32], or deep neural networks [16]. While these approaches project both semantics and visual features to a common embedding, they are not able to perform the in- verse projection. That is, they cannot generate novel sen- tences or visual depictions from the embedding. In this paper, we propose a bi-directional representation capable of generating both novel descriptions from images and visual representations from descriptions. Critical to both of these tasks is a novel representation that dynami- cally captures the visual aspects of the scene that have al- ready been described. That is, as a word is generated or read the visual representation is updated to reflect the new information contained in the word. We accomplish this us- ing Recurrent Neural Networks (RNNs) [6, 24, 27]. One long-standing problem of RNNs is their weakness in re-
  • 3. membering concepts after a few iterations of recurrence. For instance RNN language models often find difficultly in learning long distance relations [3, 24] without specialized gating units [12]. During sentence generation, our novel dy- namically updated visual representation acts as a long-term memory of the concepts that have already been mentioned. This allows the network to automatically pick salient con- cepts to convey that have yet to be spoken. As we demon- strate, the same representation may be used to create a vi- sual representation of a written description. We demonstrate our method on numerous datasets. These include the PASCAL sentence dataset [31], Flickr 8K [31], Flickr 30K [31], and the Microsoft COCO dataset [22]. When generating novel image descriptions, we demonstrate state-of-the-art results as measured by both BLEU [30] and METEOR [1] on PASCAL 1K. Surpris- ingly, we achieve performance only slightly below humans as measured by BLEU and METEOR on the MS COCO dataset. Qualitative results are shown for the generation of novel image captions. We also evaluate the bi-directional ability of our algorithm on both the image and sentence re- trieval tasks. Since this does not require the ability to gener- ate novel sentences, numerous previous papers have evalu- ated on this task. We show results that are better or compa- rable to previous state-of-the-art results using similar visual features. 1 ar X iv :1
  • 4. 41 1. 56 54 v1 [ cs .C V ] 2 0 N ov 2 01 4 Figure 1. Illustration of our model. (a) shows the full model used for training. (b) and (c) show the parts of the model needed for generating sentences from visual features and generating visual features from sentences respectively. 2. Related work
  • 5. The task of building a visual memory lies at the heart of two long-standing AI-hard problems: grounding natural language symbols to the physical world and semantically understanding the content of an image. Whereas learning the mapping between image patches and single text labels remains a popular topic in computer vision [18, 8, 9], there is a growing interest in using entire sentence descriptions together with pixels to learn joint embeddings [13, 32, 16, 10]. Viewing corresponding text and images as correlated, KCCA [13] is a natural option to discover the shared fea- tures spaces. However, given the highly non-linear mapping between the two, finding a generic distance metric based on shallow representations can be extremely difficult. Recent papers seek better objective functions that directly optimize the ranking [13], or directly adopts pre-trained representa- tions [32] to simplify the learning, or a combination of the two [16, 10]. With a good distance metric, it is possible to perform tasks like bi-directional image-sentence retrieval. However, in many scenarios it is also desired to generate novel image descriptions and to hallucinate a scene given a sentence de- scription. Numerous papers have explored the area of gen- erating novel image descriptions [7, 36, 19, 37, 28, 11, 20, 17]. These papers use various approaches to generate text, such as using pre-trained object detectors with template- based sentence generation [36, 7, 19]. Retrieved sentences may be combined to form novel descriptions [20]. Recently, purely statistical models have been used to generate sen- tences based on sampling [17] or recurrent neural networks [23]. While [23] also uses a RNN, their model is signif- icantly different from our model. Specifically their RNN does not attempt to reconstruct the visual features, and is more similar to the contextual RNN of [27]. For the synthe- sizing of images from sentences, the recent paper by Zitnick
  • 6. et al. [38] uses abstract clip art images to learn the visual interpretation of sentences. Relation tuples are extracted from the sentences and a conditional random field is used to model the visual scene. There are numerous papers using recurrent neural net- works for language modeling [2, 24, 27, 17]. We build most directly on top of [2, 24, 27] that use RNNs to learn word context. Several models use other sources of contextual information to help inform the language model [27, 17]. Despite its success, RNNs still have difficulty cap- turing long-range relationships in sequential modeling [3]. One solution is Long Short-Term Memory (LSTM) net- works [12, 33, 17], which use “gates” to control gradient back-propagation explicitly and allow for the learning of long-term interactions. However, the main focus of this pa- per is to show that the hidden layers learned by “translat- ing” between multiple modalities can already discover rich structures in the data and learn long distance relations in an automatic, data-driven manner. 3. Approach In this section we describe our approach using recurrent neural networks. Our goals are twofold. First, we want to be able to generate sentences given a set of visual observations or features. Specifically, we want to compute the probabil- ity of a word wt being generated at time t given the set of previously generated words Wt−1 = w1, . . . ,wt−1 and the observed visual features V . Second, we want to enable the capability of computing the likelihood of the visual features V given a set of spoken or read words Wt for generating visual representations of the scene or for performing image search. To accomplish both of these tasks we introduce a
  • 7. set of latent variables Ut−1 that encodes the visual interpre- tation of the previously generated or read words Wt−1. As we demonstrate later, the latent variables U play the critical role of acting as a long-term visual memory of the words that have been previously generated or read. Using U, our goal is to compute P(wt|V,Wt−1,Ut−1) and P(V |Wt−1,Ut−1). Combining these two likelihoods together our global objective is to maximize, P(wt,V |Wt−1,Ut−1) = P(wt|V,Wt−1,Ut−1)P(V |Wt−1,Ut−1). (1) That is, we want to maximize the likelihood of the word wt and the observed visual features V given the previous words and their visual interpretation. Note that in previous papers [27, 23] the objective was only to compute P(wt|V,Wt−1) and not P(V |Wt−1). 3.1. Model structure Our recurrent neural network model structure builds on the prior models proposed by [24, 27]. Mikolov [24] pro- posed a RNN language model shown by the green boxes in Figure 1(a). The word at time t is represented by a vector wt using a “one hot” representation. That is, wt is the same size as the word vocabulary with each entry having a value of 0 or 1 depending on whether the word was used. The output w̃ t contains the likelihood of generating each word. The recurrent hidden state s provides context based on the previous words. However, s typically only models short- range interactions due to the problem of vanishing gradi- ents [3, 24]. This simple, yet effective language model was shown to provide a useful continuous word embedding for a variety of applications [25].
  • 8. Following [24], Mikolov et al. [27] added an input layer v to the RNN shown by the white box in Figure 1. This layer may represent a variety of information, such as topic models or parts of speech [27]. In our application, v rep- resents the set of observed visual features. We assume the visual features v are constant. These visual features help inform the selection of words. For instance, if a cat was de- tected, the word “cat” is more likely to be spoken. Note that unlike [27], it is not necessary to directly connect v to w̃ , since v is static for our application. In [27] v represented dynamic information such as parts of speech for which w̃ needed direct access. We also found that only connecting v to half of the s units provided better results, since it allowed different units to specialize on modeling either text or visual features. The main contribution of this paper is the addition of the recurrent visual hidden layer u, blue boxes in Figure 1(a). The recurrent layer u attempts to reconstruct the visual fea- tures v from the previous words, i.e. ṽ ≈ v. The visual hid- den layer is also used by w̃ t to help in predicting the next word. That is, the network can compare its visual memory Figure 2. Illustration of the hidden units s and u activations through time (vertical axis). Notice that the visual hidden units u exhibit long-term memory through the temporal stability of some units, where the hidden units s change significantly each time step. of what it has already said u to what it currently observes v to predict what to say next. At the beginning of the sentence, u represents the prior probability of the visual features. As more words are observed, the visual feature likelihoods are
  • 9. updated to reflect the words’ visual interpretation. For in- stance, if the word “sink” is generated, the visual feature corresponding to sink will increase. Other features that cor- respond to stove or refrigerator might increase as well, since they are highly correlated with sink. A critical property of the recurrent visual features u is their ability to remember visual concepts over the long term. The property arises from the model structure. Intuitively, one may expect the visual features shouldn’t be estimated until the sentence is finished. That is, u should not be used to estimate v until wt generates the end of sentence token. However, in our model we force u to estimate v at every time step to help in remembering visual concepts. For in- stance, if the word “cat” is generated, ut will increase the likelihood of the visual feature corresponding to cat. As- suming the “cat” visual feature in v is active, the network will receive positive reinforcement to propagate u’s mem- ory of “cat” from one time instance to the next. Figure 2 shows an illustrative example of the hidden units s and u. As can be observed, some visual hidden units u exhibit longer temporal stability. Note that the same network structure can predict visual features from sentences or generate sentences from visual features. For generating sentences (Fig. 1(b)), v is known and ṽ may be ignored. For predicting visual features from sentences (Fig. 1(c)), w is known, and s and v may be ignored. This property arises from the fact that the words units w separate the model into two halves for predicting words or visual features respectively. Alternatively, if the hidden units s were connected directly to u, this property would be lost and the network would act as a normal auto- encoder [34].
  • 10. 3.2. Implementation details In this section we describe the details of our language model and how we learn our network. 3.3. Language Model Our language model typically has between 3,000 and 20,000 words. While each word may be predicted indepen- dently, this approach is computationally expensive. Instead, we adopted the idea of word classing [24] and factorized the distribution into a product of two terms: P(wt|·) = P(ct|·)×P(wt|ct, ·). (2) P(wt|·) is the probability of the word, P(ct|·) is the proba- bility of the class. The class label of the word is computed in an unsupervised manner, grouping words of similar fre- quencies together. Generally, this approach greatly acceler- ates the learning process, with little loss of perplexity. The predicted word likelihoods are computed using the standard soft-max function. After each epoch, the perplexity is eval- uated on a separate validation set and the learning reduced (cut in half in our experiments) if perplexity does not de- crease. In order to further reduce the perplexity, we combine the RNN model’s output with the output from a Maximum Entropy language model [26], simultaneously learned from the training corpus. For all experiments we fix how many words to look back when predicting the next word used by the Maximum Entropy model to three. For any natural language processing task, pre-processing is crucial to the final performance. For all the sentences, we
  • 11. did the following two steps before feeding them into the RNN model. 1) Use Stanford CoreNLP Tool to tokenize the sentences. 2) Lower case all the letters. 3.4. Learning For learning we use the Backpropagation Through Time (BPTT) algorithm [35]. Specifically, the network is un- rolled for several words and standard backpropagation is applied. Note that we reset the model after an End-of- Sentence (EOS) is encountered, so that prediction does not cross sentence boundaries. As shown to be beneficial in [24], we use online learning for the weights from the recur- rent units to the output words. The weights for the rest of the network use a once per sentence batch update. The activa- tions for all units are computed using the sigmoid function σ(z) = 1/(1 + exp(−z)) with clipping, except the word predictions that use soft-max. We found that Rectified Lin- ear Units (ReLUs) [18] with unbounded activations were numerically unstable and commonly “blew up” when used in recurrent networks. We used the open source RNN code of [24] and the Caffe framework [14] to implement our model. A big ad- vantage of combining the two is that we can jointly learn the word and image representations: the error from predict- ing the words can be directly backpropagated to the image- level features. However, deep convolution neural networks require large amounts of data to train on, but the largest sentence-image dataset has only ∼80K images [22]. There- fore, instead of training from scratch, we choose to fine- tune from the pre-trained 1000-class ImageNet model [4] to avoid potential over-fitting. In all experiments, we used the 4096D 7th full-connected layer output as the visual input v to our model.
  • 12. 4. Results In this section we evaluate the effectiveness of our bi- directional RNN model on multiple tasks. We begin by de- scribing the datasets used for training and testing, followed by our baselines. Our first set of evaluations measure our model’s ability to generate novel descriptions of images. Since our model is bi-directional, we evaluate its perfor- mance on both the sentence retrieval and image retrieval tasks. For addition results please see the supplementary ma- terial. 4.1. Datasets For evaluation we perform experiments on several stan- dard datasets that are used for sentence generation and the sentence-image retrieval task: PASCAL 1K [31] The dataset contains a subset of im- ages from the PASCAL VOC challenge. For each of the 20 categories, it has a random sample of 50 images with 5 de- scriptions provided by Amazon’s Mechanical Turk (AMT). Flickr 8K and 30K [31] These datasets consists of 8,000 and 31,783 images collected from Flickr respectively. Most of the images depict humans participating in various activ- ities. Each image is also paired with 5 sentences. These datasets have a standard training, validation, and testing splits. MS COCO [22] The Microsoft COCO dataset contains 82,783 training images and 40,504 validation images each with ∼5 human generated descriptions. The images are col- lected from Flickr by searching for common object cate- gories, and typically contain multiple objects with signif-
  • 13. icant contextual information. We downloaded the version which contains ∼40K annotated training images and ∼10K validation images for our experiments. 4.2. RNN Baselines To gain insight into the various components of our model, we compared our final model with three RNN base- lines. For fair comparison, the random seed initialization PASCAL Flickr 8K Flickr 30K MS COCO PPL BLEU METR PPL BLEU METR PPL BLEU METR PPL BLEU METR Midge [28] - 2.89 8.80 - Baby Talk [19] - 0.49 9.69 - RNN 36.79 2.79 10.08 21.88 4.86 11.81 26.94 6.29 12.34 18.96 4.63 11.47 RNN+IF 30.04 10.16 16.43 20.43 12.04 17.10 23.74 10.59 15.56 15.39 16.60 19.24 RNN+IF+FT 29.43 10.18 16.45 - - - - - - 14.90 16.77 19.41 Our Approach 27.97 10.48 16.69 19.24 14.10 17.97 22.51 12.60 16.42 14.23 18.35 20.04 Our Approach + FT 26.95 10.77 16.87 - - - - - - 13.98 18.99 20.42 Human - 22.07 25.80 - 22.51 26.31 - 19.62 23.76 - 20.19 24.94 Table 1. Results for novel sentence generation for PASCAL 1K, Flickr 8K, FLickr 30K and MS COCO. Results are measured using perplexity (PPL), BLEU (%) [30] and METEOR (METR, %) [1].
  • 14. When available results for Midge [28] and BabyTalk [19] are provided. Human agreement scores are shown in the last row. See the text for more details. Sentence Retrieval Image Retrieval [email protected][email protected][email protected] Mean r [email protected][email protected][email protected] Mean r Random Ranking 4.0 9.0 12.0 71.0 1.6 5.2 10.6 50.0 DeViSE [8] 17.0 57.0 68.0 11.9 21.6 54.6 72.4 9.5 SDT-RNN [32] 25.0 56.0 70.0 13.4 35.4 65.2 84.4 7.0 DeepFE [16] 39.0 68.0 79.0 10.5 23.6 65.2 79.8 7.6 RNN+IF 31.0 68.0 87.0 6.0 27.2 65.4 79.8 7.0 Our Approach (T) 25.0 71.0 86.0 5.4 28.0 65.4 82.2 6.8 Our Approach (T+I) 30.0 75.0 87.0 5.0 28.0 67.4 83.4 6.2 Table 2. PASCAL image and sentence retrieval experiments. The protocol of [32] was followed. Results are shown for recall at 1, 5, and 10 recalled items and the mean ranking (Mean r) of ground truth items. was fixed for all experiments. The the hidden layers s and u sizes are fixed to 100. We tried increasing the number of hidden units, but results did not improve. For small datasets, more units can lead to overfitting. RNN based Language Model (RNN) This is the basic RNN language model developed by [24], which has no in- put visual features. RNN with Image Features (RNN+IF) This is an RNN model with image features feeding into the hidden layer in-
  • 15. spired by [27]. As described in Section 3 v is only con- nected to s and not w̃ . For the visual features v we used the 4096D 7th Layer output of BVLC reference Net [14] after ReLUs. This network is trained on the ImageNet 1000-way classification task [4]. We experimented with other layers (5th and 6th) but they do not perform as well. RNN with Image Features Fine-Tuned (RNN+FT) This model has the same architecture as RNN+IF, but the error is back-propagated to the Convolution Neural Net- work [9]. The CNN is initialized with the weights from the BVLC reference net. The RNN is initialized with the the pre-trained RNN language model. That is, the only ran- domly initialized weights are the ones from visual features v to hidden layers s. If the RNN is not pre-trained we found the initial gradients to be too noisy for the CNN. If the weights from v to hidden layers s are also pre-trained the search space becomes too limited. Our current implemen- tation takes ∼5 seconds to learn a mini-batch of size 128 on a Tesla K40 GPU. It is also crucial to keep track of the val- idation error and avoid overfitting. We observed this fine- tuning strategy is particularly helpful for MS COCO, but does not give much performance gain on Flickr Datasets be- fore it overfits. The Flickr datasets may not provide enough training data to avoid overfitting. After fine-tuning, we fix the image features again and retrain our model on top of it. 4.3. Sentence generation Our first set of experiments evaluate our model’s ability to generate novel sentence descriptions of images. We ex- periment on all the image-sentence datasets described previ- ously and compare to the RNN baselines and other previous
  • 16. papers [28, 19]. Since PASCAL 1K has a limited amount of training data, we report results trained on MS COCO and tested on PASCAL 1K. We use the standard train-test splits for the Flickr 8K and 30K datasets. For MS COCO we train and validate on the training set (∼37K/∼3K), and test on the validation set, since the testing set is not available. To generate a sentence, we first sample a target sentence length from the multinomial distribution of lengths learned from the training data, then for this fixed length we sample 100 random sentences, and use the one with the lowest loss (negative likelihood, and in case of our model, also recon- struction error) as output. Sentence Retrieval Image Retrieval [email protected][email protected][email protected] Med r [email protected][email protected][email protected] Med r Random Ranking 0.1 0.6 1.1 631 0.1 0.5 1.0 500 [32] 4.5 18.0 28.6 32 6.1 18.5 29.0 29 DeViSE [8] 4.8 16.5 27.3 28 5.9 20.1 29.6 29 DeepFE [16] 12.6 32.9 44.0 14 9.7 29.6 42.5 15 DeepFE+DECAF [16] 5.9 19.2 27.3 34 5.2 17.6 26.5 32 RNN+IF 7.2 18.7 28.7 30.5 4.5 15.34 24.0 39 Our Approach (T) 7.6 21.1 31.8 27 5.0 17.6 27.4 33 Our Approach (T+I) 7.7 21.0 31.7 26.6 5.2 17.5 27.9 31 [13] 8.3 21.6 30.3 34 7.6 20.7 30.1 38 RNN+IF 5.5 17.0 27.2 28 5.0 15.0 23.9 39.5 Our Approach (T) 6.0 19.4 31.1 26 5.3 17.5 28.5 33 Our Approach (T+I) 6.2 19.3 32.1 24 5.7 18.1 28.4 31
  • 17. M-RNN [23] 14.5 37.2 48.5 11 11.5 31.0 42.4 15 RNN+IF 10.4 30.9 44.2 14 10.2 28.0 40.6 16 Our Approach (T) 11.6 33.8 47.3 11.5 11.4 31.8 45.8 12.5 Our Approach (T+I) 11.7 34.8 48.6 11.2 11.4 32.0 46.2 11 Table 3. Flickr 8K Retrieval Experiments. The protocols of [32], [13] and [23] are used respectively in each row. See text for details. Sentence Retrieval Image Retrieval [email protected][email protected][email protected] Med r [email protected][email protected][email protected] Med r Random Ranking 0.1 0.6 1.1 631 0.1 0.5 1.0 500 DeViSE [8] 4.5 18.1 29.2 26 6.7 21.9 32.7 25 DeepFE+FT [16] 16.4 40.2 54.7 8 10.3 31.4 44.5 13 RNN+IF 8.0 19.4 27.6 37 5.1 14.8 22.8 47 Our Approach (T) 9.3 23.8 24.0 28 6.0 17.7 27.0 35 Our Approach (T+I) 9.6 24.0 27.2 25 7.1 17.9 29.0 31 M-RNN [23] 18.4 40.2 50.9 10 12.6 31.2 41.5 16 RNN+IF 9.5 29.3 42.4 15 9.2 27.1 36.6 21 Our Approach (T) 11.9 25.0 47.7 12 12.8 32.9 44.5 13 Our Approach (T+I) 12.1 27.8 47.8 11 12.7 33.1 44.9 12.5 Table 4. Flickr 30K Retrieval Experiments. The protocols of [8] and [23] are used respectively in each row. See text for details. We choose three automatic metrics for evaluating the quality of the generated sentences, perplexity, BLEU [30] and METEOR [1]. Perplexity measures the likelihood of
  • 18. generating the testing sentence based on the number of bits it would take to encode it. The lower the value the bet- ter. BLEU and METEOR were originally designed for au- tomatic machine translation where they rate the quality of a translated sentences given several references sentences. We can treat the sentence generation task as the “translation” of images to sentences. For BLEU, we took the geomet- ric mean of the scores from 1-gram to 4-gram, and used the ground truth length closest to the generated sentence to penalize brevity. For METEOR, we used the latest version1 (v1.5). For both BLEU and METEOR higher scores are bet- ter. For reference, we also report the consistency between human annotators (using 1 sentence as query and the rest as references)2. 1http://www.cs.cmu.edu/˜alavie/METEOR/ 2We used 5 sentences as references for system evaluation, but leave out 4 sentences for human consistency. It is a bit unfair but the difference is usually around 0.01∼0.02. Results are shown in Table 1. Our approach signifi- cantly improves over both Midge [28] and BabyTalk [19] on the PASCAL 1K dataset as measured by BLEU and ME- TEOR. Several qualitative results for the three algorithms are shown in Figure 4. Our approach generally provides more naturally descriptive sentences, such as mentioning an image is black and white, or a bus is a “double decker”. Midge’s descriptions are often shorter with less detail and BabyTalk provides long, but often redundant descriptions. Results on Flickr 8K and Flickr 30K are also provided. On the MS COCO dataset that contains more images
  • 19. of high complexity we provide perplexity, BLEU and ME- TEOR scores. Surprisingly our BLEU and METEOR scores (18.99 & 20.42) are just slightly lower than the hu- man scores (20.19 & 24.94). The use of image features (RNN + IF) significantly improves performance over using just an RNN language model. Fine-tuning (FT) and our full approach provide additional improvements for all datasets. For future reference, our final model gives BLEU-1 to BLEU-4 (with penalty) as 60.4%, 26.4%, 12.6% and 6.5%, compared to human consistency 65.9%, 30.5%, 13.6% and http://www.cs.cmu.edu/~alavie/METEOR/ Figure 3. Qualitative results for sentence generation on the MS COCO dataset. Both a generated sentence (red) using (Our Approach + FT) and a human generated caption (black) are shown. 6.0%. Qualitative results for the MS COCO dataset are shown in Figure 3. Note that since our model is trained on MS COCO, the generated sentences are generally better on MS COCO than PASCAL 1K. It is known that automatic measures are only roughly correlated with human judgment [5], so it is also important to evaluate the generated sentences using human studies. We evaluated 1000 generated sentences on MS COCO by asking human subjects to judge whether it had better, worse or same quality to a human generated ground truth caption. 5 subjects were asked to rate each image, and the major- ity vote was recorded. In the case of a tie (2-2-1) the two winners each got half of a vote. We find 12.6% and 19.8% prefer our automatically generated captions to the human captions without (Our Approach) and with fine-tuning (Our Approach + FT) respectively. Less than 1% of the subjects
  • 20. rated the captions as the same. This is an impressive re- sult given we only used image-level visual features for the complex images in MS COCO. 4.4. Bi-Directional Retrieval Our RNN model is bi-directional. That is, it can gener- ate image features from sentences and sentences from im- age features. To evaluate its ability to do both, we measure its performance on two retrieval tasks. We retrieve images given a sentence description, and we retrieve a description given an image. Since most previous methods are capable of only the retrieval task, this also helps provide experimen- tal comparison. Following other methods, we adopted two protocols for using multiple image descriptions. The first one is to treat each of the ∼5 sentences individually. In this scenario, the rank of the retrieved ground truth sentences are used for evaluation. In the second case, we treat all the sentences as a single annotation, and concatenate them together for retrieval. For each retrieval task we have two methods for ranking. First, we may rank based on the likelihood of the sentence Figure 4. Qualitative results for sentence generation on the PASCAL 1K dataset. Generated sentences are shown for our approach (red), Midge [28] (green) and BabyTalk [19] (blue). For reference, a human generated caption is shown in black. given the image (T). Since shorter sentences naturally have higher probability of being generated, we followed [23] and
  • 21. normalized the probability by dividing it with the total prob- ability summed over the entire retrieval set. Second, we could rank based on the reconstruction error between the image’s visual features v and their reconstructed visual fea- tures ṽ (I). Due to better performance, we use the average reconstruction error over all time steps rather than just the error at the end of the sentence. In Tables 3, we report re- trieval results on using the text likelihood term only (I) and its combination with the visual feature reconstruction error (T+I). The same evaluation metrics were adopted from previous papers for both the tasks of sentence retrieval and image re- trieval. They used [email protected] (K = 1, 5, 10) as the measurements, which are the recall rates of the (first) ground truth sen- tences (sentence retrieval task) or images (image retrieval task). Higher [email protected] corresponds to better retrieval perfor- mance. We also report the median/mean rank of the (first) retrieved ground truth sentences or images (Med/Mean r). Lower Med/Mean r implies better performance. For Flickr 8K and 30K several different evaluation methodologies have been proposed. We report three scores for Flickr 8K corresponding to the methodologies proposed by [32], [13] and [23] respectively, and for Flickr 30K [8] and [23]. Measured by Mean r, we achieve state-of-the-art re- sults on PASCAL 1K image and sentence retrieval (Table 2). As shown in Tables 3 and 4, for Flickr 8K and 30K our approach achieves comparable or better results than all methods except for the recently proposed DeepFE [16]. However, DeepFE uses a different set of features based on smaller image regions. If the same features are used (DeepFE+DECAF) as our approach, we achieve better re-
  • 22. sults. We believe these contributions are complementary, and by using better features our approach may also show further improvement. In general ranking based on text and visual features (T + I) outperforms just using text (T). Please see the supplementary material for retrieval results on MS COCO. 5. Discussion Image captions describe both the objects in the image and their relationships. An area of future work is to examine the sequential exploration of an image and how it relates to image descriptions. Many words correspond to spatial rela- tions that our current model has difficultly in detecting. As demonstrated by the recent paper of [16] better feature lo- calization in the image can greatly improve the performance of retrieval tasks and similar improvement might be seen in the description generation task. In conclusion, we describe the first bi-directional model capable of the generating both novel image descriptions and visual features. Unlike many previous approaches using RNNs, our model is capable of learning long-term inter- actions. This arises from using a recurrent visual memory that learns to reconstruct the visual features as new words are read or generated. We demonstrate state-of-the-art re- sults on the task of sentence generation, image retrieval and sentence retrieval on numerous datasets. 6. Acknowledgements We thank Hao Fang, Saurabh Gupta, Meg Mitchell, Xi- aodong He, Geoff Zweig, John Platt and Piotr Dollar for
  • 23. their thoughtful and insightful discussions in the creation of this paper. References [1] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judg- ments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, 2005. 1, 5 [2] Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, and J.-L. Gauvain. Neural probabilistic language models. In Innova- tions in Machine Learning, pages 137–186. Springer, 2006. 2 [3] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. Neural Net- works, IEEE Transactions on, 5(2):157–166, 1994. 1, 2, 3 [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. 4, 5 [5] D. Elliott and F. Keller. Comparing automatic evaluation measures for image description. 2014. 7 [6] J. L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990. 1 [7] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every pic- ture tells a story: Generating sentences from images. In ECCV, pages 15–29. Springer, 2010. 2
  • 24. [8] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embed- ding model. In Advances in Neural Information Processing Systems, pages 2121–2129, 2013. 2, 5, 6, 8 [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. 2014. 2, 5 [10] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, pages 529–545, 2014. 2 [11] A. Gupta, Y. Verma, and C. Jawahar. Choosing linguistics over vision to describe images. In AAAI, 2012. 2 [12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. 1, 2 [13] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res.(JAIR), 47:853–899, 2013. 1, 2, 6, 8 [14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- tional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 4, 5 [15] M. A. Just, S. D. Newman, T. A. Keller, A. McEleney, and P. A. Carpenter. Imagery in sentence comprehension: an fmri study. Neuroimage, 21(1):112–124, 2004. 1
  • 25. [16] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment em- beddings for bidirectional image sentence mapping. arXiv preprint arXiv:1406.5679, 2014. 1, 2, 5, 6, 8 [17] R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neu- ral language models. In ICML, 2014. 2 [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 2, 4 [19] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generat- ing simple image descriptions. In CVPR, pages 1601–1608. IEEE, 2011. 2, 5, 6, 8 [20] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi. Collective generation of natural image descriptions. 2012. 2 [21] L. R. Lieberman and J. T. Culpepper. Words versus objects: Comparison of free verbal recall. Psychological Reports, 17(3):983–988, 1965. 1 [22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. In ECCV, 2014. 1, 4 [23] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. arXiv
  • 26. preprint arXiv:1410.1090, 2014. 2, 3, 6, 7, 8 [24] T. Mikolov. Recurrent neural network based language model. 1, 2, 3, 4, 5 [25] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient esti- mation of word representations in vector space. International Conference on Learning Representations: Workshops Track, 2013. 3 [26] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Cernocky. Strategies for training large scale neural network language models. In Automatic Speech Recognition and Understand- ing (ASRU), 2011 IEEE Workshop on, pages 196–201. IEEE, 2011. 4 [27] T. Mikolov and G. Zweig. Context dependent recurrent neu- ral network language model. In SLT, pages 234–239, 2012. 1, 2, 3, 5 [28] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daumé III. Midge: Generating image descriptions from computer vision detections. In EACL, pages 747–756. Asso- ciation for Computational Linguistics, 2012. 2, 5, 6, 8 [29] A. Paivio, T. B. Rogers, and P. C. Smythe. Why are pic- tures easier to recall than words? Psychonomic Science, 11(4):137–138, 1968. 1
  • 27. [30] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002. 1, 5 [31] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. Collecting image annotations using Amazon’s mechanical turk. In NAACL HLT Workshop Creating Speech and Lan- guage Data with Amazon’s Mechanical Turk, 2010. 1, 4 [32] R. Socher, Q. Le, C. Manning, and A. Ng. Grounded com- positional semantics for finding and describing images with sentences. In NIPS Deep Learning Workshop, 2013. 1, 2, 5, 6, 8 [33] I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011. 2 [34] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising au- toencoders. In Proceedings of the 25th international confer- ence on Machine learning, pages 1096–1103. ACM, 2008. 3 [35] R. J. Williams and D. Zipser. Experimental analysis of the real-time recurrent learning algorithm. Connection Science, 1(1):87–111, 1989. 4 [36] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos. Corpus-guided sentence generation of natural images. In EMNLP, 2011. 2
  • 28. [37] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2T: Image parsing to text description. Proceedings of the IEEE, 98(8):1485–1508, 2010. 2 [38] C. L. Zitnick and D. Parikh. Bringing semantics into fo- cus using visual abstraction. In Computer Vision and Pat- tern Recognition (CVPR), 2013 IEEE Conference on, pages 3009–3016. IEEE, 2013. 2 Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu [email protected] Jimmy Lei Ba [email protected] Ryan Kiros [email protected] Kyunghyun Cho [email protected] Aaron Courville [email protected] Ruslan Salakhutdinov [email protected] Richard S. Zemel [email protected] Yoshua Bengio [email protected] Abstract Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the cor-
  • 29. responding words in the output sequence. We validate the use of attention with state-of-the- art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO. 1. Introduction Automatically generating captions of an image is a task very close to the heart of scene understanding — one of the primary goals of computer vision. Not only must caption generation models be powerful enough to solve the com- puter vision challenges of determining which objects are in an image, but they must also be capable of capturing and expressing their relationships in a natural language. For this reason, caption generation has long been viewed as a difficult problem. It is a very important challenge for machine learning algorithms, as it amounts to mimicking the remarkable human ability to compress huge amounts of salient visual infomation into descriptive language. Despite the challenging nature of this task, there has been a recent surge of research interest in attacking the image caption generation problem. Aided by advances in training neural networks (Krizhevsky et al., 2012) and large clas- sification datasets (Russakovsky et al., 2014), recent work Figure 1. Our model learns a words/image alignment. The visual- ized attentional maps (3) are explained in section 3.1 & 5.4 1. Input Image 2. Convolutional Feature Extraction 3. RNN with attention
  • 30. LSTM 4. Word by word 14x14 Feature Map over the image generation A bird flying over a body of water has significantly improved the quality of caption genera- tion using a combination of convolutional neural networks (convnets) to obtain vectorial representation of images and recurrent neural networks to decode those representations into natural language sentences (see Sec. 2). One of the most curious facets of the human visual sys- tem is the presence of attention (Rensink, 2000; Corbetta & Shulman, 2002). Rather than compress an entire image into a static representation, attention allows for salient features to dynamically come to the forefront as needed. This is especially important when there is a lot of clutter in an im- age. Using representations (such as those from the top layer of a convnet) that distill information in image down to the most salient objects is one effective solution that has been widely adopted in previous work. Unfortunately, this has
  • 31. one potential drawback of losing information which could be useful for richer, more descriptive captions. Using more low-level representation can help preserve this information. However working with these features necessitates a power- ful mechanism to steer the model to information important to the task at hand. In this paper, we describe approaches to caption genera- tion that attempt to incorporate a form of attention with ar X iv :1 50 2. 03 04 4v 3 [ cs .L G ] 1 9 A
  • 32. pr 2 01 6 Neural Image Caption Generation with Visual Attention Figure 2. Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image. “soft” (top row) vs “hard” (bottom row) attention. (Note that both models generated the same captions in this example.) Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word) two variants: a “hard” attention mechanism and a “soft” attention mechanism. We also show how one advantage of including attention is the ability to visualize what the model “sees”. Encouraged by recent advances in caption genera- tion and inspired by recent success in employing attention in machine translation (Bahdanau et al., 2014) and object recognition (Ba et al., 2014; Mnih et al., 2014), we investi- gate models that can attend to salient part of an image while generating its caption. The contributions of this paper are the following: • We introduce two attention-based image caption gen- erators under a common framework (Sec. 3.1): 1) a “soft” deterministic attention mechanism trainable by
  • 33. standard back-propagation methods and 2) a “hard” stochastic attention mechanism trainable by maximiz- ing an approximate variational lower bound or equiv- alently by REINFORCE (Williams, 1992). • We show how we can gain insight and interpret the results of this framework by visualizing “where” and “what” the attention focused on. (see Sec. 5.4) • Finally, we quantitatively validate the usefulness of attention in caption generation with state of the art performance (Sec. 5.3) on three benchmark datasets: Flickr8k (Hodosh et al., 2013) , Flickr30k (Young et al., 2014) and the MS COCO dataset (Lin et al., 2014). 2. Related Work In this section we provide relevant background on previous work on image caption generation and attention. Recently, several methods have been proposed for generating image descriptions. Many of these methods are based on recur- rent neural networks and inspired by the successful use of sequence to sequence training with neural networks for ma- chine translation (Cho et al., 2014; Bahdanau et al., 2014; Sutskever et al., 2014). One major reason image caption generation is well suited to the encoder-decoder framework (Cho et al., 2014) of machine translation is because it is analogous to “translating” an image to a sentence. The first approach to use neural networks for caption gener- ation was Kiros et al. (2014a), who proposed a multimodal log-bilinear model that was biased by features from the im- age. This work was later followed by Kiros et al. (2014b) whose method was designed to explicitly allow a natural way of doing both ranking and generation. Mao et al. (2014) took a similar approach to generation but replaced a
  • 34. feed-forward neural language model with a recurrent one. Both Vinyals et al. (2014) and Donahue et al. (2014) use LSTM RNNs for their models. Unlike Kiros et al. (2014a) and Mao et al. (2014) whose models see the image at each time step of the output word sequence, Vinyals et al. (2014) only show the image to the RNN at the beginning. Along Neural Image Caption Generation with Visual Attention with images, Donahue et al. (2014) also apply LSTMs to videos, allowing their model to generate video descriptions. All of these works represent images as a single feature vec- tor from the top layer of a pre-trained convolutional net- work. Karpathy & Li (2014) instead proposed to learn a joint embedding space for ranking and generation whose model learns to score sentence and image similarity as a function of R-CNN object detections with outputs of a bidi- rectional RNN. Fang et al. (2014) proposed a three-step pipeline for generation by incorporating object detections. Their model first learn detectors for several visual concepts based on a multi-instance learning framework. A language model trained on captions was then applied to the detector outputs, followed by rescoring from a joint image-text em- bedding space. Unlike these models, our proposed atten- tion framework does not explicitly use object detectors but instead learns latent alignments from scratch. This allows our model to go beyond “objectness” and learn to attend to abstract concepts. Prior to the use of neural networks for generating captions, two main approaches were dominant. The first involved generating caption templates which were filled in based on the results of object detections and attribute discovery
  • 35. (Kulkarni et al. (2013), Li et al. (2011), Yang et al. (2011), Mitchell et al. (2012), Elliott & Keller (2013)). The second approach was based on first retrieving similar captioned im- ages from a large database then modifying these retrieved captions to fit the query (Kuznetsova et al., 2012; 2014). These approaches typically involved an intermediate “gen- eralization” step to remove the specifics of a caption that are only relevant to the retrieved image, such as the name of a city. Both of these approaches have since fallen out of favour to the now dominant neural network methods. There has been a long line of previous work incorpo- rating attention into neural networks for vision related tasks. Some that share the same spirit as our work include Larochelle & Hinton (2010); Denil et al. (2012); Tang et al. (2014). In particular however, our work directly extends the work of Bahdanau et al. (2014); Mnih et al. (2014); Ba et al. (2014). 3. Image Caption Generation with Attention Mechanism 3.1. Model Details In this section, we describe the two variants of our attention-based model by first describing their common framework. The main difference is the definition of the φ function which we describe in detail in Section 4. We denote vectors with bolded font and matrices with capital letters. In our description below, we suppress bias terms for readability. f c
  • 36. oi ht ht-1 zt Eyt-1 ht-1 zt Eyt-1 ht-1 zt Eyt-1 ht-1 zt Eyt-1 input gate output gate memory cell forget gate input modulator Figure 4. A LSTM cell, lines with bolded squares imply projec- tions with a learnt weight vector. Each cell learns how to weigh its input components (input gate), while learning how to modulate that contribution to the memory (input modulator). It also learns
  • 37. weights which erase the memory cell (forget gate), and weights which control how this memory should be emitted (output gate). 3.1.1. ENCODER: CONVOLUTIONAL FEATURES Our model takes a single raw image and generates a caption y encoded as a sequence of 1-of-K encoded words. y = {y1, . . . ,yC} , yi ∈ RK where K is the size of the vocabulary and C is the length of the caption. We use a convolutional neural network in order to extract a set of feature vectors which we refer to as annotation vec- tors. The extractor produces L vectors, each of which is a D-dimensional representation corresponding to a part of the image. a = {a1, . . . ,aL} , ai ∈ RD In order to obtain a correspondence between the feature vectors and portions of the 2-D image, we extract features from a lower convolutional layer unlike previous work which instead used a fully connected layer. This allows the decoder to selectively focus on certain parts of an image by selecting a subset of all the feature vectors. 3.1.2. DECODER: LONG SHORT-TERM MEMORY NETWORK We use a long short-term memory (LSTM) net- work (Hochreiter & Schmidhuber, 1997) that produces a caption by generating one word at every time step condi- tioned on a context vector, the previous hidden state and the previously generated words. Our implementation of LSTM
  • 38. Neural Image Caption Generation with Visual Attention closely follows the one used in Zaremba et al. (2014) (see Fig. 4). Using Ts,t : Rs → Rt to denote a simple affine transformation with parameters that are learned, it ft ot gt σ σ σ tanh ẑt
  • 39. ct = ft �ct−1 + it �gt (2) ht = ot � tanh(ct). (3) Here, it, ft, ct, ot, ht are the input, forget, memory, out- put and hidden state of the LSTM, respectively. The vector ẑ ∈ RD is the context vector, capturing the visual infor- mation associated with a particular input location, as ex- plained below. E ∈ Rm×K is an embedding matrix. Let m and n denote the embedding and LSTM dimensionality respectively and σ and � be the logistic sigmoid activation and element-wise multiplication respectively. In simple terms, the context vector ẑt (equations (1)–(3)) is a dynamic representation of the relevant part of the image input at time t. We define a mechanism φ that computes ẑt from the annotation vectors ai, i = 1, . . . ,L corresponding to the features extracted at different image locations. For each location i, the mechanism generates a positive weight αi which can be interpreted either as the probability that location i is the right place to focus for producing the next word (the “hard” but stochastic attention mechanism), or as the relative importance to give to location i in blending the ai’s together. The weight αi of each annotation vector ai is computed by an attention model fatt for which we use a multilayer perceptron conditioned on the previous hidden state ht−1. The soft version of this attention mechanism was introduced by Bahdanau et al. (2014). For emphasis, we note that the hidden state varies as the output RNN ad- vances in its output sequence: “where” the network looks next depends on the sequence of words that has already been generated.
  • 40. eti =fatt(ai,ht−1) (4) αti = exp(eti)∑L k=1 exp(etk) . (5) Once the weights (which sum to one) are computed, the context vector ẑt is computed by ẑt = φ({ai} ,{αi}) , (6) where φ is a function that returns a single vector given the set of annotation vectors and their corresponding weights. The details of φ function are discussed in Sec. 4. The initial memory state and hidden state of the LSTM are predicted by an average of the annotation vectors fed through two separate MLPs (init,c and init,h): c0 = finit,c( 1 L L∑ i ai) h0 = finit,h( 1 L
  • 41. L∑ i ai) In this work, we use a deep output layer (Pascanu et al., 2014) to compute the output word probability given the LSTM state, the context vector and the previous word: p(yt|a,yt−11 ) ∝ exp(Lo(Eyt−1 + Lhht + Lzẑt)) (7) Where Lo ∈ RK×m, Lh ∈ Rm×n, Lz ∈ Rm×D, and E are learned parameters initialized randomly. 4. Learning Stochastic “Hard” vs Deterministic “Soft” Attention In this section we discuss two alternative mechanisms for the attention model fatt: stochastic attention and determin- istic attention. 4.1. Stochastic “Hard” Attention We represent the location variable st as where the model decides to focus attention when generating the tth word. st,i is an indicator one-hot variable which is set to 1 if the i-th location (out of L) is the one used to extract visual features. By treating the attention locations as intermedi- ate latent variables, we can assign a multinoulli distribution parametrized by {αi}, and view ẑt as a random variable: p(st,i = 1 | sj<t,a) = αt,i (8) ẑt = ∑
  • 42. i st,iai. (9) We define a new objective function Ls that is a variational lower bound on the marginal log-likelihood log p(y | a) of observing the sequence of words y given image features a. The learning algorithm for the parameters W of the models can be derived by directly optimizing Ls: Ls = ∑ s p(s | a) log p(y | s,a) ≤ log ∑ s p(s | a)p(y | s,a) = log p(y | a) (10) ∂Ls ∂W = ∑ s p(s | a) [ ∂ log p(y | s,a) ∂W
  • 43. + log p(y | s,a) ∂ log p(s | a) ∂W ] . (11) Neural Image Caption Generation with Visual Attention Figure 5. Examples of mistakes where we can use attention to gain intuition into what the model saw. Equation 11 suggests a Monte Carlo based sampling ap- proximation of the gradient with respect to the model pa- rameters. This can be done by sampling the location st from a multinouilli distribution defined by Equation 8. s̃t ∼ MultinoulliL({αi}) ∂Ls ∂W ≈ 1 N N∑ n=1 [
  • 44. ∂ log p(y | s̃n,a) ∂W + log p(y | s̃n,a) ∂ log p(s̃n | a) ∂W ] (12) A moving average baseline is used to reduce the vari- ance in the Monte Carlo estimator of the gradient, follow- ing Weaver & Tao (2001). Similar, but more complicated variance reduction techniques have previously been used by Mnih et al. (2014) and Ba et al. (2014). Upon seeing the kth mini-batch, the moving average baseline is estimated as an accumulated sum of the previous log likelihoods with exponential decay: bk = 0.9× bk−1 + 0.1× log p(y | s̃k,a) To further reduce the estimator variance, an entropy term on the multinouilli distribution H[s] is added. Also, with probability 0.5 for a given image, we set the sampled at- tention location s̃ to its expected value α. Both techniques improve the robustness of the stochastic attention learning algorithm. The final learning rule for the model is then the following: ∂Ls ∂W
  • 45. ≈ 1 N N∑ n=1 [ ∂ log p(y | s̃n,a) ∂W + λr(log p(y | s̃n,a)− b) ∂ log p(s̃n | a) ∂W + λe ∂H[s̃n] ∂W ] where, λr and λe are two hyper-parameters set by cross- validation. As pointed out and used in Ba et al. (2014) and Mnih et al. (2014), this is formulation is equivalent to the REINFORCE learning rule (Williams, 1992), where the reward for the attention choosing a sequence of actions is a real value proportional to the log likelihood of the target sentence under the sampled attention trajectory. In making a hard choice at every point, φ({ai} ,{αi}) from Equation 6 is a function that returns a sampled ai at every point in time based upon a multinouilli distribution
  • 46. parameterized by α. 4.2. Deterministic “Soft” Attention Learning stochastic attention requires sampling the atten- tion location st each time, instead we can take the expecta- tion of the context vector ẑt directly, Ep(st|a)[ẑt] = L∑ i=1 αt,iai (13) and formulate a deterministic attention model by com- puting a soft attention weighted annotation vector φ({ai} ,{αi}) = ∑L i αiai as introduced by Bahdanau et al. (2014). This corresponds to feeding in a soft α Neural Image Caption Generation with Visual Attention weighted context into the system. The whole model is smooth and differentiable under the deterministic attention, so learning end-to-end is trivial by using standard back- propagation. Learning the deterministic attention can also be under- stood as approximately optimizing the marginal likelihood in Equation 10 under the attention location random vari- able st from Sec. 4.1. The hidden activation of LSTM
  • 47. ht is a linear projection of the stochastic context vector ẑt followed by tanh non-linearity. To the first order Tay- lor approximation, the expected value Ep(st|a)[ht] is equal to computing ht using a single forward prop with the ex- pected context vector Ep(st|a)[ẑt]. Considering Eq. 7, let nt = Lo(Eyt−1+Lhht+Lzẑt), nt,i denotes nt computed by setting the random variable ẑ value to ai. We define the normalized weighted geometric mean for the softmax kth word prediction: NWGM[p(yt = k | a)] = ∏ i exp(nt,k,i) p(st,i=1|a)∑ j ∏ i exp(nt,j,i) p(st,i=1|a) = exp(Ep(st|a)[nt,k])∑ j exp(Ep(st|a)[nt,j]) The equation above shows the normalized weighted ge- ometric mean of the caption prediction can be approxi- mated well by using the expected context vector, where E[nt] = Lo(Eyt−1 + LhE[ht] + LzE[ẑt]). It shows that the NWGM of a softmax unit is obtained by applying soft- max to the expectations of the underlying linear projec- tions. Also, from the results in (Baldi & Sadowski, 2014), NWGM[p(yt = k | a)] ≈ E[p(yt = k | a)] under softmax activation. That means the expectation of the out-
  • 48. puts over all possible attention locations induced by ran- dom variable st is computed by simple feedforward propa- gation with expected context vector E[ẑt]. In other words, the deterministic attention model is an approximation to the marginal likelihood over the attention locations. 4.2.1. DOUBLY STOCHASTIC ATTENTION By construction, ∑ i αti = 1 as they are the output of a softmax. In training the deterministic version of our model we introduce a form of doubly stochastic regularization, where we also encourage ∑ t αti ≈ 1. This can be in- terpreted as encouraging the model to pay equal attention to every part of the image over the course of generation. In our experiments, we observed that this penalty was impor- tant quantitatively to improving overall BLEU score and that qualitatively this leads to more rich and descriptive captions. In addition, the soft attention model predicts a gating scalar β from previous hidden state ht−1 at each time step t, such that, φ({ai} ,{αi}) = β ∑L i αiai, where βt = σ(fβ(ht−1)). We notice our attention weights put more emphasis on the objects in the images by including the scalar β.
  • 49. Concretely, the model is trained end-to-end by minimizing the following penalized negative log-likelihood: Ld = − log(P(y|x)) + λ L∑ i (1− C∑ t αti) 2 (14) 4.3. Training Procedure Both variants of our attention model were trained with stochastic gradient descent using adaptive learning rate al- gorithms. For the Flickr8k dataset, we found that RM- SProp (Tieleman & Hinton, 2012) worked best, while for Flickr30k/MS COCO dataset we used the recently pro- posed Adam algorithm (Kingma & Ba, 2014) . To create the annotations ai used by our decoder, we used the Oxford VGGnet (Simonyan & Zisserman, 2014) pre- trained on ImageNet without finetuning. In principle how- ever, any encoding function could be used. In addition, with enough data, we could also train the encoder from scratch (or fine-tune) with the rest of the model. In our ex- periments we use the 14×14×512 feature map of the fourth convolutional layer before max pooling. This means our decoder operates on the flattened 196 × 512 (i.e L × D) encoding. As our implementation requires time proportional to the length of the longest sentence per update, we found train-
  • 50. ing on a random group of captions to be computationally wasteful. To mitigate this problem, in preprocessing we build a dictionary mapping the length of a sentence to the corresponding subset of captions. Then, during training we randomly sample a length and retrieve a mini-batch of size 64 of that length. We found that this greatly improved con- vergence speed with no noticeable diminishment in perfor- mance. On our largest dataset (MS COCO), our soft atten- tion model took less than 3 days to train on an NVIDIA Titan Black GPU. In addition to dropout (Srivastava et al., 2014), the only other regularization strategy we used was early stopping on BLEU score. We observed a breakdown in correla- tion between the validation set log-likelihood and BLEU in the later stages of training during our experiments. Since BLEU is the most commonly reported metric, we used BLEU on our validation set for model selection. In our experiments with soft attention, we also used Whet- lab1 (Snoek et al., 2012; 2014) in our Flickr8k experi- ments. Some of the intuitions we gained from hyperparam- eter regions it explored were especially important in our Flickr30k and COCO experiments. We make our code for these models based in Theano 1https://www.whetlab.com/ https://www.whetlab.com/ Neural Image Caption Generation with Visual Attention Table 1. BLEU-1,2,3,4/METEOR metrics compared to other methods, † indicates a different split, (—) indicates an unknown metric, ◦
  • 51. indicates the authors kindly provided missing metrics by personal communication, Σ indicates an ensemble, a indicates using AlexNet BLEU Dataset Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR Flickr8k Google NIC(Vinyals et al., 2014)†Σ Log Bilinear (Kiros et al., 2014a)◦ Soft-Attention Hard-Attention 63 65.6 67 67 41 42.4 44.8 45.7 27 27.7 29.9 31.4 — 17.7 19.5 21.3
  • 53. 18.49 18.46 COCO CMU/MS Research (Chen & Zitnick, 2014)a MS Research (Fang et al., 2014)†a BRNN (Karpathy & Li, 2014)◦ Google NIC†◦Σ Log Bilinear◦ Soft-Attention Hard-Attention — — 64.2 66.6 70.8 70.7 71.8 — — 45.1 46.1 48.9 49.2 50.4
  • 54. — — 30.4 32.9 34.4 34.4 35.7 — — 20.3 24.6 24.3 24.3 25.0 20.41 20.71 — — 20.03 23.90 23.04 (Bergstra et al., 2010) publicly available upon publication to encourage future research in this area. 5. Experiments We describe our experimental methodology and quantita- tive results which validate the effectiveness of our model for caption generation.
  • 55. 5.1. Data We report results on the popular Flickr8k and Flickr30k dataset which has 8,000 and 30,000 images respectively as well as the more challenging Microsoft COCO dataset which has 82,783 images. The Flickr8k/Flickr30k dataset both come with 5 reference sentences per image, but for the MS COCO dataset, some of the images have references in excess of 5 which for consistency across our datasets we discard. We applied only basic tokenization to MS COCO so that it is consistent with the tokenization present in Flickr8k and Flickr30k. For all our experiments, we used a fixed vocabulary size of 10,000. Results for our attention-based architecture are reported in Table 4.2.1. We report results with the frequently used BLEU metric2 which is the standard in the caption gen- eration literature. We report BLEU from 1 to 4 with- 2We verified that our BLEU evaluation code matches the au- thors of Vinyals et al. (2014), Karpathy & Li (2014) and Kiros et al. (2014b). For fairness, we only compare against results for which we have verified that our BLEU evaluation code is the same. With the upcoming release of the COCO evaluation server, we will include comparison results with all other recent image captioning models. out a brevity penalty. There has been, however, criticism of BLEU, so in addition we report another common met- ric METEOR (Denkowski & Lavie, 2014), and compare whenever possible. 5.2. Evaluation Procedures A few challenges exist for comparison, which we explain
  • 56. here. The first is a difference in choice of convolutional feature extractor. For identical decoder architectures, us- ing more recent architectures such as GoogLeNet or Ox- ford VGG Szegedy et al. (2014), Simonyan & Zisserman (2014) can give a boost in performance over using the AlexNet (Krizhevsky et al., 2012). In our evaluation, we compare directly only with results which use the compa- rable GoogLeNet/Oxford VGG features, but for METEOR comparison we note some results that use AlexNet. The second challenge is a single model versus ensemble comparison. While other methods have reported perfor- mance boosts by using ensembling, in our results we report a single model performance. Finally, there is challenge due to differences between dataset splits. In our reported results, we use the pre- defined splits of Flickr8k. However, one challenge for the Flickr30k and COCO datasets is the lack of standardized splits. As a result, we report with the publicly available splits3 used in previous work (Karpathy & Li, 2014). In our experience, differences in splits do not make a substantial difference in overall performance, but we note the differ- 3http://cs.stanford.edu/people/karpathy/ deepimagesent/ http://cs.stanford.edu/people/karpathy/deepimagesent/ http://cs.stanford.edu/people/karpathy/deepimagesent/ Neural Image Caption Generation with Visual Attention ences where they exist. 5.3. Quantitative Analysis
  • 57. In Table 4.2.1, we provide a summary of the experi- ment validating the quantitative effectiveness of attention. We obtain state of the art performance on the Flickr8k, Flickr30k and MS COCO. In addition, we note that in our experiments we are able to significantly improve the state of the art performance METEOR on MS COCO that we speculate is connected to some of the regularization tech- niques we used 4.2.1 and our lower level representation. Finally, we also note that we are able to obtain this perfor- mance using a single model without an ensemble. 5.4. Qualitative Analysis: Learning to attend By visualizing the attention component learned by the model, we are able to add an extra layer of interpretabil- ity to the output of the model (see Fig. 1). Other systems that have done this rely on object detection systems to pro- duce candidate alignment targets (Karpathy & Li, 2014). Our approach is much more flexible, since the model can attend to “non object” salient regions. The 19-layer OxfordNet uses stacks of 3x3 filters mean- ing the only time the feature maps decrease in size are due to the max pooling layers. The input image is resized so that the shortest side is 256 dimensional with preserved as- pect ratio. The input to the convolutional network is the center cropped 224x224 image. Consequently, with 4 max pooling layers we get an output dimension of the top con- volutional layer of 14x14. Thus in order to visualize the attention weights for the soft model, we simply upsample the weights by a factor of 24 = 16 and apply a Gaussian filter. We note that the receptive fields of each of the 14x14 units are highly overlapping. As we can see in Figure 2 and 3, the model learns align-
  • 58. ments that correspond very strongly with human intuition. Especially in the examples of mistakes, we see that it is possible to exploit such visualizations to get an intuition as to why those mistakes were made. We provide a more ex- tensive list of visualizations in Appendix A for the reader. 6. Conclusion We propose an attention based approach that gives state of the art performance on three benchmark datasets us- ing the BLEU and METEOR metric. We also show how the learned attention can be exploited to give more inter- pretability into the models generation process, and demon- strate that the learned alignments correspond very well to human intuition. We hope that the results of this paper will encourage future work in using visual attention. We also expect that the modularity of the encoder-decoder approach combined with attention to have useful applications in other domains. Acknowledgments The authors would like to thank the developers of Theano (Bergstra et al., 2010; Bastien et al., 2012). We acknowledge the support of the following organizations for research funding and computing support: the Nuance Foundation, NSERC, Samsung, Calcul Québec, Compute Canada, the Canada Research Chairs and CIFAR. The au- thors would also like to thank Nitish Srivastava for assis- tance with his ConvNet package as well as preparing the Oxford convolutional network and Relu Patrascu for help- ing with numerous infrastructure related problems. References Ba, Jimmy Lei, Mnih, Volodymyr, and Kavukcuoglu, Ko- ray. Multiple object recognition with visual attention.
  • 59. arXiv:1412.7755, December 2014. Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neu- ral machine translation by jointly learning to align and trans- late. arXiv:1409.0473, September 2014. Baldi, Pierre and Sadowski, Peter. The dropout learning algo- rithm. Artificial intelligence, 210:78–122, 2014. Bastien, Frederic, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nico- las, Warde-Farley, David, and Bengio, Yoshua. Theano: new features and speed improvements. Submited to the Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012. Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lam- blin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), 2010. Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014. Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Caglar, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, October 2014. Corbetta, Maurizio and Shulman, Gordon L. Control of goal- directed and stimulus-driven attention in the brain. Nature re- views neuroscience, 3(3):201–215, 2002.
  • 60. Denil, Misha, Bazzani, Loris, Larochelle, Hugo, and de Freitas, Nando. Learning where to attend with deep architectures for image tracking. Neural Computation, 2012. Denkowski, Michael and Lavie, Alon. Meteor universal: Lan- guage specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Ma- chine Translation, 2014. Neural Image Caption Generation with Visual Attention Donahue, Jeff, Hendrikcs, Lisa Anne, Guadarrama, Se- gio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convo- lutional networks for visual recognition and description. arXiv:1411.4389v2, November 2014. Elliott, Desmond and Keller, Frank. Image description using vi- sual dependency representations. In EMNLP, 2013. Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From captions to visual concepts and back. arXiv:1411.4952, November 2014. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. Hodosh, Micah, Young, Peter, and Hockenmaier, Julia. Framing image description as a ranking task: Data, models and evalu- ation metrics. Journal of Artificial Intelligence Research, pp. 853–899, 2013.
  • 61. Karpathy, Andrej and Li, Fei-Fei. Deep visual-semantic align- ments for generating image descriptions. arXiv:1412.2306, December 2014. Kingma, Diederik P. and Ba, Jimmy. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, December 2014. Kiros, Ryan, Salahutdinov, Ruslan, and Zemel, Richard. Multi- modal neural language models. In International Conference on Machine Learning, pp. 595–603, 2014a. Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard. Uni- fying visual-semantic embeddings with multimodal neural lan- guage models. arXiv:1411.2539, November 2014b. Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey. Ima- geNet classification with deep convolutional neural networks. In NIPS. 2012. Kulkarni, Girish, Premraj, Visruth, Ordonez, Vicente, Dhar, Sag- nik, Li, Siming, Choi, Yejin, Berg, Alexander C, and Berg, Tamara L. Babytalk: Understanding and generating simple im- age descriptions. PAMI, IEEE Transactions on, 35(12):2891– 2903, 2013. Kuznetsova, Polina, Ordonez, Vicente, Berg, Alexander C, Berg, Tamara L, and Choi, Yejin. Collective generation of natural image descriptions. In Association for Computational Linguis- tics. ACL, 2012. Kuznetsova, Polina, Ordonez, Vicente, Berg, Tamara L, and Choi, Yejin. Treetalk: Composition and compression of trees for im-
  • 62. age descriptions. TACL, 2(10):351–362, 2014. Larochelle, Hugo and Hinton, Geoffrey E. Learning to com- bine foveal glimpses with a third-order boltzmann machine. In NIPS, pp. 1243–1251, 2010. Li, Siming, Kulkarni, Girish, Berg, Tamara L, Berg, Alexander C, and Choi, Yejin. Composing simple image descriptions us- ing web-scale n-grams. In Computational Natural Language Learning. ACL, 2011. Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Dollár, Piotr, and Zitnick, C Lawrence. Microsoft coco: Common objects in context. In ECCV, pp. 740–755. 2014. Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, and Yuille, Alan. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632, December 2014. Mitchell, Margaret, Han, Xufeng, Dodge, Jesse, Mensch, Alyssa, Goyal, Amit, Berg, Alex, Yamaguchi, Kota, Berg, Tamara, Stratos, Karl, and Daumé III, Hal. Midge: Generating im- age descriptions from computer vision detections. In European Chapter of the Association for Computational Linguistics, pp. 747–756. ACL, 2012. Mnih, Volodymyr, Hees, Nicolas, Graves, Alex, and Kavukcuoglu, Koray. Recurrent models of visual atten- tion. In NIPS, 2014. Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, and Ben- gio, Yoshua. How to construct deep recurrent neural networks.
  • 63. In ICLR, 2014. Rensink, Ronald A. The dynamic representation of scenes. Visual cognition, 7(1-3):17–42, 2000. Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, An- drej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, 2014. Simonyan, K. and Zisserman, A. Very deep convolu- tional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practi- cal bayesian optimization of machine learning algorithms. In NIPS, pp. 2951–2959, 2012. Snoek, Jasper, Swersky, Kevin, Zemel, Richard S, and Adams, Ryan P. Input warping for bayesian optimization of non- stationary functions. arXiv preprint arXiv:1402.0929, 2014. Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15, 2014. Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to sequence learning with neural networks. In NIPS, pp. 3104– 3112, 2014. Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Van- houcke, Vincent, and Rabinovich, Andrew. Going deeper with
  • 64. convolutions. arXiv preprint arXiv:1409.4842, 2014. Tang, Yichuan, Srivastava, Nitish, and Salakhutdinov, Ruslan R. Learning generative models with visual attention. In NIPS, pp. 1808–1816, 2014. Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5 - rmsprop. Technical report, 2012. Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. arXiv:1411.4555, November 2014. Neural Image Caption Generation with Visual Attention Weaver, Lex and Tao, Nigel. The optimal reward baseline for gradient-based reinforcement learning. In Proc. UAI’2001, pp. 538–545, 2001. Williams, Ronald J. Simple statistical gradient-following al- gorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992. Yang, Yezhou, Teo, Ching Lik, Daumé III, Hal, and Aloimonos, Yiannis. Corpus-guided sentence generation of natural images. In EMNLP, pp. 444–454. ACL, 2011. Young, Peter, Lai, Alice, Hodosh, Micah, and Hockenmaier, Julia. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67–78, 2014.
  • 65. Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Re- current neural network regularization. arXiv preprint arXiv:1409.2329, September 2014. Neural Image Caption Generation with Visual Attention A. Appendix Visualizations from our “hard” (a) and “soft” (b) attention model. White indicates the regions where the model roughly attends to (see section 5.4). AA manman andand aa womanwoman playingplaying frisbeefrisbee inin aa fieldfield .. (a) A man and a woman playing frisbee in a field. (b) A woman is throwing a frisbee in a park. Figure 6. Neural Image Caption Generation with Visual Attention AA giraffegiraffe standingstanding inin aa fieldfield withwith treestrees .. (a) A giraffe standing in the field with trees.
  • 66. (b) A large white bird standing in a forest. Figure 7. Neural Image Caption Generation with Visual Attention AA dogdog isis layinglaying onon aa bedbed withwith aa bookbook .. (a) A dog is laying on a bed with a book. (b) A dog is standing on a hardwood floor. Figure 8. Neural Image Caption Generation with Visual Attention AA womanwoman isis holdingholding aa donutdonut inin hishis handhand .. (a) A woman is holding a donut in his hand. (b) A woman holding a clock in her hand. Figure 9.
  • 67. Neural Image Caption Generation with Visual Attention AA stopstop signsign withwith aa stopstop signsign onon itit .. (a) A stop sign with a stop sign on it. (b) A stop sign is on a road with a mountain in the background. Figure 10. Neural Image Caption Generation with Visual Attention AA manman inin aa suitsuit andand aa hathat holdingholding aa remoteremote controlcontrol .. (a) A man in a suit and a hat holding a remote control. (b) A man wearing a hat and a hat on a skateboard. Figure 11.
  • 68. Neural Image Caption Generation with Visual Attention AA littlelittle girlgirl sittingsitting onon aa couchcouch withwith aa teddyteddy bearbear .. (a) A little girl sitting on a couch with a teddy bear. (b) A little girl sitting on a bed with a teddy bear. Neural Image Caption Generation with Visual Attention AA manman isis standingstanding onon aa beachbeach withwith aa surfboardsurfboard .. (a) A man is standing on a beach with a surfboard. (b) A person is standing on a beach with a surfboard. Neural Image Caption Generation with Visual Attention AA manman andand aa womanwoman ridingriding aa
  • 69. boatboat inin thethe waterwater .. (a) A man and a woman riding a boat in the water. (b) A group of people sitting on a boat in the water. Figure 12. Neural Image Caption Generation with Visual Attention AA manman isis standingstanding inin aa marketmarket withwith aa largelarge amountamount ofof foodfood .. (a) A man is standing in a market with a large amount of food. (b) A woman is sitting at a table with a large pizza. Figure 13. Neural Image Caption Generation with Visual Attention AA giraffegiraffe standingstanding inin aa fieldfield withwith
  • 70. treestrees .. (a) A giraffe standing in a field with trees. (b) A giraffe standing in a forest with trees in the background. Figure 14. Neural Image Caption Generation with Visual Attention AA groupgroup ofof peoplepeople standingstanding nextnext toto eacheach otherother .. (a) A group of people standing next to each other. (b) A man is talking on his cell phone while another man watches. Figure 15. University of Groningen Automatic Description Generation from Images Bernardi, Raffaella; Cakici, Ruket; Elliott, Desmond; Erdem, Aykut; Erdem, Erkut; Ikizler-
  • 71. Cinbis, Nazli; Keller, Frank; Muscat, Adrian; Plank, Barbara Published in: Journal of artificial intelligence research DOI: 10.1613/jair.4900 IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2016 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., ... Plank, B. (2016). Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures. Journal of artificial intelligence research, 55, 409-442. [4900]. https://doi.org/10.1613/jair.4900 Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). Take-down policy
  • 72. If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date: 09-04-2020 https://doi.org/10.1613/jair.4900 https://www.rug.nl/research/portal/en/publications/automatic- description-generation-from-images(b2b505c4-212d-4e10-9c12- 1d33ed6429ce).html https://www.rug.nl/research/portal/en/publications/automatic- description-generation-from-images(b2b505c4-212d-4e10-9c12- 1d33ed6429ce).html https://www.rug.nl/research/portal/en/publications/automatic- description-generation-from-images(b2b505c4-212d-4e10-9c12- 1d33ed6429ce).html https://www.rug.nl/research/portal/en/journals/journal-of- artificial-intelligence-research(2d21fcbd-8d5e-4bc4-9dd5- 362c2a3fe629).html https://www.rug.nl/research/portal/en/journals/journal-of- artificial-intelligence-research(2d21fcbd-8d5e-4bc4-9dd5- 362c2a3fe629).html https://doi.org/10.1613/jair.4900 Journal of Artificial Intelligence Research 55 (2016) 409-442 Submitted 07/15; published 02/16 Automatic Description Generation from Images: A Survey
  • 73. of Models, Datasets, and Evaluation Measures Raffaella Bernardi [email protected] University of Trento, Italy Ruket Cakici [email protected] Middle East Technical University, Turkey Desmond Elliott [email protected] University of Amsterdam, Netherlands Aykut Erdem [email protected] Erkut Erdem [email protected] Nazli Ikizler-Cinbis [email protected] Hacettepe University, Turkey Frank Keller [email protected] University of Edinburgh, UK Adrian Muscat [email protected] University of Malta, Malta Barbara Plank [email protected] University of Copenhagen, Denmark Abstract Automatic description generation from natural images is a challenging problem that has recently received a large amount of interest from the computer vision and natural lan- guage processing communities. In this survey, we classify the existing approaches based on how they conceptualize this problem, viz., models that cast description as either gen- eration problem or as a retrieval problem over a visual or
  • 74. multimodal representational space. We provide a detailed review of existing models, highlighting their advantages and disadvantages. Moreover, we give an overview of the benchmark image datasets and the evaluation measures that have been developed to assess the quality of machine-generated image descriptions. Finally we extrapolate future directions in the area of automatic image description generation. 1. Introduction Over the past two decades, the fields of natural language processing (NLP) and computer vision (CV) have seen great advances in their respective goals of analyzing and generating text, and of understanding images and videos. While both fields share a similar set of meth- ods rooted in artificial intelligence and machine learning, they have historically developed separately, and their scientific communities have typically interacted very little. Recent years, however, have seen an upsurge of interest in problems that require a combination of linguistic and visual information. A lot of everyday tasks are of this nature, e.g., interpreting a photo in the context of a newspaper article, following instructions in conjunction with a diagram or a map, understanding slides while listening to a lecture. In c©2016 AI Access Foundation. All rights reserved.
  • 75. Bernardi et al. addition to this, the web provides a vast amount of data that combines linguistic and visual information: tagged photographs, illustrations in newspaper articles, videos with subtitles, and multimodal feeds on social media. To tackle combined language and vision tasks and to exploit the large amounts of multimodal data, the CV and NLP communities have moved closer together, for example by organizing workshops on language and vision that have been held regularly at both CV and NLP conferences over the past few years. In this new language-vision community, automatic image description has emerged as a key task. This task involves taking an image, analyzing its visual content, and generating a textual description (typically a sentence) that verbalizes the most salient aspects of the image. This is challenging from a CV point of view, as the description could in principle talk about any visual aspect of the image: it can mention objects and their attributes, it can talk about features of the scene (e.g., indoor/outdoor), or verbalize how the people and objects in the scene interact. More challenging still, the description could even refer to objects that are not depicted (e.g., it can talk about people waiting for a train, even when the train is not visible because it has not arrived yet) and provide background knowledge that cannot be derived directly from the image (e.g., the person depicted is the Mona Lisa).
  • 76. In short, a good image description requires full image understanding, and therefore the description task is an excellent test bed for computer vision systems, one that is much more comprehensive than standard CV evaluations that typically test, for instance, the accuracy of object detectors or scene classifiers over a limited set of classes. Image understanding is necessary, but not sufficient for producing a good description. Imagine we apply an array of state-of-the-art detectors to the image to localize objects (e.g., Felzenszwalb, Girshick, McAllester, & Ramanan, 2010; Girshick, Donahue, Darrell, & Malik, 2014), determine attributes (e.g., Lampert, Nickisch, & Harmeling, 2009; Berg, Berg, & Shih, 2010; Parikh & Grauman, 2011), compute scene properties (e.g., Oliva & Torralba, 2001; Lazebnik, Schmid, & Ponce, 2006), and recognize human-object interac- tions (e.g., Prest, Schmid, & Ferrari, 2012; Yao & Fei-Fei, 2010). The result would be a long, unstructured list of labels (detector outputs), which would be unusable as an image description. A good image description, in contrast, has to be comprehensive but concise (talk about all and only the important things in the image), and has to be formally correct, i.e., consists of grammatically well-formed sentences. From an NLP point of view, generating such a description is a natural language gener- ation (NLG) problem. The task of NLG is to turn a non- linguistic representation into human-readable text. Classically, the non-linguistic
  • 77. representation is a logical form, a database query, or a set of numbers. In image description, the input is an image rep- resentation (e.g., the detector outputs listed in the previous paragraph), which the NLG model has to turn into sentences. Generating text involves a series of steps, traditionally referred to as the NLP pipeline (Reiter & Dale, 2006): we need to decide which aspects of the input to talk about (content selection), then we need to organize the content (text planning) and verbalize it (surface realization). Surface realization in turn requires choos- ing the right words (lexicalization), using pronouns if appropriate (referential expression generation), and grouping related information together (aggregation). In other words, automatic image description requires not only full image understanding, but also sophisticated natural language generation. This is what makes it such an interesting 410 Automatic Description Generation from Images: A Survey task that has been embraced by both the CV and the NLP communities.1 Note that the description task can become even more challenging when we take into account that good descriptions are often user-specific. For instance, an art critic will require a different description than a librarian or a journalist, even for the same
  • 78. photograph. We will briefly touch upon this issue when we talk about the difference between descriptions and captions in Section 3 and discuss future directions in Section 4. Given that automatic image description is such an interesting task, and it is driven by the existence of mature CV and NLP methods and the availability of relevant datasets, a large image description literature has appeared over the last five years. The aim of this survey article is to give a comprehensive overview of this literature, covering models, datasets, and evaluation metrics. We sort the existing literature into three categories based on the image description models used. The first group of models follows the classical pipeline we outlined above: they first detect or predict the image content in terms of objects, attributes, scene types, and actions, based on a set of visual features. Then, these models use this content information to drive a natural language generation system that outputs an image description. We will term these approaches direct generation models. The second group of models cast the problem as a retrieval problem. That is, to create a description for a novel image, these models search for images in a database that are similar to the novel image. Then they build a description for the novel image based on the descriptions of the set of similar images that was retrieved. The novel image is described by simply reusing the description of the most similar retrieved image
  • 79. (transfer), or by synthesizing a novel description based on the description of a set of similar images. Retrieval-based models can be further subdivided based on what type of approach they use to represent images and compute similarity. The first subgroup of models uses a visual space to retrieve images, while the second subgroup uses a multimodal space that represents images and text jointly. For an overview of the models that will be reviewed in this survey, and which category they fall into, see Table 1. Generating natural language descriptions from videos presents unique challenges over and above image-based description, as it additionally requires analyzing the objects and their attributes and actions in the temporal dimension. Models that aim to solve descrip- tion generation from videos have been proposed in the literature (e.g., Khan, Zhang, & Gotoh, 2011; Guadarrama, Krishnamoorthy, Malkarnenkar, Venugopalan, Mooney, Darrell, & Saenko, 2013; Krishnamoorthy, Malkarnenkar, Mooney, Saenko, & Guadarrama, 2013; Rohrbach, Qiu, Titov, Thater, Pinkal, & Schiele, 2013; Thomason, Venugopalan, Guadar- rama, Saenko, & Mooney, 2014; Rohrbach, Rohrback, Tandon, & Schiele, 2015; Yao, Torabi, Cho, Ballas, Pal, Larochelle, & Courville, 2015; Zhu, Kiros, Zemel, Salakhutdinov, Urtasun, Torralba, & Fidler, 2015). However, most existing work on description generation has used static images, and this is what we will focus on in this survey.2 In this survey article, we first group automatic image
  • 80. description models into the three categories outlined above and provide a comprehensive overview of the models in each 1. Though some image description approaches circumvent the NLG aspect by transferring human-authored descriptions, see Sections 2.2 and 2.3. 2. An interesting intermediate approach involves the annotation of image streams with sequences of sen- tences, see the work of Park and Kim (2015). 411 Bernardi et al. category in Section 2. We then examine the available multimodal image datasets used for training and testing description generation models in Section 3. Furthermore, we review evaluation measures that have been used to gauge the quality of generated descriptions in Section 3. Finally, in Section 4, we discuss future research directions, including possible new tasks related to image description, such as visual question answering. 2. Image Description Models Generating automatic descriptions from images requires an understanding of how humans describe images. An image description can be analyzed in several different dimensions (Shat- ford, 1986; Jaimes & Chang, 2000). We follow Hodosh, Young,
  • 81. and Hockenmaier (2013) and assume that the descriptions that are of interest for this survey article are the ones that verbalize visual and conceptual information depicted in the image, i.e., descriptions that refer to the depicted entities, their attributes and relations, and the actions they are involved in. Outside the scope of automatic image description are non-visual descriptions, which give background information or refer to objects not depicted in the image (e.g., the location at which the image was taken or who took the picture). Also, not relevant for standard approaches to image description are perceptual descriptions, which capture the global low-level visual characteristics of images (e.g., the dominant color in the image or the type of the media such as photograph, drawing, animation, etc.). In the following subsections, we give a comprehensive overview of state-of-the-art ap- proaches to description generation. Table 1 offers a high-level summary of the field, using the three categories of models outlined in the introduction: direct generation models, re- trieval models from visual space, and retrieval model from multimodal space. 2.1 Description as Generation from Visual Input The general approach of the studies in this group is to first predict the most likely meaning of a given image by analyzing its visual content, and then generate a sentence reflecting this meaning. All models in this category achieve this using the
  • 82. following general pipeline architecture: 1. Computer vision techniques are applied to classify the scene type, to detect the ob- jects present in the image, to predict their attributes and the relationships that hold between them, and to recognize the actions taking place. 2. This is followed by a generation phase that turns the detector outputs into words or phrases. These are then combined to produce a natural language description of the image, using techniques from natural language generation (e.g., templates, n-grams, grammar rules). The approaches reviewed in this section perform an explicit mapping from images to descriptions, which differentiates them from the studies described in Section 2.2 and 2.3, which incorporate implicit vision and language models. An illustration of a sample model is shown in Figure 1. An explicit pipeline architecture, while tailored to the problem at hand, constrains the generated descriptions, as it relies on a predefined sets of semantic classes of scenes, objects, attributes, and actions. Moreover, such an architecture crucially assumes 412 Automatic Description Generation from Images: A Survey
  • 83. Reference Generation Retrieval from Visual Space Multimodal Space Farhadi et al. (2010) X Kulkarni et al. (2011) X Li et al. (2011) X Ordonez et al. (2011) X Yang et al. (2011) X Gupta et al. (2012) X Kuznetsova et al. (2012) X Mitchell et al. (2012) X Elliott and Keller (2013) X Hodosh et al. (2013) X Gong et al. (2014) X Karpathy et al. (2014) X Kuznetsova et al. (2014) X Mason and Charniak (2014) X Patterson et al. (2014) X Socher et al. (2014) X Verma and Jawahar (2014) X Yatskar et al. (2014) X Chen and Zitnick (2015) X X Donahue et al. (2015) X X Devlin et al. (2015) X Elliott and de Vries (2015) X Fang et al. (2015) X Jia et al. (2015) X X Karpathy and Fei-Fei (2015) X X Kiros et al. (2015) X X Lebret et al. (2015) X X Lin et al. (2015) X Mao et al. (2015a) X X Ortiz et al. (2015) X Pinheiro et al. (2015) X Ushiku et al. (2015) X Vinyals et al. (2015) X X
  • 84. Xu et al. (2015) X X Yagcioglu et al. (2015) X Table 1: An overview of existing approaches to automatic image description. We have categorized the literature into approaches that directly generate a description of an image (Section 2.1), approaches that retrieve images via visual similarity and transfer their de- scription to the new image (Section 2.2), and approaches that frame the task as retrieving descriptions and images from a multimodal space (Section 2.3). 413 Bernardi et al. Figure 1: The automatic image description generation system proposed by Kulkarni et al. (2011). the accuracy of the detectors for each semantic class, an assumption that is not always met in practice. Approaches to description generation differ along two main dimensions: (a) which image representations they derive descriptions from, and (b) how they address the sentence gen- eration problem. In terms of the representations used, existing models have conceptualized images in a number of different ways, relying on spatial relationships (Farhadi et al., 2010), corpus-based relationships (Yang et al., 2011), or spatial and