Image captions.pptx

CSCI 5922
Neural Networks and Deep Learning:
Image Captioning
Mike Mozer
Department of Computer Science and
Institute of Cognitive Science
University of Colorado at Boulder

Fads
 2014
 Seven papers, almost simultaneously, focused on processing images and
textual descriptions
 2015-2017
 Tons of papers about visual question answering

Image-Sentence Tasks
 Sentence retrieval
 finding best matching sentence to an image
 Sentence generation
 given image, produce textual description
 Image retrieval
 given textual description, find best matching image
 Image-sentence correspondence
 match images to sentences
 Question answering
 given image and question, produce answer
From Karpathy blog

Adding Captions to Images
 A cow is standing in the field
 This is a close up of a brown cow.
 There is a brown cow with long hair
and two horns.
 There are two trees and a cloud in the
background of a field with a large cow
in it.
From Devi Parikh

Three Papers I’ll Talk About
 Deep Visual-Semantic Alignments for Generating Image Descriptions
 Karpathy & Fei-Fei (Stanford), CVPR 2015
 Show and Tell: A Neural Image Caption Generator
 Vinyals, Toshev, Bengio, Erhan (Google), CVPR 2015
 Deep Captioning with Multimodal Recurrent Nets
 Mao, Xu, Yang, Wang, Yuille (UCLA, Baidu), ICLR 2015

Other Papers
 Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic
embeddings with multimodal neural language models. arXiv preprint
arXiv:1411.2539, 2014a.
 Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Rohrbach, Marcus,
Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent
convolutional networks for visual recognition and description. arXiv preprint
arXiv:1411.4389, 2014.
 Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár,
Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From
captions to visual concepts and back. arXiv preprint arXiv:1411.4952, 2014.
 Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for
image caption generation. arXiv preprint arXiv:1411.5654, 2014.

Data Sets
 Flickr8k
 8000 images, each annotated with 5 sentences via AMT
 1000 for validation, testing
 Flickr 30k
 30k images
 1000 validation, 1000 testing
 MSCOCO
 123,000 images
 5000 for validation, testing

Approach (Karpathy & Fei Fei, 2015)
 Sentences act as weak labels
 contiguous sequences of words correspond to some particular (unknown) location
in image
 Task 1
 Find alignment between snippets of text and image locations
 Task 2
 Build RNN that maps image patches to snippets of text

Representing Images
 Process regions of image with a “Region Convolutional Neural Net”
 pretrained on ImageNet and further fine tuned
 map activations from the fully connected layer before the classification
layer to an 𝒉 dimensional embedding, 𝒗
 𝒉 ranges from 1000 to 1600
 Pick 19 locations that have strongest classification response plus whole
image
 Image is represented by a set of 20 𝒉 dimensional embedding vectors

Representing Sentences
 Bidirectional RNN
 input representation is based on word2vec
 Produces an embedding at each position
of the sentence (𝒔𝒕)
 Embedding also has dimensionality 𝒉
[don’t confuse this with hidden activity 𝒉!]
 Key claim
 𝒔𝒕 represents the concept of the words near position 𝒕

Alignment
 Image fragment 𝒊 mapped to 𝒉 dimensional
embedding, 𝒗𝒊
 Sentence fragment 𝒕 mapped to 𝒉 dimensional
embedding, 𝒔𝒕
 Define similarity between image and sentence
fragments
 dot product 𝒗𝒊
𝐓
𝒔𝒕
 Match between image 𝒌 and sentence 𝒍

 where 𝒈𝒌 and 𝒈𝒍 are sets of image and sentence
fragments

Alignment II
 Train model to minimize margin-based loss
 aligned image-sentence pairs should have a higher score than misaligned
pairs, by a margin

Alignment III
 We have a measure of alignment between words and
image fragments: 𝒗𝒊
𝐓
𝒔𝒕
 We want an assignment of phrases (multiple words) to
image fragments
 Constraint satisfaction approach
 Search through a space of assignments of words to
image fragments
 encourage adjacent words to have same assignment
 larger 𝒗𝒊
𝐓
𝒔𝒕 -> higher probability that word 𝒕 will be
associated with image fragment 𝒊

Training a Generative Text Model
 For each image fragment associated with a sentence fragment, train an
RNN to synthesize sentence fragment:
 RNN can be used for fragments or whole images

Region Predictions from Generative Model

Summary
(Karpathy & Fei-Fei, 2015)
 Obtain image embedding from Region CNN
 Obtain sentence embedding from forward-
backward RNN
 Train embeddings to achieve a good alignment of
image pates to words in corresponding sentence
 Use MRF to parse 𝑁 words of sentence into
phrases that correspond to the 𝑀 selected image
fragments
 RNN sequence generator maps image fragments
to phrases

Formal Evaluation
R@K: recall rate of ground truth sentence given top K candidates
Med r: median rank of the first retrieved ground truth sentence

Common Themes Among Models
 Joint embedding space for images and words
 Use of ImageNet to produce image embeddings
 Generic components
RNNs
Softmax for word selection on output
Start and stop words

Differences Among Models
 Image processing
 decomposition of image (Karpathy) versus processing of whole image (all)
 What input does image provide
 initial input to recurrent net (Vinyals)
 input every time step – in recurrent layer (Karpathy) or after recurrent layer (Mao)
 How much is built in?
 semantic representations?
 localization of objects in images?
 Type of recurrence
 fully connected ReLU (Karpathy, Mao), LSTM (Vinyals)
 Read out
 beam search (Vinyals) vs. not (Mao, Karpathy)

Image captions.pptx

Recommended

Recommended

More Related Content

Similar to Image captions.pptx

Similar to Image captions.pptx (20)

More from RohanBorgalli

More from RohanBorgalli (14)

Recently uploaded

Recently uploaded (20)

Image captions.pptx

Editor's Notes