CSCI 5922
Neural Networks and Deep Learning:
Image Captioning
Mike Mozer
Department of Computer Science and
Institute of Cognitive Science
University of Colorado at Boulder
Fads
 2014
 Seven papers, almost simultaneously, focused on processing images and
textual descriptions
 2015-2017
 Tons of papers about visual question answering
Image-Sentence Tasks
 Sentence retrieval
 finding best matching sentence to an image
 Sentence generation
 given image, produce textual description
 Image retrieval
 given textual description, find best matching image
 Image-sentence correspondence
 match images to sentences
 Question answering
 given image and question, produce answer
From Karpathy blog
Adding Captions to Images
 A cow is standing in the field
 This is a close up of a brown cow.
 There is a brown cow with long hair
and two horns.
 There are two trees and a cloud in the
background of a field with a large cow
in it.
From Devi Parikh
Three Papers I’ll Talk About
 Deep Visual-Semantic Alignments for Generating Image Descriptions
 Karpathy & Fei-Fei (Stanford), CVPR 2015
 Show and Tell: A Neural Image Caption Generator
 Vinyals, Toshev, Bengio, Erhan (Google), CVPR 2015
 Deep Captioning with Multimodal Recurrent Nets
 Mao, Xu, Yang, Wang, Yuille (UCLA, Baidu), ICLR 2015
Other Papers
 Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic
embeddings with multimodal neural language models. arXiv preprint
arXiv:1411.2539, 2014a.
 Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Rohrbach, Marcus,
Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent
convolutional networks for visual recognition and description. arXiv preprint
arXiv:1411.4389, 2014.
 Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár,
Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From
captions to visual concepts and back. arXiv preprint arXiv:1411.4952, 2014.
 Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for
image caption generation. arXiv preprint arXiv:1411.5654, 2014.
Data Sets
 Flickr8k
 8000 images, each annotated with 5 sentences via AMT
 1000 for validation, testing
 Flickr 30k
 30k images
 1000 validation, 1000 testing
 MSCOCO
 123,000 images
 5000 for validation, testing
Approach (Karpathy & Fei Fei, 2015)
 Sentences act as weak labels
 contiguous sequences of words correspond to some particular (unknown) location
in image
 Task 1
 Find alignment between snippets of text and image locations
 Task 2
 Build RNN that maps image patches to snippets of text
Representing Images
 Process regions of image with a “Region Convolutional Neural Net”
 pretrained on ImageNet and further fine tuned
 map activations from the fully connected layer before the classification
layer to an 𝒉 dimensional embedding, 𝒗
 𝒉 ranges from 1000 to 1600
 Pick 19 locations that have strongest classification response plus whole
image
 Image is represented by a set of 20 𝒉 dimensional embedding vectors
Representing Sentences
 Bidirectional RNN
 input representation is based on word2vec
 Produces an embedding at each position
of the sentence (𝒔𝒕)
 Embedding also has dimensionality 𝒉
[don’t confuse this with hidden activity 𝒉!]
 Key claim
 𝒔𝒕 represents the concept of the words near position 𝒕
Alignment
 Image fragment 𝒊 mapped to 𝒉 dimensional
embedding, 𝒗𝒊
 Sentence fragment 𝒕 mapped to 𝒉 dimensional
embedding, 𝒔𝒕
 Define similarity between image and sentence
fragments
 dot product 𝒗𝒊
𝐓
𝒔𝒕
 Match between image 𝒌 and sentence 𝒍

 where 𝒈𝒌 and 𝒈𝒍 are sets of image and sentence
fragments
Example Alignments
Alignment II
 Train model to minimize margin-based loss
 aligned image-sentence pairs should have a higher score than misaligned
pairs, by a margin
Alignment III
 We have a measure of alignment between words and
image fragments: 𝒗𝒊
𝐓
𝒔𝒕
 We want an assignment of phrases (multiple words) to
image fragments
 Constraint satisfaction approach
 Search through a space of assignments of words to
image fragments
 encourage adjacent words to have same assignment
 larger 𝒗𝒊
𝐓
𝒔𝒕 -> higher probability that word 𝒕 will be
associated with image fragment 𝒊
Training a Generative Text Model
 For each image fragment associated with a sentence fragment, train an
RNN to synthesize sentence fragment:
 RNN can be used for fragments or whole images
Region Predictions from Generative Model
Summary
(Karpathy & Fei-Fei, 2015)
 Obtain image embedding from Region CNN
 Obtain sentence embedding from forward-
backward RNN
 Train embeddings to achieve a good alignment of
image pates to words in corresponding sentence
 Use MRF to parse 𝑁 words of sentence into
phrases that correspond to the 𝑀 selected image
fragments
 RNN sequence generator maps image fragments
to phrases
Example Sentences
demo
Formal Evaluation
R@K: recall rate of ground truth sentence given top K candidates
Med r: median rank of the first retrieved ground truth sentence
Image Search
demo
Examples in Mao et al. (2015)
Failures (Mao et al., 2015)
Common Themes Among Models
 Joint embedding space for images and words
 Use of ImageNet to produce image embeddings
 Generic components
RNNs
Softmax for word selection on output
Start and stop words
Differences Among Models
 Image processing
 decomposition of image (Karpathy) versus processing of whole image (all)
 What input does image provide
 initial input to recurrent net (Vinyals)
 input every time step – in recurrent layer (Karpathy) or after recurrent layer (Mao)
 How much is built in?
 semantic representations?
 localization of objects in images?
 Type of recurrence
 fully connected ReLU (Karpathy, Mao), LSTM (Vinyals)
 Read out
 beam search (Vinyals) vs. not (Mao, Karpathy)

Image captions.pptx

  • 1.
    CSCI 5922 Neural Networksand Deep Learning: Image Captioning Mike Mozer Department of Computer Science and Institute of Cognitive Science University of Colorado at Boulder
  • 2.
    Fads  2014  Sevenpapers, almost simultaneously, focused on processing images and textual descriptions  2015-2017  Tons of papers about visual question answering
  • 3.
    Image-Sentence Tasks  Sentenceretrieval  finding best matching sentence to an image  Sentence generation  given image, produce textual description  Image retrieval  given textual description, find best matching image  Image-sentence correspondence  match images to sentences  Question answering  given image and question, produce answer From Karpathy blog
  • 4.
    Adding Captions toImages  A cow is standing in the field  This is a close up of a brown cow.  There is a brown cow with long hair and two horns.  There are two trees and a cloud in the background of a field with a large cow in it. From Devi Parikh
  • 5.
    Three Papers I’llTalk About  Deep Visual-Semantic Alignments for Generating Image Descriptions  Karpathy & Fei-Fei (Stanford), CVPR 2015  Show and Tell: A Neural Image Caption Generator  Vinyals, Toshev, Bengio, Erhan (Google), CVPR 2015  Deep Captioning with Multimodal Recurrent Nets  Mao, Xu, Yang, Wang, Yuille (UCLA, Baidu), ICLR 2015
  • 6.
    Other Papers  Kiros,Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014a.  Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389, 2014.  Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From captions to visual concepts and back. arXiv preprint arXiv:1411.4952, 2014.  Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014.
  • 7.
    Data Sets  Flickr8k 8000 images, each annotated with 5 sentences via AMT  1000 for validation, testing  Flickr 30k  30k images  1000 validation, 1000 testing  MSCOCO  123,000 images  5000 for validation, testing
  • 8.
    Approach (Karpathy &Fei Fei, 2015)  Sentences act as weak labels  contiguous sequences of words correspond to some particular (unknown) location in image  Task 1  Find alignment between snippets of text and image locations  Task 2  Build RNN that maps image patches to snippets of text
  • 9.
    Representing Images  Processregions of image with a “Region Convolutional Neural Net”  pretrained on ImageNet and further fine tuned  map activations from the fully connected layer before the classification layer to an 𝒉 dimensional embedding, 𝒗  𝒉 ranges from 1000 to 1600  Pick 19 locations that have strongest classification response plus whole image  Image is represented by a set of 20 𝒉 dimensional embedding vectors
  • 10.
    Representing Sentences  BidirectionalRNN  input representation is based on word2vec  Produces an embedding at each position of the sentence (𝒔𝒕)  Embedding also has dimensionality 𝒉 [don’t confuse this with hidden activity 𝒉!]  Key claim  𝒔𝒕 represents the concept of the words near position 𝒕
  • 11.
    Alignment  Image fragment𝒊 mapped to 𝒉 dimensional embedding, 𝒗𝒊  Sentence fragment 𝒕 mapped to 𝒉 dimensional embedding, 𝒔𝒕  Define similarity between image and sentence fragments  dot product 𝒗𝒊 𝐓 𝒔𝒕  Match between image 𝒌 and sentence 𝒍   where 𝒈𝒌 and 𝒈𝒍 are sets of image and sentence fragments
  • 12.
  • 13.
    Alignment II  Trainmodel to minimize margin-based loss  aligned image-sentence pairs should have a higher score than misaligned pairs, by a margin
  • 14.
    Alignment III  Wehave a measure of alignment between words and image fragments: 𝒗𝒊 𝐓 𝒔𝒕  We want an assignment of phrases (multiple words) to image fragments  Constraint satisfaction approach  Search through a space of assignments of words to image fragments  encourage adjacent words to have same assignment  larger 𝒗𝒊 𝐓 𝒔𝒕 -> higher probability that word 𝒕 will be associated with image fragment 𝒊
  • 15.
    Training a GenerativeText Model  For each image fragment associated with a sentence fragment, train an RNN to synthesize sentence fragment:  RNN can be used for fragments or whole images
  • 16.
    Region Predictions fromGenerative Model
  • 17.
    Summary (Karpathy & Fei-Fei,2015)  Obtain image embedding from Region CNN  Obtain sentence embedding from forward- backward RNN  Train embeddings to achieve a good alignment of image pates to words in corresponding sentence  Use MRF to parse 𝑁 words of sentence into phrases that correspond to the 𝑀 selected image fragments  RNN sequence generator maps image fragments to phrases
  • 18.
  • 19.
    Formal Evaluation R@K: recallrate of ground truth sentence given top K candidates Med r: median rank of the first retrieved ground truth sentence
  • 20.
  • 21.
    Examples in Maoet al. (2015)
  • 24.
    Failures (Mao etal., 2015)
  • 25.
    Common Themes AmongModels  Joint embedding space for images and words  Use of ImageNet to produce image embeddings  Generic components RNNs Softmax for word selection on output Start and stop words
  • 26.
    Differences Among Models Image processing  decomposition of image (Karpathy) versus processing of whole image (all)  What input does image provide  initial input to recurrent net (Vinyals)  input every time step – in recurrent layer (Karpathy) or after recurrent layer (Mao)  How much is built in?  semantic representations?  localization of objects in images?  Type of recurrence  fully connected ReLU (Karpathy, Mao), LSTM (Vinyals)  Read out  beam search (Vinyals) vs. not (Mao, Karpathy)

Editor's Notes

  • #5 What’s a good caption?
  • #8 These seem really small for what they have to learn…but… folks take pics of certain objects (e.g. cats)
  • #10 v is image embedding CNN_\theta_c(Ib) is output of CNN for image patch Ib Wm and bm are learned weights
  • #26 RNN-base : no image representation. note it doesn’t do too bad on some measures
  • #32 Local-to-distributed word embeddings one layer (Vinyals, Karpathy) vs. two layers (Mao)
  • #35 VQA and image captioning Difference with Image captioning.
  • #36 VQA and image captioning Difference with Image captioning.
  • #37 VQA and image captioning Difference with Image captioning.
  • #38 VQA and image captioning Difference with Image captioning.
  • #59 In this paper, we propose a novel mechanism that jointly reasons about visual attention and question attention. We build a hierarchical architecture that co-attends to the image and question at three levels: At word level, we embed the words to a vector space through a embedding matrix. At phrase, level, 1-dimensional convolution neural networks are used to capture the information contained in unigrams, bigrams and trigrams. At question level, we use recurrent neural networks to encode the entire question. At each level, we construct the co-attention maps, which are them combined recursively to predict the answers.
  • #62 In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
  • #63 In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
  • #64 In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
  • #65 In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
  • #66 In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.