chaitra-1.pptx fake news detection using machine learning
Image captions.pptx
1. CSCI 5922
Neural Networks and Deep Learning:
Image Captioning
Mike Mozer
Department of Computer Science and
Institute of Cognitive Science
University of Colorado at Boulder
2. Fads
2014
Seven papers, almost simultaneously, focused on processing images and
textual descriptions
2015-2017
Tons of papers about visual question answering
3. Image-Sentence Tasks
Sentence retrieval
finding best matching sentence to an image
Sentence generation
given image, produce textual description
Image retrieval
given textual description, find best matching image
Image-sentence correspondence
match images to sentences
Question answering
given image and question, produce answer
From Karpathy blog
4. Adding Captions to Images
A cow is standing in the field
This is a close up of a brown cow.
There is a brown cow with long hair
and two horns.
There are two trees and a cloud in the
background of a field with a large cow
in it.
From Devi Parikh
5. Three Papers I’ll Talk About
Deep Visual-Semantic Alignments for Generating Image Descriptions
Karpathy & Fei-Fei (Stanford), CVPR 2015
Show and Tell: A Neural Image Caption Generator
Vinyals, Toshev, Bengio, Erhan (Google), CVPR 2015
Deep Captioning with Multimodal Recurrent Nets
Mao, Xu, Yang, Wang, Yuille (UCLA, Baidu), ICLR 2015
6. Other Papers
Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic
embeddings with multimodal neural language models. arXiv preprint
arXiv:1411.2539, 2014a.
Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Rohrbach, Marcus,
Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent
convolutional networks for visual recognition and description. arXiv preprint
arXiv:1411.4389, 2014.
Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár,
Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From
captions to visual concepts and back. arXiv preprint arXiv:1411.4952, 2014.
Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for
image caption generation. arXiv preprint arXiv:1411.5654, 2014.
7. Data Sets
Flickr8k
8000 images, each annotated with 5 sentences via AMT
1000 for validation, testing
Flickr 30k
30k images
1000 validation, 1000 testing
MSCOCO
123,000 images
5000 for validation, testing
8. Approach (Karpathy & Fei Fei, 2015)
Sentences act as weak labels
contiguous sequences of words correspond to some particular (unknown) location
in image
Task 1
Find alignment between snippets of text and image locations
Task 2
Build RNN that maps image patches to snippets of text
9. Representing Images
Process regions of image with a “Region Convolutional Neural Net”
pretrained on ImageNet and further fine tuned
map activations from the fully connected layer before the classification
layer to an 𝒉 dimensional embedding, 𝒗
𝒉 ranges from 1000 to 1600
Pick 19 locations that have strongest classification response plus whole
image
Image is represented by a set of 20 𝒉 dimensional embedding vectors
10. Representing Sentences
Bidirectional RNN
input representation is based on word2vec
Produces an embedding at each position
of the sentence (𝒔𝒕)
Embedding also has dimensionality 𝒉
[don’t confuse this with hidden activity 𝒉!]
Key claim
𝒔𝒕 represents the concept of the words near position 𝒕
11. Alignment
Image fragment 𝒊 mapped to 𝒉 dimensional
embedding, 𝒗𝒊
Sentence fragment 𝒕 mapped to 𝒉 dimensional
embedding, 𝒔𝒕
Define similarity between image and sentence
fragments
dot product 𝒗𝒊
𝐓
𝒔𝒕
Match between image 𝒌 and sentence 𝒍
where 𝒈𝒌 and 𝒈𝒍 are sets of image and sentence
fragments
13. Alignment II
Train model to minimize margin-based loss
aligned image-sentence pairs should have a higher score than misaligned
pairs, by a margin
14. Alignment III
We have a measure of alignment between words and
image fragments: 𝒗𝒊
𝐓
𝒔𝒕
We want an assignment of phrases (multiple words) to
image fragments
Constraint satisfaction approach
Search through a space of assignments of words to
image fragments
encourage adjacent words to have same assignment
larger 𝒗𝒊
𝐓
𝒔𝒕 -> higher probability that word 𝒕 will be
associated with image fragment 𝒊
15. Training a Generative Text Model
For each image fragment associated with a sentence fragment, train an
RNN to synthesize sentence fragment:
RNN can be used for fragments or whole images
17. Summary
(Karpathy & Fei-Fei, 2015)
Obtain image embedding from Region CNN
Obtain sentence embedding from forward-
backward RNN
Train embeddings to achieve a good alignment of
image pates to words in corresponding sentence
Use MRF to parse 𝑁 words of sentence into
phrases that correspond to the 𝑀 selected image
fragments
RNN sequence generator maps image fragments
to phrases
25. Common Themes Among Models
Joint embedding space for images and words
Use of ImageNet to produce image embeddings
Generic components
RNNs
Softmax for word selection on output
Start and stop words
26. Differences Among Models
Image processing
decomposition of image (Karpathy) versus processing of whole image (all)
What input does image provide
initial input to recurrent net (Vinyals)
input every time step – in recurrent layer (Karpathy) or after recurrent layer (Mao)
How much is built in?
semantic representations?
localization of objects in images?
Type of recurrence
fully connected ReLU (Karpathy, Mao), LSTM (Vinyals)
Read out
beam search (Vinyals) vs. not (Mao, Karpathy)
Editor's Notes
What’s a good caption?
These seem really small for what they have to learn…but… folks take pics of certain objects (e.g. cats)
v is image embedding
CNN_\theta_c(Ib) is output of CNN for image patch Ib
Wm and bm are learned weights
RNN-base : no image representation. note it doesn’t do too bad on some measures
Local-to-distributed word embeddings
one layer (Vinyals, Karpathy) vs. two layers (Mao)
VQA and image captioning Difference with Image captioning.
VQA and image captioning Difference with Image captioning.
VQA and image captioning Difference with Image captioning.
VQA and image captioning Difference with Image captioning.
In this paper, we propose a novel mechanism that jointly reasons about visual attention and question attention. We build a hierarchical architecture that co-attends to the image and question at three levels: At word level, we embed the words to a vector space through a embedding matrix. At phrase, level, 1-dimensional convolution neural networks are used to capture the information contained in unigrams, bigrams and trigrams. At question level, we use recurrent neural networks to encode the entire question. At each level, we construct the co-attention maps, which are them combined recursively to predict the answers.
In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.