SlideShare a Scribd company logo
1 of 26
CSCI 5922
Neural Networks and Deep Learning:
Image Captioning
Mike Mozer
Department of Computer Science and
Institute of Cognitive Science
University of Colorado at Boulder
Fads
 2014
 Seven papers, almost simultaneously, focused on processing images and
textual descriptions
 2015-2017
 Tons of papers about visual question answering
Image-Sentence Tasks
 Sentence retrieval
 finding best matching sentence to an image
 Sentence generation
 given image, produce textual description
 Image retrieval
 given textual description, find best matching image
 Image-sentence correspondence
 match images to sentences
 Question answering
 given image and question, produce answer
From Karpathy blog
Adding Captions to Images
 A cow is standing in the field
 This is a close up of a brown cow.
 There is a brown cow with long hair
and two horns.
 There are two trees and a cloud in the
background of a field with a large cow
in it.
From Devi Parikh
Three Papers I’ll Talk About
 Deep Visual-Semantic Alignments for Generating Image Descriptions
 Karpathy & Fei-Fei (Stanford), CVPR 2015
 Show and Tell: A Neural Image Caption Generator
 Vinyals, Toshev, Bengio, Erhan (Google), CVPR 2015
 Deep Captioning with Multimodal Recurrent Nets
 Mao, Xu, Yang, Wang, Yuille (UCLA, Baidu), ICLR 2015
Other Papers
 Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic
embeddings with multimodal neural language models. arXiv preprint
arXiv:1411.2539, 2014a.
 Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Rohrbach, Marcus,
Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent
convolutional networks for visual recognition and description. arXiv preprint
arXiv:1411.4389, 2014.
 Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár,
Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From
captions to visual concepts and back. arXiv preprint arXiv:1411.4952, 2014.
 Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for
image caption generation. arXiv preprint arXiv:1411.5654, 2014.
Data Sets
 Flickr8k
 8000 images, each annotated with 5 sentences via AMT
 1000 for validation, testing
 Flickr 30k
 30k images
 1000 validation, 1000 testing
 MSCOCO
 123,000 images
 5000 for validation, testing
Approach (Karpathy & Fei Fei, 2015)
 Sentences act as weak labels
 contiguous sequences of words correspond to some particular (unknown) location
in image
 Task 1
 Find alignment between snippets of text and image locations
 Task 2
 Build RNN that maps image patches to snippets of text
Representing Images
 Process regions of image with a “Region Convolutional Neural Net”
 pretrained on ImageNet and further fine tuned
 map activations from the fully connected layer before the classification
layer to an 𝒉 dimensional embedding, 𝒗
 𝒉 ranges from 1000 to 1600
 Pick 19 locations that have strongest classification response plus whole
image
 Image is represented by a set of 20 𝒉 dimensional embedding vectors
Representing Sentences
 Bidirectional RNN
 input representation is based on word2vec
 Produces an embedding at each position
of the sentence (𝒔𝒕)
 Embedding also has dimensionality 𝒉
[don’t confuse this with hidden activity 𝒉!]
 Key claim
 𝒔𝒕 represents the concept of the words near position 𝒕
Alignment
 Image fragment 𝒊 mapped to 𝒉 dimensional
embedding, 𝒗𝒊
 Sentence fragment 𝒕 mapped to 𝒉 dimensional
embedding, 𝒔𝒕
 Define similarity between image and sentence
fragments
 dot product 𝒗𝒊
𝐓
𝒔𝒕
 Match between image 𝒌 and sentence 𝒍

 where 𝒈𝒌 and 𝒈𝒍 are sets of image and sentence
fragments
Example Alignments
Alignment II
 Train model to minimize margin-based loss
 aligned image-sentence pairs should have a higher score than misaligned
pairs, by a margin
Alignment III
 We have a measure of alignment between words and
image fragments: 𝒗𝒊
𝐓
𝒔𝒕
 We want an assignment of phrases (multiple words) to
image fragments
 Constraint satisfaction approach
 Search through a space of assignments of words to
image fragments
 encourage adjacent words to have same assignment
 larger 𝒗𝒊
𝐓
𝒔𝒕 -> higher probability that word 𝒕 will be
associated with image fragment 𝒊
Training a Generative Text Model
 For each image fragment associated with a sentence fragment, train an
RNN to synthesize sentence fragment:
 RNN can be used for fragments or whole images
Region Predictions from Generative Model
Summary
(Karpathy & Fei-Fei, 2015)
 Obtain image embedding from Region CNN
 Obtain sentence embedding from forward-
backward RNN
 Train embeddings to achieve a good alignment of
image pates to words in corresponding sentence
 Use MRF to parse 𝑁 words of sentence into
phrases that correspond to the 𝑀 selected image
fragments
 RNN sequence generator maps image fragments
to phrases
Example Sentences
demo
Formal Evaluation
R@K: recall rate of ground truth sentence given top K candidates
Med r: median rank of the first retrieved ground truth sentence
Image Search
demo
Examples in Mao et al. (2015)
Failures (Mao et al., 2015)
Common Themes Among Models
 Joint embedding space for images and words
 Use of ImageNet to produce image embeddings
 Generic components
RNNs
Softmax for word selection on output
Start and stop words
Differences Among Models
 Image processing
 decomposition of image (Karpathy) versus processing of whole image (all)
 What input does image provide
 initial input to recurrent net (Vinyals)
 input every time step – in recurrent layer (Karpathy) or after recurrent layer (Mao)
 How much is built in?
 semantic representations?
 localization of objects in images?
 Type of recurrence
 fully connected ReLU (Karpathy, Mao), LSTM (Vinyals)
 Read out
 beam search (Vinyals) vs. not (Mao, Karpathy)

More Related Content

Similar to Image captions.pptx

Deep Learning for NLP Applications
Deep Learning for NLP ApplicationsDeep Learning for NLP Applications
Deep Learning for NLP ApplicationsSamiur Rahman
 
Low-Rank Neighbor Embedding for Single Image Super-Resolution
Low-Rank Neighbor Embedding for Single Image Super-ResolutionLow-Rank Neighbor Embedding for Single Image Super-Resolution
Low-Rank Neighbor Embedding for Single Image Super-Resolutionjohn236zaq
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for LexicographyLeiden University
 
TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019sipij
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation ijcax
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image AnnotationA Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotationijcax
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation ijcax
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation ijcax
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image AnnotationA Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotationijcax
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation ijcax
 
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...ijaia
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosCodiax
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
RCOMM 2011 - Sentiment Classification
RCOMM 2011 - Sentiment ClassificationRCOMM 2011 - Sentiment Classification
RCOMM 2011 - Sentiment Classificationbohanairl
 

Similar to Image captions.pptx (20)

evonlp_slides.pdf
evonlp_slides.pdfevonlp_slides.pdf
evonlp_slides.pdf
 
Analyse de sentiment et classification par approche neuronale en Python et Weka
Analyse de sentiment et classification par approche neuronale en Python et WekaAnalyse de sentiment et classification par approche neuronale en Python et Weka
Analyse de sentiment et classification par approche neuronale en Python et Weka
 
Deep Learning for NLP Applications
Deep Learning for NLP ApplicationsDeep Learning for NLP Applications
Deep Learning for NLP Applications
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Sentence generation
Sentence generationSentence generation
Sentence generation
 
Low-Rank Neighbor Embedding for Single Image Super-Resolution
Low-Rank Neighbor Embedding for Single Image Super-ResolutionLow-Rank Neighbor Embedding for Single Image Super-Resolution
Low-Rank Neighbor Embedding for Single Image Super-Resolution
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
 
TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image AnnotationA Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image AnnotationA Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
 
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
RCOMM 2011 - Sentiment Classification
RCOMM 2011 - Sentiment ClassificationRCOMM 2011 - Sentiment Classification
RCOMM 2011 - Sentiment Classification
 

More from RohanBorgalli

Genetic Algorithms.ppt
Genetic Algorithms.pptGenetic Algorithms.ppt
Genetic Algorithms.pptRohanBorgalli
 
SHARP4_cNLP_Jun11.ppt
SHARP4_cNLP_Jun11.pptSHARP4_cNLP_Jun11.ppt
SHARP4_cNLP_Jun11.pptRohanBorgalli
 
Using Artificial Intelligence in the field of Diagnostics_Case Studies.ppt
Using Artificial Intelligence in the field of Diagnostics_Case Studies.pptUsing Artificial Intelligence in the field of Diagnostics_Case Studies.ppt
Using Artificial Intelligence in the field of Diagnostics_Case Studies.pptRohanBorgalli
 
02_Architectures_In_Context.ppt
02_Architectures_In_Context.ppt02_Architectures_In_Context.ppt
02_Architectures_In_Context.pptRohanBorgalli
 
Automobile-pathway.pptx
Automobile-pathway.pptxAutomobile-pathway.pptx
Automobile-pathway.pptxRohanBorgalli
 
Autoregressive Model.pptx
Autoregressive Model.pptxAutoregressive Model.pptx
Autoregressive Model.pptxRohanBorgalli
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxRohanBorgalli
 
Time Series Analysis_slides.pdf
Time Series Analysis_slides.pdfTime Series Analysis_slides.pdf
Time Series Analysis_slides.pdfRohanBorgalli
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdfRohanBorgalli
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptxRohanBorgalli
 

More from RohanBorgalli (14)

Genetic Algorithms.ppt
Genetic Algorithms.pptGenetic Algorithms.ppt
Genetic Algorithms.ppt
 
SHARP4_cNLP_Jun11.ppt
SHARP4_cNLP_Jun11.pptSHARP4_cNLP_Jun11.ppt
SHARP4_cNLP_Jun11.ppt
 
Using Artificial Intelligence in the field of Diagnostics_Case Studies.ppt
Using Artificial Intelligence in the field of Diagnostics_Case Studies.pptUsing Artificial Intelligence in the field of Diagnostics_Case Studies.ppt
Using Artificial Intelligence in the field of Diagnostics_Case Studies.ppt
 
02_Architectures_In_Context.ppt
02_Architectures_In_Context.ppt02_Architectures_In_Context.ppt
02_Architectures_In_Context.ppt
 
Automobile-pathway.pptx
Automobile-pathway.pptxAutomobile-pathway.pptx
Automobile-pathway.pptx
 
Autoregressive Model.pptx
Autoregressive Model.pptxAutoregressive Model.pptx
Autoregressive Model.pptx
 
IntrotoArduino.ppt
IntrotoArduino.pptIntrotoArduino.ppt
IntrotoArduino.ppt
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptx
 
Telecom1.ppt
Telecom1.pptTelecom1.ppt
Telecom1.ppt
 
FactorAnalysis.ppt
FactorAnalysis.pptFactorAnalysis.ppt
FactorAnalysis.ppt
 
Time Series Analysis_slides.pdf
Time Series Analysis_slides.pdfTime Series Analysis_slides.pdf
Time Series Analysis_slides.pdf
 
NNAF_DRK.pdf
NNAF_DRK.pdfNNAF_DRK.pdf
NNAF_DRK.pdf
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 

Recently uploaded

Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfrs7054576148
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 

Recently uploaded (20)

Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 

Image captions.pptx

  • 1. CSCI 5922 Neural Networks and Deep Learning: Image Captioning Mike Mozer Department of Computer Science and Institute of Cognitive Science University of Colorado at Boulder
  • 2. Fads  2014  Seven papers, almost simultaneously, focused on processing images and textual descriptions  2015-2017  Tons of papers about visual question answering
  • 3. Image-Sentence Tasks  Sentence retrieval  finding best matching sentence to an image  Sentence generation  given image, produce textual description  Image retrieval  given textual description, find best matching image  Image-sentence correspondence  match images to sentences  Question answering  given image and question, produce answer From Karpathy blog
  • 4. Adding Captions to Images  A cow is standing in the field  This is a close up of a brown cow.  There is a brown cow with long hair and two horns.  There are two trees and a cloud in the background of a field with a large cow in it. From Devi Parikh
  • 5. Three Papers I’ll Talk About  Deep Visual-Semantic Alignments for Generating Image Descriptions  Karpathy & Fei-Fei (Stanford), CVPR 2015  Show and Tell: A Neural Image Caption Generator  Vinyals, Toshev, Bengio, Erhan (Google), CVPR 2015  Deep Captioning with Multimodal Recurrent Nets  Mao, Xu, Yang, Wang, Yuille (UCLA, Baidu), ICLR 2015
  • 6. Other Papers  Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014a.  Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389, 2014.  Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From captions to visual concepts and back. arXiv preprint arXiv:1411.4952, 2014.  Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014.
  • 7. Data Sets  Flickr8k  8000 images, each annotated with 5 sentences via AMT  1000 for validation, testing  Flickr 30k  30k images  1000 validation, 1000 testing  MSCOCO  123,000 images  5000 for validation, testing
  • 8. Approach (Karpathy & Fei Fei, 2015)  Sentences act as weak labels  contiguous sequences of words correspond to some particular (unknown) location in image  Task 1  Find alignment between snippets of text and image locations  Task 2  Build RNN that maps image patches to snippets of text
  • 9. Representing Images  Process regions of image with a “Region Convolutional Neural Net”  pretrained on ImageNet and further fine tuned  map activations from the fully connected layer before the classification layer to an 𝒉 dimensional embedding, 𝒗  𝒉 ranges from 1000 to 1600  Pick 19 locations that have strongest classification response plus whole image  Image is represented by a set of 20 𝒉 dimensional embedding vectors
  • 10. Representing Sentences  Bidirectional RNN  input representation is based on word2vec  Produces an embedding at each position of the sentence (𝒔𝒕)  Embedding also has dimensionality 𝒉 [don’t confuse this with hidden activity 𝒉!]  Key claim  𝒔𝒕 represents the concept of the words near position 𝒕
  • 11. Alignment  Image fragment 𝒊 mapped to 𝒉 dimensional embedding, 𝒗𝒊  Sentence fragment 𝒕 mapped to 𝒉 dimensional embedding, 𝒔𝒕  Define similarity between image and sentence fragments  dot product 𝒗𝒊 𝐓 𝒔𝒕  Match between image 𝒌 and sentence 𝒍   where 𝒈𝒌 and 𝒈𝒍 are sets of image and sentence fragments
  • 13. Alignment II  Train model to minimize margin-based loss  aligned image-sentence pairs should have a higher score than misaligned pairs, by a margin
  • 14. Alignment III  We have a measure of alignment between words and image fragments: 𝒗𝒊 𝐓 𝒔𝒕  We want an assignment of phrases (multiple words) to image fragments  Constraint satisfaction approach  Search through a space of assignments of words to image fragments  encourage adjacent words to have same assignment  larger 𝒗𝒊 𝐓 𝒔𝒕 -> higher probability that word 𝒕 will be associated with image fragment 𝒊
  • 15. Training a Generative Text Model  For each image fragment associated with a sentence fragment, train an RNN to synthesize sentence fragment:  RNN can be used for fragments or whole images
  • 16. Region Predictions from Generative Model
  • 17. Summary (Karpathy & Fei-Fei, 2015)  Obtain image embedding from Region CNN  Obtain sentence embedding from forward- backward RNN  Train embeddings to achieve a good alignment of image pates to words in corresponding sentence  Use MRF to parse 𝑁 words of sentence into phrases that correspond to the 𝑀 selected image fragments  RNN sequence generator maps image fragments to phrases
  • 19. Formal Evaluation R@K: recall rate of ground truth sentence given top K candidates Med r: median rank of the first retrieved ground truth sentence
  • 21. Examples in Mao et al. (2015)
  • 22.
  • 23.
  • 24. Failures (Mao et al., 2015)
  • 25. Common Themes Among Models  Joint embedding space for images and words  Use of ImageNet to produce image embeddings  Generic components RNNs Softmax for word selection on output Start and stop words
  • 26. Differences Among Models  Image processing  decomposition of image (Karpathy) versus processing of whole image (all)  What input does image provide  initial input to recurrent net (Vinyals)  input every time step – in recurrent layer (Karpathy) or after recurrent layer (Mao)  How much is built in?  semantic representations?  localization of objects in images?  Type of recurrence  fully connected ReLU (Karpathy, Mao), LSTM (Vinyals)  Read out  beam search (Vinyals) vs. not (Mao, Karpathy)

Editor's Notes

  1. What’s a good caption?
  2. These seem really small for what they have to learn…but… folks take pics of certain objects (e.g. cats)
  3. v is image embedding CNN_\theta_c(Ib) is output of CNN for image patch Ib Wm and bm are learned weights
  4. RNN-base : no image representation. note it doesn’t do too bad on some measures
  5. Local-to-distributed word embeddings one layer (Vinyals, Karpathy) vs. two layers (Mao)
  6. VQA and image captioning Difference with Image captioning.
  7. VQA and image captioning Difference with Image captioning.
  8. VQA and image captioning Difference with Image captioning.
  9. VQA and image captioning Difference with Image captioning.
  10. In this paper, we propose a novel mechanism that jointly reasons about visual attention and question attention. We build a hierarchical architecture that co-attends to the image and question at three levels: At word level, we embed the words to a vector space through a embedding matrix. At phrase, level, 1-dimensional convolution neural networks are used to capture the information contained in unigrams, bigrams and trigrams. At question level, we use recurrent neural networks to encode the entire question. At each level, we construct the co-attention maps, which are them combined recursively to predict the answers.
  11. In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
  12. In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
  13. In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
  14. In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
  15. In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.