SlideShare a Scribd company logo
1 of 26
CSCI 5922
Neural Networks and Deep Learning:
Image Captioning
Mike Mozer
Department of Computer Science and
Institute of Cognitive Science
University of Colorado at Boulder
Fads
 2014
 Seven papers, almost simultaneously, focused on processing images and
textual descriptions
 2015-2017
 Tons of papers about visual question answering
Image-Sentence Tasks
 Sentence retrieval
 finding best matching sentence to an image
 Sentence generation
 given image, produce textual description
 Image retrieval
 given textual description, find best matching image
 Image-sentence correspondence
 match images to sentences
 Question answering
 given image and question, produce answer
From Karpathy blog
Adding Captions to Images
 A cow is standing in the field
 This is a close up of a brown cow.
 There is a brown cow with long hair
and two horns.
 There are two trees and a cloud in the
background of a field with a large cow
in it.
From Devi Parikh
Three Papers I’ll Talk About
 Deep Visual-Semantic Alignments for Generating Image Descriptions
 Karpathy & Fei-Fei (Stanford), CVPR 2015
 Show and Tell: A Neural Image Caption Generator
 Vinyals, Toshev, Bengio, Erhan (Google), CVPR 2015
 Deep Captioning with Multimodal Recurrent Nets
 Mao, Xu, Yang, Wang, Yuille (UCLA, Baidu), ICLR 2015
Other Papers
 Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic
embeddings with multimodal neural language models. arXiv preprint
arXiv:1411.2539, 2014a.
 Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Rohrbach, Marcus,
Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent
convolutional networks for visual recognition and description. arXiv preprint
arXiv:1411.4389, 2014.
 Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár,
Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From
captions to visual concepts and back. arXiv preprint arXiv:1411.4952, 2014.
 Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for
image caption generation. arXiv preprint arXiv:1411.5654, 2014.
Data Sets
 Flickr8k
 8000 images, each annotated with 5 sentences via AMT
 1000 for validation, testing
 Flickr 30k
 30k images
 1000 validation, 1000 testing
 MSCOCO
 123,000 images
 5000 for validation, testing
Approach (Karpathy & Fei Fei, 2015)
 Sentences act as weak labels
 contiguous sequences of words correspond to some particular (unknown) location
in image
 Task 1
 Find alignment between snippets of text and image locations
 Task 2
 Build RNN that maps image patches to snippets of text
Representing Images
 Process regions of image with a “Region Convolutional Neural Net”
 pretrained on ImageNet and further fine tuned
 map activations from the fully connected layer before the classification
layer to an 𝒉 dimensional embedding, 𝒗
 𝒉 ranges from 1000 to 1600
 Pick 19 locations that have strongest classification response plus whole
image
 Image is represented by a set of 20 𝒉 dimensional embedding vectors
Representing Sentences
 Bidirectional RNN
 input representation is based on word2vec
 Produces an embedding at each position
of the sentence (𝒔𝒕)
 Embedding also has dimensionality 𝒉
[don’t confuse this with hidden activity 𝒉!]
 Key claim
 𝒔𝒕 represents the concept of the words near position 𝒕
Alignment
 Image fragment 𝒊 mapped to 𝒉 dimensional
embedding, 𝒗𝒊
 Sentence fragment 𝒕 mapped to 𝒉 dimensional
embedding, 𝒔𝒕
 Define similarity between image and sentence
fragments
 dot product 𝒗𝒊
𝐓
𝒔𝒕
 Match between image 𝒌 and sentence 𝒍

 where 𝒈𝒌 and 𝒈𝒍 are sets of image and sentence
fragments
Example Alignments
Alignment II
 Train model to minimize margin-based loss
 aligned image-sentence pairs should have a higher score than misaligned
pairs, by a margin
Alignment III
 We have a measure of alignment between words and
image fragments: 𝒗𝒊
𝐓
𝒔𝒕
 We want an assignment of phrases (multiple words) to
image fragments
 Constraint satisfaction approach
 Search through a space of assignments of words to
image fragments
 encourage adjacent words to have same assignment
 larger 𝒗𝒊
𝐓
𝒔𝒕 -> higher probability that word 𝒕 will be
associated with image fragment 𝒊
Training a Generative Text Model
 For each image fragment associated with a sentence fragment, train an
RNN to synthesize sentence fragment:
 RNN can be used for fragments or whole images
Region Predictions from Generative Model
Summary
(Karpathy & Fei-Fei, 2015)
 Obtain image embedding from Region CNN
 Obtain sentence embedding from forward-
backward RNN
 Train embeddings to achieve a good alignment of
image pates to words in corresponding sentence
 Use MRF to parse 𝑁 words of sentence into
phrases that correspond to the 𝑀 selected image
fragments
 RNN sequence generator maps image fragments
to phrases
Example Sentences
demo
Formal Evaluation
R@K: recall rate of ground truth sentence given top K candidates
Med r: median rank of the first retrieved ground truth sentence
Image Search
demo
Examples in Mao et al. (2015)
Failures (Mao et al., 2015)
Common Themes Among Models
 Joint embedding space for images and words
 Use of ImageNet to produce image embeddings
 Generic components
RNNs
Softmax for word selection on output
Start and stop words
Differences Among Models
 Image processing
 decomposition of image (Karpathy) versus processing of whole image (all)
 What input does image provide
 initial input to recurrent net (Vinyals)
 input every time step – in recurrent layer (Karpathy) or after recurrent layer (Mao)
 How much is built in?
 semantic representations?
 localization of objects in images?
 Type of recurrence
 fully connected ReLU (Karpathy, Mao), LSTM (Vinyals)
 Read out
 beam search (Vinyals) vs. not (Mao, Karpathy)

More Related Content

Similar to Image captions.pptx

Deep Learning for NLP Applications
Deep Learning for NLP ApplicationsDeep Learning for NLP Applications
Deep Learning for NLP ApplicationsSamiur Rahman
 
Low-Rank Neighbor Embedding for Single Image Super-Resolution
Low-Rank Neighbor Embedding for Single Image Super-ResolutionLow-Rank Neighbor Embedding for Single Image Super-Resolution
Low-Rank Neighbor Embedding for Single Image Super-Resolutionjohn236zaq
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for LexicographyLeiden University
 
TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019sipij
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation ijcax
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image AnnotationA Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotationijcax
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation ijcax
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation ijcax
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image AnnotationA Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotationijcax
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation ijcax
 
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...ijaia
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosCodiax
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
RCOMM 2011 - Sentiment Classification
RCOMM 2011 - Sentiment ClassificationRCOMM 2011 - Sentiment Classification
RCOMM 2011 - Sentiment Classificationbohanairl
 

Similar to Image captions.pptx (20)

evonlp_slides.pdf
evonlp_slides.pdfevonlp_slides.pdf
evonlp_slides.pdf
 
Analyse de sentiment et classification par approche neuronale en Python et Weka
Analyse de sentiment et classification par approche neuronale en Python et WekaAnalyse de sentiment et classification par approche neuronale en Python et Weka
Analyse de sentiment et classification par approche neuronale en Python et Weka
 
Deep Learning for NLP Applications
Deep Learning for NLP ApplicationsDeep Learning for NLP Applications
Deep Learning for NLP Applications
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Sentence generation
Sentence generationSentence generation
Sentence generation
 
Low-Rank Neighbor Embedding for Single Image Super-Resolution
Low-Rank Neighbor Embedding for Single Image Super-ResolutionLow-Rank Neighbor Embedding for Single Image Super-Resolution
Low-Rank Neighbor Embedding for Single Image Super-Resolution
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
 
TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image AnnotationA Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image AnnotationA Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
 
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
RCOMM 2011 - Sentiment Classification
RCOMM 2011 - Sentiment ClassificationRCOMM 2011 - Sentiment Classification
RCOMM 2011 - Sentiment Classification
 

More from RohanBorgalli

Genetic Algorithms.ppt
Genetic Algorithms.pptGenetic Algorithms.ppt
Genetic Algorithms.pptRohanBorgalli
 
SHARP4_cNLP_Jun11.ppt
SHARP4_cNLP_Jun11.pptSHARP4_cNLP_Jun11.ppt
SHARP4_cNLP_Jun11.pptRohanBorgalli
 
Using Artificial Intelligence in the field of Diagnostics_Case Studies.ppt
Using Artificial Intelligence in the field of Diagnostics_Case Studies.pptUsing Artificial Intelligence in the field of Diagnostics_Case Studies.ppt
Using Artificial Intelligence in the field of Diagnostics_Case Studies.pptRohanBorgalli
 
02_Architectures_In_Context.ppt
02_Architectures_In_Context.ppt02_Architectures_In_Context.ppt
02_Architectures_In_Context.pptRohanBorgalli
 
Automobile-pathway.pptx
Automobile-pathway.pptxAutomobile-pathway.pptx
Automobile-pathway.pptxRohanBorgalli
 
Autoregressive Model.pptx
Autoregressive Model.pptxAutoregressive Model.pptx
Autoregressive Model.pptxRohanBorgalli
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxRohanBorgalli
 
Time Series Analysis_slides.pdf
Time Series Analysis_slides.pdfTime Series Analysis_slides.pdf
Time Series Analysis_slides.pdfRohanBorgalli
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdfRohanBorgalli
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptxRohanBorgalli
 

More from RohanBorgalli (14)

Genetic Algorithms.ppt
Genetic Algorithms.pptGenetic Algorithms.ppt
Genetic Algorithms.ppt
 
SHARP4_cNLP_Jun11.ppt
SHARP4_cNLP_Jun11.pptSHARP4_cNLP_Jun11.ppt
SHARP4_cNLP_Jun11.ppt
 
Using Artificial Intelligence in the field of Diagnostics_Case Studies.ppt
Using Artificial Intelligence in the field of Diagnostics_Case Studies.pptUsing Artificial Intelligence in the field of Diagnostics_Case Studies.ppt
Using Artificial Intelligence in the field of Diagnostics_Case Studies.ppt
 
02_Architectures_In_Context.ppt
02_Architectures_In_Context.ppt02_Architectures_In_Context.ppt
02_Architectures_In_Context.ppt
 
Automobile-pathway.pptx
Automobile-pathway.pptxAutomobile-pathway.pptx
Automobile-pathway.pptx
 
Autoregressive Model.pptx
Autoregressive Model.pptxAutoregressive Model.pptx
Autoregressive Model.pptx
 
IntrotoArduino.ppt
IntrotoArduino.pptIntrotoArduino.ppt
IntrotoArduino.ppt
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptx
 
Telecom1.ppt
Telecom1.pptTelecom1.ppt
Telecom1.ppt
 
FactorAnalysis.ppt
FactorAnalysis.pptFactorAnalysis.ppt
FactorAnalysis.ppt
 
Time Series Analysis_slides.pdf
Time Series Analysis_slides.pdfTime Series Analysis_slides.pdf
Time Series Analysis_slides.pdf
 
NNAF_DRK.pdf
NNAF_DRK.pdfNNAF_DRK.pdf
NNAF_DRK.pdf
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 

Recently uploaded

Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 

Recently uploaded (20)

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 

Image captions.pptx

  • 1. CSCI 5922 Neural Networks and Deep Learning: Image Captioning Mike Mozer Department of Computer Science and Institute of Cognitive Science University of Colorado at Boulder
  • 2. Fads  2014  Seven papers, almost simultaneously, focused on processing images and textual descriptions  2015-2017  Tons of papers about visual question answering
  • 3. Image-Sentence Tasks  Sentence retrieval  finding best matching sentence to an image  Sentence generation  given image, produce textual description  Image retrieval  given textual description, find best matching image  Image-sentence correspondence  match images to sentences  Question answering  given image and question, produce answer From Karpathy blog
  • 4. Adding Captions to Images  A cow is standing in the field  This is a close up of a brown cow.  There is a brown cow with long hair and two horns.  There are two trees and a cloud in the background of a field with a large cow in it. From Devi Parikh
  • 5. Three Papers I’ll Talk About  Deep Visual-Semantic Alignments for Generating Image Descriptions  Karpathy & Fei-Fei (Stanford), CVPR 2015  Show and Tell: A Neural Image Caption Generator  Vinyals, Toshev, Bengio, Erhan (Google), CVPR 2015  Deep Captioning with Multimodal Recurrent Nets  Mao, Xu, Yang, Wang, Yuille (UCLA, Baidu), ICLR 2015
  • 6. Other Papers  Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014a.  Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389, 2014.  Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From captions to visual concepts and back. arXiv preprint arXiv:1411.4952, 2014.  Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014.
  • 7. Data Sets  Flickr8k  8000 images, each annotated with 5 sentences via AMT  1000 for validation, testing  Flickr 30k  30k images  1000 validation, 1000 testing  MSCOCO  123,000 images  5000 for validation, testing
  • 8. Approach (Karpathy & Fei Fei, 2015)  Sentences act as weak labels  contiguous sequences of words correspond to some particular (unknown) location in image  Task 1  Find alignment between snippets of text and image locations  Task 2  Build RNN that maps image patches to snippets of text
  • 9. Representing Images  Process regions of image with a “Region Convolutional Neural Net”  pretrained on ImageNet and further fine tuned  map activations from the fully connected layer before the classification layer to an 𝒉 dimensional embedding, 𝒗  𝒉 ranges from 1000 to 1600  Pick 19 locations that have strongest classification response plus whole image  Image is represented by a set of 20 𝒉 dimensional embedding vectors
  • 10. Representing Sentences  Bidirectional RNN  input representation is based on word2vec  Produces an embedding at each position of the sentence (𝒔𝒕)  Embedding also has dimensionality 𝒉 [don’t confuse this with hidden activity 𝒉!]  Key claim  𝒔𝒕 represents the concept of the words near position 𝒕
  • 11. Alignment  Image fragment 𝒊 mapped to 𝒉 dimensional embedding, 𝒗𝒊  Sentence fragment 𝒕 mapped to 𝒉 dimensional embedding, 𝒔𝒕  Define similarity between image and sentence fragments  dot product 𝒗𝒊 𝐓 𝒔𝒕  Match between image 𝒌 and sentence 𝒍   where 𝒈𝒌 and 𝒈𝒍 are sets of image and sentence fragments
  • 13. Alignment II  Train model to minimize margin-based loss  aligned image-sentence pairs should have a higher score than misaligned pairs, by a margin
  • 14. Alignment III  We have a measure of alignment between words and image fragments: 𝒗𝒊 𝐓 𝒔𝒕  We want an assignment of phrases (multiple words) to image fragments  Constraint satisfaction approach  Search through a space of assignments of words to image fragments  encourage adjacent words to have same assignment  larger 𝒗𝒊 𝐓 𝒔𝒕 -> higher probability that word 𝒕 will be associated with image fragment 𝒊
  • 15. Training a Generative Text Model  For each image fragment associated with a sentence fragment, train an RNN to synthesize sentence fragment:  RNN can be used for fragments or whole images
  • 16. Region Predictions from Generative Model
  • 17. Summary (Karpathy & Fei-Fei, 2015)  Obtain image embedding from Region CNN  Obtain sentence embedding from forward- backward RNN  Train embeddings to achieve a good alignment of image pates to words in corresponding sentence  Use MRF to parse 𝑁 words of sentence into phrases that correspond to the 𝑀 selected image fragments  RNN sequence generator maps image fragments to phrases
  • 19. Formal Evaluation R@K: recall rate of ground truth sentence given top K candidates Med r: median rank of the first retrieved ground truth sentence
  • 21. Examples in Mao et al. (2015)
  • 22.
  • 23.
  • 24. Failures (Mao et al., 2015)
  • 25. Common Themes Among Models  Joint embedding space for images and words  Use of ImageNet to produce image embeddings  Generic components RNNs Softmax for word selection on output Start and stop words
  • 26. Differences Among Models  Image processing  decomposition of image (Karpathy) versus processing of whole image (all)  What input does image provide  initial input to recurrent net (Vinyals)  input every time step – in recurrent layer (Karpathy) or after recurrent layer (Mao)  How much is built in?  semantic representations?  localization of objects in images?  Type of recurrence  fully connected ReLU (Karpathy, Mao), LSTM (Vinyals)  Read out  beam search (Vinyals) vs. not (Mao, Karpathy)

Editor's Notes

  1. What’s a good caption?
  2. These seem really small for what they have to learn…but… folks take pics of certain objects (e.g. cats)
  3. v is image embedding CNN_\theta_c(Ib) is output of CNN for image patch Ib Wm and bm are learned weights
  4. RNN-base : no image representation. note it doesn’t do too bad on some measures
  5. Local-to-distributed word embeddings one layer (Vinyals, Karpathy) vs. two layers (Mao)
  6. VQA and image captioning Difference with Image captioning.
  7. VQA and image captioning Difference with Image captioning.
  8. VQA and image captioning Difference with Image captioning.
  9. VQA and image captioning Difference with Image captioning.
  10. In this paper, we propose a novel mechanism that jointly reasons about visual attention and question attention. We build a hierarchical architecture that co-attends to the image and question at three levels: At word level, we embed the words to a vector space through a embedding matrix. At phrase, level, 1-dimensional convolution neural networks are used to capture the information contained in unigrams, bigrams and trigrams. At question level, we use recurrent neural networks to encode the entire question. At each level, we construct the co-attention maps, which are them combined recursively to predict the answers.
  11. In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
  12. In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
  13. In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
  14. In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.
  15. In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods.