SlideShare a Scribd company logo
1 of 20
Describing Images using Inferred
Visual Dependency Representations
BY DESMOND ELLIOTT & ARJEN P. DE VRIES
PRESENTED BY
P.R.P.L.S KUMARA – 158224B
1
Overview
 What is VDR
 Objectives of the Paper
 Related Works & Their Limitations
 Methodology
 Linguistic Processing
 Visual Processing
 Building a Language Model
 Generating Descriptions
 Evaluation
 Models
 Evaluation Measures
 Data Sets
 Results
 Conclusion
2
What is VDR
 Structured representation of an image that explicitly models the spatial
relationships between objects
 Spatial relationship between a pair of objects is encoded with one of the
following eight options:
 Above
 Below
 Beside
 Opposite
 On
 Surrounds,
 In front
 Behind
3
Objectives of the Paper
 Training a VDR Parsing Model without the extensive human supervision
 Approach is to find the objects mentioned in a given description using a
state-of-the-art object detector, and to use successful detections to
produce training data.
4
Objectives of the Paper
 Generating descriptions for unseen images from VDR
 The description of an unseen image is produced by first predicting its VDR over
automatically detected objects, and then generating the text with a template-based
generation model using the predicted VDR
5
Related Works & Their Limitations
 Approaches
 Models Rely on Spatial Relationship
 Corpus-based relationships
 Spatial and visual attributes
 n-gram phrase fusion from Web-scale corpora
 Recurrent neural networks
 Limitation
 VDR reliance on gold-standard training annotations & which require training
annotators
6
Methodology
 Inferred VDR constructed by searching the subject and object referred in the image
description using an object detector.
 VDR is created by attaching the detected subject to the detected object
 The spatial relationships applied between subjects and objects are defined as follows
7
Methodology
 Linguistic Processing
 Description of an image is processed to extract candidates for the mentioned objects
 Candidates from the nsubj and dobj tokens in the dependency parsed description
 Example is discarded, If the parsed description does not contain both a subject and an object
8
Methodology
 Visual Processing
 Attempt to find candidate objects in the image using the Regions with Convolutional Neural Network
features object detector
 Output of the object detector is a bounding box with real-valued confidence scores
 Increase the chance of finding objects by lemmatizing the token, and transforming the token
into its WordNet hypernym parent
9
Methodology
 Collection of automatically inferred VDR
10
Methodology
 Building a Language Model
 Extract the top-N objects from an image using an object detector
 Predict the spatial relationships between the objects using a VDR Parser
 Descriptions are generated for all parent–child subtrees in the VDR
 Final text has the highest combined corpus and visual confidence
11
Methodology
 Generating Descriptions
 Using a template-based language generation model
 Descriptions are generated using the following template:
DT head is V DT child
 Above labels of the objects that appear in the head and child positions of a specific VDR subtree
 Model captures statistics about
 Nouns that appear as subjects and objects
 The verbs between them
 Spatial relationships observed in the inferred training VDRs
12
Evaluation
 Generate a natural language description of an image, which is evaluated directly against
multiple reference descriptions
 Models
 Compare against following image description models
 MIDGE (text based on tree-substitution grammar and relies on discrete object detections for visual input)
 BRNN (multimodal deep neural network that generates descriptions directly from vector representations of the
image and the description)
13
Evaluation
 Evaluation Measures
 Evaluate the generated descriptions using sentence-level Meteor and BLEU4
 They adopt a jack-knifing evaluation methodology, which enables to report human–human results
 Data Set
 Pascal1K
 Contains 1,000 images sampled from the PASCAL Object Detection Challenge data set each image is paired with
five reference descriptions collected from Mechanical Turk. It contains a wide variety of subject matter
 VLT2K
 contains 2,424 images each image is paired with three reference descriptions, also collected from Mechanical Turk
 Split the images into 80% training, 10% validation, and 10% test
14
Evaluation
 Results
15
Evaluation 16
Evaluation 17
Evaluation
18
 Optimizing the number of detected objects against generated description Meteor scores
 Improvements are seen until eight objects
 Good descriptions do not always need the most confident detections
Conclusion
 Show visual Dependency Representations of images without expensive human
supervision
 Quality of the generated text largely depended on the data set
 Quality of the descriptions depends on whether the images depict an action
 Encoding the spatial relationships between objects is a useful way of learning how to
describe actions
 Future improvements with broader coverage object detectors
 Relax the strict mirroring of human annotator behavior when searching for subjects
and objects in an image
 n-gram based language model constrained by the structured predicted in VDR
19
Thank You…!!!
20

More Related Content

What's hot

Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Editor IJMTER
 

What's hot (20)

An Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using ClusteringAn Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using Clustering
 
Designing Hybrid Cryptosystem for Secure Transmission of Image Data using Bio...
Designing Hybrid Cryptosystem for Secure Transmission of Image Data using Bio...Designing Hybrid Cryptosystem for Secure Transmission of Image Data using Bio...
Designing Hybrid Cryptosystem for Secure Transmission of Image Data using Bio...
 
A Novel Classification via Clustering Method for Anomaly Based Network Intrus...
A Novel Classification via Clustering Method for Anomaly Based Network Intrus...A Novel Classification via Clustering Method for Anomaly Based Network Intrus...
A Novel Classification via Clustering Method for Anomaly Based Network Intrus...
 
JOSA TechTalks - Machine Learning in Practice
JOSA TechTalks - Machine Learning in PracticeJOSA TechTalks - Machine Learning in Practice
JOSA TechTalks - Machine Learning in Practice
 
Assessment of Programming Language Reliability Utilizing Soft-Computing
Assessment of Programming Language Reliability Utilizing Soft-ComputingAssessment of Programming Language Reliability Utilizing Soft-Computing
Assessment of Programming Language Reliability Utilizing Soft-Computing
 
Terminology Machine Learning
Terminology Machine LearningTerminology Machine Learning
Terminology Machine Learning
 
A Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionA Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware Detection
 
Deep learning ppt
Deep learning pptDeep learning ppt
Deep learning ppt
 
AN EFFECTIVE SEMANTIC ENCRYPTED RELATIONAL DATA USING K-NN MODEL
AN EFFECTIVE SEMANTIC ENCRYPTED RELATIONAL DATA USING K-NN MODELAN EFFECTIVE SEMANTIC ENCRYPTED RELATIONAL DATA USING K-NN MODEL
AN EFFECTIVE SEMANTIC ENCRYPTED RELATIONAL DATA USING K-NN MODEL
 
Show observe and tell giang nguyen
Show observe and tell   giang nguyenShow observe and tell   giang nguyen
Show observe and tell giang nguyen
 
A SECURE STEGANOGRAPHY APPROACH FOR CLOUD DATA USING ANN ALONG WITH PRIVATE K...
A SECURE STEGANOGRAPHY APPROACH FOR CLOUD DATA USING ANN ALONG WITH PRIVATE K...A SECURE STEGANOGRAPHY APPROACH FOR CLOUD DATA USING ANN ALONG WITH PRIVATE K...
A SECURE STEGANOGRAPHY APPROACH FOR CLOUD DATA USING ANN ALONG WITH PRIVATE K...
 
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSCONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
 
Handwritten Digit Recognition
Handwritten Digit RecognitionHandwritten Digit Recognition
Handwritten Digit Recognition
 
Jz3118501853
Jz3118501853Jz3118501853
Jz3118501853
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
 
Linear Regression
Linear RegressionLinear Regression
Linear Regression
 
Identification of Relevant Sections in Web Pages Using a Machine Learning App...
Identification of Relevant Sections in Web Pages Using a Machine Learning App...Identification of Relevant Sections in Web Pages Using a Machine Learning App...
Identification of Relevant Sections in Web Pages Using a Machine Learning App...
 
G44093135
G44093135G44093135
G44093135
 
Cat and dog classification
Cat and dog classificationCat and dog classification
Cat and dog classification
 

Viewers also liked (6)

Visual Representation
Visual RepresentationVisual Representation
Visual Representation
 
Sketch ins- a tel design technique
Sketch ins- a tel design techniqueSketch ins- a tel design technique
Sketch ins- a tel design technique
 
Notes on visual representation
Notes on visual representationNotes on visual representation
Notes on visual representation
 
Educational Technology Graphic/Audio Visual Materials
Educational Technology Graphic/Audio Visual MaterialsEducational Technology Graphic/Audio Visual Materials
Educational Technology Graphic/Audio Visual Materials
 
Representation
RepresentationRepresentation
Representation
 
Graphical Representation of data
Graphical Representation of dataGraphical Representation of data
Graphical Representation of data
 

Similar to Describing Images using Visual Dependency Representation

imageclassification-160206090009.pdf
imageclassification-160206090009.pdfimageclassification-160206090009.pdf
imageclassification-160206090009.pdf
KammetaJoshna
 
Ijarcet vol-2-issue-4-1374-1382
Ijarcet vol-2-issue-4-1374-1382Ijarcet vol-2-issue-4-1374-1382
Ijarcet vol-2-issue-4-1374-1382
Editor IJARCET
 
Deep Learning for X ray Image to Text Generation
Deep Learning for X ray Image to Text GenerationDeep Learning for X ray Image to Text Generation
Deep Learning for X ray Image to Text Generation
ijtsrd
 
Hierarchical deep learning architecture for 10 k objects classification
Hierarchical deep learning architecture for 10 k objects classificationHierarchical deep learning architecture for 10 k objects classification
Hierarchical deep learning architecture for 10 k objects classification
csandit
 
HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION
HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATIONHIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION
HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION
cscpconf
 

Similar to Describing Images using Visual Dependency Representation (20)

Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
 
imageclassification-160206090009.pdf
imageclassification-160206090009.pdfimageclassification-160206090009.pdf
imageclassification-160206090009.pdf
 
Ijarcet vol-2-issue-4-1374-1382
Ijarcet vol-2-issue-4-1374-1382Ijarcet vol-2-issue-4-1374-1382
Ijarcet vol-2-issue-4-1374-1382
 
DEEP LEARNING BASED IMAGE CAPTIONING IN REGIONAL LANGUAGE USING CNN AND LSTM
DEEP LEARNING BASED IMAGE CAPTIONING IN REGIONAL LANGUAGE USING CNN AND LSTMDEEP LEARNING BASED IMAGE CAPTIONING IN REGIONAL LANGUAGE USING CNN AND LSTM
DEEP LEARNING BASED IMAGE CAPTIONING IN REGIONAL LANGUAGE USING CNN AND LSTM
 
Object Detection on Dental X-ray Images using R-CNN
Object Detection on Dental X-ray Images using R-CNNObject Detection on Dental X-ray Images using R-CNN
Object Detection on Dental X-ray Images using R-CNN
 
Modelling Framework of a Neural Object Recognition
Modelling Framework of a Neural Object RecognitionModelling Framework of a Neural Object Recognition
Modelling Framework of a Neural Object Recognition
 
Deep Learning for X ray Image to Text Generation
Deep Learning for X ray Image to Text GenerationDeep Learning for X ray Image to Text Generation
Deep Learning for X ray Image to Text Generation
 
MultiModal Retrieval Image
MultiModal Retrieval ImageMultiModal Retrieval Image
MultiModal Retrieval Image
 
Scene recognition using Convolutional Neural Network
Scene recognition using Convolutional Neural NetworkScene recognition using Convolutional Neural Network
Scene recognition using Convolutional Neural Network
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
 
IRJET - Visual Question Answering – Implementation using Keras
IRJET -  	  Visual Question Answering – Implementation using KerasIRJET -  	  Visual Question Answering – Implementation using Keras
IRJET - Visual Question Answering – Implementation using Keras
 
Object Recogniton Based on Undecimated Wavelet Transform
Object Recogniton Based on Undecimated Wavelet TransformObject Recogniton Based on Undecimated Wavelet Transform
Object Recogniton Based on Undecimated Wavelet Transform
 
Visual Translation Embedding Network for Visual Relation Detection (UPC Readi...
Visual Translation Embedding Network for Visual Relation Detection (UPC Readi...Visual Translation Embedding Network for Visual Relation Detection (UPC Readi...
Visual Translation Embedding Network for Visual Relation Detection (UPC Readi...
 
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNINGATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING
 
OBJECT DETECTION AND RECOGNITION: A SURVEY
OBJECT DETECTION AND RECOGNITION: A SURVEYOBJECT DETECTION AND RECOGNITION: A SURVEY
OBJECT DETECTION AND RECOGNITION: A SURVEY
 
Hierarchical deep learning architecture for 10 k objects classification
Hierarchical deep learning architecture for 10 k objects classificationHierarchical deep learning architecture for 10 k objects classification
Hierarchical deep learning architecture for 10 k objects classification
 
HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION
HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATIONHIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION
HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION
 
fuzzy LBP for face recognition ppt
fuzzy LBP for face recognition pptfuzzy LBP for face recognition ppt
fuzzy LBP for face recognition ppt
 
Stanford DeepDive Framework
Stanford DeepDive FrameworkStanford DeepDive Framework
Stanford DeepDive Framework
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Describing Images using Visual Dependency Representation

  • 1. Describing Images using Inferred Visual Dependency Representations BY DESMOND ELLIOTT & ARJEN P. DE VRIES PRESENTED BY P.R.P.L.S KUMARA – 158224B 1
  • 2. Overview  What is VDR  Objectives of the Paper  Related Works & Their Limitations  Methodology  Linguistic Processing  Visual Processing  Building a Language Model  Generating Descriptions  Evaluation  Models  Evaluation Measures  Data Sets  Results  Conclusion 2
  • 3. What is VDR  Structured representation of an image that explicitly models the spatial relationships between objects  Spatial relationship between a pair of objects is encoded with one of the following eight options:  Above  Below  Beside  Opposite  On  Surrounds,  In front  Behind 3
  • 4. Objectives of the Paper  Training a VDR Parsing Model without the extensive human supervision  Approach is to find the objects mentioned in a given description using a state-of-the-art object detector, and to use successful detections to produce training data. 4
  • 5. Objectives of the Paper  Generating descriptions for unseen images from VDR  The description of an unseen image is produced by first predicting its VDR over automatically detected objects, and then generating the text with a template-based generation model using the predicted VDR 5
  • 6. Related Works & Their Limitations  Approaches  Models Rely on Spatial Relationship  Corpus-based relationships  Spatial and visual attributes  n-gram phrase fusion from Web-scale corpora  Recurrent neural networks  Limitation  VDR reliance on gold-standard training annotations & which require training annotators 6
  • 7. Methodology  Inferred VDR constructed by searching the subject and object referred in the image description using an object detector.  VDR is created by attaching the detected subject to the detected object  The spatial relationships applied between subjects and objects are defined as follows 7
  • 8. Methodology  Linguistic Processing  Description of an image is processed to extract candidates for the mentioned objects  Candidates from the nsubj and dobj tokens in the dependency parsed description  Example is discarded, If the parsed description does not contain both a subject and an object 8
  • 9. Methodology  Visual Processing  Attempt to find candidate objects in the image using the Regions with Convolutional Neural Network features object detector  Output of the object detector is a bounding box with real-valued confidence scores  Increase the chance of finding objects by lemmatizing the token, and transforming the token into its WordNet hypernym parent 9
  • 10. Methodology  Collection of automatically inferred VDR 10
  • 11. Methodology  Building a Language Model  Extract the top-N objects from an image using an object detector  Predict the spatial relationships between the objects using a VDR Parser  Descriptions are generated for all parent–child subtrees in the VDR  Final text has the highest combined corpus and visual confidence 11
  • 12. Methodology  Generating Descriptions  Using a template-based language generation model  Descriptions are generated using the following template: DT head is V DT child  Above labels of the objects that appear in the head and child positions of a specific VDR subtree  Model captures statistics about  Nouns that appear as subjects and objects  The verbs between them  Spatial relationships observed in the inferred training VDRs 12
  • 13. Evaluation  Generate a natural language description of an image, which is evaluated directly against multiple reference descriptions  Models  Compare against following image description models  MIDGE (text based on tree-substitution grammar and relies on discrete object detections for visual input)  BRNN (multimodal deep neural network that generates descriptions directly from vector representations of the image and the description) 13
  • 14. Evaluation  Evaluation Measures  Evaluate the generated descriptions using sentence-level Meteor and BLEU4  They adopt a jack-knifing evaluation methodology, which enables to report human–human results  Data Set  Pascal1K  Contains 1,000 images sampled from the PASCAL Object Detection Challenge data set each image is paired with five reference descriptions collected from Mechanical Turk. It contains a wide variety of subject matter  VLT2K  contains 2,424 images each image is paired with three reference descriptions, also collected from Mechanical Turk  Split the images into 80% training, 10% validation, and 10% test 14
  • 18. Evaluation 18  Optimizing the number of detected objects against generated description Meteor scores  Improvements are seen until eight objects  Good descriptions do not always need the most confident detections
  • 19. Conclusion  Show visual Dependency Representations of images without expensive human supervision  Quality of the generated text largely depended on the data set  Quality of the descriptions depends on whether the images depict an action  Encoding the spatial relationships between objects is a useful way of learning how to describe actions  Future improvements with broader coverage object detectors  Relax the strict mirroring of human annotator behavior when searching for subjects and objects in an image  n-gram based language model constrained by the structured predicted in VDR 19