Every Picture Tells a Story: Generating Sentences from Images<br />Ali Farhadi, MohsenHejrati, Mohammad AminSadeghi, Peter...
Motivation<br />Demonstrating how good automatic methods can correlate a description to a given image or obtain images tha...
Motivation<br />Demonstrating how good automatic methods can correlate a description to a given image or obtain images tha...
Contributions<br />Proposes a system to compute score linking of an image to a sentence and vice versa.<br />Evaluates the...
Overview<br />
The Approach<br />Mapping Image to Meaning<br />16<br />23<br />29<br />Predicting the triplet of an image involves solvin...
The Approach<br />Node potentials:  Computed as a linear combination of scores from several detectors and classifiers. (fe...
The Approach<br />Image Space<br />Feature Functions:  Node features, Similarity Features <br />To provide information abo...
Hoiem et al. classification responses
Gist-based scene classification responses</li></li></ul><li>The Approach<br />Image Space<br />	Similarity Features <br />...
 Average of the node features over KNN neighbors in the training set to the test image by matching those node features der...
The normalized frequency of the word B in our corpus, f(B).
The normalized frequency of (A and B) at the same time, f(A, B).
f(A,B)/(f(A)f(B))</li></li></ul><li>The Approach<br />Sentence Space<br />Extract triplets from sentences<br />Use Lin Sim...
Learning and Inference<br /> Learning to predict triplets for images is done discriminatively using a dataset of images la...
Evaluation<br />Dataset<br />PASCAL Sentence Dataset: Pascal 2008 development kit. 50 images from 20 categories<br />Amazo...
Evaluation<br />Scoring a match between images and sentences is done by ranking them in opposite spaces and summing over t...
Evaluation<br />Quantitative Measures<br />Tree-F1 measure:A measure that reflects two important interacting components, a...
Results<br />Auto -Annotation<br />
Upcoming SlideShare
Loading in …5
×

Sentence generation

1,066 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,066
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Sentence generation

  1. 1. Every Picture Tells a Story: Generating Sentences from Images<br />Ali Farhadi, MohsenHejrati, Mohammad AminSadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth<br />Proceedings of ECCV-2010<br />
  2. 2. Motivation<br />Demonstrating how good automatic methods can correlate a description to a given image or obtain images that illustrate a given sentence. <br />Auto-annotation<br />
  3. 3. Motivation<br />Demonstrating how good automatic methods can correlate a description to a given image or obtain images that illustrate a given sentence. <br />Auto-illustration<br />
  4. 4. Contributions<br />Proposes a system to compute score linking of an image to a sentence and vice versa.<br />Evaluates their methodology on a novel dataset consisting of human-annotated images. (PASCAL Sentence Dataset) <br />Quantitative evaluation on the quality of the predictions.<br />
  5. 5. Overview<br />
  6. 6. The Approach<br />Mapping Image to Meaning<br />16<br />23<br />29<br />Predicting the triplet of an image involves solving a small multi-label Markov random field.<br />
  7. 7. The Approach<br />Node potentials: Computed as a linear combination of scores from several detectors and classifiers. (feature functions)<br />Edge potentials: Edge potentials are estimated by the frequencies of the node labels.<br />
  8. 8. The Approach<br />Image Space<br />Feature Functions: Node features, Similarity Features <br />To provide information about the nodes on the MRF we first need to construct image features:<br />Node Features: <br /><ul><li>Felzenszwalb et al. detector responses
  9. 9. Hoiem et al. classification responses
  10. 10. Gist-based scene classification responses</li></li></ul><li>The Approach<br />Image Space<br /> Similarity Features <br /><ul><li> Average of the node features over KNN neighbors in the training set to the test image by matching image features:
  11. 11. Average of the node features over KNN neighbors in the training set to the test image by matching those node features derived from classifiers and detectors:</li></li></ul><li>The Approach<br />Edge Potentials<br />Linear combination of multiple estimates for the edge potentials: <br />Four estimates for edges from node A to node B<br /><ul><li>The normalized frequency of the word A in our corpus, f(A).
  12. 12. The normalized frequency of the word B in our corpus, f(B).
  13. 13. The normalized frequency of (A and B) at the same time, f(A, B).
  14. 14. f(A,B)/(f(A)f(B))</li></li></ul><li>The Approach<br />Sentence Space<br />Extract triplets from sentences<br />Use Lin Similarity to determine Semantic distance between two words.<br />Determine actions commonly co-occurring <br />Compute sentence node potentials from these measures. <br />
  15. 15. Learning and Inference<br /> Learning to predict triplets for images is done discriminatively using a dataset of images labeled with their meaning triplets. <br />The potentials are computed as linear combinations of feature functions. <br />This makes the learning problem as searching for the best set of weights on the linear combination of feature functions so that the ground truth triplets score higher than any other triplet.<br />Inference involves finding argmaxywTφ(x, y) where φ is the potential function, y is the triplet label, and w are the learned weights.<br />
  16. 16. Evaluation<br />Dataset<br />PASCAL Sentence Dataset: Pascal 2008 development kit. 50 images from 20 categories<br />Amazon’s Mechanical Turk generate 5 captions for each image.<br />Experimental Settings<br />600 training images and 400 testing images.<br />50 closest triplets for matching<br />
  17. 17. Evaluation<br />Scoring a match between images and sentences is done by ranking them in opposite spaces and summing over them weighed by inverse rank of the triplets.<br />Distributional Semantics Usage:<br />Text Information and Similarity measure is used to take care of out of vocabulary words that occurs in sentences but are not being learnt by a detector/classifier.<br />
  18. 18. Evaluation<br />Quantitative Measures<br />Tree-F1 measure:A measure that reflects two important interacting components, accuracy and specificity. <br /> Precision is defined as the total number of edges on the path that matches the edges on the ground truth path divided by the total number of edges on the ground truth path.<br /> Recall is the total number of edges on the predicted path which is in the ground truth path divided by the total number of edges in the path.<br />BLUE Measure: A measure to check if the triplet we generate is logically valid or not. For e.g., (bottle, walk, street) is not valid. For that, we check if the triplet ever appeared in our corpus or not.<br />
  19. 19. Results<br />Auto -Annotation<br />
  20. 20. Results<br />Auto -Illustration<br />
  21. 21. Results<br />Examples of Failures<br />
  22. 22. Discussion<br /><ul><li>Sentences are not really generated from the image, but searched from a pool of user-annotated-descriptions.
  23. 23. The intermediate meaning space in the model helps in approaching the two-way problem as well as is benefitted by the distributional semantics.
  24. 24. The way to output a score and quantitatively evaluate the co-relation of description and images seems interesting.</li>

×