Sentence generation
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
891
On Slideshare
891
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Every Picture Tells a Story: Generating Sentences from Images
    Ali Farhadi, MohsenHejrati, Mohammad AminSadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth
    Proceedings of ECCV-2010
  • 2. Motivation
    Demonstrating how good automatic methods can correlate a description to a given image or obtain images that illustrate a given sentence.
    Auto-annotation
  • 3. Motivation
    Demonstrating how good automatic methods can correlate a description to a given image or obtain images that illustrate a given sentence.
    Auto-illustration
  • 4. Contributions
    Proposes a system to compute score linking of an image to a sentence and vice versa.
    Evaluates their methodology on a novel dataset consisting of human-annotated images. (PASCAL Sentence Dataset)
    Quantitative evaluation on the quality of the predictions.
  • 5. Overview
  • 6. The Approach
    Mapping Image to Meaning
    16
    23
    29
    Predicting the triplet of an image involves solving a small multi-label Markov random field.
  • 7. The Approach
    Node potentials: Computed as a linear combination of scores from several detectors and classifiers. (feature functions)
    Edge potentials: Edge potentials are estimated by the frequencies of the node labels.
  • 8. The Approach
    Image Space
    Feature Functions: Node features, Similarity Features
    To provide information about the nodes on the MRF we first need to construct image features:
    Node Features:
    • Felzenszwalb et al. detector responses
    • 9. Hoiem et al. classification responses
    • 10. Gist-based scene classification responses
  • The Approach
    Image Space
    Similarity Features
    • Average of the node features over KNN neighbors in the training set to the test image by matching image features:
    • 11. Average of the node features over KNN neighbors in the training set to the test image by matching those node features derived from classifiers and detectors:
  • The Approach
    Edge Potentials
    Linear combination of multiple estimates for the edge potentials:
    Four estimates for edges from node A to node B
    • The normalized frequency of the word A in our corpus, f(A).
    • 12. The normalized frequency of the word B in our corpus, f(B).
    • 13. The normalized frequency of (A and B) at the same time, f(A, B).
    • 14. f(A,B)/(f(A)f(B))
  • The Approach
    Sentence Space
    Extract triplets from sentences
    Use Lin Similarity to determine Semantic distance between two words.
    Determine actions commonly co-occurring
    Compute sentence node potentials from these measures.
  • 15. Learning and Inference
    Learning to predict triplets for images is done discriminatively using a dataset of images labeled with their meaning triplets.
    The potentials are computed as linear combinations of feature functions.
    This makes the learning problem as searching for the best set of weights on the linear combination of feature functions so that the ground truth triplets score higher than any other triplet.
    Inference involves finding argmaxywTφ(x, y) where φ is the potential function, y is the triplet label, and w are the learned weights.
  • 16. Evaluation
    Dataset
    PASCAL Sentence Dataset: Pascal 2008 development kit. 50 images from 20 categories
    Amazon’s Mechanical Turk generate 5 captions for each image.
    Experimental Settings
    600 training images and 400 testing images.
    50 closest triplets for matching
  • 17. Evaluation
    Scoring a match between images and sentences is done by ranking them in opposite spaces and summing over them weighed by inverse rank of the triplets.
    Distributional Semantics Usage:
    Text Information and Similarity measure is used to take care of out of vocabulary words that occurs in sentences but are not being learnt by a detector/classifier.
  • 18. Evaluation
    Quantitative Measures
    Tree-F1 measure:A measure that reflects two important interacting components, accuracy and specificity.
    Precision is defined as the total number of edges on the path that matches the edges on the ground truth path divided by the total number of edges on the ground truth path.
    Recall is the total number of edges on the predicted path which is in the ground truth path divided by the total number of edges in the path.
    BLUE Measure: A measure to check if the triplet we generate is logically valid or not. For e.g., (bottle, walk, street) is not valid. For that, we check if the triplet ever appeared in our corpus or not.
  • 19. Results
    Auto -Annotation
  • 20. Results
    Auto -Illustration
  • 21. Results
    Examples of Failures
  • 22. Discussion
    • Sentences are not really generated from the image, but searched from a pool of user-annotated-descriptions.
    • 23. The intermediate meaning space in the model helps in approaching the two-way problem as well as is benefitted by the distributional semantics.
    • 24. The way to output a score and quantitatively evaluate the co-relation of description and images seems interesting.