Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, MohsenHejrati, Mohammad AminSadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth Proceedings of ECCV-2010
Motivation Demonstrating how good automatic methods can correlate a description to a given image or obtain images that illustrate a given sentence. Auto-annotation
Motivation Demonstrating how good automatic methods can correlate a description to a given image or obtain images that illustrate a given sentence. Auto-illustration
Contributions Proposes a system to compute score linking of an image to a sentence and vice versa. Evaluates their methodology on a novel dataset consisting of human-annotated images. (PASCAL Sentence Dataset) Quantitative evaluation on the quality of the predictions.
The Approach Mapping Image to Meaning 16 23 29 Predicting the triplet of an image involves solving a small multi-label Markov random field.
The Approach Node potentials: Computed as a linear combination of scores from several detectors and classifiers. (feature functions) Edge potentials: Edge potentials are estimated by the frequencies of the node labels.
The Approach Image Space Feature Functions: Node features, Similarity Features To provide information about the nodes on the MRF we first need to construct image features: Node Features:
The Approach Sentence Space Extract triplets from sentences Use Lin Similarity to determine Semantic distance between two words. Determine actions commonly co-occurring Compute sentence node potentials from these measures.
Learning and Inference Learning to predict triplets for images is done discriminatively using a dataset of images labeled with their meaning triplets. The potentials are computed as linear combinations of feature functions. This makes the learning problem as searching for the best set of weights on the linear combination of feature functions so that the ground truth triplets score higher than any other triplet. Inference involves finding argmaxywTφ(x, y) where φ is the potential function, y is the triplet label, and w are the learned weights.
Evaluation Dataset PASCAL Sentence Dataset: Pascal 2008 development kit. 50 images from 20 categories Amazon’s Mechanical Turk generate 5 captions for each image. Experimental Settings 600 training images and 400 testing images. 50 closest triplets for matching
Evaluation Scoring a match between images and sentences is done by ranking them in opposite spaces and summing over them weighed by inverse rank of the triplets. Distributional Semantics Usage: Text Information and Similarity measure is used to take care of out of vocabulary words that occurs in sentences but are not being learnt by a detector/classifier.
Evaluation Quantitative Measures Tree-F1 measure:A measure that reflects two important interacting components, accuracy and specificity. Precision is defined as the total number of edges on the path that matches the edges on the ground truth path divided by the total number of edges on the ground truth path. Recall is the total number of edges on the predicted path which is in the ground truth path divided by the total number of edges in the path. BLUE Measure: A measure to check if the triplet we generate is logically valid or not. For e.g., (bottle, walk, street) is not valid. For that, we check if the triplet ever appeared in our corpus or not.