Beyond nouns eccv_2008

380 views
272 views

Published on

Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers-- A Presentation

Published in: Career, Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
380
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • We are to determine the correspondence between image regions and semantic object classes Problem: Significant ambiguities in correspondence of visual features and object class
  • Instead of using only co-occurrence of nouns and image features over large databases of images to determine the correspondence, additional language constructs are considered like “prepositions” and “comparative adjectives”. This paper simultaneously learns the visual features defining “nouns” and the differential visual features defining “binary-relationships” using EM approach
  • Not applicable for binary relationships if models for nouns not givenhave used spatial relationships between image patches for scene recognition. The paper applies a feature mining approach to get discriminative image patches and the relationship between them is interpreted as adjectives or prepositions. The authors mined relationships between more than two image patches too. They used SVM to train the data mining problem with different types of adjectives and prepositions encoded. Encoding is based on image representation of multi-scale local patches and the spatial pyramid representation. SIFT descriptors are used to represent each appearance patch. At first the visual code words are recognized in an image and then relationships are extracted using Apriori mining algorithm.introduces an approach to learn jointly detectors for object classes and attributes (color and texture) based on a co-training algorithm. Object to attribute is a one way association here i.e. a red table or a metallic table; but not both. Here also the image is divided into a number of windows and joint multiple instance learning is used to force learners for both the object class and the attribute class to co-operate on labeling windows that must contain both the object and attribute. They have focused on windows that are salient and homogenous to select candidate windows.In most of the cases, the object detection average precision is better than the separate learning approach and moreover “visual attribute object” not in the training set can also be detected by combining visual attribute and object detectors learned from the other categories.
  • Visual features based on appearance and shapeInitialization with random assignements
  • Word sense disambiguation is not taken into context
  • Aij refers to the subset of the set of all possible assignments for animage in which noun i is assigned to region j.
  • Aij refers to the subset of the set of all possible assignments for animage in which noun i is assigned to region j.
  • For a Gaussian classifier we estimate the mean and varianceInitialization random Authors use the result of Bernard’s paper, translation based model. Any image annotation approach with localization shall workAfter learning the maximum likelihood parameters, weuse the relationship classifier and the assignment to find possible relationshipsbetween all pairs of words. Using these generated relationship annotations weform a co-occurrence table which is used to compute P
  • For each region, we have two nodes corresponding tothe noun and image features from that region. For all possible pairs of regions,we have another two nodes representing a relationship word and differentialfeatures from that pair of regions.An example of a Bayesian network with 3 regions. The rjk represent the possiblewords for the relationship between regions (j, k). Due to the non-symmetric nature ofrelationships we consider both (j, k) and (k, j) pairs (in the figure only one is shown).The magenta blocks in the image represent differential features (Ijk).
  • Relationship model is based one differential features.The parameterlearning M-step therefore also involves feature selection for relationshipclassifiers.
  • The first measure counts the number of words that are labeled properly bythe algorithm. In this case, each word has similar importance regardless of thefrequency with which it occurs. In the second case, a word which occurs morefrequently is given higher importance.Using the first measure, both algorithms have similar performance becausethey can correctly label one word each. However, using the second measurethe latter algorithm is better as sky is more common and hence the number ofcorrectly identified regions would be higher for the latter algorithm.a co-occurrence based translation model [ibm model 1]and translation based model with mixing probabilities [duygulu et. al] form the baseline algorithms.
  • For each region, we have two nodes corresponding tothe noun and image features from that region. For all possible pairs of regions,we have another two nodes representing a relationship word and differentialfeatures from that pair of regions.An example of a Bayesian network with 3 regions. The rjk represent the possiblewords for the relationship between regions (j, k). Due to the non-symmetric nature ofrelationships we consider both (j, k) and (k, j) pairs (in the figure only one is shown).The magenta blocks in the image represent differential features (Ijk).
  • Beyond nouns eccv_2008

    1. 1. Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers<br />Abhinav Gupta and Larry S. Davis<br />University of Maryland, College Park<br />Proceedings of ECCV 2008<br />Presented by: DebaleenaChattopadhyay<br />
    2. 2. Presentation Outline<br />- The Problem Definition<br />- The Novelty<br />- The Problem Solution<br />- The Results<br />
    3. 3. The Problem Definition<br />To learn visual classifiers for object recognition from weakly labeled data<br />Input:<br />Labels: city, mountain, sky, sun<br />sun<br />sky<br />Expected Output:<br />mountain<br />city<br />
    4. 4. Novelty<br />To learn visual classifiers for object recognition from weakly labeled data utilizing additional language constructs<br />Input:<br />Labels: <br />(Nouns) city, mountain, sky, sun<br />(Relations) below(mountain, sky), below(mountain, sun)<br />  above(sky, city), above(sun, city)      <br />brighter(sun, mountain), brighter(sun, city)<br />     behind(mountain, city), convex(sun, city)<br />in(sun, sky), smaller(sun, sky)      <br />sun<br />sky<br />Expected Output: <br />mountain<br />city<br />
    5. 5. Related Work<br />Some Previous Works:<br /><ul><li> Learn classifiers for visual attributes from a training dataset of +ve and –ve images using a generative model </li></ul>[Ferrari et. al]<br /><ul><li> Learn adjectives and nouns in 2 steps (adjectives in the 1st step, nouns in the 2nd) using a latent model </li></ul> [Bernard et. al]<br />Some After Works:<br /><ul><li> Mining Discriminative Adjectives and Prepositions for Natural Scene Recognition </li></ul>[Fei-Fei Li et. al, CVPR 09]<br /><ul><li> Joint Learning of visual attributes, object classes and visual saliency </li></ul>[ Forsyth et. al, ICCV 2009]<br />
    6. 6. Overview<br />Pairs of Nouns:<br />Nouns:<br />(SEA, SUN)<br />SEA<br />(SEA, SKY)<br /> (SKY, SEA)<br />SKY<br /> (SKY, SUN)<br />SUN<br /> (SUN, SKY)<br /> (SUN, SEA)<br />Relationships: in, above, below<br />
    7. 7. Proposed Algorithm<br /><ul><li>Dataset: Training set annotated with nouns and binary relationships (prepositions and comparative adjectives)
    8. 8. Algorithm:
    9. 9. Each image represented into a set of image regions.
    10. 10. Each image region is represented by a set of features
    11. 11. Classifiers for nouns are based on these features (CA)
    12. 12. Classifiers for relationships are based on differential features extracted from pairs of regions (CR)
    13. 13. EM-approach is used to learn noun and relationship models simultaneously
    14. 14. E-step: Update assignments of nouns to image regions, given CA and CR
    15. 15. M-step: Update model parameters,(CA and CR ) given updated assignments</li></li></ul><li>The Generative Model<br />Ij<br />Ik<br />CA<br />ns<br />np<br />r<br />CR<br />Graphical Model for Image Annotation<br />Ijk<br />
    16. 16. Learning the Model<br />EM-approach: Simultaneously solve for the correspondence problem and learn the parameters of classifiers (noun and relationship)<br />E-step: Compute the noun assignment using parameters from the previous iteration. <br />P( noun iassigned to region j) =<br />Where, <br />
    17. 17. Learning the Model<br />
    18. 18. Learning the Model<br />EM-approach: Simultaneously solve for the correspondence problem and learn the parameters of classifiers (noun and relationship)<br />M-step: Update the model parameters depending on the updated assignments in the E-step. The Maximum Likelihood parameters depends upon the classifier used.<br />To utilize contextual information for labeling test-images, priors on relationship ,P(r|ns,np), are also learnt from a co-occurrence table after the relationship annotations are generated.<br />
    19. 19. Inference- Labeling<br /><ul><li> Test images are divided into regions. Region j is associated with some features Ij and noun nj.
    20. 20. We know Ij and we have to estimate nj.
    21. 21. The labeling problem is constrained by priors on relationships between pairs of nouns.
    22. 22. Bayesian Network is used to represent the labeling problem and belief propagation for inference. </li></li></ul><li>Experimental Results<br />Dataset: <br /><ul><li> Subset of Corel5k training and test dataset
    23. 23. For training, 850 images with nouns and hand-labelled relationships between subset of pairs of nouns.
    24. 24. Nearest neighbor and Gaussian Classifier based likelihood model for nouns is used.
    25. 25. Decision stump based likelihood model for relationships is used.
    26. 26. 173 nouns
    27. 27. 19 relationships: above, behind, below, beside, more textured, brighter, in, greener, larger, left, near, far from, ontopof, more blue, right, similar, smaller, taller, shorter
    28. 28. Image Features used (30): area, x, y, boundary/area, convexity, moment- of-inertia, RGB (3), RGB stdev (3), L*a*b (3), L*a*b stdev (3), mean oriented energy, 30 degree increments (12) </li></li></ul><li>Experimental Results<br />Resolution of Correspondence Ambiguities<br /><ul><li> On randomly sampled 150 images from the training dataset
    29. 29. Compared with human labeling
    30. 30. Performance measures:
    31. 31. Range of semantics identified- Both algorithm give similar performance (L)
    32. 32. Frequency Correct- Later algorithm performs better in number of times a noun is identified (R)</li></ul>Nouns only<br />Nouns & Relationships (Human)<br />Nouns & Relationships (learned)<br />Proposed EM algorithm bootstrapped by IBM Model 1<br />Proposed EM algorithm bootstrapped by Duygulu et. al<br />
    33. 33. Experimental Results<br />Reducing Correspondence Ambiguity<br />Duygulu et. al Beyond Nouns<br />
    34. 34. Experimental Results<br />Labeling New Images:<br /><ul><li> Dataset: Subset of 500 images provided in Corel5k dataset. (Images were selected randomly from those images which had been annotated with words present in the learned vocabulary)
    35. 35. Performance Measure:
    36. 36. Missed Labels (L): Compute St/Sg where St= set of annotations provided by Corel dataset, Sg = set of annotations generated by the algorithm</li></ul>Using proposed Bayesian model, missed labels decreases by 24% (IBM Model 1) and 17% (Duygulu et. al)<br /><ul><li> False Labels (R): Compared with human observers.</li></li></ul><li>Experimental Results<br />Image Labeling : Constrained Bayesian Model<br />Duygulu et. al Beyond Nouns<br />
    37. 37. Experimental Results<br />Precision-Recall:<br />Precision Ratio- The ratio of number of images that have been correctly annotated with that word to the number of images which were annotated with the word by the algorithm. (Respect to Human Observers)<br />Recall Ratio: The ratio of the number of images correctly annotated with that word using the algorithm to the number of images that should have been annotated with that word. (Respect to Corel Annotations)<br />
    38. 38. Conclusion<br /><ul><li>Most approaches to learn visual classifiers from weakly labeled data use a “bag” of nouns model and try to find correspondence using co-occurrence of image features and the nouns. However, correspondence ambiguity remains.
    39. 39. This algorithm proposes an EM based method to simultaneously learn visual classifiers for nouns, prepositions and comparative adjectives.
    40. 40. Experimental results show that using relationship words helps in reduction of correspondence ambiguity and using a constrained model leads to a better labeling performance.</li></li></ul><li>Thank you<br />
    41. 41. Inference- Labeling<br /><ul><li> Test images are divided into regions. Region j is associated with some features Ij and noun nj.
    42. 42. We know Ij and we have to estimate nj.
    43. 43. The labeling problem is constrained by priors on relationships between pairs of nouns.
    44. 44. Bayesian Network is used to represent the labeling problem and belief propagation for inference.
    45. 45. The word likelihood in an image is given as:</li>

    ×