NIPS2009: Understand Visual Scenes - Part 2

698 views

Published on

Published in: Education, Technology, Business
  • Be the first to comment

  • Be the first to like this

NIPS2009: Understand Visual Scenes - Part 2

  1. 1. A car out of context … 
  2. 2. Modeling object co‐occurrences 
  3. 3. What are the hidden objects?  1 2
  4. 4. What are the hidden objects?  Chance ~ 1/30000
  5. 5. objects imagep(O | I) α p(I|O) p(O)Object model Context model
  6. 6. p(O | I) α p(I|O) p(O)Object model Context model Full joint Scene model Approx. joint
  7. 7. p(O | I) α p(I|O) p(O)Object model Context model Full joint Scene model Approx. joint
  8. 8. p(O | I) α p(I|O) p(O)Object model Context model Full joint Scene model Approx. joint p(O) =s Σ iΠp(Oi|S=s) p(S=s) office street
  9. 9. p(O | I) α p(I|O) p(O)Object model Context model Full joint Scene model Approx. joint
  10. 10. Pixel labeling using MRFs Enforce consistency between neighboring labels,  and between labels and pixels  Oi Carbonetto, de Freitas & Barnard, ECCV’04
  11. 11. Object‐Object RelaPonships Use latent variables to induce long distance correlaPons  between labels in a CondiPonal Random Field (CRF)  He, Zemel & Carreira-Perpinan (04)
  12. 12. Object‐Object RelaPonships  [Kumar Hebert 2005]
  13. 13. Object‐Object RelaPonships •  Fink & Perona (NIPS 03) Use output of boosPng from other objects at previous  iteraPons as input into boosPng for this iteraPon 
  14. 14. Object‐Object RelaPonships  Building, boat, person Road Building, boat, motorbike Water, sky Building Most consistent labeling according to Road object co Boat -occurrences& locallabel probabilities. Water A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora and S. Belongie. Objects in Context. ICCV 2007
  15. 15. Objects in Context:  Contextual Refinement  Building Contextual model based on co-occurrences Try to find the most consistent labeling with Road high posterior probability and high mean pairwise interaction. Boat Use CRF for this purpose. Water IndependentMean interaction of all label pairs segment classificationΦ(i,j) is basically the observed labelco-occurrences in training set. 132
  16. 16. Using stuff to find things  Heitz and Koller, ECCV 2008In this work, there is not labeling for stuff. Instead, they look for clusters of textures and model how each cluster correlates with the target object.
  17. 17. What,
where
and
who?
Classifying
 events
by
scene
and
object
recognition
Slide by Fei-fei L-J
Li
&
L.
Fei-Fei,
ICCV
2007

  18. 18. what who whereSlide by Fei-fei L.-J. Li & L. Fei-Fei ICCV 2007
  19. 19. Grammars   Guzman (SEE), 1968   Noton and Stark 1971   Hansen & Riseman (VISIONS), 1978   Barrow & Tenenbaum 1978   Brooks (ACRONYM), 1979[Ohta & Kanade 1978]   Marr, 1982   Yakimovsky & Feldman, 1973
  20. 20. Grammars for objects and scenes S.C. Zhu and D. Mumford. A Stochastic Grammar of Images. Foundations and Trends in Computer Graphics and Vision, 2006.
  21. 21. 3D scenes
  22. 22. We are wired for 3D ~6cm
  23. 23. We can not shut down 3D perception (c) 2006 Walt Anthony
  24. 24. 3D drives perception of important object attributes by Roger Shepard (”Turning the Tables”) Depth processing is automatic, and we can not shut it down…
  25. 25. Manhattan WorldSlide by James Coughlan Coughlan, Yuille. 2003
  26. 26. Slide by James Coughlan Coughlan, Yuille. 2003
  27. 27. Slide by James Coughlan Coughlan, Yuille. 2003
  28. 28. Single view metrology Criminisi, et al. 1999 Need to recover: •  Ground plane •  Reference height •  Horizon line •  Where objects contact the ground
  29. 29. 3d Scene ContextImage World Hoiem, Efros, Hebert ICCV 2005
  30. 30. 3D scene context meters Ped Ped Car meters Hoiem, Efros, Hebert ICCV 2005
  31. 31. Qualitative Results Car: TP / FP Ped: TP / FP Initial: 2 TP / 3 FP Final: 7 TP / 4 FPSlide by Derek Hoiem Local Detector from [Murphy-Torralba-Freeman 2003]
  32. 32. 3D City Modeling using Cognitive Loops N. Cornelis, B. Leibe, K. Cornelis, L. Van Gool. CVPR06
  33. 33. 3D from pixel valuesD. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up”. SIGGRAPH 2005.A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image"In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007.
  34. 34. Surface EstimationImage Support Vertical Sky V-Left V-Center V-Right V-Porous V-Solid Object Surface? [Hoiem, Efros, Hebert ICCV 2005] Support? Slide by Derek Hoiem
  35. 35. Object Support Slide by Derek Hoiem
  36. 36. Qualitative 3D relationshipsGupta & Davis, EECV, 2008
  37. 37. Large databasesAlgorithms that rely on millions of images
  38. 38. DataHuman vision• Many input modalities• Active• Supervised, unsupervised, semi supervisedlearning. It can look for supervision.Robot vision• Many poor input modalities• Active, but it does not go farInternet vision• Many input modalities• It can reach everywhere• Tons of data
  39. 39. The two extremes of learningExtrapolation problem Interpolation problem Generalization Correspondence Diagnostic features Finding the differences ∞ 1 10 102 103 104 105 106 Number of training samples Transfer learning Classifiers Label transfer Priors
  40. 40. Nearest neighbors Input image •  Labels •  MoPon  •  Labels •  Depth •  …  •  MoPon  •  Depth  •  …  Hays, Efros, Siggraph 2007  Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007  Divvala, Efros, Hebert, 2008  Malisiewicz, Efros 2008  Torralba, Fergus, Freeman, PAMI 2008  Liu, Yuen, Torralba, CVPR 2009 
  41. 41. The power of large collections Google Street View PhotoToursim/PhotoSynth(controlled image capture) [Snavely et al.,2006] (register images based on multi-view geometry)
  42. 42. Image completionInstead, generate proposals using millions of images Input 16 nearest neighbors output (gist+color matching) Hays, Efros, 2007
  43. 43. im2gpsInstead of using objects labels, the web provides other kinds of metadata associate to large collections of images 20 million geotagged and geographic text-labeled images Hays & Efros. CVPR 2008
  44. 44. im2gps Hays & Efros. CVPR 2008Input image Nearest neighbors Geographic location of the nearest neighbors
  45. 45. Predicting events C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  46. 46. Predicting events C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  47. 47. Query C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  48. 48. Query Retrieved video C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  49. 49. Query Retrieved video Synthesized video C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  50. 50. Query Retrieved video Synthesized video C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  51. 51. Query Synthesized video C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  52. 52. Query Retrieved video Synthesized video C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  53. 53. Databases and the powers of 10
  54. 54. DATASETS AND Datasets  and  Powers of 10 
  55. 55. 0images
  56. 56. 0 10 images1972
  57. 57. 110images
  58. 58. 1 10 imagesMarr, 1976
  59. 59. 2-410images
  60. 60. 2-4 The faces and cars 10 images scaleIn 1996 DARPA released 14000 images,from over 1000 individuals.
  61. 61. The PASCAL Visual Object Classes  In 2007, the twenty object classes that have been selected are:Person: personAnimal: bird, cat, cow, dog, horse, sheepVehicle: aeroplane, bicycle, boat, bus, car, motorbike, trainIndoor: bottle, chair, dining table, potted plant, sofa, tv/monitor M. Everingham, Luc van Gool , C. Williams, J. Winn, A. Zisserman 2007
  62. 62. 2-410images
  63. 63. 510images
  64. 64. Caltech 101 and 256  10 images 5 Griffin, Holub, Perona, 2007Fei-Fei, Fergus, Perona, 2004
  65. 65. Lotus Hill Research InsPtute image corpus  Z.Y. Yao, X. Yang, and S.C. Zhu, 2007
  66. 66. 5 LabelMe 10 imagesTool went online July 1st, 2005530,000 object annotations collectedLabelme.csail.mit.edu B.C. Russell, A. Torralba, K.P. Murphy, W.T. Freeman, IJCV 2008
  67. 67. Extreme labeling
  68. 68. The other extreme of extreme labeling … things do not always look good…
  69. 69. Creative testing
  70. 70. 510images
  71. 71. 6-7 10 imagesThings start getting out of hand
  72. 72. 6-7 Collecting big datasets 10 images•  ESP game (CMU)Luis Von Ahn and Laura Dabbish 2004•  LabelMe (MIT)Russell, Torralba, Freeman, 2005•  StreetScenes (CBCL-MIT)Bileschi, Poggio, 2006•  WhatWhere (Caltech)Perona et al, 2007•  PASCAL challenge2006, 2007•  Lotus Hill InstituteSong-Chun Zhu et al, 2007•  80 million imagesTorralba, Fergus, Freeman, 2007
  73. 73. 80.000.000 images  10 6-775.000 non-abstract nouns from WordNet 7 Online image search engines images And after 1 year downloading images Google: 80 million images A. Torralba, R. Fergus, W.T. Freeman. PAMI 2008
  74. 74. 6-7 10 images shepherd dog, sheep dog animal collie German shepherd~105+ nodes~108+ images Deng, Dong, Socher, Li & Fei-Fei, CVPR 2009
  75. 75. Labeling for money Alexander Sorokin, David Forsyth, "Utility data annotation with Amazon Mechanical Turk", First IEEE Workshop on Internet Vision at CVPR 08.
  76. 76. 1 cent Task: Label one object in this image
  77. 77. 1 cent Task: Label one object in this image
  78. 78. Why people does this? From: John Smith <…@yahoo.co.in>Date: August  22, 2009 10:18:23 AM EDT To: Bryan Russell Subject: Re: Regarding Amazon  Mechanical Turk HIT RX5WVKGA9W Dear Mr. Bryan,            I am awaiPng for your HITS. Please help us with  more. Thanks & Regards 
  79. 79. 6-710images
  80. 80. 8-1110images
  81. 81. 8-1110images
  82. 82. 8-1110images
  83. 83. Canonical PerspecPve Examples of canonical perspective:In a recognition task, reaction time correlated with the ratings.Canonical views are recognized fasterat the entry level. From Vision Science, Palmer
  84. 84. 3D object categorizaPon Despite we can categorize all threepictures as being views of a horse,the three pictures do not look asbeing equally typical views ofhorses. And they do not seem tobe recognizable with the sameeasiness. by Greg Robbins
  85. 85. 8-11 Canonical Viewpoint  10 images Interesting biases…It is not a uniform sampling on viewpoints(some artificial datasets might contain non natural statistics)
  86. 86. 8-11 Canonical Viewpoint  10 images Interesting biases…Clocks are preferred as purely frontal
  87. 87. >11 10 images ?? ? ?

×