NIPS2009: Understand Visual Scenes - Part 1

723 views
632 views

Published on

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
723
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
27
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

NIPS2009: Understand Visual Scenes - Part 1

  1. 1. Understanding Visual Scenes Antonio Torralba Computer Science and Artificial Intelligence Laboratory (CSAIL) Department of Electrical Engineering and Computer Science
  2. 2. SKY TREE PERSON BENCH PERSON PATH LAKE PERSON DUCK PERSON DUCKSIGN DUCK GRASS
  3. 3. A VIEW OF A PARK ON A NICE SPRING DAY
  4. 4. PEOPLE WALKING IN THE PARKDo not PERSON FEEDING feed DUCKS IN THE PARKthe ducks DUCKS LOOKING FOR FOOD sign
  5. 5. PEOPLE UNDER THESHADOW OF THE TREES DUCKS ON TOP OF THE GRASS
  6. 6. Why do we care about recognition?Perception of function: We can perceive the 3D shape, texture, material properties, without knowing about objects.But, the concept of category encapsulates also information about what can we do with those objects.“We therefore include the perception of function as a proper –indeed, crucial- subject for vision science”, from Vision Science, chapter 9, Palmer.
  7. 7. The perception of function•  Direct perception (affordances): Gibson Flat surface Horizontal Sittable Knee-high upon …•  Mediated perception (Categorization) Flat surface Horizontal Chair Sittable Knee-high upon … Chair Chair Chair?
  8. 8. Direct perceptionSome aspects of an object function can be perceived directly•  Functional form: Some forms clearly indicate to a function (“sittable-upon”, container, cutting device, …) Sittable-upon Sittable-upon It does not seem easy to sit-upon this… Sittable-upon
  9. 9. Scenes, as objects, also have affordances
  10. 10. The function of the scene
  11. 11. Direct perceptionSome aspects of an object function can be perceived directly•  Observer relativity: Function is observer dependent From http://lastchancerescueflint.org
  12. 12. Limitations of Direct PerceptionObjects of similar structure might have very different functions Not all functions seem to be available from direct visual information only.
  13. 13. Limitations of Direct Perception
  14. 14. Object detection and recognitionShort overview of current approaches
  15. 15. Object recognition Is it really so hard? Find the chair in this image Output of normalized correlationThis is a chair
  16. 16. Object recognition Is it really so hard?Find the chair in this image Pretty much garbage Simple template matching is not going to make it
  17. 17. So, let’s make the problem simpler: Block world Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006
  18. 18. Binford and generalized Recognition by cylinders components Irving Biederman Recognition-by-Components: A Theory of Human Image Understanding. Psychological Review, 1987. Object Recognition in the Geometric Era: Introduced in computer vision by A. Pentland, 1986. a Retrospective. Joseph L. Mundy. 2006
  19. 19. Families of recognition algorithms Voting models Shape matching Bag of words models Deformable models Viola and Jones, ICCV 2001 Berg, Berg, Malik, 2005Csurka, Dance, Fan, Willamowski, and Heisele, Poggio, et. al., NIPS 01 Cootes, Edwards, Taylor, 2001Bray 2004 Schneiderman, Kanade 2004Sivic, Russell, Freeman, Zisserman, Vidal-Naquet, Ullman 2003ICCV 2005 Rigid template models Constellation models Fischler and Elschlager, 1973 Sirovich and Kirby 1987 Turk, Pentland, 1991 Burl, Leung, and Perona, 1995 Weber, Welling, and Perona, 2000 Dalal & Triggs, 2006 Fergus, Perona, & Zisserman, CVPR 2003
  20. 20. The face age Feret dataset, 1996 DARPA•  The representation and matching of pictorial structures Fischler,Elschlager (1973).•  Face recognition using eigenfaces M. Turk and A. Pentland (1991).•  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995)•  Graded Learning for Object Detection - Fleuret, Geman (1999)•  Robust Real-time Object Detection - Viola, Jones (2001)•  Feature Reduction and Hierarchy of Classifiers for Fast Object Detectionin Video Images - Heisele, Serre, Mukherjee, Poggio (2001)• ….
  21. 21. Face detection
  22. 22. Haar-like filters and cascadesViola and Jones, ICCV 2001 The average intensity in the block is computed with four sums independently of the block size.Also Fleuret and Geman, 2001
  23. 23. Generic objects: Edge based descriptors Gavrila, Philomin, ICCV 1999 Papageorgiou & Poggio (2000)J. Shotton, A. Blake, R. Cipolla. PAMI 2008. Opelt, Pinz, Zisserman, ECCV 2006
  24. 24. Histograms of oriented gradients Shape contextSIFT, D. Lowe, ICCV 1999 Belongie, Malik, Puzicha, NIPS 2000
  25. 25. Histograms of oriented gradients Dalal & Trigs, 2006 x Not a person x person
  26. 26. Adding partsFelzenszwalb, McAllester, Ramanan. 2008.
  27. 27. Adding parts Felzenszwalb, McAllester, Ramanan. 2008.
  28. 28. Felzenszwalb, McAllester, Ramanan. 2008.
  29. 29. Evaluation of performance Before plotting and ROC or precision-recall curves…The detector challenge: by looking at the output of a detector on a random setof images, can you guess which object is it trying to detect?
  30. 30. What object is detector trying to detect?The detector challenge: by looking at the output of a detector on a random setof images, can you guess which object is it trying to detect?
  31. 31. What object is detector trying to detect? Table detectorThe detector challenge: by looking at the output of a detector on a random setof images, can you guess which object is it trying to detect?
  32. 32. 7 5 6 4 3 1 21. chair, 2. table, 3. road, 4. road, 5. table, 6. car, 7. keyboard.
  33. 33. Some symptoms of standard approaches
  34. 34. Scenes rule over objects3D percept is driven by the scene, which imposes its ruling to the objects
  35. 35. Scene recognitionThe gist of the scene
  36. 36. Mary Potter (1976)Mary Potter (1975, 1976) demonstrated that during a rapid sequentialvisual presentation (100 msec per image), a novel picture is instantlyunderstood and observers seem to comprehend a lot of visualinformation
  37. 37. Demo : Rapid image understandingBy Aude Oliva Instructions: 9 photographs will be shown for half a second each. Your task is to memorize these pictures
  38. 38. Memory TestWhich of the following pictures have you seen ? If you have seen the image clap your hands once If you have not seen the image do nothing
  39. 39. Have you seen this picture ?
  40. 40. NO
  41. 41. Have you seen this picture ?
  42. 42. NO
  43. 43. Have you seen this picture ?
  44. 44. NO
  45. 45. Have you seen this picture ?
  46. 46. NO
  47. 47. Have you seen this picture ?
  48. 48. Yes
  49. 49. Have you seen this picture ?
  50. 50. NO
  51. 51. You have seen these picturesYou were tested with these pictures
  52. 52. The gist of the sceneIn a glance, we remember the meaning of an image and its global layout but some objects and details are forgotten
  53. 53. From objects to scenes SceneType 2 {street, office, …} SObject localization O1 O1 O1 O1 O2 O2 O2 O2 Local features L L L L Riesenhuber & Poggio (99); Vidal -Naquet & Ullman (03); Serre & Poggio, (05); Agarwal & Roth, (02), Moghaddam, Pentland (97), Turk, Pentland (91),Vidal-Naquet, Ullman, I (03) Heisele, et al, (01), Agarwal & Image Roth, (02), Kremp, Geman, Amit (02), Dorko, Schmid, (03) Fergus, Perona, Zisserman (03), Fei Fei, Fergus, Perona, (03), Schneiderman, Kanade (00), Lowe (99)
  54. 54. What makes scenes different? Ceiling Ceiling Lamp wall Light painting Door Door Painting mirror Door mirrorWall Door Wall Door wall wall Lamp phone Fireplace Bed alarm armchair armchair Floor Side-table Coffee table carpet Different objects, different spatial layout
  55. 55. What makes scenes different? ceiling lamp lampcabinet glasses mirror Window mirror mirror shelves wall beer dishes faucet faucet towel counter sink counter sink soap sink counter cabinet cabinet cabinet cabinet stool Different objects, similar spatial layout
  56. 56. What makes scenes different? ceiling ceiling window table wall window table table chairwindow wall table table table window chair table chair table chair table chair chair table table chair chair table bottle chair chair chair table chair table floor table chair chair floor chair chair Similar objects, different spatial layout
  57. 57. What makes scenes different? ceiling ceiling window table wall window table table chairwindow wall table table table window chair table chair table chair table chair chair table table chair chair table bottle chair chair chair table chair table floor table chair chair floor chair chair Similar objects, different spatial layout
  58. 58. What makes scenes different? cabinets ceiling cabinets ceiling ceiling cabinets cabinets wall screenwindow window window column seat seat window seat seat window seat seat seat seat seat seat seat seat seat seat seat seat seat seat seat seat seat seat seat seat seat seat seat seat seat seat seat seat Similar objects, and similar spatial layout Different lighting, different materials, very specific object categories
  59. 59. What can be an alternative to objects?
  60. 60. Scene emergent features“Recognition via features that are not those of individual objects but “emerge” as objects are brought into relation to each other to form a scene.” – Biederman 81 From “on the semantics of a glance at a scene”, Biederman, 1981
  61. 61. Examples of scene emergent features Suggestive edges and junctions Simple geometric forms Blobs Textures
  62. 62. Ensemble statisticsAriely, 2001, Seeing sets: Representation by statistical properties Chong, Treisman, 2003, Representation of statistical properties Alvarez, Oliva, 2008, 2009, Spatial ensemble statistics Conclusion: observers had more accurate representation of the mean than of the individual members of the set.
  63. 63. From scenes to objects SceneType 2 {street, office, …} SObject localization O1 O1 O1 O1 O2 O2 O2 O2 G •  Scene emergent Local features L L L L •  Ensemble statistics •  Global features I Image
  64. 64. How far can we go without objects? SceneType 2 {street, office, …} S G •  Scene emergent Local features L L L L •  Ensemble statistics •  Global features I Image
  65. 65. Global image descriptors
  66. 66. Global image descriptorsBag of words Spatially organized textures M. Gorkani, R. Picard, ICPR 1994 A. Oliva, A. Torralba, IJCV 2001Sivic et. al., ICCV 2005Fei-Fei and Perona, CVPR 2005Non localized textonsWalker, Malik. Vision Research 2004 … S. Lazebnik, et al, CVPR 2006 …R. Datta, D. Joshi, J. Li, and J. Z. Wang, Image Retrieval: Ideas, Influences, and Trends of the New Age,ACM Computing Surveys, vol. 40, no. 2, pp. 5:1-60, 2008.
  67. 67. Gist descriptor Oliva and Torralba, 2001 •  Apply oriented Gabor filters over different scales •  Average filter energy in each bin 8 orientations 4 scales x 16 bins 512 dimensions Similar to SIFT (Lowe 1999) applied to the entire imageM. Gorkani, R. Picard, ICPR 1994; Walker, Malik. Vision Research 2004; Vogel et al. 2004;Fei-Fei and Perona, CVPR 2005; S. Lazebnik, et al, CVPR 2006; …
  68. 68. TextonsFilter bank K-means (100 clusters) Malik, Belongie, Shi, Leung, 1999 Walker, Malik, 2004
  69. 69. Bag of words & spatial pyramid matchingSivic, Zisserman, 2003. Visual words = Kmeans of SIFT descriptors S. Lazebnik, et al, CVPR 2006
  70. 70. The 15-scenes benchmark Oliva & Torralba, 2001 Fei Fei & Perona, 2005 Lazebnik, et al 2006 OfficeSkyscrapers Suburb Building facade Coast Forest Bedroom Living room Industrial Street Highway Mountain Open country Kitchen Store
  71. 71. Scene recognition100 training samples per classSVM classifier in both cases Human performance
  72. 72. Large Scale Scene Recognition Urban  Nature  Indoor ~1,000 categories >130,000 images >12,000 fully annotated images  Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
  73. 73. Performance with 400 categories Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
  74. 74. Training images AbbeyAirplane cabinAirport terminal AlleyAmphitheater Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
  75. 75. Training images Correct classifications AbbeyAirplane cabinAirport terminal AlleyAmphitheater Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
  76. 76. Training images Correct classifications Miss-classifications Monastery Cathedral Castle Abbey Toy shop Van DiscothequeAirplane cabin Subway Stage RestaurantAirport terminal Restaurant Courtyard Canal patio Alley Harbor Coast Athletic fieldAmphitheater Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
  77. 77. Example of three different scenes RIVER BEACH  VILLAGE  Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
  78. 78. But they are all part of the same picture Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
  79. 79. But they are all part of the same picture Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
  80. 80. Scene detection Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
  81. 81. Categories or a continuous space? Check poster by Malisiewicz, Efros 
  82. 82. Categories or a continuous space? From the city to the mountains in 10 steps
  83. 83. Spatial envelope: a continuous space of scenes Highway Street City centre Tall BuildingDegree of Expansion Degree of Openness Oliva & Torralba, 2001
  84. 84. Spatial envelope: a continuous space of scenes Coast Countryside Forest MountainDegree of Ruggedness Degree of Openness Oliva & Torralba, 2001
  85. 85. Context for object recognition
  86. 86. Who needs context anyway? We can recognize objects even out of context  Banksy
  87. 87. Look‐Alikes by Joan Steiner Even in high resolution, we can not shut down contextual processing and it is hard to recognize the true identities of the elements that compose this scene.
  88. 88. Why is context important? •  Changes the interpretation of an object (or its function)•  Context defines what an unexpected event is
  89. 89. Objects and Scenes Biederman’s violaPons (1981): 
  90. 90. Global precedence Forest Before Trees: The Precedence of Global Features in VisualPerceptionNavon (1977)
  91. 91. Scene recognition without object recognition Scene S g Scene features Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
  92. 92. An integrated model of Scenes, Objects, and Parts Scene S Ncar P(Ncar | S = street) 01 5 N g P(Ncar | S = park) Scene gist features 01 5 N Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
  93. 93. Object retrieval: scene features vs. detector Results using the keyboard detector aloneResults using both the detectorand the global scene features Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
  94. 94. The layered structure of scenes  Assuming a human observer standing on the ground p(x) p(x2|x1)In a display with multiple targets present, the location of one target constraints the ‘y’coordinate of the remaining targets, but not the ‘x’ coordinate. Torralba, Oliva, Castelhano, Henderson. 2006
  95. 95. Context driven object detection Scene S Ncar Zcar P(Ncar | S = street) 01 5 N g Scene gist features Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
  96. 96. An integrated model of Scenes, Objects, and Parts We train a multiview car detector. car Fi p(d | F=1) = N(d | µ1, σ1) p(d | F=0) = N(d | µ0, σ0) dcari xcari N=4 Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
  97. 97. An integrated model of Scenes, Objects, and Parts Scene S Ncar Zcar car Fi g Scene dcari xcari gist features M=4 Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
  98. 98. Two tasks 

×