A car out of context … 
Modeling object co‐occurrences 
What are the hidden objects? 


      1
                 2
What are the hidden objects? 




         Chance ~ 1/30000
objects image
p(O | I) α p(I|O) p(O)

Object model       Context model
p(O | I) α p(I|O) p(O)

Object model           Context model



          Full joint
                         Scene model   Approx. joint
p(O | I) α p(I|O) p(O)

Object model           Context model



          Full joint
                         Scene model   Approx. joint
p(O | I) α p(I|O) p(O)

Object model                     Context model



             Full joint
                                   Scene model            Approx. joint
                          p(O) =s Σ iΠp(Oi|S=s) p(S=s)
    office
                                                 street
p(O | I) α p(I|O) p(O)

Object model           Context model



          Full joint
                         Scene model   Approx. joint
Pixel labeling using MRFs 
Enforce consistency between neighboring labels, 
  and between labels and pixels 



                                       Oi




                 Carbonetto, de Freitas & Barnard, ECCV’04
Object‐Object RelaPonships 
Use latent variables to induce long distance correlaPons 
  between labels in a CondiPonal Random Field (CRF) 




                                      He, Zemel & Carreira-Perpinan (04)
Object‐Object RelaPonships 




       [Kumar Hebert 2005]
Object‐Object RelaPonships 
•  Fink & Perona (NIPS 03) 
Use output of boosPng from other objects at previous 
  iteraPons as input into boosPng for this iteraPon 
Object‐Object RelaPonships 
                                      Building, boat, person
                                                            Road
                                             Building,
                                             boat, motorbike
                                    Water,
                                    sky


                                             Building
       Most consistent
        labeling according to                               Road
        object co
                                                  Boat
       -occurrences&
        locallabel probabilities.   Water


                    A. Rabinovich, A. Vedaldi, C. Galleguillos, E.
                    Wiewiora and S. Belongie. Objects in Context.
                    ICCV 2007
Objects in Context: 
                   Contextual Refinement 
            Building                     Contextual model based on co-occurrences
                                         Try to find the most consistent labeling with
                           Road          high posterior probability and high
                                         mean pairwise interaction.
                 Boat                    Use CRF for this purpose.

    Water




                                              Independent
Mean interaction of all label pairs           segment classification
Φ(i,j) is basically the observed label
co-occurrences in training set.                                                 132
Using stuff to find things 
                         Heitz and Koller, ECCV 2008

In this work, there is not labeling for stuff. Instead, they look for clusters of
 textures and model how each cluster correlates with the target object.
What,
where
and
who?
Classifying

     events
by
scene
and
object
recognition





Slide by Fei-fei                L-J
Li
&
L.
Fei-Fei,
ICCV
2007

what     who          where




Slide by Fei-fei         L.-J. Li & L. Fei-Fei ICCV 2007
Grammars




                              Guzman (SEE), 1968
                              Noton and Stark 1971
                              Hansen & Riseman (VISIONS), 1978
                              Barrow & Tenenbaum 1978
                              Brooks (ACRONYM), 1979
[Ohta & Kanade 1978]          Marr, 1982
                              Yakimovsky & Feldman, 1973
Grammars for objects and scenes




         S.C. Zhu and D. Mumford. A Stochastic Grammar of Images.
         Foundations and Trends in Computer Graphics and Vision, 2006.
3D scenes
We are wired for 3D
       ~6cm
We can not shut down 3D perception




                   (c) 2006 Walt Anthony
3D drives perception of important
        object attributes




                         by Roger Shepard (”Turning the Tables”)


     Depth processing is automatic, and we can not shut it down…
Manhattan World




Slide by James Coughlan                     Coughlan, Yuille. 2003
Slide by James Coughlan   Coughlan, Yuille. 2003
Slide by James Coughlan   Coughlan, Yuille. 2003
Single view metrology
     Criminisi, et al. 1999




                   Need to recover:
                  •  Ground plane
                  •  Reference height
                  •  Horizon line
                  •  Where objects contact the
                    ground
3d Scene Context




Image            World



                  Hoiem, Efros, Hebert ICCV 2005
3D scene context




       meters
                Ped
                                           Ped

                Car




                            meters

                  Hoiem, Efros, Hebert ICCV 2005
Qualitative Results
                                      Car: TP / FP Ped: TP / FP




               Initial: 2 TP / 3 FP                               Final: 7 TP / 4 FP



Slide by Derek Hoiem                           Local Detector from [Murphy-Torralba-Freeman 2003]
3D City Modeling using Cognitive Loops




                   N. Cornelis, B. Leibe, K. Cornelis, L. Van Gool. CVPR'06
3D from pixel values
D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up”. SIGGRAPH 2005.




A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image"
In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007.
Surface Estimation
Image     Support              Vertical           Sky




 V-Left   V-Center             V-Right          V-Porous         V-Solid


                     Object
                    Surface?
                                          [Hoiem, Efros, Hebert ICCV 2005]
                    Support?
                                                           Slide by Derek Hoiem
Object Support




                 Slide by Derek Hoiem
Qualitative 3D relationships




Gupta & Davis, EECV, 2008
Large databases
Algorithms that rely on millions of images
Data
Human vision
• Many input modalities
• Active
• Supervised, unsupervised, semi supervised
learning. It can look for supervision.




Robot vision
• Many poor input modalities
• Active, but it does not go far


Internet vision
• Many input modalities
• It can reach everywhere
• Tons of data
The two extremes of learning
Extrapolation problem                     Interpolation problem
    Generalization                           Correspondence
  Diagnostic features                     Finding the differences




                                                                 ∞
     1         10       102   103   104   105        106 Number of
                                                         training
                                                         samples




  Transfer learning
     Classifiers
                                                Label transfer
       Priors
Nearest neighbors 
Input image 




•  Labels 
•  MoPon 
                                                                    •  Labels 
•  Depth 
•  …                                                                •  MoPon 
                                                                    •  Depth 
                                                                    •  … 


  Hays, Efros, Siggraph 2007 
  Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007 
  Divvala, Efros, Hebert, 2008 
  Malisiewicz, Efros 2008 
  Torralba, Fergus, Freeman, PAMI 2008 
  Liu, Yuen, Torralba, CVPR 2009 
The power of large collections




   Google Street View        PhotoToursim/PhotoSynth
(controlled image capture)           [Snavely et al.,2006]
                                 (register images based on
                                    multi-view geometry)
Image completion




Instead, generate proposals using millions of images




         Input                  16 nearest neighbors    output
                                (gist+color matching)   Hays, Efros, 2007
im2gps
Instead of using objects labels, the web provides other kinds of metadata associate
 to large collections of images




          20 million geotagged and geographic text-labeled images




                                                           Hays & Efros. CVPR 2008
im2gps
                            Hays & Efros. CVPR 2008




Input image
              Nearest neighbors    Geographic location of the nearest neighbors
Predicting events




      C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Predicting events




      C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Query




        C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Query                            Retrieved video




        C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Query                              Retrieved video




        Synthesized video
          C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Query                             Retrieved video




        Synthesized video
          C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Query




        Synthesized video
          C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Query                             Retrieved video




        Synthesized video
          C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Databases and the powers of 10
DATASETS AND
  Datasets 
    and  
Powers of 10 
0
images
0
       10
       images




1972
1
10
images
1
 10
 images




Marr, 1976
2-4
10
images
2-4
               The faces and cars   10
                                    images
               scale




In 1996 DARPA released 14000
 images,
from over 1000 individuals.
The PASCAL Visual Object Classes  
In 2007, the twenty object classes that have been selected are:

Person: person
Animal: bird, cat, cow, dog, horse, sheep
Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor




              M. Everingham, Luc van Gool , C. Williams, J. Winn, A. Zisserman 2007
2-4
10
images
5
10
images
Caltech 101 and 256                         10
                                                       images
                                                               5




                                Griffin, Holub, Perona, 2007
Fei-Fei, Fergus, Perona, 2004
Lotus Hill Research InsPtute image corpus 




                       Z.Y. Yao, X. Yang, and S.C. Zhu, 2007
5
                   LabelMe                                                     10
                                                                               images




Tool went online July 1st, 2005
530,000 object annotations collected
Labelme.csail.mit.edu         B.C. Russell, A. Torralba, K.P. Murphy, W.T. Freeman, IJCV 2008
Extreme labeling
The other extreme of extreme labeling


        … things do not always look good…
Creative testing
5
10
images
6-7
                                   10
                                   images




Things start getting out of hand
6-7
                     Collecting big datasets   10
                                               images


•  ESP game (CMU)
Luis Von Ahn and Laura Dabbish 2004

•  LabelMe (MIT)
Russell, Torralba, Freeman, 2005


•  StreetScenes (CBCL-MIT)
Bileschi, Poggio, 2006


•  WhatWhere (Caltech)
Perona et al, 2007

•  PASCAL challenge
2006, 2007

•  Lotus Hill Institute
Song-Chun Zhu et al, 2007


•  80 million images
Torralba, Fergus, Freeman, 2007
80.000.000 images                                         10
                                                                                        6-7
75.000 non-abstract nouns from WordNet    7 Online image search engines          images




                     And after 1 year downloading images
       Google: 80 million images




                                                 A. Torralba, R. Fergus, W.T. Freeman. PAMI 2008
6-7
                                                         10
                                                         images




                                     shepherd dog, sheep dog

               animal
                                     collie    German shepherd



~105+ nodes
~108+ images


                        Deng, Dong, Socher, Li & Fei-Fei, CVPR 2009
Labeling for money




    Alexander Sorokin, David Forsyth, "Utility data annotation with Amazon
    Mechanical Turk", First IEEE Workshop on Internet Vision at CVPR 08.
1 cent 
Task: Label one object in this image
1 cent 
Task: Label one object in this image
Why people does this? 


From: John Smith <…@yahoo.co.in>Date: August 
   22, 2009 10:18:23 AM EDT 
To: Bryan Russell Subject: Re: Regarding Amazon 
   Mechanical Turk HIT RX5WVKGA9W 
Dear Mr. Bryan,            
I am awaiPng for your HITS. Please help us with 
   more. 
Thanks & Regards 
6-7
10
images
8-11
10
images
8-11
10
images
8-11
10
images
Canonical PerspecPve 
Examples of canonical perspective:

In a recognition task, reaction time
 correlated with the ratings.
Canonical views are recognized faster
at the entry level.




                                        From Vision Science, Palmer
3D object categorizaPon 
Despite we can categorize all three
pictures as being views of a horse,
the three pictures do not look as
being equally typical views of
horses. And they do not seem to
be recognizable with the same
easiness.


                                      by Greg Robbins
8-11
                     Canonical Viewpoint                          10
                                                                  images
                                  Interesting biases…




It is not a uniform sampling on viewpoints
(some artificial datasets might contain non natural statistics)
8-11
                  Canonical Viewpoint                10
                                                     images
                               Interesting biases…




Clocks are preferred as purely frontal
>11
                10
                images




        ?
?




    ?       ?
NIPS2009: Understand Visual Scenes - Part 2

NIPS2009: Understand Visual Scenes - Part 2

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
    objects image p(O |I) α p(I|O) p(O) Object model Context model
  • 6.
    p(O | I)α p(I|O) p(O) Object model Context model Full joint Scene model Approx. joint
  • 7.
    p(O | I)α p(I|O) p(O) Object model Context model Full joint Scene model Approx. joint
  • 8.
    p(O | I)α p(I|O) p(O) Object model Context model Full joint Scene model Approx. joint p(O) =s Σ iΠp(Oi|S=s) p(S=s) office street
  • 9.
    p(O | I)α p(I|O) p(O) Object model Context model Full joint Scene model Approx. joint
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    Object‐Object RelaPonships  Building, boat, person Road Building, boat, motorbike Water, sky Building Most consistent labeling according to Road object co Boat -occurrences& locallabel probabilities. Water A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora and S. Belongie. Objects in Context. ICCV 2007
  • 15.
    Objects in Context:  Contextual Refinement  Building Contextual model based on co-occurrences Try to find the most consistent labeling with Road high posterior probability and high mean pairwise interaction. Boat Use CRF for this purpose. Water Independent Mean interaction of all label pairs segment classification Φ(i,j) is basically the observed label co-occurrences in training set. 132
  • 16.
    Using stuff to find things  Heitz and Koller, ECCV 2008 In this work, there is not labeling for stuff. Instead, they look for clusters of textures and model how each cluster correlates with the target object.
  • 17.
    What,
where
and
who?
Classifying
 events
by
scene
and
object
recognition
 Slide by Fei-fei L-J
Li
&
L.
Fei-Fei,
ICCV
2007

  • 18.
    what who where Slide by Fei-fei L.-J. Li & L. Fei-Fei ICCV 2007
  • 19.
    Grammars   Guzman (SEE), 1968   Noton and Stark 1971   Hansen & Riseman (VISIONS), 1978   Barrow & Tenenbaum 1978   Brooks (ACRONYM), 1979 [Ohta & Kanade 1978]   Marr, 1982   Yakimovsky & Feldman, 1973
  • 20.
    Grammars for objectsand scenes S.C. Zhu and D. Mumford. A Stochastic Grammar of Images. Foundations and Trends in Computer Graphics and Vision, 2006.
  • 21.
  • 22.
    We are wiredfor 3D ~6cm
  • 23.
    We can notshut down 3D perception (c) 2006 Walt Anthony
  • 24.
    3D drives perceptionof important object attributes by Roger Shepard (”Turning the Tables”) Depth processing is automatic, and we can not shut it down…
  • 25.
    Manhattan World Slide byJames Coughlan Coughlan, Yuille. 2003
  • 26.
    Slide by JamesCoughlan Coughlan, Yuille. 2003
  • 27.
    Slide by JamesCoughlan Coughlan, Yuille. 2003
  • 28.
    Single view metrology Criminisi, et al. 1999 Need to recover: •  Ground plane •  Reference height •  Horizon line •  Where objects contact the ground
  • 29.
    3d Scene Context Image World Hoiem, Efros, Hebert ICCV 2005
  • 30.
    3D scene context meters Ped Ped Car meters Hoiem, Efros, Hebert ICCV 2005
  • 31.
    Qualitative Results Car: TP / FP Ped: TP / FP Initial: 2 TP / 3 FP Final: 7 TP / 4 FP Slide by Derek Hoiem Local Detector from [Murphy-Torralba-Freeman 2003]
  • 32.
    3D City Modelingusing Cognitive Loops N. Cornelis, B. Leibe, K. Cornelis, L. Van Gool. CVPR'06
  • 33.
    3D from pixelvalues D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up”. SIGGRAPH 2005. A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image" In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007.
  • 34.
    Surface Estimation Image Support Vertical Sky V-Left V-Center V-Right V-Porous V-Solid Object Surface? [Hoiem, Efros, Hebert ICCV 2005] Support? Slide by Derek Hoiem
  • 35.
    Object Support Slide by Derek Hoiem
  • 36.
  • 37.
    Large databases Algorithms thatrely on millions of images
  • 38.
    Data Human vision • Many inputmodalities • Active • Supervised, unsupervised, semi supervised learning. It can look for supervision. Robot vision • Many poor input modalities • Active, but it does not go far Internet vision • Many input modalities • It can reach everywhere • Tons of data
  • 39.
    The two extremesof learning Extrapolation problem Interpolation problem Generalization Correspondence Diagnostic features Finding the differences ∞ 1 10 102 103 104 105 106 Number of training samples Transfer learning Classifiers Label transfer Priors
  • 40.
    Nearest neighbors  Input image  •  Labels  •  MoPon  •  Labels  •  Depth  •  …  •  MoPon  •  Depth  •  …  Hays, Efros, Siggraph 2007  Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007  Divvala, Efros, Hebert, 2008  Malisiewicz, Efros 2008  Torralba, Fergus, Freeman, PAMI 2008  Liu, Yuen, Torralba, CVPR 2009 
  • 41.
    The power oflarge collections Google Street View PhotoToursim/PhotoSynth (controlled image capture) [Snavely et al.,2006] (register images based on multi-view geometry)
  • 42.
    Image completion Instead, generateproposals using millions of images Input 16 nearest neighbors output (gist+color matching) Hays, Efros, 2007
  • 43.
    im2gps Instead of usingobjects labels, the web provides other kinds of metadata associate to large collections of images 20 million geotagged and geographic text-labeled images Hays & Efros. CVPR 2008
  • 44.
    im2gps Hays & Efros. CVPR 2008 Input image Nearest neighbors Geographic location of the nearest neighbors
  • 45.
    Predicting events C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  • 46.
    Predicting events C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  • 47.
    Query C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  • 48.
    Query Retrieved video C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  • 49.
    Query Retrieved video Synthesized video C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  • 50.
    Query Retrieved video Synthesized video C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  • 51.
    Query Synthesized video C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  • 52.
    Query Retrieved video Synthesized video C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
  • 53.
    Databases and thepowers of 10
  • 54.
    DATASETS AND Datasets  and   Powers of 10 
  • 55.
  • 56.
    0 10 images 1972
  • 57.
  • 58.
  • 59.
  • 60.
    2-4 The faces and cars 10 images scale In 1996 DARPA released 14000 images, from over 1000 individuals.
  • 61.
    The PASCAL Visual Object Classes   In 2007, thetwenty object classes that have been selected are: Person: person Animal: bird, cat, cow, dog, horse, sheep Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor M. Everingham, Luc van Gool , C. Williams, J. Winn, A. Zisserman 2007
  • 62.
  • 63.
  • 64.
    Caltech 101 and 256  10 images 5 Griffin, Holub, Perona, 2007 Fei-Fei, Fergus, Perona, 2004
  • 65.
    Lotus Hill Research InsPtute image corpus  Z.Y. Yao, X. Yang, and S.C. Zhu, 2007
  • 66.
    5 LabelMe 10 images Tool went online July 1st, 2005 530,000 object annotations collected Labelme.csail.mit.edu B.C. Russell, A. Torralba, K.P. Murphy, W.T. Freeman, IJCV 2008
  • 67.
  • 68.
    The other extremeof extreme labeling … things do not always look good…
  • 69.
  • 71.
  • 72.
    6-7 10 images Things start getting out of hand
  • 73.
    6-7 Collecting big datasets 10 images •  ESP game (CMU) Luis Von Ahn and Laura Dabbish 2004 •  LabelMe (MIT) Russell, Torralba, Freeman, 2005 •  StreetScenes (CBCL-MIT) Bileschi, Poggio, 2006 •  WhatWhere (Caltech) Perona et al, 2007 •  PASCAL challenge 2006, 2007 •  Lotus Hill Institute Song-Chun Zhu et al, 2007 •  80 million images Torralba, Fergus, Freeman, 2007
  • 74.
    80.000.000 images  10 6-7 75.000 non-abstract nouns from WordNet 7 Online image search engines images And after 1 year downloading images Google: 80 million images A. Torralba, R. Fergus, W.T. Freeman. PAMI 2008
  • 75.
    6-7 10 images shepherd dog, sheep dog animal collie German shepherd ~105+ nodes ~108+ images Deng, Dong, Socher, Li & Fei-Fei, CVPR 2009
  • 76.
    Labeling for money Alexander Sorokin, David Forsyth, "Utility data annotation with Amazon Mechanical Turk", First IEEE Workshop on Internet Vision at CVPR 08.
  • 77.
    1 cent  Task: Label oneobject in this image
  • 78.
    1 cent  Task: Label oneobject in this image
  • 79.
    Why people does this?  From: John Smith <…@yahoo.co.in>Date: August  22, 2009 10:18:23 AM EDT  To: Bryan Russell Subject: Re: Regarding Amazon  Mechanical Turk HIT RX5WVKGA9W  Dear Mr. Bryan,             I am awaiPng for your HITS. Please help us with  more.  Thanks & Regards 
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
    Canonical PerspecPve  Examples of canonicalperspective: In a recognition task, reaction time correlated with the ratings. Canonical views are recognized faster at the entry level. From Vision Science, Palmer
  • 85.
    3D object categorizaPon  Despite we cancategorize all three pictures as being views of a horse, the three pictures do not look as being equally typical views of horses. And they do not seem to be recognizable with the same easiness. by Greg Robbins
  • 86.
    8-11 Canonical Viewpoint  10 images Interesting biases… It is not a uniform sampling on viewpoints (some artificial datasets might contain non natural statistics)
  • 87.
    8-11 Canonical Viewpoint  10 images Interesting biases… Clocks are preferred as purely frontal
  • 88.
    >11 10 images ? ? ? ?