NIPS2009: Understand Visual Scenes - Part 2

Modeling object co‐occurrences

What are the hidden objects?

1
2

What are the hidden objects?

Chance ~ 1/30000

objects image
p(O | I) α p(I|O) p(O)

Object model Context model



Full joint
Scene model Approx. joint



Full joint
Scene model Approx. joint
p(O) =s Σ iΠp(Oi|S=s) p(S=s)
office
street

Pixel labeling using MRFs
Enforce consistency between neighboring labels,
and between labels and pixels

Oi

Carbonetto, de Freitas & Barnard, ECCV’04

Object‐Object RelaPonships
Use latent variables to induce long distance correlaPons
between labels in a CondiPonal Random Field (CRF)

He, Zemel & Carreira-Perpinan (04)


[Kumar Hebert 2005]

•  Fink & Perona (NIPS 03)
Use output of boosPng from other objects at previous
iteraPons as input into boosPng for this iteraPon

Building, boat, person
Road
Building,
boat, motorbike
Water,
sky

Building
Most consistent
labeling according to Road
object co
Boat
-occurrences&
locallabel probabilities. Water

A. Rabinovich, A. Vedaldi, C. Galleguillos, E.
Wiewiora and S. Belongie. Objects in Context.
ICCV 2007

Objects in Context:
Contextual Reﬁnement
Building Contextual model based on co-occurrences
Try to find the most consistent labeling with
Road high posterior probability and high
mean pairwise interaction.
Boat Use CRF for this purpose.

Water

Independent
Mean interaction of all label pairs segment classification
Φ(i,j) is basically the observed label
co-occurrences in training set. 132

Using stuﬀ to ﬁnd things
Heitz and Koller, ECCV 2008

In this work, there is not labeling for stuff. Instead, they look for clusters of
textures and model how each cluster correlates with the target object.

What, where and who? Classifying 
events by scene and object recognition 

Slide by Fei-fei L-J Li & L. Fei-Fei, ICCV 2007

what who where

Slide by Fei-fei L.-J. Li & L. Fei-Fei ICCV 2007

Grammars

  Guzman (SEE), 1968
  Noton and Stark 1971
  Hansen & Riseman (VISIONS), 1978
  Barrow & Tenenbaum 1978
  Brooks (ACRONYM), 1979
[Ohta & Kanade 1978]   Marr, 1982
  Yakimovsky & Feldman, 1973

Grammars for objects and scenes

S.C. Zhu and D. Mumford. A Stochastic Grammar of Images.
Foundations and Trends in Computer Graphics and Vision, 2006.

We are wired for 3D
~6cm

We can not shut down 3D perception

(c) 2006 Walt Anthony

3D drives perception of important
object attributes

by Roger Shepard (”Turning the Tables”)

Depth processing is automatic, and we can not shut it down…

Manhattan World

Slide by James Coughlan Coughlan, Yuille. 2003

Slide by James Coughlan Coughlan, Yuille. 2003

Single view metrology
Criminisi, et al. 1999

Need to recover:
•  Ground plane
•  Reference height
•  Horizon line
•  Where objects contact the
ground

3d Scene Context

Image World

Hoiem, Efros, Hebert ICCV 2005

3D scene context

meters
Ped
Ped

Car

meters

Hoiem, Efros, Hebert ICCV 2005

Qualitative Results
Car: TP / FP Ped: TP / FP

Initial: 2 TP / 3 FP Final: 7 TP / 4 FP

Slide by Derek Hoiem Local Detector from [Murphy-Torralba-Freeman 2003]

3D City Modeling using Cognitive Loops

N. Cornelis, B. Leibe, K. Cornelis, L. Van Gool. CVPR'06

3D from pixel values
D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up”. SIGGRAPH 2005.

A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image"
In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007.

Surface Estimation
Image Support Vertical Sky

V-Left V-Center V-Right V-Porous V-Solid

Object
Surface?
[Hoiem, Efros, Hebert ICCV 2005]
Support?
Slide by Derek Hoiem

Object Support

Slide by Derek Hoiem

Qualitative 3D relationships

Gupta & Davis, EECV, 2008

Large databases
Algorithms that rely on millions of images

Data
Human vision
• Many input modalities
• Active
• Supervised, unsupervised, semi supervised
learning. It can look for supervision.

Robot vision
• Many poor input modalities
• Active, but it does not go far

Internet vision
• Many input modalities
• It can reach everywhere
• Tons of data

The two extremes of learning
Extrapolation problem Interpolation problem
Generalization Correspondence
Diagnostic features Finding the differences

∞
1 10 102 103 104 105 106 Number of
training
samples

Transfer learning
Classifiers
Label transfer
Priors

Nearest neighbors
Input image

•  Labels
•  MoPon
•  Labels
•  Depth
•  … •  MoPon
•  Depth
•  …

Hays, Efros, Siggraph 2007
Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007
Divvala, Efros, Hebert, 2008
Malisiewicz, Efros 2008
Torralba, Fergus, Freeman, PAMI 2008
Liu, Yuen, Torralba, CVPR 2009

The power of large collections

Google Street View PhotoToursim/PhotoSynth
(controlled image capture) [Snavely et al.,2006]
(register images based on
multi-view geometry)

Image completion

Instead, generate proposals using millions of images

Input 16 nearest neighbors output
(gist+color matching) Hays, Efros, 2007

im2gps
Instead of using objects labels, the web provides other kinds of metadata associate
to large collections of images

20 million geotagged and geographic text-labeled images

Hays & Efros. CVPR 2008

im2gps
Hays & Efros. CVPR 2008

Input image
Nearest neighbors Geographic location of the nearest neighbors

Predicting events

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Query


Query Retrieved video


Query Retrieved video

Synthesized video

Query

Synthesized video

Databases and the powers of 10

DATASETS AND
Datasets
and
Powers of 10

2-4
The faces and cars 10
images
scale

In 1996 DARPA released 14000
images,
from over 1000 individuals.

The PASCAL Visual Object Classes
In 2007, the twenty object classes that have been selected are:

Person: person
Animal: bird, cat, cow, dog, horse, sheep
Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

M. Everingham, Luc van Gool , C. Williams, J. Winn, A. Zisserman 2007

Caltech 101 and 256 10
images
5

Griffin, Holub, Perona, 2007
Fei-Fei, Fergus, Perona, 2004

Lotus Hill Research InsPtute image corpus

Z.Y. Yao, X. Yang, and S.C. Zhu, 2007

5
LabelMe 10
images

Tool went online July 1st, 2005
530,000 object annotations collected
Labelme.csail.mit.edu B.C. Russell, A. Torralba, K.P. Murphy, W.T. Freeman, IJCV 2008

The other extreme of extreme labeling

… things do not always look good…

6-7
10
images

Things start getting out of hand

6-7
Collecting big datasets 10
images

•  ESP game (CMU)
Luis Von Ahn and Laura Dabbish 2004

•  LabelMe (MIT)
Russell, Torralba, Freeman, 2005

•  StreetScenes (CBCL-MIT)
Bileschi, Poggio, 2006

•  WhatWhere (Caltech)
Perona et al, 2007

•  PASCAL challenge
2006, 2007

•  Lotus Hill Institute
Song-Chun Zhu et al, 2007

•  80 million images
Torralba, Fergus, Freeman, 2007

80.000.000 images 10
6-7
75.000 non-abstract nouns from WordNet 7 Online image search engines images

And after 1 year downloading images
Google: 80 million images

A. Torralba, R. Fergus, W.T. Freeman. PAMI 2008

6-7
10
images

shepherd dog, sheep dog

animal
collie German shepherd

~105+ nodes
~108+ images

Deng, Dong, Socher, Li & Fei-Fei, CVPR 2009

Labeling for money

Alexander Sorokin, David Forsyth, "Utility data annotation with Amazon
Mechanical Turk", First IEEE Workshop on Internet Vision at CVPR 08.

1 cent
Task: Label one object in this image

Why people does this?

From: John Smith <…@yahoo.co.in>Date: August
22, 2009 10:18:23 AM EDT
To: Bryan Russell Subject: Re: Regarding Amazon
Mechanical Turk HIT RX5WVKGA9W
Dear Mr. Bryan,
I am awaiPng for your HITS. Please help us with
more.
Thanks & Regards

Canonical PerspecPve
Examples of canonical perspective:

In a recognition task, reaction time
correlated with the ratings.
Canonical views are recognized faster
at the entry level.

From Vision Science, Palmer

3D object categorizaPon
Despite we can categorize all three
pictures as being views of a horse,
the three pictures do not look as
being equally typical views of
horses. And they do not seem to
be recognizable with the same
easiness.

by Greg Robbins

8-11
Canonical Viewpoint 10
images
Interesting biases…

It is not a uniform sampling on viewpoints
(some artificial datasets might contain non natural statistics)

8-11
Canonical Viewpoint 10
images
Interesting biases…

Clocks are preferred as purely frontal

>11
10
images

?
?

? ?

NIPS2009: Understand Visual Scenes - Part 2

NIPS2009: Understand Visual Scenes - Part 2

More Related Content

What's hot

Viewers also liked

Similar to NIPS2009: Understand Visual Scenes - Part 2

More from zukun

Recently uploaded

NIPS2009: Understand Visual Scenes - Part 2