Progress review1

Compositional Hierarchy for 3D
Object Recognition
Maria Isabel Restrepo

October 26, 2009

Goal:


Goal

Geometry Expected Appearance

Renderings obtained by Dan Crispell


Goal: Recognition in a 3D World


Compelling Characteristics
POWERFUL GEOMETRIC AND PHOTOMETRIC REPRESENTATION* OF SCENES

✤ It is a 3D, geometric representation that supports discovery of spatial relations

✤ Its appearance is modeled by MOG to handle illumination variations

✤ Appearance and geometry are automatically learned from multiple images with
calibrated cameras

✤ It is faithful to the scenes: There are no prior assumptions about the model

THESE CHARACTERISTICS ARE IDEAL FOR OBJECT RECOGNITION

* [Pollard and Mundy, CVPR 2007] [Crispell]


Outline

✤ Volumetric appearance model - The Voxel World

✤ Insights on classical recognition methods

✤ Compositional hierarchies
✤ Bienenstock, Geman, Potter, 97; Geman, Chi, 2002; Geman, Jin, CVPR 2006
✤ Fidler & Leonardis, CVPR’07; Fidler, Boben & Leonardis, CVPR 2008
✤ Mundy & Ozcanli, SPIE ’09

✤ Experimental work: Proof of concept

✤ Future work

The Voxel World

Probabilistic representation of 3-d scenes based on volumetric units -voxel.

p(intensity)
intensity

Surface probability is given by incremental learning Appearance is modeled a Mixture of Gaussians

3
(I−µk )2
pN (Ix +1 |X ∈ S)
N
wk 1 −
2σ 2
P N +1 (X ∈ S) = P N (X ∈ S) p(I) = e k
pN (Ix +1 )
N
W 2πσk
2
k=1


Outline



✤ Jin Geman
✤ Fidler Leonardis, CVPR’07; Fidler, Boben Leonardis, CVPR 2008
✤ Mundy Ozcanli, SPIE ’09

✤ Experimental work: Proof of concept

✤ Future work

Classical Recognition: Bag of Features
Codeword, Feature space -
Feature descriptor
Codebook Classify

e.g .SVM
e.g. SIFT- Lowe
Naive Bayes
HOG- Dalal
NN

Drawbacks: Many have proposed more complex
representations of spatial object structure.
✤ Disregards spatial
✤ Constellation Models [Weber and Welling et al, Fergus et al]
information -Complex, few parts
✤ Large number of features are ✤ Probabilistic voting [Leibe, Schiele] -Large codebook -
complex matching
needed
✤ Hierarchical representations


formation about the geometric
Learning Hierarchical Models of Scenes, Objects, and Parts center and local appearance. F

Hierarchical Representations
Erik B. Sudderth, Antonio Torralba, William T. Freeman, and Alan S. Willsky
clusters and their distributions f
Electrical Engineering Computer Science, Massachusetts Institute of Technology are therefore represented in on
esuddert@mit.edu, torralba@csail.mit.edu, billf@mit.edu, willsky@mit.edu clusters and geometric distribu
We thus need the means of finding the similarities among
different hierarchical nodes in a geometrical sense.
Abstract We propose to create similarity connections between hi-
o ζ νo a b
erarchical nodes within layers to achieve invariance for high
top nodes aj
We describe a hierarchical probabilistic model for the variability in object shape and draw similarities across lay-
detection and recognition of objects in cluttered, natural r Φ ∆o ers to achieve a proper scale normalization of features. We
h
O show how a d layer-independent description of objects de-
scenes. The model is based on a set of parts which describe α θ f e

Address the need for a
c
fined by the so-called shape-terminals, i.e. shapinals, can
✤
g
the expected appearance and position, in an object centered O z νp be passed to the higher-level, the category-specific repre-
coordinate frame, of features detected by a low-level inter- µ sentation. If performed inappearance k problem of ter-
l m n this manner, the r i
est operator. Each object category then has its own distri- aj
minal nodes within the hierarchical “library” is solved in a
representation that
bution over these parts, which are shared between objects.
We learn the parameters of this model via a Gibbs sampler
β φ
P
w x
Nm
Λ
P
∆p geometric
natural way. There is no distributions
need to by-pass or float features to
the top-most layer and thus unnecessarily load the complex-
p(g j,a j |O n)
M
ity of representation, which may prevent the unsupervised
incorporates geometric
which uses the graphical model’s structure to analytically
Figure 1. Graphical model describing how latent parts z creation of higher layers (the problem arising in [7]). In-
average over many parameters. Applied to a database of (a)
images of isolated objects, the sharing of parts among ob-
generate the appearance w and position x , relative to
Sudderth, Torralba, Freeman Willsky Mikolajcjzyk, Leibe, Schiele
stead, at each hierarchical stage of learning, only a subset
an image–specific reference location r , of the features of the layer’s statistically most repeatable features can be

coherence
jects improves detection accuracy when few training exam-
ples are available. We also extend this hierarchical frame- MIT-2006
detected in an image of object o . Boxes denote repli- UK, Switzerland,Hierarchical 2006
Figure 2. (a) Germany structur
combined further, yet the final, cross-layered description of
objects will retain its descriptive power.
Appearance clusters (left column
cation of the corresponding random variables: there are
work to scenes containing multiple objects.
M images, with Nm observed features in image m. tions for different object classes. F
1. Introduction
In this paper, we develop methods for the visual detec-
are in 2D Cartesian coordinate sys
with interesting semantic interpretations, and can improve
✤ Allow for a more
tion and recognition of object categories. We argue that
multi–object recognition systems should be based on mod-
performance when few training examples are available. Fi-
nally, object appearance information is shared between the Building the tree. To build t
els which consider the relationships between different ob- many scenes in which that object is found.
efficient representation
ject categories during the training process. This approach
provides several benefits. At the lowest level, significant
We begin in Sec. 2 by describing our generative model
for objects and parts, including a discussion of related work
clidean distance) to group the a
computational savings can be achieved if different cate- a hyperball of a given radius r
in the machine vision and text analysis literature. Sec. 3
gories share a common set of features. More importantly, then describes parameter estimation methods which com- or part they belong to. To bu
jointly trained recognition systems can use similarities be- bine Gibbs sampling with efficient variational approxima-
tween object categories to their advantage by learning fea- tions. In Sec. 4, we provide simulations demonstrating
ply agglomerative clustering. T
tures which lead to better generalization [4, 18]. This inter– the potential benefits of feature sharing. We conclude in with the number of clusters eq
✤ Consistent with
category regularization is particularly important in the com-
mon case where few training examples are available.
In complex, natural scenes, object recognition systems
Sec. 5 with preliminary extensions of the object hierarchy
to scenes containing multiple objects. and merges the two closest cl
record the indices of merged cl
biological systems
can be further improved by using contextual knowledge 2. A Generative Model for Object Features
about the objects likely to be found in a given scene, and Jin and Geman, 2006 Our generative model for objects is summarized in the Figuretance at which the representation. are m
1. Cross-layered, scale independent clusters
common spatial relationships between those objects [7, 19, Williamsgraphical model (a directed Geman
Chris Jin and Bayesian network) of Fig. 1. ANC
Fidler, Boben continues until the l
Leonardis
20]. In this paper, we propose a hierarchical generative Hierarchical Object Recognition
The nodes of this graph represent random variables, where
procedure hierarchical compositional
model for objects, the parts composing them, and the scenes Brown University 3.1. The base model:
U. The resulting Slovenia trace
of Ljubljana, clustering
shaded nodes are observed during training, and rounded framework [7]
surrounding them. The model, which is summarized in
Figs. 1 and 5, shares information between object categories
CVPR 2006
boxes are fixed hyperparameters. Edges encode the con-
CVPR 2007, 2008
We build on our previously proposed approach [7], p
tree. The only parameter to
ditional densities underlying the generative process [12]. where we proposed an unsupervised learning framework
in three distinct ways. First, parts define distributions over a
2.1. From Images to Features to obtain tom nodes (radius of appearanc
a hierarchical compositional representation of ob-
common low–level feature vocabularly, leading to compu- ject categories. Starting with simple oriented filters the ap-
tational savings when analyzing new images. In addition, Following [17], we represent each of our M grayscale proach learns the first three The of optimally sharable
tree levels. layers radii for interm
and more unusually, objects are defined using a common training images by a set of SIFT descriptors [13] computed features, defined as loose spatial Isabel Restrepo
Maria compositions, i.e. parts.
set of parts. This structure leads to the discovery of parts on affine covariant regions. We use K-means clustering to Upon thetributed higher-layer categorical representa- n
third layer, a between the bottom
tion is derived with minimal supervision. The model is in
essence composed of two recursively iterated radii are o
the top node. These steps, 1.) a

Prior work by Geman: Efficient Discrimination
[Bienenstock, Geman, Potter, 97], [Geman, Chi, 2002], [Geman, Jin, CVPR 2006]

A COMPOSITIONAL MACHINE: license plates
✤ Probabilistic framework
✤ Hierarchy and reusability license numbers

✤ It does not exclude the sharing of subparts
✤ Parts are everywhere, compositions are rare plate boundary

✤ Need to model relative geometry of parts (active) bricks. The proportionality sign (∝)generic letter,
can be replaced
with equality (=) if, at the introduction generic number
of each attribute
20
function, aβ , care is taken to ensure that p0 (aβ ) is exactly
β
40
the current (“unperturbed”) conditional distribution on aβ
60
given xβ 0. In general, it is not practical to compute an

Markovian distribution: Test set: 385 images, mostly from Logan Airport
80

Compositional distribution:
100
characters, plate
exact null distribution and P must be re-normalized.
The effect on coverage of the perturbation can be seen
sides
120
by comparing the upper and lower panels in Figure 3. For

Basic structures Composition vs.
140
each non-terminal brick β, the denominator, p0 (aβ ), was
β
approximated by assuming that in the absence of an explicit
Efficient discrimination: Markov versus Content-Sensitive dist. 160
constraint, the prior distribution on aβ is the parts of
one consis-
cient discrimination: Markov versus Content-Sensitive 200 Coincidence
20 40 60 80 100 120 140 160 180
dist. tent with independent instantiations of the children. The
characters and
(active) bricks. The proportionality sign (∝) can be replaced numerator, pc (aβ ), was constructed to encourage regularity
β
20 20 in plate sides
with equality (=) if, at the introduction of each attribute the relative positions of character parts, and of charac-
function, aβ , care is taken to ensure that p0 (aβ ) is exactly
β ters, in composing characters and strings, respectively. The
40 40
the current (“unperturbed”) conditional distribution on aβ upper panel is a sample instantiation from the Markov back-
60 60
given xβ 0. In general, it is not practical to compute an bone; the lower panel is a sample instantiation from the full
80

100
Sampling 80

100
exact null distribution and P must be re-normalized.
The effect on coverage of the perturbation can be seen
compositional distribution. Samples from the full compo-
sitional distribution can be computed (at considerable com-
120 120 by comparing the upper and lower panels in Figure 3. For putational cost) through a variant of importance sampling.
140 140
each non-terminal brick β, the denominator, p0 (aβ ), was
β Conditional Data Models. The data model connects in-
approximated by assuming that in the absence of an explicit terpretations to the grey-level image, and completes the
160 Original image
image
discrimination: 160 180 200 Zoomed license license region 200 aβ is the one consis-
EfficientOriginal 120 140 Markov versus Content-Sensitive dist. 60 the region
160 Zoomed prior 140 160 180
constraint, 80 100 120 distribution on
20 40 60 80 100 20 40
Bayesian framework. In the license-plate-reading demon-
tent with independent instantiations of the children. The stration system, we have assumed that the data distribution,
Figure 3. Samples from Markov backbone (upper panel, ‘4850’)
numerator, pc (aβ ), was constructed to encourage regularity
β conditioned on an interpretation, is a function only of the
and compositional distribution (lower panel, ‘8502’).
in the relative positions of character parts, and of charac-
20 states of the terminal bricks:
40
ters, in composing characters and strings, respectively. The
aβ (I ) returns the relative coordinates of the four numerals back-
upper panel is a sample instantiation from the Markov P (y|I ) = P (y|{xβ : β ∈ T })
60
bone; the lower panel is a sample instantiation from the full
that instantiate β in the interpretation I . Similarly, each where T ⊆ B is the set of terminal, or bottom-row, bricks.
Zoomed license character brick, and each numeral Samples fromhas an as-
compositional distribution. in particular, the full compo-
80
Original image region
Good performance in most image analysis applications
100

120
Detection sociated attribute function can be computed (at considerable com-
sitional distribution that computes the relative coor-
of the particular parts a variant of importance that
requires some degree of photometric invariance. In the
dinatesputational cost) through that are composed into sampling. context of a probability model, the notion of invariance is
140 Conditional Data Models. The A “compositional
character in a particular interpretation. data model connects in- closely connected to the statistical notion of sufﬁciency.
Top object under MarkovMarkov Top object under built to thea grey-level image, and completes the
Top object under distribution” is content-sensitive
Top object under content-sensitive (Equation 1)
terpretations from Markov backbone
160 The following data model, employed in the demonstration
60 distribution distribution
20 40 80 100 120
distribution
140 160 180 200
distribution
Bayesian framework. In the license-plate-reading demon-
and a pair of probability distributions, pc (“composed”) and
β system, is an example of the application of sufﬁciency to
stration system, we have assumed that the data distribution,
Figure 3. Samples from Markov backbone (upper panel, ‘4850’)β (“null”), on each attribute a . The former, composed
p0 β
invariance. As remarked earlier, the terminal bricks in
Top object under Markov distribution, captures regularities of the is a function only of the
conditioned on an interpretation, arrangements (i.e.
Top object under content-sensitive
and compositional distribution (lower panel, ‘8502’). Maria Isabel Restrepo
the demonstration system represent reusable parts of alpha-
distribution distribution states of the terminal bricks:
instantiations) of the children bricks, given that they are numeric characters. The states of the terminal bricks code
parts of the object represented by (y|{xβlatter, null distribu-
P (y|I ) = P β; the : β ∈ T }) the local position of the represented part. Some of the parts
aβ (I ) returns the relative coordinates of the four numerals tion, is the attribute distribution in the absence of the non- can be more-or-less clearly discerned from the upper-hand
that instantiate β in the interpretation I . Similarly, each

Prior Work by Fidler and Leonardis
[Fidler, Berginc, Leonardis CVPR 2006], [Fidler, Leonardis, CVPR 2007], [Fidler, Boben, Leonardis CVPR 2008]

Compositionality and bottom-up learning
✤ Computation efﬁciency - Scalable
✤ Bottom up learning: All classes in early
layers, then class speciﬁc
✤ Models general and discriminative
✤ Sharing of parts

Have learned complete objects from simple edges

Example of learned whole-object shape models.
Fidler, M. Boben, A. Leonardis. Learning a Hierarchical Compositional Shape Vocabulary for Multi-class Object Representation. Submitted to a journal.
Images from Fidler webpage


!

Work by Mundy and Ozcanli
[Mundy, Ozcanli, SPIE 2009 ]
F igu re 6 A n example of vehicle ext rema oper ator responses. 1, 0.5, 90o , dark . T he spatial resolution is
a round 0.7 meters, with about 25 pixels on a vehicle. T he oper ator response is indicated by the cyan dot.
T he oper ator ker nel extent is indicated in blue. T he or iginal grey scale intensity is in the red channel.

Composition of Parts
✤ Combine Geman’s and Leonardis’ work into an
uniﬁed Bayesian framework F igu re 7 T he composition of ext rema oper ators. T he anisot ropic da r k oper ator,
b r ight pea k oper ator,
or ientation ' .
' . T he composition is cha r acter ized by distance d
, is composed with one of a
' and relative

✤ Classiﬁcation of foreground objects: Vehicles F igu re 8 T h ree p r imitive ext rema oper ators compose in a L ayer 1 node. T he cent r al pa r t is
2, 1, - 45o , bright , and the second p r imitive pa r t is ' 2, 1, - 45o , dark . T he pea k responses of the

!
oper ators a re indicated by cyan pixels. T he oper ator ker nel is indicated in blue. T he vehicle intensity is in
the red channel.

✤ Domain: Low resolution, satellite images

Probabilistic Score:
p(dαα , θαα |ci )P (ci )
p(ci |dαα , θαα ) = αα αα
αα
p(dαα , θαα ) !

k−1
j j
p(d αα ,θ αα ) = p(d αα ,θ αα |¯
c
αα )P (¯
c
αα )+ p(d αα ,θ αα |cαα )P (cαα )
j=0


Hierarchical Composition for 3D Objects

Buildings, streets, trees,
rivers...

Windows, street lines,
roofs, leafs ...

Junctions, curves...

Simple primitives e.g edges

Learn bottom-up


Outline


✤ Jin Geman
✤ Fidler Leonardis, CVPR’07; Fidler, Boben Leonardis, CVPR 2008
✤ Mundy Ozcanli, SPIE ’09

✤ Proof of concept: Construction of a simple hierarchy to ﬁnd
windows in the voxel world

✤ Future Work

Data and Algorithm
˜
min DKL (f (x)|f (x)) Algorithm Steps
or f1 (x)
1.For each orientation

K1

✤ Apply corner kernel on
f (x) = wk fk (x) ˜
f (x) ∼ N(˜f , σf )
µ ˜2
k=1
appearance and occupancy grids
Top :Mean appearance near wall surface. Bottom: occupancy ✤ Perform non-maxima
suppression on kernel-speciﬁc
region
2.Build a hierarchy to ﬁnd windows


The Primitives: Corner Kernel

Corner kernel in 2D Corner kernel in 3D
Every pixel has a label/weight Every voxel has a label/weight

DEPTH

PLUS (+)
REGION

HEIGHT

WIDTH MINUS (-)
REGION



PLUS (+)
REGION -
WHITE
VOXELS

MINUS (-)
REGION-
BLACK
VOXELS


Rotate kernel to create layer of primitives

z

ψ θ
y
φ
x
Coordinate system of a corner kernel

Layer 1: Primitives
3D Corners


Applying the Kernel

Corresponding voxels


Applying the Kernel

“Convolve” kernel with
appearance grid


Operator Response and Simplifications
Ixi : Intensity at voxel xi
K : Kernel response

K = Ixi − Ixj
i:xi ∈R+ j:xj ∈R−

K ∼ Nk (µk , σk ) Distribution of the response
2

µk = µxi − µxj 2
σk = 2
σxi + 2
σxj
i:xi ∈R+ j:xj ∈R− i:xi ∈R+ j:xj ∈R−

This may be the ﬁrst feature detector based on the spatial arrangement of appearance distributions

  |R1

{
+|

µk ,  P (xi ∈ S) t and µk 0
kernel response = rα = i:xi ∈R+

0, otherwise


Experiment Setup:

1. Demonstrate Hierarchy on a small region Experimental hierarchy

Object Layer:
Window

Layer 3:
2. Show some results on the full grid Triplets of corners

Layer 2:
Pairs of corners

Layer 1:
Corner primitives


Algorithm Steps

Algorithm Steps
1. For each orientation
✤ Run a corner kernel


Layer 1: Simple Features

Algorithm Steps
1. For each orientation
✤ Run a corner kernel
✤ Perform non-maxima suppression
on kernel-speciﬁc region


Layer 2
Algorithm Steps
2. Build a hierarchy
2.1 Pair corners (90°)→Pairs
p(ci i ,αj |dαi ,αj , θαi ,αj ) =
α
1

{ |{rαi , rαj 0}|
0, otherwise
, for rαi , rαj 0


Layer 3

Algorithm Steps
1. ...
2.1.Pair coplanar corners (90°)→ Pairs
2.2.Pair corner pairs→ Triplets


Object Layer : Windows

Algorithm Steps
1. ...
2.1.Pair corners (90°)→Pair
2.2.Pair corner pairs→ L-shape
2.3.Pair Triplets→ Window


Full Grid: Occupancy Probabilities

Summary

✤ Appealing characteristics of The Voxel World and Compositional Hierarchies

✤ Introduced volumetric feature detectors that operate on distribution functions of
appearance

✤ Demonstrated, using a very simple instance of a compositional hierarchy the
efﬁciency of such representation.

✤ Localized large number of windows


Future Work

✤ Include other extrema operators in the hierarchy (e.g. edges)

✤ Use occupancy information

✤ Learn prior distributions to fully explain probability density of compositions

✤ Optimize source code: Search and storage of parts (e.g octree)

✤ Learn parts automatically

✤ Learn whole-object hierarchies


The Principle of Compositionality
The meaning of a complex expression is determined by
its structure and the meanings of its constituents.
Stanford Encyclopedia of Philosophy

Questions?


Progress review1

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to Progress review1

Similar to Progress review1 (12)

Progress review1