PASCAL VOC 2010: semantic object segmentation and action recognition in still images

Introduction
Harmony potential 2.0: fusing across scale
Action recognition
Discussion

PASCAL VOC 2010
Semantic object segmentation and action recognition in still images

Andrew D. Bagdanov
bagdanov@cvc.uab.es

´
Departamento de Ciencias de la Computacion
´
Universidad Autnoma de Barcelona

Xavier Pep Nataliya Wenjuan Fahad

The CVC PASCAL VOC Team CVC PASCAL VOC 2010

Introduction PASCAL VOC 2010
Harmony potential 2.0: fusing across scale Semantic image segmentation
Action recognition Action recognition
Discussion Our main ideas

Overview
On 03/05/2010 the PASCAL VOC competition was announced
and the training and validation sets published.
20 semantic categories for the competition remain the same:
aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, diningtable,
dog, horse, motorbike, person, potted plant, sheep, sofa, train, and tv/monitor.



Old competitions, new competitions

There are two (+ 1/2) main challenges in PASCAL.
Image classiﬁcation is the prediction of the presence/absence of
an instance of class in a test image.
Object detection is the prediction of the bounding box and label
of each object from the twenty target classes in a test image.
Semantic image segmentation is the assignment of one of the
twenty class labels to every pixel in a test image.
Image segmentation is becoming a mainstream competition.
Action recognition in still images was included as a new “taster
challenge” this year.
Taster competitions are used to measure interest in new problems.



Our contributions to PASCAL VOC 2010

Last year we participated in the Detection, Classification and
Segmentation challenges.
This year we decided to concentrate on Classification and
Segmentation. Our segmentation technique relies heavily on
classification.
We also fielded a team in Action Recognition this year to see
what that’s all about.
As always, success in PASCAL VOC challenges is approximately
85% engineering, 10% inspiration and 5% luck (if you’re lucky).



Outline

1 Introduction
Overview of the challenges
Our contribution and main ideas
2 The harmony potential 2.0: fusing across scale
Building on last year’s submission
Fusing across scales and learning
3 Action recognition
A torrent of features
Exploiting the size of the problem
4 Discussion



Giving semantics to pixels

Image Object Class
Semantic image segmentation is not object segmentation
Only for simple cases are they the same.


Turning a hard problem into a harder one

Image Object Class
The object is to assign semantic labels to every pixel
Fine distinctions must be made


Make that a very hard one

Image Object Class
The objective is to assign semantic labels to every pixel
Fine distinctions must be made
Occlusions, varying viewpoint and size complicate things



Action recognition in still images

New competition this year: human action recognition in still
images.
Individual images sampled from the Flikr dataset.
Bounding boxes of the human in each image is provided.
Very important: we don’t have to solve the detection problem.
Action recognition is offered as a “taster challenge” in order to
gauge interest in the general problem.
It was difﬁcult to hypothesize about what would succeed and what
would not in this challenge.



Action classes



Segmentation: the role of context

Context provides very important cues for make ﬁne
discriminations at the (super-) pixel scale.
We can exploit three levels of scale: local, mid-level and global
[Zhu, NIPS2008].
Existing techniques apply overly-simpliﬁed models of context that
do not generalize upward from local to global scales.


Segmentation: global constraints on label
combinations
Our principal idea is to use global Classification to enhance
segmentation results.
Global image classification results tend to be less noisy than ones.
We will use them to constrain the combinations of semantic labels
we are likely to encounter during segmentation.
We showed last year how a tractable inference technique can be
devised for this labeling problem (our PASCAL 2009 entry).
This year we also show how mid-level context can be incorporated
in the form of object detections.
We also show how position priors cam be similarly incorporated
into the framework to provide class specific location information.
Finally, we devised a stochastic steepest ascent technique for
optimizing the many parameters in a class-specific way.


Action recognition: driven by data limitations

Initial experiments confirmed our intuition about the limitations of
the data.
Structural learning: sampling of pose space not dense enough.
Latent SVM: object interactions under-sampled as well.
Multiple kernel learning: converges to simple selection.
From a very early stage, we decided to treat action recognition as
an image classification problem.
We exploit the small size dataset by performing extensive cross
validation.
Features are one of our string points, and we had to get the
feature pipeline running for Classification in any case.


Introduction
Our point of departure
Datasets and implementation
Action recognition
Experimental results
Discussion

HCRFs for labeling problem

We represent our segmentation problem as a graph: G = (V, E)
V is used for indexing random variables, and E is the set of
undirected edges representing compatibility relationships between
random variables.
X = {Xi } denotes the set of random variables or nodes, for i ∈ V.
An energy function will be defined over graphical configurations of
random variables.
By the Hammersley-Clifford theorem, the energy of a configuration
of x = {xi } can be written as the negative exponential of an
energy function E(x) = c∈C ϕc (xc ), where ϕc is the potential
function of clique c ∈ C.


Introduction
Action recognition
Discussion

Consistency potentials for labeling problems
The energy function of G can be written as:
E(x) = φ(xi ) + ψL (xi , xj ) + ψG (xi , xg ).
i∈V (i,j)∈EL (i,g)∈EG

The unary term φ(xi ) depends on a single probability
P(Xi = xi |Øi ), where Øi is the observation that affects Xi in the
model.
The smoothness potential ψL (xi , xj ) determines the pairwise
relationship between two local nodes.
The consistency potential ψG (xi , xg ) expresses the dependency
between local nodes and a global node.
And the Maximum a Posteriori (MAP) estimate of the optimal
labeling is:
x∗ = arg min E(x).
x

Introduction
Action recognition
Discussion

HCRF models of image segmentation

Smoothness Potts Robust P N
Free

(Shotten et al, CVPR2008) (Plath et al, ICML2009) (Ladicky et al, ICCV2009)

Colored nodes represent (hidden) semantic labels.
Dark nodes represent image measurements.
Red edges represent penalties imposed by potential.


Introduction
Action recognition
Discussion

Different features for discriminations

The previously mentioned approaches all try to make global
distinctions using local information.
Either by voting of local observations (Potts).
Or, by penalizing rampantly discordant local label assignments
PN .
None of these techniques try to exploit truly global information to
constrain local labels.
And none incorporate the notion of encoding combinations of
primitive node labels at the global level.


Introduction
Action recognition
Discussion

The harmony potential: selective subsets

Only labels that do not agree with subset are penalized.
Can represent more diverse combinations.


Introduction
Action recognition
Discussion

The harmony potential: overview


Introduction
Action recognition
Discussion

Ranked subsampling of P(L)

We can do this using the following posterior:
∗ ∗ ∗
P( ⊆ xg |Ø) ∝ P( ⊆ xg )P(O| ⊆ xg ).

This allows us to effectively rank possible global node labels, and
∗
thus to prioritize candidates in the search for the optimal label xg .
∗
P( ⊆ xg |O) establishes an order on subsets of the (unknown)
∗
optimal labeling of the global node xg that guides the
consideration of global labels.
We may not be able to exhaustively consider all labels in P(L), but
∗
at least we consider the most likely candidates for xg .
And image classiﬁcation can give us an estimate of this posterior.


Introduction
Action recognition
Discussion

PASCAL 2010: pushing the limit

The previous slides describe our approach used for the PASCAL
2009 submission.
The discriminative model was based on only SVMs trained to
discriminate object classes from their own backgrounds.
Starting with the harmony potential approach, this year we
concentrated on adding cues derived from different levels of
mid-level context.
We found the HCRF model with harmony potential to be very
useful for performing this fusion.
Our hypothesis at the end of the 2009 competition was that
detection would be essential for pushing forward the
state-of-the-art.


Introduction
Action recognition
Discussion

PASCAL 2010: fusing across scales
1 FG/BG: 20 SVMs trained to discriminate classes from their own
background. The same discriminative model used last year,
essential for localizing object boundaries.
2 CLASS: 20 SVMs trained to discriminate each object class from
the other object. Essential for distinguishing objects with similar
backgrounds (e.g. cows from sheep, birds from planes).
Incorporated directly into unary potential.
3 LOC: 20 class-speciﬁc location priors. Computed from ground
truth segmentations by simple, spatial averaging. A form of
top-down mid-level context.
4 OBJ: 20 class-speciﬁc object detectors [Felzenszwalb 2010] are
converted to superpixel scores by selecting the highest scoring
detection intersecting each pixel of the superpixel. A type of
bottom-up mid-level context.

Introduction
Action recognition
Discussion

PASCAL 2010: learning unary potentials

We compute the unary potential by weighting the classiﬁcation
scores {si (k , xi )}k∈F through a sigmoid function. The unary
potential becomes:

1
φL (xi ) = −µL Ki log
i
1 + exp(fi (k, xi ))
k∈F
fi (k , xi ) = a(k, xi )si (k , xi ) + b(k, xi )

µL is the weighting factor of the local unary potential, and
Ki normalizes over the number of pixels inside the superpixel.
We have two sigmoid parameters for each class/cue pair: a(k , xi )
and b(k , xi ).


Introduction
Action recognition
Discussion

Datasets

We have evaluated the harmony potential approach on two
standard, publicly available datasets.
The Pascal VOC 2010 Segmentation Challenge dataset contains
2250 color images of 20 different semantic classes.
This set is split into 750 images for training, 750 images for
testing, and 750 for validation.
The Microsoft MSRC-21 dataset contains 591 color images of 21
object classes.
We do our own splits for cross-validation on MSRC-21.


Introduction
Action recognition
Discussion

Unsupervised segmentation
Images are ﬁrst over-segmented to with quick-shift to derive
super-pixels [Fulkerson, ICCV 2009].
This preserves object boundaries while simplifying the
representation.
Working at the super-pixel level reduces the number of nodes in
the CRF by 102 to 105 per image.


Introduction
Action recognition
Discussion

Local classiﬁcation scores: P(Xi = xi |Oi )

We extract patches with 50% overlap on a regular grid at several
resolutions (12, 24, 36 and 48 pixels in diameter).
Patches are described with SIFT, color and for MSCR-21 location
features.
A vocabulary is constructed using k-means to quantize to 1000
SIFT words and 400 color words.
An SVM classiﬁer using an intersection kernel is built for each
semantic category.
A similar number of positive and negative examples are used:
around a total of 8.000 superpixel samples for MSCR-21, and
20.000 for VOC 2010 for each class.


Introduction
Action recognition
Discussion

Global potential and general approach
For the PASCAL 2010 dataset we use our entry to the 2010 VOC
Classiﬁcation Challenge:
[Khan, IJCV2010 (submitted)].
It uses a bag-of-words representation based on SIFT and color
SIFT, plus spatial pyramids and color attention
[Khan, ICCV 2009].
An SVM classiﬁer with a χ2 kernel is trained for each semantic
category in the dataset.
The FG/BG and CLASS cues are computed by training a
discriminative model using an SVM with histogram intersection
kernel.
Except for the additional cues and optimization strategy,
architecture the same as our approach described at CVPR.
[Gonfaus, CVPR2010]

Introduction
Action recognition
Discussion

Learning the HCRF parameters
We found it to be essential to train the per-class sigmoid
parameters through cross validation.
Classiﬁcation scores are learned independently, are unbalanced
and are effectively incomparable in many cases.
The sigmoid functions weight the importance of each cue for each
class.
In addition to these (180) sigmoid parameters, we also must learn
the weighting factors for each potential.
We use a stochastic, steepest ascent technique to optimize these
parameters on a validation set.
In each step we randomly generate new instances of parameters.
New parameter instances are generated using a Gibbs-like
sampling strategy.

Introduction
Action recognition
Discussion

History: PASCAL VOC 2009

Background

Aeroplane

Bicycle

Bottle

Chair
Boat
Bird

Bus

Car

Cat
BONN 83.9 64.3 21.8 21.7 32.0 40.2 57.3 49.4 38.8 5.2
BROOKES 79.6 48.3 6.7 19.1 10.0 16.6 32.7 38.1 25.3 5.5
Harmony potential 80.5 62.3 24.1 28.3 30.5 32.7 42.2 48.1 22.8 9.1
Dinning Table

Potted Plant

TV/Monitor
Motorbike

Average
Person

Sheep
Horse

Train
Sofa
Cow

Dog

BONN 28.5 22.0 19.6 33.6 45.5 33.6 27.3 40.4 18.1 33.6 46.1 36.3
BROOKES 9.4 25.1 13.3 12.3 35.5 20.7 13.4 17.1 18.4 37.5 36.4 24.8
Harmony potential 30.1 7.9 21.5 41.9 49.6 31.5 26.1 37.0 20.1 39.4 31.1 34.1


Introduction
Action recognition
Discussion

Qualitative results: MSRC-21


Introduction
Action recognition
Discussion

Quantitative results: MSRC-21

MSRC-21 contains more multi-class images than PASCAL.
Our performance demonstrates the beneﬁts of incorporating
global scale when making local decisions.


Introduction
Action recognition
Discussion

Qualitative results: PASCAL 2010


Introduction
Action recognition
Discussion

Quantitative results: PASCAL 2010

FG/BG shows the performance of our baseline (PASCAL 2009)
approach.
At the top, performance on the validation set (i.e. how well we
thought we were doing).
Image tags indicated how well the technique can perform with
perfect global information.

Introduction
Action recognition
Discussion

The cost of segmentation
The optimal MAP label conﬁguration x∗ is inferred using
α-expansion graph cuts [Kolmogorov, PAMI2004].
The global node uses the 100 most probable label subsets
Sheet1
obtained from ranked subsampling.
MSRC-21 PASCAL 2010

85 50
48
80

mAP on PASCAL VOC 2010
46
75 44
mAP on MSRC-21

70 42
40
65 38
60 36
34
55
32
50 30
1 2 3 5 10 15 20 25 30 35 40 50 75 100 150 200
# labels selected

Introduction
Action recognition
Discussion

Qualitative results: PASCAL 2010 failures

Context is sometimes weighted too much.
When the global classiﬁer fails, little can be done.


Introduction
Action recognition
Discussion

Every little bit helps


Introduction
Action recognition
Discussion

A photo ﬁnish
Sheet1
Sheet1
42
15 20 25 30 35 40
40

mAP on PASCAL VOC 2010
FG-BG 33.9

CLASS 23.4 38

LOC 20.1 36

OBJ 26.2
34
FG-BG + CLASS 36.6
32
All 40.4
30
0 500 1000 1500 2000 2500 3000
#iterations

The ﬁnal results are tough to call between BONN and CVC.
In the end, fusion over many scales and per-class, per-feature
parameter optimization won.


Introduction The data
Harmony potential 2.0: fusing across scale State-of-the-art
Action recognition Our approach
Discussion Results

The action recognition taster

Images collected from Flikr using action queries. A set of nine
actions was chosen in the end.
They are disjoint from the main challenge dataset.
Only subset of people are annotated (bounding box + action).
This subset labelled with exactly one action class.
Important point: we don’t have to solve the detection problem.
Most action classes in the challenge contain either large variation
in scale or large variations in pose (or both).


Discussion Results

Dataset breakdown

train val trainval test
img obj img obj img obj img obj
Phoning 25 25 25 26 50 51 - -
Playinginstrument 27 38 27 38 54 76 - -
Reading 25 26 26 27 51 53 - -
Ridingbike 25 33 25 33 50 66 - -
Ridinghorse 27 35 26 36 53 71 - -
Running 26 47 25 47 51 94 - -
Takingphoto 25 27 26 28 51 55 - -
Usingcomputer 26 29 26 30 52 59 - -
Walking 25 41 26 42 51 83 - -
Total 226 301 228 307 454 608 - -


Discussion Results

Grouplets and poselets
Two state-of-the art techniques to action recognition in still
images. The grouplets of Fei Fei Li [Yao et al, CVPR2010]:

And the latent poses of Greg Mori [Yang et al, CVPR2010]:


Discussion Results

Treat it like image classification

Initial experiments confirmed our intuition about the limitations of
the data.
Structural learning: sampling of pose space not dense enough.
Latent SVM: complexity of object interactions problematic.
Multiple kernel learning: converges to simple selection.
State-of-the-art techniques rely on learning complex structural
models of pose-variations over many
From a very early stage, we decided to treat action recognition as
an image classification problem.
We exploit the small size dataset by performing extensive cross
validation.


Discussion Results

The classiﬁcation pipeline


Discussion Results

Action recognition: features

SIFT, color SIFT (normalize R/G and opponent), self-similarity,
SURF, PHOG (good for capturing pose), and color attention
(focuses on interesting color features).
Sparse and dense variations of most of these.
Plus a range of pyramid conﬁgurations (1, 2 × 2, 3 × 3, 4 × 4).
Object detectors also incorporated using a simple occurrence
histogram [Felzenszwalb 2010].
The goal was to incorporate all of this into a BoVW classiﬁer and
push the limits of what is possible using classical BoW on actions.


Discussion Results

Action recognition: contextual pyramids

Context was also important for most object classes.
We used a type of foreground/background pyramid decomposition
that split features into object or background.
The was done using a type of spatial soft-assign based on the
distance to the boundary of the object.
For some classes, we also assigned contextual object regions that
model the appearance of objects associated with them (the “horsy
box”).


Discussion Results

Action recognition: learning in the design space

In the end, after all of the combinatorics introduced by pyramids
and other variations, we had about 100 feature conﬁgurations in a
big pool.
Most attempts to automatically learn the parameters of these
features were total failures.
Except one. Initial experiments with multiple kernel learning
showed that MKL starts converging quickly towards class-speciﬁc
feature selection rather than mixing.
With such a small dataset, and a little heuristic trimming, we were
able to exhaustively explore a part of the design space.
This resulted in the best per-class feature combinations.


Discussion Results

Action recognition: classiﬁcation

We experimented with a number of kernels (histogram
intersection, χ2 , bin-ratio distance).
There wasn’t a huge difference among these kernels.
In the end, we chose histogram intersection for our submission as
it appeared to generalize better.
In addition to over-ﬁtting less, there are no parameters to tune and
it is very fast.


Discussion Results

Overall results: average precision


Discussion Results

Per-class AP


Discussion Results

Per technique median average precision


Discussion Results

Qualitative results

When the horsey box and detectors fail, context dominates.
Classiﬁer still surprisingly robust.


Discussion Results

Qualitative results

Some fine discriminations very difficult to make.
Probably difficult even for humans.


Discussion Results

Qualitative results

People taking photos should be banned.
Classes with large pose variations were the most difﬁcult.


Introduction
Action recognition
Discussion

Discussion: semantic image segmentation

The harmony potential works well for fusing global information into
local segmentations.
This year we also showed that the harmony potential framework is
also appropriate for incorporating different types of mid-level cues
as well.
Ranked sub-sampling, driven by the same posterior as used to
deﬁne the global potential function, renders the optimization
problem tractable.
Most useful when multiple semantic classes co-occur frequently.
Per-class learning of parameters essential (about +5% in ﬁnal
results).


Introduction
Action recognition
Discussion

Discussion: action recognition

This year’s taster challenge on action recognition was little more
than a toy.
However, we have demonstrated what is possible using proven
techniques from image classiﬁcation.
We feel that object context, in particular object interaction context,
is the way forward.
The PASCAL data set is the right direction to go (more general),
but we need more samples.


Introduction
Action recognition
Discussion

The future: segmentation

Semantic image segmentation has come a long way, but still has a
long way to go.
It is becoming a mainstream event in PASCAL.
This year we arrived as a sort of three-way detente between the
CVC (winner 2010), BONN (winner 2009) and OXFORD (best
paper award ECCV 2010) in segmentation.
Each have their own approach, and each has its advantages and
disadvantages.
Engineering can probably maximize results.
It is becoming mature, and we can begin thinking about what new
applications are enabled by such technologies.


Introduction
Action recognition
Discussion

The future: action recognition

It seems that action recognition in still images is a popular
challenge.
The PASCAL organizers are keen to promote it for the future.
The concentration will remain on still images, but perhaps more
concentration on incorporating user interaction as well.
It seems that the community is becoming more interested in the
“alternative” PASCAL challenges.
The multimedia community probably has an important role to play
here.


PASCAL VOC 2010: semantic object segmentation and action recognition in still images

Recommended

Recommended

More Related Content

More from Media Integration and Communication Center

More from Media Integration and Communication Center (13)

Recently uploaded

Recently uploaded (20)

PASCAL VOC 2010: semantic object segmentation and action recognition in still images