SlideShare a Scribd company logo
        Harmony potential 2.0: fusing across scale
                                Action recognition

                              PASCAL VOC 2010
Semantic object segmentation and action recognition in still images

                                Andrew D. Bagdanov

                       Departamento de Ciencias de la Computacion
                           Universidad Autnoma de Barcelona

          Xavier              Pep            Nataliya      Wenjuan          Fahad

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
           Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                   Action recognition    Action recognition
                                           Discussion    Our main ideas

     On 03/05/2010 the PASCAL VOC competition was announced
     and the training and validation sets published.
     20 semantic categories for the competition remain the same:
aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, diningtable,
dog, horse, motorbike, person, potted plant, sheep, sofa, train, and tv/monitor.

                       The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas

Old competitions, new competitions

   There are two (+ 1/2) main challenges in PASCAL.
   Image classification is the prediction of the presence/absence of
   an instance of class in a test image.
   Object detection is the prediction of the bounding box and label
   of each object from the twenty target classes in a test image.
   Semantic image segmentation is the assignment of one of the
   twenty class labels to every pixel in a test image.
   Image segmentation is becoming a mainstream competition.
   Action recognition in still images was included as a new “taster
   challenge” this year.
   Taster competitions are used to measure interest in new problems.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
       Harmony potential 2.0: fusing across scale    Semantic image segmentation
                               Action recognition    Action recognition
                                       Discussion    Our main ideas

Our contributions to PASCAL VOC 2010

  Last year we participated in the Detection, Classification and
  Segmentation challenges.
  This year we decided to concentrate on Classification and
  Segmentation. Our segmentation technique relies heavily on
  We also fielded a team in Action Recognition this year to see
  what that’s all about.
  As always, success in PASCAL VOC challenges is approximately
  85% engineering, 10% inspiration and 5% luck (if you’re lucky).

                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
          Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                  Action recognition    Action recognition
                                          Discussion    Our main ideas


 1   Introduction
         Overview of the challenges
         Our contribution and main ideas
 2   The harmony potential 2.0: fusing across scale
         Building on last year’s submission
         Fusing across scales and learning
 3   Action recognition
         A torrent of features
         Exploiting the size of the problem
 4   Discussion

                      The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas

Giving semantics to pixels

       Image                                    Object                              Class
   Semantic image segmentation is not object segmentation
   Only for simple cases are they the same.
                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas

Turning a hard problem into a harder one

       Image                                    Object                              Class
   The object is to assign semantic labels to every pixel
   Fine distinctions must be made
                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas

Make that a very hard one

       Image                                    Object                              Class
   The objective is to assign semantic labels to every pixel
   Fine distinctions must be made
   Occlusions, varying viewpoint and size complicate things

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas

Action recognition in still images

   New competition this year: human action recognition in still
   Individual images sampled from the Flikr dataset.
   Bounding boxes of the human in each image is provided.
   Very important: we don’t have to solve the detection problem.
   Action recognition is offered as a “taster challenge” in order to
   gauge interest in the general problem.
   It was difficult to hypothesize about what would succeed and what
   would not in this challenge.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
       Harmony potential 2.0: fusing across scale    Semantic image segmentation
                               Action recognition    Action recognition
                                       Discussion    Our main ideas

Action classes

                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas

Segmentation: the role of context

   Context provides very important cues for make fine
   discriminations at the (super-) pixel scale.
   We can exploit three levels of scale: local, mid-level and global
   [Zhu, NIPS2008].
   Existing techniques apply overly-simplified models of context that
   do not generalize upward from local to global scales.
                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas

Segmentation: global constraints on label
   Our principal idea is to use global Classification to enhance
   segmentation results.
   Global image classification results tend to be less noisy than ones.
   We will use them to constrain the combinations of semantic labels
   we are likely to encounter during segmentation.
   We showed last year how a tractable inference technique can be
   devised for this labeling problem (our PASCAL 2009 entry).
   This year we also show how mid-level context can be incorporated
   in the form of object detections.
   We also show how position priors cam be similarly incorporated
   into the framework to provide class specific location information.
   Finally, we devised a stochastic steepest ascent technique for
   optimizing the many parameters in a class-specific way.
                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas

Action recognition: driven by data limitations

   Initial experiments confirmed our intuition about the limitations of
   the data.
       Structural learning: sampling of pose space not dense enough.
       Latent SVM: object interactions under-sampled as well.
       Multiple kernel learning: converges to simple selection.
   From a very early stage, we decided to treat action recognition as
   an image classification problem.
   We exploit the small size dataset by performing extensive cross
   Features are one of our string points, and we had to get the
   feature pipeline running for Classification in any case.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results

HCRFs for labeling problem

  We represent our segmentation problem as a graph: G = (V, E)
  V is used for indexing random variables, and E is the set of
  undirected edges representing compatibility relationships between
  random variables.
  X = {Xi } denotes the set of random variables or nodes, for i ∈ V.
  An energy function will be defined over graphical configurations of
  random variables.
  By the Hammersley-Clifford theorem, the energy of a configuration
  of x = {xi } can be written as the negative exponential of an
  energy function E(x) = c∈C ϕc (xc ), where ϕc is the potential
  function of clique c ∈ C.

                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                          Our point of departure
        Harmony potential 2.0: fusing across scale
                                                          Datasets and implementation
                                Action recognition
                                                          Experimental results

Consistency potentials for labeling problems
   The energy function of G can be written as:
       E(x) =             φ(xi ) +                    ψL (xi , xj ) +              ψG (xi , xg ).
                   i∈V                  (i,j)∈EL                        (i,g)∈EG

   The unary term φ(xi ) depends on a single probability
   P(Xi = xi |Øi ), where Øi is the observation that affects Xi in the
   The smoothness potential ψL (xi , xj ) determines the pairwise
   relationship between two local nodes.
   The consistency potential ψG (xi , xg ) expresses the dependency
   between local nodes and a global node.
   And the Maximum a Posteriori (MAP) estimate of the optimal
   labeling is:
                                     x∗ = arg min E(x).
                    The CVC PASCAL VOC Team               CVC PASCAL VOC 2010
                                                           Our point of departure
           Harmony potential 2.0: fusing across scale
                                                           Datasets and implementation
                                   Action recognition
                                                           Experimental results

HCRF models of image segmentation

   Smoothness                                       Potts                                Robust P N

  (Shotten et al, CVPR2008)                  (Plath et al, ICML2009)                (Ladicky et al, ICCV2009)

  Colored nodes represent (hidden) semantic labels.
  Dark nodes represent image measurements.
  Red edges represent penalties imposed by potential.

                       The CVC PASCAL VOC Team             CVC PASCAL VOC 2010
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results

Different features for discriminations

   The previously mentioned approaches all try to make global
   distinctions using local information.
   Either by voting of local observations (Potts).
   Or, by penalizing rampantly discordant local label assignments
   PN .
   None of these techniques try to exploit truly global information to
   constrain local labels.
   And none incorporate the notion of encoding combinations of
   primitive node labels at the global level.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results

The harmony potential: selective subsets

   Only labels that do not agree with subset are penalized.
   Can represent more diverse combinations.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results

The harmony potential: overview

                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results

Ranked subsampling of P(L)

  We can do this using the following posterior:
                          ∗             ∗          ∗
                    P( ⊆ xg |Ø) ∝ P( ⊆ xg )P(O| ⊆ xg ).

  This allows us to effectively rank possible global node labels, and
  thus to prioritize candidates in the search for the optimal label xg .
  P( ⊆ xg |O) establishes an order on subsets of the (unknown)
  optimal labeling of the global node xg that guides the
  consideration of global labels.
  We may not be able to exhaustively consider all labels in P(L), but
  at least we consider the most likely candidates for xg .
  And image classification can give us an estimate of this posterior.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results

PASCAL 2010: pushing the limit

  The previous slides describe our approach used for the PASCAL
  2009 submission.
  The discriminative model was based on only SVMs trained to
  discriminate object classes from their own backgrounds.
  Starting with the harmony potential approach, this year we
  concentrated on adding cues derived from different levels of
  mid-level context.
  We found the HCRF model with harmony potential to be very
  useful for performing this fusion.
  Our hypothesis at the end of the 2009 competition was that
  detection would be essential for pushing forward the

                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                        Our point of departure
          Harmony potential 2.0: fusing across scale
                                                        Datasets and implementation
                                  Action recognition
                                                        Experimental results

PASCAL 2010: fusing across scales
 1   FG/BG: 20 SVMs trained to discriminate classes from their own
     background. The same discriminative model used last year,
     essential for localizing object boundaries.
 2   CLASS: 20 SVMs trained to discriminate each object class from
     the other object. Essential for distinguishing objects with similar
     backgrounds (e.g. cows from sheep, birds from planes).
     Incorporated directly into unary potential.
 3   LOC: 20 class-specific location priors. Computed from ground
     truth segmentations by simple, spatial averaging. A form of
     top-down mid-level context.
 4   OBJ: 20 class-specific object detectors [Felzenszwalb 2010] are
     converted to superpixel scores by selecting the highest scoring
     detection intersecting each pixel of the superpixel. A type of
     bottom-up mid-level context.
                      The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                       Our point of departure
       Harmony potential 2.0: fusing across scale
                                                       Datasets and implementation
                               Action recognition
                                                       Experimental results

PASCAL 2010: learning unary potentials

  We compute the unary potential by weighting the classification
  scores {si (k , xi )}k∈F through a sigmoid function. The unary
  potential becomes:

                 φL (xi ) = −µL Ki log
                                                           1 + exp(fi (k, xi ))
                          fi (k , xi ) = a(k, xi )si (k , xi ) + b(k, xi )

  µL is the weighting factor of the local unary potential, and
  Ki normalizes over the number of pixels inside the superpixel.
  We have two sigmoid parameters for each class/cue pair: a(k , xi )
  and b(k , xi ).

                   The CVC PASCAL VOC Team             CVC PASCAL VOC 2010
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results


  We have evaluated the harmony potential approach on two
  standard, publicly available datasets.
  The Pascal VOC 2010 Segmentation Challenge dataset contains
  2250 color images of 20 different semantic classes.
  This set is split into 750 images for training, 750 images for
  testing, and 750 for validation.
  The Microsoft MSRC-21 dataset contains 591 color images of 21
  object classes.
  We do our own splits for cross-validation on MSRC-21.

                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results

Unsupervised segmentation
  Images are first over-segmented to with quick-shift to derive
  super-pixels [Fulkerson, ICCV 2009].
  This preserves object boundaries while simplifying the
  Working at the super-pixel level reduces the number of nodes in
  the CRF by 102 to 105 per image.

                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results

Local classification scores: P(Xi = xi |Oi )

   We extract patches with 50% overlap on a regular grid at several
   resolutions (12, 24, 36 and 48 pixels in diameter).
   Patches are described with SIFT, color and for MSCR-21 location
   A vocabulary is constructed using k-means to quantize to 1000
   SIFT words and 400 color words.
   An SVM classifier using an intersection kernel is built for each
   semantic category.
   A similar number of positive and negative examples are used:
   around a total of 8.000 superpixel samples for MSCR-21, and
   20.000 for VOC 2010 for each class.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results

Global potential and general approach
   For the PASCAL 2010 dataset we use our entry to the 2010 VOC
   Classification Challenge:
   [Khan, IJCV2010 (submitted)].
   It uses a bag-of-words representation based on SIFT and color
   SIFT, plus spatial pyramids and color attention
   [Khan, ICCV 2009].
   An SVM classifier with a χ2 kernel is trained for each semantic
   category in the dataset.
   The FG/BG and CLASS cues are computed by training a
   discriminative model using an SVM with histogram intersection
   Except for the additional cues and optimization strategy,
   architecture the same as our approach described at CVPR.
   [Gonfaus, CVPR2010]
                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results

Learning the HCRF parameters
  We found it to be essential to train the per-class sigmoid
  parameters through cross validation.
  Classification scores are learned independently, are unbalanced
  and are effectively incomparable in many cases.
  The sigmoid functions weight the importance of each cue for each
  In addition to these (180) sigmoid parameters, we also must learn
  the weighting factors for each potential.
  We use a stochastic, steepest ascent technique to optimize these
  parameters on a validation set.
  In each step we randomly generate new instances of parameters.
  New parameter instances are generated using a Gibbs-like
  sampling strategy.
                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                                                    Our point of departure
           Harmony potential 2.0: fusing across scale
                                                                                    Datasets and implementation
                                   Action recognition
                                                                                    Experimental results

History: PASCAL VOC 2009








          BONN         83.9 64.3 21.8 21.7 32.0 40.2 57.3 49.4 38.8 5.2
     BROOKES           79.6 48.3 6.7 19.1 10.0 16.6 32.7 38.1 25.3 5.5
Harmony potential      80.5 62.3 24.1 28.3 30.5 32.7 42.2 48.1 22.8 9.1
                                      Dinning Table

                                                                                               Potted Plant






          BONN         28.5 22.0 19.6 33.6 45.5 33.6 27.3 40.4 18.1 33.6 46.1                                                                     36.3
     BROOKES            9.4 25.1 13.3 12.3 35.5 20.7 13.4 17.1 18.4 37.5 36.4                                                                     24.8
Harmony potential      30.1 7.9 21.5 41.9 49.6 31.5 26.1 37.0 20.1 39.4 31.1                                                                      34.1

                       The CVC PASCAL VOC Team                                      CVC PASCAL VOC 2010
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results

Qualitative results: MSRC-21

                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results

Quantitative results: MSRC-21

   MSRC-21 contains more multi-class images than PASCAL.
   Our performance demonstrates the benefits of incorporating
   global scale when making local decisions.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results

Qualitative results: PASCAL 2010

                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results

Quantitative results: PASCAL 2010

   FG/BG shows the performance of our baseline (PASCAL 2009)
   At the top, performance on the validation set (i.e. how well we
   thought we were doing).
   Image tags indicated how well the technique can perform with
   perfect global information.
                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                                   Our point of departure
        Harmony potential 2.0: fusing across scale
                                                                   Datasets and implementation
                                Action recognition
                                                                   Experimental results

The cost of segmentation
   The optimal MAP label configuration x∗ is inferred using
   α-expansion graph cuts [Kolmogorov, PAMI2004].
   The global node uses the 100 most probable label subsets
   obtained from ranked subsampling.
                                                  MSRC-21       PASCAL 2010

                             85                                                             50

                                                                                                 mAP on PASCAL VOC 2010
                             75                                                             44
            mAP on MSRC-21

                             70                                                             42
                             65                                                             38
                             60                                                             36
                             50                                                       30
                                  1   2   3   5 10 15 20 25 30 35 40 50 75 100 150 200
                                                      # labels selected
                                  The CVC PASCAL VOC Team          CVC PASCAL VOC 2010
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results

Qualitative results: PASCAL 2010 failures

   Context is sometimes weighted too much.
   When the global classifier fails, little can be done.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results

Every little bit helps

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
                                                                                                          Our point of departure
                      Harmony potential 2.0: fusing across scale
                                                                                                          Datasets and implementation
                                              Action recognition
                                                                                                          Experimental results

A photo finish
                 15      20       25             30   35            40

                                                                                mAP on PASCAL VOC 2010
        FG-BG                                         33.9

        CLASS                    23.4                                                                    38

          LOC             20.1                                                                           36

          OBJ                           26.2
 FG-BG + CLASS                                               36.6
           All                                                           40.4
                                                                                                              0   500   1000       1500      2000   2500   3000

      The final results are tough to call between BONN and CVC.
      In the end, fusion over many scales and per-class, per-feature
      parameter optimization won.

                                        The CVC PASCAL VOC Team                                           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results

The action recognition taster

   Images collected from Flikr using action queries. A set of nine
   actions was chosen in the end.
   They are disjoint from the main challenge dataset.
   Only subset of people are annotated (bounding box + action).
   This subset labelled with exactly one action class.
   Important point: we don’t have to solve the detection problem.
   Most action classes in the challenge contain either large variation
   in scale or large variations in pose (or both).

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction     The data
        Harmony potential 2.0: fusing across scale      State-of-the-art
                                Action recognition      Our approach
                                        Discussion      Results

Dataset breakdown

                                train                     val                trainval       test
                             img obj                  img obj              img obj      img obj
     Phoning                  25     25                25     26            50     51     -      -
Playinginstrument             27     38                27     38            54     76     -      -
     Reading                  25     26                26     27            51     53     -      -
    Ridingbike                25     33                25     33            50     66     -      -
   Ridinghorse                27     35                26     36            53     71     -      -
     Running                  26     47                25     47            51     94     -      -
   Takingphoto                25     27                26     28            51     55     -      -
 Usingcomputer                26     29                26     30            52     59     -      -
     Walking                  25     41                26     42            51     83     -      -
       Total                 226 301                  228 307              454 608        -      -

                    The CVC PASCAL VOC Team             CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results

Grouplets and poselets
   Two state-of-the art techniques to action recognition in still
   images. The grouplets of Fei Fei Li [Yao et al, CVPR2010]:

   And the latent poses of Greg Mori [Yang et al, CVPR2010]:

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results

Treat it like image classification

   Initial experiments confirmed our intuition about the limitations of
   the data.
       Structural learning: sampling of pose space not dense enough.
       Latent SVM: complexity of object interactions problematic.
       Multiple kernel learning: converges to simple selection.
   State-of-the-art techniques rely on learning complex structural
   models of pose-variations over many
   From a very early stage, we decided to treat action recognition as
   an image classification problem.
   We exploit the small size dataset by performing extensive cross

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
       Harmony potential 2.0: fusing across scale    State-of-the-art
                               Action recognition    Our approach
                                       Discussion    Results

The classification pipeline

                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results

Action recognition: features

   SIFT, color SIFT (normalize R/G and opponent), self-similarity,
   SURF, PHOG (good for capturing pose), and color attention
   (focuses on interesting color features).
   Sparse and dense variations of most of these.
   Plus a range of pyramid configurations (1, 2 × 2, 3 × 3, 4 × 4).
   Object detectors also incorporated using a simple occurrence
   histogram [Felzenszwalb 2010].
   The goal was to incorporate all of this into a BoVW classifier and
   push the limits of what is possible using classical BoW on actions.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results

Action recognition: contextual pyramids

   Context was also important for most object classes.
   We used a type of foreground/background pyramid decomposition
   that split features into object or background.
   The was done using a type of spatial soft-assign based on the
   distance to the boundary of the object.
   For some classes, we also assigned contextual object regions that
   model the appearance of objects associated with them (the “horsy

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results

Action recognition: learning in the design space

   In the end, after all of the combinatorics introduced by pyramids
   and other variations, we had about 100 feature configurations in a
   big pool.
   Most attempts to automatically learn the parameters of these
   features were total failures.
   Except one. Initial experiments with multiple kernel learning
   showed that MKL starts converging quickly towards class-specific
   feature selection rather than mixing.
   With such a small dataset, and a little heuristic trimming, we were
   able to exhaustively explore a part of the design space.
   This resulted in the best per-class feature combinations.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results

Action recognition: classification

   We experimented with a number of kernels (histogram
   intersection, χ2 , bin-ratio distance).
   There wasn’t a huge difference among these kernels.
   In the end, we chose histogram intersection for our submission as
   it appeared to generalize better.
   In addition to over-fitting less, there are no parameters to tune and
   it is very fast.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
       Harmony potential 2.0: fusing across scale    State-of-the-art
                               Action recognition    Our approach
                                       Discussion    Results

Overall results: average precision

                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
      Harmony potential 2.0: fusing across scale    State-of-the-art
                              Action recognition    Our approach
                                      Discussion    Results

Per-class AP

                  The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
       Harmony potential 2.0: fusing across scale    State-of-the-art
                               Action recognition    Our approach
                                       Discussion    Results

Per technique median average precision

                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
         Harmony potential 2.0: fusing across scale    State-of-the-art
                                 Action recognition    Our approach
                                         Discussion    Results

Qualitative results

   When the horsey box and detectors fail, context dominates.
   Classifier still surprisingly robust.

                     The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results

Qualitative results

   Some fine discriminations very difficult to make.
   Probably difficult even for humans.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results

Qualitative results

   People taking photos should be banned.
   Classes with large pose variations were the most difficult.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
       Harmony potential 2.0: fusing across scale
                               Action recognition

Discussion: semantic image segmentation

  The harmony potential works well for fusing global information into
  local segmentations.
  This year we also showed that the harmony potential framework is
  also appropriate for incorporating different types of mid-level cues
  as well.
  Ranked sub-sampling, driven by the same posterior as used to
  define the global potential function, renders the optimization
  problem tractable.
  Most useful when multiple semantic classes co-occur frequently.
  Per-class learning of parameters essential (about +5% in final

                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale
                                Action recognition

Discussion: action recognition

   This year’s taster challenge on action recognition was little more
   than a toy.
   However, we have demonstrated what is possible using proven
   techniques from image classification.
   We feel that object context, in particular object interaction context,
   is the way forward.
   The PASCAL data set is the right direction to go (more general),
   but we need more samples.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale
                                Action recognition

The future: segmentation

   Semantic image segmentation has come a long way, but still has a
   long way to go.
   It is becoming a mainstream event in PASCAL.
   This year we arrived as a sort of three-way detente between the
   CVC (winner 2010), BONN (winner 2009) and OXFORD (best
   paper award ECCV 2010) in segmentation.
   Each have their own approach, and each has its advantages and
   Engineering can probably maximize results.
   It is becoming mature, and we can begin thinking about what new
   applications are enabled by such technologies.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale
                                Action recognition

The future: action recognition

   It seems that action recognition in still images is a popular
   The PASCAL organizers are keen to promote it for the future.
   The concentration will remain on still images, but perhaps more
   concentration on incorporating user interaction as well.
   It seems that the community is becoming more interested in the
   “alternative” PASCAL challenges.
   The multimedia community probably has an important role to play

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010

More Related Content

More from Media Integration and Communication Center

Interactive Video Search and Browsing Systems
Interactive Video Search and Browsing SystemsInteractive Video Search and Browsing Systems
Interactive Video Search and Browsing Systems
Media Integration and Communication Center
Danthe. Digital and Tuscan heritage
Danthe. Digital and Tuscan heritageDanthe. Digital and Tuscan heritage
Danthe. Digital and Tuscan heritage
Media Integration and Communication Center
IM3I Presentation
IM3I PresentationIM3I Presentation
IM3I flyer
IM3I flyerIM3I flyer
IM3I brochure
IM3I brochureIM3I brochure
IM3I flyer
IM3I flyerIM3I flyer
The harmony potential: fusing local and global information for semantic image...
The harmony potential: fusing local and global information for semantic image...The harmony potential: fusing local and global information for semantic image...
The harmony potential: fusing local and global information for semantic image...
Media Integration and Communication Center
Sirio, Orione and Pan
Sirio, Orione and PanSirio, Orione and Pan
Vidivideo and IM3I
Vidivideo and IM3IVidivideo and IM3I
Ircdl damico del-bimbo-meoni
Ircdl damico del-bimbo-meoniIrcdl damico del-bimbo-meoni
Ircdl damico del-bimbo-meoni
Media Integration and Communication Center
Accurate Evaluation of HER-2 Ampli cation in FISH Images Poster at Internatio...
Accurate Evaluation of HER-2 Amplication in FISH Images Poster at Internatio...Accurate Evaluation of HER-2 Amplication in FISH Images Poster at Internatio...
Accurate Evaluation of HER-2 Ampli cation in FISH Images Poster at Internatio...
Media Integration and Communication Center

More from Media Integration and Communication Center (13)

Interactive Video Search and Browsing Systems
Interactive Video Search and Browsing SystemsInteractive Video Search and Browsing Systems
Interactive Video Search and Browsing Systems
Danthe. Digital and Tuscan heritage
Danthe. Digital and Tuscan heritageDanthe. Digital and Tuscan heritage
Danthe. Digital and Tuscan heritage
IM3I Presentation
IM3I PresentationIM3I Presentation
IM3I Presentation
IM3I flyer
IM3I flyerIM3I flyer
IM3I flyer
IM3I brochure
IM3I brochureIM3I brochure
IM3I brochure
IM3I flyer
IM3I flyerIM3I flyer
IM3I flyer
The harmony potential: fusing local and global information for semantic image...
The harmony potential: fusing local and global information for semantic image...The harmony potential: fusing local and global information for semantic image...
The harmony potential: fusing local and global information for semantic image...
Sirio, Orione and Pan
Sirio, Orione and PanSirio, Orione and Pan
Sirio, Orione and Pan
Vidivideo and IM3I
Vidivideo and IM3IVidivideo and IM3I
Vidivideo and IM3I
Ircdl damico del-bimbo-meoni
Ircdl damico del-bimbo-meoniIrcdl damico del-bimbo-meoni
Ircdl damico del-bimbo-meoni
Accurate Evaluation of HER-2 Ampli cation in FISH Images Poster at Internatio...
Accurate Evaluation of HER-2 Amplication in FISH Images Poster at Internatio...Accurate Evaluation of HER-2 Amplication in FISH Images Poster at Internatio...
Accurate Evaluation of HER-2 Ampli cation in FISH Images Poster at Internatio...

Recently uploaded

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead

Recently uploaded (20)

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead

PASCAL VOC 2010: semantic object segmentation and action recognition in still images

  • 1. Introduction Harmony potential 2.0: fusing across scale Action recognition Discussion PASCAL VOC 2010 Semantic object segmentation and action recognition in still images Andrew D. Bagdanov ´ Departamento de Ciencias de la Computacion ´ Universidad Autnoma de Barcelona Xavier Pep Nataliya Wenjuan Fahad The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 2. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Overview On 03/05/2010 the PASCAL VOC competition was announced and the training and validation sets published. 20 semantic categories for the competition remain the same: aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, diningtable, dog, horse, motorbike, person, potted plant, sheep, sofa, train, and tv/monitor. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 3. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Old competitions, new competitions There are two (+ 1/2) main challenges in PASCAL. Image classification is the prediction of the presence/absence of an instance of class in a test image. Object detection is the prediction of the bounding box and label of each object from the twenty target classes in a test image. Semantic image segmentation is the assignment of one of the twenty class labels to every pixel in a test image. Image segmentation is becoming a mainstream competition. Action recognition in still images was included as a new “taster challenge” this year. Taster competitions are used to measure interest in new problems. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 4. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Our contributions to PASCAL VOC 2010 Last year we participated in the Detection, Classification and Segmentation challenges. This year we decided to concentrate on Classification and Segmentation. Our segmentation technique relies heavily on classification. We also fielded a team in Action Recognition this year to see what that’s all about. As always, success in PASCAL VOC challenges is approximately 85% engineering, 10% inspiration and 5% luck (if you’re lucky). The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 5. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Outline 1 Introduction Overview of the challenges Our contribution and main ideas 2 The harmony potential 2.0: fusing across scale Building on last year’s submission Fusing across scales and learning 3 Action recognition A torrent of features Exploiting the size of the problem 4 Discussion The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 6. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Giving semantics to pixels Image Object Class Semantic image segmentation is not object segmentation Only for simple cases are they the same. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 7. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Turning a hard problem into a harder one Image Object Class The object is to assign semantic labels to every pixel Fine distinctions must be made The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 8. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Make that a very hard one Image Object Class The objective is to assign semantic labels to every pixel Fine distinctions must be made Occlusions, varying viewpoint and size complicate things The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 9. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Action recognition in still images New competition this year: human action recognition in still images. Individual images sampled from the Flikr dataset. Bounding boxes of the human in each image is provided. Very important: we don’t have to solve the detection problem. Action recognition is offered as a “taster challenge” in order to gauge interest in the general problem. It was difficult to hypothesize about what would succeed and what would not in this challenge. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 10. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Action classes The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 11. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Segmentation: the role of context Context provides very important cues for make fine discriminations at the (super-) pixel scale. We can exploit three levels of scale: local, mid-level and global [Zhu, NIPS2008]. Existing techniques apply overly-simplified models of context that do not generalize upward from local to global scales. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 12. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Segmentation: global constraints on label combinations Our principal idea is to use global Classification to enhance segmentation results. Global image classification results tend to be less noisy than ones. We will use them to constrain the combinations of semantic labels we are likely to encounter during segmentation. We showed last year how a tractable inference technique can be devised for this labeling problem (our PASCAL 2009 entry). This year we also show how mid-level context can be incorporated in the form of object detections. We also show how position priors cam be similarly incorporated into the framework to provide class specific location information. Finally, we devised a stochastic steepest ascent technique for optimizing the many parameters in a class-specific way. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 13. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Action recognition: driven by data limitations Initial experiments confirmed our intuition about the limitations of the data. Structural learning: sampling of pose space not dense enough. Latent SVM: object interactions under-sampled as well. Multiple kernel learning: converges to simple selection. From a very early stage, we decided to treat action recognition as an image classification problem. We exploit the small size dataset by performing extensive cross validation. Features are one of our string points, and we had to get the feature pipeline running for Classification in any case. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 14. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion HCRFs for labeling problem We represent our segmentation problem as a graph: G = (V, E) V is used for indexing random variables, and E is the set of undirected edges representing compatibility relationships between random variables. X = {Xi } denotes the set of random variables or nodes, for i ∈ V. An energy function will be defined over graphical configurations of random variables. By the Hammersley-Clifford theorem, the energy of a configuration of x = {xi } can be written as the negative exponential of an energy function E(x) = c∈C ϕc (xc ), where ϕc is the potential function of clique c ∈ C. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 15. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Consistency potentials for labeling problems The energy function of G can be written as: E(x) = φ(xi ) + ψL (xi , xj ) + ψG (xi , xg ). i∈V (i,j)∈EL (i,g)∈EG The unary term φ(xi ) depends on a single probability P(Xi = xi |Øi ), where Øi is the observation that affects Xi in the model. The smoothness potential ψL (xi , xj ) determines the pairwise relationship between two local nodes. The consistency potential ψG (xi , xg ) expresses the dependency between local nodes and a global node. And the Maximum a Posteriori (MAP) estimate of the optimal labeling is: x∗ = arg min E(x). x The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 16. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion HCRF models of image segmentation Smoothness Potts Robust P N Free (Shotten et al, CVPR2008) (Plath et al, ICML2009) (Ladicky et al, ICCV2009) Colored nodes represent (hidden) semantic labels. Dark nodes represent image measurements. Red edges represent penalties imposed by potential. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 17. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Different features for discriminations The previously mentioned approaches all try to make global distinctions using local information. Either by voting of local observations (Potts). Or, by penalizing rampantly discordant local label assignments PN . None of these techniques try to exploit truly global information to constrain local labels. And none incorporate the notion of encoding combinations of primitive node labels at the global level. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 18. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion The harmony potential: selective subsets Only labels that do not agree with subset are penalized. Can represent more diverse combinations. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 19. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion The harmony potential: overview The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 20. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Ranked subsampling of P(L) We can do this using the following posterior: ∗ ∗ ∗ P( ⊆ xg |Ø) ∝ P( ⊆ xg )P(O| ⊆ xg ). This allows us to effectively rank possible global node labels, and ∗ thus to prioritize candidates in the search for the optimal label xg . ∗ P( ⊆ xg |O) establishes an order on subsets of the (unknown) ∗ optimal labeling of the global node xg that guides the consideration of global labels. We may not be able to exhaustively consider all labels in P(L), but ∗ at least we consider the most likely candidates for xg . And image classification can give us an estimate of this posterior. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 21. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion PASCAL 2010: pushing the limit The previous slides describe our approach used for the PASCAL 2009 submission. The discriminative model was based on only SVMs trained to discriminate object classes from their own backgrounds. Starting with the harmony potential approach, this year we concentrated on adding cues derived from different levels of mid-level context. We found the HCRF model with harmony potential to be very useful for performing this fusion. Our hypothesis at the end of the 2009 competition was that detection would be essential for pushing forward the state-of-the-art. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 22. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion PASCAL 2010: fusing across scales 1 FG/BG: 20 SVMs trained to discriminate classes from their own background. The same discriminative model used last year, essential for localizing object boundaries. 2 CLASS: 20 SVMs trained to discriminate each object class from the other object. Essential for distinguishing objects with similar backgrounds (e.g. cows from sheep, birds from planes). Incorporated directly into unary potential. 3 LOC: 20 class-specific location priors. Computed from ground truth segmentations by simple, spatial averaging. A form of top-down mid-level context. 4 OBJ: 20 class-specific object detectors [Felzenszwalb 2010] are converted to superpixel scores by selecting the highest scoring detection intersecting each pixel of the superpixel. A type of bottom-up mid-level context. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 23. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion PASCAL 2010: learning unary potentials We compute the unary potential by weighting the classification scores {si (k , xi )}k∈F through a sigmoid function. The unary potential becomes: 1 φL (xi ) = −µL Ki log i 1 + exp(fi (k, xi )) k∈F fi (k , xi ) = a(k, xi )si (k , xi ) + b(k, xi ) µL is the weighting factor of the local unary potential, and Ki normalizes over the number of pixels inside the superpixel. We have two sigmoid parameters for each class/cue pair: a(k , xi ) and b(k , xi ). The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 24. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Datasets We have evaluated the harmony potential approach on two standard, publicly available datasets. The Pascal VOC 2010 Segmentation Challenge dataset contains 2250 color images of 20 different semantic classes. This set is split into 750 images for training, 750 images for testing, and 750 for validation. The Microsoft MSRC-21 dataset contains 591 color images of 21 object classes. We do our own splits for cross-validation on MSRC-21. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 25. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Unsupervised segmentation Images are first over-segmented to with quick-shift to derive super-pixels [Fulkerson, ICCV 2009]. This preserves object boundaries while simplifying the representation. Working at the super-pixel level reduces the number of nodes in the CRF by 102 to 105 per image. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 26. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Local classification scores: P(Xi = xi |Oi ) We extract patches with 50% overlap on a regular grid at several resolutions (12, 24, 36 and 48 pixels in diameter). Patches are described with SIFT, color and for MSCR-21 location features. A vocabulary is constructed using k-means to quantize to 1000 SIFT words and 400 color words. An SVM classifier using an intersection kernel is built for each semantic category. A similar number of positive and negative examples are used: around a total of 8.000 superpixel samples for MSCR-21, and 20.000 for VOC 2010 for each class. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 27. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Global potential and general approach For the PASCAL 2010 dataset we use our entry to the 2010 VOC Classification Challenge: [Khan, IJCV2010 (submitted)]. It uses a bag-of-words representation based on SIFT and color SIFT, plus spatial pyramids and color attention [Khan, ICCV 2009]. An SVM classifier with a χ2 kernel is trained for each semantic category in the dataset. The FG/BG and CLASS cues are computed by training a discriminative model using an SVM with histogram intersection kernel. Except for the additional cues and optimization strategy, architecture the same as our approach described at CVPR. [Gonfaus, CVPR2010] The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 28. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Learning the HCRF parameters We found it to be essential to train the per-class sigmoid parameters through cross validation. Classification scores are learned independently, are unbalanced and are effectively incomparable in many cases. The sigmoid functions weight the importance of each cue for each class. In addition to these (180) sigmoid parameters, we also must learn the weighting factors for each potential. We use a stochastic, steepest ascent technique to optimize these parameters on a validation set. In each step we randomly generate new instances of parameters. New parameter instances are generated using a Gibbs-like sampling strategy. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 29. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion History: PASCAL VOC 2009 Background Aeroplane Bicycle Bottle Chair Boat Bird Bus Car Cat BONN 83.9 64.3 21.8 21.7 32.0 40.2 57.3 49.4 38.8 5.2 BROOKES 79.6 48.3 6.7 19.1 10.0 16.6 32.7 38.1 25.3 5.5 Harmony potential 80.5 62.3 24.1 28.3 30.5 32.7 42.2 48.1 22.8 9.1 Dinning Table Potted Plant TV/Monitor Motorbike Average Person Sheep Horse Train Sofa Cow Dog BONN 28.5 22.0 19.6 33.6 45.5 33.6 27.3 40.4 18.1 33.6 46.1 36.3 BROOKES 9.4 25.1 13.3 12.3 35.5 20.7 13.4 17.1 18.4 37.5 36.4 24.8 Harmony potential 30.1 7.9 21.5 41.9 49.6 31.5 26.1 37.0 20.1 39.4 31.1 34.1 The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 30. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Qualitative results: MSRC-21 The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 31. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Quantitative results: MSRC-21 MSRC-21 contains more multi-class images than PASCAL. Our performance demonstrates the benefits of incorporating global scale when making local decisions. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 32. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Qualitative results: PASCAL 2010 The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 33. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Quantitative results: PASCAL 2010 FG/BG shows the performance of our baseline (PASCAL 2009) approach. At the top, performance on the validation set (i.e. how well we thought we were doing). Image tags indicated how well the technique can perform with perfect global information. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 34. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion The cost of segmentation The optimal MAP label configuration x∗ is inferred using α-expansion graph cuts [Kolmogorov, PAMI2004]. The global node uses the 100 most probable label subsets Sheet1 obtained from ranked subsampling. MSRC-21 PASCAL 2010 85 50 48 80 mAP on PASCAL VOC 2010 46 75 44 mAP on MSRC-21 70 42 40 65 38 60 36 34 55 32 50 30 1 2 3 5 10 15 20 25 30 35 40 50 75 100 150 200 # labels selected The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 35. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Qualitative results: PASCAL 2010 failures Context is sometimes weighted too much. When the global classifier fails, little can be done. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 36. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Every little bit helps The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 37. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion A photo finish Sheet1 Sheet1 42 15 20 25 30 35 40 40 mAP on PASCAL VOC 2010 FG-BG 33.9 CLASS 23.4 38 LOC 20.1 36 OBJ 26.2 34 FG-BG + CLASS 36.6 32 All 40.4 30 0 500 1000 1500 2000 2500 3000 #iterations The final results are tough to call between BONN and CVC. In the end, fusion over many scales and per-class, per-feature parameter optimization won. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 38. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results The action recognition taster Images collected from Flikr using action queries. A set of nine actions was chosen in the end. They are disjoint from the main challenge dataset. Only subset of people are annotated (bounding box + action). This subset labelled with exactly one action class. Important point: we don’t have to solve the detection problem. Most action classes in the challenge contain either large variation in scale or large variations in pose (or both). The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 39. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Dataset breakdown train val trainval test img obj img obj img obj img obj Phoning 25 25 25 26 50 51 - - Playinginstrument 27 38 27 38 54 76 - - Reading 25 26 26 27 51 53 - - Ridingbike 25 33 25 33 50 66 - - Ridinghorse 27 35 26 36 53 71 - - Running 26 47 25 47 51 94 - - Takingphoto 25 27 26 28 51 55 - - Usingcomputer 26 29 26 30 52 59 - - Walking 25 41 26 42 51 83 - - Total 226 301 228 307 454 608 - - The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 40. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Grouplets and poselets Two state-of-the art techniques to action recognition in still images. The grouplets of Fei Fei Li [Yao et al, CVPR2010]: And the latent poses of Greg Mori [Yang et al, CVPR2010]: The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 41. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Treat it like image classification Initial experiments confirmed our intuition about the limitations of the data. Structural learning: sampling of pose space not dense enough. Latent SVM: complexity of object interactions problematic. Multiple kernel learning: converges to simple selection. State-of-the-art techniques rely on learning complex structural models of pose-variations over many From a very early stage, we decided to treat action recognition as an image classification problem. We exploit the small size dataset by performing extensive cross validation. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 42. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results The classification pipeline The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 43. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Action recognition: features SIFT, color SIFT (normalize R/G and opponent), self-similarity, SURF, PHOG (good for capturing pose), and color attention (focuses on interesting color features). Sparse and dense variations of most of these. Plus a range of pyramid configurations (1, 2 × 2, 3 × 3, 4 × 4). Object detectors also incorporated using a simple occurrence histogram [Felzenszwalb 2010]. The goal was to incorporate all of this into a BoVW classifier and push the limits of what is possible using classical BoW on actions. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 44. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Action recognition: contextual pyramids Context was also important for most object classes. We used a type of foreground/background pyramid decomposition that split features into object or background. The was done using a type of spatial soft-assign based on the distance to the boundary of the object. For some classes, we also assigned contextual object regions that model the appearance of objects associated with them (the “horsy box”). The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 45. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Action recognition: learning in the design space In the end, after all of the combinatorics introduced by pyramids and other variations, we had about 100 feature configurations in a big pool. Most attempts to automatically learn the parameters of these features were total failures. Except one. Initial experiments with multiple kernel learning showed that MKL starts converging quickly towards class-specific feature selection rather than mixing. With such a small dataset, and a little heuristic trimming, we were able to exhaustively explore a part of the design space. This resulted in the best per-class feature combinations. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 46. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Action recognition: classification We experimented with a number of kernels (histogram intersection, χ2 , bin-ratio distance). There wasn’t a huge difference among these kernels. In the end, we chose histogram intersection for our submission as it appeared to generalize better. In addition to over-fitting less, there are no parameters to tune and it is very fast. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 47. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Overall results: average precision The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 48. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Per-class AP The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 49. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Per technique median average precision The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 50. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Qualitative results When the horsey box and detectors fail, context dominates. Classifier still surprisingly robust. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 51. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Qualitative results Some fine discriminations very difficult to make. Probably difficult even for humans. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 52. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Qualitative results People taking photos should be banned. Classes with large pose variations were the most difficult. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 53. Introduction Harmony potential 2.0: fusing across scale Action recognition Discussion Discussion: semantic image segmentation The harmony potential works well for fusing global information into local segmentations. This year we also showed that the harmony potential framework is also appropriate for incorporating different types of mid-level cues as well. Ranked sub-sampling, driven by the same posterior as used to define the global potential function, renders the optimization problem tractable. Most useful when multiple semantic classes co-occur frequently. Per-class learning of parameters essential (about +5% in final results). The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 54. Introduction Harmony potential 2.0: fusing across scale Action recognition Discussion Discussion: action recognition This year’s taster challenge on action recognition was little more than a toy. However, we have demonstrated what is possible using proven techniques from image classification. We feel that object context, in particular object interaction context, is the way forward. The PASCAL data set is the right direction to go (more general), but we need more samples. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 55. Introduction Harmony potential 2.0: fusing across scale Action recognition Discussion The future: segmentation Semantic image segmentation has come a long way, but still has a long way to go. It is becoming a mainstream event in PASCAL. This year we arrived as a sort of three-way detente between the CVC (winner 2010), BONN (winner 2009) and OXFORD (best paper award ECCV 2010) in segmentation. Each have their own approach, and each has its advantages and disadvantages. Engineering can probably maximize results. It is becoming mature, and we can begin thinking about what new applications are enabled by such technologies. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 56. Introduction Harmony potential 2.0: fusing across scale Action recognition Discussion The future: action recognition It seems that action recognition in still images is a popular challenge. The PASCAL organizers are keen to promote it for the future. The concentration will remain on still images, but perhaps more concentration on incorporating user interaction as well. It seems that the community is becoming more interested in the “alternative” PASCAL challenges. The multimedia community probably has an important role to play here. The CVC PASCAL VOC Team CVC PASCAL VOC 2010