ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg




                    ICVSS 2011: Selected Presentations

                                Angel Cruz and Andrea Rueda

                 BioIngenium Research Group, Universidad Nacional de Colombia


                                           August 25, 2011




                        Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

    Outline


    1 ICVSS 2011


    2 A Trillion Photos - Steven Seitz

    3 Efficient Novel Class Recognition and Search - Lorenzo
       Torresani

    4 The Life of Structured Learned Dictionaries - Guillermo Sapiro


    5 Image Rearrangement & Video Synopsis - Shmuel Peleg




                        Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

    Outline


    1 ICVSS 2011


    2 A Trillion Photos - Steven Seitz

    3 Efficient Novel Class Recognition and Search - Lorenzo
       Torresani

    4 The Life of Structured Learned Dictionaries - Guillermo Sapiro


    5 Image Rearrangement & Video Synopsis - Shmuel Peleg




                        Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

    ICVSS 2011
    International Computer Vision Summer School




         15 speakers, from USA, France, UK, Italy, Prague and Israel


                        Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

    ICVSS 2011
    International Computer Vision Summer School




                        Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

    ICVSS 2011
    International Computer Vision Summer School




                        Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

    Outline


    1 ICVSS 2011


    2 A Trillion Photos - Steven Seitz

    3 Efficient Novel Class Recognition and Search - Lorenzo
       Torresani

    4 The Life of Structured Learned Dictionaries - Guillermo Sapiro


    5 Image Rearrangement & Video Synopsis - Shmuel Peleg




                        Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
A Trillion Photos

         Steve Seitz
  University of Washington
           Google

Sicily Computer Vision Summer School
            July 11, 2011
Facebook




   >3 billion uploaded each month




    ~ trillion photos taken each year
What do you do with a trillion photos?



         Digital Shoebox
        (hard drives, iphoto, facebook...)
?
Comparing images




    Detect features using SIFT [Lowe, IJCV 2004]
Comparing images




Extraordinarily robust image matching
  – Across viewpoint (~60 degree out-of-plane rotations)
  – Varying illumination
  – Real-time implementations
Edges
Scale Invariant Feature Transform




                                    0               2π
                                        angle histogram




         Adapted from slide by David Lowe
NASA Mars Rover images
NASA Mars Rover images
with SIFT feature matches
 Figure by Noah Snavely
Coliseum
                                       (outside)




St. Peters (inside)
                         Coliseum
                                          St. Peters (outside)
                          (inside)




                                     Il Vittoriano
Trevi Fountain

                      Forum
Structure from motion




   Matched photos       3D structure
Structure from motion
aka “bundle adjustment”     (texts: Zisserman; Faugeras)
                            p4
                 p1                         p3         minimize
                                  p2
                                                    f (R, T, P)
                p5                           p7
                            p6




 Camera 1                                                  Camera 3
  R1,t1          Camera 2
                                                            R3,t3
                  R2,t2
?
Reconstructing Rome
In a day...

From ~1M images
Using ~1000 cores

Sameer Agarwal, Noah Snavely, Rick Szeliski, Steve Seitz
http://grail.cs.washington.edu/rome
Rome 150K: Colosseum
Rome: St. Peters
Venice (250K images)
Venice: Canal
Dubrovnik
From Sparse to Dense




      Sparse output from the SfM system
From Sparse to Dense




   Furukawa, Curless, Seitz, Szeliski, CVPR 2010
Most of our photos don’t look like this
recognition + alignment
Your Life in 30 Seconds




    path optimization
Picasa Integration
• As “Face Movies” feature in v3.8
 – Rahul Garg, Ira Kemelmacher
Conclusion

trillions of photos
         +    computer vision breakthroughs

      = new ways to see the world
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

    Outline


    1 ICVSS 2011


    2 A Trillion Photos - Steven Seitz

    3 Efficient Novel Class Recognition and Search - Lorenzo
       Torresani

    4 The Life of Structured Learned Dictionaries - Guillermo Sapiro


    5 Image Rearrangement & Video Synopsis - Shmuel Peleg




                        Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
Efficient Novel-Class
Recognition and Search
    Lorenzo Torresani
Problem statement:
           novel object-class search
• Given:            image database              user-provided images
                (e.g., 1 million photos)          of an object class



                                           +

• Want:
 database                                  •   no text/tags available
   images                                  •   query images may
of this class                                  represent a novel class
Application: Web-powered visual search
     in unlabeled personal photos
                     Goal: Find “soccer camp”
                             pictures on my computer
1                     1 Search the Web for images
                         of “soccer camp”
                      2 Find images of this visual class
                         on my computer
               2
Application: product search

•   Search of aesthetic products
RBM predictedpredicted labels (47%)
                                      RBM labels (47%)


               Relation to other tasks   sky      sky

                                     building building
                                                tree
                                                bed
                                                          tree
                                                          bed
                                       car       car

                                      novel class
                                        road      road


                                     Input search Ground truth neighbors
                                           image image
                                              Input        Ground truth neighbors 32−RBM 32−RBM                     16384-gist
                                                                                                                             1


                                    query                     retrieved
     image retrieval                                                object categorizationshowingitperce
                                                                                 Figure 6. 6. Curves showing per
                                                                                   Figure Curves
                                                                                 query images that make it int
                                                                                   query images that make into
                                                                                              ofof the query for 1400 image
                                                                                                  the query for a a 1400 imag
                                                                                                   to 5% of the database size.
                                                                                              upup to 5% of the database siz
analogies:                    RBM predictedpredicted labels (56%)
                                         RBM labels (56%)                                     crucial for scalable retrieval th
                                                                                                 crucial for scalable retrieval
- large databases                   tree
                                                from [Nister and Stewenius, ’07]
                                              tree   sky      sky
                                                                                              database make it it to the very
                                                                                                 database make to the very to
                                                                                              is is feasible only for a tiny f
                                                                                                  feasible only for a tiny fra
- efficient indexing                                                                           database grows large. Hence, w
                                                                                                 database grows large. Hence,
                                            building building                                 the curves meet the y-axis. T
                                                                                                 the curves meet the y-axis.
- compact representation      (a)           car        car                                    given in in Table 1 for larger n
                                                                                                 given Table 1 for a a larger
                                sidewalk sidewalkcrosswalkcrosswalk                           conclusions can bebe drawn from
                                                                                                 conclusions can drawn from
                                           road       road                                    improves retrieval performance
                                                                                                 improves retrieval performan
differences:                                                      from neighbors et al., ’07] performance than vocabularies.1
                                                                                                 performance than 2 -norm. En
                                                                                                                  L L2 -norm.
                                      Input image imageGround truth [Philbinneighbors 32−RBM 32−RBM vocabularies. O
                                                Input                                            least for smaller 16384-gist
- simple notions of visual                                           Ground truth             least for smaller
                                                                                              gives much better performance th
                                                                                                 gives much better performance
                              (b)
  relevancy                                                                                   is is setting T.
                                                                                                 setting T.

  (e.g., near-duplicate,
   same object instance,                                                                          settings used by [17].
                                                                                                    settings used by [17].
                                                                                                     The performance with vav
                                                                                                        The performance with
   same spatial layout)       (c)
                               RBM predictedpredicted labels (63%) [Torralba et al., ’08]
                                         RBM labels (63%)          from                           on the full 6376 image databa
                                                                                                    on the full 6376 image data
                                                                                                  the scores decrease with inc
                                                                                                    the scores decrease with in
                                        ceiling     ceiling
                                                                                                  are more images toto confus
                                                                                                    are more images confuse
                               Figure Thewall retrieval performance is is evaluated using a large
                                                     wall performance evaluated using a large
                             Figure 5. 5. The retrieval                                           ofof the vocabulary tree is sh
                                                                                                     the vocabulary tree is show
                             ground truth database (6376 images) with groups ofof four images
                               ground truth database (6376 images) with groups four images
                                  door        door                                                defining the vocabulary tree
                                                                                                    defining the vocabulary tre
                                                  poster    poster
Relation to other tasks
                             novel class
                               search

     image retrieval                         object classification
analogies:                                 analogies:
- large databases                          - recognition of object
- efficient indexing                          classes from a few examples
- compact representation
                                           differences:
differences:                               - classes to recognize are
- simple notions of visual                   defined a priori
  relevancy                                - training and recognition
  (e.g., near-duplicate,                     time is unimportant
   same object instance,                   - storage of features is not an
   same spatial layout)                      issue
Technical requirements of
          novel class-search
• The object classifier must be learned on the fly from
  few examples


• Recognition in the database must have low
  computational cost


• Image descriptors must be compact to allow
  storage in memory
State-of-the-art in
               object classification
Winning recipe: many features + non-linear classifiers
(e.g. [Gehler and Nowozin, CVPR’09])

                                                non-linear
      !"#$%
                                            decision boundary
                                                         !"#$%&#'()*
                                                        +&,-)&.&#(#/*
     ...




                                                          01#-2"#*


      &'()*+),%%
      -'.,()*+/%
       #"0$%
Model evaluation on Caltech256
                45

                40
                     gist
                35   phog
                     phog2pi
                30
 accuracy (%)




                     ssim
                25   bow5000

                20
                                                                    !"#$%&'()*$+'
                15
                                                                          ,
                10                                               '"#*"-"*.%+'/$%0.&$1
                5

                0
                 0   5      10        15       20      25   30
                         number of training examples
Model evaluation on Caltech256
                45

                40   gist
                     phog
                35
                     phog2pi
                30   ssim
 accuracy (%)




                     bow5000                                        !"#$%&'()*$+',
                25   linear combination
                                                                 /$%0.&$'2)(3"#%4)#
                20
                                                                    !"#$%&'()*$+'
                15
                                                                          ,
                10                                               '"#*"-"*.%+'/$%0.&$1
                5

                0
                 0   5      10        15       20      25   30
                         number of training examples
Model evaluation on Caltech256
                                                                     5)#6+"#$%&'()*$+',
                45                                                  /$%0.&$'2)(3"#%4)#'
                40                                                 7%898%8':.+4;+$'<$&#$+'
                                     gist                                 !$%&#"#=>'
                35                   phog                         ?@$A+$&'B'5)C)D"#E'FGH
                                     phog2pi
                30
 accuracy (%)




                                     ssim
                25                   bow5000
                                                                      !"#$%&'()*$+',
                                     linear combination            /$%0.&$'2)(3"#%4)#
                20                   nonlinear combination
                                                                       !"#$%&'()*$+'
                15
                                                                             ,
                10                                                  '"#*"-"*.%+'/$%0.&$1
                5

                0
                 0   5      10        15       20      25    30
                         number of training examples
Multiple kernel combiners
Classification output is obtained by combining many features via
non-linear kernels:
                       F
                                    N
                                     
              h(x) =            βf         kf (x, xn )αn + b
                       f =1          n=1

  sum over features                            sum over training examples



                       !#$%
                      ...




     where
                       '()*+),%%
                       -'.,()*+/%
                        #0$%
m=1
 s. For a kernel function k between         a SVM.
he short-hand notation
                                            Training Same as for averaging.
= k(fm (x), fm (x )),
           Multiple con- 4. Methods: Multiple Kernel Learning
                          kernel learning (MKL)
                    


 nel km  : X × X → R only
espect to image feature fal., 2004; Sonnenburg etapproach toVarma and Ray, 2007] is to
                   [Bach et m . If the             Another al., 2006; perform kernel selection
  to a certain aspect, say, it only con-       a kernel combination during the training phase of th
                                               gorithm. jointly optimizing over
            Learning a non-linear SVM by One prominent instance of this class is MKL
on, then the kernel measures simi-
                                                                                 F
                                                                                 a linear combinati
to this aspect. The subscript m of
nderstood as    a linear combinationobjective ∗ (x, x ) k=(x, x ) =β over(x,fx ) x ) the par
            1. indexing into the set of        kernels k
                                                          is to optimize jointly
                                                 of kernels: ∗ F                   β k (x,
                                                                                  km f      and
                                                                                m
                                                                         m=1 f =1
            2. the SVM parameters: α ∈ RN and b ∈ R of an SVM.
                                               ters
notational convenience, we will de-                MKL was originally introduced in [1]. For efficiency
 e of the m’th feature for a given                                                            
                                 F             in order N obtain sparse, F
                                                         to                  interpretable coefficients,
                                                                                             F
raining samples xi , i = 1,  1 . . . , N                                 
                    min             βf αT Kf α stricts βm ≥ 0 and ,imposes thefconstraintT α βm
                                                + C          L yn b +          β Kf (xn ) m=1
                   α,β,b 2                     Since the scope of this paper is to access the applicab
                               f =1                    n=1                 f =1
                                               of MKL to feature combination rather than its optimiz
 ), km (x, x2 ), . . . , km (x, xN )]T .
                                    F         part we opted to present the MKL formulations in a wa
 aining sample, i.e. x = xi , then = 1,lowing for easier 1, . . . , F
                  subject to             βf       βf ≥ 0, f = comparison with the other methods
h column of the m’th kernel matrix.f =1        write its objective function as
                                                        F
 ernel selection In this papert) = max(0, 1 − yt) 1 
                where     L(y, we
                                         min                βm αT Km α
classifiers that aim to combine sev-                  2 m=1
                          Kf (x) = [kf (x, x1 ), kf (x, x2 ), . . . , kf (x, xN )]T
                                         α,β,b
e model. Since we associate image
                                                         N                 F
ctions, kernel combination/selection
                                                     +C       L(yi , b +       βm Km (x)T α)
LP-β: a two-stage approach to MKL
 ! [Gehler and Nowozin, 2009]
• Classification output of traditional MKL:
                    F
                                   N
                                                            
                                   
   hM KL (x) =             βf           kf (x, xn )αn + b
                    f =1          n=1

• Classification function of LP-β:
                                                       
            F
                          N
                           
   h(x) =          βf            kf (x, xn )αf n + bf
            f =1
                          n=1
                                                  
                                 hf (x)
  Two-stage training procedure:
  1. train each hf (x) independently → traditional SVM learning
  2. optimize over β → a simple linear program
LP-β for novel-class search?
The LP-β classifier:
                     F
                                    N
                                                                 
                                    
            h(x) =          βf           kf (x, xn )αf n + bf
                     f =1        n=1

 sum over features                           sum over training examples

Unsuitable for our needs due to:
• large storage requirements (typically over 20K bytes/image)
• costly evaluation (requires query-time kernel distance
  computation for each test image)
• costly training (1+ minute for O(10) training examples)
Classemes: a compact descriptor for
   efficient recognition [Torresani et al., 2010]
                      !
Key-idea: represent each image x in terms of its “closeness”
          to a set of basis classes (“classemes”)
        x
                          Φ(x) = [φ1 (x), . . . , φC (x)]T
                                                          F
                                                                         N
                                                                          
                         φc (x) = hclassemec (x) =                    c
                                                                     βf         kf (x, xc )αn + bc
                                                                                        n
                                                                                            c

                                                          f =1            n=1
                                   output of a pre-learned LP-β for the c-th basis class
                                               Φ(x1 )          ...         Φ(xN )
Query-time learning:                                                                     training
                                                                                       examples of
train a linear classifier on Φ(x)                                                       novel  class
                                                   
                                         C
                                                       F
                                                                    N
                                                                     
 g duck (Φ(x); wduck ) = Φ(x)T wduck =         wc 
                                                duck            c
                                                               βf          kf (x, xc )αn + bc 
                                                                                   n
                                                                                       c

                                         c=1
                                                        
                                                        f =1         n=1
                                                                                          
                                                          LP-β trained before the
                    trained at query-time
                                                          creation of the database
How this works...
                                   Efficient Object Category Recognition Using Classemes                                 777

  • Accurate weighted classemes. Five classemes with the highest LP-β weights
Table 1. Highly
                    semantic labels are not required...

to
  •make semantic sense, but it should bejust used that detectors may create
for the retrieval experiment, for a selection of Caltech 256 categories. Somefor appear
     Classeme classifiers are emphasized as our goal is simply to
        specific patterns of texture, color, shape, etc.
a useful feature vector, not to assign semantic labels. The somewhat peculiar classeme
labels reflect the ontology used as a source of base categories.

!#$%'()*+$                                                 ,-(./+$#-(.'0$%/1121$
%)#3)+4.'$      !#$%              '()*%'+%*,-.     -,.+(,/                -)##-%01#     $2330/+(,/

05%6$            1)$1*+(#,/       1)45+)3+6,%* '60$$*                    6,#.0/7         '%*,07!%
                                      12##+$,#+!*4+
/6$             3072*+'.,%*                         -,%%#                   7*,8'0%       4,4+1)45
                                      ,/0$,#
7*-13$         6,%*-*,3%+'2*3,- '-'0+-,1#       ,#,*$+-#)-.             !0/42           '*80/7+%*,5
                                                                                                    6'%*/+!$0'(!*+
'*-/)3-'4898$   -)/89+%!0/7       $0/4+,*,       -4(#,5*                 *),'%0/7+(,/
                                                                                                    (*')/
                 %,.0/7+-,*+)3+                                                                 -)/%,0/*+(*''2*+
#./3**)#$                          1,77,7+()*%* -,/)(5+-#)'2*+)(/ *)60/7+'!##
                 ')$%!0/7                                                                         1,**0*




   Large-scale recognition benefits from a compact descriptor for each image,
for example allowing databases to be stored in memory rather than on disk. The
bject Classes by Between-Class Attribute Transfer
      Hannes Nickisch       Stefan Harmeling

                                          Related work
or Biological Cybernetics, T¨ bingen, Germany
                            u
 me.lastname}@tuebingen.mpg.de




               •
                   otter


 when train-
                   Attribute-based recognition:
                    black:
                    white:
                               yes
                               no
                    brown:     yes
examples of         stripes:   no
hardly been         water:     yes
                   [Lampert et al., CVPR’09]             [Farhadi et al., CVPR’09]
                    eats fish: yes
 rule rather
ens of thou-       polar bear
                   black:       no
 very few of       white:       yes
d annotated        brown:       no
                   stripes:     no
                   water:       yes
 introducing       eats fish:   yes
ct detection   zebra
ption of the   black:     yes
 description   white:     yes

                       requires hand-specified attribute-class associations
               brown:     no
 hape, color
s. On the left
h properties
               stripes:
               water:
                          yes
                          no
ribute be
 hey can predic-
               eats fish: no

  to
      displayed.       attribute classifiers must be trained with
arethe cur- Figure 1. A description object categories: after learningthe transfer
                                    by high-level attributes allows
ected based   of knowledge between                                     the visual
ed for a new cat-      human-labeled examples
ve across appearance of attributes from any classes with training examples,
and to “engine”,can detect also object classes that do not have any training
 ike facil- we based on which attribute description a test image fits best. randomly selected positively pre
  new large-  images,        Figure 5:            This figure shows
election helps
  30,000 an-                  tributes for 12 typical images from 12 categories in Yahoo set.
nd “rein” that of well-labeled training imageslearnedtechniques
rson’s clas-  lions                                     and is likely out of
                              classifiers are numerous on Pascal train set and tested on Yahoo se
              reach for years to come. Therefore,
 emantic at-
 one class outreducing the number of necessary training imagesattributes from the list of 64 attributes a
              for             domly select 5 predicted have
Method overview
1. Classeme learning

                                     φ”body of water” (x) → 




                             ...
                                     φ”walking” (x) → 


2. Using the classemes for recognition and retrieval
  training examples of novel class
                                                          C
                                                          
                                        g duck (Φ(x)) =         wc φc (x)
                                                                 duck

                                                          c=1

     Φ(x1 )     ...   Φ(xN )
Classeme learning:
         choosing the basis classes
•   Classeme labels desiderata:

     -   must be visual concepts

     -   should span the entire space of visual classes

•   Our selection:
    concepts defined in the Large Scale Ontology for Multimedia
    [LSCOM] to be “useful, observable and feasible for automatic
    detection”.
              2659 classeme labels, after manual elimination of
              plurals, near-duplicates, and inappropriate concepts
Classeme learning:
      gathering the training data
•   We downloaded the top 150 images returned by
    Bing Images for each classeme label
• For each of the 2659 classemes, a one-versus-the-rest
    training set was formed to learn a binary classifier
                     φ”walking” (x)

               yes                    no
Classeme learning:
          training the classifiers
• Each classeme classifier is an LP-β kernel combiner
 [Gehler and Nowozin, 2009]:
                F
                      N                          
                       
         φ(x) =   βf         kf (x, xn )αf,n + bf
                 f =1        n=1

                 linear combination of feature-specific SVMs

• We use 13 kernels based on spatial pyramid histograms
 computed from the following features:
  - color GIST [Oliva and Torralba, 2001]
  - oriented gradients [Dalal and Triggs, 2009]
  - self-similarity descriptors [Schechtman and Irani, 2007]
  - SIFT [Lowe, 2004]
A dimensionality reduction
   
       view of classemes
     
      GIST
     
                                                          
  
  
  
      
       self-similarity
       descriptor                 Φ         
                                                φ1 (x)
                                                  ...    
x=
  
      
      
                                             φ2659 (x)
      oriented
     
      gradients
     
                                     • near state-of-the-art accuracy
         SIFT                            with linear classifiers
                                       • can be quantized down to
 • non-linear kernels are needed         200 bytes/image with almost
  for good classification                 no recognition loss
 • 23K bytes/image
Experiment 1: multiclass
         recognition on Caltech256
               60                                                          LP-β in [Gehler 
                    LPbeta                                                 Nowozin, 2009]
                    LPbeta13                                               using 39 kernels
               50   MKL
                    Csvm                                                   LP-β with our x
                    Cq1svm
               40   Xsvm                                                   our approach:
                                                                           linear SVM with
accuracy (%)




                                                                           classemes Φ(x)
               30
                                                                       linear SVM with
                                                                       binarized classemes,
               20                                                      i.e. (Φ(x)  0)

                                                                          linear SVM with x
               10


               0
                0      10          20              30        40   50
                               number of training examples
Computational cost
                               comparison
                            Training time                            Testing time
                 1500                                      40

                        23 hours                           30
time (minutes)




                 1000




                                               time (ms)
                                                           20

                  500
                                   9 minutes               10


                    0                                       0
                         LPbeta       Csvm                      LPbeta        Csvm
Accuracy vs. compactness
                                      4
                                     10




                                                                                                188 bytes/image
       compactness (images per MB)



                                      3
                                     10

                                                                                                2.5K bytes/image
                                      2
                                     10



                                               LPbeta13                                         23K bytes/image
                                      1        Csvm
                                     10
                                               Cq1svm
                                               nbnn [Boiman et al., 2008]                       128K bytes/image
                                               emk [Bo and Sminchisescu, 2008]
                                               Xsvm
                                      0
                                     10
                                          10   15      20       25        30     35   40   45
                                                                accuracy (%)


Lines link performance at 15 and 30 training examples
Experiment 2:
                         object class retrieval
                                  Efficient Object Category Recognition Using Classemes              787


                    30
                                                                        Csvm
                                                                        Cq1Rocchio (β=1, γ=0)
                    25
                                                                        Cq1Rocchio (β=0.75, γ=0.15)
Precision @ 25 25




                                                                        Bowsvm
Precision (%) @




                    20                                                  BowRocchio (β=1, γ=0)
                                                                        BowRocchio (β=0.75, γ=0.15)
                    15

                                                                  • Random performance is 0.4%
                    10
                                                                  • training Csvm takes 0.6 sec with
                                                                    5*256 training examples
                    5

                    0
                     0   10        20          30       40   50
                              Number of training images


  Fig. 4. Retrieval. Percentage of the top 25 in a 6400-document set which match the
  query class. Random performance is 0.4%.
Analogies with text retrieval
• Classeme representation of an image:
                        presence/absence of visual attributes




• Bag-of-word representation of a text-document:
                           presence/absence of words
Related work
•       Prior work (e.g., [Sivic  Zisserman, 2003; Nister  Stewenius, 2006;
        Philbin et al., 2007]) has exploited a similar analogy for
        object-instance retrieval by representing images as bag of visual words
                              Detect interest patches         Compute SIFT descriptors [Lowe, 2004]

                                                                                            …
             …




                                                                             Quantize
                              Represent image as a sparse
                                                                            descriptors
                               histogram of visual words
                  frequency




                                                        …..
                                        codewords




    •    To extend this methodology to object-class retrieval we need:
         - to use a representation more suited to object class recognition
           (e.g. classemes as opposed to bag of visual words)
         - to train the ranking/retrieval function for every new query-class
Data structures for
                            efficient retrieval
            Incidence matrix:                           Inverted index:
                      features
                f0   f1   f2   f3   f4   f5   f6   f7    f0 f1 f2 f3 f4 f5 f6 f7
             I0: 1    0    1    0    0    1    0    0
             I1: 0    0    1    0    1    0    0    0    I0 I2 I0 I2 I1   I0 I4 I6
documents




             I2: 1    1    0    1    0    0    0    0    I2 I7 I1 I3 I4   I6 I5 I9
             I3: 1    0    1    1    0    0    0    0
             I4: 1    0    0    0    1    0    1    0    I3 I8 I3 I9 I5   I8
             I5: 0    0    0    0    1    0    1    0    I4          I7   I9
             I6: 1    0    0    0    0    1    0    1    I6          I9
             I7: 0    1    0    0    1    0    0    0    I8
             I8: 1    1    0    0    0    1    0    0
             I9: 0    0    0    1    1    1    0    1
                                                        • enables efficient calculation
                                                          of w Φ, as:
                                                                T
                                                                    ∀Φ
            • very compact: only one bit                           
              per feature entry                                                wi Φi
                                                               i s.t. Φi =0
Efficient retrieval via
            inverted index
                                    Inverted index:
                                 w: [1.5 -2   0 -5   0    3 -2   0 ]
                                      f0 f1 f2 f3 f4 f5 f6 f7

                                      I0 I2 I0 I2 I1     I0 I4 I6
                                      I2 I7 I1 I3 I4     I6 I5 I9
                                      I3 I8 I3 I9 I5     I8
                                      I4          I7     I9
                                      I6          I9
                                      I8



Goal:
compute score w T Φ, for all binary vectors Φ in the database
                       ∀Φ
Efficient retrieval via
         inverted index
                          Inverted index:
                       w: [1.5 -2   0 -5   0    3 -2   0 ]
                            f0 f1 f2 f3 f4 f5 f6 f7

                            I0 I2 I0 I2 I1     I0 I4 I6
                            I2 I7 I1 I3 I4     I6 I5 I9
                            I3 I8 I3 I9 I5     I8
                            I4          I7     I9
                            I6          I9
                            I8




Scoring:
           I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
Efficient retrieval via
         inverted index
                          Inverted index:
                       w: [1.5 -2   0 -5   0    3 -2   0 ]
                            f0 f1 f2 f3 f4 f5 f6 f7

                            I0 I2 I0 I2 I1     I0 I4 I6
                            I2 I7 I1 I3 I4     I6 I5 I9
                            I3 I8 I3 I9 I5     I8
                            I4          I7     I9
                            I6          I9
                            I8




Scoring:
           I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
Efficient retrieval via
         inverted index
                          Inverted index:
                       w: [1.5 -2   0 -5   0    3 -2   0 ]
                            f0 f1 f2 f3 f4 f5 f6 f7

                            I0 I2 I0 I2 I1     I0 I4 I6
                            I2 I7 I1 I3 I4     I6 I5 I9
                            I3 I8 I3 I9 I5     I8
                            I4          I7     I9
                            I6          I9
                            I8




Scoring:
           I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
Efficient retrieval via
         inverted index
                          Inverted index:
                       w: [1.5 -2   0 -5   0    3 -2   0 ]
                            f0 f1 f2 f3 f4 f5 f6 f7

                            I0 I2 I0 I2 I1     I0 I4 I6
                            I2 I7 I1 I3 I4     I6 I5 I9
                            I3 I8 I3 I9 I5     I8
                            I4          I7     I9
                            I6          I9
                            I8




Scoring:
           I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
Efficient retrieval via
         inverted index
                          Inverted index:
                       w: [1.5 -2   0 -5   0    3 -2   0 ]
                            f0 f1 f2 f3 f4 f5 f6 f7

                            I0 I2 I0 I2 I1     I0 I4 I6
                            I2 I7 I1 I3 I4     I6 I5 I9
                            I3 I8 I3 I9 I5     I8
                            I4          I7     I9
                            I6          I9
                            I8




Scoring:
           I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
Efficient retrieval via
             inverted index
                                    Inverted index:
                                 w: [1.5 -2   0 -5   0    3 -2   0 ]
                                      f0 f1 f2 f3 f4 f5 f6 f7

                                      I0 I2 I0 I2 I1     I0 I4 I6
                                      I2 I7 I1 I3 I4     I6 I5 I9
                                      I3 I8 I3 I9 I5     I8
                                      I4          I7     I9
                                      I6          I9
                                      I8




Cost of scoring is linear in the sum of the lengths of inverted
lists associated to non-zero weights
Improve efficiency via
              sparse weight vectors
Key-idea: force w to contain as many zeros as possible
                                                     classeme vector      label of
Learning objective                                     of example n
                                  Tomographic inversion with             example n
                                                                       1 wavelet penalization      3
                                                N
              E(w) = R(w) +                 C
                                            N        n=1   L(w; Φn , yn )
     w2
                 regularizer                     loss function
                   w with d = AWT w and smallest 1 -norm

•
                                        T
    L2-SVM:        R(w) d =wT w w and smallestn ,2yn ) = max(0, 1 − yn (wT Φn ))
                   w with = AW
                                , L(w; Φ -norm
                       d = AWT w
•                   2
    Since |wi |  wi for small wi                                               w  2
                                                                                w 2i
                                                                                           |wi |
    and |wi |  wi for large wi , w1
                  2
                            
    choosing R(w) = i |wi | will tend to                                          |w|

    produce a small number of larger
                                                                                           wi
    weights and 2 -ball: wzero2 weights
                more 1 + w2 = constant
                          2
                                                                                       w

               1 -ball:   |w1 | + |w2 | = constant
Improve efficiency via
              sparse weight vectors
Key-idea: force w to contain as many zeros as possible
                                              classeme vector        label of
Learning objective                              of example n        example n
                                          N
              E(w) = R(w) +           C
                                      N      n=1   L(w; Φn , yn )
                 regularizer                        loss function


•   L2-SVM:       R(w) = wT w ,           L(w; Φn , yn ) = max(0, 1 − yn (wT Φn ))
                     
•   L1-LR:     R(w) = i |wi | ,           L(w; Φn , yn ) = log(1 + exp(−yn wT Φn ))

•   FGM (Feature Generating Machine) [Tan et al., 2010]:
               R(w) = wT w ,         L(w; Φn , yn ) = max(0, 1 − yn (w ⊙ d)T Φn )
                   s.t.        1T d ≤ B        d ∈ {0, 1}D             elementwise product
Performance evaluation on
                            ImageNet (10M images)
                     35
                                                                                          ! [Rastegari et al., 2011]
                                                                                   35
                                                                                            Full inner product evaluation L2 SVM
                     30
                                                                                            Full inner product evaluation L1 LR
                                                                                   30
                                                                                            Inverted index L2 SVM
Precision @ 10 (%)




                     25
                                                                                            Inverted index L1 LR




                                                              Precision @ 10 (%)
                                                                                   25
                     20
                                                                                   20   • Performance averaged over 400 object
                     15                                                                 classes used as queries
                                                                                   15 • 10 training examples per query class
                     10
                                                                                   10
                                                                                      • Database includes 450 images of the query
                                                                                        class and 9.7M images of other classes
                     5
                                                                                    5 •
                                                                                        Prec@10 of a random classifiers is 0.005%
                     0
                     20   40     60      80      100    120      140
                           Search time per query (seconds)          0
                                                                     20                      40     60      80      100    120     140
    Each curve is obtained by varying sparsity through C in training objective                Search time per query (seconds)

                                                              N
                               E(w) = R(w) +              C
                                                          N                        n=1     L(w; Φn , yn )
                                    regularizer                                             loss function
Top-k ranking
• Do we need to rank the entire database?
  - users only care about the top-ranked images




• Key idea:
  - for each image iteratively update an upper-bound and
     a lower-bound on the score
  - gradually prune images that cannot rank in the top-k
Top-k pruning
                                             ! [Rastegari et al., 2011]

w: [   3 -2   0 -6      0    3 -2      0 ]
                                               • Highest possible score:
                                                 for binary vector ΦU s.t.
    f0
 I0: 1
         f1
          0
              f2
               1
                   f3
                    0
                        f4
                         0
                             f5
                              1
                                  f6
                                   0
                                       f7
                                        0
                                                      ΦU = 1 iff wi  0
                                                        i
 I1: 0    0    1    0    1    0    0    0
 I2: 1    1    0    1    0    0    0    0        → initial upper bound
 I3: 1    0    1    1    0    0    0    0
 I4: 1    0    0    0    1    0    1    0            u∗ = wT · ΦU (6 in this case)
 I5: 0    0    0    0    1    0    1    0
 I6: 1    0    0    0    0    1    0    1
 I7: 0
 I8: 1
          1
          1
               0
               0
                    0
                    0
                         1
                         0
                              0
                              1
                                   0
                                   0
                                        0
                                        0
                                               • Lowest possible score:
 I9: 0    0    0    1    1    1    0    1        for binary vector ΦL s.t.
                                                      ΦL = 1 iff wi  0
                                                        i
                                                 → initial lower bound
                                                    l∗ = wT · ΦL (-10 in this case)
Top-k pruning
                                             ! [Rastegari et al., 2011]

w: [   3 -2   0 -6      0    3 -2      0 ]       •   Initialization: u∗ , l∗ for all images
                                                        upper bound
    f0   f1   f2   f3   f4   f5   f6   f7
 I0: 1    0    1    0    0    1    0    0
 I1: 0    0    1    0    1    0    0    0
 I2: 1    1    0    1    0    0    0    0
 I3: 1    0    1    1    0    0    0    0
 I4: 1    0    0    0    1    0    1    0
 I5: 0    0    0    0    1    0    1    0    0
 I6: 1    0    0    0    0    1    0    1
 I7: 0    1    0    0    1    0    0    0
 I8: 1    1    0    0    0    1    0    0
 I9: 0    0    0    1    1    1    0    1



                                                 I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
                                                      lower bound
Top-k pruning
                                                ! [Rastegari et al., 2011]

w: [     3 -2    0 -6      0    3 -2      0 ]

    f0      f1   f2   f3   f4   f5   f6   f7
 I0: 1       0    1    0    0    1    0    0
 I1: 0       0    1    0    1    0    0    0
 I2: 1       1    0    1    0    0    0    0    0
 I3: 1       0    1    1    0    0    0    0
 I4: 1       0    0    0    1    0    1    0
 I5: 0       0    0    0    1    0    1    0
 I6: 1       0    0    0    0    1    0    1
 I7: 0       1    0    0    1    0    0    0
 I8: 1       1    0    0    0    1    0    0
 I9: 0       0    0    1    1    1    0    1
                                                    I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
 •     Load feature i
 •     Since wi = +3 (0), for each image n:
       - subtract +3 from the upper bound if φn,i = 0
       - add +3 to the lower bound if φn,i = 1
Top-k pruning
                                                ! [Rastegari et al., 2011]

w: [     3 -2    0 -6      0    3 -2      0 ]

    f0      f1   f2   f3   f4   f5   f6   f7
 I0: 1       0    1    0    0    1    0    0
 I1: 0       0    1    0    1    0    0    0
 I2: 1       1    0    1    0    0    0    0    0
 I3: 1       0    1    1    0    0    0    0
 I4: 1       0    0    0    1    0    1    0
 I5: 0       0    0    0    1    0    1    0
 I6: 1       0    0    0    0    1    0    1
 I7: 0       1    0    0    1    0    0    0
 I8: 1       1    0    0    0    1    0    0
 I9: 0       0    0    1    1    1    0    1
                                                    I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
 •     Load feature i
 •     Since wi = -2 (0), for each image n:
       - decrement by 2 the upper bound if φn,i = 1
       - increment by 2 the lower bound if φn,i = 0
Top-k pruning
                                                ! [Rastegari et al., 2011]

w: [     3 -2    0 -6      0    3 -2      0 ]

    f0      f1   f2   f3   f4   f5   f6   f7
 I0: 1       0    1    0    0    1    0    0
 I1: 0       0    1    0    1    0    0    0
 I2: 1       1    0    1    0    0    0    0    0
 I3: 1       0    1    1    0    0    0    0
 I4: 1       0    0    0    1    0    1    0
 I5: 0       0    0    0    1    0    1    0
 I6: 1       0    0    0    0    1    0    1
 I7: 0       1    0    0    1    0    0    0
 I8: 1       1    0    0    0    1    0    0
 I9: 0       0    0    1    1    1    0    1
                                                    I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
 •     Load feature i
 •     Since wi = -6 (0), for each image n:
       - decrement by 6 the upper bound if φn,i = 1
       - increment by 6 the lower bound if φn,i = 0
Top-k pruning
                                                ! [Rastegari et al., 2011]

w: [   3 -2      0 -6      0    3 -2      0 ]

       f0   f1   f2   f3   f4   f5   f6   f7
    I0: 1    0    1    0    0    1    0    0
    I1: 0    0    1    0    1    0    0    0
    I2: 1    1    0    1    0    0    0    0    0
    I3: 1    0    1    1    0    0    0    0
    I4: 1    0    0    0    1    0    1    0
    I5: 0    0    0    0    1    0    1    0
    I6: 1    0    0    0    0    1    0    1
    I7: 0    1    0    0    1    0    0    0
    I8: 1    1    0    0    0    1    0    0
    I9: 0    0    0    1    1    1    0    1
                                                    I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

•   Suppose k = 4:
    we can prune I2,I9 since they cannot rank in the top-k
Distribution of weights and
                                                               pruning rate
CCV
 CV                                                                                                                                                                          IC
1745
 745                                                                                                                                                                          #
                                                                                                                                                                             #1
                                                             ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
                                                            ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.



540
40
                                                   11                                                          100
                                                                                                                100
                                                                                    L1−LR
                                                                                     L1−LR
           Distribution absolute weight values
          Distribution of absolute weight values




41
541
          normalized of absolute weight values




42
542                                                                                 L2−SVM
                                                                                     L2−SVM
43
543                                          0.8
                                              0.8                                   FGM
                                                                                     FGM                               80
                                                                                                                        80




                                                                                                  % of images pruned
                                                                                                  % of images pruned
44
544                                                                                                                                                 TkP L1−LR, k=10
                                                                                                                                                     TkP L1−LR, k=10
45
545                                                                                                                                                 TkP L1−LR, k=3000
                                                                                                                                                     TkP L1−LR, k=3000
                                             0.6
                                              0.6                                                                      60
                                                                                                                        60
46
546                                                                                                                                                 TkP L2−SVM, k=10
                                                                                                                                                     TkP L2−SVM, k=10
47
547                                                                                                                                                 TkP L2−SVM, k=3000
                                                                                                                                                     TkP L2−SVM, k=3000
48
548                                          0.4
                                              0.4                                                                      40
                                                                                                                        40                          TkP FGM, k=10
                                                                                                                                                     TkP FGM, k=10
49
549                                                                                                                                                 TkP FGM, k=3000
                                                                                                                                                     TkP FGM, k=3000
50
550
                                             0.2
                                              0.2                                                                      20
                                                                                                                        20
51
551
52
552
53
553                               00                                                                                   00
54
554                            aa 00                     500
                                                          500   1000
                                                                 1000   1500
                                                                         1500
                                                                   Dimension
                                                                                2000
                                                                                 2000   2500
                                                                                         2500           bb              00    500
                                                                                                                               500     1000
                                                                                                                                        1000 1500  1500 2000  2000
                                                                                                                                     Number ofof iterations (d)
                                                                                                                                               iterations (d)
                                                                                                                                                                     2500
                                                                                                                                                                      2500
                                                                    Dimension                                                         Number
55
555
56
556    Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster with
        Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster with
57
557
                                                         Features considered in descending order of |wi |
       sparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values ofof k (k = 10, 3000).
        sparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values k (k = 10, 3000).
58
558
59
559
60
560    aa smaller value of kk allows the method to eliminate more
           smaller value of allows the method to eliminate more
61     images from consideration at aavery early stage.                                                                 20
                                                                                                                         20             v=128
561     images from consideration at very early stage.                                                                                   v=128
                                                                                                                                            8
                                                                                                                                                                  v=256
                                                                                                                                                                    v=256
62                                                                                                                                      w=2 8              v=256
                                                                                                                                                             v=256 w=28 8
562                                                                                                                                       w=2                  6
                                                                                                                               v=64
                                                                                                                                v=64                       w=2 6 w=2
                                                                                                                                                             w=2
63
Performance evaluation on            35

                           ImageNet (10M images)               30

                     35                                                          ! [Rastegari et al., 2011]

                                          Precision @ 10 (%)
                                                               25
                     30                                                          TkP L1−LR
                                                               20
                                                                                 TkP L2−SVM
                                                                                 Inverted index L1−LR
Precision @ 10 (%)




                     25
                                                               15
                                                                                 Inverted index L2−SVM
                     20
                                                               10             • k = 10
                     15
                                                                              • Performance averaged over 400 object
                                                               5      classes used as queries
                     10                                             • 10 training examples per query class
                                             0
                                              0     50              •
                                                          100 150 Database includes 450 images of the query
                     5                   Search time per query (seconds) and 9.7M images of other classes
                                                                      class
                                                                    • Prec@10 of a random classifiers is 0.005%
                     0
                      0          50              100                    150
                           Search time per query (seconds)

Each curve is obtained by varying sparsity through C in training objective
                                                                        N
                             E(w) = R(w) +                          C
                                                                    N    n=1      L(w; Φn , yn )
                                   regularizer                                     loss function
Alternative search strategy:
        approximate ranking
•   Key-idea: approximate the score function with a measure that can
    computed (more) efficiently (related to approximate NN search:
    [Shakhnarovich et al., 2006; Grauman and Darrell, 2007; Chum et al.,
    2008])
•   Approximate ranking via vector quantization:
      wT Φ ≈ wT q(Φ)                                !
                                                     q(!)
    where q(.) is a quantizer returning
    the cluster centroid nearest to Φ

•   Problem:
    - to approximate well the score we need a fine quantization
    - the dimensionality of our space is D=2659:
      too large to enable a fine quantization using k-means clustering
Product quantization
                    !
        Product quantization for nearest neighbor search
                                                                                                                 [Jegou et al., 2011]
 • Split feature vector ! into v subvectors:                                                               !  [ !1 | !2 | ... | !v ]
              Vector split into m subvectors:
 • Subvectors are quantized separately by quantizers
              Subvectors are quantized separately by quantizers
                       q(!) = [ q1(!1) | q2(!2) | ... | qv(!v) ]
      where each qi(.) is learned in a space of dimensionality D/v
                        where each           is learned by k-means with a limited number of centroids

 • Example from [Jegou vector split in 8 subvectors of dimension 16
      Example: y = 128-dim
                           et al., 2011]:
      ! is a 128-dimensional vector split into 8 subvectors of dimension 16
      16 components
16 components
                   y1               y2                     y3              y4                y5             y6              y7             y8
                   !1                  !2                       !3              !4                   !5               !6             !7            !8
                                                    xedni noitazitnauq tib-46
               stib 8

           256 ) 1 y( 1 q
                q
                              ) 2 y( 2 q
                                    q2
                                              ) 3 y( 3 q
                                                           q3
                                                                 )4y(4q
                                                                           q4
                                                                                   )5y(5q
                                                                                             q5
                                                                                                  )6y(6q
                                                                                                            q6
                                                                                                              )7y(7q        )8y(8q
                                                                                                                            q7             q8
 28   = 256
      centroids 1
 centroids
                   q1                 q2
                                       1                    q3
                                                             1                  q4
                                                                                 1                   q5               q6             q7            q8
       sdiortnec   1q            2q              3q                  4q              5q              6q          7q          8q
       652
               q1(y1)           q2(y2)              q3(y3)                q4(y4)            q5(y5)         q6(y6)          q7(y7)         q8(y8)
           q1(!1) q2(!2) q3(!3) q4(!4)
                   1
                   1y 1   1  1   1
                                 2y 1            3y                  4y              5y       q5(!5) q6(!6) q7(!7) q8(!8)
                                                                                                     6y          7y          8y
              8 bits
        stnenopmoc 61
                                                            64-bit quantization index
               8 bits
                                                      64-bit quantization index
               61 noisnemid fo srotcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE

                        hcae erehw         sdiortnec fo rebmun detimil a htiw snaem-k yb denrael si
obhgien tseraen rof noitazitnauq tcudorP
                   :srotcevbus m otni tilps rotceV
                                                                                                          wv
                                                                                                          .
                                                                                                           .   
                                                                                                          .   
tnauq yb yletarapes dezitnauq era srotcevbuS                                                          
                                                                                                         w2   
                                                                                                               
                                                      sub-blocks
                                                                                                          w1
                                                                                                              
 htiw snaem-k yb denrael si
           centroids (r per sub-block)
                                         hcae erehw
                                                                                     1.Filling the look-up table:
tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE
            look-up table
            can be precomputed and stored in a                     stnenopmoc 61
                                                                                    j=1
5y          4y            3y                2y                  T        1y
                                                               wj qj (Φj )                 wT Φ ≈ wT q(Φ) =
                                                                                    
                                                                                    v
                                                                               652
5q          4q            3q
             Efficient approximate scoring   2q                           1q    sdiortnec
y(5q      )4y(4q        )3y(3q           ) 2 y( 2 q                   ) 1 y( 1 q
                                                                       stib 8
xedni noitazitnauq tib-46
obhgien tseraen rof noitazitnauq tcudorP
                   :srotcevbus m otni tilps rotceV
                                                                                                                wv
                                                                                                                .
                                                                                                                 .   
                                                                                                                .   
tnauq yb yletarapes dezitnauq era srotcevbuS                                                              
                                                                                                               w2   
                                                                                                                     
                                                      sub-blocks
                                           s11                                                                  w1
                                                                                                 in
                                                                                                 ner product        
                                                                        quantization for sub-block 1:
 htiw snaem-k yb denrael si
           centroids (r per sub-block)
                                         hcae erehw
                                                                                     1.Filling the look-up table:
tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE
            look-up table
            can be precomputed and stored in a                     stnenopmoc 61
                                                                                    j=1
5y          4y            3y                2y                  T        1y
                                                               wj qj (Φj )                 wT Φ ≈ wT q(Φ) =
                                                                                    
                                                                                    v
                                                                               652
5q          4q            3q
             Efficient approximate scoring   2q                           1q    sdiortnec
y(5q      )4y(4q        )3y(3q           ) 2 y( 2 q                   ) 1 y( 1 q
                                                                       stib 8
xedni noitazitnauq tib-46
obhgien tseraen rof noitazitnauq tcudorP
                   :srotcevbus m otni tilps rotceV
                                                                                                                     wv
                                                                                                                     .
                                                                                                                      .   
                                                                                                                     .   
tnauq yb yletarapes dezitnauq era srotcevbuS                                                                 
                                                                                                                    w2   
                                                                                                                          
                                                      sub-blocks
                                                                                                       uct
                                     s11 s12                                                       prod              w1
                                                                                                             inner
                                                                                                                         
                                                                        quantization for sub-block 1:
 htiw snaem-k yb denrael si
           centroids (r per sub-block)
                                         hcae erehw
                                                                                     1.Filling the look-up table:
tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE
            look-up table
            can be precomputed and stored in a                     stnenopmoc 61
                                                                                    j=1
5y          4y            3y                2y                  T        1y
                                                               wj qj (Φj )                 wT Φ ≈ wT q(Φ) =
                                                                                    
                                                                                    v
                                                                               652
5q          4q            3q
             Efficient approximate scoring   2q                           1q    sdiortnec
y(5q      )4y(4q        )3y(3q           ) 2 y( 2 q                   ) 1 y( 1 q
                                                                       stib 8
xedni noitazitnauq tib-46
obhgien tseraen rof noitazitnauq tcudorP
                    :srotcevbus m otni tilps rotceV
                                                                                                                  wv
                                                                                                                  .
                                                                                                                   .   
                                                                                                                  .   
tnauq yb yletarapes dezitnauq era srotcevbuS                                                                  
                                                                                                                 w2   
                                                                                                                       
                                                       sub-blocks
                                                                                                       duct
        s11 s12 s13 ... ... ... ... ... ... s1r                                                   r pro     i
                                                                                                                  w1
                                                                                                           nne        
                                                                         quantization for sub-block 1:
 htiw snaem-k yb denrael si
           centroids (r per sub-block)
                                          hcae erehw
                                                                                      1.Filling the look-up table:
tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE
            look-up table
            can be precomputed and stored in a                      stnenopmoc 61
                                                                                     j=1
5y           4y             3y               2y                  T        1y
                                                                wj qj (Φj )                 wT Φ ≈ wT q(Φ) =
                                                                                     
                                                                                     v
                                                                                652
5q          4q              3q
             Efficient approximate scoring    2q                           1q    sdiortnec
y(5q      )4y(4q          )3y(3q          ) 2 y( 2 q                   ) 1 y( 1 q
                                                                        stib 8
xedni noitazitnauq tib-46
obhgien tseraen rof noitazitnauq tcudorP
                    :srotcevbus m otni tilps rotceV
                                                                                                                    wv
                                                                                                                    .
                                                                                                                     .   
                                                                                                                    .   
tnauq yb yletarapes dezitnauq era srotcevbuS                                                                    
                                                                                                                   w2   
                                                                                                                         
        s21                                                                                                in

                                                       sub-blocks
                                                                                                ner prod
                                                                                                        uct         w1
        s11 s12 s13 ... ... ... ... ... ... s1r
                                                                                                                        
                                                                         quantization for sub-block 2:
 htiw snaem-k yb denrael si
           centroids (r per sub-block)
                                          hcae erehw
                                                                                      1.Filling the look-up table:
tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE
            look-up table
            can be precomputed and stored in a                      stnenopmoc 61
                                                                                     j=1
5y           4y             3y               2y                  T        1y
                                                                wj qj (Φj )                 wT Φ ≈ wT q(Φ) =
                                                                                     
                                                                                     v
                                                                                652
5q          4q              3q
             Efficient approximate scoring    2q                           1q    sdiortnec
y(5q      )4y(4q          )3y(3q          ) 2 y( 2 q                   ) 1 y( 1 q
                                                                        stib 8
xedni noitazitnauq tib-46
xedni noitazitnauq tib-46
                                                     stib 8

                                                     ) 1 y( 1 q                ) 2 y( 2 q          )3y(3q                 )4y(4q            y(5q


                 Efficient approximate scoringsdiortnec
                                             652
                                                         1q                       2q                   3q                      4q            5q


                                          v
                                          
                 wT Φ ≈ wT q(Φ) =                wj qj (Φj )
                                                  T      1y                       2y                   3y                      4y            5y


                                           j=1
                                              stnenopmoc 61            can be precomputed and stored in a
                                                                       look-up table
                                                     tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE

             2.Score each quantized vector q(Φ)
               in the database using the look-up              hcae erehw                    centroids (r per sub-block)
                                                                                              htiw snaem-k yb denrael si
               table:                                                                                                                 s1r
                                                               s11 s12 s13                       ...   ...   ...   ...   ...    ...


                                                                  sub-blocks
                                                               s21 s22 s23                       ...   ...   ...   ...   ...    ...   s2r
            w q(Φ) = w1 q1 (Φ1 ) + w2 q2 (Φ2 ) + . . . + wv qv... ... ) ...
             T            T             T                    T
                                                                 (Φv
                                                     tnauq yb yletarapes dezitnauq era srotcevbuS...   ...   ...   ...   ...    ...   ...
                                                                ... ... ...                      ...   ...   ...   ...   ...    ...   ...
T
    q(Φ)   = w1 q1 (Φ1 ) + w2 q2 (Φ2 ) + . . . + wv qv (Φv )
              T             T                     T
                                                                ... ... ...
                                                     :srotcevbus m otni tilps rotceV             ...   ...   ...   ...   ...    ...   ...
                                                               sv1 sv2 sv3                       ...   ...   ...   ...   ...    ...   svr
                        Only v additions per image!
                                                 obhgien tseraen rof noitazitnauq tcudorP
Choice of parameters
                                                                                             ! [Rastegari et al., 2011]
• Dimensionality is first reduced with PCA from D=2659 to D’  D
• How do we choose D’, v (number of sub-blocks),
 r (number of centroids per sub-block)?
• Effect of parameter choices on a database of 150K images:
                                                      (v,r)
                20
                                                                   8                              8
                         (128,2 )            (256,2 )                                    6
                                    (256,2 )
                                                      6
                                                 (64,2 )
                                 15
            Precision @ 10 (%)




                                                                    6
                                             8
                                                               (64,2 )
                                      (32,2 )
                                                                              (128,28)
                                                                                             D’=512
                                 10      8
                                  (16,2 )                                                    D’=256
                                                           8                  6
                                                  (32,2 )                (64,2 )             D’=128

                                 5                             (32,28)
                                                          8
                                                  (16,2 )
                                        8
                                  (16,2 )
                                 0
                                  0          0.05    0.1     0.15      0.2   0.25                     0.3
                                                Search time per query (seconds)
Performance evaluation
                                      on 150K images
ICCV
#1745
                                    ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.



 432
                             25                                           • Performance averaged over 1000 the largely
                                                                              the other classes. To cope with
 433                                                                      object classes and negative examples (n− 
                                                                             of positive used as queries
 434                                                                    • 50malize the loss term for each example in
                                                                               training examples per query
 435       20                                                             class its class. We evaluate the learned retrie
                                                                             of
 436                                                                    • ILSVRC2010 test set, which includes 150
                                                                          Database includes 150 images of
        Precision @ 10 (%)




 437                                                                      the query class and 150K images Thus, the d
                                                                             150 examples per category.
 438       15                                                             of n+ = 150 true positives and n− =
                                                                              other classes
 439                                                                    • Prec@10 of a random Figure 1 shows precis
                                                                                test
                                                                             tors for each query.
                                                                                                   classifiers    test

 440                                                                      is 0.1%
           10                                                                time for AR and TkP in combination wit
 441
                                                       AR L2−SVM             fication models. Since AR does not use s
 442
                                                       TkP L1−LR             efficiency, we only paired it with the L2-S
 443         5                                                             approximate ranking retrieval time per qu
                                                                             x-axis shows average
 444                                                   TkP L2−SVM
                                                       TkP FGM               a single-core computer with 16GB of R
 445
             0
                                                                             Core i7-930 CPU @ 2.80GHz. The y-axis
 446          0         0.5          1        1.5         2        2.5       at 10 which measures the proportion of tru
 447                    Search time per query (seconds)                      top 10. The times reported for TkP wer
 448    Figure 1. Class-retrieval precision versus search time for the       k = 10. The curve for AR was generate
 449    ILSVRC2010 data set: x-axis is search time; y-axis shows per-        parameter choices for v and w, as discusse
 450    centage of true positives ranked in the top 10 using a database
                                      −
                                                                             later. The performance curves for “TkP L
 451    of 150,000 images (with n            = 149, 850 distractors and
Memory requirements for
     10M images
                       9 Gbytes


                8



                6
 memory usage




                4
                                        3 Gbytes

                                                           1.8 Gbytes
                2



                0
                           1
                    Inverted index           2
                                     Incidence matrix           3
                                                             Product
                                       (used by TkP)    quantization index
Conclusions and open questions
Classemes:

•   Compact descriptor enabling efficient novel-class recognition
    (less than 200 bytes/image yet it produces performance similar to MKL
    at a tiny fraction of the cost)

•   Questions currently under investigation:
    - can we learn better classemes from fully-labeled data?
    - can we decouple the descriptor size from the number of
      classeme classes?
    - can we encode spatial information ([Li et al. NIPS10])?
•   Software for classeme extraction available at:
    http://vlg.cs.dartmouth.edu/projects/classemes_extractor/

Information retrieval approaches to large-scale object-class search:
•   sparse representations and retrieval models
•   top-k ranking
•   approximate scoring
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

    Outline


    1 ICVSS 2011


    2 A Trillion Photos - Steven Seitz

    3 Efficient Novel Class Recognition and Search - Lorenzo
       Torresani

    4 The Life of Structured Learned Dictionaries - Guillermo Sapiro


    5 Image Rearrangement  Video Synopsis - Shmuel Peleg




                        Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
The Life of Structured
                        Learned Dictionaries

                                              Guillermo Sapiro
                                           University of Minnesota


                  G.Yu and S. Mallat (Inverse problems via GMM)
                  G.Yu and F. Leger (Matrix completion)
                  G.Yu (Statistical compressed sensing)

                  A. Castrodad (activity recognition in video)
                  M. Zhou, D. Dunson, and L. Carin (video layers separation)



                                                           1
Friday, July 8, 2011
Inverse Problems
     y = Uf + w
    w ∼ N (0, σ 2 Id)
                                                 Inpainting




          Examples                f                            U : masking
                                           Deblurring
                        Zooming




                         U : subsampling                      U: convolution
                                                                         2
                                             3                w ∼ N (0, σ Id)
Friday, July 8, 2011
Learned Overcomplete Dictionaries
                                               Dictionary
                                                learning




                 •     Dictionary learning
                                               
                                                                2
                                                                              
                                  min                 fi − Dai  + λai 1
                               D,{ai }1≤i≤I
                                              1≤i≤I

                 •     Better performance than pre-fixed dictionaries.

                 •     Huge numbers of parameters to estimate.
                 •     Non-convex.
                 •     High computational complexity.
                 •     Behavior not well understood (results starting to appear).
                                                       11
Friday, July 8, 2011
Sparse Inverse Problem Estimation
                        y = Uf + w          where              2
                                                    w ∼ N (0, σ Id)
        •       Sparse prior
                D = {φm }m∈Γ provides a sparse representation for f .
       f = Da + Λ with |Λ|  |Γ| , Λ = support(a) and Λ 2  f 2

       •        Observation
                UD = {Uφm }m∈Γ provides a sparse representation for y .
      y = UDa +  with |Λ|  |Γ| , Λ = support(a) and  2  y2
                 Λ                                       Λ


       •       Sparse inverse problem estimation
               Sparse estimation of a from y
                                               2
                         a = arg min UDa − y + λ a1
                         ˜
                                   a
               Inverse problem estimation
                                        ˜ = Da
                                        f    ˜
                                               12
Friday, July 8, 2011
Structured Representation and Estimation

                              D                           B1 B2 B3 B4 B5
                 Overcomplete dictionary                  Structured overcomplete dictionary


             •         Dictionary: union of PCAs
                   •     Union of orthogonal bases D = {Bk }1≤k≤K
                   •     In each basis, the atoms are ordered: λk ≥ λk ≥ · · · ≥ λk
                                                                1    2            N



             •         Piecewise linear estimation (PLE)
                   •     A linear estimator per basis
                   •     Non-linear basis selection: a best linear estimator is selected

             •         Small degree of freedom, fast computation, state-of-the-art
                       performance
                                                     16
Friday, July 8, 2011
Gaussian Mixture Models

                         y i = U i fi + w i   where   wi ∼ N (0, σ 2 Id)


                 •     Estimate {(µk , Σk )}1≤k≤K from {yi }1≤i≤I

                 •     Identify the Gaussian ki that generates fi , ∀i
                 •     Estimate ˜i from N (µki , Σki ) , ∀i
                                f




                                                 18
Friday, July 8, 2011
Structured Sparsity
            •          PCA (Principal Component Analysis)
                                                              T
                                               Σk =   Bk S k Bk
                        •   Bk = {φk }1≤m≤N PCA basis, orthogonal.
                                   m

                        •   Sk = diag(λk , . . . , λk ) , λk ≥ λk ≥ · · · ≥ λk eigenvalues.
                                       1            N      1    2            N


            •          PCA transform
                                                   ˜k = Bk ak
                                                   fi      ˜i

            •          MAP with PCA
                                                                             
                               ˜k = arg min Ui fi − yi 2 + σ 2 f T Σ−1 fi
                               fi                                    ˜
                                                                  i    k
                                          fi

                                                     ⇔           N
                                                                  |ai [m]|2
                                                                                  
                            ak = arg min Ui Bk ai − yi 2 + σ 2
                            ˜i
                                      ai
                                                                 m=1
                                                                      λkm

                                                       22
Friday, July 8, 2011
Structured Sparsity
                       Sparse estimate            v.s.               Piecewise linear estimate
                                         |Γ|                                                                 
                                                                                                 N
                                                                                                  
                                                                                                  |ai [m]|2
                                    2
   ai = arg min UDai − yi  + λ
   ˜                                           |ai [m]|   ak
                                                          ˜i                             2
                                                               = arg min Ui Bk ai − yi  + σ 2
                       ai                                             ai                             λkm
                                         m=1                                                  m=1




                              D                                    B1 B2 B3 B4 B5

                                                               •    Linear collaborative filtering
                 Full degree of freedom
                                     
                                                                    in each basis.
                 in atom selection |Λ|
                                     |Γ|



                                                               •   Nonlinear basis selection,
                                                                   degree of freedom K.


                                                          23
Friday, July 8, 2011
Initial Experiments: Evolution




                        Clustering           Clustering
                       1st iteration        2nd iteration




                                       24
Friday, July 8, 2011
Experiments: Inpainting




               Original            20% available                 MCA 24.18 dB                  ASR 21.84 dB
                                                          [Elad, Starck, Querre, Donoho, 05]    [Guleryuz, 06]




           KR 21.55 dB              FOE 21.92 dB                     BP 25.54 dB               PLE 27.65 dB
  [Takeda, Farsiu. Milanfar, 06]   [Roth and Black, 09]         [Zhou, Sapiro, Carin, 10]
                                                          26
Friday, July 8, 2011
Experiments: Zooming
                                                                                        Low-resolution




          Original     Bicubic 28.47 dB         SAI 30.32 dB          SR 23.85 dB          PLE 30.64 dB

                        SR [Yang, Wright, Huang, Ma, 09]       SAI [Zhang and Wu, 08]

                                                      29
Friday, July 8, 2011
Experiments: Zooming Deblurring




                       f             Uf               y = SUf




                   Iy 29.40 dB   PLE 30.49 dB         SR 28.93 dB
                                                [Yang, Wright, Huang, Ma, 09]
                                       32
Friday, July 8, 2011
Experiments: Denoising




                         Original         Noisy 22.10 dB      NLmeans 28.42 dB
                                                                [Buades et al, 06]




                       FOE 25.62 dB      BM3D 30.97 dB          PLE 31.00 dB
                  [Roth and Black, 09]    [Dabov et al, 07]
                                                  34
Friday, July 8, 2011
Summary of this part

                 •     Gaussian mixture models and MAP-EM work well for
                       image inverse problems.

                 •     Piecewise linear estimation, connection to structured
                       sparsity.

                       •   Collaborative linear filtering.

                       •   Nonlinear best basis selection, small degree of freedom.

                 •     Faster computation than sparse estimation.
                 •     Results in the same ballpark of the state-of-the-art.

                 •     Beyond images: recommender systems and audio (Sprechmann  Cancela)

                 •     Statistical compressed sensing
                                                           38
Friday, July 8, 2011
Modeling	
  and	
  Learning	
  Human	
  
             Ac2vity	
  
                                          	
  

   Alexey	
  Castrodad1,2	
  and	
  Guillermo	
  Sapiro2	
  
                 1	
  NGA	
  Basic	
  and	
  Applied	
  Research	
  
              2	
  University	
  of	
  Minnesota,	
  ECE	
  Department	
  
                    castr103@umn.edu	
  ,	
  guille@umn.edu	
  	
  
                                            	
  
                                            	
  
Mo2va2on	
  
•  Problem:	
  
	
  	
  	
  	
  	
  Given	
  volumes	
  of	
  video	
  feed,	
  detect	
  ac2vi2es	
  of	
  interest	
  
       §  Mostly	
  done	
  manually!	
  
•  Solving	
  this	
  will:	
  
       §  Aid	
  the	
  operator:	
  surveillance/security,	
  gaming,	
  psychological	
  
           research	
  
       §  SiV	
  through	
  large	
  amounts	
  of	
  data	
  
•  Solu2on:	
  	
  Fully/semi-­‐automa2c	
  ac2vity	
  detec2on	
  with	
  
   minimum	
  human	
  interac2on	
  
       §    Invariance	
  to	
  spa2al	
  transforma2ons	
  
       §    Robust	
  to	
  occlusions,	
  low	
  resolu2on,	
  noise	
  
       §    Fast	
  and	
  accurate	
  
       §    Simple,	
  generic	
  
	
  
                                                                                                           4	
  
Sparse	
  modeling:	
  	
  
Dic2onary	
  learning	
  from	
  data	
  




                                            7	
  
Sparse	
  modeling	
  for	
  ac2on	
  
                             classifica2on:	
  Phase	
  1	
  
Training	
  
                                                 Class	
  1	
                                 Class	
  2	
        Class	
  3	
  

           Input	
  Videos	
              • 	
  	
  
Spa2al	
  Temporal	
  Features	
  	
      • 	
  	
  
       Sparse	
  Modeling	
  	
           • 	
  	
  
                                                        D1	
                                  D2	
             D3	
                        D	
  
Classifica2on	
  


                                         • 	
  	
  	
  
                                                                                                A1	
  
               l1	
  Pooling	
              New	
  
                                            video	
  
                                                                  Feature	
  
                                                                 Extrac2on	
  
                                                                                 Sparse	
  
                                                                                 coding	
       A2	
  
                                                                                                A3	
  
                                                                                                                   Classifier	
  output	
  
                                                                                                                                   9	
  
Sparse	
  modeling	
  for	
  ac2on	
  
                             classifica2on:	
  Phase	
  2	
  
Training	
  


          	
  Sparse	
  Modeling	
  
                                                      • 	
  	
  
                                                                   D1	
                                    D2	
                D3	
                       D	
  


        Inter-­‐class	
  Modeling	
                   •  	
  	
  
                                                      Training	
  
                                                       Videos	
  
                                                                               Feature	
  
                                                                              Extrac2on	
  
                                                                                              Sparse	
  
                                                                                              coding	
  



                                                                                                             E1	
          E2	
         E3	
  

Classifica2on	
  
                                                                     A1	
  
      l1	
  
    Pooling	
      • 	
  	
  	
  
                     New	
  
                     video	
  
                                     Feature	
  
                                    Extrac2on	
  
                                                      Sparse	
  
                                                      coding	
  
                                                                     A2	
  
                                                                     A3	
  
                                                                                                              Sparse	
  
                                                                                                              Coding	
  



                                       from	
  Phase	
  1	
                                                                    Classifier	
  output	
  
                                                                                                                                                 10	
  
Results	
  
 •  YouTube	
  Ac2on	
  Dataset	
  
        §  variable	
  spa2al	
  resolu2on	
  videos,	
  3-­‐8	
  seconds	
  each	
  
        §  11	
  types	
  of	
  ac2ons	
  from	
  YouTube	
  videos	
  
Scene	
                  AcGons	
                  Camera	
                 ResoluGon	
               Frame	
  Rate	
  
indoors/outdoors	
   basketball	
  shoo2ng,	
   jiaer,	
  scale	
           variable,	
               25	
  fps	
  
                     cycling,	
  diving,	
  golf	
   varia2ons,	
           resampled	
  to	
  	
  
                     swinging,	
  horse	
            camera	
  mo2on,	
     320	
  x	
  240	
  
                     back	
  riding,	
  soccer	
  
                                                     variable	
  
                     juggling,	
  swinging,	
  
                     tennis	
  swinging,	
           illumina2on	
  
                     trampoline	
  jumping,	
   condi2ons,	
  high	
  
                     volleyball	
  spiking,	
        background	
  
                     walking	
  with	
  a	
  dog	
   cluaer	
  




                                                                                                                          18	
  
Results:	
  YouTube	
  Ac2on	
  Dataset	
  
§  Best/recent	
  reported:	
  75.8%	
  (Q.V.	
  Le	
  et	
  al.,	
  2011);	
  84.2%	
  
    (Wang	
  et	
  al.,	
  2011)	
  	
  
§  Recogni2on	
  rate:	
  80.29	
  %	
  (phase	
  1)	
  and	
  91.9%	
  (phase	
  2)	
  	
  




                                                                                                20	
  
Conclusion	
  
•  Main	
  contribu2on:	
  	
  
    §  Robust	
  ac2vity	
  recogni2on	
  framework	
  based	
  on	
  sparse	
  
        modeling	
  
    §  Generic:	
  	
  works	
  on	
  mul2ple	
  data	
  sources	
  
          §  State-­‐of-­‐the-­‐art	
  results	
  in	
  all	
  of	
  them,	
  same	
  parameters	
  
•  Key	
  advantage:	
  
    §  Simplicity,	
  state	
  of	
  the	
  art	
  results	
  
    §  Fast	
  and	
  accurate:	
  7.5	
  fps	
  
    §  7	
  frames	
  needed	
  for	
  detec2on	
  	
  
•  Future	
  direc2on:	
  
    §  Exploit	
  human	
  interac2ons	
  
    §  Infer	
  the	
  ac2ons	
  
    §  Foreground	
  extrac2on/video	
  analysis	
  for	
  ac2vity	
  clustering	
  

                                                                                                        21	
  
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

    Outline


    1 ICVSS 2011


    2 A Trillion Photos - Steven Seitz

    3 Efficient Novel Class Recognition and Search - Lorenzo
       Torresani

    4 The Life of Structured Learned Dictionaries - Guillermo Sapiro


    5 Image Rearrangement  Video Synopsis - Shmuel Peleg




                        Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
Shift-Map Image Editing
              Yael Pritch
          Eitam Kav-Venaki
            Shmuel Peleg

  The Hebrew University of Jerusalem
Geometrical Image Editing:
                     Retargeting
Retargeting          (Avidan and Shamir SIGGRAPH’07, Wolf et al., ICCV’07, Wang et al., SIGASIA’08,
Rubinstein et al., SIGGRAPH’08, Rubinstein et al.,SIGGRAPH’09)
                            Input




                                                                       Shift-Map
                                                                        Output
Geometrical Image Editing:
               Inpainting
Inpainting (Criminisi et al. CVPR’03, Wexler et al. CVPR’04,Sun, et al. SIGGRAPH’05,
Komodakis et al. CVPR’06, Hays and Efros, SIGGRAPH’07)

                           Mask                            Input
Geometrical Image Editing:
               Inpainting
Inpainting (Criminisi et al. CVPR’03, Wexler et al. CVPR’04,Sun, et al. SIGGRAPH’05,
Komodakis et al. CVPR’06, Hays and Efros, SIGGRAPH’07)

                           Mask                            Output
Shift-Map Composition




       A          B       C           D
  User
Constraints
Shift-Map Composition




       A           B      C           D
  User
Constraints




               A
Shift-Map Composition




       A           B      C           D
  User
Constraints




               A         B
Shift-Map Composition




       A           B      C           D
  User
Constraints




               A       C B
Shift-Map Composition




         A           B      C           D
    User
  Constraints



No accurate
 segmentation
required

                 A       C B     D
Shift-Map Composition




         A          B       C           D
    User
  Constraints



No accurate
 segmentation
required
Our Approach : Shift-Map
• Shift-Maps represent a mapping for each pixel in the output
  image into the input image
          Output : R(u,v)              Input : I(x,y)




• The color of the output pixel is copied from corresponding input pixel
Our Approach : Shift-Map

         Output : R(u,v)            Input : I(x,y)

             (u,v)               (u,v)


                                     (x,y)




• We use relative mapping coordinate (like in Optical Flow)
Our Approach : Shift-Map
                        Output                         Input


 Shift-Map
Output Image




            Horizontal Shifts    Vertical Shifts

Tx = 0
Tx = 400
                                                   • Minimal distortion
Tx = 50
                                                   • Adaptive boundaries
                                                   • Fast optimization
Ty = 10
Geometric Editing as an
          Energy Minimization
• We look for the optimal mapping - can be
  described as an Energy Minimization problem



                    Data term :              Smoothness term :
           External Editing Requirement   Avoid Stitching Artifacts
             Compute For Each Pixel       Compute For Each Pair
                                           of Neighboring pixels

• Unified representation for geometric editing applications
• Solved using a graph labeling algorithm
The Smoothness Term
        R - Output Image                 I - Input Image


                                    p’      np’



            pq                           nq’ q’
    Discontinuity
    in the shift-map



For p                      For q
                                                       color

                                                      gradient

                                   (Kwatra et al. 03, Agarwala et al. 04)
The Data Term: Inpainting
• Data term varies between different application
• Inpainting data term uses data mask D(x,y) over the
  input image
   – D(x,y)= ∞ for pixels to be removed
   – D(x,y)=0 elsewhere


                                           D=0




• Specific input pixels can be forced             (x,y)

  not to be included in the output
  image by setting D(x,y)=∞                      (u,v)
The Data Term: Rearrangement

• Input pixels can be forced
  to appear in a new location
    • Appropriate shift gets    (u,v)
      infinitely low energy             (x,y)
    • Other shifts get
       infinitely high energy
The Data Term: Retargeting
• Use picture borders
• Can incorporate importance mask
  – Order constraint on mapping is applied to prevent
    duplications of important areas
Shift-Map as Graph Labeling
  • Minimal energy mapping can be represented as graph
    labeling where the Shift-Map value is the selected label
    for each output pixel
  • Labels: relative shift Labels: shift-map values (tx,ty)

         Output image pixels                Input image


Nodes:
pixels

                               Shift Map:
                               assign
                               a label to
                               each pixel
Hierarchical Solution
Gaussian pyramid                  Output
    on input




             Shift-Map




             Shift-Map
Results and Comparison
            Image completion with structure propagation [Sun et al. SIGGRAPH’05]




   Shift-Map handles
   without additional                           Mask                              Shift-Map
   user interaction
   some cases where
   other algorithms
   suggested that
   can only be handled
   with additional user
   guidance

J. Sun, L. Yuan, J. Jia, and H. Shum. Image completion with structure propagation. In SIGGRAPH’05
Application: Retargeting
     Input           Output
Results and
     Comparison

Non-Homogeneous           Improved Seam Carving                  PatchMatch
 [Wolf et al., ICCV’07]   [Robinstein et al, SIGGRAPH’08]   [Barnes et al, SIGGRAPH‘09]
                                                                                          Shift-Maps
Summary

• New representation to geometrical editing
  applications as an optimal graph labeling

• Unified approach

• Solved efficiently using hierarchical
  approximations

• Minimal user interaction is required for various
  editing tasks
Similarity Guided Composition
• Build an Output image R from pixels
  taken from Source image I such that R is
  most similar to Target image T
 Source Image
                Target Image      Output
Similarity Guided Composition

• Data term reflects a similarity between
  the output image R and a target image T
• Similarity uses both colors and gradients
Similarity Guided Composition
• Data term indicates the similarity of the
  output image to the target image
• Weight between similarity and smoothness
  has the following effect
Source Image        Resulted
                    Output                      Target Image




               Previous Work: Efros and Freeman 2001, Hertzman et al. 2001
Edge Preserving Magnification
Using the original image as the source, similarity
  guided composition can magnify




  Source          Result           Target (bilinear
                                   magnification)

 Does not work for gradual color changes
Edge Preserving Magnification
Original image can be the source for edge areas.
  Otherwise the magnified image is the source.




 Original    Magnified Target    Edge Map
 Source 1      Source 2
Edge Preserving Magnification




  Bicubic            Shift Map
The Bidirectional Similarity
     [Simakov, Caspi, Shechtman, Irani – CVPR’2008]

              Completeness            All source patches
                                      (at multiple scales)
          source     ⊆       target   should be in the target


                               ?


                     ⊇                All target patches
                                      (at multiple scales)
                                      should be in the source
               Coherence

Easy to compose (recover) source from target
                   
Easy to compose (recover) target from source
Shift-Map Retargeting with Feedback
 • Shift-Map retargeting maximize the
   coherence




 • It will he hard to reconstruct back the Fish
Shift-Map Retargeting with Feedback

• Increase the Appearance Data Term of
  input regions with a high Composition
  Score EA|B and recompute the output B.

• Pixels with the higher Appearance Term
  will now appear in the output and
  increase the completeness.
Original              Appearance Term
                                  EA|B
                                  EA|B




      Retargeted   Reconstruction of Original
Shift-Map Retargeting with Feedback




       Original Shift-Map   Feedback
Video Synopsis and Indexing
 Making a Long Video Short




 •   11 million cameras in 2008
 •   Expected 30 million in 2013
 •   Recording 24 hours a day, every day
Video Synopsis
                 Shift Objects in Time




Synopsis Video
S(x,y,t)


                                    Input Video
                                    I(x,y,t)
                              t
Steps in Video Synopsis

• Detect and track objects, store in database.
• Select relevant objects from database
• Display selected objects in a very short
  “Video Synopsis”
• In “Video Synopsis”, objects from different
  times can appear simultaneously
• Index from selected objects into original video
• Cluster similar objects
Two Clusters
                            Cars
Camera in St. Petersburg




                           People


• Detect specific events
• Discover activity patterns
ICVSS 2011 Presentations



  168.176.61.22/comp/buzones/PROCEEDINGS/ICVSS2011


   Jiri Matas - Tracking, Learning, Detection, Modeling
   Ivan Laptev - Human Action Recognition
   Josef Sivic - Large Scale Visual Search
   Andrew Fitzgibbon - Computer Vision: Truth and Beauty
   (Kinect)




           Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
The end...



Thanks !




                          Angel Cruz-Roa aacruzr@unal.edu.co
                    Andrea Rueda-Olarte adruedao@unal.edu.co

         Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations

ICVSS2011 Selected Presentations

  • 1.
    ICVSS 2011 StevenSeitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg ICVSS 2011: Selected Presentations Angel Cruz and Andrea Rueda BioIngenium Research Group, Universidad Nacional de Colombia August 25, 2011 Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
  • 2.
    ICVSS 2011 StevenSeitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg Outline 1 ICVSS 2011 2 A Trillion Photos - Steven Seitz 3 Efficient Novel Class Recognition and Search - Lorenzo Torresani 4 The Life of Structured Learned Dictionaries - Guillermo Sapiro 5 Image Rearrangement & Video Synopsis - Shmuel Peleg Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
  • 3.
    ICVSS 2011 StevenSeitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg Outline 1 ICVSS 2011 2 A Trillion Photos - Steven Seitz 3 Efficient Novel Class Recognition and Search - Lorenzo Torresani 4 The Life of Structured Learned Dictionaries - Guillermo Sapiro 5 Image Rearrangement & Video Synopsis - Shmuel Peleg Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
  • 4.
    ICVSS 2011 StevenSeitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg ICVSS 2011 International Computer Vision Summer School 15 speakers, from USA, France, UK, Italy, Prague and Israel Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
  • 5.
    ICVSS 2011 StevenSeitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg ICVSS 2011 International Computer Vision Summer School Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
  • 6.
    ICVSS 2011 StevenSeitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg ICVSS 2011 International Computer Vision Summer School Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
  • 7.
    ICVSS 2011 StevenSeitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg Outline 1 ICVSS 2011 2 A Trillion Photos - Steven Seitz 3 Efficient Novel Class Recognition and Search - Lorenzo Torresani 4 The Life of Structured Learned Dictionaries - Guillermo Sapiro 5 Image Rearrangement & Video Synopsis - Shmuel Peleg Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
  • 8.
    A Trillion Photos Steve Seitz University of Washington Google Sicily Computer Vision Summer School July 11, 2011
  • 9.
    Facebook >3 billion uploaded each month ~ trillion photos taken each year
  • 10.
    What do youdo with a trillion photos? Digital Shoebox (hard drives, iphoto, facebook...)
  • 17.
  • 18.
    Comparing images Detect features using SIFT [Lowe, IJCV 2004]
  • 19.
    Comparing images Extraordinarily robustimage matching – Across viewpoint (~60 degree out-of-plane rotations) – Varying illumination – Real-time implementations
  • 20.
  • 21.
    Scale Invariant FeatureTransform 0 2π angle histogram Adapted from slide by David Lowe
  • 22.
  • 23.
    NASA Mars Roverimages with SIFT feature matches Figure by Noah Snavely
  • 25.
    Coliseum (outside) St. Peters (inside) Coliseum St. Peters (outside) (inside) Il Vittoriano Trevi Fountain Forum
  • 26.
    Structure from motion Matched photos 3D structure
  • 27.
    Structure from motion aka“bundle adjustment” (texts: Zisserman; Faugeras) p4 p1 p3 minimize p2 f (R, T, P) p5 p7 p6 Camera 1 Camera 3 R1,t1 Camera 2 R3,t3 R2,t2
  • 28.
  • 29.
    Reconstructing Rome In aday... From ~1M images Using ~1000 cores Sameer Agarwal, Noah Snavely, Rick Szeliski, Steve Seitz http://grail.cs.washington.edu/rome
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    From Sparse toDense Sparse output from the SfM system
  • 36.
    From Sparse toDense Furukawa, Curless, Seitz, Szeliski, CVPR 2010
  • 41.
    Most of ourphotos don’t look like this
  • 43.
  • 44.
    Your Life in30 Seconds path optimization
  • 45.
    Picasa Integration • As“Face Movies” feature in v3.8 – Rahul Garg, Ira Kemelmacher
  • 46.
    Conclusion trillions of photos + computer vision breakthroughs = new ways to see the world
  • 47.
    ICVSS 2011 StevenSeitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg Outline 1 ICVSS 2011 2 A Trillion Photos - Steven Seitz 3 Efficient Novel Class Recognition and Search - Lorenzo Torresani 4 The Life of Structured Learned Dictionaries - Guillermo Sapiro 5 Image Rearrangement & Video Synopsis - Shmuel Peleg Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
  • 48.
  • 49.
    Problem statement: novel object-class search • Given: image database user-provided images (e.g., 1 million photos) of an object class + • Want: database • no text/tags available images • query images may of this class represent a novel class
  • 50.
    Application: Web-powered visualsearch in unlabeled personal photos Goal: Find “soccer camp” pictures on my computer 1 1 Search the Web for images of “soccer camp” 2 Find images of this visual class on my computer 2
  • 51.
    Application: product search • Search of aesthetic products
  • 52.
    RBM predictedpredicted labels(47%) RBM labels (47%) Relation to other tasks sky sky building building tree bed tree bed car car novel class road road Input search Ground truth neighbors image image Input Ground truth neighbors 32−RBM 32−RBM 16384-gist 1 query retrieved image retrieval object categorizationshowingitperce Figure 6. 6. Curves showing per Figure Curves query images that make it int query images that make into ofof the query for 1400 image the query for a a 1400 imag to 5% of the database size. upup to 5% of the database siz analogies: RBM predictedpredicted labels (56%) RBM labels (56%) crucial for scalable retrieval th crucial for scalable retrieval - large databases tree from [Nister and Stewenius, ’07] tree sky sky database make it it to the very database make to the very to is is feasible only for a tiny f feasible only for a tiny fra - efficient indexing database grows large. Hence, w database grows large. Hence, building building the curves meet the y-axis. T the curves meet the y-axis. - compact representation (a) car car given in in Table 1 for larger n given Table 1 for a a larger sidewalk sidewalkcrosswalkcrosswalk conclusions can bebe drawn from conclusions can drawn from road road improves retrieval performance improves retrieval performan differences: from neighbors et al., ’07] performance than vocabularies.1 performance than 2 -norm. En L L2 -norm. Input image imageGround truth [Philbinneighbors 32−RBM 32−RBM vocabularies. O Input least for smaller 16384-gist - simple notions of visual Ground truth least for smaller gives much better performance th gives much better performance (b) relevancy is is setting T. setting T. (e.g., near-duplicate, same object instance, settings used by [17]. settings used by [17]. The performance with vav The performance with same spatial layout) (c) RBM predictedpredicted labels (63%) [Torralba et al., ’08] RBM labels (63%) from on the full 6376 image databa on the full 6376 image data the scores decrease with inc the scores decrease with in ceiling ceiling are more images toto confus are more images confuse Figure Thewall retrieval performance is is evaluated using a large wall performance evaluated using a large Figure 5. 5. The retrieval ofof the vocabulary tree is sh the vocabulary tree is show ground truth database (6376 images) with groups ofof four images ground truth database (6376 images) with groups four images door door defining the vocabulary tree defining the vocabulary tre poster poster
  • 53.
    Relation to othertasks novel class search image retrieval object classification analogies: analogies: - large databases - recognition of object - efficient indexing classes from a few examples - compact representation differences: differences: - classes to recognize are - simple notions of visual defined a priori relevancy - training and recognition (e.g., near-duplicate, time is unimportant same object instance, - storage of features is not an same spatial layout) issue
  • 54.
    Technical requirements of novel class-search • The object classifier must be learned on the fly from few examples • Recognition in the database must have low computational cost • Image descriptors must be compact to allow storage in memory
  • 55.
    State-of-the-art in object classification Winning recipe: many features + non-linear classifiers (e.g. [Gehler and Nowozin, CVPR’09]) non-linear !"#$% decision boundary !"#$%&#'()* +&,-)&.&#(#/* ... 01#-2"#* &'()*+),%% -'.,()*+/% #"0$%
  • 56.
    Model evaluation onCaltech256 45 40 gist 35 phog phog2pi 30 accuracy (%) ssim 25 bow5000 20 !"#$%&'()*$+' 15 , 10 '"#*"-"*.%+'/$%0.&$1 5 0 0 5 10 15 20 25 30 number of training examples
  • 57.
    Model evaluation onCaltech256 45 40 gist phog 35 phog2pi 30 ssim accuracy (%) bow5000 !"#$%&'()*$+', 25 linear combination /$%0.&$'2)(3"#%4)# 20 !"#$%&'()*$+' 15 , 10 '"#*"-"*.%+'/$%0.&$1 5 0 0 5 10 15 20 25 30 number of training examples
  • 58.
    Model evaluation onCaltech256 5)#6+"#$%&'()*$+', 45 /$%0.&$'2)(3"#%4)#' 40 7%898%8':.+4;+$'<$&#$+' gist !$%&#"#=>' 35 phog ?@$A+$&'B'5)C)D"#E'FGH phog2pi 30 accuracy (%) ssim 25 bow5000 !"#$%&'()*$+', linear combination /$%0.&$'2)(3"#%4)# 20 nonlinear combination !"#$%&'()*$+' 15 , 10 '"#*"-"*.%+'/$%0.&$1 5 0 0 5 10 15 20 25 30 number of training examples
  • 59.
    Multiple kernel combiners Classificationoutput is obtained by combining many features via non-linear kernels: F N h(x) = βf kf (x, xn )αn + b f =1 n=1 sum over features sum over training examples !#$% ... where '()*+),%% -'.,()*+/% #0$%
  • 60.
    m=1 s. Fora kernel function k between a SVM. he short-hand notation Training Same as for averaging. = k(fm (x), fm (x )), Multiple con- 4. Methods: Multiple Kernel Learning kernel learning (MKL) nel km : X × X → R only espect to image feature fal., 2004; Sonnenburg etapproach toVarma and Ray, 2007] is to [Bach et m . If the Another al., 2006; perform kernel selection to a certain aspect, say, it only con- a kernel combination during the training phase of th gorithm. jointly optimizing over Learning a non-linear SVM by One prominent instance of this class is MKL on, then the kernel measures simi- F a linear combinati to this aspect. The subscript m of nderstood as a linear combinationobjective ∗ (x, x ) k=(x, x ) =β over(x,fx ) x ) the par 1. indexing into the set of kernels k is to optimize jointly of kernels: ∗ F β k (x, km f and m m=1 f =1 2. the SVM parameters: α ∈ RN and b ∈ R of an SVM. ters notational convenience, we will de- MKL was originally introduced in [1]. For efficiency e of the m’th feature for a given   F in order N obtain sparse, F to interpretable coefficients, F raining samples xi , i = 1, 1 . . . , N min βf αT Kf α stricts βm ≥ 0 and ,imposes thefconstraintT α βm + C L yn b + β Kf (xn ) m=1 α,β,b 2 Since the scope of this paper is to access the applicab f =1 n=1 f =1 of MKL to feature combination rather than its optimiz ), km (x, x2 ), . . . , km (x, xN )]T . F part we opted to present the MKL formulations in a wa aining sample, i.e. x = xi , then = 1,lowing for easier 1, . . . , F subject to βf βf ≥ 0, f = comparison with the other methods h column of the m’th kernel matrix.f =1 write its objective function as F ernel selection In this papert) = max(0, 1 − yt) 1 where L(y, we min βm αT Km α classifiers that aim to combine sev- 2 m=1 Kf (x) = [kf (x, x1 ), kf (x, x2 ), . . . , kf (x, xN )]T α,β,b e model. Since we associate image N F ctions, kernel combination/selection +C L(yi , b + βm Km (x)T α)
  • 61.
    LP-β: a two-stageapproach to MKL ! [Gehler and Nowozin, 2009] • Classification output of traditional MKL: F N hM KL (x) = βf kf (x, xn )αn + b f =1 n=1 • Classification function of LP-β: F N h(x) = βf kf (x, xn )αf n + bf f =1 n=1 hf (x) Two-stage training procedure: 1. train each hf (x) independently → traditional SVM learning 2. optimize over β → a simple linear program
  • 62.
    LP-β for novel-classsearch? The LP-β classifier: F N h(x) = βf kf (x, xn )αf n + bf f =1 n=1 sum over features sum over training examples Unsuitable for our needs due to: • large storage requirements (typically over 20K bytes/image) • costly evaluation (requires query-time kernel distance computation for each test image) • costly training (1+ minute for O(10) training examples)
  • 63.
    Classemes: a compactdescriptor for efficient recognition [Torresani et al., 2010] ! Key-idea: represent each image x in terms of its “closeness” to a set of basis classes (“classemes”) x Φ(x) = [φ1 (x), . . . , φC (x)]T F N φc (x) = hclassemec (x) = c βf kf (x, xc )αn + bc n c f =1 n=1 output of a pre-learned LP-β for the c-th basis class Φ(x1 ) ... Φ(xN ) Query-time learning: training examples of train a linear classifier on Φ(x) novel  class  C F N g duck (Φ(x); wduck ) = Φ(x)T wduck = wc  duck c βf kf (x, xc )αn + bc  n c c=1 f =1 n=1 LP-β trained before the trained at query-time creation of the database
  • 64.
    How this works... Efficient Object Category Recognition Using Classemes 777 • Accurate weighted classemes. Five classemes with the highest LP-β weights Table 1. Highly semantic labels are not required... to •make semantic sense, but it should bejust used that detectors may create for the retrieval experiment, for a selection of Caltech 256 categories. Somefor appear Classeme classifiers are emphasized as our goal is simply to specific patterns of texture, color, shape, etc. a useful feature vector, not to assign semantic labels. The somewhat peculiar classeme labels reflect the ontology used as a source of base categories. !#$%'()*+$ ,-(./+$#-(.'0$%/1121$ %)#3)+4.'$ !#$% '()*%'+%*,-. -,.+(,/ -)##-%01# $2330/+(,/ 05%6$ 1)$1*+(#,/ 1)45+)3+6,%* '60$$* 6,#.0/7 '%*,07!% 12##+$,#+!*4+ /6$ 3072*+'.,%* -,%%# 7*,8'0% 4,4+1)45 ,/0$,# 7*-13$ 6,%*-*,3%+'2*3,- '-'0+-,1# ,#,*$+-#)-. !0/42 '*80/7+%*,5 6'%*/+!$0'(!*+ '*-/)3-'4898$ -)/89+%!0/7 $0/4+,*, -4(#,5* *),'%0/7+(,/ (*')/ %,.0/7+-,*+)3+ -)/%,0/*+(*''2*+ #./3**)#$ 1,77,7+()*%* -,/)(5+-#)'2*+)(/ *)60/7+'!## ')$%!0/7 1,**0* Large-scale recognition benefits from a compact descriptor for each image, for example allowing databases to be stored in memory rather than on disk. The
  • 65.
    bject Classes byBetween-Class Attribute Transfer Hannes Nickisch Stefan Harmeling Related work or Biological Cybernetics, T¨ bingen, Germany u me.lastname}@tuebingen.mpg.de • otter when train- Attribute-based recognition: black: white: yes no brown: yes examples of stripes: no hardly been water: yes [Lampert et al., CVPR’09] [Farhadi et al., CVPR’09] eats fish: yes rule rather ens of thou- polar bear black: no very few of white: yes d annotated brown: no stripes: no water: yes introducing eats fish: yes ct detection zebra ption of the black: yes description white: yes requires hand-specified attribute-class associations brown: no hape, color s. On the left h properties stripes: water: yes no ribute be hey can predic- eats fish: no to displayed. attribute classifiers must be trained with arethe cur- Figure 1. A description object categories: after learningthe transfer by high-level attributes allows ected based of knowledge between the visual ed for a new cat- human-labeled examples ve across appearance of attributes from any classes with training examples, and to “engine”,can detect also object classes that do not have any training ike facil- we based on which attribute description a test image fits best. randomly selected positively pre new large- images, Figure 5: This figure shows election helps 30,000 an- tributes for 12 typical images from 12 categories in Yahoo set. nd “rein” that of well-labeled training imageslearnedtechniques rson’s clas- lions and is likely out of classifiers are numerous on Pascal train set and tested on Yahoo se reach for years to come. Therefore, emantic at- one class outreducing the number of necessary training imagesattributes from the list of 64 attributes a for domly select 5 predicted have
  • 66.
    Method overview 1. Classemelearning φ”body of water” (x) → ... φ”walking” (x) → 2. Using the classemes for recognition and retrieval training examples of novel class C g duck (Φ(x)) = wc φc (x) duck c=1 Φ(x1 ) ... Φ(xN )
  • 67.
    Classeme learning: choosing the basis classes • Classeme labels desiderata: - must be visual concepts - should span the entire space of visual classes • Our selection: concepts defined in the Large Scale Ontology for Multimedia [LSCOM] to be “useful, observable and feasible for automatic detection”. 2659 classeme labels, after manual elimination of plurals, near-duplicates, and inappropriate concepts
  • 68.
    Classeme learning: gathering the training data • We downloaded the top 150 images returned by Bing Images for each classeme label • For each of the 2659 classemes, a one-versus-the-rest training set was formed to learn a binary classifier φ”walking” (x) yes no
  • 69.
    Classeme learning: training the classifiers • Each classeme classifier is an LP-β kernel combiner [Gehler and Nowozin, 2009]: F N φ(x) = βf kf (x, xn )αf,n + bf f =1 n=1 linear combination of feature-specific SVMs • We use 13 kernels based on spatial pyramid histograms computed from the following features: - color GIST [Oliva and Torralba, 2001] - oriented gradients [Dalal and Triggs, 2009] - self-similarity descriptors [Schechtman and Irani, 2007] - SIFT [Lowe, 2004]
  • 70.
    A dimensionality reduction   view of classemes     GIST            self-similarity  descriptor Φ  φ1 (x) ...  x=      φ2659 (x)   oriented     gradients     • near state-of-the-art accuracy SIFT with linear classifiers • can be quantized down to • non-linear kernels are needed 200 bytes/image with almost for good classification no recognition loss • 23K bytes/image
  • 71.
    Experiment 1: multiclass recognition on Caltech256 60 LP-β in [Gehler LPbeta Nowozin, 2009] LPbeta13 using 39 kernels 50 MKL Csvm LP-β with our x Cq1svm 40 Xsvm our approach: linear SVM with accuracy (%) classemes Φ(x) 30 linear SVM with binarized classemes, 20 i.e. (Φ(x) 0) linear SVM with x 10 0 0 10 20 30 40 50 number of training examples
  • 72.
    Computational cost comparison Training time Testing time 1500 40 23 hours 30 time (minutes) 1000 time (ms) 20 500 9 minutes 10 0 0 LPbeta Csvm LPbeta Csvm
  • 73.
    Accuracy vs. compactness 4 10 188 bytes/image compactness (images per MB) 3 10 2.5K bytes/image 2 10 LPbeta13 23K bytes/image 1 Csvm 10 Cq1svm nbnn [Boiman et al., 2008] 128K bytes/image emk [Bo and Sminchisescu, 2008] Xsvm 0 10 10 15 20 25 30 35 40 45 accuracy (%) Lines link performance at 15 and 30 training examples
  • 74.
    Experiment 2: object class retrieval Efficient Object Category Recognition Using Classemes 787 30 Csvm Cq1Rocchio (β=1, γ=0) 25 Cq1Rocchio (β=0.75, γ=0.15) Precision @ 25 25 Bowsvm Precision (%) @ 20 BowRocchio (β=1, γ=0) BowRocchio (β=0.75, γ=0.15) 15 • Random performance is 0.4% 10 • training Csvm takes 0.6 sec with 5*256 training examples 5 0 0 10 20 30 40 50 Number of training images Fig. 4. Retrieval. Percentage of the top 25 in a 6400-document set which match the query class. Random performance is 0.4%.
  • 75.
    Analogies with textretrieval • Classeme representation of an image: presence/absence of visual attributes • Bag-of-word representation of a text-document: presence/absence of words
  • 76.
    Related work • Prior work (e.g., [Sivic Zisserman, 2003; Nister Stewenius, 2006; Philbin et al., 2007]) has exploited a similar analogy for object-instance retrieval by representing images as bag of visual words Detect interest patches Compute SIFT descriptors [Lowe, 2004] … … Quantize Represent image as a sparse descriptors histogram of visual words frequency ….. codewords • To extend this methodology to object-class retrieval we need: - to use a representation more suited to object class recognition (e.g. classemes as opposed to bag of visual words) - to train the ranking/retrieval function for every new query-class
  • 77.
    Data structures for efficient retrieval Incidence matrix: Inverted index: features f0 f1 f2 f3 f4 f5 f6 f7 f0 f1 f2 f3 f4 f5 f6 f7 I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0 I0 I2 I0 I2 I1 I0 I4 I6 documents I2: 1 1 0 1 0 0 0 0 I2 I7 I1 I3 I4 I6 I5 I9 I3: 1 0 1 1 0 0 0 0 I4: 1 0 0 0 1 0 1 0 I3 I8 I3 I9 I5 I8 I5: 0 0 0 0 1 0 1 0 I4 I7 I9 I6: 1 0 0 0 0 1 0 1 I6 I9 I7: 0 1 0 0 1 0 0 0 I8 I8: 1 1 0 0 0 1 0 0 I9: 0 0 0 1 1 1 0 1 • enables efficient calculation of w Φ, as: T ∀Φ • very compact: only one bit per feature entry wi Φi i s.t. Φi =0
  • 78.
    Efficient retrieval via inverted index Inverted index: w: [1.5 -2 0 -5 0 3 -2 0 ] f0 f1 f2 f3 f4 f5 f6 f7 I0 I2 I0 I2 I1 I0 I4 I6 I2 I7 I1 I3 I4 I6 I5 I9 I3 I8 I3 I9 I5 I8 I4 I7 I9 I6 I9 I8 Goal: compute score w T Φ, for all binary vectors Φ in the database ∀Φ
  • 79.
    Efficient retrieval via inverted index Inverted index: w: [1.5 -2 0 -5 0 3 -2 0 ] f0 f1 f2 f3 f4 f5 f6 f7 I0 I2 I0 I2 I1 I0 I4 I6 I2 I7 I1 I3 I4 I6 I5 I9 I3 I8 I3 I9 I5 I8 I4 I7 I9 I6 I9 I8 Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
  • 80.
    Efficient retrieval via inverted index Inverted index: w: [1.5 -2 0 -5 0 3 -2 0 ] f0 f1 f2 f3 f4 f5 f6 f7 I0 I2 I0 I2 I1 I0 I4 I6 I2 I7 I1 I3 I4 I6 I5 I9 I3 I8 I3 I9 I5 I8 I4 I7 I9 I6 I9 I8 Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
  • 81.
    Efficient retrieval via inverted index Inverted index: w: [1.5 -2 0 -5 0 3 -2 0 ] f0 f1 f2 f3 f4 f5 f6 f7 I0 I2 I0 I2 I1 I0 I4 I6 I2 I7 I1 I3 I4 I6 I5 I9 I3 I8 I3 I9 I5 I8 I4 I7 I9 I6 I9 I8 Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
  • 82.
    Efficient retrieval via inverted index Inverted index: w: [1.5 -2 0 -5 0 3 -2 0 ] f0 f1 f2 f3 f4 f5 f6 f7 I0 I2 I0 I2 I1 I0 I4 I6 I2 I7 I1 I3 I4 I6 I5 I9 I3 I8 I3 I9 I5 I8 I4 I7 I9 I6 I9 I8 Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
  • 83.
    Efficient retrieval via inverted index Inverted index: w: [1.5 -2 0 -5 0 3 -2 0 ] f0 f1 f2 f3 f4 f5 f6 f7 I0 I2 I0 I2 I1 I0 I4 I6 I2 I7 I1 I3 I4 I6 I5 I9 I3 I8 I3 I9 I5 I8 I4 I7 I9 I6 I9 I8 Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
  • 84.
    Efficient retrieval via inverted index Inverted index: w: [1.5 -2 0 -5 0 3 -2 0 ] f0 f1 f2 f3 f4 f5 f6 f7 I0 I2 I0 I2 I1 I0 I4 I6 I2 I7 I1 I3 I4 I6 I5 I9 I3 I8 I3 I9 I5 I8 I4 I7 I9 I6 I9 I8 Cost of scoring is linear in the sum of the lengths of inverted lists associated to non-zero weights
  • 85.
    Improve efficiency via sparse weight vectors Key-idea: force w to contain as many zeros as possible classeme vector label of Learning objective of example n Tomographic inversion with example n 1 wavelet penalization 3 N E(w) = R(w) + C N n=1 L(w; Φn , yn ) w2 regularizer loss function w with d = AWT w and smallest 1 -norm • T L2-SVM: R(w) d =wT w w and smallestn ,2yn ) = max(0, 1 − yn (wT Φn )) w with = AW , L(w; Φ -norm d = AWT w • 2 Since |wi | wi for small wi w 2 w 2i |wi | and |wi | wi for large wi , w1 2 choosing R(w) = i |wi | will tend to |w| produce a small number of larger wi weights and 2 -ball: wzero2 weights more 1 + w2 = constant 2 w 1 -ball: |w1 | + |w2 | = constant
  • 86.
    Improve efficiency via sparse weight vectors Key-idea: force w to contain as many zeros as possible classeme vector label of Learning objective of example n example n N E(w) = R(w) + C N n=1 L(w; Φn , yn ) regularizer loss function • L2-SVM: R(w) = wT w , L(w; Φn , yn ) = max(0, 1 − yn (wT Φn )) • L1-LR: R(w) = i |wi | , L(w; Φn , yn ) = log(1 + exp(−yn wT Φn )) • FGM (Feature Generating Machine) [Tan et al., 2010]: R(w) = wT w , L(w; Φn , yn ) = max(0, 1 − yn (w ⊙ d)T Φn ) s.t. 1T d ≤ B d ∈ {0, 1}D elementwise product
  • 87.
    Performance evaluation on ImageNet (10M images) 35 ! [Rastegari et al., 2011] 35 Full inner product evaluation L2 SVM 30 Full inner product evaluation L1 LR 30 Inverted index L2 SVM Precision @ 10 (%) 25 Inverted index L1 LR Precision @ 10 (%) 25 20 20 • Performance averaged over 400 object 15 classes used as queries 15 • 10 training examples per query class 10 10 • Database includes 450 images of the query class and 9.7M images of other classes 5 5 • Prec@10 of a random classifiers is 0.005% 0 20 40 60 80 100 120 140 Search time per query (seconds) 0 20 40 60 80 100 120 140 Each curve is obtained by varying sparsity through C in training objective Search time per query (seconds) N E(w) = R(w) + C N n=1 L(w; Φn , yn ) regularizer loss function
  • 88.
    Top-k ranking • Dowe need to rank the entire database? - users only care about the top-ranked images • Key idea: - for each image iteratively update an upper-bound and a lower-bound on the score - gradually prune images that cannot rank in the top-k
  • 89.
    Top-k pruning ! [Rastegari et al., 2011] w: [ 3 -2 0 -6 0 3 -2 0 ] • Highest possible score: for binary vector ΦU s.t. f0 I0: 1 f1 0 f2 1 f3 0 f4 0 f5 1 f6 0 f7 0 ΦU = 1 iff wi 0 i I1: 0 0 1 0 1 0 0 0 I2: 1 1 0 1 0 0 0 0 → initial upper bound I3: 1 0 1 1 0 0 0 0 I4: 1 0 0 0 1 0 1 0 u∗ = wT · ΦU (6 in this case) I5: 0 0 0 0 1 0 1 0 I6: 1 0 0 0 0 1 0 1 I7: 0 I8: 1 1 1 0 0 0 0 1 0 0 1 0 0 0 0 • Lowest possible score: I9: 0 0 0 1 1 1 0 1 for binary vector ΦL s.t. ΦL = 1 iff wi 0 i → initial lower bound l∗ = wT · ΦL (-10 in this case)
  • 90.
    Top-k pruning ! [Rastegari et al., 2011] w: [ 3 -2 0 -6 0 3 -2 0 ] • Initialization: u∗ , l∗ for all images upper bound f0 f1 f2 f3 f4 f5 f6 f7 I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0 I2: 1 1 0 1 0 0 0 0 I3: 1 0 1 1 0 0 0 0 I4: 1 0 0 0 1 0 1 0 I5: 0 0 0 0 1 0 1 0 0 I6: 1 0 0 0 0 1 0 1 I7: 0 1 0 0 1 0 0 0 I8: 1 1 0 0 0 1 0 0 I9: 0 0 0 1 1 1 0 1 I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 lower bound
  • 91.
    Top-k pruning ! [Rastegari et al., 2011] w: [ 3 -2 0 -6 0 3 -2 0 ] f0 f1 f2 f3 f4 f5 f6 f7 I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0 I2: 1 1 0 1 0 0 0 0 0 I3: 1 0 1 1 0 0 0 0 I4: 1 0 0 0 1 0 1 0 I5: 0 0 0 0 1 0 1 0 I6: 1 0 0 0 0 1 0 1 I7: 0 1 0 0 1 0 0 0 I8: 1 1 0 0 0 1 0 0 I9: 0 0 0 1 1 1 0 1 I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 • Load feature i • Since wi = +3 (0), for each image n: - subtract +3 from the upper bound if φn,i = 0 - add +3 to the lower bound if φn,i = 1
  • 92.
    Top-k pruning ! [Rastegari et al., 2011] w: [ 3 -2 0 -6 0 3 -2 0 ] f0 f1 f2 f3 f4 f5 f6 f7 I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0 I2: 1 1 0 1 0 0 0 0 0 I3: 1 0 1 1 0 0 0 0 I4: 1 0 0 0 1 0 1 0 I5: 0 0 0 0 1 0 1 0 I6: 1 0 0 0 0 1 0 1 I7: 0 1 0 0 1 0 0 0 I8: 1 1 0 0 0 1 0 0 I9: 0 0 0 1 1 1 0 1 I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 • Load feature i • Since wi = -2 (0), for each image n: - decrement by 2 the upper bound if φn,i = 1 - increment by 2 the lower bound if φn,i = 0
  • 93.
    Top-k pruning ! [Rastegari et al., 2011] w: [ 3 -2 0 -6 0 3 -2 0 ] f0 f1 f2 f3 f4 f5 f6 f7 I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0 I2: 1 1 0 1 0 0 0 0 0 I3: 1 0 1 1 0 0 0 0 I4: 1 0 0 0 1 0 1 0 I5: 0 0 0 0 1 0 1 0 I6: 1 0 0 0 0 1 0 1 I7: 0 1 0 0 1 0 0 0 I8: 1 1 0 0 0 1 0 0 I9: 0 0 0 1 1 1 0 1 I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 • Load feature i • Since wi = -6 (0), for each image n: - decrement by 6 the upper bound if φn,i = 1 - increment by 6 the lower bound if φn,i = 0
  • 94.
    Top-k pruning ! [Rastegari et al., 2011] w: [ 3 -2 0 -6 0 3 -2 0 ] f0 f1 f2 f3 f4 f5 f6 f7 I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0 I2: 1 1 0 1 0 0 0 0 0 I3: 1 0 1 1 0 0 0 0 I4: 1 0 0 0 1 0 1 0 I5: 0 0 0 0 1 0 1 0 I6: 1 0 0 0 0 1 0 1 I7: 0 1 0 0 1 0 0 0 I8: 1 1 0 0 0 1 0 0 I9: 0 0 0 1 1 1 0 1 I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 • Suppose k = 4: we can prune I2,I9 since they cannot rank in the top-k
  • 95.
    Distribution of weightsand pruning rate CCV CV IC 1745 745 # #1 ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. 540 40 11 100 100 L1−LR L1−LR Distribution absolute weight values Distribution of absolute weight values 41 541 normalized of absolute weight values 42 542 L2−SVM L2−SVM 43 543 0.8 0.8 FGM FGM 80 80 % of images pruned % of images pruned 44 544 TkP L1−LR, k=10 TkP L1−LR, k=10 45 545 TkP L1−LR, k=3000 TkP L1−LR, k=3000 0.6 0.6 60 60 46 546 TkP L2−SVM, k=10 TkP L2−SVM, k=10 47 547 TkP L2−SVM, k=3000 TkP L2−SVM, k=3000 48 548 0.4 0.4 40 40 TkP FGM, k=10 TkP FGM, k=10 49 549 TkP FGM, k=3000 TkP FGM, k=3000 50 550 0.2 0.2 20 20 51 551 52 552 53 553 00 00 54 554 aa 00 500 500 1000 1000 1500 1500 Dimension 2000 2000 2500 2500 bb 00 500 500 1000 1000 1500 1500 2000 2000 Number ofof iterations (d) iterations (d) 2500 2500 Dimension Number 55 555 56 556 Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster with Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster with 57 557 Features considered in descending order of |wi | sparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values ofof k (k = 10, 3000). sparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values k (k = 10, 3000). 58 558 59 559 60 560 aa smaller value of kk allows the method to eliminate more smaller value of allows the method to eliminate more 61 images from consideration at aavery early stage. 20 20 v=128 561 images from consideration at very early stage. v=128 8 v=256 v=256 62 w=2 8 v=256 v=256 w=28 8 562 w=2 6 v=64 v=64 w=2 6 w=2 w=2 63
  • 96.
    Performance evaluation on 35 ImageNet (10M images) 30 35 ! [Rastegari et al., 2011] Precision @ 10 (%) 25 30 TkP L1−LR 20 TkP L2−SVM Inverted index L1−LR Precision @ 10 (%) 25 15 Inverted index L2−SVM 20 10 • k = 10 15 • Performance averaged over 400 object 5 classes used as queries 10 • 10 training examples per query class 0 0 50 • 100 150 Database includes 450 images of the query 5 Search time per query (seconds) and 9.7M images of other classes class • Prec@10 of a random classifiers is 0.005% 0 0 50 100 150 Search time per query (seconds) Each curve is obtained by varying sparsity through C in training objective N E(w) = R(w) + C N n=1 L(w; Φn , yn ) regularizer loss function
  • 97.
    Alternative search strategy: approximate ranking • Key-idea: approximate the score function with a measure that can computed (more) efficiently (related to approximate NN search: [Shakhnarovich et al., 2006; Grauman and Darrell, 2007; Chum et al., 2008]) • Approximate ranking via vector quantization: wT Φ ≈ wT q(Φ) ! q(!) where q(.) is a quantizer returning the cluster centroid nearest to Φ • Problem: - to approximate well the score we need a fine quantization - the dimensionality of our space is D=2659: too large to enable a fine quantization using k-means clustering
  • 98.
    Product quantization ! Product quantization for nearest neighbor search [Jegou et al., 2011] • Split feature vector ! into v subvectors: ! [ !1 | !2 | ... | !v ] Vector split into m subvectors: • Subvectors are quantized separately by quantizers Subvectors are quantized separately by quantizers q(!) = [ q1(!1) | q2(!2) | ... | qv(!v) ] where each qi(.) is learned in a space of dimensionality D/v where each is learned by k-means with a limited number of centroids • Example from [Jegou vector split in 8 subvectors of dimension 16 Example: y = 128-dim et al., 2011]: ! is a 128-dimensional vector split into 8 subvectors of dimension 16 16 components 16 components y1 y2 y3 y4 y5 y6 y7 y8 !1 !2 !3 !4 !5 !6 !7 !8 xedni noitazitnauq tib-46 stib 8 256 ) 1 y( 1 q q ) 2 y( 2 q q2 ) 3 y( 3 q q3 )4y(4q q4 )5y(5q q5 )6y(6q q6 )7y(7q )8y(8q q7 q8 28 = 256 centroids 1 centroids q1 q2 1 q3 1 q4 1 q5 q6 q7 q8 sdiortnec 1q 2q 3q 4q 5q 6q 7q 8q 652 q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8) q1(!1) q2(!2) q3(!3) q4(!4) 1 1y 1 1 1 1 2y 1 3y 4y 5y q5(!5) q6(!6) q7(!7) q8(!8) 6y 7y 8y 8 bits stnenopmoc 61 64-bit quantization index 8 bits 64-bit quantization index 61 noisnemid fo srotcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE hcae erehw sdiortnec fo rebmun detimil a htiw snaem-k yb denrael si
  • 99.
    obhgien tseraen rofnoitazitnauq tcudorP :srotcevbus m otni tilps rotceV wv  . .   .  tnauq yb yletarapes dezitnauq era srotcevbuS   w2   sub-blocks w1   htiw snaem-k yb denrael si centroids (r per sub-block) hcae erehw 1.Filling the look-up table: tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE look-up table can be precomputed and stored in a stnenopmoc 61 j=1 5y 4y 3y 2y T 1y wj qj (Φj ) wT Φ ≈ wT q(Φ) = v 652 5q 4q 3q Efficient approximate scoring 2q 1q sdiortnec y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q stib 8 xedni noitazitnauq tib-46
  • 100.
    obhgien tseraen rofnoitazitnauq tcudorP :srotcevbus m otni tilps rotceV wv  . .   .  tnauq yb yletarapes dezitnauq era srotcevbuS   w2   sub-blocks s11 w1 in  ner product  quantization for sub-block 1: htiw snaem-k yb denrael si centroids (r per sub-block) hcae erehw 1.Filling the look-up table: tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE look-up table can be precomputed and stored in a stnenopmoc 61 j=1 5y 4y 3y 2y T 1y wj qj (Φj ) wT Φ ≈ wT q(Φ) = v 652 5q 4q 3q Efficient approximate scoring 2q 1q sdiortnec y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q stib 8 xedni noitazitnauq tib-46
  • 101.
    obhgien tseraen rofnoitazitnauq tcudorP :srotcevbus m otni tilps rotceV wv  . .   .  tnauq yb yletarapes dezitnauq era srotcevbuS   w2   sub-blocks uct s11 s12 prod w1 inner   quantization for sub-block 1: htiw snaem-k yb denrael si centroids (r per sub-block) hcae erehw 1.Filling the look-up table: tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE look-up table can be precomputed and stored in a stnenopmoc 61 j=1 5y 4y 3y 2y T 1y wj qj (Φj ) wT Φ ≈ wT q(Φ) = v 652 5q 4q 3q Efficient approximate scoring 2q 1q sdiortnec y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q stib 8 xedni noitazitnauq tib-46
  • 102.
    obhgien tseraen rofnoitazitnauq tcudorP :srotcevbus m otni tilps rotceV wv  . .   .  tnauq yb yletarapes dezitnauq era srotcevbuS   w2   sub-blocks duct s11 s12 s13 ... ... ... ... ... ... s1r r pro i w1 nne  quantization for sub-block 1: htiw snaem-k yb denrael si centroids (r per sub-block) hcae erehw 1.Filling the look-up table: tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE look-up table can be precomputed and stored in a stnenopmoc 61 j=1 5y 4y 3y 2y T 1y wj qj (Φj ) wT Φ ≈ wT q(Φ) = v 652 5q 4q 3q Efficient approximate scoring 2q 1q sdiortnec y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q stib 8 xedni noitazitnauq tib-46
  • 103.
    obhgien tseraen rofnoitazitnauq tcudorP :srotcevbus m otni tilps rotceV wv  . .   .  tnauq yb yletarapes dezitnauq era srotcevbuS   w2   s21 in sub-blocks ner prod uct w1 s11 s12 s13 ... ... ... ... ... ... s1r   quantization for sub-block 2: htiw snaem-k yb denrael si centroids (r per sub-block) hcae erehw 1.Filling the look-up table: tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE look-up table can be precomputed and stored in a stnenopmoc 61 j=1 5y 4y 3y 2y T 1y wj qj (Φj ) wT Φ ≈ wT q(Φ) = v 652 5q 4q 3q Efficient approximate scoring 2q 1q sdiortnec y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q stib 8 xedni noitazitnauq tib-46
  • 104.
    xedni noitazitnauq tib-46 stib 8 ) 1 y( 1 q ) 2 y( 2 q )3y(3q )4y(4q y(5q Efficient approximate scoringsdiortnec 652 1q 2q 3q 4q 5q v wT Φ ≈ wT q(Φ) = wj qj (Φj ) T 1y 2y 3y 4y 5y j=1 stnenopmoc 61 can be precomputed and stored in a look-up table tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE 2.Score each quantized vector q(Φ) in the database using the look-up hcae erehw centroids (r per sub-block) htiw snaem-k yb denrael si table: s1r s11 s12 s13 ... ... ... ... ... ... sub-blocks s21 s22 s23 ... ... ... ... ... ... s2r w q(Φ) = w1 q1 (Φ1 ) + w2 q2 (Φ2 ) + . . . + wv qv... ... ) ... T T T T (Φv tnauq yb yletarapes dezitnauq era srotcevbuS... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... T q(Φ) = w1 q1 (Φ1 ) + w2 q2 (Φ2 ) + . . . + wv qv (Φv ) T T T ... ... ... :srotcevbus m otni tilps rotceV ... ... ... ... ... ... ... sv1 sv2 sv3 ... ... ... ... ... ... svr Only v additions per image! obhgien tseraen rof noitazitnauq tcudorP
  • 105.
    Choice of parameters ! [Rastegari et al., 2011] • Dimensionality is first reduced with PCA from D=2659 to D’ D • How do we choose D’, v (number of sub-blocks), r (number of centroids per sub-block)? • Effect of parameter choices on a database of 150K images: (v,r) 20 8 8 (128,2 ) (256,2 ) 6 (256,2 ) 6 (64,2 ) 15 Precision @ 10 (%) 6 8 (64,2 ) (32,2 ) (128,28) D’=512 10 8 (16,2 ) D’=256 8 6 (32,2 ) (64,2 ) D’=128 5 (32,28) 8 (16,2 ) 8 (16,2 ) 0 0 0.05 0.1 0.15 0.2 0.25 0.3 Search time per query (seconds)
  • 106.
    Performance evaluation on 150K images ICCV #1745 ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. 432 25 • Performance averaged over 1000 the largely the other classes. To cope with 433 object classes and negative examples (n− of positive used as queries 434 • 50malize the loss term for each example in training examples per query 435 20 class its class. We evaluate the learned retrie of 436 • ILSVRC2010 test set, which includes 150 Database includes 150 images of Precision @ 10 (%) 437 the query class and 150K images Thus, the d 150 examples per category. 438 15 of n+ = 150 true positives and n− = other classes 439 • Prec@10 of a random Figure 1 shows precis test tors for each query. classifiers test 440 is 0.1% 10 time for AR and TkP in combination wit 441 AR L2−SVM fication models. Since AR does not use s 442 TkP L1−LR efficiency, we only paired it with the L2-S 443 5 approximate ranking retrieval time per qu x-axis shows average 444 TkP L2−SVM TkP FGM a single-core computer with 16GB of R 445 0 Core i7-930 CPU @ 2.80GHz. The y-axis 446 0 0.5 1 1.5 2 2.5 at 10 which measures the proportion of tru 447 Search time per query (seconds) top 10. The times reported for TkP wer 448 Figure 1. Class-retrieval precision versus search time for the k = 10. The curve for AR was generate 449 ILSVRC2010 data set: x-axis is search time; y-axis shows per- parameter choices for v and w, as discusse 450 centage of true positives ranked in the top 10 using a database − later. The performance curves for “TkP L 451 of 150,000 images (with n = 149, 850 distractors and
  • 107.
    Memory requirements for 10M images 9 Gbytes 8 6 memory usage 4 3 Gbytes 1.8 Gbytes 2 0 1 Inverted index 2 Incidence matrix 3 Product (used by TkP) quantization index
  • 108.
    Conclusions and openquestions Classemes: • Compact descriptor enabling efficient novel-class recognition (less than 200 bytes/image yet it produces performance similar to MKL at a tiny fraction of the cost) • Questions currently under investigation: - can we learn better classemes from fully-labeled data? - can we decouple the descriptor size from the number of classeme classes? - can we encode spatial information ([Li et al. NIPS10])? • Software for classeme extraction available at: http://vlg.cs.dartmouth.edu/projects/classemes_extractor/ Information retrieval approaches to large-scale object-class search: • sparse representations and retrieval models • top-k ranking • approximate scoring
  • 109.
    ICVSS 2011 StevenSeitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg Outline 1 ICVSS 2011 2 A Trillion Photos - Steven Seitz 3 Efficient Novel Class Recognition and Search - Lorenzo Torresani 4 The Life of Structured Learned Dictionaries - Guillermo Sapiro 5 Image Rearrangement Video Synopsis - Shmuel Peleg Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
  • 110.
    The Life ofStructured Learned Dictionaries Guillermo Sapiro University of Minnesota G.Yu and S. Mallat (Inverse problems via GMM) G.Yu and F. Leger (Matrix completion) G.Yu (Statistical compressed sensing) A. Castrodad (activity recognition in video) M. Zhou, D. Dunson, and L. Carin (video layers separation) 1 Friday, July 8, 2011
  • 111.
    Inverse Problems y = Uf + w w ∼ N (0, σ 2 Id) Inpainting Examples f U : masking Deblurring Zooming U : subsampling U: convolution 2 3 w ∼ N (0, σ Id) Friday, July 8, 2011
  • 112.
    Learned Overcomplete Dictionaries Dictionary learning • Dictionary learning 2 min fi − Dai + λai 1 D,{ai }1≤i≤I 1≤i≤I • Better performance than pre-fixed dictionaries. • Huge numbers of parameters to estimate. • Non-convex. • High computational complexity. • Behavior not well understood (results starting to appear). 11 Friday, July 8, 2011
  • 113.
    Sparse Inverse ProblemEstimation y = Uf + w where 2 w ∼ N (0, σ Id) • Sparse prior D = {φm }m∈Γ provides a sparse representation for f . f = Da + Λ with |Λ| |Γ| , Λ = support(a) and Λ 2 f 2 • Observation UD = {Uφm }m∈Γ provides a sparse representation for y . y = UDa + with |Λ| |Γ| , Λ = support(a) and 2 y2 Λ Λ • Sparse inverse problem estimation Sparse estimation of a from y 2 a = arg min UDa − y + λ a1 ˜ a Inverse problem estimation ˜ = Da f ˜ 12 Friday, July 8, 2011
  • 114.
    Structured Representation andEstimation D B1 B2 B3 B4 B5 Overcomplete dictionary Structured overcomplete dictionary • Dictionary: union of PCAs • Union of orthogonal bases D = {Bk }1≤k≤K • In each basis, the atoms are ordered: λk ≥ λk ≥ · · · ≥ λk 1 2 N • Piecewise linear estimation (PLE) • A linear estimator per basis • Non-linear basis selection: a best linear estimator is selected • Small degree of freedom, fast computation, state-of-the-art performance 16 Friday, July 8, 2011
  • 115.
    Gaussian Mixture Models y i = U i fi + w i where wi ∼ N (0, σ 2 Id) • Estimate {(µk , Σk )}1≤k≤K from {yi }1≤i≤I • Identify the Gaussian ki that generates fi , ∀i • Estimate ˜i from N (µki , Σki ) , ∀i f 18 Friday, July 8, 2011
  • 116.
    Structured Sparsity • PCA (Principal Component Analysis) T Σk = Bk S k Bk • Bk = {φk }1≤m≤N PCA basis, orthogonal. m • Sk = diag(λk , . . . , λk ) , λk ≥ λk ≥ · · · ≥ λk eigenvalues. 1 N 1 2 N • PCA transform ˜k = Bk ak fi ˜i • MAP with PCA ˜k = arg min Ui fi − yi 2 + σ 2 f T Σ−1 fi fi ˜ i k fi ⇔ N |ai [m]|2 ak = arg min Ui Bk ai − yi 2 + σ 2 ˜i ai m=1 λkm 22 Friday, July 8, 2011
  • 117.
    Structured Sparsity Sparse estimate v.s. Piecewise linear estimate |Γ| N |ai [m]|2 2 ai = arg min UDai − yi + λ ˜ |ai [m]| ak ˜i 2 = arg min Ui Bk ai − yi + σ 2 ai ai λkm m=1 m=1 D B1 B2 B3 B4 B5 • Linear collaborative filtering Full degree of freedom in each basis. in atom selection |Λ| |Γ| • Nonlinear basis selection, degree of freedom K. 23 Friday, July 8, 2011
  • 118.
    Initial Experiments: Evolution Clustering Clustering 1st iteration 2nd iteration 24 Friday, July 8, 2011
  • 119.
    Experiments: Inpainting Original 20% available MCA 24.18 dB ASR 21.84 dB [Elad, Starck, Querre, Donoho, 05] [Guleryuz, 06] KR 21.55 dB FOE 21.92 dB BP 25.54 dB PLE 27.65 dB [Takeda, Farsiu. Milanfar, 06] [Roth and Black, 09] [Zhou, Sapiro, Carin, 10] 26 Friday, July 8, 2011
  • 120.
    Experiments: Zooming Low-resolution Original Bicubic 28.47 dB SAI 30.32 dB SR 23.85 dB PLE 30.64 dB SR [Yang, Wright, Huang, Ma, 09] SAI [Zhang and Wu, 08] 29 Friday, July 8, 2011
  • 121.
    Experiments: Zooming Deblurring f Uf y = SUf Iy 29.40 dB PLE 30.49 dB SR 28.93 dB [Yang, Wright, Huang, Ma, 09] 32 Friday, July 8, 2011
  • 122.
    Experiments: Denoising Original Noisy 22.10 dB NLmeans 28.42 dB [Buades et al, 06] FOE 25.62 dB BM3D 30.97 dB PLE 31.00 dB [Roth and Black, 09] [Dabov et al, 07] 34 Friday, July 8, 2011
  • 123.
    Summary of thispart • Gaussian mixture models and MAP-EM work well for image inverse problems. • Piecewise linear estimation, connection to structured sparsity. • Collaborative linear filtering. • Nonlinear best basis selection, small degree of freedom. • Faster computation than sparse estimation. • Results in the same ballpark of the state-of-the-art. • Beyond images: recommender systems and audio (Sprechmann Cancela) • Statistical compressed sensing 38 Friday, July 8, 2011
  • 124.
    Modeling  and  Learning  Human   Ac2vity     Alexey  Castrodad1,2  and  Guillermo  Sapiro2   1  NGA  Basic  and  Applied  Research   2  University  of  Minnesota,  ECE  Department   castr103@umn.edu  ,  guille@umn.edu        
  • 125.
    Mo2va2on   •  Problem:            Given  volumes  of  video  feed,  detect  ac2vi2es  of  interest   §  Mostly  done  manually!   •  Solving  this  will:   §  Aid  the  operator:  surveillance/security,  gaming,  psychological   research   §  SiV  through  large  amounts  of  data   •  Solu2on:    Fully/semi-­‐automa2c  ac2vity  detec2on  with   minimum  human  interac2on   §  Invariance  to  spa2al  transforma2ons   §  Robust  to  occlusions,  low  resolu2on,  noise   §  Fast  and  accurate   §  Simple,  generic     4  
  • 126.
    Sparse  modeling:     Dic2onary  learning  from  data   7  
  • 127.
    Sparse  modeling  for  ac2on   classifica2on:  Phase  1   Training   Class  1   Class  2   Class  3   Input  Videos   •      Spa2al  Temporal  Features     •      Sparse  Modeling     •      D1   D2   D3   D   Classifica2on   •        A1   l1  Pooling   New   video   Feature   Extrac2on   Sparse   coding   A2   A3   Classifier  output   9  
  • 128.
    Sparse  modeling  for  ac2on   classifica2on:  Phase  2   Training    Sparse  Modeling   •      D1   D2   D3   D   Inter-­‐class  Modeling   •      Training   Videos   Feature   Extrac2on   Sparse   coding   E1   E2   E3   Classifica2on   A1   l1   Pooling   •        New   video   Feature   Extrac2on   Sparse   coding   A2   A3   Sparse   Coding   from  Phase  1   Classifier  output   10  
  • 129.
    Results   • YouTube  Ac2on  Dataset   §  variable  spa2al  resolu2on  videos,  3-­‐8  seconds  each   §  11  types  of  ac2ons  from  YouTube  videos   Scene   AcGons   Camera   ResoluGon   Frame  Rate   indoors/outdoors   basketball  shoo2ng,   jiaer,  scale   variable,   25  fps   cycling,  diving,  golf   varia2ons,   resampled  to     swinging,  horse   camera  mo2on,   320  x  240   back  riding,  soccer   variable   juggling,  swinging,   tennis  swinging,   illumina2on   trampoline  jumping,   condi2ons,  high   volleyball  spiking,   background   walking  with  a  dog   cluaer   18  
  • 130.
    Results:  YouTube  Ac2on  Dataset   §  Best/recent  reported:  75.8%  (Q.V.  Le  et  al.,  2011);  84.2%   (Wang  et  al.,  2011)     §  Recogni2on  rate:  80.29  %  (phase  1)  and  91.9%  (phase  2)     20  
  • 131.
    Conclusion   •  Main  contribu2on:     §  Robust  ac2vity  recogni2on  framework  based  on  sparse   modeling   §  Generic:    works  on  mul2ple  data  sources   §  State-­‐of-­‐the-­‐art  results  in  all  of  them,  same  parameters   •  Key  advantage:   §  Simplicity,  state  of  the  art  results   §  Fast  and  accurate:  7.5  fps   §  7  frames  needed  for  detec2on     •  Future  direc2on:   §  Exploit  human  interac2ons   §  Infer  the  ac2ons   §  Foreground  extrac2on/video  analysis  for  ac2vity  clustering   21  
  • 132.
    ICVSS 2011 StevenSeitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg Outline 1 ICVSS 2011 2 A Trillion Photos - Steven Seitz 3 Efficient Novel Class Recognition and Search - Lorenzo Torresani 4 The Life of Structured Learned Dictionaries - Guillermo Sapiro 5 Image Rearrangement Video Synopsis - Shmuel Peleg Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
  • 133.
    Shift-Map Image Editing Yael Pritch Eitam Kav-Venaki Shmuel Peleg The Hebrew University of Jerusalem
  • 135.
    Geometrical Image Editing: Retargeting Retargeting (Avidan and Shamir SIGGRAPH’07, Wolf et al., ICCV’07, Wang et al., SIGASIA’08, Rubinstein et al., SIGGRAPH’08, Rubinstein et al.,SIGGRAPH’09) Input Shift-Map Output
  • 136.
    Geometrical Image Editing: Inpainting Inpainting (Criminisi et al. CVPR’03, Wexler et al. CVPR’04,Sun, et al. SIGGRAPH’05, Komodakis et al. CVPR’06, Hays and Efros, SIGGRAPH’07) Mask Input
  • 137.
    Geometrical Image Editing: Inpainting Inpainting (Criminisi et al. CVPR’03, Wexler et al. CVPR’04,Sun, et al. SIGGRAPH’05, Komodakis et al. CVPR’06, Hays and Efros, SIGGRAPH’07) Mask Output
  • 138.
    Shift-Map Composition A B C D User Constraints
  • 139.
    Shift-Map Composition A B C D User Constraints A
  • 140.
    Shift-Map Composition A B C D User Constraints A B
  • 141.
    Shift-Map Composition A B C D User Constraints A C B
  • 142.
    Shift-Map Composition A B C D User Constraints No accurate segmentation required A C B D
  • 143.
    Shift-Map Composition A B C D User Constraints No accurate segmentation required
  • 144.
    Our Approach :Shift-Map • Shift-Maps represent a mapping for each pixel in the output image into the input image Output : R(u,v) Input : I(x,y) • The color of the output pixel is copied from corresponding input pixel
  • 145.
    Our Approach :Shift-Map Output : R(u,v) Input : I(x,y) (u,v) (u,v) (x,y) • We use relative mapping coordinate (like in Optical Flow)
  • 146.
    Our Approach :Shift-Map Output Input Shift-Map Output Image Horizontal Shifts Vertical Shifts Tx = 0 Tx = 400 • Minimal distortion Tx = 50 • Adaptive boundaries • Fast optimization Ty = 10
  • 147.
    Geometric Editing asan Energy Minimization • We look for the optimal mapping - can be described as an Energy Minimization problem Data term : Smoothness term : External Editing Requirement Avoid Stitching Artifacts Compute For Each Pixel Compute For Each Pair of Neighboring pixels • Unified representation for geometric editing applications • Solved using a graph labeling algorithm
  • 148.
    The Smoothness Term R - Output Image I - Input Image p’ np’ pq nq’ q’ Discontinuity in the shift-map For p For q color gradient (Kwatra et al. 03, Agarwala et al. 04)
  • 149.
    The Data Term:Inpainting • Data term varies between different application • Inpainting data term uses data mask D(x,y) over the input image – D(x,y)= ∞ for pixels to be removed – D(x,y)=0 elsewhere D=0 • Specific input pixels can be forced (x,y) not to be included in the output image by setting D(x,y)=∞ (u,v)
  • 150.
    The Data Term:Rearrangement • Input pixels can be forced to appear in a new location • Appropriate shift gets (u,v) infinitely low energy (x,y) • Other shifts get infinitely high energy
  • 151.
    The Data Term:Retargeting • Use picture borders • Can incorporate importance mask – Order constraint on mapping is applied to prevent duplications of important areas
  • 152.
    Shift-Map as GraphLabeling • Minimal energy mapping can be represented as graph labeling where the Shift-Map value is the selected label for each output pixel • Labels: relative shift Labels: shift-map values (tx,ty) Output image pixels Input image Nodes: pixels Shift Map: assign a label to each pixel
  • 153.
    Hierarchical Solution Gaussian pyramid Output on input Shift-Map Shift-Map
  • 154.
    Results and Comparison Image completion with structure propagation [Sun et al. SIGGRAPH’05] Shift-Map handles without additional Mask Shift-Map user interaction some cases where other algorithms suggested that can only be handled with additional user guidance J. Sun, L. Yuan, J. Jia, and H. Shum. Image completion with structure propagation. In SIGGRAPH’05
  • 155.
  • 156.
    Results and Comparison Non-Homogeneous Improved Seam Carving PatchMatch [Wolf et al., ICCV’07] [Robinstein et al, SIGGRAPH’08] [Barnes et al, SIGGRAPH‘09] Shift-Maps
  • 157.
    Summary • New representationto geometrical editing applications as an optimal graph labeling • Unified approach • Solved efficiently using hierarchical approximations • Minimal user interaction is required for various editing tasks
  • 158.
    Similarity Guided Composition •Build an Output image R from pixels taken from Source image I such that R is most similar to Target image T Source Image Target Image Output
  • 159.
    Similarity Guided Composition •Data term reflects a similarity between the output image R and a target image T • Similarity uses both colors and gradients
  • 160.
    Similarity Guided Composition •Data term indicates the similarity of the output image to the target image • Weight between similarity and smoothness has the following effect Source Image Resulted Output Target Image Previous Work: Efros and Freeman 2001, Hertzman et al. 2001
  • 161.
    Edge Preserving Magnification Usingthe original image as the source, similarity guided composition can magnify Source Result Target (bilinear magnification) Does not work for gradual color changes
  • 162.
    Edge Preserving Magnification Originalimage can be the source for edge areas. Otherwise the magnified image is the source. Original Magnified Target Edge Map Source 1 Source 2
  • 163.
  • 164.
    The Bidirectional Similarity [Simakov, Caspi, Shechtman, Irani – CVPR’2008] Completeness All source patches (at multiple scales) source ⊆ target should be in the target ? ⊇ All target patches (at multiple scales) should be in the source Coherence Easy to compose (recover) source from target Easy to compose (recover) target from source
  • 165.
    Shift-Map Retargeting withFeedback • Shift-Map retargeting maximize the coherence • It will he hard to reconstruct back the Fish
  • 166.
    Shift-Map Retargeting withFeedback • Increase the Appearance Data Term of input regions with a high Composition Score EA|B and recompute the output B. • Pixels with the higher Appearance Term will now appear in the output and increase the completeness.
  • 167.
    Original Appearance Term EA|B EA|B Retargeted Reconstruction of Original
  • 168.
    Shift-Map Retargeting withFeedback Original Shift-Map Feedback
  • 169.
    Video Synopsis andIndexing Making a Long Video Short • 11 million cameras in 2008 • Expected 30 million in 2013 • Recording 24 hours a day, every day
  • 170.
    Video Synopsis Shift Objects in Time Synopsis Video S(x,y,t) Input Video I(x,y,t) t
  • 171.
    Steps in VideoSynopsis • Detect and track objects, store in database. • Select relevant objects from database • Display selected objects in a very short “Video Synopsis” • In “Video Synopsis”, objects from different times can appear simultaneously • Index from selected objects into original video • Cluster similar objects
  • 172.
    Two Clusters Cars Camera in St. Petersburg People • Detect specific events • Discover activity patterns
  • 173.
    ICVSS 2011 Presentations 168.176.61.22/comp/buzones/PROCEEDINGS/ICVSS2011 Jiri Matas - Tracking, Learning, Detection, Modeling Ivan Laptev - Human Action Recognition Josef Sivic - Large Scale Visual Search Andrew Fitzgibbon - Computer Vision: Truth and Beauty (Kinect) Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
  • 174.
    The end... Thanks ! Angel Cruz-Roa aacruzr@unal.edu.co Andrea Rueda-Olarte adruedao@unal.edu.co Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations