ICVSS2011 Selected Presentations

ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

ICVSS 2011: Selected Presentations

Angel Cruz and Andrea Rueda

BioIngenium Research Group, Universidad Nacional de Colombia

August 25, 2011

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations


Outline

1 ICVSS 2011

2 A Trillion Photos - Steven Seitz

3 Eﬃcient Novel Class Recognition and Search - Lorenzo
Torresani

4 The Life of Structured Learned Dictionaries - Guillermo Sapiro

5 Image Rearrangement & Video Synopsis - Shmuel Peleg



ICVSS 2011
International Computer Vision Summer School

15 speakers, from USA, France, UK, Italy, Prague and Israel



ICVSS 2011
International Computer Vision Summer School


A Trillion Photos

Steve Seitz
University of Washington
Google

Sicily Computer Vision Summer School
July 11, 2011

Facebook

>3 billion uploaded each month

~ trillion photos taken each year

What do you do with a trillion photos?

Digital Shoebox
(hard drives, iphoto, facebook...)

Comparing images

Detect features using SIFT [Lowe, IJCV 2004]

Comparing images

Extraordinarily robust image matching
– Across viewpoint (~60 degree out-of-plane rotations)
– Varying illumination
– Real-time implementations

Scale Invariant Feature Transform

0 2π
angle histogram

Adapted from slide by David Lowe

NASA Mars Rover images
with SIFT feature matches
Figure by Noah Snavely

Coliseum
(outside)

St. Peters (inside)
Coliseum
St. Peters (outside)
(inside)

Il Vittoriano
Trevi Fountain

Forum

Structure from motion

Matched photos 3D structure

Structure from motion
aka “bundle adjustment” (texts: Zisserman; Faugeras)
p4
p1 p3 minimize
p2
f (R, T, P)
p5 p7
p6

Camera 1 Camera 3
R1,t1 Camera 2
R3,t3
R2,t2

Reconstructing Rome
In a day...

From ~1M images
Using ~1000 cores

Sameer Agarwal, Noah Snavely, Rick Szeliski, Steve Seitz
http://grail.cs.washington.edu/rome

From Sparse to Dense

Sparse output from the SfM system

From Sparse to Dense

Furukawa, Curless, Seitz, Szeliski, CVPR 2010

Most of our photos don’t look like this

Your Life in 30 Seconds

path optimization

Picasa Integration
• As “Face Movies” feature in v3.8
– Rahul Garg, Ira Kemelmacher

Conclusion

trillions of photos
+ computer vision breakthroughs

= new ways to see the world

Efﬁcient Novel-Class
Recognition and Search
Lorenzo Torresani

Problem statement:
novel object-class search
• Given: image database user-provided images
(e.g., 1 million photos) of an object class

+

• Want:
database • no text/tags available
images • query images may
of this class represent a novel class

Application: Web-powered visual search
in unlabeled personal photos
Goal: Find “soccer camp”
pictures on my computer
1 1 Search the Web for images
of “soccer camp”
2 Find images of this visual class
on my computer
2

Application: product search

• Search of aesthetic products

RBM predictedpredicted labels (47%)
RBM labels (47%)

Relation to other tasks sky sky

building building
tree
bed
tree
bed
car car

novel class
road road

Input search Ground truth neighbors
image image
Input Ground truth neighbors 32−RBM 32−RBM 16384-gist
1

query retrieved
image retrieval object categorizationshowingitperce
Figure 6. 6. Curves showing per
Figure Curves
query images that make it int
query images that make into
ofof the query for 1400 image
the query for a a 1400 imag
to 5% of the database size.
upup to 5% of the database siz
analogies: RBM predictedpredicted labels (56%)
RBM labels (56%) crucial for scalable retrieval th
crucial for scalable retrieval
- large databases tree
from [Nister and Stewenius, ’07]
tree sky sky
database make it it to the very
database make to the very to
is is feasible only for a tiny f
feasible only for a tiny fra
- efficient indexing database grows large. Hence, w
database grows large. Hence,
building building the curves meet the y-axis. T
the curves meet the y-axis.
- compact representation (a) car car given in in Table 1 for larger n
given Table 1 for a a larger
sidewalk sidewalkcrosswalkcrosswalk conclusions can bebe drawn from
conclusions can drawn from
road road improves retrieval performance
improves retrieval performan
differences: from neighbors et al., ’07] performance than vocabularies.1
performance than 2 -norm. En
L L2 -norm.
Input image imageGround truth [Philbinneighbors 32−RBM 32−RBM vocabularies. O
Input least for smaller 16384-gist
- simple notions of visual Ground truth least for smaller
gives much better performance th
gives much better performance
(b)
relevancy is is setting T.
setting T.

(e.g., near-duplicate,
same object instance, settings used by [17].
settings used by [17].
The performance with vav
The performance with
same spatial layout) (c)
RBM predictedpredicted labels (63%) [Torralba et al., ’08]
RBM labels (63%) from on the full 6376 image databa
on the full 6376 image data
the scores decrease with inc
the scores decrease with in
ceiling ceiling
are more images toto confus
are more images confuse
Figure Thewall retrieval performance is is evaluated using a large
wall performance evaluated using a large
Figure 5. 5. The retrieval ofof the vocabulary tree is sh
the vocabulary tree is show
ground truth database (6376 images) with groups ofof four images
ground truth database (6376 images) with groups four images
door door defining the vocabulary tree
defining the vocabulary tre
poster poster

Relation to other tasks
novel class
search

image retrieval object classification
analogies: analogies:
- large databases - recognition of object
- efficient indexing classes from a few examples
- compact representation
differences:
differences: - classes to recognize are
- simple notions of visual defined a priori
relevancy - training and recognition
(e.g., near-duplicate, time is unimportant
same object instance, - storage of features is not an
same spatial layout) issue

Technical requirements of
novel class-search
• The object classiﬁer must be learned on the ﬂy from
few examples

• Recognition in the database must have low
computational cost

• Image descriptors must be compact to allow
storage in memory

State-of-the-art in
object classiﬁcation
Winning recipe: many features + non-linear classiﬁers
(e.g. [Gehler and Nowozin, CVPR’09])

non-linear
!"#$%
decision boundary
!"#$%&#'()*
+&,-)&.&#(#/*
...

01#-2"#*

&'()*+),%%
-'.,()*+/%
#"0$%

Model evaluation on Caltech256
45

40
gist
35 phog
phog2pi
30
accuracy (%)

ssim
25 bow5000

20
!"#$%&'()*$+'
15
,
10 '"#*"-"*.%+'/$%0.&$1
5

0
0 5 10 15 20 25 30
number of training examples

45

40 gist
phog
35
phog2pi
30 ssim
accuracy (%)

bow5000 !"#$%&'()*$+',
25 linear combination
/$%0.&$'2)(3"#%4)#
20
!"#$%&'()*$+'
15
,
10 '"#*"-"*.%+'/$%0.&$1
5

0
0 5 10 15 20 25 30

5)#6+"#$%&'()*$+',
45 /$%0.&$'2)(3"#%4)#'
40 7%898%8':.+4;+$'<$&#$+'
gist !$%&#"#=>'
35 phog ?@$A+$&'B'5)C)D"#E'FGH
phog2pi
30
accuracy (%)

ssim
25 bow5000
!"#$%&'()*$+',
linear combination /$%0.&$'2)(3"#%4)#
20 nonlinear combination
!"#$%&'()*$+'
15
,
10 '"#*"-"*.%+'/$%0.&$1
5

0
0 5 10 15 20 25 30

Multiple kernel combiners
Classiﬁcation output is obtained by combining many features via
non-linear kernels:
F
N

h(x) = βf kf (x, xn )αn + b
f =1 n=1

sum over features sum over training examples

!#$%
...

where
'()*+),%%
-'.,()*+/%
#0$%

m=1
s. For a kernel function k between a SVM.
he short-hand notation
Training Same as for averaging.
= k(fm (x), fm (x )),
Multiple con- 4. Methods: Multiple Kernel Learning
kernel learning (MKL)

nel km : X × X → R only
espect to image feature fal., 2004; Sonnenburg etapproach toVarma and Ray, 2007] is to
[Bach et m . If the Another al., 2006; perform kernel selection
to a certain aspect, say, it only con- a kernel combination during the training phase of th
gorithm. jointly optimizing over
Learning a non-linear SVM by One prominent instance of this class is MKL
on, then the kernel measures simi-
F
a linear combinati
to this aspect. The subscript m of
nderstood as a linear combinationobjective ∗ (x, x ) k=(x, x ) =β over(x,fx ) x ) the par
1. indexing into the set of kernels k
is to optimize jointly
of kernels: ∗ F β k (x,
km f and
m
m=1 f =1
2. the SVM parameters: α ∈ RN and b ∈ R of an SVM.
ters
notational convenience, we will de- MKL was originally introduced in [1]. For efficiency
e of the m’th feature for a given  
F in order N obtain sparse, F
to interpretable coefficients,
F
raining samples xi , i = 1, 1 . . . , N
min βf αT Kf α stricts βm ≥ 0 and ,imposes thefconstraintT α βm
+ C L yn b + β Kf (xn ) m=1
α,β,b 2 Since the scope of this paper is to access the applicab
f =1 n=1 f =1
of MKL to feature combination rather than its optimiz
), km (x, x2 ), . . . , km (x, xN )]T .
F part we opted to present the MKL formulations in a wa
aining sample, i.e. x = xi , then = 1,lowing for easier 1, . . . , F
subject to βf βf ≥ 0, f = comparison with the other methods
h column of the m’th kernel matrix.f =1 write its objective function as
F
ernel selection In this papert) = max(0, 1 − yt) 1
where L(y, we
min βm αT Km α
classifiers that aim to combine sev- 2 m=1
Kf (x) = [kf (x, x1 ), kf (x, x2 ), . . . , kf (x, xN )]T
α,β,b
e model. Since we associate image
N F
ctions, kernel combination/selection
+C L(yi , b + βm Km (x)T α)

LP-β: a two-stage approach to MKL
! [Gehler and Nowozin, 2009]
• Classiﬁcation output of traditional MKL:
F
N

hM KL (x) = βf kf (x, xn )αn + b
f =1 n=1

• Classiﬁcation function of LP-β:

F
N

h(x) = βf kf (x, xn )αf n + bf
f =1
n=1

hf (x)
Two-stage training procedure:
1. train each hf (x) independently → traditional SVM learning
2. optimize over β → a simple linear program

LP-β for novel-class search?
The LP-β classiﬁer:
F
N

h(x) = βf kf (x, xn )αf n + bf
f =1 n=1

sum over features sum over training examples

Unsuitable for our needs due to:
• large storage requirements (typically over 20K bytes/image)
• costly evaluation (requires query-time kernel distance
computation for each test image)
• costly training (1+ minute for O(10) training examples)

Classemes: a compact descriptor for
efﬁcient recognition [Torresani et al., 2010]
!
Key-idea: represent each image x in terms of its “closeness”
to a set of basis classes (“classemes”)
x
Φ(x) = [φ1 (x), . . . , φC (x)]T
F
N

φc (x) = hclassemec (x) = c
βf kf (x, xc )αn + bc
n
c

f =1 n=1
output of a pre-learned LP-β for the c-th basis class
Φ(x1 ) ... Φ(xN )
Query-time learning: training
examples of
train a linear classiﬁer on Φ(x) novel  class

C
F
N

g duck (Φ(x); wduck ) = Φ(x)T wduck = wc 
duck c
βf kf (x, xc )αn + bc 
n
c

c=1

f =1 n=1

LP-β trained before the
trained at query-time
creation of the database

How this works...
Efficient Object Category Recognition Using Classemes 777

• Accurate weighted classemes. Five classemes with the highest LP-β weights
Table 1. Highly
semantic labels are not required...

to
•make semantic sense, but it should bejust used that detectors may create
for the retrieval experiment, for a selection of Caltech 256 categories. Somefor appear
Classeme classifiers are emphasized as our goal is simply to
specific patterns of texture, color, shape, etc.
a useful feature vector, not to assign semantic labels. The somewhat peculiar classeme
labels reflect the ontology used as a source of base categories.

!#$%'()*+$ ,-(./+$#-(.'0$%/1121$
%)#3)+4.'$ !#$% '()*%'+%*,-. -,.+(,/ -)##-%01# $2330/+(,/

05%6$ 1)$1*+(#,/ 1)45+)3+6,%* '60$$* 6,#.0/7 '%*,07!%
12##+$,#+!*4+
/6$ 3072*+'.,%* -,%%# 7*,8'0% 4,4+1)45
,/0$,#
7*-13$ 6,%*-*,3%+'2*3,- '-'0+-,1# ,#,*$+-#)-. !0/42 '*80/7+%*,5
6'%*/+!$0'(!*+
'*-/)3-'4898$ -)/89+%!0/7 $0/4+,*, -4(#,5* *),'%0/7+(,/
(*')/
%,.0/7+-,*+)3+ -)/%,0/*+(*''2*+
#./3**)#$ 1,77,7+()*%* -,/)(5+-#)'2*+)(/ *)60/7+'!##
')$%!0/7 1,**0*

Large-scale recognition benefits from a compact descriptor for each image,
for example allowing databases to be stored in memory rather than on disk. The

bject Classes by Between-Class Attribute Transfer
Hannes Nickisch Stefan Harmeling

Related work
or Biological Cybernetics, T¨ bingen, Germany
u
me.lastname}@tuebingen.mpg.de

•
otter

when train-
Attribute-based recognition:
black:
white:
yes
no
brown: yes
examples of stripes: no
hardly been water: yes
[Lampert et al., CVPR’09] [Farhadi et al., CVPR’09]
eats fish: yes
rule rather
ens of thou- polar bear
black: no
very few of white: yes
d annotated brown: no
stripes: no
water: yes
introducing eats fish: yes
ct detection zebra
ption of the black: yes
description white: yes

requires hand-specified attribute-class associations
brown: no
hape, color
s. On the left
h properties
stripes:
water:
yes
no
ribute be
hey can predic-
eats fish: no

to
displayed. attribute classifiers must be trained with
arethe cur- Figure 1. A description object categories: after learningthe transfer
by high-level attributes allows
ected based of knowledge between the visual
ed for a new cat- human-labeled examples
ve across appearance of attributes from any classes with training examples,
and to “engine”,can detect also object classes that do not have any training
ike facil- we based on which attribute description a test image fits best. randomly selected positively pre
new large- images, Figure 5: This figure shows
election helps
30,000 an- tributes for 12 typical images from 12 categories in Yahoo set.
nd “rein” that of well-labeled training imageslearnedtechniques
rson’s clas- lions and is likely out of
classifiers are numerous on Pascal train set and tested on Yahoo se
reach for years to come. Therefore,
emantic at-
one class outreducing the number of necessary training imagesattributes from the list of 64 attributes a
for domly select 5 predicted have

Method overview
1. Classeme learning

φ”body of water” (x) →

...
φ”walking” (x) →

2. Using the classemes for recognition and retrieval
training examples of novel class
C

g duck (Φ(x)) = wc φc (x)
duck

c=1

Φ(x1 ) ... Φ(xN )

Classeme learning:
choosing the basis classes
• Classeme labels desiderata:

- must be visual concepts

- should span the entire space of visual classes

• Our selection:
concepts deﬁned in the Large Scale Ontology for Multimedia
[LSCOM] to be “useful, observable and feasible for automatic
detection”.
2659 classeme labels, after manual elimination of
plurals, near-duplicates, and inappropriate concepts

Classeme learning:
gathering the training data
• We downloaded the top 150 images returned by
Bing Images for each classeme label
• For each of the 2659 classemes, a one-versus-the-rest
training set was formed to learn a binary classiﬁer
φ”walking” (x)

yes no

Classeme learning:
training the classifiers
• Each classeme classifier is an LP-β kernel combiner
[Gehler and Nowozin, 2009]:
F
N

φ(x) = βf kf (x, xn )αf,n + bf
f =1 n=1

linear combination of feature-specific SVMs

• We use 13 kernels based on spatial pyramid histograms
computed from the following features:
- color GIST [Oliva and Torralba, 2001]
- oriented gradients [Dalal and Triggs, 2009]
- self-similarity descriptors [Schechtman and Irani, 2007]
- SIFT [Lowe, 2004]

A dimensionality reduction
 
view of classemes
 
  GIST
 
   




 self-similarity
 descriptor Φ 
φ1 (x)
... 
x=



  φ2659 (x)
  oriented
 
  gradients
 
  • near state-of-the-art accuracy
SIFT with linear classiﬁers
• can be quantized down to
• non-linear kernels are needed 200 bytes/image with almost
for good classiﬁcation no recognition loss
• 23K bytes/image

Experiment 1: multiclass
recognition on Caltech256
60 LP-β in [Gehler
LPbeta Nowozin, 2009]
LPbeta13 using 39 kernels
50 MKL
Csvm LP-β with our x
Cq1svm
40 Xsvm our approach:
linear SVM with
accuracy (%)

classemes Φ(x)
30
linear SVM with
binarized classemes,
20 i.e. (Φ(x) 0)

linear SVM with x
10

0
0 10 20 30 40 50

Computational cost
comparison
Training time Testing time
1500 40

23 hours 30
time (minutes)

1000

time (ms)
20

500
9 minutes 10

0 0
LPbeta Csvm LPbeta Csvm

Accuracy vs. compactness
4
10

188 bytes/image
compactness (images per MB)

3
10

2.5K bytes/image
2
10

LPbeta13 23K bytes/image
1 Csvm
10
Cq1svm
nbnn [Boiman et al., 2008] 128K bytes/image
emk [Bo and Sminchisescu, 2008]
Xsvm
0
10
10 15 20 25 30 35 40 45
accuracy (%)

Lines link performance at 15 and 30 training examples

Experiment 2:
object class retrieval
Eﬃcient Object Category Recognition Using Classemes 787

30
Csvm
Cq1Rocchio (β=1, γ=0)
25
Cq1Rocchio (β=0.75, γ=0.15)
Precision @ 25 25

Bowsvm
Precision (%) @

20 BowRocchio (β=1, γ=0)
BowRocchio (β=0.75, γ=0.15)
15

• Random performance is 0.4%
10
• training Csvm takes 0.6 sec with
5*256 training examples
5

0
0 10 20 30 40 50
Number of training images

Fig. 4. Retrieval. Percentage of the top 25 in a 6400-document set which match the
query class. Random performance is 0.4%.

Analogies with text retrieval
• Classeme representation of an image:
presence/absence of visual attributes

• Bag-of-word representation of a text-document:
presence/absence of words

Related work
• Prior work (e.g., [Sivic Zisserman, 2003; Nister Stewenius, 2006;
Philbin et al., 2007]) has exploited a similar analogy for
object-instance retrieval by representing images as bag of visual words
Detect interest patches Compute SIFT descriptors [Lowe, 2004]

…
…

Quantize
Represent image as a sparse
descriptors
histogram of visual words
frequency

…..
codewords

• To extend this methodology to object-class retrieval we need:
- to use a representation more suited to object class recognition
(e.g. classemes as opposed to bag of visual words)
- to train the ranking/retrieval function for every new query-class

Data structures for
efﬁcient retrieval
Incidence matrix: Inverted index:
features
f0 f1 f2 f3 f4 f5 f6 f7 f0 f1 f2 f3 f4 f5 f6 f7
I0: 1 0 1 0 0 1 0 0
I1: 0 0 1 0 1 0 0 0 I0 I2 I0 I2 I1 I0 I4 I6
documents

I2: 1 1 0 1 0 0 0 0 I2 I7 I1 I3 I4 I6 I5 I9
I3: 1 0 1 1 0 0 0 0
I4: 1 0 0 0 1 0 1 0 I3 I8 I3 I9 I5 I8
I5: 0 0 0 0 1 0 1 0 I4 I7 I9
I6: 1 0 0 0 0 1 0 1 I6 I9
I7: 0 1 0 0 1 0 0 0 I8
I8: 1 1 0 0 0 1 0 0
I9: 0 0 0 1 1 1 0 1
• enables efﬁcient calculation
of w Φ, as:
T
∀Φ
• very compact: only one bit
per feature entry wi Φi
i s.t. Φi =0

Efﬁcient retrieval via
inverted index
Inverted index:
w: [1.5 -2 0 -5 0 3 -2 0 ]
f0 f1 f2 f3 f4 f5 f6 f7

I0 I2 I0 I2 I1 I0 I4 I6
I2 I7 I1 I3 I4 I6 I5 I9
I3 I8 I3 I9 I5 I8
I4 I7 I9
I6 I9
I8

Goal:
compute score w T Φ, for all binary vectors Φ in the database
∀Φ

inverted index
Inverted index:
w: [1.5 -2 0 -5 0 3 -2 0 ]
f0 f1 f2 f3 f4 f5 f6 f7

I0 I2 I0 I2 I1 I0 I4 I6
I2 I7 I1 I3 I4 I6 I5 I9
I3 I8 I3 I9 I5 I8
I4 I7 I9
I6 I9
I8

Scoring:
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

inverted index
Inverted index:
w: [1.5 -2 0 -5 0 3 -2 0 ]
f0 f1 f2 f3 f4 f5 f6 f7

I0 I2 I0 I2 I1 I0 I4 I6
I2 I7 I1 I3 I4 I6 I5 I9
I3 I8 I3 I9 I5 I8
I4 I7 I9
I6 I9
I8

Cost of scoring is linear in the sum of the lengths of inverted
lists associated to non-zero weights

Improve efﬁciency via
sparse weight vectors
Key-idea: force w to contain as many zeros as possible
classeme vector label of
Learning objective of example n
Tomographic inversion with example n
1 wavelet penalization 3
N
E(w) = R(w) + C
N n=1 L(w; Φn , yn )
w2
regularizer loss function
w with d = AWT w and smallest 1 -norm

•
T
L2-SVM: R(w) d =wT w w and smallestn ,2yn ) = max(0, 1 − yn (wT Φn ))
w with = AW
, L(w; Φ -norm
d = AWT w
• 2
Since |wi | wi for small wi w 2
w 2i
|wi |
and |wi | wi for large wi , w1
2

choosing R(w) = i |wi | will tend to |w|

produce a small number of larger
wi
weights and 2 -ball: wzero2 weights
more 1 + w2 = constant
2
w

1 -ball: |w1 | + |w2 | = constant

Improve efﬁciency via
sparse weight vectors
Key-idea: force w to contain as many zeros as possible
classeme vector label of
Learning objective of example n example n
N
E(w) = R(w) + C

• L2-SVM: R(w) = wT w , L(w; Φn , yn ) = max(0, 1 − yn (wT Φn ))

• L1-LR: R(w) = i |wi | , L(w; Φn , yn ) = log(1 + exp(−yn wT Φn ))

• FGM (Feature Generating Machine) [Tan et al., 2010]:
R(w) = wT w , L(w; Φn , yn ) = max(0, 1 − yn (w ⊙ d)T Φn )
s.t. 1T d ≤ B d ∈ {0, 1}D elementwise product

Performance evaluation on
ImageNet (10M images)
35
! [Rastegari et al., 2011]
35
Full inner product evaluation L2 SVM
30
Full inner product evaluation L1 LR
30
Inverted index L2 SVM
Precision @ 10 (%)

25
Inverted index L1 LR

Precision @ 10 (%)
25
20
20 • Performance averaged over 400 object
15 classes used as queries
15 • 10 training examples per query class
10
10
• Database includes 450 images of the query
class and 9.7M images of other classes
5
5 •
Prec@10 of a random classiﬁers is 0.005%
0
20 40 60 80 100 120 140
Search time per query (seconds) 0
20 40 60 80 100 120 140
Each curve is obtained by varying sparsity through C in training objective Search time per query (seconds)

N
E(w) = R(w) + C

Top-k ranking
• Do we need to rank the entire database?
- users only care about the top-ranked images

• Key idea:
- for each image iteratively update an upper-bound and
a lower-bound on the score
- gradually prune images that cannot rank in the top-k

Top-k pruning

w: [ 3 -2 0 -6 0 3 -2 0 ]
• Highest possible score:
for binary vector ΦU s.t.
f0
I0: 1
f1
0
f2
1
f3
0
f4
0
f5
1
f6
0
f7
0
ΦU = 1 iﬀ wi 0
i
I1: 0 0 1 0 1 0 0 0
I2: 1 1 0 1 0 0 0 0 → initial upper bound
I3: 1 0 1 1 0 0 0 0
I4: 1 0 0 0 1 0 1 0 u∗ = wT · ΦU (6 in this case)
I5: 0 0 0 0 1 0 1 0
I6: 1 0 0 0 0 1 0 1
I7: 0
I8: 1
1
1
0
0
0
0
1
0
0
1
0
0
0
0
• Lowest possible score:
I9: 0 0 0 1 1 1 0 1 for binary vector ΦL s.t.
ΦL = 1 iﬀ wi 0
i
→ initial lower bound
l∗ = wT · ΦL (-10 in this case)

Top-k pruning

w: [ 3 -2 0 -6 0 3 -2 0 ] • Initialization: u∗ , l∗ for all images
upper bound
f0 f1 f2 f3 f4 f5 f6 f7
I0: 1 0 1 0 0 1 0 0
I1: 0 0 1 0 1 0 0 0
I2: 1 1 0 1 0 0 0 0
I3: 1 0 1 1 0 0 0 0
I4: 1 0 0 0 1 0 1 0
I5: 0 0 0 0 1 0 1 0 0
I6: 1 0 0 0 0 1 0 1
I7: 0 1 0 0 1 0 0 0
I8: 1 1 0 0 0 1 0 0
I9: 0 0 0 1 1 1 0 1

I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
lower bound

Top-k pruning

w: [ 3 -2 0 -6 0 3 -2 0 ]

f0 f1 f2 f3 f4 f5 f6 f7
I0: 1 0 1 0 0 1 0 0
I1: 0 0 1 0 1 0 0 0
I2: 1 1 0 1 0 0 0 0 0
I3: 1 0 1 1 0 0 0 0
I4: 1 0 0 0 1 0 1 0
I5: 0 0 0 0 1 0 1 0
I6: 1 0 0 0 0 1 0 1
I7: 0 1 0 0 1 0 0 0
I8: 1 1 0 0 0 1 0 0
I9: 0 0 0 1 1 1 0 1
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
• Load feature i
• Since wi = +3 (0), for each image n:
- subtract +3 from the upper bound if φn,i = 0
- add +3 to the lower bound if φn,i = 1

Top-k pruning

w: [ 3 -2 0 -6 0 3 -2 0 ]

f0 f1 f2 f3 f4 f5 f6 f7
I0: 1 0 1 0 0 1 0 0
I1: 0 0 1 0 1 0 0 0
I2: 1 1 0 1 0 0 0 0 0
I3: 1 0 1 1 0 0 0 0
I4: 1 0 0 0 1 0 1 0
I5: 0 0 0 0 1 0 1 0
I6: 1 0 0 0 0 1 0 1
I7: 0 1 0 0 1 0 0 0
I8: 1 1 0 0 0 1 0 0
I9: 0 0 0 1 1 1 0 1
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
• Load feature i
• Since wi = -2 (0), for each image n:
- decrement by 2 the upper bound if φn,i = 1
- increment by 2 the lower bound if φn,i = 0

Top-k pruning

w: [ 3 -2 0 -6 0 3 -2 0 ]

f0 f1 f2 f3 f4 f5 f6 f7
I0: 1 0 1 0 0 1 0 0
I1: 0 0 1 0 1 0 0 0
I2: 1 1 0 1 0 0 0 0 0
I3: 1 0 1 1 0 0 0 0
I4: 1 0 0 0 1 0 1 0
I5: 0 0 0 0 1 0 1 0
I6: 1 0 0 0 0 1 0 1
I7: 0 1 0 0 1 0 0 0
I8: 1 1 0 0 0 1 0 0
I9: 0 0 0 1 1 1 0 1
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
• Load feature i
• Since wi = -6 (0), for each image n:
- decrement by 6 the upper bound if φn,i = 1
- increment by 6 the lower bound if φn,i = 0

Top-k pruning

w: [ 3 -2 0 -6 0 3 -2 0 ]

f0 f1 f2 f3 f4 f5 f6 f7
I0: 1 0 1 0 0 1 0 0
I1: 0 0 1 0 1 0 0 0
I2: 1 1 0 1 0 0 0 0 0
I3: 1 0 1 1 0 0 0 0
I4: 1 0 0 0 1 0 1 0
I5: 0 0 0 0 1 0 1 0
I6: 1 0 0 0 0 1 0 1
I7: 0 1 0 0 1 0 0 0
I8: 1 1 0 0 0 1 0 0
I9: 0 0 0 1 1 1 0 1
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

• Suppose k = 4:
we can prune I2,I9 since they cannot rank in the top-k

Distribution of weights and
pruning rate
CCV
CV IC
1745
745 #
#1
ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

540
40
11 100
100
L1−LR
L1−LR
Distribution absolute weight values
Distribution of absolute weight values

41
541
normalized of absolute weight values

42
542 L2−SVM
L2−SVM
43
543 0.8
0.8 FGM
FGM 80
80

% of images pruned
% of images pruned
44
544 TkP L1−LR, k=10
TkP L1−LR, k=10
45
545 TkP L1−LR, k=3000
TkP L1−LR, k=3000
0.6
0.6 60
60
46
546 TkP L2−SVM, k=10
TkP L2−SVM, k=10
47
547 TkP L2−SVM, k=3000
TkP L2−SVM, k=3000
48
548 0.4
0.4 40
40 TkP FGM, k=10
TkP FGM, k=10
49
549 TkP FGM, k=3000
TkP FGM, k=3000
50
550
0.2
0.2 20
20
51
551
52
552
53
553 00 00
54
554 aa 00 500
500 1000
1000 1500
1500
Dimension
2000
2000 2500
2500 bb 00 500
500 1000
1000 1500 1500 2000 2000
Number ofof iterations (d)
iterations (d)
2500
2500
Dimension Number
55
555
56
556 Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster with
Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster with
57
557
Features considered in descending order of |wi |
sparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values ofof k (k = 10, 3000).
sparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values k (k = 10, 3000).
58
558
59
559
60
560 aa smaller value of kk allows the method to eliminate more
smaller value of allows the method to eliminate more
61 images from consideration at aavery early stage. 20
20 v=128
561 images from consideration at very early stage. v=128
8
v=256
v=256
62 w=2 8 v=256
v=256 w=28 8
562 w=2 6
v=64
v=64 w=2 6 w=2
w=2
63

Performance evaluation on 35

ImageNet (10M images) 30

35 ! [Rastegari et al., 2011]

Precision @ 10 (%)
25
30 TkP L1−LR
20
TkP L2−SVM
Inverted index L1−LR
Precision @ 10 (%)

25
15
Inverted index L2−SVM
20
10 • k = 10
15
• Performance averaged over 400 object
5 classes used as queries
10 • 10 training examples per query class
0
0 50 •
100 150 Database includes 450 images of the query
5 Search time per query (seconds) and 9.7M images of other classes
class
• Prec@10 of a random classiﬁers is 0.005%
0
0 50 100 150
Search time per query (seconds)

Each curve is obtained by varying sparsity through C in training objective
N
E(w) = R(w) + C

Alternative search strategy:
approximate ranking
• Key-idea: approximate the score function with a measure that can
computed (more) efficiently (related to approximate NN search:
[Shakhnarovich et al., 2006; Grauman and Darrell, 2007; Chum et al.,
2008])
• Approximate ranking via vector quantization:
wT Φ ≈ wT q(Φ) !
q(!)
where q(.) is a quantizer returning
the cluster centroid nearest to Φ

• Problem:
- to approximate well the score we need a fine quantization
- the dimensionality of our space is D=2659:
too large to enable a fine quantization using k-means clustering

Product quantization
!
Product quantization for nearest neighbor search
[Jegou et al., 2011]
• Split feature vector ! into v subvectors: ! [ !1 | !2 | ... | !v ]
Vector split into m subvectors:
• Subvectors are quantized separately by quantizers
Subvectors are quantized separately by quantizers
q(!) = [ q1(!1) | q2(!2) | ... | qv(!v) ]
where each qi(.) is learned in a space of dimensionality D/v
where each is learned by k-means with a limited number of centroids

• Example from [Jegou vector split in 8 subvectors of dimension 16
Example: y = 128-dim
et al., 2011]:
! is a 128-dimensional vector split into 8 subvectors of dimension 16
16 components
16 components
y1 y2 y3 y4 y5 y6 y7 y8
!1 !2 !3 !4 !5 !6 !7 !8
xedni noitazitnauq tib-46
stib 8

256 ) 1 y( 1 q
q
) 2 y( 2 q
q2
) 3 y( 3 q
q3
)4y(4q
q4
)5y(5q
q5
)6y(6q
q6
)7y(7q )8y(8q
q7 q8
28 = 256
centroids 1
centroids
q1 q2
1 q3
1 q4
1 q5 q6 q7 q8
sdiortnec 1q 2q 3q 4q 5q 6q 7q 8q
652
q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)
q1(!1) q2(!2) q3(!3) q4(!4)
1
1y 1 1 1 1
2y 1 3y 4y 5y q5(!5) q6(!6) q7(!7) q8(!8)
6y 7y 8y
8 bits
stnenopmoc 61
64-bit quantization index
8 bits
64-bit quantization index
61 noisnemid fo srotcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE

hcae erehw sdiortnec fo rebmun detimil a htiw snaem-k yb denrael si

obhgien tseraen rof noitazitnauq tcudorP
:srotcevbus m otni tilps rotceV
wv
 .
. 
 . 
tnauq yb yletarapes dezitnauq era srotcevbuS 
 w2 

sub-blocks
w1
 
htiw snaem-k yb denrael si
centroids (r per sub-block)
hcae erehw
1.Filling the look-up table:
tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE
look-up table
can be precomputed and stored in a stnenopmoc 61
j=1
5y 4y 3y 2y T 1y
wj qj (Φj ) wT Φ ≈ wT q(Φ) =

v
652
5q 4q 3q
Efﬁcient approximate scoring 2q 1q sdiortnec
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8

wv
 .
. 
 . 
 w2 

sub-blocks
s11 w1
in
 ner product 
quantization for sub-block 1:
hcae erehw
look-up table
j=1
5y 4y 3y 2y T 1y

v
652
5q 4q 3q
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8

wv
 .
. 
 . 
 w2 

sub-blocks
uct
s11 s12 prod w1
inner
 
hcae erehw
look-up table
j=1
5y 4y 3y 2y T 1y

v
652
5q 4q 3q
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8

wv
 .
. 
 . 
 w2 

sub-blocks
duct
s11 s12 s13 ... ... ... ... ... ... s1r r pro i
w1
nne 
hcae erehw
look-up table
j=1
5y 4y 3y 2y T 1y

v
652
5q 4q 3q
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8

wv
 .
. 
 . 
 w2 

s21 in

sub-blocks
ner prod
uct w1
s11 s12 s13 ... ... ... ... ... ... s1r
 
hcae erehw
look-up table
j=1
5y 4y 3y 2y T 1y

v
652
5q 4q 3q
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8

stib 8

) 1 y( 1 q ) 2 y( 2 q )3y(3q )4y(4q y(5q

Efﬁcient approximate scoringsdiortnec
652
1q 2q 3q 4q 5q

v

wT Φ ≈ wT q(Φ) = wj qj (Φj )
T 1y 2y 3y 4y 5y

j=1
stnenopmoc 61 can be precomputed and stored in a
look-up table

2.Score each quantized vector q(Φ)
in the database using the look-up hcae erehw centroids (r per sub-block)
table: s1r
s11 s12 s13 ... ... ... ... ... ...

sub-blocks
s21 s22 s23 ... ... ... ... ... ... s2r
w q(Φ) = w1 q1 (Φ1 ) + w2 q2 (Φ2 ) + . . . + wv qv... ... ) ...
T T T T
(Φv
tnauq yb yletarapes dezitnauq era srotcevbuS... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ...
T
q(Φ) = w1 q1 (Φ1 ) + w2 q2 (Φ2 ) + . . . + wv qv (Φv )
T T T
... ... ...
:srotcevbus m otni tilps rotceV ... ... ... ... ... ... ...
sv1 sv2 sv3 ... ... ... ... ... ... svr
Only v additions per image!

Choice of parameters
• Dimensionality is ﬁrst reduced with PCA from D=2659 to D’ D
• How do we choose D’, v (number of sub-blocks),
r (number of centroids per sub-block)?
• Effect of parameter choices on a database of 150K images:
(v,r)
20
8 8
(128,2 ) (256,2 ) 6
(256,2 )
6
(64,2 )
15
Precision @ 10 (%)

6
8
(64,2 )
(32,2 )
(128,28)
D’=512
10 8
(16,2 ) D’=256
8 6
(32,2 ) (64,2 ) D’=128

5 (32,28)
8
(16,2 )
8
(16,2 )
0
0 0.05 0.1 0.15 0.2 0.25 0.3
Search time per query (seconds)

Performance evaluation
on 150K images
ICCV
#1745

432
25 • Performance averaged over 1000 the largely
the other classes. To cope with
433 object classes and negative examples (n−
of positive used as queries
434 • 50malize the loss term for each example in
training examples per query
435 20 class its class. We evaluate the learned retrie
of
436 • ILSVRC2010 test set, which includes 150
Database includes 150 images of
Precision @ 10 (%)

437 the query class and 150K images Thus, the d
150 examples per category.
438 15 of n+ = 150 true positives and n− =
other classes
439 • Prec@10 of a random Figure 1 shows precis
test
tors for each query.
classifiers test

440 is 0.1%
10 time for AR and TkP in combination wit
441
AR L2−SVM fication models. Since AR does not use s
442
TkP L1−LR efficiency, we only paired it with the L2-S
443 5 approximate ranking retrieval time per qu
x-axis shows average
444 TkP L2−SVM
TkP FGM a single-core computer with 16GB of R
445
0
Core i7-930 CPU @ 2.80GHz. The y-axis
446 0 0.5 1 1.5 2 2.5 at 10 which measures the proportion of tru
447 Search time per query (seconds) top 10. The times reported for TkP wer
448 Figure 1. Class-retrieval precision versus search time for the k = 10. The curve for AR was generate
449 ILSVRC2010 data set: x-axis is search time; y-axis shows per- parameter choices for v and w, as discusse
450 centage of true positives ranked in the top 10 using a database
−
later. The performance curves for “TkP L
451 of 150,000 images (with n = 149, 850 distractors and

Memory requirements for
10M images
9 Gbytes

8

6
memory usage

4
3 Gbytes

1.8 Gbytes
2

0
1
Inverted index 2
Incidence matrix 3
Product
(used by TkP) quantization index

Conclusions and open questions
Classemes:

• Compact descriptor enabling efﬁcient novel-class recognition
(less than 200 bytes/image yet it produces performance similar to MKL
at a tiny fraction of the cost)

• Questions currently under investigation:
- can we learn better classemes from fully-labeled data?
- can we decouple the descriptor size from the number of
classeme classes?
- can we encode spatial information ([Li et al. NIPS10])?
• Software for classeme extraction available at:
http://vlg.cs.dartmouth.edu/projects/classemes_extractor/

Information retrieval approaches to large-scale object-class search:
• sparse representations and retrieval models
• top-k ranking
• approximate scoring


Outline

1 ICVSS 2011

2 A Trillion Photos - Steven Seitz

3 Eﬃcient Novel Class Recognition and Search - Lorenzo
Torresani

4 The Life of Structured Learned Dictionaries - Guillermo Sapiro

5 Image Rearrangement Video Synopsis - Shmuel Peleg


The Life of Structured
Learned Dictionaries

Guillermo Sapiro
University of Minnesota

G.Yu and S. Mallat (Inverse problems via GMM)
G.Yu and F. Leger (Matrix completion)
G.Yu (Statistical compressed sensing)

A. Castrodad (activity recognition in video)
M. Zhou, D. Dunson, and L. Carin (video layers separation)

1
Friday, July 8, 2011

Inverse Problems
y = Uf + w
w ∼ N (0, σ 2 Id)
Inpainting

Examples f U : masking
Deblurring
Zooming

U : subsampling U: convolution
2
3 w ∼ N (0, σ Id)

Learned Overcomplete Dictionaries
Dictionary
learning

• Dictionary learning

2

min fi − Dai + λai 1
D,{ai }1≤i≤I
1≤i≤I

• Better performance than pre-ﬁxed dictionaries.

• Huge numbers of parameters to estimate.
• Non-convex.
• High computational complexity.
• Behavior not well understood (results starting to appear).
11

Sparse Inverse Problem Estimation
y = Uf + w where 2
w ∼ N (0, σ Id)
• Sparse prior
D = {φm }m∈Γ provides a sparse representation for f .
f = Da + Λ with |Λ| |Γ| , Λ = support(a) and Λ 2 f 2

• Observation
UD = {Uφm }m∈Γ provides a sparse representation for y .
y = UDa + with |Λ| |Γ| , Λ = support(a) and 2 y2
Λ Λ

• Sparse inverse problem estimation
Sparse estimation of a from y
2
a = arg min UDa − y + λ a1
˜
a
Inverse problem estimation
˜ = Da
f ˜
12

Structured Representation and Estimation

D B1 B2 B3 B4 B5
Overcomplete dictionary Structured overcomplete dictionary

• Dictionary: union of PCAs
• Union of orthogonal bases D = {Bk }1≤k≤K
• In each basis, the atoms are ordered: λk ≥ λk ≥ · · · ≥ λk
1 2 N

• Piecewise linear estimation (PLE)
• A linear estimator per basis
• Non-linear basis selection: a best linear estimator is selected

• Small degree of freedom, fast computation, state-of-the-art
performance
16

Gaussian Mixture Models

y i = U i fi + w i where wi ∼ N (0, σ 2 Id)

• Estimate {(µk , Σk )}1≤k≤K from {yi }1≤i≤I

• Identify the Gaussian ki that generates fi , ∀i
• Estimate ˜i from N (µki , Σki ) , ∀i
f

18

Structured Sparsity
• PCA (Principal Component Analysis)
T
Σk = Bk S k Bk
• Bk = {φk }1≤m≤N PCA basis, orthogonal.
m

• Sk = diag(λk , . . . , λk ) , λk ≥ λk ≥ · · · ≥ λk eigenvalues.
1 N 1 2 N

• PCA transform
˜k = Bk ak
fi ˜i

• MAP with PCA

˜k = arg min Ui fi − yi 2 + σ 2 f T Σ−1 fi
fi ˜
i k
fi

⇔ N
|ai [m]|2

ak = arg min Ui Bk ai − yi 2 + σ 2
˜i
ai
m=1
λkm

22

Structured Sparsity
Sparse estimate v.s. Piecewise linear estimate
|Γ|
N

|ai [m]|2
2
ai = arg min UDai − yi + λ
˜ |ai [m]| ak
˜i 2
= arg min Ui Bk ai − yi + σ 2
ai ai λkm
m=1 m=1

D B1 B2 B3 B4 B5

• Linear collaborative ﬁltering
Full degree of freedom

in each basis.
in atom selection |Λ|
|Γ|

• Nonlinear basis selection,
degree of freedom K.

23

Initial Experiments: Evolution

Clustering Clustering
1st iteration 2nd iteration

24

Experiments: Inpainting

Original 20% available MCA 24.18 dB ASR 21.84 dB
[Elad, Starck, Querre, Donoho, 05] [Guleryuz, 06]

KR 21.55 dB FOE 21.92 dB BP 25.54 dB PLE 27.65 dB
[Takeda, Farsiu. Milanfar, 06] [Roth and Black, 09] [Zhou, Sapiro, Carin, 10]
26

Experiments: Zooming
Low-resolution

Original Bicubic 28.47 dB SAI 30.32 dB SR 23.85 dB PLE 30.64 dB

SR [Yang, Wright, Huang, Ma, 09] SAI [Zhang and Wu, 08]

29

Experiments: Zooming Deblurring

f Uf y = SUf

Iy 29.40 dB PLE 30.49 dB SR 28.93 dB
[Yang, Wright, Huang, Ma, 09]
32

Experiments: Denoising

Original Noisy 22.10 dB NLmeans 28.42 dB
[Buades et al, 06]

FOE 25.62 dB BM3D 30.97 dB PLE 31.00 dB
[Roth and Black, 09] [Dabov et al, 07]
34

Summary of this part

• Gaussian mixture models and MAP-EM work well for
image inverse problems.

• Piecewise linear estimation, connection to structured
sparsity.

• Collaborative linear ﬁltering.

• Nonlinear best basis selection, small degree of freedom.

• Faster computation than sparse estimation.
• Results in the same ballpark of the state-of-the-art.

• Beyond images: recommender systems and audio (Sprechmann Cancela)

• Statistical compressed sensing
38

Modeling
and
Learning
Human

Ac2vity

Alexey
Castrodad1,2
and
Guillermo
Sapiro2

1
NGA
Basic
and
Applied
Research

2
University
of
Minnesota,
ECE
Department

castr103@umn.edu
,
guille@umn.edu

Mo2va2on

•  Problem:

Given
volumes
of
video
feed,
detect
ac2vi2es
of
interest

§  Mostly
done
manually!

•  Solving
this
will:

§  Aid
the
operator:
surveillance/security,
gaming,
psychological

research

§  SiV
through
large
amounts
of
data

•  Solu2on:

Fully/semi-‐automa2c
ac2vity
detec2on
with

minimum
human
interac2on

§  Invariance
to
spa2al
transforma2ons

§  Robust
to
occlusions,
low
resolu2on,
noise

§  Fast
and
accurate

§  Simple,
generic

4

Sparse
modeling:

Dic2onary
learning
from
data

7

Sparse
modeling
for
ac2on

classifica2on:
Phase
1

Training

Class
1
Class
2
Class
3

Input
Videos
• 

Spa2al
Temporal
Features

• 

Sparse
Modeling

• 

D1
D2
D3
D

Classifica2on

• 

A1

l1
Pooling
New

video

Feature

Extrac2on

Sparse

coding
A2

A3

Classifier
output

9

Sparse
modeling
for
ac2on

classifica2on:
Phase
2

Training

Sparse
Modeling

• 

D1
D2
D3
D

Inter-‐class
Modeling
• 

Training

Videos

Feature

Extrac2on

Sparse

coding

E1
E2
E3

Classifica2on

A1

l1

Pooling
• 

New

video

Feature

Extrac2on

Sparse

coding

A2

A3

Sparse

Coding

from
Phase
1
Classifier
output

10

Results

•  YouTube
Ac2on
Dataset

§  variable
spa2al
resolu2on
videos,
3-‐8
seconds
each

§  11
types
of
ac2ons
from
YouTube
videos

Scene
AcGons
Camera
ResoluGon
Frame
Rate

indoors/outdoors
basketball
shoo2ng,
jiaer,
scale
variable,
25
fps

cycling,
diving,
golf
varia2ons,
resampled
to

swinging,
horse
camera
mo2on,
320
x
240

back
riding,
soccer

variable

juggling,
swinging,

tennis
swinging,
illumina2on

trampoline
jumping,
condi2ons,
high

volleyball
spiking,
background

walking
with
a
dog
cluaer

18

Results:
YouTube
Ac2on
Dataset

§  Best/recent
reported:
75.8%
(Q.V.
Le
et
al.,
2011);
84.2%

(Wang
et
al.,
2011)

§  Recogni2on
rate:
80.29
%
(phase
1)
and
91.9%
(phase
2)

20

Conclusion

•  Main
contribu2on:

§  Robust
ac2vity
recogni2on
framework
based
on
sparse

modeling

§  Generic:

works
on
mul2ple
data
sources

§  State-‐of-‐the-‐art
results
in
all
of
them,
same
parameters

•  Key
advantage:

§  Simplicity,
state
of
the
art
results

§  Fast
and
accurate:
7.5
fps

§  7
frames
needed
for
detec2on

•  Future
direc2on:

§  Exploit
human
interac2ons

§  Infer
the
ac2ons

§  Foreground
extrac2on/video
analysis
for
ac2vity
clustering

21

Shift-Map Image Editing
Yael Pritch
Eitam Kav-Venaki
Shmuel Peleg

The Hebrew University of Jerusalem

Geometrical Image Editing:
Retargeting
Retargeting (Avidan and Shamir SIGGRAPH’07, Wolf et al., ICCV’07, Wang et al., SIGASIA’08,
Rubinstein et al., SIGGRAPH’08, Rubinstein et al.,SIGGRAPH’09)
Input

Shift-Map
Output

Inpainting
Inpainting (Criminisi et al. CVPR’03, Wexler et al. CVPR’04,Sun, et al. SIGGRAPH’05,
Komodakis et al. CVPR’06, Hays and Efros, SIGGRAPH’07)

Mask Input

Inpainting
Inpainting (Criminisi et al. CVPR’03, Wexler et al. CVPR’04,Sun, et al. SIGGRAPH’05,
Komodakis et al. CVPR’06, Hays and Efros, SIGGRAPH’07)

Mask Output

Shift-Map Composition

A B C D
User
Constraints


A B C D
User
Constraints

A


A B C D
User
Constraints

A B


A B C D
User
Constraints

A C B


A B C D
User
Constraints

No accurate
segmentation
required

A C B D


A B C D
User
Constraints

No accurate
segmentation
required

Our Approach : Shift-Map
• Shift-Maps represent a mapping for each pixel in the output
image into the input image
Output : R(u,v) Input : I(x,y)

• The color of the output pixel is copied from corresponding input pixel


Output : R(u,v) Input : I(x,y)

(u,v) (u,v)

(x,y)

• We use relative mapping coordinate (like in Optical Flow)

Output Input

Shift-Map
Output Image

Horizontal Shifts Vertical Shifts

Tx = 0
Tx = 400
• Minimal distortion
Tx = 50
• Adaptive boundaries
• Fast optimization
Ty = 10

Geometric Editing as an
Energy Minimization
• We look for the optimal mapping - can be
described as an Energy Minimization problem

Data term : Smoothness term :
External Editing Requirement Avoid Stitching Artifacts
Compute For Each Pixel Compute For Each Pair
of Neighboring pixels

• Unified representation for geometric editing applications
• Solved using a graph labeling algorithm

The Smoothness Term
R - Output Image I - Input Image

p’ np’

pq nq’ q’
Discontinuity
in the shift-map

For p For q
color

gradient

(Kwatra et al. 03, Agarwala et al. 04)

The Data Term: Inpainting
• Data term varies between different application
• Inpainting data term uses data mask D(x,y) over the
input image
– D(x,y)= ∞ for pixels to be removed
– D(x,y)=0 elsewhere

D=0

• Specific input pixels can be forced (x,y)

not to be included in the output
image by setting D(x,y)=∞ (u,v)

The Data Term: Rearrangement

• Input pixels can be forced
to appear in a new location
• Appropriate shift gets (u,v)
infinitely low energy (x,y)
• Other shifts get
infinitely high energy

The Data Term: Retargeting
• Use picture borders
• Can incorporate importance mask
– Order constraint on mapping is applied to prevent
duplications of important areas

Shift-Map as Graph Labeling
• Minimal energy mapping can be represented as graph
labeling where the Shift-Map value is the selected label
for each output pixel
• Labels: relative shift Labels: shift-map values (tx,ty)

Output image pixels Input image

Nodes:
pixels

Shift Map:
assign
a label to
each pixel

Hierarchical Solution
Gaussian pyramid Output
on input

Shift-Map

Shift-Map

Results and Comparison
Image completion with structure propagation [Sun et al. SIGGRAPH’05]

Shift-Map handles
without additional Mask Shift-Map
user interaction
some cases where
other algorithms
suggested that
can only be handled
with additional user
guidance

J. Sun, L. Yuan, J. Jia, and H. Shum. Image completion with structure propagation. In SIGGRAPH’05

Application: Retargeting
Input Output

Results and
Comparison

Non-Homogeneous Improved Seam Carving PatchMatch
[Wolf et al., ICCV’07] [Robinstein et al, SIGGRAPH’08] [Barnes et al, SIGGRAPH‘09]
Shift-Maps

Summary

• New representation to geometrical editing
applications as an optimal graph labeling

• Unified approach

• Solved efficiently using hierarchical
approximations

• Minimal user interaction is required for various
editing tasks

Similarity Guided Composition
• Build an Output image R from pixels
taken from Source image I such that R is
most similar to Target image T
Source Image
Target Image Output


• Data term reflects a similarity between
the output image R and a target image T
• Similarity uses both colors and gradients

• Data term indicates the similarity of the
output image to the target image
• Weight between similarity and smoothness
has the following effect
Source Image Resulted
Output Target Image

Previous Work: Efros and Freeman 2001, Hertzman et al. 2001

Edge Preserving Magnification
Using the original image as the source, similarity
guided composition can magnify

Source Result Target (bilinear
magnification)

Does not work for gradual color changes

Original image can be the source for edge areas.
Otherwise the magnified image is the source.

Original Magnified Target Edge Map
Source 1 Source 2


Bicubic Shift Map

The Bidirectional Similarity
[Simakov, Caspi, Shechtman, Irani – CVPR’2008]

Completeness All source patches
(at multiple scales)
source ⊆ target should be in the target

?

⊇ All target patches
(at multiple scales)
should be in the source
Coherence

Easy to compose (recover) source from target

Easy to compose (recover) target from source

Shift-Map Retargeting with Feedback
• Shift-Map retargeting maximize the
coherence

• It will he hard to reconstruct back the Fish


• Increase the Appearance Data Term of
input regions with a high Composition
Score EA|B and recompute the output B.

• Pixels with the higher Appearance Term
will now appear in the output and
increase the completeness.

Original Appearance Term
EA|B
EA|B

Retargeted Reconstruction of Original


Original Shift-Map Feedback

Video Synopsis and Indexing
Making a Long Video Short

• 11 million cameras in 2008
• Expected 30 million in 2013
• Recording 24 hours a day, every day

Video Synopsis
Shift Objects in Time

Synopsis Video
S(x,y,t)

Input Video
I(x,y,t)
t

Steps in Video Synopsis

• Detect and track objects, store in database.
• Select relevant objects from database
• Display selected objects in a very short
“Video Synopsis”
• In “Video Synopsis”, objects from different
times can appear simultaneously
• Index from selected objects into original video
• Cluster similar objects

Two Clusters
Cars
Camera in St. Petersburg

People

• Detect specific events
• Discover activity patterns

ICVSS 2011 Presentations

168.176.61.22/comp/buzones/PROCEEDINGS/ICVSS2011

Jiri Matas - Tracking, Learning, Detection, Modeling
Ivan Laptev - Human Action Recognition
Josef Sivic - Large Scale Visual Search
Andrew Fitzgibbon - Computer Vision: Truth and Beauty
(Kinect)


The end...

Thanks !

Angel Cruz-Roa aacruzr@unal.edu.co
Andrea Rueda-Olarte adruedao@unal.edu.co


ICVSS2011 Selected Presentations

More Related Content

Similar to ICVSS2011 Selected Presentations

Recently uploaded

ICVSS2011 Selected Presentations