Sparse Kernel Learning for Image Annotation

Sparse Kernel Learning for Image Annotation
Sean Moran and Victor Lavrenko
Institute of Language, Cognition and Computation
School of Informatics
University of Edinburgh
ICMR’14 Glasgow, April 2014

Sparse Kernel Learning for Image Annotation
Overview
SKL-CRM
Evaluation
Conclusion

Assigning words to pictures
Feature
Extraction
GIST SIFT LAB HAAR
Tiger, Grass,
Whiskers
City, Castle,
Smoke
Tiger, Tree,
Leaves
Eagle, Sky
Training Dataset
P(Tiger | ) = 0.15
P(Grass | ) = 0.12
P(Whiskers| ) = 0.12
Top 5 words as
annotation
This talk:
How best to
combine
features?
Multiple Features
Ranked list of words
Tiger, Grass, Tree
Leaves, Whiskers
Annotation Model
P(Leaves | ) = 0.10
P(Tree | ) = 0.10
P(Smoke | ) = 0.01
Testing Image
P(City | ) = 0.03
P(Waterfall | ) = 0.05
P(Castle | ) = 0.03
P(Eagle | ) = 0.02
P(Sky | ) = 0.08
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X6
X5
X4
X3
X2
X1
X6
X5
X4
X3
X2
X1
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X6
X5
X4
X3
X2
X1
X6
X5
X4
X3
X2
X1
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X6
X5
X4
X3
X2
X1
X6
X5
X4
X3
X2
X1
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X6
X5
X4
X3
X2
X1
X6
X5
X4
X3
X2
X1
X1
X2
X3
X4
X5
X6

Previous work
Topic models: latent Dirichlet allocation (LDA) [Barnard et
al. ’03], Machine Translation [Duygulu et al. ’02]
Mixture models: Continuous Relevance Model (CRM)
[Lavrenko et al. ’03], Multiple Bernoulli Relevance Model
(MBRM) [Feng ’04]
Discriminative models: Support Vector Machine (SVM)
[Verma and Jahawar ’13], Passive Aggressive Classiﬁer
[Grangier ’08]
Local learning models: Joint Equal Contribution (JEC)
[Makadia’08], Tag Propagation (Tagprop) [Guillaumin et al.
’09], Two-pass KNN (2PKNN) [Verma et al. ’12]

Combining diﬀerent feature types
Previous work: linear combination of feature distances in a
weighted summation with “default” kernels:
Kernels
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
Laplacian UniformGaussian
Standard kernel assignment: Gaussian for Gist, Laplacian
for colour features, χ2 for SIFT

Data-adaptive visual kernels
Our contribution: permit the visual kernels themselves to
adapt to the data:
Kernels
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
Corel 5K
Hypothesis: Optimal kernels for GIST, SIFT etc dependent
on the image dataset itself

Data-adaptive visual kernels
Our contribution: permit the visual kernels themselves to
adapt to the data:
Kernels
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
IAPR TC12
Hypothesis: Optimal kernels for GIST, SIFT etc dependent
on the image dataset itself

Sparse Kernel Continuous Relevance Model (SKL-CRM)
Overview
SKL-CRM
Evaluation
Conclusion

Continuous Relevance Model (CRM)
CRM estimates joint distribution of image features (f) and
words (w)[Lavrenko et al. 2003]:
P(w, f) =
J∈T
P(J)
N
j=1
P(wj |J)
M
i=1
P(fi |J)
P(J): Uniform prior for training image J
P(fi |J): Gaussian non-parametric kernel density estimate
P(wi |J): Multinomial for word smoothing
Estimate marginal probability distribution over individual tags:
P(w|f) =
P(w, f)
w P(w, f)
Top e.g. 5 words with highest P(w|f) used as annotation

Sparse Kernel Learning CRM (SKL-CRM)
Introduce binary kernel-feature alignment matrix Ψu,v
P(I|J) =
M
i=1
R
j=1
exp −
1
β u,v
Ψu,v kv
(f u
i , f u
j )
kv
(f u
i , f u
j ): v-th kernel function on the u-th feature type
β: kernel bandwidth parameter
Goal: learn Ψu,v by directly maximising annotation F1 score
on held-out validation dataset

Generalised Gaussian Kernel
Shape factor p: traces out an inﬁnite family of kernels
P(fi |fj ) =
p1−1/p
2βΓ(1/p)
exp −
1
p
|fi − fj |p
βp
Γ: Gamma function
β: kernel bandwidth parameter

P(fi |fj ) =
p1−1/p
2βΓ(1/p)
exp −
1
p
|fi − fj |p
βp
x
GG(x;p)
p =2

P(fi |fj ) =
p1−1/p
2βΓ(1/p)
exp −
1
p
|fi − fj |p
βp
x
GG(x;p)
p =1

P(fi |fj ) =
p1−1/p
2βΓ(1/p)
exp −
1
p
|fi − fj |p
βp
x
GG(x;p)
p =15

Multinomial Kernel
Multinomial kernel optimised for count-based features:
P(fi |fj ) =
( d fi,d )!
d (fi,d !)
d
(pj,d )fi,d
fi,d : count for bin d in the unlabelled image i
fj,d count for the training image j
Jelinek-Mercer smoothing used to estimate pj,d :
pj,d = λ
fj,d
d fj,d
+ (1 − λ)
j fj,d
j,d fj,d
We also consider standard χ2 and Hellinger kernels

Greedy kernel-feature alignment
Features
Kernels
Laplacian
GIST HAAR
Gaussian Uniform
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
0 0 0 0
0 0 0 0
0 0 0 0
GIST SIFT LAB HAAR
Laplacian
Gaussian
Uniform
Ψvu
X6
Iteration 0:
F1 0.0
Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
X6
Testing Image
Training Image
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2

Features
Kernels
Laplacian
GIST HAAR
Uniform
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
0 0 0 0
1 0 0 0
0 0 0 0
GIST SIFT LAB HAAR
Laplacian
Gaussian
Uniform
Ψvu
X6
Iteration 1:
F1 0.25
Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
X6
Testing Image
Training Image
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
Gaussian

Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
0 0 0 0
1 0 0 0
0 0 0 1
GIST SIFT LAB HAAR
Laplacian
Gaussian
Uniform
Ψvu
X6
Iteration 2:
F1 0.34
Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
X6
Testing Image
Training Image
Kernels
Laplacian Uniform
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
Gaussian

Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
0 0 0 0
1 1 0 0
0 0 0 1
GIST SIFT LAB HAAR
Laplacian
Gaussian
Uniform
Ψvu
X6
Iteration 3:
F1 0.38
Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
X6
Testing Image
Training Image
Kernels
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
Gaussian Laplacian Uniform

Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
0 0 1 0
1 1 0 0
0 0 0 1
GIST SIFT LAB HAAR
Laplacian
Gaussian
Uniform
Ψvu
X6
Iteration 4:
F1 0.42
Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
X6
Testing Image
Training Image
Kernels
Laplacian Uniform
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
Gaussian

Evaluation
Overview
SKL-CRM
Evaluation
Conclusion

Datasets/Features
Standard evaluation datasets:
Corel 5K: 5,000 images (landscapes, cities), 260 keywords
IAPR TC12: 19,627 images (tourism, sports), 291 keywords
ESP Game: 20,768 images (drawings, graphs), 268 keywords
Standard “Tagprop” feature set [Guillaumin et al. ’09]:
Bag-of-words histograms: SIFT [Lowe ’04] and Hue [van de
Weijer & Schmid ’06]
Global colour histograms: RGB, HSV, LAB
Global GIST descriptor [Oliva & Torralba ’01]
Descriptors, except GIST, also computed in a 3x1 spatial
arrangement [Lazebnik et al. ’06]

Evaluation Metrics
Standard evaluation metrics [Guillaumin et al. ’09]:
Mean per word Recall (R)
Mean per word Precision (P)
F1 Measure
Number of words with recall > 0 (N+)
Fixed annotation length of 5 keywords

F1 score of CRM model variants
Corel 5K IAPR TC12 ESP Game
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
CRM
CRM 15
SKL-CRM
F1

0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
CRM
CRM 15
SKL-CRM
F1
Original CRM
Duygulu et al.
features

0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
CRM
CRM 15
SKL-CRM
F1
Original CRM
Duygulu et al.
features
Original CRM
15 Tagprop
features +71%

0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
CRM
CRM 15
SKL-CRM
F1
Original CRM
Duygulu et al.
features
Original CRM
15 Tagprop
features +71%
SKL-CRM
15 Tagprop
features +45%

F1 score of SKL-CRM on Corel 5K
HSV_V3H1
DS
HS_V3H1
HSV
HS
HH_V3H1
GIST
LAB_V3H1
RGB_V3H1
RGB
DH_V3H1
DH
HH
LAB
DS_V3H1
0.31
0.33
0.35
0.37
0.39
0.41
0.43
0.45
SKL-CRM (Valid F1)
SKL-CRM (Test F1)
Tagprop (Test F1)
Feature type
F1

Optimal kernel-feature alignments on Corel 5K
Optimal alignments1:
HSV: Multinomial (λ = 0.99)
HSV V3H1: Generalised Gaussian (p=0.9)
Harris Hue (HH V3H1): Generalised Gaussian (p=0.1) ≈
Dirac spike!
Harris SIFT (HS): Gaussian
HS V3H1: Generalised Gaussian (p=0.7)
DenseSift (DS): Laplacian
Our data-driven kernels more eﬀective than standard kernels
No alignment agrees with literature default assignment i.e.
Gaussian for Gist, Laplacian for colour histogram, χ2 for SIFT
1
V3H1 denotes descriptors computed in a spatial arrangement

SKL-CRM Results vs. Literature (Precision & Recall)
R P R P
0.20
0.25
0.30
0.35
0.40
0.45
0.50
MBRM JEC
Tagprop GS
SKL-CRM
Corel 5K IAPR TC12

SKL-CRM Results vs. Literature (N+)
MBRM JEC Tagprop GS SKL-CRM
0
50
100
150
200
250
300
Corel 5K
IAPR TC12
N+

Conclusion
Overview
SKL-CRM
Evaluation
Conclusion

Conclusions and Future Work
Proposed a sparse kernel model for image annotation
Key experimental findings:
Default kernel-feature alignment suboptimal
Data-adaptive kernels are superior to standard kernels
Sparse set of features just as effective as much larger set
Greedy forward selection as effective as gradient ascent
Future work: superposition of kernels per feature type

Thank you for your attention
Sean Moran
sean.moran@ed.ac.uk
www.seanjmoran.com

Sparse Kernel Learning for Image Annotation

More Related Content

What's hot

Similar to Sparse Kernel Learning for Image Annotation

More from Sean Moran

Recently uploaded

Sparse Kernel Learning for Image Annotation