- 1. Sparse Kernel Learning for Image Annotation Sean Moran and Victor Lavrenko Institute of Language, Cognition and Computation School of Informatics University of Edinburgh ICMR’14 Glasgow, April 2014
- 2. Sparse Kernel Learning for Image Annotation Overview SKL-CRM Evaluation Conclusion
- 3. Sparse Kernel Learning for Image Annotation Overview SKL-CRM Evaluation Conclusion
- 4. Assigning words to pictures Feature Extraction GIST SIFT LAB HAAR Tiger, Grass, Whiskers City, Castle, Smoke Tiger, Tree, Leaves Eagle, Sky Training Dataset P(Tiger | ) = 0.15 P(Grass | ) = 0.12 P(Whiskers| ) = 0.12 Top 5 words as annotation This talk: How best to combine features? Multiple Features Ranked list of words Tiger, Grass, Tree Leaves, Whiskers Annotation Model P(Leaves | ) = 0.10 P(Tree | ) = 0.10 P(Smoke | ) = 0.01 Testing Image P(City | ) = 0.03 P(Waterfall | ) = 0.05 P(Castle | ) = 0.03 P(Eagle | ) = 0.02 P(Sky | ) = 0.08 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X6 X5 X4 X3 X2 X1 X6 X5 X4 X3 X2 X1 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X6 X5 X4 X3 X2 X1 X6 X5 X4 X3 X2 X1 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X6 X5 X4 X3 X2 X1 X6 X5 X4 X3 X2 X1 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X6 X5 X4 X3 X2 X1 X6 X5 X4 X3 X2 X1 X1 X2 X3 X4 X5 X6
- 5. Previous work Topic models: latent Dirichlet allocation (LDA) [Barnard et al. ’03], Machine Translation [Duygulu et al. ’02] Mixture models: Continuous Relevance Model (CRM) [Lavrenko et al. ’03], Multiple Bernoulli Relevance Model (MBRM) [Feng ’04] Discriminative models: Support Vector Machine (SVM) [Verma and Jahawar ’13], Passive Aggressive Classiﬁer [Grangier ’08] Local learning models: Joint Equal Contribution (JEC) [Makadia’08], Tag Propagation (Tagprop) [Guillaumin et al. ’09], Two-pass KNN (2PKNN) [Verma et al. ’12]
- 6. Combining diﬀerent feature types Previous work: linear combination of feature distances in a weighted summation with “default” kernels: Kernels x GG(x;p) p =1 x GG(x;p) p =15 x GG(x;p) p =2 Laplacian UniformGaussian Standard kernel assignment: Gaussian for Gist, Laplacian for colour features, χ2 for SIFT
- 7. Data-adaptive visual kernels Our contribution: permit the visual kernels themselves to adapt to the data: Kernels x GG(x;p) p =1 x GG(x;p) p =15 x GG(x;p) p =2 Laplacian UniformGaussian Corel 5K Hypothesis: Optimal kernels for GIST, SIFT etc dependent on the image dataset itself
- 8. Data-adaptive visual kernels Our contribution: permit the visual kernels themselves to adapt to the data: Kernels x GG(x;p) p =1 x GG(x;p) p =15 x GG(x;p) p =2 Laplacian UniformGaussian IAPR TC12 Hypothesis: Optimal kernels for GIST, SIFT etc dependent on the image dataset itself
- 9. Sparse Kernel Continuous Relevance Model (SKL-CRM) Overview SKL-CRM Evaluation Conclusion
- 10. Continuous Relevance Model (CRM) CRM estimates joint distribution of image features (f) and words (w)[Lavrenko et al. 2003]: P(w, f) = J∈T P(J) N j=1 P(wj |J) M i=1 P(fi |J) P(J): Uniform prior for training image J P(fi |J): Gaussian non-parametric kernel density estimate P(wi |J): Multinomial for word smoothing Estimate marginal probability distribution over individual tags: P(w|f) = P(w, f) w P(w, f) Top e.g. 5 words with highest P(w|f) used as annotation
- 11. Sparse Kernel Learning CRM (SKL-CRM) Introduce binary kernel-feature alignment matrix Ψu,v P(I|J) = M i=1 R j=1 exp − 1 β u,v Ψu,v kv (f u i , f u j ) kv (f u i , f u j ): v-th kernel function on the u-th feature type β: kernel bandwidth parameter Goal: learn Ψu,v by directly maximising annotation F1 score on held-out validation dataset
- 12. Generalised Gaussian Kernel Shape factor p: traces out an inﬁnite family of kernels P(fi |fj ) = p1−1/p 2βΓ(1/p) exp − 1 p |fi − fj |p βp Γ: Gamma function β: kernel bandwidth parameter
- 13. Generalised Gaussian Kernel Shape factor p: traces out an inﬁnite family of kernels P(fi |fj ) = p1−1/p 2βΓ(1/p) exp − 1 p |fi − fj |p βp x GG(x;p) p =2
- 14. Generalised Gaussian Kernel Shape factor p: traces out an inﬁnite family of kernels P(fi |fj ) = p1−1/p 2βΓ(1/p) exp − 1 p |fi − fj |p βp x GG(x;p) p =1
- 15. Generalised Gaussian Kernel Shape factor p: traces out an inﬁnite family of kernels P(fi |fj ) = p1−1/p 2βΓ(1/p) exp − 1 p |fi − fj |p βp x GG(x;p) p =15
- 16. Multinomial Kernel Multinomial kernel optimised for count-based features: P(fi |fj ) = ( d fi,d )! d (fi,d !) d (pj,d )fi,d fi,d : count for bin d in the unlabelled image i fj,d count for the training image j Jelinek-Mercer smoothing used to estimate pj,d : pj,d = λ fj,d d fj,d + (1 − λ) j fj,d j,d fj,d We also consider standard χ2 and Hellinger kernels
- 17. Greedy kernel-feature alignment Features Kernels Laplacian GIST HAAR Gaussian Uniform X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 SIFT LAB 0 0 0 0 0 0 0 0 0 0 0 0 GIST SIFT LAB HAAR Laplacian Gaussian Uniform Ψvu X6 Iteration 0: F1 0.0 Features GIST HAAR X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 SIFT LAB X6 Testing Image Training Image x GG(x;p) p =1 x GG(x;p) p =15 x GG(x;p) p =2
- 18. Greedy kernel-feature alignment Features Kernels Laplacian GIST HAAR Uniform X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 SIFT LAB 0 0 0 0 1 0 0 0 0 0 0 0 GIST SIFT LAB HAAR Laplacian Gaussian Uniform Ψvu X6 Iteration 1: F1 0.25 Features GIST HAAR X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 SIFT LAB X6 Testing Image Training Image x GG(x;p) p =1 x GG(x;p) p =15 x GG(x;p) p =2 Gaussian
- 19. Greedy kernel-feature alignment Features GIST HAAR X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 SIFT LAB 0 0 0 0 1 0 0 0 0 0 0 1 GIST SIFT LAB HAAR Laplacian Gaussian Uniform Ψvu X6 Iteration 2: F1 0.34 Features GIST HAAR X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 SIFT LAB X6 Testing Image Training Image Kernels Laplacian Uniform x GG(x;p) p =1 x GG(x;p) p =15 x GG(x;p) p =2 Gaussian
- 20. Greedy kernel-feature alignment Features GIST HAAR X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 SIFT LAB 0 0 0 0 1 1 0 0 0 0 0 1 GIST SIFT LAB HAAR Laplacian Gaussian Uniform Ψvu X6 Iteration 3: F1 0.38 Features GIST HAAR X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 SIFT LAB X6 Testing Image Training Image Kernels x GG(x;p) p =1 x GG(x;p) p =15 x GG(x;p) p =2 Gaussian Laplacian Uniform
- 21. Greedy kernel-feature alignment Features GIST HAAR X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 SIFT LAB 0 0 1 0 1 1 0 0 0 0 0 1 GIST SIFT LAB HAAR Laplacian Gaussian Uniform Ψvu X6 Iteration 4: F1 0.42 Features GIST HAAR X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 SIFT LAB X6 Testing Image Training Image Kernels Laplacian Uniform x GG(x;p) p =1 x GG(x;p) p =15 x GG(x;p) p =2 Gaussian
- 23. Datasets/Features Standard evaluation datasets: Corel 5K: 5,000 images (landscapes, cities), 260 keywords IAPR TC12: 19,627 images (tourism, sports), 291 keywords ESP Game: 20,768 images (drawings, graphs), 268 keywords Standard “Tagprop” feature set [Guillaumin et al. ’09]: Bag-of-words histograms: SIFT [Lowe ’04] and Hue [van de Weijer & Schmid ’06] Global colour histograms: RGB, HSV, LAB Global GIST descriptor [Oliva & Torralba ’01] Descriptors, except GIST, also computed in a 3x1 spatial arrangement [Lazebnik et al. ’06]
- 24. Evaluation Metrics Standard evaluation metrics [Guillaumin et al. ’09]: Mean per word Recall (R) Mean per word Precision (P) F1 Measure Number of words with recall > 0 (N+) Fixed annotation length of 5 keywords
- 25. F1 score of CRM model variants Corel 5K IAPR TC12 ESP Game 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 CRM CRM 15 SKL-CRM F1
- 26. F1 score of CRM model variants Corel 5K IAPR TC12 ESP Game 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 CRM CRM 15 SKL-CRM F1 Original CRM Duygulu et al. features
- 27. F1 score of CRM model variants Corel 5K IAPR TC12 ESP Game 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 CRM CRM 15 SKL-CRM F1 Original CRM Duygulu et al. features Original CRM 15 Tagprop features +71%
- 28. F1 score of CRM model variants Corel 5K IAPR TC12 ESP Game 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 CRM CRM 15 SKL-CRM F1 Original CRM Duygulu et al. features Original CRM 15 Tagprop features +71% SKL-CRM 15 Tagprop features +45%
- 29. F1 score of SKL-CRM on Corel 5K HSV_V3H1 DS HS_V3H1 HSV HS HH_V3H1 GIST LAB_V3H1 RGB_V3H1 RGB DH_V3H1 DH HH LAB DS_V3H1 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 SKL-CRM (Valid F1) SKL-CRM (Test F1) Tagprop (Test F1) Feature type F1
- 30. F1 score of SKL-CRM on Corel 5K HSV_V3H1 DS HS_V3H1 HSV HS HH_V3H1 GIST LAB_V3H1 RGB_V3H1 RGB DH_V3H1 DH HH LAB DS_V3H1 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 SKL-CRM (Valid F1) SKL-CRM (Test F1) Tagprop (Test F1) Feature type F1
- 31. F1 score of SKL-CRM on Corel 5K HSV_V3H1 DS HS_V3H1 HSV HS HH_V3H1 GIST LAB_V3H1 RGB_V3H1 RGB DH_V3H1 DH HH LAB DS_V3H1 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 SKL-CRM (Valid F1) SKL-CRM (Test F1) Tagprop (Test F1) Feature type F1
- 32. F1 score of SKL-CRM on Corel 5K HSV_V3H1 DS HS_V3H1 HSV HS HH_V3H1 GIST LAB_V3H1 RGB_V3H1 RGB DH_V3H1 DH HH LAB DS_V3H1 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 SKL-CRM (Valid F1) SKL-CRM (Test F1) Tagprop (Test F1) Feature type F1
- 33. F1 score of SKL-CRM on Corel 5K HSV_V3H1 DS HS_V3H1 HSV HS HH_V3H1 GIST LAB_V3H1 RGB_V3H1 RGB DH_V3H1 DH HH LAB DS_V3H1 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 SKL-CRM (Valid F1) SKL-CRM (Test F1) Tagprop (Test F1) Feature type F1
- 34. Optimal kernel-feature alignments on Corel 5K Optimal alignments1: HSV: Multinomial (λ = 0.99) HSV V3H1: Generalised Gaussian (p=0.9) Harris Hue (HH V3H1): Generalised Gaussian (p=0.1) ≈ Dirac spike! Harris SIFT (HS): Gaussian HS V3H1: Generalised Gaussian (p=0.7) DenseSift (DS): Laplacian Our data-driven kernels more eﬀective than standard kernels No alignment agrees with literature default assignment i.e. Gaussian for Gist, Laplacian for colour histogram, χ2 for SIFT 1 V3H1 denotes descriptors computed in a spatial arrangement
- 35. SKL-CRM Results vs. Literature (Precision & Recall) R P R P 0.20 0.25 0.30 0.35 0.40 0.45 0.50 MBRM JEC Tagprop GS SKL-CRM Corel 5K IAPR TC12
- 36. SKL-CRM Results vs. Literature (N+) MBRM JEC Tagprop GS SKL-CRM 0 50 100 150 200 250 300 Corel 5K IAPR TC12 N+
- 38. Conclusions and Future Work Proposed a sparse kernel model for image annotation Key experimental ﬁndings: Default kernel-feature alignment suboptimal Data-adaptive kernels are superior to standard kernels Sparse set of features just as eﬀective as much larger set Greedy forward selection as eﬀective as gradient ascent Future work: superposition of kernels per feature type
- 39. Thank you for your attention Sean Moran sean.moran@ed.ac.uk www.seanjmoran.com