Seminar

1,012 views
882 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,012
On SlideShare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Seminar

  1. 1. Features and Learning Methods forLarge-scale Image Annotation and Categorization Hideki Nakayama The University of Tokyo Department of Creative Informatics 2013/1/15
  2. 2. My research interest Generic image (object) recognition  Whole-image level recognition  Weakly supervised training samples 画像アノテーショ ン 一枚の画像全体へ 複数の単語を付与 (without region correspondence)
  3. 3. The era of big data We can use gigantic weakly-labeled web data now! Tags: Nikon D200 DSLR Nikkor 60mmf28dmicro Nature Landscape Lake Idaho Ice Sunset Sun Mountain Sky Frozen AnAwesomeShot ImpressedBeauty isawyoufirst ABigFave Ljomi ljspot4 ColorPhotoAward http://www.flickr.com/  Flickr: 6 billion images (2011)  Facebook: 3 billion images every year  Youtube: 8 year movies every day
  4. 4. More data helps recognition  Simple k-NN using Flickr images & tagsqueryRecog. result 100K dataset 1.6M dataset 12M dataset football soccer varsity girls boys football soccer festival college church stainedglass football travel party family school high futbol park people cycling bath city vacation travel marchingband vacation cathedral window glassNearest neighbors
  5. 5. Growth of datasets Search engine: Tinyimage, ARISTA Crowd sourcing: ImageNet, SUN397 Corel5K Caltech256 NUS-WIDE ImageNet ARISTA (2002) (2007) (2009) (2011) (2008) 5K 30K 200K 14M 2B102 103 104 105 106 107 108 109 Caltech101 Pascal SUN397 ILSVRC TinyImage (2004) VOC (2010) (2010) (2008) 9K 20K 100K 1.4M 80M
  6. 6. Challenge: scaling to large training data Traditional methods are not scalable in training  Bag-of-visual words + kernel SVM (chi-square, etc) complexity memory O N2 ~ O N3 O N2 cf. [Yang et al., CVPR’09] ☹ Recent methods exploit linear methods  With carefully designed image features, where dot kernel approximates the similarity between instances ☺ complexity memory ON O1
  7. 7. Linear Distance metric learning for image annotation
  8. 8. Example-based image annotation Standard approach for image annotation problem K-NN tiger tiger Kernel density forest estimation grass etc… water cow street city MBRM [Feng et al, 2004], sea JEC [Makadia et al, 2008] wave Similar image people TagProp [Guillaumin et al, 2009] search plane sky jet Problem: grass How to define tiger similarity? water people tree stone Image and label data (training samples)
  9. 9. Fundamental problem: Semantic gap Visually similar ≠ semantically similar I look my dog contest: http://www.hemmy.net/2006/06/25 /i-look-like-my-dog-contest/ Solution: Distance metric learning
  10. 10. Canonical Contextual Distance [Nakayama+, BMVC’10] Canonical Correlation Analysis (CCA) x : image features (e.g. BoVW), y : binary label vector finds linear transformations s AT x x , t BT y y that maximizes the correlation between s and t t 1 2 XY YY YX A XX A AT XX A I sX t Y YX 1 XX XY B YY B 2 BT YY B I s : covariance matrices Image feature Canonical space Labels feature : canonical correlatio ns similarity measure in the latent subspace using probabilistic structure latent variable z z ~ N 0, I d , min{ p, q} d 1 x|z ~ N Wx z x , x , Wx Rp d x y y|z ~ N Wy z y , y , Wy Rq d image labelsfeature featureProbabilistic interpretation of CCA [Bach and Jordan, 2005]
  11. 11. CCD for image auto-annotation M x AT x x T TE z | xi , y i Mx I 2 1 1 I 2 1 1 AT x i x E z|x My I 2 I 2 BT y i y T Mx T 2 1 2 1 Mx var z | x I M xM x I Ivar z | x i , y i I 1 1 My I 2 I 2 My 1 N p z | x i , y i p z | x dz P w | xs P w | li P li | x s P li | x N N i 1 p z | x j , y j p z | x dz j 1 P w | li w , li 1 IDF w w,li : annotation of training samples 1
  12. 12. Features Image features  BoVW, GIST, etc… (off-the-shelf ones)  Needs to be encoded in a Euclidean space Labels features  Binary occurrence vector cf. [Guillaumin et al., CVPR’10] When the dictionary contains 「plane, sea, sky, clouds, mountain」 Ij y j = (1, 0, 1, 1, 0) plane sky clouds y j , yk 2 Ik yk 0, 0, 1, 1, 1 Dot product counts the number of common labels. sky clouds mountain
  13. 13. Evaluation Benchmark datasets Corel5K IAPR-TC12 ESP Game # of words 260 291 268 # of training images 4,500 17,665 18,689 # of testing images 499 1,962 2,081 # of words per 3.4/5 5.7/23 4.7/15 image (avg./max)
  14. 14. Evaluation  Comparable performance to state-of-the-arts Corel5K IAPR-TC12 ESP Game0.45 0.45 0.35 0.4 0.4 0.30.35 0.35 0.25 0.3 0.30.25 0.25 0.2 0.2 0.2 0.150.15 0.15 0.1 0.1 0.10.05 0.05 0.05 0 0 0
  15. 15. Image features for linear classifiers
  16. 16. Basic pipeline 0 .5 1 .2 0 .1  1. Local feature extraction 2. Coding image-level  1-1. feature detector feature vector (Operator, grid)  1-2. descriptor (SIFT, SURF, …) How to encode similarity between distributions of local features?
  17. 17. Bag-of-Visual-Words (traditional) [Csurka et al. 2004] Vector quantization → histogram ○ computationally efficient × large reconstruction error × non-linear property (must be used with non-linear kernel) Training images Visual Local features words query Credit: K. Yanai
  18. 18. New BoVW① sparse coding + max pooling Reduce reconstruction error using multiple basis (words) Max pooling leads to linearly-separable image signatures (taking max response for each visual word) cf. [Boureau et al., ICML’10] [Yang+, CVPR’09] [Wang+, CVPR’10]
  19. 19. New BoVW② encode higher-level statistics N: # of visual words (10^3~10^4) d: dimension of descriptor (10~100Method Statistics Dim. of image signatureBoVW count (ratio) NVLAD [Jegou+,CVPR’10] mean NdSuper vector [Zhou+, ECCV’10] ratio+mean N(d+1)Fisher vector [Perronnin+, mean+variance 2NdECCV’10]Global Gaussian mean+covariance d(d+1)/2[Nakayama+, CVPR’10] (N=1)VLAT [Picard+, ICIP’11] mean+covariance Nd(d+1)/2 Encoded in a feature vector so that the dot product approximates the distance between distributions
  20. 20. Global Gaussian Coding [Nakayama+, CVPR’10] Exploit Riemannian manifold of Gaussian using information geometry framework 1 1 T p x; d /2 exp x μ 1 x μ 2 2 x : local descriptor Affine coordinates 2 T η 1 ,, d , 11 2 1 , 12 1 2 ,, 1d 1 d , 22 2 2 ,, dd d η, η η ηT G η η η Inverse of Fisher information matrix We use G η (metric on the center of samples) for entire space η ηi , η j ηT G η η j i Somewhat approximates the KL-divergence…
  21. 21. Competition Large-scale visual recognition challenge 2010  1000-class categorization  1.2M training images, 150K testing images  Evaluate top 5 classification accuracy Part of ImageNet dataset [Deng et al.]  Labeled with Amazon Mechanical Turk  14M images, 22K categories (as of 2011)  Semantic structure according to WordNet Credit: Fei-Fei Li
  22. 22. Result (2010) 11 teams participated  1. NEC+UIUC (72%) 80,000~260,000 dim ×6  2. Xerox Research (64%) 260,000 dim ×2  3. ISI(55%) 12,000 dim  4. UC Irvine (53%)  5. MIT (46%) Examples  http://www.isi.imi.i.u-tokyo.ac.jp/pattern/ilsvrc/index.html
  23. 23. 2010 Winner: NEC-UIUC LCC + super vector coding  Ensemble of six classifiers using different features Parallelized feature extraction (Hadoop) Linear SVM (Averaging SGD)  LCC→2days、Super vector→7days (with a 8-core machine)
  24. 24. 2011 Winner: XRCE Fisher vector  520K dim ×2 (SIFT, color)  2 days with a 16-core machine Linear SVM (SGD)  1.5 days with a 16-core machine
  25. 25. 2012 Winner: Univ. Toronto Deep learning  Huge convolutional neural network from raw images  Two GPUs, one week 10%
  26. 26. Summary Large-scale image recognition is now a hot topic  Millions of training images, tens of thousands of categories Scalability is the key issue  Linear training methods + compatibly-designed features  If we somehow approximate the sample similarity with dot kernel, we can simply apply linear methods!  Explicit embedding  Fisher kernel  KPCA + Nystrom method  Personal interest: Can we do this with graph kernels?

×