Seminar

Features and Learning Methods for
Large-scale Image Annotation and Categorization

Hideki Nakayama
The University of Tokyo
Department of Creative Informatics

2013/1/15

My research interest
 Generic image (object) recognition
 Whole-image level recognition
 Weakly supervised training samples

画像アノテーショ
ン
一枚の画像全体へ
複数の単語を付与

(without region correspondence)

The era of big data
 We can use gigantic weakly-labeled web data now!

Tags:
Nikon D200 DSLR Nikkor 60mmf28dmicro Nature
Landscape
Lake Idaho Ice Sunset Sun Mountain
Sky Frozen AnAwesomeShot
ImpressedBeauty isawyoufirst
ABigFave Ljomi ljspot4 ColorPhotoAward

http://www.flickr.com/

 Flickr： 6 billion images (2011)
 Facebook: 3 billion images every year
 Youtube: 8 year movies every day

More data helps recognition

 Simple k-NN using Flickr images & tags
query
Recog. result

100K dataset 1.6M dataset 12M dataset

football soccer varsity girls boys football soccer festival college church stainedglass football
travel party family school high futbol park people cycling bath city vacation travel
marchingband vacation cathedral window glass
Nearest neighbors

Growth of datasets
 Search engine: Tinyimage, ARISTA
 Crowd sourcing: ImageNet, SUN397

Corel5K Caltech256 NUS-WIDE ImageNet ARISTA
(2002) (2007) (2009) (2011) (2008)
5K 30K 200K 14M 2B

102 103 104 105 106 107 108 109
Caltech101 Pascal SUN397 ILSVRC TinyImage
(2004) VOC (2010) (2010) (2008)
9K 20K 100K 1.4M 80M

Challenge: scaling to large training data
 Traditional methods are not scalable in training
 Bag-of-visual words + kernel SVM (chi-square, etc)

complexity memory
O N2 ~ O N3 O N2

cf. [Yang et al., CVPR’09]
☹
 Recent methods exploit linear methods
 With carefully designed image features, where dot kernel
approximates the similarity between instances

☺
complexity memory
ON O1

Linear Distance metric learning for
image annotation

Example-based image annotation
 Standard approach for image annotation problem

K-NN tiger
tiger Kernel density
forest estimation
grass
etc… water
cow
street
city MBRM [Feng et al, 2004],
sea JEC [Makadia et al, 2008]
wave
Similar image people
TagProp [Guillaumin et al, 2009]
search
plane
sky
jet
Problem：
grass
How to define tiger
similarity? water

people
tree
stone

Image and label data
(training samples)

Fundamental problem： Semantic gap
 Visually similar ≠ semantically similar

I look my dog contest:
http://www.hemmy.net/2006/06/25
/i-look-like-my-dog-contest/

 Solution: Distance metric learning

Canonical Contextual Distance [Nakayama+, BMVC’10]

 Canonical Correlation Analysis (CCA)
x : image features (e.g. BoVW), y : binary label vector
finds linear transformations
s AT x x , t BT y y that maximizes the correlation between s and t
t
1 2
XY YY YX A XX A AT XX A I
s
X t
Y YX
1
XX XY B YY B 2
BT YY B I
s : covariance matrices
Image feature Canonical space Labels feature
: canonical correlatio ns

similarity measure in the latent subspace
using probabilistic structure
latent
variable

z z ~ N 0, I d , min{ p, q} d 1
x|z ~ N Wx z x , x , Wx Rp d

x y y|z ~ N Wy z y , y , Wy Rq d

image labels
feature feature

Probabilistic interpretation of CCA [Bach and Jordan, 2005]

Features
 Image features
 BoVW, GIST, etc… (off-the-shelf ones)
 Needs to be encoded in a Euclidean space

 Labels features
 Binary occurrence vector cf. [Guillaumin et al., CVPR’10]

When the dictionary contains 「plane, sea, sky, clouds, mountain」

Ij y j = (1, 0, 1, 1, 0)
plane sky clouds y j , yk 2

Ik yk 0, 0, 1, 1, 1 Dot product counts the
number of common labels.
sky clouds
mountain

Evaluation
 Benchmark datasets
Corel5K IAPR-TC12 ESP Game
# of words 260 291 268
# of training images 4,500 17,665 18,689
# of testing images 499 1,962 2,081
# of words per
3.4/5 5.7/23 4.7/15
image
（avg./max）

Evaluation
 Comparable performance to state-of-the-arts

Corel5K IAPR-TC12 ESP Game
0.45 0.45 0.35
0.4 0.4
0.3
0.35 0.35
0.25
0.3 0.3
0.25 0.25 0.2
0.2 0.2 0.15
0.15 0.15
0.1
0.1 0.1
0.05 0.05 0.05

0 0 0

Image features for linear classifiers

Basic pipeline

0 .5
1 .2
0 .1



1. Local feature extraction 2. Coding image-level
 1-1. feature detector feature vector
(Operator, grid)
 1-2. descriptor
(SIFT, SURF, …) How to encode similarity between
distributions of local features?

Bag-of-Visual-Words (traditional) [Csurka et al. 2004]

 Vector quantization → histogram
○ computationally efficient
× large reconstruction error
× non-linear property (must be used with non-linear kernel)
Training images

Visual
Local features words

query
Credit: K. Yanai

New BoVW① sparse coding + max pooling

 Reduce reconstruction error using multiple basis (words)
 Max pooling leads to linearly-separable image signatures
(taking max response for each visual word) cf. [Boureau et al., ICML’10]

[Yang+, CVPR’09] [Wang+, CVPR’10]

New BoVW② encode higher-level statistics

N: # of visual words (10^3～10^4)
d: dimension of descriptor (10～100

Method Statistics Dim. of image signature

BoVW count (ratio) N
VLAD [Jegou+,CVPR’10] mean Nd
Super vector [Zhou+, ECCV’10] ratio+mean N(d+1)
Fisher vector [Perronnin+, mean+variance 2Nd
ECCV’10]

Global Gaussian mean+covariance d(d+1)/2
[Nakayama+, CVPR’10] (N=1)
VLAT [Picard+, ICIP’11] mean+covariance Nd(d+1)/2

Encoded in a feature vector so that the dot product
approximates the distance between distributions

Global Gaussian Coding [Nakayama+, CVPR’10]

 Exploit Riemannian manifold of Gaussian
using information geometry framework
1 1 T
p x; d /2
exp x μ 1
x μ
2 2
x : local descriptor

Affine coordinates
2 T
η 1 ,, d , 11
2
1 , 12 1 2 ,, 1d 1 d , 22
2
2 ,, dd d

η, η η ηT G η η η
Inverse of Fisher information matrix

We use G η (metric on the center of samples) for entire space
η
ηi , η j ηT G η η j
i

Somewhat approximates
the KL-divergence…

Competition
 Large-scale visual recognition challenge 2010
 1000-class categorization
 1.2M training images, 150K testing images
 Evaluate top 5 classification accuracy

 Part of ImageNet dataset [Deng et al.]
 Labeled with Amazon Mechanical Turk
 14M images, 22K categories (as of 2011)
 Semantic structure according to WordNet

Credit: Fei-Fei Li

Result (2010)
 11 teams participated
 1. NEC+UIUC (72%) 80,000~260,000 dim ×6
 2. Xerox Research (64%) 260,000 dim ×2
 3. ISI(55%) 12,000 dim
 4. UC Irvine (53%)
 5. MIT (46%)

 Examples
 http://www.isi.imi.i.u-tokyo.ac.jp/pattern/ilsvrc/index.html

2010 Winner: NEC-UIUC
 LCC + super vector coding
 Ensemble of six classifiers using different features
 Parallelized feature extraction (Hadoop)
 Linear SVM (Averaging SGD)
 LCC→2days、Super vector→7days (with a 8-core
machine)

2011 Winner: XRCE
 Fisher vector
 520K dim ×2 (SIFT, color)
 2 days with a 16-core machine
 Linear SVM (SGD)
 1.5 days with a 16-core machine

2012 Winner: Univ. Toronto
 Deep learning
 Huge convolutional neural network from raw images
 Two GPUs, one week

10%

Summary
 Large-scale image recognition is now a hot topic
 Millions of training images, tens of thousands of categories

 Scalability is the key issue
 Linear training methods + compatibly-designed features
 If we somehow approximate the sample similarity with dot
kernel, we can simply apply linear methods!
 Explicit embedding
 Fisher kernel
 KPCA + Nystrom method

 Personal interest: Can we do this with graph kernels?

Seminar

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (20)

Similar to Seminar

Similar to Seminar (15)

Seminar