Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

ADBIS 2013 Conference
Genoa, Italy, Sep 1-4, 2013
Compact and Distinctive Visual Vocabularies for
Efficient Multimedia Data Indexing
Dimitris Kastrinakis1
, Symeon Papadopoulos2
, Athena Vakali1
1
Aristotle University of Thessaloniki, Department of Informatics (AUTH)
2
Centre for Research and Technology Hellas, Information Technologies Institute (CERTH-ITI)

#2
Overview
• Problem formulation
• Related work
• Proposed method
• Evaluation
• Conclusions

#3
Motivation
• Multimedia collections are ever-growing
– Personal photo collections can easily reach several
thousands of photos
– Professional photo archives are typically in the range of
hundreds of thousands to many millions of photos
– Online photos are many billions
– It is estimated that a total of 3.5 trillion photos have been
captured by people so far
• Need for effective and efficient search!
– Prevalent paradigm: Content-based image search
http://blog.1000memories.com/94-number-of-photos-ever-taken-digital-and-analog-in-shoebox

#4
Problem Formulation (1)
Content-based image search (CBIR)
• Also known as “example-based search”, “similarity-
based search”
• Given an indexed collection of images and a query
image (typically not part of the collection) fetch N
images of the collection that are most similar to the
query image
– similarity is typically computed on the basis of Euclidean
distance between feature vectors extracted from the
visual content of images

#5
Problem Formulation (2)
• Recent CBIR systems make use of local descriptors (e.g. SIFT,
SURF) extracted from images, according to the BoW retrieval
paradigm.
– Each image is represented as a set (“bag”) of visual words
• Visual words are the result of a learning process (clustering +
quantization) on a large set of images
– An inverted index structure is used to speed-up retrieval
• Such systems achieve good search accuracy, but:
– For this to happen, the number of visual words in the vocabulary need
to be very large (~105
-106
)
– In case of very large vocabularies, we face two computational
problems: (a) creating the vocabularies (offline), and (b) quantizing
the local descriptors of a new image to the vocabulary words (online)

#6
Overview of BoW indexing
Feature extraction
(e.g. SIFT, SURF)
Feature clustering
+ quantization
Visual
vocabulary
VOCABULARY LEARNING
Feature extraction
(e.g. SIFT, SURF)
Feature-to-
vocabulary
mapping
TRAINING COLLECTION
COLLECTION TO INDEX
QUERY IMAGE
Feature extraction
(e.g. SIFT, SURF)
Feature-to-
vocabulary
mapping
Collection
index
INDEXING
RETRIEVAL

#7
Our contribution
• Propose the concept of Composite Visual Words
(CVW) that reduces the need for many visual words
– CVW are permutations of plain visual words
– Using CVW instead of plain visual words makes it possible
to achieve similar search accuracy levels having much
fewer visual words in the vocabulary
• Experimentally validate our approach in two
standard datasets (Oxford and Paris buildings)
– With a vocabulary of 200 visual words and the use of CVW
we manage to match the retrieval performance of
approaches that use two-three orders of magnitude more
visual words.

#8
Our contribution
Feature extraction
(e.g. SIFT, SURF)
Feature clustering
+ quantization
Visual
vocabulary
VOCABULARY LEARNING
Feature extraction
(e.g. SIFT, SURF)
Feature-to-
vocabulary
mapping
TRAINING COLLECTION
COLLECTION TO INDEX
QUERY IMAGE
Feature extraction
(e.g. SIFT, SURF)
Feature-to-
vocabulary
mapping
Collection
index
INDEXING
RETRIEVAL
CVW

#10
Related Work
• First use of BoW approach for image similarity search (Sivic &
Zisserman 2003)
• Approaches to speed up original method:
– Hierarchical vocabulary tree (Nister & Stewenius, 2007)
– Approximate k-means clustering (Philbin et al., 2007)
• More expressive (and expensive!) representations:
– Soft assignment of descriptors to words (Philbin et al., 2008) 
requires more index space and time
– Visual phrases (Yuan et al., 2007)  does not take into account order
of words, expensive mining phase
– Words + phrases (Zhang et al., 2009)  high accuracy but
computationally intensive learning and still large vocabulary
– Vocabulary that maintains spatial layout of words (Zhang et al., 2010),
bundled features (Wu et al., 2009)  complex vocabulary generation
process + not high coverage of feature space

#11
Proposed Approach (1)
• Instead of indexing using plain visual words, we index
using permutations of visual words based on their
relative distance from the image to be indexed
PLAIN VISUAL WORDS COMPOSITE VISUAL WORDS

#12
• Important distinction:
– Original visual vocabulary (V): Number of visual words
that are the result of the clustering and quantization
process on the training collection.
– Effective visual vocabulary (V’): Number of composite
visual words that can be used for indexing.
Maximum theoretical size:
– For instance, if |V|=100 and B = 3 (the maximum length of
a permutation), then |V|’max = 970,200
This is the main way to increase the distinctive capability
of the vocabulary without increasing complexity.

#13
• Caveat: If the maximum effective vocabulary size increases a
lot (e.g. several millions), the retrieval performance might be
harmed since several CVWs will appear very sparsely.
• For this, we employ a thresholding strategy to make sure that
the resulting CVWs are high-quality.
where d(u,f) is the Euclidean distance between local feature f
and word u, while max d(u’,f) is the maximum distance
between feature f and any word of V. Parameter α controls
the “strictness” of the threshold (larger α means stricter
thresholding).

#14
Algorithmic description of approach

#15
Evaluation
• Datasets: Oxford and Paris buildings
• Implementation:
– Local descriptors: SIFT, 2000 features/image (Oxford),
1000 features/image (Paris)
– Inverted index: Apache Solr
• Evaluation Measure:
mean Average Precision (mAP)

#17
Results (2)
EXAMPLE RESULTS (|V|=200)
Comparison to SoA

#18
Conclusions
SUMMARY
• Approach to increase distinctive capability of visual
vocabulary without harming efficiency.
• Validation in standard datasets demonstrates
significant reduction in vocabulary size while
maintaining state-of-the-art retrieval accuracy.
FUTURE WORK
• Tests in larger datasets (e.g. using millions of images
as distractors)
• Alternative thresholding or CVW filtering strategies.

#19
References (1)
• Classic BoW indexing
Sivic, J., Zisserman, A.: Video Google: A text retrieval
approach to object matching in videos. Ninth IEEE
International Conference on Computer Vision, ICCV (2003)
• Efficient BoW indexing
Nister, D., Stewenius, H.: Scalable recognition with a
vocabulary tree. IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, Vol. 2. (2006)
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object
retrieval with large vocabularies and fast spatial matching.
IEEE Conference on Computer Vision and Pattern
Recognition, CVPR’07 (2007)

#20
References (2)
• Richer BoW-based representations
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in
quantization: Improving particular object retrieval in large scale image
databases. IEEE Conference on Computer Vision and Pattern
Recognition, CVPR (2008)
Wu, Z., Ke, Q., Isard, M., Sun, J.: Bundling features for large scale partial-
duplicate web image search. IEEE Conference on Computer Vision and
Pattern Recognition, CVPR (2009)
Yuan, J., Wu, Y., Yang, M.: Discovery of collocation patterns: from visual
words to visual phrases. In IEEE Conference on Computer Vision and
Pattern Recognition, CVPR’07, 1–8, IEEE (2007)
Zhang, S., Tian, Q., Hua, G., Huang, Q., Li, S.: Descriptive visual words and
visual phrases for image applications. In Proceedings of the 17th ACM
international conference on Multimedia, 75–84, ACM (2009)
Zhang, S., Huang, Q., Hua, G., Jiang, S., Gao, W., Tian, Q: Building
contextual visual vocabulary for large-scale image applications. In
Proceedings of the international conference on Multimedia, 501–510,
ACM (2010)

#21
Questions
Further contact:
papadop@iti.gr
Acknowledgement:
www.socialsensor.eu

Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

Recommended

Recommended

More Related Content

Similar to Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

Similar to Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing (20)

More from Symeon Papadopoulos

More from Symeon Papadopoulos (20)

Recently uploaded

Recently uploaded (20)

Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing