Mobile Visual Search

Mobile Visual Search

Oge Marques

Florida Atlantic University

Boca Raton, FL - USA

TEWI
Kolloquium
–
24
Jan
2012

Take-home message

Mobile Visual Search (MVS) is a fascinating research
ﬁeld with many open challenges and opportunities
which have the potential to impact the way we
organize, annotate, and retrieve visual data (images
and videos) using mobile devices.

Oge
Marques

Outline

•  This talk is structured in four parts:

1.  Opportunities

2.  Basic concepts

3.  Technical details

4.  Examples and applications

Oge
Marques

Mobile visual search: driving factors

•  Age of mobile computing

h<p://60secondmarketer.com/blog/2011/10/18/more-‐mobile-‐phones-‐than-‐toothbrushes/

Oge
Marques


•  Smartphone market

h<p://www.idc.com/getdoc.jsp?containerId=prUS23123911

Oge
Marques


•  Smartphone market

h<p://www.cellular-‐news.com/story/48647.php?s=h

Oge
Marques


•  Why do I need a camera? I have a smartphone…

h<p://www.cellular-‐news.com/story/52382.php

Oge
Marques


•  Powerful devices

1 GHz ARM
Cortex-A9
processor,
PowerVR
SGX543MP2,

Apple A5 chipset

h<p://www.apple.com/iphone/specs.html

h<p://www.gsmarena.com/apple_iphone_4s-‐4212.php

Oge
Marques


Social networks
and mobile
devices

(May 2011)

hp://jess3.com/geosocial-‐universe-‐2/

Oge
Marques


•  Social networks and mobile devices

–  Motivated users: image taking and image sharing are
huge!

:
hp://www.onlinemarkeVng-‐trends.com/2011/03/facebook-‐photo-‐staVsVcs-‐and-‐insights.html

Oge
Marques


•  Instagram:

–  13 million registered (although not
necessarily active) users (in 13 months)

–  7 employees

–  Several apps based on it!

hp://venturebeat.com/2011/11/18/instagram-‐13-‐million-‐users/

Oge
Marques


•  Food photo
sharing!

hp://mashable.com/2011/05/09/foodtography-‐infographic/

Oge
Marques


•  Legitimate (or not quite…) needs and use cases

hp://www.slideshare.net/dtunkelang/search-‐by-‐sight-‐google-‐goggles

hps://twier.com/#!/courtanee/status/14704916575

Oge
Marques

Search system, a low-latency interactive visual search system. base and is the key to very fast retr
Several sidebars in this article invite the interested reader to dig features they have in common wit
deeper into the underlying algorithms. of potentially similar images is sele
Finally, a geometric verificatio


ROBUST MOBILE IMAGE RECOGNITION
Today, the most successful algorithms for content-based image
most similar matches in the datab
spatial pattern between features of
retrieval use an approach that is referred to as bag of features didate database image to ensure
(BoFs) or bag of words (BoWs). The BoW idea is borrowed from Example retrieval systems are pres
•  A natural use case for CBIR with QBE (at last!)

text retrieval. To find a particular text document, such as a Web
page, it is sufficient to use a few well-chosen words. In the
For mobile visual search, ther
to provide the users with an int
–  The example is right in front of the user!

database, the document itself can be likewise represented by a deployed systems typically transm
the server, which might require t
large databases, the inverted file in
memory swapping operations slow
ing stage. Further, the GV step
and thus increases the response t
the retrieval pipeline in the follow
the challenges of mobile visual se

Query Feature
Image Extraction

[FIG2] A Pipeline for image retrieva
from the query image. Feature mat
[FIG1] A snapshot of an outdoor mobile visual search system images in the database that have m
being used. The system augments the viewfinder with with the query image. The GV step
information about the objects it recognizes in the image taken feature locations that cannot be pl
with a camera phone. in viewing position.
Girod
et
al.
IEEE
MulVmedia
2011
Oge
Marques

MVS: technical challenges

•  How to ensure low latency (and interactive
queries) under constraints such as:

–  Network bandwidth

–  Computational power

–  Battery consumption

•  How to achieve robust visual recognition in spite
of low-resolution cameras, varying lighting
conditions, etc.

•  How to handle broad and narrow domains

Oge
Marques

MVS: Pipeline for image retrieval

Girod
et
al.
IEEE
MulVmedia
2011
Oge
Marques

3 scenarios

Girod
et
al.
IEEE
MulVmedia
2011
Oge
Marques

Part III - Outline

•  The MVS pipeline in greater detail

•  Datasets for MVS research

•  MPEG Compact Descriptors for Visual Search
(CDVS)

Oge
Marques

MVS: descriptor extraction

•  Interest point detection

•  Feature descriptor computation

Girod
et
al.
IEEE
MulVmedia
2011
Oge
Marques

Interest point detection

•  Numerous interest-point detectors have been
proposed in the literature:

–  Harris Corners (Harris and Stephens 1988)

–  Scale-Invariant Feature Transform (SIFT) Difference-of-
Gaussian (DoG) (Lowe 2004)

–  Maximally Stable Extremal Regions (MSERs) (Matas et al.
2002)

–  Hessian afﬁne (Mikolajczyk et al. 2005)

–  Features from Accelerated Segment Test (FAST) (Rosten
and Drummond 2006)

–  Hessian blobs (Bay, Tuytelaars and Van Gool 2006)

–  etc.

Girod
et
al.
IEEE
Signal
Processing
Magazine
2011
Oge
Marques

Interest point detection

•  Different tradeoffs in repeatability and complexity:

–  SIFT DoG and other affine interest-point detectors are slow to
compute but are highly repeatable.

–  SURF interest-point detector provides significant speed up over
DoG interest-point detectors by using box filters and integral
images for fast computation.

•  However, the box filter approximation causes significant anisotropy, i.e.,
the matching performance varies with the relative orientation of query
and database images.

–  FAST corner detector is an extremely fast interest-point
detector that offers very low repeatability.

•  See (Mikolajczyk and Schmid 2005) for a comparative
performance evaluation of local descriptors in a common
framework.

Girod
et
al.
IEEE
Signal
Processing
Magazine
2011
Oge
Marques

Feature descriptor computation

•  After interest-point detection, we compute a
visual word descriptor on a normalized patch.

•  Ideally, descriptors should be:

–  robust to small distortions in scale, orientation, and
lighting conditions;

–  discriminative, i.e., characteristic of an image or a small
set of images;

–  compact, due to typical mobile computing constraints.

Girod
et
al.
IEEE
Signal
Processing
Magazine
2011
Oge
Marques


•  Examples of feature descriptors in the literature:

–  SIFT (Lowe 1999)

–  Speeded Up Robust Feature (SURF) interest-point
detector (Bay et al. 2008)

–  Gradient Location and Orientation Histogram (GLOH)
(Mikolajczyk and Schmid 2005)

–  Compressed Histogram of Gradients (CHoG)
(Chandrasekhar et al. 2009, 2010)

•  See (Winder, (Hua,) and Brown CVPR 2007, 2009) and
(Mikolajczyk and Schmid PAMI 2005) for comparative
performance evaluation of different descriptors.

Girod
et
al.
IEEE
Signal
Processing
Magazine
2011
Oge
Marques


•  What about compactness?

–  Several attempts in the literature to compress off-the-
shelf descriptors did not lead to the best-rate-
constrained image-retrieval performance.

–  Alternative: design a descriptor with compression in
mind.

Girod
et
al.
IEEE
Signal
Processing
Magazine
2011
Oge
Marques


•  CHoG (Compressed Histogram of Gradients)
(Chandrasekhar et al. 2009, 2010)

–  Based on the distribution of gradients within a patch of pixels

–  Histogram of gradient (HoG)-based descriptors [e.g. (Lowe
2004), (Bay et al. 2008), (Dalal and Triggs 2005), (Freeman and
Roth 1994), and (Winder et al. 2009)] have been shown to be
highly discriminative at low bit rates.

Girod
et
al.
IEEE
Signal
Processing
Magazine
2011
Oge
Marques

CHoG: Compressed Histogram of Gradients

Gradients
Gradient distributions
Patch
for each bin
dx

dy

dx
dy
011101

Spatial
0100101

binning
01101

101101

Histogram
0100011

111001

compression
0010011

01100

1010100

CHoG 
Descriptor
Bernd Girod: Mobile Visual Search
Chandrasekhar
et
al.
CVPR
09,10
Oge
Marques

the context for each spatial bin. I
LHC provides two key benefits. First, encoding the (x,
locations of a set of N features as the histogram reduces we
the bit rate by log(N!) compared to encoding each feature rate

Encoding descriptor’s location information

location in sequence [47]. Here, we exploit the fact that
the features can be sent in any order. Consider the sample
VGA
loca
space that represents N features. There are N! number of tion
codes that represent the same feature set because the tati
•  Location Histogram Coding (LHC)

order does not matter. Thus, if we fix the ordering for the fixe

–  Rationale: Interest-
point locations in
images tend to
cluster spatially.

[FIG
a lo
[FIG S3] Interest-point locations in images tend to cluster spa
spatially. bloc
Girod
et
al.
IEEE
Signal
Processing
Magazine
2011
Oge
Marques

spatial different coding gains. In our experiments, Hessian Laplace
ounts, has the highest gain followed by SIFT and SURF interest-
ns. We point detectors. Even if the feature points are uniformly
based scattered in the image, LHC is still able to exploit the
sed as ordering gain, which results in log(N!) saving in bits.

g the Encoding descriptor’s location information

In our experiments, we found that quantizing the
(x, y) location to four-pixel blocks is sufficient for GV. If
educes we use a simple fixed-length coding scheme, then the
eature rate will be log(640/4) 1 log(640/4) z 14 b/feature for a
t that VGA size image. Using LHC, we can transmit the same •  Method:

ample
ber of •  Location Histogram
location data with z 5 b/descriptor; z 12.5 times reduc-
tion in data compared to a 64-b floating point represen- 1.  Generate a 2D histogram from
se the
or the Coding (LHC)

tation and z 2.8 times rate reduction compared to
fixed-length coding [48].
the locations of the descriptors.

•  Divide the image into spatial bins and
count the number of features within

each spatial bin.

2.  Compress the binary map,
indicating which spatial bins
contains features, and a sequence
1
of feature counts, representing
2 1 the number of features in
1 1 3 occupied bins.

1
1 1 1 3.  Encode the binary map using a
trained context-based arithmetic
[FIG S4] We represent the location of the descriptors using
coder, with the neighboring bins
a location histogram. The image is first divided into evenly being used as the context for
er spaced blocks. We enumerate the features within each spatial
block by generating a location histogram. each spatial bin.

be tra- ed by images taken from multiple view points. The size of the
inverted index is reduced by using geometry to find matching Oge
Marques

Girod
et
al.
IEEE
Signal
Processing
Magazine
2011

sed to features across images, and only retaining useful features and
devel- discarding irrelevant clutter features.

MVS: feature indexing and matching

•  Goal: produce a data structure that can quickly return a short
list of the database candidates most likely to match the query
image.

–  The short list may contain false positives as long as the correct match
is included.

–  Slower pairwise comparisons can be subsequently performed on just
the short list of candidates rather than the entire database.

Girod
et
al.
IEEE
MulVmedia
2011
Oge
Marques

clustering is applied to the training descriptors assigned to fall in same cluster.
that cluster, to generate k smaller clusters. This recursive di- During a query, the VT is traversed for each feature in
vision of the descriptor space is repeated until there are the query image, finishing at one of the leaf nodes. The
enough bins to ensure good classification performance. corresponding lists of images and frequency counts are

MVS: feature indexing and matching

Figure B1 shows a VT with only two levels, branching factor
k ¼ 3, and 32 ¼ 9 leaf nodes. In practice, VT can be much
larger, for example, with height 6, branching factor k ¼
subsequently used to compute similarity scores be-
tween these images and the query image. By pulling
images from all these lists and sorting them according
10, and containing 106 ¼ 1 million nodes. to the scores, we arrive at a subset of database images

•  Vocabulary Tree (VT)-Based Retrieval

The associated inverted index structure maintains two
lists for each VT leaf node, as shown in Figure B2. For a
that is likely to contain a true match to the query
image.

1
2

3 Training descriptor
Root node
7 8
1st level intermediate node
4 5
2nd level leaf node
9
6
(1)

Vocabulary tree Inverted index

i11 i12 i13 ... i1N1
c11 c12 c13 ... c1N
1

i21 i22 i23 ... i2N2
1 2 3 4 5 6 7 8 9 ... c2N2
c21 c22 c23

(2)

Girod
et
al.
IEEE
MulVmedia
2011
B. (1) Vocabulary tree and (2) inverted index structures.
Figure Oge
Marques

MVS: geometric veriﬁcation

•  Goal: use location information of features in
query and database images to conﬁrm that the
feature matches are consistent with a change in
view-point between the two images.

Girod
et
al.
IEEE
MulVmedia
2011
Oge
Marques

counts are far from code. Index compression reduces memory usage from near-
be much more rate ly 10 GB to 2 GB. This five times reduction leads to a sub-
g the distributions of stantial speedup in server-side processing, as shown in
h inverted list can be
[63]. Since keeping
tant for interactive

Figure S6(b). Without compression, the large inverted
index causes swapping between main and virtual memory
and slows down the retrieval engine. After compression,
me that allows ultra- memory swapping is avoided and memory congestion
•  Method: perform pairwise matching of feature
C. The carryover code delays no longer contribute to the query latency.

descriptors and evaluate geometric consistency of
correspondences.

checks to rerank
cale information of
69] propose incor-
he VT matching or
uthors investigate
tself. Philbin et al.
atures to propose
mation model and
s. Weak geometric
to rerank a larger
GV is performed on
[FIG4] In the GV step, we match feature descriptors pairwise and
find feature correspondences that are consistent with a geometric
etric reranking step model. True feature matches are shown in red. False feature
ted inGirod
et
al.
IEEE
MulVmedia
2011

Figure 5. In matches are shown in green. Oge
Marques


•  Techniques:

–  The geometric transform between the query and database
image is usually estimated using robust regression
techniques such as:

•  Random sample consensus (RANSAC) (Fischler and Bolles 1981)

•  Hough transform (Lowe 2004)

–  The transformation is often represented by an afﬁne
mapping or a homography.

•  GV is computationally expensive, which is why it’s
only used for a subset of images selected during the
feature-matching stage.

Girod
et
al.
IEEE
MulVmedia
2011
Oge
Marques

mation itself. Philbin et al.
ching features to propose
transformation model and
MVS: geometric reranking

ypotheses. Weak geometric
lly used to rerank a larger
e a full GV is performed on
•  Speed-up step between Vocabulary Tree building
[FIG4] In the GV step, we match feature descriptors pairwise and
find feature correspondences that are consistent with a geometric
and Geometric Veriﬁcation.

are shown in red. False feature
d a geometric reranking step model. True feature matches
s illustrated in Figure 5. In matches are shown in green.
tep that
mation
up stage
t of top Query Geometric Identify
VT GV
Data Reranking Information
ometric
e of the
x, y fea- [FIG5] An image retrieval pipeline can be greatly sped up by incorporating a geometric
use scale reranking stage.

IEEE SIGNAL PROCESSING MAGAZINE [69] JULY 2011

Oge
Marques

Fast geometric reranking

ing algorithm in
erank a short list
set of potential log (÷)
database image
nerating a set of
geometric score (a) (b) (c) (d)
te the geometric
We find the dis- location geometric score is computed as follows:

•  The [FIG S7] The location geometric score is computed as
uery image a)  features of twofeatures are matched based on VT quantization;

and follows: (a) images of two images are matched based on
VT quantization, (b) distances between pairs of features
atching features distances between pairs of features within an image are calculated;

b) 
distance corre- within an image are calculated, (c) log-distance ratios of the
c)  log-distance ratiospairs (denoted by color) pairs (denotedand color) are
corresponding of the corresponding are calculated, by
the two images. calculated; and

(d) histogram of log-distance ratios is computed. The
res in the query histogram of log-distance histogram is the geometric similarity
d)  maximum value of the ratios is computed.

es. If there exists score. A peak in the histogram indicates a similarity
•  The maximum value of the histogram is the geometric
peak in the his- transform between the query and database image.
similarity score.

y that the query
–  A peak in the histogram indicates a similarity transform between
The time required to calculate a geometric similarity score
use we use the the query to two orders of magnitude less than using RANSAC.
is one and database image.

otential feature Typically, we perform fast geometric reranking on the top
scoring scheme.MulVmedia
2011

Girod
et
al.
IEEE
500 images and RANSAC on the top 50 ranked images. Oge
Marques

Datasets for MVS research

•  Stanford Mobile Visual Search Data Set
(http://web.cs.wpi.edu/~claypool/mmsys-2011-dataset/stanford/)

–  Key characteristics:

•  rigid objects

•  widely varying lighting conditions

•  perspective distortion

•  foreground and background clutter

•  realistic ground-truth reference data

•  query data collected from heterogeneous low and high-end
camera phones.

Chandrasekhar
et
al.
ACM
MMSys
2011
Oge
Marques

Stanford Mobile Visual Search (SMVS)
Data Set

•  Limitations of popular datasets

Data Database Query Classes Rigid Lighting Clutter Perspective Camera
Set (#) (#) (#) Phone
√ √ √
ZuBuD 1005 115 200 √ −
√ √ √ −
Oxford 5062 55 17 √ √ ×
INRIA 1491 500 500 −
√ − √ −
UKY 10200 2550 2550 −
√ −
√ √ −
ImageNet 11M 15K 15K −
√ √ √ √ −
√
SMVS 1200 3300 1200

able 1: Comparison of different data sets. “Classes” refers to the number of distinct objects in the data set.
Rigid” refers to whether on not the objects in the database are rigid. “Lighting” refers to whether or not
ZuBuD
e query images capture widely varying lighting conditions. “Clutter” refers to whether or not the query
mages contain foreground/background clutter. “Perspective” refers to whether the data set contains typical
rspective distortions. “Camera-phone” refers to whether the images were captured with mobile devices.
MVS is a good data set for mobile visual search applications.
Oxford
ries like CDs, DVDs, books, text documents and business affine models with a minimum threshold of 10 matches post-
rds, we capture the images indoors under widely varying RANSAC for declaring a pair of images to be a valid match.
hting conditions over several days. We include foreground In Fig. 3, we report results for 3 state-of-the-art schemes:
d background clutter that would be typically present in (1) SIFT INRIA
Difference-of-Gaussian (DoG) interest point de-
e application, e.g., a picture of a CD would might other tector and SIFT descriptor (code: [27]), (2) Hessian-affine
Ds in the background. For landmarks, we capture images interest point detector and SIFT descriptor (code [17]), and
buildings in San Francisco. We collected query images (3) Fast Hessian blob interest point detector [2] sped up
veral months after the reference data was collected. For
UKY
with integral images, and the recently proposed Compressed
deo clips, the query images were taken from laptop, com- Histogram of Gradients (CHoG) descriptor [4]. We report
ter and TV screens to include typical specular distortions. the percentage of images that match, the average number
nally, the paintings were captured at the Cantor Arts Cen- of features and the average number of features that match
at Stanford University under controlled lighting condi-
Image Nets
post-RANSAC for each category.
ns typical of museums. et
al.
ACM
MMSys
2011

Chandrasekhar
Oge
Marques

Figure 2: Limitations with popular data sets in are easier vision. The left most image in each row is
First, we note that indoor categories computer than out-
The resolution of the query images varies for each camera database image, and the E.g., some categories like CDs, ZuBuD,and
door categories. other 3 images are query images. DVDs INRIA and UKY consist of images ta
one. We provide the original JPEG compressed high qual- at the book time and location. ImageNets is not suitable for image retrieval applications. The Oxford data
same
covers achieve over 95% accuracy. The most challeng-
has different faades of the same building labelled with the same name.

SMVS Data Set: categories and examples

•  Number of query and database images per
category

Chandrasekhar
et
al.
ACM
MMSys
2011
Oge
Marques


•  DVD covers

hp://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/dvd_covers.html

Oge
Marques


•  CD covers

hp://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/cd_covers.html

Oge
Marques


•  Museum paintings

hp://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/museum_painVngs.html

Oge
Marques

Other MVS data sets

ISO/IEC
JTC1/SC29/WG11/N12202
-‐
July
2011,
Torino,
IT
Oge
Marques

Other MVS data sets

•  Distractor set

–  1 million images of various resolution and content
collected from FLICKR.

ISO/IEC
JTC1/SC29/WG11/N12202
-‐
July
2011,
Torino,
IT
Oge
Marques

MPEG Compact Descriptors for Visual Search (CDVS)

•  Objectives

–  Deﬁne a standard that:

•  enables design of visual search applications

•  minimizes lengths of query requests

•  ensures high matching performance (in terms of reliability and
complexity)

•  enables interoperability between search applications and visual databases

•  enables efﬁcient implementation of visual search functionality on mobile
devices

•  Scope

–  It is envisioned that (as a minimum) the standard will specify:

•  bitstream of descriptors

•  parts of descriptor extraction process (e.g. key-point detection) needed
to ensure interoperability

Bober,
Cordara,
and
Reznik
(2010)
Oge
Marques

MPEG CDVS

•  Requirements

–  Robustness

•  High matching accuracy shall be achieved at least for images of textured
rigid objects, landmarks, and printed documents. The matching accuracy
shall be robust to changes in vantage point, camera parameters, lighting
conditions, as well as in the presence of partial occlusions.

–  Sufﬁciency

•  Descriptors shall be self-contained, in the sense that no other data are
necessary for matching.

–  Compactness

•  Shall minimize lengths/size of image descriptors

–  Scalability

•  Shall allow adaptation of descriptor lengths to support the required
performance level and database size.

•  Shall enable design of web-scale visual search applications and databases.

Bober,
Cordara,
and
Reznik
(2010)
Oge
Marques

MPEG CDVS

•  Requirements (cont’d)

–  Image format independence

•  Descriptors shall be independent of the image format

–  Extraction complexity

•  Shall allow descriptor extraction with low complexity (in terms of
memory and computation) to facilitate video rate implementations

–  Matching complexity

•  Shall allow matching of descriptors with low complexity (in terms of
memory and computation).

•  If decoding of descriptors is required for matching, such decoding shall
also be possible with low complexity.

–  Localization:

•  Shall support visual search algorithms that identify and localize matching
regions of the query image and the database image

•  Shall support visual search algorithms that provide an estimate of a
geometric transformation between matching regions of the query image
and the database image

Bober,
Cordara,
and
Reznik
(2010)
Oge
Marques

MPEG CDVS

[3B2-9] mmu2011030086.3d 1/8/011 16:44 Page 93

•  Summarized timeline

Table 1. Timeline for development of MPEG standard for visual search.

When Milestone Comments
March, 2011 Call for Proposals is published Registration deadline: 11 July 2011
Proposals due: 21 November 2011
December, 2011 Evaluation of proposals None
February, 2012 1st Working Draft First specification and test software model that can
be used for subsequent improvements.
July, 2012 Committee Draft Essentially complete and stabilized specification.
January, 2013 Draft International Standard Complete specification. Only minor editorial
changes are allowed after DIS.
July, 2013 Final Draft International Finalized specification, submitted for approval and
Standard publication as International standard.

that among several component technologies for existing standards, such as MPEG Query For-
image retrieval, such a standard should focus pri- mat, HTTP, XML, JPEG, and JPSearch.
marily on defining the format of descriptors and
Girod
et
al.
IEEE
MulVmedia
2011
Oge
Marques

parts of their extraction process (such as interest Conclusions and outlook
point detectors) needed to ensure interoperabil- Recent years have witnessed remarkable

MPEG CDVS

•  CDVS: evaluation framework

–  Experimental setup

•  Retrieval experiment: intended to assess performance of
proposals in the context of an image retrieval system

ISO/IEC
JTC1/SC29/WG11/N12202
-‐
July
2011,
Torino,
IT
Oge
Marques

MPEG CDVS

•  CDVS: evaluation framework

–  Experimental setup

•  Pair-wise matching experiments: intended for assessing
performance of proposals in the context of an application
that uses descriptors for the purpose of image matching.

Annota-‐
Vons

Check

accuracy
of
Report

search
results

Image
A
Extract

descriptor

Match

Image
B
Extract

descriptors

ISO/IEC
JTC1/SC29/WG11/N12202
-‐
July
2011,
Torino,
IT
Oge
Marques

MPEG CDVS

•  For more info:

–  https://mailhost.tnt.uni-hannover.de/mailman/listinfo/cdvs

–  http://mpeg.chiariglione.org/meetings/geneva11-1/geneva_ahg.htm
(Ad hoc groups)

Oge
Marques

Part IV

Examples and applications

Examples

•  Academic

–  Stanford Product Search System

•  Commercial

–  Google Goggles

–  Kooaba: Déjà Vu and Paperboy

–  SnapTell

–  oMoby (and the IQ Engines API)

–  pixlinQ

–  Moodstocks

Oge
Marques

list of important references for each module in the matching (500 3 500 pixel resolution) [75] exhibiting challenging pho-
pipeline in Table 2. tometric and geometric distortions, as shown in Figure 7. For

Stanford Product Search (SPS) System

[TABLE 2] SUMMARY OF REFERENCES FOR MODULES IN A MATCHING PIPELINE.

MODULE LIST OF REFERENCES

•  Local feature based visual search system

FEATURE EXTRACTION HARRIS AND STEPHENS [17], LOWE [15], [23], MATAS ET AL. [18], MIKOLAJCZYK ET AL. [16], [22],
DALAL AND TRIGGS [41], ROSTEN AND DRUMMOND [19], BAY ET AL. [20], WINDER ET AL. [27], [28],
CHANDRASEKHAR ET AL. [25], [26], PHILBIN ET AL. [40]
FEATURE INDEXING AND MATCHING SCHMID AND MOHR [13], LOWE [15], [23], SIVIC AND ZISSERMAN [9], NISTÉR AND STEWÉNIUS [10],
CHUM ET AL. [50], [52], [53], YEH ET AL. [51], PHILBIN ET AL. [12], JEGOU ET AL. [11], [59], [60], ZHANG ET AL. [54]
CHEN ET AL. [58], PERRONNIN [61], MIKULIK ET AL. [55], TURCOT AND LOWE [56], LI ET AL. [57]
GV FISCHLER AND BOLLES [66], SCHAFFALITZKY AND ZISSERMAN [74], LOWE [15], [23], CHUM ET AL. [53], [70], [71]

•  Client-server architecture

FERRARI ET AL. [68], JEGOU ET AL. [11], WU ET AL. [69], TSAI ET AL. [73]

Query Data
Image Feature Feature
VT Matching
Extraction Compression
Network
Display GV
Client Identification Data Server

[FIG6] Stanford Product Search system. Because of the large database, the image-recognition server is placed at a remote location. In
most systems [1], [3], [7], the query image is sent to the server and feature extraction is performed. In our system, we show that by
performing feature extraction on the phone we can significantly reduce the transmission delay and provide an interactive experience.

IEEE SIGNAL PROCESSING MAGAZINE [70] JULY 2011

Girod
et
al.
IEEE
MulVmedia
2011

Tsai
et
al.
ACM
MM
2010
Oge
Marques


•  Key contributions:

–  Optimized feature extraction implementation

–  CHoG: a low bit-rate compact descriptor (provides up
to 20× bit-rate saving over SIFT with comparable
image retrieval performance)

–  Inverted index compression to enable large-scale
image retrieval on the server

–  Fast geometric re-ranking

Girod
et
al.
IEEE
MulVmedia
2011

Tsai
et
al.
ACM
MM
2010
Oge
Marques

The system
including different distances, viewing angles, Figure 1. A
the viewfind
Mobile image-based retrieval
and lighting conditions, or in the presence of of an outdo
information
technologies
partial occlusions or motion blur. visual-searc
objects it rec
Most successful algorithms for image-based
The system
the image ta
retrieval today use an approach that is referred


the viewfind
a phone cam
Mobile image-based(BoF) or bag of words
to as bag of features retrieval
information
technologiesBoW idea is borrowed from text
(BoW).1,2 The
objects it re
document retrieval. To find afor image-based
Most successful algorithms particular text
the image ta
retrieval today use anwebpage, it’s sufficient to
document, such as a approach that is referred
a phone cam
to as few well-chosen words. or the database,
use a bag of features (BoF) In bag of words

•  Two modes:

(BoW).1,2 The BoW idea is borrowed be repre-
the document itself can likewise from text characteristic of a particular image take the
document a bag of salient words, regardless
sented by retrieval. To find a particular text role of visual words. As with text retrieval,
document, such as a webpage, it’s sufficient to
of where these words appear in the document. BoF image retrieval does not consider where
use aimages, robust local features database,
For few well-chosen words. In the that are
–  Send Image mode

in the image the features occur, at least in the
the document itself can likewise be repre- characteristic of a particular image take the
sented by a bag of salient words, regardless role of visual words. As with text retrieval,
Figure 2. Mo
Mobile phone
of where these words appear in the document. Visual search server
BoF image retrieval does not consider where
search archi
For images, robust local features that are in the image the features occur, at least in the
(a) The mob
Image encoding Image Descriptor
Image transmits th
(JPEG) decoding extraction
compressed
Figure 2. Mo
Mobile phone Visual search server
Wireless network Descriptor while analys
search arch
matching Database image and r
(a) The mob
Image encoding Image Descriptor are done ent
Image Process and (JPEG) Search transmits th
decoding extraction
display results results remote serve
compressed
Wireless network Descriptor local image
while analy
matching Database (descriptors)
image and r
(a)
extracted en
are done on
Process and Search
display results results phone and t
remote serve
encoded and

–  Send Feature mode

local image
transmitted
(descriptors
(a) Descriptor Descriptor Descriptor
Image network. Su
extracted on
extraction encoding decoding
descriptors a
phone and t
Wireless network Descriptor used by the
encoded and
matching Database perform the
transmitted
Descriptor Descriptor Descriptor (c) The mob
network. Su
Image Process and
extraction encoding Search
decoding
display results results maintains a
descriptors a
the databas
used by the
Wireless network Descriptor
matching Database search reque
perform the
(b)
remote serve
(c) The mob
Process and Search
display results object of int
maintains a
Mobile phone results Visual search server
found in thi
the databas
further requ
search redu
Girod
et
al.
IEEE
MulVmedia
2011
(b)
Image
Descriptor Descriptor Descriptor
amountserve
remote of d
extraction encoding decoding
Tsai
et
al.
ACM
MM
2010
Oge
Marques

over the netw
object of int
Mobile phone No Visual search server
Descriptor Found Wireless network Descriptor found in th
matching matching Database further redu
Descriptor Descriptor Descriptor

Stanford Product Search System

•  Performance evaluation

–  Dataset: 1 million CD, DVD, and book cover images +
1,000 query images (500×500) with challenging
photometric and geometric distortions

(a)

(b)

Girod
et
al.
IEEE
distortions. 2011
pairs from the data set. (a) A clean database picture is matched against (b) a real-world picture with various
[FIG7] Example image
MulVmedia
Oge
Marques

the client, we use a Nokia 5800 mobile phone with a 300-MHz Figure 8 compares the recall for the three schemes: send


[3B2-9] mmu2011030086.3d 30/7/011 16:27 Page 92


–  Recall vs. bit rate

Industry and Standards

100
features, as they arrive.15 On
98 finds a result that has sufficien
ing score, it terminates the searc
96 ately sends the results back. T
optimization reduces system
Classification accuracy (%)

94
other factor of two.
92 Overall, the SPS system dem
using the described array of tec
90 bile visual-search systems can ac
ognition accuracy, scale to re
88
databases, and deliver search r
86 ceptable time.

84 Send feature (CHoG) Emerging MPEG standard
Send image (JPEG) As we have seen, key compo
82
Send feature (SIFT) gies for mobile visual search alr
80 we can choose among several p
100 101 102
tures to design such a system. W
Query size (Kbytes)
these options at the beginnin
Figure 7. Comparison of different schemes with regard to classification The architecture shown in Figur
Girod
et
al.
IEEE
MulVmedia
2011
Oge
Marques

est one to implement on a mobi
accuracy and query size. CHoG descriptor data is an order of magnitude
smaller compared to JPEG images or uncompressed SIFT descriptors. requires fast networks such as W
good performance. The archite

achieve ,1 s server processing latency while maintaining
high recall. 14

Communication Time-Out (%)

12
TRANSMISSION DELAY
The transmission delay depends on the type of network used. 10
In Figure 10, we observe that the data transmission time is 8

insignificant for a WLAN network because of the high
6
–  Processing times

4
[TABLE 3] PROCESSING TIMES.
2
CLIENT-SIDE OPERATIONS TIME ( S)
0
IMAGE CAPTURE 1–2 0
FEATURE EXTRACTION AND COMPRESSION 1–1.5
(FOR SEND IMAGE MODE)
SERVER-SIDE OPERATIONS TIME ( MS)
FEATURE EXTRACTION 100
[FIG9] Measured
(FOR SEND IMAGE MODE)
VT MATCHING 100
percentage (b) f
FAST GEOMETRIC RERANKING (PER IMAGE) 0.46 3G network. Ind
GV (PER IMAGE) 30 Indoor (II) is test
tested outside o

IEEE SIGNAL PROCESSING MAGAZINE [72] JULY

Girod
et
al.
IEEE
MulVmedia
2011

Tsai
et
al.
ACM
MM
2010
Oge
Marques

ognition accuracy, sca

Classifica
88
databases, and deliver s
86 ceptable time.


84 Send feature (CHoG) Emerging MPEG stan
Send image (JPEG) As we have seen, key
82
Send feature (SIFT) gies for mobile visual se
80 we can choose among se
100 101 102
tures to design such a s


Query size (Kbytes)
these options at the b
Figure 7. Comparison of different schemes with regard to classification The architecture shown
accuracy and query size. CHoG descriptor data is an order of magnitude est one to implement on
–  End-to-end latency

smaller compared to JPEG images or uncompressed SIFT descriptors. requires fast networks su
good performance. The
Figure 2b reduces netwo
12
fast response over tod
Feature extraction
Network transmission requires descriptors to
10 Retrieval phone. Many applicatio
further by using a cache
phone, as exemplified
Response time (seconds)

8 shown in Figure 2c.
However, this imme
tion of interoperability
6 mobile visual search app
across a broad range of d
the information is exc
4
compressed visual de
images? This question w
ing the Workshop on
2
held at Stanford Univer
This discussion led to a
US delegation to MPEG,
0
JPEG Feature Feature JPEG Feature tential interest in a stan
(3G) (3G) progressive (WLAN) (WLAN) applications be explore
(3G)
ploratory activity in MP
Figure 8. End-to-end latency for different schemes. Compared to a system produced a series of do
Girod
et
al.
IEEE
MulVmedia
2011
Oge
Marques

transmitting a JPEG query image, a scheme employing progressive quent year describing a
transmission of CHoG features achieves approximately four times the objectives, scope, and re
reduction in system latency over a 3G network. standard.17

Examples of commercial MVS apps

•  Google
Goggles

–  Android
and iPhone

–  Narrow-
domain
search and
retrieval

hp://www.google.com/mobile/goggles

Oge
Marques

SnapTell

•  One of the earliest (ca. 2008) MVS apps for iPhone

–  Eventually acquired by Amazon (A9)

•  Proprietary technique (“highly accurate and robust
algorithm for image matching: Accumulated Signed Gradient
(ASG)”).

hp://www.snaptell.com/technology/index.htm

Oge
Marques

oMoby (and the IQ Engines API)

–  iPhone app

hp://omoby.com/pages/screenshots.php

Oge
Marques

oMoby (and the IQ Engines API)

•  The IQ Engines API:
“vision as a service”

hp://www.iqengines.com/applicaVons.php

Oge
Marques

The IQEngines API demo app

•  Screenshots

Oge
Marques

The IQEngines API demo app

•  XML-formatted response

Oge
Marques

Kooaba: Déjà Vu and Paperboy

•  “Image recognition in the cloud” platform

hp://www.kooaba.com/en/home/developers

Oge
Marques

Kooaba: Déjà Vu and Paperboy

•  Déjà Vu

•  Paperboy

–  Enhanced digital –  News sharing from
memories / notes / printed media

journal

hp://www.kooaba.com/en/products/dejavu

hp://www.kooaba.com/en/products/paperboy

Oge
Marques

pixlinQ

•  A “mobile visual
search solution that
enables you to link
users to digital
content whenever
they take a mobile
picture of your
printed materials.”

–  Powered by image
recognition from LTU
technologies

hp://www.pixlinq.com/home

Oge
Marques

pixlinQ

•  Example app (La Redoute)

hp://www.youtube.com/watch?v=qUZCFtc42Q4

Oge
Marques

Moodstocks

•  Real-time mobile image recognition that works ofﬂine (!)

•  API and SDK available

hp://www.youtube.com/watch?v=tsxe23b12eU

Oge
Marques

Moodstocks

•  Many successful apps for different platforms

hp://www.moodstocks.com/gallery/

Oge
Marques

Concluding thoughts

•  Mobile Visual Search (MVS) is coming of age.

•  This is not a fad and it can only grow.

•  Still a good research topic

–  Many relevant technical challenges

–  MPEG efforts have just started

•  Inﬁnite creative commercial possibilities

Oge
Marques

Side note

•  The power of Twitter…

Oge
Marques

Thanks!

•  Questions?

•  For additional information: omarques@fau.edu

Oge
Marques

Mobile Visual Search

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

Similar to Mobile Visual Search

Similar to Mobile Visual Search (20)

More from Förderverein Technische Fakultät

More from Förderverein Technische Fakultät (20)

Recently uploaded

Recently uploaded (20)

Mobile Visual Search