1. INTERNATIONAL ORGANISATION FOR STANDARDISATION
ORGANISATION INTERNATIONALE DE NORMALISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND AUDIO
ISO/IEC JTC1/SC29/WG11
MPEG2011/m19188
January 2011, Daegu, Korea
Source Peking University, Harbin Institute of Technology, China
Status Input Contribution
Title Peking University Landmarks: A Context Aware Visual Search Benchmark Database
Author lingyu@pku.edu.cn, Lingyu Duan
jirongrong@gmail.com, Rongrong Ji
cjie@pku.edu.cn, Jie Chen
syang@pku.edu.cn, Shuang Yang
tjhuang@pku.edu.cn, Tiejun Huang
hongxun.yao@gmail.com, Hongxun Yao
wgao@pku.edu.cn, Wen Gao
1 Introduction
The 93rd
MPEG meeting output draft requirements documents (w11529, w11530 and w11531) of
Compact Descriptors for Visual Search. To advance this work, this contribution presents our
work on establishing context aware visual search benchmark database for mobile Landmark
search. In the input contribution m18542 at MPEG 94th
Meeting [9], Peking University has
proposed a compact descriptor for visual search, which combines location cues to learn a
discriminative and compact visual descriptor that is very suitable for mobile landmark search.
We believe our practice as well as the benchmark dataset would enhance the use cases and be
helpful to identify requirements for Compact Descriptors for Visual Search.
While there are ever growing focuses on mobile visual search in recent years, a
comprehensive benchmark database for fair evaluation among different strategies is still missing.
In particular, the rich contextual cues in mobile devices, such as GPS information and camera
parameters, are left unexploited in the current visual search benchmarks. This contribution
introduces a Peking University Landmarks benchmark for the quantitative evaluations of mobile
visual search performance with the support of GPS information. It contains over 13179 images
organized into 198 distinct landmark locations within the Peking University campus, which is
built by 20 volunteers during November and December, 2010. Each location is captured with
multiple shot sizes and viewing angles, using both digital cameras and phone cameras, each
photo being tagged with rich contextual information in the mobile scenarios. Moreover, this
benchmark studies typical quality degeneration scenarios in mobile photographing, including
variable resolutions, blurring, lighting changes, occlusions, as well as various viewing angles.
Together with this benchmark, we provide the bag-of-visual-words search baselines in the cases
of using either spatial or contextual information in returning image ranking. Finally, distractor
images are further introduced to evaluate the robustness of visual search methods in the database.
2. 2 Mo
Coming
increasi
munitie
establish
search s
the rich
benefici
are left
We
photogr
applicat
evaluate
198 lan
sufficien
focus o
systema
3 Be
Sca
contains
digital a
DSC-W
NIKON
from m
iphone
respecti
pair of
GPS de
digital a
blurring
the volu
phone p
during N
Fig. 1. T
otivation
g with the
ing interest
s. However
hed benchm
scenarios th
h contextual
ial to refine
unexploited
believe a
raphing var
tions. In thi
e GPS conte
ndmarks lo
nt real-worl
n the avail
atic methodo
enchmark
ale and C
s over 1317
and phone c
W290, Samsu
N COOLPIX
mobile phon
3G and LG
ively. We re
volunteers,
evice (HOL
and phone c
g and shakin
unteers com
photos than
November a
Two typical
n
explosive
ts in compu
r, state-of-th
mark databa
hat involve
l cues, such
e solely visu
d in the exis
real-world
riances is
is contributi
ext assisted
ocations wi
ld photogra
ability of c
ology to ev
k Databa
Constituti
79 scene ph
cameras. Th
ung Techwi
X L12 and
ne cameras
G Electroni
ecruited ove
one using
LUX M-120
camera phot
ng are mor
mpensate th
the digital
and Decemb
l scenarios o
growth of
uter vision,
he-art work
ase, which
lots of phot
h as GPS, ti
ual ranking.
sting visual
d, context
important t
ion, we intr
mobile visu
ithin the P
aphing varia
contextual c
aluate the ro
ase Statis
ion: The P
hotos, organ
here are in t
in <Digima
Canon IXU
(Nokia E7
cs KP500 w
er 20 volunt
digital cam
00E) with t
tographers a
e frequent
heir bad pho
camera pho
ber, 2010.
of capturing
f phone ca
, multimedi
ks are rarel
should be
tographing
ime stamp,
However, t
search benc
rich bench
to put forw
roduce the
ual search p
Peking Uni
ances typica
cues to imp
obustness o
stics
Peking Univ
nized into19
total 6193 p
ax S830 / Ke
US 210 wi
72-1, HTC
with resolu
teers in data
mera and the
them. The
are within 1
happenings
oto with a
one. All the
g landmark
ameras, mo
ia analysis,
y compared
designed to
variances u
and base s
the effectiv
chmarks.
hmark with
ward mobi
Peking Uni
performance
iversity cam
ally for mob
prove the v
of contextua
versity Lan
98 landmark
photos captu
enox S830>
ith resolutio
Desire, No
ution 640×4
a acquisitio
e other usin
averaged v
0 degrees f
s for mobile
new one, w
e images in
photos in d
obile visual
, and inform
d among ea
o target rea
using phone
tation infor
eness and ro
h sufficien
le visual s
iversity Lan
e. The datas
mpus. Our
bile phone c
isual search
al cue by ad
ndmarks ben
k locations
ured from d
>, Canon DI
on 2592×19
okia 5235,
480, 1600×1
n, each land
ng mobile p
viewing ang
for both volu
e phone cap
which thus
n the entire
ifferent sho
l search ha
mation retr
ach other o
al-world mo
e cameras. I
rmation, are
obustness o
nt coverage
search rese
ndmarks be
set is collec
benchmar
cameras. W
h performan
dding contex
nchmark (P
and captur
digital came
IGITAL IX
944) and 6
Apple iph
1200 and 2
dmark is ca
phone, with
gle variation
unteers. No
pturing. In
produce m
database ar
ots sizes and
as received
rieval com-
over a well-
obile visual
In addition,
e extremely
of such cues
e of users’
arches and
nchmark to
ct from over
rk provides
We put more
nce, with a
xt distractor
PKUBench)
ed via both
eras (SONY
XUS 100 IS,
986 photos
one, Apple
2048×1536)
aptured by a
h a portable
ns between
ote that both
such cases,
more mobile
re collected
d angles.
d
-
-
l
,
y
s
’
d
o
r
s
e
a
r.
)
h
Y
,
s
e
)
a
e
n
h
,
e
d
3. As i
medium
which a
degrees
differen
with res
images
Fig. 2.
Fi
Co
scenario
mobile
illustrated i
m shot and c
attempt to c
respectivel
nt weathers
spect to lan
of different
The landma
ig.3. The pe
ontextual
o is closely
user’s geo
n Figure 1,
close up. Fo
cover 360 d
ly. The cap
(sunny, clou
ndmark loc
t landmarks
ark photo d
on the
ercentages o
Cues: Co
y related to
ographical l
we capture
or each sho
degrees from
pturing of bo
udy, etc.) du
ations are
. The perce
istribution b
Google Ma
of both phon
omparing w
rich contex
location can
e photos in
ot size, ther
m the front
oth digital
during Nove
given in Fi
entage of mo
by overlayin
ap of Peking
ne camera a
with genera
xtual inform
n be levera
three differ
re are at mo
tal view of
camera and
ember and D
igure 2. Di
obile and ca
ng the locat
g University
and digital c
alized visu
mation on th
aged to pre
rent shot siz
ost 8 directi
the landma
d mobile ph
December. T
fferent colo
amera photo
tion point of
y campus.
camera phot
ual search,
he mobile p
e-filter mos
zes, namely
ions in pho
ark, capture
hone photos
The photo d
ors denote
os is given i
f each colle
tos in PKUB
mobile vis
phone. For
st of unrela
y long shot,
tographing,
ed every 45
s undergoes
distributions
the sample
in Figure 3.
ected photo
Bech.
sual search
instance, a
ated scenes
,
,
5
s
s
e
h
a
s
4. without visual ranking. Over PKUBench, we pay more focus to the use of such contextual cues
in facilitating visual search, including: (1) GPS tag (both latitude and longitude); (2) landmark
name label; (3) shot size (long, medium, and close-up) and viewpoints (frontal, side, and others)
of those photos; (4) camera type (digital camera or mobile phone camera); (5) capture time
stamp. We also provide EXIF information: camera setting (focal, resolution).
In addition, we will show the performance improvement of using contextual information by
providing baselines that leverage GPS to refine visual ranking. Furthermore, the effects of less
precise contextual information are also investigated by adding distractor images by imposing
random GPS distortion to the original GPS location of an image.
Scene Diversity: We provide as diverse landmark appearances as possible to simulate the
real-world difficulty in visual search. Hence, the volunteers are encouraged to capture both
queries and the ground truth photos (for both digital and phone cameras) without any particular
intent to avoid the intruding foreground, e.g. cars, human faces, and tree occlusions.
4 Comparing with Related Benchmarks
Zubud Database [2] is widely adopted to evaluate vision-based geographical location
recognition, which contains 1,005 color images of 201 buildings or scenes (5 images per
building or per scene) in Zurich city, Switzerland.
Oxford Buildings Database [3] contains 5,062 images collected from Flickr by
searching for particular Oxford landmarks, with manual annotated ground truth for 11 different
landmarks, each represented by 5 possible queries.
SCity Database [4] contains 20, 000 street-side photos for mobile visual search
validation in Microsoft Photo2Search system [4]. It is captured along the Seattle urban streets by
a car automatically, taken from the main streets of Seattle by a car with six surrounding cameras
and a GPS device. The location of each captured photo is obtained by aligning the time stamps
of photos and GPS record.
UKBench Database [5] contains 10,000 images with 2,500 objects, containing indoor
objects like CD Covers, book set, etc. There are four images per object to offer sufficient
variances in viewpoints, rotations, lighting conditions, scales, occlusions, and affine transforms.
Stanford Mobile Visual Search Data Set [6] contains camera-phone images of
products, CDs, books, outdoor landmarks, business cards, text documents, museum paintings and
video clips. It provides several unique characteristics, e.g. varying lighting conditions,
perspective distortion, and mobile phone queries.
Table 1. Brief comparison of related benchmarking databases.
Database PKUBench Zubud Oxfold SCity UKBench Stanford
Data Scale 13,179 1,005 5,062 20,000 10,000
Images
Per Landmark
/Object Category
66 5 92 6 4
Mobile Capture √ × × × × √
Categorized
shot size,
view Angle,
landmark/Object
Scale
√ × × × Indoor ×
Blurring Query √ × × × × ×
Context √ × × × × ×
5. PK
aspects:
Low qu
camera;
queries
Table 1
5 Ex
Five gr
evaluate
Oc
queries,
Ba
queries.
yield wo
Nig
photo qu
Blu
and 20 c
Ad
contextu
Palace (
Univers
the orig
degener
Fig.4. E
Occlusi
KUBench
: (1) Rich c
uality cellp
; quantize t
and databa
presents th
emplar M
roups of ex
e the real-w
cclusive Q
, occluded b
ackground
. These are
orse results
ght Quer
uality heavi
urring an
correspondi
dding Dist
ual informa
(note: the l
sity) and 20
ginal databa
ration.
Examples of
ve, Backgro
Database
contextual i
hone queri
the perform
ase caused b
he brief com
Mobile Q
xemplar mo
world visual
Query Set
by foregroun
d Clutter
often capt
due to the b
ry Set con
ily depends
nd Shakin
ing mobile q
tractors i
ation to visu
andmark bu
012 photos f
ase. We then
f query scen
ound Clutte
e: Our data
information
es with co
mance dege
by cars, peo
mparison of r
Query Sce
obile query
search perf
t contains 2
nd cars, peo
s Query
tured far aw
bias of othe
ntains 9 mo
on the ligh
ng Query
queries with
into Data
ual search. W
uildings in
from PKU,
n select 10
narios (Digi
ers, Blurring
abase is pro
n to simulat
omparison t
eneration of
ople, trees,
related benc
enarios
y scenarios
formance in
20 mobile q
ople, and bu
Set contai
way from a
er nearby bu
obile phone
hting conditi
Set contain
hout any blu
abase is to
We collect a
Summer Pa
then rando
locations (
ital Camera
g/ Shaking,
oviding rich
te what we
to the corre
f cellphone
and nearby
chmarking d
(in total 1
n challenging
queries and
uildings.
ns 20 mobi
a landmark,
uildings.
e queries an
ions.
ns 20 mobil
urring or sh
evaluate th
a distractor
alace are vi
omly assign
(30 queries)
versus Mob
and Night.
query scen
can get fro
esponding
queries; (3
y buildings,
databasets i
68 queries)
g situations
20 corresp
ile queries
where GP
nd 9 digital
le queries w
haking.
he effects of
set of 6630
isually simi
ed them wi
) from PKU
bile Phone)
narios in the
om mobile p
queries of
3) Occlusio
blurring an
in the state-
) are demo
s (See Figur
ponding dig
and 20 dig
S based se
l camera qu
with blurring
f applying l
0 photos fro
ilar as those
ith the GPS
U to evaluat
) (From Top
e following
phones. (2)
the digital
ons in both
nd shaking.
of-the-art.
onstrated to
re 4):
gital camera
gital camera
arch would
ueries. The
g or shaking
less precise
om Summer
e in Peking
S tagging of
te the mAP
p to Bottom
g
)
l
h
.
o
a
a
d
e
g
e
r
g
f
P
m:
6. La
walking
with thr
medium
small on
Small Sc
Medium
Large Sc
andmark
g distances
ree scales, s
m scale is 1
nes, 75 med
Tabl
cale (0-12m):
Scale (12-30
cale (> 30m):
(Examples
Fig. 5. T
Scale: We
of the phot
small, medi
2-30 m and
dium ones, a
le. 2. Typic
S
m): C
h
L
la
s of Differe
The photo v
try to categ
tographers
ium, and la
d the large
and 60 large
al landmark
Sculpture, ston
Courtyard, and
historic buildin
Large building
arge object (e.
nt Scales: F
volumes of t
gorize the la
around eac
arge. The ty
scale is ov
e ones.
k types of th
ne, pavilions,
d small or m
ngs (smaller fl
gs, such as lib
.g. BoYa Tow
From Top to
three differe
andmark sc
ch assigned
ypical distan
ver 30m. As
hree differen
gates and othe
medium sized
floor area)..
brary, comple
wer).
o Down: Sm
ent landmar
ale by meas
landmark l
nce for sma
s shown in
nt Landmar
ers.
buildings, su
ex building, o
mall, Medium
rk scales in
suring the r
location. W
all scale is 0
Figure 5, w
rk scales
uch as office
or a long shot
m, and Larg
PKUBench
range of the
We come up
0-12 m, the
we have 63
buildings,
t of a very
ge)
h
e
p
e
3
7. Fina
6 Mo
We pro
assisted
(1) B
build a
ally, we pro
F
obile Vis
ovide severa
d visual sear
BoW: We e
Scalable V
ovide more p
Fig.6. Ph
Fig.7. Photo
sual Sear
al visual se
rch:
extract SIFT
ocabulary T
photograph
hoto volume
o volume dis
rch Basel
earch baseli
T [7] featur
Tree [5] to g
details in F
e distributio
stribution b
lines
ines, includ
res from ea
generate the
Figure 6 and
n by differe
by different
ding purely
ach photo, t
e initial Vo
d Figure 7.
ent shot size
viewing ang
visual sear
the ensembl
cabulary V.
es
gles.
rch as well
le of which
. The SVT
as context
h is used to
generates a
t
o
a
8. bag-of-w
the bran
approxi
search p
(2)
function
on the w
where D
and BoW
the BoW
is based
It is
while it
satisfact
contains
building
be well
mA
perform
Fig.8. T
camera
Note
happens
to favor
degener
perform
words signa
nching fact
mate 100,0
performance
GPS + Bo
n by multip
weighting fu
Dis(A,Q) is
WDis(A,Q)
W based vis
d on such sim
s worth men
t typically
tory perform
s lots of tr
gs (such as
distinguish
AP Perform
mance of eac
The perform
and mobile
e that most
s in the long
r other near
ration using
mance when
ature Vi for
tor as B. In
000 codewo
e, which rev
oW: We fu
plying the G
unction as:
(Dis A
s the overal
) stands for
ual distance
milarity me
ntioning tha
gives prom
mance in th
rees that ar
ancient Chi
hed by RAN
mance with
ch query sce
mance of oc
e phone cam
t of occlusi
g or medium
rby landma
g solely G
combining
r each datab
n a typical
ords. We us
veals its pos
urther lever
GPS distanc
, )A Q GeoD
ll distance b
the geogra
e between q
easurement i
at we have
mising resul
is database.
e un-regula
inese buildi
NSAC.
h respect t
enario respe
cclusive qu
mera(Y axis:
ive queries
m shot of a
arks around
GPS inform
visual sear
base photo I
l settlement
se mean Av
sition-sensit
rage the lo
ce with the
( , )Dis A Q B
between qu
aphical dista
query Q and
in Equation
discovered
lts in tradit
. There are
ar for spati
ings) that ha
to differen
ectively as f
ueries with
: mAP@N p
come from
a large scale
d the query
mation. Thi
rch with GP
Ii. We deno
t, we have
verage Preci
tive ranking
cation cont
BoW distan
( ,BoWDis A
uery Q and
ance (measu
d database im
n (1).
that the RA
tional visua
two possibl
al re-rankin
ave very sim
nt challeng
follows:
respect to
performanc
m a large sc
e landmark.
location, w
s may eve
S informati
ote the hiera
H = 6 an
ision at N (
g precision a
text to refin
nces to the
)Q (1)
database im
ured by GP
mage A resp
ANSAC ba
al experimen
le reasons: (
ng; (2) The
milar local f
ging scena
difference
e; X axis: to
cale landma
In such cas
which would
en degenera
ion.
archical lev
nd B = 10,
(mAP@N)
at the top N
ine the visu
query exam
)
mage A; Ge
PS distance)
pectively. O
ased spatial
ents, does n
(1) PKUBe
ere are lots
features, wh
arios: We d
methods u
op N return
ark, as occl
se, GPS po
d lead to p
ate the vis
el as H and
producing
to evaluate
N returning.
ual ranking
mple, based
eoDis(A,Q)
) as well as
Our ranking
re-ranking,
not produce
nch usually
s of similar
hich cannot
discuss the
sing digital
ning results)
usion often
sition tends
erformance
sual search
d
g
e
g
d
)
s
g
,
e
y
r
t
e
l
).
n
s
e
h
9. Fig
In p
such cas
major p
The N
recognit
differen
enough
camera
.9. The perf
practice, the
ses, the pur
part of a que
Fig.10. T
Night query
tion perform
nt from the
at night. It
can achieve
formance of
e backgroun
rely visual s
ery photo is
The perform
y is an inter
mance. Extr
day time.
is worth m
e better visu
f backgroun
nd clutter ty
search perfo
actually oc
mance of Ni
resting case,
racting dist
Hence, we
entioning th
ual search p
nd clutters q
ypically hap
ormance per
cupied by b
ight queries
, where GP
tinguishing
e can obser
hat due to b
performance
queries with
ppens in cap
rforms wors
backgrounds
with respec
PS (contextu
local featu
ve that usin
better image
e than a mob
h respect to
pturing sma
se, due to th
s.
ct to differe
ual informat
ures is very
ng solely G
e capturing
bile phone.
different m
all scale lan
hat in most
ent methods
tion) roles t
difficult, th
GPS is alm
quality, usi
ethods.
ndmarks. In
queries the
s.
the location
hat is quite
most already
ing a digital
n
e
n
e
y
l
10. F
From
visual s
become
Over
exempla
Figure 1
phone c
Note
mobile p
or mobi
F
Fig. 11. The
m Fig.11, we
search perfo
e much more
rall Perform
ar scenarios
12, which g
camera.
that using
phone phot
ile phone is
Fig.12. Over
e performan
e find that
ormance. H
e acceptable
mance Com
s) with resp
gives an intu
solely visu
tos; but with
almost iden
rall perform
nce of blurri
introducing
However, by
e comparing
mparison: W
pect to usin
uitive findin
ual search, t
h the combi
ntical.
mance comp
ing and sha
g blurring a
y incorporat
g with the p
We further s
ng either di
ng about the
the perform
ination of G
parison betw
aking querie
and shaking
ting GPS in
pure visual q
show the ov
igital came
e mAP diffe
mance of ca
GPS, the per
ween using c
es of phone
g would def
nto similari
query result
verall perfor
ra and mob
erence betwe
amera photo
rformances
camera and
camera pho
finitely deg
ity ranking,
ts.
rmance (168
bile phone
ween digital
os is better
of using eit
mobile pho
otos.
generate the
the results
8 queries of
cameras in
camera and
than using
ther camera
ones.
e
s
f
n
d
g
a
11. Figur
the rest
visual s
scales d
search p
around
scale lan
Final
168 que
and visu
pure vis
re 13 furthe
as the searc
search perfo
due to less b
performance
a larger sca
ndmarks yie
Fig. 1
ly, we inve
eries), as sh
ual search t
sual search,
Fig. 14. O
er compares
ched datase
ormance of
background
e of small s
ale landmark
eld better re
3. Performa
estigate the
hown in Fig
together. A
this degene
Overall perf
the perform
et) among d
large scale
clusters and
scale is bett
k. Moreover
esults of fus
ance compa
overall per
gure 14. Un
lthough dis
eration effec
formance of
mance over
different lan
landmarks
d more disti
ter than larg
r, as the GP
sing visual s
arison amon
rformance o
ndoubtedly,
stractor ima
cts are allev
f 570 querie
the whole d
dmark scale
are much b
inguishing i
ge scale, as
PS plays rel
search and G
ng different
of in total 5
the best res
ages typicall
viated by int
es in Peking
database (on
es. It is wor
better than t
interesting p
the GPS si
atively imp
GPS inform
scales of la
570 queries
sults come
ly degenera
tegrating GP
g University
ne image as
rth mention
the medium
points. The
ignal may b
portant roles
mation.
andmarks.
s (including
from fusing
ate the perf
PS with vis
y Landmark
s the query,
ning that the
m and small
GPS based
be distorted
s, the small-
g the above
g both GPS
formance of
sual cues.
ks.
,
e
l
d
d
-
e
S
f
12. mAP
15-16, o
perform
diverse
both blu
interest
adding
taken se
P Performa
over those
mance can b
performanc
urring/Nigh
points wou
distractors
eriously, wh
Fig.15. S
Fig.16. G
ance with r
168 querie
be more or
ces. Genera
ht and addin
uld be chal
in Fig. 15
hile the simp
olely BoW
GPS+BoW p
respect to D
es, it is qui
r less impr
ally speakin
ng distracto
llenged by
and 16, w
ple combina
performanc
performanc
Different S
ite obvious
roved, whil
ng, the wors
or images.
mobile blu
we can see t
ation is not
ce comparis
ce comparis
earch Base
that by ad
le different
st performan
The former
urring queri
the use of c
robust enou
sons in five
ons in five t
elines: Furth
dding contex
mobile qu
nces origina
r indicates
ies. By com
contextual i
ugh for deal
typical que
typical quer
hermore, fr
extual inform
uery scenar
ate from the
that the us
mparing the
information
ling with di
ery scenario
ry scenarios
om Figures
mation, the
rios present
e queries of
se of visual
e results of
n should be
istractors.
os.
s.
s
e
t
f
l
f
e
13. 7 Application Scenarios
We brief possible application scenarios of our Peking University Landmarks database as follows:
A benchmark dataset for mobile visual search: We hope the Peking University Landmarks
could become useful resource to validate mobile visual search systems. It emphasizes two
important factors in mobile visual search: query quality and contextual cues. To the best of our
knowledge, both are beyond the state-of-the-art benchmark databases. In addition, it offers a
dataset to evaluate the effectiveness and robustness of contextual information.
A benchmark dataset for location recognition: This dataset can be used to evaluate
traditional location recognition systems since GPS location are bound with each image instance
A training resource for scene modeling: This dataset may facilitate scene analysis and
modeling since our photograph is well designed to cover multi-shot, multi-view appearances of
the landmarks of multi-scale. To this end, we will provide the camera calibration information in
our future work.
A training resource to learn better photograph manners: Our landmark photo collection
can be further exploited to learn the (recommended) mobile visual photograph manners (proper
angle and shot size for different types of landmarks) towards better visual search results.
8 References
[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li ImageNet: A Large-Scale
Hierarchical Image Database. CVPR. 2010.
[2] H. Shao, T. Svoboda, and L. Van Gool ZuBuD-zurich buildings database for image based
recognition. Technical Report, Computer Vision Lab, Swiss Federal Institute of Technology.
2006.
[3] Philbin, J. , Chum, O. , Isard, M. , Sivic, J. and Zisserman, A. Object retrieval with large
vocabularies and fast spatial matching. CVPR. 2007.
[4] Rongrong Ji, Xing Xie, Hongxun Yao, Wei-Ying Ma Hierarchical Optimization of Visual
Vocabulary for Effective and Transferable Retrieval. CVPR. 2009.
[5] Nister D. and Stewenius H. Scalable recognition with a vocabulary tree. CVPR. 2006.
[6] S. Tsai, D. Chen, G. Takacs, V. Chandrasekhar, J. Singh, and B. Girod. Location Coding for
Mobile Image Retrieval. Proc. 5th International Mobile Multimedia Communications
Conference, MobiMedia. 2009.
[7] Lowe D. G. Distinctive image features from scale invariant key points. IJCV. 2004.
[8] M. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with
applications to image analysis and automated cartography. Comm. of the ACM, 24: 381-395,
1981.
[9]. Rongrong Ji, Lingyu Duan, Tiejun Huang, Hongxun Yao, and Wen Gao. Compact
Descriptors for Visual Search - Location Discriminative Mobile Landmark Search, CDVS AD
HOC Group, Input Contribution m18542, 94th MPEG Meeting, Oct. 2010