Dynamic Two-Stage Image Retrieval from Large Multimodal Databases

Dynamic Two-Stage Image Retrieval from
Large Multimodal Databases

Avi Arampatzis
Konstantinos Zagoris
Savvas Chatzichristoﬁs

Department of Electrical and Computer Engineering
Democritus University of Thrace
Xanthi 67100, Greece
ECIR 2011

Dynamic Two-Stage Image Retrieval from Large Multimodal Databases Avi Arampatzis, Konstantinos Zagoris, Savvas Chatzichristoﬁs 2

Outline

1. Content-based Image Retrieval (CBIR): global vs local features, scalability

2. A problem of global features: The Red Tomato / Red Pie-Chart Problem

3. Multimodal Information Collections

4. Two-stage Image Retrieval from Multimedia Databases
related work
our main contributions

5. Experiments on the ImageCLEF 2010 Wikipedia Test Collection


Content-based Image Retrieval (CBIR)

In CBIR, images are represented by either
local features:
¦ computed at multiple image points; capable of recognizing objects
¦ computationally expensive, e.g. due to high dimensionality
global features:
¦ generalize images with single vectors capturing color, texture, shape, etc.
¦ computationally cheap (relatively)
¥ thus, more popular

CBIR with either global or local features
does not scale up well to large databases eﬃciency-wise


CBIR with Global Features

notoriously noisy for image queries of low generality
(generality = the fraction of relevant images in a collection)
In text retrieval: documents matching no query keywords are not retrieved
In image retrieval: CBIR will rank the whole collection
¦ The Red Tomato / Red Pie-Chart Problem [next]


The Red Tomato / Red Pie-Chart Problem

Image query: Collection items:

If the query has a low generality
early ranks may be dominated by spurious results (such as the pie-chart)
possibly ranked even before red tomatoes on non-white backgrounds


Contemporary Information Collections

Large

Multimodal (= provide diﬀerent manners of retrieval )

A good example is Wikipedia:

Large:
3,605,426 articles (English version alone), “millions of images”, etc.

Multimodal:
Topics are covered in several languages.
Topics may include non-textual media (image, sound, video, etc.)
annotated in a variety of metadata ﬁelds (caption, description, etc.)
Thus, there are several ways to get to the same info/topic.


Image Retrieval from Multimodal Databases
In image retrieval systems, users are assumed to target visual similarity. Thus,

Image is the primary medium.
image modalities are most important than others

Modalities from other media are secondary.
but they can still help with the problems of CBIR:
¦ low generality (Red Tomato/Pie-Chart) problem (esp. for global feats.)
¦ speed

Traditionally, fusion has been used, but:

weighing of media is not trivial; usually requires training data

not theoretically sound: inﬂuence of textual scores may worsen the visual quality


Two-stage Image Retrieval from Multimedia Databases
Key idea for improving CBIR eﬀectiveness:
Before CBIR, raise query generality by reducing collection size via ﬁltering.
Assumptions:

Query is expressed in the primary medium i.e. image,

accompanied by a query in a secondary medium, e.g. text.

Two stage retrieval:

Rank the collection by the secondary medium/query, e.g. text.

Draw a rank threshold K.

Re-rank only the top-K items with CBIR.

Using a ‘cheaper’ secondary medium, we also improve speed.


Previous Related Work

Best-results re-ranking by visual content has been seen before, but

for other purposes, e.g. result clustering, diversity, . . .

using external info (e.g. sets of images), or training data, . . .

They all used
global features
a static predeﬁned threshold for all queries

Eﬀectiveness results have been mixed,
while some didn’t provide a comparative evaluation.


Main Contributions

In view of the related literature, our main contributions are:

dynamic thresholds per query
no external information
no training data

extensive evaluation of thresholding types and levels


The ImageCLEF 2010 Wikipedia Test Collection

237,434 items
Primary medium: image
¦ Heterogeneous: color natural images, graphics, greyscale images, etc.
¦ variety of image sizes
¦ It’s a large benchmark image database for today’s standards.
+ noisy and incomplete user-supplied annotations
+ wikipedia articles containing the images
Annotations articles come in 3 languages: En, Fr, De.

70 test topics
Visual part:
¦ 1 or more example images
Textual part:
¦ 3 titles ﬁelds, one per language (En, Fr, De)
Topics are assessed by visual similarity.


Indexing and Retrieval

text: only English annotations + English query
Lemur Toolkit V4.11 / Indri V2.11
¦ tf.idf retrieval model (works better with our thresholding methods)
¦ default settings + Krovetz stemmer

image:
Joint Composite Descriptor (JCD)
¦ captures color and texture information
¦ developed for color natural images
Spatial Color Distribution (SpCD)
¦ captures color and its spatial distribution
¦ more suitable for colored graphics (fewer colors, less texture)
JCD SpCD are found to be more eﬀective than MPEG-7 descriptors.


Thresholding and Re-ranking

Static Thresholding: a ﬁxed pre-selected rank threshold K for all topics
K 25, 50, 100, 250, 500, 1000.

Dynamic Thresholding: a variable rank threshold per topic
Score-Distributional Threshold Optimization (SDTO)
[Arampatzis et.al., ”Where to Stop Reading a Ranked-list?”, SIGIR 2009]
Two types:
¦ Threshold on Precision: g 0.990, 0.950, 0.800, 0.500, 0.330, 0.100
¦ Threshold on prel: θ 0.990, 0.950, 0.800, 0.500, 0.330, 0.100


Initial Experiments: Setting a Baseline
item scoring by MAP P@10 P@20 bpref
JCD1 .0058 .0486 .0479 .0352
maxi JCDi .0072 .0614 .0614 .0387
maxi JCDi + maxi SpCDi .0112 .0871 .0886 .0415
tf.idf (text-only) .1293 .3614 .3314 .1806

Image-only runs provide very weak baselines.

We chose the text-only run as a baseline for the statistical significance testing.

This makes sense also from an efficiency point of view:

if using a secondary text modality for image retrieval is more effective than
current CBIR methods, then there is no reason at all for using
computationally costly CBIR methods.


Results

threshold K€ MAP P@10
JCD1
P@20 bpref MAP
maxi JCDi +
P@10
maxi SpCDi
P@20 bpref
text-only — .1293 .3614 .3314 .1806 .1293 .3614 .3314 .1806
25 25 .1162™ .3957 - .3457˜ .1641™
.1168 - .3943 - .3436 ˜
.1659™
50 50 .1144™ .3829 - .3579˜ .1608™ .1154 - .3986 - .3557 - .1648™
100 100 .1138 - .3786 - .3471 - .1609™ .1133 - .3900 - .3486 - .1623 -
K
250 250 .1081™ .3414 - .3164 - .1644 - .1092™ .3771 - .3564 - .1664 -
500 500 .0968™
.3200 - .3007 - .1575™ .0999™
.3557 - .3250 - .1590™
1000 1000 .0865™
.2871™ .2729™ .1493™ .0909™
.3329 - .3064 - .1511™
.9900 49 .1364 - .4214˜ .3550 - .1902˜ .1385˜ .4371˜œ
.3743˜ .1921˜
.9500 68 .1352 - .4171˜ .3586 - .1912˜ .1386˜ .4500˜ œ
.3836˜ .1932˜
.8000 95 .1318 - .4000 - .3536 - .1892 - .1365 - .4443˜œ
.3871˜ .1924 -
g
.5000 151 .1196 - .3814 - .3393 - .1808 - .1226 - .4043 - .3550 - .1813 -
.3333 237 .1085™
.3500 - .3000 - .1707 - .1121™ .3857 - .3364 - .1734 -
.1000 711 .0864™
.2871™ .2621™
.1461™
.0909™
.3357 - .2964 - .1487™

.9900 42 .1342 - .4043 - .3414 - .1865 - .1375˜ .4371˜œ
.3700˜ .1897˜
.9500 51 .1371 - .4214˜ .3586 - .1903˜ .1417˜œ
.4500˜œ
.3864˜œ
.1924˜œ

.8000 81 .1384˜ .4229˜ .3614 - .1921˜ .1427˜ œ
.4629˜ œ
.3871˜œ
.1961˜ œ
θ
.5000 91 .1367 - .4057 - .3571 - .1919˜ .1397˜ .4400˜œ
.3829˜ .1937˜
.3333 109 .1375 - .4129 - .3636 - .1933˜ .1404˜ .4500˜œ
.3907˜œ
.1949˜œ

.1000 130 .1314 - .4100 - .3629 - .1866 - .1370 - .4371˜ .3843˜ .1922˜
image-only — .0058™
.0486™
.0479™
.0352™
.0112™
.0871™
.0886™
.0415™


Results

Static thresholding improves initial Precision at the cost of MAP and bpref.

Dynamic thresholding on Precision or prel does not have this drawback.

The level of static and precision thresholds influences greatly the effectiveness;
unsuitable choices (e.g. too loose) lead to a degraded performance.

Prel thresholds are much more robust in this respect.

Better CBIR at the second stage leads to overall improvements, but the
thresholding type seems more important: While the two CBIR methods vary
greatly in performance (the best has almost double the effectiveness of the
other), static thresholding is not influenced much by this choice. Dynamic
methods benefit more from improved CBIR.


Results


Conclusions

Two-stage retrieval with dynamic thresholding:
more effective robust than static thresholding
practically insensitive to a wide range of choices for the optimization measure
beats significantly the text-only several image-only baselines

Two-stage retrieval, irrespective of thresholding type:
efficiency benefit:
¦ it cuts down greatly on expensive image operations
¦ on average, only 2 to 5 in 10,000 images had to be scored


Further Research

Compare two-stage retrieval to fusion. [Did this; check ECIR11 poster]
Both methods work better than text-only image-only;
two-stage is slightly better than fusion, but the difference is non-significant.
Both methods are robust in different ways:
¦ Fusion provides less variability across topics but it is sensitive to the
weighing parameter of the contributing media.
¦ Two-stage provides a much lower sensitivity to its thresholding parameter
but has a higher variability across topics.

Try two-stage retrieval with other type of image descriptors,
e.g. the visual codebook approach (TOP-SURF).

Generalization to multi-stage retrieval, where rankings for the media are
successively being thresholded and re-ranked with respect to a media hierarchy.


avi@ee.duth.gr

Thank you !

Dynamic Two-Stage Image Retrieval from Large Multimodal Databases

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Dynamic Two-Stage Image Retrieval from Large Multimodal Databases

Similar to Dynamic Two-Stage Image Retrieval from Large Multimodal Databases (20)

Recently uploaded

Recently uploaded (20)

Dynamic Two-Stage Image Retrieval from Large Multimodal Databases