Visual Attention for Implicit Relevance Feedback in a Content Based Image
                                                      A. Faro, D. Giordano, C. Pino, C. Spampinato∗
                                               Department of Informatics and Telecommunication Engineering
                                                       University of Catania, Catania, 95125, Italy

Abstract                                                                                          Relevance feedback is a key-feature in image retrieval systems,
                                                                                                  whose main idea is to take into account the outputs, initially re-
In this paper we propose an implicit relevance feedback method                                    trieved, and to use the user’s feedback, on the relevance of them
with the aim to improve the performance of known Content Based                                    with the initial query, in order to perform a new query. In literature
Image Retrieval (CBIR) systems by re-ranking the retrieved images                                 two types of feedback can be defined: explicit feedback and implicit
according to users’ eye gaze data. This represents a new mechanism                                feedback. Since the former method requires a higher effort on the
for implicit relevance feedback, in fact usually the sources taken                                user’s side, because it may be difficult to get explicit relevance as-
into account for image retrieval are based on the natural behavior                                sessments from searchers [Xu et al. 2008], implicit feedback meth-
of the user in his/her environment estimated by analyzing mouse                                   ods have gained more attention, where feedback data are obtained
and keyboard interactions. In detail, after the retrieval of the im-                              by observing the user’s actions and in his/her natural environment.
ages by querying CBIRs with a keyword, our system computes the                                    Until today the most explored and implemented sources for im-
most salient regions (where users look with a greater interest) of the                            plicit relevance feedback have been the interactions of users with
retrieved images by gathering data from an unobtrusive eye tracker,                               the mouse and the keyboard [Kelly and Teevan 2003]. A new evi-
such as Tobii T60. According to the features, in terms of color,                                  dence source for implicit feedback, explored in the last few years,
texture, of these relevant regions our system is able to re-rank the                              e.g., in [Moe et al. 2007], [Miller and Agne 2005], [Granka et al.
images, initially, retrieved by the CBIR. Performance evaluation,                                 2004], is the one related to the user’s visual attention (provided by
carried out on a set of 30 users by using Google Images and “pyra-                                the eye movements), which introduces a potentially very valuable
mid” like keyword, shows that about the 87% of the users is more                                  new dimension of contextual information [Buscher 2007].
satisfied of the output images when the re-raking is applied.                                      Indeed, in a CBIR the knowledge of the human visual attention
                                                                                                  would allow us to select the most salient parts of an image, which
Keywords: Relevance Feedback, Content Based Image Retrieval,                                      can be used both for image retrieval, as in [Marques et al. 2006], and
Visual Attention, Eye Tracker                                                                     for relevance feedback mechanisms implementation. Moreover, the
                                                                                                  detection of these salient regions observed by a user is a crucial in-
                                                                                                  formation for finding image similarity.
1     Introduction                                                                                In this paper we propose an implicit relevance feedback mechanism
                                                                                                  by using visual attention implemented with a Tobii Eye Tracker T60
During the last ten years, with the growing of Internet and the ad-                               to be integrated in a web based content based image retrieval, which
vances in digital cameras research, huge collections of images have                               aims at re-ranking, using the most salient regions extracted by the
been created and shared on the web. During the past the only way to                               eye tracker, the output images provided by a web-based CBIR. The
search digital images was done by keyword indexing, or simply by                                  proposed system represents a novel approach of eye tracking for
images browsing. The needs to fast find the images in a large digi-                                image retrieval since other approaches in literature, e.g. [Oyekoya
tal images databases have brought researchers in image processing                                 and Stentiford 2006], [Oyekoya and Stentiford 2007], are based on
to open the way to develop Content Based Image Retrieval (CBIR),                                  roughly retrieval engines based on high level features and it allows
i.e., a system for images retrieval based on the concept of similar                               both users with disabilities to perform a feedback of the obtained
visual content.                                                                                   results and generic users to tune the CBIR with their cognitive per-
Moreover, recent researches in information retrieval are based                                    ception of the images, (e.g., “I unconsciously prefer image with red
on the consideration of the user’s personal environment in or-                                    colors-like”, or “When I look to Egyptian Images I prefer to see
der to better understand the user’s needs. Indeed, in CBIR                                        Pyramids and Sphinx”). The remainder of the paper is as follows:
systems not always user gets results fully related with the                                       in the section 2 the architecture of the proposed system in discussed.
image query especially in web-based image retrieval, such                                         In the section 3 an experimental evaluation on the Google images
as Google Images ( or Yahoo!’s Picture                                    CBIR is performed and the experimental results on a set of 30 users
Gallery ( This is mainly due to the fact                               are shown. Finally, in the last section conclusion and future work
that metadata often cannot explain well the content of an image and                               are, respectively, presented.
even when the description is exhaustive the attention of the user
may be only in some portions of the image, which often correspond
to the greatest salience areas of the image. In order to take into ac-                            2    The Proposed System
count these user’s needs, a relevance feedback mechanism must be
integrated in CBIRs.                                                                              The flow diagram of the proposed system is shown in fig. 1: 1) the
    ∗ e-mail:                                                                                     user insert a keyword for image searching, 2) the web-based CBIR
                {afaro,dgiordan,cpino, cspampin}
                                                                                                  retrieves the most relevant images whose metadata contain the in-
Copyright © 2010 by the Association for Computing Machinery, Inc.                                 serted word, 3) the user looks at the output images and the system
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
                                                                                                  retrieves the most relevant regions by using Tobii facilities and their
for commercial advantage and that copies bear this notice and the full citation on the            features (e.g. color, texture, etc ...), 4) the system re-ranks the out-
first page. Copyrights for components of this work owned by others than ACM must be               put images according to the above extracted information.
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on        Our system uses the Tobii eye tracker to capture an implicit rele-
servers, or to redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail
                                                                                                  vance feedback and to classify the images in a different order of                                                                              relevance with respect to the initial classification, in order to im-
ETRA 2010, Austin, TX, March 22 – 24, 2010.
© 2010 ACM 978-1-60558-994-7/10/0003 $10.00

Figure 1: Implicit Relevance Feedback for the new Ranking Method in web-based CBIR.

prove the ranking provided by the search on a CBIR environment.                   tant regions of the image, such as clustering algorithms and
The aim of this is to capture the user’s gaze fixations in order to                object recognition, but it permits a considerable reduction of
identify the characteristics of the images s/he declares to be of                 computational complexity of search algorithms.
her/his interest. This will allow the tool to retrieve automatically              In our case the detection is simplified by the eye tracker,
further relevant images. The tool may be also able to discover in an              which allows us to identify the regions of major interest. The
unsupervised way the characteristics of the images of potential user              local features, considered for describing image content, are
interest. Indeed, it is able to derive the characteristics of the images          the Contrast C, Correlation Cr , Energy E, Homogeneity H,
of user interest by considering the images, which mainly captured                 Gabor filters G-Maps (24 maps: 6 scales and 4 orientations)
the user attention, e.g., by taking into account the user visual ac-              and two novel features that describe the:
tivity over the analyzed images. In the former case the tool learns                                                                    µ3 ·2552
how to select further relevant images, whereas in the latter case it                  – Brightness computed as rbright = µ +              10
could be also able to reclassify the images already examined by the                                                         1
                                                                                      – Smoothness computed as rsmooth = ( µ2 +              1
                                                                                                                                                 +E +
user suggesting to her/him of reconsidering more deeply some po-
tentially relevant images.                                                              H);
Although the system proposed has been only tested on Google im-                   The above features are based on the moments of the histogram
ages to improve the precision of the retrieval, it may be applied to              H of the gray levels. The nth moment of the histogram of
improve the precision of the retrieval of any document on the basis               gray levels is represented by
of the images featuring the documents.
Figure 2 shows the general architecture of proposed implicit rel-                                 µn (x) =         (xi − µ) · p(xi )
evance feedback, where we point out the system ability of rear-                                              i=0
ranging the images initially retrieved from a web-based CBIR (e.g.
Google Images) without any user supervision, i.e., only on the basis              where p(xi ) is the probability of finding a pixel of the im-
of the user gaze fixations. A fine tuning of the characteristics to be              age with gray level xi (given by the histogram H), L is the
possessed by the images may be carried out by the system on the                   number of gray levels and µ the average value. Therefore, in
basis of the user agreement for a better rearrangement of the images              the proposed system, the images returned by the CBIR and
or for extracting relevant images from other datasets. In detail, the             the file containing the data taken by the eye tracker are pro-
re-ranking mechanism is composed of the following steps:                          cessed in order to identify the most relevant images and their
                                                                                  features. Each image is then represented by a feature vector
   • First Image Retrieval. The user enters some keywords on                      F = [C, Cr , R, H, G − M aps, rbright , rsmooth ].
     the used CBIR and observes the results. During this phase,
     the eye tracker stores gaze fixations on the thumbnails of the              • Re-Ranking. The values of the extracted features, which
     retrieved images, which most captured the user attention and                 should be possessed by the images to best fit the user inter-
     her/his eye movements;                                                       est, are then processed to produce a ranking of the images
                                                                                  initially retrieved. In detail, we compute a similarity score
   • Features Extraction. One of the crucial point in CBIR is the                 (which represents a sort of implicit relevance feedback) be-
     choice of low-level features, to be used to compare the image                tween the most relevant images, detected at the previous step,
     under test with the queried image. The features combination                  and the images retrieved at the first step (see fig. 3). The
     determines the effectiveness of research. The extracted fea-                 metrics to evaluate the similarity is based on the concept of
     tures can be related to the entire image, so we are talking about            distance, measured between the feature vector Frel (normal-
     global features, or to its portion, then we are talking about lo-            ized between 0 and 1) of the most salient images (extracted at
     cal features. The local features extraction is more complex,                 the previous step) and the feature vector Fret (normalized be-
     because it requires a first step for the detection of the impor-              tween 0 and 1) of the images initially retrieved (at step 1). The

Figure 2: System Architecture.

     images are re-ranked by using this similarity score, computed

                                                     i      i
                f (IRel , IRet ) =         wi · Ωi (frel , fret )
                                     i=1                               (1)
                w1 + w2 + ..... + wN = 1

                           i     i
     where IRel , IRet , frel , fret are respectively the relevant im-
     age detected at the previous step, the image initially retrieved,
     the ith feature of the N features of the vector F of the image
     IRel and IRet . Ωi is the fitness function related to the features
       i     i
     frel , fret and is computed as:
                                     1     i      i
                          Ω = e− 2 ·(fret −frel )                      (2)

     Finally, the retrieved images are ordered, hence re-ranked, ac-
     cording to the decreasing values of the similarity score f .
The relevance feedback detected by the eye tracker could be im-
proved by taking into account the ranking carried out by other meth-              Figure 3: Eye Tracker with the implicit relevance feedback produce
ods, e.g., by the ones, which model the user behavior during the                  an image input for CBIR system.
phase of image analysis from how the user operates on the mouse
and keyboard.
                                                                                  fixations and the one related to the images, which are merged in the
3   User Interface and Experimental Results                                       same picture, are actually separated into two files.
                                                                                  To evaluate the effectiveness of the proposed system for increasing
The system has been implemented by integrating the functionality                  the precision of the information retrieval carried out by Google Im-
of the Tobii Studio to Matlab 7.5 responsible for processing the out-             ages, we will show below how the system rearranges significantly
put provided from the eye tracker. The Tobii studio makes possible                the collection of images proposed by Google in response to the
to register a web browsing, setting appropriate parameters such as                word “pyramid” and we will evaluate the performance increase as
the URL and the initial size of the window on the web browser.                    perceived by a set of 30 users. Indeed, such collection is proposed
By default the web browsing is set to as                without any knowledge of the user interest by merging images of
homepage, whereas the window size and resolution are put equal                    “pyramid” where the subject is either a monument or a geometric
to the entire screen and the maximum resolution allowed by the                    solid (see fig. 4).
monitor. After a proper training phase of the instrument, the user                  With the eye tracker we may go insight the user interests, by dis-
is authorized to start regular recording sessions that terminate by               covering, for example that s/he is more interested in the pyramids as
pressing the F10 key on the keyboard. At the end of the session                   monuments since the more fixed images are related to the Egyptian
the user should confirm the export in textual form of the two files                 pyramids (as shown by the heatmap in fig. 5). With this informa-
related to fixations and events needed for the computation of the                  tion at hand it is relatively easy for the system to discover, after
relevance feedback. Thus, the information representing the gaze                   the recording session, the images relevant for the user following the

the users was satisfied after the re-ranking, the 6.7% of the users
                                                                              was indifferent and the 6.7% was less satisfied.

                                                                                            Less Satisfied      Indifferent     More Satisfied
                                                                                  U sers          2                 2               26

                                                                                    %             6.7              6.7              86.6

                                                                              Table 1: Qualitative Performance Evaluation on a set of 30 Users

       Figure 4: Google Ranking for “Pyramid” Keyword.                        4    Conclusions and Future Work
                                                                              The proposed model shows that the use of an eye tracker to de-
                                                                              tect an implicit feedback may greatly improve the performance of
processing procedure pointed out in the previous section.                     a search in a CBIR system. Future developments will concern the
 Fig. 6 shows the collection of the images as re-proposed by our              possibility of considering not only the first image but also the next
                                                                              in order of importance, to obtain a more refined ranking. Moreover,
                                                                              we are currently working on the possibility to use visual attention
                                                                              for image indexing, thus taking into account the real contents of
                                                                              images. A comparison with on other web-based CBIRs and tests
                                                                              on a wider set of users in order to provide quantitative results will
                                                                              be carried out in future works.

                                                                              B USCHER , G. 2007. Attention-based information retrieval. In
                                                                                 SIGIR ’07: Proceedings of the 30th annual international ACM
                                                                                 SIGIR conference on Research and development in information
                                                                                 retrieval, ACM, New York, NY, USA, 918–918.
Figure 5: Gaze Fixations on the Images retrieved by Google using
the “Pyramid” keyword.                                                        G RANKA , L. A., J OACHIMS , T., AND G AY, G. 2004. Eye-tracking
                                                                                 analysis of user behavior in www search. In SIGIR ’04: Proceed-
                                                                                 ings of the 27th annual international ACM SIGIR conference on
                                                                                 Research and development in information retrieval, ACM, New
                                                                                 York, NY, USA, 478–479.
                                                                              K ELLY, D., AND T EEVAN , J. 2003. Implicit feedback for inferring
                                                                                 user preference: a bibliography. SIGIR Forum 37, 2, 18–28.
                                                                              M ARQUES , O., M AYRON , L. M., B ORBA , G. B., AND G AMBA ,
                                                                                H. R. 2006. Using visual attention to extract regions of interest
                                                                                in the context of image retrieval. In ACM-SE 44: Proceedings
                                                                                of the 44th annual Southeast regional conference, ACM, New
                                                                                York, NY, USA, 638–643.
                                                                              M ILLER , T., AND AGNE , S. 2005. Attention-based information
                                                                                retrieval using eye tracker data. In K-CAP ’05: Proceedings of
                                                                                the 3rd international conference on Knowledge capture, ACM,
Figure 6: New Images Pyramid Ranking according to the Eye                       New York, NY, USA, 209–210.
Tracker feedback given by the user.
                                                                              M OE , K. K., J ENSEN , J. M., AND L ARSEN , B. 2007. A qual-
system. The new ranking correctly suggests a sequence that favors               itative look at eye-tracking for implicit relevance feedback. In
the pyramids more similar to those observed and then actually re-               CIR.
quested by the user. The user’s will was caught with an implicit              OYEKOYA , O. K., AND S TENTIFORD , F. W. 2006. Eye tracking
relevance feedback by taking into account that s/he was particu-                – a new interface for visual exploration. BT Technology Journal
larly attracted by a picture with the Sphinx in the foreground and              24, 3, 57–66.
the pyramid in the background.
The proposed system was then able to discover meaningful infor-               OYEKOYA , O., AND S TENTIFORD , F. 2007. Perceptual image re-
mation from how the perception process has been carried out by the              trieval using eye movements. Int. J. Comput. Math. 84, 9, 1379–
user. Indeed, by the new re-proposed ranking, at the top two places             1391.
there are images with the pyramid and the Sphinx.                             X U , S., Z HU , Y., J IANG , H., AND L AU , F. C. M. 2008. A
Finally, we tested the performance of the proposed system on a set               user-oriented webpage ranking algorithm based on user attention
of 30 users. In detail, after the re-ranking the user was requested to           time. In AAAI’08: Proceedings of the 23rd national conference
say if the first five retrieved images were more or less relevant to the           on Artificial intelligence, AAAI Press, 1255–1260.
inserted word with respect to the ones obtained by Google Images.
The results are reported in table 1, where we can see that 86.6% of


