Eye Movement as an Interaction Mechanism for Relevance Feed-
                  back in a Content-Based Image Retrieval Sys...
concrete and substantial foundation to involve natural eye                  Ten participants took part in the study, four ...
during the experiment. The average value and standard deviation              ing task. We can see that (1) some of the can...
NO.       Features                        Description                           color, texture, shape, and spatial informa...
Upcoming SlideShare
Loading in...5
×

Zhang Eye Movement As An Interaction Mechanism For Relevance Feedback In A Content Based Image Retrieval System

1,730

Published on

Relevance feedback (RF) mechanisms are widely adopted in Content-Based Image Retrieval (CBIR) systems to improve image retrieval performance. However, there exist some intrinsic problems: (1) the semantic gap between high-level concepts and low-level features and (2) the subjectivity of human perception of visual contents. The primary focus of this paper is to evaluate the possibility of inferring the relevance of images based on eye movement data. In total, 882 images from 101 categories are viewed by 10 subjects to test the usefulness of implicit RF, where the relevance of each image is known beforehand. A set of measures based on fixations are thoroughly evaluated which include fixation duration, fixation count, and the number of revisits. Finally, the paper proposes a decision tree to predict the user’s input during the image searching tasks. The prediction precision of the decision tree is over 87%, which spreads light on a promising integration of natural eye movement into CBIR systems in the future.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,730
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Zhang Eye Movement As An Interaction Mechanism For Relevance Feedback In A Content Based Image Retrieval System

  1. 1. Eye Movement as an Interaction Mechanism for Relevance Feed- back in a Content-Based Image Retrieval System Yun Zhang*1,2 ¶ † ‡ Hong Fu 2, Zhen Liang 2, Zheru Chi 2 § Dagan Feng 2,3, 1 3 School of Computer Science 2 Centre for Multimedia Signal School of Information Technologies Northwestern Polytechnical Uni- Processing Department of Electronic The University of Sydney versity, Xi’an, Shaanxi, China and Information Engineering Sydney, Australia The Hong Kong Polytechnic University Hong Kong, China ver, the subjective nature of human annotation adds another Abstract dimension of difficulty in managing image database. Relevance feedback (RF) mechanisms are widely adopted in CBIR is an alternative solution to retrieve images. However, Content-Based Image Retrieval (CBIR) systems to improve after years of rapid growth since 1990s [Flickner et al.1995], the image retrieval performance. However, there exist some intrinsic gaps between low level features and semantic contents of images problems: (1) the semantic gap between high-level concepts and holds back the progress and has entered a plateau phase. Such low-level features and (2) the subjectivity of human perception gaps can be concretely outlined into three aspects: (1) image of visual contents. The primary focus of this paper is to evaluate representation (2) similarity measure (3) user’s interaction. Most the possibility of inferring the relevance of images based on eye of the image representations are based on intuitiveness of the movement data. In total, 882 images from 101 categories are researchers and the fulfillment of mathematics, instead of hu- viewed by 10 subjects to test the usefulness of implicit RF, man’s eye behavior. Do the features extracted reflect humans’ where the relevance of each image is known beforehand. A set understanding of the image’s content? There is no clear answer of measures based on fixations are thoroughly evaluated which to this question. Similarity measure is highly dependent on the include fixation duration, fixation count, and the number of revi- features and structures used in image representation. Moreover, sits. Finally, the paper proposes a decision tree to predict the developing better distance descriptors and refining similarity user’s input during the image searching tasks. The prediction measures are also very challenging. User interaction can be a precision of the decision tree is over 87%, which spreads light feasible approach to answer the question and to improve the on a promising integration of natural eye movement into CBIR image retrieval performance. In the Relevance Feedback (RF) systems in the future. process, the user is asked to refine the searching by providing CR Categories: H.3.3 [Information Storage and Retrieval]: explicit RF, such as selecting Areas-of-Interest (AOIs) from the Information Search and Retrieval—Relevance feedback, Search query image, or to tick positive and negative samples from re- Process; H.5.2 [Information Interfaces and Representation]: trieves. In the past few years, many articles reported that RF can User interfaces help to establish the association between the low-level features Keywords: Eye Tracking, Relevance Feedback (RF), Content- and the semantics of images and to improve the retrieval per- Based Image Retrieval (CBIR), Visual Perception formance [Liu et al.2006; Dacheng Tao et al.2008]. 1 Introduction However, the explicit feedback is laborious for the user and limited in complexity. In this paper, we propose an eye move- Numerous digital images are being produced everyday from ment based implicit feedback as a rich and natural source to digital cameras, medical devices, security monitors, and other replace the time-consuming and expensive explicit feedback. As image capturing apparatus. It has become more and more diffi- far as we know, there are just a few preliminary studies on im- cult to retrieve a desired picture even from a photo album on a plementing some general eye movement features in image re- home computer because of the exponential increase in the num- trieval. One is from Oyekoya and Setntiford’s work [Oyekoya ber of images. Most traditional and common methods of im- and Stentiford.2004; Oyekoya and Stentiford.2006]. They made age retrieval based on metadata, such as textual annotations or an investigation into the fixation duration and found that they user-specified tags, have become the industry standard for re- are different on images with/without a clear AOI. The other trieval from large image collections. However, manual image work was reported by Klami et al. [Klami et al. 2008]. They annotation is time-consuming, laborious and expensive. Moreo- proposed nine-feature vectors from different forms of fixations *email: tvsunny@gmail.comemail: † and saccades and used a classifier to predict one relevant image email:zhenliang@eie.polyu.edu.hk from four candidates. ‡ email:enhongfu@inet.polyu.edu.hk ‖ email:enzheru@inet.polyu.edu.hk Different from the previous work, the study reported in this pa- § email: feng@it.usyd.edu.au per attempts to simulate a more real and complex image retrieval situation and to quantitatively analyze the correlation between users’ eye behavior and target images (positive images). In our Copyright © 2010 by the Association for Computing Machinery, Inc. experiments, the images come from a wide variety of web Permission to make digital or hard copies of part or all of this work for personal or sources, and in each task, the query image and the numbers of classroom use is granted without fee provided that copies are not made or distributed positive images are varied from time to time. We evaluated the for commercial advantage and that copies bear this notice and the full citation on the significance of fixation durations, fixation counts, and the num- first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on ber of revisits to provide a systematic interoperation of the us- servers, or to redistribute to lists, requires prior specific permission and/or a fee. er’s attention and effort allocation in eye movements, laying a Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail permissions@acm.org. ETRA 2010, Austin, TX, March 22 – 24, 2010. © 2010 ACM 978-1-60558-994-7/10/0003 $10.00 37
  2. 2. concrete and substantial foundation to involve natural eye Ten participants took part in the study, four females and six movement as a robust RF source [Zhou and Huang. 2003]. males in an age range from 20 to 32 all with an academic back- ground. All of them are proficient computer users, and half of The rest of the paper is organized as follows. Section 2 introduc- them have had experience of using an eye tracking system. Their es experimental design and setting for relevance feedback tasks visions are either normal or correct-to-normal. The participants and the corresponding eye movement data collecting. In Section were asked to complete two sets of the above mentioned image 3, we report our thorough investigation on using fixation dura- searching tasks and the gaze data are recorded with a 60 Hz tion, fixation count and the numbers of revisits for the prediction sampling rate. Afterwards the participants were asked to indicate of relevant images. These factors are performed with the ANO- which images they have chosen as positive images to ensure the VA test to reveal their significances and interconnections. Sec- accuracy of a further analysis on their eye movement data. The tion 4 proposes a decision tree model to predict the user’s input eye tracker is non-intrusive and allows a 300x220x300 mm free during the images searching tasks. Finally, we conclude the head movement space. Different candidate images and the loca- results and propose the future work. tions of positive images are ensured in and between each set of the task. In other words, no two images are the same and no two 2 Design of Experiments stimuli have the same positive image locations. This is to reduce the memory effects and to simulate the natural relevance feed- 2.1 Task Setup back situation. We study an image searching task which reflects kinds of activi- ties occurring in a complete CBIR system. In total, 882 images 3 Analysis of Gaze Data in Image Searching are randomly selected from 101 object categories. The image set is obtained by collecting images through the Google image Raw gaze data are preprocessed by finding the fixations with the search enginee [Li 2005]. The design and examples of the built-in filter provided by Tobii Technology. The filter maps a searching task interface is shown in Fig. 1. On the top left is the series of raw coordinates to a single fixation if the coordinates query image. Twenty candidate images are arranged as a 4x5 stay sufficiently long within a sphere of a given radius. We used grid display. All of the images are from 101 categories such as an interval threshold of 150 ms and a radius of 1 º visual angle. landscapes, animals, buildings, human faces, and home ap- 3.1 Fixation Duration and Fixation Count pliances. The red blocks in Fig. 1(a) denotes the locations of positive images in Fig. 1(b) (Class No. 22: Pyramid). The others The main features used in eye tracking related information re- are negative images and their image classes are different from trieval are fixations and saccades [Jacob and Karn.2003]. Two each other. That is to say, apart from the query image’s category, groups of derived metrics stem from the fixation: fixation dura- no two images in the grid are from the same category. The can- tion and fixation count are thoroughly studied to support the didate images in one searching stimulus are randomly arranged. possibility of inferring the relevance of images based on eye movements [Goldberg et al.2002; Gołofit 2008]. Suppose that FDP(m) and FDN(m) are the fixation durations on the positive Query Class No Class No Class No Class No Class No Image 01 22 22 75 64 and the negative images observed by subject m, respectively; Negative Positive Positive Negative Negative FCP(m) and FCN(m) are the fixation counts on the positive and Class No 56 Class No 38 Class No 17 Class No 100 Class No 12 the negative images observed by subject m, respectively; Then Negative Negative Negative Negative Negative in our searching task, FDP(m) and FDN(m) are defined as Class No Class No Class No Class No Class No 45 22 06 77 91 Negative Positive Negative Negative Negative ∑, , FD sgn , FDP = (a) Class No Class No Class No Class No Class No (b) ∑, , sgn , (1) 13 69 22 22 28 Negative Negative Positive Positive Negative ∑, , FD 1 sgn , FDN = ∑, , 1 sgn 1, Figure 1. Image searching stimulus. (a) the layout of the search- ing stimulus with 5 positive images; (b) an example. where 0,1, … ,20 denotes the image candidate in each searching stimulus interface; 1,2, … ,21 denotes the stimulus in each searching task (it also represents the numbers of positive Such a simulated relevance feedback task asks each participant images in the current stimulus); 1,2 denotes the task set, to use his eye to locate the positive image on each stimulus. On 1,2, … ,10 represents the subject and sgn(x) is the signum locating the positive image, the participants select the target by function. Consequently, FD is the fixation duration on the fixating on it for a short period of time with the eye. A set of the i-th image candidate of the j-th stimulus of the k-th task from task are composed of 21 such stimulus whose positive image subject m, and number are varied from 0 to 20. Thus, the set of task contains 21x21 = 441 images and the total number of the negative images 1 if subject regards cadidate image as positive and positive images are equal (210 images each). , 0 if subject regards cadidate image as negative 2.2 Apparatus and Procedure In the similar manner, FCP(m) and FCN(m) are defined as Eye tracking data is collected by the Tobii X120 eye tracker, ∑, , FC sgn , whose accuracy is α 0.5° and drift β 0.3°. Each candidate FCP = ∑, , sgn , image has a resolution of 300 x 300 pixels and thus an image ∑, , FC 1 sgn , (2) stimulus has 1800 x 1200 pixels. Each of stimuli is displayed on FCN = the screen with a viewing distance of 600 mm and the screen’s ∑, , 1 sgn , resolution is 1920x1280 pixels and the pixel pitch is h = 0.264 where FC is the fixation counts on the i-th image candi- mm. Hence the output uncertainty is just R tan α β /h = date of the j-th stimulus of the k-th task from subject m. The two 30 pixels, which has ensured the error of gaze data no larger pairs of fixation-related variables were monitored and recorded than 1% area of each candidate image. 38
  3. 3. during the experiment. The average value and standard deviation ing task. We can see that (1) some of the candidate images are of ten participants are summarized in Table 1. never visited, which indicates the use of pre-attentive vision at the very beginning of the visual search [Salojärvi et al. 2004]. Table 1 Statistics on the fixation duration and fixation count on During the pre-attentive process, all the candidate images have positive and negative images been examined to decide the successive fixation locations; and Sub. FDP(m) FDN(m) FCP(m) FCN(m) (2) in our experiments, revisits happen both on positive images 1.410±1.081 0.415±0.481 2.5±1.9 1.3±1.3 and negative images. The majority of them have just been vi- 1 sited once, while some of them are revisited during the image 2 1.332±0.394 0.283±0.247 2.7±1.4 1.2±0.9 searching. 3 2.582±1.277 0.418±0.430 5.6±3.3 1.7±1.5 4 0.805±0.414 0.356±0.328 2.4±1.2 1.5±1.2 The Number of Visits ‐‐ Histogram 2500 2149 5 1.154±0.484 0.388±0.284 2.6±1.4 1.5±1.0 2000 6 1.880±0.926 0.402±0.338 3.0±1.9 1.4±1.0 1500 7 0.987±0.397 0.166±0.283 1.7±0.8 0.6±0.7 878 8 0.704±0.377 0.358±0.254 2.2±1.1 1.3±0.9 1000 403 306 9 1.125±0.674 0.329±0.403 3.0±2.0 1.4±1.5 500 119 65 80 10 1.101±0.444 0.392±0.235 2.7±1.3 1.5±0.8 0 AVG. 1.308±0.891 0.351±0.345 2.8±2.0 1.3±1.1 No Visit 1 times 2 times 3 times 4 times 5 times > 6 times Figure 2 The total revisit histogram. The X-axis denotes the Analysis of variance (ANOVA) tests are performed to find out number of re-fixation and Y-axis is the corresponding count whether there are discriminating visual behaviors between the (unit: millisecond). observation of positive and negative images. Given the individu- al difference in eye movements, we designed two groups of two- Table 3 Overall revisits on positive and negative images way ANOVA among three factors: test subject, fixation duration A1 1 2 3 4 5 6 >7 and fixation count. The results are shown in Table 2. A2 549 196 88 55 34 13 27 Table 2 ANOVA test results among three factors: test subject, A3 329 110 31 10 3 2 1 fixation duration and fixation count. A4 878 306 119 65 37 15 28 GROUP I Factor Levels Test result A5 63% 64% 74% 85% 92% 87% 100% (A) Test 10 levels A1 = the number of revisits on an image candidate; A2 = revisit F(9,9) = 1.26, p < 0.37 Subjects (10 subjects) counts on positive images; A3 = revisit counts on negative im- (B) Fixation 2 levels ages; A4 = the total number of revisits; A5 = the percentage of F(1,9) = 32.84, p < 0.0003 Duration (FDP & FDN) the total revisits occurring to positive images. GROUP II Factor Levels Test result To compare with Oyekoya and Setntiford’s work [2006], we (A) Test 10 levels investigate whether the variance of revisit counts has a different F(9,9) = 2.03, p < 0.15 effect between positive and negative image candidates over all Subjects (10 subjects) (B) Fixation 2 levels the participants (as shown in Table 3). When revisits counts ≥ 3 F(1,9) = 28.28, p < 0.0005 Count (FCP & FCN) times, the result of one-way ANOVA is significant with F(1,8) = 5.73, p < 0.044. That is to say, the probability of revisits on a As illustrated in Table 2, both fixation duration and fixation positive image is increased with revisits counts. For example, count revealed significant effects to positive and negative im- when an image is revisited more than three times, it has a very ages during simulated relevance feedback tasks. Concretely high probability (over 74%) to be a positive image candidate. As speaking, the fixation durations on each positive image from all a result, the number of revisit is also a feasible implicit relev- the subjects (1.30 seconds) are longer than those on negative ance feedback to drive an image retrieval engine. image (0.35seconds). Correspondingly, the analysis of fixation count produces similar results that subjects visit more times on a 4 Feature Extraction and Results positive image (2.8) than on a negative one (1.3). On the other The primary focus of this paper is on evaluating the possibility hand, the variations of different subjects have no significant of inferring the relevance of images based on eye movement effects on both groups. (In GROUP I, 0.37 > α = 0.05; in GROUP II, data. The features such as fixation duration, fixation count and 0.15 > α = 0.05). the number of revisit have shown discriminating power between positive and negative images. Consequently, we composed a 3.2 Number of Revisits simple set of 11 features , ,…, , an eye A revisit is defined as the re-fixation on an AOI previously fix- movement’s vector to predict the positive images from each ated. Much human computer interaction and usability research returned 4x5 image candidates set in the simulated relevance shows that re-fixation or revisit on a target may be an indication feedback task, where 1,2, … ,20 denotes the numbers of of special interest on the target. Therefore, the analysis of revisit positive images in the current stimulus; 1,2, … ,10 during the relevance feedback process may reveal the correlation represents the subject , , … , are listed in Table 4, where between the eye movement pattern and positive image candi- 1, … ,20 and FL FD /FC . dates. Table 4 Features used in relevance feedback to predict positive images Figure 2 shows a general status of the overall visit frequency (no. of revisits = no. of visits - 1) throughout the whole image search- 39
  4. 4. NO. Features Description color, texture, shape, and spatial information, to human attention, such as AOIs. As a result, eye tracking data can be a rich and Fixation duration on i-th image inside 4x5 image FD candidate set interface new source for improving image representation [Lei Wu et al. Fixation Count on i-th image inside 4x5 image 2009]. Our future work is to develop an eye tracking based FC CBIR system in which human beings’ natural eye movements candidate set interface FL FD /FC will be effectively exploited and used in the modules of image FL Fixation Length on i-th image inside 4x5 image representation, similarity measurement and relevance feedback. candidate set interface R Revisit numbers happened on i-th image inside Acknowledgments 4x5 image candidate set interface The work reported in this paper is substantially supported by the Different from Klami et al.’s work [Klami et al. 2008], we use a Research Grants Council of the Hong Kong Special Administra- decision tree (DT) as a classifier to automatically learn the pre- tive Region, China (Project code: PolyU 5141/07E) and the diction rules. The data set mentioned in Section 2 is divided into PolyU Grant (Project code: 1-BBZ9). a training and a testing sets to evaluate the prediction accuracy. Two different methods are used to train the DT, which are illu- References strated in Table 5 (prediction precisions are 87.3% and 93.5%, respectively), and an example of predicted positive image from DACHENG TAO, XIAOOU TANG AND XUELONG LI. 2008. Which 4x5 candidates set is shown in Figure 3. Components are Important for Interactive Image Searching Circuits and Systems for Video Technology, IEEE Transactions on 18, 3-11. . Table 5 Training methods and testing results of decision trees FLICKNER, M., SAWHNEY, H., NIBLACK, W., ASHLEY, J., HUANG, Method I Q., DOM, B., GORKANI, M., HAFNER, J., LEE, D., PETKOVIC, D., Training Data Set 1,2, … 5 STEELE, D. AND YANKER, P. 1995. Query by Image and Video Testing Data Set 5,6, … 10 Content: The QBIC System. Computer 28, 23-32. . Prediction Precision 87.3% GOLDBERG, J.H., STIMSON, M.J., LEWENSTEIN, M., SCOTT, N. Method II AND WICHANSKY, A.M. 2002. Eye tracking in web search tasks: design implications. In ETRA '02: Proceedings of the 2002 symposium Training Data Set 1,3,5 … 19 on Eye tracking research & applications, New Orleans, Louisiana, Testing Data Set 2,4,6 … 20 Anonymous ACM, New York, NY, USA, 51-58. Prediction Precision 93.5% GOŁOFIT, K. 2008. Click Passwords Under Investigation. Computer Security - ESORICS 2007 343-358. . JACOB, R. AND KARN, K. 2003. Eye Tracking in Human-Computer Interaction and Usability Research: Ready to Deliver the Promises. In The Mind's Eye: Cognitive and Applied Aspects of Eye Movement Re- search, HYONA, RADACH AND DEUBEL, Eds. Elsevier Science, Oxford, England. KLAMI, A., SAUNDERS, C., DE CAMPOS, T.E. AND KASKI, S. 2008. Can relevance of images be inferred from eye movements? In MIR '08: Proceeding of the 1st ACM international conference on Multimedia information retrieval, Vancouver, British Columbia, Canada, Anonym- ous ACM, New York, NY, USA, 134-140. Figure 3 An example of predicted positive images from 4x5 LEI WU, YANG HU, MINGJING LI, NENGHAI YU AND XIAN- candidates set in the simulated relevance feedback task. The SHENG HUA. 2009. Scale-Invariant Visual Language Modeling for Object Categorization. Multimedia, IEEE Transactions on 11, 286-294. . query image is “hedgehog”, and DT model returned 8 predicted positive images (in red frames) based on the 11 features vector FEIFEI. LI ,Visual recognition: computational models and human psy- with 100% accuracy. chophysics, Phd Thesis, California Institute of Technology, 2005. LIU, D., HUA, K., VU, K. AND YU, N. 2006. Fast Query Point Move- 5 Conclusion and Further Work ment Techniques with Relevance Feedback for Content-Based Image Retrieval. Advances in Database Technology - EDBT 2006 700-717. . An eye tracking system can be possibly integrated into a CBIR system as a more efficient input mechanism for implementing OYEKOYA, O. AND STENTIFORD, F. 2004. Exploring Human Eye the user’s relevance feedback process. In this paper, we mainly Behaviour using a Model of Visual Attention. 17th International Confe- concentrate on a group of fixation- related measurements which rence on (ICPR'04) Volume 4, IEEE Computer Society, Washington, DC, USA, 945-948. shows static eye movement patterns. In fact, the dynamic cha- racteristics can also manifest human organizational behavior and OYEKOYA, O. AND STENTIFORD, F. 2006. Perceptual Image Retriev- decision processes, such as saccades and scan path, which reveal al Using Eye Movements. Advances in Machine Vision, Image the pre-attention and cognition process of a human being while Processing, and Pattern Analysis 281-289. . viewing an image. In our further work, we will continue to de- SALOJÄRVI, J., PUOLAMÄKI, K. AND KASKI, S. 2004. Relevance velop a more comprehensive study which includes both the stat- feedback from eye movements for proactive information retrieval. In ic and dynamic features of eye movements. Originally, it is a Workshop on Processing Sensory Information for Proactive Systems unity of human’s conscious and unconscious visual cognition (PSIPS 2004, Anonymous , 14-15. behavior, which can not only be used in relevance feedback, but also a new source of image representation. Human’s image ZHOU, X.S. AND HUANG, T.S. 2003. Relevance feedback in image viewing automatically bridge the low level features, such as retrieval: A comprehensive review. Multimedia Systems 8, 536-544. . 40

×