Content-based Image Retrieval Using a Combination of
                              Visual Features and Eye Tracking Data
 ...
In this paper, a model of using a combination of visual features          based on low-level features (color & texture). T...
the importance values, the redder region in the map. In the next                                                          ...
object only in the attention-driven image retrieval strategy [Fu et        favorably with conventional CBIR methods, espec...
Upcoming SlideShare
Loading in...5
×

Liang Content Based Image Retrieval Using A Combination Of Visual Features And Eye Tracking Data

1,728

Published on

Image retrieval technology has been developed for more than twenty years. However, the current image retrieval techniques cannot achieve a satisfactory recall and precision. To improve the effectiveness and efficiency of an image retrieval system, a novel content-based image retrieval method with a combination of image segmentation and eye tracking data is proposed in this paper. In the method, eye tracking data is collected by a non-intrusive table mounted eye tracker at a sampling rate of 120 Hz, and the corresponding fixation data is used to locate the human’s Regions of Interest (hROIs) on the segmentation result from the JSEG algorithm. The hROIs are treated as important informative segments/objects and used in the image matching. In addition, the relative gaze duration of each hROI is used to weigh the similarity measure for image retrieval. The similarity measure proposed in this paper is based on a retrieval strategy emphasiz-ing the most important regions. Experiments on 7346 Hemera color images annotated manually show that the retrieval results from our proposed approach compare favorably with conventional content-based image retrieval methods, especially when the important regions are difficult to be located based on visual features.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,728
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
57
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Liang Content Based Image Retrieval Using A Combination Of Visual Features And Eye Tracking Data"

  1. 1. Content-based Image Retrieval Using a Combination of Visual Features and Eye Tracking Data Zhen Liang1*, Hong Fu1, Yun Zhang1, Zheru Chi1, Dagan Feng1,2 1 Centre for Multimedia Signal Processing, Department of Electronic and Information Engineering The Hong Kong Polytechnic University, Hong Kong, China 2 School of Information Technologies, The University of Sydney, Sydney, Australia Abstract research topic since the early 1990s. In CBIR, image contents are characterized for searching similar images to the query im- Image retrieval technology has been developed for more than age. Usually, low-level features, such as colors, shapes and tex- twenty years. However, the current image retrieval techniques tures, are used to form a feature vector to represent images. A cannot achieve a satisfactory recall and precision. To improve similarity measurement between images is usually computed the effectiveness and efficiency of an image retrieval system, a based on the distance between the corresponding feature vectors. novel content-based image retrieval method with a combination However, low-level features are quite often not sufficient of image segmentation and eye tracking data is proposed in this enough to describe an image, and the gap between low-lever paper. In the method, eye tracking data is collected by a non- features and high-level semantic concepts becomes a major dif- intrusive table mounted eye tracker at a sampling rate of 120 Hz, ficulty that hinders a further development of CBIR systems and the corresponding fixation data is used to locate the human’s [Smeulders et al. 2002]. As an attempt to reduce the semantic Regions of Interest (hROIs) on the segmentation result from the gap and to improve retrieval performance, Region-Based Image JSEG algorithm. The hROIs are treated as important informative Retrieval (RBIR) techniques have been proposed. segments/objects and used in the image matching. In addition, the relative gaze duration of each hROI is used to weigh the In an RBIR system, local information is extracted from the similarity measure for image retrieval. The similarity measure whole image and used for image retrieval. The basic rational is proposed in this paper is based on a retrieval strategy emphasiz- that one who searches for similar images is normally interested ing the most important regions. Experiments on 7346 Hemera in visual objects/segments and the features extracted from the color images annotated manually show that the retrieval results whole image may not properly represent the characteristics of from our proposed approach compare favorably with conven- the objects. The standard process of an RBIR system includes tional content-based image retrieval methods, especially when three steps: (1) segmenting an image into a set of regions; (2) the important regions are difficult to be located based on visual extracting the features from segmented regions, which are features. known as “local features”; and (3) measuring the similarity be- tween the query image and every candidate image in terms of CR Categories: H.3.3 [Information Storage and Retrieval]: local features. Many recent algorithms are focusing on improv- Information Search and Retrieval—Relevance feedback, Search ing the efficiency and effectiveness of image segmentation, fea- process; H.5.2 [Information Interfaces and Representation]: User ture extraction and similarity measurement in the RBIR system interfaces [Tsai et al. 2003; Lv et al. 2004; Marques et al. 2006; Wang et al. 2006]. On the other hand, sometimes the segmentation process Keywords: eye tracking, content-based image retrieval (CBIR), may fail to produce objects if they are not salient based on their visual perception, similarity measure, fixation visual features although these objects carries important semantic information. One of the most influencing RBIR approaches is to integrate a process of manually selecting important regions and 1 Introduction indicating feature importance into a system to overcome the problem mentioned above [Carson et al. 1999]. However, these Due to an exponential growth of digital images in a daily basis, bring a huge burden to users and are not convenient at all. content-based image retrieval (CBIR) has been a very active In 2003, Parkhurst and Niebur pointed out that eye movements 1 under natural viewing conditions are determined by Selective email: zhenliang@eie.polyu.edu.hk Visual Attention Model (SVAM) [Parkhurst and Niebur, 2003]. enhongfu@inet.polyu.edu.hk The SVAM is composed of two stages: a bottom-up procedure tvsunny@gmail.com with low-level features and a top-down process guided by a enzheru@inet.polyu.edu.hk high-level understanding of the image. For integrating the top- 1,2 email: enfeng@polyu.edu.hk down processing in an RBIR system, eye tracking technique can provide a more natural, convenient and imperceptible way to understand the user’s intention instead of asking him/her to ma- Copyright © 2010 by the Association for Computing Machinery, Inc. nually select the ROIs. It has been found that longer-duration Permission to make digital or hard copies of part or all of this work for personal or and more frequent fixations appear on the objects in a scene [De classroom use is granted without fee provided that copies are not made or distributed Graef et al. 1990; Henderson and Hollingworth, 1998]. There- for commercial advantage and that copies bear this notice and the full citation on the fore, the relative gaze duration could be utilized to improve re- first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on trieval performance by weighing the corresponding hROI. servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail permissions@acm.org. ETRA 2010, Austin, TX, March 22 – 24, 2010. © 2010 ACM 978-1-60558-994-7/10/0003 $10.00 41
  2. 2. In this paper, a model of using a combination of visual features based on low-level features (color & texture). The image seg- and eye tracking data is proposed to reduce the semantic gap and mentation step is similar as the human bottom-up processing that improve retrieval performance. The flowchart of our proposed can help locate objects and boundaries in image retrieval system. model is shown in Figure 1. After eye tracking data are collected In our experiment, images are downsized into a maximum by a non-intrusive table mounted eye tracker and the image is width/length of 100 cm with a fixed aspect ratio before segmen- segmented by the JSEG algorithm, the fixation data is extracted tation that can reduce the computational complexity and increase and used to locate the hROIs on the segmented image. The rela- the retrieval efficiency. Figure 2(b) gives some segmentation tive gaze duration in each hROI is also computed to weigh the results. importance. The selected hROIs are treated as important infor- mative segments/objects and used in the image retrieval. 4 Image Retrieval Model The rest of this paper is organized as follows. Eye tracking data A novel image retrieval model using a combination of image acquisition is described in Section 2. In Section 3, the JSEG algorithm is explained. Then we discuss how to construct an segmentation results and eye tracking data is proposed in this section. The aim of image segmentation step is to simulate the image retrieval model with eye tracking data in terms of region bottom-up processing that coarsely interpret and parse images selection, feature extraction, weight calculation as well as simi- larity measurement in Section 4. In Section 5, experimental re- into coherent segments based on the low-level features [Spe- kreijse 2000]. [Rutishauser et al. 2004] has demonstrated that the sults are reported with a comparison of our new approach with bottom-up attention is partially useful for object recognition. But the conventional image retrieval methods. Finally, a conclusion is drawn and future work is outlined in Section 6. it is not sufficient. In the second stage of the human visual atten- tion, top-down processing, one or a few of objects are selected from the whole image for a more thorough analysis [Spekreijse 2000]. The selection procedure is not only guided by elementary features but also by human understandings. Thus, if an image retrieval strategy could incorporate bottom-up and top-down mechanisms, the accuracy and efficiency will be largely im- proved. Eye tracking technique provides us with an important signal to understand which region(s) is concerned by the user or which object(s) in the image is the target the user wants to Figure 1The flowchart of our proposed model. search for. Fixation is one of the most important features in eye tracking data, which can tells us where the subject’s attention points are and how long one fixates on each attention point. Thus, in our proposed model, the fixation data is used to define (a) Original images the hROIs on the segmented image, and the relative gaze dura- tion in each hROI is treated as the corresponding region signi- ficance. (b) Segmented maps 4.1 Selection Process of Human’s ROI Here, we use eye tracking data to locate the observer’s interest- (c) Eye tracking data acquisition ing regions in an image, and an importance value for each seg- mented region is defined as the relative gaze duration on the Figure 2 representative images (a) with the corresponding seg- region. Some example images with their eye tracking data are mentation results (b) and eye tracking data acquisition (c). shown in Figure 2(c). Suppose that an image I is composed of N segmented regions (Eq. (1)), and the relative gaze duration on 2 Eye Tracking Data Acquisition the image I is D. A non-intrusive table-mounted eye tracker, Tobii X120, was I = S1 , … , Si , … , SN , (1) used to collect eye movement data in a user-friendly where is the th segmented region. environment with a high accuracy (0.5 degree) at 120Hz sample rate. The freedom of head movement is 30x22x30 cm. Before A concept of importance value is introduced to show the degree each data collection, a calibration will be conducted under a grid of the observer’s interest on the region. Assume that the relative of nine calibration points for minimizing errors in eye tracking. gaze duration on the segmented region is , a corresponding Fixations (location and duration) were extracted from the raw importance value can be calculated by Eq. (2). The value will eye tracking data using a criterion of fixation radius (35 pixels) be 0 if there is no fixation on the region. and minimum fixation duration (100 ms). Each image in the = , and =1 = 1 . (2) 7346 Hemera color image database is viewed by a normal vision participant within 5 seconds. The viewing distance is where = =1 . approximately 68 cm away from the display with a display resolution of 1920 x 1200 pixels. The corresponding subtended As shown in Figure 3(a), the popping-out process of human visual angel is about 41.5º x 26.8º. ROIs consists of the following steps. Step 1: Scale eye tracking data to the segmented map size. Step 2: Determine whether a segmented region is a hROI or not. Step 3: If all segmented re- 3 Image Segmentation gions in the image are processed, terminate the procedure; oth- erwise, go to Step 2. Figure 3(b) gives some examples to show A state-of-art segmentation algorithm, JSEG [Deng and Manju- the selection results in terms of weighting maps. The higher of nath 2001], is used in this paper to segment images into regions 42
  3. 3. the importance values, the redder region in the map. In the next =1 =1 s , = . (7) image retrieval step, the selected hROIs are treated as important =1 =1 informative segments/objects, and the corresponding importance When the query image and a candidate image are identical, the values are used as the region significance to weigh the similarity distance in Eq. (7) is zero. Thus, for a query image, a smaller measure. distance indicates that there are more matched regions in the candidate image. In the other words, the corresponding image is more relevant to the query one. 5 Experimental Results and Discussion 5.1 Image Database and Evaluation Criterion The retrieval experiments are conducted on the 7,346 Hemera color images annotated by keywords manually. Figure 4 shows example images from a few categories. The evaluation criterion for image retrieval performance applied here is not simply by labeling images as “relevant” or “irrelevant”, but based on the (a) (b) ratio of matched keywords between the query image and a data- Figure 3 Selection process of hROIs (a) and weighting maps (b) base image returned. Suppose that the query image and retrieval image have M and N keywords, respectively, with P matched (the value in the weighting map is the corresponding important keywords, then the semantic similarity is defined as value o f the region). query image, retrieval image = . (8) (+)/2 4.2 Feature Extraction Color and texture properties are extracted from the selected hROIs for similarity measure. For the color property, the HSV color space is used because it approximates the human percep- (a) (b) tion [Paschos 2001]. For the texture property, the Sobel operator is used to produce the edge map from the gray-scale image. A feature set including an 11 x 11 x 11 HSV color histogram and a 1 x 41 texture histogram of the edge map is used to characterize the region. (c) (d) 4.3 Similarity Measure Figure 4 Example images in the image database. (a) Asian ar- chitecture; (b) Close-up; (c) People; (d) Landmarks. An image is represented by several regions in the image retrieval system. Suppose that there are m ROIs from the query image, = 1 , … , , and n ROIs from a candidate image, = {1 , … , }. As discussed in Section 4.1, the correspond- ing region weight vectors of the query image and the candidate image are = {1 , … , } and = {1 , … , }, respec- tively. The similarity matrix among the regions of two images is defined as (a) (b) = = , , = 1, … , ; = 1, … , . , (3) where , is the Euclidean distance between the feature vectors of region and . The weight matrix, which indi- cates the importance of the corresponding region similarity measure in the similarity matrix, is defined as = = , = 1, … , ; = 1, … , . . (4) (c) (d) To find the most similar region in the candidate image, a match- Figure 5 Average semantic similarity vs. the number of images ing matrix is defined as returned for four themes of images shown in Figure 4. = , = 1, … , ; = 1, … , . , (5) where 5.2 Performance Evaluation of Image Retrieval 1 if = ∗ and ∗ = arg min , The performances of our proposed image retrieval model on = = 1, … , . (6) 0 otherwise different types of query images (Figure 4) are shown in Figure 5, In the matching matrix, there is only one element is 1 in each compared with the following three methods: 1) Global based: row and the others are 0. The value of 1 represents the corres- retrieval based on the Euclidean distance of the global color and ponding in the similarity matrix is the minimum in the row. texture histograms of two images; 2) Attention based: attention- Thus, the distance between two images in the proposed image driven image retrieval strategy proposed in [Fu et al. 2006]; 3) retrieval model is defined as Attention object1 based: retrieval using the first popped-out 43
  4. 4. object only in the attention-driven image retrieval strategy [Fu et favorably with conventional CBIR methods, especially when the al. 2006]; important regions are difficult to be located based on the visual features of an image. Future work to be carried out includes The fixation-based image retrieval system is one proposed in collecting eye tracking data during the relevance feedback this paper. Figure 5 shows the results of image retrieval by using process and the refinement on both the feature extraction and the above mentioned four image retrieval methods for different weight computation. image classes. We can see that our fixation-based method is significantly better than the other two in “Asian Architecture” Acknowledgements and “People” image classes in terms of average semantic simi- larity. For the other two image classes “Closeup”, and “Land- marks”, our method is better when the number of return images This work is supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (Project code: is not large (20 or smaller), suggesting that our method can have PolyU 5141/07E) and the PolyU Grant (Project code: 1-BBZ9). a better effectiveness. 5.3 Discussion References Our proposed model achieves a better retrieval performance than CARSON, C., THOMAS, M., BELONGIE, S., HELLERSTEIN, J.M., AND the other three image retrieval methods when the objects are MALIK, J. 1999. BlobWORLD: A System for Region-Based Image hidden in the background (the low-level features of the objects Indexing and Retrieval. Proc. Visual Information Systems, 509- are not conspicuous) or there are multi-objects in the image. For 516. example, an image shown in Figure 6 (left), “man working out DENG, Y., AND MANJUNATH, B. 2001. Unsupervised Segmenta- in the gym”, Fu et al.’s model places a higher importance value tion of Color-Texture Regions in Images and Videos. IEEE on the white ground and the other part is treated as the back- Trans. Pattern Anal. Mach. Intell., vol. 23(8), 800-810. ground, while the man and the woman in the corner are consi- FU, H., CHI, Z., AND FENG. D. 2006. Attention-Driven Image dered as the two most important hROIs in our model. A compar- Interpretation with Application to Image Retrieval. Pattern Rec- ison of the selection of HOIs on the example image for the fixa- ognition, Vol. 39(9), 1604-1621. tion-based and attention-based approaches is shown in Figure 6 (right). On the other hand, in the global based image retrieval, DE GRAEF, P., CHRISTIAENS, D., AND D’YDEWALLE, G. 1990. all the information is mixed together, which cannot distinguish Perceptual Effects of Scene Context on Object Recognition. Psychological Research, Vol. 52, 317-329. objects from the image with different significances especially when the objects are hidden in the background. In our method, HENDERSON, J. M., AND HOLLINGWORTH, A. 1998. Eye Move- the important information in the image can be extracted and well ments During Scene Viewing: An Overview. in: Eye Guidance ranked based on the human visual attention process. For exam- While Reading and While Watching Dynamic Scenes, Under- ple, for the image with a cow in the grass background (Figure 2), wood, G. (Ed.). Elsevier Science, Amsterdam, 269-293. the grass has a much larger area than the cow. As a result, the LV, Q., CHARIKAR, M., AND LI. K. 2004. Image Similarity Search global based image retrieval prefers to retrieving images that With Compact Data Structure. In Proceedings of The ACM In- also have green objects and/or the green background. On the ternational Conference on Information and Knowledge Man- contrary, our method identifies the cow as the most important agement, 208-217. object in the image and accordingly the retrieval performance is MARQUES, Q., MAYRON, L., BORBA, G., AND GAMBA, H. 2006. much improved. Using Visual Attention to Extract Regions of Interest in the Context of Image Retrieval. In Proceedings of ACM Annual Southeast Regional Conference, 638-643. PARKHURST, D. J., AND NIEBUR, E. 2003. Scene Content Selected by Active vision. Spatial Vision, Vol. 16(2), 125-154. PASCHOS, M. 2001. Perceptually Uniform Color Spaces for Col- or Texture Analysis: An Empirical Evaluation. IEEE Trans. Image Process, Vol. 10(6), 932-937. RUTISHAUSER, U., WALTHER, D., KOCH, C., AND PERONA, P. 2004. Is Bottom-Up Attention Useful for Object Recognition? Figure 6 Fixation-based vs. attention-based selection where the CVPR2004, Vol. 2, 37-44. value below the left image is the corresponding importance val- SMEULDERS, A. W. M., WORRING, M., AND SANTINI, S. 2002. ue. Content-Based Image Retrieval At the End of the Early Years. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 22(12), 1349-1380. 6 Conclusion SPEKREIJSE, H. 2000. Pre-attentive and Attentive Mechanism in Vision. Perceptual Organization and Dysfunction. Vision Search, Vol. 40, 1179-1638. In this paper, we report our study on imitating the human visual attention process for CBIR by combining the image segmenta- TSAI, C. F., MCGARRY, K., AND TAIT, J. 2003. Image Classifica- tion and eye tracking techniques. JSEG algorithm is used to tion Using Hybrid Neural Network, In Proceedings of The ACM parse the image into homogeneous sub-regions and eye tracking SIGIR Conference on Research and Development in Information Retrieaval, 431-432. data is utilized to locate the hROIs on the segmented image. In the similarity measurement step, each hROI is weighed by the WANG, X. Y., HU. F. L. AND YANG, H. Y. 2006. A Novel Re- fixation duration on each hROI as the importance value to em- gions-of-Interest Based Image Retrieval Using Multiple Features. phasize the most important regions. Retrieval results on 7346 In Proceedings of The Multi-Media Modeling International Con- Hemera color images show that our proposed approach compare ference, Vol. 1, 377-380. 44

×