Kandemir Inferring Object Relevance From Gaze In Dynamic Scenes

1,274 views

Published on

As prototypes of data glasses having both data augmentation and gaze tracking capabilities are becoming available, it is now possible to develop proactive gaze-controlled user interfaces to display information about objects, people, and other entities in real-world setups. In order to decide which objects the augmented information should be about, and how saliently to augment, the system needs an estimate of the importance or relevance of the objects of the scene for the user at a given time. The estimates will be used to minimize distraction of the user, and for providing efficient spatial management of the augmented items. This work is a feasibility study on inferring the relevance of objects in dynamic scenes from gaze. We collected gaze data from subjects watching a video for a pre-defined task. The results show that a simple ordinal logistic regression model gives relevance rankings of scene objects with a promising accuracy.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,274
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Kandemir Inferring Object Relevance From Gaze In Dynamic Scenes

  1. 1. Copyright © 2010 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail permissions@acm.org. ETRA 2010, Austin, TX, March 22 – 24, 2010. © 2010 ACM 978-1-60558-994-7/10/0003 $10.00 Inferring Object Relevance from Gaze in Dynamic Scenes Melih Kandemir∗ Helsinki University of Technology Department of Information and Computer Science Veli-Matti Saarinen† Helsinki University of Technology Low Temperature Laboratory Samuel Kaski‡ Helsinki University of Technology Department of Information and Computer Science Abstract As prototypes of data glasses having both data augmentation and gaze tracking capabilities are becoming available, it is now possi- ble to develop proactive gaze-controlled user interfaces to display information about objects, people, and other entities in real-world setups. In order to decide which objects the augmented information should be about, and how saliently to augment, the system needs an estimate of the importance or relevance of the objects of the scene for the user at a given time. The estimates will be used to minimize distraction of the user, and for providing efficient spa- tial management of the augmented items. This work is a feasibility study on inferring the relevance of objects in dynamic scenes from gaze. We collected gaze data from subjects watching a video for a pre-defined task. The results show that a simple ordinal logistic regression model gives relevance rankings of scene objects with a promising accuracy. CR Categories: H.5.r [Information Interfaces and Representation (HCI)]: User interfaces—User interface management systems Keywords: augmented reality, gaze tracking, information re- trieval, intelligent user interfaces, machine learning, ordinal logistic regression 1 Introduction In this paper, we develop a method needed for doing information retrieval in dynamic real-world scenes where the queries are for- mulated implicitly by gaze. In our setup the user wears a ubiqui- tous information access device, “data glasses” having eye-tracking and information augmentation capabilities. The device is assumed to be capable of recognising and tracking certain types of objects from the first-person video data of the user Figure 1 illustrates the idea. Some objects, three faces and the whiteboard in this image, are augmented with attached boxes that include textual information obtained from other sources. In such a setup, each visible object in a scene can be considered as a channel through which additional relevant information can be obtained as augmented on the screen. As in traditional information retrieval setups such as text search en- gines, potential abundance of available information brings up the need for a mechanism to rank the channels with respect to their rel- evance. This is particularly important in proactive mobile setups where the augmented items are also potential distractors. Our goal is to infer the degree of interest of the user for the objects ∗e-mail: melihk@cis.hut.fi †e-mail:veli-m@neuro.hut.fi ‡e-mail:samuel.kaski@tkk.fi Figure 1: A screenshot from the eyesight of hypothetical data glasses with augmented reality capability during a short presen- tation in a meeting room (Scene 1) in the scene. This problem has a connection to modelling of vi- sual attention [Henderson 2003; Itti et al. 1998; Zhang et al. 2008]; whereas visual attention models typically try to predict the gaze pat- tern given the scene, our target is the inverse problem of inferring the user’s state (interests) given the scene and the gaze trajectory. A good solution for the former problem would obviously help in our task too, but current visual attention models mainly consider only physical pointwise saliency which does not yet capture the mainly top-down nature of effects of user’s interest on the gaze pat- tern. Although there exists some initial attempts towards two-way saliency modeling [Torralba et al. 2006], these are evaluated only for rather trivial visual tasks such as counting a certain type of ob- jects in static images. Unlike top-down models where the model is optimised given a well-defined search task, the cognitive task of the subject in our setup is hidden and may even be unclear to the subject herself. Hence, we start by data-driven statistical machine learning techniques for the inverse modeling task. Gaze data has been used in user interfaces in three ways. Our goal is the furthest from the most frequent approach, off-line analysis, for instance studying effectiveness of advertisements in attracting peo- ple’s attention, or analysis of social interaction. In the second ap- proach the user selects actions by explicitly looking at the choices, for instance eye typing [Hyrskykari et al. 2000; Ward and MacKay 2002]. Although such explicit selection mechanisms are easy to im- plement, they require full user attention and are strenuous because of the Midas touch effect: each glance activates an action whether it is intended or not. The third way of using gaze data in user in- terfaces is implicit feedback. The user uses her gaze normally, and information needed by the interface is inferred from the gaze data. An emerging example is proactive information retrieval where sta- tistical machine learning methods are used for inferring relevance from gaze patterns. The inferred relevance judgements are then used as implicit relevance feedback for information retrieval. This has been done for text retrieval by generating implicit queries from 105
  2. 2. gaze patterns [Hardoon et al. 2007]. The same principle has been used for image retrieval as well [Klami et al. 2008], recently also coupled dynamically to a retrieval engine in an interactive zooming interface [Kozma et al. 2009]. Gaze has additionally been used as a means of proactive interaction, but not information retrieval, in a desktop application by assigning a relevance function to the entities on a synthetic 2D map [Qvarfordt and Zhai 2005]. To test the feasibility of the idea of relevance ranking from gaze in dynamical real-world setups, we prepared a stimulus video and col- lected gaze data from subjects watching that video. True relevance rankings were then asked from the subjects in several frames. We trained an ordinal logistic regression model and measured its accu- racy in the relevance prediction task on the left-out data. 2 Measurement Setup We shot a video from the first-person view of a subject visiting three indoor scenes. Then we postprocessed this video by augmenting some of the objects with additional textual information in an at- tached box. This video was shown to 4 subjects and gaze data was collected. Right after the viewing session the subjects ranked the scene objects in relevance order for a subset of the video frames. The ranking was considered as the ground truth for learning the models and evaluating them. The modelling task is to predict the user-given ranking for an object given the gaze-tracking data from a window immediately preceding the ranked frame. 3 Model for Inferring Relevance Let us index the stimulus slices preceding each relevance judgement from 1 to N. We extract a feature vector (details in the Experiments section) for each scene object i at time slice t to obtain a single un- labelled data point: fi (t) = {f (t) i1 , f (t) i2 , · · · , f (t) id } where d is the number of features. If we also attach the ground truth relevance ranking ri (t) , we get a labelled data point (fi (t) , ri (t) ). Let us de- note the set of data points, one for each object, related to time slice t as a data subset Λ(t) = {(f1 (t) , r1 (t) ), · · · , (fmt (1) , rmt (1) )} where mt is the number of visible objects at time slice t. Let us denote the data subset without labels by Λ (t) , and the maxi- mum number of visible objects by L = max({m1, · · · , mN }). For notational convenience, we define the most relevant object to have rank L, and the rank decreases as relevance decreases. The whole labelled data set consists of the union of all data subsets ∆ = {Λ(1) , Λ(2) , · · · , Λ(N) }. We search for a mapping from the feature space to the space of relevances, which is conventionally [0, 1]. Such a mapping can di- rectly be achieved using ordinal logistic regression [McCullagh and Nelder 1989] if we assume that the relevance of an object depends only on its features, and it is independent of the relevance of the other visible objects. We use the standard approach as described briefly below. Let us denote the probability of the object rank to be k as P(ri (t) = k | f (t) i ) = φk(f (t) i ). Then we can define the log odds such that the problem reduces to a batch of L − 1 binary regression problems, one for each k = 1, 2, · · · , L − 1: Mk = log P (ri (t) <=k | f(t) i ) 1−P (ri (t)<= | f(t) i ) = log φ0(f(t) i )+φ1(f(t) i )+···+φk(f(t) i ) φk+1(f(t) i )+φk+2(f(t) i )+···+φL(f(t) i ) = w (k) 0 + wf (t) i where a linear model is assumed. By taking the exponent of both sides we get the CDF of the rank distribution for object i at time t: P(ri (t) <= k | f (t) i ) = exp(w (k) 0 + wf (t) i ) 1 + exp(w (k) 0 + wf (t) i ) . Notice that we adopted the standard approach and used common slope coefficients w = [w1, · · · , wd] for all logit models but differ- ent intercepts w (k) 0 . In the training phase, we calculate the maxi- mum likelihood estimates for the parameters θ of this model (θ = {w (1) 0 , · · · , w (k−1) 0 , w1, · · · , wd}) using the Newton-Raphson tech- nique. Given an unlabelled data subset Λ (t) at time t, the object with relevance rank k is predicted to be the one that has the highest probability for that rank; arg maxi φk(f (t) i ). 4 Experiments 4.1 Stimulus Preparation We shot a video clip of 4 minutes and 17 seconds long from the first- person view of a subject, using a see-through head mounted display device. In the scenario of the clip, a visitor coming to our laboratory is informed of our research project. The scenario consists of three consecutive scenes: 1. A short presentation in a meeting room: A researcher in- troduces the project with a block diagram drawn on the white- board (Figure 1) in a meeting room. People present are asking questions. The visitor follows the presentation. 2. A walk in the lab corridor: The visitor walks through the laboratory taking a look at posters on the wall, and zooms into some of the name tags on office doors. 3. Demo of data collection devices: The host introduces how eye tracking experiments are made. He demonstrates a mon- itor with eye tracking capabilities and the head-mounted dis- play device. Next, we augmented the video by attaching information boxes to objects; such as faces, the whiteboard, name tags, posters, and de- vices related to the project. These were considered to be the objects potentialls the most interesting to the visitor. Short snippets of tex- tual information relevant to the objects were displayed inside the boxes. At most one information box was attached to any one object at a time. We displayed boxes for all visible objects. There were from 0 to 4 objects in the scene at a time; average number of scene objects was 2.017 with 1.36 standard deviation. The frame rate of the postprocessed video was 12 fps. 4.2 Data Collection We collected gaze data from 4 subjects while they were watching the stimulus video to get as much information as they can about the research project. After the viewing session, the subjects were shown 154 screenshots from the video in temporal order, each of which represent a 1.66 seconds slot (20 frames). The users were asked to select the objects that were relevant to them at that mo- ment, and also to rank the selected subset of objects according to their relevance. We defined relevance as the interest in seeing aug- mented information about an object in the scene at that particular time. All subjects assured, after ranking, that they were able to remember the correct ranks for almost all the frames. The sub- jects were graduate and postgraduate researchers not working on the project related to the study we present in this paper. 106
  3. 3. 4.3 The Eye Tracker We collected the gaze data with a Tobii 1750 eye tracker with 50Hz sample rate. The tracker has an infra-red stereo camera on a stan- dard flat-screen monitor. The device performs tracking by detecting the pupil centers and measuring the reflection from the cornea. The successive gazes that were located within an area of 30 pixels are considered as a single fixation. This corresponds to approximately 0.6 degrees of deflection at a normal viewing distance to an 17”- screen monitor with 1280 × 1024 pixel resolution. Test subjects were sitting 60 cm away from the monitor. 4.4 Feature Extraction We extracted from the gaze and video data a set of features cor- responding to each visible object. This was done at every time slice for which the labelled object ranks were available (i.e., for one frame in every 20 consecutive frames). Each of these features summarises a particular aspect in the temporal context (recent past). We define the context at time t to be a slot from time point t − W to t − 1 where W is a predetermined window size. We used the following 11 features: 1. mean area of the bounding box of the object 2. mean area of the information box attached to the object 3. mean distance between the centers of the object bounding box and the attached information box 4. total duration of fixations inside the bounding box of the object 5. total duration of fixations inside the information box attached to the object 6. mean duration of fixations inside the bounding box of the object 7. mean duration of fixations inside the information box attached to the object 8. mean distance of all fixations to the center of the object bounding box 9. mean distance of all fixations to the center of the information box 10. mean length of saccades that ended up with fixations inside the bounding box of the object 11. mean length of saccades that ended up with fixations inside the information box attached to the object We marked the bounding boxes of the objects manually frame by frame. 4.5 Evaluation We evaluated the accuracy of the model with respect to the propor- tion of times the most relevant object was predicted correctly. We compared the model performance with five baseline methods. The first one is random guessing, in which at each time slice, scene objects are ranked uniformly at random. The second one is an attention-based method that assigns a relevance proportional to the total fixation duration on the object and on the augmented content. This estimate of object relevance is referred to as gaze intensity [Qvarfordt and Zhai 2005]. This is used to reveal the effect of in- tricate gaze patterns, other than mere visual attention measured by gaze intensity in relevance prediction. In the third baseline model we used the ordinal logistic regression model with the features that are not related to gaze: first three of the features. Thus we investi- gated the effect of gaze-based features in prediction accuracy. We defined two more baseline models that depend on Itti et al.’s bottom- up visual attention model [Itti et al. 1998] in order to observe how useful such plain attention modelling is in our problem setup, and to test if our model provides better accuracy. We computed the Itti- Koch saliency map of the labelled frames. Then we calculated the relevance of an object as the maximum saliency inside its bounding box for one baseline model, and as the average saliency inside the bounding box for the other one. We trained separate models for user specific and user independent cases. In the user-specific case, we trained and tested the model on the data of the same subject. We splitted the dataset into training and validation sets by random selection without replacement. We randomly selected 2/3 of the dataset for training and left out the remainder for testing. We repeated this process 50 times and mea- sured the mean prediction accuracy. We computed the accuracy for several window sizes, starting from 50 frames and increasing un- til 750 frames with 25-frame steps. Our model outperformed all the other baseline methods for all subjects and all window sizes (Figure 2). The significance of the difference was tested for each subject separately using Wilcoxon signed-rank method with α=0.05. We made the test between our model and three best performing base- lines; the logit model without gaze features and the two saliency based models. We selected the window sizes for our model and the logit model without gaze features with respect to average prediction accuracy on the training data. Figure 2: User-specific model accuracy for one user. Sub-images show the accuracy (proportion of correct predictions) as a func- tion of the context window size (in frames, x-axis). Red diamond: our proposed model, blue circles: baseline model using only the video features (not gaze), green reversed triangles: attention-only model, cyan squares: random guessing, black triangles: maximum saliency inside object, pink crosses: average saliency inside object. In the user-independent case, we left out one user and trained the model with the whole datasets of the other users. Then we evalu- ated the accuracy on the data of the left out user. This procedure was repeated for all users. The results gave the same conclusions as in the user-specific case although with some decrease in the ac- curacy for all the metrics and insignificance of outperformance for some test subjects. This is probably due to the increase in the degree of uncertainty originating from subjectivity of top-down cognitive processes. Then a single common model may be inadequate to han- dle the variability of gaze patterns across the subjects. This issue needs to be investigated further. The box plot in Figure 3 (a) shows the learned regressor weights for a subject in the user-specific case. Small variance of weights indicates that the model is stable across different splits. Both the magnitude and the ordering of weights in the user-independent case 107
  4. 4. was very similar to the user-specific case. The best accuracy is achieved at the longish window sizes (i.e. 525 frames in the user-specific case, and 300 frames in the user- independent case for test subject 1). This supports the claim that the context does contain information related to object relevances. The decrease in accuracy as the window size further increases is not very significant, and in particular the proposed model seems to be insensitive to window size. The feature that makes the highest positive influence on relevance is the mean distance between the object center and the fixations within the context (w8). Intuitively, the relevance of an object increases as the fixations within the context get closer to the center of that object. The feature that has the highest negative influence is the mean distance between the object and the box. This means that as the information box is placed closer to the object, it takes more interest. Some of the weights are harder to interpret and we will study them further in our subsequent research. Figure 3: Variance of the regressor weights for each of the features among different bootstrap trials in the user-specific model. The features are nubmered in Section 4.4 5 Discussion In this work, we assessed the feasibility of a gaze-based object relevance predictor in real-world scenes where the scene objects were augmented with additional information. For this, we applied a rather simple ordinal logistic regression model over a set of gaze pattern and visual content features. The prominent increase in ac- curacy when the gaze pattern features are added to the feature set reveals that gaze statistics and visual features make a mutually com- plementary contribution to relevance inference. The optimal way of combining these two sources of information should be further stud- ied. The outperformance of our model over the bottom-up attention model in predicting the most relevant object can be attributed to that the bottom-up models are incapable of reflecting the task-dependent control of attention. A better performance can probably be achieved by enriching the feature set and using a more complex model that better fits to the data. Generalisation of the model for other real-world scenes also needs to be investigated further. This can be done by plugging the model into a wearable information access device and assessing its performance during online use. Such assessment of our model is currently under progress. 6 Acknowledgements Melih Kandemir and Samuel Kaski belong to the Finnish Center of Excellence in Adaptive Informatics and Helsinki Institute for In- formation Technology (HIIT). Samuel Kaski also belongs to PAS- CAL2 EU network of excellence. This study is funded by TKK MIDE project UI-ART. References HARDOON, D., SHAWE-TAYLOR, J., AJANKI, A., PUOLAM ¨AKI, K., AND KASKI, S. 2007. Information retrieval by inferring im- plicit queries from eye movements. In International Conference on Artificial Intelligence and Statistics (AISTATS ’07). HENDERSON, J. M. 2003. Human gaze control during real-world scene perception. Trends in Cognitive Sciences 7, 11, 498 – 504. HYRSKYKARI, A., MAJARANTA, P., AALTONEN, A., AND R ¨AIH ¨A, K.-J. 2000. Design issues of ’idict’: A gaze-assisted translation aid. In Proceedings of ETRA 2000, Eye Tracking Re- search and Applications Symposium, ACM Press, ACM Press, 9–14. ITTI, L., KOCH, C., AND NIEBUR, E. 1998. A model of saliency- based visual attention for rapid scene analysis. IEEE Trans- actions on Pattern Analysis and Machine Intelligence 20, 11, 1254–1259. KANDEMIR, M., SAARINEN, V.-M., AND KASKI, S. 2010. In- ferring object relevance from gaze in dynamic scenes. In To Ap- pear in Short Paper Proceedings of ETRA 2000, Eye Tracking Research and Applications Symposium. KLAMI, A., SAUNDERS, C., DE CAMPOS, T. E., AND KASKI, S. 2008. Can relevance of images be inferred from eye movements? In MIR ’08: Proceeding of the 1st ACM international confer- ence on Multimedia information retrieval, ACM, New York, NY, USA, 134–140. KOZMA, L., KLAMI, A., AND KASKI, S. 2009. GaZIR: Gaze- based zooming interface for image retrieval. In Proc. ICMI- MLMI 2009, The Eleventh International Conference on Multi- modal Interfaces and The Sixth Workshop on Machine Learning for Multimodal Interaction, ACM, New York, NY, USA, 305– 312. MCCULLAGH, P., AND NELDER, J. 1989. Generalized Linear Models. Chapman & Hall/CRC. QVARFORDT, P., AND ZHAI, S. 2005. Conversing with the user based on eye-gaze patterns. In CHI ’05: Proceedings of the SIGCHI conference on Human factors in computing systems, ACM, New York, NY, USA, 221–230. TORRALBA, A., OLIVA, A., CASTELHANO, M. S., AND HEN- DERSON, J. M. 2006. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological Review 113, 4, 766–786. WARD, D. J., AND MACKAY, D. J. C. 2002. Fast hands-free writing by gaze direction. Nature 418, 6900, 838. ZHANG, L., TONG, M. H., MARKS, T. K., SHAN, H., AND COT- TRELL, G. W. 2008. Sun: A bayesian framework for saliency using natural statistics. Journal of Vision 8, 7 (12), 1–20. 108

×