Kandemir Inferring Object Relevance From Gaze In Dynamic Scenes

Copyright © 2010 by the Association for Computing Machinery, Inc.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for commercial advantage and that copies bear this notice and the full citation on the
first page. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on
servers, or to redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail
permissions@acm.org.
ETRA 2010, Austin, TX, March 22 – 24, 2010.
© 2010 ACM 978-1-60558-994-7/10/0003 $10.00
Inferring Object Relevance from Gaze in Dynamic Scenes
Melih Kandemir∗
Helsinki University of Technology
Department of Information
and Computer Science
Veli-Matti Saarinen†
Low Temperature Laboratory
Samuel Kaski‡
Department of Information
and Computer Science
Abstract
As prototypes of data glasses having both data augmentation and
gaze tracking capabilities are becoming available, it is now possi-
ble to develop proactive gaze-controlled user interfaces to display
information about objects, people, and other entities in real-world
setups. In order to decide which objects the augmented information
should be about, and how saliently to augment, the system needs
an estimate of the importance or relevance of the objects of the
scene for the user at a given time. The estimates will be used to
minimize distraction of the user, and for providing efficient spa-
tial management of the augmented items. This work is a feasibility
study on inferring the relevance of objects in dynamic scenes from
gaze. We collected gaze data from subjects watching a video for
a pre-defined task. The results show that a simple ordinal logistic
regression model gives relevance rankings of scene objects with a
promising accuracy.
CR Categories: H.5.r [Information Interfaces and Representation
(HCI)]: User interfaces—User interface management systems
Keywords: augmented reality, gaze tracking, information re-
trieval, intelligent user interfaces, machine learning, ordinal logistic
regression
1 Introduction
In this paper, we develop a method needed for doing information
retrieval in dynamic real-world scenes where the queries are for-
mulated implicitly by gaze. In our setup the user wears a ubiqui-
tous information access device, “data glasses” having eye-tracking
and information augmentation capabilities. The device is assumed
to be capable of recognising and tracking certain types of objects
from the first-person video data of the user Figure 1 illustrates the
idea. Some objects, three faces and the whiteboard in this image,
are augmented with attached boxes that include textual information
obtained from other sources. In such a setup, each visible object
in a scene can be considered as a channel through which additional
relevant information can be obtained as augmented on the screen.
As in traditional information retrieval setups such as text search en-
gines, potential abundance of available information brings up the
need for a mechanism to rank the channels with respect to their rel-
evance. This is particularly important in proactive mobile setups
where the augmented items are also potential distractors.
Our goal is to infer the degree of interest of the user for the objects
∗e-mail: melihk@cis.hut.fi
†e-mail:veli-m@neuro.hut.fi
‡e-mail:samuel.kaski@tkk.fi
Figure 1: A screenshot from the eyesight of hypothetical data
glasses with augmented reality capability during a short presen-
tation in a meeting room (Scene 1)
in the scene. This problem has a connection to modelling of vi-
sual attention [Henderson 2003; Itti et al. 1998; Zhang et al. 2008];
whereas visual attention models typically try to predict the gaze pat-
tern given the scene, our target is the inverse problem of inferring
the user’s state (interests) given the scene and the gaze trajectory.
A good solution for the former problem would obviously help in
our task too, but current visual attention models mainly consider
only physical pointwise saliency which does not yet capture the
mainly top-down nature of effects of user’s interest on the gaze pat-
tern. Although there exists some initial attempts towards two-way
saliency modeling [Torralba et al. 2006], these are evaluated only
for rather trivial visual tasks such as counting a certain type of ob-
jects in static images. Unlike top-down models where the model
is optimised given a well-defined search task, the cognitive task of
the subject in our setup is hidden and may even be unclear to the
subject herself. Hence, we start by data-driven statistical machine
learning techniques for the inverse modeling task.
Gaze data has been used in user interfaces in three ways. Our goal is
the furthest from the most frequent approach, off-line analysis, for
instance studying effectiveness of advertisements in attracting peo-
ple’s attention, or analysis of social interaction. In the second ap-
proach the user selects actions by explicitly looking at the choices,
for instance eye typing [Hyrskykari et al. 2000; Ward and MacKay
2002]. Although such explicit selection mechanisms are easy to im-
plement, they require full user attention and are strenuous because
of the Midas touch effect: each glance activates an action whether
it is intended or not. The third way of using gaze data in user in-
terfaces is implicit feedback. The user uses her gaze normally, and
information needed by the interface is inferred from the gaze data.
An emerging example is proactive information retrieval where sta-
tistical machine learning methods are used for inferring relevance
from gaze patterns. The inferred relevance judgements are then
used as implicit relevance feedback for information retrieval. This
has been done for text retrieval by generating implicit queries from
105

gaze patterns [Hardoon et al. 2007]. The same principle has been
used for image retrieval as well [Klami et al. 2008], recently also
coupled dynamically to a retrieval engine in an interactive zooming
interface [Kozma et al. 2009]. Gaze has additionally been used as
a means of proactive interaction, but not information retrieval, in a
desktop application by assigning a relevance function to the entities
on a synthetic 2D map [Qvarfordt and Zhai 2005].
To test the feasibility of the idea of relevance ranking from gaze in
dynamical real-world setups, we prepared a stimulus video and col-
lected gaze data from subjects watching that video. True relevance
rankings were then asked from the subjects in several frames. We
trained an ordinal logistic regression model and measured its accu-
racy in the relevance prediction task on the left-out data.
2 Measurement Setup
We shot a video from the first-person view of a subject visiting three
indoor scenes. Then we postprocessed this video by augmenting
some of the objects with additional textual information in an at-
tached box. This video was shown to 4 subjects and gaze data was
collected. Right after the viewing session the subjects ranked the
scene objects in relevance order for a subset of the video frames.
The ranking was considered as the ground truth for learning the
models and evaluating them. The modelling task is to predict the
user-given ranking for an object given the gaze-tracking data from
a window immediately preceding the ranked frame.
3 Model for Inferring Relevance
Let us index the stimulus slices preceding each relevance judgement
from 1 to N. We extract a feature vector (details in the Experiments
section) for each scene object i at time slice t to obtain a single un-
labelled data point: fi
(t)
= {f
(t)
i1 , f
(t)
i2 , · · · , f
(t)
id } where d is the
number of features. If we also attach the ground truth relevance
ranking ri
(t)
, we get a labelled data point (fi
(t)
, ri
(t)
). Let us de-
note the set of data points, one for each object, related to time slice
t as a data subset Λ(t)
= {(f1
(t)
, r1
(t)
), · · · , (fmt
(1)
, rmt
(1)
)}
where mt is the number of visible objects at time slice t. Let
us denote the data subset without labels by Λ (t)
, and the maxi-
mum number of visible objects by L = max({m1, · · · , mN }).
For notational convenience, we define the most relevant object to
have rank L, and the rank decreases as relevance decreases. The
whole labelled data set consists of the union of all data subsets
∆ = {Λ(1)
, Λ(2)
, · · · , Λ(N)
}.
We search for a mapping from the feature space to the space of
relevances, which is conventionally [0, 1]. Such a mapping can di-
rectly be achieved using ordinal logistic regression [McCullagh and
Nelder 1989] if we assume that the relevance of an object depends
only on its features, and it is independent of the relevance of the
other visible objects. We use the standard approach as described
briefly below.
Let us denote the probability of the object rank to be k as P(ri
(t)
=
k | f
(t)
i ) = φk(f
(t)
i ). Then we can define the log odds such that the
problem reduces to a batch of L − 1 binary regression problems,
one for each k = 1, 2, · · · , L − 1:
Mk = log
P (ri
(t)
<=k | f(t)
i )
1−P (ri
(t)<= | f(t)
i )
= log
φ0(f(t)
i )+φ1(f(t)
i )+···+φk(f(t)
i )
φk+1(f(t)
i )+φk+2(f(t)
i )+···+φL(f(t)
i )
= w
(k)
0 + wf
(t)
i
where a linear model is assumed. By taking the exponent of both
sides we get the CDF of the rank distribution for object i at time t:
P(ri
(t)
<= k | f
(t)
i ) =
exp(w
(k)
0 + wf
(t)
i )
1 + exp(w
(k)
0 + wf
(t)
i )
.
Notice that we adopted the standard approach and used common
slope coefficients w = [w1, · · · , wd] for all logit models but differ-
ent intercepts w
(k)
0 . In the training phase, we calculate the maxi-
mum likelihood estimates for the parameters θ of this model (θ =
{w
(1)
0 , · · · , w
(k−1)
0 , w1, · · · , wd}) using the Newton-Raphson tech-
nique. Given an unlabelled data subset Λ (t)
at time t, the object
with relevance rank k is predicted to be the one that has the highest
probability for that rank; arg maxi φk(f
(t)
i ).
4 Experiments
4.1 Stimulus Preparation
We shot a video clip of 4 minutes and 17 seconds long from the first-
person view of a subject, using a see-through head mounted display
device. In the scenario of the clip, a visitor coming to our laboratory
is informed of our research project. The scenario consists of three
consecutive scenes:
1. A short presentation in a meeting room: A researcher in-
troduces the project with a block diagram drawn on the white-
board (Figure 1) in a meeting room. People present are asking
questions. The visitor follows the presentation.
2. A walk in the lab corridor: The visitor walks through the
laboratory taking a look at posters on the wall, and zooms
into some of the name tags on office doors.
3. Demo of data collection devices: The host introduces how
eye tracking experiments are made. He demonstrates a mon-
itor with eye tracking capabilities and the head-mounted dis-
play device.
Next, we augmented the video by attaching information boxes to
objects; such as faces, the whiteboard, name tags, posters, and de-
vices related to the project. These were considered to be the objects
potentialls the most interesting to the visitor. Short snippets of tex-
tual information relevant to the objects were displayed inside the
boxes. At most one information box was attached to any one object
at a time. We displayed boxes for all visible objects. There were
from 0 to 4 objects in the scene at a time; average number of scene
objects was 2.017 with 1.36 standard deviation. The frame rate of
the postprocessed video was 12 fps.
4.2 Data Collection
We collected gaze data from 4 subjects while they were watching
the stimulus video to get as much information as they can about
the research project. After the viewing session, the subjects were
shown 154 screenshots from the video in temporal order, each of
which represent a 1.66 seconds slot (20 frames). The users were
asked to select the objects that were relevant to them at that mo-
ment, and also to rank the selected subset of objects according to
their relevance. We defined relevance as the interest in seeing aug-
mented information about an object in the scene at that particular
time. All subjects assured, after ranking, that they were able to
remember the correct ranks for almost all the frames. The sub-
jects were graduate and postgraduate researchers not working on
the project related to the study we present in this paper.
106

4.3 The Eye Tracker
We collected the gaze data with a Tobii 1750 eye tracker with 50Hz
sample rate. The tracker has an infra-red stereo camera on a stan-
dard flat-screen monitor. The device performs tracking by detecting
the pupil centers and measuring the reflection from the cornea. The
successive gazes that were located within an area of 30 pixels are
considered as a single fixation. This corresponds to approximately
0.6 degrees of deflection at a normal viewing distance to an 17”-
screen monitor with 1280 × 1024 pixel resolution. Test subjects
were sitting 60 cm away from the monitor.
4.4 Feature Extraction
We extracted from the gaze and video data a set of features cor-
responding to each visible object. This was done at every time
slice for which the labelled object ranks were available (i.e., for
one frame in every 20 consecutive frames). Each of these features
summarises a particular aspect in the temporal context (recent past).
We define the context at time t to be a slot from time point t − W
to t − 1 where W is a predetermined window size. We used the
following 11 features:
1. mean area of the bounding box of the object
2. mean area of the information box attached to the object
3. mean distance between the centers of the object bounding box and the attached
information box
4. total duration of fixations inside the bounding box of the object
5. total duration of fixations inside the information box attached to the object
6. mean duration of fixations inside the bounding box of the object
7. mean duration of fixations inside the information box attached to the object
8. mean distance of all fixations to the center of the object bounding box
9. mean distance of all fixations to the center of the information box
10. mean length of saccades that ended up with fixations inside the bounding box
of the object
11. mean length of saccades that ended up with fixations inside the information box
attached to the object
We marked the bounding boxes of the objects manually frame by
frame.
4.5 Evaluation
We evaluated the accuracy of the model with respect to the propor-
tion of times the most relevant object was predicted correctly. We
compared the model performance with five baseline methods. The
first one is random guessing, in which at each time slice, scene
objects are ranked uniformly at random. The second one is an
attention-based method that assigns a relevance proportional to the
total fixation duration on the object and on the augmented content.
This estimate of object relevance is referred to as gaze intensity
[Qvarfordt and Zhai 2005]. This is used to reveal the effect of in-
tricate gaze patterns, other than mere visual attention measured by
gaze intensity in relevance prediction. In the third baseline model
we used the ordinal logistic regression model with the features that
are not related to gaze: first three of the features. Thus we investi-
gated the effect of gaze-based features in prediction accuracy. We
defined two more baseline models that depend on Itti et al.’s bottom-
up visual attention model [Itti et al. 1998] in order to observe how
useful such plain attention modelling is in our problem setup, and
to test if our model provides better accuracy. We computed the Itti-
Koch saliency map of the labelled frames. Then we calculated the
relevance of an object as the maximum saliency inside its bounding
box for one baseline model, and as the average saliency inside the
bounding box for the other one.
We trained separate models for user specific and user independent
cases. In the user-specific case, we trained and tested the model on
the data of the same subject. We splitted the dataset into training
and validation sets by random selection without replacement. We
randomly selected 2/3 of the dataset for training and left out the
remainder for testing. We repeated this process 50 times and mea-
sured the mean prediction accuracy. We computed the accuracy for
several window sizes, starting from 50 frames and increasing un-
til 750 frames with 25-frame steps. Our model outperformed all the
other baseline methods for all subjects and all window sizes (Figure
2). The significance of the difference was tested for each subject
separately using Wilcoxon signed-rank method with α=0.05. We
made the test between our model and three best performing base-
lines; the logit model without gaze features and the two saliency
based models. We selected the window sizes for our model and the
logit model without gaze features with respect to average prediction
accuracy on the training data.
Figure 2: User-specific model accuracy for one user. Sub-images
show the accuracy (proportion of correct predictions) as a func-
tion of the context window size (in frames, x-axis). Red diamond:
our proposed model, blue circles: baseline model using only the
video features (not gaze), green reversed triangles: attention-only
model, cyan squares: random guessing, black triangles: maximum
saliency inside object, pink crosses: average saliency inside object.
In the user-independent case, we left out one user and trained the
model with the whole datasets of the other users. Then we evalu-
ated the accuracy on the data of the left out user. This procedure
was repeated for all users. The results gave the same conclusions
as in the user-specific case although with some decrease in the ac-
curacy for all the metrics and insignificance of outperformance for
some test subjects. This is probably due to the increase in the degree
of uncertainty originating from subjectivity of top-down cognitive
processes. Then a single common model may be inadequate to han-
dle the variability of gaze patterns across the subjects. This issue
needs to be investigated further.
The box plot in Figure 3 (a) shows the learned regressor weights
for a subject in the user-specific case. Small variance of weights
indicates that the model is stable across different splits. Both the
magnitude and the ordering of weights in the user-independent case
107

was very similar to the user-specific case.
The best accuracy is achieved at the longish window sizes (i.e.
525 frames in the user-specific case, and 300 frames in the user-
independent case for test subject 1). This supports the claim that
the context does contain information related to object relevances.
The decrease in accuracy as the window size further increases is
not very significant, and in particular the proposed model seems to
be insensitive to window size.
The feature that makes the highest positive influence on relevance is
the mean distance between the object center and the fixations within
the context (w8). Intuitively, the relevance of an object increases
as the fixations within the context get closer to the center of that
object. The feature that has the highest negative influence is the
mean distance between the object and the box. This means that
as the information box is placed closer to the object, it takes more
interest. Some of the weights are harder to interpret and we will
study them further in our subsequent research.
Figure 3: Variance of the regressor weights for each of the features
among different bootstrap trials in the user-specific model. The
features are nubmered in Section 4.4
5 Discussion
In this work, we assessed the feasibility of a gaze-based object
relevance predictor in real-world scenes where the scene objects
were augmented with additional information. For this, we applied
a rather simple ordinal logistic regression model over a set of gaze
pattern and visual content features. The prominent increase in ac-
curacy when the gaze pattern features are added to the feature set
reveals that gaze statistics and visual features make a mutually com-
plementary contribution to relevance inference. The optimal way of
combining these two sources of information should be further stud-
ied. The outperformance of our model over the bottom-up attention
model in predicting the most relevant object can be attributed to that
the bottom-up models are incapable of reflecting the task-dependent
control of attention.
A better performance can probably be achieved by enriching the
feature set and using a more complex model that better fits to the
data. Generalisation of the model for other real-world scenes also
needs to be investigated further. This can be done by plugging the
model into a wearable information access device and assessing its
performance during online use. Such assessment of our model is
currently under progress.
6 Acknowledgements
Melih Kandemir and Samuel Kaski belong to the Finnish Center
of Excellence in Adaptive Informatics and Helsinki Institute for In-
formation Technology (HIIT). Samuel Kaski also belongs to PAS-
CAL2 EU network of excellence. This study is funded by TKK
MIDE project UI-ART.
References
HARDOON, D., SHAWE-TAYLOR, J., AJANKI, A., PUOLAM ÄKI,
K., AND KASKI, S. 2007. Information retrieval by inferring im-
plicit queries from eye movements. In International Conference
on Artificial Intelligence and Statistics (AISTATS ’07).
HENDERSON, J. M. 2003. Human gaze control during real-world
scene perception. Trends in Cognitive Sciences 7, 11, 498 – 504.
HYRSKYKARI, A., MAJARANTA, P., AALTONEN, A., AND
R ÄIH Ä, K.-J. 2000. Design issues of ’idict’: A gaze-assisted
translation aid. In Proceedings of ETRA 2000, Eye Tracking Re-
search and Applications Symposium, ACM Press, ACM Press,
9–14.
ITTI, L., KOCH, C., AND NIEBUR, E. 1998. A model of saliency-
based visual attention for rapid scene analysis. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 20, 11,
1254–1259.
KANDEMIR, M., SAARINEN, V.-M., AND KASKI, S. 2010. In-
ferring object relevance from gaze in dynamic scenes. In To Ap-
pear in Short Paper Proceedings of ETRA 2000, Eye Tracking
Research and Applications Symposium.
KLAMI, A., SAUNDERS, C., DE CAMPOS, T. E., AND KASKI, S.
2008. Can relevance of images be inferred from eye movements?
In MIR ’08: Proceeding of the 1st ACM international confer-
ence on Multimedia information retrieval, ACM, New York, NY,
USA, 134–140.
KOZMA, L., KLAMI, A., AND KASKI, S. 2009. GaZIR: Gaze-
based zooming interface for image retrieval. In Proc. ICMI-
MLMI 2009, The Eleventh International Conference on Multi-
modal Interfaces and The Sixth Workshop on Machine Learning
for Multimodal Interaction, ACM, New York, NY, USA, 305–
312.
MCCULLAGH, P., AND NELDER, J. 1989. Generalized Linear
Models. Chapman & Hall/CRC.
QVARFORDT, P., AND ZHAI, S. 2005. Conversing with the user
based on eye-gaze patterns. In CHI ’05: Proceedings of the
SIGCHI conference on Human factors in computing systems,
ACM, New York, NY, USA, 221–230.
TORRALBA, A., OLIVA, A., CASTELHANO, M. S., AND HEN-
DERSON, J. M. 2006. Contextual guidance of eye movements
and attention in real-world scenes: the role of global features in
object search. Psychological Review 113, 4, 766–786.
WARD, D. J., AND MACKAY, D. J. C. 2002. Fast hands-free
writing by gaze direction. Nature 418, 6900, 838.
ZHANG, L., TONG, M. H., MARKS, T. K., SHAN, H., AND COT-
TRELL, G. W. 2008. Sun: A bayesian framework for saliency
using natural statistics. Journal of Vision 8, 7 (12), 1–20.
108

Kandemir Inferring Object Relevance From Gaze In Dynamic Scenes

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (20)

Similar to Kandemir Inferring Object Relevance From Gaze In Dynamic Scenes

Similar to Kandemir Inferring Object Relevance From Gaze In Dynamic Scenes (20)

More from Kalle

More from Kalle (20)

Kandemir Inferring Object Relevance From Gaze In Dynamic Scenes