Your SlideShare is downloading. ×
Hansen Homography Normalization For Robust Gaze Estimation In Uncalibrated Setups
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hansen Homography Normalization For Robust Gaze Estimation In Uncalibrated Setups


Published on

Homography normalization is presented as a novel gaze estimation method for uncalibrated setups. The method applies when head movements are present but without any requirements to camera calibration …

Homography normalization is presented as a novel gaze estimation method for uncalibrated setups. The method applies when head movements are present but without any requirements to camera calibration or geometric calibration. The method is geometrically and empirically demonstrated to be robust to head pose changes and despite being less constrained than cross-ratio methods, it consistently
performs favorably by several degrees on both simulated data and data from physical setups. The physical setups include the use of off-the-shelf web cameras with infrared light (night vision) and standard cameras with and without infrared light. The benefits of homography normalization and uncalibrated setups in general are also demonstrated through obtaining gaze estimates (in the visible spectrum) using only the screen reflections on the cornea.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Homography Normalization for Robust Gaze Estimation in Uncalibrated Setups Dan Witzner Hansen∗ Javier San Agustin† Arantxa Villanueva‡ IT University, Copenhagen IT University, Copenhagen Public University of Navarra Abstract ubiquitous, and convenient for the general public. So far, it has not been possible to meet these constraints concurrently. Homography normalization is presented as a novel gaze estimation Many gaze models require a fully calibrated setup and detailed eye method for uncalibrated setups. The method applies when head models (a strong prior model) to be able to minimize user calibra- movements are present but without any requirements to camera cal- tion and maintain high accuracy. A major limitation of fully cal- ibration or geometric calibration. The method is geometrically and ibrated setups is that they require exact knowledge of the relative empirically demonstrated to be robust to head pose changes and positions of the camera, light sources and monitor. Geometric cal- despite being less constrained than cross-ratio methods, it consis- ibration is usually tedious and time consuming to perform and au- tently performs favorably by several degrees on both simulated data tomated techniques are sparse [Brolly and Mulligan 2004]. Slight and data from physical setups. The physical setups include the use unintentional movement of a system part or change in focal length of off-the-shelf web cameras with infrared light (night vision) and may result in a significant drop in accuracy when relying on a cali- standard cameras with and without infrared light. The benefits of brated setup. The accuracy is therefore difficult to maintain unless homography normalization and uncalibrated setups in general are the hardware is placed in a rigid setup. Such requirements add to also demonstrated through obtaining gaze estimates (in the visible the cost of the system. Gaze models may alternatively use mul- spectrum) using only the screen reflections on the cornea. tiple calibration points in order to be less dependent on prior as- sumptions (e.g. using polynomial approximations [Hansen and Ji Keywords: Eye tracking, Gaze estimation, Homography normal- 2010]). Models employing a weak prior model have not been able ization, Gaussian process, Uncalibrated setup, HCI to demonstrate head pose invariance to date. This paper will both geometrically and empirically demonstrate that 1 Introduction it is possible to obtain robust gaze estimation in the presence of head movements when using a weak prior model of the geometric Eye and gaze tracking have a long history but only recently have setup. The model relies on homography normalization and does gaze trackers become robust enough for use outside laboratories. not require any direct measurements of the relative position of the The precision of current gaze trackers is sufficient for many types screen, camera and light source, nor does it need camera calibra- of applications, but are we really satisfied with their current capa- tion. This means that it is possible to obtain a highly flexible eye bilities? tracker that can be made compact, mobile and suit individual needs. Both research and commercial gaze trackers have been driven by Besides, the method is very simple to implement. Homography nor- the urge to obtain high accuracy gaze position data while simpli- malization is shown to consistently provide higher accuracies than fying user calibration, often by reducing the number of points nec- cross-ratio-based methods on both simulated data (section 4) and essary for calibrating an individual user to the system. Both high data recorded from a physical setup (section 5). One reason for accuracy and few calibration points are desirable properties of a considering uncalibrated setups is to facilitate the general public gaze tracker, but they are not necessarily the only parameters which with affordable and flexible gaze trackers that are robust with regard should be optimized [Scott and Findlay 1993]. Price is obviously to head movements. In section 5.2 this is shown to be achievable an issue, but may be partially resolved with technological devel- through purely off-the-shelf components. It is additionally shown opments. Today even cheap web cameras are of sufficient quality possible to use screen reflections on the cornea as an alternative for reliable gaze tracking. In some situations, however, it would be to IR glints (section 5.3). Through this paper we intend to show convenient if light sources, cameras and monitors could be placed that flexible, mobile and low cost gaze trackers are indeed feasible according to particular needs rather than being constrained by man- without sacrificing significant accuracy. ufacturer specifications. Avoiding external light sources or allow- ing the user to change the zoom of the camera to suit their particular 2 Related Work needs would be desirable. Gaze models that support flexible setups eliminate the need for rigid frames that keep individual components The primary task of a gaze tracker is to determine gaze, where gaze in place and allow for more compact, lightweight, adaptable and may either be a gaze direction or the point of regard (PoR). Gaze perhaps cheap eye trackers. If the models employed in the gaze modeling consequently focuses on the relations between the image trackers only required a few calibration targets and could maintain data and gaze. A comprehensive review of eye and gaze models is accuracy while avoiding the need for light sources, then eye track- provided in Hansen & Ji [2010]. ing technology would take an important step towards being flexible, All gaze estimation methods need to determine a set of parame- ∗ e-mail: ters through calibration. Some parameters may be estimated for † e-mail: each session by letting the user look at a set of predefined targets ‡ e-mail: on the screen, others need only be calculated once (e.g. human spe- cific parameters) and yet other parameters are estimated prior to use Copyright © 2010 by the Association for Computing Machinery, Inc. (e.g. camera parameters, geometric and physical parameters such Permission to make digital or hard copies of part or all of this work for personal or as angles and location between camera and monitor). A system classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the where the camera parameters and the geometry are a priori known first page. Copyrights for components of this work owned by others than ACM must be is termed fully calibrated [Hansen and Ji 2010]. honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. This paper focuses primarily on feature-based methods but alterna- Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail tive methods based on appearance also exist [Hansen and Ji 2010]. ETRA 2010, Austin, TX, March 22 – 24, 2010. © 2010 ACM 978-1-60558-994-7/10/0003 $10.00 13
  • 2. Feature-based methods explore the characteristics of the human eye screen positions). Coutinho and Morimoto [2006] extend the model to identify a set of distinctive and informative features around the of Yoo et al. [2005], by using the offset between visual and optical eyes that are less sensitive to variations in illumination and view- axes as an argument to learn a constant on-screen offset. They ad- point. Ensuring head pose invariance is a common problem often ditionally perform an elaborate evaluation of the consequences of solved through the use of external light sources and their reflections changing the calibration of the virtual calibration parameter (α). (glints) on the cornea. Besides the glints, the pupil is the most com- Based on this, they argue that a simpler model can be made by mon feature to use, since it is easy to extract in IR spectrum images. learning a single α value rather than four different values as orig- The image measurements (e.g. the pupil) however, are influenced inally proposed. Where calibration in [Yoo and Chung 2005] can by refraction [Guestrin and Eizenman 2006]. The limbus is less only be done by looking at the light sources in the screen corners, influenced by refraction, but since its boundary may be partially the method of [Coutinho and Morimoto 2006] may use multiple occluded, it may be more difficult to obtain reliable measurements. on-screen targets. Two types of feature-based gaze estimation approaches exist: the Since the cross-ratio is defined on projective planes and is invariant interpolation-based (regression-based) and the model-based (geo- to any projective transformation, scale changes will not influence metric) Using a single camera, the 2D regression methods model the cross-ratio. The method is therefore not directly applicable to the optical properties, geometry and the eye physiology indirectly depth translations. Coutinho and Morimoto [2006] show signifi- and may, therefore, be considered as approximate models which cant accuracy improvements compared to the original paper, pro- may not strictly guarantee head pose invariance. They are, how- vided the user does not change their distance to the camera and ever, simple to implement, do not require camera or geometric cal- monitor. The advantage of the method, compared to methods based ibration (a.k.a weak prior model) and may still provide good re- on calibrated setups, is that full hardware calibration is needless. sults under conditions of small head movements. More recent 2D The method only requires light source position data relative to the regression-based methods attempt to improve performance under screen. One limitation is that the light sources should be placed larger head movements through compensation, or by adding addi- right on the corners of the screen. In practice the method is highly tional cameras [Hansen and Ji 2010]. The 3D model-based meth- sensitive to the individual eye and formal analysis of the method is ods, on the other hand, directly compute the gaze direction from presented by Kang et al. [2008]. They identified two main sources the eye features based on a geometric model of the eye. Most 3D of errors: (1) the angular offset between visual and optical axes and model-based (or geometric) approaches rely on metric information (2) the offset between pupil and glint planes. Depending on the and thus require camera calibration and a global geometric model point configuration, the cross-ratio is also known for not being par- (external to the eye) of light sources, camera and monitor position ticularly robust to noise, since small changes in point positions can and orientation. Gaze direction is modeled either as the optical axis result in large variations in the cross-ratio. or the visual axis. The optical axis is the line connecting the pupil center, cornea center and the eyeball center. The line connecting 3 Homography Normalization for Gaze Esti- the fovea and the center of the cornea is the visual axis. The visual axis is presumably the true direction of gaze. The visual and optical mation axes intersect at the cornea center with subject dependent angular This section presents the fundamental model for a robust point of offsets. In a typical adult, the fovea is located about 4 − 5◦ horizon- regard estimation method in uncalibrated setups (a priori unknown tally and about 1.5◦ below the point of the optic axis and the retina geometry and camera parameters). The components of the model and may vary up to 3◦ vertically between subjects. Much of the the- are illustrated in figure 1. ory behind geometric models using fully calibrated setups, has been formalized by Guestrin and Eizenman [2006]. Their model covers L2 L1 a variable number of light sources and cameras, human specific pa- rameters, light source positions, refraction, and camera parameters but is limited by only applying to fully calibrated setups. Methods Cornea relying on fully calibrated setups are most common in commercial L3 l1 and research-based systems but are limited for public use unless l2 placed in a rigid setup. Any change (e.g. placing the camera dif- ferently or changing the zoom of the camera) requires a tedious l3 recalibration. Πc L4 l4 Pupil An alternative to the fully calibrated systems while allowing for head movements is to use projective invariants and multiple light Πs fc p sources [Yoo and Chung 2005; Coutinho and Morimoto 2006]. c C Contrary to the previous methods, Yoo et al. [2005] describe a method which is capable of determining the point of regard based Camera Πi solely on the availability of light source position information (e.g. Center no camera calibration or prior knowledge of rigid transformations between hardware units) by exploiting the cross-ratio of four points Figure 1: Geometric model of the human eye, light sources, screen, (light sources) in projective space. Yoo et al. [2005] use two cam- camera and projections (dashed line). The pupil is depicted as an eras and four IR light sources placed around the screen to project ellipse with center pc and the cornea as a hemisphere with center these corners on the corneal surface, but only one camera is needed C. The corneal-reflection plane, Πc , and its projection in the image for gaze estimation. When looking at the screen the pupil center are shown by quadrilaterals. Both Πc and the cornea focal point, should ideally be within the four glint area. A fifth IR light emitter fc , are displaced relative to each other and to the pupil center for is placed on-axis to produce bright pupil images and to be able to illustration purposes. account for non-linear displacements (modeled by four αi parame- ters) of the glints. The method of Yoo et al. [2005] was shown to be The cornea is approximately spherical and has a radius, Rc , about prone to large person specific errors [Coutinho and Morimoto 2006] 7.8mm. The cornea reflects light similarly to a convex mirror and and can only use the light sources for calibration (e.g. not on other has a focal point, fc , located halfway between the corneal surface 14
  • 3. and the center of corneal curvature (fc = Rc ≈ 3.9 mm). Re- 2 n n (normalized plane) spanned by four points g1 . . . g4 . Πn represents flections on the cornea consequently appear further away than the the (unknown) corneal-reflection plane given up to a homography. n n corneal surface (a.k.a virtual reflections). Let gj (j = 1..4) be the corners of the unit square and define Hi n n such that gj = Hi gj . Notice, using the screen corners to span the Denote the screen plane Πs and four (virtual) reflection on the normalized space would be equally viable. The basic idea is that the c c cornea (g1 . . . g4 ). The reflections may come from any point in 3D n pupil is mapped to the normalized space through Hi to normalize space, for example external light sources (Li ) or the corners of a the effects of head pose prior to any calibration or gaze estimation screen reflected on the cornea. The issue of screen projections will s procedure (Fn , in figure 2). The mapping of the reflections from be addressed in section 5.3. For the sake of simplicity and with- s s the image Πi to the screen Πs via Πn is therefore Hi = Hn ◦ Hi . n c c out loss of generality, the following description assumes (g1 . . . g4 ) s s That is, a homography Hn is a sufficient model for Fn when the come from point light sources. Provided the eye is stationary then pupil and Πc coincide. any location of a light source, Li , on li with same direction produce s the same point of reflection on the cornea. The light sources can Hi can be found through a user calibration consisting of a min- therefore and interchangeably be assumed located on e.g. the screen imum of 4 calibration targets, t1 . . . tN on the screen. Denote the plane Πs or at infinity as depicted in figure 1. Projected points at general principle of normalizing eye data (pupil center, pupil or lim- infinity lie in the focal plane of the convex mirror. With four light bus contours) with respect to the reflections by homography nor- s s source there will exist a plane Πc (in fact a family of planes related malization. The method of using Fn = Hn in connection with by homographies), spanned by the lines li . This plane is denoted homography normalization is referred to as (Hom). the corneal-reflection plane and is close to fc when Li at infin- ity. When considering the reflection laws (e.g. not a projection) the The cross-ratio method do not model the visual axis well [Kang corneal reflections may only be approximately planar. et al. 2008]. Homography normalization, on the other hand, does model the offset between the optical and visual axes to a much Without loss of generality suppose the light sources are located on higher degree. Points in normalized space are based on the pupil c c Πs . The quadrilateral of glints (g1 . . . g4 ) is consequently related center i.e. a model of the optical axis without the interference of i i to the corresponding quadrilateral (g1 . . . g4 ) in the image via a ho- head movements. However, as offsets between the optical and vi- i mography, Hc , from the cornea (Πc ) to the image (Πi ) [Hartley sual axes correspond to translations in normalized space, the visual s s and Zisserman 2004]. Similarly, the mapping from the cornea to and optical axis offset is modeled implicitly through Fn = Hn . s the screen is also given by a homography Hc . The homography s s c from the image to the screen Hi = Hc ◦ Hi via the Πc will 3.1 Model Error from Planarity Assumption therefore exist regardless of the location of the cornea, provided the geometric setup does not change. These arguments also apply The previous section describes a generalized approach for head to cross-ratio-based methods [Coutinho and Morimoto 2006; Yoo pose invariant PoR estimation under the assumption that the pupil and Chung 2005]. and Πc coincide. If the pupil had been located on Πc , it would be a head pose invariant gaze estimation method that models the The pupil center is located about 4.2 mm from the cornea center visual and optical axis offset. Euclidean information is not avail- but its location vary between subjects and over time for a particu- able in uncalibrated settings. Using metric information (e.g. be- lar subject [Guestrin and Eizenman 2006]. However, the pupil is tween the pupil and the Πc ) does therefore not apply in this setting. located approximately 0.3 mm (| Rc − 4.2|) from the corneal focal 2 This section provides an analysis of the model error and section point, fc , and thus also close to Πc . In the following suppose that 3.2 discusses an approach to accommodate the errors. Figure 3 il- Πc and the pupil coincide. The pupil may under these assumptions s lustrates two different gaze directions and the associated modeling be mapped through Hi from the image to the screen via the corneal error measured from the camera. reflections. Camera center Image space Normalized space Screen Pupil gi gn gn gi 2 1 pc n 2 Camera 1 optical axis n H i Fs n gi gi gn gn PoR Gaze direction 1 3 4 3 4 Gaze direction 2 pci Πc X X e1 Figure 2: (left) Reflection points (crosses) and the pupil (gray el- e2 lipse) are observed in the image and (middle) the pupil mapped to Pupil position 2 Pupil position 1 the normalized space using the four reflection points. (right) from the normalized space the pupil is mapped to the point of regard. Figure 3: Projected differences between pupil and the correspond- These basic observations are sufficient to describe the fundamen- ing point on Πc for two gaze directions. Πc is kept constant for tal and simple algorithm for PoR estimation in an uncalibrated set- clarity. ting. The method is illustrated in figure 2 and is based on locating i i and tracking four reflections (g1 . . . g4 ) (e.g. glints) and the pupil in the image. The pupil center, pc , will be used in the following When the user looks away from the camera (’gaze direction 1’) it is description. However, the presented method may alternatively use evident that the error in the image plane is related to the projected the limbus center or the pupil/limbus ellipse contours directly in the line segment (between the point on Πc and the actual location of mapping since homographies allow for mappings of points, lines the pupil), el , onto the image plane. A gaze vector directed to- and conics. wards the camera (’gaze direction 2’) yields a point and therefore no error. Hence equal angular offsets from the optical axis of the It is convenient, though not necessary, to define a virtual plane, Πn , camera generate offset vectors ∆c (i, j) with the same magnitude 15
  • 4. when viewed from the camera. The largest magnitude of errors oc- seen for single or dual glint systems [Morimoto and Mimica 2005]. cur when the gaze direction is perpendicular to the optical axis of One of the limitation when using polynomials is that any increase the camera. The magnitude field |∆c (i, j)| in camera coordinates of the order of the polynomial would require additional calibration consequently consists of elliptic iso-contours, centered around the targets in order to estimate the parameters of the polynomial. A cu- optical axis of the camera. However, it is the error, ∆s , in screen bic polynomial seem to be a good approximation for ∆i [Cerrolaza coordinates, that is of interest. The true point of regard in screen co- et al. 2008], however it would require at least 10 calibration targets. ordinates, ρ∗ = ρs + ∆s is a function of the estimated gaze ρs and s ˆ ˆ Different from the ’weight space’ approach of polynomials is the the error ∆s . That is ρ∗ = Hi (pc + ∆i ) = Hi pc + Hi ∆i , hence s s s s function view approach of Gaussian processes (GP). Gaussian pro- s errors on the screen ∆s = Hi ∆i are merely errors in the camera cess (GP) interpolation method is used to estimate ∆i by using a propagated to the screen through the homography. An example of squared exponential covariance function [Rasmussen and Williams the error vector field, ∆s , using a simulator and the corresponding 2006]: vector magnitudes is shown in Figure 4. 1 |xp − xq | cov(xp , xq ) = k1 ∗ exp(− 2 ) + k3 σ 2 Calibration Targets 2 k2 Vector field of PoR errors Magnitudes of PoR error vector field 16 14 where xp and xq are data points and ki are weights. GP’s have 12 several innate properties that make them highly suited for gaze es- 10 16 timation. Gaussian processes do not model weights directly and 8 12 14 thus there are no requirements on the minimum number of calibra- 6 0.015 10 tion targets needed to infer model parameters. Each additional cal- 4 0.01 6 8 ibration target provides additional information that will be used to 2 0.005 0 4 increase accuracy. Each estimate also comes with an error measure- ment which, via the covariance function, is related to the distance 2 0 5 0 10 0 0 5 10 15 15 from the input data to the calibration data. This information can Camera location potentially be used to regularize output data. The exponential co- variance function has been adopted since it is highly smooth (like ∆i ) and it makes it possible to account for noise directly in the co- Figure 4: (left) Error vector field and (right) corresponding mag- variance function through k3 σ 2 . In the following we denote with nitudes obtained from simulated data. Crosses indicate calibration s (GP) the method of Fn that use (Hom) together with Gaussian pro- targets and the circles the projection of the camera center. cess modeling of ∆i . To argue for the characteristics of ∆s it is without loss of general- 4 Assessment on Simulated Data ity and for the sake of simplicity assumed that only four calibration points, (t1 . . . t4 ), are used (crosses in figure 4). When estimat- s Head pose, head position, the offset between visual and optical ing the homography, Hi , through user calibration, the errors in the axes, refraction, measurement noise, relative position of hardware calibration targets, ∆s (ti) = 0, are minimized to zero and there and camera parameters are factors that mostly influence the accu- will therefore be 5 points (calibration targets and the camera opti- racy of gaze estimation methods. We will in the following sec- cal axis) where the ∆s is zero. tions evaluate the homography normalization methods ((Hom) and One way of thinking of a homography is that it generates a linear (GP)) to the cross-ratio methods ((Yoo)[Yoo and Chung 2005] and s (Cou)[Coutinho and Morimoto 2006]). These methods have been vector field of displacements. ∆s = Hi ∆i is therefore a compo- sition of two vector fields (∆s = Vh + ∆i ), a linear vector field chosen since they operate under similar premises as homography corresponding to the homography (Vh ) and an ellipsoidal vector normalization (e.g. uncalibrated/semi-calibrated setup). Simulated field ∆i . Since ∆s (ti ) = 0 then Vh (ti ) = −∆s (ti ). Vh (ti ) is data is used in this section to be able to asses the effects of potential consequently defined through the negative error vectors of ∆i (ti ). noise-factors separately. The simulator [B¨ hme et al. 2008] allows o It is worth noting that as the camera location is unknown due to the for detailed modeling of the different components of the setup and uncalibrated setup assumption and the location of the maximum er- eye specific parameters. The evaluation is divided according to the ror depends on the location of the camera, it would be impossible presence of head movements and the number of calibration targets to determine the extremal location without additional information. (N). Notice the methods, except (Yoo), allow for multiple on-screen However, despite of this, it is be shown in the following sections calibration targets. The effects of eye specific parameters such as that it is possible through homography normalization to obtain re- refraction and offset between the visual and optical axis as well sults quite similar to fully calibrated setups. as the effect of the number of calibration targets and errors asso- ciated with the model assumptions are evaluated when the head is 3.2 Modeling Error Vectors fixed (section 4.2). The methods are examined with respect to head movements in section 4.3. In some experiments the (GP) method This section discusses one approach of modeling the error caused has been left out since it is a derivative of (Hom) and would not alter by the non-coplanarity of Πc and the pupil. Even though the loca- the inherent properties of using homography normalization, it only tion of the largest errors cannot be determined (a priori) due to the makes a difference to the accuracy when the number of calibration uncalibrated setup, it may be worthwhile to accommodate the er- targets is larger than four (N > 4). rors to the extent possible. That is to estimate a vector field similar to figure 4. When the camera is placed outside the screen area, the 4.1 Setup error due to the homography is zero in 5 points (e.g. the calibration targets and the camera projection center) and non-zero elsewhere. The camera is located slightly below and to the right of the cen- s After estimating Hi it is possible to measure the error due to the ho- ter of the screen as to simulate a realistic setup (e.g. users do not mography for each additional calibration target. Since the error vec- place the components in an exact position). All tests have been tor field is smooth, a simplified yet effective approach would be to conducted with the same camera focal length. The cornea is mod- model the error through polynomials in a similar way as previously eled as a sphere with radius 7.98 mm. Four light sources are placed 16
  • 5. at the corners of a planar surface (screen) to be able compare ho- offset, γ ( with β = 0), has a significant effect on the accuracies of mography and cross-ratio methods. In the following denote with the cross-ratio methods but not on homography normalization. The N the number of calibration targets. γ and β correspond to the an- reason is that homography normalization models the optical visual gular offsets between the visual and optical axes in horizontal and offset to a much higher degree. vertical directions, respectively. 4.2 Stationary Head Accuracy with variable optical/visual−axis offset 3.5 Yoo Cou Basic Settings and Refraction In this section the methods are 3 Hom evaluated as if the head is kept still while gazing at a uniformly On−screen error (deg) distributed set of 64 × 64 targets. Figure 5 shows the mean ac- 2.5 curacy (degrees) with error-bars (variance) in the hypothetical eye 2 model, where there is no offset between visual and optical axes E0 = {γ = β = 0} and a more realistic setting with eye model 1.5 E1 ={γ = 4.5, β = 1.5}. Each sub-figure shows the cases where refraction is included and when it is not. E0 is a physically infea- 1 sible setup since the optical and visual axis are different, but the model avoids eye specific biases. It is clear from figure 5 that the 0.5 methods exhibit similar accuracies in E0 , but the offset between vi- sual and optical axes in E1 makes a notable difference between the 0 −5 −3.9 −2.8 −1.7 −0.6 0.6 1.7 2.8 3.9 5 methods. Refraction has only a minor effect on the methods. Offset (degrees) Influence of refraction with eye model 0 Influence of refraction with eye model 1 Figure 7: Accuracy as a function of the angular offset. 0.8 Refraction 3.5 No refraction Refraction 0.7 No refraction 3 0.6 Error magnitude (deg) 2.5 Error magnitude (deg) 0.5 4.3 Head Movements 2 0.4 1.5 0.3 0.2 1 Gaze trackers should ideally be head pose invariant. This section 0.1 0.5 evaluates the methods in scenarios where the eye location changes 0 0 in space (±300 mm in both x and y directions from the camera Yoo Cou Hom Yoo Cou Hom Method Method center) but the target location remains fixed on the screen. Figure 5: Comparison of methods (with/without refraction) when the head is kept still using eye model (left) E0 =(γ = β = 0) and Influence of N and γ Figure 8 shows the accuracies of using (right) eye model E1 =(γ = 4.5, β = 1.5) and N = 4 calibration a variable number of calibration targets and eye parameters in the targets. presence of head movements. The results show similarities to the head still experiments by also revealing that the offset between the optical and visual axes makes a significant difference to the cross- Changing N The previous test is based on a minimum number ratio methods, but not to the homography-based methods. The of calibration targets. However, the methods may, besides (Yoo), number of calibration targets has only a minor effect on accuracy. improve accuracy as the N uniformly distributed calibration targets Non-linear modeling improves accuracy and especially the differ- increase. Figure 6 shows accuracy of the methods as a function of ence between 4 and 9 calibration targets makes a significant dif- N for both eye models. (GP) exhibit a rapid increase of accuracy ference. When considering the nuisance of calibration and the ob- when increasing N . Both (Hom) and (Cou) may be improved by tained accuracy, it is task dependent whether the rather small in- increasing N , but large N implies a accuracy decrease for (Cou). crease in accuracy between 9 and 16 calibration targets is worth- The accuracy for (Yoo) is as expected. while. Varying the number of calibration targets eye model 0 Varying the number of calibration targets eye model 1 0.8 3.5 Yoo Cou Yoo Cou Depth Translation The methods analyzed here are all using 0.7 Hom Hom GP 3 GP properties on projective planes. Movements in depth is therefore 0.6 2.5 not an inherent property to the methods. The influence of head Accuracy (deg) Accuracy (deg) 0.5 2 movements will therefore be examined by evaluating head move- 0.4 1.5 ments as translations parallel to the screen plane (or equivalently 0.3 1 Πc ) as depicted in figure 9 and movements in depth (figure 10). A 0.2 single depth is used for calibration. The results show that none of 0.5 0.1 the methods are invariant to neither depth or in-plane translations, 0 4 9 16 25 36 49 Number of calibration targets 64 0 4 9 16 25 36 49 Number of calibration targets 64 but that the homography normalization-based methods have better performance. For depth changes larger than 150 mm (see figure 10) Figure 6: Changing the number of calibration targets, N , for E0 the (GP) method does not perform as well as (Hom). The reason is (left) E1 (right). that the learned offsets in (GP) are only valid for a single scale. The graphs in figure 10 show the accuracy as a function of depth Offset between Visual and Optical Axes There is a noticeable changes (from the calibration depth) when using different eye pa- accuracy difference when using E0 and E1 in the previous experi- rameters (E0 and E1 ) and with a variable number of calibration ments. Figure 7 shows that the influence of the angular horizontal targets, N . 17
  • 6. Observer movement using eye model 0 and gaze pos(0,0) Observer movement using eye model 1 and gaze pos(0,0) In uence of depth changes using E0 and N = 2 In uence of depth changes using E1 and N = 2 0.35 3 1.6 5 Yoo Yoo Yoo Yoo Cou Cou Hom Cou 4.5 Cou 0.3 Hom 1.4 GP 2.5 GP Hom Hom On−screen error (deg) On−screen error (deg) GP 4 GP 1.2 Error magnitude (deg) 0.25 Error magnitude (deg) 2 3.5 0.2 1 3 1.5 0.8 2.5 0.15 1 2 0.1 0.6 1.5 0.5 0.4 0.05 1 0 0.2 0.5 0 −250 −180 −110 −40 0 40 110 180 250 4 9 16 4 9 16 −250 −180 −110 −40 0 40 110 180 250 Calibration targets (N) Calibration targets (N) Depth (mm) Depth (mm) Observer movement using eye model 0 and gaze pos(200,200) Observer movement using eye model 1 and gaze pos(200,200) In uence of depth changes using E1 and N = 4 0.8 4 In uence of depth changes using E0 and N = 4 Yoo Yoo 1.4 6 Cou Cou Yoo Yoo 0.7 Hom 3.5 Hom Cou Cou GP GP 1.2 Hom 5 Hom 0.6 3 On−screen error (deg) On−screen error (deg) GP GP Error magnitude (deg) Error magnitude (deg) 1 0.5 2.5 4 0.4 2 0.8 3 0.3 1.5 0.6 2 0.2 1 0.4 0.1 0.5 1 0.2 0 0 4 9 16 4 9 16 Calibration targets (N) Calibration targets (N) 0 0 −250 −180 −110 −40 0 40 110 180 250 −250 −180 −110 −40 0 40 110 180 250 Depth (mm) Depth (mm) Figure 8: Influence of head movements while gazing at (top row) (0, 0) and (bottom row) (200, 200) on the screen using eye param- Figure 10: Influence of depth movements for (top row) N = 2 and eters in (left column) E0 and (right column) E1 . (bottom row) N = 4 and eye models (left column) E0 and (right column) E1 . Translate parallel to screen (Yoo) (deg) Translate parallel to screen (Cou) (deg) 1.75 1.85 −100 −100 −50 1.7 −50 1.8 method. 1.65 Offset Y Offset Y 0 0 1.75 1.6 −100 1.55 −100 1.7 5 Assessment on Data from a Physical Setup 100 1.5 100 1.65 −100 1.45 −100 −100 −50 0 Offset X −100 100 −100 −100 −50 0 Offset X −100 100 −100 This section provides an assessment similar to section 4, but it Translate parallel to screen (Hom) (deg) 0.65 Translate parallel to screen (GP) (deg) is based on data from a physical setup. In addition section 5.3 −100 0.6 −100 0.5 describes and evaluates two novel cases of uncalibrated setups, −50 0.55 −50 0.4 namely using low cost cameras with built-in light sources and un- known camera parameters and using screen reflections as a replace- Offset Y Offset Y 0 0 0.3 0.5 −100 0.45 −100 0.2 ment for glints. 100 0.4 100 0.1 −100 −100 −50 0 Offset X −100 100 −100 0.35 −100 −100 −50 0 Offset X −100 100 −100 5.1 Comparing the Methods Figure 9: Influence of head movements parallel to the screen with Setup The experimental setup consists of a 24” screen with a res- colors indicating accuracy. (Top) Cross-ratio-based methods (Bot- olution of 1920 × 1200 pixels. To comply with the restrictions tom) Homography-based methods. of the cross-ratio methods, four light sources (Sony IVL-150 IR lamps) are placed in the corners of the monitor and one light source placed around the camera lens. Images are obtained from a Point- 4.4 Discussion Grey Flea2 camera with a resolution of 800 × 600 pixels placed under the monitor slightly to the left of the center of the screen. The glints are found by thresholding and ellipse contour fitting on The cross-ratio methods are mostly affected by two main factors filtered blobs. The locations of the glints are used for determin- (1) the non-coplanarity of pupil and reflection planes (36% of the ing the pupil blob. An active contour method (finding maximum total error), and (2) the angular offset between optical and visual gradients along contour normals) in combination with RANSAC axes (58% of the total error). This result is in accordance with the for robust feature selection, is used to determine the outline of the observations made by [Kang et al. 2008]. Homography normaliza- pupil. Five participants (4 men and 1 woman) sitting about 50 cm tion is mostly affected by the non-coplanarity of pupil and Πc (98% from the monitor were used for the experiments. Each person was of the total estimation error). Most of this error can be accounted exposed to two calibration procedures (N=4 and N=25 targets) and for through non-linear modeling such as polynomials or Gaussian two trial cases. In one trial case the user kept the head still (as much processes. It seems not possible to account for all the errors with possible without head fixtures) and in the other case the user was homography normalization unless the point of maximum error is after calibration asked to stand up and sit down. The user was asked sampled during calibration. The planarity assumption of the virtual to move the head as convenient while looking at the screen targets. reflections accounts only for negligible errors for both cross-ratio The accuracies are measured as the average deviation in degrees and homography normalization methods. However, since the offset over 25 uniformly distributed test locations on the screen. between visual and optical axes is handled by homography normal- ization but not by the cross-ratio methods, homography normaliza- tion performs significantly better on realistic eye models. Homog- Experiments Figure 11 (left) shows the results of cases where raphy normalization is not head pose invariant, but it is robust to both the head is kept still and under head movements (N = 25). head pose changes. Movements in depth are handled quite well Generally there is very little difference between the head move- but the results indicate that depth movements that can be captured ments and no head movements, but there are significant differences by standard camera seem not to pose significant problems to the between the cross-ratio and homography normalization methods. 18
  • 7. Figure 11 (right) shows the effects of changing the number of cal- IR lightsources Accuracies using a webcamera 2 Hom ibration targets. It also shows a significant difference between the GP methods, but adding extra calibration targets has only a slight ef- 1.5 Error magnitude (deg) fect on accuracy. A surprisingly small improvement is obtained by 1 modeling the offset between the pupil and Πc . The reason may be due to the relatively large ratio of the user-to-screen and pupil-to- 0.5 cornea distances and that measurement noise in the image is large compared to the actual improvement. 0 −0.5 Accuracy as a function of 4 9 16 25 Asssesment on real data the number of calibration targets Number of Calibration Targets (N) 5 Head still 4.5 Head move 5 Error magnitude (deg) 4 Yoo 3.5 Accuracy (deg) 4 Cou Figure 12: (left) Genius iSlim 321R web camera used for the ex- Hom 3 GP periments. The four built-in light sources are for homography nor- 3 2.5 2 malization. (right) The results of using the web camera for gaze 1.5 2 estimation as a function of the number of calibration targets. 1 0.5 1 0 4 9 16 25 (Yoo) (Cou) (Hom) Method (GP) Number of calibration targets (N) the 3D environment for gaze estimation. Homography normaliza- tion can in principle use any four corneal reflections of the world Figure 11: Data obtained from a camera. Comparing (left) the e.g. the screen. The screen is used for displaying information but results in the presence of head movements and (right) changing the may simultaneously generate an easily recognizable reflection on number of calibration targets. the cornea. The reflections of the screen corners may be used for homography normalization. Figure 13 shows an example image of an eye with a screen reflection captured in the visible spectrum The experiments conducted in this section were meant to compare with a PointGrey Flea2 camera. The setup is similar to the setup the methods in a somewhat normal setup where light sources are described in section 5.1. located on the screen and the camera slightly below. The following sections are meant to illustrate some of the advantages of uncali- The screen corners were found using blob detection and robust con- brated setups. tour detection [Hansen and Pece 2005]. Detecting the screen corner reflections was found less ambiguous than normal glint detection 5.2 Using a Web Camera with Night Vison due to the size and shape of the screen region. In the case where glints would disappear (equivalent to screen corners disappear) the additional information from the screen boundary can be used to es- When the primary goal is to lower the cost of eye trackers, it is timate where the corner would be located. The limbus center is common to use off-the-shelf components, such as video and web used since the pupil may be difficult to detect reliably in the visible cameras. In this section we investigate the performance of homog- spectrum. raphy normalization using a web camera with night vision (see fig- ure 12 left). Web cameras generally use wideangle lenses for full The results given in figure 13 were obtained from two sequences of face capture and the camera consequently needs to be placed close two subjects gazing at 25 targets. The sequences were divided into to the eye to obtain a sufficient resolution for reliable feature ex- a training and test set through random sampling. The major source traction. However, cheap (around 5$) zoom lenses can be bought of gaze estimation errors were directly related to the limbus center off-the-shelf and replace the standard wideangle lens. Gaze trackers estimate. The main reason is that limbus detection is affected by the that rely on web cameras need to use gaze estimation methods that surrounding structures (eye lids and eye lashes) which dramatically are robust to head movements since even a minor head movement affects gaze accuracy. implies a rather large apparent movement in the image. Homogra- phy normalization seems to be a good choice for use in low cost 5 Accuracy with screen re ections gaze trackers for the general public since it does not make any con- Hom GP straining assumptions on the camera so users can buy any suitable 4 Mean error (deg) camera with a priori unknown camera parameters and unknown lo- 3 cations of light sources. The user does not need to make any mea- 2 surements on the relative position of the screen, light sources and 1 camera and may place the screen and camera as most convenient. Screen re ection The pupil ellipse and the glints are found through blob localization 0 4 9 16 25 Number of calibration targets (N) and contour detection [Hansen and Pece 2005]. Figure 12 (right) shows the results using the web camera and al- Figure 13: (left) Image of the eye with a screen reflection. (right) tering the number of calibration targets using the (Hom) and (GP) the results of the two sequences applying both (Hom) and (GP) methods. The 9 subjects were placed about 50 cm from the monitor methods on a variable number of calibration targets. and the camera placed as most convenient to capture the eye and reflections. The accuracy is less than one degree with only minor differences between the methods. 6 Discussion 5.3 Screen Reflections This paper has described a novel yet simple method for robust gaze estimation in an uncalibrated setup using homography normaliza- Most eye trackers use glints from IR light sources. This section tion. It has been shown geometrically and empirically that the shows that it is possible to avoid glints and use reflections from method is robust with regards to head pose changes. Limitations of 19
  • 8. the method are also outlined. Some of these limitations can be over- the possible positions and orientations of the head that prevent four come, while others can not. Fortunately the limitations and errors reflections to be appear concurrently on the cornea. However, these which remain are minor (non-spherical cornea, refraction and the constraints have not been an issue in the experiments. The road to- use of a projection of the glints on the cornea rather than the reflec- wards lightweight, versatile and highly mobile gaze trackers seems tion laws) while the major errors (offset between reflection plane now more viable. and pupil) can be well approximated through non-linear models. The method therefore combines both geometry and interpolation- Acknowledgements This work is supported by the COGAIN Eu- based methods and thus obtains the robustness of geometry-based ropean network of excellence, funded under the FP6/IST program models while keeping the simplicity and flexibility of interpolation- of the European Commission. based methods. A Gaussian process non-linear model is suggested as beneficial in preference to parameterized models (i.e. allowing References for flexibility in the calibration procedure). The main reasons for using GP’s are (1) no requirements on the minimum number of cal- ¨ B OHME , M., D ORR , M., G RAW, M., M ARTINETZ , T., AND ibration targets and (2) gaze accuracy increases locally for each ad- BARTH , E. 2008. A software framework for simulating eye ditional (distinct) calibration target. Our experiments on simulated trackers. In Proceedings of the Eye Tracking Research & Ap- data show that N = 9 calibration targets provide significant im- plication Symposium, ETRA 2008, Savannah, Georgia, USA, provements and N = 16 yield a vanishing error. The difference March 26-28, 2008. in the errors between 9 and 16 calibration targets is quite small and one has to consider whether the additional 7 calibration tar- B ROLLY, X. L. C., AND M ULLIGAN , J. B. 2004. Implicit cal- gets are worth the additional effort. Since the accuracy is about one ibration of a remote gaze tracker. In Proceedings of the 2004 degree, it has a comparable accuracy to methods using calibrated Conference on Computer Vision and Pattern Recognition Work- setups. The experiments on data from actual setups generally con- shop (CVPRW ’04), vol. 8, 134. firm the experiments on simulated data, but with less improvement C ERROLAZA , J. J., V ILLANUEVA , A., AND C ABEZA , R. 2008. by using a non-linear model. Homography normalization has been Taxonomic study of polynomial regressions applied to the cal- compared to similar methods using the cross-ratio on both simu- ibration of video-oculographic systems. In Proceedings of the lated data and data obtained from a physical setup. The experi- 2008 symposium on Eye tracking research & applications, ACM, ments indicate that both cross-ratio and homography normalization New York, NY, USA, 259–266. methods are reasonably robust to head pose changes under normal working conditions, hence indicating the planarity assumption of C OUTINHO , F. L., AND M ORIMOTO , C. H. 2006. Free head the cornea reflections is a fairly good approximation. Homogra- motion eye gaze tracking using a single camera and multiple phy normalization is similar to cross ratio-based methods by us- light sources. In Proceedings, IEEE Computer Society, M. M. d. ing measurements on planar structures. One noticeable difference Oliveira Neto and R. L. Carceroni, Eds. is that cross-ratio-based methods use planar projective invariants G UESTRIN , E. D., AND E IZENMAN , M. 2006. General theory where homography normalization is based directly on the planar of remote gaze estimation using the pupil center and corneal re- mapping function. This seemingly small difference makes a signif- flections. IEEE Transactions on Biomedical Engineering 53, 6, icant difference in terms of accuracy. Homography normalization 1124–1133. and its derivative (GP) consistently performs better than cross-ratio- based methods. A key reason is that homography normalization H ANSEN , D. W., AND J I , Q. 2010. In the eye of the beholder: implicitly models the offset between optical and visual axes. Fur- A survey of models for eyes and gaze. IEEE Transactions on thermore, homography normalization allow for more flexible se- Pattern Analysis and Machine Intelligence 32, 3, 478–500. tups since only four reflections are required where the cross-ratio methods need five of which four needs to come from the corners H ANSEN , D. W., AND P ECE , A. E. 2005. Eye tracking in the of the screen and one from the camera. In fact, the homography wild. Computer Vision and Image Understanding 98, 1 (April), normalization methods may use any stationary reflections from the 182–210. 3D world without knowing their individual locations. Three dif- H ARTLEY, R. I., AND Z ISSERMAN , A. 2004. Multiple View Ge- ferent uncalibrated scenarios were described (1) a standard setup ometry in Computer Vision, second ed. Cambridge University where the light sources were located on the screen, (2) a low cost Press, ISBN: 0521540518. solution using a web camera with built-in light sources and (3) a method in the visible spectrum using the reflections of the screen K ANG , J. J., G UESTRIN , E. D., AND E IZENMAN , E. 2008. Inves- on the cornea and limbus information. These examples show the tigation of the cross-ratio method for point-of-gaze estimation. generality of the method: that it can be used directly with either Transactions on Biometical Engineering 55, 9, 2293–302. the center of the limbus or pupil or their contours without altering M ORIMOTO , C. H., AND M IMICA , M. 2005. Eye gaze track- the model. Furthermore, gaze estimation may be obtained with- ing techniques for interactive applications. Computer Vision and out IR and still be robust to head movements as long as four stable Image Understanding 98, 1 (April), 4–24. reflections are detected. The use of screen reflections have made the detection of four points more robust since the screen region is R ASMUSSEN , C. E., AND W ILLIAMS , C. K. 2006. Gaussian larger and has well-defined edges. Even though a corner may dis- processes for Machine Learning. The MIT Press. appear the information from the rest of the screen reflection may yield good approximations that can be used for gaze estimation. S COTT, D., AND F INDLAY, J. 1993. Visual search, eye movements The limbus can be detected quite reliably, but it is affected by the and display units. Human factors report. surrounding structures and therefore provide less robust results than YOO , D. H., AND C HUNG , M. J. 2005. A novel non-intrusive the pupil. Perhaps if both the pupil and screen reflections would be eye gaze estimation using cross-ratio under large head motion. obtained simultaneously one could possibly obtain highly reliable Computer Vision and Image Understanding 98, 1 (April), 25– results. Through the results of this paper it has been shown that re- 51. laxing the assumptions of the setup of gaze trackers may not cause significant reduction in accuracy. There are physical constrains for 20