We propose a person authentication system using eye movement signals. In security scenarios, eye-tracking has earlier been used for gaze-based password entry. A few authors have also used physical features of eye movement signals for authentication in a taskdependent
scenario with matched training and test samples. We propose and implement a task-independent scenario whereby the training and test samples can be arbitrary. We use short-term eye gaze direction to construct feature vectors which are modeled using Gaussian mixtures. The results suggest that there are personspecific features in the eye movements that can be modeled in a task-independent manner. The range of possible applications extends
beyond the security-type of authentication to proactive and user-convenience systems.
2. 1.1 Related Work Design of the authentication stimulus should be done with great
care to avoid learning effect of the user; if the user learns the task,
The idea of using eye-movements for person authentication is not such as the order of the moving object, his behavior changes which
new. The currently adopted approaches consider eye gaze as an al- may lead to increased recognition errors. From the security point of
ternative way to input user-speciļ¬c PIN numbers or passwords; see view, the repetitious task can easily be copied and intrusion system
[Kumar et al. 2007] for a recent review of such methods. In these built to imitate the expected input.
systems, the user may input the pass-phrase by, for instance, gaze-
typing the password [Kumar et al. 2007], using graphical pass-
words such as human faces [Passfaces ], or using gaze gestures sim- Here we take a step towards eye movement biometrics where we
ilar to mouse gestures in web browsers [Luca et al. 2007]. In such have minimal or absolutely no prior knowledge of the underlying
systems, it is important to design the user interface (presenting the task. We call such problem as task-independent eye-movement bio-
keypad or visual tokens; providing user feedback) to ļ¬nd acceptable metrics. We stem this terminology from other behavioral biometric
trade-off between security and usability - too complicated cognitive tasks, such as text-independent speaker [Kinnunen and Li 2010] and
task becomes easily distracting. The eye gaze input is useful in se- writer [Bulacu and Schomaker 2007] identiļ¬cation.
curity application where shoulder surļ¬ng is expected (e.g. watch-
ing over oneās shoulder when he/she is typing his/her PIN code in
an ATM device). However, it is still based on the āwhat-the-user- 2 Data Preprocessing and Feature Extraction
knowsā type of authentication; it might also be called cognometric
authentication [Passfaces ]. The eye-coordinates are denoted here by xleft (t), yleft (t), xright (t),
yright (t) for each timestamp t. Some trackers, including the one
In this paper, we focus on biometric authentication, that is, who
used in this study, provide information about pupil diameter of both
the person is rather than what he/she knows. Biometric authentica-
eyes as well; we focus on the gaze coordinate data only. The eye
tion could be combined with the cognometric techniques, or used as
movement signal is inherently noisy and includes missing data due
the only authentication method in low-security scenarios (e.g. cus-
to blinks, involuntary head movements, or corneal moisture tear
tomized user proļ¬les in computers, adaptive user interfaces). We
ļ¬lm and irregularities, for instance. Typically, the eye-tracking de-
are aware of two prior studies that have utilized physical features of
vice gives information about the missing data in a validity variable.
the eye movements to recognize persons [Kasprowski 2004; Bed-
We assume continuity of the data within the missing parts and lin-
narik et al. 2005]. In [Kasprowski 2004], the author used a custom-
early interpolate the data in-between the valid segments. Interpola-
made head-mounted infrared oculography eye tracker (āOBER 2ā)
tion is applied independently for all four coordinate time series.
with sampling rate of 250 Hz to collect data from N = 47 sub-
jects. The subjects followed a jumping stimulation point presented
on a normal computer screen at twelve successive point positions. It is a known fact that the eye is never still and constantly moves
Each subject had multiple recording sessions and one task consisted around the ļ¬xation center. While most eye-trackers cannot mea-
of less than 10 seconds of data and lead to ļ¬xed-dimensional (2048 sure the microscopic movements of the eye ā the tremor, drift, and
data points) pattern. The stimulation reference signal, together with microsaccades ā due to accuracy and temporal resolution, we con-
robust ļ¬xation detection, was used for normalizing and aligning the centrate on the movements on a coarser level of detail that can be
gaze coordinate data. From the normalized signals, several fea- effectively measured. Even during a single ļ¬xation, the eye con-
tures were derived: average velocity direction, distance to stimula- stantly samples the region of interest. We hypothesize that the way
tion, eye difference, discrete Fourier transform (DFT) and discrete the eye moves during the ļ¬xation and smooth pursuit have bearing
wavelet transform (DWT). Five well-known pattern matching tech- on some individual property of the oculomotor plant.
niques, as well as their ensemble classiļ¬er, were then used for ver-
ifying the identity of a person. Average recognition error of the We describe the movement as a histogram of all angles the eye
classiļ¬er ensemble around 7 % to 8 % was reported. travel during a certain period. Similar technique was used for the
task-dependent recognition in [Kasprowski 2004]. Figure 2 sum-
In an independent study [Bednarik et al. 2005], the authors used a
marizes the idea. We consider short-term data window (L sam-
commercially available eye tracker (Tobii ET-1750), built in a TFT
ples) that expands over a temporal span of few seconds. The lo-
panel, to collect data from N = 12 subjects at sampling rate of 50
cal velocity direction of the gaze (from time step t to time step
Hz. The stimulus consisted of a single cross shown in the middle of
t + 1) is computed using trigonometric identities and transformed
the screen and displayed for 1 second. The eye-movement features
into a normalized histogram, or discrete probability mass function
consisted of pupil diameter and its time derivative, gaze velocity,
z = (p(Īø1 ), p(Īø2 ), . . . , p(ĪøK )), where p(Īøk ) > 0 for all k and
and time-varying distance between the eyes. Discrete Fourier trans-
form (DFT), principal component analysis (PCA) and their combi- k p(Īøk ) = 1. The K histogram bin midpoints are pre-located
at angles Īøk = k(2Ļ/K) where k = 0, 1, . . . , K ā 1. The data
nation were then applied to derive low-dimensional features, fol-
window is then shifted forward by S samples (here we choose S =
lowed by k-nearest neighbor template matching. The time deriva-
L/2). The method produces a time sequence {z t }, t = 1, 2, . . . , T
tive of the pupil size was found to be the best dynamic feature,
of K-dimensional feature vectors which are considered indepen-
yielding identiļ¬cation rate of 50 % to 60 % (the best feature, dis-
dent from each other and used in statistical modeling as explained
tance between the eyes, was not considered interesting since it can
in the next Section.
be obtained without an eye tracker).
1.2 What is New in This Study: Task Independence Note that in the example of Fig. 2 most of the ļ¬xations have a
ānorth-eastā or āsouth-westā tendency and this shows up in the his-
Both of the studies [Kasprowski 2004; Bednarik et al. 2005] are togram as an āeigendirectionā of the velocity. It is worth empha-
inherently task-dependent: they assume that the same stimulus ap- sizing that we deliberately avoid the use of any ļ¬xation detection
pears in training and testing. This approach has the advantage that ā which is error prone ā but instead use all the data within a given
the reference template (training sample) and test sample can be ac- window. This data is likely to include several ļ¬xations and sac-
curately aligned. However, the accuracy comes with a price paid cades. Since the amount of time spent on ļ¬xating is typically more
on convenience: in security scenario, the user is forced to perform than 90 %, the relative contribution of the saccades in the histogram
a certain task which becomes both learned and easily distracting. is smaller.
188
3. Shortāterm gaze (temporal
Eye gaze data (entire task) window of 1.6 seconds)
500
800
700
y coordinate (pixels)
y coordinate (pixels)
450
600
500
400
400
300
350 Local
200 velocity
100 direction 46
300
200 400 600 800 1000 400 450 500 550 600 650 44
Equal error rate (EER %)
x coordinate (pixels) x coordinate (pixels)
45
42
Histogram of velocity directions
Shortāterm coordinate data 40
within the temporal window 40
650 35
90 38
600 120 60
30
36
coordinate (pixels)
550 150 0.05 30 25
34
500
180 0 200 32
450
wi
400 30
nd
400
ow
x coordinate 210 330
60 28
le
350 600
ng
y coordinate 50
th
240 300 40 (K) 26
bins
(L
300 270 800 30
gram
)
0 50 100 150 200
Magnitude of each arc is proportional to the 20 histo
Time (samples reletive to window location) 10
probability of that velocity direction within the 1000
window.
Figure 2: Illustration of velocity direction features. Figure 3: Effect of feature parameters to accuracy.
3 Task-Independent Modeling of Features nm is the soft count of vectors assigned to the mth Gaussian and
r > 0 is a ļ¬xed relevance factor. Variances and weights can also
We adopt some machinery from modern text-independent speaker
be adapted. For small amount of data, the adapted vector is interpo-
recognition [Kinnunen and Li 2010]. In speaker recognition, the
lation between the data and the UBM, whereas for large number of
speech signal is also transformed into a sequence of short-term fea-
data the effect of UBM is reduced. Given a sequence of independent
ture vectors that are considered independent. The matching prob-
test vectors {z 1 , . . . , z T }, the match score is computed as the dif-
lem becomes to quantifying the similarity of the given āfeature vec-
ference of the target person and the UBM average log-likelihoods:
tor cloudsā. To this end, each person is modeled using a Gaussian
mixture model (GMM) [Bishop 2006]. An important advance in T
speaker recognition was normalizing the speaker models with re- 1
score = {log p(z t |Ītarget ) ā log p(z t |ĪUBM )}. (2)
spect to a so-called universal background model (UBM) [Reynolds T t=1
et al. 2000]. This method (Fig. 1) is now well-established base-
line method in that ļ¬eld. The UBM, which is just another Gaus-
sian mixture model, is ļ¬rst trained from a large set of data. The
4 Experiments
user-dependent GMMs are then derived by adapting the parame- 4.1 Experimental Setup
ters of the UBM to that speakerās training data. The adapted is
an interpolated model between the UBM (prior model) and the ob- For the experiments, we collected a database of N = 17 users.
served training data. This maximum a posteriori (MAP) training of The initial number of participants was higher (23), but we left out
the models gives better accuracy than maximum likelihood (ML) the recordings with low quality data - e.g. due to a frequent loss
trained models for limited amount of data. Normalization by the of eye. Participants were naive of the actual purpose of the study.
UBM likelihood in the test phase also emphasizes the difference of The recordings were conducted in a quiet usability laboratory with
the given person from the general population of persons. constant lighting. After a calibration, the target stimuli consisted of
a two sub-tasks. First, the display showed an introductory text with
The GMM-UBM method [Reynolds et al. 2000] can be shortly
instructions related to the content and duration of the task. Partici-
summarized as follows. Both the UBM and the user-dependent
pants were also asked to keep their head as stable as possible. The
models are described by a mixture of multivariate Gaussians with
following stimulus was a 25 minutes long video of a section of Ab-
some mean vectors Āµm , diagonal covariance matrices (Ī£m ) and
solutely fabulous - a comedy series produced by BBC. The video
mixture weights (wm ). The density function of GMM is then
contained original sound and subtitles in Finnish. The screen reso-
p(z|Ī) = M wm N (z|Āµm , Ī£m ), where N (Ā·|Ā·) denotes mul-
m=1 lution of the video was 800 x 600 pixels and it was shown on a 20.1
tivariate Gaussian and Ī denotes collectively all the parameters. inch LCD ļ¬at panel Viewsonic VP201b (true native resolution 1600
The mixture weights satisfy wm ā„ 0, m wm = 1. The UBM x 1200, anti-glare surface, contrast ration 800:1, response time 16
parameters are found via the expectation-maximization (EM) algo- ms, viewing distance approximately 1 m). A Tobii X120 (sampling
rithm, and the user-dependent mean vectors are derived as, rate 120 Hz, accuracy 0.5 degree) eye-tracker was employed in the
study and the stimuli were presented using the Tobii Studio analysis
Āµm = Ī±m z m + (1 ā Ī±m )Āµm ,
Ė (1) software v. 1.5.2.
Ė
where z m is the posterior-weighted mean of the training sample. We use one training ļ¬le and one test ļ¬le per person and do cross-
The adaptation coefļ¬cient is deļ¬ned as Ī±m = nm /(nm +r), where matching of all ļ¬le pairs, which leads to 17 genuine access tri-
189
4. als and 272 impostor access trials. We measure the accuracy by Table 1: Effect of training and test data durations.
equal error rate (EER) which corresponds to the operating point
for which the false acceptance (accepting an impostor) and false Training data Test data Equal error rate (EER %)
rejection (rejecting a legitimate user) are equal. The UBM train- 10 sec 10 sec 47.1
ing data, person-dependent training data and test data are all dis- 1 min 10 sec 47.1
joint. The UBM is trained by a deterministic splitting method fol- 3 min 10 sec 35.3
lowed by 7 k-means iterations and 2 EM iterations. In creating the 6 min 10 sec 29.8
user-dependent adapted Gaussian mixture models, we adapt both 9 min 10 sec 28.7
the mean and variance vectors with relevance factor r = 16. 1 min 1 min 41.9
2 min 2 min 41.2
4 min 4 min 29.4
Window length L = 900 samples (7.5 sec)
5 min 3 min 29.4
45 window shift S = 450 samples 6 min 2 min 29.4
Histogram size K = 27 bins 7 min 1 min 29.8
Equal error rate (EER %)
40
ther studies using larger number of participants and wider range of
35 tasks and features. Several improvements could be done on the data
processing side as well: using complementary features from pupil
30
diameter, using feature and score normalization and discriminative
training [Kinnunen and Li 2010].
25
5 10 15 20 25 30 References
Number of Gaussian components (M)
ĀØ
B EDNARIK , R., K INNUNEN , T., M IHAILA , A., AND F R ANTI , P.
Figure 4: Effect of model order.
2005. Eye-movements as a biometric. In Proc. 14th Scandina-
vian Conference on Image Analysis (SCIA 2005). 780-789.
4.2 Results B ISHOP, C. 2006. Pattern Recognition and Machine Learning.
We ļ¬rst optimize the feature extractor and classiļ¬er parameters. Springer Science+Business Media, LLC, New York.
The UBM is trained from 140 minutes of data whereas the train- B ULACU , M., AND S CHOMAKER , L. 2007. Text-independent
ing and test segments have approximate durations of ļ¬ve and one writer identiļ¬cation and veriļ¬cation using textual and allo-
minute, respectively. We ļ¬x the number of Gaussians to M = 12 graphic features. IEEE Trans. on Pattern Analysis and Machine
and vary the length of the data window (L) and the number of his- Intelligence 29, 4 (April), 19ā41.
togram bins (K). The error rates presented in Fig. 3 suggest that
increasing the window size improves accuracy, and the number of C OGAIN. Open source gaze tracking, freeware and low cost
bins does not have to be too high (20 ā¤ K ā¤ 30). We ļ¬x K = 27 eye tracking. WWW page, http://www.cogain.org/
and L = 900 and further ļ¬ne-tune the number of Gaussians in Fig. eyetrackers/low-cost-eye-trackers.
4. The accuracy improves when increasing the number of Gaus- K ASPROWSKI , P. 2004. Human Identiļ¬cation Using Eye Move-
sians, and achieves optimum at approximately 6 ā¤ M ā¤ 18. In the ments. PhD thesis, Silesian University of Technology, Insti-
following we set M = 16. tute of Computer Science, Gliwice, Poland. http://www.
Finally, we display the error rates for varying lengths of training kasprowski.pl/phd/PhD_Kasprowski.pdf.
and test data in Table 1. Here we reduced the amount of UBM K INNUNEN , T., AND L I , H. 2010. An overview of text-
training data to 70 minutes so as to allow using longer training independent speaker recognition: from features to supervectors.
and test segments for the target persons. Increasing the amount of Speech Communication 52, 1 (January), 12ā40.
data improves accuracy, saturating to accuracy around 30 % EER.
Although the accuracy is not sufļ¬cient for a realistic application, K UMAR , M., G ARFINKEL , T., B ONEH , D., AND W INOGRAD , T.
it is clearly below the chance level (50 % EER), suggesting that 2007. Reducing shoulder-surļ¬ng by using gaze-based password
there is individual information in the eye movements which can be entry. In Proc. of the 3rd symposium on Usable privacy and
modeled. The error rates are clearly higher than in [Kasprowski security table of contents, ACM, 13ā19.
2004; Bednarik et al. 2005]. However, those studies used ļ¬xed-
dimensional templates that were carefully aligned whereas our sys- L UCA , A. D., W EISS , R., AND D REWES , H. 2007. Evaluation of
tem does not use any explicit temporal alignment. eye-gaze interaction methods for security enhanced PIN-entry.
In Proc. of the 19th Australasian conf. on Computer-Human In-
teraction: Entertaining User Interfaces, ACM, 199ā202.
5 Conclusions
PASSFACES. Passfaces: Two factor authentication for the enter-
We approached the problem of task-independent eye-movement prise. WWW page, http://www.passfaces.com/.
biometric authentication from bottom-up without any prior signal R EYNOLDS , D., Q UATIERI , T., AND D UNN , R. 2000. Speaker
model of how the resulting eye-movement signal and features are veriļ¬cation using adapted gaussian mixture models. Digital Sig-
generated by the oculomotor plant. Instead, we had an intuitive nal Processing 10, 1 (January), 19ā41.
guess that the way the ocular muscles operate the eye is individ-
ual and there is some independence on the underlying task. Our S ALVUCCI , D., AND G OLDBERG , J. 2000. Identifying ļ¬xations
results indicate that this is a feasible conception. The error rates and saccades in eye-tracking protocols. In ETRA ā00: Proceed-
are too high to be useful in a realistic security system, but signiļ¬- ings of the 2000 symposium on Eye tracking research & applica-
cantly lower than the chance level (50 %) which warrants for fur- tions, ACM, New York, NY, USA, 71ā78.
190