SemantiCode: Using Content Similarity and Database-driven
                    Matching to Code Wearable Eyetracker Gaze Da...
by the (X,Y) coordinate of the fixation in a scene plane. The 2D             An existing algorithm for automatic extractio...
an eye tracking video. Thus, the actual coding of fixations is a             exemplars to test against. The denominator is...
exhaustive search of all histograms. Table 1 contains the peak                With future improvements and extensibility, ...
Upcoming SlideShare
Loading in …5

Pontillo Semanti Code Using Content Similarity And Database Driven Matching To Code Wearable Eyetracker Gaze Data


Published on

Laboratory eyetrackers, constrained to a fixed display and static (or accurately tracked) observer, facilitate automated analysis of fixation data. Development of wearable eyetrackers has extended environments and tasks that can be studied at the expense of automated analysis. Wearable eyetrackers provide 2D point-of-regard (POR) in scene-camera coordinates, but the researcher is typically interested in some high-level semantic property (e.g., object identity, region, or material) surrounding individual fixation points. The synthesis of POR into fixations and semantic information remains a labor-intensive manual task, limiting the application of wearable eyetracking.
We describe a system that segments POR videos into fixations and allows users to train a database-driven, object-recognition system. A correctly trained library results in a very accurate and semi-automated translation of raw POR data into a sequence of objects, regions or materials.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Pontillo Semanti Code Using Content Similarity And Database Driven Matching To Code Wearable Eyetracker Gaze Data

  1. 1. SemantiCode: Using Content Similarity and Database-driven Matching to Code Wearable Eyetracker Gaze Data Daniel F. Pontillo, Thomas B. Kinsman, & Jeff B. Pelz* Multidisciplinary Vision Research Lab, Carlson Center for Imaging Science, Rochester Institute of Technology {dfp7615, btk1526, *pelz} Abstract can be mapped onto the scene camera’s intrinsic 3D coordinate system. This allows for accurate ray tracing from a known origin Laboratory eyetrackers, constrained to a fixed display and static relative to the scene camera. While this method has been shown (or accurately tracked) observer, facilitate automated analysis of to be accurate, it has limitations. Critically, it requires an fixation data. Development of wearable eyetrackers has accurate and complete a priori map of the environment to relate extended environments and tasks that can be studied at the object identities with fixated volumes of interest. In addition, all expense of automated analysis. data collection must be completed with a carefully calibrated scene camera, and the algorithm is computationally intensive. Wearable eyetrackers provide 2D point-of-regard (POR) in Another proposed method is based on Simultaneous scene-camera coordinates, but the researcher is typically Localization and Mapping (SLAM) algorithms originally interested in some high-level semantic property (e.g., object developed for mobile robotics applications [Thrun and Leonard identity, region, or material) surrounding individual fixation 2008]. Like FixTag, current implementations of SLAM-based points. The synthesis of POR into fixations and semantic analyses require that the environment be completely mapped information remains a labor-intensive manual task, limiting the before analysis begins, and are brittle to scene layout changes, application of wearable eyetracking. precluding their use in novel and/or dynamic environments. We describe a system that segments POR videos into fixations Our initial impetus for this research was the need for a tool to and allows users to train a database-driven, object-recognition aid the coding of gaze data from mobile shoppers interacting system. A correctly trained library results in a very accurate and with products. Because the environment changes every time a semi-automated translation of raw POR data into a sequence of product is purchased (or the shopper picks up a product to objects, regions or materials. inspect it), neither FixTag nor SLAM-based solutions were viable. Another application of the tool is in a geoscience Keywords: semantic coding, eyetracking, gaze data analysis research project, in which multiple observers explore a large number of sites. While the background in each scene is static, it isn’t practical to survey each site horizon-to-horizon, and 1 Introduction because the scenes include an active instructor and other observers, existing solutions were not suitable for this case. Eye tracking has a well-established history of revealing valuable information about visual perception and more broadly about Figure 1 shows sample frames from the geosciences-project cognitive processes [Buswell 1935; Yarbus 1967; Mackworth gaze video recorded in an open, natural scene, which contains and Morandi 1967; Just and Carpenter 1976]. Within this field many irregular objects and other observers. Note that even if it of research, the objective is often to examine how an observer were possible to extract volumes of interest and camera motions visually engages with the content or layout of an environment. within this environment, there would be no mechanism for When the observer’s head is stationary (or accurately tracked) mapping fixations within volumes into their semantic identities and the stimuli are static (or their motion over time is recorded), because of the dynamic properties of the scene. commercial systems exist that are capable of automatically extracting gaze behavior in scene coordinates. Outside the laboratory, where observers are free to move through dynamic environments, the lack of constraints precludes the use of most existing automatic methods. A variety of solutions have been proposed and implemented in order to overcome this issue. One approach, ‘FixTag,’ [Munn and Pelz 2009] utilizes ray tracing to estimate fixation on 3D volumes of interest. In this scheme, a calibrated scene camera is used to track features across frames, allowing for the extraction of 3D camera movement. With this, points in a 2D image plane Figure 1 Frames gaze captured in outdoor scene Copyright © 2010 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or 2 Semantic-based Coding classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be Our goal in developing the SemantiCode tool was to replace the honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on concept of location-based coding with a semantic-based tool. In servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail 2D location-based coding, the identity of each fixation is defined ETRA 2010, Austin, TX, March 22 – 24, 2010. © 2010 ACM 978-1-60558-994-7/10/0003 $10.00 267
  2. 2. by the (X,Y) coordinate of the fixation in a scene plane. The 2D An existing algorithm for automatic extraction of fixations scene plane can be extended for dynamic cases such as web [Munn 2009; Rothkopf and Pelz 2004] was modified and pages, providing that any scene motion (i.e., scrolling) is embedded within the SemantiCode system. Temporal and spatial captured for analysis. In 3D location-based coding, fixations are defined by the (X,Y,Z) coordinate of the fixation in scene space, provided that the space is mapped and all objects of interest are placed within the map. By contrast, in semantic-based coding, a fixation’s identity can be determined independent of its location in a 2D or 3D scene. Rather than basing identity on location, semantic-based coding uses the tools of object recognition to infer a semantic identity for each fixation. A wide range of spectral, spatial, and temporal features can be used in this recognition step. Note that while identity can be determined independent of location in semantic- based coding, location can be retained as a feature in identifying a fixation by integrating location data. Alternatively, a ‘relative Figure 2 The SemantiCode GUI as it appears after the user location’ feature can be included by incorporating the semantic- has loaded a video and tagged a number of fixations. This based features of the region surrounding the fixated object. usage example represents a scenario wherein a library has just been built. The area on the left side of the interface Fundamental to the design of the SemantiCode Tool is the contains all of the fixation viewer components, while the area concept of database training. Training occurs at two levels; the on the right is generally devoted to coding, library system is first trained by manually coding gaze videos. As each management, and the display of the top matches from the fixation is coded, the features at fixation are captured and stored active library along with the image region as an exemplar of the semantic identifier. Higher-level training can occur via relative weighting constraints on the fixation extraction can be adjusted by the of multiple features, as described in Section 8. experimenter via the Fixation Definition Adjustment subgui seen in Figure 3. The user is also presented with statistics about the 3 Software Overview fixations as calculated from the currently selected video. The average fixation duration and the rate of fixations per second can be useful indicators of how well the automatic segmentation has SemantiCode was designed as a tool to optimize and optionally worked for the current video [Salvucci and Goldberg 2000]. automate the process of coding without sacrificing adaptability, robustness, and an immediate mechanism for manual overrides. The software is reliant on user interaction and feedback, yet in most cases this interaction requires very little training. One major design consideration was a scalable operative complexity; this is crucial for research groups who employ undergraduates and other short-term researchers, as it obviates the need for an extended period of operator training. To this end, the graphical user interface (GIU) allows users manual control over every parameter and phase of video analysis and coding, while simultaneously offering default settings and semi-automation that should be applicable to most situations. Assuming previous Figure 3 Fixation Definition Adjustment subgui allows the users have trained a strong library of objects and exemplars, the user to shift the constraints on what may be considered a coding task could be as simple as pressing one key to progress fixation in order to produce more or fewer fixations. through fixations, resulting in a table of data that correlates each fixation to the semantic identity of the fixated region. The 5 Fixation Analysis training process requires some straightforward manual configuration before this type of usage is possible, but A single frame, extracted from the temporal center of the active depending on the variety of objects of interest, this can still be fixation in the gaze video is displayed on the left of the main achieved in a much shorter period of time with significantly less GUI. Within the frame, a blue box indicates the pixel region effort than previous manual processes have required. considered relevant in all further steps. Beneath the frame, that region’s semantic identifier is shown, if one exists, along with a 4 Graphical User Interface text display of the progress that has been made in coding the currently selected video. The user can use an intuitive control When the user runs SemantiCode for the first time, the first step panel for switching between fixation frames, videos and is to import a video that has been run through POR analysis projects. Users have the option of manually navigating fixations software. (The examples here were done with Positive Science either with a drop-down menu fixation selector, with the Yarbus 2.0 software []). Any video next/previous buttons, or with the right and left arrows on the with an accompanying text file listing the POR for each video user’s keyboard. frame can be used. The POR location and time code of each frame are used to automatically segment the video into estimated 6 Object Coding fixations. Once this is finished, the first fixation frame appears, and coding can proceed. The primary purpose of SemantiCode is the attachment of semantic identification information to a set of pixel locations in 268
  3. 3. an eye tracking video. Thus, the actual coding of fixations is a exemplars to test against. The denominator is the sum of each critical functionality in the software. The first time the software model’s histogram, a normalization constant computed once. is used with video of a new environment, coding begins H(I,M) represents the fractional match value [0 – 1] between the manually. Users add new objects to the active library by typing fixated region and a model in the library. This has the desirable in an identifier for the fixated region in the active frame, which qualities that background colors, which are not present in the can be selected as either 64x64 or 128x128 pixels surrounding object, are ignored. The intersection only increases if the same the point of regard. With each added object, the image and its colors are present, and the amount of those colors does not histogram are stored in the active library under the specified exceed the amount expected. This approach is robust to changes name. Once a sufficient number of objects have been added to in orientation and scale because it relies only on the color describe the elements of interest in the environment, the user can intensity values within the two images being compared. It is also continue coding by selecting the most appropriate member of the computationally efficient, requiring only n comparisons, n object list. As each frame is tagged with a name, the frame additions, and one division. number, video name, and semantic identifier are stored and displayed as coded frames. The representation of 3D content by 2D views is elegantly handled by the design of the library. Each semantic identifier After coding each fixation (either manually or by accepting can contain an arbitrary number of exemplars from any view or SemantiCode’s match), data about the fixation and the video scale. Consequently, multiple perspectives are added to the from which it was extracted are written to an output data file. library as they are required. The library is thus adaptively With this, statistical analyses can easily be run on the newly modified to meet the needs of the coding task. added semantic information for each coded fixation. Future work will involve extended feature generation and 7 Building a Library selection, including alternative and adaptive histogram techniques, and the use of machine-learning algorithms for enhanced object discrimination. The data structure that underlies SemantiCode is referred to as a library. A library is simply a collection of semantic identifiers that each contain one or more images or exemplars that has been constructed through the act of coding. When a user runs the software for the first time, an unpopulated default library is automatically created. Users can immediately start adding to this library, which is a persistent data structure that is automatically imported for every subsequent session of the software. The user can create a new blank library, copy an existing library into a new one, merge two or more libraries into one, and delete unwanted libraries. Figure 4 The Examine subgui for a region called “Distant Alternatively, users can import a pre-existing library, or merge Terrain.” The GUI displays the exemplar and image for several libraries into one before ever coding a single object. This each fixation tagged with this name. portability is a major feature, as it means that theoretically for a given environment, manual object coding must only be done Since the current algorithm is not affected by shape or spatial once. All subsequent coding tasks, regardless of the user or the configuration, it is not is not necessary to segment the region of location of the software, can be based on a pre-built library of interest from its background. As a result, irregular environments exemplars and object images. and observer movement do not degrade performance. Even more compelling is the capacity for this algorithm to accurately match 8 Semantic Recognition materials and other mass nouns that may not take the form of discrete objects. The ability to automatically identify materials Computer Vision usually attempts to either find the location of a along with objects helps to address a larger issue in the machine- known object (“where?”), or identify an object at a known vision field about the salience of uniform material regions. location (“what?”). In the case of eyetracking the fixation location is given, so the primary question is, “What is the fixated These factors make the Swain and Ballard [1990] color- object, region, or material?” To answer this the region histogram method an attractive choice for a highly adaptable and surrounding the calculated POR is characterized by one or more robust form of assisted semantic coding. Testing with just RGB features. Those features are then compared to the features stored histogram intersections shows great promise. In its current in a library to answer the question posed above. implementation, each time a new fixation frame is shown, SemantiCode matches its histogram against every object in the As our initial method, we used the color-histogram intersection currently active library, ranks them, and displays the top ten method introduced by Swain and Ballard [1990], in which the objects on the right panel. The highest-ranking object shows the count of the number of pixels in each bin of image I’s histogram top three exemplars. is compared to the number of pixels in the same bin of model M’s histogram: Table 1 shows the results of preliminary tests in a challenging outdoor environment similar to that depicted in Figure 1. For n n analysis, five regions were identified: Distant terrain, H(I, M) = " min(I j , M j ) " Mj Eq.1 Midground terrain, Horizon, and Lake. After initializing the j=1 j=1 library by coding the first nine fixations within each region, the color-histogram match scores for the tenth fixation in each Where Ij represents the jth bin of the histogram at fixation and Mj region were calculated. Recall that SemantiCode performs an is the jth bin of a model’s histogram from the library of ! 269
  4. 4. exhaustive search of all histograms. Table 1 contains the peak With future improvements and extensibility, SemantiCode histogram match within each category. In the current promises to become a valuable tool to support attaching implementation, SemantiCode presents the top ten matches to semantic identifiers to image content. It will be possible to tune the experimenter. Hitting a single key accepts the top match; any SemantiCode to virtually any environment. By combining the of the next nine can be accepted instead by using the numeric power of database-driven identification with unique matching keypad, as seen in Figure 2. techniques, it will only be limited by the degree to which it is appropriately trained. It is thus promising both as a tool for Table 1 Peak histogram match (see text) evaluating which algorithms are useful in different experimental scenarios, and as an improved practical coding system with which to analyze research data. Midground Horizon Lighter Distant terrain terrain terrain 10 Acknowledgments Lake This work was made possible by the generous support of Procter Midground 81% 52% 26% 38% 55% & Gamble and NSF Grant 0909588. terrain Lighter terrain 34% 77% 72% 54% 65% Distant terrain 45% 65% 82% 58% 71% 11 Proprietary Information/Conflict of Interest Horizon 14% 30% 39% 60% 55% Lake 14% 61% 72% 65% 81% Invention disclosure and provisional patent protection for the described tools are in process. The next version will allow the experimenter to implement automatic coding when the feature matches are unambiguous. References For example, if the top match exceeds a predefined accept parameter (e.g., 80%), and no other matches are closer than a BUSSWELL, G.T. 1935 How People Look At Pictures: A Study Of conflict parameter (e.g., 10%) of the top match, the fixation The Psychology Of Perception In Art, The University of would be coded without experimenter intervention. If either Chicago Press, Chicago constraint is not met, SemantiCode would revert to suggesting codes and waiting for verification. Table 1 shows that even in JUST, M. A. AND CARPENTER, P. A. 1976. Eye fixations and the challenging case of a low-contrast outdoor scene with similar cognitive processes. Cognitive Psychology, 8, 441-480. spectral signatures, three of the five semantic categories would be coded correctly without user intervention, even with only MACKWORTH, N.H. AND MORANDI, A. 1967. The gaze selects nine exemplars per region. Note that in this case the semantic informative details within pictures, Perception and label ‘Horizon’ spans two distinct regions, making it a challenge Psychophysics, 2, 547–552. to match. Still, the correct label is the second highest match. MUNN, S.M., and Pelz, J.B. 2009. FixTag: An algorithm for To test SemantiCode’s ability to work in various environments, identifying and tagging fixations to simplify the analysis of data it was also evaluated in a consumer-shopping environment. Six collected by portable eye trackers. Transactions on Applied regions were identified for analysis: four shampoos and two Perception, Special Issue on APGV, In press. personal hygiene products. Histogram matches were calculated as described for the outdoor environment. The indoor ROTHKOPF, C. A. and PELZ, J. B. 2004. Head movement environment was less challenging – after training, all six estimation for wearable eye tracker. In Proceedings of the 2004 semantic categories could be coded correctly without user Symposium on Eye Tracking Research & Applications (San intervention with top matches ranging from 74% to 85%. Antonio, Texas, March 22 - 24, 2004). ETRA '04. ACM, New York, NY, 123-130. In the near future, additional image-matching algorithms will be evaluated within the SemantiCode application for their SALVUCCI, D. D. and GOLDBERG, J. H. 2000. Identifying effectiveness in different scene circumstances. Using the results fixations and saccades in eye-tracking protocols. In Proceedings from these evaluations it will be possible to select optimally of the 2000 Symposium on Eye Tracking Research & useful match evaluation approaches. Applications (Palm Beach Gardens, Florida, United States, November 06 - 08, 2000). ETRA '00. ACM, New York, NY, 71- Match scores can be computed as weighted combinations of 78. outputs from a number of image matching algorithms. Weights, dynamically adjusted by the reinforcement introduced by the SWAIN, M.J., BALLARD, D.H. 1990. Indexing Via Color experimenter’s manual coding, would allow a given library to be Histograms, 1990, Third International Conference on Computer highly tuned to the detection of content that may otherwise be Vision. too indistinct for any individual matching technique. THRUN, S. and LEONARD, J. 2008. Simultaneous localization and 9 Conclusion mapping. In SICILIANO, B. and KHATIB, O., Springer Handbook of Robotics, Springer, Berlin. SemantiCode offers a significant improvement over previous YARBUS, A.L. 1967. Eye movements and vision. New York: approaches to streamlining the coding of eyetracking data. The Plenum Press. immediate benefit is seen in the dramatically increased efficiency for video coding, and increased gains are anticipated with the semi-autonomous coding described. 270