Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Engelman.2011.exploring interaction modes for image retrieval


Published on

Published in: Technology, News & Politics
  • Be the first to comment

  • Be the first to like this

Engelman.2011.exploring interaction modes for image retrieval

  1. 1. Exploring Interaction Modes for Image Retrieval Corey Engelman1 Rui Li1 Jeff Pelz2 Pengcheng Shi1 Anne Haake1ABSTRACT applications, where information about the images can be extractedThe number of digital images in use is growing at an increasing from experts and utilized. Major questions remain as to how bestrate across a wide array of application domains. That being said, to bring users “into the loop” [2,3].there is an ever-growing need for innovative ways to help end- Multimodal user interfaces are promising as the interactiveusers gain access to these images quickly and effectively. component of CBIR systems because different modes are bestMoreover, it is becoming increasingly more difficult to manually suited to expressing different kinds of information. Recentannotate these images, for example with text labels, to generate research efforts have been focused on developing and studyinguseful metadata. One such method for helping users gain access to usability for multimodal interaction [4,5,6]. Designing natural,digital images is content-based image retrieval (CBIR). Practical usable interaction will require an understanding of which useruse of CBIR systems has been limited by several “gaps”, interactions should be explicit and which implicit. Consider queryincluding the well-known semantic gap and usability gaps [1]. by example (QBE), which requires users to select a representativeInnovative designs are needed to bring end users into the loop to image and often a region of that image. It is the usual paradigm inbridge these gaps. Our human-centered approaches integrate CBIR but users have difficulty forming such queries. There is ahuman perception and multimodal interaction to facilitate more need for innovative new methods to support QBE. Beyond QBE,usable and effective image retrieval. Here we show that multi- more effective methods are needed for gaining input from the usertouch interaction is more usable than gaze based interaction for for relevance feedback to refine the results of a search. Forexplicit image region selection. 1 example, this could be done explicitly, by actually having the user directly specify which images were close to what they wereCategories and Subject Descriptors looking for, or implicitly by simply making note of which imagesH.5.2 [Information Interfaces and Presentation]: User they looked at with interest (e.g via gaze). Finally, betterInterfaces – Graphical user interfaces, input devices and organization of the images returned from a query is as importantstrategies, prototyping, user-centered design, voice I/O, as the underlying retrieval system itself, in that it allows the userinteraction styles. to quickly scan the results and find what they are looking for.General Terms Our approach to overcoming the interactivity challenges of CBIRMeasurement, Performance, Design, Experimentation, Human is largely based on bringing the user into the process byFactors combining traditional modes of input such as the keyboard and mouse with interaction styles that may be more natural such asKeywords gaze input (eye-tracking), voice recognition, and multi-touchMultimodal, eye tracking, image retrieval, human-centered interaction. A software framework was developed for such acomputing system using existing graphical user interface (GUI) libraries and then designing several subcomponents that allow for interaction1. INTRODUCTION via the new methods within a GUI. With the implementation ofResearch in CBIR has shown that image content is more this basic framework for multimodal interface design it is nowexpressive of users’ perception than is textual annotation. A possible to quickly develop and test prototypes for differentsemantic gap occurs, however, when low-level image features, interface layouts and even prototypes for different modes ofsuch as color or texture, are insufficient in completely interaction using one or more of the input modes (mouse,representing an image in a way that reflects human perception. keyboard, gaze, voice, touch).One possible way to bridge the semantic gap is to take a “human- A series of studies will be performed to determine which of thesecentered” approach in system design. This is particularly prototypes are most efficient and usable across a range of imageimportant in knowledge rich domains, such as biomedical types and among varied end user groups. The first of these, described here, involves study of modes of interaction for1 B. Thomas Golisano College of Computing and Information Sciences, performing QBE through explicit region of interest selection. TheRochester Institute of Technology main goal is to effectively compare the efficiency of different1 Lomb Memorial Drive, Rochester, NY 14623-5603 interaction methods, as well as user preference, ease-of-use, and{cde7825, rxl5604, spcast, arhics} ease-of-learning.2 College of Imaging Arts and Sciences, Rochester Institute Technology 2. Methods1 Lomb Memorial Drive, Rochester, NY 14623-5603{jbppph } 2.1 Design And Implementation The best approach to developing a multimodal user interface suchPermission to make digital or hard copies of all or part of this work for as the one described here is an evolutionary approach. This meanspersonal or classroom use is granted without fee provided that copies are breaking the overall large goal of building a multimodal usernot made or distributed for profit or commercial advantage and that copies interface into smaller obtainable goals, and designing,bear this notice and the full citation on the first page. To copy otherwise, implementing, testing, and integrating these smaller portions. Inor republish, to post on servers or to redistribute to lists, requires prior this way, the developer can ensure that separate components arespecific permission and/or a fee. not dependent on one another, because one builds stand-alone subsystems, and then integrates them.NGCA 11, May 26-27 2011, Karlskrona, SwedenCopyright 2011 ACM 978-1-4503-0680-5/11/05…$10.00.
  2. 2. 2.1.1 Eye Tracking window (JFrame) and the LayoutManager class for managingThe Sensomotoric Instruments (SMI) RED 250 Hz eye-tracking placement of components within the window. Furthermore, adevice, was used to track the position of the user’s gaze on the system for allowing rapid prototyping of UI layouts can be put inmonitor. SMI’s iViewX software was used to run the eye tracker place to facilitate development. This involves creating an Abstractduring use and SMI’s Experiment Center was used to perform a class called PrototypeUI that inherits from javas JFrame class.calibration prior to use. Our custom software, written in Java, Any number of prototype UI layouts can be created and testedcommunicates with the device using Unified Data Protocol (UDP) without changing the code for core functionality of the system orto send signals to the eye-tracker to start and stop recording. Once for the previously mentioned subcomponents that are handlingthe eye tracker receives the start signal, it begins streaming screen different modes of input.coordinates to the program. A separate program thread can then 2.2 Experimental Designrepeatedly get the new coordinates and update respective variables To evaluate prototype interaction styles for QBE, we recruited 9corresponding to the users gaze. Because the human eye is undergraduate and graduate students at Rochester Institute ofnaturally jittery, it is necessary to implement an algorithm for Technology as study participants. Participants were given ansmoothing/filtering the data coming from the eye tracker. Because explanation of the CBIR paradigm and of QBE and then werethe system is developed in an Object Oriented Programming given a brief tutorial on each prototype mode they would be using.Language (OOP), implementing such functionality is as simple as For the study they were shown a set of ten images, four separatecreating an abstract Filter class, and then creating several times, in randomized order. Each of the four times they wereinstances of that abstract Filter. This allows multiple different shown the ten images, their task was to perform QBE by explicitfiltering algorithms to be created easily. Even this functionality region of interest selection using one of the four prototypeaffords a vast array of possibilities then for how the eye input data methods of interaction. Because we are not concerned in thiscan be used for interaction. For example, eye tracking could be study about regions of interest within objects but rather whetherused to replace mouse/keyboard scrolling and panning [7]. the user can effectively select an object, we instructed the user to2.1.2 Voice Recognition select a specific object from each image (e.g select the eight ballJava defines the Java Speech Application Programming Interface from an image of billiard balls on a pool table; see Figure 1C).(JSAPI), implemented by several open source libraries. Any 2.2.1 Image Selectionimplementation of the JSAPI is a suitable choice as they all When choosing the images to use for the study, there were twoperform the functionality specified by Java. For our system, we main considerations. First, because we specified what to select,chose Cloud Garden JSAPI ( there was a requirement for obvious, discrete objects in the imageBeyond a suitable library that implements the JSAPI, a speech to eliminate ambiguity. Next, we wanted to test our fourrecognition engine is required on the computer running the prototypes across a variety of images and so we defined categoriesmultimodal system. For our system, we have used Windows of images. These categories; simple, intermediate, and complex,Speech Recognition, because it is included in the Windows were based on the complexity of the object the user was to select.operating system (Windows 7). A custom “grammar” can be For the simple category, we photographed billiard balls inwritten to specify which commands the system will accept. Then a different configurations. This covers both criteria, because thesimple controller can be implemented to receive commands, shape is simply a circle, and it allows us to instruct the user tointerpret them, and pass them on to the proper event handler. select the eight ball. For the intermediate category, we used dice.Voice recognition has the potential to greatly increase the This allowed us to construct a number of intermediate complexityefficiency of interaction between system and user. Furthermore, it shapes. We considered them to be intermediate, because the edgesis simple to include basic functions such as a speech lock, so that were always straight and in a 2D image, the shapes formed by thethe user can easily turn on/off voice recognition. dice are essentially polygons. Finally, for the complex images, we2.1.3 Multi-Touch Interaction chose to use images of horses. This is obviously a more complexFor multi-touch, an open source library called MT4J shape than the previous example, and it still allows for easy( was used. This library allows the Windows instruction of what to select, because each of the images contained7 touch screen commands to be used within a Java application. a brown pony and a larger whitish/greyish horse.From here, it is possible to implement custom gesture processors, 2.2.2 Prototype Interaction Methodsor use a number of predefined processors. Touch interaction canbe applied to QBE, and a number of other interactions with the The Anchor Methoduser. Beyond this, the library allows creation of custom multi- The anchor method combines interaction styles of gaze, voice andtouch user interface components. Another benefit is that it is either the mouse or touch screen. The user looks at the center ofsimple to create stand-alone multi-touch applications and then the object they want to select, then says the command “setembed them in the system. This follows the previously mentioned anchor”. This places a small selection circle on screen where theevolutionary prototyping engineering methodology, because it user was looking. Next to this selection circle is a slider objecteasily allows simple standalone prototypes to be developed, then which can slide left to decrease the radius of the selection circle,integrated into the existing system. For our experiment, a Dell or right to increase the radius of the selection circle. The sliderSX2210T Touch Screen Monitor was used can be adjusted using either mouse or touch, depending on the user’s preference.2.1.4 Traditional GUI ComponentsBecause the subcomponents of the multimodal user interface were Gaze Interactiondeveloped in Java, the Swing GUI libraries can be used to create Unlike the anchor method, this method uses eye tracking almosttraditional visual components and handle input from the mouse exclusively. The user finds the object to select, then clicks aand keyboard. This also makes developing the basic framework button using either mouse or touch screen to begin eye tracking.for the user interface (i.e windowing and layout structure) very Once turned on, the program begins painting over the area tosimple, because Java’s Swing library includes classes for a UI provide feedback, as the user glances over the object. When
  3. 3. finished, the user presses the same button to stop the eye tracker. participants missed, a measure of precision by showing excessAlternatively, eye tracking can be started by saying the command selection as the percentage of the users total selection that was not“start eye tracking” and stopped by saying, “stop eye tracking”. the object, and a measure of efficiency by showing the time toWhile painting, saccades are not drawn; rather fixations are complete the image.visualized by placing translucent circles on the screen. The radiusof the circle is determined by the fixation duration (i.e a longer 3.2 Efficiency of Interaction Methodsfixation duration means a larger radius). Descriptive statistical analysis of the data was performed to determine efficiency of the different prototypes in terms of2.2.2.3 Mouse Selection accuracy, precision, and time to complete. Box plots wereFor this method, the user finds the object of interest and then constructed to show the comparison of the different prototypes.presses and holds the mouse button to begin drawing a selectionwindow. The selection auto-completes by always drawing astraight line from the point of the initial click to where the mouseis currently located. When the user finishes their selection, theysimply release the mouse button. Touch SelectionThis method works similarly to mouse selection except that ratherthan pointing and clicking with the mouse, the user traces theobject with a finger to form the selection window. The windowauto-completes in the same fashion as for mouse selection. Figure 2.aFigure 1. From left to right, images from the intermediate,complex and simple categories. The first is a selection madeusing the touch screen. The second uses gaze interaction, andthe third uses the anchor method2.2.3 MetricsTo evaluate the usability attributes of efficiency and usefulnessfor each style of interaction we defined several metrics. Accuracywas measured by calculating the area of the object in the image(in pixels) prior to selection using the GNU Image Manipulation Figure 2.bProgram (GIMP), then calculating the area of the object in a givenselection. To determine the percentage of the object the usermissed. Precision was determined by calculating how much of theusers selection was outside of the object. The amount of excessselection (in pixels), was divided by the total selection (in pixels)to calculate a relative excess value of the user’s selection.Efficiency of the different modes was determined by measuringthe time (in seconds) to complete a selection. We also asked theusers to rate each of the prototypes in three categories on a scalefrom one to five. The categories were ease-of-use, ease-of-learning, and how natural the method felt. Also, we counted thenumber of times the user had to use the undo function. Thesemeasurements show more the usability of a prototype rather than Figure 2.cits efficiency and accuracy. Figures 2.a-2.c show comparison of box plots of the data3. Data Analysis collected from the nine participants on all four interaction methods for one of the images of horses. 2.a shows the3.1 Data Collection percentage of the selection that was excess, 2.b shows theCamtasia Studio (TechSmith) was used to record the screen percentage of the object missed by the user, and 2.c shows theduring the study. Data were extracted from the images captured time taken to complete the selectionfrom the video. These images showed the participants selections In all three of the plots above, the touch screen method has thefor each of the ten images four separate times (one for each most consistent results (smaller size of the box). The touch screenmethod). The data extracted included, the area (in pixels) that they also has the lowest median value for percentage of the selectionselected within the object, and the area that was excess selection. missed and time taken to complete. For percentage of excessAgain, the values were measured using GIMP. Viewing of the selection, the mouse has the lowest median, but the touch screendata suggested that the best way to effectively show the still had a more consistent set of values in which the bulk of thecomparison of the four prototypes would be to show a measure of values were lower than those from the mouse.accuracy by displaying the percentage of the object that the
  4. 4. Table 1. The table below shows the average values of excess requires the user to coordinate between their hand and eye without selection, percentage of the object missed, and time taken for the hand being in their field of vision. Furthermore, the average all four prototype methods. user prefers to use a mouse or touch screen for this type of task. Anchor Method Touch Mouse Gaze 4.1.3 Individual Differences Excess 48.4% 17.7% 17.1% 49.4% Finally, our study metrics show that interaction with the mouse Missed 9.0% 4.7% 9.8% 7.6% and touch screen is generally consistent across participants, whereas there is greater variability with eye tracking, This Time (s) 17.6 13.9 16.3 20.8 probably occurs because using one’s eyes to select or trace something is not natural, and so while some people may learn the3.3 User Preference method very quickly, others will not. Table 2. The table below shows the average values of user 4.1.4 Future Studies preference (scale of one to five) and the average undo usage Studies are ongoing to prototype and test additional interaction for all four prototypes styles which may be useful for image retrieval. For example, a study to show the efficiency of different modes in a search related Anchor Touch Mouse Gaze task, like scrolling, selection of an entire image from a set, or Ease-of-Use 2.9 4.5 4.7 3.3 using gestures, see [10], would be useful. This would be Ease-of-Learning 3.5 4.8 4.4 3.8 interesting to see, because it might be the case that in these types of tasks, mouse and touch screen are not the most efficient. We Natural 2.6 4.7 4 2.4 are also engaged in using gaze for implicit interaction, such as in Undo Usage 8 1 1 1 [5,9], towards our long-term goals of creating adaptive, multimodal systems for image retrieval.The table above clearly shows that the mouse and touch screenreceived higher ratings than the two methods using eye tracking. 5. ACKNOWLEDGMENTSIn general, the users were in agreement about the different This work is supported by NSF grant IIS-0941452. Any opinions,prototypes, with the standard deviation on average being below findings, conclusions, or recommendations expressed in thisone (SD ≈ .86). Undo usage was fairly low with the average user material are those of the authors and do not necessarily reflect thepressing undo just once per 10 images when using touch, mouse, views of the NSF.or gaze. However, the Anchor method had a significantly higherundo usage. Furthermore, the variance with undo usage for the 6. REFERENCESanchor method is relatively high (SD ≈ 10.2). This variance is [1] Deserno TM, Antani S, Long R. Ontology of gaps inlikely caused by a combination of the high learning curve that this content-based image retrieval. J Digit Imaging.2009method has. It requires the user to coordinate use of three input Apr;22(2):202-15. Epub 2008 Feb 1.methods. Furthermore, the inaccuracy of the eye tracker, plus or [2] Lew S.L., Sebe N., Lifl D. C., and J. Ramesh. Content-basedminus two visual degrees, plays a more significant factor here, multimedia information retrieval: State of the art andbecause unlike the gaze method where the user can see where they challenges. ACM Transactions on Multimedia Computing,are painting, and adjust their eyes, in this method if the tracker is Communications and Applications, 2(1): 1-19,, then the user only sees this after the anchor is placed. Thenthe user must click undo. [3] Müller H, Michoux N, Bandon D, A. Geissbuhler. A review of CBIR systems in medical applications-clinical benefits4. Conclusions and future directions. Int J Med Inform., 73(1):1-23, 2004.4.1.1 Eye Tracking Interaction Methods [4] Qvarfordt P. and Zhai S. Conversing with the User Based onThis study shows clearly that using eye tracking for explicit user Eye-Gaze Patterns. Proc. CHI (2005), ACM, 221-230.interaction in a task that requires the user to be precise and [5] Sadeghi M., TienG., Hamarneh G., and Atkins A. . Hands-accurate is not effective. This is not surprising since people have free Interactive Image Segmentation Using Eyegaze. In SPIEdifficulty with smooth pursuit, that might be required for drawing Medical Imaging, 2009.or tracing activities, when objects are stationary [8] This, incombination with some inaccuracy of the eye tracker, does not [6] Ren, J., Zhao, R., Feng, D.D., and Siu, W. Multimodalallow enough accuracy using interaction styles implemented for Interface Techniques in Content-Based Multimediathis study. It is more likely that implicit interaction i.e. selection Retrieval. In Proceedings of ICMI. 2000, 634-641.based on more natural gaze behavior as a user is browsing or [7] Kumar, M., and Winograd, T. Gaze-enhanced Scrollingexamining an image, such as in [5,9], will be effective for QBE. Techniques, UIST: Symposium on User Interface Software and Technology. New Port, RI. 20074.1.2 Touch Screen and Mouse Interaction MethodsFor the user group studied here, touch screen and mouse show [8] Krauzlis, RJ. The control of voluntary eye movements: newsimilar results for a task such as tracing/selecting. The general perspectives. The Neuroscientist. 2005 Apr;11(2) is that touch screen is slightly more efficient than the mouse. PMID 15746381.However, when we consider the images from the category of [9] Santella, A., Agrawala, M., DeCarlo D., Saleshin, D., Cohen,complexly shaped images, it is apparent that the trend does not M., Gaze-Based Interaction for Semi-Automatic Photoapply. The touch screen is more efficient than the mouse. This is Cropping. CHI proceedings: Collecting and Editing Photos,likely caused by the fact that the touch screen is more natural than 2006mouse even for technically-savvy, college-age participants [10] Heikkilä, H., Räihä, K-J. Speed and Accuracy of Gazebecause it is closer to the human’s natural interaction process. In Gestures, Journal of Eye Movement Research. 2009contrast, the mouse somewhat mimics a natural interaction, but