An experimental usability_test_for_different_destination


Published on


Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

An experimental usability_test_for_different_destination

  1. 1. An Experimental Usability Test for different Destination Recommender Systems Andreas H. Zins a, Ulrike Bauernfeind a, Fabio Del Missierb, Adriano ‚Venturinib and Hildegard Rumetshoferc a Institute for Tourism and Leisure StudiesVienna University of Economics and Business Administration, Austria {zins, bauernfeind} b ITC- irst Electronic Commerce and Tourism Research Laboratory, Italy {delmissier, venturi} c Institute for Applied Knowledge Processing University of Linz, Austria hildegard.rumetshofer@faw.uni- AbstractThe present paper outlines the experimental evaluation of travel recommendation systems.First, theoretical concepts concentrating on the influencing factors for human-computerinteraction, system usage and satisfaction are reviewed. An introduction of various methodsdealing with usability evaluation is given and an overview of different “standard” surveyinstruments is provided. Second, a case study, the assessment of a travel recommender systemcurrently under development, is presented. The evaluation considers aspects such as design andlayout, functionality or ease of use. These measures obtained by a user questionnaire arecombined with user interaction logging data. Different variants of the travel recommendationsystem and a baseline system were used for the assessment. This promising approachcomplements subjective ratings by objective tracking data to obtain a more thorough picture ofthe system’s weaknesses. Finally, findings are presented and an explanatory model foruser/system satisfaction is proposed.
  2. 2. Keywords: travel recommendation system; usability testing; human-computer interaction(HCI); user interface questionnaire; subjective vs. objective rating.1 IntroductionAs the amount of information in the field of tourism becomes abundant, finding thedesired information is increasingly difficult. Therefore, recommender systems becamea significant tool in the travel industry; they offer users a convenient opportunity tofind a travel bundle or a single travel item such as accommodation. The developmentof a recommendation system is a time and cost-intensive issue and a lot of usabilityquestions arise. Users will reject systems that do not meet their needs or provideinsufficient functionalities. Before its implementation, an assessment is a prerequisiteto discover strengths and weaknesses and to be able to provide the best systemversion possible. Thus, the primary goal of this contribution is to illustrate how theprototype of a recommendation system can be evaluated. Which concepts can befound in the literature explaining human computer interaction and which methodsexist and are used to evaluate usability?2 Theoretical considerationsAn overview of theories, which concentrate on human computer-interaction (HCI)and computer-mediated environments (CMEs), will be given. The aim is to illustratethe factors, which mostly influence the usage of a system.The Technology Acceptance Model (TAM; Davis 1989) relies on two factorsexplaining human behaviour: perceived usefulness and perceived ease of use.Perceived usefulness describes the user’s point of view of enhancing his/herperformance by using the system. Perceived ease of use is the degree of effort the userbelieves he or she will need for using a particular system. There are numerouscontributions extending the TAM by additional factors such as playfulness, attitude(Moon and Kim 2001), trust and risk (Pavlou 2001) or accessibility and attitude(Jeong and Lambert 2001). Another approach differing from the TAM is the Conceptof Flow (Novak, Hoffman and Yung 2000) including the factors skill and control,challenge and arousal, focused attention, telepresence, and time distortion. Thesefactors contribute to flow, a state of mind where the user is completely devoted to theuse of a system and forgets everything else around him, like time. Thus, the aim is tocreate a compelling online experience to facilitate flow.Another theoretical area relevant for this contribution is usability testing and systemevaluation. According to ISO 9241-11 (1998) usability is “the extent to which aproduct can be used by specified users to achieve specified goals with effectiveness,
  3. 3. efficiency and satisfaction”. Lindgaard (1994) described usability as the ease oflearning and using computer systems from the experienced and unexperienced user’spoint of view. Classifications of usability evaluation methods differ from author toauthor. According to Riihiaho (2000) usability evaluation can be divided into twobroad categories: user testing and usability inspection. User testing involves usabilitytesting, pluralistic, informal, visual walkthroughs and contextual inquiry. Usabilityinspection comprises heuristic evaluation, cognitive walkthrough and GOMS (goals,operators, methods, and selection rules). Harms and Schweibenz (2000) distinguishtwo methods: the heuristic evaluation and usability testing. But different contributionshave a common definition of usability testing: persons performing a given task andevaluating the system. Empirical testing with potential users is the best way to findproblems related to users’ tasks and experiences (Riihiaho 2000). A common way isto ask a set of participants to accomplish a realistic task and performance measuresare collected (Galitz 2002). Usability can be gauged by objective and subjectivevariables. Objective measures include for instance the task completion time, thenumber of queries, or the error rate. Subjective measures, i.e. user’s feedback, areoften collected by questionnaires. For this purpose, some standard questionnaireswere created. Several of these survey instruments were suggested by IBM (Lewis1995): the Post-Study System Usability Questionnaire (PSSUQ), the ComputerSystem Usability Questionnaire (CSUQ) or the After-Scenario Questionnaire (ASQ).There are other examples as well: the Questionnaire for User Interface Satisfaction(QUIS) developed by Chin, Diehl and Norman (1988), the System Usability Scale(SUS) ( or theWebsite Analysis and Measurement Inventory( Research objectives and applied methodologyThe approach of the experimental evaluation described here consists of building somevariants of the recommendation prototype (named DieToRecs) and of testing somehypotheses about the performance of each variant on a set of dependent measuresinvolving a reference or baseline system (in this case the TISCover system).The variants to be tested are: o DTR-A: interactive query management only (i.e. empty case base and no recommendation support via smart sorting or through other means); o DTR-B: single item recommendation with interactive query management and ranking based on a representative case base; o DTR-C: this variant allows a user to navigate among complete travel recommendations in a simple and effective way (starting from the link “Seeking for inspiration”). Six travel examples are shown at each page.
  4. 4. Then the user is requested to provide a feedback on the presented examples in a simple form (“I like this” vs. “I do not like this”). Finally, the system updates the proposed examples by means of the feedback provided by the user, and the similarity-based retrieval in the case base is performed again.The main hypotheses will concern the users’ search and choice behaviour, and theirsatisfaction:H1: The recommendation-enhanced system is able to deliver useful recommendationsThis hypothesis can be tested by analyzing the differences between the DieToRecsvariants DTR-A vs. DTR-B on the relative position of the item within the result listwhich the user selected and added to the travel plan. Only if the recommendation isgood the user will immediately find a suitable item. The position for DTR-B shouldbe nearer to the top of the visualized result list.H2: The recommendation-enhanced system is able to facilitate the construction ofgood travel plansThis hypothesis can be tested by analyzing the differences between the three systems(the DieToRecs variants and TISCover) on the users’ ratings of the selected items.We should find a significant difference between the two DieToRecs variants (DTR-Awill get a lower mean satisfaction rating than DTR-B). Nonetheless, DTR-A will notreceive very low ratings, due to the availability of the interactive query managementfunctions, which will help the plan construction. For different reasons, both TISCoverand the recommendation-enhanced variant should support the construction ofsatisfying plans. TISCover can exploit its grounding on a rich item database and itsdegree of development and testing and DTR-B should benefit from its effectiverecommendation functions.H3: The recommendation-enhanced system allows a more efficient searchThe recommendation-enhanced system should enable the user to perform fewerqueries, to examine fewer pages and to reduce the search and decision time. Thevariant with the empty case base will be less efficient, due to the lack of smart sortingin the presentation of options. Therefore, the user will have to browse a greaternumber of result pages, and occasionally will have to reformulate the query. TheTISCover system should obtain an intermediate result (because of the lack ofintelligent support, but its grounding on a rich item database).H4: The recommendation-enhanced system heightens the user satisfactionGiven that user satisfaction is related both to the perception of the efficiency andeffectiveness of the system, we expect to find significant differences between DTR-Band DTR-A on the questionnaire measures which are associated with efficiency,effectiveness, and overall satisfaction. TISCover should get good ratings due to itsdegree of development and testing that will prevent the user to be faced with salientsystem failures or errors (strongly affecting user satisfaction). Furthermore, there willbe some differences in the products accessed by TISCover and DieToRecs (some
  5. 5. features could be missing in the information accessed by the DieToRecs variants), andthis should be appropriately taken into account in the interpretation and evaluation ofthe results.Different participants from a student population, randomly assigned to theexperimental groups, were asked to use both one DieToRecs variant and a so-calledbaseline system (see Table 1). In our case, the baseline system is the TISCover.comon-line travel agency web site. The DieToRecs recommender system and its variantsare fed by a substantial subset of the travel items represented in the TISCover system.An additional small number of participants will be assigned to a full functionalitydesign (corresponding to a variant recommending complete travel arrangementscalled DTR-C), to obtain some exploratory indications on the user’s interaction with asystem resembling the final development of the DieToRecs project. The users wereasked to perform some tasks in the general context of “planning a travel in Tyrol”. Aseries of objective and subjective measures were recorded, both automatically duringthe interaction (by means of the logging component; DTR-variants only) and byasking the user to fill a questionnaire after each test session. To gain external validityit is necessary to design tasks that are representative of the typical usage of the systemin the real world. So, putting too many constraints on the participants should beavoided (the users will typically be unconstrained). On the other hand, it wasattempted to obtain a representative set of search and interaction behaviours, trying toreduce the variability due to the initial exploratory and erratic navigation behaviors.The choice in favour of one training and one separate test tasks is motivated by theobjective to balance the representativeness concern and the need to limit the durationof the experimental session (in order to avoid fatigue effects and unwanted variationsin attention and motivation). The participants were requested to choose a differentgeographical area for the execution of the two test tasks, thus trying to avoid content-specific learning. Table 1. Experimental DesignSequence Group 1 Group 2 Group 3 Group 4 Group 5 Group 6First System TISCover DTR-A TISCover DTR-B TISCover DTR-CSecond DTR-A TISCover DTR-B TISCover DTR-C TISCoverSystem N = 47 10 11 10 10 2 4Besides some socio-demographic and internet usage characteristics the questionnairefocused on the process and outcome evaluation of the trip planning task. After havingscreened a list of potential standardized measurement instruments devised to capturesome aspects of usability criteria the Post-Study-Satisfaction-User-Questionnaire(PSSUQ with 19 statements, Lewis 1995) was chosen, slightly adapted to a non-technical wording and extended by typical aspects relevant for recommendationsystems (resulting in 23 statements in total).
  6. 6. Table 2. Usability and User Satisfaction Questionnaire (adapted from PSSUQ) Effectivness Satisfaction Ease-of-use Reliability PSSUQItemsDesign / LayoutI liked using the interface of the system. o xThe organization of information presented by the system was clear. c xThe interface of this system was p leasant to use. c xFunctionalityThis system has all the functions and capabilities that I expect it to have. o xThe information retrieved by the system was effective in helping me to c xcomplete the tasks.The products listed by the system as a reply to my request were suitable for n xmy travel.I found the “recommend (the whole) travel” function useful. nEase of UseIt was simple to use this system. o xIt was easy to find the information I needed. o xThe information (such as online-help, on-screen messages, and other o xdocumentation) provided with this system was clear.Overall, this system was easy to use. c xLearnabilityIt was easy to learn to use the system. o xThere was too much information to read before I can use the system. nThe information provided by the system was easy to understand. c xSatisfactionI felt comfortable using this system. o xI enjoyed constructing my travel plans through this system. n xOverall, I am satisfied with this system. o xOutcome / Future UseI was able to complete the task quickly using this system. c xI could not complete the task in the preset time frame. n xI believe I could become productive quickly using this system. o xThe system was able to convince me that the recommendations are of value. n xFrom my current experience with using the system, I think I would use it n xregularly.Errors / System ReliabilityWhenever I made a mistake using the system, I could recover easily and o xquickly.The system gave error messages that clearly told me how to fix problems. o x Note: “o”: unchanged items, “c”: changed wording, “n”: new items added; “x”: highly loading variables (one variable without “x” was an outlier and did not load on any of the factors).
  7. 7. Though the psychometric properties have been documented by Lewis (1995) thestructure (system usefulness, information quality, and interface quality) and content(e.g. satisfaction aspects mixed with functional qualities) of this instrument had to betreated with caution. Table 2 shows the final questionnaire used. Furthermore, therespective statements are classified according to the factor on which they loadedhighly.4 ResultsThe following analysis investigates the hypotheses 1 to 4 step by step. It is based on asample of 47 test persons with a share of 63% females. One quarter belongs to an agegroup older than 25, the majority is under 25 years. Usage of web and e-commerceservices was measured by some questions of the 10th GVU’s User Survey( adapted to our context and some new questions.General internet usage was rather high with a share of 62% using the Web between 4to 6 years. No participant showed an experience less than one year. The students’population was well captured by a 72% share of test persons using the internet daily.20% indicated to use the internet several times a week. Almost everybody (96%) usedthe internet for information retrieval. About 75% bought some product or service atleast once a year over the internet. With regard to the travel domain the usage ratesare comparable: 98% used this source for some information; almost 80% purchasedsome travel specific product on the internet at least once a year. Only one third of thetest persons revealed to be unfamiliar with Tyrol. Only 4% had never been to Tyrol; ashare of 20% visited Tyrol at least once. In a first attempt the usefulness of thedifferent recommendation functions implemented in the three DieToRecs variants hasto be tested. The logging data – available only for the DieToRecs system – deliveredthe average position of each item in the presented result list of queries. Those itemsselected and put into the travel plan are taken here to compare the relative position(cf. Table 3; DTR-C does not provide single item result lists as it recommends in theinitial step complete travel plans only). The differences between DTR-A and DTR-Bare substantial and appear for all item categories. This can be interpreted as a sign ofconsistency though the sample size does not suffice to deliver statistically significantresults (à H1 accepted without statistical proof). Table 3. Average Position and Standard Deviation for Items in the Result List by DieToRecs Variants DTR-A DTR-B t-test Average Std.Dev. Average Std.Dev.Items in general 4.3 4.6 2.9 2.8 not sign.Accommodation items 5.0 0.4 2.2 1.2 not sign.Destination items 3.9 0.1 2.5 1.3 not sign.Interest items 4.0 4.8 3.5 3.0 not sign.
  8. 8. Next, an explanatory model for explaining user satisfaction with a typical structure asoutlined in Figure 1 was the starting point for the investigation of evaluativedimensions. The original three-dimensional configuration (PSSUQ, Lewis 1995)could not be identified with the empirical data of this study. Instead, the followingthree dimensions turned out to represent a very consistent way of how the respondentsevaluated the baseline and experimental recommendation systems: ease-of usecombined with design aspects and learnability; outcome combined with functionalityand effectiveness; and reliability strongly related with error handling (Figure 1;Cronbachs Alpha coefficients below, for loading indicators cf. Table 2). Ease-of-use/ Learnability Alpha=0.94 DTR: 0.30 TIS: 0.37 Effectiveness/ User/System DTR: 0.73 Outcome TIS: 0.61 Satisfaction Alpha=0.83 Alpha=0.95 DTR: n.s. TIS: n.s. Reliability Alpha=0.78 Fig. 1. Explanatory Model for User/System SatisfactionTesting the criterion validity by applying linear regression analyses – separately forthe two systems evaluated by each respondent – on the dependent satisfactiondimension very similar structural effects were detected (cf. Figure 1). Both modelsexplained a high proportion of the satisfaction variance (DTR-R²: 0.94; TIS-R²: 0.87).The standardized regression coefficients do not differ substantially. Finally, thereliability dimension does not contribute directly to the process and outcomeevaluation in terms of user satisfaction ratings. From the point of view of contentvalidity this configuration seems to converge towards the widely acknowledgedTechnology Acceptance Model (Davis 1989; Lederer et al. 2000) which proposes twofactors for explaining system usage: perceived usefulness and perceived ease-of-use.Based on these validity checks the detailed analysis of the interaction patterns,coupled with the experimental results, can follow to test the next hypotheses. Anobjective picture of the system effectiveness and efficiency, and of the user-
  9. 9. recommender interaction quality should be derived. Overall, the average evaluationscales show evidence of a solid superiority for our baseline system TISCover in eachof the dimensions. This result was already expected and explained within thehypotheses 2 and 4 (see above) and is obviously due to the mature developmentalstage and the huge and detailed data available. Another indicator of this performancedifference can be derived from the subjective declaration whether the planning taskcould have been accomplished successfully or not: DieToRecs achieved a 30% ratio;TISCover 64%.In terms of differences of the item ratings between the DieToRecs variants the nextTable exhibits a clear and confirming picture. The more intelligent recommendationfunctions were in operation the better the satisfaction ratings are. Overall, relativelymore respondents achieved to finish their plans during the given time framesuccessfully. For the destination recommendations the DTR-C variant holds asignificant better position compared to that of the modest DTR-A variant. Thedifferences of the accommodation ratings are even more distinct: The DTR-C variantworks better than DTR-A and DTR-B. For the activities the result is even moreprecise as all pair wise differences show the expected direction and significance level(à H2 confirmed). Table 4. Satisfaction Ratings for Travel Plan Elements by DieToRecs Variants Travel Plan Element System Variants Average DTR-A DTR-B DTR-C p-value Finished plans 30% 10% 30% 100% 0.001 Ratings Destination 4.0 2.8 4.5 5.3 0.10 significant A-C 0.10 Accommodation 4.1 4.1 3.6 5.9 0.15 significant B-C 0.01 significant A-C 0.05 Activities 4.2 3.2 4.9 7.0 0.05 significant A-B 0.1 significant B-C 0.01 significant A-C 0.001Note: “1”: very dissatisfied, “7”: very satisfiedAs outlined in hypotheses 3 the objective measures of system evaluation arenecessary for further testing. They were derived from the user logging componentexhibited by Table 5. In general, there is a lack of power due to the small sample size.Considering the different success rates in terms of finished travel plans (see Table 4)the irrelevant differences in the number of queries and page visits turn into some moreencouraging findings (à H3 confirmed). Session time has to be taken with cautionbecause the experimental process strictly limited the granted time for the travel plan
  10. 10. assembly. Nevertheless, the improved recommendation functions help to reduce thenecessary planning time. From the number of query refinement options applied wecan learn that in most of the cases the result lists were too short (and maybe moreoften empty) than too long. No apparent differences can be detected. Table 5. Objective Efficiency Measures by DieToRecs Variants System Variants DTR-A DTR-B DTR-C p-value Total number of queries 12.9 13.3 9.5 n.s. Accommodation queries 5.5 6.5 4.0 n.s. Destination queries 4.3 2.1 2.3 n.s. Interest queries 3.1 4.6 1.8 n.s. Number of pages visited 20.2 18.8 8.8 n.s. Number of query relaxations applied 5.8 4.6 4.0 n.s. Number of query tightening applied 0.6 0.2 0 n.s. Session time in minutes 25 20 23 > 0.1Note: n.s. = not significantIn order to test the final hypotheses 4, the variation of the evaluation scores (seeTable 6) was decomposed with respect to the within-subject (i.e. sequence order) andthe between-subject (i.e. variant comparison) effects. In general, a significant order orsequence effect could be detected which affected each dimension except reliability.As initially assumed a learning effect appeared which favoured the ratings for thesecond trip planning task. On average, this learning effect was much morepronounced in the situation in which the baseline system was used and evaluated insecond place. The effect size was rather similar for the system satisfaction scale;whereas for the ease-of-use scale it was more than double and for the outcome scaleeven more than eight times as large. Table 6. Average Ratings and Differences on the Evaluation Dimensions by System Variants TISCover DieToRecs DTR-A – DTR-B – DTR-C – Ø Ø TISCover TISCover TISCover User Satisfaction 3.2 4.6 2.33 1.05*) -0.50*) Ease-of-use 2.8 3.6 1.34 0.45*) 0.31*) Effectiveness/Outcome 3.4 4.6 1.71 1.01 -0.50*) Reliability 3.5 3.7 0.60*) 0.05*) -0.22*)Note: “1”: strongly agree, “7”: strongly disagree; *) not significantConsidering the sequence effect simultaneously with the between-subject effect ofcomparing different system variants (only DTR-A and DTR-B due to the smallsample size) a considerable scale difference remains for each scale (ease-of-use: 0.43
  11. 11. [p = 0.39]; outcome: 0.32 [p = 0.47]; reliability: 0.60 [p = 0.26]; satisfaction: 0.70 [p= 0.2]). Comparing the ratings – without order effect – only for respondents testingthe DTR-C variant the differences even turn into the other – expected – direction.Each scale except ease-of-use exhibit better average scores for the DieToRecs system(cf. Table 6). Hence, in principle hypothesis 4 cannot be corroborated entirely, thoughtaking the small sample size into account the results show the expected direction.5 ConclusionsAn experimental evaluation of a travel recommendation system applying objectiveand subjective measures was accomplished. The travel recommendation systemprototype DieToRecs and a reference or baseline system TISCover were tested andevaluated by users with the basic goal to discover weaknesses and to be able toremove them in the further development process. Although the assessment results forthe baseline system were significantly better than for DieToRecs, the highersatisfaction ratings for the DieToRecs variant with more recommendation functionsconfirm the appropriate direction. A certain familiarisation effect for the TISCoversystem cannot be completely denied. The user sample employed for this assessmentwas very likely to know the system and might have used it before. For the purpose oftesting DieToRecs a subset of TISCovers’ travel items was fed into the database. Ofcourse, TISCover as a full functioning travel recommendation system disposes of agreater variety of travel items than the DieToRecs subset. Nevertheless, it can beassumed that these differences of scope are minor compared to a system comparisonwhich would be based on completely different databases. Hence, the outcomeevaluation relies much more on process differences than on those of content. As far asthe survey instrument (adapted from PSSUQ) and the explanatory model for usersystem satisfaction is concerned, the three-dimensions (i.e. system usefulness,information quality and interface quality) explaining user/system satisfactionproposed by Lewis (1995) were not confirmed. Instead, a three factor solution forexplaining overall system satisfaction could be ascertained. These factors werelabelled with ease-of-use/learnability, effectiveness/outcome and reliability. Finally,the approach used in this study to generate empirical data is a promising one since thecombination of objective and subjective measures enables the assessment from atwofold point of view: the satisfaction ratings delivered by the user and the interactiondata showing the users’ search and selection behaviour.AcknowledgementThis work has been partially funded by the European Unions Fifth RTD FrameworkProgramme (under contract DIETORECS IST-2000-29474). The authors would like to thankall other colleagues of the DieToRecs team for their valuable contribution to this study.
  12. 12. ReferencesChin, J.P., Diehl, V.A. & K. Norman (1988). Development of an instrument measuring usersatisfaction of the human-computer interface. Proceedings of CHI’88 Conference on HumanFactors in Computing Systems, Washington, DC.Davis, F.D. (1989). Perceived usefulness, perceived ease of use, and user acceptance ofinformation technology. MIS Quarterly 13(3): 319-340.Galitz, W.O. (2002). The essential guide to user interface design. New York: Wiley ComputerPublishing.ISO (1998). ISO 9241 – 11. Usability Definitions - Guidance on Usability. Geneva,Switzerland. International Standards Organisation. ( [August 29, 2003]).Harms, I. & W. Schweibenz (2000). Testing Web Usability. Information Management &Consulting 15(3): 61-66.Jeong, M. & C.U. Lambert (2001). Adaptation of an Information Quality Framework toMeasure Customers’ Behavioral Intentions to Use Lodging Web Sites. Hospitality Management20: 129-146.Lederer, A.L., D.J. Maupin, M.P. Sena, & Y. Zhuang (2000). The technology acceptance modeland the World Wide Web. Decision Support Systems 29(3): 269-282.Lewis, J.R. (1995). IBM computer usability satisfaction questionnaires: psychometricevaluation and instructions for use. International Journal of Human Computer Interaction 7(1):57-78.Lindgaard, G. (1994). Usability Testing and System Evaluation. A Guide for Designing UsefulComputer Systems. London: Chapman & Hall.Moon, J.-W. & Y.-G. Kim (2001) Extending the TAM for a World-Wide Context. Information& Management 38: 217-230.Novak, T.P., Hoffman, D.L. & Y.-F. Yung (2000). Measuring the Customer Experience inOnline Environments: A Structural Modeling Approach. Marketing Science 19(1): 22-42.Pavlou, P.A. (2003). Consumer Acceptance of Electronic Commerce – Integrating Trust andRisk with the Technology Acceptance Model. International Journal of Electronic Commerce7(3): 69-103.Riihiaho, S. (2000). Experiences with Usability Evaluation Methods. Thesis at the HelsinkiUniversity of Technology, Laboratory of Information Processing Science.