• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Tutorial on Conducting User Experiments in Recommender Systems
 

Tutorial on Conducting User Experiments in Recommender Systems

on

  • 1,835 views

The slide handouts of my Recsys2012 tutorial

The slide handouts of my Recsys2012 tutorial

Statistics

Views

Total Views
1,835
Views on SlideShare
1,835
Embed Views
0

Actions

Likes
4
Downloads
67
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Tutorial on Conducting User Experiments in Recommender Systems Tutorial on Conducting User Experiments in Recommender Systems Presentation Transcript

    • User Experiments in Recommender Systems
    • Introduction Welcome everyone!
    • IntroductionBart Knijnenburg - Current: UC Irvine -Informatics - PhD candidate - TUHuman Technology Interaction Eindhoven - - Researcher & Teacher - Master Student - Carnegie Mellon University - Human-Computer Interaction - Master Student INFORMATION AND COMPUTER SCIENCES
    • IntroductionBart Knijnenburg - First user-centric evaluation framework - UMUAI, 2012 - Founder ofonUCERSTI - workshop user-centric evaluation of recommender systems and their interfaces - Statistics expert reviewer - Conference + journal - Research methods teacher - SEM advisor INFORMATION AND COMPUTER SCIENCES
    • Introduction INFORMATION AND COMPUTER SCIENCES
    • Introduction “What is a user experiment?” “A user experiment is a scientific method to investigate how and why system aspectsinfluence the users’ experience and behavior.” INFORMATION AND COMPUTER SCIENCES
    • IntroductionMy goal: Teach how to scientifically evaluate recommender systems using a user-centric approach How? User experiments!My approach: - I will provide a broad theoretical framework - I will cover every step in conducting a user experiment - I will teach the “statistics of the 21st century” INFORMATION AND COMPUTER SCIENCES
    • IntroductionWelcome everyone!Evaluation frameworkA theoretical foundation for user-centric evaluationHypothesesWhat do I want to find out?ParticipantsPopulation and samplingTesting A vs. BExperimental manipulationsMeasurementMeasuring subjective valuationsAnalysisStatistical evaluation of the results
    • Evaluation frameworkA theoretical foundation for user-centric evaluation
    • FrameworkOffline evaluations may not give the same outcome asonline evaluations Cosley et al., 2002; McNee et al., 2002Solution: Test with real users INFORMATION AND COMPUTER SCIENCES
    • FrameworkSystem Interactionalgorithm rating INFORMATION AND COMPUTER SCIENCES
    • FrameworkHigher accuracy does notalways mean highersatisfaction McNee et al., 2006Solution: Consider otherbehaviors INFORMATION AND COMPUTER SCIENCES
    • FrameworkSystem Interactionalgorithm rating consumption retention INFORMATION AND COMPUTER SCIENCES
    • FrameworkThe algorithm counts for only 5% of the relevance of arecommender system Francisco Martin - RecSys 2009 keynoteSolution: test those other aspects INFORMATION AND COMPUTER SCIENCES
    • Framework System Interaction algorithm ratinginteraction consumptionpresentation retention INFORMATION AND COMPUTER SCIENCES
    • Framework “Testing a recommender against a randomvideoclip system, the number of clicked clips and total viewing time went down!” INFORMATION AND COMPUTER SCIENCES
    • Framework number of clips watched from beginning to end total number of + viewing time clips clicked personalized − − recommendations OSA perceived system effectiveness + EXP + perceived recommendation quality SSA + choice satisfaction EXPKnijnenburg et al.: “Receiving Recommendations and Providing Feedback”, EC-Web 2010 INFORMATION AND COMPUTER SCIENCES
    • FrameworkBehavior is hard to interpret Relationship between behavior and satisfaction is not always trivialUser experience is a better predictor of long-term retention With behavior only, you will need to run for a long timeQuestionnaire data is more robust Fewer participants needed INFORMATION AND COMPUTER SCIENCES
    • FrameworkMeasure subjective valuations with questionnaires Perception and experienceTriangulate these data with behavior Ground subjective valuations in observable actions Explain observable actions with subjective valuationsMeasure every step in your theory Create a chain of mediating variables INFORMATION AND COMPUTER SCIENCES
    • Framework System Perception Experience Interaction algorithm usability system ratinginteraction quality process consumptionpresentation appeal outcome retention INFORMATION AND COMPUTER SCIENCES
    • FrameworkPersonal and situational characteristics may have animportant impact Adomavicius et al., 2005; Knijnenburg et al., 2012Solution: measure those as well INFORMATION AND COMPUTER SCIENCES
    • Framework Situational Characteristics routine system trust choice goal System Perception Experience Interaction algorithm usability system ratinginteraction quality process consumptionpresentation appeal outcome retention Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
    • Framework Situational Characteristics routine system trust choice goal Objective System Aspects (OSA) System Perception Experience Interaction These are manipulations algorithm usability system ratinginteraction - visual / interaction design consumption quality processpresentation - recommenderoutcome appeal algorithm retention - presentation of recommendations - additional features Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
    • Framework Situational Characteristics routine User Experience (EXP) system trust choice goal Different aspects may System different things influence Perception Experience Interaction algorithm usability system rating - interface -> systeminteraction evaluations quality process consumption -presentation appeal preference elicitation outcome retention method -> choice process - algorithm -> quality of Personal Characteristics the final choice gender privacy expertise INFORMATION AND COMPUTER SCIENCES
    • Framework Situational Characteristics routine system trust choice goal Subjective System Aspects (SSA) System Perception Experienceto EXP Link OSA Interaction algorithm usability system (mediation) ratinginteraction quality process consumption Increase the robustnesspresentation appeal of the effects of OSAs on outcome retention EXP How and why OSAs Personal Characteristics affect EXP gender privacy expertise INFORMATION AND COMPUTER SCIENCES
    • Framework Situational Characteristics routine system trust choice goal System Perception Experience Interaction Personal and Situational characteristics (PC and SC) algorithm usability system rating Effect of specific user and taskinteraction quality process consumption Beyond the influence of the systempresentation appeal outcome retention Here used for evaluation, not for augmenting algorithms Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
    • Framework Situational Characteristics routine system trust choice goal Interaction (INT) System Perception Experience Interaction Observable behavior algorithm usability system rating - browsing, viewing, log-insinteraction quality process consumptionpresentation of evaluation Final step appeal outcome retention Grounds EXP in “real” data Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
    • Framework Situational Characteristics routine system trust choice goal System Perception Experience Interaction algorithm usability system ratinginteraction quality process consumptionpresentation appeal outcome retention Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
    • FrameworkHas suggestions for measurement scales e.g. How can we measure something like “satisfaction”?Provides a good starting point for causal relations e.g. How and why do certain system aspects influence the user experience?Useful for integrating existing work e.g. How do recommendation list length, diversity and presentation each have an influence on user experience? INFORMATION AND COMPUTER SCIENCES
    • Framework“Statisticians, like artists, have the bad habit of falling in love with their models.” George Box INFORMATION AND COMPUTER SCIENCES
    • test your recommender system with real usersmeasure behaviors other test aspects other than than ratings the algorithm measure subjective take situational and valuations with personal characteristics questionnaires into account Evaluation framework Help in measurement and causal relations use our framework as a starting point for user-centric research
    • HypothesesWhat do I want to find out?
    • Hypotheses“Can you test if my recommender system is good?” INFORMATION AND COMPUTER SCIENCES
    • HypothesesWhat does good mean?- Recommendation accuracy?- Recommendation quality?- System usability?- System satisfaction?We need to define measures INFORMATION AND COMPUTER SCIENCES
    • Hypotheses“Can you test if the user interface of myrecommender system scores high on this usability scale?” INFORMATION AND COMPUTER SCIENCES
    • HypothesesWhat does high mean? Is 3.6 out of 5 on a 5-point scale “high”? What are 1 and 5? What is the difference between 3.6 and 3.7?We need to compare the UI against something INFORMATION AND COMPUTER SCIENCES
    • Hypotheses“Can you test if the UI of my recommender system scores high on this usability scale compared to this other system?” INFORMATION AND COMPUTER SCIENCES
    • HypothesesMy new travel recommender Travelocity INFORMATION AND COMPUTER SCIENCES
    • HypothesesSay we find that it scores higher on usability... why does it? - different date-picker method - different layout - different number of options availableApply the concept of ceteris paribus to get rid ofconfounding variables Keep everything the same, except for the thing you want to test (the manipulation) Any difference can be attributed to the manipulation INFORMATION AND COMPUTER SCIENCES
    • HypothesesMy new travel recommender Previous version (too many options) INFORMATION AND COMPUTER SCIENCES
    • HypothesesTo learn something from the study, we need a theory behindthe effect For industry, this may suggest further improvements to the system For research, this makes the work generalizableMeasure mediating variables Measure understandability (and a number of other concepts) as well Find out how they mediate the effect on usability INFORMATION AND COMPUTER SCIENCES
    • HypothesesAn example:We compared threerecommender systems Three different algorithms Ceteris paribus! INFORMATION AND COMPUTER SCIENCES
    • HypothesesKnijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012The mediating variables show the entire story INFORMATION AND COMPUTER SCIENCES
    • HypothesesMatrix Factorization recommender with Matrix Factorization recommender withexplicit feedback (MF-E) implicit feedback (MF-I)(versus generally most popular; GMP) (versus most popular; GMP) OSA OSA + + perceived recommendation perceived recommendation perceived system variety + quality + effectiveness SSA SSA EXP Knijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012 INFORMATION AND COMPUTER SCIENCES
    • Hypotheses“An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.” John Tukey INFORMATION AND COMPUTER SCIENCES
    • define measures compare system get rid ofaspects against each confounding other variablesapply the concept of look for a theory ceteris paribus behind the found effects Hypotheses What do I want to find out? measure mediating variables to explain the effects
    • ParticipantsPopulation and sampling
    • Participants“We are testing our recommender system on our colleagues/students.” -or- “We posted the study link on Facebook/Twitter.” INFORMATION AND COMPUTER SCIENCES
    • ParticipantsAre your connections, colleagues, or students typical usersof your system? - They may have more knowledge of the field of study - They may feel more excited about the system - They may know what the experiment is about - They probably want to please youYou should sample from your target population An unbiased sample of users of your system INFORMATION AND COMPUTER SCIENCES
    • Participants“We only use data from frequent users.” INFORMATION AND COMPUTER SCIENCES
    • ParticipantsWhat are the consequences of limiting your scope? You run the risk of catering to that subset of users only You cannot make generalizable claims about usersFor scientific experiments, the target population may beunrestricted Especially when your study is more about human nature than about a specific system INFORMATION AND COMPUTER SCIENCES
    • Participants“We tested our system with 10 users.” INFORMATION AND COMPUTER SCIENCES
    • ParticipantsIs this a decent sample size? Anticipated Needed Can you attain statistically e!ect size sample size significant results? Does it provide a wide small 385 enough inductive base? medium 54Make sure your sample islarge enough large 25 40 is typically the bare minimum INFORMATION AND COMPUTER SCIENCES
    • sample from your target populationthe target population make sure your samplemay be unrestricted is large enough Participants Population and sampling
    • Testing A vs. B Experimental manipulations
    • Manipulations“Are our users more satisfied if our newsrecommender shows only recent items?” INFORMATION AND COMPUTER SCIENCES
    • ManipulationsProposed system or treatment: Filter out any items > 1 month oldWhat should be my baseline? - Filter out items < 1 month old? - Unfiltered recommendations? - Filter out items > 3 months old?You should test against a reasonable alternative “Absence of evidence is not evidence of absence” INFORMATION AND COMPUTER SCIENCES
    • Manipulations“The first 40 participants will get the baseline, the next 40 will get the treatment.” INFORMATION AND COMPUTER SCIENCES
    • ManipulationsThese two groups cannot be expected to be similar! Some news item may affect one group but not the otherRandomize the assignment of conditions to participants Randomization neutralizes (but doesn’t eliminate) participant variation INFORMATION AND COMPUTER SCIENCES
    • ManipulationsBetween-subjects design: 100 participantsRandomly assign half theparticipants to A, half to B Realistic interaction 50 50 Manipulation hidden from user Many participants needed INFORMATION AND COMPUTER SCIENCES
    • ManipulationsWithin-subjects design: 50 participantsGive participants A first,then B - Remove subject variability - Participant may see the manipulation - Spill-over effect INFORMATION AND COMPUTER SCIENCES
    • ManipulationsWithin-subjects design: 50 participantsShow participants A and Bsimultaneously - Remove subject variability - Participants can compare conditions - Not a realistic interaction INFORMATION AND COMPUTER SCIENCES
    • ManipulationsShould I do within-subjects or between-subjects?Use between-subjects designs for user experience Closer to a real-world usage situation No unwanted spill-over effectsUse within-subjects designs for psychological research Effects are typically smaller Nice to control between-subjects variability INFORMATION AND COMPUTER SCIENCES
    • Manipulations Low HighYou can test multiple diversity diversitymanipulations in a factorial 5design 5+low 5+high itemsThe more conditions, the 10 10+low 10+highmore participants you will itemsneed! 20 20+low 20+high items INFORMATION AND COMPUTER SCIENCES
    • ManipulationsLet’s test an algorithm against random recommendations What should we tell the participant?Beware of the Placebo effect! Remember: ceteris paribus! Other option: manipulate the message (factorial design) INFORMATION AND COMPUTER SCIENCES
    • Manipulations “We were demonstrating our newrecommender to a client. They were amazedby how well it predicted their preferences!”“Later we found out that we forgot to activate the algorithm: the system was giving completely random recommendations.” (anonymized) INFORMATION AND COMPUTER SCIENCES
    • test against a reasonable alternativerandomize assignment use between-subjects of conditions for user experienceuse within-subjects for you can test morepsychological research than two conditions Testing A vs. B Experimental manipulations you can test multiple manipulations in a factorial design
    • MeasurementMeasuring subjective valuations
    • Measurement“To measure satisfaction, we asked users whether they liked the system (on a 5-point rating scale).” INFORMATION AND COMPUTER SCIENCES
    • MeasurementDoes the question mean the same to everyone? - John likes the system because it is convenient - Mary likes the system because it is easy to use - Dave likes it because the recommendations are goodA single question is not enough to establish content validity We need a multi-item measurement scale INFORMATION AND COMPUTER SCIENCES
    • MeasurementPerceived system effectiveness: - Using the system is annoying - The system is useful - Using the system makes me happy - Overall, I am satisfied with the system - I would recommend the system to others - I would quickly abandon using this system INFORMATION AND COMPUTER SCIENCES
    • MeasurementUse both positively and negatively phrased items - They make the questionnaire less “leading” - They help filtering out bad participants - They explore the “flip-side” of the scaleThe word “not” is easily overlooked Bad: “The recommendations were not very novel” Good: “The recommendations felt outdated” INFORMATION AND COMPUTER SCIENCES
    • MeasurementChoose simple over specialized words Participants may have no idea they are using a “recommender system”Avoid double-barreled questions Bad: “The recommendations were relevant and fun” INFORMATION AND COMPUTER SCIENCES
    • Measurement“We asked users ten 5-point scale questions and summed the answers.” INFORMATION AND COMPUTER SCIENCES
    • MeasurementIs the scale really measuring a single thing? - 5 items measure satisfaction, the other 5 convenience - The items are not related enough to make a reliable scaleAre two scales really measuring different things? - They are so closely related that they actually measure the same thingWe need to establish convergent and discriminant validity This makes sure the scales are unidimensional INFORMATION AND COMPUTER SCIENCES
    • MeasurementSolution: factor analysis - Define latent factors, specify how items “load” on them - Factor analysis will determine how well the items “fit” - It will give you suggestions for improvementBenefits of factor analysis: - Establishes convergent and discriminant validity - Outcome is a normally distributed measurement scale - The scale captures the “shared essence” of the items INFORMATION AND COMPUTER SCIENCES
    • Measurementvar1 var2 var3 var4 var5 var6 qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5 perceived perceived choice 1 recommendation recommendation 1 1 variety quality difficulty movie choice 1 1 expertise satisfactionFactorsItems exp1 exp2 exp3 sat1 sat2 sat3 sat4 sat5 sat6 sat7 INFORMATION AND COMPUTER SCIENCES
    • Measurement low communality, loads on quality, variety, low high residual with qual2 and satisfaction communalityvar1 var2 var3 var4 var5 var6 qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5 perceived perceived choice 1 recommendation recommendation 1 1 variety quality difficulty movie choice 1 1 expertise satisfaction exp1 exp2 exp3 sat1 sat2 sat3 sat4 sat5 sat6 sat7 low communality high residual with var1 INFORMATION AND COMPUTER SCIENCES
    • Measurement AVE: 0.622 AVE: 0.756 AVE: 0.435 (!)var1 sqrt(AVE) = 0.789 var6 var2 var3 var4 sqrt(AVE) qual3 qual4 qual1 qual2 = 0.870 sqrt(AVE) = 0.659 diff1 diff2 diff4 largest corr.: 0.491 largest corr.: 0.709 largest corr.: -0.438 perceived perceived choice 1 recommendation recommendation 1 1 variety quality difficulty movie choice 1 1 expertise satisfaction AVE: 0.793 AVE: 0.655 exp1 exp2 exp3 sat1 sat3 sat4 sat5 sat6 sqrt(AVE) = 0.891 sqrt(AVE) = 0.809 highest corr.: 0.234 highest corr.: 0.709 INFORMATION AND COMPUTER SCIENCES
    • Measurement“Great! Can I learn how to do this myself?” Check the video tutorials at www.statmodel.com INFORMATION AND COMPUTER SCIENCES
    • establish content validity with multi-item scales follow the general establish convergent principles for good and discriminantquestionnaire items validity Measurement Measuring subjective valuations use factor analysis
    • AnalysisStatistical evaluation of the results
    • Analysis Perceived qualityManipulation -> perception: 1 Do these two algorithms 0.5 lead to a different level of 0 perceived quality? -0.5T-test -1 A B INFORMATION AND COMPUTER SCIENCES
    • Analysis System effectiveness 2Perception -> experience: 1 Does perceived quality 0 influence system effectiveness? -1Linear regression -2 -3 0 3 Recommendation quality INFORMATION AND COMPUTER SCIENCES
    • Analysis Perceived quality 0.6Two manipulations -> low diversificationperception: 0.5 high diversification What is the combined 0.4 effect of list diversity and 0.3 list length on perceived 0.2 recommendation quality? 0.1Factorial ANOVA 0 5 items 10 items 20 items Willemsen et al.: “Not just more of the same”, submitted to TiiS INFORMATION AND COMPUTER SCIENCES
    • Analysis var1 var2 var3 var4 var5 var6 qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5Only one method: structural 1 perceived recommendation 1 perceived recommendation choice 1 difficultyequation modeling variety quality The statistical method of movie choice 1 1 expertise satisfaction the 21st century exp1 exp2 exp3 sat1 sat2 sat3 sat4 sat5 sat6 sat7Combines factor analysisand path models Top-20 Lin-20 vs Top-5 recommendations vs Top-5 recommendations - Turn items into factors .455 (.211) .612 (.220) .879 (.265) -.804 (.230) p < .05 p < .01 p < .001 + + p < .001 − + perceived + perceived + choice recommendation recommendation variety .503 (.090) quality .336 (.089) difficulty - Test causal relations p < .001 p < .001 + + -.417 (.125) 1.151 (.161) .181 (.075) .205 (.083) p < .001 p < .005 p < .05 p < .05 + − .894 (.287) movie choice p < .005 + expertise satisfaction INFORMATION AND COMPUTER SCIENCES
    • Analysis var1 var2 var3 var4 var5 var6 qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5 perceived perceived choice 1 recommendation recommendation 1 1 variety quality difficultyVery simple reporting 1 movie expertise choice satisfaction 1 - Report overall fit + effect exp1 exp2 exp3 sat1 sat2 sat3 sat4 sat5 sat6 sat7 of each causal relation - A path that explains the Top-20 vs Top-5 recommendations Lin-20 vs Top-5 recommendations effects .455 (.211) p < .05 + + .612 (.220) p < .01 -.804 (.230) p < .001 − .879 (.265) p < .001 + perceived + perceived + choice recommendation recommendation variety .503 (.090) quality .336 (.089) difficulty p < .001 p < .001 + + -.417 (.125) 1.151 (.161) .181 (.075) .205 (.083) p < .001 p < .005 p < .05 p < .05 + − .894 (.287) movie choice p < .005 + expertise satisfaction INFORMATION AND COMPUTER SCIENCES
    • AnalysisExample from Bollen et al.: “Choice Overload” What is the effect of the number of recommendations? What about the composition of the recommendation list?Tested with 3 conditions: - Toprecs: 1 2 3 4 5 5: - - Toprecs: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 20: - - Lin recs: 1 2 3 4 5 99 199 299 399 499 599 699 799 899 999 1099 1199 1299 1399 1499 20: - INFORMATION AND COMPUTER SCIENCES
    • AnalysisBollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010 INFORMATION AND COMPUTER SCIENCES
    • Analysis Top-20 Lin-20 vs Top-5 recommendations vs Top-5 recommendations Simple regression “Inconsistent mediation”“Full mediation” + + − + perceived + perceived + choice recommendation recommendation variety quality difficulty Trade-o! Measured by+var1-var6 + (not shown here) + − movie choice + Additional e!ect expertise satisfaction Bollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010 INFORMATION AND COMPUTER SCIENCES
    • Analysis Top-20 Lin-20 vs Top-5 recommendations vs Top-5 recommendations .455 (.211) .612 (.220) .879 (.265) -.804 (.230) p < .05 p < .01 p < .001 + + p < .001 − + perceived + perceived + choicerecommendation recommendation variety .503 (.090) quality .336 (.089) difficulty p < .001 p < .001 + + -.417 (.125) 1.151 (.161) .181 (.075) .205 (.083) p < .001 p < .005 p < .05 p < .05 + − .894 (.287) movie choice p < .005 + expertise satisfaction Bollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010 INFORMATION AND COMPUTER SCIENCES
    • Analysis“Great! Can I learn how to do this myself?” Check the video tutorials at www.statmodel.com INFORMATION AND COMPUTER SCIENCES
    • AnalysisHomoscedasticity is a 20prerequisite for linear stats Interaction time (min) 15 Not true for count data, time, etc. 10Outcomes need to beunbounded and continuous 5 Not true for binary 0 answers, counts, etc. 0 5 10 15 Level of commitment INFORMATION AND COMPUTER SCIENCES
    • AnalysisUse the correct methods for non-normal data - Binary data: logit/probit regression - 5- or 7-point scales: ordered logit/probit regression - Count data: Poisson regressionDon’t use “distribution-free” stats They were invented when calculations were done by handDo use structural equation modeling MPlus can do all these things “automatically” INFORMATION AND COMPUTER SCIENCES
    • AnalysisStandard regression requires Number of followed recommendations 5uncorrelated errors Not true for repeated 4 measures, e.g. “rate these 3 5 items” 2 There will be a user-bias (and maybe an item-bias) 1Golden rule: data-points 0 0 5 10 15should be independent Level of commitment INFORMATION AND COMPUTER SCIENCES
    • Analysis Number of followed recommendations 5Two ways to account forrepeated measures: 3.75 - Define a random intercept for each user 2.5 - Impose an error 1.25 covariance structureAgain, in MPlus this is easy 0 0 5 10 15 Level of commitment INFORMATION AND COMPUTER SCIENCES
    • AnalysisA manipulation only causesthings Disclosure PerceivedFor all other variables: help (R2 = .302) privacy threat (R2 = .565) - Common sense + − + − - Psych literature Trust in the company (R2 = .662) + Satisfaction with the system (R2 = .674) - Evaluation frameworks Knijnenburg & Kobsa.: “Making Decisions about Privacy”Example: privacy study INFORMATION AND COMPUTER SCIENCES
    • Analysis“All models are wrong, but some are useful.” George Box INFORMATION AND COMPUTER SCIENCES
    • use structural equation models use correct use correct methods for methods fornon-normal data repeated measures Analysis Statistical evaluation of the resultsuse manipulations and theory to make inferences about causality
    • IntroductionUser experiments: user-centric evaluation of recommender systemsEvaluation frameworkUse our framework as a starting point for user experimentsHypothesesConstruct a measurable theory behind the expected effectParticipantsSelect a large enough sample from your target populationTesting A vs. BAssign users to different versions of a system aspect, ceteris paribusMeasurementUse factor analysis and follow the principles for good questionnairesAnalysisUse structural equation models to test causal models
    • “It is the mark of a truly intelligent person to be moved by statistics.” George Bernard Shaw
    • ResourcesUser-centric evaluation Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C.: Explaining the User Experience of Recommender Systems. UMUAI 2012. Knijnenburg, B.P., Willemsen, M.C., Kobsa, A.: A Pragmatic Procedure to Support the User-Centric Evaluation of Recommender Systems. RecSys 2011.Choice overload Willemsen, M.C., Graus, M.P., Knijnenburg, B.P., Bollen, D.: Not just more of the same: Preventing Choice Overload in Recommender Systems by Offering Small Diversified Sets. Submitted to TiiS. Bollen, D.G.F.M., Knijnenburg, B.P., Willemsen, M.C., Graus, M.P.: Understanding Choice Overload in Recommender Systems. RecSys 2010. INFORMATION AND COMPUTER SCIENCES
    • ResourcesPreference elicitation methods Knijnenburg, B.P., Reijmer, N.J.M., Willemsen, M.C.: Each to His Own: How Different Users Call for Different Interaction Methods in Recommender Systems. RecSys 2011. Knijnenburg, B.P., Willemsen, M.C.: The Effect of Preference Elicitation Methods on the User Experience of a Recommender System. CHI 2010. Knijnenburg, B.P., Willemsen, M.C.: Understanding the effect of adaptive preference elicitation methods on user satisfaction of a recommender system. RecSys 2009.Social recommenders Knijnenburg, B.P., Bostandjiev, S., ODonovan, J., Kobsa, A.: Inspectability and Control in Social Recommender Systems. RecSys 2012. INFORMATION AND COMPUTER SCIENCES
    • ResourcesUser feedback and privacy Knijnenburg, B.P., Kobsa, A: Making Decisions about Privacy: Information Disclosure in Context-Aware Recommender Systems. Submitted to TiiS. Knijnenburg, B.P., Willemsen, M.C., Hirtbach, S.: Getting Recommendations and Providing Feedback: The User-Experience of a Recommender System. EC- Web 2010.Statistics books Agresti, A.: An Introduction to Categorical Data Analysis. 2nd ed. 2007. Fitzmaurice, G.M., Laird, N.M., Ware, J.H.: Applied Longitudinal Analysis. 2004. Kline, R.B.: Principles and Practice of Structural Equation Modeling. 3rd ed. 2010. Online tutorial videos at www.statmodel.com INFORMATION AND COMPUTER SCIENCES
    • ResourcesQuestions? Suggestions? Collaboration proposals? Contact me!Contact info E: bart.k@uci.edu W: www.usabart.nl T: @usabart INFORMATION AND COMPUTER SCIENCES