Your SlideShare is downloading. ×
0
User Experiments   in Recommender Systems
Introduction Welcome everyone!
IntroductionBart Knijnenburg - Current: UC Irvine    -Informatics    -   PhD candidate - TUHuman Technology Interaction   ...
IntroductionBart Knijnenburg - First user-centric   evaluation framework    -   UMUAI, 2012 - Founder ofonUCERSTI    - wor...
Introduction               INFORMATION AND COMPUTER SCIENCES
Introduction        “What is a user experiment?”  “A user experiment is a scientific method to    investigate how and why s...
IntroductionMy goal:   Teach how to scientifically evaluate recommender systems   using a user-centric approach   How? User...
IntroductionWelcome everyone!Evaluation frameworkA theoretical foundation for user-centric evaluationHypothesesWhat do I w...
Evaluation frameworkA theoretical foundation for user-centric evaluation
FrameworkOffline evaluations may not give the same outcome asonline evaluations   Cosley et al., 2002; McNee et al., 2002S...
FrameworkSystem                        Interactionalgorithm                           rating                        INFORM...
FrameworkHigher accuracy does notalways mean highersatisfaction   McNee et al., 2006Solution: Consider otherbehaviors     ...
FrameworkSystem                        Interactionalgorithm                           rating                              ...
FrameworkThe algorithm counts for only 5% of the relevance of arecommender system   Francisco Martin - RecSys 2009 keynote...
Framework  System                         Interaction algorithm                             ratinginteraction             ...
Framework “Testing a recommender against a randomvideoclip system, the number of clicked clips    and total viewing time w...
Framework             number of          clips watched          from beginning               to end                       ...
FrameworkBehavior is hard to interpret   Relationship between behavior and satisfaction is not   always trivialUser experi...
FrameworkMeasure subjective valuations with questionnaires   Perception and experienceTriangulate these data with behavior...
Framework  System        Perception   Experience         Interaction algorithm       usability     system                 ...
FrameworkPersonal and situational characteristics may have animportant impact   Adomavicius et al., 2005; Knijnenburg et a...
Framework                         Situational Characteristics                   routine         system trust        choice...
Framework                         Situational Characteristics                   routine       system trust        choice g...
Framework                         Situational Characteristics                   routine  User Experience (EXP)          sy...
Framework                         Situational Characteristics                   routine        system trust    choice goal...
Framework                           Situational Characteristics                     routine         system trust        ch...
Framework                           Situational Characteristics                     routine       system trust        choi...
Framework                         Situational Characteristics                   routine         system trust        choice...
FrameworkHas suggestions for measurement scales   e.g. How can we measure something like “satisfaction”?Provides a good st...
Framework“Statisticians, like artists, have the bad habit    of falling in love with their models.”                 George...
test your recommender system with real usersmeasure behaviors other                          test aspects other than     t...
HypothesesWhat do I want to find out?
Hypotheses“Can you test if my recommender system                 is good?”                               INFORMATION AND C...
HypothesesWhat does good mean?- Recommendation accuracy?- Recommendation quality?- System usability?- System satisfaction?...
Hypotheses“Can you test if the user interface of myrecommender system scores high on this           usability scale?”     ...
HypothesesWhat does high mean?   Is 3.6 out of 5 on a 5-point scale “high”?   What are 1 and 5?   What is the difference b...
Hypotheses“Can you test if the UI of my recommender system scores high on this usability scale     compared to this other ...
HypothesesMy new travel recommender   Travelocity                                   INFORMATION AND COMPUTER SCIENCES
HypothesesSay we find that it scores higher on usability... why does it? - different date-picker method - different layout ...
HypothesesMy new travel recommender     Previous version                            (too many options)                    ...
HypothesesTo learn something from the study, we need a theory behindthe effect   For industry, this may suggest further im...
HypothesesAn example:We compared threerecommender systems  Three different algorithms  Ceteris paribus!                   ...
HypothesesKnijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012The mediating variables s...
HypothesesMatrix Factorization recommender with   Matrix Factorization recommender withexplicit feedback (MF-E)           ...
Hypotheses“An approximate answer to the right problem  is worth a good deal more than an exact    answer to an approximate...
define measures  compare system                                      get rid ofaspects against each                        ...
ParticipantsPopulation and sampling
Participants“We are testing our recommender system      on our colleagues/students.”                 -or-       “We posted...
ParticipantsAre your connections, colleagues, or students typical usersof your system? - They may have more knowledge of t...
Participants“We only use data from frequent users.”                                INFORMATION AND COMPUTER SCIENCES
ParticipantsWhat are the consequences of limiting your scope?   You run the risk of catering to that subset of users only ...
Participants“We tested our system with 10 users.”                               INFORMATION AND COMPUTER SCIENCES
ParticipantsIs this a decent sample size?                                  Anticipated Needed   Can you attain statistical...
sample from your target populationthe target population                             make sure your samplemay be unrestrict...
Testing A vs. B Experimental manipulations
Manipulations“Are our users more satisfied if our newsrecommender shows only recent items?”                                ...
ManipulationsProposed system or treatment:   Filter out any items > 1 month oldWhat should be my baseline? - Filter out it...
Manipulations“The first 40 participants will get the baseline,     the next 40 will get the treatment.”                    ...
ManipulationsThese two groups cannot be expected to be similar!   Some news item may affect one group but not the otherRan...
ManipulationsBetween-subjects design:        100 participantsRandomly assign half theparticipants to A, half to B   Realis...
ManipulationsWithin-subjects design:         50 participantsGive participants A first,then B - Remove subject variability -...
ManipulationsWithin-subjects design:          50 participantsShow participants A and Bsimultaneously - Remove subject vari...
ManipulationsShould I do within-subjects or between-subjects?Use between-subjects designs for user experience   Closer to ...
Manipulations                                         Low       HighYou can test multiple                  diversity diver...
ManipulationsLet’s test an algorithm against random recommendations   What should we tell the participant?Beware of the Pl...
Manipulations     “We were demonstrating our newrecommender to a client. They were amazedby how well it predicted their pr...
test against a reasonable alternativerandomize assignment                            use between-subjects    of conditions...
MeasurementMeasuring subjective valuations
Measurement“To measure satisfaction, we asked users     whether they liked the system      (on a 5-point rating scale).”  ...
MeasurementDoes the question mean the same to everyone? - John likes the system because it is convenient - Mary likes the ...
MeasurementPerceived system effectiveness: - Using the system is annoying - The system is useful - Using the system makes ...
MeasurementUse both positively and negatively phrased items - They make the questionnaire less “leading” - They help filter...
MeasurementChoose simple over specialized words   Participants may have no idea they are using a   “recommender system”Avo...
Measurement“We asked users ten 5-point scale questions       and summed the answers.”                                  INF...
MeasurementIs the scale really measuring a single thing? - 5 items measure satisfaction, the other 5 convenience - The ite...
MeasurementSolution: factor analysis - Define latent factors, specify how items “load” on them - Factor analysis will deter...
Measurementvar1       var2    var3   var4      var5     var6      qual1   qual2      qual3      qual4            diff1    ...
Measurement                     low communality,                                                                          ...
Measurement       AVE: 0.622                         AVE: 0.756                       AVE: 0.435 (!)var1    sqrt(AVE) = 0....
Measurement“Great! Can I learn how to do this myself?”       Check the video tutorials at          www.statmodel.com      ...
establish content validity with multi-item scales follow the general                             establish convergent prin...
AnalysisStatistical evaluation of the results
Analysis                                         Perceived qualityManipulation -> perception:          1   Do these two al...
Analysis                                      System effectiveness                            2Perception -> experience:  ...
Analysis                                                  Perceived quality                                   0.6Two manip...
Analysis                               var1       var2    var3       var4    var5     var6         qual1      qual2      q...
Analysis                                var1       var2    var3       var4    var5     var6         qual1      qual2      ...
AnalysisExample from Bollen et al.: “Choice Overload”    What is the effect of the number of recommendations?    What abou...
AnalysisBollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010                                ...
Analysis                                 Top-20                                      Lin-20                          vs To...
Analysis                      Top-20                                               Lin-20               vs Top-5 recommend...
Analysis“Great! Can I learn how to do this myself?”       Check the video tutorials at          www.statmodel.com         ...
AnalysisHomoscedasticity is a                                    20prerequisite for linear stats                          ...
AnalysisUse the correct methods for non-normal data - Binary data: logit/probit regression - 5- or 7-point scales: ordered...
AnalysisStandard regression requires                                Number of followed recommendations                    ...
Analysis                               Number of followed recommendations                                                 ...
AnalysisA manipulation only causesthings                                  Disclosure                         PerceivedFor ...
Analysis“All models are wrong, but some are useful.”                George Box                                   INFORMATI...
use structural equation models  use correct                                        use correct  methods for               ...
IntroductionUser experiments: user-centric evaluation of recommender systemsEvaluation frameworkUse our framework as a sta...
“It is the mark of a truly intelligent person         to be moved by statistics.”          George Bernard Shaw
ResourcesUser-centric evaluation   Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C.: Explaining   th...
ResourcesPreference elicitation methods   Knijnenburg, B.P., Reijmer, N.J.M., Willemsen, M.C.: Each to His Own: How   Diff...
ResourcesUser feedback and privacy   Knijnenburg, B.P., Kobsa, A: Making Decisions about Privacy: Information   Disclosure...
ResourcesQuestions? Suggestions? Collaboration proposals?   Contact me!Contact info   E: bart.k@uci.edu   W: www.usabart.n...
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Tutorial on Conducting User Experiments in Recommender Systems
Upcoming SlideShare
Loading in...5
×

Tutorial on Conducting User Experiments in Recommender Systems

2,084

Published on

The slide handouts of my Recsys2012 tutorial

Published in: Education, Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,084
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
91
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Transcript of "Tutorial on Conducting User Experiments in Recommender Systems"

  1. 1. User Experiments in Recommender Systems
  2. 2. Introduction Welcome everyone!
  3. 3. IntroductionBart Knijnenburg - Current: UC Irvine -Informatics - PhD candidate - TUHuman Technology Interaction Eindhoven - - Researcher & Teacher - Master Student - Carnegie Mellon University - Human-Computer Interaction - Master Student INFORMATION AND COMPUTER SCIENCES
  4. 4. IntroductionBart Knijnenburg - First user-centric evaluation framework - UMUAI, 2012 - Founder ofonUCERSTI - workshop user-centric evaluation of recommender systems and their interfaces - Statistics expert reviewer - Conference + journal - Research methods teacher - SEM advisor INFORMATION AND COMPUTER SCIENCES
  5. 5. Introduction INFORMATION AND COMPUTER SCIENCES
  6. 6. Introduction “What is a user experiment?” “A user experiment is a scientific method to investigate how and why system aspectsinfluence the users’ experience and behavior.” INFORMATION AND COMPUTER SCIENCES
  7. 7. IntroductionMy goal: Teach how to scientifically evaluate recommender systems using a user-centric approach How? User experiments!My approach: - I will provide a broad theoretical framework - I will cover every step in conducting a user experiment - I will teach the “statistics of the 21st century” INFORMATION AND COMPUTER SCIENCES
  8. 8. IntroductionWelcome everyone!Evaluation frameworkA theoretical foundation for user-centric evaluationHypothesesWhat do I want to find out?ParticipantsPopulation and samplingTesting A vs. BExperimental manipulationsMeasurementMeasuring subjective valuationsAnalysisStatistical evaluation of the results
  9. 9. Evaluation frameworkA theoretical foundation for user-centric evaluation
  10. 10. FrameworkOffline evaluations may not give the same outcome asonline evaluations Cosley et al., 2002; McNee et al., 2002Solution: Test with real users INFORMATION AND COMPUTER SCIENCES
  11. 11. FrameworkSystem Interactionalgorithm rating INFORMATION AND COMPUTER SCIENCES
  12. 12. FrameworkHigher accuracy does notalways mean highersatisfaction McNee et al., 2006Solution: Consider otherbehaviors INFORMATION AND COMPUTER SCIENCES
  13. 13. FrameworkSystem Interactionalgorithm rating consumption retention INFORMATION AND COMPUTER SCIENCES
  14. 14. FrameworkThe algorithm counts for only 5% of the relevance of arecommender system Francisco Martin - RecSys 2009 keynoteSolution: test those other aspects INFORMATION AND COMPUTER SCIENCES
  15. 15. Framework System Interaction algorithm ratinginteraction consumptionpresentation retention INFORMATION AND COMPUTER SCIENCES
  16. 16. Framework “Testing a recommender against a randomvideoclip system, the number of clicked clips and total viewing time went down!” INFORMATION AND COMPUTER SCIENCES
  17. 17. Framework number of clips watched from beginning to end total number of + viewing time clips clicked personalized − − recommendations OSA perceived system effectiveness + EXP + perceived recommendation quality SSA + choice satisfaction EXPKnijnenburg et al.: “Receiving Recommendations and Providing Feedback”, EC-Web 2010 INFORMATION AND COMPUTER SCIENCES
  18. 18. FrameworkBehavior is hard to interpret Relationship between behavior and satisfaction is not always trivialUser experience is a better predictor of long-term retention With behavior only, you will need to run for a long timeQuestionnaire data is more robust Fewer participants needed INFORMATION AND COMPUTER SCIENCES
  19. 19. FrameworkMeasure subjective valuations with questionnaires Perception and experienceTriangulate these data with behavior Ground subjective valuations in observable actions Explain observable actions with subjective valuationsMeasure every step in your theory Create a chain of mediating variables INFORMATION AND COMPUTER SCIENCES
  20. 20. Framework System Perception Experience Interaction algorithm usability system ratinginteraction quality process consumptionpresentation appeal outcome retention INFORMATION AND COMPUTER SCIENCES
  21. 21. FrameworkPersonal and situational characteristics may have animportant impact Adomavicius et al., 2005; Knijnenburg et al., 2012Solution: measure those as well INFORMATION AND COMPUTER SCIENCES
  22. 22. Framework Situational Characteristics routine system trust choice goal System Perception Experience Interaction algorithm usability system ratinginteraction quality process consumptionpresentation appeal outcome retention Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
  23. 23. Framework Situational Characteristics routine system trust choice goal Objective System Aspects (OSA) System Perception Experience Interaction These are manipulations algorithm usability system ratinginteraction - visual / interaction design consumption quality processpresentation - recommenderoutcome appeal algorithm retention - presentation of recommendations - additional features Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
  24. 24. Framework Situational Characteristics routine User Experience (EXP) system trust choice goal Different aspects may System different things influence Perception Experience Interaction algorithm usability system rating - interface -> systeminteraction evaluations quality process consumption -presentation appeal preference elicitation outcome retention method -> choice process - algorithm -> quality of Personal Characteristics the final choice gender privacy expertise INFORMATION AND COMPUTER SCIENCES
  25. 25. Framework Situational Characteristics routine system trust choice goal Subjective System Aspects (SSA) System Perception Experienceto EXP Link OSA Interaction algorithm usability system (mediation) ratinginteraction quality process consumption Increase the robustnesspresentation appeal of the effects of OSAs on outcome retention EXP How and why OSAs Personal Characteristics affect EXP gender privacy expertise INFORMATION AND COMPUTER SCIENCES
  26. 26. Framework Situational Characteristics routine system trust choice goal System Perception Experience Interaction Personal and Situational characteristics (PC and SC) algorithm usability system rating Effect of specific user and taskinteraction quality process consumption Beyond the influence of the systempresentation appeal outcome retention Here used for evaluation, not for augmenting algorithms Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
  27. 27. Framework Situational Characteristics routine system trust choice goal Interaction (INT) System Perception Experience Interaction Observable behavior algorithm usability system rating - browsing, viewing, log-insinteraction quality process consumptionpresentation of evaluation Final step appeal outcome retention Grounds EXP in “real” data Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
  28. 28. Framework Situational Characteristics routine system trust choice goal System Perception Experience Interaction algorithm usability system ratinginteraction quality process consumptionpresentation appeal outcome retention Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
  29. 29. FrameworkHas suggestions for measurement scales e.g. How can we measure something like “satisfaction”?Provides a good starting point for causal relations e.g. How and why do certain system aspects influence the user experience?Useful for integrating existing work e.g. How do recommendation list length, diversity and presentation each have an influence on user experience? INFORMATION AND COMPUTER SCIENCES
  30. 30. Framework“Statisticians, like artists, have the bad habit of falling in love with their models.” George Box INFORMATION AND COMPUTER SCIENCES
  31. 31. test your recommender system with real usersmeasure behaviors other test aspects other than than ratings the algorithm measure subjective take situational and valuations with personal characteristics questionnaires into account Evaluation framework Help in measurement and causal relations use our framework as a starting point for user-centric research
  32. 32. HypothesesWhat do I want to find out?
  33. 33. Hypotheses“Can you test if my recommender system is good?” INFORMATION AND COMPUTER SCIENCES
  34. 34. HypothesesWhat does good mean?- Recommendation accuracy?- Recommendation quality?- System usability?- System satisfaction?We need to define measures INFORMATION AND COMPUTER SCIENCES
  35. 35. Hypotheses“Can you test if the user interface of myrecommender system scores high on this usability scale?” INFORMATION AND COMPUTER SCIENCES
  36. 36. HypothesesWhat does high mean? Is 3.6 out of 5 on a 5-point scale “high”? What are 1 and 5? What is the difference between 3.6 and 3.7?We need to compare the UI against something INFORMATION AND COMPUTER SCIENCES
  37. 37. Hypotheses“Can you test if the UI of my recommender system scores high on this usability scale compared to this other system?” INFORMATION AND COMPUTER SCIENCES
  38. 38. HypothesesMy new travel recommender Travelocity INFORMATION AND COMPUTER SCIENCES
  39. 39. HypothesesSay we find that it scores higher on usability... why does it? - different date-picker method - different layout - different number of options availableApply the concept of ceteris paribus to get rid ofconfounding variables Keep everything the same, except for the thing you want to test (the manipulation) Any difference can be attributed to the manipulation INFORMATION AND COMPUTER SCIENCES
  40. 40. HypothesesMy new travel recommender Previous version (too many options) INFORMATION AND COMPUTER SCIENCES
  41. 41. HypothesesTo learn something from the study, we need a theory behindthe effect For industry, this may suggest further improvements to the system For research, this makes the work generalizableMeasure mediating variables Measure understandability (and a number of other concepts) as well Find out how they mediate the effect on usability INFORMATION AND COMPUTER SCIENCES
  42. 42. HypothesesAn example:We compared threerecommender systems Three different algorithms Ceteris paribus! INFORMATION AND COMPUTER SCIENCES
  43. 43. HypothesesKnijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012The mediating variables show the entire story INFORMATION AND COMPUTER SCIENCES
  44. 44. HypothesesMatrix Factorization recommender with Matrix Factorization recommender withexplicit feedback (MF-E) implicit feedback (MF-I)(versus generally most popular; GMP) (versus most popular; GMP) OSA OSA + + perceived recommendation perceived recommendation perceived system variety + quality + effectiveness SSA SSA EXP Knijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012 INFORMATION AND COMPUTER SCIENCES
  45. 45. Hypotheses“An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.” John Tukey INFORMATION AND COMPUTER SCIENCES
  46. 46. define measures compare system get rid ofaspects against each confounding other variablesapply the concept of look for a theory ceteris paribus behind the found effects Hypotheses What do I want to find out? measure mediating variables to explain the effects
  47. 47. ParticipantsPopulation and sampling
  48. 48. Participants“We are testing our recommender system on our colleagues/students.” -or- “We posted the study link on Facebook/Twitter.” INFORMATION AND COMPUTER SCIENCES
  49. 49. ParticipantsAre your connections, colleagues, or students typical usersof your system? - They may have more knowledge of the field of study - They may feel more excited about the system - They may know what the experiment is about - They probably want to please youYou should sample from your target population An unbiased sample of users of your system INFORMATION AND COMPUTER SCIENCES
  50. 50. Participants“We only use data from frequent users.” INFORMATION AND COMPUTER SCIENCES
  51. 51. ParticipantsWhat are the consequences of limiting your scope? You run the risk of catering to that subset of users only You cannot make generalizable claims about usersFor scientific experiments, the target population may beunrestricted Especially when your study is more about human nature than about a specific system INFORMATION AND COMPUTER SCIENCES
  52. 52. Participants“We tested our system with 10 users.” INFORMATION AND COMPUTER SCIENCES
  53. 53. ParticipantsIs this a decent sample size? Anticipated Needed Can you attain statistically e!ect size sample size significant results? Does it provide a wide small 385 enough inductive base? medium 54Make sure your sample islarge enough large 25 40 is typically the bare minimum INFORMATION AND COMPUTER SCIENCES
  54. 54. sample from your target populationthe target population make sure your samplemay be unrestricted is large enough Participants Population and sampling
  55. 55. Testing A vs. B Experimental manipulations
  56. 56. Manipulations“Are our users more satisfied if our newsrecommender shows only recent items?” INFORMATION AND COMPUTER SCIENCES
  57. 57. ManipulationsProposed system or treatment: Filter out any items > 1 month oldWhat should be my baseline? - Filter out items < 1 month old? - Unfiltered recommendations? - Filter out items > 3 months old?You should test against a reasonable alternative “Absence of evidence is not evidence of absence” INFORMATION AND COMPUTER SCIENCES
  58. 58. Manipulations“The first 40 participants will get the baseline, the next 40 will get the treatment.” INFORMATION AND COMPUTER SCIENCES
  59. 59. ManipulationsThese two groups cannot be expected to be similar! Some news item may affect one group but not the otherRandomize the assignment of conditions to participants Randomization neutralizes (but doesn’t eliminate) participant variation INFORMATION AND COMPUTER SCIENCES
  60. 60. ManipulationsBetween-subjects design: 100 participantsRandomly assign half theparticipants to A, half to B Realistic interaction 50 50 Manipulation hidden from user Many participants needed INFORMATION AND COMPUTER SCIENCES
  61. 61. ManipulationsWithin-subjects design: 50 participantsGive participants A first,then B - Remove subject variability - Participant may see the manipulation - Spill-over effect INFORMATION AND COMPUTER SCIENCES
  62. 62. ManipulationsWithin-subjects design: 50 participantsShow participants A and Bsimultaneously - Remove subject variability - Participants can compare conditions - Not a realistic interaction INFORMATION AND COMPUTER SCIENCES
  63. 63. ManipulationsShould I do within-subjects or between-subjects?Use between-subjects designs for user experience Closer to a real-world usage situation No unwanted spill-over effectsUse within-subjects designs for psychological research Effects are typically smaller Nice to control between-subjects variability INFORMATION AND COMPUTER SCIENCES
  64. 64. Manipulations Low HighYou can test multiple diversity diversitymanipulations in a factorial 5design 5+low 5+high itemsThe more conditions, the 10 10+low 10+highmore participants you will itemsneed! 20 20+low 20+high items INFORMATION AND COMPUTER SCIENCES
  65. 65. ManipulationsLet’s test an algorithm against random recommendations What should we tell the participant?Beware of the Placebo effect! Remember: ceteris paribus! Other option: manipulate the message (factorial design) INFORMATION AND COMPUTER SCIENCES
  66. 66. Manipulations “We were demonstrating our newrecommender to a client. They were amazedby how well it predicted their preferences!”“Later we found out that we forgot to activate the algorithm: the system was giving completely random recommendations.” (anonymized) INFORMATION AND COMPUTER SCIENCES
  67. 67. test against a reasonable alternativerandomize assignment use between-subjects of conditions for user experienceuse within-subjects for you can test morepsychological research than two conditions Testing A vs. B Experimental manipulations you can test multiple manipulations in a factorial design
  68. 68. MeasurementMeasuring subjective valuations
  69. 69. Measurement“To measure satisfaction, we asked users whether they liked the system (on a 5-point rating scale).” INFORMATION AND COMPUTER SCIENCES
  70. 70. MeasurementDoes the question mean the same to everyone? - John likes the system because it is convenient - Mary likes the system because it is easy to use - Dave likes it because the recommendations are goodA single question is not enough to establish content validity We need a multi-item measurement scale INFORMATION AND COMPUTER SCIENCES
  71. 71. MeasurementPerceived system effectiveness: - Using the system is annoying - The system is useful - Using the system makes me happy - Overall, I am satisfied with the system - I would recommend the system to others - I would quickly abandon using this system INFORMATION AND COMPUTER SCIENCES
  72. 72. MeasurementUse both positively and negatively phrased items - They make the questionnaire less “leading” - They help filtering out bad participants - They explore the “flip-side” of the scaleThe word “not” is easily overlooked Bad: “The recommendations were not very novel” Good: “The recommendations felt outdated” INFORMATION AND COMPUTER SCIENCES
  73. 73. MeasurementChoose simple over specialized words Participants may have no idea they are using a “recommender system”Avoid double-barreled questions Bad: “The recommendations were relevant and fun” INFORMATION AND COMPUTER SCIENCES
  74. 74. Measurement“We asked users ten 5-point scale questions and summed the answers.” INFORMATION AND COMPUTER SCIENCES
  75. 75. MeasurementIs the scale really measuring a single thing? - 5 items measure satisfaction, the other 5 convenience - The items are not related enough to make a reliable scaleAre two scales really measuring different things? - They are so closely related that they actually measure the same thingWe need to establish convergent and discriminant validity This makes sure the scales are unidimensional INFORMATION AND COMPUTER SCIENCES
  76. 76. MeasurementSolution: factor analysis - Define latent factors, specify how items “load” on them - Factor analysis will determine how well the items “fit” - It will give you suggestions for improvementBenefits of factor analysis: - Establishes convergent and discriminant validity - Outcome is a normally distributed measurement scale - The scale captures the “shared essence” of the items INFORMATION AND COMPUTER SCIENCES
  77. 77. Measurementvar1 var2 var3 var4 var5 var6 qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5 perceived perceived choice 1 recommendation recommendation 1 1 variety quality difficulty movie choice 1 1 expertise satisfactionFactorsItems exp1 exp2 exp3 sat1 sat2 sat3 sat4 sat5 sat6 sat7 INFORMATION AND COMPUTER SCIENCES
  78. 78. Measurement low communality, loads on quality, variety, low high residual with qual2 and satisfaction communalityvar1 var2 var3 var4 var5 var6 qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5 perceived perceived choice 1 recommendation recommendation 1 1 variety quality difficulty movie choice 1 1 expertise satisfaction exp1 exp2 exp3 sat1 sat2 sat3 sat4 sat5 sat6 sat7 low communality high residual with var1 INFORMATION AND COMPUTER SCIENCES
  79. 79. Measurement AVE: 0.622 AVE: 0.756 AVE: 0.435 (!)var1 sqrt(AVE) = 0.789 var6 var2 var3 var4 sqrt(AVE) qual3 qual4 qual1 qual2 = 0.870 sqrt(AVE) = 0.659 diff1 diff2 diff4 largest corr.: 0.491 largest corr.: 0.709 largest corr.: -0.438 perceived perceived choice 1 recommendation recommendation 1 1 variety quality difficulty movie choice 1 1 expertise satisfaction AVE: 0.793 AVE: 0.655 exp1 exp2 exp3 sat1 sat3 sat4 sat5 sat6 sqrt(AVE) = 0.891 sqrt(AVE) = 0.809 highest corr.: 0.234 highest corr.: 0.709 INFORMATION AND COMPUTER SCIENCES
  80. 80. Measurement“Great! Can I learn how to do this myself?” Check the video tutorials at www.statmodel.com INFORMATION AND COMPUTER SCIENCES
  81. 81. establish content validity with multi-item scales follow the general establish convergent principles for good and discriminantquestionnaire items validity Measurement Measuring subjective valuations use factor analysis
  82. 82. AnalysisStatistical evaluation of the results
  83. 83. Analysis Perceived qualityManipulation -> perception: 1 Do these two algorithms 0.5 lead to a different level of 0 perceived quality? -0.5T-test -1 A B INFORMATION AND COMPUTER SCIENCES
  84. 84. Analysis System effectiveness 2Perception -> experience: 1 Does perceived quality 0 influence system effectiveness? -1Linear regression -2 -3 0 3 Recommendation quality INFORMATION AND COMPUTER SCIENCES
  85. 85. Analysis Perceived quality 0.6Two manipulations -> low diversificationperception: 0.5 high diversification What is the combined 0.4 effect of list diversity and 0.3 list length on perceived 0.2 recommendation quality? 0.1Factorial ANOVA 0 5 items 10 items 20 items Willemsen et al.: “Not just more of the same”, submitted to TiiS INFORMATION AND COMPUTER SCIENCES
  86. 86. Analysis var1 var2 var3 var4 var5 var6 qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5Only one method: structural 1 perceived recommendation 1 perceived recommendation choice 1 difficultyequation modeling variety quality The statistical method of movie choice 1 1 expertise satisfaction the 21st century exp1 exp2 exp3 sat1 sat2 sat3 sat4 sat5 sat6 sat7Combines factor analysisand path models Top-20 Lin-20 vs Top-5 recommendations vs Top-5 recommendations - Turn items into factors .455 (.211) .612 (.220) .879 (.265) -.804 (.230) p < .05 p < .01 p < .001 + + p < .001 − + perceived + perceived + choice recommendation recommendation variety .503 (.090) quality .336 (.089) difficulty - Test causal relations p < .001 p < .001 + + -.417 (.125) 1.151 (.161) .181 (.075) .205 (.083) p < .001 p < .005 p < .05 p < .05 + − .894 (.287) movie choice p < .005 + expertise satisfaction INFORMATION AND COMPUTER SCIENCES
  87. 87. Analysis var1 var2 var3 var4 var5 var6 qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5 perceived perceived choice 1 recommendation recommendation 1 1 variety quality difficultyVery simple reporting 1 movie expertise choice satisfaction 1 - Report overall fit + effect exp1 exp2 exp3 sat1 sat2 sat3 sat4 sat5 sat6 sat7 of each causal relation - A path that explains the Top-20 vs Top-5 recommendations Lin-20 vs Top-5 recommendations effects .455 (.211) p < .05 + + .612 (.220) p < .01 -.804 (.230) p < .001 − .879 (.265) p < .001 + perceived + perceived + choice recommendation recommendation variety .503 (.090) quality .336 (.089) difficulty p < .001 p < .001 + + -.417 (.125) 1.151 (.161) .181 (.075) .205 (.083) p < .001 p < .005 p < .05 p < .05 + − .894 (.287) movie choice p < .005 + expertise satisfaction INFORMATION AND COMPUTER SCIENCES
  88. 88. AnalysisExample from Bollen et al.: “Choice Overload” What is the effect of the number of recommendations? What about the composition of the recommendation list?Tested with 3 conditions: - Toprecs: 1 2 3 4 5 5: - - Toprecs: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 20: - - Lin recs: 1 2 3 4 5 99 199 299 399 499 599 699 799 899 999 1099 1199 1299 1399 1499 20: - INFORMATION AND COMPUTER SCIENCES
  89. 89. AnalysisBollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010 INFORMATION AND COMPUTER SCIENCES
  90. 90. Analysis Top-20 Lin-20 vs Top-5 recommendations vs Top-5 recommendations Simple regression “Inconsistent mediation”“Full mediation” + + − + perceived + perceived + choice recommendation recommendation variety quality difficulty Trade-o! Measured by+var1-var6 + (not shown here) + − movie choice + Additional e!ect expertise satisfaction Bollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010 INFORMATION AND COMPUTER SCIENCES
  91. 91. Analysis Top-20 Lin-20 vs Top-5 recommendations vs Top-5 recommendations .455 (.211) .612 (.220) .879 (.265) -.804 (.230) p < .05 p < .01 p < .001 + + p < .001 − + perceived + perceived + choicerecommendation recommendation variety .503 (.090) quality .336 (.089) difficulty p < .001 p < .001 + + -.417 (.125) 1.151 (.161) .181 (.075) .205 (.083) p < .001 p < .005 p < .05 p < .05 + − .894 (.287) movie choice p < .005 + expertise satisfaction Bollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010 INFORMATION AND COMPUTER SCIENCES
  92. 92. Analysis“Great! Can I learn how to do this myself?” Check the video tutorials at www.statmodel.com INFORMATION AND COMPUTER SCIENCES
  93. 93. AnalysisHomoscedasticity is a 20prerequisite for linear stats Interaction time (min) 15 Not true for count data, time, etc. 10Outcomes need to beunbounded and continuous 5 Not true for binary 0 answers, counts, etc. 0 5 10 15 Level of commitment INFORMATION AND COMPUTER SCIENCES
  94. 94. AnalysisUse the correct methods for non-normal data - Binary data: logit/probit regression - 5- or 7-point scales: ordered logit/probit regression - Count data: Poisson regressionDon’t use “distribution-free” stats They were invented when calculations were done by handDo use structural equation modeling MPlus can do all these things “automatically” INFORMATION AND COMPUTER SCIENCES
  95. 95. AnalysisStandard regression requires Number of followed recommendations 5uncorrelated errors Not true for repeated 4 measures, e.g. “rate these 3 5 items” 2 There will be a user-bias (and maybe an item-bias) 1Golden rule: data-points 0 0 5 10 15should be independent Level of commitment INFORMATION AND COMPUTER SCIENCES
  96. 96. Analysis Number of followed recommendations 5Two ways to account forrepeated measures: 3.75 - Define a random intercept for each user 2.5 - Impose an error 1.25 covariance structureAgain, in MPlus this is easy 0 0 5 10 15 Level of commitment INFORMATION AND COMPUTER SCIENCES
  97. 97. AnalysisA manipulation only causesthings Disclosure PerceivedFor all other variables: help (R2 = .302) privacy threat (R2 = .565) - Common sense + − + − - Psych literature Trust in the company (R2 = .662) + Satisfaction with the system (R2 = .674) - Evaluation frameworks Knijnenburg & Kobsa.: “Making Decisions about Privacy”Example: privacy study INFORMATION AND COMPUTER SCIENCES
  98. 98. Analysis“All models are wrong, but some are useful.” George Box INFORMATION AND COMPUTER SCIENCES
  99. 99. use structural equation models use correct use correct methods for methods fornon-normal data repeated measures Analysis Statistical evaluation of the resultsuse manipulations and theory to make inferences about causality
  100. 100. IntroductionUser experiments: user-centric evaluation of recommender systemsEvaluation frameworkUse our framework as a starting point for user experimentsHypothesesConstruct a measurable theory behind the expected effectParticipantsSelect a large enough sample from your target populationTesting A vs. BAssign users to different versions of a system aspect, ceteris paribusMeasurementUse factor analysis and follow the principles for good questionnairesAnalysisUse structural equation models to test causal models
  101. 101. “It is the mark of a truly intelligent person to be moved by statistics.” George Bernard Shaw
  102. 102. ResourcesUser-centric evaluation Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C.: Explaining the User Experience of Recommender Systems. UMUAI 2012. Knijnenburg, B.P., Willemsen, M.C., Kobsa, A.: A Pragmatic Procedure to Support the User-Centric Evaluation of Recommender Systems. RecSys 2011.Choice overload Willemsen, M.C., Graus, M.P., Knijnenburg, B.P., Bollen, D.: Not just more of the same: Preventing Choice Overload in Recommender Systems by Offering Small Diversified Sets. Submitted to TiiS. Bollen, D.G.F.M., Knijnenburg, B.P., Willemsen, M.C., Graus, M.P.: Understanding Choice Overload in Recommender Systems. RecSys 2010. INFORMATION AND COMPUTER SCIENCES
  103. 103. ResourcesPreference elicitation methods Knijnenburg, B.P., Reijmer, N.J.M., Willemsen, M.C.: Each to His Own: How Different Users Call for Different Interaction Methods in Recommender Systems. RecSys 2011. Knijnenburg, B.P., Willemsen, M.C.: The Effect of Preference Elicitation Methods on the User Experience of a Recommender System. CHI 2010. Knijnenburg, B.P., Willemsen, M.C.: Understanding the effect of adaptive preference elicitation methods on user satisfaction of a recommender system. RecSys 2009.Social recommenders Knijnenburg, B.P., Bostandjiev, S., ODonovan, J., Kobsa, A.: Inspectability and Control in Social Recommender Systems. RecSys 2012. INFORMATION AND COMPUTER SCIENCES
  104. 104. ResourcesUser feedback and privacy Knijnenburg, B.P., Kobsa, A: Making Decisions about Privacy: Information Disclosure in Context-Aware Recommender Systems. Submitted to TiiS. Knijnenburg, B.P., Willemsen, M.C., Hirtbach, S.: Getting Recommendations and Providing Feedback: The User-Experience of a Recommender System. EC- Web 2010.Statistics books Agresti, A.: An Introduction to Categorical Data Analysis. 2nd ed. 2007. Fitzmaurice, G.M., Laird, N.M., Ware, J.H.: Applied Longitudinal Analysis. 2004. Kline, R.B.: Principles and Practice of Structural Equation Modeling. 3rd ed. 2010. Online tutorial videos at www.statmodel.com INFORMATION AND COMPUTER SCIENCES
  105. 105. ResourcesQuestions? Suggestions? Collaboration proposals? Contact me!Contact info E: bart.k@uci.edu W: www.usabart.nl T: @usabart INFORMATION AND COMPUTER SCIENCES
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×