Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Recommender systems to help people move forward


Published on

My talk at the 11th RecSys NL meetup (Oct 17, 2018) in Utrecht about the research on user-centric recommender systems

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Recommender systems to help people move forward

  1. 1. Recommender systems to help people move forward RecSysNL meetup Oct 17, 2018 Martijn Willemsen PI of the recommender LAB @mcwillemsen Decision Making, Process tracing, Cognition, Recommender Systems, online behavior, e-coaching, Data Science
  2. 2. My lab has a strong user-centric RecSys focus…, why? • Because I failed as a (real) engineer ? 2 MSc in EE 2nd Bsc: Technology and Society PhD in Decision Making Recommender Systems MSc Technology and Society PostDoc in Process Tracing Electrical Engineering
  3. 3. My lab has a strong user-centric RecSys focus…, why? • Because it is easier to get papers into the RecSys conference? 3
  4. 4. Surely not…. Because just optimizing accuracy is not enough… REVEAL Workshop RecSys2018: (Joe Konstan) “…if the recommender systems we are building are trained to predict the very items that user founds by themselves without recommendation (yes, I’m looking at you Precision_at_k), then the usefulness of the recommender becomes very debatable. ” 4 2018-recommender-systems-that-care- 16389e43114c
  5. 5. And optimizing for engagement/behavior is tricky… Netflix tradeoffs popularity, diversity and accuracy AB tests to test ranking between and within rows Source: RecSys 2016, 18 Sept: Talk by Xavier Amatriain
  6. 6. We don’t need the user: Let’s do AB Testing! Netflix used 5-star rating scales to get input from users (apart from log data) Netflix reported an AB test of thumbs up/down versus rating: Yellin (Netflix VP of product): “The result was that thumbs got 200% more ratings than the traditional star-rating feature.” So is the 5-star rating wrong? or just different information? Should we only trust the behavior? 6 However, over time, Netflix realized that explicit star ratings were less relevant than other signals. Users would rate documentaries with 5 stars, and silly movies with just 3 stars, but still watch silly movies more often than those high-rated documentaries. ws/netflix-thumbs-vs-stars- 1202010492/
  7. 7. Behavior versus Experience Looking at behavior… • Testing a recommender against a random videoclip system, the number of clicked clips and total viewing time went down! Looking at user experience… • Users found what they liked faster with less ineffective clicks… Behaviorism is not enough! (Ekstrand & Willemsen, RecSys 2016) We need to measure user experience and relate it to user behavior… We need to understand user goals and develop Rec. Systems that help users attain these goals! 7
  8. 8. User-Centric Research can help us understand… • How they perceive recommender algorithms (Ekstrand et al. 2014) • WHY users are satisfied… even when we reduce accuracy by diversification (Willemsen et al. 2016) • What inaction (non-behavior) means… (Zhao et al. 2018) Or even help built systems to help people move forward • Energy saving recommendations (Starke et al. 2017) • Music Genre Explorer (Liang and Willemsen, 2018) 8
  9. 9. User-Centric Framework Computers Scientists (and marketing researchers) would study behavior…. (they hate asking the user or just cannot (AB tests))
  10. 10. User-Centric Framework Psychologists and HCI people are mostly interested in experience…
  11. 11. User-Centric Framework Though it helps to triangulate experience and behavior…
  12. 12. User-Centric Framework Our framework adds the intermediate construct of perception that explains why behavior and experiences changes due to our manipulations
  13. 13. User-Centric Framework • And adds personal and situational characteristics • Relations modeled using factor analysis and SEM Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C. (2012). Explaining the User Experience of Recommender Systems. User Modeling and User-Adapted Interaction (UMUAI), vol 22, p. 441-504
  14. 14. User Perceptions of Differences in Recommender Algorithms Joint work with grouplens Michael Ekstrand, Max Harper and Joseph Konstan Ekstrand, M.D., Harper, F.M., Willemsen, M.C.& Konstan, J.A. (2014). User Perception of Differences in Recommender Algorithms. In Proceedings of the 8th ACM conference on Recommender systems (pp. 161–168). New York, NY, USA: ACM
  15. 15. Going beyond accuracy… McNee et al. (2006): Accuracy is not enough “study recommenders from a user-centric perspective to make them not only accurate and helpful, but also a pleasure to use” But wait! we don’t even know how the standard algorithms are perceived… and what differences there are… Compare 3 classic algorithms (Item-Item, User-User and SVD) side by side (joint evaluation) in terms of preference and perceptions
  16. 16. The task provided to the user First impression Perceived Diversity & novelty and satisfaction Choice of algo
  17. 17. First look at the measurement model • only measurement model relating the concepts (no conditions) • All concepts are relative comparisons – e.g. if they think list A is more diverse than B, they are also more satisfied with list A than B SSA EXPSSA INT INT
  18. 18. What algorithms do users prefer? 528 users completed the questionnaire Joint evaluation, 3 pairs of comparing A with B User-User CF significantly looses from the other two Item-Item and SVD are on par Why? – User-user more novel than either SVD or item-item – User-user more diverse than SVD – Item-item slightly more diverse than SVD (but diversity didn't affect satisfaction) I-I I-I SVD U-U SVD U-U 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% I-I v. U-U I-I v. SVD SVD v. U-U
  19. 19. Objective measures No accuracy differences, but consistent with subjective data RQ2: User-user more novel, SVD somewhat less diverse
  20. 20. Aligning objective with subjective measures Objective and subjective metrics correlate consistently But their effects on choice are mediated by the subjective perceptions! (Objective) obscurity only influences satisfaction if it increases perceived novelty (i.e. if it is registered by the user)
  21. 21. Conclusions Novelty is not always good: complex, largely negative effect Diversity is important for satisfaction Diversity/accuracy tradeoff does not seem to hold… Subjective Perceptions and experience mediate the effect of objective measures on choice / preference for algorithm Brings the ‘WHY’: e.g. User-user is less satisfactory and less often chosen because of its obscure items (which are perceived as novel)
  22. 22. Choice difficulty and satisfaction in RecSys Applying latent feature diversification Willemsen, M.C., Graus, M.P, & Knijnenburg, B.P. (2016). Understanding the role of latent feature diversification on choice difficulty and satisfaction. User Modeling and User- Adapted Interaction (UMUAI), vol 26 (4), 347-389 doi:10.1007/s11257-016-9178-6
  23. 23. Seminal example of choice overload Satisfaction decreases with larger sets as increased attractiveness is counteracted by choice difficulty Can we reduce difficulty while controlling attractiveness? More attractive 3% sales Less attractive 30% sales Higher purchase satisfaction From Iyengar and Lepper (2000)
  24. 24. Koren, Y., Bell, R., and Volinsky, C. 2009. Matrix Factorization Techniques for Recommender Systems. IEEE Computer 42, 8, 30–37. Dimensionality reduction Users and items are represented as vectors on a set of latent features Rating is the dot product of these vectors (overall utility!) Gus will like Dumb and Dumber but hate Color Purple Use the properties of Matrix Factorization algorithms!
  25. 25. Latent feature diversification: high diversity/equal attractiveness
  26. 26. 26 Latent Feature Diversification Psychology- informed Diversity manipulation Increased perceived Diversity & attractiveness Reduced difficulty & increased satisfaction Less hovers More choice for lower ranked items Diversification Rank of chosen None (top 5) 3.6 Medium 14.5 High 77.6 -0.2 0 0.2 0.4 0.6 0.8 1 none med high standardizedscore diversification Choice Satisfaction Higher satisfaction for high diversification, despite choice for lower predicted/ranked items
  27. 27. Interpreting User Inaction in Recommender Systems Zhao, Q., Willemsen, M. C., Adomavicius, G., Maxwell Harper, F., & Konstan, J. A. (2018). Interpreting user inaction in recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems (blz. 40-48). New York: Association for Computing Machinery, Inc. DOI: 10.1145/3240323.3240366
  28. 28. 28 Action and Inaction in Add into a wishlist Not interested Rating Click to see details
  29. 29. 29 How to interpret user inaction? Randomly pick one and ask users.
  30. 30. 7 Categories of User Inaction 30 Did not notice it (38.6%) Noticed but watched it before (14.6%) Noticed and have not watched it yet (46.8%) Would not enjoy it (5.8%) Others are better (9.5%) Okay but not now (18.2%) Plan to explore it soon (6.9%) Have decided to watch (5.8%)
  31. 31. Should movielens keep recommending this item to you? most preferred least preferred 4% 1% 11% 9% 30% 42% 63% 65% 51% 25% 27% 23% 19% 5%
  32. 32. Based on this survey data we: Built an inaction classification model • Predictors: item attributes, position, user actions, predicted rating and action probabilities: (clicking, rating, adding to a wishlist) • Best accuracy: 48.5% (majority class is NotNoticed: 39.9%) Try to improve the recommender system: what to do with inaction items? • Utilize inferred 7-class probabilities • Estimate the (in)action or adjust the recommendation timing… • Hide an item or Delay showing an item ! 32
  33. 33. How recommenders can help users achieve their goals Research with Alain Starke (PhD student) RecSys 2017 33
  34. 34. Recommending for Behavioral change • Behavioral change is hard… – Exercising more, eat healthy, reduce alcohol consumption (reducing Binge watching on Netflix ) – Needs awareness, motivation and commitment Combi model: Klein, Mogles, Wissen Journal of Biomedical Informatics, 2014
  35. 35. What can recommenders do? • Persuasive Technology: focused on how to help people change their behavior: – personalize the message… • Recommenders systems can help with what to change and when to act – personalize what to do next… • This requires different models/algorithms – our past behavior/liking is not what we want to do now!
  36. 36. Central question Can we design a recommender interface which effectively supports a user’s energy-saving goals? 36
  37. 37. Regular RecSys approaches, e.g. collaborative filtering, are prone to reinforcing current behavior • If we want consumers and users to achieve (energy- saving) goals, we should not only focus on past behavior but ‘move forward’ (cf. Ekstrand & Willemsen, 2016) • We need a model which considers future goals 37
  38. 38. Energy-saving measures can be ordered as increasingly difficult behavioral steps towards attaining the goal of saving energy (Kaiser et al., 2010; Urban & Scasny, 2014) < < 38
  39. 39. These steps reflect willingness & capacity to save energy: a person’s energy-saving ability (Kaiser et al., 2010; Urban & Scasny, 2014) < < 39
  40. 40. We infer behavioral difficulties based on engagement frequencies 40 INPUT Persons indicate which measures they perform Difficult / Obscure Easy / Popular
  41. 41. In a similar vein, we infer energy-saving abilities 41 Low ability, Performs few Persons indicate which measures they perform INPUT High ability, Performs many
  42. 42. One’s energy-saving ability is a good starting point to look for appropriate measures A person has a 50% probability of performing a measure with a difficulty equal to his/her ability 42
  43. 43. 43
  44. 44. Study 2: How should advice be tailored to support energy-efficient choices? (And can fit scores help to persuade users to pick more challenging measures?) 44
  45. 45. 45 Web shop interface with three lists (tabs): ‘Base’, ‘Recommended’ and ‘Challenging’ • ‘Recommended’ contains 15 best-matching measures, with fit scores ranging 100% to 60% • ‘Base’ are easier, ‘challenging’ more difficult Matched on their ability: -1, 0 or +1 Logit (75%, 50% of 25% likelihood) Show a fit score (or not)
  46. 46. 3x2 Between-subject research design • 3 levels of difficulty, determining contents of the ‘recommended’ list: – Easy / below ability (~75% probability) – Ability-tailored (~50% probability) – Difficult / above ability (~25% probability) • 2 levels of fit score: they were either shown or not – The 100% score was consistent with the difficulty condition – E.g. in the easy condition, measures below a user’s ability (75%) had a 100% match score
  47. 47. Easy recommendations were perceived as feasible and, in turn, supportive & satisfactory SEM Statistics: χ²(140) = 198.693, p < 0.001, CFI = 0.992, TLI = 0.990 47 *** p < 0.001, ** p < 0.01, * p < 0.05. Perceived Feasibility −.469*** Rec difficulty -
  48. 48. Easy recommendations were perceived as feasible and, in turn, supportive & satisfactory SEM Statistics: χ²(140) = 198.693, p < 0.001, CFI = 0.992, TLI = 0.990 48 *** p < 0.001, ** p < 0.01, * p < 0.05. Perceived Feasibility −.469*** Perceived Support Choice Satisfaction.234*** Rec difficulty .506*** .221*** - + +
  49. 49. • Users who felt supported selected more measures • Satisfied users showed a higher % of follow-up 49 *** p < 0.001, ** p < 0.01, * p < 0.05. Perceived Feasibility −.469*** Perceived Support Choice Satisfaction.234*** No. of chosen items % Executed items Difficulty chosen items Rec difficulty .506*** .221*** .385*** −.113** - - ++ + + + +
  50. 50. Users chose slightly more measures when presented easier ones (Showing fit scores did not really matter) 50
  51. 51. Fit scores boosted satisfaction levels for easy measures, but backfired for difficult ones 51
  52. 52. Lessons learned • A satisfactory user interface can lead to the adoption of more energy-saving measures (within system + after 4 weeks) • Easy tailored measures seem to be attractive, as they were perceived as feasible and chosen more often • Fit scores were merely self-reinforcing, not persuasive to attain ‘more difficult goals’ 52
  53. 53. General conclusions • Recommender systems are all about good UX • Taking a psychological, user-oriented approach we can better account for how users perceive recommendations and reach their goals • Behaviorism is not enough: an integrated user- centric approach offers many insights/benefits! • Recommenders should take into account user goals and should adapt their algorithms to it… 53