Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
What recommender
systems can learn from
decision psychology
about preference elicitation
and behavioral change
Martijn Wil...
What are recommender systems about
Algorithms
Accuracy:
compare prediction
with actual values
Recommendation:
best predict...
User-Centric Framework
Computers Scientists (and marketing researchers) would study
behavior…. (they hate asking the user ...
User-Centric Framework
Psychologists and HCI people are mostly interested in experience…
User-Centric Framework
Though it helps to triangulate experience and behavior…
User-Centric Framework
Our framework adds the intermediate construct of perception that explains
why behavior and experien...
User-Centric Framework
And adds personal
and situational
characteristics
Relations modeled
using factor analysis
and SEM
K...
Providing input to the
recommender:
Preference Elicitation
memory, ratings & choices
Providing input to a recommender
system
how do algorithms get their data?
Preference Elicitation (PE)
PE is a major topic ...
What does rating entail?
Typical recommender scenario:
First usage: Show a set of (typical) items and ask
people to rate t...
Psychologist: user knows best, ask her!
In two user experiments, users rated a number of
movies that were aired on Dutch T...
Results
247 users, 4212 ratings
Rating distributions:
Most movies are seen
long ago…
Only 28% seen in the
last year
# Posi...
Modeling the ratings
Coefficient Std. Err. t-value
intercept 2.95 0.15 19.05
time 0.29 0.13 2.31
highrated 1.62 0.22 7.43
...
This is a problem…
How can we train a recommender system..
If ratings depend on our memory this much…
This is new to psych...
Joint versus Separate Evaluation
Evaluations of two job candidates for a computer
programmer position expecting the use of...
Rating support interfaces
Using movielens!
Can we help users during rating to make
their ratings more stable/accurate?
We ...
Tag Interface
Provide 10 tags
that are relevant
for that user
and that describe
the movie well
Didn’t really help…
Exemplar Interface
Support rating on the
scale by providing
exemplars:
Exemplar: Similar movies
rated before by that user
...
Bur what are preferences?
Ratings are absolute statements
of preference…
But preference is a relative
statement…
I like Gr...
Others also tried different PE methods
Loepp, Hussein & Ziegler
(CHI 2014)
Choose between sets of
movies that differ a lot...
Choice-based preference elicitation
Choices are relative statements that are easier to make
Better fit with final goal: fi...
Dimensions in Matrix Factorization
Dimensionality reduction
Users and items are
represented as vectors on a set
of latent ...
How does this work? Step 1
Latent Feature 1
LatentFeature2
Iteration 1a: Diversified choice set is
calculated from a matri...
How does this work? Step 2
Iteration 2: New diversified choice set
(blue items)
End of Iteration 2: with updated vector
an...
Evaluation of Preference Elicitation
Choice-based PE: choosing 10 times from 10 items
Rating-based PE: rating 15 items
Aft...
Behavioral data of PE-tasks
Choice-based PE: most users find their perfect item
around the 8th / 9th item and they inspect...
Perception of Recommendation List
Participants evaluated each recommendation list
separately on Choice Difficulty and Sati...
Conclusion
Participants experienced reduced effort and increased
satisfaction for choice-based PE over rating-based PE
rel...
Recommending for
Behavioral Change
Energy saving and
Hypertension management
Behavioral change
Behavioral change is hard…
Exercising more, eat healthy, reduce alchohol
consumption (reducing Binge wat...
What can recommenders do?
Persuasive Technology: focused on how to help
people change their behavior:
personalize the mess...
How can we help people to save
energy?
Our first (old) recommender system
Recommendations
Selected
measures
Things you
already do or
don’t want to
Attributes
Set...
Study 3 (AMCIS 2014)
Online lab study
—147 paid pps (79M, 68F, mean age: 40.0)
—Selected pps interacted for at least 2.5 m...
Study 3 — Results
Experts prefer Attribute-based PE and Hybrid PE,
novices prefer Top-N and Sort (baselines)
System satisf...
Tailoring energy-saving
advice using a unidimensional
Rasch scale of conservation
measures
Work with Alain Starke (Ph. D. ...
Towards a better (psychometric) user model
consumers differ in energy-saving capabilities, attitudes,
goals, …
Our prior w...
Single energy saving
dimension/attitude?
Campbell’s Paradigm (Kaiser et al., 2010)
“One’s attitude or ability becomes app...
Psychological assumptions
Three assumptions for our user model
(Based on Kaiser et al., 2010)
1. All Energy-saving behavio...
The Rasch model
The Rasch model equates behavioral difficulties and
individual propensities in a probabilistic model
Log-o...
The resulting Rasch scale
Pictures disclaimer: courtesy of my PhD student
(the guy on the right)
Item difficulty vs. person ability
Probability of a person executing behavior
depends on the Ability - Costs
LED lighting
...
Using Rasch for tailored advice
Earlier research (Kaiser, Urban) found evidence
for a unidimensional scale, but with few i...
Energy saving ‘Webshop’
Ordered set of measures with rich information and
(in some conditions) recommendations
Order:
Asce...
Procedure
1. Determining Ability: Each
user indicated for 13
measures whether
he/she executed them
2. Show webshop in one ...
Results: Experience (SEM model)
Perceived
Effort
Perceived
Support
Choice
Satisfaction
-0.40*** 0.59***
-0.56**
* p < .05,...
Results
Recommendations work (esp. for descending
order)
higher perceived support higher choice satisfaction
What did they choose?
Rank-ordered logit with
ability – difficulty/costs
difference as predictor
Without recommendations
t...
Measure Follow-up after 4 weeks
Follow-up depends on relative costs (they are
more likely to do the easier measures…)
Lifestyle recommendations
for Hypertension
Management
Joint work with Mustafa Radha (Ph.D. student)
Radha, Willemsen, Boer...
Hypertension
Hypertension occurs in 30% of the population
Hypertension is without symptoms.
Hypertension is a leading caus...
Adherence
Healthy habits have a strong benefit,
but are not always feasible
Study 1: Model construction
Online survey
300 participants between 40 and 60 years old
50% hypertensive
Self-reported enga...
The Rasch scale
Study 2: Coaching strategies
The engagement maximization strategy selects
the easiest behaviors (that she does not do yet)...
Study 2: design
150 hypertensive users invited online
Questionnaire used to measure user ability and
find behaviors that u...
Medium abilityEntire population
Results
EM
EM
Health Benefit
Personalization
RC
RC
MM RC
Health Benefit
MM
MM
Health Benef...
Conclusions
Engagement maximization (easy behaviors not
done yet) outperforms random most of the time
The rasch order help...
Questions?
Contact:
Martijn Willemsen
@MCWillemsen
M.C.Willemsen@tue.nl
www.martijnwillemsen.nl
Thanks to my co-authors:
M...
Upcoming SlideShare
Loading in …5
×

What recommender systems can learn from decision psychology about preference elicitation and behavioral change

452 views

Published on

Slides of my talks at Boise State (Idaho) and Grouplens at University of Minnesota

  • Be the first to comment

  • Be the first to like this

What recommender systems can learn from decision psychology about preference elicitation and behavioral change

  1. 1. What recommender systems can learn from decision psychology about preference elicitation and behavioral change Martijn Willemsen Human-Technology Interaction www.martijnwillemsen.nl
  2. 2. What are recommender systems about Algorithms Accuracy: compare prediction with actual values Recommendation: best predicted items dataset user-item rating pairs user Choose (prefer?) ratings Rating? Experience! preferences Goals & desires!
  3. 3. User-Centric Framework Computers Scientists (and marketing researchers) would study behavior…. (they hate asking the user or just cannot (AB tests))
  4. 4. User-Centric Framework Psychologists and HCI people are mostly interested in experience…
  5. 5. User-Centric Framework Though it helps to triangulate experience and behavior…
  6. 6. User-Centric Framework Our framework adds the intermediate construct of perception that explains why behavior and experiences changes due to our manipulations
  7. 7. User-Centric Framework And adds personal and situational characteristics Relations modeled using factor analysis and SEM Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C. (2012). Explaining the User Experience of Recommender Systems. User Modeling and User-Adapted Interaction (UMUAI), vol 22, p. 441-504 http://bit.ly/umuai
  8. 8. Providing input to the recommender: Preference Elicitation memory, ratings & choices
  9. 9. Providing input to a recommender system how do algorithms get their data? Preference Elicitation (PE) PE is a major topic in research on Decision Making I even did my PhD thesis on it… ;-) What can Psychology learn us on improving this aspect? Role of memory in ratings Rating support Rating versus choice-based elicitation
  10. 10. What does rating entail? Typical recommender scenario: First usage: Show a set of (typical) items and ask people to rate the ones they know (cold start) Later usage: people go to the recommender when they have consumed an item to rate it and typically also rate other aspects Does it matter if the preference you provide (by rating) is based on recent experiences or mostly on your memory?
  11. 11. Psychologist: user knows best, ask her! In two user experiments, users rated a number of movies that were aired on Dutch TV in the previous month (~150 movies) Rate 15-20 movies from that set that you know and indicate how long ago you have seen these movies (last week, last month, last 6 months, last year, last 3 years, longer ago) Motivate ratings for two movies (one seen recently and one seen more than a year ago)
  12. 12. Results 247 users, 4212 ratings Rating distributions: Most movies are seen long ago… Only 28% seen in the last year # Positive ratings decrease with time 1st timeslot: 60% 4/5 star Last timeslot: only 36% ***** **** *** ** *
  13. 13. Modeling the ratings Coefficient Std. Err. t-value intercept 2.95 0.15 19.05 time 0.29 0.13 2.31 highrated 1.62 0.22 7.43 time2 -0.09 0.02 -3.55 Time x highrated -0.73 0.18 -4.10 Tine2 x highrated 0.11 0.03 3.26 Multilevel model: Random intercepts for movies and users high-rated versus low-rated shows a different pattern Regression towards the mean? High-rated Low rated
  14. 14. This is a problem… How can we train a recommender system.. If ratings depend on our memory this much… This is new to psychology as well! Memory effects like this have not been studied… Problem lies partly in the type of judgment asked: Rating is separate evaluation on an absolute scale… Lacks a good reference/comparison Two solutions we explored: Rating support Different elicitation methods: choice!
  15. 15. Joint versus Separate Evaluation Evaluations of two job candidates for a computer programmer position expecting the use of a special language called KY. Mean WTP (in thousands): Separate $ 32.7 k $ 26.8 k Joint $ 31.2 k $ 33.2 k Candidate A Candidate B Education B.Sc. computer Sc. B.Sc. computer Sc. GPA (0-5) 4.8 3.1 KY Experience 10 KY programs 70 KY programs 17
  16. 16. Rating support interfaces Using movielens! Can we help users during rating to make their ratings more stable/accurate? We can support their memory for the movie using tags We can help ratings on the scale with previous ratings Movielens has a tag genome and a history of ratings so we can give real-time user-specific feedback! Nguyen, T. T., Kluver, D., Wang, T.-Y., Hui, P.-M., Ekstrand, M. D., Willemsen, M. C., & Riedl, J. (2013). Rating Support Interfaces to Improve User Experience and Recommender Accuracy. RecSys 2013 (pp. 149–156)
  17. 17. Tag Interface Provide 10 tags that are relevant for that user and that describe the movie well Didn’t really help…
  18. 18. Exemplar Interface Support rating on the scale by providing exemplars: Exemplar: Similar movies rated before by that user for that level on the scale This helps to anchor the values on the scale better: more consistent ratings
  19. 19. Bur what are preferences? Ratings are absolute statements of preference… But preference is a relative statement… I like Grand Budapest hotel more than King’s Speech So why not ask users to choose? Which do you prefer?
  20. 20. Others also tried different PE methods Loepp, Hussein & Ziegler (CHI 2014) Choose between sets of movies that differ a lot on a latent feature Chang, Harper & Terveen (CSCW 2015) Choose between groups of similar movies By assigning points per group (ranking!)
  21. 21. Choice-based preference elicitation Choices are relative statements that are easier to make Better fit with final goal: finding a good item rather than making a good prediction In Marketing, conjoint-based analysis uses the same idea to determine attribute weights and utilities based on a series of (adaptive) choices Can we use a set of choices in the matrix factorization space to determine a user vector in a stepwise fashion? Graus, M.P. & Willemsen, M.C. (2015). Improving the user experience during cold start through choice-based preference elicitation. In Proceedings of the 9th ACM conference on Recommender systems (pp. 273-276)
  22. 22. Dimensions in Matrix Factorization Dimensionality reduction Users and items are represented as vectors on a set of latent features item vector: utility of attributes user vector: weights of attributes Rating is the dot product of these vectors (overall utility!) Gus will like Dumb and Dumber but hate Color Purple Koren, Y., Bell, R., and Volinsky, C. 2009. Matrix Factorization Techniques for Recommender Systems. IEEE Computer 42, 8, 30–37.
  23. 23. How does this work? Step 1 Latent Feature 1 LatentFeature2 Iteration 1a: Diversified choice set is calculated from a matrix factorization model (red items) Iteration 1b: User vector (blue arrow) is moved towards chosen item (green item), items with lowest predicted rating are discarded (greyed out items)
  24. 24. How does this work? Step 2 Iteration 2: New diversified choice set (blue items) End of Iteration 2: with updated vector and more items discarded based on second choice (green item)
  25. 25. Evaluation of Preference Elicitation Choice-based PE: choosing 10 times from 10 items Rating-based PE: rating 15 items After each PE method they evaluated the interface on interaction usability in terms of ease of use e.g., “It was easy to let the system know my preferences” Effort: e.g., “Using the interface was effortful.” effort and usability are highly related (r=0.62) Results: less perceived effort for choice-based PE perceived effort goes down with completion time
  26. 26. Behavioral data of PE-tasks Choice-based PE: most users find their perfect item around the 8th / 9th item and they inspect quite some unique items along the way Rating-based: user inspect many lists (Median = 13), suggesting high effort in rating task.
  27. 27. Perception of Recommendation List Participants evaluated each recommendation list separately on Choice Difficulty and Satisfaction Satisfaction with Chosen Item Obscurity Difficulty Intra List Similarity -2.407(.381) p<.001 -.240 (.145) p<.1 -.479 (.111) p<.001 -.257 (.045) p<.001 14.00 (4.51) p<.01 Choice- Based List + - - - -
  28. 28. Conclusion Participants experienced reduced effort and increased satisfaction for choice-based PE over rating-based PE relative (choice) rather than absolute (rating) PE could alleviate the cold-start problem for new users Further research needed: the parameterization of the choice task strong effect of choice on the popularity of the resulting list Using trailers helps to decrease popularity (-> IntRS 2016) novelty effects might have played a role: fun way of interacting?
  29. 29. Recommending for Behavioral Change Energy saving and Hypertension management
  30. 30. Behavioral change Behavioral change is hard… Exercising more, eat healthy, reduce alchohol consumption (reducing Binge watching on Netflix ) Needs awareness, motivation and commitment Combi model: Klein, Mogles, Wissen Journal of Biomedical Informatics, 2014
  31. 31. What can recommenders do? Persuasive Technology: focused on how to help people change their behavior: personalize the message… Recommenders systems can help with what to change and when to act personalize what to do next… This requires different models/algorithms our past behavior/liking is not what we want to do now! Two illustrations of new approach: energy saving hypertension management
  32. 32. How can we help people to save energy?
  33. 33. Our first (old) recommender system Recommendations Selected measures Things you already do or don’t want to Attributes Set attribute weights Show items with highest Uitem,user, where Uitem,user = ∑ Vitem,attribute • Wattribute,user
  34. 34. Study 3 (AMCIS 2014) Online lab study —147 paid pps (79M, 68F, mean age: 40.0) —Selected pps interacted for at least 2.5 minutes 3 PE-methods, 2 baselines —Attribute-based PE —Implicit PE —Hybrid PE (attribute + implicit) —Sort (baseline, not personalized) —Top-N (baseline, not personalized) http://bit.ly/amcis14
  35. 35. Study 3 — Results Experts prefer Attribute-based PE and Hybrid PE, novices prefer Top-N and Sort (baselines) System satisfaction mediates the effect on choice satisfaction and behavior! Systemsatisfaction Domain knowledge http://bit.ly/amcis14
  36. 36. Tailoring energy-saving advice using a unidimensional Rasch scale of conservation measures Work with Alain Starke (Ph. D. student)
  37. 37. Towards a better (psychometric) user model consumers differ in energy-saving capabilities, attitudes, goals, … Our prior work did not take that into account… Energy-saving interventions are more effective when personalized. But how? ≠ (Cf. Abrahamse et al., 2005)
  38. 38. Single energy saving dimension/attitude? Campbell’s Paradigm (Kaiser et al., 2010) “One’s attitude or ability becomes apparent through its behavior…” “Attitude and Behavior are two sides of the same coin…” Different from standard psychological approaches to measure attitudes & intentions with likert scales… 41
  39. 39. Psychological assumptions Three assumptions for our user model (Based on Kaiser et al., 2010) 1. All Energy-saving behaviors form a class serving a single goal: Saving Energy 2. Less performed behaviors yield higher Behavioral Costs (i.e. are more difficult) 3. Individuals that execute more energy-saving behaviors have a higher Energy-saving Ability (i.e. more skilled)
  40. 40. The Rasch model The Rasch model equates behavioral difficulties and individual propensities in a probabilistic model Log-odds of engagement levels (yes/no): 𝜽 = an individual’s propensity/attitude 𝜹 = behavioral difficulty P = probability of individual n engaging in behavior i Rasch also determines individual propensities and item difficulties & fits them onto a single scale One scale may have lots of different difficulty levels 43 𝐥𝐧 𝑷 𝒏𝒊 𝟏 − 𝑷 𝒏𝒊 = 𝜽 𝒏 − 𝜹𝒊
  41. 41. The resulting Rasch scale Pictures disclaimer: courtesy of my PhD student (the guy on the right)
  42. 42. Item difficulty vs. person ability Probability of a person executing behavior depends on the Ability - Costs LED lighting Unplug chargers Install PV panels
  43. 43. Using Rasch for tailored advice Earlier research (Kaiser, Urban) found evidence for a unidimensional scale, but with few items & no advice We set out a Rasch-based, energy recommender system that: Shows the measures in order of difficulty (either ascending or descending) Provide tailored conservation advice to users (or not) Include a more extensive set of measures 46
  44. 44. Energy saving ‘Webshop’ Ordered set of measures with rich information and (in some conditions) recommendations Order: Ascending or descending in Rasch difficulty Rasch recommendation: start at their ability (3 items highlighted) or just at the bottom (no highlights)
  45. 45. Procedure 1. Determining Ability: Each user indicated for 13 measures whether he/she executed them 2. Show webshop in one of 4 conditions (total N=224) and ask to select a set of measures (to execute in next 4 weeks) 3. Survey to measure user experience 4. After 4 weeks: report back on which measures they implemented (to some extent) N=86
  46. 46. Results: Experience (SEM model) Perceived Effort Perceived Support Choice Satisfaction -0.40*** 0.59*** -0.56** * p < .05, ** p < .01, *** p < .001. 0.74** -0.65* Rasch recomm. Low Ability Ascending Order
  47. 47. Results Recommendations work (esp. for descending order) higher perceived support higher choice satisfaction
  48. 48. What did they choose? Rank-ordered logit with ability – difficulty/costs difference as predictor Without recommendations they select easy measures (below their ability) With recommendations they choose measures around their ability
  49. 49. Measure Follow-up after 4 weeks Follow-up depends on relative costs (they are more likely to do the easier measures…)
  50. 50. Lifestyle recommendations for Hypertension Management Joint work with Mustafa Radha (Ph.D. student) Radha, Willemsen, Boerhof & IJsselsteijn UMAP 2016
  51. 51. Hypertension Hypertension occurs in 30% of the population Hypertension is without symptoms. Hypertension is a leading cause of death. Hypertension can be prevented or treated with lifestyle change, such as: Salt intake reduction Physical activity Weight control Alcohol moderation
  52. 52. Adherence Healthy habits have a strong benefit, but are not always feasible
  53. 53. Study 1: Model construction Online survey 300 participants between 40 and 60 years old 50% hypertensive Self-reported engagement in 63 health behaviors about diet, sodium intake and physical activity Results: reliable Rasch scale (for both persons and items) No strong differences between subgroups Unidimensional: different categories mix nicely across the scale
  54. 54. The Rasch scale
  55. 55. Study 2: Coaching strategies The engagement maximization strategy selects the easiest behaviors (that she does not do yet) The motivation maximization strategy selects behaviors with difficulties that match the individual’s ability (that she does no do yet) The random control strategy selects behaviors (not done yet) at random without regard for difficulty
  56. 56. Study 2: design 150 hypertensive users invited online Questionnaire used to measure user ability and find behaviors that user should be coached on. Pairwise comparison between the three intervention strategies through virtual coaches
  57. 57. Medium abilityEntire population Results EM EM Health Benefit Personalization RC RC MM RC Health Benefit MM MM Health Benefit Appeal RC RC
  58. 58. Conclusions Engagement maximization (easy behaviors not done yet) outperforms random most of the time The rasch order helps! Knowing the ability helps to elicit better recommendations by tailoring: for medium ability individuals But for other groups it does not matter much…
  59. 59. Questions? Contact: Martijn Willemsen @MCWillemsen M.C.Willemsen@tue.nl www.martijnwillemsen.nl Thanks to my co-authors: Mark Graus, Alain Starke Mustafa Radha Bart Knijnenburg, Ron Broeders

×