A talk I gave at the Netflix offices on July 2nd, 2012.
Please do not use any of the slides or their contents without my explicit permission (bart@usabart.nl for inquiries).
3. Introduction
Bart Knijnenburg
- Current PhD candidate,
-
Informatics
UC Irvine
- Research intern,
Samsung R&D
- Past
-Researcher & teacher,
TU Eindhoven
- MSc Human Technology
Interaction, TU Eindhoven
- M Human-Computer
Interaction, Carnegie Mellon
INFORMATION AND COMPUTER SCIENCES
4. Introduction
Bart Knijnenburg
- UMUAI paper Experience
-Explaining the User
of Recommender Systems
(first UX framework in RecSys)
- Founder ofonUCERSTI
-workshop user-centric
evaluation of recommender
systems and their interfaces
- this year: tutorial on user
experiments
INFORMATION AND COMPUTER SCIENCES
5. Introduction
My goal:
Introduce my framework for user-centric evaluation of
recommender systems
Explain how the framework can help in conducting user
experiments
My approach:
- Start with “beyond the five stars”
- Explain how to improve upon this
- Introduce a 21st century approach to user experiments
INFORMATION AND COMPUTER SCIENCES
6. Introduction
“What is a user experiment?”
“A user experiment is a scientific method to
investigate how and why system aspects
influence the users’ experience and behavior.”
INFORMATION AND COMPUTER SCIENCES
7. Introduction
What I want to do today
Beyond the 5 stars
The Netflix approach
Towards user experience
How to improve upon the Netflix approach
Evaluation framework
A framework for user-centric evaluation
Measurement
Measuring subjective valuations
Analysis
Statistical evaluation of results
Some examples
Findings based on the framework
9. Beyond the 5 stars
Offline evaluations may not give the same outcome as
online evaluations
Cosley et al., 2002; McNee et al., 2002
Solution: Test with real users
INFORMATION AND COMPUTER SCIENCES
10. Beyond the 5 stars
System Interaction
algorithm rating
INFORMATION AND COMPUTER SCIENCES
11. Beyond the 5 stars
Higher accuracy does not always mean higher satisfaction
McNee et al., 2006
Solution: Consider other behaviors
“Beyond the five stars”
INFORMATION AND COMPUTER SCIENCES
12. Beyond the 5 stars
System Interaction
algorithm rating
consumption
retention
INFORMATION AND COMPUTER SCIENCES
13. Beyond the 5 stars
The algorithm counts for only 5% of the relevance of a
recommender system
Francisco Martin - RecSys 2009 keynote
Solution: test those other aspects
“Beyond the five stars”
INFORMATION AND COMPUTER SCIENCES
14. Beyond the 5 stars
System Interaction
algorithm rating
interaction consumption
presentation retention
INFORMATION AND COMPUTER SCIENCES
15. Beyond the 5 stars
Start with a hypothesis
Algorithm/feature/design X will increase member
engagement and ultimately member retention
Design a test
Develop a solution or prototype
Think about dependent & independent variables, control,
significance…
Execute the test
INFORMATION AND COMPUTER SCIENCES
16. Beyond the 5 stars
Let data speak for itself
Many different metrics
We ultimately trust member engagement (e.g. hours of
play) and retention
Is this good enough?
I think it is not
INFORMATION AND COMPUTER SCIENCES
18. User experience
“Testing a recommender against a static
videoclip system, the number of clicked clips
and total viewing time went down!”
INFORMATION AND COMPUTER SCIENCES
19. User experience
number of
clips watched
from beginning
to end total number of
+ viewing time clips clicked
personalized − −
recommendations
OSA
perceived system
effectiveness
+ EXP
+
perceived recommendation
quality
SSA
+
choice
satisfaction
EXP
Knijnenburg et al.: “Receiving Recommendations and Providing Feedback”, EC-Web 2010
INFORMATION AND COMPUTER SCIENCES
20. User experience
Behavior is hard to interpret
Relationship between behavior and satisfaction is not
always trivial
User experience is a better predictor of long-term retention
With behavior only, you will need to run for a long time
Questionnaire data is more robust
Fewer participants needed
INFORMATION AND COMPUTER SCIENCES
21. User experience
Measure subjective valuations with questionnaires
Perception and experience
Triangulate these data with behavior
Find out how they mediate the effect on behavior
Measure every step in your theory
Create a chain of mediating variables
INFORMATION AND COMPUTER SCIENCES
22. User experience
Perception: do users notice the manipulation?
- Understandability, perceived control
- Perceived recommendation diversity and quality
Experience: self-relevant evaluations of the system
- Process-related (e.g. choice difficulty)
- System-related (e.g. system effectiveness)
- Outcome related (e.g. choice satisfaction)
INFORMATION AND COMPUTER SCIENCES
23. User experience
System Perception Experience Interaction
algorithm usability system rating
interaction quality process consumption
presentation appeal outcome retention
INFORMATION AND COMPUTER SCIENCES
24. User experience
Personal and situational characteristics may have an
important impact
Adomavicius et al., 2005; Knijnenburg et al., 2012
Solution: measure those as well
INFORMATION AND COMPUTER SCIENCES
25. User experience
Situational Characteristics
routine system trust choice goal
System Perception Experience Interaction
algorithm usability system rating
interaction quality process consumption
presentation appeal outcome retention
Personal Characteristics
gender privacy expertise
INFORMATION AND COMPUTER SCIENCES
27. Framework
Framework for user-centric evaluation of recommenders
Situational Characteristics
routine system trust choice goal
System Perception Experience Interaction
algorithm usability system rating
interaction quality process consumption
presentation appeal outcome retention
Personal Characteristics
gender privacy expertise
Knijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012
INFORMATION AND COMPUTER SCIENCES
28. Framework
Situational Characteristics
routine system trust choice goal
System Perception System Aspects (OSA)
Objective Experience Interaction
algorithm -
usability / interaction design
visual system rating
interaction - recommenderprocess
quality algorithm consumption
presentation - presentation of recommendations
appeal outcome retention
- additional features
Personal Characteristics
gender privacy expertise
INFORMATION AND COMPUTER SCIENCES
29. Framework
Situational Characteristics
routine
User Experience (EXP) system trust choice goal
Different aspects may
System different things
influence Perception Experience Interaction
algorithm usability system rating
- interface -> system
interaction
evaluations quality process consumption
-
presentation appeal
preference elicitation outcome retention
method -> choice process
- algorithm -> quality of Personal Characteristics
the
final choice gender privacy expertise
INFORMATION AND COMPUTER SCIENCES
30. Framework
Situational Characteristics
routine system trust Subjectivegoal
Perception or choice
System Aspects (SSA)
System Perception Experienceto EXP
Link OSA Interaction
algorithm usability system
(mediation) rating
interaction quality process consumption
Increase the robustness
presentation appeal of the effects of OSAs on
outcome retention
EXP
How and why OSAs
Personal Characteristics
affect EXP
gender privacy expertise
INFORMATION AND COMPUTER SCIENCES
31. Framework
Situational Characteristics
routine system trust choice goal
System Perception Experience Interaction
Personal characteristics (PC)
algorithm usability system rating
Users’ general disposition
interaction quality process consumption
Beyond the influence of the system
presentation appeal outcome retention
Typically stable across tasks
Personal Characteristics
gender privacy expertise
INFORMATION AND COMPUTER SCIENCES
32. Framework
Situational Characteristics
routine system trust choice goal
System Perception Experience Interaction
Situational Characteristics (SC)
algorithm usability system rating
How the task influences the experience
interaction quality process consumption
Also beyond the influence of the system
presentation appeal outcome retention
Here used for evaluation, not for augmenting algorithms!
Personal Characteristics
gender privacy expertise
INFORMATION AND COMPUTER SCIENCES
33. Framework
Situational Characteristics
routine system trust choice goal
Interaction (INT)
System Perception Experience Interaction
Observable behavior
algorithm usability system rating
- browsing, viewing, log-ins
interaction quality process consumption
Final step of evaluation
presentation appeal outcome retention
But not the goal of the evaluation
Triangulate with EXP Personal Characteristics
gender privacy expertise
INFORMATION AND COMPUTER SCIENCES
34. Framework
Situational Characteristics
routine system trust choice goal
System Perception Experience Interaction
algorithm usability system rating
interaction quality process consumption
presentation appeal outcome retention
Personal Characteristics
gender privacy expertise
INFORMATION AND COMPUTER SCIENCES
35. Framework
Has suggestions for measurement scales
e.g. How can we measure something like “satisfaction”?
Provides a good starting point for causal relations
e.g. How and why do certain system aspects influence the
user experience?
Useful for integrating existing work
e.g. How do recommendation list length, diversity and
presentation each have an influence on user experience?
INFORMATION AND COMPUTER SCIENCES
39. Measurement
Does the question mean the same to everyone?
- John likes the system because it is convenient
- Mary likes the system because it is easy to use
- Dave likes it because the recommendations are good
A single question is not enough to establish content validity
We need a multi-item measurement scale
INFORMATION AND COMPUTER SCIENCES
40. Measurement
Measure at least 5 items per concept
At least 2-3 should remain after removing bad items
It is most convenient to statements to which the participant
can agree/disagree on a 5- or 7-point scale:
“I am satisfied with this system.”
Completely disagree, somewhat disagree, neutral, somewhat agree, completely agree
INFORMATION AND COMPUTER SCIENCES
41. Measurement
Use both positively and negatively phrased items
- They make the questionnaire less “leading”
- They help filtering out bad participants
- They explore the “flip-side” of the scale
The word “not” is easily overlooked
Bad: “The recommendations were not very novel”
Better: “The recommendations were not very novel”
Best: “The recommendations felt outdated”
INFORMATION AND COMPUTER SCIENCES
42. Measurement
Choose simple over specialized words
Participants may have no idea they are using a
“recommender system”
Use as few words as possible
But always use complete sentences
INFORMATION AND COMPUTER SCIENCES
43. Measurement
Use design techniques that help to improve recall and
provide appropriate time references
Bad: “I liked the signup process” (one week ago)
Good: Provide a screenshot
Best: Ask the question immediately afterwards
Avoid double-barreled items
Bad: “The recommendations were relevant and fun”
INFORMATION AND COMPUTER SCIENCES
44. Measurement
“We asked users ten 5-point scale questions
and summed the answers.”
INFORMATION AND COMPUTER SCIENCES
45. Measurement
Is the scale really measuring a single thing?
- 5 items measure satisfaction, the other 5 convenience
- The items are not related enough to make a reliable scale
Are two scales really measuring different things?
- They are so closely related that they actually measure the
same thing
We need to establish convergent and discriminant validity
This makes sure the scales are unidimensional
INFORMATION AND COMPUTER SCIENCES
46. Measurement
Solution: factor analysis
- Define latent factors, specify how items “load” on them
- Factor analysis will determine how well the items “fit”
- It will give you suggestions for improvement
Benefits of factor analysis:
- Establishes convergent and discriminant validity
- Outcome is a normally distributed measurement scale
- The scale captures the “shared essence” of the items
INFORMATION AND COMPUTER SCIENCES
47. Measurement
collct22 ctrl28
collct23 gipc16 gipc17 gipc18 gipc19 gipc21 ctrl29
collct24 ctrl30
collct25 1 1 ctrl31
General Control
Collection ctrl32
collct26 privacy 1 concerns
concerns
concerns
collct27
INFORMATION AND COMPUTER SCIENCES
48. Measurement
collct22 gipc17 gipc18 gipc21
collct24
collct25 1 1
ctrl28
General Control
Collection
collct26 privacy 1 concerns
concerns ctrl29
concerns
collct27
INFORMATION AND COMPUTER SCIENCES
50. Analysis
Perceived quality
Manipulation -> perception: 1
Do these two algorithms 0.5
lead to a different level of
0
perceived quality?
-0.5
T-test
-1
A B
INFORMATION AND COMPUTER SCIENCES
51. Analysis
System effectiveness
2
Perception -> experience: 1
Does perceived quality
0
influence system
effectiveness? -1
Linear regression -2
-3 0 3
Recommendation quality
INFORMATION AND COMPUTER SCIENCES
52. Analysis
Manipulation -> perception
-> experience personalized
recommendations
OSA
Does the algorithm
influence effectiveness +
via perceived quality? perceived recommendation
quality
perceived system
effectiveness
SSA + EXP
Path model
INFORMATION AND COMPUTER SCIENCES
53. Analysis
Perceived quality
0.6
Two manipulations ->
low diversification
perception: 0.5
high diversification
What is the combined 0.4
effect of list diversity and 0.3
list length on perceived 0.2
recommendation quality?
0.1
Factorial ANOVA 0
5 items 10 items 20 items
Willemsen et al.: “Not just more of the same”, submitted to TiiS
INFORMATION AND COMPUTER SCIENCES
54. Analysis
System usefulness
2
Manipulation x personal case-based PE attribute-based PE
characteristic -> outcome 1
Do experts and novices
0
rate these two interaction
methods differently in -1
terms of usefulness?
-2
-3 0 3
ANCOVA
Domain knowledge
Knijnenburg & WIllemsen: “Understanding the Effect of
Adaptive Preference Elicitation Methods“, RecSys2009
INFORMATION AND COMPUTER SCIENCES
55. Analysis
Use only one method: structural equation models
- A combination of Factor Analysis and Path Models
- The statistical method of the 21st century
- Statistical evaluation of causal effects
- Report as causal paths
INFORMATION AND COMPUTER SCIENCES
56. Analysis
We compared three
recommender systems
Three different algorithms
MF-I and MF-E algorithms
are more effective
Why?
INFORMATION AND COMPUTER SCIENCES
57. Analysis
Knijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012
The mediating variables show the entire story
INFORMATION AND COMPUTER SCIENCES
58. Analysis
participant's
age
Matrix Factorization recommender with Matrix Factorization recommender with PC
1.352 (.222)
explicit feedback (MF-E) implicit feedback (MF-I) p < .001
(versus generally most popular; GMP) (versus most popular; GMP)
OSA OSA
+
Number of
items rated
.356 (.175) .310 (.148)
+ p < .05 + p < .05
20.094 (7.600)
p < .01 +
1.250 (.183) .846 (.127)
p < .001 p < .001
perceived recommendation perceived recommendation perceived system
variety + quality + effectiveness
SSA SSA EXP
.856 (.132) + .314 (.111) 1.865 (.729)
+ p < .001 p < .01
+ p < .05
Number of items Number of items
Rating
clicked in the rated higher than
of clicked items
player the predicted rating
Knijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012
INFORMATION AND COMPUTER SCIENCES
59. Analysis
“All models are wrong, but some are useful.”
George Box
INFORMATION AND COMPUTER SCIENCES
61. Some examples
Giving users many options:
- increases the chance they find something nice
- also increases the difficulty of making a decision
Result: choice overload
Solution:
- Provide short, diversified lists of recommendations
INFORMATION AND COMPUTER SCIENCES
62. Some examples
Top-20 Lin-20
vs Top-5 recommendations vs Top-5 recommendations
OSA OSA
.455 (.211) .612 (.220) .879 (.265)
-.804 (.230)
p < .05 p < .01 p < .001
p < .001 −
+ + +
perceived recommendation + perceived recommendation + choice
variety quality difficulty .894 (.287)
.503 (.090) .336 (.089) p < .005
SSA p < .001 SSA p < .001 EXP
+ + 1.151 (.161) -.417 (.125)
.181 (.075) .205 (.083) p < .001 p < .005
p < .05 p < .05 + −
movie choice
expertise satisfaction
PC EXP
INFORMATION AND COMPUTER SCIENCES
65. Some examples
Experts and novices differ in their decision-making process
- Experts want to leverage their domain knowledge
- Novices make decisions by choosing from examples
Result: they want different preference elicitation methods
Solution:
Tailor the preference elicitation method to the user
A needs-based approach seems to work well for everyone
INFORMATION AND COMPUTER SCIENCES
69. Some examples
2!
*! 1!
TopN!
Control!
Sort!
0!
-2! -1! 0! 1! 2! Explicit!
Implicit!
Hybrid!
-1!
-2!
Domain Knowledge!
INFORMATION AND COMPUTER SCIENCES
70. Some examples
2!
1!
*!
Understandability!
TopN!
Sort!
0!
-2! -1! 0! 1! 2! Explicit!
Implicit!
Hybrid!
-1!
-2!
Domain Knowledge!
INFORMATION AND COMPUTER SCIENCES
71. Some examples
45!
40!
User interface satisfaction!
35!
30!
***!
TopN!
25! Sort!
Explicit!
20!
**!
Implicit!
15! Hybrid!
10!
5!
-2! -1! 0! 1! 2!
Domain Knowledge!
INFORMATION AND COMPUTER SCIENCES
72. Some examples
2!
Perceived system effectiveness!
***!
1! 1! **!
TopN!
Sort!
0!
-2! -1! 0! 1! 2! Explicit!
Implicit!
Hybrid!
-1!
-2!
Domain Knowledge!
INFORMATION AND COMPUTER SCIENCES
73. Some examples
2!
*!
1!
1! *!
Choice satisfaction!
TopN!
Sort!
0!
-2! -1! 0! 1! 2! Explicit!
Implicit!
Hybrid!
-1!
-2!
Domain Knowledge!
INFORMATION AND COMPUTER SCIENCES
74. Some examples
Social recommenders limit the neighbors to the users’
friends
Result:
- More control: friends can be rated as well
- Better inspectability: Recommendations can be traced
We show that each of these effects exist
They both increase overall satisfaction
INFORMATION AND COMPUTER SCIENCES
77. Some examples
Personal Characteristics (PC)
Familiarity with Music Trusting
recommenders expertise propensity
0.166 (0.077)* −0.332 (0.088)***
Objective System Aspects + Subjective System Aspects (SSA) − User Experience (EXP)
(OSA)
+ Perceived 0.375
!2(2) = 10.70** (0.094)***
+ Understandability control 0.205 0.257
item: 0.428 (0.207)* (R2 = .153) 0.377
friend: 0.668 (0.206)**
(R2 = .311) (0.100)* (0.124)*
(0.074)***
0.955
Control + + (0.148)***
item/friend vs. no control 0.459 + 0.770 + + + + +
(0.148)** (0.094)***
!2(2) = 10.81** 0.231 0.249 Perceived
Satisfaction
item: −0.181 (0.097)1 (0.114)* (0.049)*** recommendation
with the system
friend: −0.389 (0.125)** quality 0.410 + (R2 = .696)
(R2 = .512) (0.092)***
+ −
0.148
Inspectability (0.051)** −0.152 (0.063)*
full graph vs. list only − Interaction (INT)
0.288 Inspection 0.323
(0.091)** + time (min) (0.031)***
(R2 = .092) +
+ number of known +
Average rating
recommendations (R2 = .508)
0.695 (0.304)* (R2 = .044)
0.067
(0.022)**
INFORMATION AND COMPUTER SCIENCES
78. Introduction
I discussed how to do user experiments with my framework
Beyond the 5 stars
The Netflix approach is a step in the right direction
Towards user experience
We need to measure subjective valuations as well
Evaluation framework
With the framework you can put this all together
Measurement
Measure subjective valuations with questionnaires and factor analysis
Analysis
Use structural equation models to test comprehensive causal models
Some examples
I demonstrated the findings of some studies that used this approach
79. See you in Dublin!
bart.k@uci.edu
www.usabart.nl
@usabart
80. Resources
User-centric evaluation
Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C.:
Explaining the User Experience of Recommender Systems. UMUAI 2012.
Knijnenburg, B.P., Willemsen, M.C., Kobsa, A.: A Pragmatic Procedure to
Support the User-Centric Evaluation of Recommender Systems. RecSys 2011.
Choice overload
Willemsen, M.C., Graus, M.P., Knijnenburg, B.P., Bollen, D.: Not just more of the
same: Preventing Choice Overload in Recommender Systems by Offering
Small Diversified Sets. Submitted to TiiS.
Bollen, D.G.F.M., Knijnenburg, B.P., Willemsen, M.C., Graus, M.P.:
Understanding Choice Overload in Recommender Systems. RecSys 2010.
INFORMATION AND COMPUTER SCIENCES
81. Resources
Preference elicitation methods
Knijnenburg, B.P., Reijmer, N.J.M., Willemsen, M.C.: Each to His Own: How
Different Users Call for Different Interaction Methods in Recommender
Systems. RecSys 2011.
Knijnenburg, B.P., Willemsen, M.C.: The Effect of Preference Elicitation
Methods on the User Experience of a Recommender System. CHI 2010.
Knijnenburg, B.P., Willemsen, M.C.: Understanding the effect of adaptive
preference elicitation methods on user satisfaction of a recommender system.
RecSys 2009.
Social recommenders
Knijnenburg, B.P., Bostandjiev, S., O'Donovan, J., Kobsa, A.: Inspectability and
Control in Social Recommender Systems. RecSys 2012.
INFORMATION AND COMPUTER SCIENCES
82. Resources
User feedback and privacy
Knijnenburg, B.P., Kobsa, A: Making Decisions about Privacy: Information
Disclosure in Context-Aware Recommender Systems. Submitted to TiiS.
Knijnenburg, B.P., Willemsen, M.C., Hirtbach, S.: Getting Recommendations and
Providing Feedback: The User-Experience of a Recommender System. EC-
Web 2010.
INFORMATION AND COMPUTER SCIENCES