Explaining the User Experience of Recommender Systems with User Experiments

The User Experience
of Recommender Systems

Introduction
What I want to do today

Introduction
Bart Knijnenburg
- Current PhD candidate,
-
Informatics
UC Irvine
- Research intern,
Samsung R&D

- Past
-Researcher & teacher,
TU Eindhoven
- MSc Human Technology
Interaction, TU Eindhoven
- M Human-Computer
Interaction, Carnegie Mellon

INFORMATION AND COMPUTER SCIENCES

Introduction
Bart Knijnenburg
- UMUAI paper Experience
-Explaining the User
of Recommender Systems
(ﬁrst UX framework in RecSys)

- Founder ofonUCERSTI
-workshop user-centric
evaluation of recommender
systems and their interfaces
- this year: tutorial on user
experiments


Introduction
My goal:
Introduce my framework for user-centric evaluation of
recommender systems
Explain how the framework can help in conducting user
experiments

My approach:
- Start with “beyond the ﬁve stars”
- Explain how to improve upon this
- Introduce a 21st century approach to user experiments

Introduction

“What is a user experiment?”

“A user experiment is a scientiﬁc method to
investigate how and why system aspects
inﬂuence the users’ experience and behavior.”


Introduction
What I want to do today

Beyond the 5 stars
The Netﬂix approach

Towards user experience
How to improve upon the Netﬂix approach

Evaluation framework
A framework for user-centric evaluation

Measurement
Measuring subjective valuations

Analysis
Statistical evaluation of results

Some examples
Findings based on the framework

Beyond the 5 stars
The Netﬂix approach

Beyond the 5 stars

Ofﬂine evaluations may not give the same outcome as
online evaluations
Cosley et al., 2002; McNee et al., 2002

Solution: Test with real users


Beyond the 5 stars

System Interaction
algorithm rating


Beyond the 5 stars

Higher accuracy does not always mean higher satisfaction
McNee et al., 2006

Solution: Consider other behaviors
“Beyond the ﬁve stars”


Beyond the 5 stars

System Interaction
algorithm rating
consumption
retention


Beyond the 5 stars

The algorithm counts for only 5% of the relevance of a
recommender system
Francisco Martin - RecSys 2009 keynote

Solution: test those other aspects
“Beyond the ﬁve stars”


Beyond the 5 stars

System Interaction
algorithm rating
interaction consumption
presentation retention


Beyond the 5 stars
Start with a hypothesis
Algorithm/feature/design X will increase member
engagement and ultimately member retention

Design a test
Develop a solution or prototype
Think about dependent & independent variables, control,
signiﬁcance…

Execute the test


Beyond the 5 stars

Let data speak for itself
Many different metrics
We ultimately trust member engagement (e.g. hours of
play) and retention

Is this good enough?
I think it is not


How to improve upon the Netﬂix approach

User experience

“Testing a recommender against a static
videoclip system, the number of clicked clips
and total viewing time went down!”


User experience
number of
clips watched
from beginning
to end total number of
+ viewing time clips clicked

personalized − −
recommendations
OSA

perceived system
effectiveness
+ EXP

+
perceived recommendation
quality
SSA
+

choice
satisfaction
EXP

Knijnenburg et al.: “Receiving Recommendations and Providing Feedback”, EC-Web 2010


User experience
Behavior is hard to interpret
Relationship between behavior and satisfaction is not
always trivial

User experience is a better predictor of long-term retention
With behavior only, you will need to run for a long time

Questionnaire data is more robust
Fewer participants needed


User experience
Measure subjective valuations with questionnaires
Perception and experience

Triangulate these data with behavior
Find out how they mediate the effect on behavior

Measure every step in your theory
Create a chain of mediating variables


User experience
Perception: do users notice the manipulation?
- Understandability, perceived control
- Perceived recommendation diversity and quality
Experience: self-relevant evaluations of the system
- Process-related (e.g. choice difﬁculty)
- System-related (e.g. system effectiveness)
- Outcome related (e.g. choice satisfaction)

User experience

System Perception Experience Interaction
algorithm usability system rating
interaction quality process consumption
presentation appeal outcome retention


User experience

Personal and situational characteristics may have an
important impact
Adomavicius et al., 2005; Knijnenburg et al., 2012

Solution: measure those as well


User experience
Situational Characteristics
routine system trust choice goal


Personal Characteristics
gender privacy expertise


A framework for user-centric evaluation

Framework
Framework for user-centric evaluation of recommenders


Knijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012


Framework

System Perception System Aspects (OSA)
Objective Experience Interaction
algorithm -
usability / interaction design
visual system rating
interaction - recommenderprocess
quality algorithm consumption
presentation - presentation of recommendations
appeal outcome retention
- additional features


Framework
routine
User Experience (EXP) system trust choice goal

Different aspects may
System different things
inﬂuence Perception Experience Interaction
- interface -> system
interaction
evaluations quality process consumption

-
presentation appeal
preference elicitation outcome retention
method -> choice process
- algorithm -> quality of Personal Characteristics
the
ﬁnal choice gender privacy expertise


Framework
routine system trust Subjectivegoal
Perception or choice
System Aspects (SSA)
System Perception Experienceto EXP
Link OSA Interaction
algorithm usability system
(mediation) rating
Increase the robustness
presentation appeal of the effects of OSAs on
outcome retention
EXP
How and why OSAs
affect EXP


Framework

Personal characteristics (PC)
Users’ general disposition
Beyond the inﬂuence of the system
Typically stable across tasks



Framework

Situational Characteristics (SC)
How the task inﬂuences the experience
Also beyond the inﬂuence of the system
Here used for evaluation, not for augmenting algorithms!



Framework
Interaction (INT)
Observable behavior
- browsing, viewing, log-ins
Final step of evaluation
But not the goal of the evaluation
Triangulate with EXP Personal Characteristics


Framework




Framework
Has suggestions for measurement scales
e.g. How can we measure something like “satisfaction”?

Provides a good starting point for causal relations
e.g. How and why do certain system aspects inﬂuence the
user experience?

Useful for integrating existing work
e.g. How do recommendation list length, diversity and
presentation each have an inﬂuence on user experience?


Framework

“Statisticians, like artists, have the bad habit
of falling in love with their models.”

George Box


Measurement
Measuring subjective valuations

Measurement

“To measure satisfaction, we asked users
whether they liked the system
(on a 5-point rating scale).”


Measurement

Does the question mean the same to everyone?
- John likes the system because it is convenient
- Mary likes the system because it is easy to use
- Dave likes it because the recommendations are good
A single question is not enough to establish content validity
We need a multi-item measurement scale


Measurement
Measure at least 5 items per concept
At least 2-3 should remain after removing bad items

It is most convenient to statements to which the participant
can agree/disagree on a 5- or 7-point scale:

“I am satisﬁed with this system.”
Completely disagree, somewhat disagree, neutral, somewhat agree, completely agree


Measurement
Use both positively and negatively phrased items
- They make the questionnaire less “leading”
- They help ﬁltering out bad participants
- They explore the “ﬂip-side” of the scale
The word “not” is easily overlooked
Bad: “The recommendations were not very novel”
Better: “The recommendations were not very novel”
Best: “The recommendations felt outdated”


Measurement

Choose simple over specialized words
Participants may have no idea they are using a
“recommender system”

Use as few words as possible
But always use complete sentences


Measurement
Use design techniques that help to improve recall and
provide appropriate time references
Bad: “I liked the signup process” (one week ago)
Good: Provide a screenshot
Best: Ask the question immediately afterwards

Avoid double-barreled items
Bad: “The recommendations were relevant and fun”


Measurement

“We asked users ten 5-point scale questions
and summed the answers.”


Measurement
Is the scale really measuring a single thing?
- 5 items measure satisfaction, the other 5 convenience
- The items are not related enough to make a reliable scale
Are two scales really measuring different things?
- They are so closely related that they actually measure the
same thing

We need to establish convergent and discriminant validity
This makes sure the scales are unidimensional


Measurement
Solution: factor analysis
- Define latent factors, specify how items “load” on them
- Factor analysis will determine how well the items “fit”
- It will give you suggestions for improvement
Benefits of factor analysis:
- Establishes convergent and discriminant validity
- Outcome is a normally distributed measurement scale
- The scale captures the “shared essence” of the items

Measurement

collct22 ctrl28

collct23 gipc16 gipc17 gipc18 gipc19 gipc21 ctrl29

collct24 ctrl30

collct25 1 1 ctrl31

General Control
Collection ctrl32
collct26 privacy 1 concerns
concerns
concerns
collct27


Measurement

collct22 gipc17 gipc18 gipc21

collct24

collct25 1 1
ctrl28
General Control
Collection
collct26 privacy 1 concerns
concerns ctrl29
concerns
collct27


Analysis
Statistical evaluation of results

Analysis

Perceived quality
Manipulation -> perception: 1
Do these two algorithms 0.5
lead to a different level of
0
perceived quality?
-0.5
T-test
-1
A B


Analysis
System effectiveness
2

Perception -> experience: 1
Does perceived quality
0
inﬂuence system
effectiveness? -1

Linear regression -2
-3 0 3
Recommendation quality


Analysis

Manipulation -> perception
-> experience personalized
recommendations
OSA

Does the algorithm
inﬂuence effectiveness +

via perceived quality? perceived recommendation
quality
perceived system
effectiveness
SSA + EXP

Path model


Analysis
Perceived quality
0.6
Two manipulations ->
low diversiﬁcation
perception: 0.5
high diversiﬁcation
What is the combined 0.4

effect of list diversity and 0.3
list length on perceived 0.2
recommendation quality?
0.1

Factorial ANOVA 0
5 items 10 items 20 items
Willemsen et al.: “Not just more of the same”, submitted to TiiS


Analysis
System usefulness
2
Manipulation x personal case-based PE attribute-based PE
characteristic -> outcome 1
Do experts and novices
0
rate these two interaction
methods differently in -1
terms of usefulness?
-2
-3 0 3
ANCOVA
Domain knowledge
Knijnenburg & WIllemsen: “Understanding the Effect of
Adaptive Preference Elicitation Methods“, RecSys2009


Analysis

Use only one method: structural equation models
- A combination of Factor Analysis and Path Models
- The statistical method of the 21st century
- Statistical evaluation of causal effects
- Report as causal paths


Analysis

We compared three
recommender systems
Three different algorithms

MF-I and MF-E algorithms
are more effective
Why?


Analysis


The mediating variables show the entire story


Analysis
participant's
age
Matrix Factorization recommender with Matrix Factorization recommender with PC
1.352 (.222)
explicit feedback (MF-E) implicit feedback (MF-I) p < .001
(versus generally most popular; GMP) (versus most popular; GMP)
OSA OSA
+
Number of
items rated
.356 (.175) .310 (.148)
+ p < .05 + p < .05
20.094 (7.600)
p < .01 +
1.250 (.183) .846 (.127)
p < .001 p < .001
perceived recommendation perceived recommendation perceived system
variety + quality + effectiveness
SSA SSA EXP

.856 (.132) + .314 (.111) 1.865 (.729)
+ p < .001 p < .01
+ p < .05

Number of items Number of items
Rating
clicked in the rated higher than
of clicked items
player the predicted rating



Analysis

“All models are wrong, but some are useful.”

George Box


Some examples
Findings based on the framework

Some examples
Giving users many options:
- increases the chance they find something nice
- also increases the difficulty of making a decision
Result: choice overload

Solution:
- Provide short, diversified lists of recommendations


Some examples
Top-20 Lin-20
vs Top-5 recommendations vs Top-5 recommendations
OSA OSA

.455 (.211) .612 (.220) .879 (.265)
-.804 (.230)
p < .05 p < .01 p < .001
p < .001 −
+ + +
perceived recommendation + perceived recommendation + choice
variety quality difﬁculty .894 (.287)
.503 (.090) .336 (.089) p < .005
SSA p < .001 SSA p < .001 EXP

+ + 1.151 (.161) -.417 (.125)
.181 (.075) .205 (.083) p < .001 p < .005
p < .05 p < .05 + −

movie choice
expertise satisfaction
PC EXP


Some examples


Some examples
Perceived quality Choice difficulty Choice satisfaction
0.6 1.5 2

0.5 1
1.5
0.4
0.5
0.3 1
0
0.2
0.5
0.1 -0.5

0 -1 0
5 items 10 items 20 items 5 items 10 items 20 items 5 items 10 items 20 items

low diversiﬁcation
high diversiﬁcation

Some examples
Experts and novices differ in their decision-making process
- Experts want to leverage their domain knowledge
- Novices make decisions by choosing from examples
Result: they want different preference elicitation methods

Solution:
Tailor the preference elicitation method to the user
A needs-based approach seems to work well for everyone


Some examples
System usefulness
2
case-based PE attribute-based PE
1

0

-1

-2
-3 0 3
Domain knowledge


Some examples
2! 2! 45!

*! 40!

User interface satisfaction!
35!
1! 1!
*! ***!

Understandability!
30!
TopN! TopN! TopN!
Control!

Sort! Sort! 25! Sort!
0! 0!
-2! -1! 0! 1! 2! Explicit! -2! -1! 0! 1! 2! Explicit! Explicit!
20!

**!
Implicit! Implicit! Implicit!
Hybrid! Hybrid! 15! Hybrid!
-1! -1!
10!

5!
-2! -1! 0! 1! 2!
-2! -2!
Domain Knowledge!
Domain Knowledge! Domain Knowledge!

2! 2!
***! *!
1!
Perceived system effectiveness!

1! 1! **! 1! *!
Choice satisfaction!

TopN! TopN!
Sort! Sort!
0! 0!
-2! -1! 0! 1! 2! Explicit! -2! -1! 0! 1! 2! Explicit!
Implicit! Implicit!
Hybrid! Hybrid!
-1! -1!

-2! -2!
Domain Knowledge! Domain Knowledge!


Some examples
2!

*! 1!

TopN!
Control!

Sort!
0!
-2! -1! 0! 1! 2! Explicit!
Implicit!
Hybrid!
-1!

-2!
Domain Knowledge!


Some examples
2!

1!
*!
Understandability!

TopN!
Sort!
0!
-2! -1! 0! 1! 2! Explicit!
Implicit!
Hybrid!
-1!

-2!
Domain Knowledge!


Some examples
45!

40!
User interface satisfaction!

35!

30!
***!
TopN!
25! Sort!
Explicit!
20!

**!
Implicit!
15! Hybrid!

10!

5!
-2! -1! 0! 1! 2!
Domain Knowledge!


Some examples
2!
Perceived system effectiveness!
***!
1! 1! **!
TopN!
Sort!
0!
-2! -1! 0! 1! 2! Explicit!
Implicit!
Hybrid!
-1!

-2!
Domain Knowledge!


Some examples
2!
*!
1!

1! *!
Choice satisfaction!

TopN!
Sort!
0!
-2! -1! 0! 1! 2! Explicit!
Implicit!
Hybrid!
-1!

-2!
Domain Knowledge!


Some examples
Social recommenders limit the neighbors to the users’
friends

Result:
- More control: friends can be rated as well
- Better inspectability: Recommendations can be traced
We show that each of these effects exist
They both increase overall satisfaction


Some examples
Rate likes Weigh friends


Some examples
Personal Characteristics (PC)

Familiarity with Music Trusting
recommenders expertise propensity

0.166 (0.077)* −0.332 (0.088)***
Objective System Aspects + Subjective System Aspects (SSA) − User Experience (EXP)
(OSA)
+ Perceived 0.375
!2(2) = 10.70** (0.094)***
+ Understandability control 0.205 0.257
item: 0.428 (0.207)* (R2 = .153) 0.377
friend: 0.668 (0.206)**
(R2 = .311) (0.100)* (0.124)*
(0.074)***
0.955
Control + + (0.148)***
item/friend vs. no control 0.459 + 0.770 + + + + +
(0.148)** (0.094)***
!2(2) = 10.81** 0.231 0.249 Perceived
Satisfaction
item: −0.181 (0.097)1 (0.114)* (0.049)*** recommendation
with the system
friend: −0.389 (0.125)** quality 0.410 + (R2 = .696)
(R2 = .512) (0.092)***
+ −
0.148
Inspectability (0.051)** −0.152 (0.063)*
full graph vs. list only − Interaction (INT)
0.288 Inspection 0.323
(0.091)** + time (min) (0.031)***
(R2 = .092) +
+ number of known +
Average rating
recommendations (R2 = .508)
0.695 (0.304)* (R2 = .044)
0.067
(0.022)**


Introduction
I discussed how to do user experiments with my framework

Beyond the 5 stars
The Netﬂix approach is a step in the right direction

We need to measure subjective valuations as well

With the framework you can put this all together

Measurement
Measure subjective valuations with questionnaires and factor analysis

Analysis
Use structural equation models to test comprehensive causal models

Some examples
I demonstrated the ﬁndings of some studies that used this approach

See you in Dublin!

bart.k@uci.edu

www.usabart.nl

@usabart

Resources
User-centric evaluation
Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C.:
Explaining the User Experience of Recommender Systems. UMUAI 2012.
Knijnenburg, B.P., Willemsen, M.C., Kobsa, A.: A Pragmatic Procedure to
Support the User-Centric Evaluation of Recommender Systems. RecSys 2011.

Choice overload
Willemsen, M.C., Graus, M.P., Knijnenburg, B.P., Bollen, D.: Not just more of the
same: Preventing Choice Overload in Recommender Systems by Offering
Small Diversiﬁed Sets. Submitted to TiiS.
Bollen, D.G.F.M., Knijnenburg, B.P., Willemsen, M.C., Graus, M.P.:
Understanding Choice Overload in Recommender Systems. RecSys 2010.


Resources
Preference elicitation methods
Knijnenburg, B.P., Reijmer, N.J.M., Willemsen, M.C.: Each to His Own: How
Different Users Call for Different Interaction Methods in Recommender
Systems. RecSys 2011.
Knijnenburg, B.P., Willemsen, M.C.: The Effect of Preference Elicitation
Methods on the User Experience of a Recommender System. CHI 2010.
Knijnenburg, B.P., Willemsen, M.C.: Understanding the effect of adaptive
preference elicitation methods on user satisfaction of a recommender system.
RecSys 2009.

Social recommenders
Knijnenburg, B.P., Bostandjiev, S., O'Donovan, J., Kobsa, A.: Inspectability and
Control in Social Recommender Systems. RecSys 2012.


Resources
User feedback and privacy
Knijnenburg, B.P., Kobsa, A: Making Decisions about Privacy: Information
Disclosure in Context-Aware Recommender Systems. Submitted to TiiS.
Knijnenburg, B.P., Willemsen, M.C., Hirtbach, S.: Getting Recommendations and
Providing Feedback: The User-Experience of a Recommender System. EC-
Web 2010.


Explaining the User Experience of Recommender Systems with User Experiments

Recommended

Recommended

More Related Content

Similar to Explaining the User Experience of Recommender Systems with User Experiments

Similar to Explaining the User Experience of Recommender Systems with User Experiments (20)

More from Bart Knijnenburg

More from Bart Knijnenburg (15)

Recently uploaded

Recently uploaded (20)

Explaining the User Experience of Recommender Systems with User Experiments