Improving user experience in recommender systems

Improving user experience in
recommender systems
How latent feature diversification
can decrease choice difficulty and
improve choice satisfaction
Martijn Willemsen
Talk at Institute for Software
technology, Nov 3, 2015
Graz University of Technology

Information overload
Recommendation

Choice Overload in Recommenders
Recommenders reduce information
overload…
But large personalized sets cause choice
overload!
Top-N of all highly ranked items
What should I choose?
These are all very attractive!

Choice Overload
Seminal example of choice overload
Satisfaction decreases with larger sets as increased
attractiveness is counteracted by choice difficulty
More attractive
3% sales
Less attractive
30% sales
Higher purchase
satisfaction
From Iyengar and Lepper (2000)

Choice Overload in Recommenders
(Bollen, Knijnenburg, Willemsen & Graus, RecSys 2010)
perceived recommendation
variety
perceived recommendation
quality
Top-20
vs Top-5 recommendations
movie
expertise
choice
satisfaction
choice
difﬁculty
+
+
+
+
-+
.401 (.189)
p < .05
.170 (.069)
p < .05
.449 (.072)
p < .001
.346 (.125)
p < .01
.445 (.102)
p < .001
-.217 (.070)
p < .005
Objective System Aspects (OSA)
Subjective System Aspects (SSA)
Experience (EXP)
Personal Characteristics (PC)
Interaction (INT)
Lin-20
vs Top-5 recommendations
+
+ - +
.172 (.068)
p < .05
.938 (.249)
p < .001
-.540 (.196)
p < .01
-.633 (.177)
p < .001
.496 (.152)
p < .005
-0.1
0
0.1
0.2
0.3
0.4
0.5
Top-5 Top-20 Lin-20
Choice satisfaction

A solution: diversification
Tradeoff between similarity and diversity (Smyth &
McClave, 2001) in finding relevant items
Diversification remedies high similarity of Top-N lists
But diversification reduces the overall quality (accuracy) of the list
Many studies only look at data not at real users
Simulated users interact more efficiently with a diverse set (Bridge &
Kelly 2006)
Some studies look at how actual users perceive and
evaluate diversity
Ziegler et al. (2005): diversification based on external ontology
Diversity reduced the accuracy of the recommendations
But coverage and satisfaction increased!

Goal of the current research
Extend existing work by:
Diversification based on psychological mechanisms
Test the algorithm with real users and measure their perceptions and
experiences with the algorithm
Outline of the talk
Our user-centric evaluation framework
Psychology behind choice overload
Latent Feature diversification algorithm
Two user studies to test our diversification

User-Centric Framework
Computers Scientists (and marketing researchers) like to study
behavior…. (they hate asking the user or just cannot (AB tests))

Psychologists and HCI people are mostly interested in experience…

Though it helps to triangulate experience with behavior…

Our framework adds the intermediate construct of perception that explains
why behavior and experiences changes due to our manipulations

And adds personal
and situational
characteristics
Relations modeled
using factor analysis
and SEM
Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C. (2012). Explaining
the User Experience of Recommender Systems. User Modeling and User-Adapted
Interaction (UMUAI), vol 22, p. 441-504 http://bit.ly/umuai

Latent feature diversification
from Psy to CS
Joint work with Mark Graus and
Bart Knijnenburg

Psychology behind choice overload
More options provide more benefits in terms of finding the
right option…
…but result in higher costs/effort
Objective effort:
More comparisons required
Cognitive effort:
Increased potential regret
Larger expectations for larger
sets
Many tradeoffs

Research on Choice overload
Choice overload is not omnipresent
Meta-analysis (Scheibehenne et al., JCR 2010)
suggests an overall effect size of zero
Choice overload stronger when:
No strong prior preferences
Little difference in attractiveness items
Prior studies did not control for
the diversity of the item set
Can we reduce choice difficulty and overload by using
personalized diversified item sets?
While controlling for attractiveness…

Diversification and attractiveness
Camera:
Suppose Peter thinks
resolution (MP) and Zoom
are equally important
user vector shows preference
direction
Equi-preference line:
Set of equally attractive options
(orthogonal on user vector)
Diversify on the equi-
preference line!

Choice Difficulty and Diversity
Larger sets are often more difficult because of the
increased uniformity of these sets (Fasolo et al., 2009;
Reutskaja et al., 2009)
Larger item sets have many
similar options
small inter-product distances
and small tradeoffs
High density!
Choice Difficulty related to
lack of justification 0
5
10
15
20
0 10 20Resolution(MP)
Zoom
High Density
small tradeoffs

Choice difficulty and trade-offs
As item sets become more diverse (less dense) tradeoff
size increases
Tradeoffs are effortful…
give up one aspect for another
But can be justified very easily!
0
5
10
15
20
0 10 20
Resolution(MP)
Zoom
High Density
small tradeoffs
0
5
10
15
20
0 10 20
Resolution(MP)
Zoom
Low Density
large tradeoffs

Double Mediation Model for difficulty
(Scholten and Sherman, JEP:G 2006)
U-shaped relation between diversity and difficulty:
Choosing from uniform set is
hard to justify but has no
difficult tradeoffs
Choosing from a diverse set
encompasses difficult tradeoffs
but is easy to justify
Does this also apply to personalized item sets?
How can we apply this to recommender system
algorithms? What features to diversify on?
Difficulty
diversity
uniform diverse

Features in Matrix Factorization
Latent Features as means
of diversification!
“Latent features are Preference
dimensions related to real world
concepts (e.g. Escapist/serious)”
(Koren, Bell and Volinsky,2009)
Users and items described as
vectors of latent features
Parallel to how choice sets are described in
MAUT(multi-attribute utility theory) in consumer
psychology

Explaining Matrix Factorization
Map users and items to a joint latent factor space of
dimensionality f
Each item is a vector qi
each user a vector pu
Predicted rating r of user u for item i:
How to find the dimensions?
SVD: singular value decomposition
u
T
iui pqr ˆ

Matrix Factorization algorithms
each user a vector pu Each item is a vector qi
Usual
Suspects
Titanic
DieHard
Godfather
Jack
Dylan
Olivia
Mark
?
?
?
? ? ?
?
pu
Dim1
Dim2
Jack 3 -1
Dylan 1.4 .2
Olivia -2.5 -.8
Mark -2 -1.5
qi
Usual
Suspects
Titanic
DieHard
Godfather
Dim 1 1.6 -1 5 0.2
Dim 2 1 1 .3 -.2

Two-dimensional Latent feature space and
diversification
Jack
Mark
Olivia Dylan

Greedy Diversity Algorithm
10-dimensional MF model
Take personalized top-200
Low: closest to centroid
Greedy algorithm
Select K items with
highest inter-item distance
(using city-block)
Medium:
select maximally diverse from
100 items closest to centroid
High: from all items in top-200

The algorithm
Measure of density: AFSR
Density/tradeoffs on the features: capture how items are distributed
in the feature space.
Average Factor Score Range (AFSR) based on the density metric
used by Fasolo et al. (2009)
X is set of items i
D is number of features
Captures the distribution of items in the feature space and their
tradeoffs better than standard similarity measures

Selection of initial set
Top-200 was selected as a balanced initial set:
Large differences in AFSR scores for the 3 levels of diversification
High average predicted rating: 4.48
Range in predicted ratings (0.546) lower than error of MF model:
MAE = .656 (RMSE = .854)
Final check: How does attractiveness vary within
and between the sets?
more diverse sets are by nature more likely to capture high-ranked
items…
Set Size Diversification Rating AFSR
5
Low 4.505 0.295
Medium 4.529 0.634
High 4.561 1.210
20
Low 4.537 0.586
Medium 4.558 1.005
High 4.604 1.615

System characteristics
MF recommender based on MyMedia project
10M MovieLens dataset: movies from 1994
5.6M ratings for 70k users and 5.4k movies
RMSE of 0.854, MAE of 0.656
Movies shown with title and predicted rating:
hovering the mouse over the title reveals additional information:
short synopsis, cast, director and image

Study 1a: Check diversification algorithm
Does our diversification affects the subjective experiences
with the recommendations?
Do people perceive the diversity?
Does it affects attractiveness and choice difficulty?
Each participant inspects 3 lists (low, mid and high
diversification), order counterbalanced
No choices made from the list

Study 1a: design/procedure
Pre-questionnaire
Personal characteristics
Rating task to train the system (10 ratings)
Assess three lists of recommendations
Within subjects: low / mid / high diversification
Between subjects: number of items (5,10,15,20,25)
After each list we measured:
Perceived Diversity & Attractiveness
Expected Trade-Off Difficulty & Choice Difficulty

Study 1a: Participants and Manipulation
checks
97 Participants from an online database
Paid for participation
Mean age: 29 years, 52 females and 45 males
Low, medium and high diversification
differed in the feature score range
Average predicted ratings of the sets were not different!
diversity
Average AFSR
(SE)
Avg. Predicted
rating
(SE)
Low 0.959 (0.015) 4.486 (0.042)
Medium 1.273 (0.016) 4.486 (0.041)
High 1.744 (0.024) 4.527 (0.039)

Study 1a: Structural Equation Model

Study 1a: Structural Equation Model
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
low mid high
standardizedscore
diversification
attractiveness diversity
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
low mid high
scaledifference
diversification
choice diff. tradeoff diff.

Study 1a: Conclusions & Discussion
Diversifying on latent features
Increases attractiveness/diversity
Reduces trade-off difficulty (high)
Reduces choice difficulty (linearly)
No evidence for U-shaped difficulty model
High diversity does not result in trade-off conflicts
(perhaps due to the nature of the domain/MF?)
No effect of number on items
Small sets benefit as much from diversification
Diversification on MF features seems
promising to increase attractiveness!
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
low mid high
scaledifference
diversification
choice diff. tradeoff diff.

Study 1b: No choice satisfaction
In Study 1a, no actual choice was made
Explains limited effect of number of items
We could not measure choice satisfaction or justification-based
processes
Diversification and list length as two factors in a new
experiment with choice (and choice satisfaction)
Item size: 5, 10 and 20
Low and high diversity (no medium)
We expect choice difficulty to be more prominent for low
diversity sets

Study 1b: design/procedure
Pre-questionnaire
Personal characteristics
Choose one item from a list of recommendations
Between subjects: 2 levels (low / high diversification)
X 3 lengths (5, 10 or 20 items)
Afterwards we measured:
Perceived Diversity & Attractiveness
Choice Difficulty and Choice satisfaction
Paid for participation Mean age: 29Y, 41F, 46M

Questionnaire-items
Perceived recommendation diversity
5 items, e.g. “The list of movies was varied”
Perceived recommendation attractiveness
5 items, e.g. “The list of recommendations was attractive”
Choice satisfaction
6 items, e.g. “I think I would enjoy watching the chosen movie”
Choice difficulty
5 items, e.g.: “It was easy to select a movie”

Study 1b: Structural Equation Model

Study 1b: Structural Equation Model
-0.5
0.0
0.5
1.0
1.5
2.0
5 10 20
standardizedscore
list length
Diversity
low diversity
high diversity
0.0
0.5
1.0
1.5
2.0
2.5
5 10 20
standardizedscore
list length
Satisfaction
low diversity high diversity

Study 1b: Results
Long list is more difficult (cognitive and objective effort
(hovers)) but also more satisfying in itself
Diversity influences choice satisfaction in important ways
Diversity increases attractiveness and reduces difficulty
These can increase satisfaction (but only for short lists)
Our diversification improved the 5-item list
5 diverse items are as satisfactory as
10 or 20 items and less difficult!
Less effort needed (hovers)
Using latent feature diversification
one does not need long item lists…
0.0
0.5
1.0
1.5
2.0
2.5
5 10 20
standardizedscore
list length
Satisfaction
low diversity high diversity

But…
Our studies show that low or high diversification from the
centroid of Top-200 works
However, these sets were optimized for diversity, not for prediction
accuracy
Most item sets thus not contain the best predicted items
So how does this compare to standard Top-N lists?
Slight modification of our algorithm: diversify starting from the best
predicted item in the top-N set (Top-1) rather than the centroid
Diversification and list length as two factors in a choice
overload experiment
list sizes: 5 and 20
Diversification: none (top 5/20), medium, high

Properties of the item sets
Set size diversification Avg. AFSR Avg. Rank
5
High (α = 1) 1.380 78.284
Medium (α = 0.3) 1.096 7.484
Top-N (α = 0) 0.774 3.000
20
High (α = 1) 1.793 89.380
Medium (α = 0.3) 1.486 17.849
Top-N (α = 0) 1.270 10.500
Algorithm balances between max diversity and highest rank
Every iteration: weigh (1-α) highest rank
against highest diversification (α)
α=0: top-N, α=1: max diverse.
(α=0.3 gives good medium diversity)

Design/procedure Study 2
Choose one item from a list of recommendations
Between subjects: 3 levels of diversification, 2 lengths
Afterwards we measured:
Perceptions: Perceived Diversity & Attractiveness
Experience: Choice Difficulty and Choice satisfaction
Behavior: total views / unique items considered

Perceived Diversity & attractiveness
Perceived Diversity increases with
Diversification
Similarly for 5 and 20 items
Perc. Diversity increases attractiveness
Perceived difficulty goes down with
diversification
Perceived attractiveness goes up
with diversification
Diverse 5 item set excels…
Just as satisfying as 20 items
Less difficult to choose from
Less cognitive load…!
-0.5
0
0.5
1
1.5
none med high
standardizedscore
diversification
Perc. Diversity
5 items
20 items
-0.2
0
0.2
0.4
0.6
0.8
1
none med high
standardizedscore
diversification
Choice Satisfaction
5 items
20 items

Choice Characteristics
Chosen option (mean and std. err)
Set
Diversity
List Position Rating Rank
5 items
None (top 5) 3.60 (0.27) 4.51 (0.07) 3.60 (0.27)
Medium 4.41 (0.59) 4.41 (0.07) 14.52 (5.37)
High 4.19 (0.27) 4.30 (0.07) 77.59 (12.76)
20 items
None (top 20) 10.15 (0.92) 4.45 (0.05) 10.15 (0.92)
Medium 10.33 (1.18) 4.40 (0.08) 17.7 (2.68)
high 9.93 (1.07) 4.16 (0.07) 72.22 (11.84)
With higher diversity, no difference in position of chosen option
Resulting in less ‘optimal’ choice in terms of predicted rating
Without a reduction in choice satisfaction!

Conclusions
Reducing Choice difficulty and overload
Diversity reduces choice difficulty
Less uniform sets are easier to choose from
Latent feature diversification easy to implement
Diversity can improve choice satisfaction
Even when the diversified list has movies with lower
predicted ratings than standard top-N lists
No need for larger item sets
Offering personalized diversified small items sets might be the key
to help decision makers cope with too much choice!
Psychological theory can inform how to improve the
output of Recommender algorithms

What you should take away…
Psychological theory can inform new ways of diversifying
algorithm output or eliciting preferences
But also: working with recommenders and algorithms we could
enhance psychological theory: personalized item sets gives control
User-centric evaluation helps to assess the effectiveness
Lot of work… and we need user studies…
But linking subjective to objective measures might help future
studies that cannot do user studies
User-centric framework allows us to understand WHY
particular approaches work or not
Concept of mediation: user perception helps understanding..

Questions?
Contact:
Martijn Willemsen
@MCWillemsen
M.C.Willemsen@tue.nl
www.martijnwillemsen.nl
Thanks to my co-authors:
Mark Graus
Bart Knijnenburg

Improving user experience in recommender systems

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Improving user experience in recommender systems

Similar to Improving user experience in recommender systems (20)

Recently uploaded

Recently uploaded (20)

Improving user experience in recommender systems

Editor's Notes