4. + 4
Large vs small datasets
Everything is significant!
Data from most/all of your customers
More than just an educated guess
This is what really happens!
Large datasets can improve business intelligence
5. + 5
The Netflix challenge
Recommendations seen as $1M prize if 10% better than
Netflix’ strongest asset Netflix’s Moviematch
2006-2009 Data: 18k movies, 500k
users, 100M ratings
6. + 6
The Netflix challenge
Netflix’s rational:
“Improve our ability to connect people to the movies they love”
Improve recommendations = improve satisfaction and retention
Small R&D team, slow progress
$1M will pay for itself
Based on Padhraic Smyth’s report at
http://www.ics.uci.edu/~smyth/courses/cs277/slides/netflix_over
view.pdf
7. + 7
Matrix approximation
Distinguish noise from signal: variance and eigenvalues
Singular value decomposition
Ratings(m*n) = U(m*n) E(n*n) V(n*n)
Rank-k approximation
Ratings(m*n) ≈ U(m*k) E(k*k) V(k*n)
n movies k k n movies
E V
k
k
m users
m users
Ratings = U
8. independent, quirky,
critically acclaimed 8
Plot of V with k=2
Lowbrow Drama,
comedies, serious
Horror, comedy,
Male or Strong
adolescent female
audience lead
mainstream,
formulaic
[Koren et al. 2009]
10. + 10
Take-aways
Matrix decomposition
Meaningful movie categories!
For example: lowbrow, quirky, indie, strong female lead
Older movies are rated higher
So ...?
Should recommend older movies more often or less often?
Why are they rated higher?
11. +
The Perils
of Big Data
How overfitting and
a lack of domain knowledge
can lead to suboptimal solutions
12. + 12
What about random?
“We were demonstrating our new recommender to a client.
They were amazed by how well it predicted their preferences!”
“Later we found out that we forgot to activate the algorithm: the
system was giving completely random recommendations.”
14. + 14
Model complexity
“Our winning entries consist of more than 100 different
predictor sets” [Koren et al 2009]
Only 10% better than Netflix
Why?
Intrinsic noise
Example: children watch cartoons, Mum is recommended cartoons
Should Netflix implement a “switch user” feature?
Domain knowledge!
15. + 15
More gotchas
Obvious truisms and correlation fallacies
Still present in large datasets
Domain knowledge!
Overfitting: simple models that make sense vs complex models
that fit the data
17. + 17
Offline evaluations
Calibration/Evaluation
Gather rating data
Remove 10% of the ratings of each user
Optimize the algorithm to predict those 10%
Execution
Predict the rating of unknown items
Recommend items with highest predicted rating
18. + 18
Offline evaluations
http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html
Problems Solutions
Offline evaluations may not Test with real users
give the same outcome as (A/B testing)
online evaluations (Cosley et
al., 2002; McNee et al., 2002)
Higher rating does not mean Consider other behaviors
good recommendation (McNee (consumption, retention)
et al., 2006)
The algorithm counts for only A/B test other aspects
5% of the relevance of a (interaction, presentation)
recommender system (Francisco
Martin, 2009)
19. + 19
Online evaluations
Testing a recommender against
a random videoclip system (A/B
test) number of
clips watched
Expectation: Consumption from beginning
to end total number of
+ viewing time clips clicked
will increase
Reality: The number of personalized
recommendations
− −
clicked clips and total viewing OSA
time went down! perceived system
effectiveness
+ EXP
+
Insight: Recommender is more perceived recommendation
quality
effective SSA
+
More clips watched from choice
satisfaction
beginning to end EXP
Users browse less, consume
more
20. + 20
Behavior vs Questionnaires
Behavior is hard to interpret
Relationship between behavior and satisfaction is not always trivial
Questionnaires are a better predictor of long-term retention
With behavior only, you will need to run for a long time
Questionnaire data is more robust
Fewer participants needed
21. + 21
A guide to user experiments
http://bit.ly/recsys2011short http://bit.ly/recsystutorialhandout
“Is my system good?”
What does good mean?
We need to define measures
“Does my system score high on this satisfaction scale?”
What does high mean?
We need to compare it against something
“Does my system score higher than this other system?”
Say we find that it scores higher on satisfaction... why does it?
Apply the concept of ceteris paribus
22. + 22
An example…
We compared three
recommender systems
Three different algorithms
System effectiveness scale:
The system has no real benefit
for me.
I would recommend the system
to others.
The system is useful.
I can save time using the
system.
I can find better TV programs
without the help of the system.
23. + 23
An example…
The mediating variables tell the entire story
24. + 24
An example…
Matrix Factorization recommender with Matrix Factorization recommender with
explicit feedback (MF-E) implicit feedback (MF-I)
(versus generally most popular; GMP) (versus most popular; GMP)
OSA OSA
+ +
perceived recommendation perceived recommendation perceived system
variety + quality + effectiveness
SSA SSA EXP
25. +
A Note on Privacy
How to avoid
this looming danger
of our Big Data future
27. + 27
Privacy concerns
Second Netflix challenge
Anonymized dataset
Lawsuit from Californian closeted lesbian Mum
Netflix withdraws their second challenge
http://arstechnica.com/tech-policy/2012/07/class-action-lawsuit-
settlement-forces-netflix-privacy-changes/
28. + 28
Privacy directive
Transparency
“companies should provide
clear descriptions of [...] why
they need the data, how they
will use it”
Informed consent
Control
“companies should offer
consumers clear and simple
choices [...] about personal
data collection, use, and
disclosure”
User empowerment
30. + 30
Control Paradox
“bewildering tangle of options” (New York Times, 2010)
“labyrinthian controls” (U.S. Consumer Magazine, 2012)
Researchers asked: “what do your privacy settings mean?”
86% of Facebook users got it wrong!
31. + 31
Control Paradox
http://bit.ly/chi2013privacy
Introducing an “extreme”
E sharing option
Nothing - City - Block
benefits
B Add the option Exact
Expected:
C
Some will choose Exact
instead of Block
N
Unexpected:
privacy Sharing increases across
the board!
32. + 32
Bounded rationality
A 25%
?
B 37%
?
C 53%
?
D 0%
?
33. + 33
Idea: nudging
People do not always choose
what is best for them
Idea: use defaults to “nudge”
users in the right direction
34. + 34
What is the right direction?
“More information = better, e.g. for personalization”
Techniques to increase disclosure cause reactance in the more
privacy-minded users
“Privacy is an absolute right“
More difficult for less privacy-minded users to enjoy the benefits that
disclosure would provide
35. + 35
It depends on the user!
“What is best for consumers
depends upon characteristics
of the consumer
An outcome that maximizes
consumer welfare may be
suboptimal for some consumers
in a context where there is
heterogeneity in preferences”
(Smith, Goldstein & Johnson, 2009)
36. + 36
Privacy Adaptation Procedure
http://bit.ly/privdim
Idea:
Personalize users’ privacy settings!
Automatic defaults in line with “disclosure profile”
Using big data to improve big data privacy
Relieves some of the burden of the privacy decision:
The right privacy-related information
The right amount of control
“Realistic empowerment”
37. + The wonders of Big Data
Big Data can be used to create powerful
personalized e-commerce experiences
The Perils of Big Data
Big Data solutions will only work if the
developers have an adequate amount of
domain knowledge
User Experiments
Big Data solutions need to be tested on
Conclusions real users, with a focus on user
experience
A Note on Privacy
Big Data can raise privacy concerns, but
it can at the same time be used to
alleviate these concerns
38. + The wonders of Big Data
Big Data can be used to create
powerful personalized e-commerce
experiences
The Perils of Big Data
Big Data solutions will only work if the
developers have an adequate amount
of domain knowledge
User Experiments
Questions? Big Data solutions need to be tested
on real users, with a focus on user
experience
A Note on Privacy
Big Data can raise privacy
concerns, but it can at the same time
be used to alleviate these concerns
Editor's Notes
The wonders of Big DataHow Big Data will put the personal back in e-commerceThe Perils of Big DataHow overfitting and a lack of domain knowledge can lead to suboptimal solutionsUser ExperimentsHow user evaluations can be used to create meaningful experiencesA Note on PrivacyHow to avoid this looming danger of our Big Data future
Improvement means reducing the error in predicting user ratingerror = root mean square error between system rating and user rating
Older movies have higher average rating.
ASK QUESTIONS?
Averages are understandable.Bayes and multinomial maybe. Leaders’ models not at all!
Nobody will use these hybrids in a real system
ASK QUESTIONS?
We have a “ground truth” problem. Easy to overfit models on some quirk in the data. We want to make sure we adapt to general human behavior, and ultimately, that we make our users happy.Framework for user centric evaluation, using the example of recommender systems.
If we just have more accurate algorithms, our recommendations will automatically be better!
Also link to Xavier’s blog posts about NetflixAsk who knows A/B testing
But even that is not enough
ASK QUESTIONS?
Also add the Target horror story
I think transparency and control will not help because people are kind of broken.Transparency should make people avoid bad privacy practices and endorse good privacy practices
Control is an illusion, because we can easily influence people’s decisions
People are boundedly rational. Here is another example:
This idea is interesting, because if people don’t choose what is best for them, then why don’t we just push them in the right direction?