Offline Evaluation of Recommender Systems
All pain and no gain?
Mark Levy
Mendeley
About me
About me
Some things I built
Something I'm building
What is a good recommendation?
What is a good recommendation?
One that increases the usefulness
of your product in the long run1
1. WARNING: hard to meas...
What is a good recommendation?
●
One that increased your bottom line:
– User bought item after it was recommended
– User c...
Why was it good?
Why was it good?
●
Maybe it was
– Relevant
– Novel
– Familiar
– Serendipitous
– Well explained
●
Note: some of these are m...
What is a bad recommendation?
What is a bad recommendation?
(you know one when you seen one)
What is a bad recommendation?
What is a bad recommendation?
What is a bad recommendation?
What is a bad recommendation?
●
Maybe it was
– Not relevant
– Too obscure
– Too familiar
– I already have it
– I already k...
What's the cost of getting it wrong?
●
Depends on your product and your users
– Lost revenue
– Less engaged user
– Angry u...
Hypotheses
Good offline metrics
express product goals
Most (really) bad recommendations
can be caught by business logic
Issues
●
Real business goals concern long-term user
behaviour e.g. Netflix
“we have reformulated the recommendation proble...
Least bad solution?
●
“Back to the future” aka historical log analysis
●
Decide which logged event(s) indicate success
●
B...
Making metrics meaningful
●
Building a test framework + data is hard
●
Be sure to get best value from your work
●
Don't us...
Making metrics meaningful
●
Old skool k-NN systems are better than you think
– Input numbers from mining logs
– Temporal “...
Making metrics meaningful
●
Measure preference honestly
●
Predicted items may not be “correct” just because
they were cons...
Making metrics meaningful
●
Public data isn't enough for reproducibility or
fair comparison
●
Need to document preprocessi...
What's the cost of poor evaluation?
What's the cost of poor evaluation?
Poor offline evaluation can lead to
years of misdirected research
Ex 1: Reduce playlist skips
●
Reorder a playlist of tracks to reduce skips by
avoiding “genre whiplash”
●
Use audio simila...
Ex 1: Reduce playlist skips
●
Result: users skipped more often
●
Why?
Ex 1: Reduce playlist skips
●
Result: users skipped more often
●
When a user skipped a track they didn't like
they were pl...
Ex 2: Recommend movies
●
Use a corpus of star ratings to improve movie
recommendations
●
Learn to predict ratings for un-r...
Ex 2: Recommend movies
●
Result: “best” solutions were never deployed
●
Why?
Ex 2: Recommend movies
●
Result: “best” solutions were never deployed
●
User behaviour correlates with rank not RMSE
●
Sid...
Can contests help?
●
Good:
– Great for consistent evaluation
●
Not so good:
– Privacy concerns mean obfuscated data
– No g...
Ex 3: Yahoo! Music KDD Cup
●
Largest music rating dataset ever released
●
Realistic “loved songs” classification task
●
Da...
Ex 3: Yahoo! Music KDD Cup
●
Result: researchers hated it
●
Why?
Ex 3: Yahoo! Music KDD Cup
●
Result: researchers hated it
●
Research frontier focussed on audio content
and metadata, not ...
Ex 4: Million Song Challenge
●
Large music dataset with rich metadata
●
Anonymized listening histories
●
Simple item recom...
Ex 4: Million Song Challenge
●
Result: winning entry didn't use side data
●
Why?
Ex 4: Million Song Challenge
●
Result: winning entry didn't use side data
●
No timestamps so test tracks chosen at random
...
Ex 5: Yelp RecSys Challenge
●
Small business review dataset with side data
●
Realistic mix of input data types
●
Rating pr...
Ex 5: Yelp RecSys Challenge
●
Result: baseline algorithms high up leaderboard
●
Why?
Ex 5: Yelp RecSys Challenge
●
Result: baseline algorithms high up leaderboard
●
Train/test split was corrupt
●
Competition...
Ex 6: MIREX Audio Chord Estimation
●
Small dataset of audio tracks
●
Task to label with predicted chord symbols
●
Human la...
Ex 6: MIREX Audio Chord Estimation
●
Result: 2011 winner ftw
●
Why?
Ex 6: MIREX Audio Chord Estimation
●
Result: 2011 winner ftw
●
Spoof entry relying on known test set
●
Protest against ina...
So why evaluate offline at all?
●
Building test framework ensures clear goals
●
Avoid wishful thinking if your data is too...
Online evaluation is tricky too
●
No off the shelf solution for services
●
Many statistical gotchas
●
Same mismatch betwee...
Ex 7: Article Recommendations
●
Recommender for related research articles
●
Massive download logs available
●
Framework de...
Ex 7: Article Recommendations
●
Result: worse than similar title search
●
Why?
Ex 7: Article Recommendations
●
Result: worse than similar title search
●
Inadequate business rules e.g. often suggesting
...
Isn't there software for that?
Rules of the game:
– Model fit metrics (e.g. validation loss) don't count
– Need a transpar...
Isn't there software for that?
Wish list for reproducible evaluation:
– Integrate with recommender implementations
– Handl...
Isn't there software for that?
Current offerings:
●
GraphChi/GraphLab
●
Mahout
●
LensKit
●
MyMediaLite
Isn't there software for that?
Current offerings:
●
GraphChi/GraphLab
– Model validation loss, doesn't count
●
Mahout
– On...
Isn't there software for that?
Current offerings:
●
MyMediaLite
– Reports meaningful metrics
– Handles cross-validation
– ...
Eating your own dog food
●
Built a small framework around new algorithm
●
https://github.com/mendeley/mrec
– Reports meani...
Eating your own dog food
●
Some lessons learned
– Usable frameworks are hard to write
– Tradeoff between clarity and scala...
Where next?
●
Shift evaluation online:
– Contests based around online evaluation
– Realistic but not reproducible
– Could ...
Where next?
●
Support alternative query paradigms:
– More like this, less like that
– Metrics for dynamic/online recommend...
Thanks for listening
mark.levy@mendeley.com
@gamboviol
https://github.com/gamboviol
https://github.com/mendeley/mrec
Upcoming SlideShare
Loading in...5
×

Offline evaluation of recommender systems: all pain and no gain?

11,582

Published on

Keynote for the workshop on Reproducibility and Replication in Recommender Systems at ACM RecSys, Hong Kong, 12 October 2013.

Published in: Technology, Education

Offline evaluation of recommender systems: all pain and no gain?

  1. 1. Offline Evaluation of Recommender Systems All pain and no gain? Mark Levy Mendeley
  2. 2. About me
  3. 3. About me
  4. 4. Some things I built
  5. 5. Something I'm building
  6. 6. What is a good recommendation?
  7. 7. What is a good recommendation? One that increases the usefulness of your product in the long run1 1. WARNING: hard to measure directly
  8. 8. What is a good recommendation? ● One that increased your bottom line: – User bought item after it was recommended – User clicked ad after it was shown – User didn't skip track when it was played – User added document to library... – User connected with contact...
  9. 9. Why was it good?
  10. 10. Why was it good? ● Maybe it was – Relevant – Novel – Familiar – Serendipitous – Well explained ● Note: some of these are mutually incompatible
  11. 11. What is a bad recommendation?
  12. 12. What is a bad recommendation? (you know one when you seen one)
  13. 13. What is a bad recommendation?
  14. 14. What is a bad recommendation?
  15. 15. What is a bad recommendation?
  16. 16. What is a bad recommendation? ● Maybe it was – Not relevant – Too obscure – Too familiar – I already have it – I already know that I don't like it – Badly explained
  17. 17. What's the cost of getting it wrong? ● Depends on your product and your users – Lost revenue – Less engaged user – Angry user – Amused user – Confused user – User defects to a rival product
  18. 18. Hypotheses Good offline metrics express product goals Most (really) bad recommendations can be caught by business logic
  19. 19. Issues ● Real business goals concern long-term user behaviour e.g. Netflix “we have reformulated the recommendation problem to the question of optimizing the probability a member chooses to watch a title and enjoys it enough to come back to the service” ● Usually have to settle for short-term surrogate ● Only some user behaviour is visible ● Same constraints when collecting training data
  20. 20. Least bad solution? ● “Back to the future” aka historical log analysis ● Decide which logged event(s) indicate success ● Be honest about “success” ● Usually care most about precision @ small k ● Recall will discriminate once this plateaus ● Expect to have to do online testing too
  21. 21. Making metrics meaningful ● Building a test framework + data is hard ● Be sure to get best value from your work ● Don't use straw man baselines ● Be realistic – leave the ivory tower ● Make test setups and baselines reproducible
  22. 22. Making metrics meaningful ● Old skool k-NN systems are better than you think – Input numbers from mining logs – Temporal “modelling” (e.g. fake users) – Data pruning (scalability, popularity bias, quality) – Preprocessing (tf-idf, log/sqrt, )… – Hand crafted similarity metric – Hand crafted aggregation formula – Postprocessing (popularity matching) – Diversification – Attention profile
  23. 23. Making metrics meaningful ● Measure preference honestly ● Predicted items may not be “correct” just because they were consumed once ● Try to capture value – Earlier recommendation may be better – Don't need a recommender to suggest items by same artist/author ● Don't neglect side data – At least use it for evaluation / sanity checking
  24. 24. Making metrics meaningful ● Public data isn't enough for reproducibility or fair comparison ● Need to document preprocessing ● Better: Release your preparation/evaluation code too
  25. 25. What's the cost of poor evaluation?
  26. 26. What's the cost of poor evaluation? Poor offline evaluation can lead to years of misdirected research
  27. 27. Ex 1: Reduce playlist skips ● Reorder a playlist of tracks to reduce skips by avoiding “genre whiplash” ● Use audio similarity measure to compute transition distance, then travelling salesman ● Metric: sum of transition distances (lower is better) ● 6 months work to develop solution
  28. 28. Ex 1: Reduce playlist skips ● Result: users skipped more often ● Why?
  29. 29. Ex 1: Reduce playlist skips ● Result: users skipped more often ● When a user skipped a track they didn't like they were played something else just like it ● Better metric: average position of skipped tracks (based on logs, lower down is better)
  30. 30. Ex 2: Recommend movies ● Use a corpus of star ratings to improve movie recommendations ● Learn to predict ratings for un-rated movies ● Metric: average RMSE of predictions for a hidden test set (lower is better) ● 2+ years work to develop new algorithms
  31. 31. Ex 2: Recommend movies ● Result: “best” solutions were never deployed ● Why?
  32. 32. Ex 2: Recommend movies ● Result: “best” solutions were never deployed ● User behaviour correlates with rank not RMSE ● Side datasets an order of magnitude more valuable than algorithm improvements ● Explicit ratings are the exception not the rule ● RMSE still haunts research labs
  33. 33. Can contests help? ● Good: – Great for consistent evaluation ● Not so good: – Privacy concerns mean obfuscated data – No guarantee that metrics are meaningful – No guarantee that train/test framework is valid – Small datasets can become overexposed
  34. 34. Ex 3: Yahoo! Music KDD Cup ● Largest music rating dataset ever released ● Realistic “loved songs” classification task ● Data fully obfuscated due to recent lawsuits
  35. 35. Ex 3: Yahoo! Music KDD Cup ● Result: researchers hated it ● Why?
  36. 36. Ex 3: Yahoo! Music KDD Cup ● Result: researchers hated it ● Research frontier focussed on audio content and metadata, not joinable to obfuscated ratings
  37. 37. Ex 4: Million Song Challenge ● Large music dataset with rich metadata ● Anonymized listening histories ● Simple item recommendation task ● Reasonable MAP@500 metric ● Aimed to solve shortcomings of KDD Cup ● Only obfuscation was removal of timestamps
  38. 38. Ex 4: Million Song Challenge ● Result: winning entry didn't use side data ● Why?
  39. 39. Ex 4: Million Song Challenge ● Result: winning entry didn't use side data ● No timestamps so test tracks chosen at random ● So “people who listen to A also listen to B” ● Traditional item similarity solves this well ● More honesty about “success” might have shown that contest data was flawed
  40. 40. Ex 5: Yelp RecSys Challenge ● Small business review dataset with side data ● Realistic mix of input data types ● Rating prediction task ● Informal procedure to create train/test sets
  41. 41. Ex 5: Yelp RecSys Challenge ● Result: baseline algorithms high up leaderboard ● Why?
  42. 42. Ex 5: Yelp RecSys Challenge ● Result: baseline algorithms high up leaderboard ● Train/test split was corrupt ● Competition organisers moved fast to fix this ● But left only one week before deadline
  43. 43. Ex 6: MIREX Audio Chord Estimation ● Small dataset of audio tracks ● Task to label with predicted chord symbols ● Human labelled data hard to come by ● Contest hosted by premier forum in field ● Evaluate frame-level prediction accuracy ● Historical glass ceiling around 80%
  44. 44. Ex 6: MIREX Audio Chord Estimation ● Result: 2011 winner ftw ● Why?
  45. 45. Ex 6: MIREX Audio Chord Estimation ● Result: 2011 winner ftw ● Spoof entry relying on known test set ● Protest against inadequate test data ● Other research showed weak generalisation of winning algorithms from same contest ● Next year results dropped significantly
  46. 46. So why evaluate offline at all? ● Building test framework ensures clear goals ● Avoid wishful thinking if your data is too thin ● Be efficient with precious online testing – Cut down huge parameter space – Don't alienate users ● Need to publish ● Pursuing science as well as profit
  47. 47. Online evaluation is tricky too ● No off the shelf solution for services ● Many statistical gotchas ● Same mismatch between short-term and long- term success criteria ● Results open to interpretation by management ● Can make incremental improvements look good when radical innovation is needed
  48. 48. Ex 7: Article Recommendations ● Recommender for related research articles ● Massive download logs available ● Framework developed based on co-downloads ● Aim to improve on existing search solution ● Management “keen for it work” ● Several weeks of live A/B testing available ● No offline evaluation
  49. 49. Ex 7: Article Recommendations ● Result: worse than similar title search ● Why?
  50. 50. Ex 7: Article Recommendations ● Result: worse than similar title search ● Inadequate business rules e.g. often suggesting other articles from same publication ● Users identified only by organisational IP range so value of “big data” very limited ● Establishing an offline evaluation protocol would have shown these in advance
  51. 51. Isn't there software for that? Rules of the game: – Model fit metrics (e.g. validation loss) don't count – Need a transparent “audit trail” of data to support genuine reproducibility – Just using public datasets doesn't ensure this
  52. 52. Isn't there software for that? Wish list for reproducible evaluation: – Integrate with recommender implementations – Handle data formats and preprocessing – Handle splitting, cross-validation, side datasets – Save everything to file – Work from file inputs so not tied to one framework – Generate meaningful metrics – Well documented and easy to use
  53. 53. Isn't there software for that? Current offerings: ● GraphChi/GraphLab ● Mahout ● LensKit ● MyMediaLite
  54. 54. Isn't there software for that? Current offerings: ● GraphChi/GraphLab – Model validation loss, doesn't count ● Mahout – Only rating prediction accuracy, doesn't count ● LensKit – Too hard to understand, won't use
  55. 55. Isn't there software for that? Current offerings: ● MyMediaLite – Reports meaningful metrics – Handles cross-validation – Data splitting not transparent – No support for pre-processing – No built in support for standalone evaluation – API is capable but current utils don't meet wishlist
  56. 56. Eating your own dog food ● Built a small framework around new algorithm ● https://github.com/mendeley/mrec – Reports meaningful metrics – Handles cross-validation – Supports simple pre-processing – Writes everything to file for reproducibility – Provides API and utility scripts – Runs standalone evalutions – Readable Python code
  57. 57. Eating your own dog food ● Some lessons learned – Usable frameworks are hard to write – Tradeoff between clarity and scalability – Should generate explicit validation sets ● Please contribute! ● Or use as inspiration to improve existing tools
  58. 58. Where next? ● Shift evaluation online: – Contests based around online evaluation – Realistic but not reproducible – Could some run continuously? ● Recommender Systems as a commodity: – Software and services reaching maturity now – Business users can tune/evaluate themselves – Is there a way to report results?
  59. 59. Where next? ● Support alternative query paradigms: – More like this, less like that – Metrics for dynamic/online recommenders ● Support recommendation with side data: – LibFM, GenSGD, WARP research @google, … – Open datasets?
  60. 60. Thanks for listening mark.levy@mendeley.com @gamboviol https://github.com/gamboviol https://github.com/mendeley/mrec
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×