Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

Comparative Recommender System Evaluation:
Benchmarking Recommendation Frameworks
Alan Said
@alansaid
TU Delft
Alejandro Bellogín
@abellogin
Universidad Autónoma de Madrid
ACM RecSys 2014
Foster City, CA, USA

A RecSys paper outline
– We have a new model – it’s
great
– We used %DATASET% 100k
to evaluate it
– It’s 10% better than our
baseline
– It’s 12% better than
[Authors, 2010]
2

LibRec
mrec
Python-recsys
4

What are the differences?
Some things just work differently
• Data splitting
• Algorithm design (implementation)
• Algorithm optimization
• Parameter values
• Evaluation
• Relevance/ranking
• Software architecture
• etc
Different design choices!!
How do these choices affect
evaluation results?
5

Evaluate evaluation
• Comparison of frameworks
• Comparison of implementation
• Comparison of results
• Objective benchmarking
6

Algorithmic Implementation
Framework Class Similarity
Item-based
LensKit ItemItemScorer CosineVectorSimilarity, PearsonCorrelation
Mahout GenericItemBasedRecommender UncenteredCosineSimilarity, PearsonCorrelationSimilarity
MyMediaLite ItemKNN Cosine, Pearson
User-based Parameters
LensKit UserUserItemScorer
CosineVectorSimilarity,
PearsonCorrelation
SimpleNeighborhoodFinder,
NeighborhoodSize
Mahout GenericUserBasedRecommender
UncenteredCosineSimilarity,
PearsonCorrelationSimilarity
NearestNUserNeighborhood,
neighborhoodsize
MyMediaLite UserKNN Cosine, Pearson neighborhoodsize
Matrix Factorization
LensKit FunkSVDItemScorer IterationsCountStoppingCondition, factors, iterations
Mahout SVDRecommender FunkSVDFactorizer, factors, iterations
MyMediaLite SVDPlusPlus factors, iterations
7

There’s more than
algorithms though
There’s the data, evaluation, and more
Data splits
• 80-20 Cross-validation
• Random Cross-validation
• User-based cross validation
• Per-user splits
• Per-item splits
• Etc.
Evaluation
• Metrics
• Relevance
• Strategies
8

Real world examples
Movielens 1M
[Cremonesi et al, 2010]
Movielens 1M
[Yin et al, 2012]
9

Evaluation
Dataset
Training /
Test
Framework Evaluation
Results
Algorithm
10

Internal Evaluation
Dataset
Training /
Test
Results
Algorithm
11

Internal Evaluation Results
Algorithm Framework nDCG
IB Cosine Mahout 0,00041478
IB Cosine Lenskit 0,94219205
IB Pearson Mahout 0,00516923
IB Pearson Lenskit 0,92454613
SVD50 Mahout 0,10542729
SVD50 Lenskit 0,94346409
UB Cosine Mahout 0,16929545
UB Cosine Lenskit 0,94841356
UB Pearson Mahout 0,16929545
UB Pearson Lenskit 0,94841356
Algorithm Framework RMSE
IB Cosine Lenskit 1,01390931
IB Cosine MyMediaLite 0,92476162
IB Pearson Lenskit 1,05018614
IB Pearson MyMediaLite 0,92933246
SVD50 Lenskit 1,01209290
SVD50 MyMediaLite 0,93074012
UB Cosine Lenskit 1,02545490
UB Cosine MyMediaLite 0,93419026
12

We need a fair and common evaluation protocol!
13

Reproducible evaluation -
Benchmarking
Control all parts of the process
- Data Splitting strategy
- Recommendation (black box)
- Candidate items generation (what
items to test)
- Evaluation
Select strategy
• By time
• Cross validation
• Random
• Ratio
Select framework
• Apache Mahout
• LensKit
• MyMediaLite
Select algorithm
• Tune settings
Recommend
Define strategy
• What is the
ground truth
• What users to
evaluate
• What items to
evaluate
Select error
metrics
• RMSE, MAE
Select ranking
metrics
• nDCG,
Precision/Recall
, MAP
Split Recommend
Candidate
items
Evaluate
http://rival.recommenders.net
14

Controlled Evaluation
Dataset
Training /
Test
Results
Algorithm
15

Lenskit vs. Mahout vs. MyMediaLite
Movielens 100k (additional datasets in the paper)
AN OBJECTIVE BENCHMARK
16

The Frameworks
AM: Apache Mahout
LK: Lenskit
MML: MyMediaLite
The Candidate Items
RPN: Relevant + N [Koren, KDD 2008]
TI: TrainItems
UT: UserTest
Split Point
gl: Global
pu: Per-user
Split Strategy
cv: 5-fold cross-validation
rt: 80-20 random ratio
Algorithms nDCG@10
17

Yes, at the cost of coverage
22

Difficult to say … depends on what you’re evaluating!!
24

In conclusion
• Design choices matter!
– Some more than others
• Evaluation needs to be documented
• Cross-framework comparison is not easy
– You need to have control!
25

How did we do this?
RiVal – an evaluation toolkit for RecSys
• http://rival.recommenders.net
• http://github.com/recommenders/rival
• RiVal demo later today
• On Maven central!
• RiVal was also used for this year’s RecSys
Challenge
– www.recsyschallenge.com
27

Thanks!
QUESTIONS?
Special thanks:
• Zeno Gantner
• Michael Ekstrand
28

• https://www.flickr.com/photos/13698839@N00/3001363490/in/photostream/
• http://rick--hunter.deviantart.com/art/Unfair-scale-1-149667590
29

Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

Similar to Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks (20)

More from Alan Said

More from Alan Said (13)

Recently uploaded

Recently uploaded (20)

Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

Editor's Notes