Video available here http://www.youtube.com/watch?v=1jHxGCl8RXc
Recommender systems research is often based on comparisons of predictive accuracy: the better the evaluation scores, the better the recommender.
However, it is difficult to compare results from different recommender systems due to the many options in design and implementation of an evaluation strategy.
Additionally, algorithmic implementations can diverge from the standard formulation due to manual tuning and modifications that work better in some situations.
In this work we compare common recommendation algorithms as implemented in three popular recommendation frameworks.
To provide a fair comparison, we have complete control of the evaluation dimensions being benchmarked: dataset, data splitting, evaluation strategies, and metrics.
We also include results using the internal evaluation mechanisms of these frameworks.
Our analysis points to large differences in recommendation accuracy across frameworks and strategies, i.e. the same baselines may perform orders of magnitude better or worse across frameworks.
Our results show the necessity of clear guidelines when reporting evaluation of recommender systems to ensure reproducibility and comparison of results.
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks
1. Comparative Recommender System Evaluation:
Benchmarking Recommendation Frameworks
Alan Said
@alansaid
TU Delft
Alejandro Bellogín
@abellogin
Universidad Autónoma de Madrid
ACM RecSys 2014
Foster City, CA, USA
2. A RecSys paper outline
– We have a new model – it’s
great
– We used %DATASET% 100k
to evaluate it
– It’s 10% better than our
baseline
– It’s 12% better than
[Authors, 2010]
2
5. What are the differences?
Some things just work differently
• Data splitting
• Algorithm design (implementation)
• Algorithm optimization
• Parameter values
• Evaluation
• Relevance/ranking
• Software architecture
• etc
Different design choices!!
How do these choices affect
evaluation results?
5
6. Evaluate evaluation
• Comparison of frameworks
• Comparison of implementation
• Comparison of results
• Objective benchmarking
6
8. There’s more than
algorithms though
There’s the data, evaluation, and more
Data splits
• 80-20 Cross-validation
• Random Cross-validation
• User-based cross validation
• Per-user splits
• Per-item splits
• Etc.
Evaluation
• Metrics
• Relevance
• Strategies
8
9. Real world examples
Movielens 1M
[Cremonesi et al, 2010]
Movielens 1M
[Yin et al, 2012]
9
13. We need a fair and common evaluation protocol!
13
14. Reproducible evaluation -
Benchmarking
Control all parts of the process
- Data Splitting strategy
- Recommendation (black box)
- Candidate items generation (what
items to test)
- Evaluation
Select strategy
• By time
• Cross validation
• Random
• Ratio
Select framework
• Apache Mahout
• LensKit
• MyMediaLite
Select algorithm
• Tune settings
Recommend
Define strategy
• What is the
ground truth
• What users to
evaluate
• What items to
evaluate
Select error
metrics
• RMSE, MAE
Select ranking
metrics
• nDCG,
Precision/Recall
, MAP
Split Recommend
Candidate
items
Evaluate
http://rival.recommenders.net
14
25. In conclusion
• Design choices matter!
– Some more than others
• Evaluation needs to be documented
• Cross-framework comparison is not easy
– You need to have control!
25
27. How did we do this?
RiVal – an evaluation toolkit for RecSys
• http://rival.recommenders.net
• http://github.com/recommenders/rival
• RiVal demo later today
• On Maven central!
• RiVal was also used for this year’s RecSys
Challenge
– www.recsyschallenge.com
27
Let’s start with talking about what benchmarking is.
Mainly, we want to focus on the difference between benchmarking and evaluation. When we evaluate, we evaluate one system in the specific context of our evaluation. But what about benchmarking?
Benchmarking is when we compare systems against each other.
But, we need to know what it is that we are comparing – or benchmarking – and we need to know that we’re doing it correctly.
Currently, there are several popular recommendation frameworks.
The three that are popularly used at RecSys, Mahout, MyMediaLite and LensKit offer very similar recommendation setups (and evaluation solutions).
In this paper, we ran experiments to see how comparable the frameworks are to each other.
What we did in this paper
How comparable are the frameworks and results
The only thing we know here is that they used ml1m and puresvd. We don’t know the topics, we don’t know the software – The actual implementation of the algorithm
Because of that we compared the three most popular frameworks to each other, using three basic algorithms
Using each framework’s internal evaluators!!!
Common input: the same dataset, the same algorithms.
Mahout does not have an evaluator which is capable of calculating RMSE and nDCG at the same time
MyMediaLite does either ranking or rating prediction
You control the data split, the evaluation, the metric
When you can control this, you can see all factors, i.e. different splits in the same framework, different evaluation strategies, etc.
Zoom in on one set of results (first quadrant)
Candidate items: what am I going to rank
TI: rank all items for each user (except those in the training set for that user)
UT: only rank items in the user’s test set – this is what you do when you do rating prediction
User coverage is the percentage of users in the dataset that we can recommend for
The percentage of items that we can recommend
Difficult to say
Depends on the strategy!!!
Need to set the evaluation context beforehand