Kaggle is a platform for data prediction contests that allows organizations to post data problems. This document describes a contest on Kaggle to predict which R packages users have installed based on package metadata. Three example models were tested that used package metadata with increasing complexity. The best model that added user intercepts and topic assignments achieved over 95% accuracy. The document discusses opportunities to improve the models through additional ratings data and sampling more users.
Building a Recommendation Engine - An example of a product recommendation engine
Introduction to R Package Recommendation System Competition
1. R Recommendation System Contest
John Myles White
March 10, 2011
John Myles White R Recommendation System Contest
2. Kaggle
Kaggle is a platform for data prediction competitions
that allows organizations to post their data and have it
scrutinized by the world’s best data scientists.
John Myles White R Recommendation System Contest
3. Kaggle Features
Kaggle provides every contest with:
Centralized data downloads
Public and private leaderboards using RMSE, AUC and other
metrics
Public discussion forums for participants to use
John Myles White R Recommendation System Contest
4. Kaggle Features
John Myles White R Recommendation System Contest
5. Recent Kaggle Contests
Tourism Forecasting
Chess Ratings: Elo versus the Rest of the World
INFORMS 2010: Short Term Stock Price Movements
John Myles White R Recommendation System Contest
6. Current and Upcoming Kaggle Contests
Arabic Writer Identification
Don’t Overfit: Dealing with Many Variables and Few
Observations
Heritage Health Prize
John Myles White R Recommendation System Contest
7. Advice on Running Kaggle Contests
Stay involved: respond to forum posts quickly and make the
contest seem alive
Don’t use a prediction task where near perfect accuracy can
be achieved
John Myles White R Recommendation System Contest
8. Mistakes We Made
Netflix Prize: 0.8616 RMSE
R Recommendation Contest: 0.9882 AUC
John Myles White R Recommendation System Contest
9. The R Recommendation System Contest
Contestants must be able to predict whether a user U will
have a package P installed on their system
John Myles White R Recommendation System Contest
10. Full Data Set
Outcomes: List of all packages installed on 52 R users’
systems
Predictors: Metadata about 2485 CRAN packages
John Myles White R Recommendation System Contest
11. Metadata
Dependencies
Suggests
Imports
Views
Core
Recommended
Maintainer
Maintainer’s Package Count
John Myles White R Recommendation System Contest
12. Training Data / Test Data Split
Uniform random split over rows in full data set
Training Set: 99373 rows
Test Set: 33125 rows
John Myles White R Recommendation System Contest
13. Additional Metadata
LDA topic assignments for CRAN packages
Used 25 topics
Used all documentation: manuals, vignettes, etc.
John Myles White R Recommendation System Contest
14. Example Models
1. Package Metadata
2. Package Metadata + Per User Intercepts
3. Package Metadata + Per User Intercepts + Package Topic
Assignments
John Myles White R Recommendation System Contest
15. Example Model 1
library(‘ProjectTemplate’)
try(load.project())
logit.fit <- glm(Installed ~ LogDependencyCount +
LogSuggestionCount +
LogImportCount +
LogViewsIncluding +
LogPackagesMaintaining +
CorePackage +
RecommendedPackage,
data = training.data,
family = binomial(link = ‘logit’))
John Myles White R Recommendation System Contest
16. Example Model 2
logit.fit <- glm(Installed ~ LogDependencyCount +
LogSuggestionCount +
LogImportCount +
LogViewsIncluding +
LogPackagesMaintaining +
CorePackage +
RecommendedPackage +
factor(User),
data = training.data,
family = binomial(link = ‘logit’))
John Myles White R Recommendation System Contest
17. Example Model 3
logit.fit <- glm(Installed ~ LogDependencyCount +
LogSuggestionCount +
LogImportCount +
LogViewsIncluding +
LogPackagesMaintaining +
CorePackage +
RecommendedPackage +
factor(User) +
Topic,
data = training.data,
family = binomial(link = ‘logit’))
John Myles White R Recommendation System Contest
18. Model Performance
Model 1: ∼ 0.80 AUC
Model 2: ∼ 0.95 AUC
Model 3: > 0.95 AUC
John Myles White R Recommendation System Contest
20. Future Work
What makes a package useful?
Need subjective ratings
Some packages are only installed because they’re
dependencies for other popular packages
John Myles White R Recommendation System Contest
21. Future Work
Get a better data sample:
Contest only used data from 52 users
But we do have complete data for those users
But data was not a random sample of R users
John Myles White R Recommendation System Contest
22. Future Work
Do more with LDA to categorize R packages
Prediction task allows us to evaluate “quality” of topics count
and topic assignments
John Myles White R Recommendation System Contest
23. Future Work
Build up various package-package similarity matrices for
conditional recommendations
John Myles White R Recommendation System Contest
24. Future Work
Can we understand the clustering in the network structure
graph?
John Myles White R Recommendation System Contest
25. Resources
For more information, see
The original Dataists’ contest announcement
GitHub project page
John Myles White R Recommendation System Contest