Introduction to R Package Recommendation System Competition

•

10 likes•4,028 views

Kaggle is a platform for data prediction contests that allows organizations to post data problems. This document describes a contest on Kaggle to predict which R packages users have installed based on package metadata. Three example models were tested that used package metadata with increasing complexity. The best model that added user intercepts and topic assignments achieved over 95% accuracy. The document discusses opportunities to improve the models through additional ratings data and sampling more users.

R Recommendation System Contest

John Myles White

March 10, 2011

John Myles White R Recommendation System Contest

Kaggle

Kaggle is a platform for data prediction competitions
that allows organizations to post their data and have it
scrutinized by the world’s best data scientists.

John Myles White R Recommendation System Contest

Kaggle Features

Kaggle provides every contest with:
Centralized data downloads
Public and private leaderboards using RMSE, AUC and other
metrics
Public discussion forums for participants to use

John Myles White R Recommendation System Contest

Kaggle Features

John Myles White R Recommendation System Contest

Recent Kaggle Contests

Tourism Forecasting
Chess Ratings: Elo versus the Rest of the World
INFORMS 2010: Short Term Stock Price Movements

John Myles White R Recommendation System Contest

Current and Upcoming Kaggle Contests

Arabic Writer Identiﬁcation
Don’t Overﬁt: Dealing with Many Variables and Few
Observations
Heritage Health Prize

John Myles White R Recommendation System Contest

Advice on Running Kaggle Contests

Stay involved: respond to forum posts quickly and make the
contest seem alive
Don’t use a prediction task where near perfect accuracy can
be achieved

John Myles White R Recommendation System Contest

Mistakes We Made

Netﬂix Prize: 0.8616 RMSE
R Recommendation Contest: 0.9882 AUC

John Myles White R Recommendation System Contest

The R Recommendation System Contest

Contestants must be able to predict whether a user U will
have a package P installed on their system

John Myles White R Recommendation System Contest

Full Data Set

Outcomes: List of all packages installed on 52 R users’
systems
Predictors: Metadata about 2485 CRAN packages

John Myles White R Recommendation System Contest

Metadata

Dependencies
Suggests
Imports
Views
Core
Recommended
Maintainer
Maintainer’s Package Count

John Myles White R Recommendation System Contest

Training Data / Test Data Split

Uniform random split over rows in full data set
Training Set: 99373 rows
Test Set: 33125 rows

John Myles White R Recommendation System Contest

Additional Metadata

LDA topic assignments for CRAN packages
Used 25 topics
Used all documentation: manuals, vignettes, etc.

John Myles White R Recommendation System Contest

Example Models

1. Package Metadata
2. Package Metadata + Per User Intercepts
3. Package Metadata + Per User Intercepts + Package Topic
Assignments

John Myles White R Recommendation System Contest

Example Model 1

library(‘ProjectTemplate’)
try(load.project())

logit.fit <- glm(Installed ~ LogDependencyCount +
LogSuggestionCount +
LogImportCount +
LogViewsIncluding +
LogPackagesMaintaining +
CorePackage +
RecommendedPackage,
data = training.data,
family = binomial(link = ‘logit’))

John Myles White R Recommendation System Contest

Example Model 2

logit.fit <- glm(Installed ~ LogDependencyCount +
LogSuggestionCount +
LogImportCount +
LogViewsIncluding +
LogPackagesMaintaining +
CorePackage +
RecommendedPackage +
factor(User),
data = training.data,
family = binomial(link = ‘logit’))

John Myles White R Recommendation System Contest

Example Model 3

logit.fit <- glm(Installed ~ LogDependencyCount +
LogSuggestionCount +
LogImportCount +
LogViewsIncluding +
LogPackagesMaintaining +
CorePackage +
RecommendedPackage +
factor(User) +
Topic,
data = training.data,
family = binomial(link = ‘logit’))

John Myles White R Recommendation System Contest

Model Performance

Model 1: ∼ 0.80 AUC
Model 2: ∼ 0.95 AUC
Model 3: > 0.95 AUC

John Myles White R Recommendation System Contest

Unexploited Structure in Data

John Myles White R Recommendation System Contest

Future Work

What makes a package useful?
Need subjective ratings
Some packages are only installed because they’re
dependencies for other popular packages

John Myles White R Recommendation System Contest

Future Work

Get a better data sample:
Contest only used data from 52 users
But we do have complete data for those users
But data was not a random sample of R users

John Myles White R Recommendation System Contest

Future Work

Do more with LDA to categorize R packages
Prediction task allows us to evaluate “quality” of topics count
and topic assignments

John Myles White R Recommendation System Contest

Future Work

Build up various package-package similarity matrices for
conditional recommendations

John Myles White R Recommendation System Contest

Future Work

Can we understand the clustering in the network structure
graph?

John Myles White R Recommendation System Contest

Resources

For more information, see
The original Dataists’ contest announcement
GitHub project page

John Myles White R Recommendation System Contest

Similar to Introduction to R Package Recommendation System Competition

For more details: https://sease.io/2020/04/the-importance-of-online-testing-in-learning-to-rank-part-1.html https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html Learning to rank (LTR from now on) is the application of machine learning techniques, typically supervised, in the formulation of ranking models for information retrieval systems. With LTR becoming more and more popular (Apache Solr supports it from Jan 2017 and Elasticsearch has an Open Source plugin released in 2018), organizations struggle with the problem of how to evaluate the quality of the models they train. This talk explores all the major points in both Offline and Online evaluation. Setting up correct infrastructures and processes for a fair and effective evaluation of the trained models is vital for measuring the improvements/regressions of a LTR system. The talk is intended for: – Product Owners, Search Managers, Business Owners – Software Engineers, Data Scientists, and Machine Learning Enthusiast Expect to learn : the importance of Offline testing from a business perspective how Offline testing can be done with Open Source libraries how to build a realistic test set from the original data set in input avoiding common mistakes in the process the importance of Online testing from a business perspective A/B testing and Interleaving approaches: details and Pros/ Cons common mistakes and how they can false the obtained results Join us as we explore real-world scenarios and dos and don’ts from the e-commerce industry!

Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...

Sease

BDACA - Lecture3

Department of Communication Science, University of Amsterdam

SpringOne Platform 2017 Dmitry Didovicher, Crunchy Data In security centric environments, such as Federal and State Government or large enterprises, even the most mature DevOps teams find themselves unable to deliver value due to complex compliance requirements. Adherence to a particular compliance framework is almost always an afterthought. Compliance Driven Development (CDD) is a way to merge the best practices of Test and Behavior Driven Development with mandatory compliance guidelines into one cohesive, transparent, and fully automated process. You will learn how the use of the Pivotal platform and 100% Open Source Compliance Automation technologies enabled Crunchy to receive the Authority To Operate (ATO) its Crunchy Certified PostgreSQL-As-A-Service Tile at one of the country's largest Intelligence Agencies in record time as part of the NGA's "ATO In a Day" pilot.

Introduction to the Compliance Driven Development (CDD) and Security Centric ...

VMware Tanzu

Resilience Engineering: A field of study, a community, and some perspective s...

John Allspaw

Q1 Southern California Session Slides

Harold Wong

groovy & grails - lecture 13

Alexandre Masselot

Brandon obrien streaming_data

Nitin Kumar

For AAA games now there is a consumer expectation that the developer has a post release strategy. This strategy goes beyond just DLC content. Users expect to receive bug fixes, balancing updates, gamemode variations and constant tuning of the game experience. So how can you architect your game technology to facilitate all of this? Stewart explains the unique patching system developed for Crysis 3 Multiplayer which allowed the team to hot-patch pretty much any asset or data used by the game. He also details the supporting telemetry, server and testing infrastructure required to support this along with some interesting lessons learned.

The post release technologies of Crysis 3 (Slides Only) - Stewart Needham

Stewart Needham

Intuit Data Ecosystem supports unique consumer and small business assets at scale, and handle petabytes of customer data. We have 8M active small business customers and 16M paid workers that uses Intuit Quick Books and Quick Books Payroll Products. Huge customer base and large volumes of data always challenges the data teams in terms of freshness of data, correctness of data etc. This presentation is intended to cover such problems we faced at Intuit along with the data observability model we follow to cure, detect and prevent data Issues. We would like to provide deep insights into the implementations and the impact of some of the great work done by Intuit in this direction.

Data Observability.pptx

SonaSamad1

Windows 2008 R2 & Windows7

Gabe Akisanmi

Presentation_BigData_NenaMarin

n5712036

Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...

MLconf

Automation frameworks

Vishwanath KC

IT talk: Как я перестал бояться и полюбил TestNG

DataArt

Data Mining Concepts 15061

badirh

Data Mining Concepts

dataminers.ir

Data Mining Concepts

Dung Nguyen

Windows Logging Cheat Sheet ver Jan 2016 - MalwareArchaeology

Michael Gough

Implementing a data science project (R Version) Part1

Dr Sulaimon Afolabi

Redis: Swiss Army Knife @HackerRank: Kamal Joshi

Redis Labs

Similar to Introduction to R Package Recommendation System Competition (20)

Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...

BDACA - Lecture3

Introduction to the Compliance Driven Development (CDD) and Security Centric ...

Resilience Engineering: A field of study, a community, and some perspective s...

Q1 Southern California Session Slides

groovy & grails - lecture 13

Brandon obrien streaming_data

The post release technologies of Crysis 3 (Slides Only) - Stewart Needham

Data Observability.pptx

Windows 2008 R2 & Windows7

Presentation_BigData_NenaMarin

Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...

Automation frameworks

IT talk: Как я перестал бояться и полюбил TestNG

Data Mining Concepts 15061

Data Mining Concepts

Windows Logging Cheat Sheet ver Jan 2016 - MalwareArchaeology

Implementing a data science project (R Version) Part1

Redis: Swiss Army Knife @HackerRank: Kamal Joshi

More from NYC Predictive Analytics

Summary: Graphs are structures commonly used in computer science that model the interactions among entities. I will start from introducing the basic formulations of graph based machine learning, which has been a popular topic of research in the past decade and led to a powerful set of techniques. Particularly, I will show examples on how it acts as a generic data mining and predictive analytic tool. In the second part, I am going to discuss applications of such learning techniques in media analytics: (1) image analysis, where visually coherent objects are isolated from images; (2) social analysis of videos, where actors' social properties are predicted from videos. Materials in this part are based on our recent publications in highly selective venues (papers on https://sites.google.com/site/leiding2010/ ). Bio: Lei Ding is a researcher making sense of large amounts of data in all media types. He currently works in Intent Media as a scientist, focusing on data analytics and applied machine learning in online advertising. Previously, he has worked in several research institutions including Columbia University, UIUC and IBM Research on digital / social media analysis and understanding. He received a Ph.D. degree in Computer Science and Engineering from The Ohio State University, where he was a Distinguished University Fellow.

Graph Based Machine Learning with Applications to Media Analytics

NYC Predictive Analytics

The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques. It also includes methods for pre-processing training data, calculating variable importance, and model visualizations. An example from classification of music genres is used to illustrate the functionality on a real data set and to benchmark the benefits of parallel processing with several types of models. Bio: Max is a Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He has been applying models in the molecular diagnostic and pharmaceutical industries for over 15 years. He is the author of several R packages including the caret package that provides a simple and consistent interface to over 100 predictive models available in R. Max has taught courses on modeling within Pfizer and externally. Recently, he taught modeling classes for the American Society of Chemistry, the Indian Ministry of Information Technology and Predictive Analytics World. He is a co-author of the forthcoming Spring book "Applied Predictive Modeling".

The caret Package: A Unified Interface for Predictive Models

NYC Predictive Analytics

Intro to Classification: Logistic Regression & SVM

NYC Predictive Analytics

R package Recommendation Engine

NYC Predictive Analytics

Optimization: A Framework for Predictive Analytics

NYC Predictive Analytics

To download the presentation, please go to Harlan's or Jared's websites: jaredlander.com harlan.harris.name Although traditional multiple regression is an extremely powerful tool for prediction, it can be inadequate when the goal is to predict relationships that differ among groups. For example, the relationship between income and political affiliation varies among American states, and the relationship between income level and calorie intake varies among counties of the world. Traditional multiple regression will either try to independently estimate these relationships, which can be very problematic if there is not enough data, or will lump the groups together, throwing away potentially valuable differences among groups. A more powerful approach is to assume that the groups have a statistical distribution of their own, just as the error among individual observations is assumed to come from a (often normal) distribution. Then the data in each group is "partially pooled" with all of the other data, appropriately splitting the difference between the two extremes. In the general case, this is Bayesian model estimation, which can be very complex and difficult to do well. But in more common cases, simpler statistical techniques called variously "multilevel" or "hierarchical regression," and "mixed-effects modeling" can be used to improve the quality of predictions. In this talk, we will motivate and explain the basics of practical multilevel regression, and will demonstrate how it works using R. Bios: Harlan D. Harris, PhD, works as a statistical data scientist for Kaplan Test Prep and Admissions in New York City. He previously worked as a cognitive psychology researcher at NYU, UConn and Columbia University, and studied machine learning and cognitive science at the University of Illinois at Urbana-Champaign. Jared Lander is a statistical consultant based in New York City. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts. He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing.

An Introduction to Multilevel Regression Modeling for Prediction

NYC Predictive Analytics

How OMGPOP Uses Predictive Analytics to Drive Change

NYC Predictive Analytics

Introduction to Probabilistic Latent Semantic Analysis

NYC Predictive Analytics

Recommendation Engine Demystified

NYC Predictive Analytics

Building a Recommendation Engine - An example of a product recommendation engine

NYC Predictive Analytics

More from NYC Predictive Analytics (10)

Graph Based Machine Learning with Applications to Media Analytics

The caret Package: A Unified Interface for Predictive Models

Intro to Classification: Logistic Regression & SVM

R package Recommendation Engine

Optimization: A Framework for Predictive Analytics

An Introduction to Multilevel Regression Modeling for Prediction

How OMGPOP Uses Predictive Analytics to Drive Change

Introduction to Probabilistic Latent Semantic Analysis

Recommendation Engine Demystified

Building a Recommendation Engine - An example of a product recommendation engine

Introduction to R Package Recommendation System Competition

1. R Recommendation System Contest John Myles White March 10, 2011 John Myles White R Recommendation System Contest

2. Kaggle Kaggle is a platform for data prediction competitions that allows organizations to post their data and have it scrutinized by the world’s best data scientists. John Myles White R Recommendation System Contest

3. Kaggle Features Kaggle provides every contest with: Centralized data downloads Public and private leaderboards using RMSE, AUC and other metrics Public discussion forums for participants to use John Myles White R Recommendation System Contest

4. Kaggle Features John Myles White R Recommendation System Contest

5. Recent Kaggle Contests Tourism Forecasting Chess Ratings: Elo versus the Rest of the World INFORMS 2010: Short Term Stock Price Movements John Myles White R Recommendation System Contest

6. Current and Upcoming Kaggle Contests Arabic Writer Identiﬁcation Don’t Overﬁt: Dealing with Many Variables and Few Observations Heritage Health Prize John Myles White R Recommendation System Contest

7. Advice on Running Kaggle Contests Stay involved: respond to forum posts quickly and make the contest seem alive Don’t use a prediction task where near perfect accuracy can be achieved John Myles White R Recommendation System Contest

8. Mistakes We Made Netﬂix Prize: 0.8616 RMSE R Recommendation Contest: 0.9882 AUC John Myles White R Recommendation System Contest

9. The R Recommendation System Contest Contestants must be able to predict whether a user U will have a package P installed on their system John Myles White R Recommendation System Contest

10. Full Data Set Outcomes: List of all packages installed on 52 R users’ systems Predictors: Metadata about 2485 CRAN packages John Myles White R Recommendation System Contest

11. Metadata Dependencies Suggests Imports Views Core Recommended Maintainer Maintainer’s Package Count John Myles White R Recommendation System Contest

12. Training Data / Test Data Split Uniform random split over rows in full data set Training Set: 99373 rows Test Set: 33125 rows John Myles White R Recommendation System Contest

13. Additional Metadata LDA topic assignments for CRAN packages Used 25 topics Used all documentation: manuals, vignettes, etc. John Myles White R Recommendation System Contest

14. Example Models 1. Package Metadata 2. Package Metadata + Per User Intercepts 3. Package Metadata + Per User Intercepts + Package Topic Assignments John Myles White R Recommendation System Contest

15. Example Model 1 library(‘ProjectTemplate’) try(load.project()) logit.fit <- glm(Installed ~ LogDependencyCount + LogSuggestionCount + LogImportCount + LogViewsIncluding + LogPackagesMaintaining + CorePackage + RecommendedPackage, data = training.data, family = binomial(link = ‘logit’)) John Myles White R Recommendation System Contest

16. Example Model 2 logit.fit <- glm(Installed ~ LogDependencyCount + LogSuggestionCount + LogImportCount + LogViewsIncluding + LogPackagesMaintaining + CorePackage + RecommendedPackage + factor(User), data = training.data, family = binomial(link = ‘logit’)) John Myles White R Recommendation System Contest

17. Example Model 3 logit.fit <- glm(Installed ~ LogDependencyCount + LogSuggestionCount + LogImportCount + LogViewsIncluding + LogPackagesMaintaining + CorePackage + RecommendedPackage + factor(User) + Topic, data = training.data, family = binomial(link = ‘logit’)) John Myles White R Recommendation System Contest

18. Model Performance Model 1: ∼ 0.80 AUC Model 2: ∼ 0.95 AUC Model 3: > 0.95 AUC John Myles White R Recommendation System Contest

19. Unexploited Structure in Data John Myles White R Recommendation System Contest

20. Future Work What makes a package useful? Need subjective ratings Some packages are only installed because they’re dependencies for other popular packages John Myles White R Recommendation System Contest

21. Future Work Get a better data sample: Contest only used data from 52 users But we do have complete data for those users But data was not a random sample of R users John Myles White R Recommendation System Contest

22. Future Work Do more with LDA to categorize R packages Prediction task allows us to evaluate “quality” of topics count and topic assignments John Myles White R Recommendation System Contest

23. Future Work Build up various package-package similarity matrices for conditional recommendations John Myles White R Recommendation System Contest

24. Future Work Can we understand the clustering in the network structure graph? John Myles White R Recommendation System Contest

25. Resources For more information, see The original Dataists’ contest announcement GitHub project page John Myles White R Recommendation System Contest

Introduction to R Package Recommendation System Competition

Recommended

Recommended

More Related Content

Similar to Introduction to R Package Recommendation System Competition

Similar to Introduction to R Package Recommendation System Competition (20)

More from NYC Predictive Analytics

More from NYC Predictive Analytics (10)

Introduction to R Package Recommendation System Competition