SlideShare a Scribd company logo
1 of 11
Download to read offline
An Approach to R Package
Recommendation Engine
                             Alex Lin
           alin@intelligentmining.com
                 Twitter: @alinatwork
Initial Thoughts
  The  data set expected to have very strong
   package-package relationships (dependencies and
   related package functionalities).
  The data set (training + test) is not sparse.

  Most of matrix factorization (MF) techniques in
   the recommender field optimize square errors on
   the predicted user ratings not directly optimize
   for AUC.
Steps
1.    Modified k-Nearest Neighbor algorithm.
2.    User average & package average as prior bias.
3.    User-specific package Maintainer Affinity.
4.    Matrix factorization (MF) to post-process the
      residuals.
5.    Other rules.
Modified k-Nearest Neighbor algorithm
  Calculate cosine similarity for each pkg-pkg pair.
  Scale the cosine similarity with “square user
   support” ie. cosine * (support / ttl_user_cnt)**2
  Unlike the regular kNN that is only additive, we use
   the same kNN rules to penalize the package if other
   related package was not installed.
  For unknown records, we choose to take ZAN
   approach. We treat the unknown entries as negative.
  k=all
User average and Package average as
prior bias
  User  average =
   user installed pkg count / user observation count
  Package average =
   pkg installed by users count / pkg observation count
  Add them into the kNN result score.
User-specific Package Maintainer
Affinity
  This metric measured as the installed package
   percent of a given maintainer for an user.
  We use the percentage to predict how likely the
   user will install the other package from the same
   maintainer. Combine with kNN result score with
   weight of 0.25.
So Far – baseline model
  Very heuristic
  Public AUC = 0.976x
Matrix Factorization
      Analyze   the residuals only. The goal is to find out
       structural errors in our baseline prediction.
      prediction := baseline_output + residual
      residual := pkg_bias + user_bias + pkgFactors . userFactors

      residuals is related to Wilcoxon-Mann-Whitney (WMW)
       statistics

                 P −1   N −1
              ∑ ∑i= 0   j= 0
                               I(mi ,n j )
                                             m ∈ PostiveClass
      AUC =                                  n ∈ NegativeClass
                        MN
                 1 if (mi < n j )
                             €
      I(mi ,n j )           €
                  0 otherwise
€



€
Matrix Factorization – cont.
          Minimizes truncated square error with batch gradient
           descent (BGD)
         x ui = sui + bi + bu + qT pu
                                 i

                        1                          2    2
         min
         q*. p*
                  ∑ 2     (Lrank (x ui )) 2 + λ( qi + pu + bi2 + bu )
                                                                  2

                (u,i)∈K
                                                 Pairwise comparison
€                    U −1        I −1
         Lrank = ∑           ∑          I(x ui , x vj )(x vj − x ui )
                      v= 0       j= 0


                                                           1 if (x ui < x vj )
€                        x ui ∈ P          x vj ∈ N        
                                                           0 otherwise
    €    I(x ui , x vj )
                        x ∈ N                              1 if (x ui > x vj )
                                             x vj ∈ P       
                         ui
                                                           0 otherwise



€
Other Rules
  For those duplicate records found exist in both testing
   and training set, copy answers from training set.
  Assume when a user install a package P, the user also
   installs the packages that P depends on.
Final Result
  Public AUC = 0.984914
  Final AUC = 0.979565

More Related Content

What's hot

What's hot (9)

Presentation cm2011
Presentation cm2011Presentation cm2011
Presentation cm2011
 
Dsp3
Dsp3Dsp3
Dsp3
 
Slides euria-2
Slides euria-2Slides euria-2
Slides euria-2
 
Kristhyan kurtlazartezubia evidencia1-metodosnumericos
Kristhyan kurtlazartezubia evidencia1-metodosnumericosKristhyan kurtlazartezubia evidencia1-metodosnumericos
Kristhyan kurtlazartezubia evidencia1-metodosnumericos
 
Generalization of Compositons of Cellular Automata on Groups
Generalization of Compositons of Cellular Automata on GroupsGeneralization of Compositons of Cellular Automata on Groups
Generalization of Compositons of Cellular Automata on Groups
 
Module iii sp
Module iii spModule iii sp
Module iii sp
 
17.04.2012 m.petrov
17.04.2012 m.petrov17.04.2012 m.petrov
17.04.2012 m.petrov
 
Rpra1
Rpra1Rpra1
Rpra1
 
IROS 2011 talk 2 (Filippo's file)
IROS 2011 talk 2 (Filippo's file)IROS 2011 talk 2 (Filippo's file)
IROS 2011 talk 2 (Filippo's file)
 

Similar to R package Recommendation Engine

類神經網路、語意相似度(一個不嫌少、兩個恰恰好)
類神經網路、語意相似度(一個不嫌少、兩個恰恰好)類神經網路、語意相似度(一個不嫌少、兩個恰恰好)
類神經網路、語意相似度(一個不嫌少、兩個恰恰好)Ming-Chi Liu
 
Mathematics and AI
Mathematics and AIMathematics and AI
Mathematics and AIMarc Lelarge
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012Zheng Mengdi
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machinenozomuhamada
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixturesChristian Robert
 
lec-10-perceptron-upload.pdf
lec-10-perceptron-upload.pdflec-10-perceptron-upload.pdf
lec-10-perceptron-upload.pdfAntonio Espinosa
 
Switkes01200543268
Switkes01200543268Switkes01200543268
Switkes01200543268Hitesh Wagle
 
Summary of "A Universally-Truthful Approximation Scheme for Multi-unit Auction"
Summary of "A Universally-Truthful Approximation Scheme for Multi-unit Auction"Summary of "A Universally-Truthful Approximation Scheme for Multi-unit Auction"
Summary of "A Universally-Truthful Approximation Scheme for Multi-unit Auction"Thatchaphol Saranurak
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis
 
Relaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete MarketsRelaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete Marketsguasoni
 
Classification using perceptron.pptx
Classification using perceptron.pptxClassification using perceptron.pptx
Classification using perceptron.pptxsomeyamohsen3
 
06_finite_elements_basics.ppt
06_finite_elements_basics.ppt06_finite_elements_basics.ppt
06_finite_elements_basics.pptAditya765321
 
Normal density and discreminant analysis
Normal density and discreminant analysisNormal density and discreminant analysis
Normal density and discreminant analysisVARUN KUMAR
 

Similar to R package Recommendation Engine (20)

類神經網路、語意相似度(一個不嫌少、兩個恰恰好)
類神經網路、語意相似度(一個不嫌少、兩個恰恰好)類神經網路、語意相似度(一個不嫌少、兩個恰恰好)
類神經網路、語意相似度(一個不嫌少、兩個恰恰好)
 
Mathematics and AI
Mathematics and AIMathematics and AI
Mathematics and AI
 
YSC 2013
YSC 2013YSC 2013
YSC 2013
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
 
lec-10-perceptron-upload.pdf
lec-10-perceptron-upload.pdflec-10-perceptron-upload.pdf
lec-10-perceptron-upload.pdf
 
EE658_Lecture_8.pdf
EE658_Lecture_8.pdfEE658_Lecture_8.pdf
EE658_Lecture_8.pdf
 
Backpropagation for Deep Learning
Backpropagation for Deep LearningBackpropagation for Deep Learning
Backpropagation for Deep Learning
 
Switkes01200543268
Switkes01200543268Switkes01200543268
Switkes01200543268
 
Summary of "A Universally-Truthful Approximation Scheme for Multi-unit Auction"
Summary of "A Universally-Truthful Approximation Scheme for Multi-unit Auction"Summary of "A Universally-Truthful Approximation Scheme for Multi-unit Auction"
Summary of "A Universally-Truthful Approximation Scheme for Multi-unit Auction"
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
 
Relaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete MarketsRelaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete Markets
 
5.n nmodels i
5.n nmodels i5.n nmodels i
5.n nmodels i
 
Classification using perceptron.pptx
Classification using perceptron.pptxClassification using perceptron.pptx
Classification using perceptron.pptx
 
UofT_ML_lecture.pptx
UofT_ML_lecture.pptxUofT_ML_lecture.pptx
UofT_ML_lecture.pptx
 
02 basics i-handout
02 basics i-handout02 basics i-handout
02 basics i-handout
 
06_finite_elements_basics.ppt
06_finite_elements_basics.ppt06_finite_elements_basics.ppt
06_finite_elements_basics.ppt
 
5994944.ppt
5994944.ppt5994944.ppt
5994944.ppt
 
Normal density and discreminant analysis
Normal density and discreminant analysisNormal density and discreminant analysis
Normal density and discreminant analysis
 

More from NYC Predictive Analytics

Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsNYC Predictive Analytics
 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsNYC Predictive Analytics
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMNYC Predictive Analytics
 
Introduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System CompetitionIntroduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System CompetitionNYC Predictive Analytics
 
Optimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsOptimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsNYC Predictive Analytics
 
An Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for PredictionAn Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for PredictionNYC Predictive Analytics
 
How OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive ChangeHow OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive ChangeNYC Predictive Analytics
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineNYC Predictive Analytics
 

More from NYC Predictive Analytics (10)

Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media Analytics
 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVM
 
Introduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System CompetitionIntroduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System Competition
 
Optimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsOptimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive Analytics
 
An Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for PredictionAn Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for Prediction
 
How OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive ChangeHow OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive Change
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Recommendation Engine Demystified
Recommendation Engine DemystifiedRecommendation Engine Demystified
Recommendation Engine Demystified
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 

R package Recommendation Engine

  • 1. An Approach to R Package Recommendation Engine Alex Lin alin@intelligentmining.com Twitter: @alinatwork
  • 2. Initial Thoughts   The data set expected to have very strong package-package relationships (dependencies and related package functionalities).   The data set (training + test) is not sparse.   Most of matrix factorization (MF) techniques in the recommender field optimize square errors on the predicted user ratings not directly optimize for AUC.
  • 3. Steps 1.  Modified k-Nearest Neighbor algorithm. 2.  User average & package average as prior bias. 3.  User-specific package Maintainer Affinity. 4.  Matrix factorization (MF) to post-process the residuals. 5.  Other rules.
  • 4. Modified k-Nearest Neighbor algorithm   Calculate cosine similarity for each pkg-pkg pair.   Scale the cosine similarity with “square user support” ie. cosine * (support / ttl_user_cnt)**2   Unlike the regular kNN that is only additive, we use the same kNN rules to penalize the package if other related package was not installed.   For unknown records, we choose to take ZAN approach. We treat the unknown entries as negative.   k=all
  • 5. User average and Package average as prior bias   User average = user installed pkg count / user observation count   Package average = pkg installed by users count / pkg observation count   Add them into the kNN result score.
  • 6. User-specific Package Maintainer Affinity   This metric measured as the installed package percent of a given maintainer for an user.   We use the percentage to predict how likely the user will install the other package from the same maintainer. Combine with kNN result score with weight of 0.25.
  • 7. So Far – baseline model   Very heuristic   Public AUC = 0.976x
  • 8. Matrix Factorization   Analyze the residuals only. The goal is to find out structural errors in our baseline prediction.   prediction := baseline_output + residual   residual := pkg_bias + user_bias + pkgFactors . userFactors   residuals is related to Wilcoxon-Mann-Whitney (WMW) statistics P −1 N −1 ∑ ∑i= 0 j= 0 I(mi ,n j ) m ∈ PostiveClass AUC = n ∈ NegativeClass MN 1 if (mi < n j ) € I(mi ,n j ) €  0 otherwise € €
  • 9. Matrix Factorization – cont.   Minimizes truncated square error with batch gradient descent (BGD) x ui = sui + bi + bu + qT pu i 1 2 2 min q*. p* ∑ 2 (Lrank (x ui )) 2 + λ( qi + pu + bi2 + bu ) 2 (u,i)∈K Pairwise comparison € U −1 I −1 Lrank = ∑ ∑ I(x ui , x vj )(x vj − x ui ) v= 0 j= 0  1 if (x ui < x vj ) €  x ui ∈ P x vj ∈ N   0 otherwise € I(x ui , x vj ) x ∈ N 1 if (x ui > x vj ) x vj ∈ P   ui  0 otherwise €
  • 10. Other Rules   For those duplicate records found exist in both testing and training set, copy answers from training set.   Assume when a user install a package P, the user also installs the packages that P depends on.
  • 11. Final Result   Public AUC = 0.984914   Final AUC = 0.979565