An Approach to R Package Recommendation Engine


Published on

To download the slides please go here:

Alex's approach to the R package recommender system Kaggle competition, where he placed 4th. Slides presented to NYCPA.

Published in: Technology
1 Comment
  • To download slides please go here:
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • AUC = 0.85
  • AUC = 0.95x
  • Xui is residual and Residual = pkg_bias + user_bias + (pkgFactors . userFactors)
  • Xui is residual and Residual = pkg_bias + user_bias + (pkgFactors . userFactors)
  • Xui is residual and Residual = pkg_bias + user_bias + (pkgFactors . userFactors)
  • An Approach to R Package Recommendation Engine

    1. 1. An Approach to R Package Recommendation Engine<br />Alex Lin<br /><br />Twitter: @alinatwork<br />
    2. 2. Initial Thoughts<br />The data set expected to have very strong package-package relationships (dependencies and related package functionalities).<br />The data set (training + test) is not sparse.<br />Most of matrix factorization (MF) techniques in the recommender field optimize square errors on the predicted user ratings not directly optimize for AUC.<br />
    3. 3. Steps<br />Modified k-Nearest Neighbor algorithm.<br />User average & package average as prior bias.<br />User-specific package Maintainer Affinity.<br />Matrix factorization (MF) to post-process the residuals.<br />Other rules.<br />
    4. 4. Modified k-Nearest Neighbor algorithm<br />Calculate cosine similarity for each pkg-pkg pair.<br />Scale the cosine similarity with “square user support” ie. cosine * (support / ttl_user_cnt)**2<br />Unlike the regular kNN that is only additive, we use the same kNN rules to penalize the package if other related package was not installed. <br />For unknown records, we choose to take ZAN approach. We treat the unknown entries as negative. <br />k=all<br />
    5. 5. User average and Package average as prior bias<br />User average = user installed pkg count / user observation count<br />Package average = pkg installed by users count / pkg observation count<br />Add them into the kNN result score.<br />
    6. 6. User-specific Package Maintainer Affinity<br />This metric measured as the installed package percent of a given maintainer for an user. <br />We use the percentage to predict how likely the user will install the other package from the same maintainer. Combine with kNN result score with weight of 0.25.<br />
    7. 7. So Far – baseline model<br />Very heuristic<br />Public AUC = 0.976x<br />
    8. 8. Matrix Factorization<br />Analyze the residuals only. The goal is to find out structural errors in our baseline prediction.<br />prediction := baseline_output + residual<br />residual := pkg_bias + user_bias + pkgFactors . userFactors<br />residuals is related to Wilcoxon-Mann-Whitney (WMW) statistics<br />
    9. 9. Matrix Factorization – cont.<br />Minimizes truncated square error with batch gradient descent (BGD)<br />Pairwise comparison<br />
    10. 10. Other Rules<br />For those duplicate records found exist in both testing and training set, copy answers from training set.<br />Assume when a user install a package P, the user also installs the packages that P depends on.<br />
    11. 11. Final Result<br />Public AUC = 0.984914 <br />Final AUC = 0.979565<br />