Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
An Approach to R PackageRecommendation Engine                             Alex Lin           alin@intelligentmining.com   ...
Initial Thoughts  The  data set expected to have very strong   package-package relationships (dependencies and   related ...
Steps1.    Modified k-Nearest Neighbor algorithm.2.    User average & package average as prior bias.3.    User-specific pa...
Modified k-Nearest Neighbor algorithm  Calculate cosine similarity for each pkg-pkg pair.  Scale the cosine similarity w...
User average and Package average asprior bias  User  average =   user installed pkg count / user observation count  Pack...
User-specific Package MaintainerAffinity  This metric measured as the installed package   percent of a given maintainer f...
So Far – baseline model  Very heuristic  Public AUC = 0.976x
Matrix Factorization      Analyze   the residuals only. The goal is to find out       structural errors in our baseline p...
Matrix Factorization – cont.          Minimizes truncated square error with batch gradient           descent (BGD)       ...
Other Rules  For those duplicate records found exist in both testing   and training set, copy answers from training set....
Final Result  Public AUC = 0.984914  Final AUC = 0.979565
Upcoming SlideShare
Loading in …5
×

R package Recommendation Engine

2,129 views

Published on

Alex Lin's slides about his entry to the R Package Recommendation Engine Competition

  • Be the first to comment

R package Recommendation Engine

  1. 1. An Approach to R PackageRecommendation Engine Alex Lin alin@intelligentmining.com Twitter: @alinatwork
  2. 2. Initial Thoughts  The data set expected to have very strong package-package relationships (dependencies and related package functionalities).  The data set (training + test) is not sparse.  Most of matrix factorization (MF) techniques in the recommender field optimize square errors on the predicted user ratings not directly optimize for AUC.
  3. 3. Steps1.  Modified k-Nearest Neighbor algorithm.2.  User average & package average as prior bias.3.  User-specific package Maintainer Affinity.4.  Matrix factorization (MF) to post-process the residuals.5.  Other rules.
  4. 4. Modified k-Nearest Neighbor algorithm  Calculate cosine similarity for each pkg-pkg pair.  Scale the cosine similarity with “square user support” ie. cosine * (support / ttl_user_cnt)**2  Unlike the regular kNN that is only additive, we use the same kNN rules to penalize the package if other related package was not installed.  For unknown records, we choose to take ZAN approach. We treat the unknown entries as negative.  k=all
  5. 5. User average and Package average asprior bias  User average = user installed pkg count / user observation count  Package average = pkg installed by users count / pkg observation count  Add them into the kNN result score.
  6. 6. User-specific Package MaintainerAffinity  This metric measured as the installed package percent of a given maintainer for an user.  We use the percentage to predict how likely the user will install the other package from the same maintainer. Combine with kNN result score with weight of 0.25.
  7. 7. So Far – baseline model  Very heuristic  Public AUC = 0.976x
  8. 8. Matrix Factorization   Analyze the residuals only. The goal is to find out structural errors in our baseline prediction.   prediction := baseline_output + residual   residual := pkg_bias + user_bias + pkgFactors . userFactors   residuals is related to Wilcoxon-Mann-Whitney (WMW) statistics P −1 N −1 ∑ ∑i= 0 j= 0 I(mi ,n j ) m ∈ PostiveClass AUC = n ∈ NegativeClass MN 1 if (mi < n j ) € I(mi ,n j ) €  0 otherwise€€
  9. 9. Matrix Factorization – cont.   Minimizes truncated square error with batch gradient descent (BGD) x ui = sui + bi + bu + qT pu i 1 2 2 min q*. p* ∑ 2 (Lrank (x ui )) 2 + λ( qi + pu + bi2 + bu ) 2 (u,i)∈K Pairwise comparison€ U −1 I −1 Lrank = ∑ ∑ I(x ui , x vj )(x vj − x ui ) v= 0 j= 0  1 if (x ui < x vj )€  x ui ∈ P x vj ∈ N   0 otherwise € I(x ui , x vj ) x ∈ N 1 if (x ui > x vj ) x vj ∈ P   ui  0 otherwise€
  10. 10. Other Rules  For those duplicate records found exist in both testing and training set, copy answers from training set.  Assume when a user install a package P, the user also installs the packages that P depends on.
  11. 11. Final Result  Public AUC = 0.984914  Final AUC = 0.979565

×