An Approach to R Package
Recommendation Engine
                             Alex Lin
           alin@intelligentmining.com
                 Twitter: @alinatwork
Initial Thoughts
  The  data set expected to have very strong
   package-package relationships (dependencies and
   related package functionalities).
  The data set (training + test) is not sparse.

  Most of matrix factorization (MF) techniques in
   the recommender field optimize square errors on
   the predicted user ratings not directly optimize
   for AUC.
Steps
1.    Modified k-Nearest Neighbor algorithm.
2.    User average & package average as prior bias.
3.    User-specific package Maintainer Affinity.
4.    Matrix factorization (MF) to post-process the
      residuals.
5.    Other rules.
Modified k-Nearest Neighbor algorithm
  Calculate cosine similarity for each pkg-pkg pair.
  Scale the cosine similarity with “square user
   support” ie. cosine * (support / ttl_user_cnt)**2
  Unlike the regular kNN that is only additive, we use
   the same kNN rules to penalize the package if other
   related package was not installed.
  For unknown records, we choose to take ZAN
   approach. We treat the unknown entries as negative.
  k=all
User average and Package average as
prior bias
  User  average =
   user installed pkg count / user observation count
  Package average =
   pkg installed by users count / pkg observation count
  Add them into the kNN result score.
User-specific Package Maintainer
Affinity
  This metric measured as the installed package
   percent of a given maintainer for an user.
  We use the percentage to predict how likely the
   user will install the other package from the same
   maintainer. Combine with kNN result score with
   weight of 0.25.
So Far – baseline model
  Very heuristic
  Public AUC = 0.976x
Matrix Factorization
      Analyze   the residuals only. The goal is to find out
       structural errors in our baseline prediction.
      prediction := baseline_output + residual
      residual := pkg_bias + user_bias + pkgFactors . userFactors

      residuals is related to Wilcoxon-Mann-Whitney (WMW)
       statistics

                 P −1   N −1
              ∑ ∑i= 0   j= 0
                               I(mi ,n j )
                                             m ∈ PostiveClass
      AUC =                                  n ∈ NegativeClass
                        MN
                 1 if (mi < n j )
                             €
      I(mi ,n j )           €
                  0 otherwise
€



€
Matrix Factorization – cont.
          Minimizes truncated square error with batch gradient
           descent (BGD)
         x ui = sui + bi + bu + qT pu
                                 i

                        1                          2    2
         min
         q*. p*
                  ∑ 2     (Lrank (x ui )) 2 + λ( qi + pu + bi2 + bu )
                                                                  2

                (u,i)∈K
                                                 Pairwise comparison
€                    U −1        I −1
         Lrank = ∑           ∑          I(x ui , x vj )(x vj − x ui )
                      v= 0       j= 0


                                                           1 if (x ui < x vj )
€                        x ui ∈ P          x vj ∈ N        
                                                           0 otherwise
    €    I(x ui , x vj )
                        x ∈ N                              1 if (x ui > x vj )
                                             x vj ∈ P       
                         ui
                                                           0 otherwise



€
Other Rules
  For those duplicate records found exist in both testing
   and training set, copy answers from training set.
  Assume when a user install a package P, the user also
   installs the packages that P depends on.
Final Result
  Public AUC = 0.984914
  Final AUC = 0.979565

R package Recommendation Engine

  • 1.
    An Approach toR Package Recommendation Engine Alex Lin alin@intelligentmining.com Twitter: @alinatwork
  • 2.
    Initial Thoughts   The data set expected to have very strong package-package relationships (dependencies and related package functionalities).   The data set (training + test) is not sparse.   Most of matrix factorization (MF) techniques in the recommender field optimize square errors on the predicted user ratings not directly optimize for AUC.
  • 3.
    Steps 1.  Modified k-Nearest Neighbor algorithm. 2.  User average & package average as prior bias. 3.  User-specific package Maintainer Affinity. 4.  Matrix factorization (MF) to post-process the residuals. 5.  Other rules.
  • 4.
    Modified k-Nearest Neighboralgorithm   Calculate cosine similarity for each pkg-pkg pair.   Scale the cosine similarity with “square user support” ie. cosine * (support / ttl_user_cnt)**2   Unlike the regular kNN that is only additive, we use the same kNN rules to penalize the package if other related package was not installed.   For unknown records, we choose to take ZAN approach. We treat the unknown entries as negative.   k=all
  • 5.
    User average andPackage average as prior bias   User average = user installed pkg count / user observation count   Package average = pkg installed by users count / pkg observation count   Add them into the kNN result score.
  • 6.
    User-specific Package Maintainer Affinity  This metric measured as the installed package percent of a given maintainer for an user.   We use the percentage to predict how likely the user will install the other package from the same maintainer. Combine with kNN result score with weight of 0.25.
  • 7.
    So Far –baseline model   Very heuristic   Public AUC = 0.976x
  • 8.
    Matrix Factorization   Analyze the residuals only. The goal is to find out structural errors in our baseline prediction.   prediction := baseline_output + residual   residual := pkg_bias + user_bias + pkgFactors . userFactors   residuals is related to Wilcoxon-Mann-Whitney (WMW) statistics P −1 N −1 ∑ ∑i= 0 j= 0 I(mi ,n j ) m ∈ PostiveClass AUC = n ∈ NegativeClass MN 1 if (mi < n j ) € I(mi ,n j ) €  0 otherwise € €
  • 9.
    Matrix Factorization –cont.   Minimizes truncated square error with batch gradient descent (BGD) x ui = sui + bi + bu + qT pu i 1 2 2 min q*. p* ∑ 2 (Lrank (x ui )) 2 + λ( qi + pu + bi2 + bu ) 2 (u,i)∈K Pairwise comparison € U −1 I −1 Lrank = ∑ ∑ I(x ui , x vj )(x vj − x ui ) v= 0 j= 0  1 if (x ui < x vj ) €  x ui ∈ P x vj ∈ N   0 otherwise € I(x ui , x vj ) x ∈ N 1 if (x ui > x vj ) x vj ∈ P   ui  0 otherwise €
  • 10.
    Other Rules   Forthose duplicate records found exist in both testing and training set, copy answers from training set.   Assume when a user install a package P, the user also installs the packages that P depends on.
  • 11.
    Final Result   PublicAUC = 0.984914   Final AUC = 0.979565