An Approach to R PackageRecommendation Engine Alex Lin email@example.com Twitter: @alinatwork
Initial Thoughts The data set expected to have very strong package-package relationships (dependencies and related package functionalities). The data set (training + test) is not sparse. Most of matrix factorization (MF) techniques in the recommender field optimize square errors on the predicted user ratings not directly optimize for AUC.
Steps1. Modified k-Nearest Neighbor algorithm.2. User average & package average as prior bias.3. User-specific package Maintainer Affinity.4. Matrix factorization (MF) to post-process the residuals.5. Other rules.
Modified k-Nearest Neighbor algorithm Calculate cosine similarity for each pkg-pkg pair. Scale the cosine similarity with “square user support” ie. cosine * (support / ttl_user_cnt)**2 Unlike the regular kNN that is only additive, we use the same kNN rules to penalize the package if other related package was not installed. For unknown records, we choose to take ZAN approach. We treat the unknown entries as negative. k=all
User average and Package average asprior bias User average = user installed pkg count / user observation count Package average = pkg installed by users count / pkg observation count Add them into the kNN result score.
User-specific Package MaintainerAffinity This metric measured as the installed package percent of a given maintainer for an user. We use the percentage to predict how likely the user will install the other package from the same maintainer. Combine with kNN result score with weight of 0.25.
So Far – baseline model Very heuristic Public AUC = 0.976x
Matrix Factorization Analyze the residuals only. The goal is to find out structural errors in our baseline prediction. prediction := baseline_output + residual residual := pkg_bias + user_bias + pkgFactors . userFactors residuals is related to Wilcoxon-Mann-Whitney (WMW) statistics P −1 N −1 ∑ ∑i= 0 j= 0 I(mi ,n j ) m ∈ PostiveClass AUC = n ∈ NegativeClass MN 1 if (mi < n j ) € I(mi ,n j ) € 0 otherwise€€
Matrix Factorization – cont. Minimizes truncated square error with batch gradient descent (BGD) x ui = sui + bi + bu + qT pu i 1 2 2 min q*. p* ∑ 2 (Lrank (x ui )) 2 + λ( qi + pu + bi2 + bu ) 2 (u,i)∈K Pairwise comparison€ U −1 I −1 Lrank = ∑ ∑ I(x ui , x vj )(x vj − x ui ) v= 0 j= 0 1 if (x ui < x vj )€ x ui ∈ P x vj ∈ N 0 otherwise € I(x ui , x vj ) x ∈ N 1 if (x ui > x vj ) x vj ∈ P ui 0 otherwise€
Other Rules For those duplicate records found exist in both testing and training set, copy answers from training set. Assume when a user install a package P, the user also installs the packages that P depends on.
Final Result Public AUC = 0.984914 Final AUC = 0.979565