R package Recommendation Engine

An Approach to R Package
Recommendation Engine
Alex Lin
alin@intelligentmining.com
Twitter: @alinatwork

Initial Thoughts
  The data set expected to have very strong
package-package relationships (dependencies and
related package functionalities).
  The data set (training + test) is not sparse.

  Most of matrix factorization (MF) techniques in
the recommender field optimize square errors on
the predicted user ratings not directly optimize
for AUC.

Steps
1.  Modified k-Nearest Neighbor algorithm.
2.  User average & package average as prior bias.
3.  User-specific package Maintainer Affinity.
4.  Matrix factorization (MF) to post-process the
residuals.
5.  Other rules.

Modified k-Nearest Neighbor algorithm
  Calculate cosine similarity for each pkg-pkg pair.
  Scale the cosine similarity with “square user
support” ie. cosine * (support / ttl_user_cnt)**2
  Unlike the regular kNN that is only additive, we use
the same kNN rules to penalize the package if other
related package was not installed.
  For unknown records, we choose to take ZAN
approach. We treat the unknown entries as negative.
  k=all

User average and Package average as
prior bias
  User average =
user installed pkg count / user observation count
  Package average =
pkg installed by users count / pkg observation count
  Add them into the kNN result score.

User-specific Package Maintainer
Affinity
  This metric measured as the installed package
percent of a given maintainer for an user.
  We use the percentage to predict how likely the
user will install the other package from the same
maintainer. Combine with kNN result score with
weight of 0.25.

So Far – baseline model
  Very heuristic
  Public AUC = 0.976x

Matrix Factorization
  Analyze the residuals only. The goal is to find out
structural errors in our baseline prediction.
  prediction := baseline_output + residual
  residual := pkg_bias + user_bias + pkgFactors . userFactors

  residuals is related to Wilcoxon-Mann-Whitney (WMW)
statistics

P −1 N −1
∑ ∑i= 0 j= 0
I(mi ,n j )
m ∈ PostiveClass
AUC = n ∈ NegativeClass
MN
1 if (mi < n j )
€
I(mi ,n j ) €
 0 otherwise
€

€

Matrix Factorization – cont.
  Minimizes truncated square error with batch gradient
descent (BGD)
x ui = sui + bi + bu + qT pu
i

1 2 2
min
q*. p*
∑ 2 (Lrank (x ui )) 2 + λ( qi + pu + bi2 + bu )
2

(u,i)∈K
Pairwise comparison
€ U −1 I −1
Lrank = ∑ ∑ I(x ui , x vj )(x vj − x ui )
v= 0 j= 0

 1 if (x ui < x vj )
€  x ui ∈ P x vj ∈ N 
 0 otherwise
€ I(x ui , x vj )
x ∈ N 1 if (x ui > x vj )
x vj ∈ P 
 ui
 0 otherwise

€

Other Rules
  For those duplicate records found exist in both testing
and training set, copy answers from training set.
  Assume when a user install a package P, the user also
installs the packages that P depends on.

Final Result
  Public AUC = 0.984914
  Final AUC = 0.979565

R package Recommendation Engine

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to R package Recommendation Engine

Similar to R package Recommendation Engine (20)

More from NYC Predictive Analytics

More from NYC Predictive Analytics (10)

R package Recommendation Engine