Building a Recommendation Engine - An example of a product recommendation engine
R package Recommendation Engine
1. An Approach to R Package
Recommendation Engine
Alex Lin
alin@intelligentmining.com
Twitter: @alinatwork
2. Initial Thoughts
The data set expected to have very strong
package-package relationships (dependencies and
related package functionalities).
The data set (training + test) is not sparse.
Most of matrix factorization (MF) techniques in
the recommender field optimize square errors on
the predicted user ratings not directly optimize
for AUC.
3. Steps
1. Modified k-Nearest Neighbor algorithm.
2. User average & package average as prior bias.
3. User-specific package Maintainer Affinity.
4. Matrix factorization (MF) to post-process the
residuals.
5. Other rules.
4. Modified k-Nearest Neighbor algorithm
Calculate cosine similarity for each pkg-pkg pair.
Scale the cosine similarity with “square user
support” ie. cosine * (support / ttl_user_cnt)**2
Unlike the regular kNN that is only additive, we use
the same kNN rules to penalize the package if other
related package was not installed.
For unknown records, we choose to take ZAN
approach. We treat the unknown entries as negative.
k=all
5. User average and Package average as
prior bias
User average =
user installed pkg count / user observation count
Package average =
pkg installed by users count / pkg observation count
Add them into the kNN result score.
6. User-specific Package Maintainer
Affinity
This metric measured as the installed package
percent of a given maintainer for an user.
We use the percentage to predict how likely the
user will install the other package from the same
maintainer. Combine with kNN result score with
weight of 0.25.
7. So Far – baseline model
Very heuristic
Public AUC = 0.976x
8. Matrix Factorization
Analyze the residuals only. The goal is to find out
structural errors in our baseline prediction.
prediction := baseline_output + residual
residual := pkg_bias + user_bias + pkgFactors . userFactors
residuals is related to Wilcoxon-Mann-Whitney (WMW)
statistics
P −1 N −1
∑ ∑i= 0 j= 0
I(mi ,n j )
m ∈ PostiveClass
AUC = n ∈ NegativeClass
MN
1 if (mi < n j )
€
I(mi ,n j ) €
0 otherwise
€
€
9. Matrix Factorization – cont.
Minimizes truncated square error with batch gradient
descent (BGD)
x ui = sui + bi + bu + qT pu
i
1 2 2
min
q*. p*
∑ 2 (Lrank (x ui )) 2 + λ( qi + pu + bi2 + bu )
2
(u,i)∈K
Pairwise comparison
€ U −1 I −1
Lrank = ∑ ∑ I(x ui , x vj )(x vj − x ui )
v= 0 j= 0
1 if (x ui < x vj )
€ x ui ∈ P x vj ∈ N
0 otherwise
€ I(x ui , x vj )
x ∈ N 1 if (x ui > x vj )
x vj ∈ P
ui
0 otherwise
€
10. Other Rules
For those duplicate records found exist in both testing
and training set, copy answers from training set.
Assume when a user install a package P, the user also
installs the packages that P depends on.