2. 2
Charu C. Aggarwal, Recommender Systems: The Textbook,
Springer Publishing Company, Incorporated, 2016
Recommendation
In the age of Machine Learning
3. 3
Goals of Recommendation Systems, Rating Types
Operational & Technical Goals
• Relevance (most important)
• Novelty
• Serendipity (surprising)
• Diversity
Business Goals
• Improve user satisfaction
• Improve user loyalty
• Increase sales
• Provide insights into users’ need
• Help customize the user experience further
Rating Types
Explicit
Like/dislike
Implicit
Interaction or not
Less expressive
• Continuous
• Interval- based
• Ordinal
• Binary
• Unary
Prediction Types
• Rating value of a user-item combination
• Top-k items or top-k users
4. 4
Basic Models of Recommendation Systems
• Use the collaborative power of the ratings provided by
multiple users to make recommendations
• Observed ratings are often highly correlated across
various users and items
Collaborative Filtering Models
Content-Based Recommender Systems
• Descriptive attributes of items are used to make recommendations
• Ratings and buying behavior of users are combined with the content
information available in the items
Knowledge-Based Recommender Systems
• Allow the users to explicitly specify what they want
• Based on the similarities between customer requirements and item
descriptions, or the use of constraints specifying user requirements.
• Constraint-based vs Case-based
• Conversational vs Search-based vs Navigation-based
Demographic Recommender Systems
• Map specific demographics to ratings or buying propensities
• Combined with additional context to guide the recommendation
process
Context-Based Recommender Systems
• Time, Location, Social (structural recommendation)
Hybrid and Ensemble-Based Recommender Systems
Generalization of classification/regression modeling in
which the prediction is performed in entry-wise fashion
rather than row-wise fashion.
5. 5
Neighborhood-Based Collaborative Filtering Models
• Similar users display similar patterns of rating behavior
• Predict using the ratings of neighboring user
• Provide diverse recommendations
User-based collaborative filtering
• Similar items receive similar ratings
• Predict using the user’s own ratings on neighboring items
• Provide relevant recommendations
Item-based collaborative filtering
mean rating of user u Iu items rated by u
mean-centered rating
prediction function
Example: select users (1, 2) as peer group to predict user 3’s ratings on item 1
and 3
Pu(j): closest users to target user u, who have specified ratings for item j.
Adjusted cosine similarity between the items (columns) i and j:
Each row of the ratings matrix is first centered to a mean of zero.
Qt(u): top-k matching items
to item t, rated by user u
Example: items (2,3) are similar to item1; items (4,5) are similar to item 6
6. 6
Neighborhood-Based Collaborative Filtering Optimization
Impact of the Long Tail
Some movies may be very popular, and they may repeatedly occur as
commonly rated items by different users. Such ratings can sometimes
worsen the quality of the recommendations because they tend to be less
discriminative across different users.
If mj is the number of ratings of item j, and m is the total
number of users, then the weight wj of the item j:
Less popular items will have more weights
Clustering and Neighborhood-Based Methods
• The users or items are divided into k clusters first
• Top-k closest peers within the same cluster are used to perform prediction
• Computation is significantly more efficient
• The m x n matrix is incomplete, subset of dimensions are used in calculation
Dimensionality Reduction
• Principle Component Analysis (PCA)
Rating matrix: R(m x n) → R’(m x d), d ≪ n
• Singular Value Decomposition (SVD)
Fill in the missing values in R (col/row average) → Rf
n × n similarity matrix between pairs of
items
P(n x n), whose columns contain the orthonormal eigenvectors of S
∆: diagonal matrix containing the non-negative eigenvalues of S along its diagonal
Pd (n x d): containing only the columns of P corresponding to the largest d eigenvectors
Rf (m x n) is represented by Rf Pd (m x d ) each user is represented in a d-dimensional space
• The maximum likelihood estimate of the covariance between each pair of items is
estimated as the covariance between only the specified entries.
• Incomplete matrix R, rather than filled matrix Rf, can be directly projected on the
reduced matrix Pd .
Filling missing values with mean introduces errors
• Direct Matrix Factorization of Incomplete Data
Q (m x m), whose columns contain the orthonormal eigenvectors of RRT
P (n x n), whose columns contain the orthonormal eigenvectors of RTR
∑ (m x n), only diagonal entries are nonzero and contain the square-root of the
eigenvalues of RTR (or equivalently RTR)
The squared error of factorization can be optimized only over the
observed entries of the ratings matrix.
A Regression Modeling View of Neighborhood Methods
Use the observed ratings in the matrix to set up a least-squares optimization
problem over the unknown values of w in order to minimize the overall error
7. 7
Graph Models for neighborhood-Based Methods
Defining Neighborhoods with Random Walks
In the case of the Pearson’s correlation coefficient, two users need to be connected
directly to a set of common items for the neighborhood to be defined meaningfully. In
sparse user-item graphs, such direct connectivity may not exist for many nodes. On the
other hand, a random-walk method also considers indirect connectivity, because a walk
from one node to another may use any number of steps.
Katz measure: the weighted number of walks between a pair of nodes
β: discount factor
A: adjacency matrix
Katz measure is used to compute the affinity between pairs of users.
User-User Graphs
Horting: A user u is said to hort user v at
level (F,G), if either of the following are true:
Predictability: The user v predicts user u, if u
horts v and there exists a linear transformation
function f(·) such that the following is true
Let f1. . .fr represent the sequence of linear
transformations along the directed path
starting from node u to this user v.
Item-Item Graphs
The weights on edges correspond to random-walk probabilities.
𝑛𝑖𝑗
(𝑡)
∶ the number of walks of
length t between nodes i and j
8. 8
Rule-Based & Naïve Bayes Collaborative Filtering
Support of an itemset X ⊆ I is the fraction of transactions in T , of which X
is a subset.
• If the support of an itemset is at least equal to a predefined threshold
s, then the itemset is said to be frequent.
• This threshold is referred to as the minimum support.
{Bread, Butter ,Milk}, {Fish, Beef, Ham} have a support of 2/7.
Confidence of the rule X → Y : conditional probability that a transaction
in T contains Y , given that it also contains X
{Bread, Milk} → {Butter}
Association Rules
A rule X → Y is said to be an association rule at a minimum support
of s and minimum confidence of c, if the following two conditions are
satisfied:
Rule-Based Collaborative Filtering Naïve Bayes Collaborative Filtering
9. 9
Latent Factor Model and Matrix Factorization
Factorization is a general way of
approximating a matrix when it is prone
to dimensionality reduction because of
correlations between columns (or rows).
R (m x n)
U (m x k)
V (n x k)
k ≪ min 𝑚, 𝑛
Approximation Error:
The key usefulness of the approach arises when
the matrix R is not fully specified, but one can still
robustly estimate all entries of the latent factors U
and V, respectively.
10. 10
Unconstrained Matrix Factorization
Stochastic Gradient Descent
In the context of a matrix with missing entries:
Let 𝑢𝑖 be the ith row of U and
𝑣𝑗 be the jth row of V:
Regularization
Incorporating User and Item Biases
oi : general bias of user i
pj : general bias of item j
In fact, it has been shown that using only the bias variables (i.e., k = 0) can often
provide reasonably good rating predictions.
Incorporating Implicit Feedback
Even in cases in which users explicitly rate items, the identity of the items
they rate can be viewed as an implicit feedback
Y (n x k) : implicit item-factor matrix
F (m x n) : provides the linear combination coefficients to create a user-factor matrix from it
11. 11
Other Matrix Factorization Methods
Singular Value Decomposition (SVD)
Columns of U and V are constrained to be mutually orthogonal.
Non-negative Matrix Factorization
Provide high-level interpretability
Probabilistic Latent Semantic Analysis (PLSA)
A probabilistic variant of non-negative matrix factorization.
• borrows ideas from support vector machines to add a maximum
margin regularizer to the objective function and some of its variants
• particularly effective for discrete ratings
Maximum Margin Factorization
12. 12
Content-Based Recommender Systems
1. Preprocessing and feature extraction
Assume the features are keywords of each item
V (n x d) Feature matrix, n items, d keywords
Feature Representation (can be normalized)
• Unary
• Term Frequency tf (k): counts of keyword k in a content
• Weighted Term Frequency (Title counts more than body)
• TF-IDF (Term Frequency – Inverse Document Frequency)
• 𝑖𝑑𝑓(𝑘) = log(
𝑁
𝑛𝑘
), N: total content, 𝑛k: contents with keyword k
• tf-idf(k) = 𝑡𝑓 𝑘 × log
𝑁
𝑛𝑘
2. Content-based learning of user profiles (can be normalized)
R (m x n)
V normalized first
U (m x d) = R x V User Profile Matrix, m users, d keywords
3. Filtering and recommendation
V’ (n’ x d) Feature matrix, n’ testing items, d keywords
R’ (m x n’) = U x V’T Prediction matrix, m users, n’ testing items
Cosine Similarity
𝐶𝑜𝑛𝑠𝑖𝑛𝑒 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
𝐴 ∙ 𝐵
𝐴 𝐵
=
𝑖=1
𝑛
𝐴𝑖𝐵𝑖
𝑖=1
𝑛
𝐴𝑖
2
𝑖=1
𝑛
𝐵𝑖
2
𝐴 = 1,1,1,0
𝐵 = 0,1,0,1
𝐴 = 12 + 12 + 12 + 02 = 3
𝐵 = 02 + 12 + 02 + 12 = 2
𝐴 ∙ 𝐵 = 1 ∗ 0 + 1 ∗ 1 + 1 ∗ 0 + 0 ∗ 1 = 1
𝐶𝑜𝑛𝑠𝑖𝑛𝑒 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
𝐴 ∙ 𝐵
𝐴 𝐵
=
1
3 ∙ 2
= 0.41
Content Similarity (n x n) = V x VT
Rating matrix, m users, n items
Each cell represent similarity between a user and an item
14. 14
Neural Collaborative Filtering (NCF) He 2017 GitHub
Neural Collaborative Filtering (Multi-layer Perceptron) Neural Matrix Factorization Model (MLP + GMF)
Learning from Implicit Data Generalized Matrix Factorization (GMF)
Hit Ration (HR): the test item among other 99 negative samples is ranked top 10
NDCG: assigns higher scores to hits at top ranks
one negative sample per positive instance is insufficient,
the optimal sampling ratio is around 3 to 6.
15. 15
Ying 2018
I (2 billion pins) ↔ C (1 billion board)
Method Hit-rate
Visual embeddings (4,096 dimensions, from CNN) 17%
Annotation embeddings (256 dim, title & description -> Word2Vec) 14%
Combined embeddings (2-layer MLP on visual and annotation embeddings) 27%
Pixie (random-walk-based, closeness only from graph structure) -
PinSage (graph convolution with visual and annotation features) 67%
Hit-rate: probability that positive samples were ranked among the top
500 among the 5M negative samples
Importance pooling: based upon random walk similarity to choose positive
sampling, leading to a 46% performance gain in offline evaluation metrics.
Curriculum training: the algorithm is fed harder-and-harder examples (from
PageRank score) during training, resulting in a 12% performance gain.
Pinterest
A/B tests show 30%
to 100%
improvements in user
engagement across
various settings after
deploying PinSage
16. 16
Neural Graph Collaborative Filtering (NGCF) Wang 2020 GitHub
High order connectivity contains rich semantics carrying
collaborative signal.
Target
NGCF explicitly incorporated collaborative signal into the embedding
function of model-based CF, by leveraging high-order connectivity in
the user-item integration graph.
17. 17
LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation
He 2020 GitHub
NGCF
Feature transformation and nonlinear activation contribute
little to the performance of collaborative filtering
• NGCF-f, which removes the feature transformation matricesW1 andW2.
• NGCF-n, which removes the non-linear activation function σ.
• NGCF-fn, which removes both the feature transformation matrices and
non-linear activation function.
Light Graph Convolution (LGC)
Layer Combination and Model Prediction
Keep neighbored
aggregation only
user-item interaction matrix
Adjacency matrix
(M+N) X (M+N)
D is a (M+N)×(M+N) diagonal matrix, in which each entry Dii denotes the number of nonzero entries in the
ith row vector of the adjacency matrix A (also named as degree matrix).
Bayesian Personalized Ranking (BPR) loss
In all cases, LightGCN outperforms NGCF by a large margin
18. 18
Temporal Graph Networks (TGN) for Deep Learning on Dynamic Graphs Rossi 2020
Deep learning on static graphs
Dynamic Graphs
• A node-wise event is represented by vi(t)
• An interaction event between nodes i and j is
represented by a (directed) temporal edge eij(t)
For each time t, the embedding of the graph nodes Z(t) = (z1(t),…, zn(t)(t))
Message Function
Message Aggregator
Memory Updater
Embedding
Aggregation over graph
19. 19
“Do not quench your inspiration
and your imagination; do not
become the slave of your
model.”
– Vincent van Gogh