Boston ML - Architecting Recommender Systems

Boston Machine Learning
Architecting
Recommender
Systems
Algorithm design, user experience, and system architecture
June 2018
James Kirk

Tools for
Recommender
Systems
41 - 53
Tools for building systems
quickly
Anatomy of
Recommender
Systems
3 - 19
System components and
terminology
Evaluating
Recommender
Systems
54 - 58
What makes a good
recommender system?
What We
Missed
59 - 63
Other subjects in
recommender systems
Designing
Recommender
Systems
20 - 31
Design considerations and
frameworks
Example
Recommender
Systems
32 - 40
Real-world recommender
systems and their
architectures
Table of
contents
2

Anatomy of
Recommender
Systems

Recommendation
A recommendation system presents items to users in
a relevant way.
The definition of relevant is product/context-specific.
Recommendation vs Personalization
Personalization
A personalization system presents recommendations
in a way that is relevant to the individual user.
The user expects their experience to change based
on their interactions with the system.
Relevance can still be product/context specific.

Users
A user in a recommender system is the party that is
receiving and acting on the recommendations.
Sometimes the user is the context, not an actual
person.
Users vs Items
Items
An item in a recommender system is the passive party
that is being recommended to the users.
The line between these two can be blurry.

Example:
Consultant
Matchmaking
(Hypothetical)
*Personalized
Rec Sys #1
Users: Consultants*
Items: Projects
Recommend projects for
the consultant to bid on.
Rec Sys #2
Users: Projects
Items: Consultants
Recommend the right
consultant for the project.
Rec Sys #3
Users: Enterprises*
Items: Consultants
Recommend consultants
for relationship building.

Positive
Hearts, stars, likes, listens, watches, follows,
bids, purchases, hires, reads, views, upvotes…
❤
Negative
Bans, skips, angry-face-reacts, 1-star reviews,
rejections, unfollows, returns, downvotes…
Interactions
Explicit vs Implicit
Explicit actions are those that a user expects
or intends to impact their personalized
experience.
Implicit actions are all other interactions
between users and items.

Interactions
User 1
User 2
User 3
User 4
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6

Indicator Features
A feature that is unique to every user/item to
allow for direct personalization.
These features allow recommender systems
to learn about every user individually without
being diluted through metadata.
Often one-hot encoded user IDs or just an
identity matrix.
Metadata Features
Age, location, language, tags, labels, word
counts, pre-learned embeddings…
Everything that is known about a user/item
before training can be a feature if properly
structured. Should it be?
Often called “side input” or “shared features.”
User/Item Features

User/Item Features
Indicator Features Metadata Features
Encoded
Labels/Tags/et
c.
[n_users x n_user_features]
or
[n_items x n_item_features]
User 1
User 2
User 3
User 4
User 5
User 6

Representation
A (typically) low-dimensional vector that
encodes the feature information about the
user or item.
Often called “embedding,” “latent user/item,”
or “latent representation.”
Representation size, which is the dimension of
the latent space, is often referred to as
“components.”
Representation Functions
Representation Function
The process that converts user/item features
in to representations.
Learning happens here.
Common examples:
1. Matrix factorization
2. Linear kernels
3. Deep nets
4. Word2Vec
5. Autoencoders
6. None! (Pass-through)

Representation Functions
Image: Eric Nyquist

Prediction
A prediction from a recommender system is an
estimate of an item’s relevance to the user.
Predictions can be ranked for relevance.
The predictions are an indirect approximation
of the interactions.
Prediction Functions
Prediction Function
The process that converts user/item
representations in to predictions.
Common examples:
1. Dot product
2. Cosine similarity/distance
3. Euclidean similarity/distance
4. Manhattan similarity/distance*
Some systems use deep nets for prediction,
and this can be an assumption-breaker.
*Actually, Manhattan is rare

Prediction Functions
User
Item
Θ
2-Component Latent Representation Space
(2-Dimensional)
Common examples:
1. Dot product = User · Item
2. Cosine similarity = cos(Θ)
3. Euclidean similarity* = ( -1 * δ )
4. Manhattan similarity = ( -1 * |User - Item| )
*There are many methods for expressing euclidean similarity
δ

Loss Function
The process that converts predictions and
interactions in to error for learning.
Common examples:
1. Root-mean-square error (RMSE)
2. Kullback-Leibler divergence (KLD)
3. Alternating least squares* (ALS)
4. Bayesian personalized ranking* (BPR)
5. Weighted approximately ranked
pairwise (WARP)
6. Weighted margin-rank batch (WMRB)
*These are both a loss and representation function
Loss and Learning
Learning-to-rank
Some loss functions learn to approximate
the values in the interactions matrix.
Other loss functions learn to uprank positive
interactions and downrank negative
interactions (and/or non-interacted items) for
that user.
This second category of loss functions are
called learning-to-rank.

User Features
Item Features
Interactions
User Representation
Item Representation
User Representation
Function
Item Representation
Function
Prediction
Function
Predicted Scores Predicted Ranks
Training Loss
Loss Function
InputData
Output Data

Y = Prediction
p = Prediction function
r = Representation function
X = Features
Ɛ = Loss
s = Loss Function
N = Interactions

Interactions Features Learning
What are our
interaction values?
We must select interaction values based on
what data is available, how meaningful that
data is, and how it interacts with the rest of the
system.
Considerations
❏ What user behaviors do our interactions
represent?
❏ Explicit vs implicit?
❏ Do we allow for negative interactions?
❏ How dense are our interactions?
❏ Can our recommender handle these
interactions?
How does our system
learn?
We must select representation functions that are
appropriate for our features as well as a
prediction function and loss function that will
learn effectively from this data.
Considerations
❏ What representation functions will best
encode the user/item features?
❏ What prediction function will best estimate
relevance?
❏ What loss function will learn from our data
most effectively?
❏ Do these choices scale?
What are our user/item
features?
We must select user/item features from the
data available, ensure that the data is
meaningful to the recommender system, and
ensure that our use of this data is appropriate.
Considerations
❏ Do we use indicator features?
❏ What useful metadata is available?
❏ Does the metadata require feature
engineering?
❏ Do users expect this metadata to impact
their recommendations?

What user behaviors do our
interactions represent?
Interaction values should be an
approximation of the intended effect of the
recommender system on user behavior.
If we want people to purchase, our
interactions should be related to purchases.
If we want people to binge episodes of
shows for longer, our interactions should be
related to the act of binging.
What are our interaction values?
When the user gave you this signal, did they
intend/expect it to alter their
recommendations?
Some explicit signals don’t work well as
interactions.
Negative explicit signals should be handled
with simple product logic.
“You might give five stars to Hotel Rwanda and two
stars to Captain America, but you’re much more likely to
watch Captain America.”
-Todd Yellin, Netflix, You May Also Like

Does the user know we are using this signal for
recommendation?
Does the user care we are using this signal for
recommendation?
Is it ethical for us to use this signal for
recommendation?

1. Positive Positive Positive
2. Positive Positive Positive
3. No-int Negative No-int
4. No-int Negative Negative
5. No-int Negative No-int
6. No-int No-int Negative
7. Negative No-int No-int
8. Negative No-int Negative
9. Negative No-int No-int
Confusing?Do we allow negative
interactions?
Negative interactions can be valuable
statements of what content to avoid.
Negative interactions can be confusing
when learning-to-rank.
Not all loss functions accommodate negative
interactions.
Which ordering is better?

Do we use indicator
features?
Indicator features allow for powerful
personalization but are as numerous as our
users/items.
Recommenders with user indicators can not
effectively make recommendations for new
users* (the cold-start problem).
Many users means many indicator features
-- this may not scale.
*Vice-versa is true for new items
What are our user/item features?
What useful metadata is
available?
What user/item metadata do we have that is
relevant?
Metadata that is useful but missing can be
requested from users, crowd-sourced, or
inferred with other ML systems.

Does the metadata require
feature engineering?
Pre-processing features can improve
recommender learning.
Some features may be useless/misleading
without feature engineering.
The choice of representation function
impacts the usefulness of feature
engineering.
What are our user/item features?
Do users expect this
metadata to impact their
recommendations?
Is the use of this metadata ethical*?
Users can be surprised when changing
metadata impacts product experience.
*There is a distinction between metadata used in training
and metadata used in evaluation.

What representation
functions will best encode
the user/item features?
Linear kernels are effective if all we have are
indicator features or well-engineered
features. (Matrix factorization)
More complex relationships may lead us to
neural nets. How does their architecture
impact the recommender? (Use of the latent
space)
Can the representation be learned without
interaction? (Auto-encoders, word2vec, etc)
How does our system learn?
What prediction function will
best estimate relevance?
Dot-product prediction accounts for
representation relevance and magnitude.
Cosine prediction optimizes for relevance but
has no sense for magnitude.
Euclidean prediction builds a map of items but
also has no sense for magnitude.
Should items be biased, given our choice?

What loss function will learn
from our data most
effectively?
Do we want to estimate interactions, or
perform learning-to-rank?
Should the loss function accommodate
negative interactions? (RMSE, KLD…)
Should the loss function be sensitive to
interaction magnitude? (RMSE, B-WMRB…)
Tweaking the loss function can dramatically
change how recommendations feel.
Sparse vs Dense vs Sampled
Some implementations of loss functions only
account for user/item pairs with interactions.
These same loss functions can be written to
compare every possible user/item pair. These
predictions and losses are dense, and they can
be expensive.
Some of the most effective and efficient loss
functions learn by comparing pairs with
interactions against sampled pairs.* (WARP,
WMRB)
* There are many methods for sampling candidate pairs

Example: WMRB
WMRB approximates positive item rank
against a random sample and upranks
positive items through a hinge loss.
x = User
y = Positive item
y’ = Non-positive item
Y = All items
Z = Random sample of non-positive items
Hinge
Random
Sampling

Example: Balancing WMRB
If we notice an undue popularity bias, we can
balance this by accounting for interaction
magnitudes and popularity.
x = User
y = Positive item
X = All users
n = Interaction magnitude for pair (user, item)
Balancing
Factor

We can think about a recommender system
architecture as a set of top-level decisions.
When designing recommender systems, we
are evaluating the tradeoffs between these
decisions and the relationships between
these choices.
A Framework for Recommender Systems
Interactions ?
User Features ?
User Representation ?
Item Features ?
Item Representation ?
Prediction ?
Learning ?

A collaborative filter learns representations
from interactions and uses these to make
personalized recommendations, often
through matrix factorization.
Pure collaborative filters are metadata-naïve.
Example: Collaborative Filter
Interactions *
(Positive only?)
User Features Indicator
User Representation Linear
Item Features Indicator
Item Representation Linear
Prediction *
(Dot-product for MF)
Learning ALS, BPR, SVD, PCA, NMF...

A content-based recommender learns the
item features to which a user is affined.
Purely content-based systems do no transfer
learning between users.
This allows easy rec-splanation.
This requires clean item metadata.
Example: Content-based Recommender
Interactions *
Item Features Metadata
Item Representation None
(n_components = n_item_features)
Prediction Dot-product
Learning *

A hybrid recommender system learns
representations for both user and item
metadata and indicators, if available.
This opens a lot of options for us.
Example: Hybrid Recommender System
Interactions *
User Features *
User Representation *
Item Features *
Item Representation *
Prediction *
Learning *

We can build a hybrid recommender system
to recommend personalized products based
on past purchases.
Example: Purchase Recommendations
Interactions Purchases
Item Features Indicator + Metadata
Item Representation *
Learning *

We can use the pre-trained purchase
recommender’s representations to provide
recommendations in a new context.
In this system, the “user” is the context item,
not the person using our product.
Example: “You May Also Like” (YMAL)
Interactions X
User Features Context Item Repr
User Representation None
Item Features All Item Reprs
Prediction Dot-product, Cosine?
Learning X

We can take the output of the YMAL
recommender and re-rank the items based
on the customer’s representation.
This system does not learn. The learning’s
already been done.
Example: Personalized “You May Also Like”
Interactions X
User Features User Reprs
User Representation None
Item Features Similar Item Reprs
Learning X

Example: Personalized “You May Also Like”
Purchase
Recommender
System
“YMAL”
Recommender
System
“YMAL”
Personalization
System
Step 1:
Learn to personalize
purchasing
recommendations
Step 2:
Use previous learning to
calculate the most similar
items
Step 3:
Personalize the similar
items by re-ranking
OR
Contextualize purchase
recommendations by
limiting the item set

Example: YouTube (Covington, Adams, Sargin)
Interactions Watches + Searches
User Features Geography, Age, Gender...
User Representation Deep net
Item Features
Pre-learned embeddings,
language, previous impressions...
Item Representation Deep net
Prediction Deep net
Learning Sampled Cross-Entropy

Implicit
Interactions *
Learning ALS, BPR
Implicit is a Python collaborative filter toolkit
that uses matrix factorization to learn
representations.
Includes factorization classes for ALS and
BPR.
Made by Ben Frederickson.
MIT License

Scikit-Learn
Interactions *
Learning SVD, PCA, NMF...
Scikit-learn is a Python machine learning
toolkit with many tools for feature
engineering and machine learning.
The decomposition package contains some
classes that can be used for matrix
factorization recommender systems like SVD,
PCA, NMF...
Maintained by volunteers.
BSD license

LightFM
Interactions *
User Features *
Item Features *
Learning Logistic, BPR, WARP
LightFM is a Python hybrid recommender
system that uses matrix factorization to learn
representations.
Made by Lyst - a fashion shopping website.
Apache-2.0 license

TensorRec is a Python hybrid recommender
system framework for developing whole
recommender systems quickly.
Representation functions, prediction
functions, and loss functions can be
customized using TensorFlow or Keras.
Made by James Kirk.
Apache-2.0 license
TensorRec
Interactions *
User Features *
User Representation Linear, Deep nets, None...
Item Features *
Item Representation Linear, Deep nets, None...
Prediction
Dot-product, Cosine,
Euclidean...
Learning RMSE, KLD, WMRB...
Hey, that’s me

Annoy is a tool for fast similarity search
written in C++ with Python bindings.
Useful for building systems to serve
recommendations from pre-learned
representations.
Made by Spotify.
Apache-2.0 license
ANNOY (Approximate Nearest Neighbors Oh Yeah)
Interactions X
User Features X
User Representation X
Item Features X
Item Representation X
Prediction
Cosine, Euclidean,
Manhattan, Hamming
Learning X

Faiss is a tool for fast similarity search
representations.
Allows item biases.
Made by Facebook.
BSD license
FAISS (Facebook AI Similarity Search)
Interactions X
User Features X
Item Features X
Prediction Dot-product, Euclidean
Learning X

NMSLib is a tool for fast similarity search
representations.
Made by Bilegsaikhan Naidan, Leonid
Boytsov, Yury Malkov, David Novak, Ben
Frederickson.
Apache-2.0 license, with some
MIT and GNU components
NMSLib (Non-Metric Space Library)
Interactions X
User Features X
Item Features X
Prediction Cosine, Euclidean
Learning X

We can build a hybrid recommender system
to recommend personalized news articles
based on past reading.
Requirements:
1. We have to learn the tastes of
individual users.
2. We know users’ home location with
low resolution (country/state).
3. Articles are ephemeral. All items are
cold-start items.
4. We can vectorize article contents and
tagged categories. (politics, sports…)
5. We have to serve production-scale
user traffic.
6. We don’t have to do rec-splanation.
Example: News Article Recommendation
Interactions Clicks, page dwells...
User Features
Indicator +
vectorized locations
Item Features
TF-IDF of contents +
vectorized categories
Item Representation Deep net
Prediction Cosine
Learning Balanced WMRB

Daily Model Training
Scikit-learn
Feature
Transformation
TensorRec
Recommender
System
Annoy
Ranking
Step 1:
Vectorize historical
article contents and
metadata
Step 2:
Use vectorized article
features to learn user
representations and train
a deep net for article
representation
Step 3:
Build Annoy indices

Scikit-learn
Feature
Transformation
TensorRec
Recommender
System
Annoy
Ranking
Step 1:
Vectorize new article
contents and metadata
Step 2:
Use trained deep net to
calculate new article
representation
Step 3:
Rebuild Annoy indices
with the new article
Handling New Articles

Database
Representation
Storage
Annoy
Ranking
Step 1:
Retrieve the user
representation from the
database
Step 2:
Find most relevant
articles for the user
Serving User Traffic

Example: MovieLens with TensorRec
Interactions Movie ratings
Item Features Indicator + Movie Tags
Learning Balanced WMRB

Evaluating
Recommender
Systems

Offline Evaluation
Many metrics are available for offline
evaluation to comparing predictions and
known interactions.
Most measure novelty, diversity, and
coverage.
Precision@K, Recall@K, NDCG@K…
Precision@K: “What percentage of the top K
items were positively interacted?”
Recall@K: “What percentage of users’
positively interacted items were in the top K
results?”
What makes a good recommender system?
Offline Pitfalls
Many offline metrics don’t represent fairness
of performance between users or items.
These metrics can be useful for
hyperparameter optimization, but often fail to
evaluate the “feel” of recommendations.
It is hard to use offline metrics to state that
one recommender system is better than
another.

Example: Offline Pitfalls
Three recommendation results for two users.
User 1 has 5 positive interactions.
User 2 has 2 positive interactions.
The third recommendation system is the
most broadly effective, and probably the
“best.” Precision fails to identify that, but
recall does.
You can concoct similar pitfalls for recall or
NDCG.
1 2 1 2 1 2
T T T T T
T T T T
T T T
T T
T
P@5: 0.5 P@5: 0.5 P@5: 0.5
R@5: 0.65 R@5: 0.5 R@5: 0.8

Online Evaluation
When rolling-out a new recommender
system, the truest test is an A/B test with an
existing system.
The most effective feedback comes from
user interviewing and monitoring the user
behaviors the system is intended to drive.
If there is no existing system, do phased
roll-outs with quant/qual feedback.*
User interviewing is the only way to evaluate
the “feel” of recommendations.
*Fellow employees make great, but biased, guinea pigs
Feel?
“I already own a crib, why would I need
another?”
Missing item filtering based on metadata?
“These songs are excellent, but I already
know these bands.”
Maybe we should target discovery?
“I’ve watched Captain America twenty
times, but that doesn’t mean I only want to
watch Marvel movies. What about the
sitcoms I watch?”
Maybe we’re oversimplifying the user’s
representation?

All Algorithms Are Biased
There are biases innate in the data we use,
the way users interact with our products, and
the way our algorithms learn.
Controlling for this is not as simple as setting
biased=False.
When designing these systems, we have a
responsibility to, at the least, understand the
biases in our products.
You wouldn’t ship a product without tests.
You shouldn’t ship a RecSys without
examining bias.
Algorithmic Bias and Fairness
Understanding Fairness
There are many of definitions of fairness.
Some cross-section recommender
performance by user and item metadata.
C-fairness
Is recommendation recall significantly lower
for customers in Massachusetts?
P-fairness
Are movies with female leads recommended
less often than in the natural distribution of
movie watching?
Missing metadata? Crowdsource it, but be
careful with sensitive metadata.

1
2
3
4
5
6
What We Missed
Sequence-based models
In what order do our users interact with
our items?
Mixture-of-tastes models
Is one representation per user enough
for users with diverse tastes?
Rec-splanation
How do system design choices impact
interpretability?
Attention models
Can we learn more nuance to user
representation that just a vector?
Graphical models
Can we map relationships between
users, items, and their attributes?
Cold-start problems
How do we make recommendations for
brand-new users?

Wait, is it “recommender systems”
or “recommendation systems?”

Wait, is it “recommender systems”
or “recommendation systems?”
¯_(ツ)_/¯

Thank you!
Questions?
James Kirk
@jiminy_kirket
/jkirk12
@jameskirk1
/jfkirk

Boston ML - Architecting Recommender Systems

More Related Content

What's hot

Similar to Boston ML - Architecting Recommender Systems

Recently uploaded

Boston ML - Architecting Recommender Systems