18. 1. More data means more personalization
1. Recommendation engines are instruments of your business.
1. Custom models respect your heterogeneous and unique system
Lessons Learned
19. Lessons Learned Goals for Today
1. More data means more personalization
We’ll use rec engines to drive personalization
1. Recommendation engines are instruments of your business.
The latent factors they reveal enable new science!
1. Custom models respect your heterogeneous and unique system
We’ll explore 8 different ways of building
deep recommendation engines.
36. There are eight Jupyter
Notebooks in this tutorial.
We’ll go through them one-by-
one.
Feel free to run them in advance
-- they’ll start training models in
the background.
… or start them up as I go along.
47. P
The user matrix
(n_users, k)
Q
The item matrix
(n_item, k)
Much smaller!
(millions or 100ks of rows)
Compact & dense!
No zeroes, efficient storage.
53. Baseline model
- Only captures interactions.
- What if a user generally likes everything?
- What if an item is generally popular?
Rui = pu⋅ qi
54. Baseline model
- Only captures interactions.
- What if a user generally likes everything?
- What if an item is generally popular?
Let’s add biases.
Rui = pu⋅ qi
55. Model +
global bias
user biases
item biases
Now we learn how much a user
generally likes things, and an item is
generally liked.
Rui = b + ωu + μi + pu⋅ qi
56. A single +1 or 0 rating
User vector Item vectorUser bias
One number
per user
Item bias
One number
per item
=
Rui = b + ωu + μi + pu⋅ qi
57. A single +1 or 0 rating
User vector Item vectorUser bias
One number
per user
Rui = b + ωu + μi + pu⋅ qi
Item bias
One number
per item
=
59. User vector Item vector
User & item vectors are not black boxes.
They are instruments into your space.
Stitch Fix vectors stand for clothing style.
Movielens vectors yield movie genres.
At Spotify, they represent latent musical tastes.
Amazon they represent latent categories.
60. Let’s PCA the Stitch Fix vectors and take a look.
+1.2 - 0.3 +0.4 -4.0 +0.9
+6.7 - 0.44 + 0.9 +0.58 -0.7
+3.8 - 0.9 -2. 4 +0.8 +0.3
61. We’ll sort by the
first eigenvector.
+1.2 - 0.3 +0.4 -4.0 +0.9
+6.7 - 0.44 + 0.9 +0.58 -0.7
+3.8 - 0.9 -2. 4 +0.8 +0.3
Let’s PCA the Stitch Fix vectors and take a look.
64. we did not define boho or preppy -- we discovered it!
This is the cornerstone of our ‘style space’ that segments our clients and
our merch.
65. +1.2 - 0.3 +0.4 -4.0 +0.9
+6.7 - 0.44 + 0.9 +0.58 -0.7
+3.8 - 0.9 -2. 4 +0.8 +0.3
We don’t have just one
dimension however….
… we have many more.
What do each of those
look like?
66. Internally, we use 18
dimensions.
The first three we call
trendiness, femininity,
and end use.
69. This is now a typical matrix-
factorization recommender.
Rui = b + ωu + μi + pu⋅ qi
70. This is now a typical matrix-
factorization recommender.
What if we know more “side” features
-- like the user’s occupation?
Great for when we want to “coldstart”
a new user.
Rui = b + ωu + μi + pu⋅ qi
71. We have two choices for side features:
- add them as a bias
→ Choose this if you think occupation changes like rate, but not which
movies
→e.g. “artists like movies more than other occupations”
- add them as a user vector
→ Choose this if you think occupation changes depending on the item
→e.g. “realtors love real estate shows”
Rui = b + do + ωu + μi + ( pu+ to)⋅ qi
*Note that to interacts with qi
72. We have two choices for side features:
- add them as a bias
→ Choose this if you think occupation changes like rate, but not which
movies
→e.g. “realtors love real estate shows”
- add them as a user vector
→ Choose this if you think occupation changes movie selection
→e.g. “realtors love real estate shows”
Rui = b + do + ωu + μi + ( pu+ to)⋅ qi
*Note that to interacts with qi
73. Rui = b + ωu + μi + pu⋅ qi
What about when features change in time?
You’ve seen biases.
You’ve seen user-item interactions.
74. Rui = b + ωu + μi + pu⋅ qi + mu⋅ nt
What does a latent temporal vector encode?
Here’s a toy example.
t (time)
m0= 0.0 1.0
Average
User
Rating
0 1 2 3 4 5 6
0.0 1.0
75. Rui = b + ωu + μi + pu⋅ qi + mu⋅ nt
What does a latent temporal vector encode?
Here’s a toy example.
t (time)
m0= 0.0 1.0
Average
User
Rating
n0 = 1.0 0.0
n1 =
nt = 1 - t t0 1 2 3 4 5 6
0.9 0.1
0.0 1.0
76. Rui = b + ωu + μi + pu⋅ qi + mu⋅ nt
What does a latent temporal vector encode?
Here’s a toy example.
t (time)
Average
User
Rating
0 1 2 3 4 5 6
m1= 0.0 1.0
n0 = 1.0 0.0
n1 =
nt = 1 - t t
0.9 0.1
1.0 0.0
77. Rui = b + ωu + μi + pu⋅ qi + mu⋅ nt
What does a latent temporal vector encode?
A user mixture of time-series.
m1= 0.0 1.0
n0 = 1.0 0.0
n1 =
nt = 1 - t t
0.9 0.1
1.0 0.0
m0= 0.0 1.00.0 1.0
User 0 will have a
100% mixture of
the 1st time series
User 1 will have a
100% mixture of
the 2nd time series
78. Rui = b + ωu + μi + pu⋅ qi + mu⋅ nt
We can enforce that nearby times have
similar components.
n0 = 1.0 0.0
n1 =
nt = 1 - t t
0.9 0.1
Enforce that:
|nt - nt-1|
should be small.
79. Word2Vec is
actually a rec engine!
Notebooks we’ll be using:
03 Simple MF Biases is actually word2vec.ipynb
80. “ITEM_92 I think this fabric is wonderful
(rayon & spandex). like the lace/embroidery
accents”
81.
82.
83.
84.
85.
86.
87.
88.
89.
90.
91.
92.
93.
94.
95. Instead of
(users, items) we have (token1, token2)
And instead of a
Rating we have a skipgram count.
But it’s still the same model!
Rui = b + ωu + μi + pu⋅ qi
96. Item bias
token2 bias
User bias
token1 bias
A single +1 or 0 rating
Log Skipgram count
User vector
token1 vector
Item vector
token2 vector
=
Rui = ωu + μi + pu⋅ qi
97. User bias
token1 bias
(How frequent
is this word?)
A single +1 or 0 rating
Log Skipgram count
User vector
word1 vector
Item vector
word2 vector
Item bias
token2 bias
(How frequent is
this word?)
=
Rui = ωu + μi + pu⋅ qi
Do word1 & word2 have a
special interaction?
98.
99.
100.
101.
102. In recommendation engines, there’s also
directions in the latent space. What are the
directions in your space?
MORE TRENDY
MORE
PROFESSIONAL
103. In recommendation engines, there’s also
directions in the latent space. What are the
directions in your space?
MORE TRENDY
MORE
PROFESSIONAL
105. Practical reasons to go variational:
1. Alternative regularization
2. Measure what your model doesn’t know.
3. Help explain your data.
106. Practical reasons to go variational:
1. Alternative regularization
2. Measure what your model doesn’t know.
3. Help explain your data.
4. Short & fits in a tweet!
107.
108.
109.
110.
111.
112.
113.
114. Rui = b + ωu + μi + pu⋅ qi
Let’s make this variational:
1. Replace point estimates with samples from a
distribution.
2. Replace regularizing that point, regularize that
distribution.
122. Rui = b + ωu + μi + pu⋅ qi
Let’s make this variational:
1. Replace point estimates with samples from a
distribution.
2. Replace regularizing that point with
regularizing that distribution.
123. WITH VARIATIONAL
loss += gaussian_kldiv(vector_mu, vector_lv)
WITHOUT VARIATIONAL
loss += c_vector.pow(2.0).sum()
124. WITH VARIATIONAL
loss += gaussian_kldiv(vector_mu, vector_lv)
WITHOUT VARIATIONAL
loss += c_vector.pow(2.0).sum()
126. More Things to Try
FMs
● Useful when you have many kinds of
interactions, not just user-item and user-
time
Mixture-of-Tastes
● Like “attention” for recommendations,
allows for users to have multiple “tastes.”
140. “Zelda” is a game in set 2D Euclidean space
(roughly).
● Infinity can’t be seen
● Area is homogenous
● Geodesics are straight lines
141. “Hyper Rogue” is a game in hyperbolic
space.
● Radius = 1 is infinitely far away,
but visible
● Volume increases towards
boundary
142. “Hyper Rogue” is a game in hyperbolic
space.
● Radius = 1 is infinitely far away,
but visible
● Volume increases towards
boundary
Editor's Notes
*Only* captures interactions.
That means it captures all kinds of effects like:
*Only* captures interactions.
That means it captures all kinds of effects like:
It powers:
What are the design characteristics best liked by a trendy professional?
We can share with our design team and vendors that black cloth, button-downs, and blazers would be good features to leverage.
How many trendy boho items should we buy? We could share with our planning team what fraction of our clientele live in this region of style space, and stock accordingly. Or how about deciding whether two items are duplicative and should not both be purchased? We can calculate the item pair’s cosine similarity and surface these scores to our buyers.
Try: lassie vs reservoir dogs vs star trek
all deep learning text is running some sort of co-occurrence model.
it’s a simple BOW baseline model that still gives you vectors that are aren’t immediately interpreted
embeds both words & items in same space
Let’s remove the fluff & simplify…
X is a 2D matrix where each element is just the count of how many times i saw that pair of words near each other
called “skip-grams” because we’re skipping tokens between both words
This is a minor modification from Glove & word2vec-like algos
LHS: take log of that count (words are log-distributed)
This is a minor modification from Glove & word2vec-like algos
LHS: take log of that count (words are log-distributed)
This is a minor modification from Glove & word2vec-like algos
LHS: take log of that count (words are log-distributed)
you do get hints that these direction exist outside of t-sne
the difference b/w m & w is same k & queen
(arrows are the same size & direction)
not just m&w, k&q …
this is linear regularity
we have hundreds of dimensions in word vectors, hundreds of potential directions
most ML & deep learning is about optimizing a point estimate of your model parameters. variational are about optimizing a posterior, which loosely speaking expresses a spectrum of model configurations that are consistent w/ my data.
1. never seen a client? high uncertain.
— having a model that knows when it doesn't know something is key to var methods
3. variational methods can reveal entailment, natural way of grouping data, sometimes hierarchical
var w2v — entailment, brand new property of word vectors not in vanilla
var FMs — capture what they don’t know, prioritize discovery, ex new drugs in medicine or new items in rec engines
var tsne — build trust in model, viz whats going
of course have this extra log variance matrix hanging around
every word (king or queen or man or woman) has a row in this matrix
what does this mean?
not just at a point any more
polysemous, lot of different contexts
man: 1) human, male. 2) human beings in general 3) ‘sticking it to the man’
woman: has fewer
king: ruler or chess piece
variance not just a regularizer
variance not just measuring ‘noise’
variance estimate tells me what my model claims is consistent
w/ my data
variane not just uncertainty — but actually giving me representations w/ novel properties
word2gauss has entailment
now not just capturing similarities, linear regularities but entailment, and property that words may be subsumed entirely by other tokens
didn't have this in OG word2vec
example of learning hierarchical representations
get a diff sample every time
backprop will affect both mean & variance
now replace w/ two matrices, two row vectors
mean matrix
log variance matrix
Until 2013 recently, the latter required formidable variational calculus and all manner of integral tricks and specialized math.
But the beautiful reparameterization trick unifies both approaches, and is at the core of Bayesian Deep Learning methods.
What's more, it's elegant and can be implemented in Python code fitting inside of a single tweet.
2nd piece of regularizing that distribution
instead of point, penalize dist
2nd piece of regularizing that distribution
instead of point, penalize dist
Can use video games tell the story of Euclidean vs Poincare
3 points to make
Hyper rogue is game set in hyperbolic space
Read first two
Next slide for geodesics
Similar to an binary tree:
Shortst path involves me going up the tree, and back down