Large Scale Recommendation: a view from the Trenches

Large scale
recommendation: a
view from the trenches
Anne-Marie Tousch
Senior Research Scientist
51èmes Journées de Statistiques de la SFdS

Outline
1. Context & problem setting,
2. One large-scale solution,
3. Open problems.
Large scale recommendation: a view from the trenches JDS'19 2 / 26

Context
What Criteo does: online personalized advertising.

Personalized advertising

Personalized advertising
We buy ad placements
We recommend
products
We sell clicks that lead to sales.

Context
Daily: 300B Bid requests; 4B Displays
Worldwide: 3 Billions shoppers; 1 Billion products

Recommendation
A user = timeline of products browsed.

Recommendation
A user = timeline of products browsed.
Task: ﬁnd products she wants to buy

Recommendation
A user = timeline of products browsed on catalog A.
Task: ﬁnd products she wants to buy, in catalog B.

Large-scale high-speed recommendation
4B times a day, recommend products in less than 100ms.

Large-scale recommender systems
co-event counters, nearest neighbors easy, strong baseline,
matrix factorization (MF) now scales,
neural networks is state-of-the-art, but how does it scale?

Matrix factorization
Classical recommender system setting:
Products set P, m = |P|; a product: vj, j ∈ [m]
User ui = {vj1 , . . . , vji }
Interaction matrix Ai,j = δ[vj∈ui] or ratings, or counts, A ∈ Rn×m
Factorize A with truncated SVD to obtain user and product embeddings of
dimension k << min(m, n):
A = U · Σ · V∗

Large-scale MF
What if m ≈ n ≈ 107−9?

Large-scale MF
What if m ≈ n ≈ 107−9?
Idea: use sketching.

Large-scale MF
What if m ≈ n ≈ 107−9?
Idea: use sketching.
Johnson-Lindenstrauss lemma, 1984
Let ϵ ∈ (0, 1) and A be a set of n points in Rd . Let k be an integer and
k = O
(
ϵ−2 log n
)
. Then there exists a mapping f : Rd → Rk such that for any a,
b ∈ A:
(1 − ϵ)∥a − b∥2 ≤ ∥f(a) − f(b)∥2 ≤ (1 + ϵ)∥a − b∥2

Randomized SVD

Randomized SVD1
Stage A: Compute an approximate basis for the range of the input matrix A. In
other words, we require a matrix Q for which
Q has orthonormal columns and A ≈ QQ∗
A
1
Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. “Finding structure with randomness:
Probabilistic algorithms for constructing approximate matrix decompositions”. In: SIAM review 53.2
(2011), pp. 217–288.

Randomized SVD1
Stage A: Compute an approximate basis for the range of the input matrix A. In
other words, we require a matrix Q for which
Q has orthonormal columns and A ≈ QQ∗
A
Stage B: Use Q to help compute a standard factorization (QR, SVD, etc.) of A.
Form the matrix B = Q∗
A.
Compute an SVD of the small matrix: U = UΣV∗
Form the orthonormal matrix U = QU
1
Halko, Martinsson, and Tropp, “Finding structure with randomness: Probabilistic algorithms for
constructing approximate matrix decompositions”.

Randomized SVD
Draw an n × ℓ standard Gaussian matrix Ω.
Form Y0 = AΩ and compute its QR factorization Y0 = Q0R0.
for j = 1, 2, . . . , q
Form Yj = A∗
Qj−1
Compute its QR factorization Yj = QjRj
Form Yj = AQj
Q = Qq
Apply stage B:
B := QTA; BT = ˜QR = ˜Q
(
ˆVSˆUT
)
U := QˆU

Randomized decomposition
Draw an n × ℓ standard Gaussian matrix Ω.
Form Y0 = AΩ and compute its QR factorization Y0 = Q0R0.
for j = 1, 2, . . . , q
Normalize rows of Qj−1
Form ˜Yj = A∗
Qj−1
Normalize rows of Qj
Form Yj = AQj
Q = Qq
Skip stage B.

Matrix factorization vs. Word2Vec
“For a negative-sampling value of k = 1, the Skip-Gram objective is factorizing a
word-context matrix in which the association between a word and its context is
measured by f(w, c) = PMI(w, c)2
.”
We approximate Skip-Gram by factorizing a PMI matrix with:
P = A∗
A ∈ Rm×m
PMIi,j := log
Pi,j
∑
i′,j′ Pi′,j′
∑
j′ Pi,j′
∑
i′ Pj,i′
2
Omer Levy and Yoav Goldberg. “Neural word embedding as implicit matrix factorization”. In:
Advances in neural information processing systems. 2014, pp. 2177–2185.

Approximate nearest neighbors
Project user in embedding space,
Recommend top-k nearest neighbors to user in product space.
Problem: if diﬀerent catalogs are not aligned, nearest neighbors are almost
always the same.

Open questions
Pb1: the popularity biases
Eg: Recommending high-frequency items is a strong baseline strategy.
=> fairness and diversity issues.

Open questions
Pb1: the popularity biases
Eg: Recommending high-frequency items is a strong baseline strategy.
=> fairness and diversity issues.
high-frequency users, big vs. small advertisers, ...

Open questions
Pb2: the organic traﬃc bias
Metric: predict next item?

Open questions
But: we want to predict incremental sales. What if we had not recommended
this product, would the user still have bought it?
3
Stephen Bonner and Flavian Vasile. “Causal embeddings for recommendation”. In: Proceedings of
the 12th ACM Conference on Recommender Systems. ACM. 2018, pp. 104–112.

Open questions
But: we want to predict incremental sales. What if we had not recommended
this product, would the user still have bought it?
Idea: learn embeddings to optimize individual treatment eﬀects3.
3
Bonner and Vasile, “Causal embeddings for recommendation”.

Open questions
Simulation environment4: https://github.com/criteo-research/reco-gym
4
David Rohde et al. “RecoGym: A Reinforcement Learning Environment for the problem of Product
Recommendation in Online Advertising”. In: arXiv preprint arXiv:1808.00720 (2018).

Open questions
Pb3: the unbounded number of products
Large scale neural networks: variational auto-encoder example
“[Use...] function fθ(·) ∈ RI
to produce a probability distribution
over m items π (zu) ...a
.”
What if I = 107, 109?
a
Dawen Liang et al. “Variational autoencoders for collaborative filtering”. In: Proceedings of t
2018 World Wide Web Conference on World Wide Web. International World Wide Web
Conferences Steering Committee. 2018, pp. 689–698.

Open questions
Idea: use group testing scheme with binary p × m matrix H
h(y) = H ∨ y
=> work as with p pseudo-items.
“Theorem: Suppose we wish to recover a k sparse binary vector y ∈ Rm
. A random
binary {0, 1} matrix A where each entry is 1 with probability ρ = 1/k recovers 1 − ε
proportion of the support of y correctly with high probability, for any ε > 0, with
p = O(k log m). This matrix will also detect e = Ω(p) errors.5
”
5
Shashanka Ubaru and Arya Mazumdar. “Multilabel classification with group testing and codes”.
In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org.
2017, pp. 3492–3501.

Open questions
Idea: use group testing scheme with binary p × m matrix H
h(y) = H ∨ y
=> work as with p pseudo-items.
“Theorem: Suppose we wish to recover a k sparse binary vector y ∈ Rm
. A random
binary {0, 1} matrix A where each entry is 1 with probability ρ = 1/k recovers 1 − ε
proportion of the support of y correctly with high probability, for any ε > 0, with
p = O(k log m). This matrix will also detect e = Ω(p) errors.5
”
Question: Can we do better knowing the item frequency follows a power law?
5
Ubaru and Mazumdar, “Multilabel classification with group testing and codes”.

Thanks! Questions?
Reach out to me at:
am.tousch@criteo.com or on Twitter @amy8492

Bonner, Stephen and Flavian Vasile. “Causal embeddings for recommendation”. In:
Proceedings of the 12th ACM Conference on Recommender Systems. ACM. 2018,
pp. 104–112.
Halko, Nathan, Per-Gunnar Martinsson, and Joel A Tropp. “Finding structure with
randomness: Probabilistic algorithms for constructing approximate matrix
decompositions”. In: SIAM review 53.2 (2011), pp. 217–288.
Levy, Omer and Yoav Goldberg. “Neural word embedding as implicit matrix
factorization”. In: Advances in neural information processing systems. 2014,
pp. 2177–2185.
Liang, Dawen et al. “Variational autoencoders for collaborative filtering”. In:
Proceedings of the 2018 World Wide Web Conference on World Wide Web.
International World Wide Web Conferences Steering Committee. 2018, pp. 689–698.
Rohde, David et al. “RecoGym: A Reinforcement Learning Environment for the
problem of Product Recommendation in Online Advertising”. In: arXiv preprint
arXiv:1808.00720 (2018).
Ubaru, Shashanka and Arya Mazumdar. “Multilabel classification with group testing
and codes”. In: Proceedings of the 34th International Conference on Machine
Learning-Volume 70. JMLR. org. 2017, pp. 3492–3501.

Large Scale Recommendation: a view from the Trenches

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Large Scale Recommendation: a view from the Trenches

Similar to Large Scale Recommendation: a view from the Trenches (20)

More from Anne-Marie Tousch

More from Anne-Marie Tousch (6)

Recently uploaded

Recently uploaded (20)

Large Scale Recommendation: a view from the Trenches