Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Large Scale Recommendation: a view from the Trenches

20 views

Published on

Presentation to the MALIA session at the JdS 2019 (Journées françaises de statistiques), on June 6, 2019.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Large Scale Recommendation: a view from the Trenches

  1. 1. Large scale recommendation: a view from the trenches Anne-Marie Tousch Senior Research Scientist 51èmes Journées de Statistiques de la SFdS
  2. 2. Outline 1. Context & problem setting, 2. One large-scale solution, 3. Open problems. Large scale recommendation: a view from the trenches JDS'19 2 / 26
  3. 3. Context What Criteo does: online personalized advertising. Large scale recommendation: a view from the trenches JDS'19 3 / 26
  4. 4. Personalized advertising Large scale recommendation: a view from the trenches JDS'19 4 / 26
  5. 5. Personalized advertising We buy ad placements We recommend products We sell clicks that lead to sales. Large scale recommendation: a view from the trenches JDS'19 5 / 26
  6. 6. Context Daily: 300B Bid requests; 4B Displays Worldwide: 3 Billions shoppers; 1 Billion products Large scale recommendation: a view from the trenches JDS'19 6 / 26
  7. 7. Recommendation A user = timeline of products browsed. Large scale recommendation: a view from the trenches JDS'19 7 / 26
  8. 8. Recommendation A user = timeline of products browsed. Task: find products she wants to buy Large scale recommendation: a view from the trenches JDS'19 7 / 26
  9. 9. Recommendation A user = timeline of products browsed on catalog A. Task: find products she wants to buy, in catalog B. Large scale recommendation: a view from the trenches JDS'19 8 / 26
  10. 10. Large-scale high-speed recommendation 4B times a day, recommend products in less than 100ms. Large scale recommendation: a view from the trenches JDS'19 9 / 26
  11. 11. Large-scale recommender systems co-event counters, nearest neighbors easy, strong baseline, matrix factorization (MF) now scales, neural networks is state-of-the-art, but how does it scale? Large scale recommendation: a view from the trenches JDS'19 10 / 26
  12. 12. Matrix factorization Classical recommender system setting: Products set P, m = |P|; a product: vj, j ∈ [m] User ui = {vj1 , . . . , vji } Interaction matrix Ai,j = δ[vj∈ui] or ratings, or counts, A ∈ Rn×m Factorize A with truncated SVD to obtain user and product embeddings of dimension k << min(m, n): A = U · Σ · V∗ Large scale recommendation: a view from the trenches JDS'19 11 / 26
  13. 13. Large-scale MF What if m ≈ n ≈ 107−9? Large scale recommendation: a view from the trenches JDS'19 12 / 26
  14. 14. Large-scale MF What if m ≈ n ≈ 107−9? Idea: use sketching. Large scale recommendation: a view from the trenches JDS'19 12 / 26
  15. 15. Large-scale MF What if m ≈ n ≈ 107−9? Idea: use sketching. Johnson-Lindenstrauss lemma, 1984 Let ϵ ∈ (0, 1) and A be a set of n points in Rd . Let k be an integer and k = O ( ϵ−2 log n ) . Then there exists a mapping f : Rd → Rk such that for any a, b ∈ A: (1 − ϵ)∥a − b∥2 ≤ ∥f(a) − f(b)∥2 ≤ (1 + ϵ)∥a − b∥2 Large scale recommendation: a view from the trenches JDS'19 12 / 26
  16. 16. Randomized SVD Large scale recommendation: a view from the trenches JDS'19 13 / 26
  17. 17. Randomized SVD1 Stage A: Compute an approximate basis for the range of the input matrix A. In other words, we require a matrix Q for which Q has orthonormal columns and A ≈ QQ∗ A 1 Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions”. In: SIAM review 53.2 (2011), pp. 217–288. Large scale recommendation: a view from the trenches JDS'19 14 / 26
  18. 18. Randomized SVD1 Stage A: Compute an approximate basis for the range of the input matrix A. In other words, we require a matrix Q for which Q has orthonormal columns and A ≈ QQ∗ A Stage B: Use Q to help compute a standard factorization (QR, SVD, etc.) of A. Form the matrix B = Q∗ A. Compute an SVD of the small matrix: U = UΣV∗ Form the orthonormal matrix U = QU 1 Halko, Martinsson, and Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions”. Large scale recommendation: a view from the trenches JDS'19 14 / 26
  19. 19. Randomized SVD Draw an n × ℓ standard Gaussian matrix Ω. Form Y0 = AΩ and compute its QR factorization Y0 = Q0R0. for j = 1, 2, . . . , q Form Yj = A∗ Qj−1 Compute its QR factorization Yj = QjRj Form Yj = AQj Compute its QR factorization Yj = QjRj Q = Qq Apply stage B: B := QTA; BT = ˜QR = ˜Q ( ˆVSˆUT ) U := QˆU Large scale recommendation: a view from the trenches JDS'19 15 / 26
  20. 20. Randomized decomposition Draw an n × ℓ standard Gaussian matrix Ω. Form Y0 = AΩ and compute its QR factorization Y0 = Q0R0. for j = 1, 2, . . . , q Normalize rows of Qj−1 Form ˜Yj = A∗ Qj−1 Compute its QR factorization Yj = QjRj Normalize rows of Qj Form Yj = AQj Compute its QR factorization Yj = QjRj Q = Qq Skip stage B. Large scale recommendation: a view from the trenches JDS'19 16 / 26
  21. 21. Matrix factorization vs. Word2Vec “For a negative-sampling value of k = 1, the Skip-Gram objective is factorizing a word-context matrix in which the association between a word and its context is measured by f(w, c) = PMI(w, c)2 .” We approximate Skip-Gram by factorizing a PMI matrix with: P = A∗ A ∈ Rm×m PMIi,j := log Pi,j ∑ i′,j′ Pi′,j′ ∑ j′ Pi,j′ ∑ i′ Pj,i′ 2 Omer Levy and Yoav Goldberg. “Neural word embedding as implicit matrix factorization”. In: Advances in neural information processing systems. 2014, pp. 2177–2185. Large scale recommendation: a view from the trenches JDS'19 17 / 26
  22. 22. Approximate nearest neighbors Project user in embedding space, Recommend top-k nearest neighbors to user in product space. Problem: if different catalogs are not aligned, nearest neighbors are almost always the same. Large scale recommendation: a view from the trenches JDS'19 18 / 26
  23. 23. Open questions Pb1: the popularity biases Eg: Recommending high-frequency items is a strong baseline strategy. => fairness and diversity issues. Large scale recommendation: a view from the trenches JDS'19 19 / 26
  24. 24. Open questions Pb1: the popularity biases Eg: Recommending high-frequency items is a strong baseline strategy. => fairness and diversity issues. high-frequency users, big vs. small advertisers, ... Large scale recommendation: a view from the trenches JDS'19 19 / 26
  25. 25. Open questions Pb2: the organic traffic bias Metric: predict next item? Large scale recommendation: a view from the trenches JDS'19 20 / 26
  26. 26. Open questions Pb2: the organic traffic bias Metric: predict next item? Large scale recommendation: a view from the trenches JDS'19 20 / 26
  27. 27. Open questions Pb2: the organic traffic bias Metric: predict next item? But: we want to predict incremental sales. What if we had not recommended this product, would the user still have bought it? 3 Stephen Bonner and Flavian Vasile. “Causal embeddings for recommendation”. In: Proceedings of the 12th ACM Conference on Recommender Systems. ACM. 2018, pp. 104–112. Large scale recommendation: a view from the trenches JDS'19 21 / 26
  28. 28. Open questions Pb2: the organic traffic bias Metric: predict next item? But: we want to predict incremental sales. What if we had not recommended this product, would the user still have bought it? Idea: learn embeddings to optimize individual treatment effects3. 3 Bonner and Vasile, “Causal embeddings for recommendation”. Large scale recommendation: a view from the trenches JDS'19 21 / 26
  29. 29. Open questions Pb2: the organic traffic bias Simulation environment4: https://github.com/criteo-research/reco-gym 4 David Rohde et al. “RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising”. In: arXiv preprint arXiv:1808.00720 (2018). Large scale recommendation: a view from the trenches JDS'19 22 / 26
  30. 30. Open questions Pb3: the unbounded number of products Large scale neural networks: variational auto-encoder example “[Use...] function fθ(·) ∈ RI to produce a probability distribution over m items π (zu) ...a .” What if I = 107, 109? a Dawen Liang et al. “Variational autoencoders for collaborative filtering”. In: Proceedings of t 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee. 2018, pp. 689–698. Large scale recommendation: a view from the trenches JDS'19 23 / 26
  31. 31. Open questions Pb3: the unbounded number of products Idea: use group testing scheme with binary p × m matrix H h(y) = H ∨ y => work as with p pseudo-items. “Theorem: Suppose we wish to recover a k sparse binary vector y ∈ Rm . A random binary {0, 1} matrix A where each entry is 1 with probability ρ = 1/k recovers 1 − ε proportion of the support of y correctly with high probability, for any ε > 0, with p = O(k log m). This matrix will also detect e = Ω(p) errors.5 ” 5 Shashanka Ubaru and Arya Mazumdar. “Multilabel classification with group testing and codes”. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org. 2017, pp. 3492–3501. Large scale recommendation: a view from the trenches JDS'19 24 / 26
  32. 32. Open questions Pb3: the unbounded number of products Idea: use group testing scheme with binary p × m matrix H h(y) = H ∨ y => work as with p pseudo-items. “Theorem: Suppose we wish to recover a k sparse binary vector y ∈ Rm . A random binary {0, 1} matrix A where each entry is 1 with probability ρ = 1/k recovers 1 − ε proportion of the support of y correctly with high probability, for any ε > 0, with p = O(k log m). This matrix will also detect e = Ω(p) errors.5 ” Question: Can we do better knowing the item frequency follows a power law? 5 Ubaru and Mazumdar, “Multilabel classification with group testing and codes”. Large scale recommendation: a view from the trenches JDS'19 24 / 26
  33. 33. Thanks! Questions? Reach out to me at: am.tousch@criteo.com or on Twitter @amy8492 Large scale recommendation: a view from the trenches JDS'19 25 / 26
  34. 34. Bonner, Stephen and Flavian Vasile. “Causal embeddings for recommendation”. In: Proceedings of the 12th ACM Conference on Recommender Systems. ACM. 2018, pp. 104–112. Halko, Nathan, Per-Gunnar Martinsson, and Joel A Tropp. “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions”. In: SIAM review 53.2 (2011), pp. 217–288. Levy, Omer and Yoav Goldberg. “Neural word embedding as implicit matrix factorization”. In: Advances in neural information processing systems. 2014, pp. 2177–2185. Liang, Dawen et al. “Variational autoencoders for collaborative filtering”. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee. 2018, pp. 689–698. Rohde, David et al. “RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising”. In: arXiv preprint arXiv:1808.00720 (2018). Ubaru, Shashanka and Arya Mazumdar. “Multilabel classification with group testing and codes”. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org. 2017, pp. 3492–3501. Large scale recommendation: a view from the trenches JDS'19 26 / 26

×