Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Boston ML - Architecting Recommender Systems

4,518 views

Published on

June 2018 talk on Architecting Recommender Systems by James Kirk at Spotify

Published in: Software

Boston ML - Architecting Recommender Systems

  1. 1. Boston Machine Learning Architecting Recommender Systems Algorithm design, user experience, and system architecture June 2018 James Kirk
  2. 2. Tools for Recommender Systems 41 - 53 Tools for building systems quickly Anatomy of Recommender Systems 3 - 19 System components and terminology Evaluating Recommender Systems 54 - 58 What makes a good recommender system? What We Missed 59 - 63 Other subjects in recommender systems Designing Recommender Systems 20 - 31 Design considerations and frameworks Example Recommender Systems 32 - 40 Real-world recommender systems and their architectures Table of contents 2
  3. 3. Anatomy of Recommender Systems
  4. 4. Recommendation A recommendation system presents items to users in a relevant way. The definition of relevant is product/context-specific. Recommendation vs Personalization Personalization A personalization system presents recommendations in a way that is relevant to the individual user. The user expects their experience to change based on their interactions with the system. Relevance can still be product/context specific.
  5. 5. Example: Recommendation
  6. 6. Example: Personalization
  7. 7. Users A user in a recommender system is the party that is receiving and acting on the recommendations. Sometimes the user is the context, not an actual person. Users vs Items Items An item in a recommender system is the passive party that is being recommended to the users. The line between these two can be blurry.
  8. 8. Example: Consultant Matchmaking (Hypothetical) *Personalized Rec Sys #1 Users: Consultants* Items: Projects Recommend projects for the consultant to bid on. Rec Sys #2 Users: Projects Items: Consultants Recommend the right consultant for the project. Rec Sys #3 Users: Enterprises* Items: Consultants Recommend consultants for relationship building.
  9. 9. Positive Hearts, stars, likes, listens, watches, follows, bids, purchases, hires, reads, views, upvotes… ❤ Negative Bans, skips, angry-face-reacts, 1-star reviews, rejections, unfollows, returns, downvotes… Interactions Explicit vs Implicit Explicit actions are those that a user expects or intends to impact their personalized experience. Implicit actions are all other interactions between users and items.
  10. 10. Interactions User 1 User 2 User 3 User 4 Item 1 Item 2 Item 3 Item 4 Item 5 Item 6
  11. 11. Indicator Features A feature that is unique to every user/item to allow for direct personalization. These features allow recommender systems to learn about every user individually without being diluted through metadata. Often one-hot encoded user IDs or just an identity matrix. Metadata Features Age, location, language, tags, labels, word counts, pre-learned embeddings… Everything that is known about a user/item before training can be a feature if properly structured. Should it be? Often called “side input” or “shared features.” User/Item Features
  12. 12. User/Item Features Indicator Features Metadata Features Encoded Labels/Tags/et c. [n_users x n_user_features] or [n_items x n_item_features] User 1 User 2 User 3 User 4 User 5 User 6
  13. 13. Representation A (typically) low-dimensional vector that encodes the feature information about the user or item. Often called “embedding,” “latent user/item,” or “latent representation.” Representation size, which is the dimension of the latent space, is often referred to as “components.” Representation Functions Representation Function The process that converts user/item features in to representations. Learning happens here. Common examples: 1. Matrix factorization 2. Linear kernels 3. Deep nets 4. Word2Vec 5. Autoencoders 6. None! (Pass-through)
  14. 14. Representation Functions Image: Eric Nyquist
  15. 15. Prediction A prediction from a recommender system is an estimate of an item’s relevance to the user. Predictions can be ranked for relevance. The predictions are an indirect approximation of the interactions. Prediction Functions Prediction Function The process that converts user/item representations in to predictions. Common examples: 1. Dot product 2. Cosine similarity/distance 3. Euclidean similarity/distance 4. Manhattan similarity/distance* Some systems use deep nets for prediction, and this can be an assumption-breaker. *Actually, Manhattan is rare
  16. 16. Prediction Functions User Item Θ 2-Component Latent Representation Space (2-Dimensional) Common examples: 1. Dot product = User · Item 2. Cosine similarity = cos(Θ) 3. Euclidean similarity* = ( -1 * δ ) 4. Manhattan similarity = ( -1 * |User - Item| ) *There are many methods for expressing euclidean similarity δ
  17. 17. Loss Function The process that converts predictions and interactions in to error for learning. Common examples: 1. Root-mean-square error (RMSE) 2. Kullback-Leibler divergence (KLD) 3. Alternating least squares* (ALS) 4. Bayesian personalized ranking* (BPR) 5. Weighted approximately ranked pairwise (WARP) 6. Weighted margin-rank batch (WMRB) *These are both a loss and representation function Loss and Learning Learning-to-rank Some loss functions learn to approximate the values in the interactions matrix. Other loss functions learn to uprank positive interactions and downrank negative interactions (and/or non-interacted items) for that user. This second category of loss functions are called learning-to-rank.
  18. 18. User Features Item Features Interactions User Representation Item Representation User Representation Function Item Representation Function Prediction Function Predicted Scores Predicted Ranks Training Loss Loss Function InputData Output Data
  19. 19. Y = Prediction p = Prediction function r = Representation function X = Features Ɛ = Loss s = Loss Function N = Interactions
  20. 20. Designing Recommender Systems
  21. 21. Interactions Features Learning What are our interaction values? We must select interaction values based on what data is available, how meaningful that data is, and how it interacts with the rest of the system. Considerations ❏ What user behaviors do our interactions represent? ❏ Explicit vs implicit? ❏ Do we allow for negative interactions? ❏ How dense are our interactions? ❏ Can our recommender handle these interactions? How does our system learn? We must select representation functions that are appropriate for our features as well as a prediction function and loss function that will learn effectively from this data. Considerations ❏ What representation functions will best encode the user/item features? ❏ What prediction function will best estimate relevance? ❏ What loss function will learn from our data most effectively? ❏ Do these choices scale? What are our user/item features? We must select user/item features from the data available, ensure that the data is meaningful to the recommender system, and ensure that our use of this data is appropriate. Considerations ❏ Do we use indicator features? ❏ What useful metadata is available? ❏ Does the metadata require feature engineering? ❏ Do users expect this metadata to impact their recommendations?
  22. 22. What user behaviors do our interactions represent? Interaction values should be an approximation of the intended effect of the recommender system on user behavior. If we want people to purchase, our interactions should be related to purchases. If we want people to binge episodes of shows for longer, our interactions should be related to the act of binging. What are our interaction values? Explicit vs Implicit When the user gave you this signal, did they intend/expect it to alter their recommendations? Some explicit signals don’t work well as interactions. Negative explicit signals should be handled with simple product logic. “You might give five stars to Hotel Rwanda and two stars to Captain America, but you’re much more likely to watch Captain America.” -Todd Yellin, Netflix, You May Also Like
  23. 23. What are our interaction values? Explicit vs Implicit Does the user know we are using this signal for recommendation? Does the user care we are using this signal for recommendation? Is it ethical for us to use this signal for recommendation?
  24. 24. 1. Positive Positive Positive 2. Positive Positive Positive 3. No-int Negative No-int 4. No-int Negative Negative 5. No-int Negative No-int 6. No-int No-int Negative 7. Negative No-int No-int 8. Negative No-int Negative 9. Negative No-int No-int Confusing?Do we allow negative interactions? Negative interactions can be valuable statements of what content to avoid. Negative interactions can be confusing when learning-to-rank. Not all loss functions accommodate negative interactions. What are our interaction values? Which ordering is better?
  25. 25. Do we use indicator features? Indicator features allow for powerful personalization but are as numerous as our users/items. Recommenders with user indicators can not effectively make recommendations for new users* (the cold-start problem). Many users means many indicator features -- this may not scale. *Vice-versa is true for new items What are our user/item features? What useful metadata is available? What user/item metadata do we have that is relevant? Metadata that is useful but missing can be requested from users, crowd-sourced, or inferred with other ML systems.
  26. 26. Does the metadata require feature engineering? Pre-processing features can improve recommender learning. Some features may be useless/misleading without feature engineering. The choice of representation function impacts the usefulness of feature engineering. What are our user/item features? Do users expect this metadata to impact their recommendations? Is the use of this metadata ethical*? Users can be surprised when changing metadata impacts product experience. *There is a distinction between metadata used in training and metadata used in evaluation.
  27. 27. What representation functions will best encode the user/item features? Linear kernels are effective if all we have are indicator features or well-engineered features. (Matrix factorization) More complex relationships may lead us to neural nets. How does their architecture impact the recommender? (Use of the latent space) Can the representation be learned without interaction? (Auto-encoders, word2vec, etc) How does our system learn? What prediction function will best estimate relevance? Dot-product prediction accounts for representation relevance and magnitude. Cosine prediction optimizes for relevance but has no sense for magnitude. Euclidean prediction builds a map of items but also has no sense for magnitude. Should items be biased, given our choice?
  28. 28. What loss function will learn from our data most effectively? Do we want to estimate interactions, or perform learning-to-rank? Should the loss function accommodate negative interactions? (RMSE, KLD…) Should the loss function be sensitive to interaction magnitude? (RMSE, B-WMRB…) Tweaking the loss function can dramatically change how recommendations feel. How does our system learn? Sparse vs Dense vs Sampled Some implementations of loss functions only account for user/item pairs with interactions. These same loss functions can be written to compare every possible user/item pair. These predictions and losses are dense, and they can be expensive. Some of the most effective and efficient loss functions learn by comparing pairs with interactions against sampled pairs.* (WARP, WMRB) * There are many methods for sampling candidate pairs
  29. 29. Example: WMRB WMRB approximates positive item rank against a random sample and upranks positive items through a hinge loss. How does our system learn? x = User y = Positive item y’ = Non-positive item Y = All items Z = Random sample of non-positive items p = Prediction function Hinge Random Sampling
  30. 30. Example: Balancing WMRB If we notice an undue popularity bias, we can balance this by accounting for interaction magnitudes and popularity. How does our system learn? x = User y = Positive item X = All users p = Prediction function n = Interaction magnitude for pair (user, item) Balancing Factor
  31. 31. We can think about a recommender system architecture as a set of top-level decisions. When designing recommender systems, we are evaluating the tradeoffs between these decisions and the relationships between these choices. A Framework for Recommender Systems Interactions ? User Features ? User Representation ? Item Features ? Item Representation ? Prediction ? Learning ?
  32. 32. Example Recommender Systems
  33. 33. A collaborative filter learns representations from interactions and uses these to make personalized recommendations, often through matrix factorization. Pure collaborative filters are metadata-naïve. Example: Collaborative Filter Interactions * (Positive only?) User Features Indicator User Representation Linear Item Features Indicator Item Representation Linear Prediction * (Dot-product for MF) Learning ALS, BPR, SVD, PCA, NMF...
  34. 34. A content-based recommender learns the item features to which a user is affined. Purely content-based systems do no transfer learning between users. This allows easy rec-splanation. This requires clean item metadata. Example: Content-based Recommender Interactions * User Features Indicator User Representation Linear Item Features Metadata Item Representation None (n_components = n_item_features) Prediction Dot-product Learning *
  35. 35. A hybrid recommender system learns representations for both user and item metadata and indicators, if available. This opens a lot of options for us. Example: Hybrid Recommender System Interactions * User Features * User Representation * Item Features * Item Representation * Prediction * Learning *
  36. 36. We can build a hybrid recommender system to recommend personalized products based on past purchases. Example: Purchase Recommendations Interactions Purchases User Features Indicator User Representation Linear Item Features Indicator + Metadata Item Representation * Prediction Dot-product Learning *
  37. 37. We can use the pre-trained purchase recommender’s representations to provide recommendations in a new context. In this system, the “user” is the context item, not the person using our product. Example: “You May Also Like” (YMAL) Interactions X User Features Context Item Repr User Representation None Item Features All Item Reprs Item Representation None Prediction Dot-product, Cosine? Learning X
  38. 38. We can take the output of the YMAL recommender and re-rank the items based on the customer’s representation. This system does not learn. The learning’s already been done. Example: Personalized “You May Also Like” Interactions X User Features User Reprs User Representation None Item Features Similar Item Reprs Item Representation None Prediction Dot-product Learning X
  39. 39. Example: Personalized “You May Also Like” Purchase Recommender System “YMAL” Recommender System “YMAL” Personalization System Step 1: Learn to personalize purchasing recommendations Step 2: Use previous learning to calculate the most similar items Step 3: Personalize the similar items by re-ranking OR Contextualize purchase recommendations by limiting the item set
  40. 40. Example: YouTube (Covington, Adams, Sargin) Interactions Watches + Searches User Features Geography, Age, Gender... User Representation Deep net Item Features Pre-learned embeddings, language, previous impressions... Item Representation Deep net Prediction Deep net Learning Sampled Cross-Entropy
  41. 41. Tools for Recommender Systems
  42. 42. Implicit Interactions * User Features Indicator User Representation Linear Item Features Indicator Item Representation Linear Prediction Dot-product Learning ALS, BPR Implicit is a Python collaborative filter toolkit that uses matrix factorization to learn representations. Includes factorization classes for ALS and BPR. Made by Ben Frederickson. MIT License
  43. 43. Scikit-Learn Interactions * User Features Indicator User Representation Linear Item Features Indicator Item Representation Linear Prediction Dot-product Learning SVD, PCA, NMF... Scikit-learn is a Python machine learning toolkit with many tools for feature engineering and machine learning. The decomposition package contains some classes that can be used for matrix factorization recommender systems like SVD, PCA, NMF... Maintained by volunteers. BSD license
  44. 44. LightFM Interactions * User Features * User Representation Linear Item Features * Item Representation Linear Prediction Dot-product Learning Logistic, BPR, WARP LightFM is a Python hybrid recommender system that uses matrix factorization to learn representations. Made by Lyst - a fashion shopping website. Apache-2.0 license
  45. 45. TensorRec is a Python hybrid recommender system framework for developing whole recommender systems quickly. Representation functions, prediction functions, and loss functions can be customized using TensorFlow or Keras. Made by James Kirk. Apache-2.0 license TensorRec Interactions * User Features * User Representation Linear, Deep nets, None... Item Features * Item Representation Linear, Deep nets, None... Prediction Dot-product, Cosine, Euclidean... Learning RMSE, KLD, WMRB... Hey, that’s me
  46. 46. Annoy is a tool for fast similarity search written in C++ with Python bindings. Useful for building systems to serve recommendations from pre-learned representations. Made by Spotify. Apache-2.0 license ANNOY (Approximate Nearest Neighbors Oh Yeah) Interactions X User Features X User Representation X Item Features X Item Representation X Prediction Cosine, Euclidean, Manhattan, Hamming Learning X
  47. 47. Faiss is a tool for fast similarity search written in C++ with Python bindings. Useful for building systems to serve recommendations from pre-learned representations. Allows item biases. Made by Facebook. BSD license FAISS (Facebook AI Similarity Search) Interactions X User Features X User Representation X Item Features X Item Representation X Prediction Dot-product, Euclidean Learning X
  48. 48. NMSLib is a tool for fast similarity search written in C++ with Python bindings. Useful for building systems to serve recommendations from pre-learned representations. Made by Bilegsaikhan Naidan, Leonid Boytsov, Yury Malkov, David Novak, Ben Frederickson. Apache-2.0 license, with some MIT and GNU components NMSLib (Non-Metric Space Library) Interactions X User Features X User Representation X Item Features X Item Representation X Prediction Cosine, Euclidean Learning X
  49. 49. We can build a hybrid recommender system to recommend personalized news articles based on past reading. Requirements: 1. We have to learn the tastes of individual users. 2. We know users’ home location with low resolution (country/state). 3. Articles are ephemeral. All items are cold-start items. 4. We can vectorize article contents and tagged categories. (politics, sports…) 5. We have to serve production-scale user traffic. 6. We don’t have to do rec-splanation. Example: News Article Recommendation Interactions Clicks, page dwells... User Features Indicator + vectorized locations User Representation Linear Item Features TF-IDF of contents + vectorized categories Item Representation Deep net Prediction Cosine Learning Balanced WMRB
  50. 50. Example: News Article Recommendation Daily Model Training Scikit-learn Feature Transformation TensorRec Recommender System Annoy Ranking Step 1: Vectorize historical article contents and metadata Step 2: Use vectorized article features to learn user representations and train a deep net for article representation Step 3: Build Annoy indices
  51. 51. Scikit-learn Feature Transformation TensorRec Recommender System Annoy Ranking Step 1: Vectorize new article contents and metadata Step 2: Use trained deep net to calculate new article representation Step 3: Rebuild Annoy indices with the new article Example: News Article Recommendation Handling New Articles
  52. 52. Database Representation Storage Annoy Ranking Step 1: Retrieve the user representation from the database Step 2: Find most relevant articles for the user Example: News Article Recommendation Serving User Traffic
  53. 53. Example: MovieLens with TensorRec Interactions Movie ratings User Features Indicator User Representation Linear Item Features Indicator + Movie Tags Item Representation Linear Prediction Dot-product Learning Balanced WMRB
  54. 54. Evaluating Recommender Systems
  55. 55. Offline Evaluation Many metrics are available for offline evaluation to comparing predictions and known interactions. Most measure novelty, diversity, and coverage. Precision@K, Recall@K, NDCG@K… Precision@K: “What percentage of the top K items were positively interacted?” Recall@K: “What percentage of users’ positively interacted items were in the top K results?” What makes a good recommender system? Offline Pitfalls Many offline metrics don’t represent fairness of performance between users or items. These metrics can be useful for hyperparameter optimization, but often fail to evaluate the “feel” of recommendations. It is hard to use offline metrics to state that one recommender system is better than another.
  56. 56. Example: Offline Pitfalls Three recommendation results for two users. User 1 has 5 positive interactions. User 2 has 2 positive interactions. The third recommendation system is the most broadly effective, and probably the “best.” Precision fails to identify that, but recall does. You can concoct similar pitfalls for recall or NDCG. What makes a good recommender system? 1 2 1 2 1 2 T T T T T T T T T T T T T T T P@5: 0.5 P@5: 0.5 P@5: 0.5 R@5: 0.65 R@5: 0.5 R@5: 0.8
  57. 57. Online Evaluation When rolling-out a new recommender system, the truest test is an A/B test with an existing system. The most effective feedback comes from user interviewing and monitoring the user behaviors the system is intended to drive. If there is no existing system, do phased roll-outs with quant/qual feedback.* User interviewing is the only way to evaluate the “feel” of recommendations. *Fellow employees make great, but biased, guinea pigs What makes a good recommender system? Feel? “I already own a crib, why would I need another?” Missing item filtering based on metadata? “These songs are excellent, but I already know these bands.” Maybe we should target discovery? “I’ve watched Captain America twenty times, but that doesn’t mean I only want to watch Marvel movies. What about the sitcoms I watch?” Maybe we’re oversimplifying the user’s representation?
  58. 58. All Algorithms Are Biased There are biases innate in the data we use, the way users interact with our products, and the way our algorithms learn. Controlling for this is not as simple as setting biased=False. When designing these systems, we have a responsibility to, at the least, understand the biases in our products. You wouldn’t ship a product without tests. You shouldn’t ship a RecSys without examining bias. Algorithmic Bias and Fairness Understanding Fairness There are many of definitions of fairness. Some cross-section recommender performance by user and item metadata. C-fairness Is recommendation recall significantly lower for customers in Massachusetts? P-fairness Are movies with female leads recommended less often than in the natural distribution of movie watching? Missing metadata? Crowdsource it, but be careful with sensitive metadata.
  59. 59. What We Missed
  60. 60. 1 2 3 4 5 6 What We Missed Sequence-based models In what order do our users interact with our items? Mixture-of-tastes models Is one representation per user enough for users with diverse tastes? Rec-splanation How do system design choices impact interpretability? Attention models Can we learn more nuance to user representation that just a vector? Graphical models Can we map relationships between users, items, and their attributes? Cold-start problems How do we make recommendations for brand-new users?
  61. 61. Wait, is it “recommender systems” or “recommendation systems?”
  62. 62. Wait, is it “recommender systems” or “recommendation systems?” ¯_(ツ)_/¯
  63. 63. Thank you! Questions? James Kirk @jiminy_kirket /jkirk12 @jameskirk1 /jfkirk

×