This document discusses the key considerations and challenges for productionizing recommender systems. It outlines the full lifecycle from scoping a recommender system project through deployment and continuous monitoring. Some of the main points covered include: defining requirements and key metrics; preparing data through feature engineering, cleaning and transformation; selecting and evaluating recommendation models both offline and through A/B testing; ensuring deployments are robust, scalable and address technical debt; and continuously monitoring systems for data or algorithmic drift once in production.
6. Classical recommendation model
Three types of entities: Users, Items and Contexts
1. A background knowledge:
• A set of ratings – preferences
• r: Users x Items x Contexts --> {1, 2, 3, 4, 5}
• A set of “features” of the Users, Items and Contexts
2. A method for predicting the function r where it is unknown:
• r*(u, i, c) = Average ratings r(u’, i, c’): users u’ are similar to u and context c’ is
similar to c
3. A method for selecting the items to recommend (choice):
• In context c recommend to u the item i* with the largest predicted rating
r*(u,i,c)
7. The goal is to find items that the user will
happily choose
8. The goal is to find items that the user will
happily choose
10. Requirements surrounding RecSys
Sculley, David, et al. "Hidden technical debt in machine learning systems." Advances in neural information processing systems 28 (2015).
11. Requirements surrounding RecSys
Sculley, David, et al. "Hidden technical debt in machine learning systems." Advances in neural information processing systems 28 (2015).
User
Modeling
12. Scoping
1. Decide the recommendation paradigm and degree of
personalization
• Resources and product
• Type of user we are catering to
2. KPI metrics:
• Online: CVR, CTR, ad-hoc metrics
• Offline: recall, nDCG
Ricci, Francesco, Lior Rokach, and Bracha Shapira. "Recommender systems: introduction and challenges." Recommender systems handbook (2015): 1-34.
Scoping
• Define the project
• Define KPI metrics
Data
• Features
• Type of feedback
• Cleaning
• Transformation
Modelling
• Select model
• Offline test model
• Check
fairness/explainability
• UAT
Deployment
• Deploy in production
• A/B testing
• Continuous
Monitoring
13. The life cycle of a recommender
Scoping
• Define the project
• Define KPI metrics
Data
• Features
• Type of feedback
• Cleaning
• Transformation
Modelling
• Select model
• Offline test model
• Check
fairness/explainability
• UAT
Deployment
• Deploy in production
• A/B testing
• Continuous
Monitoring
1. Features:
• Reusable
• Transformable
• Interpretable
• Reliable
2. Type of feedback
• Implicit data is (usually):
• More dense, and available
for all users
• Better representative of
user behavior vs. user
reflection
• More related to final
objective function ○ Better
correlated with AB test
results
• E.g. Rating vs watching
3. Cleaning
• Dropping bots
• Missing data strategy
• Standardisation
• Bucketing
Beliakov, Gleb, Tomasa Calvo, and Simon James. "Aggregation of preferences in recommender systems." Recommender systems handbook (2011): 705-734.
15. Data preparation
Scoping
• Define the project
• Define KPI metrics
Data
• Features
• Type of feedback
• Cleaning
• Transformation
Modelling
• Select model
• Offline test model
• Check
fairness/explainability
• UAT
Deployment
• Deploy in production
• A/B testing
• Continuous
Monitoring
1. Features:
• Reusable
• Transformable
• Interpretable
• Reliable
2. Type of feedback
• Implicit data is (usually):
• More dense, and available
for all users
• Better representative of
user behavior vs. user
reflection
• More related to final
objective function ○ Better
correlated with AB test
results
• E.g. Rating vs watching
3. Cleaning
• Dropping bots
• Missing data strategy
• Standardisation
• Bucketing
Beliakov, Gleb, Tomasa Calvo, and Simon James. "Aggregation of preferences in recommender systems." Recommender systems handbook (2011): 705-734.
16. Modelling
Scoping
• Define the project
• Define KPI metrics
Data
• Features
• Type of feedback
• Cleaning
• Transformation
Modelling
• Select model
• Offline evaluation
• Check
fairness/explainability
• UAT
Deployment
• Deploy in production
• A/B testing
• Continuous
Monitoring
Latency is key!
18. Collaborative Filtering
Coba, L., Rook, L., Zanker, M., & Symeonidis, P. (2019, March). Decision making strategies differ in the presence of collaborative explanations: two conjoint studies. IUI 19.
19. Content-based
Dominguez, V., Messina, P., Donoso-Guzmán, I., & Parra, D. (2019, March). The effect of explanations and algorithmic accuracy on visual recommender systems of artistic images. IUI 2019.
21. Selecting the model
• There is no single winner
• Usually, models are ensembled
• Multi-stage architectures are
common
• Live predictions or in batches
• DNNs are not always deployable
(cost-benefit tradeoff and latency)
22. A non personalized recommender
• We can use a hybrid
recommender: collaborative
filtering + content based
• Collaborative filtering suffers
from cold start, but it is better
performing
• If the catalogue doesn’t
change often then we can pre-
compute interactions
24. Non-personalized recommender
Edge device
API Call
Recommendations
ANN
Search
Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point https://github.com/spotify/annoy
25. Scalable personalized recommender
Covington, Paul, Jay Adams, and Emre Sargin. "Deep neural networks for youtube recommendations." Proceedings of the 10th ACM conference on recommender systems. 2016.
26. Scalable personalized recommender
Amatrian, Xavier. “Blueprints for recommender system architectures: 10th anniversary edition.” https://amatriain.net/blog/RecsysArchitectures
27. Modelling
Scoping
• Define the project
• Define KPI metrics
Data
• Features
• Type of feedback
• Cleaning
• Transformation
Modelling
• Select model
• Offline evaluation
• Check
fairness/explainability
• UAT
Deployment
• Deploy in production
• A/B testing
• Continuous
Monitoring
• Latency is key!
• Choose the right architecture based on the situation
• Pick the right metrics:
• recall for retrieval
• nDCG or MRR for ranking @ K, or precision if ranking is irrelevant
• Go beyond traditional metrics, measure diversity, novelty and see whether the models are biased
• UAT
• Define data contracts
28. Deploying
1. Test
Scoping
• Define the project
• Define KPI metrics
Data
• Features
• Type of feedback
• Cleaning
• Transformation
Modelling
• Select model
• Offline test model
• Check
fairness/explainability
• UAT
Deployment
• Deploy in production
• A/B testing
• Continuous
Monitoring
29. Deploying
1. Test – high code coverage
Scoping
• Define the project
• Define KPI metrics
Data
• Features
• Type of feedback
• Cleaning
• Transformation
Modelling
• Select model
• Offline test model
• Check
fairness/explainability
• UAT
Deployment
• Deploy in production
• A/B testing
• Continuous
Monitoring
30. Deploying
1. Test – high code coverage
Scoping
• Define the project
• Define KPI metrics
Data
• Features
• Type of feedback
• Cleaning
• Transformation
Modelling
• Select model
• Offline test model
• Check
fairness/explainability
• UAT
Deployment
• Deploy in production
• A/B testing
• Continuous
Monitoring
31. Deploying
1. Test – high code coverage
2. Test
Scoping
• Define the project
• Define KPI metrics
Data
• Features
• Type of feedback
• Cleaning
• Transformation
Modelling
• Select model
• Offline test model
• Check
fairness/explainability
• UAT
Deployment
• Deploy in production
• A/B testing
• Continuous
Monitoring
32. Deploying
1. Test – high code coverage
2. Test – smart deployment strategy
Scoping
• Define the project
• Define KPI metrics
Data
• Features
• Type of feedback
• Cleaning
• Transformation
Modelling
• Select model
• Offline test model
• Check
fairness/explainability
• UAT
Deployment
• Deploy in production
• A/B testing
• Continuous
Monitoring
33. Tech debt sources
• Glue code
• Pipeline Jungles
• Dead Experimental Codepaths
• Plain-Old-Data Type Smell
• Multiple-Language Smell
• Prototype Smell
• Reproducibility Debt
• Cultural Debt
34. Deploying
1. Test – high code coverage
2. Test – smart deployment strategy
3. Test
Scoping
• Define the project
• Define KPI metrics
Data
• Features
• Type of feedback
• Cleaning
• Transformation
Modelling
• Select model
• Offline test model
• Check
fairness/explainability
• UAT
Deployment
• Deploy in production
• A/B testing
• Continuous
Monitoring
35. Deploying
1. Test – high code coverage
2. Test – smart deployment strategy
3. Test – A/B testing, interleaving or MAB
Scoping
• Define the project
• Define KPI metrics
Data
• Features
• Type of feedback
• Cleaning
• Transformation
Modelling
• Select model
• Offline test model
• Check
fairness/explainability
• UAT
Deployment
• Deploy in production
• A/B testing
• Continuous
Monitoring
36.
37. Deploying
1. Test – high code coverage
2. Test – smart deployment strategy
3. Test – A/B testing, interleaving or MAB
4. Keep monitoring
Scoping
• Define the project
• Define KPI metrics
Data
• Features
• Type of feedback
• Cleaning
• Transformation
Modelling
• Select model
• Offline test model
• Check
fairness/explainability
• UAT
Deployment
• Deploy in production
• A/B testing
• Continuous
Monitoring