RecSysOps:
Best Practices for Operating a Large-Scale
Recommender System
Ehsan (Mohammad) Saberian,
Justin Basilico
RecSys 2021
2021-09-27
@ehsan_saberian, @JustinBasilico, @NetflixResearch
Large scale RecSys is a complex
operation
RecSys environment is dynamic
It changes every second!
New
members
New
members
New
items
New
members
New
items
New
member
interests
New
members
New
items
New
member
interests
New
ML models
New
members
New
items
New
member
interests
New
ML models
Libraries
updates
New
members
New
items
New
member
interests
New
ML models
Libraries
updates
New
codes
How to ensure that our
RecSys is working
correctly?
RecSysOps:
Lessons we learned while
operating a large RecSys
Benefits:
Reduce firefighting time
Focus on innovation
Build trust with our stakeholders
RecSysOps Components
Detection
RecSysOps Components
Detection
Prediction
RecSysOps Components
Detection
Prediction
Diagnosis
RecSysOps Components
Detection
Prediction
Diagnosis
Resolution
Detection
Prediction
Diagnosis
Resolution
Detect issues quickly
The most challenging part
There are endless potential issues
Some of them we don’t know yet!
Detection lesson 1:
Implement all the known best practices
Unit test, Integration test
MLOps: data/metrics check
CICD, regular retraining
Detection lesson 1:
Implement all the known best practices
Unit test, Integration test
MLOps: data/metrics check
CICD, regular retraining
Detection lesson 2:
Monitor end-to-end your own way
Detection lesson 2:
Monitor end-to-end your own way
Don’t rely only on partner teams’ audits
What does correct data
look like from your perspective?
Detection lesson 3:
Understand your stakeholders’ concerns
Stakeholders: members and items
Detection lesson 3.1:
Every time a member plays something
that is ranked low by the model; it is a
potential issue
Detection lesson 3.1:
Every time a member plays something
that is ranked low by the model; it is a
potential issue
Monitor and analyze them
Get inspiration for future innovations
Detection lesson 3.2:
Engage with teams responsible for items
and understand their concerns
Detection lesson 3.2:
Engage with teams responsible for items
and understand their concerns
Is an item cold-started properly?
Is production bias hurting an item?
Detection lesson 3.2:
Engage with teams responsible for items
and understand their concerns
Build tools to detect their concerns and
integrated them in your system
Detection
Prediction
Diagnosis
Resolution
Can you predict
issues before they
happen?
Netflix case:
Is it possible to predict if an item is going
to cold-start properly 7 days before its
launch date?
Yes, we can train a model to predict
production model’s behaviour on day of
launch
Flag any item with unexpected prediction
and investigate
Prediction Lesson:
Try to predict issues before they happen
Detection
Prediction
Diagnosis
Resolution
Step 1:
Reproduce issue in isolation
Step 1:
Reproduce issue in isolation
Need sufficient advanced logging
Step 2:
Input data issue or model issue?
Step 2.1:
Input data issue?
Are input values right?
Step 2.1:
Input data issue?
Are input values right?
Trick: use similar items or members to
estimate range of typical values
Step 2.1:
Input data issue?
Are input values right?
Example: language of an item was set up
incorrectly
Step 2.2:
Model issue?
Need to inspect/interpret model
Step 2.2:
Model issue?
Need to inspect/interpret model
There are many tools, SHAP, LIME ..
Step 2.2:
Model issue?
Need to inspect/interpret model
Example: missing values were handled
incorrectly
Diagnosis lessons:
Set up logging to reproduce issue
Develop tools to check validity of inputs
Develop tools to inspect models
Detection
Prediction
Diagnosis
Resolution
It’s like software engineering
Hotfixes and long term
solutions
Hotfixes in ML are challenging!
Models are highly optimized and hotfix
modification will lead to suboptimality
Resolution lesson 1:
Have a collection of hotfixes ready
Understand their costs and trade-offs
Resolution lesson 2:
With every issue, make RecsysOps better
Detection
Prediction
Diagnosis
Resolution
Final lesson:
Make RecSysOps frictionless
Run checks on a regular basis
If human judgment is needed,
make all required information ready
Be able to deploy hotfixes
with couple of clicks
Questions?
Ehsan Saberian
@ehsan_saberian
Justin Basilico
@JustinBasilico
@NetflixResearch

RecSysOps: Best Practices for Operating a Large-Scale Recommender System