This document describes Netflix's use of data stratification for machine learning experimentation and model training. It discusses how Netflix uses the Boson library and its stratification API to:
1) Downsample datasets while meeting a desired distribution of users based on attributes like country, tenure, number of plays, etc.
2) Allow ML researchers to rapidly iterate on ideas by experimenting on stratified subsets of data offline before testing models online.
3) Ensure models are trained on data that sufficiently represents important user demographics through constraints placed during stratified sampling.
The API provides flexible and declarative rules for stratifying user cohorts that inform the library how to sample data rather than specifying implementation details.
2. • Iterative ML in Netflix
• Boson Library
• Data Sampling (Stratification) Needs
• API Examples
Outline
3. Title Ranking
Everything is a Recommendation
RowSelection&Ordering
Recommendations are
driven by machine
learning algorithms
Over 80% of what
members watch comes
from our
recommendations
Image
4. • Try an idea offline using historical data to see if it would have
made better recommendations
• If it would, deploy a live A/B test to see if it performs well in
production
Running Experiments
5. • When experimenting offline, iterate quickly
• When ready to move to production, transition must be
smooth
• Spark + Scala!
ML Research Pipeline
Hypothesize Prototype Analyze A/B Test Productize
6. Rapid Iterations
• Developing machine learning
is iterative
• Need a shortened pipeline to
rapidly try ideas
• Entire pipeline must be
amenable to correct
experimentation
• Down-sampled data should
satisfy desired distribution for
model training/testing
7. • A high level Spark/Scala API for ML exploration
• Focuses on Offline Feature Engineering/Training
for both
• Ad-hoc exploration
• Production
• Think “Subset of SKLearn” for Scala/JVM ecosystem
• Spark’s Dataframe is the core data abstraction
Boson Overview
8. Design
Experiment
Collect Label
Dataset
Model Training
Compute
Validation Metrics
Model Testing
Design a New Experiment to Test Out Different Ideas
Offline ML
Experiment
Online
System
Online
AB Testing
Running Experiments
Offline Feature
Generation
Snapshots
Infrastructure
Stratify User
Cohorts
Capture Facts Stratify Training
Set
Runs once a
day
9. Design
Experiment
Collect Label
Dataset
Model Training
Compute
Validation Metrics
Model Testing
Design a New Experiment to Test Out Different Ideas
Offline ML
Experiment
Online
System
Online
AB Testing
Offline Feature
Generation
Snapshots
Infrastructure
Stratify User
Cohorts
Capture Facts Stratify
Training Set
Runs once a
day
Stratify User
Cohorts
Stratify Training
Set
Running Experiments
10. Data Stratification
• A mechanism to down-sample datasets while meeting a desired
distribution
• Place constraints on snapshotted users while ensuring maximal
training data yield
• Place constraints on training data to make sure a model learns
enough about important/newer demographics
• Several criteria: Small Countries, Emerging Markets, New
Members ...
12. Sampling Netflix Userbase
• Spark + Scala DSL with strong type safety
• Expressive power to build arbitrary user-sets on demand
• Declarative API that informs the library what to do, not how
• Large set of supported user attributes to stratify by
– E.g Country, Tenure, Plays, Searches, etc
14. Building Blocks
• Country.US represents US users.
US US &&
M1
M1 || Plays(1,
10)
!US &&
Tenure(10,
100)
• Country.US && Tenure.M1 comprises of
US users AND also in first membership
month.
• Tenure.M1 || Plays(1, 10)) comprises of
users in their first membership month OR
having 1 <= number-of-plays < 10
• !Country.US && Tenure(10, 100))
comprises of international user profiles
AND with tenure in the range 10 - 100
days.
18. API Example: Error margin to increase yield
new UserSet(
samplingRules = Map(
Tenure.M1 -> TargetPercent(10.0),
Country.EACH && Tenure.M2 -> MaxPercent(0.2),
…
…
),
allowedErrorMargin = 20.0,
)
19. Algorithm Complexity
• When there are no overlapping regions in the venn diagram space
• Closed expressions to determine the sampling ratio per region are possible
• When the regions overlap, we formulate the optimization problem
as a Linear Programming problem.
• The sampling rules form the constraints
• Objective is to typically maximize the sample size