Building flexible machine learning libraries adapted for Netflix’s use cases is paramount in our continued efforts to better model our users’ behaviors and provide them great personalized video recommendations.
This talk introduces one such spark-based stratification library developed at Netflix to aid “Training Set Stratification” in offline machine learning workflows. Originally created to implement user selection algorithms in our data snapshotting infrastructure, the library has evolved to cater to general-purpose stratification use cases in ML pipelines. We will talk about how using the stratification library’s DSL (domain specific language) and its underlying Spark based implementation, one can easily express complex sampling rules and dynamically carve out matching portions of a Spark dataframe.
For example, arbitrary rules governing the distributions of user attributes (and combinations there of) such as origin country, video play frequency, tenure etc can be easily enforced when constructing a ML training data set. The demo section of the talk will showcase example usages of the stratification library in a Jupyter notebook
2. • Iterative ML in Netflix
• Boson Library
• Data Sampling (Stratification) Needs
• API Examples
Outline
3. Title Ranking
Everything is a Recommendation
RowSelection&Ordering
Recommendations are
driven by machine
learning algorithms
Over 80% of what
members watch comes
from our
recommendations
Image
4. • Try an idea offline using historical data to see if it would have
made better recommendations
• If it would, deploy a live A/B test to see if it performs well in
production
Running Experiments
5. • When experimenting offline, iterate quickly
• When ready to move to production, transition must be
smooth
• Spark + Scala!
ML Research Pipeline
Hypothesize Prototype Analyze A/B Test Productize
6. Rapid Iterations
• Developing machine learning
is iterative
• Need a shortened pipeline to
rapidly try ideas
• Entire pipeline must be
amenable to correct
experimentation
• Down-sampled data should
satisfy desired distribution for
model training/testing
7. • A high level Spark/Scala API for ML exploration
• Focuses on Offline Feature Engineering/Training
for both
• Ad-hoc exploration
• Production
• Think “Subset of SKLearn” for Scala/JVM ecosystem
• Spark’s Dataframe is the core data abstraction
Boson Overview
8. Design
Experiment
Collect Label
Dataset
Model Training
Compute
Validation Metrics
Model Testing
Design a New Experiment to Test Out Different Ideas
Offline ML
Experiment
Online
System
Online
AB Testing
Running Experiments
Offline Feature
Generation
Snapshots
Infrastructure
Stratify User
Cohorts
Capture Facts Stratify Training
Set
Runs once a
day
9. Design
Experiment
Collect Label
Dataset
Model Training
Compute
Validation Metrics
Model Testing
Design a New Experiment to Test Out Different Ideas
Offline ML
Experiment
Online
System
Online
AB Testing
Offline Feature
Generation
Snapshots
Infrastructure
Stratify User
Cohorts
Capture Facts Stratify
Training Set
Runs once a
day
Stratify User
Cohorts
Stratify Training
Set
Running Experiments
10. Data Stratification
• A mechanism to down-sample datasets while meeting a desired
distribution
• Place constraints on snapshotted users while ensuring maximal
training data yield
• Place constraints on training data to make sure a model learns
enough about important/newer demographics
• Several criteria: Small Countries, Emerging Markets, New
Members ...
12. Sampling Netflix Userbase
• Spark + Scala DSL with strong type safety
• Expressive power to build arbitrary user-sets on demand
• Declarative API that informs the library what to do, not how
• Large set of supported user attributes to stratify by
– E.g Country, Tenure, Plays, Searches, etc
14. Building Blocks
• Country.US represents US users.
US US &&
M1
M1 || Plays(1,
10)
!US &&
Tenure(10,
100)
• Country.US && Tenure.M1 comprises of
US users AND also in first membership
month.
• Tenure.M1 || Plays(1, 10)) comprises of
users in their first membership month OR
having 1 <= number-of-plays < 10
• !Country.US && Tenure(10, 100))
comprises of international user profiles
AND with tenure in the range 10 - 100
days.
18. API Example: Error margin to increase yield
new UserSet(
samplingRules = Map(
Tenure.M1 -> TargetPercent(10.0),
Country.EACH && Tenure.M2 -> MaxPercent(0.2),
…
…
),
allowedErrorMargin = 20.0,
)
19. Algorithm Complexity
• When there are no overlapping regions in the venn diagram space
• Closed expressions to determine the sampling ratio per region are possible
• When the regions overlap, we formulate the optimization problem
as a Linear Programming problem.
• The sampling rules form the constraints
• Objective is to typically maximize the sample size