Shiva Chaitanya, Netflix
#DSSAIS11
Stratification Library For Machine
Learning Use Cases At Netflix
• Iterative ML in Netflix
• Boson Library
• Data Sampling (Stratification) Needs
• API Examples
Outline
Title Ranking
Everything is a Recommendation
RowSelection&Ordering
Recommendations are
driven by machine
learning algorithms
Over 80% of what
members watch comes
from our
recommendations
Image
• Try an idea offline using historical data to see if it would have
made better recommendations
• If it would, deploy a live A/B test to see if it performs well in
production
Running Experiments
• When experimenting offline, iterate quickly
• When ready to move to production, transition must be
smooth
• Spark + Scala!
ML Research Pipeline
Hypothesize Prototype Analyze A/B Test Productize
Rapid Iterations
• Developing machine learning
is iterative
• Need a shortened pipeline to
rapidly try ideas
• Entire pipeline must be
amenable to correct
experimentation
• Down-sampled data should
satisfy desired distribution for
model training/testing
• A high level Spark/Scala API for ML exploration
• Focuses on Offline Feature Engineering/Training
for both
• Ad-hoc exploration
• Production
• Think “Subset of SKLearn” for Scala/JVM ecosystem
• Spark’s Dataframe is the core data abstraction
Boson Overview
Design
Experiment
Collect Label
Dataset
Model Training
Compute
Validation Metrics
Model Testing
Design a New Experiment to Test Out Different Ideas
Offline ML
Experiment
Online
System
Online
AB Testing
Running Experiments
Offline Feature
Generation
Snapshots
Infrastructure
Stratify User
Cohorts
Capture Facts Stratify Training
Set
Runs once a
day
Design
Experiment
Collect Label
Dataset
Model Training
Compute
Validation Metrics
Model Testing
Design a New Experiment to Test Out Different Ideas
Offline ML
Experiment
Online
System
Online
AB Testing
Offline Feature
Generation
Snapshots
Infrastructure
Stratify User
Cohorts
Capture Facts Stratify
Training Set
Runs once a
day
Stratify User
Cohorts
Stratify Training
Set
Running Experiments
Data Stratification
• A mechanism to down-sample datasets while meeting a desired
distribution
• Place constraints on snapshotted users while ensuring maximal
training data yield
• Place constraints on training data to make sure a model learns
enough about important/newer demographics
• Several criteria: Small Countries, Emerging Markets, New
Members ...
Flexible Dataframe Stratification API
• Extended stratification methods on Spark dataframes
• Think sampling rules specified with native spark SQL-like expressions
dataframe.stratify(rules = Seq(
$(“country”) == ‘US’ maxPercent 8.0,
$(“tenure”) > 10 && $(“plays”) > 1 minPercent 0.5,
…
))
Sampling Netflix Userbase
• Spark + Scala DSL with strong type safety
• Expressive power to build arbitrary user-sets on demand
• Declarative API that informs the library what to do, not how
• Large set of supported user attributes to stratify by
– E.g Country, Tenure, Plays, Searches, etc
User stratification client
User Stratification
Client
De-normalized
Data
Snapshots
Specify sampling rules
Building Blocks
• Country.US represents US users.
US US &&
M1
M1 || Plays(1,
10)
!US &&
Tenure(10,
100)
• Country.US && Tenure.M1 comprises of
US users AND also in first membership
month.
• Tenure.M1 || Plays(1, 10)) comprises of
users in their first membership month OR
having 1 <= number-of-plays < 10
• !Country.US && Tenure(10, 100))
comprises of international user profiles
AND with tenure in the range 10 - 100
days.
API Example: Specify rules on counts
new UserSet(
samplingRules = Map(
Country.US -> TargetCount(500),
Country.US && Tenure.M1 -> TargetCount(100),
Country.US && Tenure.M2 -> MinCount(200),
Country(“GB”) && Plays(1) -> MaxCount(200),
…
…
)
)
API Example: Specify rules on percentages
new UserSet(
samplingRules = Map(
Country.US -> TargetPercent(40.0),
Country.US && Tenure.M1 -> TargetPercent(20.0),
Country.US && Tenure.M2 -> MinPercent(10.0),
Country(“GB”) && Plays(1) -> MaxPercent(5.0),
…
…
)
)
API Example: EACH Expander
new UserSet(
samplingRules = Map(
Tenure.M1 -> TargetPercent(10.0),
Country.EACH && Tenure.M2 -> MaxPercent(0.2),
…
)
)
Country(“AD”) && Tenure.M2 -> MaxPercent(0.2),
Country(“AE”) && Tenure.M2 -> MaxPercent(0.2),
….
API Example: Error margin to increase yield
new UserSet(
samplingRules = Map(
Tenure.M1 -> TargetPercent(10.0),
Country.EACH && Tenure.M2 -> MaxPercent(0.2),
…
…
),
allowedErrorMargin = 20.0,
)
Algorithm Complexity
• When there are no overlapping regions in the venn diagram space
• Closed expressions to determine the sampling ratio per region are possible
• When the regions overlap, we formulate the optimization problem
as a Linear Programming problem.
• The sampling rules form the constraints
• Objective is to typically maximize the sample size
LP Formulation
0 <= s1 <= C1
0 <= s2 <= C2
0 <= s12 <= C12
p1 * (C1 + C12) <= (s1 + s12) * 100
p2 * (C2 + C12) >= (s2 + s12) * 100
Maximize (s1 + s2 + s12)
C2
C1
C12
minPercent p1
maxPercent p2
Optimization problem
• Shout out to the vagmcs/optimus third party
idiomatic scala library!
https://github.com/vagmcs/Optimus
Code snippet
targetDistroMap.foreach {
case (samplingExpr, freq @ TargetPercent(percent)) =>
val relevantRegions: Seq[Region] =
existingDistroRegionMap.keys.filter(_.contains(samplingExpr)).toSeq
val lpExpression = relevantRegions.map(regionVariables).foldLeft[Expression](Zero)(_ + _)
val leftErrorMargin = MPFloatVar(samplingExpr.toString + "_leftErrorMargin", 0,
existingDistro.totalCount)
val rightErrorMargin = MPFloatVar(samplingExpr.toString + "_rightErrorMargin", 0,
existingDistro.totalCount)
val toleranceMargin = freq.toleranceMargin.getOrElse(targetToleranceMargin)
add(leftErrorMargin <:= overallSizeVariable * Const(0.01 * percent * toleranceMargin/100.0))
add(rightErrorMargin <:= overallSizeVariable * Const(0.01 * percent * toleranceMargin/100.0))
add(lpExpression := overallSizeVariable * Const(0.01 * percent) - leftErrorMargin +
rightErrorMargin)
leftErrorMarginsSum = leftErrorMarginsSum ++ Seq(leftErrorMargin)
rightErrorMarginsSum = rightErrorMarginsSum ++ Seq(rightErrorMargin)
}
Questions?

Apache Spark-Based Stratification Library for Machine Learning Use Cases at Netflix with Shiva Chaitanya

  • 1.
    Shiva Chaitanya, Netflix #DSSAIS11 StratificationLibrary For Machine Learning Use Cases At Netflix
  • 2.
    • Iterative MLin Netflix • Boson Library • Data Sampling (Stratification) Needs • API Examples Outline
  • 3.
    Title Ranking Everything isa Recommendation RowSelection&Ordering Recommendations are driven by machine learning algorithms Over 80% of what members watch comes from our recommendations Image
  • 4.
    • Try anidea offline using historical data to see if it would have made better recommendations • If it would, deploy a live A/B test to see if it performs well in production Running Experiments
  • 5.
    • When experimentingoffline, iterate quickly • When ready to move to production, transition must be smooth • Spark + Scala! ML Research Pipeline Hypothesize Prototype Analyze A/B Test Productize
  • 6.
    Rapid Iterations • Developingmachine learning is iterative • Need a shortened pipeline to rapidly try ideas • Entire pipeline must be amenable to correct experimentation • Down-sampled data should satisfy desired distribution for model training/testing
  • 7.
    • A highlevel Spark/Scala API for ML exploration • Focuses on Offline Feature Engineering/Training for both • Ad-hoc exploration • Production • Think “Subset of SKLearn” for Scala/JVM ecosystem • Spark’s Dataframe is the core data abstraction Boson Overview
  • 8.
    Design Experiment Collect Label Dataset Model Training Compute ValidationMetrics Model Testing Design a New Experiment to Test Out Different Ideas Offline ML Experiment Online System Online AB Testing Running Experiments Offline Feature Generation Snapshots Infrastructure Stratify User Cohorts Capture Facts Stratify Training Set Runs once a day
  • 9.
    Design Experiment Collect Label Dataset Model Training Compute ValidationMetrics Model Testing Design a New Experiment to Test Out Different Ideas Offline ML Experiment Online System Online AB Testing Offline Feature Generation Snapshots Infrastructure Stratify User Cohorts Capture Facts Stratify Training Set Runs once a day Stratify User Cohorts Stratify Training Set Running Experiments
  • 10.
    Data Stratification • Amechanism to down-sample datasets while meeting a desired distribution • Place constraints on snapshotted users while ensuring maximal training data yield • Place constraints on training data to make sure a model learns enough about important/newer demographics • Several criteria: Small Countries, Emerging Markets, New Members ...
  • 11.
    Flexible Dataframe StratificationAPI • Extended stratification methods on Spark dataframes • Think sampling rules specified with native spark SQL-like expressions dataframe.stratify(rules = Seq( $(“country”) == ‘US’ maxPercent 8.0, $(“tenure”) > 10 && $(“plays”) > 1 minPercent 0.5, … ))
  • 12.
    Sampling Netflix Userbase •Spark + Scala DSL with strong type safety • Expressive power to build arbitrary user-sets on demand • Declarative API that informs the library what to do, not how • Large set of supported user attributes to stratify by – E.g Country, Tenure, Plays, Searches, etc
  • 13.
    User stratification client UserStratification Client De-normalized Data Snapshots Specify sampling rules
  • 14.
    Building Blocks • Country.USrepresents US users. US US && M1 M1 || Plays(1, 10) !US && Tenure(10, 100) • Country.US && Tenure.M1 comprises of US users AND also in first membership month. • Tenure.M1 || Plays(1, 10)) comprises of users in their first membership month OR having 1 <= number-of-plays < 10 • !Country.US && Tenure(10, 100)) comprises of international user profiles AND with tenure in the range 10 - 100 days.
  • 15.
    API Example: Specifyrules on counts new UserSet( samplingRules = Map( Country.US -> TargetCount(500), Country.US && Tenure.M1 -> TargetCount(100), Country.US && Tenure.M2 -> MinCount(200), Country(“GB”) && Plays(1) -> MaxCount(200), … … ) )
  • 16.
    API Example: Specifyrules on percentages new UserSet( samplingRules = Map( Country.US -> TargetPercent(40.0), Country.US && Tenure.M1 -> TargetPercent(20.0), Country.US && Tenure.M2 -> MinPercent(10.0), Country(“GB”) && Plays(1) -> MaxPercent(5.0), … … ) )
  • 17.
    API Example: EACHExpander new UserSet( samplingRules = Map( Tenure.M1 -> TargetPercent(10.0), Country.EACH && Tenure.M2 -> MaxPercent(0.2), … ) ) Country(“AD”) && Tenure.M2 -> MaxPercent(0.2), Country(“AE”) && Tenure.M2 -> MaxPercent(0.2), ….
  • 18.
    API Example: Errormargin to increase yield new UserSet( samplingRules = Map( Tenure.M1 -> TargetPercent(10.0), Country.EACH && Tenure.M2 -> MaxPercent(0.2), … … ), allowedErrorMargin = 20.0, )
  • 19.
    Algorithm Complexity • Whenthere are no overlapping regions in the venn diagram space • Closed expressions to determine the sampling ratio per region are possible • When the regions overlap, we formulate the optimization problem as a Linear Programming problem. • The sampling rules form the constraints • Objective is to typically maximize the sample size
  • 20.
    LP Formulation 0 <=s1 <= C1 0 <= s2 <= C2 0 <= s12 <= C12 p1 * (C1 + C12) <= (s1 + s12) * 100 p2 * (C2 + C12) >= (s2 + s12) * 100 Maximize (s1 + s2 + s12) C2 C1 C12 minPercent p1 maxPercent p2
  • 21.
    Optimization problem • Shoutout to the vagmcs/optimus third party idiomatic scala library! https://github.com/vagmcs/Optimus
  • 22.
    Code snippet targetDistroMap.foreach { case(samplingExpr, freq @ TargetPercent(percent)) => val relevantRegions: Seq[Region] = existingDistroRegionMap.keys.filter(_.contains(samplingExpr)).toSeq val lpExpression = relevantRegions.map(regionVariables).foldLeft[Expression](Zero)(_ + _) val leftErrorMargin = MPFloatVar(samplingExpr.toString + "_leftErrorMargin", 0, existingDistro.totalCount) val rightErrorMargin = MPFloatVar(samplingExpr.toString + "_rightErrorMargin", 0, existingDistro.totalCount) val toleranceMargin = freq.toleranceMargin.getOrElse(targetToleranceMargin) add(leftErrorMargin <:= overallSizeVariable * Const(0.01 * percent * toleranceMargin/100.0)) add(rightErrorMargin <:= overallSizeVariable * Const(0.01 * percent * toleranceMargin/100.0)) add(lpExpression := overallSizeVariable * Const(0.01 * percent) - leftErrorMargin + rightErrorMargin) leftErrorMarginsSum = leftErrorMarginsSum ++ Seq(leftErrorMargin) rightErrorMarginsSum = rightErrorMarginsSum ++ Seq(rightErrorMargin) }
  • 23.