Apache Spark-Based Stratification Library for Machine Learning Use Cases at Netflix with Shiva Chaitanya

Shiva Chaitanya, Netflix
#DSSAIS11
Stratification Library For Machine
Learning Use Cases At Netflix

• Iterative ML in Netflix
• Boson Library
• Data Sampling (Stratification) Needs
• API Examples
Outline

Title Ranking
Everything is a Recommendation
RowSelection&Ordering
Recommendations are
driven by machine
learning algorithms
Over 80% of what
members watch comes
from our
recommendations
Image

• Try an idea offline using historical data to see if it would have
made better recommendations
• If it would, deploy a live A/B test to see if it performs well in
production
Running Experiments

• When experimenting offline, iterate quickly
• When ready to move to production, transition must be
smooth
• Spark + Scala!
ML Research Pipeline
Hypothesize Prototype Analyze A/B Test Productize

Rapid Iterations
• Developing machine learning
is iterative
• Need a shortened pipeline to
rapidly try ideas
• Entire pipeline must be
amenable to correct
experimentation
• Down-sampled data should
satisfy desired distribution for
model training/testing

• A high level Spark/Scala API for ML exploration
• Focuses on Offline Feature Engineering/Training
for both
• Ad-hoc exploration
• Production
• Think “Subset of SKLearn” for Scala/JVM ecosystem
• Spark’s Dataframe is the core data abstraction
Boson Overview

Design
Experiment
Collect Label
Dataset
Model Training
Compute
Validation Metrics
Model Testing
Design a New Experiment to Test Out Different Ideas
Offline ML
Experiment
Online
System
Online
AB Testing
Running Experiments
Offline Feature
Generation
Snapshots
Infrastructure
Stratify User
Cohorts
Capture Facts Stratify Training
Set
Runs once a
day

Design
Experiment
Collect Label
Dataset
Model Training
Compute
Validation Metrics
Model Testing
Design a New Experiment to Test Out Different Ideas
Offline ML
Experiment
Online
System
Online
AB Testing
Offline Feature
Generation
Snapshots
Infrastructure
Stratify User
Cohorts
Capture Facts Stratify
Training Set
Runs once a
day
Stratify User
Cohorts
Stratify Training
Set
Running Experiments

Data Stratification
• A mechanism to down-sample datasets while meeting a desired
distribution
• Place constraints on snapshotted users while ensuring maximal
training data yield
• Place constraints on training data to make sure a model learns
enough about important/newer demographics
• Several criteria: Small Countries, Emerging Markets, New
Members ...

Flexible Dataframe Stratification API
• Extended stratification methods on Spark dataframes
• Think sampling rules specified with native spark SQL-like expressions
dataframe.stratify(rules = Seq(
$(“country”) == ‘US’ maxPercent 8.0,
$(“tenure”) > 10 && $(“plays”) > 1 minPercent 0.5,
…
))

Sampling Netflix Userbase
• Spark + Scala DSL with strong type safety
• Expressive power to build arbitrary user-sets on demand
• Declarative API that informs the library what to do, not how
• Large set of supported user attributes to stratify by
– E.g Country, Tenure, Plays, Searches, etc

User stratification client
User Stratification
Client
De-normalized
Data
Snapshots
Specify sampling rules

Building Blocks
• Country.US represents US users.
US US &&
M1
M1 || Plays(1,
10)
!US &&
Tenure(10,
100)
• Country.US && Tenure.M1 comprises of
US users AND also in first membership
month.
• Tenure.M1 || Plays(1, 10)) comprises of
users in their first membership month OR
having 1 <= number-of-plays < 10
• !Country.US && Tenure(10, 100))
comprises of international user profiles
AND with tenure in the range 10 - 100
days.

API Example: Specify rules on counts
new UserSet(
samplingRules = Map(
Country.US -> TargetCount(500),
Country.US && Tenure.M1 -> TargetCount(100),
Country.US && Tenure.M2 -> MinCount(200),
Country(“GB”) && Plays(1) -> MaxCount(200),
…
…
)
)

API Example: Specify rules on percentages
new UserSet(
Country.US -> TargetPercent(40.0),
Country.US && Tenure.M1 -> TargetPercent(20.0),
Country.US && Tenure.M2 -> MinPercent(10.0),
Country(“GB”) && Plays(1) -> MaxPercent(5.0),
…
…
)
)

API Example: EACH Expander
new UserSet(
Tenure.M1 -> TargetPercent(10.0),
Country.EACH && Tenure.M2 -> MaxPercent(0.2),
…
)
)
Country(“AD”) && Tenure.M2 -> MaxPercent(0.2),
Country(“AE”) && Tenure.M2 -> MaxPercent(0.2),
….

API Example: Error margin to increase yield
new UserSet(
Tenure.M1 -> TargetPercent(10.0),
Country.EACH && Tenure.M2 -> MaxPercent(0.2),
…
…
),
allowedErrorMargin = 20.0,
)

Algorithm Complexity
• When there are no overlapping regions in the venn diagram space
• Closed expressions to determine the sampling ratio per region are possible
• When the regions overlap, we formulate the optimization problem
as a Linear Programming problem.
• The sampling rules form the constraints
• Objective is to typically maximize the sample size

LP Formulation
0 <= s1 <= C1
0 <= s2 <= C2
0 <= s12 <= C12
p1 * (C1 + C12) <= (s1 + s12) * 100
p2 * (C2 + C12) >= (s2 + s12) * 100
Maximize (s1 + s2 + s12)
C2
C1
C12
minPercent p1
maxPercent p2

Optimization problem
• Shout out to the vagmcs/optimus third party
idiomatic scala library!
https://github.com/vagmcs/Optimus

Code snippet
targetDistroMap.foreach {
case (samplingExpr, freq @ TargetPercent(percent)) =>
val relevantRegions: Seq[Region] =
existingDistroRegionMap.keys.filter(_.contains(samplingExpr)).toSeq
val lpExpression = relevantRegions.map(regionVariables).foldLeft[Expression](Zero)(_ + _)
val leftErrorMargin = MPFloatVar(samplingExpr.toString + "_leftErrorMargin", 0,
existingDistro.totalCount)
val rightErrorMargin = MPFloatVar(samplingExpr.toString + "_rightErrorMargin", 0,
existingDistro.totalCount)
val toleranceMargin = freq.toleranceMargin.getOrElse(targetToleranceMargin)
add(leftErrorMargin <:= overallSizeVariable * Const(0.01 * percent * toleranceMargin/100.0))
add(rightErrorMargin <:= overallSizeVariable * Const(0.01 * percent * toleranceMargin/100.0))
add(lpExpression := overallSizeVariable * Const(0.01 * percent) - leftErrorMargin +
rightErrorMargin)
leftErrorMarginsSum = leftErrorMarginsSum ++ Seq(leftErrorMargin)
rightErrorMarginsSum = rightErrorMarginsSum ++ Seq(rightErrorMargin)
}

Apache Spark-Based Stratification Library for Machine Learning Use Cases at Netflix with Shiva Chaitanya

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark-Based Stratification Library for Machine Learning Use Cases at Netflix with Shiva Chaitanya

Similar to Apache Spark-Based Stratification Library for Machine Learning Use Cases at Netflix with Shiva Chaitanya (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Apache Spark-Based Stratification Library for Machine Learning Use Cases at Netflix with Shiva Chaitanya