Training Large-scale Ad Ranking Models in Spark
PRESENTED BY Patrick Pletscher October 19, 2015
About Us
2
Michal Aharon Oren Somekh Yaacov Fernandess Yair Koren
Amit Kagian Shahar Golan Raz Nissim Patrick Pletscher
Amir Ingber
Haifa
Collaborator
What We Do
3
Research focused on ad ranking algorithms for Yahoo Gemini Native Ads
Ad Ranking Overview
4
• Advertisers run several campaigns each with several ads
• Each ad has a bid set by the advertiser; different ad price types
- pay per view
- pay per click
- various conversion price types
• Auction for each impression on a Gemini Native enabled property
- auction between all eligible ads (filter by targeting/budget)
- ad with the highest expected revenue is determined
• Need to know the (personalized!) probability of a click
- we mostly get money for clicks / conversions!
Ad 1 Ad 2
2$1$
5% 1%5c 2c
user
Click-Through Rate (CTR) Prediction
5
• Given a user and context, predict probability of a click for an ad.
• Probably the most “profitable” machine learning problem in industry
- simple binary problem; but want probabilities, not just the label
- very skewed label distribution: clicks << skips
- tons of data (every impression generates a training example)
- limitations at serving: need to predict quickly
• Basic setting quite well-studied; scale makes it challenging
- Google (McMahan et al. 2013)
- Facebook (He et al. 2014)
- Yahoo (Aharon et al. 2013)
- others (Chapelle et al. 2014)
• Some more involved research topics
- Exploration/Exploitation tradeoff
- Learning from logged feedback
Overview - CTR Prediction for Gemini Native Ads
6
• Collaborative Filtering approach (Aharon et al. 2013)
- Current production system
- Implemented in Hadoop MapReduce
- Used in Gemini Native ad ranking
• Large-scale Logistic Regression
- A research prototype
- Implemented in Spark
- The combination of Spark & Scala allows us to iterate quickly
- Takes several concepts from the CF approach
Large-­scale Logistic Regression in Spark
Apache Spark
8
• “Apache Spark is a fast and general engine for large-scale data processing”
• Similar to Hadoop
• Advantages over Hadoop MapReduce
- Option to cache data in memory, great for iterative computations
- A lot of syntactic sugar
‣ filter, reduceByKey, distinct, sortByKey, join
‣ in general Spark/Scala code very concise
- Spark Shell, great for interactive/ETL* workflows
- Dataframes interesting for data scientists coming from R / Python
• Includes modules for
- machine learning
- streaming
- graph computations
- SQL / Dataframes
*ETL: Extract, transform, load
Spark at Yahoo
9
• Spark 1.5.1, the latest version of Spark
• Runs on top of Hadoop YARN 2.6
- integrates nicely with existing Hadoop tools and infrastructure

at Yahoo
- data is generally stored in HDFS
• Clusters are centrally managed
• Large Hadoop deployment at Yahoo
- A few different clusters
- Each has at least a few thousand nodes
HDFS (storage)
YARN (resource management)
SparkMapReduceHive
Dataset for CTR Prediction
10
• Billions of ad impressions daily
- Need for Streaming / Batched Streaming
- Each impression has a unique id
• Need click information for every impression for learning
- Join impressions with a click stream every x minutes
- Need to wait for the click; introduces some delay
18:30 18:45 19:00
clicks
impressions impressions
clicks
impressions
clicks
19:15
impressions
clicks
labeled
events
labeled
events
in Spark: union & reduceByKey
Example - Joining Impression & Click RDDs
11
val keyAndImpressions = impressions

.map(e => (e.joinKey, ("i", e))
val keyAndClicks = clicks

.map(e => (e.joinKey, ("c", e)))
keyAndImpressions.union(keyAndClicks)

.reduceByKey(smartCombine)

.flatMap { case (k, (t, event)) => t match {

case "ci" => Some(LabeledEvent(event, clicked=1))

case "i" => Some(LabeledEvent(event, clicked=0))

case "c" => None

}

}
def smartCombine(event1: (String, Event), event2: (String, Event)):
(String, Event) = {

(event1._1, event2._1) match {

case ("c", "c") => event1 // de-dupe

case ("i", "i") => event1 // de-dupe

case ("c", "i") => ("ci", event2._2) // combine click and impression

case ("i", "c") => ("ci", event1._2) // combine click and impression

case ("ci", _) => event1 // de-dupe

case (_, "ci") => event2 // de-dupe

}

}
Incremental Learning Architecture
12
learning
examples
18:30 18:45 19:00
clicks
impressions impressions
clicks
impressions
clicks
19:15
impressions
clicks
labeled
events
feature extraction
learning
modelmodel
Large-scale Logistic Regression
13
• Industry standard for CTR prediction (McMahan et al. 2013, He et al. 2014)
• Models the probability of a click as
- feature vector
‣ high-dimensional vector but sparse (few non-zero values)
‣ model expressivity controlled by the features
‣ a lot of hand-tuning and playing around
- model parameters
‣ need to be learned
‣ generally rather non-sparse
Features for Logistic Regression
14
• Basic features
- age, gender
- browser, device
• Feature crosses
- E.g. age x gender x state (30 year old male from Boston)
- mostly indicator features
- Examples:
‣ gender^age m^30
‣ gender^device m^Windows_NT
‣ gender^section m^5417810
‣ gender^state m^2347579
‣ age^device 30^Windows_NT
• Feature hashing to get a vector of fixed length
- hash all the index tuples, e.g. (gender^age, m^30), to get a numeric index
- will introduce collisions! Choose dimensionality large enough
Parameter Estimation
15
• Basic Problem: Regularized Maximum Likelihood
- Often: L1 regularization instead of L2
‣ promotes sparsity in the weight vector
‣ more efficient predictions in serving (also requires less memory!)
- Batch vs. streaming
‣ in our case: batched streaming, every x min perform an incremental model update
• Follow-the-regularized leader (McMahan et al. 2013)
- sequential online algorithm: only use a data point once
- similar to stochastic gradient descent
- per coordinate learning rates
- encourages sparseness
- FTRL stores weight and accumulated gradient per coordinate
fit training data prevent overfitting
Basic Parallelized FTRL in Spark
16
def train(examples: RDD[LearningExample]): Unit={

val delta = examples

.repartition(numWorkers)

.mapPartitions(xs => updatePartition(xs, weights, counts))

.treeReduce{case(a, b) => (a._1+b._1, a._2+b._2)}



weights += delta._1 / numWorkers.toDouble

counts += delta._2 / numWorkers.toDouble

}

def updatePartition(examples: Iterator[LearningExample],

weights: DenseVector[Double],

counts: DenseVector[Double]): 

Iterator[(DenseVector[Double], DenseVector[Double])]=
{
// standard FTRL code for examples
Iterator((deltaWeights, deltaCounts))
}
hack:
actually a single
result, but Spark
expects an iterator!
Summary: LR with Spark
17
• Efficient: Can learn on all the data
- before: somewhat aggressive subsampling of the skips
• Possible to do feature pre-processing
- in Hadoop MapReduce much harder: only one pass over data
- drop infrequent features, TF-IDF, …
• Spark-shell as a life-saver
- helps to debug problems as one can inspect intermediate results at scale
- have yet to try Zeppelin notebooks
• Easy to unit test complex workflows
Spark: Lessons Learned
Upgrade!
19
• Spark has a pretty regular 3 months release schedule
• Always run with the latest version
- Lots of bugs get fixed
- Difficult to keep up with new functionality (see DataFrame vs. RDD)
• Speed improvements over the past year
Configurations
20
• Our solution
- config directory containing
‣ Logging: log4j.properties
‣ Spark itself: spark-defaults.conf
‣ our code: application.conf
- two versions of configs: local & cluster
- in YARN: specify them using --files argument & SPARK_CONF_DIR variable
• Use Typesafe’s config library for all application related configs
- provide sensible defaults for everything
- overwrite using application.conf
• Do not hard-code any configurations in code
Accumulators
21
• Use accumulators for ensuring correctness!
• Example:
- parse data, ignore event if there is a problem with the data
- use accumulator to count these failed lines
class Parser(failedLinesAccumulator: Accumulator[Int]) extends Serializable {
def parse(s: String): Option[Event] = {
try {
// parsing logic goes here
Some(...)
}
catch {

case e: Exception => {

failedLinesAccumulator += 1

None

}

}
}
}
val accumulator = Some(sc.accumulator(0, “failed lines”))
val parser = new Parser(accumulator)
val events = sc.textFile(“hdfs:///myfile”)
.flatMap(s => parser.parse(s))
RDD vs. DataFrame in Spark
22
• Initially Spark advocated Resilient Distributed Data (RDD) for data set
abstraction
- type-safe
- usually stores some Scala case class
- code relatively easy to understand
• Recently Spark is pushing towards using DataFrame
- similar to R and Python’s Pandas data frames
- some advantages
‣ less rigid types: can append columns
‣ speed
- disadvantage: code readability suffers for non-basic types
‣ user defined types
‣ user defined functions
• Have not fully migrated to it yet
Every Day I’m Shuffling…
23
• Careful with operations which send a lot of data over the network
- reduceByKey
- repartition / shuffle
• Careful with sending too much data to the driver
- collect
- reduce
• found mapPartitions & treeReduce useful in some cases (see FTRL example)
• play with spark configurations: frameSize, maxResultSize, timeouts…
textFile flatMap map reduceByKey
Machine Learning in Spark
24
• Relatively basic
- some algorithms don’t scale so well
- not customizable enough for experts:
‣ optimizers that assume a regularizer
‣ built our own DSL for feature extraction & combination
‣ a lot of the APIs are not exposed, i.e. private to Spark
- will hopefully get there eventually
• Nice: new Transformer / Estimator / Pipeline approach
- Inspired by scikit-learn, makes it easy to combine different algorithms
- Requires DataFrame
- Example (from Spark docs)
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)
Thank you!

Training Large-scale Ad Ranking Models in Spark

  • 1.
    Training Large-scale AdRanking Models in Spark PRESENTED BY Patrick Pletscher October 19, 2015
  • 2.
    About Us 2 Michal AharonOren Somekh Yaacov Fernandess Yair Koren Amit Kagian Shahar Golan Raz Nissim Patrick Pletscher Amir Ingber Haifa Collaborator
  • 3.
    What We Do 3 Researchfocused on ad ranking algorithms for Yahoo Gemini Native Ads
  • 4.
    Ad Ranking Overview 4 •Advertisers run several campaigns each with several ads • Each ad has a bid set by the advertiser; different ad price types - pay per view - pay per click - various conversion price types • Auction for each impression on a Gemini Native enabled property - auction between all eligible ads (filter by targeting/budget) - ad with the highest expected revenue is determined • Need to know the (personalized!) probability of a click - we mostly get money for clicks / conversions! Ad 1 Ad 2 2$1$ 5% 1%5c 2c user
  • 5.
    Click-Through Rate (CTR)Prediction 5 • Given a user and context, predict probability of a click for an ad. • Probably the most “profitable” machine learning problem in industry - simple binary problem; but want probabilities, not just the label - very skewed label distribution: clicks << skips - tons of data (every impression generates a training example) - limitations at serving: need to predict quickly • Basic setting quite well-studied; scale makes it challenging - Google (McMahan et al. 2013) - Facebook (He et al. 2014) - Yahoo (Aharon et al. 2013) - others (Chapelle et al. 2014) • Some more involved research topics - Exploration/Exploitation tradeoff - Learning from logged feedback
  • 6.
    Overview - CTRPrediction for Gemini Native Ads 6 • Collaborative Filtering approach (Aharon et al. 2013) - Current production system - Implemented in Hadoop MapReduce - Used in Gemini Native ad ranking • Large-scale Logistic Regression - A research prototype - Implemented in Spark - The combination of Spark & Scala allows us to iterate quickly - Takes several concepts from the CF approach
  • 7.
  • 8.
    Apache Spark 8 • “ApacheSpark is a fast and general engine for large-scale data processing” • Similar to Hadoop • Advantages over Hadoop MapReduce - Option to cache data in memory, great for iterative computations - A lot of syntactic sugar ‣ filter, reduceByKey, distinct, sortByKey, join ‣ in general Spark/Scala code very concise - Spark Shell, great for interactive/ETL* workflows - Dataframes interesting for data scientists coming from R / Python • Includes modules for - machine learning - streaming - graph computations - SQL / Dataframes *ETL: Extract, transform, load
  • 9.
    Spark at Yahoo 9 •Spark 1.5.1, the latest version of Spark • Runs on top of Hadoop YARN 2.6 - integrates nicely with existing Hadoop tools and infrastructure
 at Yahoo - data is generally stored in HDFS • Clusters are centrally managed • Large Hadoop deployment at Yahoo - A few different clusters - Each has at least a few thousand nodes HDFS (storage) YARN (resource management) SparkMapReduceHive
  • 10.
    Dataset for CTRPrediction 10 • Billions of ad impressions daily - Need for Streaming / Batched Streaming - Each impression has a unique id • Need click information for every impression for learning - Join impressions with a click stream every x minutes - Need to wait for the click; introduces some delay 18:30 18:45 19:00 clicks impressions impressions clicks impressions clicks 19:15 impressions clicks labeled events labeled events in Spark: union & reduceByKey
  • 11.
    Example - JoiningImpression & Click RDDs 11 val keyAndImpressions = impressions
 .map(e => (e.joinKey, ("i", e)) val keyAndClicks = clicks
 .map(e => (e.joinKey, ("c", e))) keyAndImpressions.union(keyAndClicks)
 .reduceByKey(smartCombine)
 .flatMap { case (k, (t, event)) => t match {
 case "ci" => Some(LabeledEvent(event, clicked=1))
 case "i" => Some(LabeledEvent(event, clicked=0))
 case "c" => None
 }
 } def smartCombine(event1: (String, Event), event2: (String, Event)): (String, Event) = {
 (event1._1, event2._1) match {
 case ("c", "c") => event1 // de-dupe
 case ("i", "i") => event1 // de-dupe
 case ("c", "i") => ("ci", event2._2) // combine click and impression
 case ("i", "c") => ("ci", event1._2) // combine click and impression
 case ("ci", _) => event1 // de-dupe
 case (_, "ci") => event2 // de-dupe
 }
 }
  • 12.
    Incremental Learning Architecture 12 learning examples 18:3018:45 19:00 clicks impressions impressions clicks impressions clicks 19:15 impressions clicks labeled events feature extraction learning modelmodel
  • 13.
    Large-scale Logistic Regression 13 •Industry standard for CTR prediction (McMahan et al. 2013, He et al. 2014) • Models the probability of a click as - feature vector ‣ high-dimensional vector but sparse (few non-zero values) ‣ model expressivity controlled by the features ‣ a lot of hand-tuning and playing around - model parameters ‣ need to be learned ‣ generally rather non-sparse
  • 14.
    Features for LogisticRegression 14 • Basic features - age, gender - browser, device • Feature crosses - E.g. age x gender x state (30 year old male from Boston) - mostly indicator features - Examples: ‣ gender^age m^30 ‣ gender^device m^Windows_NT ‣ gender^section m^5417810 ‣ gender^state m^2347579 ‣ age^device 30^Windows_NT • Feature hashing to get a vector of fixed length - hash all the index tuples, e.g. (gender^age, m^30), to get a numeric index - will introduce collisions! Choose dimensionality large enough
  • 15.
    Parameter Estimation 15 • BasicProblem: Regularized Maximum Likelihood - Often: L1 regularization instead of L2 ‣ promotes sparsity in the weight vector ‣ more efficient predictions in serving (also requires less memory!) - Batch vs. streaming ‣ in our case: batched streaming, every x min perform an incremental model update • Follow-the-regularized leader (McMahan et al. 2013) - sequential online algorithm: only use a data point once - similar to stochastic gradient descent - per coordinate learning rates - encourages sparseness - FTRL stores weight and accumulated gradient per coordinate fit training data prevent overfitting
  • 16.
    Basic Parallelized FTRLin Spark 16 def train(examples: RDD[LearningExample]): Unit={
 val delta = examples
 .repartition(numWorkers)
 .mapPartitions(xs => updatePartition(xs, weights, counts))
 .treeReduce{case(a, b) => (a._1+b._1, a._2+b._2)}
 
 weights += delta._1 / numWorkers.toDouble
 counts += delta._2 / numWorkers.toDouble
 }
 def updatePartition(examples: Iterator[LearningExample],
 weights: DenseVector[Double],
 counts: DenseVector[Double]): 
 Iterator[(DenseVector[Double], DenseVector[Double])]= { // standard FTRL code for examples Iterator((deltaWeights, deltaCounts)) } hack: actually a single result, but Spark expects an iterator!
  • 17.
    Summary: LR withSpark 17 • Efficient: Can learn on all the data - before: somewhat aggressive subsampling of the skips • Possible to do feature pre-processing - in Hadoop MapReduce much harder: only one pass over data - drop infrequent features, TF-IDF, … • Spark-shell as a life-saver - helps to debug problems as one can inspect intermediate results at scale - have yet to try Zeppelin notebooks • Easy to unit test complex workflows
  • 18.
  • 19.
    Upgrade! 19 • Spark hasa pretty regular 3 months release schedule • Always run with the latest version - Lots of bugs get fixed - Difficult to keep up with new functionality (see DataFrame vs. RDD) • Speed improvements over the past year
  • 20.
    Configurations 20 • Our solution -config directory containing ‣ Logging: log4j.properties ‣ Spark itself: spark-defaults.conf ‣ our code: application.conf - two versions of configs: local & cluster - in YARN: specify them using --files argument & SPARK_CONF_DIR variable • Use Typesafe’s config library for all application related configs - provide sensible defaults for everything - overwrite using application.conf • Do not hard-code any configurations in code
  • 21.
    Accumulators 21 • Use accumulatorsfor ensuring correctness! • Example: - parse data, ignore event if there is a problem with the data - use accumulator to count these failed lines class Parser(failedLinesAccumulator: Accumulator[Int]) extends Serializable { def parse(s: String): Option[Event] = { try { // parsing logic goes here Some(...) } catch {
 case e: Exception => {
 failedLinesAccumulator += 1
 None
 }
 } } } val accumulator = Some(sc.accumulator(0, “failed lines”)) val parser = new Parser(accumulator) val events = sc.textFile(“hdfs:///myfile”) .flatMap(s => parser.parse(s))
  • 22.
    RDD vs. DataFramein Spark 22 • Initially Spark advocated Resilient Distributed Data (RDD) for data set abstraction - type-safe - usually stores some Scala case class - code relatively easy to understand • Recently Spark is pushing towards using DataFrame - similar to R and Python’s Pandas data frames - some advantages ‣ less rigid types: can append columns ‣ speed - disadvantage: code readability suffers for non-basic types ‣ user defined types ‣ user defined functions • Have not fully migrated to it yet
  • 23.
    Every Day I’mShuffling… 23 • Careful with operations which send a lot of data over the network - reduceByKey - repartition / shuffle • Careful with sending too much data to the driver - collect - reduce • found mapPartitions & treeReduce useful in some cases (see FTRL example) • play with spark configurations: frameSize, maxResultSize, timeouts… textFile flatMap map reduceByKey
  • 24.
    Machine Learning inSpark 24 • Relatively basic - some algorithms don’t scale so well - not customizable enough for experts: ‣ optimizers that assume a regularizer ‣ built our own DSL for feature extraction & combination ‣ a lot of the APIs are not exposed, i.e. private to Spark - will hopefully get there eventually • Nice: new Transformer / Estimator / Pipeline approach - Inspired by scikit-learn, makes it easy to combine different algorithms - Requires DataFrame - Example (from Spark docs) val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.01) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) val model = pipeline.fit(training)
  • 25.