Training Large-scale Ad Ranking Models in Spark

Training Large-scale Ad Ranking Models in Spark
PRESENTED BY Patrick Pletscher October 19, 2015

About Us
2
Michal Aharon Oren Somekh Yaacov Fernandess Yair Koren
Amit Kagian Shahar Golan Raz Nissim Patrick Pletscher
Amir Ingber
Haifa
Collaborator

What We Do
3
Research focused on ad ranking algorithms for Yahoo Gemini Native Ads

Ad Ranking Overview
4
• Advertisers run several campaigns each with several ads
• Each ad has a bid set by the advertiser; different ad price types
- pay per view
- pay per click
- various conversion price types
• Auction for each impression on a Gemini Native enabled property
- auction between all eligible ads (filter by targeting/budget)
- ad with the highest expected revenue is determined
• Need to know the (personalized!) probability of a click
- we mostly get money for clicks / conversions!
Ad 1 Ad 2
2$1$
5% 1%5c 2c
user

Click-Through Rate (CTR) Prediction
5
• Given a user and context, predict probability of a click for an ad.
• Probably the most “profitable” machine learning problem in industry
- simple binary problem; but want probabilities, not just the label
- very skewed label distribution: clicks << skips
- tons of data (every impression generates a training example)
- limitations at serving: need to predict quickly
• Basic setting quite well-studied; scale makes it challenging
- Google (McMahan et al. 2013)
- Facebook (He et al. 2014)
- Yahoo (Aharon et al. 2013)
- others (Chapelle et al. 2014)
• Some more involved research topics
- Exploration/Exploitation tradeoff
- Learning from logged feedback

Overview - CTR Prediction for Gemini Native Ads
6
• Collaborative Filtering approach (Aharon et al. 2013)
- Current production system
- Implemented in Hadoop MapReduce
- Used in Gemini Native ad ranking
• Large-scale Logistic Regression
- A research prototype
- Implemented in Spark
- The combination of Spark & Scala allows us to iterate quickly
- Takes several concepts from the CF approach

Large-scale Logistic Regression in Spark

Apache Spark
8
• “Apache Spark is a fast and general engine for large-scale data processing”
• Similar to Hadoop
• Advantages over Hadoop MapReduce
- Option to cache data in memory, great for iterative computations
- A lot of syntactic sugar
‣ filter, reduceByKey, distinct, sortByKey, join
‣ in general Spark/Scala code very concise
- Spark Shell, great for interactive/ETL* workflows
- Dataframes interesting for data scientists coming from R / Python
• Includes modules for
- machine learning
- streaming
- graph computations
- SQL / Dataframes
*ETL: Extract, transform, load

Spark at Yahoo
9
• Spark 1.5.1, the latest version of Spark
• Runs on top of Hadoop YARN 2.6
- integrates nicely with existing Hadoop tools and infrastructure 
at Yahoo
- data is generally stored in HDFS
• Clusters are centrally managed
• Large Hadoop deployment at Yahoo
- A few different clusters
- Each has at least a few thousand nodes
HDFS (storage)
YARN (resource management)
SparkMapReduceHive

Dataset for CTR Prediction
10
• Billions of ad impressions daily
- Need for Streaming / Batched Streaming
- Each impression has a unique id
• Need click information for every impression for learning
- Join impressions with a click stream every x minutes
- Need to wait for the click; introduces some delay
18:30 18:45 19:00
clicks
impressions impressions
clicks
impressions
clicks
19:15
impressions
clicks
labeled
events
labeled
events
in Spark: union & reduceByKey

Example - Joining Impression & Click RDDs
11
val keyAndImpressions = impressions 
.map(e => (e.joinKey, ("i", e))
val keyAndClicks = clicks 
.map(e => (e.joinKey, ("c", e)))
keyAndImpressions.union(keyAndClicks) 
.reduceByKey(smartCombine) 
.flatMap { case (k, (t, event)) => t match { 
case "ci" => Some(LabeledEvent(event, clicked=1)) 
case "i" => Some(LabeledEvent(event, clicked=0)) 
case "c" => None 
} 
}
def smartCombine(event1: (String, Event), event2: (String, Event)):
(String, Event) = { 
(event1._1, event2._1) match { 
case ("c", "c") => event1 // de-dupe 
case ("i", "i") => event1 // de-dupe 
case ("c", "i") => ("ci", event2._2) // combine click and impression 
case ("i", "c") => ("ci", event1._2) // combine click and impression 
case ("ci", _) => event1 // de-dupe 
case (_, "ci") => event2 // de-dupe 
} 
}

Incremental Learning Architecture
12
learning
examples
18:30 18:45 19:00
clicks
impressions impressions
clicks
impressions
clicks
19:15
impressions
clicks
labeled
events
feature extraction
learning
modelmodel

Large-scale Logistic Regression
13
• Industry standard for CTR prediction (McMahan et al. 2013, He et al. 2014)
• Models the probability of a click as
- feature vector
‣ high-dimensional vector but sparse (few non-zero values)
‣ model expressivity controlled by the features
‣ a lot of hand-tuning and playing around
- model parameters
‣ need to be learned
‣ generally rather non-sparse

Features for Logistic Regression
14
• Basic features
- age, gender
- browser, device
• Feature crosses
- E.g. age x gender x state (30 year old male from Boston)
- mostly indicator features
- Examples:
‣ gender^age m^30
‣ gender^device m^Windows_NT
‣ gender^section m^5417810
‣ gender^state m^2347579
‣ age^device 30^Windows_NT
• Feature hashing to get a vector of fixed length
- hash all the index tuples, e.g. (gender^age, m^30), to get a numeric index
- will introduce collisions! Choose dimensionality large enough

Parameter Estimation
15
• Basic Problem: Regularized Maximum Likelihood
- Often: L1 regularization instead of L2
‣ promotes sparsity in the weight vector
‣ more efficient predictions in serving (also requires less memory!)
- Batch vs. streaming
‣ in our case: batched streaming, every x min perform an incremental model update
• Follow-the-regularized leader (McMahan et al. 2013)
- sequential online algorithm: only use a data point once
- similar to stochastic gradient descent
- per coordinate learning rates
- encourages sparseness
- FTRL stores weight and accumulated gradient per coordinate
fit training data prevent overfitting

Basic Parallelized FTRL in Spark
16
def train(examples: RDD[LearningExample]): Unit={ 
val delta = examples 
.repartition(numWorkers) 
.mapPartitions(xs => updatePartition(xs, weights, counts)) 
.treeReduce{case(a, b) => (a._1+b._1, a._2+b._2)} 
 
weights += delta._1 / numWorkers.toDouble 
counts += delta._2 / numWorkers.toDouble 
} 
def updatePartition(examples: Iterator[LearningExample], 
weights: DenseVector[Double], 
counts: DenseVector[Double]):  
Iterator[(DenseVector[Double], DenseVector[Double])]=
{
// standard FTRL code for examples
Iterator((deltaWeights, deltaCounts))
}
hack:
actually a single
result, but Spark
expects an iterator!

Summary: LR with Spark
17
• Efficient: Can learn on all the data
- before: somewhat aggressive subsampling of the skips
• Possible to do feature pre-processing
- in Hadoop MapReduce much harder: only one pass over data
- drop infrequent features, TF-IDF, …
• Spark-shell as a life-saver
- helps to debug problems as one can inspect intermediate results at scale
- have yet to try Zeppelin notebooks
• Easy to unit test complex workflows

Upgrade!
19
• Spark has a pretty regular 3 months release schedule
• Always run with the latest version
- Lots of bugs get fixed
- Difficult to keep up with new functionality (see DataFrame vs. RDD)
• Speed improvements over the past year

Configurations
20
• Our solution
- config directory containing
‣ Logging: log4j.properties
‣ Spark itself: spark-defaults.conf
‣ our code: application.conf
- two versions of configs: local & cluster
- in YARN: specify them using --files argument & SPARK_CONF_DIR variable
• Use Typesafe’s config library for all application related configs
- provide sensible defaults for everything
- overwrite using application.conf
• Do not hard-code any configurations in code

Accumulators
21
• Use accumulators for ensuring correctness!
• Example:
- parse data, ignore event if there is a problem with the data
- use accumulator to count these failed lines
class Parser(failedLinesAccumulator: Accumulator[Int]) extends Serializable {
def parse(s: String): Option[Event] = {
try {
// parsing logic goes here
Some(...)
}
catch { 
case e: Exception => { 
failedLinesAccumulator += 1 
None 
} 
}
}
}
val accumulator = Some(sc.accumulator(0, “failed lines”))
val parser = new Parser(accumulator)
val events = sc.textFile(“hdfs:///myfile”)
.flatMap(s => parser.parse(s))

RDD vs. DataFrame in Spark
22
• Initially Spark advocated Resilient Distributed Data (RDD) for data set
abstraction
- type-safe
- usually stores some Scala case class
- code relatively easy to understand
• Recently Spark is pushing towards using DataFrame
- similar to R and Python’s Pandas data frames
- some advantages
‣ less rigid types: can append columns
‣ speed
- disadvantage: code readability suffers for non-basic types
‣ user defined types
‣ user defined functions
• Have not fully migrated to it yet

Every Day I’m Shuffling…
23
• Careful with operations which send a lot of data over the network
- reduceByKey
- repartition / shuffle
• Careful with sending too much data to the driver
- collect
- reduce
• found mapPartitions & treeReduce useful in some cases (see FTRL example)
• play with spark configurations: frameSize, maxResultSize, timeouts…
textFile flatMap map reduceByKey

Machine Learning in Spark
24
• Relatively basic
- some algorithms don’t scale so well
- not customizable enough for experts:
‣ optimizers that assume a regularizer
‣ built our own DSL for feature extraction & combination
‣ a lot of the APIs are not exposed, i.e. private to Spark
- will hopefully get there eventually
• Nice: new Transformer / Estimator / Pipeline approach
- Inspired by scikit-learn, makes it easy to combine different algorithms
- Requires DataFrame
- Example (from Spark docs)
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)

Training Large-scale Ad Ranking Models in Spark

More Related Content

What's hot

Similar to Training Large-scale Ad Ranking Models in Spark

Recently uploaded

Training Large-scale Ad Ranking Models in Spark