Arzam M. Kotriwala
Ad Click Prediction
Mazen Aly
A View from the Trenches
Proceedings of the 19th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (KDD) (2013)
1
Motivation
2
Motivation: Huge online ad industry
Predicting ad click-through rates is central
to the multi-billion dollar online ad industry.
Different types of ads heavily rely on
learned models to predict ad click–through
rates accurately, quickly, and reliably.
Search engines get paid if users click ads.
Thus, it is essential to show the most relevant ads.
3
Motivation: data-intensive problem
Predicting ad click–through rates is a
massive-scale learning problem.
The goal is to:
● Use massive data
● Consume minimum resources
This entails handling billions of:
● Training examples
● Unique features
● Predictions/day 4
Contribution
5
Contribution
● Memory saving techniques for efficient execution of learning
algorithms. These may also be applied to other large-scale
problem areas.
● Presents depth of challenges that arise when employing
traditional machine learning methods in a real and complex
dynamic system.
● Enhanced the traditional Stochastic (online) Gradient Descent
algorithm to handle sparsification of very high dimensional data.
6
Solution
7
Solution: FTRL-Proximal learning algorithm
Sparsification is essential in minimizing memory usage at serving.
Solution: FTRL-Proximal learning algorithm
○ Combines:
■ Improved accuracy of OGD
■ Sparsity provided by RDA
○ How? Uses Elastic net regularization
Online Gradient Descent (OGD):
+ Yields excellent prediction accuracy
- Not very effective at producing sparse models
Regularized Dual Averaging (RDA):
+ Effective at producing sparse models
- Predictions are less accurate than OGD
8
Solution: Per-Coordinate Learning Rates
● The standard theory for online gradient descent suggests
using a global learning rate schedule 1/sqrt(t) that is common
for all coordinates.
Huge accuracy improvement:
● Improved AUC by 11.2% versus a global learning rate baseline.
● In the ad prediction setting, a 1% improvement is large.
● Per-Coordinate learning rate: Features that change frequently,
their learning rates will decrease faster.
9
Solution: Memory Saving Techniques
● Probabilistic feature inclusion
● Subsampling training data
● Encoding values with fewer bits
Several tricks are used to save memory:
10
Solution: Probabilistic Feature Inclusion
● Poisson Inclusion
○ New features are inserted with probability p
● Bloom Filter Inclusion
○ Once a feature has occurred more than n times (according to
the filter), we add it to the model.
● Typically in high dimensional data, the vast majority of features
are extremely rare.
11
Solution: Subsampling Training Data
● Any query for which at least one of the ads was clicked.
● A fraction r ∈ (0, 1] of the queries where none of the ads were clicked.
The expected contribution of a randomly chosen event t in
the unsampled data to the sub-sampled objective function
FIXING THE SAMPLING BIAS
12
Solution: Encoding Values with Fewer Bits
For their Regularized Logistic Regression
models, such encodings waste memory.
To store coefficient values…
Naive implementations of the Online
Gradient Descent algorithm use 32 or 64
bit floating point encodings.
Large dynamic range
Fine-grained precision
Range: (-2,+2)
Fine-grained precision
not neededUse fixed point (q2.13) encoding instead.
End result: No measurable loss in precision and 50-75% RAM savings.
13
Evaluation
14
Evaluation
The authors evaluate model changes across several performance
metrics such as AucLoss, LogLoss, and SquaredError.
Progressive Validation
Use every training example to validate the model before using it
for training.
15
Evaluation: GridViz
High-dimensional analysis visualization
16
Strengths
17
Strengths
● Also explain several techniques which did not work well for their
models though they had promising results in other literature:
■ Aggressive feature hashing
■ Randomized feature dropout
■ Averaging models trained on different subsets of features
■ Feature vector normalization
● The FTRL algorithm:
○ Has excellent sparsity and convergence properties
○ Is about as easy to implement as gradient descent
● The memory saving techniques are presented with the same rigor
that is traditionally given to the problem of designing an effective
learning algorithm.
18
Weaknesses
19
Weaknesses
● No detailed results section in the paper.
○ “In practice, we observe no measurable loss using this
memory saving technique”
● Using Squared Error metric in Logistic regression.
● Important details were skipped.
○ Calculating the magnitude of a feature vector during
normalization.
20
Questions?
21
Backup slides
22
Solution: High level system overview
23

Ad Click Prediction - Paper review

  • 1.
    Arzam M. Kotriwala AdClick Prediction Mazen Aly A View from the Trenches Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2013) 1
  • 2.
  • 3.
    Motivation: Huge onlinead industry Predicting ad click-through rates is central to the multi-billion dollar online ad industry. Different types of ads heavily rely on learned models to predict ad click–through rates accurately, quickly, and reliably. Search engines get paid if users click ads. Thus, it is essential to show the most relevant ads. 3
  • 4.
    Motivation: data-intensive problem Predictingad click–through rates is a massive-scale learning problem. The goal is to: ● Use massive data ● Consume minimum resources This entails handling billions of: ● Training examples ● Unique features ● Predictions/day 4
  • 5.
  • 6.
    Contribution ● Memory savingtechniques for efficient execution of learning algorithms. These may also be applied to other large-scale problem areas. ● Presents depth of challenges that arise when employing traditional machine learning methods in a real and complex dynamic system. ● Enhanced the traditional Stochastic (online) Gradient Descent algorithm to handle sparsification of very high dimensional data. 6
  • 7.
  • 8.
    Solution: FTRL-Proximal learningalgorithm Sparsification is essential in minimizing memory usage at serving. Solution: FTRL-Proximal learning algorithm ○ Combines: ■ Improved accuracy of OGD ■ Sparsity provided by RDA ○ How? Uses Elastic net regularization Online Gradient Descent (OGD): + Yields excellent prediction accuracy - Not very effective at producing sparse models Regularized Dual Averaging (RDA): + Effective at producing sparse models - Predictions are less accurate than OGD 8
  • 9.
    Solution: Per-Coordinate LearningRates ● The standard theory for online gradient descent suggests using a global learning rate schedule 1/sqrt(t) that is common for all coordinates. Huge accuracy improvement: ● Improved AUC by 11.2% versus a global learning rate baseline. ● In the ad prediction setting, a 1% improvement is large. ● Per-Coordinate learning rate: Features that change frequently, their learning rates will decrease faster. 9
  • 10.
    Solution: Memory SavingTechniques ● Probabilistic feature inclusion ● Subsampling training data ● Encoding values with fewer bits Several tricks are used to save memory: 10
  • 11.
    Solution: Probabilistic FeatureInclusion ● Poisson Inclusion ○ New features are inserted with probability p ● Bloom Filter Inclusion ○ Once a feature has occurred more than n times (according to the filter), we add it to the model. ● Typically in high dimensional data, the vast majority of features are extremely rare. 11
  • 12.
    Solution: Subsampling TrainingData ● Any query for which at least one of the ads was clicked. ● A fraction r ∈ (0, 1] of the queries where none of the ads were clicked. The expected contribution of a randomly chosen event t in the unsampled data to the sub-sampled objective function FIXING THE SAMPLING BIAS 12
  • 13.
    Solution: Encoding Valueswith Fewer Bits For their Regularized Logistic Regression models, such encodings waste memory. To store coefficient values… Naive implementations of the Online Gradient Descent algorithm use 32 or 64 bit floating point encodings. Large dynamic range Fine-grained precision Range: (-2,+2) Fine-grained precision not neededUse fixed point (q2.13) encoding instead. End result: No measurable loss in precision and 50-75% RAM savings. 13
  • 14.
  • 15.
    Evaluation The authors evaluatemodel changes across several performance metrics such as AucLoss, LogLoss, and SquaredError. Progressive Validation Use every training example to validate the model before using it for training. 15
  • 16.
  • 17.
  • 18.
    Strengths ● Also explainseveral techniques which did not work well for their models though they had promising results in other literature: ■ Aggressive feature hashing ■ Randomized feature dropout ■ Averaging models trained on different subsets of features ■ Feature vector normalization ● The FTRL algorithm: ○ Has excellent sparsity and convergence properties ○ Is about as easy to implement as gradient descent ● The memory saving techniques are presented with the same rigor that is traditionally given to the problem of designing an effective learning algorithm. 18
  • 19.
  • 20.
    Weaknesses ● No detailedresults section in the paper. ○ “In practice, we observe no measurable loss using this memory saving technique” ● Using Squared Error metric in Logistic regression. ● Important details were skipped. ○ Calculating the magnitude of a feature vector during normalization. 20
  • 21.
  • 22.
  • 23.
    Solution: High levelsystem overview 23