Ad Click Prediction - Paper review

Arzam M. Kotriwala
Ad Click Prediction
Mazen Aly
A View from the Trenches
Proceedings of the 19th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (KDD) (2013)
1

Motivation: Huge online ad industry
Predicting ad click-through rates is central
to the multi-billion dollar online ad industry.
Different types of ads heavily rely on
learned models to predict ad click–through
rates accurately, quickly, and reliably.
Search engines get paid if users click ads.
Thus, it is essential to show the most relevant ads.
3

Motivation: data-intensive problem
Predicting ad click–through rates is a
massive-scale learning problem.
The goal is to:
● Use massive data
● Consume minimum resources
This entails handling billions of:
● Training examples
● Unique features
● Predictions/day 4

Contribution
● Memory saving techniques for efficient execution of learning
algorithms. These may also be applied to other large-scale
problem areas.
● Presents depth of challenges that arise when employing
traditional machine learning methods in a real and complex
dynamic system.
● Enhanced the traditional Stochastic (online) Gradient Descent
algorithm to handle sparsification of very high dimensional data.
6

Solution: FTRL-Proximal learning algorithm
Sparsification is essential in minimizing memory usage at serving.
Solution: FTRL-Proximal learning algorithm
○ Combines:
■ Improved accuracy of OGD
■ Sparsity provided by RDA
○ How? Uses Elastic net regularization
Online Gradient Descent (OGD):
+ Yields excellent prediction accuracy
- Not very effective at producing sparse models
Regularized Dual Averaging (RDA):
+ Effective at producing sparse models
- Predictions are less accurate than OGD
8

Solution: Per-Coordinate Learning Rates
● The standard theory for online gradient descent suggests
using a global learning rate schedule 1/sqrt(t) that is common
for all coordinates.
Huge accuracy improvement:
● Improved AUC by 11.2% versus a global learning rate baseline.
● In the ad prediction setting, a 1% improvement is large.
● Per-Coordinate learning rate: Features that change frequently,
their learning rates will decrease faster.
9

Solution: Memory Saving Techniques
● Probabilistic feature inclusion
● Subsampling training data
● Encoding values with fewer bits
Several tricks are used to save memory:
10

Solution: Probabilistic Feature Inclusion
● Poisson Inclusion
○ New features are inserted with probability p
● Bloom Filter Inclusion
○ Once a feature has occurred more than n times (according to
the filter), we add it to the model.
● Typically in high dimensional data, the vast majority of features
are extremely rare.
11

Solution: Subsampling Training Data
● Any query for which at least one of the ads was clicked.
● A fraction r ∈ (0, 1] of the queries where none of the ads were clicked.
The expected contribution of a randomly chosen event t in
the unsampled data to the sub-sampled objective function
FIXING THE SAMPLING BIAS
12

Solution: Encoding Values with Fewer Bits
For their Regularized Logistic Regression
models, such encodings waste memory.
To store coefficient values…
Naive implementations of the Online
Gradient Descent algorithm use 32 or 64
bit floating point encodings.
Large dynamic range
Fine-grained precision
Range: (-2,+2)
Fine-grained precision
not neededUse fixed point (q2.13) encoding instead.
End result: No measurable loss in precision and 50-75% RAM savings.
13

Evaluation
The authors evaluate model changes across several performance
metrics such as AucLoss, LogLoss, and SquaredError.
Progressive Validation
Use every training example to validate the model before using it
for training.
15

Evaluation: GridViz
High-dimensional analysis visualization
16

Strengths
● Also explain several techniques which did not work well for their
models though they had promising results in other literature:
■ Aggressive feature hashing
■ Randomized feature dropout
■ Averaging models trained on different subsets of features
■ Feature vector normalization
● The FTRL algorithm:
○ Has excellent sparsity and convergence properties
○ Is about as easy to implement as gradient descent
● The memory saving techniques are presented with the same rigor
that is traditionally given to the problem of designing an effective
learning algorithm.
18

Weaknesses
● No detailed results section in the paper.
○ “In practice, we observe no measurable loss using this
memory saving technique”
● Using Squared Error metric in Logistic regression.
● Important details were skipped.
○ Calculating the magnitude of a feature vector during
normalization.
20

Solution: High level system overview
23

Ad Click Prediction - Paper review

More Related Content

What's hot

Viewers also liked

Similar to Ad Click Prediction - Paper review

Recently uploaded

Ad Click Prediction - Paper review