2019 Triangle Machine Learning Day - Stacking Audience Models -- Using an Ensemble Approach for Predictive Modeling - Susan Xia, September 20, 2019

© 2019 Valassis Digital | PUBLIC
1
VALASSISDIGITAL.COM
Stacking Audience Models
Using an Ensemble Approach for Predictive
Modeling
Name: Susan Xia
Position: Data Scientist
Email: xias@valassis.com
Date: September 20th, 2019

VALASSISDIGITAL.COM
2
The
Business
Problem
Page 3
Ensemble
Learning
Methods
Page 13
Stacking in
Ensemble
Learning
Page 19
Stacker
Optimization
Page 23
Stacking Audience Models
OVERVIEW
How can we combine our best models
to predict the rare event of an internet
user making a purchase online?
Implementation
and Deployment
Page 32

3
VALASSISDIGITAL.COM
THE BUSINESS
PROBLEM

VALASSISDIGITAL.COM
4
Digital Advertising: How Does It Work?
INTRODUCTION

VALASSISDIGITAL.COM
5
Browses a
website

VALASSISDIGITAL.COM
6
Browses a
website
Ad ExchangeValassis Digital
120 billion requests daily

VALASSISDIGITAL.COM
7
Browses a
website
Win an auction
and serve an ad
99.9% of < 10 ms
response time
Drops a cookie

VALASSISDIGITAL.COM
8
Browses a
website
Perform tracked
activities
Win an auction
and serve an ad
Drops a cookie
among
1 billion
users
99.9% of < 10 ms
response time

VALASSISDIGITAL.COM
9
Browses a
website
Perform tracked
activities
Makes a purchase
Win an auction
and serve an ad
Drops a cookie
Conversion signal
comes back via a
pixel
among
1 billion
users
99.9% of < 10 ms
response time

10
VALASSISDIGITAL.COM
The Problem
BUSINESS CONTEXT
• In digital advertising, we run campaigns for our clients (usually different brands, and ad
agencies).
• The client has certain KPIs to drive (e.g., increase the purchase rate of its products).
• We receive credits if we served an ad to a user, and the user later makes a purchase of our
client’s product.
So we want to serve ads to people who are likely to make a purchase (or more generally, convert)
in the future. In order to achieve this, we need to:
Given the data we have about internet users, predict the likelihood of them converting.

11
VALASSISDIGITAL.COM
Predict Target Outcome
BUSINESS PROBLEM
Models Prediction
Will the user make a
purchase?

12
VALASSISDIGITAL.COM
Predict Target Outcome
BUSINESS PROBLEM
Additional requirements:
• Allow multiple contributors over time.
• Can generalize to other use cases.

13
VALASSISDIGITAL.COM
ENSEMBLE
LEARNING
METHODS

14
VALASSISDIGITAL.COM
Ensemble Learning
THE TOOL
• Combine multiple models to give a single prediction.
• Increase the diversity of the models or algorithms used.
• Has been shown to improve predictive power of the model.
Strength of Many
Combine the predictive power of multiple learners to obtain better
predictions than with one learner alone

15
VALASSISDIGITAL.COM
• Reduces variance of the base learners.
• Bootstraps (sample with replacement) the
training data and train learners in parallel.
• Each learner is often trained on a random
subset of the training data.
• Learners vote on the outcome with weights.
Image from: Isied, Anwar & Tamimi, Hashem. (2015). Using Random Forest (RF) as a transfer learning classifier for detecting Error-Related Potential (ErrP) within the context of P300-Speller.
ENSEMBLE EXAMPLE
Bagging Example: Random Forest
Ensemble Learning
Bagging

16
VALASSISDIGITAL.COM
Ensemble Learning
• Reduces bias of the base learners.
• Build learners sequentially.
• Samples misclassified by the previous
learner get weighted more in
subsequent learners.
Image from https://blog.bigml.com/2017/03/14/introduction-to-boosted-trees/
ENSEMBLE EXAMPLE
Boosting
Boosting

17
VALASSISDIGITAL.COM
• Build learners sequentially.
Images from Marsh, Brendan. (2016). Multivariate Analysis of the Vector Boson Fusion Higgs Boson, and
https://www.quora.com/How-would-you-explain-gradient-boosting-machine-learning-technique-in-no-more-than-300-words-to-non-science-major-college-students
BOOSTING
Boosting Examples
Adaptive Boosting Gradient Boosting
• Each model is fitted to predict the
residual error of the previous model.

18
VALASSISDIGITAL.COM
• A full featured, efficient implementation of gradient boosted trees.
• Supports fast learning through distributed and randomized computing.
• Uses approximation algorithm to evaluate and find tree splits.
• Supports regularization and tree pruning.
• Can intelligently and efficiently handle missing values.
In practice, it has shown to be:
• good at predicting rare events.
• good at distinguishing signal from noise.
BOOSTING
Boosting Examples
XGBoost

19
VALASSISDIGITAL.COM
STACKING IN
ENSEMBLE
LEARNING

20
VALASSISDIGITAL.COM
Stacking
STACKING
Stacking is another ensemble method, where
• it has a layered structure.
• predictions from the models in the previous layer are used as inputs to the
sequential layer.
• new models will train on these inputs.
• it will produce a final result.
Image from http://supunsetunga.blogspot.com/

21
VALASSISDIGITAL.COM
Benefits of Stacking
• Stacking increases the diversity of the algorithms and models used.
• Stacking can decrease bias – rather than winner takes all, combine datasets to
decrease bias.
• Enables “parallel development”: allow each individual base model to be
developed and tuned by different individuals.
• We can capture different “categories” of features with different base models.
• Combining features could lead to very high dimensionality, if we were to use a
single big model.
STACKING

22
VALASSISDIGITAL.COM
Stacker Details
EXAMPLE
Predicts conversion
based on browsing
history
Predicts conversion
based on response
to previous ads
Predicts conversion
based on user’s
activity level
Predicts conversion
based on all base
classifiers
XGBoost
Classifier
Random Forest
Classifier
Logistic
Regression
Probability
of
Conversion
Probability
of
Conversion
Probability
of
Conversion
XGBoost
Classifier
Probability
of
Conversion

23
VALASSISDIGITAL.COM
STACKER
OPTIMIZATION

24
VALASSISDIGITAL.COM
The Machine Learning Pipeline
MODEL IMPROVEMENT
Train
Train model
Validate
Select
hyperparameters
Test
Evaluate model
performance
How does the machine learning pipeline work in the case of a stacker?

25
VALASSISDIGITAL.COM
Stacker Tuning
Tune base models
We use K-fold cross validation.
Tune stacker model
• Each base model fits on train folds, predicts on test fold
• Predictions on each test fold are now new folds for stacker.
• Recreate train and test folds for the stacker using the new folds.
VALIDATION
Test Train Train Train Train Test Train Train Train Train
• Each model is tuned using randomized search in the hyperparameter space and K-fold cross validation.
• The folds of the base models and the stacker are different.
• After tuning the models are refitted on the enter dataset.

26
VALASSISDIGITAL.COM
PITFALLS
Stacker Pitfalls
Stacking can lead to overfit. We take precautions against it:
• Make sure there is enough data to support stacking.
• Use some form of regularization (cap complexity related hyperparamters, use early
stopping, etc.)
• Use ”mutually orthogonal” base models: when we combine the predictions of different
models using stacking, it is desirable that the predictions made by the base models have
low correlation. This would suggest that the models are skillful but in different ways,
allowing the stacker to figure out how to get the best from each model for an improved
score.

27
VALASSISDIGITAL.COM
METRICS
Evaluation Metrics
There are many metrics for evaluating the performance and effectiveness of
a model. The choices of the evaluation metrics should be based on the
problem the model is trying to solve.
Before we decide on the appropriate metrics to use, let us revisit the
problem…

28
VALASSISDIGITAL.COM
Audience Overview
THE GOALS
For each ad campaign, there is a ranked list of users that we want to serve to associated. We call this
ranked list “an audience” of the campaign.
The probability of converting predicted by the stacking model provides a natural ranking of users.
For each audience, our goals are:
Goal 1: Put as many future converters as possible in the audience
• Serve ads to people who are likely to make a purchase.
Goal 2: Rank future converters higher than non-converters so that:
• People who are likely to make a purchase will be served first
• People who are likely to make a purchase will be priced higher
Based on the two goals, we chose our evaluation metrics to be: 1. Recall, 2. NDCG

29
VALASSISDIGITAL.COM
What is NDCG?
NDCG
Normalized Discounted Cumulative Gain
Evaluates ability of ranked list to achieve desired result: relevant items (future converters)
at the top, irrelevant ones (non-converters) at the bottom
• Cumulative Gain: we get points for putting each relevant item in the list
• Discounted: we get fewer points for putting relevant items lower in the list
• Normalized: divide by the discounted cumulative gain of a perfect ranking so that we
can compare amongst lists of different lengths
Due to normalization, ranges from 0 (no converters in the list) to 1 (all the converters at the
top of the list).

30
VALASSISDIGITAL.COM
What is NDCG?
THE MATH
Normalized Discounted Cumulative Gain
Note that !"#$ is always 1 or 0 in our case – either you are a converter (1) or not (0)
Cumulative Gain at position p
%&' = ∑$*+
'
!"#$
Discounted Cumulative Gain at position p
,%&' = ∑$*+
' -./01 2+
3456($8+)
Normalized Discounted Cumulative Gain at position p
:;<=
>:;<=
This is quite simply DCG normalized by the best score that list could receive for DCG.
So for our audiences, IDCG is DCG with the converters all at the top of the list.
Retains advantage of DCG so that converters higher in the list are worth more.
Adds advantage that we can compare amongst lists due to normalization.

31
VALASSISDIGITAL.COM
NDCG Calculation
EXAMPLE
Score Rank Converter Addition to DCG Ideal addition to
DCG
1 0 0 1
2 1 0.63 0.63
3 1 0.5 0
4 0 0 0
5 0 0 0 NDCG =
DCG/IDCG
Sum 1.13 1.63 0.69

32
VALASSISDIGITAL.COM
IMPLEMENTATION
AND
DEPLOYMENT

33
VALASSISDIGITAL.COM
SCALABILITY
Implementation and Deployment
Things to consider when implementing the model deployment pipeline:
• Scalable: need to be able to fit hundreds of models daily, and for each model, to predict
on 1 billion users.
• User-friendly: allow data scientists to develop models in Python.
• Generalizable: the process should be generalizable to other model fitting and
deployment use cases.

34
VALASSISDIGITAL.COM
SCALABILITY
Our Solution
Hunch, an in-house library allowing functionality of python and speed of Scala
smaller scale
~ 200k users for each model
to fit
Model Fitting
Model
Scoring
larger scale
~ 1 billion users to
score for each
model
Serialize fitted
models in Python
De-serialize
in Scala
Hunch
It supports serialization of scikit-learn models and XGBoost models into Hunch representations, and de-
serialization of Hunch representations into Scala functions.

35
VALASSISDIGITAL.COM
SCALABILITY
Performance
The de-serializers in Scala translate the Hunch representations to a Scala function which takes a
vector and emits class likelihoods. This provides huge improvements in speed over Python models.
Key Statistics
• We run 600 models daily to predict on our users, each user is a row of feature vector of 7k+
entries.
• We make an estimated number of 200 billion row evaluations.
• The models are run in batch and takes ~5.4 hours serially and ~18,700,000 (5300 hours) virtual
core seconds in Spark.

2019 Triangle Machine Learning Day - Stacking Audience Models -- Using an Ensemble Approach for Predictive Modeling - Susan Xia, September 20, 2019

Recommended

Recommended

More Related Content

Similar to 2019 Triangle Machine Learning Day - Stacking Audience Models -- Using an Ensemble Approach for Predictive Modeling - Susan Xia, September 20, 2019

Similar to 2019 Triangle Machine Learning Day - Stacking Audience Models -- Using an Ensemble Approach for Predictive Modeling - Susan Xia, September 20, 2019 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

2019 Triangle Machine Learning Day - Stacking Audience Models -- Using an Ensemble Approach for Predictive Modeling - Susan Xia, September 20, 2019