Why start using uplift models for more efficient marketing campaigns

Causal Inference +
Estimating Heterogeneous
Treatment Effects using ML

Agenda
● The Fundamental Problem in Causal Inference
● ATE vs CATE
● Why do we care about CATE?
● Propensity Score Matching
● Meta-Learners
○ S Learner
○ T Learner
○ X Learner
● Uplift Curves
● Interpreting/Explaining the Lift
● Validating the Model

Steve Steve (copy)Steve (copy)
The Fundamental Problem in
Causal Inference
Our
Universe
Alternate
Universe
Our
Universe
Alternate
Universe

The Fundamental Problem in
Causal Inference
Our
Universe
Steve
Alternate
Universe
Steve (copy)
Our
Universe
Steve
Alternate
Universe
Steve (copy)
A
People like Steve
B
People like Steve
Impossible
Gold Standard

It’s Much Harder in Observational Data
People like Steve People like Steve
People unlike Steve
Did Not ClickClicked
A
People like Steve
B
People like Steve
Gold Standard

ATE vs CATE
Average Treatment Effect (ATE) =
Conditional Average Treatment Effect (CATE) =
E[Y | Treatment] - E[Y | Control]
E[Y | Treatment, X] - E[Y | Control, X]
Population Level
User/Segment Level

ATE can mask subgroups
with big CATEs
If you have those with a positive and a
negative CATE at the level of the total
experimental population, it might happen
that the ATE is close to zero while the
CATEs within the subpopulations are
statistically significant.
Negative CATES
On the one hand, you most of the time
want to target those with the highest
predicted uplift. On the other hand, you’ll
also want to avoid targeting those who
might have a negative CATE. For example,
some customers could be put off by CRM
comms.
Target those with the
highest uplift
If the treatment costs money, then it
makes sense to target a subset of the total
population. Most of the time, the best way
to do such targeting is to select the
subgroup with the highest predicted
treatment effect. Example: churn
prevention with incentives.
Why do we care about CATE?

Procedure
1. Build a propensity (binary classification) model
for all users.
2. For every user uT in Treatment,
3. Calculate a vector of distances/similarities
from uT to all Control users
4. Select the top k similar users (k=1 to achieve
balanced post-match set), filtered by a
specified threshold/caliper
5. Repeat Step 2 with/without replacement, until all
users in Treatment group have their
corresponding matched Control user
Propensity Score Matching
People like Steve People like Steve
People unlike Steve
ControlTreatment

Treatment
Period
(3 Mon)
Pre-Treatment
Period
(3 Mon)
Post-Treatment
Period
(3 Mon)
Treatment
Group
Control
Group
Feature
Collection
Treatment
Observation
GB
Observation
- Treatment Group: Riders who converted to Eaters within treatment period
- Control Group: Riders who were not converted to Eaters until the end of post-treatment period

Before Matching After Matching
Treatment
Control
Treatment
Control

S Learner
X Learner R Learner
Model
T
C
T
Pred
C
PredInput Train
T, CT, C
ModelT
T
PredInput Train
C ModelC
Cpred
Actual Estimate
Tpred
Ctrue
Ttrue
Propensity-weighted
average
CATE
CATE
=
=
-
-
T Learner
ModelT
T
Pred PredInput Train
C ModelC
CATE
--
CATE
Cest
Test
Modelm Modele
Family of Meta-Learners

Procedure
1. Create a binary feature is_treatment,
indicating whether a user is from the
treatment group
2. Train a single (S) model
3. For all users, set is_treatment to 1 and
calculate yhatis_treatment=1
4. For all users, set is_treatment to 0 and
calculate yhatis_treatment=0
5. CATE = yhatis_treatment=0 - yhatis_treatment=1
S Learner
Model
T
C
T
Pred
C
PredInput Train
CATE
-
S Learner

Procedure
1. Train two (T) separate models, one for
Treatment group and one for Control group
2. For all users, predict output based on the
Treatment model, i.e. yhatT_model
3. For all users, predict output based on the
Control model, i.e. yhatC_model
4. CATE = yhatT_model - yhatC_model
T, CT, C
T Learner
ModelT
T
Pred PredInput Train
C ModelC
CATE
-
T Learner

Procedure
1. Train two separate models, like in T-Learner case
2. For Control users, predict yhatT_model
3. For Treatment users, predict yhatC_model
4. For Control users, compute
○ tauC_users = yhatT_model_C_users - yC_users
○ Build a model to predict tauC_users
5. For Treatment users, compute
○ tauT_users = yT_users - yhatC_model_T_users
○ Build a model to predict tauT_users
6. CATE = (1 - p) * tauT_users + p * tauC_users
X Learner
ModelT
T
PredInput Train
C ModelC
Cpred
Actual Estimate
Tpred
Ctrue
Ttrue
Propensity-weighted
average
CATE
=
=
-
-
Cest
Test
X Learner

Targeting Users with Highest Uplift
0% 100%
CumulativeUplift
Population Targeted (%)

0% 100%10%
40% uplift achieved
from targeted just
10% of users
*Note: x axis not drawn to scale - annotation serves as interpretation example
CumulativeUplift

CumulativeUplift
0% 100%
Offer Promos to
these customers!
Stop spamming
these customers!

What’s actually “causing” the lift?

Synthetic Data
We can use different synthetic data
generation processes to generate data
where we know the true labels (treatment
effects). This allows us to measure the
accuracy on CATE, but the downside is
that it is highly dependent on the data
generation process (and in reality the data
you observe will most likely not follow the
same process)
Consistency
Like any other machine learning problem,
we should run all meta-learners and
observe how different the results are in
each case. If we observe a high level of
inconsistency, it’s likely that the input
data is too noisy, or that there isn’t
enough data for the meta-learners to
learn.
Experimentation
Recall that the gold standard for
measuring ATE is running a randomized
controlled experiment (i.e. A/B test). Same
applies here! We can measure the ATE of
the experiment to validate whether the
ATE of our meta-learner is accurate. But
this won’t necessarily prove that CATE is
accurate on a user-level.
Validating the Estimated Treatment
Effects

Subset Validation
Remove a random subset of the data, then
re-train the meta-learner.
Replace/Add Irrelevant
Confounder
Add/replace a random variable to
introduce noise to the system, then re-
train the meta-learner.
Placebo Treatment
Replace the treatment with a random
variable, then re-train the meta-learner.
Sensitivity Analysis: measuring the
robustness of meta-learners

CausalML
https://github.com/uber/causalml

Why start using uplift models for more efficient marketing campaigns

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Why start using uplift models for more efficient marketing campaigns

Similar to Why start using uplift models for more efficient marketing campaigns (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Why start using uplift models for more efficient marketing campaigns

Editor's Notes