Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
of causal inference
Glass of
wine a day
Health
Income
w Y
X
Experiments you thought were good can still be invalid
Experiments you thought were bad can still be valid
Randomized testing: the set-up
Sample is randomly
split into two groups
Random subsample of
population is chosen
POPULATIO...
USE CASE: heat pump savings @ Eneco
?
Measurement data: daily gas usage ~ outside temperature
Average outside temperature (°C)
Gasusage(m3)
The experiment in the randomized test framework
• Sample is based on
“friendly users”: Eneco
employees, early
adopters and...
Fixing group imbalance: match test and control
Available covariates:
• House size (m2)
• Building type (terraced, apartmen...
Propensity score matching – concept
38%
Calculate chance of receiving treatment
given X (house type, etc)
test A
83%39%
41...
Recap heat pump use case
• Experiment fails (almost) all standard assumptions
• Each of the “faults” can be corrected
• Me...
USE CASE: effect of cooler placement @ HEINEKEN
?
€ €
USE CASE: effect of cooler placement @ HEINEKEN
POPULATION
• 13K off-trade* outlets
• Selling HEINEKEN beer brands
• May r...
Fig. Histograms showing the distribution of total profit per
outlet, when broken down by ranking and cooler setup
Problem ...
Problem 1: test and control group are statistically
different
Distribution of relevant characteristics* is different betwe...
data_nongold = pd.DataFrame({
'y_profit': 20 + 5*np.random.randn(n),
'X_gold': 0,
'w_cooler': np.random.choice([0, 1], siz...
The need for effect correction – staging an experiment
Definition: conditional mean
Mean of y for given values of X, i.e. ...
The need for effect correction – staging an experiment
𝐴𝑇𝐸𝑖𝑛𝑠 = 𝐸 𝑌 𝑋 = 1, 𝑤 = 1 − 𝐸 𝑌 𝑋 = 1, 𝑤 = 0
= 30.07 − 24.90 = 5.17...
The need for effect correction – staging an experiment
What would be the effect if all the imbalance in treatment
caused b...
The need for effect correction – staging an experiment
Procedure
With the sample mean of the covariates, fit the
regressio...
data_reg = data.assign(
demeaned_interaction=lambda df:
df.w_cooler * (df.X_gold - df.X_gold.mean())
)
lm_all = LinearRegr...
Estimating the ATE with regression – assumptions
Conditional mean independence
Mean dependence between treatment assignmen...
Individual treatment effect estimation – assumptions
Many approaches exist, but most of your bias will be due to not obser...
Estimating ITE with Virtual Twins*
Sales
Rating
=Bronze/Silver
Rating
=Gold
Cooler
=0
Cooler
=1
€2000 €3000
Procedure
Fit ...
Fig. Model predicted profit versus actual profit, by
cooler type (all outlets)
USE CASE: effect of cooler placement @ HEIN...
USE CASE: effect of cooler placement @ HEINEKEN
Coolers to consider
Fig. Model predicted profit versus actual profit, by
c...
USE CASE: effect of cooler placement @ HEINEKEN
Coolers to upgrade
Fig. Model predicted profit versus actual profit, by
co...
USE CASE: effect of cooler placement @ HEINEKEN
Coolers to upgrade
Fig. Model predicted profit versus actual profit, by
co...
USE CASE: effect of cooler placement @ HEINEKEN
Coolers to upgrade
Fig. Model predicted profit versus actual profit, by
co...
• Your perfect experiment is likely ruined by harsh
reality
• But you may be able to fix it:
• Propensity score matching
•...
Looking for:
• Senior Data Scientist
• Senior Data Engineer
Contact: ciaran.jetten@heineken.com
Estimating ITE with Honest RF*
* Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. ...
Estimating ITE using Counterfactual Regression*
* Shalit, U., Johansson, F., & Sontag, D. (2016). Estimating individual tr...
Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink
Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink
Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink

Download to read offline

PyData Amsterdam 2018

Causal Inference, AKA how effective is your new product, policy or feature? Inspired by A\B testing in tech, organizations have turned to randomized testing. However, randomization often fails, leaving us in a biased reality. Join us on our quest to dispel myths about randomized testing and build practical models for effect measurement in business situations, in this Eneco-Heineken joint talk.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink

  1. 1. of causal inference
  2. 2. Glass of wine a day Health Income
  3. 3. w Y X
  4. 4. Experiments you thought were good can still be invalid Experiments you thought were bad can still be valid
  5. 5. Randomized testing: the set-up Sample is randomly split into two groups Random subsample of population is chosen POPULATION INTERVENTION CONTROL = no change = improved outcome Outcome in both groups is measured The same for all participants AVERAGE TREATMENT EFFECT
  6. 6. USE CASE: heat pump savings @ Eneco ?
  7. 7. Measurement data: daily gas usage ~ outside temperature Average outside temperature (°C) Gasusage(m3)
  8. 8. The experiment in the randomized test framework • Sample is based on “friendly users”: Eneco employees, early adopters and energy enthusiasts • Rental homes are excluded from the study • Participation is initiated by customer • Outcome: average yearly gas savings • Placements over many months • Changes made to intervention halfway through study AVERAGE TREATMENT EFFECT INTERVENTION CONTROL
  9. 9. Fixing group imbalance: match test and control Available covariates: • House size (m2) • Building type (terraced, apartment, detached, semi-detached) • Construction period (<1946, 1946-1965, …, > 2010) • Number of inhabitants (1, 2, 3, 4, 5+) Number of possibilities: 10 x 4 x 6 x 5 = 1200 Our sample population is only 2500, exact matches infeasible  partial matching Propensity Score Matching
  10. 10. Propensity score matching – concept 38% Calculate chance of receiving treatment given X (house type, etc) test A 83%39% 41% Match test subject to k control subjects on this probability 12% 22% Calculate effect for test and (matched) control - 500m3 -20m3average - 480m3 Repeat for all participants  average effect over test group RUN AWAY!
  11. 11. Recap heat pump use case • Experiment fails (almost) all standard assumptions • Each of the “faults” can be corrected • Measure months, need year  extrapolate with model • Bias in test group  match with equally biased control using propensity • Outcome: average effect over test group, not whole population • We can not say anything about rental households without making additional assumptions
  12. 12. USE CASE: effect of cooler placement @ HEINEKEN ? € €
  13. 13. USE CASE: effect of cooler placement @ HEINEKEN POPULATION • 13K off-trade* outlets • Selling HEINEKEN beer brands • May receive cooler * Small to medium shops, e.g. mom and pop shops, groceries and kiosks; not retail • Pool for ’experiment’ is all outlets, sample is the population • Observational approach: coolers are already placed • Gold outlets higher probability of getting cooler than others • Need effect on individual outlets, to prioritize future placements AVERAGE TREATMENT EFFECT INTERVENTION CONTROL The same for all participants • Outcome: yearly profit** uplift • Placements over many years, movements not tracked  sales before/after unknown ** Profit is measured as FGP/hl, a company-wide calculation of profit per hl sales
  14. 14. Fig. Histograms showing the distribution of total profit per outlet, when broken down by ranking and cooler setup Problem 1: test and control group are statistically different Distribution of relevant characteristics* is different between test and control profit * A relevant characteristic is one that influences the probability of being selected for treatment
  15. 15. Problem 1: test and control group are statistically different Distribution of relevant characteristics* is different between test and control * A relevant characteristic is one that influences the probability of being selected for treatment • Outlet ranking (gold, silver, bronze) • Outlet sub-channel (kiosk, grocery, convenience, etc) • Outlet area type (city, urban, village) • Area (name of neighborhood) • Seasonality (is outlet only open in summer) • Sales rep visits per month • Volume of competitor vs HEINEKEN sales • Number of assortment deals with HEINEKEN • Amount of investment by HEINEKEN • Number of HEINEKEN branding materials • Census demographics in km2 (population, age, gender) • Google Maps metrics in 500m2 (average venue rating, # venues with photo, # of unique venue types, average venue opening times)
  16. 16. data_nongold = pd.DataFrame({ 'y_profit': 20 + 5*np.random.randn(n), 'X_gold': 0, 'w_cooler': np.random.choice([0, 1], size=(n,), p=[2./3, 1./3]) }).assign(y_profit=lambda df: np.where(df.w_cooler, df.y_profit + 3, df.y_profit)) data_gold = pd.DataFrame({ 'y_profit': 25 + 5*np.random.randn(n), 'X_gold': 1, 'w_cooler': np.random.choice([0, 1], size=(n,), p=[1./3, 2./3]) }).assign(y_profit=lambda df: np.where(df.w_cooler, df.y_profit + 5, df.y_profit)) data = data_nongold.append(data_gold
  17. 17. The need for effect correction – staging an experiment Definition: conditional mean Mean of y for given values of X, i.e. average of one variable as a function of some other variables 𝐸 𝑌 𝑋 = 𝑋𝛽 Effect = mean treated – mean untreated 𝐸 𝑌 𝑤 = 1 − 𝐸 𝑌 𝑤 = 0 = 27.70 − 21.66 = 6.04 ??
  18. 18. The need for effect correction – staging an experiment 𝐴𝑇𝐸𝑖𝑛𝑠 = 𝐸 𝑌 𝑋 = 1, 𝑤 = 1 − 𝐸 𝑌 𝑋 = 1, 𝑤 = 0 = 30.07 − 24.90 = 5.17 𝐴𝑇𝐸 𝑛𝑜𝑛𝑖𝑛𝑠 = 𝐸 𝑌 𝑋 = 0, 𝑤 = 1 − 𝐸 𝑌 𝑋 = 0, 𝑤 = 0 = 20.00 − 22.96 = 2.96 Only gold Effect = mean treated – mean untreated Only non-gold Effect = mean treated – mean untreated
  19. 19. The need for effect correction – staging an experiment What would be the effect if all the imbalance in treatment caused by gold ranking is removed? 50% of outlets are gold, if the probability of placement were equal for all of them, the effect would be ... 𝐴𝑇𝐸 = 𝐸 𝑌 𝑋, 𝑤 = 1 − 𝐸 𝑌 𝑋, 𝑤 = 0 = 4.06
  20. 20. The need for effect correction – staging an experiment Procedure With the sample mean of the covariates, fit the regression And the coefficient on w will be the average treatment effect 𝑌 𝑜𝑛 1, 𝑤, 𝑿, 𝑤(𝑿 − 𝑿) 𝑿
  21. 21. data_reg = data.assign( demeaned_interaction=lambda df: df.w_cooler * (df.X_gold - df.X_gold.mean()) ) lm_all = LinearRegression() lm_all.fit( data_reg[['X_gold', 'demeaned_interaction', 'w_cooler']], data.y_profit ) lm_all.coef_[2] 4.0637
  22. 22. Estimating the ATE with regression – assumptions Conditional mean independence Mean dependence between treatment assignment w and treatment-specific outcomes Yi can be removed by conditioning on some variables X, provided that they are observable (AKA weak ignorability) 𝐸 𝑌𝑖 𝑋, 𝑤 = 𝐸 𝑌𝑖 𝑋 𝑓𝑜𝑟 𝑖 ∈ {0,1}
  23. 23. Individual treatment effect estimation – assumptions Many approaches exist, but most of your bias will be due to not observing enough confounders X! Conditional independence Any dependence between treatment assignment w and treatment-specific outcomes Yi can be removed by conditioning on some variables X, provided that they are observable (AKA strong ignorability) 𝑌0, 𝑌1 ⫫ 𝑤|𝑿
  24. 24. Estimating ITE with Virtual Twins* Sales Rating =Bronze/Silver Rating =Gold Cooler =0 Cooler =1 €2000 €3000 Procedure Fit a tree ensemble with target Y and features X, w, and interactions** between X and w Predict all units with w=1 , predict all units with w=0 Subtract to get Early stopping and OOB predictions reduce overfitting, quantile objective can help to trim outliers 𝜏𝑖𝑡𝑒, 𝑖 = 𝑚1 𝑿𝑖 − 𝑚0 𝑿𝑖 * Foster, J. C., Taylor, J. M., and Ruberg, S. J. (2011). Subgroup identification from randomized clinical trial data. Statistics in Medicine, 30(24):2867–2880. ** Scaling like we did with the linear ATE estimator is generally not needed with tree-based estimators
  25. 25. Fig. Model predicted profit versus actual profit, by cooler type (all outlets) USE CASE: effect of cooler placement @ HEINEKEN Overview
  26. 26. USE CASE: effect of cooler placement @ HEINEKEN Coolers to consider Fig. Model predicted profit versus actual profit, by cooler type (outlets within 90% confidence interval)
  27. 27. USE CASE: effect of cooler placement @ HEINEKEN Coolers to upgrade Fig. Model predicted profit versus actual profit, by cooler type (outlets to upgrade / install)
  28. 28. USE CASE: effect of cooler placement @ HEINEKEN Coolers to upgrade Fig. Model predicted profit versus actual profit, by cooler type (outlets to upgrade / install)
  29. 29. USE CASE: effect of cooler placement @ HEINEKEN Coolers to upgrade Fig. Model predicted profit versus actual profit, by cooler type (outlets to upgrade / install)
  30. 30. • Your perfect experiment is likely ruined by harsh reality • But you may be able to fix it: • Propensity score matching • Average and individual treatment effect estimation • Make sure you collect enough data: • When is the treatment done? • Measure Y before and after experiment • What covariates X influence both treatment w and outcome Y?
  31. 31. Looking for: • Senior Data Scientist • Senior Data Engineer Contact: ciaran.jetten@heineken.com
  32. 32. Estimating ITE with Honest RF* * Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), 7353-7360. Cooler 1/0 Rating =Bronze/Silver Rating =Gold 𝐸 𝑌 𝑤 = 1 − 𝐸 𝑌 𝑤 = 0 €2000 − €3000 = €1000 Procedure Fit a tree ensemble with target w and features X, with constraint of minimum k units per class in each DT leaf Per leaf K in each DT, calculate mean difference in Y between treatment and control units to get 𝜏𝑖𝑡𝑒, 𝑖 = 𝑁−1 𝑗=1 𝑁 [𝑌𝑗1 − 𝑌𝑗0] 𝑓𝑜𝑟 𝑖 ∈ 𝐾 𝑎𝑛𝑑 𝑗 ∈ 𝐾
  33. 33. Estimating ITE using Counterfactual Regression* * Shalit, U., Johansson, F., & Sontag, D. (2016). Estimating individual treatment effect: generalization bounds and algorithms. arXiv preprint arXiv:1606.03976. Procedure Learn a representation Φ of X  split samples according to w  regress Y0 and Y1 on the representation separately Regularize Φ using IPM, which is the distance between the distribution of X in w=1 and of X in w=0 Thus having joint objective of minimizing predictive error and guaranteeing a balanced representation of X
  • JanBours

    Nov. 3, 2019
  • BartManintveld

    Oct. 25, 2018
  • fparra

    Aug. 13, 2018
  • andrewolton

    Jul. 24, 2018

PyData Amsterdam 2018 Causal Inference, AKA how effective is your new product, policy or feature? Inspired by A\B testing in tech, organizations have turned to randomized testing. However, randomization often fails, leaving us in a biased reality. Join us on our quest to dispel myths about randomized testing and build practical models for effect measurement in business situations, in this Eneco-Heineken joint talk.

Views

Total views

1,081

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

30

Shares

0

Comments

0

Likes

4

×