SlideShare a Scribd company logo
Supercharging your
A/B testing with
automated causal
inference
Community Workshop on Microsoft's Causal Tools
May 3, 2022
Egor Kraev
Head of AI, Wise Plc
egor.kraev@wise.com
What is causal inference anyway?
● Causal inference tries to estimate impacts of actions, ideally at a
single-observation level
● The fundamental problem when doing that is you can never directly
verify such estimates at individual level – you can’t send and not send
the same email to the same customer, to observe the impact!
● Randomization when gathering data is helpful for causal inference,
but not strictly necessary
Causal inference is a fascinating, deep domain
Here, we’ll apply it to the simplest case possible
Strengths and weaknesses of
classic A/B testing.
Traditional A/B testing
“A/B testing is the golden standard
for learning cause and effect”
● Randomly split your audience into test and control groups
● Subject the test group to a ‘treatment’, such as sending a
marketing email or enabling a new product feature
● Measure the average difference in some ‘target’ variable, such as
post-treatment revenue, between the two groups
○ Choose sample sizes large enough for the difference to be statistically significant
Image by Maxime Lorant - https://commons.wikimedia.org/wiki/File:A-B_testing_simple_example.png, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=52602931
Downsides of traditional A/B testing
● Treats differences between customers as noise: if it's harmful to
some customers but beneficial to others, we'll just see the zero
average impact
● Wasteful: collect and process 100Ks of data points just to learn a
boolean, or at most a single number (average impact)
● Can be a hard sell to product teams: if we take the effort to build
a feature, we kind of assume it'll be useful – and in any case it’s
already built, so what’s the value of the test?
How can causal inference help?
How can causal inference help in A/B testing?
● Causal inference models will estimate impact per customer as a function
of their features (also known as Conditional Average Treatment Effect, or
CATE)
○ Most models also supply confidence intervals for those estimates.
● This means you can take the same dataset you collected during A/B
testing, enriched with customer features, and get customer
segmentation by impact
● Thus, customer heterogeneity becomes a valuable information source,
rather than noise to be averaged over
○ This promises smaller sample sizes required for significance
Our scope: “No unobserved confounders”
● Most important example: we run an
A/B/N test and want to better
understand its results
● Second option: treatment assignment
is random but biased based on
customer features
● Final class of examples: we want to
draw causal conclusions about a
situation where we can’t make an
experiment (eg impact of suspending
users), and don’t have any
instrumental variables
Comparing causal models.
Which causal model?
● There is a wealth of causal inference packages
available, first and foremost DoWhy/EconML,
also CausalML, UpliftML and several others
○ Inconsistent APIs
○ Each model comes with its own set of quirks
○ Most models require choice of hyperparameters
and also of ‘regular’ ML regressors/classifiers as components
● Most importantly, how do you compare different models after
fitting?
○ How do you do out-of-sample testing for counterfactual estimates?
Out of sample scoring of causal models
To compare causal models, we need to score out of sample.
We consider two main method families:
1. Qini/AUC: cumulative curves for outcomes sorted by
estimated impact, similar to ROC for classifiers. Theoretically
nicer, but harder to interpret
2. Policy value, aka ERUPT (Estimated Response Under
Proposed Treatment): unbiased estimator of the outcome if
we treated every customer for whom a model predicts (say) a
positive CATE: more interpretable
We also support the r-scorer, but don’t really like its complexity ;)
Image source: N. Radcliffe (2007), formula: Hitsch and Misra (2018) https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3111957
What policy to choose for ERUPT?
● If the treatment effect is variable, but positive for every customer, the naïve
“treat if CATE > 0” policy will involve treating everybody, which makes it hard to
differentiate between models
● To correct for this, we offer a normalized ERUPT score, by adding a treatment
cost equal to the average treatment effect - thus forcing the models to compete
on predicting impact variability
● In real-world problems, you should use a policy that’s as realistic as possible. For
example, “send promotion to customer if predicted impact of a promotion on
revenue minus promotion cost is positive”
Automated model selection.
Building blocks for automated model selection
● CATE models:
● Most EconML CATE estimators
● Transformed Outcome
● Dummy model: average treatment effect + randomness
● Easy to add others – let us know if you have favorites!
● DoWhy for a consistent high-level interface to individual estimators
● FLAML regression for regression component models
● DummyClassifier or FLAML (user choice) for propensity to treat
● FLAML, again, for estimator and hyperparameter search
● First run all chosen estimators with default settings, then sample
DoWhy
It’s yours to use right now!
https://github.com/transferwise/auto-causality
Current state, tested and used:
● Extensively tested for binary treatment, random assignment
● Example notebook with the full fitting and analysis cycle
Coming soon (almost works, bar testing and bugfixes):
● Multi-valued discrete treatment
● More testing for FLAML classifier option for propensity to treat
● Component models’ time budget as part of the search space
● Reuse of component models across fits, for efficiency
Next steps
Next big milestone (planned for Summer 2022):
● Extension to EconML instrumental variable models
Could use some help from the EconML team:
● Review of hyperparameter search spaces for the EconML models
● OrthoForest inference on 500K+ points doesn’t finish after running for days,
although training only takes tens of minutes – is that expected?
● Early stopping option for tree-based models
● What are good ways to score instrumental variable models out of sample?
Causal inference in Wise.
A lot of analytics questions are questions about causality
In progress:
● Analysis of CRM campaign impacts
● Improved targeting of referral reward programs
Planned:
● Improved analysis of A/B test results for product features
● Impact of customer support turnaround time on retention and future volumes
Extra referral rewards impact analysis
● 470K customers in the sample, some kind of extra reward for referring new customers was
offered to 360K of them (all grouped into one ‘treatment’ here for simplicity)
● Simulations run on an r5a.4xlarge (128GB RAM)
● Component model time budget was set to 10min, total time budget to 3h
● Features include reward base currency, customer ‘age’, host’s recent transaction volume
Quite variable out-of-sample model performance depending on model and target:
Results continued
● Different models’ ATE estimates are all over the place
● DML and Sparse flavors most likely to generate wildly off ATE estimates occasionally
(excluded from graphs for readability)
● ForestDRLearner tends to overfit the training set (need early stopping?)
For POST_BOOST_REFERRALS, best model’s ERUPT and ATE estimates are inconsistent
ERUPT vs ATE for training, validation, and test sets (denoted by marker size from smallest to largest)
Out-of-sample segmentation
Let’s compare the estimated treatment effects in the test
set split by the best model’s recommendations:
On the other hand, the simple tree policy derived from
the model’s recommendations is not very useful
Final approach
● The number of converted referrals post-boost was the best-modeled
target, choose that to train a causal inference model
● Augment it with a conditional model of how much volume a new
customer would do with us if converted, given the boost program the host
was offered and the other host’s features, to quantify a reward program’s
payoff.
No causal inference needed for this stage, just regular regression on the
much smaller dataset of hosts who had at least one post-boost conversion
● Re-run optimization with policy based on the conditional model
Lessons learned
● Consider different target variables as well as different models
● Check consistency between ERUPT and ATE estimates, and between
validation and test performance
● Choose your metric wisely
● Need quite large instances to run the fits
○ Now looking at using Ray to parallelize
● Beware of causality leakage!
If treatment is correlated with features, but you
use a naïve propensity-to-treat model, can get
GREAT out-of-sample scores that aren’t real
Conclusion.
So how can causal inference supercharge your
A/B testing?
● Use same A/B testing data, enriched with customer features
● Estimate causal impact as function of features, allowing you to segment
customers by impact
● Learn fine-grained impact also from biased random sampling, allowing
you to optimize and keep learning at the same time
○ For example, could do a kind of Thompson sampling on customer segments
● Thanks to the magic of DoWhy, EconML, and FLAML, combined in auto-
causality, you can do all this with no prior expertise in causal inference
Many thanks to
Pere Planell Morell
Mark Harley
Timo Flesh
Guy Durant
Wen Kho
Ziyang Zhang
For helping to bring the package this far!
Questions?
egor.kraev@wise.com
Supercharge your AB testing with automated causal inference - Community Workshop on Microsoft's Causal Tools.pdf

More Related Content

What's hot

Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
Ganesan Narayanasamy
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ?
HackerEarth
 
Machine learning ~ Forecasting
Machine learning ~ ForecastingMachine learning ~ Forecasting
Machine learning ~ Forecasting
Shaswat Mandhanya
 
Interpretable machine learning : Methods for understanding complex models
Interpretable machine learning : Methods for understanding complex modelsInterpretable machine learning : Methods for understanding complex models
Interpretable machine learning : Methods for understanding complex models
Manojit Nandi
 
Stock market prediction technique:
Stock market prediction technique:Stock market prediction technique:
Stock market prediction technique:
Paladion Networks
 
Recommendation System Explained
Recommendation System ExplainedRecommendation System Explained
Recommendation System Explained
Crossing Minds
 
Fair Recommendation of Two-sided Platforms
Fair Recommendation of Two-sided PlatformsFair Recommendation of Two-sided Platforms
Fair Recommendation of Two-sided Platforms
Arpita Biswas
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Jinwon Lee
 
Introduction To Machine Learning
Introduction To Machine LearningIntroduction To Machine Learning
Introduction To Machine Learning
Knoldus Inc.
 
Marketplace in motion - AdKDD keynote - 2020
Marketplace in motion - AdKDD keynote - 2020 Marketplace in motion - AdKDD keynote - 2020
Marketplace in motion - AdKDD keynote - 2020
Roelof van Zwol
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
ssuser4edc93
 
Time series forecasting with machine learning
Time series forecasting with machine learningTime series forecasting with machine learning
Time series forecasting with machine learning
Dr Wei Liu
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
Justin Basilico
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series Prediction
Gianluca Bontempi
 
Bayesian Revenue Estimation with Stan
Bayesian Revenue Estimation with StanBayesian Revenue Estimation with Stan
Bayesian Revenue Estimation with Stan
Markus Ojala
 
generative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language modelsgenerative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language models
AdventureWorld5
 
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Sergey Karayev
 
Machine learning
Machine learningMachine learning
Machine learning
InfoFarm
 
Missing values in recommender models
Missing values in recommender modelsMissing values in recommender models
Missing values in recommender models
Parmeshwar Khurd
 
Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)
Krishnaram Kenthapadi
 

What's hot (20)

Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ?
 
Machine learning ~ Forecasting
Machine learning ~ ForecastingMachine learning ~ Forecasting
Machine learning ~ Forecasting
 
Interpretable machine learning : Methods for understanding complex models
Interpretable machine learning : Methods for understanding complex modelsInterpretable machine learning : Methods for understanding complex models
Interpretable machine learning : Methods for understanding complex models
 
Stock market prediction technique:
Stock market prediction technique:Stock market prediction technique:
Stock market prediction technique:
 
Recommendation System Explained
Recommendation System ExplainedRecommendation System Explained
Recommendation System Explained
 
Fair Recommendation of Two-sided Platforms
Fair Recommendation of Two-sided PlatformsFair Recommendation of Two-sided Platforms
Fair Recommendation of Two-sided Platforms
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
Introduction To Machine Learning
Introduction To Machine LearningIntroduction To Machine Learning
Introduction To Machine Learning
 
Marketplace in motion - AdKDD keynote - 2020
Marketplace in motion - AdKDD keynote - 2020 Marketplace in motion - AdKDD keynote - 2020
Marketplace in motion - AdKDD keynote - 2020
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
 
Time series forecasting with machine learning
Time series forecasting with machine learningTime series forecasting with machine learning
Time series forecasting with machine learning
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series Prediction
 
Bayesian Revenue Estimation with Stan
Bayesian Revenue Estimation with StanBayesian Revenue Estimation with Stan
Bayesian Revenue Estimation with Stan
 
generative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language modelsgenerative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language models
 
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
 
Machine learning
Machine learningMachine learning
Machine learning
 
Missing values in recommender models
Missing values in recommender modelsMissing values in recommender models
Missing values in recommender models
 
Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)
 

Similar to Supercharge your AB testing with automated causal inference - Community Workshop on Microsoft's Causal Tools.pdf

Combining Linear and Non Linear Modeling Techniques
Combining Linear and Non Linear Modeling Techniques Combining Linear and Non Linear Modeling Techniques
Combining Linear and Non Linear Modeling Techniques
Salford Systems
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
Naveen Kumar
 
Cpk indispensable index or misleading measure? by PQ Systems
Cpk indispensable index or misleading measure? by PQ SystemsCpk indispensable index or misleading measure? by PQ Systems
Cpk indispensable index or misleading measure? by PQ Systems
Blackberry&Cross
 
churn_detection.pptx
churn_detection.pptxchurn_detection.pptx
churn_detection.pptx
DhanuDhanu49
 
C3 w4
C3 w4C3 w4
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov DisplayIRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET Journal
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
rajalakshmi5921
 
Basics of AB testing in online products
Basics of AB testing in online productsBasics of AB testing in online products
Basics of AB testing in online products
Ashish Dua
 
Ab testing 101
Ab testing 101Ab testing 101
Ab testing 101
Ashish Dua
 
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R Open
Poo Kuan Hoong
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
Marc Berman
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
nagarajan740445
 
validation and verification part 2.pptx
validation and verification part 2.pptxvalidation and verification part 2.pptx
validation and verification part 2.pptx
ubaidullah75790
 
How to Keep Up Your Creative Testing After Budget Cuts
How to Keep Up Your Creative Testing After Budget CutsHow to Keep Up Your Creative Testing After Budget Cuts
How to Keep Up Your Creative Testing After Budget Cuts
Tinuiti
 
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
ESEM 2014
 
Virtual segment brief
Virtual segment briefVirtual segment brief
Virtual segment brief
Ahmed Shaaban
 
Models ABC
Models ABCModels ABC
Models ABC
Sergey Sviridenko
 
Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...
Rama Irsheidat
 
A robust multi criteria optimization approach
A robust multi criteria optimization approachA robust multi criteria optimization approach
A robust multi criteria optimization approach
Phuong Dx
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
Ajit Ghodke
 

Similar to Supercharge your AB testing with automated causal inference - Community Workshop on Microsoft's Causal Tools.pdf (20)

Combining Linear and Non Linear Modeling Techniques
Combining Linear and Non Linear Modeling Techniques Combining Linear and Non Linear Modeling Techniques
Combining Linear and Non Linear Modeling Techniques
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
 
Cpk indispensable index or misleading measure? by PQ Systems
Cpk indispensable index or misleading measure? by PQ SystemsCpk indispensable index or misleading measure? by PQ Systems
Cpk indispensable index or misleading measure? by PQ Systems
 
churn_detection.pptx
churn_detection.pptxchurn_detection.pptx
churn_detection.pptx
 
C3 w4
C3 w4C3 w4
C3 w4
 
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov DisplayIRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Basics of AB testing in online products
Basics of AB testing in online productsBasics of AB testing in online products
Basics of AB testing in online products
 
Ab testing 101
Ab testing 101Ab testing 101
Ab testing 101
 
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R Open
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
 
validation and verification part 2.pptx
validation and verification part 2.pptxvalidation and verification part 2.pptx
validation and verification part 2.pptx
 
How to Keep Up Your Creative Testing After Budget Cuts
How to Keep Up Your Creative Testing After Budget CutsHow to Keep Up Your Creative Testing After Budget Cuts
How to Keep Up Your Creative Testing After Budget Cuts
 
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
 
Virtual segment brief
Virtual segment briefVirtual segment brief
Virtual segment brief
 
Models ABC
Models ABCModels ABC
Models ABC
 
Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...
 
A robust multi criteria optimization approach
A robust multi criteria optimization approachA robust multi criteria optimization approach
A robust multi criteria optimization approach
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
 

Recently uploaded

Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 

Recently uploaded (20)

Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 

Supercharge your AB testing with automated causal inference - Community Workshop on Microsoft's Causal Tools.pdf

  • 1. Supercharging your A/B testing with automated causal inference Community Workshop on Microsoft's Causal Tools May 3, 2022 Egor Kraev Head of AI, Wise Plc egor.kraev@wise.com
  • 2. What is causal inference anyway? ● Causal inference tries to estimate impacts of actions, ideally at a single-observation level ● The fundamental problem when doing that is you can never directly verify such estimates at individual level – you can’t send and not send the same email to the same customer, to observe the impact! ● Randomization when gathering data is helpful for causal inference, but not strictly necessary
  • 3. Causal inference is a fascinating, deep domain Here, we’ll apply it to the simplest case possible
  • 4. Strengths and weaknesses of classic A/B testing.
  • 5. Traditional A/B testing “A/B testing is the golden standard for learning cause and effect” ● Randomly split your audience into test and control groups ● Subject the test group to a ‘treatment’, such as sending a marketing email or enabling a new product feature ● Measure the average difference in some ‘target’ variable, such as post-treatment revenue, between the two groups ○ Choose sample sizes large enough for the difference to be statistically significant Image by Maxime Lorant - https://commons.wikimedia.org/wiki/File:A-B_testing_simple_example.png, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=52602931
  • 6. Downsides of traditional A/B testing ● Treats differences between customers as noise: if it's harmful to some customers but beneficial to others, we'll just see the zero average impact ● Wasteful: collect and process 100Ks of data points just to learn a boolean, or at most a single number (average impact) ● Can be a hard sell to product teams: if we take the effort to build a feature, we kind of assume it'll be useful – and in any case it’s already built, so what’s the value of the test?
  • 7. How can causal inference help?
  • 8. How can causal inference help in A/B testing? ● Causal inference models will estimate impact per customer as a function of their features (also known as Conditional Average Treatment Effect, or CATE) ○ Most models also supply confidence intervals for those estimates. ● This means you can take the same dataset you collected during A/B testing, enriched with customer features, and get customer segmentation by impact ● Thus, customer heterogeneity becomes a valuable information source, rather than noise to be averaged over ○ This promises smaller sample sizes required for significance
  • 9. Our scope: “No unobserved confounders” ● Most important example: we run an A/B/N test and want to better understand its results ● Second option: treatment assignment is random but biased based on customer features ● Final class of examples: we want to draw causal conclusions about a situation where we can’t make an experiment (eg impact of suspending users), and don’t have any instrumental variables
  • 11. Which causal model? ● There is a wealth of causal inference packages available, first and foremost DoWhy/EconML, also CausalML, UpliftML and several others ○ Inconsistent APIs ○ Each model comes with its own set of quirks ○ Most models require choice of hyperparameters and also of ‘regular’ ML regressors/classifiers as components ● Most importantly, how do you compare different models after fitting? ○ How do you do out-of-sample testing for counterfactual estimates?
  • 12. Out of sample scoring of causal models To compare causal models, we need to score out of sample. We consider two main method families: 1. Qini/AUC: cumulative curves for outcomes sorted by estimated impact, similar to ROC for classifiers. Theoretically nicer, but harder to interpret 2. Policy value, aka ERUPT (Estimated Response Under Proposed Treatment): unbiased estimator of the outcome if we treated every customer for whom a model predicts (say) a positive CATE: more interpretable We also support the r-scorer, but don’t really like its complexity ;) Image source: N. Radcliffe (2007), formula: Hitsch and Misra (2018) https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3111957
  • 13. What policy to choose for ERUPT? ● If the treatment effect is variable, but positive for every customer, the naïve “treat if CATE > 0” policy will involve treating everybody, which makes it hard to differentiate between models ● To correct for this, we offer a normalized ERUPT score, by adding a treatment cost equal to the average treatment effect - thus forcing the models to compete on predicting impact variability ● In real-world problems, you should use a policy that’s as realistic as possible. For example, “send promotion to customer if predicted impact of a promotion on revenue minus promotion cost is positive”
  • 15. Building blocks for automated model selection ● CATE models: ● Most EconML CATE estimators ● Transformed Outcome ● Dummy model: average treatment effect + randomness ● Easy to add others – let us know if you have favorites! ● DoWhy for a consistent high-level interface to individual estimators ● FLAML regression for regression component models ● DummyClassifier or FLAML (user choice) for propensity to treat ● FLAML, again, for estimator and hyperparameter search ● First run all chosen estimators with default settings, then sample DoWhy
  • 16. It’s yours to use right now! https://github.com/transferwise/auto-causality Current state, tested and used: ● Extensively tested for binary treatment, random assignment ● Example notebook with the full fitting and analysis cycle Coming soon (almost works, bar testing and bugfixes): ● Multi-valued discrete treatment ● More testing for FLAML classifier option for propensity to treat ● Component models’ time budget as part of the search space ● Reuse of component models across fits, for efficiency
  • 17. Next steps Next big milestone (planned for Summer 2022): ● Extension to EconML instrumental variable models Could use some help from the EconML team: ● Review of hyperparameter search spaces for the EconML models ● OrthoForest inference on 500K+ points doesn’t finish after running for days, although training only takes tens of minutes – is that expected? ● Early stopping option for tree-based models ● What are good ways to score instrumental variable models out of sample?
  • 19. A lot of analytics questions are questions about causality In progress: ● Analysis of CRM campaign impacts ● Improved targeting of referral reward programs Planned: ● Improved analysis of A/B test results for product features ● Impact of customer support turnaround time on retention and future volumes
  • 20. Extra referral rewards impact analysis ● 470K customers in the sample, some kind of extra reward for referring new customers was offered to 360K of them (all grouped into one ‘treatment’ here for simplicity) ● Simulations run on an r5a.4xlarge (128GB RAM) ● Component model time budget was set to 10min, total time budget to 3h ● Features include reward base currency, customer ‘age’, host’s recent transaction volume Quite variable out-of-sample model performance depending on model and target:
  • 21. Results continued ● Different models’ ATE estimates are all over the place ● DML and Sparse flavors most likely to generate wildly off ATE estimates occasionally (excluded from graphs for readability) ● ForestDRLearner tends to overfit the training set (need early stopping?) For POST_BOOST_REFERRALS, best model’s ERUPT and ATE estimates are inconsistent ERUPT vs ATE for training, validation, and test sets (denoted by marker size from smallest to largest)
  • 22. Out-of-sample segmentation Let’s compare the estimated treatment effects in the test set split by the best model’s recommendations: On the other hand, the simple tree policy derived from the model’s recommendations is not very useful
  • 23. Final approach ● The number of converted referrals post-boost was the best-modeled target, choose that to train a causal inference model ● Augment it with a conditional model of how much volume a new customer would do with us if converted, given the boost program the host was offered and the other host’s features, to quantify a reward program’s payoff. No causal inference needed for this stage, just regular regression on the much smaller dataset of hosts who had at least one post-boost conversion ● Re-run optimization with policy based on the conditional model
  • 24. Lessons learned ● Consider different target variables as well as different models ● Check consistency between ERUPT and ATE estimates, and between validation and test performance ● Choose your metric wisely ● Need quite large instances to run the fits ○ Now looking at using Ray to parallelize ● Beware of causality leakage! If treatment is correlated with features, but you use a naïve propensity-to-treat model, can get GREAT out-of-sample scores that aren’t real
  • 26. So how can causal inference supercharge your A/B testing? ● Use same A/B testing data, enriched with customer features ● Estimate causal impact as function of features, allowing you to segment customers by impact ● Learn fine-grained impact also from biased random sampling, allowing you to optimize and keep learning at the same time ○ For example, could do a kind of Thompson sampling on customer segments ● Thanks to the magic of DoWhy, EconML, and FLAML, combined in auto- causality, you can do all this with no prior expertise in causal inference
  • 27. Many thanks to Pere Planell Morell Mark Harley Timo Flesh Guy Durant Wen Kho Ziyang Zhang For helping to bring the package this far!