SlideShare a Scribd company logo
1 of 43
Download to read offline
Faster & Cheaper, Smart A/B test
Marsan Ma
[2021/2/2]
Outlines
1. (Problem)
○ Sparse signal make A/B test hard / impossible
2. (Preliminary)
○ Statistical significance / sample size recap
○ How to reduce sample size required
3. (Algorithms)
○ Stratification
○ CUPED
○ Covariate adjustment with ML
1. Problem:
Sparse signal make A/B
test hard / impossible
● A/B test need enough samples to achieve statistical significant.
● Use more samples = slower, expensive, harm revenue, or cost lives (ex: delayed vaccine/medicine).
● We have monthly _____________ bid-optimized ads for US in 2020.
test need more than that = not achievable = cannot deploy
How many samples we have for test?
BBBB ads
AAAA ads
● Ex: how many samples we need for CPAS experiment (CPAS-356)
○ For 10% (huge effect size) need 20K ads, but matured product can’t easily have 10%
improvement.
○ For 3% it need 200K (50% of total ads), test is super expensive.
○ For 1% it need 2.2M ads => impossible to achieve.
Samples we need for post-apply(start) world
Other power analysis examples:
1. ASBID-652 ApplyStart per job
○ effect size = 3% need 770K ads / effect size = 5% need 285K ads for micro metrics
2. ASBID-629 hIPAS (hundred impression per applystart)
○ effect size = 3%, need 200K ads for micro / 1M ads for micro metric
3. ASBID-642 AS/impression
○ effect size = 3%, need 70K / effect size = 1% need 600K for micro metrics
(“micro” = average of ads, we care about micro when doing A/B test.
“macro” = average of market, we care about macro when doing economic review.)
Samples we need for post-apply(start) world
We want to reduce N (sample size)
By reducing the noise, so the signal stands out.
TL;DR
2. Preliminary:
Statistical Significant &
Sample Size
An illustration from booking.com blog explaining the rough idea:
Next, let’s formulate it and explain more “scientifically”.
Rough idea: define “different” between ctrl/test
From a really good interactive explanation:
Statistical significance in layman’s word
Whether our new algorithm work? =>
in an A/B test, usually we have:
● H0 (null hypothesis): ctrl/test
make no difference
=> new algorithm not work :(
● Ha (alternative hypothesize):
test significantly better than ctrl
=> new algorithm work :D !
Ctrl Test
● Power (Blue region): probability
of getting statistically
significant result
● Effect size (Cohen’s d): how
different we observed between
ctrl/test.
=> ex: improve “1%” on CTR
● α : false positive (type-I err)
H0 satisfied and > Zcrit
● 𝛃 : false negative (type-II err)
H1 satisfied and < Zcrit
From a really good interactive explanation:
Statistical significance in layman’s word
Ctrl Test
From a really good interactive explanation:
Statistical significance in layman’s word
● What we focus now:
If metric under test is volatile =>
high std_err
=> larger type-II error (black)
=> smaller power (blue)
=> need more samples
(large N) to have enough
statistical power. (so we could
tell whether our new algorithm in
test work)
Ctrl Test
Q: Are you cheating? Why the distribution is normal and only shifted but not deformation after treatment?
=> Yes, we assumed the std_err is the same in ctrl/test. (And it could be wrong.)
=> That’s why we can estimate sample size before we really do A/B test. Since we only care about std_err,
which could be found from historical data.
You might question … the assumption?
Ctrl Test
Now we formulate how many samples we need into this formula
(from @HenryStokeley’s “The Power Talk”)
Sample size formula
From @HenryStokeley’s “The Power Talk”
Sample size for enough power
type-I err
quantile
type-I err
quantile
Sample size formula
sample
size effect size
standard error
Regardless the intimidating formula,
we only care about these 3 guys
● Sample size N
● Effect size μ
● Standard Error 𝛔
And 2 scenarios:
1. Volatile metric
=> larger 𝛔 (standard error)
=> need more samples
2. Effective test
=> larger μ (effect size)
=> need less samples
● Za / Zb are thresholds by definition, we don’t wanna change them.
● effect size - sure we’ll try our best to maximize how far we improve.
● Standard error - there are tricks we might be able to reduce it!
1. Choose units less volatile on metrics under test as test samples (Stratification)
2. Leverage historical data of metrics under test (CUPED / Variance Reduction)
3. Leverage other data correlated with metrics under test (Covariate Adjustment)
type-I err
quantile
type-I err
quantile
How to reduce required sample size
sample
size
effect size
standard error
What does this mean for indeed products?
● JobSeeker: having lots of new users everyday, but low retention.
=> easily increase sample size by extend experiment period.
=> most of users don’t have historical data, CUPED might not quite effective.
=> stratification/covariant adjustment are still possible.
● Sponsored Ads: don’t have new ads everyday, but ads usually live months long.
=> sample size don’t change much even lengthen experiment period.
=> lots of historical data could be leveraged, CUPED is useful!
=> ads natively have categories, stratification by budget/industry are all promising.
How to reduce required sample size
We want to reduce N (sample size)
By reducing 𝛔 (noise), so μ (signal) stand out.
TL;DR
3. Algorithms
1. To speed-up test, we want to reduce N
=> Z and μ are pre-defined thresholds.
=> The only thing we can work on is to reduce σ.
Recap the main idea
type-I err
quantile
type-I err
quantile
sample
size effect size
standard error
2. We want to know whether ctrl/test is distinguishable
It’s actually a t-test (whether their mean value are different)
=> We’re gonna transfer 𝚫 into something else, unbiased and having lower σ
Recap the main idea
3. Algorithms - 1:
Stratification
Eliminate variance among groups
variance among strata
variance within strata
overall variance
adjusted variance
Reference: Improving the Sensitivity of Online Controlled Experiments (Netflix2016)
Main idea: remove variance unrelated to treatment
● Variance among difference stratas (age/browser/country … etc.) is unrelated to our treatment.
● We can remove them to reduce variance.
Main idea: remove variance unrelated to treatment
● In practice, we don’t always know the appropriate weights wk to use. In the context of
online experimentation, these can usually be computed from users not in the experiment.
As we will see in Section 3.3, when we formulate the same problem in the form of control
variates (Section 3.2), we no longer need to estimate the weights.
Eliminate variance among groups
𝚫: unbiased estimator
for the shift in the
means
● Quote from Netflix paper:
1. “We have learned through years of research that many factors not related to the product correlate with our
business metrics. For example, the signup country of users correlates with retention. “
2. “The most impactful factors are leveraged as covariates in stratified sampling to help reduce the sampling variance of
business metrics. More covariates are leveraged for existing members since we know more about them.”
Netflix’s stratify implementation
10%
10%
● Online stratification goal: assign user to “Kth strate”, also decide “ctrl or test” group.
1. Certain “trigger” make user included into experiment (ex: user open “kids channel”)
2. Assign “strata” first, according to predefined rules or similarity to strates.
3. In each strata, use shuffled 1-100 sequence as in (b) to assign user to ctrl (1) or test (2) as in (c)
(Ex: want 3% test => assign cells with b < 3 as test.)
● Ensure balanceness is crucial to variance reduction performance (avoid introducing extra variation)
● However in Netflix conclusion:
“online stratification is NOT suggested.
Post-stratification is easier and less limited by
Online constraint (ex: distributed into multiple
machines)”
Netflix’s stratify implementation
[Stratification]
Removing variance among strata,
focus only on variance within strata.
TL;DR
3. Algorithms - 2:
CUPED
● Reference: Improving the Sensitivity of Online Controlled Experimentsby Utilizing Pre-Experiment Data
(Microsoft 2013)
● Main Idea: The pre-experiment variance is not effects from experiment, therefore can be removed.
● Impact: in Bing search, reduce variance by ~50%, achieving same statistical power with half of the
users, or half the duration.)
● Ex: an experiment slowing down
page load time by 200ms
○ t-test need 14 days to achieve
p-val < 0.05
○ CUPED achieve p-val < 0.05
ever since 1st day.
CUPED
Normal t-test
CUPED
● The math behind CUPED, with comments
● Now the difficulty of applying it boils down to “finding a control variate X that
(a) highly correlated with Y, and (b) has known E(X)”
● A simple yet effective way is using the same variable from the pre-experiment period as the covariate X.
(a) highly correlate to Y since X is Y, and (b) E(X) known, since it’s E(Y)
since corr(Y,X) = cov(Y,X) / var(X)*var(Y)
now we use this adjusted Ycv instead of Y,
and the variance reduced by ρ2
d/dθ var(Ycv) = 2θ*var(X) - 2*cov(Y,X) = 0
Thus min happen in θ = cov(Y,X) / var(X)
so now we have θ
Math under the hood (feel free to skip)
Ycv
is the unbiased estimation of E(Y), CV=control variate
Suppose we have a magical X highly correlate with Y and
somehow E(X) is known / θ is just “some constant” here
Since E(X) is known, var(E(X)) = 0
And we expand this adjusted var(Ycv)
● We could choose X as something else, and the effect depends on corr(Y,X).
● The author tried using some X (EntryDay) not quite related to Y (Queries)
○ Result is bad (blue curve), compare to:
○ Green: using X = pre-experiment Y
○ Red: leverage both, a bit better than Green
PS-1: what if using X = pre-experiment Y ?
PS-2: pre/post experiment data contribution
● The more pre-experiment data, the better.
(increased pre-experiment user coverage lead to larger variance reduce)
● The more post-experiment data, NOT necessarily better.
(decreased pre-experiment user coverage since more new users having no history)
● This is just the case of this particular scenario, as a reference, not applicable to all experiments.
[CUPED]
we’re gonna use Ycv
=Y-𝚹Xinstead of Y
where X=historical Y, 𝚹 = cov(Y,X)/var(X)
which reduce variant by ρ2
, where ρ = corr(Y, X)
so the better ctrl align before/after treatment, the less samples we need.
(ex: user never click before treatment and in ctrl group, thus corr(Y, X) = 1
=> clicked in test group after treatment => effective even only 1 sample!)
TL;DR
3. Algorithms - 3:
Covariate Adjustment with ML
● Reference: Unbiased variance reduction in randomized experiments (Google 2019)
● Usually in A/B test we have effect size
● From CUPED we have Ycv
= Y - 𝚹X, and unbiased estimator of 𝛕 as 𝚫cv
where
○ X = auxiliary information independent of the treatment
○ 𝚹 = some constant depends on how we define this X
● And we want to know whether 𝚫cv
is statistically significant.
Preliminary recap (from CUPED result)
● Still start from Ycv
= Y - 𝚹X, but now we replace X with ML model predictions.
(rather than X = pre-experiment Y in CUPED)
● Leverage covariates (plural since it’s hard w/o ML) with machine learning, by compiling
covariates into “model expect” through model.
○ If the result is apparently out of model expect, treatment is effective!
○ If even ctrl doesn’t align with model prediction, the effect will be not so well.
Covariate adjustment with ML
time
metric
ctrl & model prediction
test
start treatment
● Model need to be “asymptotically unbiased”
○ There are some sufficient & required conditions, but basically, all sane ML models
fit the criterias. (skip detailed math proofs in paper here)
● Simulation results checking whether the adjusted metric are unbiased.
(well, they seems unbiased ...)
That’s cheap? What’s the cost?
● We also interested in what’s the final variance, and how much it reduced.
● Still, skip super long mathematical proofs, only wrap up most important conclusions here:
○ Note that feature used by model CANNOT be something affected by treatment!
○ g is a real-valued differentiable function (ex: log for ratio metrics)
○ assume all the pairwise correlations are zero, except ρ=corr(Y,H) and ρ*=corr(Y*,H*)
○ T = effect / * = treatments / Y = target metric / H = model prediction
○
where
How much we gain?
Variation reduced
[Covariate Adjustment with ML]
Final variation upper bound = (1-min{ρ2
,
ρ*2
})*var{T0
}
where ρ = corr(Y, H), ρ* = corr(Y*, H*)
The better model prediction (H/H*) align with truth (Y/Y*),
the more variance reduced.
(Ex: suppose the model is perfect so Y = H, we can close the test with 1 sample.)
TL;DR
3. Algorithms Summary
Recap all algorithms
1. Stratification
○ Removing variance among strata, focus only on variance within strata.
2. CUPED
○ The better ctrl align before/after treatment, the less samples we need.
3. Covariate Adjustment with ML
○ The better model prediction (H/H*) align with truth (Y/Y*), the more variance
reduced.
4. Evaluation
on sponsored ads metrics
1. (Microsoft) CUPED: Improving the Sensitivity of Online Controlled Experiments by Utilizing
Pre-Experiment Data / 3rd party slides from Tokyo University
2. (Google) Variance reduction: https://arxiv.org/abs/1904.03817
3. (NetFlix) Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix
4. (Booking.com) increases the power of online experiments with CUPED
5. An overview of variance reduction techniques for improving the power of your A/B tests
6. (Unlearn.ai) Increasing the efficiency of randomized trial estimates via linearadjustment for a prognostic
score
References

More Related Content

What's hot

XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
 
The Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemThe Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemMasaharu Kinoshita
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoostDataRobot
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R PackageDataRobot
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity ResolutionBenjamin Bengfort
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboostmichiaki ito
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Abhishek Thakur
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIRaouf KESKES
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson ChallengeRaouf KESKES
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature HashingWush Wu
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection ProcessBenjamin Bengfort
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningBenjamin Bengfort
 

What's hot (20)

XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
 
The Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemThe Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting Problem
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R Package
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 

Similar to Faster and cheaper, smart ab experiments - public ver.

The following calendar-year information is taken from the December.docx
The following calendar-year information is taken from the December.docxThe following calendar-year information is taken from the December.docx
The following calendar-year information is taken from the December.docxcherry686017
 
Setting up an A/B-testing framework
Setting up an A/B-testing frameworkSetting up an A/B-testing framework
Setting up an A/B-testing frameworkAgnes van Belle
 
Why start using uplift models for more efficient marketing campaigns
Why start using uplift models for more efficient marketing campaignsWhy start using uplift models for more efficient marketing campaigns
Why start using uplift models for more efficient marketing campaignsData Con LA
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
How to correctly estimate the effect of online advertisement(About Double Mac...
How to correctly estimate the effect of online advertisement(About Double Mac...How to correctly estimate the effect of online advertisement(About Double Mac...
How to correctly estimate the effect of online advertisement(About Double Mac...Yusuke Kaneko
 
MLlectureMethod.ppt
MLlectureMethod.pptMLlectureMethod.ppt
MLlectureMethod.pptbutest
 
MLlectureMethod.ppt
MLlectureMethod.pptMLlectureMethod.ppt
MLlectureMethod.pptbutest
 
Business mathematics gmb 105 pres
Business mathematics gmb 105 presBusiness mathematics gmb 105 pres
Business mathematics gmb 105 presWilliam Masvinu
 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoringharmonylab
 
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...Smarten Augmented Analytics
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Simplilearn
 
Converting Measurement Systems From Attribute
Converting Measurement Systems From AttributeConverting Measurement Systems From Attribute
Converting Measurement Systems From Attributejdavidgreen007
 
Kanban Metrics in practice at Sky Network Services
Kanban Metrics in practice at Sky Network ServicesKanban Metrics in practice at Sky Network Services
Kanban Metrics in practice at Sky Network ServicesMattia Battiston
 

Similar to Faster and cheaper, smart ab experiments - public ver. (20)

The following calendar-year information is taken from the December.docx
The following calendar-year information is taken from the December.docxThe following calendar-year information is taken from the December.docx
The following calendar-year information is taken from the December.docx
 
report
reportreport
report
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
A04 Sample Size
A04 Sample SizeA04 Sample Size
A04 Sample Size
 
A04 Sample Size
A04 Sample SizeA04 Sample Size
A04 Sample Size
 
Setting up an A/B-testing framework
Setting up an A/B-testing frameworkSetting up an A/B-testing framework
Setting up an A/B-testing framework
 
Why start using uplift models for more efficient marketing campaigns
Why start using uplift models for more efficient marketing campaignsWhy start using uplift models for more efficient marketing campaigns
Why start using uplift models for more efficient marketing campaigns
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Cb36469472
Cb36469472Cb36469472
Cb36469472
 
How to correctly estimate the effect of online advertisement(About Double Mac...
How to correctly estimate the effect of online advertisement(About Double Mac...How to correctly estimate the effect of online advertisement(About Double Mac...
How to correctly estimate the effect of online advertisement(About Double Mac...
 
MLlectureMethod.ppt
MLlectureMethod.pptMLlectureMethod.ppt
MLlectureMethod.ppt
 
MLlectureMethod.ppt
MLlectureMethod.pptMLlectureMethod.ppt
MLlectureMethod.ppt
 
Business mathematics gmb 105 pres
Business mathematics gmb 105 presBusiness mathematics gmb 105 pres
Business mathematics gmb 105 pres
 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoring
 
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 
ABTest-20231020.pptx
ABTest-20231020.pptxABTest-20231020.pptx
ABTest-20231020.pptx
 
Converting Measurement Systems From Attribute
Converting Measurement Systems From AttributeConverting Measurement Systems From Attribute
Converting Measurement Systems From Attribute
 
One way anova
One way anovaOne way anova
One way anova
 
Kanban Metrics in practice at Sky Network Services
Kanban Metrics in practice at Sky Network ServicesKanban Metrics in practice at Sky Network Services
Kanban Metrics in practice at Sky Network Services
 

Recently uploaded

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 

Recently uploaded (20)

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 

Faster and cheaper, smart ab experiments - public ver.

  • 1. Faster & Cheaper, Smart A/B test Marsan Ma [2021/2/2]
  • 2. Outlines 1. (Problem) ○ Sparse signal make A/B test hard / impossible 2. (Preliminary) ○ Statistical significance / sample size recap ○ How to reduce sample size required 3. (Algorithms) ○ Stratification ○ CUPED ○ Covariate adjustment with ML
  • 3. 1. Problem: Sparse signal make A/B test hard / impossible
  • 4. ● A/B test need enough samples to achieve statistical significant. ● Use more samples = slower, expensive, harm revenue, or cost lives (ex: delayed vaccine/medicine). ● We have monthly _____________ bid-optimized ads for US in 2020. test need more than that = not achievable = cannot deploy How many samples we have for test? BBBB ads AAAA ads
  • 5. ● Ex: how many samples we need for CPAS experiment (CPAS-356) ○ For 10% (huge effect size) need 20K ads, but matured product can’t easily have 10% improvement. ○ For 3% it need 200K (50% of total ads), test is super expensive. ○ For 1% it need 2.2M ads => impossible to achieve. Samples we need for post-apply(start) world
  • 6. Other power analysis examples: 1. ASBID-652 ApplyStart per job ○ effect size = 3% need 770K ads / effect size = 5% need 285K ads for micro metrics 2. ASBID-629 hIPAS (hundred impression per applystart) ○ effect size = 3%, need 200K ads for micro / 1M ads for micro metric 3. ASBID-642 AS/impression ○ effect size = 3%, need 70K / effect size = 1% need 600K for micro metrics (“micro” = average of ads, we care about micro when doing A/B test. “macro” = average of market, we care about macro when doing economic review.) Samples we need for post-apply(start) world
  • 7. We want to reduce N (sample size) By reducing the noise, so the signal stands out. TL;DR
  • 9. An illustration from booking.com blog explaining the rough idea: Next, let’s formulate it and explain more “scientifically”. Rough idea: define “different” between ctrl/test
  • 10. From a really good interactive explanation: Statistical significance in layman’s word Whether our new algorithm work? => in an A/B test, usually we have: ● H0 (null hypothesis): ctrl/test make no difference => new algorithm not work :( ● Ha (alternative hypothesize): test significantly better than ctrl => new algorithm work :D ! Ctrl Test
  • 11. ● Power (Blue region): probability of getting statistically significant result ● Effect size (Cohen’s d): how different we observed between ctrl/test. => ex: improve “1%” on CTR ● α : false positive (type-I err) H0 satisfied and > Zcrit ● 𝛃 : false negative (type-II err) H1 satisfied and < Zcrit From a really good interactive explanation: Statistical significance in layman’s word Ctrl Test
  • 12. From a really good interactive explanation: Statistical significance in layman’s word ● What we focus now: If metric under test is volatile => high std_err => larger type-II error (black) => smaller power (blue) => need more samples (large N) to have enough statistical power. (so we could tell whether our new algorithm in test work) Ctrl Test
  • 13. Q: Are you cheating? Why the distribution is normal and only shifted but not deformation after treatment? => Yes, we assumed the std_err is the same in ctrl/test. (And it could be wrong.) => That’s why we can estimate sample size before we really do A/B test. Since we only care about std_err, which could be found from historical data. You might question … the assumption? Ctrl Test
  • 14. Now we formulate how many samples we need into this formula (from @HenryStokeley’s “The Power Talk”) Sample size formula
  • 15. From @HenryStokeley’s “The Power Talk” Sample size for enough power type-I err quantile type-I err quantile Sample size formula sample size effect size standard error Regardless the intimidating formula, we only care about these 3 guys ● Sample size N ● Effect size μ ● Standard Error 𝛔 And 2 scenarios: 1. Volatile metric => larger 𝛔 (standard error) => need more samples 2. Effective test => larger μ (effect size) => need less samples
  • 16. ● Za / Zb are thresholds by definition, we don’t wanna change them. ● effect size - sure we’ll try our best to maximize how far we improve. ● Standard error - there are tricks we might be able to reduce it! 1. Choose units less volatile on metrics under test as test samples (Stratification) 2. Leverage historical data of metrics under test (CUPED / Variance Reduction) 3. Leverage other data correlated with metrics under test (Covariate Adjustment) type-I err quantile type-I err quantile How to reduce required sample size sample size effect size standard error
  • 17. What does this mean for indeed products? ● JobSeeker: having lots of new users everyday, but low retention. => easily increase sample size by extend experiment period. => most of users don’t have historical data, CUPED might not quite effective. => stratification/covariant adjustment are still possible. ● Sponsored Ads: don’t have new ads everyday, but ads usually live months long. => sample size don’t change much even lengthen experiment period. => lots of historical data could be leveraged, CUPED is useful! => ads natively have categories, stratification by budget/industry are all promising. How to reduce required sample size
  • 18. We want to reduce N (sample size) By reducing 𝛔 (noise), so μ (signal) stand out. TL;DR
  • 20. 1. To speed-up test, we want to reduce N => Z and μ are pre-defined thresholds. => The only thing we can work on is to reduce σ. Recap the main idea type-I err quantile type-I err quantile sample size effect size standard error
  • 21. 2. We want to know whether ctrl/test is distinguishable It’s actually a t-test (whether their mean value are different) => We’re gonna transfer 𝚫 into something else, unbiased and having lower σ Recap the main idea
  • 22. 3. Algorithms - 1: Stratification
  • 23. Eliminate variance among groups variance among strata variance within strata overall variance adjusted variance Reference: Improving the Sensitivity of Online Controlled Experiments (Netflix2016) Main idea: remove variance unrelated to treatment ● Variance among difference stratas (age/browser/country … etc.) is unrelated to our treatment. ● We can remove them to reduce variance.
  • 24. Main idea: remove variance unrelated to treatment ● In practice, we don’t always know the appropriate weights wk to use. In the context of online experimentation, these can usually be computed from users not in the experiment. As we will see in Section 3.3, when we formulate the same problem in the form of control variates (Section 3.2), we no longer need to estimate the weights. Eliminate variance among groups 𝚫: unbiased estimator for the shift in the means
  • 25. ● Quote from Netflix paper: 1. “We have learned through years of research that many factors not related to the product correlate with our business metrics. For example, the signup country of users correlates with retention. “ 2. “The most impactful factors are leveraged as covariates in stratified sampling to help reduce the sampling variance of business metrics. More covariates are leveraged for existing members since we know more about them.” Netflix’s stratify implementation 10% 10%
  • 26. ● Online stratification goal: assign user to “Kth strate”, also decide “ctrl or test” group. 1. Certain “trigger” make user included into experiment (ex: user open “kids channel”) 2. Assign “strata” first, according to predefined rules or similarity to strates. 3. In each strata, use shuffled 1-100 sequence as in (b) to assign user to ctrl (1) or test (2) as in (c) (Ex: want 3% test => assign cells with b < 3 as test.) ● Ensure balanceness is crucial to variance reduction performance (avoid introducing extra variation) ● However in Netflix conclusion: “online stratification is NOT suggested. Post-stratification is easier and less limited by Online constraint (ex: distributed into multiple machines)” Netflix’s stratify implementation
  • 27. [Stratification] Removing variance among strata, focus only on variance within strata. TL;DR
  • 28. 3. Algorithms - 2: CUPED
  • 29. ● Reference: Improving the Sensitivity of Online Controlled Experimentsby Utilizing Pre-Experiment Data (Microsoft 2013) ● Main Idea: The pre-experiment variance is not effects from experiment, therefore can be removed. ● Impact: in Bing search, reduce variance by ~50%, achieving same statistical power with half of the users, or half the duration.) ● Ex: an experiment slowing down page load time by 200ms ○ t-test need 14 days to achieve p-val < 0.05 ○ CUPED achieve p-val < 0.05 ever since 1st day. CUPED Normal t-test CUPED
  • 30. ● The math behind CUPED, with comments ● Now the difficulty of applying it boils down to “finding a control variate X that (a) highly correlated with Y, and (b) has known E(X)” ● A simple yet effective way is using the same variable from the pre-experiment period as the covariate X. (a) highly correlate to Y since X is Y, and (b) E(X) known, since it’s E(Y) since corr(Y,X) = cov(Y,X) / var(X)*var(Y) now we use this adjusted Ycv instead of Y, and the variance reduced by ρ2 d/dθ var(Ycv) = 2θ*var(X) - 2*cov(Y,X) = 0 Thus min happen in θ = cov(Y,X) / var(X) so now we have θ Math under the hood (feel free to skip) Ycv is the unbiased estimation of E(Y), CV=control variate Suppose we have a magical X highly correlate with Y and somehow E(X) is known / θ is just “some constant” here Since E(X) is known, var(E(X)) = 0 And we expand this adjusted var(Ycv)
  • 31. ● We could choose X as something else, and the effect depends on corr(Y,X). ● The author tried using some X (EntryDay) not quite related to Y (Queries) ○ Result is bad (blue curve), compare to: ○ Green: using X = pre-experiment Y ○ Red: leverage both, a bit better than Green PS-1: what if using X = pre-experiment Y ?
  • 32. PS-2: pre/post experiment data contribution ● The more pre-experiment data, the better. (increased pre-experiment user coverage lead to larger variance reduce) ● The more post-experiment data, NOT necessarily better. (decreased pre-experiment user coverage since more new users having no history) ● This is just the case of this particular scenario, as a reference, not applicable to all experiments.
  • 33. [CUPED] we’re gonna use Ycv =Y-𝚹Xinstead of Y where X=historical Y, 𝚹 = cov(Y,X)/var(X) which reduce variant by ρ2 , where ρ = corr(Y, X) so the better ctrl align before/after treatment, the less samples we need. (ex: user never click before treatment and in ctrl group, thus corr(Y, X) = 1 => clicked in test group after treatment => effective even only 1 sample!) TL;DR
  • 34. 3. Algorithms - 3: Covariate Adjustment with ML
  • 35. ● Reference: Unbiased variance reduction in randomized experiments (Google 2019) ● Usually in A/B test we have effect size ● From CUPED we have Ycv = Y - 𝚹X, and unbiased estimator of 𝛕 as 𝚫cv where ○ X = auxiliary information independent of the treatment ○ 𝚹 = some constant depends on how we define this X ● And we want to know whether 𝚫cv is statistically significant. Preliminary recap (from CUPED result)
  • 36. ● Still start from Ycv = Y - 𝚹X, but now we replace X with ML model predictions. (rather than X = pre-experiment Y in CUPED) ● Leverage covariates (plural since it’s hard w/o ML) with machine learning, by compiling covariates into “model expect” through model. ○ If the result is apparently out of model expect, treatment is effective! ○ If even ctrl doesn’t align with model prediction, the effect will be not so well. Covariate adjustment with ML time metric ctrl & model prediction test start treatment
  • 37. ● Model need to be “asymptotically unbiased” ○ There are some sufficient & required conditions, but basically, all sane ML models fit the criterias. (skip detailed math proofs in paper here) ● Simulation results checking whether the adjusted metric are unbiased. (well, they seems unbiased ...) That’s cheap? What’s the cost?
  • 38. ● We also interested in what’s the final variance, and how much it reduced. ● Still, skip super long mathematical proofs, only wrap up most important conclusions here: ○ Note that feature used by model CANNOT be something affected by treatment! ○ g is a real-valued differentiable function (ex: log for ratio metrics) ○ assume all the pairwise correlations are zero, except ρ=corr(Y,H) and ρ*=corr(Y*,H*) ○ T = effect / * = treatments / Y = target metric / H = model prediction ○ where How much we gain? Variation reduced
  • 39. [Covariate Adjustment with ML] Final variation upper bound = (1-min{ρ2 , ρ*2 })*var{T0 } where ρ = corr(Y, H), ρ* = corr(Y*, H*) The better model prediction (H/H*) align with truth (Y/Y*), the more variance reduced. (Ex: suppose the model is perfect so Y = H, we can close the test with 1 sample.) TL;DR
  • 41. Recap all algorithms 1. Stratification ○ Removing variance among strata, focus only on variance within strata. 2. CUPED ○ The better ctrl align before/after treatment, the less samples we need. 3. Covariate Adjustment with ML ○ The better model prediction (H/H*) align with truth (Y/Y*), the more variance reduced.
  • 43. 1. (Microsoft) CUPED: Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data / 3rd party slides from Tokyo University 2. (Google) Variance reduction: https://arxiv.org/abs/1904.03817 3. (NetFlix) Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix 4. (Booking.com) increases the power of online experiments with CUPED 5. An overview of variance reduction techniques for improving the power of your A/B tests 6. (Unlearn.ai) Increasing the efficiency of randomized trial estimates via linearadjustment for a prognostic score References