Faster and cheaper, smart ab experiments - public ver.

Faster & Cheaper, Smart A/B test
Marsan Ma
[2021/2/2]

Outlines
1. (Problem)
○ Sparse signal make A/B test hard / impossible
2. (Preliminary)
○ Statistical significance / sample size recap
○ How to reduce sample size required
3. (Algorithms)
○ Stratification
○ CUPED
○ Covariate adjustment with ML

1. Problem:
Sparse signal make A/B
test hard / impossible

● A/B test need enough samples to achieve statistical significant.
● Use more samples = slower, expensive, harm revenue, or cost lives (ex: delayed vaccine/medicine).
● We have monthly _____________ bid-optimized ads for US in 2020.
test need more than that = not achievable = cannot deploy
How many samples we have for test?
BBBB ads
AAAA ads

● Ex: how many samples we need for CPAS experiment (CPAS-356)
○ For 10% (huge effect size) need 20K ads, but matured product can’t easily have 10%
improvement.
○ For 3% it need 200K (50% of total ads), test is super expensive.
○ For 1% it need 2.2M ads => impossible to achieve.
Samples we need for post-apply(start) world

Other power analysis examples:
1. ASBID-652 ApplyStart per job
○ effect size = 3% need 770K ads / effect size = 5% need 285K ads for micro metrics
2. ASBID-629 hIPAS (hundred impression per applystart)
○ effect size = 3%, need 200K ads for micro / 1M ads for micro metric
3. ASBID-642 AS/impression
○ effect size = 3%, need 70K / effect size = 1% need 600K for micro metrics
(“micro” = average of ads, we care about micro when doing A/B test.
“macro” = average of market, we care about macro when doing economic review.)
Samples we need for post-apply(start) world

We want to reduce N (sample size)
By reducing the noise, so the signal stands out.
TL;DR

2. Preliminary:
Statistical Signiﬁcant &
Sample Size

An illustration from booking.com blog explaining the rough idea:
Next, let’s formulate it and explain more “scientifically”.
Rough idea: deﬁne “different” between ctrl/test

From a really good interactive explanation:
Statistical signiﬁcance in layman’s word
Whether our new algorithm work? =>
in an A/B test, usually we have:
● H0 (null hypothesis): ctrl/test
make no difference
=> new algorithm not work :(
● Ha (alternative hypothesize):
test signiﬁcantly better than ctrl
=> new algorithm work :D !
Ctrl Test

● Power (Blue region): probability
of getting statistically
significant result
● Effect size (Cohen’s d): how
different we observed between
ctrl/test.
=> ex: improve “1%” on CTR
● α : false positive (type-I err)
H0 satisfied and > Zcrit
● 𝛃 : false negative (type-II err)
H1 satisfied and < Zcrit
Ctrl Test

● What we focus now:
If metric under test is volatile =>
high std_err
=> larger type-II error (black)
=> smaller power (blue)
=> need more samples
(large N) to have enough
statistical power. (so we could
tell whether our new algorithm in
test work)
Ctrl Test

Q: Are you cheating? Why the distribution is normal and only shifted but not deformation after treatment?
=> Yes, we assumed the std_err is the same in ctrl/test. (And it could be wrong.)
=> That’s why we can estimate sample size before we really do A/B test. Since we only care about std_err,
which could be found from historical data.
You might question … the assumption?
Ctrl Test

Now we formulate how many samples we need into this formula
(from @HenryStokeley’s “The Power Talk”)
Sample size formula

From @HenryStokeley’s “The Power Talk”
Sample size for enough power
type-I err
quantile
type-I err
quantile
Sample size formula
sample
size effect size
standard error
Regardless the intimidating formula,
we only care about these 3 guys
● Sample size N
● Effect size μ
● Standard Error 𝛔
And 2 scenarios:
1. Volatile metric
=> larger 𝛔 (standard error)
=> need more samples
2. Effective test
=> larger μ (effect size)
=> need less samples

● Za / Zb are thresholds by definition, we don’t wanna change them.
● effect size - sure we’ll try our best to maximize how far we improve.
● Standard error - there are tricks we might be able to reduce it!
1. Choose units less volatile on metrics under test as test samples (Stratification)
2. Leverage historical data of metrics under test (CUPED / Variance Reduction)
3. Leverage other data correlated with metrics under test (Covariate Adjustment)
type-I err
quantile
type-I err
quantile
How to reduce required sample size
sample
size
effect size
standard error

What does this mean for indeed products?
● JobSeeker: having lots of new users everyday, but low retention.
=> easily increase sample size by extend experiment period.
=> most of users don’t have historical data, CUPED might not quite effective.
=> stratification/covariant adjustment are still possible.
● Sponsored Ads: don’t have new ads everyday, but ads usually live months long.
=> sample size don’t change much even lengthen experiment period.
=> lots of historical data could be leveraged, CUPED is useful!
=> ads natively have categories, stratification by budget/industry are all promising.
How to reduce required sample size

We want to reduce N (sample size)
By reducing 𝛔 (noise), so μ (signal) stand out.
TL;DR

1. To speed-up test, we want to reduce N
=> Z and μ are pre-defined thresholds.
=> The only thing we can work on is to reduce σ.
Recap the main idea
type-I err
quantile
type-I err
quantile
sample
size effect size
standard error

2. We want to know whether ctrl/test is distinguishable
It’s actually a t-test (whether their mean value are different)
=> We’re gonna transfer 𝚫 into something else, unbiased and having lower σ
Recap the main idea

3. Algorithms - 1:
Stratiﬁcation

Eliminate variance among groups
variance among strata
variance within strata
overall variance
adjusted variance
Reference: Improving the Sensitivity of Online Controlled Experiments (Netflix2016)
Main idea: remove variance unrelated to treatment
● Variance among difference stratas (age/browser/country … etc.) is unrelated to our treatment.
● We can remove them to reduce variance.

Main idea: remove variance unrelated to treatment
● In practice, we don’t always know the appropriate weights wk to use. In the context of
online experimentation, these can usually be computed from users not in the experiment.
As we will see in Section 3.3, when we formulate the same problem in the form of control
variates (Section 3.2), we no longer need to estimate the weights.
Eliminate variance among groups
𝚫: unbiased estimator
for the shift in the
means

● Quote from Netflix paper:
1. “We have learned through years of research that many factors not related to the product correlate with our
business metrics. For example, the signup country of users correlates with retention. “
2. “The most impactful factors are leveraged as covariates in stratified sampling to help reduce the sampling variance of
business metrics. More covariates are leveraged for existing members since we know more about them.”
Netﬂix’s stratify implementation
10%
10%

● Online stratification goal: assign user to “Kth strate”, also decide “ctrl or test” group.
1. Certain “trigger” make user included into experiment (ex: user open “kids channel”)
2. Assign “strata” first, according to predefined rules or similarity to strates.
3. In each strata, use shuffled 1-100 sequence as in (b) to assign user to ctrl (1) or test (2) as in (c)
(Ex: want 3% test => assign cells with b < 3 as test.)
● Ensure balanceness is crucial to variance reduction performance (avoid introducing extra variation)
● However in Netflix conclusion:
“online stratification is NOT suggested.
Post-stratification is easier and less limited by
Online constraint (ex: distributed into multiple
machines)”
Netﬂix’s stratify implementation

[Stratiﬁcation]
Removing variance among strata,
focus only on variance within strata.
TL;DR

● Reference: Improving the Sensitivity of Online Controlled Experimentsby Utilizing Pre-Experiment Data
(Microsoft 2013)
● Main Idea: The pre-experiment variance is not effects from experiment, therefore can be removed.
● Impact: in Bing search, reduce variance by ~50%, achieving same statistical power with half of the
users, or half the duration.)
● Ex: an experiment slowing down
page load time by 200ms
○ t-test need 14 days to achieve
p-val < 0.05
○ CUPED achieve p-val < 0.05
ever since 1st day.
CUPED
Normal t-test
CUPED

● The math behind CUPED, with comments
● Now the difficulty of applying it boils down to “finding a control variate X that
(a) highly correlated with Y, and (b) has known E(X)”
● A simple yet effective way is using the same variable from the pre-experiment period as the covariate X.
(a) highly correlate to Y since X is Y, and (b) E(X) known, since it’s E(Y)
since corr(Y,X) = cov(Y,X) / var(X)*var(Y)
now we use this adjusted Ycv instead of Y,
and the variance reduced by ρ2
d/dθ var(Ycv) = 2θ*var(X) - 2*cov(Y,X) = 0
Thus min happen in θ = cov(Y,X) / var(X)
so now we have θ
Math under the hood (feel free to skip)
Ycv
is the unbiased estimation of E(Y), CV=control variate
Suppose we have a magical X highly correlate with Y and
somehow E(X) is known / θ is just “some constant” here
Since E(X) is known, var(E(X)) = 0
And we expand this adjusted var(Ycv)

● We could choose X as something else, and the effect depends on corr(Y,X).
● The author tried using some X (EntryDay) not quite related to Y (Queries)
○ Result is bad (blue curve), compare to:
○ Green: using X = pre-experiment Y
○ Red: leverage both, a bit better than Green
PS-1: what if using X = pre-experiment Y ?

PS-2: pre/post experiment data contribution
● The more pre-experiment data, the better.
(increased pre-experiment user coverage lead to larger variance reduce)
● The more post-experiment data, NOT necessarily better.
(decreased pre-experiment user coverage since more new users having no history)
● This is just the case of this particular scenario, as a reference, not applicable to all experiments.

[CUPED]
we’re gonna use Ycv
=Y-𝚹Xinstead of Y
where X=historical Y, 𝚹 = cov(Y,X)/var(X)
which reduce variant by ρ2
, where ρ = corr(Y, X)
so the better ctrl align before/after treatment, the less samples we need.
(ex: user never click before treatment and in ctrl group, thus corr(Y, X) = 1
=> clicked in test group after treatment => effective even only 1 sample!)
TL;DR

3. Algorithms - 3:
Covariate Adjustment with ML

● Reference: Unbiased variance reduction in randomized experiments (Google 2019)
● Usually in A/B test we have effect size
● From CUPED we have Ycv
= Y - 𝚹X, and unbiased estimator of 𝛕 as 𝚫cv
where
○ X = auxiliary information independent of the treatment
○ 𝚹 = some constant depends on how we define this X
● And we want to know whether 𝚫cv
is statistically significant.
Preliminary recap (from CUPED result)

● Still start from Ycv
= Y - 𝚹X, but now we replace X with ML model predictions.
(rather than X = pre-experiment Y in CUPED)
● Leverage covariates (plural since it’s hard w/o ML) with machine learning, by compiling
covariates into “model expect” through model.
○ If the result is apparently out of model expect, treatment is effective!
○ If even ctrl doesn’t align with model prediction, the effect will be not so well.
Covariate adjustment with ML
time
metric
ctrl & model prediction
test
start treatment

● Model need to be “asymptotically unbiased”
○ There are some sufficient & required conditions, but basically, all sane ML models
fit the criterias. (skip detailed math proofs in paper here)
● Simulation results checking whether the adjusted metric are unbiased.
(well, they seems unbiased ...)
That’s cheap? What’s the cost?

● We also interested in what’s the final variance, and how much it reduced.
● Still, skip super long mathematical proofs, only wrap up most important conclusions here:
○ Note that feature used by model CANNOT be something affected by treatment!
○ g is a real-valued differentiable function (ex: log for ratio metrics)
○ assume all the pairwise correlations are zero, except ρ=corr(Y,H) and ρ*=corr(Y*,H*)
○ T = effect / * = treatments / Y = target metric / H = model prediction
○
where
How much we gain?
Variation reduced

[Covariate Adjustment with ML]
Final variation upper bound = (1-min{ρ2
,
ρ*2
})*var{T0
}
where ρ = corr(Y, H), ρ* = corr(Y*, H*)
The better model prediction (H/H*) align with truth (Y/Y*),
the more variance reduced.
(Ex: suppose the model is perfect so Y = H, we can close the test with 1 sample.)
TL;DR

Recap all algorithms
1. Stratification
○ Removing variance among strata, focus only on variance within strata.
2. CUPED
○ The better ctrl align before/after treatment, the less samples we need.
3. Covariate Adjustment with ML
○ The better model prediction (H/H*) align with truth (Y/Y*), the more variance
reduced.

4. Evaluation
on sponsored ads metrics

1. (Microsoft) CUPED: Improving the Sensitivity of Online Controlled Experiments by Utilizing
Pre-Experiment Data / 3rd party slides from Tokyo University
2. (Google) Variance reduction: https://arxiv.org/abs/1904.03817
3. (NetFlix) Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix
4. (Booking.com) increases the power of online experiments with CUPED
5. An overview of variance reduction techniques for improving the power of your A/B tests
6. (Unlearn.ai) Increasing the efficiency of randomized trial estimates via linearadjustment for a prognostic
score
References

Faster and cheaper, smart ab experiments - public ver.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Faster and cheaper, smart ab experiments - public ver.

Similar to Faster and cheaper, smart ab experiments - public ver. (20)

Recently uploaded

Recently uploaded (20)

Faster and cheaper, smart ab experiments - public ver.