Your A/B Tests are Lying to You

Your A/B Tests are
Lying to You.

Proprietary and confidential
John Clevenger
Data Scientist
John McDonnell
Data Science Manager

Stitch Fix

How does Stitch Fix work?

Hard problems: inferring latent size
Blog post on latent size

Hard problems: inferring latent style
Blog post on latent style

Proprietary and confidential 12
We hope to convince you:

● Standard A/B testing methods lead you to overestimate
the business impact of rolled out features.

● The only way around this is to massively increase
sample size, run replications/holdouts, or to model out
the bias.

the bias.
● You should think hard about your goal: Is it just
hypothesis testing or is it estimation too?

Airbnb’s great quarter
Example from Airbnb [blogpost]

Airbnb’s great quarter

Something doesn’t add up

What’s going
on here?

Survivorship Bias
● Treating samples that passed some selection
threshold as though they were randomly selected.

Factory line at the mint
● Your job: determine if each coin is fair.

You flip each coin 10 times

Your first day

30% of coins are bad?

Hm

● Our measurements will be biased when:
What’s going on?

○ We expect variability in our measurements.
What’s going on?

○ We expect variability in our measurements.
and
○ We condition on only those measurements that
pass some evidence threshold.
What’s going on?

Goal of research
● To generalize from some sample to a population:
○ Inferences about the population are only as good as the sample.

How does this
affect A/B testing?

2019 Stitch Fix website

2011 Stitch Fix website

Get on the list for “a Stitch Fix”

Let’s go back in time
TreatmentControl

2019 website is 1% better than 2011
TreatmentControl

Simulated
A/B tests

Simulated A/B tests
● Control:
○ Draw 50k samples from a binomial distribution where
probability of success = 50%.

Simulated A/B tests
● Control:
● Treatment:
probability of success = 50% * 1.01 = 50.5%.

Simulated A/B tests
● Control:
● Treatment:
● Run statistical test
○ Roll out 2019 if it’s better than 2011 and p < 0.05.

Simulated A/B tests
● Control:
● Treatment:
● Run statistical test
○ Roll out 2019 if it’s better than 2011 and p < 0.05.
● Repeat 10k times

Results

10k treatment estimates

The mean of all tests is 1%

Unbiased estimate
The mean of all tests is 1%

p < 0.05 as cutoff

70% too big!
p < 0.05 as cutoff

100% too big!
p < 0.01 as cutoff
100% too big!

p < 0.001 as cutoff
130% too big!

Lesson
If your power < 1 then using any evidence threshold to
determine winners inflates treatment estimates.

2 goals
of experimentation

2 goals of experimentation
Hypothesis Testing
● Goal: Know if treatment is different from control.
● Output: Probability of observed data given null.

Hypothesis Testing
● Goal: Know if treatment is different from control.
● Output: Probability of observed data given null.
Treatment Effect Estimation
● Goal: Know how different treatment is from control.
● Output: distribution of treatment effect sizes compatible with observed
data.

Hypothesis Testing -> Inflated Treatment Estimates

What can we do?

Solutions
Increase Sample Size
● Pro: no inflation when power = 1.
● Cons: time consuming, opportunity costs.

Solutions
Holdouts/Replications
● Pro: no inflation.
● Cons: resource intensive, time consuming, opportunity costs.

Solutions
Holdouts/Replications
● Pro: no inflation.
● Cons: resource intensive, time consuming, opportunity costs.
Estimate and Remove Bias
● Pro: easy.
● Cons: relies on assumptions, based on long run expectation.

Estimate and
remove bias

Model the experimentation process

Simulated A/B tests
1%

1%
50k samples from binomial
with probability of success
= 50%.
Control
Simulated A/B tests

1%
= 50%.
= 50% * 1.01 = 50.5%.
Control Treatment
Simulated A/B tests

Control Treatment
Repeat 50k times.
Simulated A/B tests

Results

All 50k tests

Only rollouts

Inflation is a function of power

Problem
● We never know the true power of any test.

Problem
● We never know the true power of any test.
Solution
● Infer power from observed difference.

Infer power from observed difference

Once we infer power, we can estimate inflation.
Infer power from observed difference

Infer inflation from observed difference

Once we infer expected inflation, we can shrink the
observed difference appropriately.
Infer inflation from observed difference

Correct for inflation

We hope we convinced you:

the bias.

the bias.
● You should think hard about your goal: Is it just
hypothesis testing or is it estimation too?

Fin

Appendix

Early stopping/sequential testing
● Efficient A/B testing where you peek at the data more than once.
○ Controls Type I error rate by “spending” alpha incrementally at each
peak.

peak.
○ Example:
■ Collect data for 1 week, run test, stop test if t-statistic is more
extreme than t-statistic at that peek.

peak.
○ Example:
● Problem: leads to even more bias in treatment estimates!

peak.
○ Example:
● Problem: leads to even more bias in treatment estimates.

2 peeks

10 peeks

Your A/B Tests are Lying to You

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Your A/B Tests are Lying to You

Similar to Your A/B Tests are Lying to You (20)

Recently uploaded

Recently uploaded (20)

Your A/B Tests are Lying to You