Optimizely Workshop: Take Action on Results with Statistics

Take Action on Results
With Statisitcs
An Optimizely Online Workshop
Statistician: Leonid Pekelis

Optimizely’s Stats Engine is designed to work with you, not
against you, to provide results which are reliable and
accurate, without requiring statistical training.
At the same time, by knowing some statistics of your own, you
can tune Stats Engine to get the most performance for your
unique needs.

1. Which two A/B Testing pitfalls inflate error rates when using
classical statistics, and are avoided with Stats Engine?
2. What are the three tradeoffs in an A/B Test? And how are they
related?
3. How can you use Optimizely’s results page to best tune the
tradeoffs to achieve your experimentation goals?
After this workshop, you should be able to answer…

How to choose the number
of goals and variations for your experiment.
We will also preview

• A)
The original, or baseline version of content that
you are testing through a variation.
• B)
Metric used to measure impact of control and
variation
• C)
The control group’s expected conversion rate.
• D)
The relative percentage diﬀerence of your
variation from baseline.
• E)
The number of visitors in your test.
Which is the
Improvement?

• A) Control and Variation
The original, or baseline version of content that
you are testing through a variation.
• B) Goal
Metric used to measure impact of control and
variation
• C) Baseline conversion rate
The control group’s expected conversion rate.
• D) Improvement
The relative percentage diﬀerence of your
variation from baseline.
• E) Sample size
The number of visitors in your test.

Stats Engine corrects the
pitfalls of A/B Testing with
classical statistics.

A procedure for classical statistics
(a.k.a. “T-test”, a.k.a. “Traditional Frequentist”, a.k.a “Fixed Horizon Testing” )
Farmer Fred
wants to compare the effect of two fertilizers on crop yield.
1. Chooses how many plots to use (sample size).
2. Waits for a crop cycle, collects data once at the end.
3. Asks “What are the chances I’d have gotten these results if
there was no difference between the fertilizers?” (a.k.a. p-value)
If p-value < 5%, his results are significant.
4. Goes on, maybe to test irrigation methods.

1915
Data is expensive.
Data is slow.
Practitioners are trained.
2015
Data is cheap.
Data is real-time.
Practitioners are everyone.
Classical statistics were designed for an
offline world.

The modern A/B Testing procedure is different
1. Start without good estimate of sample size.
2. Check results early and often. Estimate ROI as quickly as
possible.
3. Ask “How likely did my testing procedure give a wrong
answer?”
4. Many variations on multiple goals, not just 1.
5. Iterate. Iterate. Iterate.

p-Value < 5%.
Signiﬁcant!
p-Value > 5%.
Inconclusive.
p-Value > 5%.
Inconclusive.
Min Sample Size
Peeking
Time
Experiment Starts
p-Value > 5%.
Inconclusive.

Why is this a problem?
There is a ~5% chance of false
positive each time you peek.

p-Value < 5%.
Signiﬁcant!
p-Value > 5%.
Inconclusive.
p-Value > 5%.
Inconclusive.
Min Sample Size
Peeking
Time
Experiment Starts
p-Value > 5%.
Inconclusive.
4 peeks —> ~18% chance of seeing a false positive

Pitfall 2. Mistaking
“False Positive Rate” for
“Chance of a wrong
conclusion”

1 original page, 5 variations, 6 goals = 30 “A/B Tests”

After I reach my minimum sample size,
I stop the experiment and see
2 of my variations beating control
and 1 variation losing to control

Winner
Winner
Loser
Classical statistics guarantee
<= 5% false positives.
What % of my 2 winners and 1
loser do I expect to be false
positives?

Winner
Winner
Loser
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
2 winners, 1 loser, and 27 inconclusives

Winner
Winner
Loser
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
Inconclusive
30 A/B Tests x 5% = 1.5 false positives!

Winner
Winner
Loser
Classical statistics guarantee <= 5%
false positives.
What % of my winners & losers do I
expect to be false positives?
Answer: C) With 30 A/B Tests, we can
expect a
= 50% chance of a wrong conclusion!
In general, we can’t say without
knowing how many other goals &
variations were tested.
1.5
3

related?
After this workshop you should be able to answer …

A. Peeking and mistaking “False Positive Rate” for “Chance of
a wrong conclusion.”
After this webinar, you should be able to answer …

Error rates Runtime
Improvement
& Baseline CR

Error rates Runtime
Improvement
& Baseline CR
conclusion”

Error rates Runtime
Improvement
& Baseline CR
conclusion calling a non-
winner a winner, or a non-
loser a loser.”

Where is the error rate on Optimizely’s results page?
I. II. III. IV.
Statistical Signiﬁcance
=
“Chance of a right conclusion”
= (a.k.a.)
100 x (1 - False Discovery Rate)

How can you control the error rate?

Where is runtime on Optimizely’s results page?

Error rates Runtime
Were you expecting a funny
picture?
Improvement
& Baseline CR

Where is effect size on Optimizely’s results page?

Improvement
& Baseline CR
These three quantities are all …
Error rates Runtime
Inversely
Related

At any number of visitors,
the higher error rate I allow,
the smaller improvement you can
detect.
Error rates Runtime
Inversely
Related
Improvement
& Baseline CR

Error rates Runtime
Inversely
Related
At any error rate threshold,
stopping your test earlier means
you can only detect larger
improvements.
Improvement
& Baseline CR

For any improvement,
the lower error rate you want,
the longer you need to run your test.
Error rates Runtime
Inversely
Related
Improvement
& Baseline CR

What does this look like in practice?
Average Visitors needed to reach
signiﬁcance with Stats Engine
Improvement (relative)
5% 10% 25%
Signiﬁcance
Threshold
(Error Rate)
95 (5%) 62 K 14 K 1,800
90 (10%) 59 K 12 K 1,700
80 (20%) 53 K 11 K 1,500
Baseline conversion rate = 10%

~ 1 K visitors per day
5% 10% 25%
Signiﬁcance
Threshold
(Error Rate)
95 (5%) 62 K 14 K 1,800
90 (10%) 59 K 12 K 1,700
80 (20%) 53 K 11 K 1,500 (1 day)

~ 10K visitors per day
5% 10% 25%
Signiﬁcance
Threshold
(Error Rate)
95 (5%) 62 K 14 K 1,800
90 (10%) 59 K 12 K 1,700
80 (20%) 53 K 11 K (1 day) 1,500

~ 50K visitors per day
3% 5% 10%
Signiﬁcance
Threshold
(Error Rate)
95 (5%) 190 K 62 K 14 K
90 (10%) 180 K 59 K 12 K
80 (20%) 160 K 53 K (1 day) 11 K

> 100K visitors per day
3% 5% 10%
Signiﬁcance
Threshold
(Error Rate)
95 (5%) 190 K 62 K 14 K
90 (10%) 180 K 59 K 12 K
80 (20%) 160 K (1 day) 53 K 11 K

related?
After this workshop, you should be able to answer …

related?
A. Error Rates, Runtime, and Eﬀect Size. They are all inversely
related.

Use tradeoffs to align your
testing goals

5%
Error rates Runtime
Improvement
& Baseline CR
Inversely
Related
+5%,
10%
53 K
?
In the beginning, we make an educated guess …

… but after 1 day …
Data!
How can we update the tradeoffs?

Improvement turns out to be better …
Instead of:
53K - 10K
=
43K
5% 1,600
Error rates Runtime
+13%,
10%
Inversely
Related
Improvement
& Baseline CR

… or worse.
5% 75 K
Error rates Runtime
+2%,
8%
Inversely
Related
Improvement
& Baseline CR

2. Accept higher / lower error
rate

Improvement turns out to be better …
1% 43 K
Error rates Runtime
+13%,
10%
Inversely
Related
Improvement
& Baseline CR

… or worse.
30% 43 K
Error rates Runtime
+2%,
8%
Inversely
Related
Improvement
& Baseline CR

3. Admit it. It’s inconclusive.

… or a lot worse.
> 99% > 100K
Error rates Runtime
+.2%,
8%
Inversely
Related
Improvement
& Baseline CR
iterate,
iterate,
iterate!

Your experiments will not always have the same
improvement over time.
So, run A/B Tests for at least a business cycle
appropriate for that test and your company.
Seasonality & Time Variation

classical statistics, and are avoided with for Stats Engine?
2. What are the three tradeoﬀs in one A/B Test?
A. Adjust your timeline. Accept higher / lower error rate. Admit
an inconclusive result.

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics,
and are avoided with Stats Engine?
A. Peeking and mistaking “False Positive Rate” for “Chance of a Wrong
Answer.”
B. Error Rates, Runtime, and Effect Size. They are all negatively related.
3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve
your experimentation goals?
C. Accept higher / lower error rate. Adjust your timeline. Admit an
inconclusive result.
Answer.”
Answer.”
Answer.”
Review

Preview: How many goals and
variations should I use?

Stats Engine is more conservative when
there are more goals that are not aﬀected by
a variation.
So, adding a lot of “random” goals will slow
down your experiment.

Tips & Tricks for using Stats Engine with multiple goals
and variations
• Ask: Which goal is most important to me?
-This should be the primary goal (not impacted by all other
goals)
• Run large, or large multivariate tests without fear of ﬁnding
spurious results, but be prepared for the cost of exploration.
• For maximum velocity, only test goals and variations that you
believe will have highest impact.

Optimizely Workshop: Take Action on Results with Statistics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Optimizely Workshop: Take Action on Results with Statistics

Similar to Optimizely Workshop: Take Action on Results with Statistics (20)

More from Optimizely

More from Optimizely (20)

Recently uploaded

Recently uploaded (20)

Optimizely Workshop: Take Action on Results with Statistics