Online A-B Tests
From Statistics to Distributed Systems
Part 1. Applied Statistics
Ming Lei
2019/11
The Statistics and Concepts
• Problem Structure in Statistics
• Formal Statistical Methodologies
• Hypothesis Testing: Superiority test, Non-inf test
• Two-sampled T-Tests: 1-tail vs. 2-tail, paired vs. unpaired
• A/B Test vs. Multi-treatments Test
• Power Analysis
• Is the sample set large enough?
• How MSZ relates to lift, std-dev and mean of population
• Sampling vs. Activation
• Activation is ultimately on impressions –- where end users interacts with computer
programs
Objective of an A-B Test
• Effect on a random variable
• Assume a treatment only shifts the mean of its distribution.
• Lift: change on mean measured as a ratio
• Find how a treatment a random variable.
• CTR on results of a search query (metrics for search engine)
• CTR on an Ad impression (metrics for CPC ads)
• GMB of a user, conversion rate of sessions of a user (metrics for
E-commerce)
• New ranking model (search
engine)
• New CTR model (ads)
• New item view page (e-
comm
A random process that generates variable v:
𝑝 𝑥1, 𝑥2, … , ε  v
Change a parameter x that we can control
𝑝 𝑥1′, 𝑥2, … , ε  v’
We want to see whether there is meaningful difference
between v and v’.
Is the population mean of v’ different from that of v?
But we can’t directly measure the population mean
Run the random process a few times to
generate two sample sets:
{𝑣1, 𝑣2, 𝑣3, … . } {𝑣1′, 𝑣2′, 𝑣3′, … . }
But the lift itself is a random variable!
Hence the naïve idea does not seem very smart or complete!
A naïve idea:
Compute a 𝑙𝑖𝑓𝑡 = 𝑣′ 𝑣 − 1 from the means of the
two sample sets .
Is lift = 0?
It turns out:
The lift roughly follows a known distribution.
CLT: sum of large number IID variables
follow normal distribution.
Hence the sample mean of IID variables
As long as the following are satisfied:
• “IID”
• “Large number” of samples
Probability density function
The shaded area is:
Prob(v>x)
Formal Methodology
• Hypothesis Testing
• Form a hypothesis on random variables.
• Collect evidence from samples to reject the hypothesis.
• Two sampled T-tests
• Test a hypothesis on the sample means or the lift of the sample means of our
variables.
• Reject the hypothesis with alpha (type I error).
• Power Analysis
• Is the sample set large enough?
Null Hypothesis on Lift
Let lift = treatedMean / controlMean - 1
• 𝑯 𝟎 : lift = 0
This is usually referred to as two-tailed test. It tests both superiority
(lift > 0) and inferiority (lift < 0). And it is the H0 used by EBay EP.
• 𝑯 𝟎 : lift <= 0
One-tail test for superiority.
• 𝑯 𝟎 : lift <= -errorMargin
This is non-inferiority test, where we want to reject the H0 that
treated is worse than control by the errorMargin.
Assumptions of T-Test
• Mean of the two populations being compared should follow a normal
distribution
• Implied by the assumption of Central Limit Theorem:
sum/mean of large number of IID (independent and identical distributed)
variables tend to follow Normal Distribution.
• The two populations being compared should have the same variance
• Implied by the assumption that treatment only changes mean of the
population.
• Variables are sampled independently
• This implies that the metric computation has to be consistent with the serving
time treatment assignment [session scope metrics is not consistent with
assigning guids to treatments].
Independent T Test for Two Samples
ℋ0: no difference between two groups
𝛼 (Type I error) - false positive of ℋ1
Pr(Reject ℋ0 | ℋ0 is true)
𝛽 (Type II error) - false negative
Pr(Not reject ℋ0 | ℋ0 is false)
Power = 1 − 𝛽
Power = Pr(reject H0 | H0 is false)
When dotted green line is
to the left of the left orange line OR to the right of the right orange line
Then The mean of Test is not same as mean of Control
P-Value
1-tail vs. 2-tail two-sample T-Test on Sample Data only.
(assuming the sample size is large enough !)
Can we trust P-Value ?!
 Can we trust our two-sample T-Test ?!
 Can we trust the assumption made by our two-sample T-Test ?!
Is our sample size really large enough?
 Power Analysis!
Power
• More than type II error.
• It also evaluates whether the sample size is large enough to confirm
the assumption made by p-value calculation in two-sample T-Test.
• Intuition: Repeatability/Stability of the t-test result.
Power vs. Sample Size and Lift
MSZ (Minimum Sample Size)
Minimum number of sampled variables required to reach statistical
power (eg, β = 0.1) for the test.
Use large historical
data set to
approximate
Use mean of the
control sample to
approximate
Preconfigured MDE
(minimum detectable
lift)
Fixed Horizon A-B Test
Test
Start
Pre-Test:
• Estimate variable stats of the
tested population using the
part of the site traffic covered
by the test.
• Collect historical test data to
infer MDE (min detectable
effect)
Test planning:
• Estimate MSZ w.
MDE and
population stats.
• Compute duration
w. MSZ
Test ends:
If sample# < MSZ:
Update MSZ calculation for
extended period.
Recommend extending the test
duration.
Else
If p-value < alpha, reject H0: lift=0
(regardless whether lift > MDE)
Else don’t reject H0.
Past 3 months Fixed test duration
Fixed Horizon w. a Trial-period
Test
Start
Pre-Test:
……
Test planning:
- inaccurate MSZ
and test duration.
trial-period=
1 week
End of Trial:
• Re-estimate population
variance w. test traffic.
• Re-estimate MDE.
• Re-estimate MSZ and
duration w. estimated
variance and observed
lift
Test ends:
……..
Past 3 months
Non-Inferiority Test
● P-value is ½ that of 2-tail test, when observed lift > -
errorMargin.
● Power analysis with MDE and errorMargin
Non-Inferiority Test
● Non-inf test is easier: lower MDE and lower MSZ
CV alpha beta Test
alternati
ve
ErrorMar
gin
MDE(lift) MSZ
Two-
sided/No
n-equiv
5.15 0.1 0.2 two-
sided
N/A 1% 3.3M
Non-inf 5.15 0.1 0.2 larger -0.5% 0.5% 2.4M
Usage of Non-inferiority Tests
• It serves as “Guardrail” against degradation.
• Applies to:
• ML model refreshes.
• Code rewrites and bug fixes.
• Use Non-inf on all non-primary metrics while superiority test on primary
metrics.
Multi-treatments Test
• A/B test has a pair {treatment, control}.
• Multivariate Test has multiple treatments and a single control.
• Warning:
If you simply treat a multivariate test as multiple A/B tests that share
control,
you would underestimate type-I error.
Because a single control sample is reused for multiple comparisons.
Two-steps for Multi-treatments Test
• Step 1: Test on whether means of different treatments are different.
𝐻0: 𝜇0 = 𝜇1 = 𝜇2 = … . .
Use One-way ANOVA Test
• Step 2: Find what treatments are significantly better than control.
Use Tukey Test.
Paired Two-sample T-Test in Conv Model Eval
The same
input record
{Query,
Result set,
conversions}
Treatment A
Treatment B
{ResultList,
conversions}
{ResultList’,
conversions}
NDCG of convs
NDCG’ of convs
thru diff treat
process
generates two
variables for
comparison
Activation vs. Sampling
● Sampling
A random process to sample sets A for control and B for treatment from
the same population.
● Activation
A treatment is activated on the variable if some condition is satisfied.
Eg, A clinical may have the following qualification requirement:
● A person has certain disease, and
● is willing to take the drug (this is often enforced by overseeing
both treatment and control groups to take real and fake drugs
respectively).
Why Activation stands out?
● Can it be implemented as part of the sampling process?
● A Clinical A-B Test on a drug for cancer
● YES. The sampling process can limit to the population with cancer!
● But is that the only condition for “activation”
● What if half of sampled patients throw away the drug!
● The observed lift may go down 50% compared to the expected.
● Add an administrative step to your test procedure:
● Monitor your patients taking the drug.
● This is your activation step.
A-B Test in Online Services
● User interacts with Online Services using a pattern of
“Request – Impression”.
● Sampling can be applied to different units:
users, sessions, req/imp, …
● But treatment is ultimate defined on what is fed to user –
Impression.
● Treating request is only a necessary condition for treating
impression.
Activation in Online Service A-B Test
● Activation
● Is not checking whether treatment is applied in request processing.
● Is checked on impressions.
● A refreshed ranking model for a search engine
● Every request is processed differently by it than the base model. But
only a small percentage of user impressions (SRPs) are actually
different and hence treated at all.
Test Boundaries in Online Services
● A component view
● An Online Application has frontend, backend, data store, etc.
● Each component is a self-contained subsystem and runs its own A-B
test.
● It is fine if you understand who your test subjects are
● If subjects are users,
regardless where you treat the requests,
the boundary of your test is the interaction between
application as a whole and users.

Online A/B Tests -- Part 1. Applied Statistics

  • 1.
    Online A-B Tests FromStatistics to Distributed Systems Part 1. Applied Statistics Ming Lei 2019/11
  • 2.
    The Statistics andConcepts • Problem Structure in Statistics • Formal Statistical Methodologies • Hypothesis Testing: Superiority test, Non-inf test • Two-sampled T-Tests: 1-tail vs. 2-tail, paired vs. unpaired • A/B Test vs. Multi-treatments Test • Power Analysis • Is the sample set large enough? • How MSZ relates to lift, std-dev and mean of population • Sampling vs. Activation • Activation is ultimately on impressions –- where end users interacts with computer programs
  • 3.
    Objective of anA-B Test • Effect on a random variable • Assume a treatment only shifts the mean of its distribution. • Lift: change on mean measured as a ratio • Find how a treatment a random variable. • CTR on results of a search query (metrics for search engine) • CTR on an Ad impression (metrics for CPC ads) • GMB of a user, conversion rate of sessions of a user (metrics for E-commerce) • New ranking model (search engine) • New CTR model (ads) • New item view page (e- comm
  • 4.
    A random processthat generates variable v: 𝑝 𝑥1, 𝑥2, … , ε  v Change a parameter x that we can control 𝑝 𝑥1′, 𝑥2, … , ε  v’ We want to see whether there is meaningful difference between v and v’. Is the population mean of v’ different from that of v? But we can’t directly measure the population mean
  • 5.
    Run the randomprocess a few times to generate two sample sets: {𝑣1, 𝑣2, 𝑣3, … . } {𝑣1′, 𝑣2′, 𝑣3′, … . } But the lift itself is a random variable! Hence the naïve idea does not seem very smart or complete! A naïve idea: Compute a 𝑙𝑖𝑓𝑡 = 𝑣′ 𝑣 − 1 from the means of the two sample sets . Is lift = 0?
  • 6.
    It turns out: Thelift roughly follows a known distribution. CLT: sum of large number IID variables follow normal distribution. Hence the sample mean of IID variables As long as the following are satisfied: • “IID” • “Large number” of samples Probability density function The shaded area is: Prob(v>x)
  • 7.
    Formal Methodology • HypothesisTesting • Form a hypothesis on random variables. • Collect evidence from samples to reject the hypothesis. • Two sampled T-tests • Test a hypothesis on the sample means or the lift of the sample means of our variables. • Reject the hypothesis with alpha (type I error). • Power Analysis • Is the sample set large enough?
  • 8.
    Null Hypothesis onLift Let lift = treatedMean / controlMean - 1 • 𝑯 𝟎 : lift = 0 This is usually referred to as two-tailed test. It tests both superiority (lift > 0) and inferiority (lift < 0). And it is the H0 used by EBay EP. • 𝑯 𝟎 : lift <= 0 One-tail test for superiority. • 𝑯 𝟎 : lift <= -errorMargin This is non-inferiority test, where we want to reject the H0 that treated is worse than control by the errorMargin.
  • 9.
    Assumptions of T-Test •Mean of the two populations being compared should follow a normal distribution • Implied by the assumption of Central Limit Theorem: sum/mean of large number of IID (independent and identical distributed) variables tend to follow Normal Distribution. • The two populations being compared should have the same variance • Implied by the assumption that treatment only changes mean of the population. • Variables are sampled independently • This implies that the metric computation has to be consistent with the serving time treatment assignment [session scope metrics is not consistent with assigning guids to treatments].
  • 10.
    Independent T Testfor Two Samples ℋ0: no difference between two groups 𝛼 (Type I error) - false positive of ℋ1 Pr(Reject ℋ0 | ℋ0 is true) 𝛽 (Type II error) - false negative Pr(Not reject ℋ0 | ℋ0 is false) Power = 1 − 𝛽 Power = Pr(reject H0 | H0 is false) When dotted green line is to the left of the left orange line OR to the right of the right orange line Then The mean of Test is not same as mean of Control
  • 11.
    P-Value 1-tail vs. 2-tailtwo-sample T-Test on Sample Data only. (assuming the sample size is large enough !)
  • 12.
    Can we trustP-Value ?!  Can we trust our two-sample T-Test ?!  Can we trust the assumption made by our two-sample T-Test ?! Is our sample size really large enough?  Power Analysis!
  • 13.
    Power • More thantype II error. • It also evaluates whether the sample size is large enough to confirm the assumption made by p-value calculation in two-sample T-Test. • Intuition: Repeatability/Stability of the t-test result.
  • 14.
    Power vs. SampleSize and Lift MSZ (Minimum Sample Size) Minimum number of sampled variables required to reach statistical power (eg, β = 0.1) for the test. Use large historical data set to approximate Use mean of the control sample to approximate Preconfigured MDE (minimum detectable lift)
  • 15.
    Fixed Horizon A-BTest Test Start Pre-Test: • Estimate variable stats of the tested population using the part of the site traffic covered by the test. • Collect historical test data to infer MDE (min detectable effect) Test planning: • Estimate MSZ w. MDE and population stats. • Compute duration w. MSZ Test ends: If sample# < MSZ: Update MSZ calculation for extended period. Recommend extending the test duration. Else If p-value < alpha, reject H0: lift=0 (regardless whether lift > MDE) Else don’t reject H0. Past 3 months Fixed test duration
  • 16.
    Fixed Horizon w.a Trial-period Test Start Pre-Test: …… Test planning: - inaccurate MSZ and test duration. trial-period= 1 week End of Trial: • Re-estimate population variance w. test traffic. • Re-estimate MDE. • Re-estimate MSZ and duration w. estimated variance and observed lift Test ends: …….. Past 3 months
  • 17.
    Non-Inferiority Test ● P-valueis ½ that of 2-tail test, when observed lift > - errorMargin. ● Power analysis with MDE and errorMargin
  • 18.
    Non-Inferiority Test ● Non-inftest is easier: lower MDE and lower MSZ CV alpha beta Test alternati ve ErrorMar gin MDE(lift) MSZ Two- sided/No n-equiv 5.15 0.1 0.2 two- sided N/A 1% 3.3M Non-inf 5.15 0.1 0.2 larger -0.5% 0.5% 2.4M
  • 19.
    Usage of Non-inferiorityTests • It serves as “Guardrail” against degradation. • Applies to: • ML model refreshes. • Code rewrites and bug fixes. • Use Non-inf on all non-primary metrics while superiority test on primary metrics.
  • 20.
    Multi-treatments Test • A/Btest has a pair {treatment, control}. • Multivariate Test has multiple treatments and a single control. • Warning: If you simply treat a multivariate test as multiple A/B tests that share control, you would underestimate type-I error. Because a single control sample is reused for multiple comparisons.
  • 21.
    Two-steps for Multi-treatmentsTest • Step 1: Test on whether means of different treatments are different. 𝐻0: 𝜇0 = 𝜇1 = 𝜇2 = … . . Use One-way ANOVA Test • Step 2: Find what treatments are significantly better than control. Use Tukey Test.
  • 22.
    Paired Two-sample T-Testin Conv Model Eval The same input record {Query, Result set, conversions} Treatment A Treatment B {ResultList, conversions} {ResultList’, conversions} NDCG of convs NDCG’ of convs thru diff treat process generates two variables for comparison
  • 23.
    Activation vs. Sampling ●Sampling A random process to sample sets A for control and B for treatment from the same population. ● Activation A treatment is activated on the variable if some condition is satisfied. Eg, A clinical may have the following qualification requirement: ● A person has certain disease, and ● is willing to take the drug (this is often enforced by overseeing both treatment and control groups to take real and fake drugs respectively).
  • 24.
    Why Activation standsout? ● Can it be implemented as part of the sampling process? ● A Clinical A-B Test on a drug for cancer ● YES. The sampling process can limit to the population with cancer! ● But is that the only condition for “activation” ● What if half of sampled patients throw away the drug! ● The observed lift may go down 50% compared to the expected. ● Add an administrative step to your test procedure: ● Monitor your patients taking the drug. ● This is your activation step.
  • 25.
    A-B Test inOnline Services ● User interacts with Online Services using a pattern of “Request – Impression”. ● Sampling can be applied to different units: users, sessions, req/imp, … ● But treatment is ultimate defined on what is fed to user – Impression. ● Treating request is only a necessary condition for treating impression.
  • 26.
    Activation in OnlineService A-B Test ● Activation ● Is not checking whether treatment is applied in request processing. ● Is checked on impressions. ● A refreshed ranking model for a search engine ● Every request is processed differently by it than the base model. But only a small percentage of user impressions (SRPs) are actually different and hence treated at all.
  • 27.
    Test Boundaries inOnline Services ● A component view ● An Online Application has frontend, backend, data store, etc. ● Each component is a self-contained subsystem and runs its own A-B test. ● It is fine if you understand who your test subjects are ● If subjects are users, regardless where you treat the requests, the boundary of your test is the interaction between application as a whole and users.