Introductory Online Controlled Experiments
Bowen Li,
Staff Data Scientist @Vpon
2016/04/08
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 1 / 35
Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 2 / 35
Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 3 / 35
Introduction
Applications
Validate segments for advertising
Enhance algorithms: search, ads, personalization, recommendation
Change apps, UI, content management system
Among many others,...
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 4 / 35
Introduction
Applications
Validate segments for advertising
Enhance algorithms: search, ads, personalization, recommendation
Change apps, UI, content management system
Among many others,...
Motivations
Verify scientifically the hypothesis:
If a specific change is introduced, will it improve key metrics?
Establish causal relationship:
Unlike data mining techniques for finding correlation patterns
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 4 / 35
Why Online Experiment?
Intuition for assessing idea value is not reliable
Most ideas fail to improve key metrics:
Google: Only about 10% of experiments led to business changes
Netflix: 90% of what they try to be wrong
Even small gains are aggregated across millions of users & events
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 5 / 35
Why Online Experiment?
Intuition for assessing idea value is not reliable
Most ideas fail to improve key metrics:
Google: Only about 10% of experiments led to business changes
Netflix: 90% of what they try to be wrong
Even small gains are aggregated across millions of users & events
Getting trustworthy results is hard
Shared pitfalls and puzzling results:
Kohavi et al (2010, 2012); Kohavi & Longbotham (2010)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 5 / 35
Experiment Basics
Factor: Controlled variable thought to influence metric
Test factor: Its effect is of interest
Non-test factors: Its effects is of no interest
A/B test: Single factor with two levels
A vs. B
Control vs. Treatment
Existing vs. New
A/B/n test: 1 factor with more than two levels
Multivariable test: More than 1 factors
Variant: E.g. A/B test has 2 experimental variants
Randomization unit: Based on independent assumption
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 6 / 35
Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 7 / 35
Online Experiment Procedure (1/2)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 8 / 35
Online Experiment Procedure (2/2)
A/B test procedure
1 Define
Overall Evaluation Criterion (OEC): Make decision
Metrics of interest: Find insights
2 Sample size calculation
3 Random assignment to Treatment & Control
4 Log data collection
5 Online monitor
6 Experiment analytics
7 Decision making based on OEC
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 9 / 35
Online Experiment Procedure: Details (1/5)
Step 1.1: Define OEC for decision making
Single metric: Incorporate tradeoff between metrics
Frequently, experiment will improve one metric but hurt another
Must be decided in advance
Otherwise induce Familywise Type-I Error (later)
Guideline:
Bad OEC: Short-term profit, but not long-term
Good OEC: Drivers of lifetime value
E.g. sessions per user, repeated visits, conversion rates, etc
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 10 / 35
Online Experiment Procedure: Details (1/5)
Step 1.1: Define OEC for decision making
Single metric: Incorporate tradeoff between metrics
Frequently, experiment will improve one metric but hurt another
Must be decided in advance
Otherwise induce Familywise Type-I Error (later)
Guideline:
Bad OEC: Short-term profit, but not long-term
Good OEC: Drivers of lifetime value
E.g. sessions per user, repeated visits, conversion rates, etc
Step 1.2: Define metrics for finding insights
Compute many metrics
Must control False Discovery Rate (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 10 / 35
Online Experiment Procedure: Details (2/5)
Step 3: Calculate sample size
Sample size based on 50%/50% of Treatment/Control
For maximum testing power
How long experiement runs
See later for details
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 11 / 35
Online Experiment Procedure: Details (3/5)
Step 4: Assign randomly user to Treatment or Control
George Box: "Block what you can control and randomize what you cannot"
Blocking: (later)
If can control some non-test factors
Randomization: (later)
If cannot control these non-test factors
In consistent manner:
Same experience in user’s repeated visits
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 12 / 35
Online Experiment Procedure: Details (3/5)
Step 4: Assign randomly user to Treatment or Control
George Box: "Block what you can control and randomize what you cannot"
Blocking: (later)
If can control some non-test factors
Randomization: (later)
If cannot control these non-test factors
In consistent manner:
Same experience in user’s repeated visits
Step 5: Collect log data
Collect logs for online monitor & experiment analytics
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 12 / 35
Online Experiment Procedure: Details (4/5)
Step 6: Online monitor
Treatment ramp-up
Intiate Treatment with 0.1%/99.9% split
Ramp up from 0.1% to 0.5%, 2.5%, 10%, 50%
At each step (for hours), analyze data to prevent egregious problems
Could be detected quickly on small samples
Sample ratio mismatch (SRM) graph:
Monitor (1) # users, (2) OEC/metrics, etc, in each variant, over time
Interactions between overlapping experiments (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 13 / 35
Online Experiment Procedure: Details (5/5)
Step 7: Experiment analytics
Compare Treatment’s & Control’s OEC distributions
Hypothesis testing for experiment effect
Estimation for experiment effect
See later for details
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 14 / 35
Online Experiment Procedure: Details (5/5)
Step 7: Experiment analytics
Compare Treatment’s & Control’s OEC distributions
Hypothesis testing for experiment effect
Estimation for experiment effect
See later for details
May be defined with different units; for example
Experiment unit: User
Analysis unit: User-Session
Apply Bootstrapping Technique, among others (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 14 / 35
Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 15 / 35
Statistics for Experiments (1/2)
Hypotheses for testing
Null hypothesis: H0
Treatment and Control are of no difference
Any observed differences are due to random fluctuations
Alternative hypothesis: H1
Treatment is different from or better than Control
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 16 / 35
Statistics for Experiments (1/2)
Hypotheses for testing
Null hypothesis: H0
Treatment and Control are of no difference
Any observed differences are due to random fluctuations
Alternative hypothesis: H1
Treatment is different from or better than Control
Testing null hypothesis: H0 : OB = OA
OX : OEC for Treatment & Control for X = B & A respectively
OX : Estimated OEC
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 16 / 35
Statistics for Experiments (2/2)
Hypothesis testing basics
Type-I error: Pr(H1|H0) = α
Probability of rejecting H0 when H0 is true (common: 5%)
Type-II error: Pr(H0|H1) = β
Probability of not rejecting H0 when H0 is false
Confidence level: Pr(H0|H0) = 1 − α
Probability of not rejecting H0 when H0 is true (common: 95%)
Power: Pr(H1|H1) = 1 − β
Probability of rejecting H0 when H0 is false (common: 80-95%)
Decision/Condition H0 is true (H0) H0 is false (H1)
Reject H0 (H1) Type-I error Power
Not reject H0 (H0) Confidence level Type-II error
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 17 / 35
Sample Size Calculation
Hypothesis testing:
H0 : OB = OA, with desired confidence level: 1 − α
H1 : OB − OA = , with desired power: 1 − β
Minimum sample size:
0 + z1−α/2σ
2
n
= − z1−βσ
2
n
⇒ n =
σ2
∆2
2(z1−α/2 + z1−β)2
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 18 / 35
Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 19 / 35
Experiment Effect Testing & Estimation (1/2)
Absolute effect: OB − OA
95% confidence interval (CI) for OB − OA:
OB − OA ± 1.96σd
σd : Estimated standard deviation of OB − OA
See Appendix for derivations
Hypothesis tesing for OB − OA: Based on CI
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 20 / 35
Experiment Effect Testing & Estimation (2/2)
Percent effect: OB−OA
OA
· 100%
95% confidence interval (CI):
OB − OA
OA
+ 1
1 ± 1.96 CV
2
A + CV
2
B − 1.962CV
2
ACV
2
B
1 − 1.962CV
2
A
− 1
CV B = σB
OB
: Estimated coefficient of variation (CV)
σB: Estimated standard deviation of OB
See Appendix for derivations
Hypothesis tesing: Based on CI
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 21 / 35
Further Experiment Analytics
To reduce variance for increasing power
Increase sample size: Will increase experiment length
Adjust analysis units by features: May shorten experiment length
Pre-experiment user metrics
User demographics: gender, age, location
User behavior analytics: device, App
Among many others
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 22 / 35
Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 23 / 35
Validation of Experiments
A/A test (null test):
To test experimental & randomization setups
Assign users to variant groups, but expose to the same experience
If system is working properly, H0 should be retained
rejected about only 5%
Other application: Software migration
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 24 / 35
Limitations of Experiments (1/2)
Quantitative metrics, but no explanations:
Possible to know which is better and by how much, but not why
Long-term effects:
Online tests are typically run for short periods, e.g. a few days/weeks
Find good OEC metrics predicting long-term effects
Run experiments longer: Hard in practice due to Survivorship Bias in
online cohorts:
When lots of cookies would churn, especially in anonymous settings
Primacy effect & newness effect:
Run experiment longer or compute OEC only for new users
Primacy effect:
Experienced users may be less efficient to get used to Treatment
Newness effect:
When Treatment is introduced, some users click everywhere
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 25 / 35
Limitations of Experiments (2/2)
Feature must be implemented:
In early stages, use paper prototyping for quick feedback/refinements
Consistency:
Need a consistent experience for users
Overlapping experiments:
Previous experiences: Strong interactions are rare in practice(?)
Avoid initially tests that cound interact
Perform Pairwise Tests: Flag interactions automatically
Launch event:
All users need to see it, and we cannot run experiment
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 26 / 35
Other Practical Concerns (1/2)
Triggering
Example: Change to checkout page, only 10% of users arrive it
Analyze only users who were exposed to the variants (checkout pages)
Reduce variance of treatment effect estimates
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 27 / 35
Other Practical Concerns (1/2)
Triggering
Example: Change to checkout page, only 10% of users arrive it
Analyze only users who were exposed to the variants (checkout pages)
Reduce variance of treatment effect estimates
Automatic optimization
Run experiments to optimize areas amenable to automated search
Once an organization has a clear OEC
Multi-Armed Bandit Algorithm / Hoeffding Races (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 27 / 35
Other Practical Concerns (2/2)
Robots removal
Their acitivity can severely bias results
Call Treatment assignment by JavaScript (client-side), not server-side
Exclude robots that reject cookies with unidentified requests
Exclude robots that do not delete cookies and have many actions
Robots removal approach:
List of known robots
Heuristics (Kohavi & Parekh, 2003)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 28 / 35
Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 29 / 35
Discussions
Online experiments are extremely important for building data products
of various applications
For fast iteration, we will build online experiments platform with
Random assigment to Treatment or Control
Online monitor for ramp-up, SRM, and interactions
Experiment analytics with data query, ETL and statistical inference
Next: Segment Validation SOP as the 1st application
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 30 / 35
References
Box et al. (2005). Statistics for experiments: design, innovation and
discovery
Kohavi & Longbotham (Encyclopedia of MLDM, 2015). Online
controlled experiments and A/B tests
Kohavi et al. (DMKD, 2009). Controlled experiments on the web:
survey and practical guide
van Belle (2002). Statistical rule of thumb
Willan & Briggs (2006). Statistical analysis of cost-effective data
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 31 / 35
Thank you for your listening!
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 32 / 35
Appendix: Derivations of CI for Absolute Effect
Under H0 : OB = OA,
E(OB − OA) = 0
Var(OB − OA) can be estimated by σ2
d
As sample size is large, by Central Limit Theorem
OB − OA
σd
d
−→ N(0, 1)
Thus
Pr
OB − OA
σd
≤ 1.96 = 95%
CI for absolute effect:
OB − OA ± 1.96σd
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 33 / 35
Appendix: Derivations of CI for Percent Effect (1/2)
Fieller (1954):
Define R = OB
OA
Obtain CI for R based on OB − ROA
Apply Central Limit Theorem
OB − ROA
d
−→ N(0, Var[OB − ROA])
Var[OB − ROA] = σ2
B + R2σ2
A (since Cov(OB, OA) = 0)
Thus
OB − ROA
σ2
B + R2σ2
A
d
−→ N(0, 1)
Pr


OB − ROA
σ2
B + R2σ2
A
≤ 1.96

 = 95%
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 34 / 35
Appendix: Proof of CI for Percent Effect (2/2)
CI for R: By solving quadratic equation of R


OB − ROA
σ2
B + R2σ2
A


2
= 1.962
R =
OB
OA
1 ± 1.96 CV
2
A + CV
2
B − 1.962CV
2
ACV
2
B
1.962CV
2
A
Note: OB−OA
OA
= OB
OA
− 1
CI for percent effect:
OB
OA
1 ± 1.96 CV
2
A + CV
2
B − 1.962CV
2
ACV
2
B
1.962CV
2
A
− 1
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 35 / 35

Introductory Online Controlled Experiments

  • 1.
    Introductory Online ControlledExperiments Bowen Li, Staff Data Scientist @Vpon 2016/04/08 Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 1 / 35
  • 2.
    Outline 1 Introduction 2 OnlineExperiment Procedure 3 Experiment Designs 4 Experiment Analytics 5 Further on Online Experiments 6 Discussions Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 2 / 35
  • 3.
    Outline 1 Introduction 2 OnlineExperiment Procedure 3 Experiment Designs 4 Experiment Analytics 5 Further on Online Experiments 6 Discussions Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 3 / 35
  • 4.
    Introduction Applications Validate segments foradvertising Enhance algorithms: search, ads, personalization, recommendation Change apps, UI, content management system Among many others,... Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 4 / 35
  • 5.
    Introduction Applications Validate segments foradvertising Enhance algorithms: search, ads, personalization, recommendation Change apps, UI, content management system Among many others,... Motivations Verify scientifically the hypothesis: If a specific change is introduced, will it improve key metrics? Establish causal relationship: Unlike data mining techniques for finding correlation patterns Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 4 / 35
  • 6.
    Why Online Experiment? Intuitionfor assessing idea value is not reliable Most ideas fail to improve key metrics: Google: Only about 10% of experiments led to business changes Netflix: 90% of what they try to be wrong Even small gains are aggregated across millions of users & events Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 5 / 35
  • 7.
    Why Online Experiment? Intuitionfor assessing idea value is not reliable Most ideas fail to improve key metrics: Google: Only about 10% of experiments led to business changes Netflix: 90% of what they try to be wrong Even small gains are aggregated across millions of users & events Getting trustworthy results is hard Shared pitfalls and puzzling results: Kohavi et al (2010, 2012); Kohavi & Longbotham (2010) Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 5 / 35
  • 8.
    Experiment Basics Factor: Controlledvariable thought to influence metric Test factor: Its effect is of interest Non-test factors: Its effects is of no interest A/B test: Single factor with two levels A vs. B Control vs. Treatment Existing vs. New A/B/n test: 1 factor with more than two levels Multivariable test: More than 1 factors Variant: E.g. A/B test has 2 experimental variants Randomization unit: Based on independent assumption Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 6 / 35
  • 9.
    Outline 1 Introduction 2 OnlineExperiment Procedure 3 Experiment Designs 4 Experiment Analytics 5 Further on Online Experiments 6 Discussions Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 7 / 35
  • 10.
    Online Experiment Procedure(1/2) Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 8 / 35
  • 11.
    Online Experiment Procedure(2/2) A/B test procedure 1 Define Overall Evaluation Criterion (OEC): Make decision Metrics of interest: Find insights 2 Sample size calculation 3 Random assignment to Treatment & Control 4 Log data collection 5 Online monitor 6 Experiment analytics 7 Decision making based on OEC Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 9 / 35
  • 12.
    Online Experiment Procedure:Details (1/5) Step 1.1: Define OEC for decision making Single metric: Incorporate tradeoff between metrics Frequently, experiment will improve one metric but hurt another Must be decided in advance Otherwise induce Familywise Type-I Error (later) Guideline: Bad OEC: Short-term profit, but not long-term Good OEC: Drivers of lifetime value E.g. sessions per user, repeated visits, conversion rates, etc Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 10 / 35
  • 13.
    Online Experiment Procedure:Details (1/5) Step 1.1: Define OEC for decision making Single metric: Incorporate tradeoff between metrics Frequently, experiment will improve one metric but hurt another Must be decided in advance Otherwise induce Familywise Type-I Error (later) Guideline: Bad OEC: Short-term profit, but not long-term Good OEC: Drivers of lifetime value E.g. sessions per user, repeated visits, conversion rates, etc Step 1.2: Define metrics for finding insights Compute many metrics Must control False Discovery Rate (later) Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 10 / 35
  • 14.
    Online Experiment Procedure:Details (2/5) Step 3: Calculate sample size Sample size based on 50%/50% of Treatment/Control For maximum testing power How long experiement runs See later for details Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 11 / 35
  • 15.
    Online Experiment Procedure:Details (3/5) Step 4: Assign randomly user to Treatment or Control George Box: "Block what you can control and randomize what you cannot" Blocking: (later) If can control some non-test factors Randomization: (later) If cannot control these non-test factors In consistent manner: Same experience in user’s repeated visits Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 12 / 35
  • 16.
    Online Experiment Procedure:Details (3/5) Step 4: Assign randomly user to Treatment or Control George Box: "Block what you can control and randomize what you cannot" Blocking: (later) If can control some non-test factors Randomization: (later) If cannot control these non-test factors In consistent manner: Same experience in user’s repeated visits Step 5: Collect log data Collect logs for online monitor & experiment analytics Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 12 / 35
  • 17.
    Online Experiment Procedure:Details (4/5) Step 6: Online monitor Treatment ramp-up Intiate Treatment with 0.1%/99.9% split Ramp up from 0.1% to 0.5%, 2.5%, 10%, 50% At each step (for hours), analyze data to prevent egregious problems Could be detected quickly on small samples Sample ratio mismatch (SRM) graph: Monitor (1) # users, (2) OEC/metrics, etc, in each variant, over time Interactions between overlapping experiments (later) Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 13 / 35
  • 18.
    Online Experiment Procedure:Details (5/5) Step 7: Experiment analytics Compare Treatment’s & Control’s OEC distributions Hypothesis testing for experiment effect Estimation for experiment effect See later for details Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 14 / 35
  • 19.
    Online Experiment Procedure:Details (5/5) Step 7: Experiment analytics Compare Treatment’s & Control’s OEC distributions Hypothesis testing for experiment effect Estimation for experiment effect See later for details May be defined with different units; for example Experiment unit: User Analysis unit: User-Session Apply Bootstrapping Technique, among others (later) Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 14 / 35
  • 20.
    Outline 1 Introduction 2 OnlineExperiment Procedure 3 Experiment Designs 4 Experiment Analytics 5 Further on Online Experiments 6 Discussions Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 15 / 35
  • 21.
    Statistics for Experiments(1/2) Hypotheses for testing Null hypothesis: H0 Treatment and Control are of no difference Any observed differences are due to random fluctuations Alternative hypothesis: H1 Treatment is different from or better than Control Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 16 / 35
  • 22.
    Statistics for Experiments(1/2) Hypotheses for testing Null hypothesis: H0 Treatment and Control are of no difference Any observed differences are due to random fluctuations Alternative hypothesis: H1 Treatment is different from or better than Control Testing null hypothesis: H0 : OB = OA OX : OEC for Treatment & Control for X = B & A respectively OX : Estimated OEC Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 16 / 35
  • 23.
    Statistics for Experiments(2/2) Hypothesis testing basics Type-I error: Pr(H1|H0) = α Probability of rejecting H0 when H0 is true (common: 5%) Type-II error: Pr(H0|H1) = β Probability of not rejecting H0 when H0 is false Confidence level: Pr(H0|H0) = 1 − α Probability of not rejecting H0 when H0 is true (common: 95%) Power: Pr(H1|H1) = 1 − β Probability of rejecting H0 when H0 is false (common: 80-95%) Decision/Condition H0 is true (H0) H0 is false (H1) Reject H0 (H1) Type-I error Power Not reject H0 (H0) Confidence level Type-II error Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 17 / 35
  • 24.
    Sample Size Calculation Hypothesistesting: H0 : OB = OA, with desired confidence level: 1 − α H1 : OB − OA = , with desired power: 1 − β Minimum sample size: 0 + z1−α/2σ 2 n = − z1−βσ 2 n ⇒ n = σ2 ∆2 2(z1−α/2 + z1−β)2 Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 18 / 35
  • 25.
    Outline 1 Introduction 2 OnlineExperiment Procedure 3 Experiment Designs 4 Experiment Analytics 5 Further on Online Experiments 6 Discussions Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 19 / 35
  • 26.
    Experiment Effect Testing& Estimation (1/2) Absolute effect: OB − OA 95% confidence interval (CI) for OB − OA: OB − OA ± 1.96σd σd : Estimated standard deviation of OB − OA See Appendix for derivations Hypothesis tesing for OB − OA: Based on CI Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 20 / 35
  • 27.
    Experiment Effect Testing& Estimation (2/2) Percent effect: OB−OA OA · 100% 95% confidence interval (CI): OB − OA OA + 1 1 ± 1.96 CV 2 A + CV 2 B − 1.962CV 2 ACV 2 B 1 − 1.962CV 2 A − 1 CV B = σB OB : Estimated coefficient of variation (CV) σB: Estimated standard deviation of OB See Appendix for derivations Hypothesis tesing: Based on CI Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 21 / 35
  • 28.
    Further Experiment Analytics Toreduce variance for increasing power Increase sample size: Will increase experiment length Adjust analysis units by features: May shorten experiment length Pre-experiment user metrics User demographics: gender, age, location User behavior analytics: device, App Among many others Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 22 / 35
  • 29.
    Outline 1 Introduction 2 OnlineExperiment Procedure 3 Experiment Designs 4 Experiment Analytics 5 Further on Online Experiments 6 Discussions Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 23 / 35
  • 30.
    Validation of Experiments A/Atest (null test): To test experimental & randomization setups Assign users to variant groups, but expose to the same experience If system is working properly, H0 should be retained rejected about only 5% Other application: Software migration Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 24 / 35
  • 31.
    Limitations of Experiments(1/2) Quantitative metrics, but no explanations: Possible to know which is better and by how much, but not why Long-term effects: Online tests are typically run for short periods, e.g. a few days/weeks Find good OEC metrics predicting long-term effects Run experiments longer: Hard in practice due to Survivorship Bias in online cohorts: When lots of cookies would churn, especially in anonymous settings Primacy effect & newness effect: Run experiment longer or compute OEC only for new users Primacy effect: Experienced users may be less efficient to get used to Treatment Newness effect: When Treatment is introduced, some users click everywhere Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 25 / 35
  • 32.
    Limitations of Experiments(2/2) Feature must be implemented: In early stages, use paper prototyping for quick feedback/refinements Consistency: Need a consistent experience for users Overlapping experiments: Previous experiences: Strong interactions are rare in practice(?) Avoid initially tests that cound interact Perform Pairwise Tests: Flag interactions automatically Launch event: All users need to see it, and we cannot run experiment Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 26 / 35
  • 33.
    Other Practical Concerns(1/2) Triggering Example: Change to checkout page, only 10% of users arrive it Analyze only users who were exposed to the variants (checkout pages) Reduce variance of treatment effect estimates Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 27 / 35
  • 34.
    Other Practical Concerns(1/2) Triggering Example: Change to checkout page, only 10% of users arrive it Analyze only users who were exposed to the variants (checkout pages) Reduce variance of treatment effect estimates Automatic optimization Run experiments to optimize areas amenable to automated search Once an organization has a clear OEC Multi-Armed Bandit Algorithm / Hoeffding Races (later) Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 27 / 35
  • 35.
    Other Practical Concerns(2/2) Robots removal Their acitivity can severely bias results Call Treatment assignment by JavaScript (client-side), not server-side Exclude robots that reject cookies with unidentified requests Exclude robots that do not delete cookies and have many actions Robots removal approach: List of known robots Heuristics (Kohavi & Parekh, 2003) Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 28 / 35
  • 36.
    Outline 1 Introduction 2 OnlineExperiment Procedure 3 Experiment Designs 4 Experiment Analytics 5 Further on Online Experiments 6 Discussions Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 29 / 35
  • 37.
    Discussions Online experiments areextremely important for building data products of various applications For fast iteration, we will build online experiments platform with Random assigment to Treatment or Control Online monitor for ramp-up, SRM, and interactions Experiment analytics with data query, ETL and statistical inference Next: Segment Validation SOP as the 1st application Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 30 / 35
  • 38.
    References Box et al.(2005). Statistics for experiments: design, innovation and discovery Kohavi & Longbotham (Encyclopedia of MLDM, 2015). Online controlled experiments and A/B tests Kohavi et al. (DMKD, 2009). Controlled experiments on the web: survey and practical guide van Belle (2002). Statistical rule of thumb Willan & Briggs (2006). Statistical analysis of cost-effective data Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 31 / 35
  • 39.
    Thank you foryour listening! Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 32 / 35
  • 40.
    Appendix: Derivations ofCI for Absolute Effect Under H0 : OB = OA, E(OB − OA) = 0 Var(OB − OA) can be estimated by σ2 d As sample size is large, by Central Limit Theorem OB − OA σd d −→ N(0, 1) Thus Pr OB − OA σd ≤ 1.96 = 95% CI for absolute effect: OB − OA ± 1.96σd Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 33 / 35
  • 41.
    Appendix: Derivations ofCI for Percent Effect (1/2) Fieller (1954): Define R = OB OA Obtain CI for R based on OB − ROA Apply Central Limit Theorem OB − ROA d −→ N(0, Var[OB − ROA]) Var[OB − ROA] = σ2 B + R2σ2 A (since Cov(OB, OA) = 0) Thus OB − ROA σ2 B + R2σ2 A d −→ N(0, 1) Pr   OB − ROA σ2 B + R2σ2 A ≤ 1.96   = 95% Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 34 / 35
  • 42.
    Appendix: Proof ofCI for Percent Effect (2/2) CI for R: By solving quadratic equation of R   OB − ROA σ2 B + R2σ2 A   2 = 1.962 R = OB OA 1 ± 1.96 CV 2 A + CV 2 B − 1.962CV 2 ACV 2 B 1.962CV 2 A Note: OB−OA OA = OB OA − 1 CI for percent effect: OB OA 1 ± 1.96 CV 2 A + CV 2 B − 1.962CV 2 ACV 2 B 1.962CV 2 A − 1 Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 35 / 35