5. Agenda
• Design the Experiment
▫ 2 main questions – how many users and how long to run the test
▫ Define reasonable number of KPIs
▫ Pay attention on seasonality/weekdays effect
• Analyze the Experiment
▫ Statistical methods for checking significance
▫ Non-parametric methods
▫ Outliers/bots/fraud
• Data-driven culture
• Pitfalls
• Open Discussion
6. Design
• Test Duration & Sample Size
▫ Duration needs to be defined before the experiment is started!
▫ Depends on distribution of main KPIs
80% have Binomial Distribution (Conversion Rate, CTR, etc…) +
CLT can help.
20% other (count events, revenue).
Power calculations for defining N (size) and t (duration) OR use
rules of thumb.
General rule – the less difference you want to catch the more data
you’ll need to collect.
7. Design
▫ Example – # of searches per user (SweetIM)
Poisson assumption for count events
Not appropriate when variance >> mean
NB was found appropriate
Power limitation of NB
9. Design
• Define Reasonable Number of KPIs
▫ It’s impossible to conclude based on 20 KPIs
• Project your KPI on Main Business (Lead) Indicators
• Consider Weighted KPIs or GPI (General Performance Indicator)
• Seasonality
▫ Weekends may have different user behavior than Weekdays
▫ Holidays can be unpredictable
• 7-days rule of thumb
11. Analysis
• Statistical Parametric Methods
▫ Use confidence intervals based on KPI distribution
▫ T-test, Chi-square test, etc will work, but…
T-test assumes normal distribution of statistic
Chi-square can be weak when low frequencies are observed
▫ Try Hypothesis testing based on KPI distribution – it’s not simple
but worse it
12. • Can be used as a generalization of Poisson in over dispersed cases (Var >>
Mean).
• Has been used before in other domains to analyze the count data
(genetics, traffic modeling).
• Fits well the real distribution.
0 100 200 300 400 500
0.000.050.100.150.200.25
Number of search
Frequency
Real data
Fitted NB
Fitted Poisson
13. Analysis
• Non-parametric tests
▫ When it’s hard to estimate the distribution
▫ As Q&A for parametric tests
• Mann-Whitney, Kolmogorov-Smirnov
▫ Pros:
Can be appropriate for unknown or not Normal distributions
More robust than t-test
▫ Cons
Less sensitive and have less power than parametric test (median as a
parameter)
Assume that both samples come from the same distribution
Assume normal distribution in large samples
14. Analysis
• Permutations tests
1. Calculate test statistic
2. Shuffle and resample 2 random groups
3. Calculate again test statistic
4. Compare to your original statistic, if is more extreme ->k=+1
5. Return on step 2 N times
6. Calculate the probability to get a result, more extreme than your
original k/N - this is your P-value
15. Analysis
• Check for outliers
▫ Plot your data on daily/hourly level
▫ Descriptive statistics can help (variance)
• Try to filter bots and crawlers
▫ It is almost impossible to filter all non-human activity on the web.
▫ Automatic bots and crawlers can bias the results and drive to wrong
conclusions.
• Continuous A/A test for sanity check for the whole system
▫ What difference you observe between A groups and is it
insignificant?
▫ Technical and tracking issues
16. Data-Driven Culture
• Avoid HiPPO that is not supported by data
Highest
Paid
Person’s
Opinion
• Be clear about your KPI & how they affect your business
• Fight your ego – numbers don’t lie
• 80%-90% of tests won’t give positive result
• Learn from failed tests
17. Pitfalls
• Picking an easy-to-beat KPI without relation to lead business
metrics
▫ Example – focusing on increase click-through rate for
banners/buttons and ignoring other metrics like user retention or
revenue.
• Using incorrect statistical methods or violate the assumptions
▫ Example 1 – assuming that KPI has Normal distribution without
actually checking it.
▫ Example 2 – Using online significance calculators without
understanding the data distribution
18. Pitfalls
• Combining ratios from different proportions over time -Simpson’s
Paradox
▫ Example:
• Ignoring outliers and bots | not plotting data on a timeline
▫ Example: One outlier can change the test results
19. Pitfalls
• Starting test without validation (A/A test as a solution)
• Change control group during the test (solution- change them
both!)
• Technical issues with experiment group
▫ Example – redirect , cash, new technology
• Running your experiment “until it will reach significant difference”
• Not “anchoring” users to one group only (also cookie problems)
20.
21. Reference
▫ How Not To Run An A/B Test
▫ http://www.evanmiller.org/how-not-to-run-an-ab-test.html
▫ Microsoft Experimentation Platform
▫ http://www.exp-platform.com/Pages/ExPpitfalls.aspx
▫ Simpson’s Paradox
▫ http://vudlab.com/simpsons/