Controlled Experiments for Decision-Making in e-Commerce Search
1. Controlled Experiments for Decision-Making in
e-Commerce Search
Anjan Goswami Wei Han Zhenrui Wang Angela Jiang
October 26, 2015
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 1 / 1
2. Agenda of this Presentation
Background of Controlled Experiments
Proposed Guidelines for Feature Development for E-commerce Search
Know your Bias
Know your Metrics
Know your Tests
Know your Results
Summary
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 2 / 1
3. Background of Controlled Experiments
Statistically-sound approach for causal inference
A/B testing as its simplest form
More complex approaches exist such as Randomized Complete Block
Design, Factorial Design, Split-Plot design etc (regression model
behind the scene)
Three core pillars: randomization, replication, blocking
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 3 / 1
4. Know Your Bias
Bias: related to additional impacting factors resulting unfair comparison
and misleading experiment conclusion, if not quantified/eliminated
visit level factors: visits from old users vs visits from new users [clicks
out of curiosity].
query level factors: query-level performance boost != visit level boost;
particular targeted queries
item level factors: impact of item difference (popular/unpopular,
price-competitive/high-end)
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 4 / 1
5. Know Your Bias
Avoiding or reducing bias is non-trivial
Proper randomization/blocking in experiment design;
Correct selection of target population
Suitable statistical analysis [proper tests]
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 5 / 1
6. Know Your Metrics
Know Your Metrics
Numerous metrics generated from user’s activities on walmart.com, for
example,
Product View Rate (PVR):
num of visits/query session with at least a click on an item
num of visits
Add to Cart Rate (ATC):
num of visits/query session with at least one cart addition
num of visits
Conversion Rate (CR):
num of visits/query session with at least one converted item
num of visits
Average Order Size (AOS) : total revenue
num of visits with at least a converted item
Revenue Per Visit (RPV) : total revenue
num of visits
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 6 / 1
7. Know Your Metrics
Know Your Metrics
Collect descriptive statistics: mean, median, variance and quantiles
Plot empirical distribution to check (i) normality, (ii) long tail, (iii)
skewness, (iv) multi-modal, etc.
Normal distribution leads to straightforward analysis. For non-normal
distribution, check if conditions of central limit theorem (CLT) can be
met. Use non-parametric test when CLT is not applicable
For long tailed data, verify if data in the tail are correct observations
Multi-modal data suggests data come from multiple sub-populations
and further data segmentation
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 7 / 1
8. Know Your Metrics
Know Your Metrics: Test design and analysis
Then, these knowledge are used on
Determine sample size based on these descriptive statistics sample to
avoid underpowered test
Remove erroneous measurement, choose proper metrics based on
distribution
Distribution helps to determine the hypothesis test, for example, RPV
is highly skewed with zero inflation (more than 95% of data are zeros)
Seasonality: weekly seasonality suggests test should be run for
multiple weeks
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 8 / 1
9. Know Your Tests
Hypothesis Testing
A method of statistical inference for testing a statistical hypothesis (e.g.
has product view rate improved after deploying a new feature)
Components of hypothesis test:
Hypotheses:
two contradicting beliefs (null and alternative hypothesis, denoted as
H0 and H1 ) regarding the metric of interest
testing procedure will decide which belief the collected data supports
Test statistics: A random variable whose distribution depends on
validity of H0 and H1
Rejection region: possible value of test statistics that supports
rejection of H0
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 9 / 1
10. Know Your Tests
Common Tests on Location Parameter
IID normal/CLT with large sample: two-sample t-test
paired data, non-normal: Wilcoxon signed-rank test, bootstrapping
(sample with replacement)
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 10 / 1
11. Know Your Tests
Example Pitfall
One common pitfall is to overlook independence assumption and use
t-test on autocorrelated data (e.g. daily metrics difference between
control and variation group); Below is one simulation 100 replications
of t-test on autocorrelated data with zero mean.
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 11 / 1
12. Know Your Result
Know Your Result
Visualization
time series plots for all business metrics
show the trend over time
different performance for weekdays and weekends
compare business metrics over different segmentation/sub-populations
segmented by query intent in terms of product category
segmented by user’s devices and browsers
segmented by new users and returned users
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 12 / 1
14. Know Your Result
Representation of A/B testing result
Confidence interval: a type of interval estimate of a population
parameter. It is an observed interval (i.e. calculated from the
observations), in principle different from sample to sample, that
frequently includes the parameter of interest if the experiment is
repeated. The ’frequently’ refers to a probability value associated
with the interval. Usually 95%
P-value: Probability of test statistics being as extreme as observed
one given the hypothesis H0 is true.
Small p-value leads to rejection of H0; statistical significance !=
physical significance
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 14 / 1
15. Know Your Result
Interpretation of Test Results
Hypothesis tests are decision making process subjective to Two types
of decision errors
type I error: rejecting H0 when H0 is true: related to confidence interval
type II error: rejecting H1 when H1 is true: related to power
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 15 / 1
16. Summary
Summary
Many factors (query/item/visit level) contribute to biased estimation
in controlled experiment for e-Commerce setting
Understand chosen metric (target population/distribution/impacting
factors) to design experiment
Verify assumptions of chosen hypothesis test (autocorrelation? A/A
test?)
Interpret the test results (p-value/Power/significance)
Anjan Goswami, Wei Han, Zhenrui Wang, Angela Jiang (WalmartLabs)IEEE big data 2016 October 26, 2015 16 / 1