Experimentation abounds, but how do we test our tests? I’ll share some ways we at Etsy proved our experimentation methods broken, and the approach we took to fixing them. I’ll discuss multiple ways of running A/A tests (as opposed to A/B tests), and a statistical method called bootstrapping, which we used to remedy our experiment analysis.
21. take the 95%
confidence interval
21
2.5th 97.5th
distribution of t-statistics
boot 1
boot 3
boot 2
boot 4
boot 1 results
boot 2 results
boot 3 results
boot 4 results
22. 22
BOOTSTRAP BY USER
instead of by "visit" (one user can have many "visits")
user id visit id $
A 45 1
A 23 1
A 85 0
B 37 0
C 12 1
C 72 0
A
B
C
}
}
}
closer to i.i.d. (independent & identically distributed)
25. 25
A B C D E Fusers in our
original experiment
fetch a bag:
resample withOUT replacementbag 1
A
E
F
B
25
26. 26
A B C D E Fusers in our
original experiment
bag 1
F
A
E
B
E
B
F
A
F
B
monte carlo subsamples:
resample WITH replacement (from our bag)
*TO SIZE OF ORIGINAL DATA SET*
boot 1
26
29. A B C D E Fusers in our
original experiment
confidence interval
from bag 1
2.5th 97.5th
t-statistics
E
B
F
A
B
boot 2
E
B
F
A
F
B
boot 1
E
F
B
boot 3
E
F
A
F
B
boot 4
A
E
A
A
B
bag 1
F
A
E
B
29
30. average the
confidence intervals
30
avg. avg.
2.5th 97.5th
averaged t-statistics
bag 1
bag 3
bag 2
bag 4
bag 1 results
bag 2 results
bag 3 results
bag 4 results
31. Fixes i.i.d. & distribution
Suitable for distributed systems
Faster, less memory
WINS
31
40. Resources
Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. I. A scalable bootstrap for massive
data. arXiv preprint arXiv:1112.5016v2, 2012. URL http://arxiv.org/abs/1112.5016v2
Bakshy, E., Eckles, D. Uncertainty in Online Experiments with Dependent Data: An
Evaluation of Bootstrap Methods. arXiv preprint arXiv:1304.7406v3, 2013. URL
https://arxiv.org/pdf/1304.7406v3.pdf
Idea for Bootstrapping @ Etsy: @hpster (Hilary Parker)
40