TrueMotion Data Science Lunch Seminar for September 19, 2016, wherein we discuss the theory behind A/B testing and some best practices for its real-world application.
Presenter: Nicholas Arcolano, Ph.D., Senior Data Scientist
20. Casagrande et al (1978)
Approximate formula gives the desired sample size as a function of , , , and :
where is a "correction factor" given by
with and where denotes the standard normal quantile function, i.e.
is location of the -th quantile for
n p1 p2 α β
n = A
⎡
⎣
⎢
⎢
⎢
1 + 1 +
4( − )p1 p2
A
‾ ‾‾‾‾‾‾‾‾‾‾
√
2( − )p1 p2
⎤
⎦
⎥
⎥
⎥
2
A χ2
A = ,[ + ]z1−α 2 (1 − )p¯ p¯‾ ‾‾‾‾‾‾‾‾√ z1−β (1 − ) + (1 − )p1 p1 p2 p2‾ ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√
2
= ( + )/2p¯ p1 p2 zp
= (p)zp Φ−1 p N(0, 1)
21. Example
In [9]: p1, p2 = 0.40, 0.60
alpha = 0.05
beta = 0.05
# Evaluate quantile functions
p_bar = (p1 + p2)/2.0
za = stats.norm.ppf(1 - alpha/2) # Two-sided test
zb = stats.norm.ppf(1 - beta)
# Compute correction factor
A = (za*np.sqrt(2*p_bar*(1-p_bar)) + zb*np.sqrt(p1*(1-p1) + p2*(1-p2)))**2
# Estimate samples required
n = A*(((1 + np.sqrt(1 + 4*(p1-p2)/A))) / (2*(p1-p2)))**2
print n
149.2852619
21
23. In [10]:
So, for test and control combined we'll need at least 1.1 million users.
p1, p2 = 0.0500, 0.0515
alpha = 0.05
beta = 0.05
# Evaluate quantile functions
p_bar = (p1 + p2)/2.0
za = stats.norm.ppf(1 - alpha/2) # Two-sided test
zb = stats.norm.ppf(1 - beta)
# Compute correction factor
A = (za*np.sqrt(2*p_bar*(1-p_bar)) + zb*np.sqrt(p1*(1-p1) + p2*(1-p2)))**2
# Estimate samples required
n = A*(((1 + np.sqrt(1 + 4*(p1-p2)/A))) / (2*(p1-p2)))**2
print n
555118.7638
31
2n =
24. Also, let's verify that this calculation even works...
In [11]: n = 555119
n_trials = 10000
# Simulate experimental results when null is true
control0 = stats.binom.rvs(n, p1, size=n_trials)
test0 = stats.binom.rvs(n, p1, size=n_trials) # Test and control are the sa
me
tables0 = [[[a, n-a], [b, n-b]] for a, b in zip(control0, test0)]
results0 = [stats.chi2_contingency(T) for T in tables0]
decisions0 = [x[1] <= alpha for x in results0]
# Simulate Experimental results when alternate is true
control1 = stats.binom.rvs(n, p1, size=n_trials)
test1 = stats.binom.rvs(n, p2, size=n_trials) # Test and control are differ
ent
tables1 = [[[a, n-a], [b, n-b]] for a, b in zip(control1, test1)]
results1 = [stats.chi2_contingency(T) for T in tables1]
decisions1 = [x[1] <= alpha for x in results1]
# Compute false alarm and correct detection rates
alpha_est = sum(decisions0)/float(n_trials)
power_est = sum(decisions1)/float(n_trials)
print('Theoretical false alarm rate = {:0.4f}, '.format(alpha) +
'empirical false alarm rate = {:0.4f}'.format(alpha_est))
print('Theoretical power = {:0.4f}, '.format(1 - beta) +
'empirical power = {:0.4f}'.format(power_est))
Theoretical false alarm rate = 0.0500, empirical false alarm rate = 0.04
82
Theoretical power = 0.9500, empirical power = 0.9466