Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Equal or unequal cell sizes in A/B testing?
Tom Haxton
Senior Data Scientist, Chegg
August 30, 2016
At Chegg we often run ...
So what do we do? In the model I found that test results will usually be most
accurate when we use equal-size experimental...
the control cell on their second visit, and convert. The fourth term represents
visitors who arrive in the experimental ce...
and the second-visit conversion rates do not depend on the experience in the first
visit, pcc
2 = pec
2 and pce
2 = pee
2 ....
|pe
2 − pc
2| < |pe
1 − pc
1|. This means that to estimate the direction of the effect we
can consider the simpler approxim...
we would measure if we could perfectly keep track of everyone’s identity. To
lowest order in return rates Eq. 21 can be wr...
Upcoming SlideShare
Loading in …5
×

Equal or unequal cell sizes in A/B testing?

809 views

Published on

When running an A/B test with a small experimental cell, one needs to decide whether to use all remaining traffic as the control group or to use a control group of the same size as the experimental cell. Here I construct a minimal theoretical model to show that test results can depend on this choice of control group. This comes from a combination of two effects: (1) unconverted (anonymous) visitors coming back to a website with an identity that cannot be linked to their first visit (e.g. on a new device or with cookies turned off) and (2) return rates and/or second-visit conversion rates varying between control and experiment experiences.

So what do we do? In the model I found that test results will usually be most accurate when we use equal-size experimental and control cells, so I recommend using equal-size cells with a holdout cell whenever a 50/50 split is not appropriate. However, I found that even in this case results will not in general agree with what we would measure if we could track visitors perfectly. This is reminder that A/B test results on anonymous web traffic must be taken with a grain of salt.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Equal or unequal cell sizes in A/B testing?

  1. 1. Equal or unequal cell sizes in A/B testing? Tom Haxton Senior Data Scientist, Chegg August 30, 2016 At Chegg we often run A/B tests to measure differences in conversion rates when we change a webpage design. Most often, we split traffic evenly into control and experimental cells. However, for a variety of reasons we sometimes allocate only a small fraction of our incoming traffic (e.g. 5%) to the experimental cell. In these cases, we need to decide which control group to compare to the small experimental group. In a truly randomized experiment, our results should not depend on our choice of control group, because sample means are unbiased estimators of population means. Thus, to reach a desired level of statistical certainty fastest, we would want to use the entire remaining 95% of traffic as the control group. However, at Chegg we have found that our test results (e.g. differences in conversion rates) can vary if we use an imbalanced control group (95%) vs an equal control group (5%, with a 90% “holdout cell” removed from analysis). I looked on the web for any discussion on why A/B test results could depend on the size of the control cell. I found conflicting advice on whether to use equal or unequal control cell size and no explanation of why, except for those pointing out that confidence intervals calculated assuming normal distributions will be less accurate when cell sizes are smaller. So I constructed a minimal theoretical model for A/B tests measuring conversion and solved the model to find out if measured conversion rates depend on cell sizes. For those interested in the details, read on to the next section. If you just want the punchline, it turns out that the dependence of test results on cell size comes from a combination of two effects: (1) unconverted (anonymous) visitors coming back to a website with an identity that cannot be linked to their first visit (e.g. on a new device or with cookies turned off) and (2) return rates and/or second-visit conversion rates varying between control and experiment experiences. The first effect reminds us that A/B tests on anonymous web traffic are not truly randomized experiments, because the anonymous visitors we treat as independent may in fact be the same people. 1
  2. 2. So what do we do? In the model I found that test results will usually be most accurate when we use equal-size experimental and control cells, so I recommend using equal-size cells with a holdout cell whenever a 50/50 split is not appro- priate. However, I found that even in this case results will not in general agree with what we would measure if we could track visitors perfectly. This is another reminder that A/B test results on anonymous web traffic must be taken with a grain of salt. In the following sections, I will discuss (1) the model and math leading to the results, (2) trends, and (3) the case of equal cell sizes. 1 Model and math For simplicity, assume we have only one experimental cell. We want to know whether the difference in conversion rates that we measure depends on the size of the control cell and, if so, why. This approach should generalize to multiple experimental cells and to metrics other than conversion that are led by conversion. We have an experimental cell of size f and a control cell of size 1 − f. Assume that visitors convert on their first visit to the control (experimental) cell with a probability pc 1 (pe 1). Assume that they do not convert but return with a probability rc 1 (rc 1). Assume that some fraction d of those return with an identity that cannot be linked with their initial identity, and assume that there is no interaction between the likelihood to come back with a new identity and the other probabilities. The probability to convert on a second visit can depend on the experience in both the first and second visits, so there may be four distinct probabilities to convert on the second visit, pcc 2 , pce 2 , pec 2 , and pee 2 , where the first (second) superscript index refers to the first (second) visit. For simplicity, let’s assume that no one returns for a third visit, but these results could be generalized to multiple return visits. The number of conversions in the control cell (relative to the total number of visitors) is (1 − f)pc 1 + (1 − f)rc 1(1 − d)pcc 2 + (1 − f)rc 1d(1 − f)pcc 2 + fre 1d(1 − f)pec 2 . (1) The first term in Eq. 1 represents visitors who arrive in the control cell and convert on the first visit. The second term represents visitors who arrive in the control cell, do not convert but return, return with a same identity, and convert on the second visit. The third term represents visitors who arrive in the control cell, do not convert but return, return with a different identity, arrive in 2
  3. 3. the control cell on their second visit, and convert. The fourth term represents visitors who arrive in the experimental cell on their first visit, do not convert but return, return with a different identity, arrive in the control cell in their second visit, and convert. Similarly, the number of conversions in the experimental cell (relative to the total number of visitors) is fpe 1 + fre 1(1 − d)pee 2 + fre 1dfpee 2 + (1 − f)rc 1dfpce 2 . (2) The number of unique identities counted in the control cell (relative to the total number of visitors) is (1 − f) + (1 − f)rc 1d(1 − f) + fre 1d(1 − f). (3) The first term in Eq. 3 represents visitors who arrive first in the control cell. The second term represents visitors who arrive in the control cell, do not convert but return, return with a different identity, and arrive in the control cell the second time. The third term represents visitors who arrive in the experimental cell, do not convert but return, return with a different identity, and arrive in the control cell the second time. Similarly, the number of unique identities counted in the experimental cell (rel- ative to the total number of visitors) is f + fre 1df + (1 − f)rc 1df. (4) The apparent conversion rates pc and pe are obtained by dividing Eq. 1 by Eq. 3 and Eq. 2 by Eq. 4. We get pc = pc 1 + rc 1(1 − d)pcc 2 + rc 1d(1 − f)pcc 2 + fre 1dpec 2 1 + rc 1d(1 − f) + fre 1d (5) and pe = pe 1 + re 1(1 − d)pee 2 + re 1dfpee 2 + (1 − f)rc 1dpce 2 1 + re 1df + (1 − f)rc 1d (6) From Eqs. 5 and 6 we see that if we can always identify visitors perfectly (d = 0) there should be no dependence of apparent conversion rates on allocation size. In that case pc = pc 1 + rc 1pcc 2 (7) and pe = pe 1 + re 1pee 2 . (8) However, if we lose some identities (d > 0), then the apparent conversion rates will depend on allocation size unless both the return rates are the same, re 1 = rc 1, 3
  4. 4. and the second-visit conversion rates do not depend on the experience in the first visit, pcc 2 = pec 2 and pce 2 = pee 2 . If both of these types of rates are different between cells, the dependence on allocation is complicated (Eq. 5 and 6). Usually, we would expect that the return rate would be more different between cells than the dependence of second visit conversion on first visit experience, so to get the dominant behavior we assume that pcc 2 = pec 2 ≡ pc 2 and pce 2 = pee 2 ≡ pe 2. Then pc = pc 1 + rc 1pc 2 + (re 1 − rc 1) dfpc 2 1 + rc 1d + f (re 1 − rc 1) d (9) and pe = pe 1 + re 1pe 2 + (rc 1 − re 1) d(1 − f)pe 2 1 + re 1d + (1 − f) (rc 1 − re 1) d . (10) Expanding in d, pc = pc 1 + rc 1pc 2 + [(re 1 − rc 1)f(pc 2 − pc 1 − rc 1pc 2) − (pc 1 + rc 1pc 2)rc 1] d + O(d2 ) (11) and pe = pe 1 + re 1pe 2 + [(rc 1 − re 1)(1 − f)(pe 2 − pe 1 − re 1pe 2) − (pe 1 + re 1pe 2)re 1] d + O(d2 ) (12) The apparent conversion rates change with allocation size according to dpc df = d (re 1 − rc 1) (pc 2 − pc 1 − rc 1pc 2) + O(d2 ) (13) dpe df = d (re 1 − rc 1) (pe 2 − pe 1 − re 1pe 2) + O(d2 ), (14) so that the change in relative conversion rates is d(pe − pc ) df = d(re 1 − rc 1) ((pe 2 − pc 2) − (pe 1 − pc 1) − (re 1pe 2 − rc 1pc 2)) + O d2 . (15) Dropping higher order terms in the return rates (assuming these rate are sub- stantially less than 1), this simplifies to d(pe − pc ) df = d(re 1 − rc 1) ((pe 2 − pc 2) − (pe 1 − pc 1)) + O d(r1)2 p2 + O d2 . (16) 2 Trends Depending on the values on the right side of Eq. 16, this effect could go either way. In general, we expect that second-visit conversion rates are lower than first-visit conversion rates, so differences between second-visit conversion rates will also usually be smaller than differences between first-visit conversion rates, 4
  5. 5. |pe 2 − pc 2| < |pe 1 − pc 1|. This means that to estimate the direction of the effect we can consider the simpler approximation d(pe − pc ) df ∼ −d(re 1 − rc 1)(pe 1 − pc 1). (17) Additionally, when return rates and second-visit conversion rates are small, we expect the sign of pe − pc to be the same as the sign of pe 1 − pc 1, so the direction of the effect is given by Sign d(pe − pc ) df = − Sign (re 1 − rc 1) Sign (pe − pc ) . (18) This means that when the return rate is larger for unconverted visitors from the experimental cell, the difference in conversion rates (whichever way it goes) is increasingly overestimated as the control cell gets bigger (f decreases). Con- versely, when the return rate is smaller for unconverted visitors from the exper- imental cell, the difference in conversion rates is increasingly underestimated as the control cell gets bigger. The effect is not likely to switch the sign of the difference in conversion rates (which would lead to qualitatively wrong results) because f, d, and |re 1 − rc 1| in Eq. 17 all must be less than 1. 3 Should we trust same-size cells? Given that our results depend on cell size, our intuition has been to trust the results of A/B tests with equal-size cells, since this seems to compare the vari- ations on more equal footing. But should we fully trust these results? That is, are the results with equal-size cells the same as what we would find in the ideal experimental design where we perfectly track all visitors’ identities? If the cells are the same size (f = 1 − f = 1/2) the difference in apparent conversion rates turns out to be pe − pc = pe 1 − pc 1 + re 1(1 − d/2)pe 2 − rc 1(1 − d/2)pc 2 + rc 1dpe 2/2 − re 1dpc 2/2 1 + (re 1 + rc 1)d/2 . (19) Comparing this to the conversion rate we would measure if we lost no identities (d = 0), (pe − pc )0 = pe 1 − pc 1 + re 1pe 2 − rc 1pc 2, (20) we find that the lost identities change the apparent difference in conversion rates by pe − pc − (pe − pc )0 = rc 1pe 2 − re 1pc 2 + (rc 1 + re 1)(rc 1pc 2 − re 1pe 2) 2/d + (re 1 + rc 1) . (21) The right side of Eq. 21 does not equal 0 in general, so even if we use equal-size cells, the difference in conversion rates that we measure is not the same as what 5
  6. 6. we would measure if we could perfectly keep track of everyone’s identity. To lowest order in return rates Eq. 21 can be written pe − pc − (pe − pc )0 = d 2 (rc 1pe 2 − re 1pc 2)) + O(r2 1p2 ). (22) This effect is small whenever second-visit conversion rates are small (or whenever their particular combination with return rates in Eq. 22 is small). In general we expect that first-visit conversion rates are larger than second-visit conversion rates, so the discrepancy from having unequal cells will usually be larger than the discrepancy purely from losing visitors’ identities. As for the discrepancy from unequal cells, the direction of the latter effect can go either way. 4 Bottom line Whenever we cannot perfectly track visitors’ identities, we must take A/B tests with a grain of salt: measured conversion rates will be different from what we would measure if we could perfectly track identities. Although part of this effect—and usually the larger part—can be avoided by using same-size cells, even A/B tests with same-size cells will not in general give accurate results unless we can perfectly track visitors’ identities. 6

×