0
Eytan Bakshy
Dean Eckles
Facebook Data Science
ACM KDD 2013, Chicago, IL
August 12, 2013
Uncertainty in Online Experiments...
Outline
▪ Motivation
▪ A model of user-item experiments and its variance
▪ The bootstrap
▪ Evaluation of bootstrap methods...
http://www.tucsonsentinel.com/local/report/062712_az_students/state-natl-rankings-vary-widely-az-student-performance/
100 ...
A ranking experiment
d = 0 Item 1 Item 2 Item 3 Item 4 Item 5
User 1
User 2
User 3
User 4
1 1 1 0 0
1 1 0 0 0
0 1 0 1 0
0 ...
A ranking experiment
d = 0 Item 1 Item 2 Item 3 Item 4 Item 5
User 1
User 2
User 3
User 4
Y1,1 Y1,2
(0) Y1,3 - -
Y2,1 Y2,2...
A user interface experiment
Z(0) = Z(1)
UI Experiments may not affect the pattern of exposure, but do affect responses to ...
Basic crossed random effects model
secondary
misleading
fic (usually
sis that the
imentation
tions about
conducting
ation t...
A random effects model for experiments
number of combinations of units [18, 19], we work condi-
tional on D.
For the sake ...
“Clicky” users
(D=0)
KDD attendants
(D=1)
“Bad luck” in an A/A test
Treatment has no true effect, but the estimated CTR (m...
Variance of ATE for user-item experiments
ndi-
lin-
hat
un-
are
(1)
rre-
To further simplify, we can introduce coe cients ...
Bootstrap Illustration
IID bootstrap illustration, R = 1
clicked
weight (iid)
product
1 0 1 0 1
1 0 0 1 0
1 0 0 0 0
t*r=1 = (1+0+0+0+0)/(1+0+0+1+...
IID bootstrap illustration, R = 2
t*r=2 = (1+0+1+0+0)/(1+1+1+0+0) = 2/3 = 0.66
clicked
weight (iid)
product
1 0 1 0 1
1 1 ...
Repeat 500 times...
▪ Repeat process 500 times, and compute the 95% interval of the
estimated means
R=500 bootstrap replic...
Problem: variance from IID bootstrap does
not account for dependence due to users or
items.
Problem: variance from IID bootstrap does
not account for dependence due to users or
items.
“95% confidence intervals” may ...
User bootstrap illustration, R = 1
t*r=1 = (0+0+1+0+1)/(0+1+1+0+1) = 2/3 = 0.66
clicked
weight (user)
click*weight
1 0 1 0...
Multiway bootstrap illustration, R = 1
t*r=1 = (0+0+1+0+0)/(0+1+0+0+1) = 1/2 = 0.5
clicked
w1 (user)
w2 (ad)
w1*w2
clicked...
Bootstrap evaluation under the sharp null
iid item user multiway
−1e−03
−5e−04
0e+00
5e−04
1e−03
0 100 200 300 400 500 0 1...
Duplication increases over timeads search feed
0.4
0.6
0.8
1.0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
days
truecoveragefor...
Anti-conservatism increases over time
ads search feed
0.4
0.6
0.8
1.0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
days
truecove...
Experiments with effects
▪ Fit probit random effects model to a small countries to estimate random
effects for realistic g...
Experiments with effects
▪ Fit probit random effects model to a small countries to estimate random
effects for realistic g...
Takeaways
▪ Not accounting for any dependence results in incorrect CIs
▪ The bootstrap provides a simple way to account fo...
Thanks
▪ Coauthor, Dean Eckles
▪ Helpful comments: Dan Merl, Yaron Greif, Alex Deng, Daniel Ting,
Wojtek Galuba, Art Owen
...
Uncertainty in online experiments with dependent data (KDD 2013 presentation)
Upcoming SlideShare
Loading in...5
×

Uncertainty in online experiments with dependent data (KDD 2013 presentation)

321

Published on

Presentation from KDD 2013 paper with Dean Eckles: Uncertainty in Online Experiments with Dependent Data: An Evaluation of Bootstrap Methods.

See http://arxiv.org/pdf/1304.7406v2.pdf

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
321
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Uncertainty in online experiments with dependent data (KDD 2013 presentation)"

  1. 1. Eytan Bakshy Dean Eckles Facebook Data Science ACM KDD 2013, Chicago, IL August 12, 2013 Uncertainty in Online Experiments with Dependent Data An Evaluation of Bootstrap Methods
  2. 2. Outline ▪ Motivation ▪ A model of user-item experiments and its variance ▪ The bootstrap ▪ Evaluation of bootstrap methods using search, ads, and feed data ▪ Under the sharp null (no effect) ▪ Under the null with effects ▪ Takeaways
  3. 3. http://www.tucsonsentinel.com/local/report/062712_az_students/state-natl-rankings-vary-widely-az-student-performance/ 100 items 100 users 100,000 impressions Does N = 100,000?
  4. 4. A ranking experiment d = 0 Item 1 Item 2 Item 3 Item 4 Item 5 User 1 User 2 User 3 User 4 1 1 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1 d = 1 Item 1 Item 2 Item 3 Item 4 Item 5 User 1 User 2 User 3 User 4 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 0 Many ranking experiments affect the pattern of exposure... pattern of exposure under control pattern of exposure under treatment Z(0) Z(1)
  5. 5. A ranking experiment d = 0 Item 1 Item 2 Item 3 Item 4 Item 5 User 1 User 2 User 3 User 4 Y1,1 Y1,2 (0) Y1,3 - - Y2,1 Y2,2 - - - - Y3,2 - Y3,4 - - - - - Y4,5 d = 1 Item 1 Item 2 Item 3 Item 4 Item 5 User 1 User 2 User 3 User 4 Y1,1 - Y1,3 Y1,4 - Y2,1 - Y2,3 - - - Y3,2 - Y3,4 - Y4,1 - - - - Many ranking experiments affect the pattern of exposure, but not the potential outcomes Y = Y(0) = Y(1) potential outcomes under control potential outcomes under treatment
  6. 6. A user interface experiment Z(0) = Z(1) UI Experiments may not affect the pattern of exposure, but do affect responses to items Y(0) < Y(1) d=0 d=1 pattern of exposure is the same for treatment and control d = i Item 1 Item 2 Item 3 Item 4 Item 5 User 1 User 2 User 3 User 4 Y1,1 (i) Y1,2 (i) Y1,3 (i) - - Y2,1 (i) Y2,2 (i) - - - - Y3,2 (i) - Y3,4 (i) - - - - - Y4,5 (i) response differs for the same item depending on treatment status
  7. 7. Basic crossed random effects model secondary misleading fic (usually sis that the imentation tions about conducting ation tools, ducted and nd that, in ueprint for oose among TS nits repeat- erson many tems many ts our esti- tions, such ontrol, i.e., by combinations of units, called random e↵ects. In the two- way crossed random e↵ects model [2, 24], each observation is generated by some function f of a linear combination of a grand mean, µ, a random e↵ect ↵i for the first variable, which (without loss of generality) we take to be the idiosyn- cratic deviation for user i, and a second random variable j for the idiosyncratic deviation for item j (e.g. an ad, a search result, a URL). Finally, we have a error term "ij for each user’s idiosyncratic response to each item.3 This final term could be caused by a number of factors, including how relevant the item is to the user. Thus, we have the model Yij = f (µ + ↵i + j + "ij) ↵i ⇠ H(0, 2 ↵i ), j ⇠ H(0, 2 j ), "ij ⇠ H(0, 2 "ij ). Each random e↵ect is modeled as being drawn from some distribution H with zero mean and some variance. In the ho- mogeneous random e↵ects model, this variance is the same for each user or item (i.e., ↵i = ↵), whereas in the het- erogenous random e↵ects model, each variable or groups of variables as may have their own variances. 2.1.1 Comparisons of means Some function (often identity, logit, or probit) User effect Item effect User-item interaction effect (error term) Effects are random variables Variances may be the same or different for each user or item Random effects models are a general way of describing a data-generating process
  8. 8. A random effects model for experiments number of combinations of units [18, 19], we work condi- tional on D. For the sake of exposition, we restrict our attention to lin- ear models with normally distributed random e↵ects. That is, the following analysis considers cases where Y is un- bounded, f is the identity function, and random e↵ects are drawn from a multivariate normal distribution, so that Y (d) ij = µ(d) + ↵ (d) i + (d) j + " (d) ij ~↵i ⇠ N(0, ⌃↵), ~j ⇠ N(0, ⌃ ), ~"ij ⇠ N(0, ⌃"). (1) Note that ~↵i, etc., are vectors, where each element corre- sponds to the random e↵ect of a unit under a given treat- ment. We wish to estimate quantities comparing outcomes for di↵erent values of Dij — most simply, the di↵erence in means, or average treatment e↵ect (ATE) for a binary treat- ment ⌘ E[Y (1) ij | Dij = 1] E[Y (0) ij | Dij = 0] To further simp ing how much un previous work [18 ⌫ (d) A ⌘ 1 N X i which are the av same user (the ⌫ (including itself). this case, users), the non-assigned between-condition Under the homog then simplify (2) V[ˆ] = 1 N ✓ ⌫ Response of user i under treatment d Average effect under d number of combinations of units [18, 19], we work condi- tional on D. For the sake of exposition, we restrict our attention to lin- ear models with normally distributed random e↵ects. That is, the following analysis considers cases where Y is un- bounded, f is the identity function, and random e↵ects are drawn from a multivariate normal distribution, so that Y (d) ij = µ(d) + ↵ (d) i + (d) j + " (d) ij ~↵i ⇠ N(0, ⌃↵), ~j ⇠ N(0, ⌃ ), ~"ij ⇠ N(0, ⌃"). (1) Note that ~↵i, etc., are vectors, where each element corre- sponds to the random e↵ect of a unit under a given treat- ment. We wish to estimate quantities comparing outcomes for di↵erent values of Dij — most simply, the di↵erence in means, or average treatment e↵ect (ATE) for a binary treat- ment ⌘ E[Y (1) ij | Dij = 1] E[Y (0) ij | Dij = 0] = µ(1) µ(0) . While this di↵erence cannot be directly observed from the To further s ing how much previous work ⌫ (d) A ⌘ 1 N which are the same user (th (including itse this case, user the non-assign between-condi Under the hom then simplify V[ˆ] = 1 N  + ✓ d might affect units in different ways True difference in means Model of potential outcomes for thinking about ads, search, and feed experiments Experiments can produce a non-zero simply by chang- ing this pattern of exposure. For example, a search ranking experiment could primarily have e↵ects by changing which items are displayed as results (and thus observed). At the extreme, it could be that the potential outcomes are identi- cal under treatment and control, Y (0) ij = Y (1) ij for all i, j but that the pattern of exposure is di↵erent (Z (0) ij 6= Z (1) ij ), such that 6= 0. Other experiments can produce a non-zero while leaving the pattern of exposure identical (i.e., Z(0) = Z(1) ) or oth- erwise ignorably similar. For example, an experiment might display the same item slightly di↵erently, so that Y (0) ij 6= Y (1) ij for some i, j. In this case, is then an average treatment e↵ect (ATE) since it is a di↵erence in means for the same units [20]. If for all i, j, Z (0) ij = Z (1) ij , the pattern of exposure is the same and is an ATE, = E[Y (1) ij Y (0) ij | Zij = 1]. For analytical and expository simplicity, the remainder of this section assumes that the pattern of exposure is the 4 (0) (1) ˆ = + 1 N and V[ˆ] = 1 N2  X i ✓ + X j ✓ n (1) •j + X i X j ✓ The first term is to the variance, a dom e↵ects of ite 2.1.2 Difference in means for user–item experiments We wish to estimate quantities comparing outcomes ob- served under di↵erent values of Di — most simply, the dif- ference in means for a binary treatment ⌘ E[Y (1) ij | Z (1) ij = 1] E[Y (0) ij | Z (0) ij = 1]. Experiments can produce a non-zero simply by chang- ing this pattern of exposure. For example, a search ranking experiment could primarily have e↵ects by changing which items are displayed as results (and thus observed). At the extreme, it could be that the potential outcomes are identi- where, e. in condit estimate Consid are of eq simplifyin ˆ =If Z = Z(0) = Z(1) , δ is an ATE
  9. 9. “Clicky” users (D=0) KDD attendants (D=1) “Bad luck” in an A/A test Treatment has no true effect, but the estimated CTR (mean click rate) is higher for the control 3/5 clicks for D = 0 1/3 clicks for D = 1d = i Item 1 Item 2 Item 3 Item 4 Item 5 User 1 User 2 User 3 User 4 Y1,1 (0) Y1,2 (0) Y1,3 (0) Y1,4 (0) - Y2,1 (0) - - - - - - Y3,3 (1) Y3,4 (1) - - - - - Y4,5 (1) benchmarks, there is little research and precise inferences with minimal net services can be surprisingly di c (e.g., for new stories, ads, search re the significance and magnitude of p of causal inference and discuss how systems. We develop hierarchical m account dependence due to user reque routines that maximize precision sub 1. ¯Y (0) = 3/5 clicks ¯Y (1) = 1/3 clicks ˆ = 1/3 3/5 . = 0.26 We often want to make changes to s model, a retrieval system, or virtual m we made that change. 1.1. Why benchmark? -Determine t errors introduced into a system -May b 1.2. Core issues. Sampling: how repr benchmark things? Validity: how muc Precision: how precise is your estimate generalize to what happens across the 2. Sampling 2.1. Clustering. “Good” ads “Bad” ads
  10. 10. Variance of ATE for user-item experiments ndi- lin- hat un- are (1) rre- To further simplify, we can introduce coe cients measur- ing how much units are duplicated in the data. Following previous work [18, 19], we define ⌫ (d) A ⌘ 1 N X i n (d) i• 2 ⌫ (d) B ⌘ 1 N X j n (d) •j 2 , which are the average number of observations sharing the same user (the ⌫As) or item (the ⌫Bs) as an observation (including itself). For the units assigned to conditions (in this case, users), either n (0) i• or n (1) i• is zero for each i; for the non-assigned units (items), we need a measure of this binations of units [18, 19], we work condi- f exposition, we restrict our attention to lin- normally distributed random e↵ects. That g analysis considers cases where Y is un- e identity function, and random e↵ects are ultivariate normal distribution, so that ↵ (d) i + (d) j + " (d) ij ⌃↵), ~j ⇠ N(0, ⌃ ), ~"ij ⇠ N(0, ⌃"). (1) tc., are vectors, where each element corre- andom e↵ect of a unit under a given treat- stimate quantities comparing outcomes for of Dij — most simply, the di↵erence in e treatment e↵ect (ATE) for a binary treat- Y (1) ij | Dij = 1] E[Y (0) ij | Dij = 0] ) µ(0) . To further simplify, we can introduce coe cients measur- ing how much units are duplicated in the data. Following previous work [18, 19], we define ⌫ (d) A ⌘ 1 N X i n (d) i• 2 ⌫ (d) B ⌘ 1 N X j n (d) •j 2 , which are the average number of observations sharing the same user (the ⌫As) or item (the ⌫Bs) as an observation (including itself). For the units assigned to conditions (in this case, users), either n (0) i• or n (1) i• is zero for each i; for the non-assigned units (items), we need a measure of this between-condition duplication !B ⌘ 1 N X j n (0) •j n (1) •j . Under the homogeneous random e↵ects model (1), we can then simplify (2) to V[ˆ] = 1 N ✓ ⌫ (1) A 2 ↵(1) + ⌫ (0) A 2 ↵(0) ◆ ✓ ◆ ment corre- iven treat- tcomes for ↵erence in nary treat- ] d from the o one con- ↵erence in er the true p methods X i X j " (d) ij ! f user i in between-condition duplication !B ⌘ 1 N X j n (0) •j n (1) •j . Under the homogeneous random e↵ects model (1), we can then simplify (2) to V[ˆ] = 1 N ✓ ⌫ (1) A 2 ↵(1) + ⌫ (0) A 2 ↵(0) ◆ + ✓ ⌫ (1) B 2 (1) + ⌫ (0) B 2 (0) 2! 2 (0), (1) ◆ + 2 "(0) + 2 "(1) . (3) This expression makes clear that if the random e↵ects for items in the treatment and control are correlated (as we would usually expect), then an increase in the balance of how often items appear in each condition reduces the variance of the estimated treatment e↵ect. 2.1.2 Sharp and non-sharp null hypotheses Under the sharp null hypothesis, the treatment has no av- Estimates of the average treatment effect (ATE) include noise that depends on users and items Average number of observations sharing the same user Balance of items across conditions Duplication coefficients Correlation between item-level effects under treatment and control *under the the linear homogeneous random effects model with an equal number of observations in each condition, randomizing over users with Z(0) = Z(1) *
  11. 11. Bootstrap Illustration
  12. 12. IID bootstrap illustration, R = 1 clicked weight (iid) product 1 0 1 0 1 1 0 0 1 0 1 0 0 0 0 t*r=1 = (1+0+0+0+0)/(1+0+0+1+0) = 1/2 = 0.5 t = (1+0+1+0+1)/5 = 3/5 = 0.6
  13. 13. IID bootstrap illustration, R = 2 t*r=2 = (1+0+1+0+0)/(1+1+1+0+0) = 2/3 = 0.66 clicked weight (iid) product 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 t = (1+0+1+0+1)/5 = 3/4 = 0.6
  14. 14. Repeat 500 times... ▪ Repeat process 500 times, and compute the 95% interval of the estimated means R=500 bootstrap replicates estimated average effect Frequency 0.70 0.75 0.80 020406080120 estimated mean clicks per impression
  15. 15. Problem: variance from IID bootstrap does not account for dependence due to users or items.
  16. 16. Problem: variance from IID bootstrap does not account for dependence due to users or items. “95% confidence intervals” may not include the true mean 95% of the time.
  17. 17. User bootstrap illustration, R = 1 t*r=1 = (0+0+1+0+1)/(0+1+1+0+1) = 2/3 = 0.66 clicked weight (user) click*weight 1 0 1 0 1 0 1 1 0 1 0 0 1 0 1 t = (0+0+1+0+1)/5 = 3/5 = 0.6 0 1 1 10
  18. 18. Multiway bootstrap illustration, R = 1 t*r=1 = (0+0+1+0+0)/(0+1+0+0+1) = 1/2 = 0.5 clicked w1 (user) w2 (ad) w1*w2 clicked*w1*w2 1 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 1 1 0 0 0 0 1 0 0 t = (0+0+1+0+1)/5 = 3/5 = 0.6 0 1 1 10 1 0 01 1
  19. 19. Bootstrap evaluation under the sharp null iid item user multiway −1e−03 −5e−04 0e+00 5e−04 1e−03 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 rank 95%bootstrapCI 1. Randomly assign 1% segments of users to treatment (0) or control (1) 2. Bootstrap difference in mean outcomes under (0) and (1) 3. Repeat 500 times 4. True coverage is the proportion of times the null is accepted (i.e., the “95%” confidence interval for the difference in means crosses zero) Sharp null hypothesis. Under the sharp null hypot sis for user–item experiments, the treatment has no avera interaction, or exposure e↵ects; that is, the outcome fo particular user–item pair, and whether or not the item displayed to the user, would be the same regardless of tre ment assignment. In the context of our model, in addit to = 0, the sharp null can be defined by: Zij ⌘ Z (0) ij = Z (1) ij 2 ↵ ⌘ 2 ↵(1) = 2 ↵(0) = 2 ↵(0),↵(1) 2 ⌘ 2 (1) = 2 (0) = 2 (0), (1) 2 " ⌘ 2 "(1) = 2 "(0) = 2 "(0),"(1) . In the case of the sharp null, only random e↵ects for ite that are not balanced across conditions contribute to variance of our di↵erence: the contribution a single item makes to the variance simplifies to n (0) •j n (1) •j 2 2 ; t is, it depends only on the squared di↵erence in duplicat between treatment and control. It is easy to show that V[ˆ] = 1 N  ⌫ (1) A + ⌫ (0) A 2 ↵ + B 2 + 2 2 " , where B ⌘ 1 N P j n (1) •j n (0) •j 2 measures the aver between-condition duplication of observations of items. items, like users, also only appear in either treatment control, then B = ⌫ (1) B + ⌫ (0) B , highlighting the result symmetry between users’ and items’ contributions to uncertainty. Non-sharp null hypothesis. Experiments may h zero average e↵ects ( = 0) and violate the sharp null. example, when (4) does not hold, the pattern of expos may change such that users are exposed to di↵erent items
  20. 20. Duplication increases over timeads search feed 0.4 0.6 0.8 1.0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 days truecoveragefor95%bootstrapCI boot iid item user multiway Figure 2: True coverage for nominal 95% confidence intervals produced by the iid, single-way, and multiway bootstrap for A/A tests segmented by user id as a function of time. Uncertainty estimates for the iid and item-level bootstrap become increasingly inaccurate over time, while the user-level and multiway bootstrap have the advertised or conservative Type I error rate. Ads Search Feed users 4,515,816 908,339 545,218 items 317,159 1,362,061 326,831 user-item pairs 24,081,939 4,263,769 2,882,452 ⌫users 18.5 35.5 20.3 ⌫items 6,625.9 543.6 1,333.0 Table 1: The amount of duplication present in our datasets for a single 1% segment of users. user item 5 10 15 ν(t)ν(0) ads search feed ce intervals produced by the iid, single-way, and multiway a function of time. Uncertainty estimates for the iid and te over time, while the user-level and multiway bootstrap ate. ur p- To f- ce of we al ull m- user item 5 10 15 5 10 15 5 10 15 days (t) ν(t)ν(0) ads search feed Figure 3: Duplication (⌫) for users and items over time relative to the first day. used in our evaluation for the Ads, Search, and Feed datasets in Table 1. For the restricted categories of items we consider in each dataset, there are more users exposed to ads than the search results or feed stories. While per user duplication N i i• i i• i X j ⇣ n (1) •j 2 (1) j n (0) •j 2 (0) j ⌘ + X i X j " (Dij ) ij and V[ˆ] = 1 N2  X i ✓ n (1) i• 2 2 ↵(1) + n (0) i• 2 2 ↵(0) ◆ + X j ✓ n (1) •j 2 2 (1) + n (0) •j 2 2 (0) 2n (0) •j n (1) •j 2 (0), (1) ◆ + X i X j ✓ n (1) ij 2 " (1) ij n (0) ij 2 " (0) ij ◆ . (2) The first term is the contribution of random e↵ects of users to the variance, and the second is the contribution of the random e↵ects of items. The covariance term, present for items, is absent for users and user–item pairs since each is only observed in either the treatment or control. 4 For true experiments, D is randomly assigned, but under some circumstances (i.e. conditional ignorability) treatment e↵ects may be estimated without randomization [7, 17, 21]. 2 ↵(1) = 2 ↵(0) = 2 ↵(0),↵(1) 2 (1) = 2 (0) = 2 (0), (1) 2 "(1) = 2 "(0) = 2 "(0), (1) . (4) In this case, only random e↵ects for items that are not bal- anced across conditions contribute to the variance of our ATE estimate: the contribution a single item j makes to the variance simplifies to n (0) •j n (1) •j 2 2 ; that is, it de- pends only on the squared di↵erence in duplication between treatment and control. It is easy to show that V[ˆ] = 1 N  ⌫ (1) A + ⌫ (0) A 2 ↵ + B 2 + 2 2 " , (5) where B ⌘ 1 N P j n (0) •j n (1) •j 2 measures the average between-condition duplication of observations of items. If items, like users, also only appear in either treatment or control, then B = ⌫ (1) B + ⌫ (0) B , highlighting the resulting symmetry between users’ and items’ contributions to our uncertainty. When Eq 4 does not hold, we say that there are interaction e↵ects of the treatment and units; for example, there may be an item–treatment interaction e↵ect. ˆ = N i ni• ↵i ni• ↵i + X j ⇣ n (1) •j 2 (1) j n (0) •j 2 (0) j ⌘ + X i X j " (Dij ) ij and V[ˆ] = 1 N2  X i ✓ n (1) i• 2 2 ↵(1) + n (0) i• 2 2 ↵(0) ◆ + X j ✓ n (1) •j 2 2 (1) + n (0) •j 2 2 (0) 2n (0) •j n (1) •j 2 (0), (1) ◆ + X i X j ✓ n (1) ij 2 " (1) ij n (0) ij 2 " (0) ij ◆ . (2) The first term is the contribution of random e↵ects of users to the variance, and the second is the contribution of the random e↵ects of items. The covariance term, present for items, is absent for users and user–item pairs since each is only observed in either the treatment or control. 4 For true experiments, D is randomly assigned, but under some circumstances (i.e. conditional ignorability) treatment e↵ects may be estimated without randomization [7, 17, 21]. 2 ↵(1) = 2 ↵(0) = 2 ↵(0),↵(1) 2 (1) = 2 (0) = 2 (0), (1) 2 "(1) = 2 "(0) = 2 "(0), (1) . In this case, only random e↵ects for items that are not anced across conditions contribute to the variance of ATE estimate: the contribution a single item j make the variance simplifies to n (0) •j n (1) •j 2 2 ; that is, it pends only on the squared di↵erence in duplication betw treatment and control. It is easy to show that V[ˆ] = 1 N  ⌫ (1) A + ⌫ (0) A 2 ↵ + B 2 + 2 2 " , where B ⌘ 1 N P j n (0) •j n (1) •j 2 measures the ave between-condition duplication of observations of items items, like users, also only appear in either treatmen control, then B = ⌫ (1) B + ⌫ (0) B , highlighting the resul symmetry between users’ and items’ contributions to uncertainty. When Eq 4 does not hold, we say that there are interac e↵ects of the treatment and units; for example, there be an item–treatment interaction e↵ect. Variance increases with duplication:
  21. 21. Anti-conservatism increases over time ads search feed 0.4 0.6 0.8 1.0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 days truecoveragefor95%bootstrapCI boot iid item user multiway
  22. 22. Experiments with effects ▪ Fit probit random effects model to a small countries to estimate random effects for realistic generative model ▪ Simulate varying treatment effect levels via synthetic model with fixed layout with duplication similar to real-world duplication IMULATIONS ave seen how di↵erent bootstrap methods perform he sharp null hypothesis and synthetic imbalance ree real-world domains. However, these A/A tests tell us about how bootstrap procedures might per- situations where treatments do have e↵ects. For e, an ads experiment that manipulates the display of advertising units may only a↵ect certains items and ers [3]. To explore these circumstances, we conduct ons with a probit random e↵ects model parameter- mirror the kinds of outcomes described in the previ- ion. We use this generative model to vary the pres- an item–treatment interaction, a plausible source of ns of the sharp null hypothesis. modify the model of (1) so that Y is binary and there gle intercept common to both treatment and control, ng the lack of an ATE: y (d) ij = µ + ↵ (d) i + (d) j + " (d) ij (6) E[Y (d) ij ] = {y (d) ij > 0} (7) flecting the absence of an ATE, we restrict the ran- ect variance to be the same in treatment and control. mple, the covariance matrix for the item random ef- ⌃ = " 2 ⇢ 2 ⇢ 2 2 # . e realistic choices for the variances of the random we fit a probit random e↵ects models to the ads from a large random sample of users in each of sev- all countries. This produced several estimates of ↵ real-world domains. However, these A/A tests us about how bootstrap procedures might per- uations where treatments do have e↵ects. For n ads experiment that manipulates the display of ertising units may only a↵ect certains items and [3]. To explore these circumstances, we conduct with a probit random e↵ects model parameter- ror the kinds of outcomes described in the previ- We use this generative model to vary the pres- tem–treatment interaction, a plausible source of f the sharp null hypothesis. fy the model of (1) so that Y is binary and there ntercept common to both treatment and control, he lack of an ATE: y (d) ij = µ + ↵ (d) i + (d) j + " (d) ij (6) E[Y (d) ij ] = {y (d) ij > 0} (7) ing the absence of an ATE, we restrict the ran- variance to be the same in treatment and control. e, the covariance matrix for the item random ef- ⌃ = " 2 ⇢ 2 ⇢ 2 2 # . ealistic choices for the variances of the random fit a probit random e↵ects models to the ads m a large random sample of users in each of sev- ountries. This produced several estimates of ↵ We report on simulation results for ↵ = 0.3, Probit model: ition, we restrict our attention to lin- ally distributed random e↵ects. That ysis considers cases where Y is un- tity function, and random e↵ects are iate normal distribution, so that (d) j + " (d) ij ~j ⇠ N(0, ⌃ ), ~"ij ⇠ N(0, ⌃"). (1) e vectors, where each element corre- e↵ect of a unit under a given treat- e quantities comparing outcomes for j — most simply, the di↵erence in ment e↵ect (ATE) for a binary treat- Dij = 1] E[Y (0) ij | Dij = 0] 0) . annot be directly observed from the pairs can only be assigned to one con- can estimate with the di↵erence in ur focus is then to consider the true or of and, later, bootstrap methods iance. r each condition is X n (d) i• ↵ (d) i + X j n (d) •j (d) j + X i X j " (d) ij ! previous work [18, 19], we define ⌫ (d) A ⌘ 1 N X i n (d) i• 2 ⌫ (d) B ⌘ 1 N X j n (d) •j 2 , which are the average number of observations sharing the same user (the ⌫As) or item (the ⌫Bs) as an observation (including itself). For the units assigned to conditions (in this case, users), either n (0) i• or n (1) i• is zero for each i; for the non-assigned units (items), we need a measure of this between-condition duplication !B ⌘ 1 N X j n (0) •j n (1) •j . Under the homogeneous random e↵ects model (1), we can then simplify (2) to V[ˆ] = 1 N ✓ ⌫ (1) A 2 ↵(1) + ⌫ (0) A 2 ↵(0) ◆ + ✓ ⌫ (1) B 2 (1) + ⌫ (0) B 2 (0) 2! 2 (0), (1) ◆ + 2 "(0) + 2 "(1) . (3) This expression makes clear that if the random e↵ects for items in the treatment and control are correlated (as we would usually expect), then an increase in the balance of how often items appear in each condition reduces the variance of the estimated treatment e↵ect. s [18, 19], we work condi- estrict our attention to lin- uted random e↵ects. That ders cases where Y is un- on, and random e↵ects are al distribution, so that ) ⌃ ), ~"ij ⇠ N(0, ⌃"). (1) where each element corre- a unit under a given treat- es comparing outcomes for simply, the di↵erence in t (ATE) for a binary treat- To further simplify, we can introduce coe cients measur- ing how much units are duplicated in the data. Following previous work [18, 19], we define ⌫ (d) A ⌘ 1 N X i n (d) i• 2 ⌫ (d) B ⌘ 1 N X j n (d) •j 2 , which are the average number of observations sharing the same user (the ⌫As) or item (the ⌫Bs) as an observation (including itself). For the units assigned to conditions (in this case, users), either n (0) i• or n (1) i• is zero for each i; for the non-assigned units (items), we need a measure of this between-condition duplication !B ⌘ 1 N X j n (0) •j n (1) •j . Under the homogeneous random e↵ects model (1), we can then simplify (2) to ✓ ◆ Variance of ATE for an experiment for (non sharp) null:
  23. 23. Experiments with effects ▪ Fit probit random effects model to a small countries to estimate random effects for realistic generative model ▪ Simulate varying treatment effect levels via synthetic model with fixed layout with duplication similar to real-world duplication Strength of item−treatment interaction 1 − ρβ Truecoverageofnominal95%CI 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.5 1.0 =σβ 0.1 =σα0.3 0.0 0.5 1.0 =σβ 0.3 =σα0.3 0.0 0.5 1.0 =σβ 0.5 =σα0.3 0.0 0.5 1.0 =σβ 1 =σα0.3 user item multiway Figure 5: E↵ects of item–treatment interaction e↵ects on true coverage of 95% confidence intervals. D creasing ⇢ , which makes the random item e↵ects less correlated between treatment and control, reduces t IMULATIONS ave seen how di↵erent bootstrap methods perform he sharp null hypothesis and synthetic imbalance ree real-world domains. However, these A/A tests tell us about how bootstrap procedures might per- situations where treatments do have e↵ects. For e, an ads experiment that manipulates the display of advertising units may only a↵ect certains items and ers [3]. To explore these circumstances, we conduct ons with a probit random e↵ects model parameter- mirror the kinds of outcomes described in the previ- ion. We use this generative model to vary the pres- an item–treatment interaction, a plausible source of ns of the sharp null hypothesis. modify the model of (1) so that Y is binary and there gle intercept common to both treatment and control, ng the lack of an ATE: y (d) ij = µ + ↵ (d) i + (d) j + " (d) ij (6) E[Y (d) ij ] = {y (d) ij > 0} (7) flecting the absence of an ATE, we restrict the ran- ect variance to be the same in treatment and control. mple, the covariance matrix for the item random ef- ⌃ = " 2 ⇢ 2 ⇢ 2 2 # . e realistic choices for the variances of the random we fit a probit random e↵ects models to the ads from a large random sample of users in each of sev- all countries. This produced several estimates of ↵ real-world domains. However, these A/A tests us about how bootstrap procedures might per- uations where treatments do have e↵ects. For n ads experiment that manipulates the display of ertising units may only a↵ect certains items and [3]. To explore these circumstances, we conduct with a probit random e↵ects model parameter- ror the kinds of outcomes described in the previ- We use this generative model to vary the pres- tem–treatment interaction, a plausible source of f the sharp null hypothesis. fy the model of (1) so that Y is binary and there ntercept common to both treatment and control, he lack of an ATE: y (d) ij = µ + ↵ (d) i + (d) j + " (d) ij (6) E[Y (d) ij ] = {y (d) ij > 0} (7) ing the absence of an ATE, we restrict the ran- variance to be the same in treatment and control. e, the covariance matrix for the item random ef- ⌃ = " 2 ⇢ 2 ⇢ 2 2 # . ealistic choices for the variances of the random fit a probit random e↵ects models to the ads m a large random sample of users in each of sev- ountries. This produced several estimates of ↵ We report on simulation results for ↵ = 0.3, Probit model:
  24. 24. Takeaways ▪ Not accounting for any dependence results in incorrect CIs ▪ The bootstrap provides a simple way to account for dependence ▪ Bootstrapping on the units being randomized over (e.g. users) is sufficient to test the (very narrow) sharp null ▪ When experiments have effects, not using multiway CIs can result in anti-conservative estimates
  25. 25. Thanks ▪ Coauthor, Dean Eckles ▪ Helpful comments: Dan Merl, Yaron Greif, Alex Deng, Daniel Ting, Wojtek Galuba, Art Owen ▪ For more details, see paper @: ▪ http://arxiv.org/abs/1304.7406
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×