Successfully reported this slideshow.
Upcoming SlideShare
×

# Implementing and analyzing online experiments

3,288 views

Published on

Randomized experiments are the gold standard for understanding and quantifying causal relationships. This talk is divided into two parts corresponding to before and after the experiment is run. In the first section, we discuss how to design and implement online experiments using PlanOut, an open-source toolkit for advanced online experimentation used at Facebook. We will show how basic “A/B tests”, within-subjects designs, as well as more sophisticated experiments can be implemented. In the second section, we cover methods to estimate causal quantities of interest and construct appropriate confidence intervals. Particular attention will be given to scalable methods suitable for “big data”, including working with weighted data and clustered bootstrapping.

Published in: Data & Analytics
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv

Are you sure you want to  Yes  No

### Implementing and analyzing online experiments

1. 1. IMPLEMENTING AND ANALYZING ONLINE EXPERIMENTS SEAN J. TAYLOR 28 JUL 2015 MULTITHREADED DATA
2. 2. WHO AM I? • Core Data Science Team at Facebook • PhD from NYU in Information Systems • Four academic papers employing online ﬁeld experiments • Teach and consult on experimental design at Facebook http://seanjtaylor.com  http://github.com/seanjtaylor  http://facebook.com/seanjtaylor  @seanjtaylor
3. 3. I ASSUME YOU KNOW • Why causality matters • A little bit of Python and R • Basic statistics + linear regression
4. 4. SIMPLEST POSSIBLE EXPERIMENT user_id version spent 123 B \$10 596 A \$0 456 A \$4 991 B \$9 def  get_version(user_id):          if  user_id  %  2:                  return  'A'          else:                  return  'B'   >  t.test(c(0,  4),  c(10,  9))     Welch  Two  Sample  t-­‐test   data:    c(0,  4)  and  c(10,  9)   t  =  -­‐3.638,  df  =  1.1245,  p-­‐value  =  0.1487   alternative  hypothesis:  true  difference  in  means  is  not  equal  to  0   95  percent  confidence  interval:    -­‐27.74338    12.74338   sample  estimates:   mean  of  x  mean  of  y                2.0              9.5
5. 5. FIN
6. 6. COMMON PROBLEMS • Type I errors from measuring too many effects • Type II and M errors from lack of power • Repeated use of the same population (“pollution”) • Type I errors from violation of the i.i.d. assumption • Composing many changes into one experiment
7. 7. POWER OR THE ONLY WAY TO TRULY FAIL AT AN EXPERIMENT OR THE SIZE OF YOUR CONFIDENCE INTERVALS
8. 8. ERRORS • Type I: Thinking your metric changed when it didn’t. We usually bound this at 1 or 5%. • Type II: Thinking your metric didn’t change when it did. You can control this through better planning.
9. 9. HOW TO MAKE TYPE I ERRORS \$ Spent Time Spent Survey Satisfaction oMale, <25 Female, <25 Male, >=25 Female, >=25 Measure a ton of metrics Findasubgroupitworkson
10. 10. AVOID TYPE II ERRORS WITH POWER 1. Use enough subjects in your experiment. 2. Test a reasonably strong treatment.  Remember: you care about the difference.
11. 11. POWER ANALYSIS First step in designing an experiment is to determine how much data you’ll need to learn the answer to your question. Process: • set the smallest effect size you’d like to detect. • simulate your experiment 200 times at various sample sizes • count the number of simulated experiments where you correctly reject the null of effect=0.
12. 12. TYPE M ERRORS • Magnitude error: reporting an effect size which is too large • happens when your experiment is underpowered AND you only report the signiﬁcant results
13. 13. IMPLEMENTATION
14. 14. PLANOUT: KEY IDEAS • an experiment is just a pseudo-random mapping from (user, context) → parameters, and is serializable. • persistent randomizations implemented through hash functions, salts make experiments orthogonal • always log exposures (parameters assignment) to improve precision, provide randomization check • namespaces create ability to do sequential experiments on new blocks of users https://facebook.github.io/planout/
15. 15. A/B TESTING IN PLANOUT from  planout.ops.random  import  *   from  planout.experiment  import  SimpleExperiment   class  ButtonCopyExperiment(SimpleExperiment):          def  assign(self,  params,  user_id):                  #  `params`  is  always  the  first  argument.                  params.button_text  =  UniformChoice(                          choices=["Buy  now!",  "Buy  later!"],                          unit=user_id                  )   #  Later  in  your  production  code:   from  myexperiments  import  ButtonCopyExperiment   e  =  ButtonCopyExperiment(user_id=212)   print(e.get('button_text'))   #  Event  later:   e  =  ButtonCopyExperiment(user_id=212)   e.log_event('purchase',  {'amount':  9.43})
16. 16. PLANOUT LOGS → DATA {"inputs":  {"user_id":  212},   "name":  "ButtonCopyExperiment",   "checksum":  "646e69a5",  "params":   {"button_text":  "Buy  later!"},   "time":  1437952369,  "salt":   "ButtonCopyExperiment",  "event":   “exposure"}   {"inputs":  {"user_id":  212},   "name":  "ButtonCopyExperiment",   "checksum":  "646e69a5",  "params":   {"button_text":  "Buy  later!"},   "time":  1437952369,  "extra_data":   {"amount":  9.43},  "salt":   "ButtonCopyExperiment",  "event":   "purchase"}   user_id button_text 123 Buy later! 596 Buy later! 456 Buy now! 991 Buy later! user_id amount 123 \$12 596 \$9 Exposures Purchases
17. 17. ADVANCED DESIGN 1: FACTORIAL DESIGN • Can use conditional logic as well as other random assignment operators:   RandomInteger, RandomFloat, WeightedChoice, Sample. class  FactorialExperiment(SimpleExperiment):          def  assign(self,  params,  user_id):                  params.button_text  =  UniformChoice(                          choices=["Buy  now!",  "Buy  later!"],                          unit=user_id                  )                  params.button_color  =  UniformChoice(                          choices=["blue",  "orange"],                          unit=user_id                  )
18. 18. ADVANCED DESIGN 2: INCREMENTAL CHANGES ##  We're  going  to  try  two  different  button  redesigns.   class  FirstExperiment(SimpleExperiment):          def  assign(self,  params,  user_id):                  #  ...  set  some  params   class  SecondExperiment(SimpleExperiment):          def  assign(self,  params,  user_id):                  #  ...  set  some  params  differently                     class  ButtonNamespace(SimpleNamespace):          def  setup(self):                  self.name  =  'button_experiment_sequence'                  self.primary_unit  =  'user_id'                  self.num_segments  =  1000          def  setup_experiments():                  #  allocate  and  deallocate  experiments  here                  #  First  gets  100  out  of  1000  segments.                  self.add_experiment('first',  FirstExperiment,  100)                      self.add_experiment('second',  SecondExperiment,  100)
19. 19. ADVANCED DESIGN 3: WITHIN-SUBJECTS Previous experiments persistently assigned same treatment to user, but unit of analysis can be more complex: class  DiscountExperiment(SimpleExperiment):          def  assign(self,  params,  user_id,  item_id):                  params.discount  =  BernoulliTrial(p=0.1,  unit=[user_id,  item_id])                  if  params.discount:                          params.discount_amount  =  RandomInteger(                                  min=5,  max=15,  unit=user_id                          )                  else:                          params.discount_amount  =  0   e  =  DiscountExperiment(user_id=212,  item_id=2)   print(e.get('discount_amount'))
20. 20. ANALYSIS
21. 21. THE IDEAL DATA SET Subject / User Gender Age Button Size Button Text Spent Bounce Erin F 22 Large Buy Now! \$20 0 Ashley F 29 Large Buy Later! \$4 0 Gary M 34 Small Buy Now! \$0 1 Leo M 18 Large Buy Now! \$0 1 Ed M 46 Small Buy Later! \$9 0 Sam M 25 Small Buy Now! \$5 0 Independent Observations Randomly Assigned Metrics Pre-experiment Covariates { { { {
22. 22. SIMPLEST CASE: OLS >  summary(lm(spent  ~  button.size,  data  =  df))   Call:   lm(formula  =  spent  ~  button.size,  data  =  df)   Residuals:          1          2          3          4          5          6      10.0    -­‐0.5    -­‐4.5  -­‐10.0      4.5      0.5     Coefficients:                                            Estimate  Std.  Error  t  value  Pr(>|t|)   (Intercept)                        10.000            5.489      1.822        0.143   factor(button.size)s      -­‐5.500            6.722    -­‐0.818        0.459   Residual  standard  error:  7.762  on  4  degrees  of  freedom   Multiple  R-­‐squared:    0.1434,   Adjusted  R-­‐squared:    -­‐0.07079     F-­‐statistic:  0.6694  on  1  and  4  DF,    p-­‐value:  0.4592
23. 23. DATA REDUCTION Subject Xi Di Yi Evan M 0 1 Ashley F 0 1 Greg M 1 0 Leena F 1 0 Ema F 0 0 Seamus M 1 1 X D Y Cases M 0 1 1 M 1 1 1 F 0 1 1 F 1 1 0 M 0 0 0 M 1 0 1 F 0 0 1 F 1 0 1 N # treatments X # groups X #outcomes
24. 24. source('css_stats.R')   reduced  <-­‐  df  %>%      mutate(rounded.spent  =  round(spent,  0))  %>%      group_by(button.size,  rounded.spent)  %>%      summarise(n  =  n())   >  lm(rounded.spent  ~  button.size,  data  =  reduced,  weights  =  n)  %>%   +      coeftest(vcov  =  sandwich.lm)   t  test  of  coefficients:                            Estimate  Std.  Error  t  value    Pr(>|t|)           (Intercept)      7.43137        0.45162  16.4548  7.522e-­‐14  ***   button.sizes  -­‐2.45178        0.59032  -­‐4.1533  0.0004149  ***   -­‐-­‐-­‐   Signif.  codes:    0  '***'  0.001  '**'  0.01  '*'  0.05  '.'  0.1  '  '  1   DATA REDUCTION + WEIGHTED OLS
25. 25. FACTORIAL DESIGNS • Identify two types of effects: marginal and interactions. Need to ﬁx one group as the baseline. >  coeftest(lm(spent  ~  button.size  *  button.text,  data  =  df))   t  test  of  coefficients:                                                      Estimate  Std.  Error  t  value    Pr(>|t|)           (Intercept)                                6.79643        0.62998  10.7884  <  2.2e-­‐16  ***   button.sizes                            -­‐2.43253        0.86673  -­‐2.8066    0.006064  **     button.textn                              2.11611        0.86673    2.4415    0.016458  *       button.sizes:button.textn  -­‐2.57660        1.27584  -­‐2.0195    0.046219  *       -­‐-­‐-­‐   Signif.  codes:    0  '***'  0.001  '**'  0.01  '*'  0.05  '.'  0.1  '  '  1
26. 26. USING COVARIATES TO GAIN PRECISION • With simple random assignment, using covariates is not necessary. • However, you can improve precision of ATE estimates if covariates explain a lot of variation in the potential outcomes. • Can be added to a linear model and SEs should get smaller if they are helpful.
27. 27. NON-IID DATA • Repeated observations of the same user are not independent. • Ditto if you ‘re experimenting on certain items only. • If you ignore dependent data, the true conﬁdence intervals are larger than you think. Subject / User Item Button Size Spent Erin Shirt Large \$20 Erin Socks Large \$4 Erin Pants Large \$0 Leo Shirt Large \$0 Ed Shirt Small \$9 Ed Socks Small \$5
28. 28. THE BOOTSTRAP R1 All Your Data R2 … R500 Generate random sub-samples s1 s2 s500 Compute statistics or estimate model parameters … } 0.0 2.5 5.0 7.5 -2 -1 0 1 2 Statistic Count Get a distribution over statistic of interest (e.g. the treatment effect) - take mean - CIs == 95% quantiles - SEs == standard deviation
29. 29. USER AND USER-ITEM BOOTSTRAPS source('css_stats.R')   library(broom)  ##  for  extracting  model  coefficients   fitter  <-­‐  function(.data)  {          lm(summary  ~  opposed,  data  =  .data,  weights  =  .weights)  %>%          tidy   }   iid.replicates        <-­‐  iid.bootstrap(df,  fitter,  .combine  =  bind_rows)   oneway.replicates  <-­‐  clustered.bootstrap(df,  c('user_id'),  fitter,  .combine  =   bind_rows)   twoway.replicates  <-­‐  clustered.bootstrap(df,  c('user_id',  'item_id'),  fitter,  .combine   =  bind_rows)   >  head(iid.replicates)                    term      estimate    std.error  statistic            p.value   1    (Intercept)    0.4700000  0.04795154    9.801561  6.296919e-­‐18   2  button.sizes  -­‐0.2200000  0.08003333  -­‐2.748855  6.695621e-­‐03   3    (Intercept)    0.4250000  0.05307832    8.007036  5.768641e-­‐13   4  button.sizes  -­‐0.1750000  0.08456729  -­‐2.069358  4.049329e-­‐02   5    (Intercept)    0.4137931  0.05141050    8.048805  4.118301e-­‐13   6  button.sizes  -­‐0.1429598  0.08621804  -­‐1.658119  9.965016e-­‐02
30. 30. DEPENDENCE CHANGES CONFIDENCE INTERVALS
31. 31. DATA REDUCTION WITH DEPENDENT DATA Subject Di Yij Evan 1 1 Evan 1 0 Ashley 0 1 Ashley 0 1 Ashley 0 1 Greg 1 0 Leena 1 0 Leena 1 1 Ema 0 0 Seamus 1 1 Create bootstrap replicates R1 R2 R3 reduce the replicates as if they’re i.i.d. r1 r2 r3 s1 s2 s3 compute statistics on reduced data
32. 32. THANKS! HERE ARE SOME RESOURCES: • Me: http://seanjtaylor.com • These slides:  http://www.slideshare.net/seanjtaylor/implementing- and-analyzing-online-experiments • Full Featured Tutorial:   http://eytan.github.io/www-15-tutorial/ • “Field Experiments”   by Gerber and Green