Overview Experimental Design Data Analysis Power Calculations Sampling
Most Important You need a control group (C), and a treatment (T) group Only change one thing at time in your treatment groups Answer your question-What needs to change? Calculate your sample size for T & C (power) before you begin the experiment Individuals should be randomly assigned
Why a control group? External events. Your experiment. What’s causing a change In the user behavior? • Your experiment? • Or external events?
Experimental Design What’s your unit? A person or a group? If the treatment spills into the control, then you’ll need a higher clustering level than the individual user What’s your design? 1 control, 1 treatment 1 control, multiple treatments One period or multiple periods? “Ethical” Designs: Roll-in, Within-group, and Encouragement
Think Ahead: Analysis Either you’ll have a one time treatment: Cross section: ΔT-C Or the treatment(s) will last over time: Panel:ΔT-ΔCWhich changes your analysis.Where the Δ is a change between the average (or otherstatistic) of your interested outcome variable and control.If it’s not the average, then things are just a tad morecomplique (as variance and medians are not Gaussian).
How many users? ->Power Represents the desired power (typically .84 for 80% power).Sample size in each group(assumes equal sized groups) Standard deviation of the outcome variable, e.g. sales. Represents the desired level of statistical significance (typically 1.96). Effect Size (the difference in means). E.g. change in sales. Then check a tool (e.g. http://www.statisticalsolutions.net/pss_calc.php) Then use a stats tool for bells and whistles (R, STATA, SPSS, Matlab). R (pwr), Stata (sampsi, sampclus) To maximize power: the ratio of sample sizes is equal to the ratio of the standard deviations of the outcomes (List, 2008)
Factors Affecting Power1. Size of the effect2. Standard deviation of the characteristic3. Bigger sample size4. Significance level desired
Power and A/B Testing Ruby, Python built-in A/B testing tools often run experiments without a pre-specified sample size. Issues with that: So when do you stop the experiment if you find no statistically significant effect between T and C? Just wait and wait? No, As the sample size increases and the variance of the outcome decreases, you’ll have the power to detect very small differences between μ0 and μ1.
Sampling So we figured out how many people to randomly select into T (N1) and C (N2). Now what? Randomly assign N1 individuals into T and N2 individuals into C. Tools: R(pwr), Stata(bsample) You might want to stratify by gender, user type etc., which would require a discussion on sampling weights.
Some References Cochran (1977) Sampling Techniques (Everyone should read) List et al (2008) So you want to run an experiment, now what? Some simple rules of thumb for optimal experimental design Levy & Lemeshow (1991) Sampling Populations: Methods and Applications Valliant, Dever & Kreute (2010) Practical Tools for Designing and Weighting Survey Samples Brady West (http://www-personal.umich.edu/~bwest/) I follow Andrew Gelman on the blogworld, who has a note (http://www.stat.columbia.edu/~gelman/stuff_for_blog/chap20.pd f) which may be useful.