Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A/B test with problematic data
Ben Paul
May 20, 2015
Background
• It has previously been shown that user experience on our...
Clean data
Handle data types Check that data types are appropriate.
summary(dat); str(dat);
user_id ts ab
Min. :2.325e+04 ...
# find user_ids with multiple rows
dat$multi_obs <- (duplicated(dat$user_id) | duplicated(dat$user_id, fromLast = TRUE))
#...
24 25758261 2013-01-01 14:52:12 treatment old_page 0 TRUE
25 29616796 2013-01-01 02:17:18 treatment new_page 0 TRUE
26 296...
dat <- dat[!dat$multi_obs, ]
Check for further experimental errors As previously mentioned, users in the control group sho...
treatment.conf.int %>% round(3) %>% percent
[1] "10.5%" "10.9%"
rates %>% round(3) %>% sapply(percent)
control treatment
"...
A/B test with problematic data
Upcoming SlideShare
Loading in …5
×

A/B test with problematic data

272 views

Published on

Data science exercise to analyze A/B test results and address problematic data

Instructions: https://github.com/benspaul/acme/blob/gh-pages/readme.pdf

Repo: https://github.com/benspaul/acme/

The image on the first page is stretched so that it will take up the entire thumbnail area on my LinkedIn profile.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

A/B test with problematic data

  1. 1. A/B test with problematic data Ben Paul May 20, 2015 Background • It has previously been shown that user experience on our site is better if users first answer a few questions about their preferences. • We are testing a new landing page to determine if it will cause more users to answer at least one question about their preferences. • If the new landing page causes any statistically significant increase in conversion rate (percentage of users who complete at least one question), then it will be considered a success. Hypotheses • The new landing page will cause a statistically significant increase in conversion rate. Method • Randomly assign 50% of users to a control group that will be shown the old landing page and the other 50% of users to a treatment group that will be shown the new landing page. • Track whether each user answers at least one question or not. • Run a z-test to determine if the treatment group had a greater conversion rate than the control group, with the conventional cuto for statistical significance of p < 0.05, two-tailed. Analysis Set up environment library("plyr") library("dplyr", warn.conflicts = FALSE) # I m aware of the plyr/dplyr conflicts library("scales") knitr::opts_chunk$set(comment = NA) # remove hashes in output Read data dat <- read.csv("data/takehome.csv") 1
  2. 2. Clean data Handle data types Check that data types are appropriate. summary(dat); str(dat); user_id ts ab Min. :2.325e+04 Min. :1.357e+09 control : 90815 1st Qu.:2.488e+09 1st Qu.:1.357e+09 treatment:100333 Median :4.997e+09 Median :1.357e+09 Mean :4.998e+09 Mean :1.357e+09 3rd Qu.:7.508e+09 3rd Qu.:1.357e+09 Max. :1.000e+10 Max. :1.357e+09 landing_page converted new_page:95574 Min. :0.0000 old_page:95574 1st Qu.:0.0000 Median :0.0000 Mean :0.1011 3rd Qu.:0.0000 Max. :1.0000 data.frame : 191148 obs. of 5 variables: $ user_id : num 9.64e+09 2.46e+09 9.67e+09 2.25e+09 7.81e+09 ... $ ts : num 1.36e+09 1.36e+09 1.36e+09 1.36e+09 1.36e+09 ... $ ab : Factor w/ 2 levels "control","treatment": 2 2 1 2 1 1 1 2 2 1 ... $ landing_page: Factor w/ 2 levels "new_page","old_page": 1 1 2 1 2 2 2 1 2 2 ... $ converted : int 0 0 0 0 0 1 1 0 0 0 ... Data types appear to be appropriate. The independent variables “ab” and “landing_page” each have two levels, corresponding to the control condition (“control”/“old_page”) and the treatment condition (“treatment”/“new_page”). The dependent variable “converted” is an integer with just two possible values representing whether the user answered at least one question (1) or not (0). Let’s ensure that it has no other values: unique(dat$converted) [1] 0 1 The dependent variable has no other values besides 0 and 1, so no cleaning is required. In summary, there are no problematic data types or values apparent from initial inspection. Handle duplicates The documentation indicated that each user should be assigned to just one condition, either the control group (ab = “control”), which was shown the old landing page (landing_page = “old_page”), or the treatment group (ab = “treatment”), which was shown the new landing page (landing_page = “new_page”). Therefore, each user_id should have just one row in the data set, with information about the one condition they were assigned as well as the one landing page they were shown. If any user has more than one row, something may have gone wrong and we will need to explore the data to determine how to handle it. Let’s start by determining if this is an issue. 2
  3. 3. # find user_ids with multiple rows dat$multi_obs <- (duplicated(dat$user_id) | duplicated(dat$user_id, fromLast = TRUE)) # print the number of rows with this issue dat[dat$multi_obs, ] %>% nrow [1] 9528 # print the percentage of rows that have this issue percent((dat[dat$multi_obs, ] %>% nrow) / (dat %>% nrow)) [1] "4.98%" These calculations show that some users do have multiple rows. These multi-observation users account for 9,528 observations, or 5% of all observations. This is concerning. To understand this issue more fully, the next step will be to visually inspect a sample of multi-observation users’ data. # print a sample of multi-observation users data dat[dat$multi_obs, ] %>% arrange(user_id, ts) %>% # show each user s data chronologically head(30) %>% mutate( # convert timestamps to human readable form ts = ts %>% as.POSIXct(origin = "1970-01-01", tz = "GMT") ) user_id ts ab landing_page converted multi_obs 1 203042 2013-01-01 02:56:48 treatment new_page 0 TRUE 2 203042 2013-01-01 02:56:49 treatment old_page 1 TRUE 3 2394489 2013-01-01 11:23:54 treatment new_page 0 TRUE 4 2394489 2013-01-01 11:23:55 treatment old_page 1 TRUE 5 2695427 2013-01-01 18:37:58 treatment new_page 0 TRUE 6 2695427 2013-01-01 18:37:59 treatment old_page 0 TRUE 7 3789396 2013-01-01 01:05:13 treatment new_page 0 TRUE 8 3789396 2013-01-01 01:05:14 treatment old_page 0 TRUE 9 6213582 2013-01-01 12:43:13 treatment new_page 0 TRUE 10 6213582 2013-01-01 12:43:14 treatment old_page 0 TRUE 11 7647078 2013-01-01 20:04:34 treatment new_page 0 TRUE 12 7647078 2013-01-01 20:04:35 treatment old_page 1 TRUE 13 11584819 2013-01-01 12:53:41 treatment new_page 0 TRUE 14 11584819 2013-01-01 12:53:42 treatment old_page 0 TRUE 15 11803291 2013-01-01 21:33:00 treatment new_page 0 TRUE 16 11803291 2013-01-01 21:33:01 treatment old_page 0 TRUE 17 22522327 2013-01-01 12:45:08 treatment new_page 0 TRUE 18 22522327 2013-01-01 12:45:09 treatment old_page 0 TRUE 19 22577434 2013-01-01 06:13:05 treatment new_page 0 TRUE 20 22577434 2013-01-01 06:13:06 treatment old_page 0 TRUE 21 24144768 2013-01-01 21:42:04 treatment new_page 0 TRUE 22 24144768 2013-01-01 21:42:05 treatment old_page 0 TRUE 23 25758261 2013-01-01 14:52:11 treatment new_page 0 TRUE 3
  4. 4. 24 25758261 2013-01-01 14:52:12 treatment old_page 0 TRUE 25 29616796 2013-01-01 02:17:18 treatment new_page 0 TRUE 26 29616796 2013-01-01 02:17:19 treatment old_page 0 TRUE 27 32617932 2013-01-01 21:50:20 treatment new_page 0 TRUE 28 32617932 2013-01-01 21:50:21 treatment old_page 1 TRUE 29 32786569 2013-01-01 07:48:23 treatment new_page 0 TRUE 30 32786569 2013-01-01 07:48:24 treatment old_page 1 TRUE In this sample of multi-observation users, it appears that such users see the new page first and then land on the old page one second later. Inspection of all multi-observation user data verified this. Inspection of this sample also raised the question of whether multi-observation users are primarily in the treatment group. Analysis of all multi-observation user data (below) confirmed that 99.9% of multi-observation users were assigned to the treatment group, and therefore should have been shown only the new page. However, what actually happened is that multi-observation users saw the new page for one second before ultimately landing on the old page, which was intended for the control group. This behavior does not match the intended experimental design. The sample data also suggest that multi-observation users never convert on the new page, which would make sense since it was shown for just one second before they landed on the old page. Analysis of all multi-observation user data (below) confirmed that none of these users converted on the new page. # calculate percentage of multi-observation users assigned only to the treatment group multi_summary <- dat[dat$multi_obs, ] %>% group_by(user_id) %>% summarize(all_treatment = as.numeric(all(ab == "treatment"))) # if user s rows are all "treatment" -> percent(sum(multi_summary$all_treatment) / nrow(multi_summary)) [1] "99.9%" # count number of times multi-observation users converted on the new page dat[dat$multi_obs, ] %>% filter(landing_page == "new_page", converted == 1) %>% nrow [1] 0 The calculations above demonstrate that, as previously discussed, 99.9% of multi-observation users were in the treatment group, but none of them converted from the new landing page. It would be possible to correct such users’ data by changing their label from “treatment” to “control” and by removing the data from when they loaded the new page for a second. However, their responses may have been influenced by a glitch in the website, which would not be generalizable to the wider audience for which these changes are intended. In addition, they were not exposed to the experimental design as intended. Therefore, their data would be di cult to interpret and should be removed altogether. Note that the decision to remove their data entirely would be defensible only if multi-observation users represented a random subset of the population under test. If multi-observation users represent a non-random subset (e.g., people who use Internet Explorer), it would not be wise to delete their data, as it would limit the generalizability of the results (e.g., results would then only apply to people who don’t use Internet Explorer). Therefore, if the glitch a ected a non-random subset of users, I would advise running more users through the study after fixing the glitch. For the sake of this assignment, I will assume this is due to a random glitch and we can remove their data. 4
  5. 5. dat <- dat[!dat$multi_obs, ] Check for further experimental errors As previously mentioned, users in the control group should only see the old page, and users in the treatment group should only see the new page. Therefore, after we removed users with multiple observations, if there are still any users left that saw the wrong page given their condition, we will need to decide how to handle them. # check that treatment and control groups saw their corresponding pages table(dat$ab, dat$landing_page) new_page old_page control 0 90809 treatment 90811 0 The table indicates that we have fully removed the problematic users; each condition is now associated with the correct landing page. Analyze data Now that the data has been cleaned, we can conduct a z-test to determine if there was an e ect of experimental condition on conversion rate. tbl <- table(dat$ab, dat$converted) res <- tbl %>% prop.test # aka z-test names(res$estimate) <- c("control", "treatment") # make results readable # invert point estimates to show conversion rate rather than non-conversion rate rates <- (1 - res$estimate) # format confidence interval of difference as percentage diff.conf.int <- res$conf.int # to help with interpretation, also calculate conversion rate confidence interval for each group separat control.conf.int <- prop.test(tbl["control", "1"], sum(tbl["control", ])) %>% .$conf.int treatment.conf.int <- prop.test(tbl["treatment", "1"], sum(tbl["treatment", ])) %>% .$conf.int Results Examine results. control.conf.int %>% round(3) %>% percent [1] "9.8%" "10.2%" 5
  6. 6. treatment.conf.int %>% round(3) %>% percent [1] "10.5%" "10.9%" rates %>% round(3) %>% sapply(percent) control treatment "10%" "10.7%" diff.conf.int %>% round(3) %>% percent [1] "0.3%" "0.9%" res["p.value"] $p.value [1] 1.104298e-05 The conversion rate of the old page is 10.0% (95% confidence interval, 9.8% - 10.2%). The conversion rate of the new page is 10.7% (95% confidence interval, 10.5% - 10.9%). The new page has a higher conversion rate than the old page (95% confidence interval of di erence, 0.3% - 0.9%), p < 0.001. If the decision to remove the problematic users was correct, then we can say with 95% confidence that the new page’s conversion rate is 3 - 9% greater than the old page’s conversion rate. Discussion Given the higher conversion rate of the new landing page, I would recommend we switch all users over to it and to monitor whether the conversion rate increases as expected. Regarding the discrepancy between our data and the third party’s data, I believe our data is more accurate because we have cleaned problematic observations from it. There is no reason to believe that the third party cleaned the data, although I would contact them to confirm this. I would explain the discrepancy to the project manager by stating that some people were mislabeled as having seen the new page, when really they saw the old page. Acme’s system isn’t set up to catch these problems, but as a result of her request we were able to find and delete the bad data, uncovering the significant results that she suspected were there all along. To protect future experiments, it would be important to understand why these glitches occurred. Therefore, I would discuss the issue with developers and quality assurance analysts and try to reproduce the problematic behavior. If I’m not able to, I would o er an incentive to anyone in the company who could. (This strategy has been successful for me in my current company: employees will actually race to reproduce an issue to earn a gold star.) Once the conditions for reproduction are identified, we can determine how to prevent this glitch in the future. I would also suggest we set up monitoring in similar experiments to ensure that these problematic conditions don’t occur again. In particular, (a) each user should have just one observation, and (b) each experimental condition should be associated with the expected behavior (e.g., the treatment condition should be associated with only new page and the control condition should be associated with only the old page). A first step would be to set up as a daily email indicating whether (a) and (b) are satisfied. As we grow more confident in the system, we could have it only email us if (a) and (b) are not satisfied. Whenever problems arise, we should analyze what went wrong, explore whether we need to delete or correct the relevant data, and continue to implement more safeguards to prevent similar problems in the future. 6

×