Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Making Causal Claims as a Data Scientist: Tips and Tricks Using R

1,090 views

Published on

Making believable causal claims can be difficult, especially with the much repeated adage “correlation is not causation”. This talk will walk through some tools often used to practice safe causation, such as propensity scores and sensitivity analyses. In addition, we will cover principles that suggest causation such as the understanding of counterfactuals, and applying Hill’s criteria in a data science setting. We will walk through specific examples, as well as provide R code for all methods discussed.

Published in: Data & Analytics
  • Be the first to comment

Making Causal Claims as a Data Scientist: Tips and Tricks Using R

  1. 1. D A T A D A Y T E X A S 2 0 1 8 Making Causal Claims as a Data Scientist: Tips and Tricks Using R Lucy D’Agostino McGowan
  2. 2. D A T A D A Y T E X A S 2 0 1 8 http://bit.ly/LucyStatsDDTX18
  3. 3. D A T A D A Y T E X A S 2 0 1 8 h-ps://xkcd.com/552/
  4. 4. D A T A D A Y T E X A S 2 0 1 8 h-ps://xkcd.com/552/
  5. 5. D A T A D A Y T E X A S 2 0 1 8
  6. 6. D A T A D A Y T E X A S 2 0 1 8
  7. 7. D A T A D A Y T E X A S 2 0 1 8 ✌ parts 1. Discuss some ways to strengthen a causal argument: Hill’s criteria 2. Discuss a specific causal inference method: propensity scores + sensitivity analyses
  8. 8. D A T A D A Y T E X A S 2 0 1 8 Hill’s Criteria h-ps://commons.wikimedia.org/wiki/File:AusBn_Bradford_Hill.jpg
  9. 9. D A T A D A Y T E X A S 2 0 1 8 http://nssdeviations.com h<p://livefreeordichotomize.com/2016/12/15/hill-for-the-data-scienDst-an-xkcd-story/
  10. 10. D A T A D A Y T E X A S 2 0 1 8 Strength h-ps://xkcd.com/539/
  11. 11. D A T A D A Y T E X A S 2 0 1 8 h-ps://xkcd.com/242/ Consistency
  12. 12. D A T A D A Y T E X A S 2 0 1 8 h-ps://xkcd.com/1217/ Specificity
  13. 13. D A T A D A Y T E X A S 2 0 1 8 h-ps://xkcd.com/925/ Temporality
  14. 14. D A T A D A Y T E X A S 2 0 1 8 h-ps://xkcd.com/323/ Biological gradient
  15. 15. D A T A D A Y T E X A S 2 0 1 8 h-ps://xkcd.com/605/ Plausibility
  16. 16. D A T A D A Y T E X A S 2 0 1 8 Coherence h-ps://xkcd.com/1170/
  17. 17. D A T A D A Y T E X A S 2 0 1 8 Analogy h-ps://xkcd.com/882/
  18. 18. D A T A D A Y T E X A S 2 0 1 8 Experiment h-ps://xkcd.com/1462/
  19. 19. D A T A D A Y T E X A S 2 0 1 8 What if you can’t do a controlled experiment?
  20. 20. D A T A D A Y T E X A S 2 0 1 8 Observational Studies Goal: Answer a research question exposure outcome
  21. 21. D A T A D A Y T E X A S 2 0 1 8 “Gold Standard” – Randomized Controlled Trials Treatment Control P=0.5 P=0.5 Observational Studies
  22. 22. D A T A D A Y T E X A S 2 0 1 8 “Gold Standard” – Randomized Controlled Trials A B P=0.5 P=0.5 Observational Studies
  23. 23. D A T A D A Y T E X A S 2 0 1 8 Treatment Control P=0.9 P=0.1 Observational Studies
  24. 24. D A T A D A Y T E X A S 2 0 1 8 Exposed Unexposed P=0.8 P=0.2 Observational Studies
  25. 25. D A T A D A Y T E X A S 2 0 1 8 Exposed Unexposed P=0.8 P=0.2 Observational Studies
  26. 26. D A T A D A Y T E X A S 2 0 1 8 Exposed Unexposed P=0.9 P=0.1 Observational Studies
  27. 27. D A T A D A Y T E X A S 2 0 1 8
  28. 28. D A T A D A Y T E X A S 2 0 1 8
  29. 29. D A T A D A Y T E X A S 2 0 1 8 Confounding Exposure Outcome Confounder
  30. 30. D A T A D A Y T E X A S 2 0 1 8 Confounding Exposure Outcome Confounder
  31. 31. D A T A D A Y T E X A S 2 0 1 8 Meaningful confounders 1. How imbalanced is the confounder between the exposure groups? 2. How predictive is the confounder of the outcome?
  32. 32. D A T A D A Y T E X A S 2 0 1 8 Confounding Exposure Outcome Confounder
  33. 33. D A T A D A Y T E X A S 2 0 1 8 Meaningful confounders 1. How imbalanced is the confounder between the exposure groups? 2. How predictive is the confounder of the outcome?
  34. 34. D A T A D A Y T E X A S 2 0 1 8 Confounding Exposure Outcome Confounder
  35. 35. D A T A D A Y T E X A S 2 0 1 8
  36. 36. D A T A D A Y T E X A S 2 0 1 8 Rosenbaum and Rubin showed in observational studies, conditioning on propensity scores can lead to unbiased estimates of the exposure effect 1. There are no unmeasured confounders 2. Every subject has a nonzero probability of receiving either exposure Propensity scores
  37. 37. D A T A D A Y T E X A S 2 0 1 8 Rosenbaum and Rubin showed in observational studies, conditioning on propensity scores can lead to unbiased estimates of the exposure effect 1. There are no unmeasured confounders 2. Every subject has a nonzero probability of receiving either exposure Propensity scores
  38. 38. D A T A D A Y T E X A S 2 0 1 8 • Fit a logistic regression predicting exposure using known covariates Pr exposure = 1 = 1 1 + exp −X. • Each individuals' predicted values are the propensity scores Propensity scores
  39. 39. D A T A D A Y T E X A S 2 0 1 8 Propensity scores glm(exposure ~ smoker + cowboy_hat + sick . . ., data = df, family = binomial)
  40. 40. D A T A D A Y T E X A S 2 0 1 8 Propensity scores glm(exposure ~ smoker + cowboy_hat + sick . . ., data = df, family = binomial) %>% predict(type = “response”)
  41. 41. D A T A D A Y T E X A S 2 0 1 8 Propensity scores glm(exposure ~ smoker + cowboy_hat + sick . . ., data = df, family = binomial) %>% predict(type = “response”) #> 1 2 3 4 5 #> 0.6967404 0.6370019 0.7195434 0.9079320 0.5905542
  42. 42. D A T A D A Y T E X A S 2 0 1 8 Propensity scores p <- glm(exposure ~ smoker + cowboy_hat + sick . . ., data = df, family = binomial) %>% predict(type = “response”)
  43. 43. D A T A D A Y T E X A S 2 0 1 8 0.6 0.9 0.8 0.2 0.7 0.1 0.2 0.3 0.2 0.7 0.8 0.2 0.3 0.7 0.2 0.1 0.1 0.6 0.1 0.3 0.9 Propensity scores
  44. 44. DA T A DA Y T E X A S 2 0 1 8 0.6 0.9 0.1 0.3 0.8 0.3 0.7 0.6 0.9 0.1 0.8 0.2 0.2 0.2 0.2 0.7 Propensity scores
  45. 45. DA T A DA Y T E X A S 2 0 1 8 0.1 0.2 0.2 0.3 0.6 0.7 0.8 0.9 0.1 0.2 0.2 0.3 0.6 0.7 0.8 0.9 Propensity scores
  46. 46. DA T A DA Y T E X A S 2 0 1 8 Propensity Scores • Weighting • Matching • Stratification • Direct Adjustment • …
  47. 47. DA T A DA Y T E X A S 2 0 1 8 Propensity Scores • Weighting • Matching • Stratification • Direct Adjustment • …
  48. 48. DA T A DA Y T E X A S 2 0 1 8 Weighting !"#$ = min{*+, 1 − *+} 0+*+ + (1 − 0+)(1 − *+)
  49. 49. DA T A DA Y T E X A S 2 0 1 8 Weighting p_0 <- 1 - p df$weight <- pmin(p, p_0) / ifelse(exposure == 1, p, p_0)
  50. 50. DA T A DA Y T E X A S 2 0 1 8 0.1 0.2 0.2 0.3 0.6 0.7 0.8 0.9 0.1 0.2 0.2 0.3 0.6 0.7 0.8 0.9 Imbalance between exposures
  51. 51. D A T A D A Y T E X A S 2 0 1 8 0.1 0.2 0.2 0.3 0.6 0.7 0.8 0.9 0.1 0.2 0.2 0.3 0.6 0.7 0.8 0.9 Propensity Scores Imbalance between exposures
  52. 52. DA T A DA Y T E X A S 2 0 1 8 Standardized mean difference ! = #$%#&'(%! − #$*+%#&'(%! (%#&'(%! , + (*+%#&'(%! , ,
  53. 53. DA T A DA Y T E X A S 2 0 1 8 Standardized Mean Difference library(survey) svy_des <- svydesign( ids = ~ 1, data = df, weights = ~ weight)
  54. 54. DA T A DA Y T E X A S 2 0 1 8 Standardized Mean Difference library(tableone) smd_table <- svyCreateTableOne( vars = c(“smoker”, “cowboy_hat”, “sick”, . . .), strata = “exposure”, data = svy_des, test = FALSE) print(smd_table, smd = TRUE)
  55. 55. DA T A DA Y T E X A S 2 0 1 8 Exposed Unexposed SMD n 1897.07 1900.24 mustache 60.71 (16.95) 60.90 (15.61) 0.012 cowboy hat 204.59 (101.45) 202.90 (106.44) 0.016 glasses 70.77 (27.00) 70.51 (27.50) 0.01 smoker 37.46 (10.93) 37.40 (11.28) 0.005 crazy hair 2.27 (2.46) 2.27 (1.82) 0.003 smile 2.49 (5.48) 2.47 (4.80) 0.003 sick 30.92 (8.23) 30.95 (7.54) 0.003 harry potter 3.04 (0.69) 3.03 (0.93) 0.01 healthy 145.4 (7.7) 152.2 (8.0) 0.013 Standardized Mean Difference
  56. 56. DA T A DA Y T E X A S 2 0 1 8 Standardized Mean Difference library(tableone) smd_table_unweighted <- CreateTableOne( vars = c(“smoker”, “cowboy_hat”, “sick”, . . .), strata = “exposure”, data = df, test = FALSE) print(smd_table_unweighted, smd = TRUE)
  57. 57. DA T A DA Y T E X A S 2 0 1 8 Love Plots
  58. 58. DA T A DA Y T E X A S 2 0 1 8 Love Plots Imbalance between exposures
  59. 59. DA T A DA Y T E X A S 2 0 1 8 Standardized Mean Difference library(tidyr) plot_df <- data.frame( var = names(ExtractSmd(smd_table)), Unadjusted = ExtractSmd(smd_table_unweighted ), Weighted = ExtractSmd(smd_table)) %>% gather("Method", "SMD", -var)
  60. 60. DA T A DA Y T E X A S 2 0 1 8 Standardized Mean Difference library(ggplot2) ggplot( data = plot_df, mapping = aes(x = var, y = SMD, group = Method, color = Method) ) + geom_line() + geom_point() + geom_hline(yintercept = 0.1, color = "black", size = 0.1) + coord_flip()
  61. 61. DA T A DA Y T E X A S 2 0 1 8 Love Plots
  62. 62. DA T A DA Y T E X A S 2 0 1 8 Outcome model svyglm( outcome ~ exposure, data = svy_des, family = binomial)
  63. 63. DA T A DA Y T E X A S 2 0 1 8 Outcome model svyglm( outcome ~ exposure, data = svy_des, family = binomial) %>% confint(parm = “exposure”)
  64. 64. DA T A DA Y T E X A S 2 0 1 8 Outcome model svyglm( outcome ~ exposure, data = svy_des, family = binomial) %>% confint(parm = “exposure”) %>% exp() #> 2.5% 97.5% #> exposure 1.51 2.01
  65. 65. DA T A DA Y T E X A S 2 0 1 8 unmeasured confounding
  66. 66. DA T A DA Y T E X A S 2 0 1 8 All you need ü Exposure-outcome effect ü Exposure-unmeasured confounder effect ü Outcome-unmeasured confounder effect
  67. 67. DA T A DA Y T E X A S 2 0 1 8 All you need ü Exposure-outcome effect generally estimated from a model, for example an odds ratio, hazard ratio, or risk ratio ü Exposure-unmeasured confounder effect ü Outcome-unmeasured confounder effect Odds Ratio Confidence Interval: (1.5, 2)
  68. 68. DA T A DA Y T E X A S 2 0 1 8 All you need ü Exposure-outcome effect generally estimated from a model, for example an odds ratio, hazard ratio, or risk ratio ü Exposure-unmeasured confounder effect ü Outcome-unmeasured confounder effect
  69. 69. DA T A DA Y T E X A S 2 0 1 8 Tipping point analyses what will tip our confidence bound to cross 1
  70. 70. DA T A DA Y T E X A S 2 0 1 8 All you need ü Exposure-outcome effect generally estimated from a model, for example an odds ratio, hazard ratio, or risk ratio ü Exposure-unmeasured confounder effect can use my observed variables to estimate this ü Outcome-unmeasured confounder effect
  71. 71. DA T A DA Y T E X A S 2 0 1 8 All you need ü Exposure-outcome effect generally estimated from a model, for example an odds ratio, hazard ratio, or risk ratio ü Exposure-unmeasured confounder effect can use my observed variables to estimate this ü Outcome-unmeasured confounder effect
  72. 72. DA T A DA Y T E X A S 2 0 1 8 Exposure-unmeasured confounder effect •Prevalence of the unmeasured confounder among the exposed •Prevalence of the unmeasured confounder among the unexposed
  73. 73. DA T A DA Y T E X A S 2 0 1 8 Love Plots !" = 0.3 !' = 0.1
  74. 74. DA T A DA Y T E X A S 2 0 1 8 Confounding Exposure Outcome Confounder !" = 0.3 !' = 0.1 ? Odds Ratio Confidence Interval: (1.5, 2)
  75. 75. DA T A DA Y T E X A S 2 0 1 8 Standardized Mean Difference library(tipr) tip_with_binary( p0 = 0.1, p1 = 0.3, lb = 1.5, ub = 2)
  76. 76. DA T A DA Y T E X A S 2 0 1 8 Standardized Mean Difference library(tipr) tip_with_binary( p0 = 0.1, p1 = 0.3, lb = 1.5, ub = 2) #> [1] 4.333333
  77. 77. DA T A DA Y T E X A S 2 0 1 8 A hypothetical unobserved binary confounder that is prevalent in 30% of the exposed population and 10% of the unexposed population would need to have an association with Y of 4.33 to tip this analysis at the 5% level, rendering it inconclusive. ” “
  78. 78. DA T A DA Y T E X A S 2 0 1 8 ✌ parts 1. Discuss some ways to strengthen a causal argument: Hill’s criteria 2. Discuss a specific causal inference method: propensity scores + sensitivity analyses
  79. 79. DA T A DA Y T E X A S 2 0 1 8 h.ps://xkcd.com/552/
  80. 80. DA T A DA Y T E X A S 2 0 1 8 R • survey • tableone • 9dyr • ggplot2 • 9pr
  81. 81. DA T A DA Y T E X A S 2 0 1 8 References 1.Cornfield, J., Haenszel, W., Hammond, E. C., Lilienfeld, A. M., Shimkin, M. B., & Wynder, E. L. (2009). Smoking and lung cancer: recent evidence and a discussion of some quesQons. 1959. Interna'onal journal of epidemiology (Vol. 38, pp. 1175–1191). Oxford University Press. hp://doi.org/10.1093/ije/dyp289 2.Schlesselman, J. J. (1978). Assessing effects of confounding variables. American Journal of Epidemiology, 108(1), 3–8. 3.Rosenbaum, P. R., & Rubin, D. B. (1983). The Central Role of the Propensity Score in ObservaQonal Studies for Causal Effects. Biometrika, 70(1), 41. hp://doi.org/10.2307/2335942 4.Lin, D. Y., Psaty, B. M., & Kronmal, R. A. (1998). Assessing the sensiQvity of regression results to unmeasured confounders in observaQonal studies. Biometrics, 54(3), 948–963. 5.hps://cran.r-project.org/web/packages/tableone/vignees/smd.html
  82. 82. DA T A DA Y T E X A S 2 0 1 8 Thank you! @LucyStats http://bit.ly/LucyStatsDDTX18 lucymcgowan.com

×