Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Repurposing predictive tools for causal research


Published on

Talk at Boston University's Questrom School of Business, Oct 29, 2018

Published in: Data & Analytics
  • Be the first to comment

Repurposing predictive tools for causal research

  1. 1. Repurposing Predictive Tools for Causal Research Questrom School of Business, Boston University, Oct 29 2018 Galit Shmueli 徐茉莉 Institute of Service Science
  2. 2. Repurposing Trees for Causal Research We tackle 2 key issues in causal research: Self Selection Identifying confounders
  3. 3. A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Big Data Yahav, Shmueli & Mani (2016), A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Big Data, MIS Quarterly, vol 40 no 4, pp. 819-848. With Inbal Yahav, Bar ilan University & Deepa Mani, Indian School of Business
  4. 4. RCT: Random Assignment Manipulation
  5. 5. Quasi-Experiment (self-selection or administrator selection) Manipulation Self Selection
  6. 6. Self selection: the challenge In impact studies of an intervention: • Individuals/firms self-select intervention group/duration (quasi-experiment) • Even in randomized experiments, some variables might remain unbalanced in sample How to identify and adjust for self-selection?
  7. 7. Three Applications (MISQ 2016) Impact of labor training on earnings Field experiment by US govt • LaLonde (1986) compared to observational control • Re-analysis by PSM (Dehejia & Wahba, 1999, 2002) Randomized Experiment Impact of e-Gov service in India New online passport service • survey of online + offline users • bribes, travel time, etc. Impact of outsourcing contract features on financial performance • pricing mechanism • contract duration Quasi Experiment Observational
  8. 8. Common Approaches • 2SLS modeling (Heckman correction) -- econometrics • Propensity Score Approach (PS) -- statistics Two steps: 1. Selection model: T = f(X) 2. Performance analysis on matched samples Y = performance measure(s) T = intervention X = pre-intervention variables
  9. 9. Propensity Scores Approach Step 1: Estimate selection model logit(T) = f(X) to compute propensity scores P(T|X) Step 3: Estimate Effect on Y (compare groups) e.g., t-test or Y = b0 + b1 T+ b2 X+ b3 PS +e Y = performance measure(s) T = intervention X = pre-intervention variables Self-selection: P(T|X) ≠P(T) Step 2: Use scores to create matched samples PSM = use matching algorithm PSS = divide scores into bins
  10. 10. Challenges of PS in Big Data 1. Matching leads to severe data loss 2. PS methods suffer from “data dredging” 3. No variable selection (cannot identify variables that drive the selection) 4. Assumes constant intervention effect 5. Sequential nature is computationally costly 6. Logistic model requires researcher to specify exact form of selection model
  11. 11. Our Proposed Solution: Tree-based approach Propensity scores P(T|X) Y, T, X E(Y|T) Even E(Y|T,X) “Kill the Intermediary”
  12. 12. Proposed Method Output: T (treat/control) Inputs: X’s (income, education, family…) Records in each terminal node share same profile (X) and same propensity score P(T=1| X)
  13. 13. Tree-Based Approach Four steps: 1. Run selection model: fit tree T = f(X) 2. Present resulting tree; see unbalanced X’s 3. Treat each terminal node as sub-sample for measuring Y; conduct terminal-node-level performance analysis 4. Present terminal-node-analyses visually 5. [optional]: combine analyses from nodes with homogeneous effects Like PS, assumes observable self-selection
  14. 14. Tree Algorithm Choice Conditional-Inference trees (Hothorn et al., 2006) – Stop tree growth using statistical tests of independence – Binary splits
  15. 15. Study 1: Impact of training on financial gains (LaLonde 1986; Dehejia & Wahba 1999, 2002) Experiment: US govt program randomly assigns eligible candidates to labor training program • Goal: increase future earnings • LaLonde (1986) shows: Groups statistically equal in terms of demographic & pre-train earnings  ATE = $1794 (p<0.004)
  16. 16. Tree on Lalonde’s RCT data If groups are completely balanced, we expect… Y = Earnings in 1978 T = Received NSW training (T = 1) or not (T = 0) X = Demographic information and prior earnings
  17. 17. Tree reveals… LaLonde’s naïve approach (experiment) Tree approach HS dropout (n=348) HS degree (n=97) Not trained (n=260) $4554 $4,495 $4,855 Trained (n=185) $6349 $5,649 $8,047 Training effect $1794 (p=0.004) $1,154 (p=0.063) $3,192 (p=0.015) Overall: $1598 (p=0.017) no yes High school degree 1. Unbalanced variable (HS degree) 2. Heterogeneous effect
  18. 18. Labor Training effect: Observational control group • LaLonde also compared with observational control groups (PSID, CPS) – experimental training group + obs control – shows training effect not estimated correctly with structural equations • Dehejia & Wahba (1999,2002) re-analyze CPS control group (n=15,991), using PSM – Effects in range $1122-$1681, depends on settings – “Best” setting effect: $1360 – Uses only 119 control group members (out of 15,991)
  19. 19. Tree for obs control group reveals… unemployed prior to training in 1974 (u74=0 ) -> negative effect 1. Unbalanced variables 2. Heterogeneous effect in u74 3. Outlier 4. Eligibility issue outlier eligibility issue! some profiles are rare in trained group but popular in control group
  20. 20. Survey commissioned by Govt of India in 2006 • >9500 individuals who used passport services • Representative sample of 13 Passport Offices • “Quasi-experimental, non-equivalent groups design” • Equal number of offline and online users, matched by geography and demographics Study 2: Impact of eGov Initiative (India)
  21. 21. Current Practice Assess impact by comparing online/offline performance stats
  22. 22. Awareness of electronic services provided by Government of India % bribe RPO % use agent %prefer online % bribe police Simpson’s Paradox 1. Demographics properly balanced 2. Unbalanced variable (Aware) 3. Heterogeneous effects on various y’s + even Simpson’s paradox
  23. 23. PSMAwareness of electronic services provided by Government of India Would we detect this with PSM?
  24. 24. Heterogeneous effect
  25. 25. Scaling Up to Big Data • We inflated eGov dataset by bootstrap • Up to 9,000,000 records and 360 variables • 10 runs for each configuration: runtime for tree 20 sec
  26. 26. Large Simulation Intervention type: binary/continuous Sample size: 10K, 100K, 1M Dimension of pre-intervention variables: 4, 50, + interactions Pre-intervention variable types: binary, Likert, continuous Outcome variable types: binary/continuous Selection models: logistic with/out interactions Intervention effects: homogeneous/heterogeneous
  27. 27. Big Data Scalability Theoretical Complexity: • O(mn/p) for binary X • O(m/p nlog(n) ) for continuous X Runtime as function of sample size, dimension
  28. 28. Scaling Trees Even Further • “Big Data” in research vs. industry • Industrial scaling – Sequential trees: efficient data structure, access (SPRINT, SLIQ, RainForest) – Parallel computing (parallel SPRINT, ScalParC, SPARK, PLANET) “as long as split metric can be computed on subsets of the training data and later aggregated, PLANET can be easily extended”
  29. 29. Example: Heterogeneous Effect
  30. 30. Tree Approach Benefits 1. Data-driven selection model 2. Scales up to Big Data 3. Less user choices (data dredging) 4. Nuanced insights • Detect unbalanced variables • Detect heterogeneous effect from anticipated outcomes 5. Simple to communicate 6. Automatic variable selection 7. Missing values do not remove record 8. Binary, multiple, continuous interventions 9. Post-analysis of RCT, quasi-experiments & observational studies
  31. 31. Tree Approach Limits 1. Assumes selection on observables 2. Need sufficient data 3. Continuous variables can lead to large tree 4. Instability [possible solution: use variable importance scores (forest)]
  32. 32. Detecting Simpson’s Paradox in Big Data Using Trees Shmueli & Yahav (2017), The Forest or the Trees? Tackling Simpson’s Paradox with Classification Trees, Production & Operations Management Journal, Forthcoming With Inbal Yahav, Bar-ilan University
  33. 33. Simpson’s Paradox The direction of a cause on an effect appears reversed when examining aggregate vs. disaggregate of a sample (or population) Simpson's Paradox is the reversal of an association between two variables after a third variable (a confounding factor) is taken into account. - Schield (1999) The phenomenon whereby an event B increases the probability of A in a given population p, at the same time, decreases the probability of A in every subpopulation of p. - Pearl (2009)
  34. 34. Death Sentence and Race (Agresti, 1984) Does defendant's race (X) affect chance of death sentence (Y)? Causal explanation: Black murderers tend to kill blacks; hence lower overall death sentence rates Causal effect seems to reverse when disaggregating by victim race (Z)
  35. 35. Goal: Does a dataset exhibit SP? C = confounder E = effectA = cause P (E|C ) – P( E|C’ ) P (E|A ) – P(E|A’ ) “If Cornfield’s minimum effect size is not reached, [you] can assume no causality” - Schield, 1999 Cornfield et al’s Criterion
  36. 36. Translate Cornfield’s Criterion into a Tree Y = outcome of interest X = causal variable Z = confounding variable(s) Predictors #1 If cause -> effect, then cause should appear in tree #2 If Z is confounding, then Z should appear in tree
  37. 37. Five potential tree structures single causal variable (X) and single confounding variable (Z) Which might exhibit Simpson’s Paradox? P (E|C ) – P( E|C’ ) P (E|A ) – P(E|A’ )
  38. 38. Simpson’s Paradox on a Tree #1 If cause -> effect, then cause should appear in tree #2 If Z is confounding, then Z should appear in tree
  39. 39. Death Sentence and Race: Tree Approach #1: full tree P(death)
  40. 40. Accounting for Sampling Error Logistic/linear regression: Interaction X*Z significant? No → no paradox Yes → ? Trees: Tree structure + significance of interaction = conditional-inference tree Tree splits based on statistical tests (c2, F , permutation tests)
  41. 41. Tree Approach #2: Conditional Inference tree (Hothorn et al., JCGS 2006) • Recursive partitioning with early stopping • Separate steps for variable selection and split search • R packages party, partykit (function ctree) Variable selection based on statistical test (c2)
  42. 42. Cornfield’s criterion + sampling error: Conditional Inference Trees
  43. 43. Proof for trees that use concave impurity measure (Gini, entropy) as well as c2 CART, CHAID, Conditional-Inference Trees
  44. 44. Accounting for Sampling Error: Conditional-Inference Tree P(death)
  45. 45. Seatbelts and Injuries (Agresti 2012) Does use of seat-belts (X) reduce chance of injury (Y)? Z = Passenger gender and accident location n=68,694 passengers involved in accidents in Maine Potential Paradox (by location) How about logistic regression?
  46. 46. % Injuries
  47. 47. Simpson’s Paradox in Big Data Large n , High-dimensional Z
  48. 48. Multiple Potential Confounders (Z) The Challenge Statistical significance of Simpson’s paradox ≠ Significance threshold of tree splits in CI treeCI Tree Full Tree Solution: X-Terminal Tree
  49. 49. Paradox Detection in Big Data (Tree Approach #3): X-Terminal Trees X-Terminal Tree: Grow tree only until X-splits
  50. 50. Tree paths with terminal X nodes • Full paradox, statistically significant • Partial paradox, statistically significant • Statistically insignificant paradox • No paradox Pivot table equivalence: Filter by Z variables above terminal X node
  51. 51. Survey commissioned by Govt of India in 2006 >9500 individuals who used passport services • Representative sample of 13 Passport Offices • Equal number of offline and online users, matched by geography and demographics Let’s focus on Police bribing Impact of Initiative
  52. 52. Y = police bribe (0/1) X = online/offline Z = {demographics; survey Qs} Split p=.32 Paradox p=0.003Paradox p=0.16 No paradox
  53. 53. Kidney Allocation in USA (104,000 patients, 19 confounders) Is the kidney allocation system racist? Type 4 tree, but no significant Simpson’s paradox detected! Y = waiting time (days) X = patient race Z = {patient demog, health, bio}
  54. 54. • Greediness of tree • Weak paradox or in small subset of data can go undetected • Highly correlated Z’s might lead to “wrong” Z choice Summary & Challenges Full tree: eliminate non-type-4 trees Conditional-inference trees: for single Z X-terminal trees: for multiple Zs • More efficient than stepwise regression • Tree structure more informative than interaction terms • Extends: continuous Y, >2 subpopulations
  55. 55. Repurposing Trees for Causal Research We tackled 2 key issues in causal research: Self Selection Identifying confounders
  56. 56. Anal yt ics Humanit y Responsibil it y Galit Shmueli 徐茉莉 Institute of Service Science