Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- The AI Rush by Jean-Baptiste Dumont 308642 views
- AI and Machine Learning Demystified... by Carol Smith 3346501 views
- 10 facts about jobs in the future by Pew Research Cent... 530554 views
- 2017 holiday survey: An annual anal... by Deloitte United S... 868118 views
- Harry Surden - Artificial Intellige... by Harry Surden 487598 views
- Inside Google's Numbers in 2017 by Rand Fishkin 1075523 views

58 views

Published on

Talk at Boston University's Questrom School of Business, Oct 29, 2018

Published in:
Data & Analytics

No Downloads

Total views

58

On SlideShare

0

From Embeds

0

Number of Embeds

29

Shares

0

Downloads

1

Comments

0

Likes

1

No embeds

No notes for slide

Orange: although p-value is very large, we still have a split

- 1. Repurposing Predictive Tools for Causal Research Questrom School of Business, Boston University, Oct 29 2018 Galit Shmueli 徐茉莉 Institute of Service Science
- 2. Repurposing Trees for Causal Research We tackle 2 key issues in causal research: Self Selection Identifying confounders
- 3. A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Big Data Yahav, Shmueli & Mani (2016), A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Big Data, MIS Quarterly, vol 40 no 4, pp. 819-848. With Inbal Yahav, Bar ilan University & Deepa Mani, Indian School of Business
- 4. RCT: Random Assignment Manipulation
- 5. Quasi-Experiment (self-selection or administrator selection) Manipulation Self Selection
- 6. Self selection: the challenge In impact studies of an intervention: • Individuals/firms self-select intervention group/duration (quasi-experiment) • Even in randomized experiments, some variables might remain unbalanced in sample How to identify and adjust for self-selection?
- 7. Three Applications (MISQ 2016) Impact of labor training on earnings Field experiment by US govt • LaLonde (1986) compared to observational control • Re-analysis by PSM (Dehejia & Wahba, 1999, 2002) Randomized Experiment Impact of e-Gov service in India New online passport service • survey of online + offline users • bribes, travel time, etc. Impact of outsourcing contract features on financial performance • pricing mechanism • contract duration Quasi Experiment Observational
- 8. Common Approaches • 2SLS modeling (Heckman correction) -- econometrics • Propensity Score Approach (PS) -- statistics Two steps: 1. Selection model: T = f(X) 2. Performance analysis on matched samples Y = performance measure(s) T = intervention X = pre-intervention variables
- 9. Propensity Scores Approach Step 1: Estimate selection model logit(T) = f(X) to compute propensity scores P(T|X) Step 3: Estimate Effect on Y (compare groups) e.g., t-test or Y = b0 + b1 T+ b2 X+ b3 PS +e Y = performance measure(s) T = intervention X = pre-intervention variables Self-selection: P(T|X) ≠P(T) Step 2: Use scores to create matched samples PSM = use matching algorithm PSS = divide scores into bins
- 10. Challenges of PS in Big Data 1. Matching leads to severe data loss 2. PS methods suffer from “data dredging” 3. No variable selection (cannot identify variables that drive the selection) 4. Assumes constant intervention effect 5. Sequential nature is computationally costly 6. Logistic model requires researcher to specify exact form of selection model
- 11. Our Proposed Solution: Tree-based approach Propensity scores P(T|X) Y, T, X E(Y|T) Even E(Y|T,X) “Kill the Intermediary”
- 12. Proposed Method Output: T (treat/control) Inputs: X’s (income, education, family…) Records in each terminal node share same profile (X) and same propensity score P(T=1| X)
- 13. Tree-Based Approach Four steps: 1. Run selection model: fit tree T = f(X) 2. Present resulting tree; see unbalanced X’s 3. Treat each terminal node as sub-sample for measuring Y; conduct terminal-node-level performance analysis 4. Present terminal-node-analyses visually 5. [optional]: combine analyses from nodes with homogeneous effects Like PS, assumes observable self-selection
- 14. Tree Algorithm Choice Conditional-Inference trees (Hothorn et al., 2006) – Stop tree growth using statistical tests of independence – Binary splits
- 15. Study 1: Impact of training on financial gains (LaLonde 1986; Dehejia & Wahba 1999, 2002) Experiment: US govt program randomly assigns eligible candidates to labor training program • Goal: increase future earnings • LaLonde (1986) shows: Groups statistically equal in terms of demographic & pre-train earnings ATE = $1794 (p<0.004)
- 16. Tree on Lalonde’s RCT data If groups are completely balanced, we expect… Y = Earnings in 1978 T = Received NSW training (T = 1) or not (T = 0) X = Demographic information and prior earnings
- 17. Tree reveals… LaLonde’s naïve approach (experiment) Tree approach HS dropout (n=348) HS degree (n=97) Not trained (n=260) $4554 $4,495 $4,855 Trained (n=185) $6349 $5,649 $8,047 Training effect $1794 (p=0.004) $1,154 (p=0.063) $3,192 (p=0.015) Overall: $1598 (p=0.017) no yes High school degree 1. Unbalanced variable (HS degree) 2. Heterogeneous effect
- 18. Labor Training effect: Observational control group • LaLonde also compared with observational control groups (PSID, CPS) – experimental training group + obs control – shows training effect not estimated correctly with structural equations • Dehejia & Wahba (1999,2002) re-analyze CPS control group (n=15,991), using PSM – Effects in range $1122-$1681, depends on settings – “Best” setting effect: $1360 – Uses only 119 control group members (out of 15,991)
- 19. Tree for obs control group reveals… unemployed prior to training in 1974 (u74=0 ) -> negative effect 1. Unbalanced variables 2. Heterogeneous effect in u74 3. Outlier 4. Eligibility issue outlier eligibility issue! some profiles are rare in trained group but popular in control group
- 20. Survey commissioned by Govt of India in 2006 • >9500 individuals who used passport services • Representative sample of 13 Passport Offices • “Quasi-experimental, non-equivalent groups design” • Equal number of offline and online users, matched by geography and demographics Study 2: Impact of eGov Initiative (India)
- 21. Current Practice Assess impact by comparing online/offline performance stats
- 22. Awareness of electronic services provided by Government of India % bribe RPO % use agent %prefer online % bribe police Simpson’s Paradox 1. Demographics properly balanced 2. Unbalanced variable (Aware) 3. Heterogeneous effects on various y’s + even Simpson’s paradox
- 23. PSMAwareness of electronic services provided by Government of India Would we detect this with PSM?
- 24. Heterogeneous effect
- 25. Scaling Up to Big Data • We inflated eGov dataset by bootstrap • Up to 9,000,000 records and 360 variables • 10 runs for each configuration: runtime for tree 20 sec
- 26. Large Simulation Intervention type: binary/continuous Sample size: 10K, 100K, 1M Dimension of pre-intervention variables: 4, 50, + interactions Pre-intervention variable types: binary, Likert, continuous Outcome variable types: binary/continuous Selection models: logistic with/out interactions Intervention effects: homogeneous/heterogeneous
- 27. Big Data Scalability Theoretical Complexity: • O(mn/p) for binary X • O(m/p nlog(n) ) for continuous X Runtime as function of sample size, dimension
- 28. Scaling Trees Even Further • “Big Data” in research vs. industry • Industrial scaling – Sequential trees: efficient data structure, access (SPRINT, SLIQ, RainForest) – Parallel computing (parallel SPRINT, ScalParC, SPARK, PLANET) “as long as split metric can be computed on subsets of the training data and later aggregated, PLANET can be easily extended”
- 29. Example: Heterogeneous Effect
- 30. Tree Approach Benefits 1. Data-driven selection model 2. Scales up to Big Data 3. Less user choices (data dredging) 4. Nuanced insights • Detect unbalanced variables • Detect heterogeneous effect from anticipated outcomes 5. Simple to communicate 6. Automatic variable selection 7. Missing values do not remove record 8. Binary, multiple, continuous interventions 9. Post-analysis of RCT, quasi-experiments & observational studies
- 31. Tree Approach Limits 1. Assumes selection on observables 2. Need sufficient data 3. Continuous variables can lead to large tree 4. Instability [possible solution: use variable importance scores (forest)]
- 32. Detecting Simpson’s Paradox in Big Data Using Trees Shmueli & Yahav (2017), The Forest or the Trees? Tackling Simpson’s Paradox with Classification Trees, Production & Operations Management Journal, Forthcoming With Inbal Yahav, Bar-ilan University
- 33. Simpson’s Paradox The direction of a cause on an effect appears reversed when examining aggregate vs. disaggregate of a sample (or population) Simpson's Paradox is the reversal of an association between two variables after a third variable (a confounding factor) is taken into account. - Schield (1999) The phenomenon whereby an event B increases the probability of A in a given population p, at the same time, decreases the probability of A in every subpopulation of p. - Pearl (2009)
- 34. Death Sentence and Race (Agresti, 1984) Does defendant's race (X) affect chance of death sentence (Y)? Causal explanation: Black murderers tend to kill blacks; hence lower overall death sentence rates Causal effect seems to reverse when disaggregating by victim race (Z)
- 35. Goal: Does a dataset exhibit SP? C = confounder E = effectA = cause P (E|C ) – P( E|C’ ) P (E|A ) – P(E|A’ ) “If Cornfield’s minimum effect size is not reached, [you] can assume no causality” - Schield, 1999 Cornfield et al’s Criterion
- 36. Translate Cornfield’s Criterion into a Tree Y = outcome of interest X = causal variable Z = confounding variable(s) Predictors #1 If cause -> effect, then cause should appear in tree #2 If Z is confounding, then Z should appear in tree
- 37. Five potential tree structures single causal variable (X) and single confounding variable (Z) Which might exhibit Simpson’s Paradox? P (E|C ) – P( E|C’ ) P (E|A ) – P(E|A’ )
- 38. Simpson’s Paradox on a Tree #1 If cause -> effect, then cause should appear in tree #2 If Z is confounding, then Z should appear in tree
- 39. Death Sentence and Race: Tree Approach #1: full tree P(death)
- 40. Accounting for Sampling Error Logistic/linear regression: Interaction X*Z significant? No → no paradox Yes → ? Trees: Tree structure + significance of interaction = conditional-inference tree Tree splits based on statistical tests (c2, F , permutation tests)
- 41. Tree Approach #2: Conditional Inference tree (Hothorn et al., JCGS 2006) • Recursive partitioning with early stopping • Separate steps for variable selection and split search • R packages party, partykit (function ctree) Variable selection based on statistical test (c2)
- 42. Cornfield’s criterion + sampling error: Conditional Inference Trees
- 43. Proof for trees that use concave impurity measure (Gini, entropy) as well as c2 CART, CHAID, Conditional-Inference Trees
- 44. Accounting for Sampling Error: Conditional-Inference Tree P(death)
- 45. Seatbelts and Injuries (Agresti 2012) Does use of seat-belts (X) reduce chance of injury (Y)? Z = Passenger gender and accident location n=68,694 passengers involved in accidents in Maine Potential Paradox (by location) How about logistic regression?
- 46. % Injuries
- 47. Simpson’s Paradox in Big Data Large n , High-dimensional Z
- 48. Multiple Potential Confounders (Z) The Challenge Statistical significance of Simpson’s paradox ≠ Significance threshold of tree splits in CI treeCI Tree Full Tree Solution: X-Terminal Tree
- 49. Paradox Detection in Big Data (Tree Approach #3): X-Terminal Trees X-Terminal Tree: Grow tree only until X-splits
- 50. Tree paths with terminal X nodes • Full paradox, statistically significant • Partial paradox, statistically significant • Statistically insignificant paradox • No paradox Pivot table equivalence: Filter by Z variables above terminal X node
- 51. Survey commissioned by Govt of India in 2006 >9500 individuals who used passport services • Representative sample of 13 Passport Offices • Equal number of offline and online users, matched by geography and demographics Let’s focus on Police bribing Impact of Initiative
- 52. Y = police bribe (0/1) X = online/offline Z = {demographics; survey Qs} Split p=.32 Paradox p=0.003Paradox p=0.16 No paradox
- 53. Kidney Allocation in USA (104,000 patients, 19 confounders) Is the kidney allocation system racist? Type 4 tree, but no significant Simpson’s paradox detected! Y = waiting time (days) X = patient race Z = {patient demog, health, bio}
- 54. • Greediness of tree • Weak paradox or in small subset of data can go undetected • Highly correlated Z’s might lead to “wrong” Z choice Summary & Challenges Full tree: eliminate non-type-4 trees Conditional-inference trees: for single Z X-terminal trees: for multiple Zs • More efficient than stepwise regression • Tree structure more informative than interaction terms • Extends: continuous Y, >2 subpopulations
- 55. Repurposing Trees for Causal Research We tackled 2 key issues in causal research: Self Selection Identifying confounders
- 56. Anal yt ics Humanit y Responsibil it y Galit Shmueli 徐茉莉 Institute of Service Science

No public clipboards found for this slide

Be the first to comment