SlideShare a Scribd company logo
Repurposing Classification & Regression Trees
for Causal Research with High-Dimensional Data
Galit Shmueli 徐茉莉
Institute of Service Science
WOMBAT 2019
Monash University
We tackle 2 key issues in causal research:
Self Selection
Identifying Confounders
A Tree-Based Approach
for Addressing Self-Selection
in Impact Studies with Big Data
Yahav, Shmueli & Mani (2016), A Tree-Based Approach for Addressing Self-Selection
in Impact Studies with Big Data, MIS Quarterly, vol 40 no 4, pp. 819-848.
With Inbal Yahav (Tel Aviv U) & Deepa Mani (Indian School of Business)
The Challenge in Impact Studies
• Individuals/firms self-select intervention
group/duration (quasi-experiment)
• Even in randomized experiments, some variables
might remain unbalanced in sample
How to identify and adjust for self-selection?
Randomized Experiment
Manipulation
Quasi-Experiment
(self-selection or administrator selection)
Manipulation
Self
Selection
Common Approaches
for Addressing Self-Selection
Two steps:
1. Selection model: T = f(X)
2. Performance analysis on matched samples
Y = performance measure(s)
T = treatment
X = pre-intervention variables
Self-selection:
P(T|X) ≠P(T)
• 2SLS modeling (Heckman correction) -- econometrics
• Propensity Score Approach (PS) -- statistics
Propensity Scores Approach
Step 1: Estimate selection model logit(T) = f(X)
to compute propensity scores P(T|X)
Step 3: Estimate Effect (compare matched groups)
e.g., t-test or Y = b0 + b1 T+ b2 X+ b3 PS +e
Step 2: Use scores to create matched samples
PSM = use matching algorithm
PSS = divide scores into bins
Challenges of PS in Big Data
1. Matching leads to severe data loss
2. PS methods suffer from “data dredging”
3. No variable selection (what drive selection?)
4. Assumes constant intervention effect
5. Sequential process is computationally costly
6. Logistic model requires specifying exact form
of selection model
Our Proposed Solution:
Trees
Propensity scores
P(T|X)
Y, T, X E(Y|T)
Even E(Y|T,X)
“Kill the Intermediary”
Proposed Method: Tree
Output: T (treat/control)
Inputs: X’s (income, education, family…)
Records in each terminal node
share same profile (X) and same
propensity score P(T=1|X)
Tree-Based Approach
Four steps:
1. Run selection model: fit tree T = f(X)
2. Visualize tree; see unbalanced X’s
3. Treat each terminal node as sub-sample;
conduct terminal-node-level performance
analysis
4. Present terminal-node-analyses visually
5. [optional]: combine analyses from nodes with
homogeneous effects
Like PS, assumes observable self-selection
Three Applications (MISQ 2016)
1. Impact of labor training on earnings
(Famous) randomized experiment by US gov
2. Impact of new online passport service on
bribing, efficiency,…
Quasi-experiment by India gov
3. Impact of outsourcing contract pricing &
duration on financial performance
Study 1: Impact of training on financial gains
In mid-1970’s US govt program randomly
assigned eligible candidates to labor training
program
• Goal: increase future earnings
• LaLonde (1986) showed:
Groups statistically equal in terms of demographic
& pre-train earnings
 ATE = $1794 (p<0.004)
Tree on Lalonde’s RCT data
If groups are completely
balanced, we expect…
Y = Earnings in 1978
T = Received NSW training (T = 1) or not (T = 0)
X = Demographic information and prior earnings
Tree reveals…
LaLonde’s naïve approach
(experiment)
Tree approach
HS dropout
(n=348)
HS degree
(n=97)
Not trained (n=260) $4554 $4,495 $4,855
Trained (n=185) $6349 $5,649 $8,047
Training effect
$1794
(p=0.004)
$1,154
(p=0.063)
$3,192
(p=0.015)
Overall: $1598
(p=0.017)
no yes
High school
degree
1. Unbalanced variable (HS degree)
2. Heterogeneous effect
Labor Training effect:
Observational control group
• LaLonde also compared with observational
control groups (PSID, CPS)
– experimental training group vs. obs control
– showed training effect not estimated correctly with
structural equations
• Dehejia & Wahba (1999,2002) re-analyzed CPS
control group (n=15,991), using PSM
– Effects in range [$1122, $1681], depends on settings
– “Best” setting effect: $1360
– Uses only 119 control group members (out of 15,991)
Tree for obs control group reveals…
unemployed prior to training
in 1974 (u74=0 )
-> negative effect outlier
eligibility
issue!
some profiles are rare in
trained group but
popular in control group
1. Unbalanced variables
2. Heterogeneous effect in u74
3. Outlier
4. Eligibility issue
Study 2:
Impact of eGov Initiative
(India)
Survey commissioned by Govt of India in 2006
• >9500 individuals who used passport services
• Representative sample of 13 Passport Offices
• “Quasi-experimental, non-equivalent groups design”
• Equal number of offline and online users, matched by
geography and demographics
Naïve Approach
Assess impact by
comparing
online/offline
performance stats
% bribe RPO
% use agent
% prefer online
% bribe police
Naive By Aware / Unaware
online onlineonline
Awareness of electronic services
provided by Government of India
Simpson’s
Paradox
1. Demographics properly balanced
2. Unbalanced variable (Awareness)
3. Heterogeneous effects on various y’s
+ even Simpson’s paradox
PSMAwareness of electronic services
provided by Government of India
Would we detect this
with PSM?
Heterogeneous effect
Scaling Up to Big Data
• We inflated eGov dataset by bootstrap
• Up to 9,000,000 records and 360 variables
• 10 runs for each configuration: runtime for tree
20 sec
Big Data Simulation
Binary intervention
T = {0, 1}
Continuous intervention
T∼ N
Sample sizes (n) 10K, 100K, 1M
#Pre-intervention
variables (p)
4, 50 (+interactions)
Pre-intervention
variable types
Binary, Likert-scale, continuous
Outcome
variable types
Binary, continuous
Selection models
#1: P (T=1) = logit (b0 + b1 x1 +…+ bp xp)
#2: P (T=1) = logit (b0 + b1 x1 +…+ bp xp + interactions)
Intervention
effects
1. Homogeneous
Control: E(Y | T = 0) = 0.5
Intervention: E(Y | T = 1) = 0.7
2. Heterogeneous
Control: E(Y | T = 0) = 0.5
Intervention: E(Y | T = 1, X1=0) = 0.7
E(Y | T = 1, X1=1) = 0.3
1. Homogeneous
Control: E(Y | T = 0) = 0
Intervention: E(Y | T = 1) = 1
2. Heterogeneous
Control: E(Y | T = 0) = 0
Intervention: E(Y | T = 1, X1=0) = 1
E(Y | T = 1, X1=1) = -1
Example: Heterogeneous Effect
Big Data Scalability
Theoretical Complexity:
• O(mn/p) for binary X
• O(m/p nlog(n) ) for continuous X
Runtime as function of sample size, dimension
Scaling Trees Even Further
• “Big Data” in research vs. industry
• Industrial scaling
– Sequential trees: efficient data structure, access
(SPRINT, SLIQ, RainForest)
– Parallel computing (parallel SPRINT, ScalParC,
SPARK, PLANET) “as long as split metric can be
computed on subsets of the training data and
later aggregated, PLANET can be easily extended”
Tree Approach Benefits
1. Data-driven selection model
2. Scales up to Big Data
3. Less user choices (data dredging)
4. Nuanced insights
• Detect unbalanced variables
• Detect heterogeneous effect from anticipated outcomes
5. Simple to communicate
6. Automatic variable selection
7. Missing values do not remove record
8. Binary, multiple, continuous interventions
9. Post-analysis of RCT quasi-experiments & observational studies
Tree Approach Limits
1. Assumes selection on observables
2. Need sufficient data
3. Continuous variables can lead to large tree
4. Instability
[possible solution: use variable importance scores (forest)]
Detecting
Simpson’s Paradox
in Big Data
Using Trees
Shmueli & Yahav (2017), The Forest or the Trees? Tackling Simpson’s Paradox with
Classification Trees, Production & Operations Management Journal, Forthcoming
With Inbal Yahav, Tel Aviv University
Simpson’s Paradox
The direction of a cause on an effect appears reversed when
examining aggregate vs. disaggregate of a sample (or population)
Simpson's Paradox is the reversal
of an association between two
variables after a third variable is
taken into account
Schield (1999)
The phenomenon whereby an event B
increases the probability of A in a given
population p, at the same time, decreases the
probability of A in every subpopulation of p.
Pearl (2009)
Death Sentence and Race
(Agresti, 1984)
Does defendant's race (X) affect
chance of death sentence (Y)?
Causal explanation:
Black murderers tend to kill blacks;
hence lower overall death sentence rates
Causal effect seems
to reverse when
disaggregating by
victim race (Z)
Goal: Does a dataset exhibit SP?
C = confounder
E = effectA = cause
P (E|C ) – P( E|C’ ) P (E|A ) – P(E|A’ )
“If Cornfield’s minimum effect size is not reached,
[you] can assume no causality” Schield, 1999
Cornfield et al’s Criterion
Translate Cornfield’s Criterion
into a Tree
Y = outcome of interest
X = causal variable
Z = confounding variable(s)
Tree Predictors
#1
If cause -> effect, then
cause should appear in tree
#2
If Z is confounder, then
Z should appear in tree
5 potential tree structures
- single causal variable (X)
- single confounding variable (Z)
Which might exhibit Simpson’s Paradox?
P (E|C ) – P( E|C’ ) P (E|A ) – P(E|A’ )
Simpson’s Paradox on a Tree
#1
If cause -> effect, then cause
should appear in tree
#2
If Z is confounder, then
Z should appear in tree
#3
Z should appear before cause
(Cornfield criterion)
Death Sentence and Race:
Tree Approach #1: full tree
P(death)
Accounting for Sampling Error
Logistic/linear regression:
Interaction X*Z significant?
No → no paradox
Yes → ?
Trees:
Tree structure + significance of interaction
= conditional-inference tree
Tree splits based on statistical tests
(c2, F , permutation tests)
Tree Approach #2:
Conditional Inference tree
(Hothorn et al., JCGS 2006)
Variable selection based on statistical test (c2)
• Recursive partitioning with early stopping
• Separate steps for variable selection and split search
• R packages party, partykit (function ctree)
Cornfield’s criterion + sampling error:
Conditional Inference Trees
Proof for trees that use concave impurity
measure (Gini, entropy) as well as c2
CART, CHAID, Conditional-Inference Trees
Accounting for Sampling Error:
Conditional-Inference Tree
P(death)
Seatbelts and Injuries (Agresti 2012)
Does use of seat-belts (X) reduce chance of injury (Y)?
Z = Passenger gender and accident location
n=68,694 passengers involved in accidents in Maine
Potential Paradox
(by location)
How about logistic regression?
% Injuries
Simpson’s Paradox in Big Data
Large n , High-dimensional Z
Multiple Potential Confounders (Z)
The Challenge
Statistical significance of
Simpson’s paradox
≠
Significance threshold of
tree splits in CI treeCI Tree Full Tree
Solution: X-Terminal Tree
Paradox Detection in Big Data
(Tree Approach #3):
X-Terminal Trees
X-Terminal Tree:
Grow tree only
until X-splits
Tree paths with terminal X nodes
can indicate…
• Full paradox, statistically significant
• Partial paradox, statistically significant
• Statistically insignificant paradox
• No paradox
Pivot table equivalence:
Filter by Z variables above terminal X node
Impact of eGov Initiative
(India)
Survey commissioned by Govt of India in 2006
• >9500 individuals who used passport services
• Representative sample of 13 Passport Offices
• “Quasi-experimental, non-equivalent groups design”
• Equal number of offline and online users, matched by
geography and demographics
Y = police bribe (0/1)
X = online/offline
Z = {demographics; survey Qs}
Split
p=.32
Paradox p=0.003Paradox p=0.16
No paradox
Kidney Allocation in USA
(104,000 patients, 19 confounders)
Is the kidney allocation system racist?
Type 4 tree, but no significant Simpson’s paradox
detected!
Y = waiting time (days)
X = patient race
Z = {patient demog, health, bio}
• Greediness of tree
• Weak paradox or in
small subset of data
can go undetected
• Highly correlated Z’s
might lead to
“wrong” Z choice
Summary & Challenges
Full tree: eliminate non-type-4 trees
Conditional-inference trees: for single Z
X-terminal trees: for multiple Zs
• More efficient than
stepwise regression
• Tree structure more
informative than
interaction terms
• Extends: continuous Y,
>2 subpopulations
We tackle 2 key issues in causal research:
Self Selection
Identifying Confounders
Anal yt ics
Humanit y
Responsibil it y
Galit Shmueli 徐茉莉
Institute of Service Science

More Related Content

What's hot

Statistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, DescribingStatistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, Describing
Galit Shmueli
 
Predictive analytics in Information Systems Research (TSWIM 2015 keynote)
Predictive analytics in Information Systems Research (TSWIM 2015 keynote)Predictive analytics in Information Systems Research (TSWIM 2015 keynote)
Predictive analytics in Information Systems Research (TSWIM 2015 keynote)
Galit Shmueli
 
Nbe rtopicsandrecomvlecture1
Nbe rtopicsandrecomvlecture1Nbe rtopicsandrecomvlecture1
Nbe rtopicsandrecomvlecture1
NBER
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
gokulprasath06
 
Statistical and Predictive Modelling
Statistical and Predictive ModellingStatistical and Predictive Modelling
Statistical and Predictive Modelling
JMP software from SAS
 
Alleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal ModelsAlleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal Models
Amit Sharma
 
Lecture 4: NBERMetrics
Lecture 4: NBERMetricsLecture 4: NBERMetrics
Lecture 4: NBERMetricsNBER
 
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
台灣資料科學年會
 
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
台灣資料科學年會
 
Exploratory data analysis project
Exploratory data analysis project Exploratory data analysis project
Exploratory data analysis project
BabatundeSogunro
 
Lecture 7
Lecture 7Lecture 7
Lecture 7butest
 
Gbs1
Gbs1Gbs1
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
Bill Liu
 
Data cleaning and screening
Data cleaning and screeningData cleaning and screening
Data cleaning and screening
Hassan Hussein
 
Statistical Modeling: The Two Cultures
Statistical Modeling: The Two CulturesStatistical Modeling: The Two Cultures
Statistical Modeling: The Two Cultures
Christoph Molnar
 
Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion
antimo musone
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
sikander kushwaha
 
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
IDEAS - Int'l Data Engineering and Science Association
 

What's hot (20)

Statistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, DescribingStatistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, Describing
 
Shmueli
ShmueliShmueli
Shmueli
 
Predictive analytics in Information Systems Research (TSWIM 2015 keynote)
Predictive analytics in Information Systems Research (TSWIM 2015 keynote)Predictive analytics in Information Systems Research (TSWIM 2015 keynote)
Predictive analytics in Information Systems Research (TSWIM 2015 keynote)
 
Nbe rtopicsandrecomvlecture1
Nbe rtopicsandrecomvlecture1Nbe rtopicsandrecomvlecture1
Nbe rtopicsandrecomvlecture1
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Statistical and Predictive Modelling
Statistical and Predictive ModellingStatistical and Predictive Modelling
Statistical and Predictive Modelling
 
Alleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal ModelsAlleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal Models
 
Lecture 4: NBERMetrics
Lecture 4: NBERMetricsLecture 4: NBERMetrics
Lecture 4: NBERMetrics
 
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
 
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
 
Exploratory data analysis project
Exploratory data analysis project Exploratory data analysis project
Exploratory data analysis project
 
Lecture 7
Lecture 7Lecture 7
Lecture 7
 
Gbs1
Gbs1Gbs1
Gbs1
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
 
Data cleaning and screening
Data cleaning and screeningData cleaning and screening
Data cleaning and screening
 
Eda sri
Eda sriEda sri
Eda sri
 
Statistical Modeling: The Two Cultures
Statistical Modeling: The Two CulturesStatistical Modeling: The Two Cultures
Statistical Modeling: The Two Cultures
 
Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
 

Similar to Repurposing Classification & Regression Trees for Causal Research with High-Dimensional Data

A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
Galit Shmueli
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
csula its training
 
A Tree-Based Approach for Addressing Self-selection in Impact Studies with B...
A Tree-Based Approach  for Addressing Self-selection in Impact Studies with B...A Tree-Based Approach  for Addressing Self-selection in Impact Studies with B...
A Tree-Based Approach for Addressing Self-selection in Impact Studies with B...
Galit Shmueli
 
Stat-Lesson.pptx
Stat-Lesson.pptxStat-Lesson.pptx
Stat-Lesson.pptx
JennilynFeliciano2
 
GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018
Nancy Garmer
 
GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018
Evans Library at Florida Institute of Technology
 
Analyzing experimental research data
Analyzing experimental research dataAnalyzing experimental research data
Analyzing experimental research data
Atula Ahuja
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Henock Beyene
 
Analyzing experimental research data
Analyzing experimental research dataAnalyzing experimental research data
Analyzing experimental research data
Atula Ahuja
 
ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalised
Tetsuya Sakai
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
Manish Parihar
 
Probability distribution Function & Decision Trees in machine learning
Probability distribution Function  & Decision Trees in machine learningProbability distribution Function  & Decision Trees in machine learning
Probability distribution Function & Decision Trees in machine learning
Sadia Zafar
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
diogor21atlas
 
An Introduction to SPSS
An Introduction to SPSSAn Introduction to SPSS
An Introduction to SPSS
Rayman Soe
 
Basic Level Quantitative Analysis Using SPSS.ppt
Basic Level Quantitative Analysis Using SPSS.pptBasic Level Quantitative Analysis Using SPSS.ppt
Basic Level Quantitative Analysis Using SPSS.ppt
Dr. Imran Ghaffar Sulehri
 
sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
Tetsuya Sakai
 
Topic_6
Topic_6Topic_6
Topic_6butest
 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorial
Tetsuya Sakai
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
Bahzad5
 

Similar to Repurposing Classification & Regression Trees for Causal Research with High-Dimensional Data (20)

A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
 
A Tree-Based Approach for Addressing Self-selection in Impact Studies with B...
A Tree-Based Approach  for Addressing Self-selection in Impact Studies with B...A Tree-Based Approach  for Addressing Self-selection in Impact Studies with B...
A Tree-Based Approach for Addressing Self-selection in Impact Studies with B...
 
Stat-Lesson.pptx
Stat-Lesson.pptxStat-Lesson.pptx
Stat-Lesson.pptx
 
GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018
 
GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018
 
Analyzing experimental research data
Analyzing experimental research dataAnalyzing experimental research data
Analyzing experimental research data
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
 
Analyzing experimental research data
Analyzing experimental research dataAnalyzing experimental research data
Analyzing experimental research data
 
ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalised
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
 
Probability distribution Function & Decision Trees in machine learning
Probability distribution Function  & Decision Trees in machine learningProbability distribution Function  & Decision Trees in machine learning
Probability distribution Function & Decision Trees in machine learning
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
An Introduction to SPSS
An Introduction to SPSSAn Introduction to SPSS
An Introduction to SPSS
 
Basic Level Quantitative Analysis Using SPSS.ppt
Basic Level Quantitative Analysis Using SPSS.pptBasic Level Quantitative Analysis Using SPSS.ppt
Basic Level Quantitative Analysis Using SPSS.ppt
 
sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
 
Topic_6
Topic_6Topic_6
Topic_6
 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorial
 
DATA COLLECTION IN RESEARCH
DATA COLLECTION IN RESEARCHDATA COLLECTION IN RESEARCH
DATA COLLECTION IN RESEARCH
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 

More from Galit Shmueli

“Improving” prediction of human behavior using behavior modification
“Improving” prediction of human behavior using behavior modification“Improving” prediction of human behavior using behavior modification
“Improving” prediction of human behavior using behavior modification
Galit Shmueli
 
Behavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare ResearchBehavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare Research
Galit Shmueli
 
Reinventing the Data Analytics Classroom
Reinventing the Data Analytics ClassroomReinventing the Data Analytics Classroom
Reinventing the Data Analytics Classroom
Galit Shmueli
 
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS TaipeiBehavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Galit Shmueli
 
Workshop on Information Quality
Workshop on Information QualityWorkshop on Information Quality
Workshop on Information Quality
Galit Shmueli
 
Behavioral Big Data: Why Quality Engineers Should Care
Behavioral Big Data: Why Quality Engineers Should CareBehavioral Big Data: Why Quality Engineers Should Care
Behavioral Big Data: Why Quality Engineers Should Care
Galit Shmueli
 
Researcher Dilemmas using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
Researcher Dilemmas  using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...Researcher Dilemmas  using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
Researcher Dilemmas using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
Galit Shmueli
 
Prediction-based Model Selection in PLS-PM
Prediction-based Model Selection in PLS-PMPrediction-based Model Selection in PLS-PM
Prediction-based Model Selection in PLS-PM
Galit Shmueli
 
When Prediction Met PLS: What We learned in 3 Years of Marriage
When Prediction Met PLS: What We learned in 3 Years of MarriageWhen Prediction Met PLS: What We learned in 3 Years of Marriage
When Prediction Met PLS: What We learned in 3 Years of Marriage
Galit Shmueli
 
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
Galit Shmueli
 
Research Using Behavioral Big Data (BBD)
Research Using Behavioral Big Data (BBD)Research Using Behavioral Big Data (BBD)
Research Using Behavioral Big Data (BBD)
Galit Shmueli
 
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral IssuesAnalyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
Galit Shmueli
 
Information Quality: A Framework for Evaluating Empirical Studies
Information Quality: A Framework for Evaluating Empirical Studies Information Quality: A Framework for Evaluating Empirical Studies
Information Quality: A Framework for Evaluating Empirical Studies
Galit Shmueli
 
E.SUN Academic Award presentation (Jan 2016)
E.SUN Academic Award presentation (Jan 2016)E.SUN Academic Award presentation (Jan 2016)
E.SUN Academic Award presentation (Jan 2016)
Galit Shmueli
 
Big Data & Analytics in the Digital Creative Industries
Big Data & Analytics in the Digital Creative IndustriesBig Data & Analytics in the Digital Creative Industries
Big Data & Analytics in the Digital Creative Industries
Galit Shmueli
 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
Galit Shmueli
 
Introducing the NTHU-EZTABLE Kaggle Contest (Predicting Repeat Restaurant Boo...
Introducing the NTHU-EZTABLE Kaggle Contest (Predicting Repeat Restaurant Boo...Introducing the NTHU-EZTABLE Kaggle Contest (Predicting Repeat Restaurant Boo...
Introducing the NTHU-EZTABLE Kaggle Contest (Predicting Repeat Restaurant Boo...
Galit Shmueli
 
Opening Data With Kaggle
Opening Data With KaggleOpening Data With Kaggle
Opening Data With Kaggle
Galit Shmueli
 
Linear Probability Models and Big Data: Kosher or Not?
Linear Probability Models and Big Data: Kosher or Not?Linear Probability Models and Big Data: Kosher or Not?
Linear Probability Models and Big Data: Kosher or Not?
Galit Shmueli
 
Prediction, Explanation and the Business Analytics Toolkit
Prediction, Explanation and the Business Analytics Toolkit Prediction, Explanation and the Business Analytics Toolkit
Prediction, Explanation and the Business Analytics Toolkit
Galit Shmueli
 

More from Galit Shmueli (20)

“Improving” prediction of human behavior using behavior modification
“Improving” prediction of human behavior using behavior modification“Improving” prediction of human behavior using behavior modification
“Improving” prediction of human behavior using behavior modification
 
Behavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare ResearchBehavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare Research
 
Reinventing the Data Analytics Classroom
Reinventing the Data Analytics ClassroomReinventing the Data Analytics Classroom
Reinventing the Data Analytics Classroom
 
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS TaipeiBehavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
 
Workshop on Information Quality
Workshop on Information QualityWorkshop on Information Quality
Workshop on Information Quality
 
Behavioral Big Data: Why Quality Engineers Should Care
Behavioral Big Data: Why Quality Engineers Should CareBehavioral Big Data: Why Quality Engineers Should Care
Behavioral Big Data: Why Quality Engineers Should Care
 
Researcher Dilemmas using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
Researcher Dilemmas  using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...Researcher Dilemmas  using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
Researcher Dilemmas using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
 
Prediction-based Model Selection in PLS-PM
Prediction-based Model Selection in PLS-PMPrediction-based Model Selection in PLS-PM
Prediction-based Model Selection in PLS-PM
 
When Prediction Met PLS: What We learned in 3 Years of Marriage
When Prediction Met PLS: What We learned in 3 Years of MarriageWhen Prediction Met PLS: What We learned in 3 Years of Marriage
When Prediction Met PLS: What We learned in 3 Years of Marriage
 
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
 
Research Using Behavioral Big Data (BBD)
Research Using Behavioral Big Data (BBD)Research Using Behavioral Big Data (BBD)
Research Using Behavioral Big Data (BBD)
 
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral IssuesAnalyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
 
Information Quality: A Framework for Evaluating Empirical Studies
Information Quality: A Framework for Evaluating Empirical Studies Information Quality: A Framework for Evaluating Empirical Studies
Information Quality: A Framework for Evaluating Empirical Studies
 
E.SUN Academic Award presentation (Jan 2016)
E.SUN Academic Award presentation (Jan 2016)E.SUN Academic Award presentation (Jan 2016)
E.SUN Academic Award presentation (Jan 2016)
 
Big Data & Analytics in the Digital Creative Industries
Big Data & Analytics in the Digital Creative IndustriesBig Data & Analytics in the Digital Creative Industries
Big Data & Analytics in the Digital Creative Industries
 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
 
Introducing the NTHU-EZTABLE Kaggle Contest (Predicting Repeat Restaurant Boo...
Introducing the NTHU-EZTABLE Kaggle Contest (Predicting Repeat Restaurant Boo...Introducing the NTHU-EZTABLE Kaggle Contest (Predicting Repeat Restaurant Boo...
Introducing the NTHU-EZTABLE Kaggle Contest (Predicting Repeat Restaurant Boo...
 
Opening Data With Kaggle
Opening Data With KaggleOpening Data With Kaggle
Opening Data With Kaggle
 
Linear Probability Models and Big Data: Kosher or Not?
Linear Probability Models and Big Data: Kosher or Not?Linear Probability Models and Big Data: Kosher or Not?
Linear Probability Models and Big Data: Kosher or Not?
 
Prediction, Explanation and the Business Analytics Toolkit
Prediction, Explanation and the Business Analytics Toolkit Prediction, Explanation and the Business Analytics Toolkit
Prediction, Explanation and the Business Analytics Toolkit
 

Recently uploaded

Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 

Recently uploaded (20)

Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 

Repurposing Classification & Regression Trees for Causal Research with High-Dimensional Data

  • 1. Repurposing Classification & Regression Trees for Causal Research with High-Dimensional Data Galit Shmueli 徐茉莉 Institute of Service Science WOMBAT 2019 Monash University
  • 2. We tackle 2 key issues in causal research: Self Selection Identifying Confounders
  • 3. A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Big Data Yahav, Shmueli & Mani (2016), A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Big Data, MIS Quarterly, vol 40 no 4, pp. 819-848. With Inbal Yahav (Tel Aviv U) & Deepa Mani (Indian School of Business)
  • 4. The Challenge in Impact Studies • Individuals/firms self-select intervention group/duration (quasi-experiment) • Even in randomized experiments, some variables might remain unbalanced in sample How to identify and adjust for self-selection?
  • 6. Quasi-Experiment (self-selection or administrator selection) Manipulation Self Selection
  • 7. Common Approaches for Addressing Self-Selection Two steps: 1. Selection model: T = f(X) 2. Performance analysis on matched samples Y = performance measure(s) T = treatment X = pre-intervention variables Self-selection: P(T|X) ≠P(T) • 2SLS modeling (Heckman correction) -- econometrics • Propensity Score Approach (PS) -- statistics
  • 8. Propensity Scores Approach Step 1: Estimate selection model logit(T) = f(X) to compute propensity scores P(T|X) Step 3: Estimate Effect (compare matched groups) e.g., t-test or Y = b0 + b1 T+ b2 X+ b3 PS +e Step 2: Use scores to create matched samples PSM = use matching algorithm PSS = divide scores into bins
  • 9. Challenges of PS in Big Data 1. Matching leads to severe data loss 2. PS methods suffer from “data dredging” 3. No variable selection (what drive selection?) 4. Assumes constant intervention effect 5. Sequential process is computationally costly 6. Logistic model requires specifying exact form of selection model
  • 10. Our Proposed Solution: Trees Propensity scores P(T|X) Y, T, X E(Y|T) Even E(Y|T,X) “Kill the Intermediary”
  • 11. Proposed Method: Tree Output: T (treat/control) Inputs: X’s (income, education, family…) Records in each terminal node share same profile (X) and same propensity score P(T=1|X)
  • 12. Tree-Based Approach Four steps: 1. Run selection model: fit tree T = f(X) 2. Visualize tree; see unbalanced X’s 3. Treat each terminal node as sub-sample; conduct terminal-node-level performance analysis 4. Present terminal-node-analyses visually 5. [optional]: combine analyses from nodes with homogeneous effects Like PS, assumes observable self-selection
  • 13. Three Applications (MISQ 2016) 1. Impact of labor training on earnings (Famous) randomized experiment by US gov 2. Impact of new online passport service on bribing, efficiency,… Quasi-experiment by India gov 3. Impact of outsourcing contract pricing & duration on financial performance
  • 14. Study 1: Impact of training on financial gains In mid-1970’s US govt program randomly assigned eligible candidates to labor training program • Goal: increase future earnings • LaLonde (1986) showed: Groups statistically equal in terms of demographic & pre-train earnings  ATE = $1794 (p<0.004)
  • 15. Tree on Lalonde’s RCT data If groups are completely balanced, we expect… Y = Earnings in 1978 T = Received NSW training (T = 1) or not (T = 0) X = Demographic information and prior earnings
  • 16. Tree reveals… LaLonde’s naïve approach (experiment) Tree approach HS dropout (n=348) HS degree (n=97) Not trained (n=260) $4554 $4,495 $4,855 Trained (n=185) $6349 $5,649 $8,047 Training effect $1794 (p=0.004) $1,154 (p=0.063) $3,192 (p=0.015) Overall: $1598 (p=0.017) no yes High school degree 1. Unbalanced variable (HS degree) 2. Heterogeneous effect
  • 17. Labor Training effect: Observational control group • LaLonde also compared with observational control groups (PSID, CPS) – experimental training group vs. obs control – showed training effect not estimated correctly with structural equations • Dehejia & Wahba (1999,2002) re-analyzed CPS control group (n=15,991), using PSM – Effects in range [$1122, $1681], depends on settings – “Best” setting effect: $1360 – Uses only 119 control group members (out of 15,991)
  • 18. Tree for obs control group reveals… unemployed prior to training in 1974 (u74=0 ) -> negative effect outlier eligibility issue! some profiles are rare in trained group but popular in control group 1. Unbalanced variables 2. Heterogeneous effect in u74 3. Outlier 4. Eligibility issue
  • 19. Study 2: Impact of eGov Initiative (India) Survey commissioned by Govt of India in 2006 • >9500 individuals who used passport services • Representative sample of 13 Passport Offices • “Quasi-experimental, non-equivalent groups design” • Equal number of offline and online users, matched by geography and demographics
  • 20. Naïve Approach Assess impact by comparing online/offline performance stats
  • 21. % bribe RPO % use agent % prefer online % bribe police Naive By Aware / Unaware online onlineonline Awareness of electronic services provided by Government of India Simpson’s Paradox 1. Demographics properly balanced 2. Unbalanced variable (Awareness) 3. Heterogeneous effects on various y’s + even Simpson’s paradox
  • 22. PSMAwareness of electronic services provided by Government of India Would we detect this with PSM?
  • 24. Scaling Up to Big Data • We inflated eGov dataset by bootstrap • Up to 9,000,000 records and 360 variables • 10 runs for each configuration: runtime for tree 20 sec
  • 25. Big Data Simulation Binary intervention T = {0, 1} Continuous intervention T∼ N Sample sizes (n) 10K, 100K, 1M #Pre-intervention variables (p) 4, 50 (+interactions) Pre-intervention variable types Binary, Likert-scale, continuous Outcome variable types Binary, continuous Selection models #1: P (T=1) = logit (b0 + b1 x1 +…+ bp xp) #2: P (T=1) = logit (b0 + b1 x1 +…+ bp xp + interactions) Intervention effects 1. Homogeneous Control: E(Y | T = 0) = 0.5 Intervention: E(Y | T = 1) = 0.7 2. Heterogeneous Control: E(Y | T = 0) = 0.5 Intervention: E(Y | T = 1, X1=0) = 0.7 E(Y | T = 1, X1=1) = 0.3 1. Homogeneous Control: E(Y | T = 0) = 0 Intervention: E(Y | T = 1) = 1 2. Heterogeneous Control: E(Y | T = 0) = 0 Intervention: E(Y | T = 1, X1=0) = 1 E(Y | T = 1, X1=1) = -1
  • 27. Big Data Scalability Theoretical Complexity: • O(mn/p) for binary X • O(m/p nlog(n) ) for continuous X Runtime as function of sample size, dimension
  • 28. Scaling Trees Even Further • “Big Data” in research vs. industry • Industrial scaling – Sequential trees: efficient data structure, access (SPRINT, SLIQ, RainForest) – Parallel computing (parallel SPRINT, ScalParC, SPARK, PLANET) “as long as split metric can be computed on subsets of the training data and later aggregated, PLANET can be easily extended”
  • 29. Tree Approach Benefits 1. Data-driven selection model 2. Scales up to Big Data 3. Less user choices (data dredging) 4. Nuanced insights • Detect unbalanced variables • Detect heterogeneous effect from anticipated outcomes 5. Simple to communicate 6. Automatic variable selection 7. Missing values do not remove record 8. Binary, multiple, continuous interventions 9. Post-analysis of RCT quasi-experiments & observational studies
  • 30. Tree Approach Limits 1. Assumes selection on observables 2. Need sufficient data 3. Continuous variables can lead to large tree 4. Instability [possible solution: use variable importance scores (forest)]
  • 31. Detecting Simpson’s Paradox in Big Data Using Trees Shmueli & Yahav (2017), The Forest or the Trees? Tackling Simpson’s Paradox with Classification Trees, Production & Operations Management Journal, Forthcoming With Inbal Yahav, Tel Aviv University
  • 32. Simpson’s Paradox The direction of a cause on an effect appears reversed when examining aggregate vs. disaggregate of a sample (or population) Simpson's Paradox is the reversal of an association between two variables after a third variable is taken into account Schield (1999) The phenomenon whereby an event B increases the probability of A in a given population p, at the same time, decreases the probability of A in every subpopulation of p. Pearl (2009)
  • 33. Death Sentence and Race (Agresti, 1984) Does defendant's race (X) affect chance of death sentence (Y)? Causal explanation: Black murderers tend to kill blacks; hence lower overall death sentence rates Causal effect seems to reverse when disaggregating by victim race (Z)
  • 34. Goal: Does a dataset exhibit SP? C = confounder E = effectA = cause P (E|C ) – P( E|C’ ) P (E|A ) – P(E|A’ ) “If Cornfield’s minimum effect size is not reached, [you] can assume no causality” Schield, 1999 Cornfield et al’s Criterion
  • 35. Translate Cornfield’s Criterion into a Tree Y = outcome of interest X = causal variable Z = confounding variable(s) Tree Predictors #1 If cause -> effect, then cause should appear in tree #2 If Z is confounder, then Z should appear in tree
  • 36. 5 potential tree structures - single causal variable (X) - single confounding variable (Z) Which might exhibit Simpson’s Paradox? P (E|C ) – P( E|C’ ) P (E|A ) – P(E|A’ )
  • 37. Simpson’s Paradox on a Tree #1 If cause -> effect, then cause should appear in tree #2 If Z is confounder, then Z should appear in tree #3 Z should appear before cause (Cornfield criterion)
  • 38. Death Sentence and Race: Tree Approach #1: full tree P(death)
  • 39. Accounting for Sampling Error Logistic/linear regression: Interaction X*Z significant? No → no paradox Yes → ? Trees: Tree structure + significance of interaction = conditional-inference tree Tree splits based on statistical tests (c2, F , permutation tests)
  • 40. Tree Approach #2: Conditional Inference tree (Hothorn et al., JCGS 2006) Variable selection based on statistical test (c2) • Recursive partitioning with early stopping • Separate steps for variable selection and split search • R packages party, partykit (function ctree)
  • 41. Cornfield’s criterion + sampling error: Conditional Inference Trees
  • 42. Proof for trees that use concave impurity measure (Gini, entropy) as well as c2 CART, CHAID, Conditional-Inference Trees
  • 43. Accounting for Sampling Error: Conditional-Inference Tree P(death)
  • 44. Seatbelts and Injuries (Agresti 2012) Does use of seat-belts (X) reduce chance of injury (Y)? Z = Passenger gender and accident location n=68,694 passengers involved in accidents in Maine Potential Paradox (by location) How about logistic regression?
  • 46. Simpson’s Paradox in Big Data Large n , High-dimensional Z
  • 47. Multiple Potential Confounders (Z) The Challenge Statistical significance of Simpson’s paradox ≠ Significance threshold of tree splits in CI treeCI Tree Full Tree Solution: X-Terminal Tree
  • 48. Paradox Detection in Big Data (Tree Approach #3): X-Terminal Trees X-Terminal Tree: Grow tree only until X-splits
  • 49. Tree paths with terminal X nodes can indicate… • Full paradox, statistically significant • Partial paradox, statistically significant • Statistically insignificant paradox • No paradox Pivot table equivalence: Filter by Z variables above terminal X node
  • 50. Impact of eGov Initiative (India) Survey commissioned by Govt of India in 2006 • >9500 individuals who used passport services • Representative sample of 13 Passport Offices • “Quasi-experimental, non-equivalent groups design” • Equal number of offline and online users, matched by geography and demographics
  • 51. Y = police bribe (0/1) X = online/offline Z = {demographics; survey Qs} Split p=.32 Paradox p=0.003Paradox p=0.16 No paradox
  • 52. Kidney Allocation in USA (104,000 patients, 19 confounders) Is the kidney allocation system racist? Type 4 tree, but no significant Simpson’s paradox detected! Y = waiting time (days) X = patient race Z = {patient demog, health, bio}
  • 53. • Greediness of tree • Weak paradox or in small subset of data can go undetected • Highly correlated Z’s might lead to “wrong” Z choice Summary & Challenges Full tree: eliminate non-type-4 trees Conditional-inference trees: for single Z X-terminal trees: for multiple Zs • More efficient than stepwise regression • Tree structure more informative than interaction terms • Extends: continuous Y, >2 subpopulations
  • 54. We tackle 2 key issues in causal research: Self Selection Identifying Confounders
  • 55. Anal yt ics Humanit y Responsibil it y Galit Shmueli 徐茉莉 Institute of Service Science

Editor's Notes

  1. Heckman correction builds selection model based on economic theory
  2. Proof for Entropy
  3. Blue: no paradox (same as overall direction), then splitting stops when reaches online/offline (X) Orange: although p-value is very large, we still have a split