To Explain or To Predict?
Predictive Analytics in IS Research
3rd Taiwan Summer Workshop
on Information Management
July 2015
Galit Shmuéli
Galit Shmueli (徐茉莉)
www.galitshmueli.com
❷ 2000-2002
Carnegie Mellon Univ.
Visiting Assistant Prof.
Dept. of Statistics
❸ 2002-2012
Univ. of Maryland College Park
Assistant then Associate Prof. of
Statistics & Management Science
R H Smith School of Business
2008-2014
Rigsum Institute (Bhutan)
Co-Director, Rigsum
Research Lab
❹ 2011-2014
Indian School of Business
SRITNE Chaired Prof. of Data
Analytics, Associate Prof. of
Statistics & Info Systems
❶ 1994-2000
Israel Institute of
Technology
MSc + PhD, Statistics
2014-… NTHU
Institute of Service Science
Director, Center for Service
Innovation & Analytics
Research in Data Analytics
www.galitshmueli.com
• Statistical strategy
• ‘Entrepreneurial’ statistical &
data mining modeling (new
conditions & environments)
• Business analytics
In progress…
www.iss.nthu.edu.tw
Road Map
Definitions
Explanatory-dominated MIS
Explanatory modeling ≠ predictive modeling
Why?
Different modeling paths
Explanatory power vs. predictive power
How do I use this?
Definitions
Explanatory modeling:
Theory-based, statistical testing of
causal hypotheses
Explanatory power:
Strength of relationship in statistical
model
Definitions
Predictive modeling:
Empirical method for predicting new
observations
Predictive power:
Ability to accurately predict new
observations
Explain PredictDescribe
Matching Game
Social Sciences
(MIS included)
Machine
learning
Statistics
Statistical modeling in
MIS research
Purpose: test causal theory (“explain”)
Association-based statistical models
Prediction nearly absent
Start with a causal
theory
Generate causal
hypotheses on
constructs
Operationalize constructs → Measurable variables
Fit statistical model
Statistical inference → Causal conclusions
Explanatory modeling à-la MIS
In MIS,
data analysis is mainly used for testing
causal theory.
“If it explains, it predicts”
“Empirical prediction alone
is un-scientific”
Some statisticians share this view:
The two goals in analyzing data... I prefer to describe
as “management” and “science”. Management seeks
profit... Science seeks truth.
- Parzen, Statistical Science 2001
Prediction in top research journals in
Information Systems
Predictive goal?
Predictive modeling?
Predictive assessment?
1990-2006
52 “predictive” articles among 1,072
in Information Systems top journals
generate new theory
develop measures
compare theories
improve theory
assess relevance
evaluate predictability
Why Predict? for Scientific Research
Shmueli & Koppius, “Predictive Analytics in IS Research”
MIS Quarterly, 2011
“A good explanatory model will also
predict well”
“You must understand the underlying
causes in order to predict”
Philosophy of Science
“Explanation and prediction have the
same logical structure”
Hempel & Oppenheim, 1948
“It becomes pertinent to investigate the
possibilities of predictive procedures
autonomous of those used for explanation”
Helmer & Rescher, 1959
“Theories of social and human behavior
address themselves to two distinct goals of
science: (1) prediction and (2) understanding”
Dubin, Theory Building, 1969
Why statistical
explanatory modeling
differs from
predictive modeling
Explanatory Model:
Test/quantify causal effect for
“average” record in population
Predictive Model:
Predict new individual
observations
Different Scientific Goals
Different generalization
Theory vs. its manifestation
?
Notation
Theoretical constructs: X, Y
Causal theoretical model: Y=F(X)
Measurable variables: X, Y
Statistical model: E(y)=f(X)
Four aspects
1. Theory – Data
2. Causation – Association
3. Retrospective – Prospective
4. Bias - Variance
Y=F(X)
E(Y)=f(X)
“The goal of finding models that are
predictively accurate differs from the
goal of finding models that are true.”
Best explanatory model
Best predictive model
≠
Point #1
Four aspects
1. Theory - Data
2. Causation – Association
3. Retrospective – Prospective
4. Bias - Variance
Y=F(X)
Y=f(X)
Predict ≠ Explain
+ ?
“we tried to benefit from an extensive
set of attributes describing each of the
movies in the dataset. Those attributes
certainly carry a significant signal and
can explain some of the user behavior.
However… they could not help at all
for improving the [predictive]
accuracy.”
Bell et al., 2008
Predict ≠ Explain
Explain ≠ Predict
The FDA considers two products
bioequivalent if the 90% CI of the
relative mean of the generic to brand
formulation is within 80%-125%
“We are planning to… develop predictive models for bioavailability
and bioequivalence”
Lester M. Crawford, 2005
Acting Commissioner of Food & Drugs
“For a long time, we thought that
Tamoxifen was roughly 80%
effective for breast cancer
patients.
But now we know much more:
we know that it’s 100% effective
in 70%-80% of the patients, and
ineffective in the rest.”
Goal
Definition
Design &
Collection
Data
Preparation
EDA
Variables?
Methods? Evaluation,
Validation
& Model
Selection
Model Use &
Reporting
Study design
Hierarchical data
Observational or experiment?
Primary or secondary data?
Instrument (reliability+validity vs. measur accuracy)
How much data?
How to sample?
& data collection
Data Preprocessing
reduced-
feature
models
missing
partitioning
PCA
SVD
Interactive
visualization
Data exploration & reduction
Which Variables?
Multicollinearity?
causation associations
endogeneity
ex-post
availability
A, B, A*B?
ensembles
Shrinkage models
variance bias
Methods / Models
Blackbox / interpretable
Mapping to theory
Evaluation, Validation
& Model Selection
Training dataEmpirical
model Holdout data
Predictive power
Over-fitting
analysis
Theoretical
model
Empirical
model
Data
Validation
Model fit ≠
Explanatory power
Inference
Model Use
test causal theory
generate new theory
develop measures
compare theories
improve theory
assess relevance
Evaluate predictability
Predictive performance
Over-fitting analysis
Null hypothesis
Naïve/baseline
Point #2
Explanatory
Power
Predictive
Power≠
Cannot infer one from the other
out-of-sample
Performance
Metrics
type I,II errors
goodness-of-fit
p-values
over-fitting
costs
prediction
accuracy
interpretation
Training vs.
holdout
R2
Explanatory Power
PredictivePower
The predictive power of an
explanatory model has important
scientific value
Relevance, reality check, predictability
Current State in Social Sciences
(and MIS)
“While the value of scientific prediction… is beyond
question… the inexact sciences [do not] have…the
use of predictive expertise well in hand.”
Helmer & Rescher, 1959
Distinction blurred
Unfamiliarity with predictive
modeling/assessment
Prediction underappreciated
How does this impact
Scientific Research?
State-of-the-art in Industry
Distinction blurred
Prediction over-appreciated
“Big Data” synonymous with prediction
How does this impact an
organization’s actions?
…and our lives?
What can be done?
Acknowledge difference
Learn/teach prediction
Leverage prediction in research
BUT
focus on its scientific uses:
generate new theory
develop measures
compare theories
improve theory
assess relevance
evaluate predictability
Why Predict? for Scientific Research
Shmueli & Koppius, “Predictive Analytics in IS Research”
MIS Quarterly, 2011
Shmueli (2010) “To Explain or To Predict?”, Statistical Science
Shmueli & Koppius (2011) “Predictive Analytics in IS Research”, MISQ

Predictive analytics in Information Systems Research (TSWIM 2015 keynote)