To explain or to predict

Galit Shmuéli
Ij Israel Statistical Association
& Tel Aviv University
July 9, 2012

To Explain or To Predict?

Points for discussion: goo.gl/gcjlN

Twitter: #explainpredict

Road Map
Definitions
Explanatory-dominated social sciences
Explanatory ≠ predictive modeling
Why?
Different modeling paths
Explanatory vs. predictive power

So what?

Definitions

Explanatory modeling:
Theory-based, statistical testing of
causal hypotheses

Explanatory power:
Strength of relationship in statistical
model

Definitions

Predictive modeling:
Empirical method for predicting new
observations

Predictive power:
Ability to accurately predict new
observations

Statistical modeling in
social science research

Purpose: test causal theory (“explain”)
Association-based statistical models
Prediction nearly absent

Explanatory modeling à-la social sciences
Start with a causal
theory

Generate causal
hypotheses on
constructs

Operationalize constructs → Measurable variables

Fit statistical model

Statistical inference → Causal conclusions

In the social sciences,

data analysis is mainly used for testing
causal theory.

“If it explains, it predicts”

“Empirical prediction alone
is un-scientific”

Some statisticians share this view:

The two goals in analyzing data... I prefer to describe
as “management” and “science”. Management seeks
profit... Science seeks truth.

- Parzen, Statistical Science 2001

52 “predictive” articles among 1,072
in Information Systems top journals

Why Predict? for Scientific Research
new theory
develop measures
compare theories
improve theory
assess relevance
predictability

Shmueli & Koppius, “Predictive Analytics in IS Research”
(MISQ, 2011)

“A good explanatory model will also
predict well”
“You must understand the underlying
causes in order to predict”

Philosophy of Science
“Explanation and prediction have the
same logical structure”
Hempel & Oppenheim, 1948

“It becomes pertinent to investigate the
possibilities of predictive procedures
autonomous of those used for explanation”
Helmer & Rescher, 1959

“Theories of social and human behavior
address themselves to two distinct goals of
science: (1) prediction and (2) understanding”
Dubin, Theory Building, 1969

Why statistical

explanatory modeling
differs from

predictive modeling
Shmueli (2010), Statistical Science

Theory vs. its manifestation

?

Notation

Theoretical constructs: X, Y
Causal theoretical model: Y=F(X)
Measurable variables: X, Y
Statistical model: E(y)=f(X)

Four aspects Y=F(X)
E(Y)=f(X)
1. Theory – Data
2. Causation – Association
3. Retrospective – Prospective
4. Bias - Variance

“The goal of finding models that are
predictively accurate differs from the
goal of finding models that are true.”

Point #1
Best explanatory model

≠
Best predictive model

Four aspects Y=F(X)
Y=f(X)
1. Theory - Data
2. Causation – Association
3. Retrospective – Prospective
4. Bias - Variance

Predict ≠ Explain
“we tried to benefit from an extensive
set of attributes describing each of the
movies in the dataset. Those attributes
certainly carry a significant signal and
+
can explain some of the user behavior.
However… they could not help at all
?
for improving the [predictive]
accuracy.”
Bell et al., 2008

Predict ≠ Explain
The FDA considers two products
bioequivalent if the 90% CI of the
relative mean of the generic to brand
formulation is within 80%-125%

“We are planning to… develop predictive models for bioavailability
and bioequivalence”
Lester M. Crawford, 2005
Acting Commissioner of Food & Drugs

Goal Design & Data EDA
Definition Collection Preparation

Variables? Model Use &
Methods? Evaluation, V Reporting
alidation &
Model
Selection

Study design
& data collection
Observational or experiment?
Primary or secondary data?
Instrument (reliability+validity vs. measur accuracy)
How much data?
How to sample?

Hierarchical data

Data Preprocessing

missing reduced-
feature
models
partitioning

Data exploration & reduction

Interactive
visualization
PCA
SVD

Which Variables?

endogeneity
ex-post
availability
causation associations
Multicollinearity? A, B, A*B?

Methods / Models
Blackbox / interpretable
Mapping to theory

variance bias

Shrinkage models
ensembles

Model fit ≠
Validation
Explanatory power

Theoretical Empirical
Data
model model

Evaluation, Validation
& Model Selection

Empirical Training data Over-fitting
model Holdout data analysis
Predictive power

Model Use
test causal theory Inference
Null hypothesis

new theory
Develop measures
compare theories Predictive performance
improve theory Naïve/baseline
assess relevance Over-fitting analysis
predictability

Point #2

Explanatory Predictive
Power ≠ Power

Cannot infer one from the other

out-of-sample
interpretation

p-values prediction
accuracy
Performance
R2 costs
Metrics
Training vs.
goodness-of-fit holdout
type I,II errors over-fitting

Predictive Power

Explanatory Power

The predictive power of an
explanatory model has important
scientific value

Relevance, reality check, predictability

In “explanatory” fields
Prediction underappreciated

Distinction blurred
Unfamiliar with predictive
modeling/assessment
“While the value of scientific prediction… is beyond
question… the inexact sciences *do not+ have…the
use of predictive expertise well in hand.”
Helmer & Rescher, 1959

How does all this impact
Scientific Research?

What can be done?

acknowledge
incorporate prediction into
curriculum

What happens in other fields?

Epidemiology
Engineering
Life sciences

What about “predictive only”
fields? http://goo.gl/gcjlN

Shmueli (2010), “To Explain or To Predict?”, Statistical Science
Shmueli & Koppius (2011), “Predictive analytics in IS research”, MISQ

To explain or to predict

More Related Content

What's hot

Viewers also liked

Similar to To explain or to predict

More from Galit Shmueli

Recently uploaded

In this document

To explain or to predict