Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Management

Big Data –
To Explain or To Predict?
Big Data Experts Speaker Series
Rotman School of Management, U Toronto, March 2016
Galit Shmueli

Galit Shmueli (徐茉莉)
www.galitshmueli.com
❷ 2000-2002
Carnegie Mellon Univ.
Visiting Assistant Prof.
Dept. of Statistics
❸ 2002-2012
Univ. of Maryland College Park
Assistant then Associate Prof. of
Statistics & Management Science
R H Smith School of Business
2008-2014
Rigsum Institute (Bhutan)
Co-Director, Rigsum
Research Lab
❹ 2011-2014
Indian School of Business
SRITNE Chaired Prof. of Data
Analytics, Associate Prof. of
Statistics & Info Systems
❶ 1994-2000
Israel Institute of
Technology
MSc + PhD, Statistics
2014-… NTHU
Institute of Service Science
Director, Center for Service
Innovation & Analytics

Research in Data Analytics
‘Entrepreneurial’ statistical
& data mining modeling
(for today’s problems)
Interdisciplinary modeling
Statistical Strategy
To Explain or To Predict?
Information Quality
Regression with Big Data

Road Map
Definitions
Explanatory-dominated social sciences
Explanatory modeling ≠ predictive modeling
Why?
Different modeling paths
Explanatory power vs. predictive power
Implications

Definitions
Explanatory modeling:
Theory-based, statistical testing of
causal hypotheses
Explanatory power:
Strength of relationship in statistical
model

Definitions
Predictive modeling:
Empirical method for predicting new
observations
Predictive power:
Ability to accurately predict new
observations

Explain PredictDescribe
Matching Game
Social Sciences
Machine
learning
Statistics

Statistical modeling in
social sciences &
management research
Purpose: test causal theory (“explain”)
Association-based statistical models
Prediction nearly absent

Start with a causal
theory
Generate causal
hypotheses on
constructs
Operationalize constructs → Measurable variables
Fit statistical model
Statistical inference → Causal conclusions
Classic journal paper

In the social sciences,
data analysis is mainly used for testing
causal theory.
“If it explains, it predicts”

“Empirical prediction alone
is un-scientific”
Some statisticians share this view:
The two goals in analyzing data... I prefer to describe
as “management” and “science”. Management seeks
profit... Science seeks truth.
- Parzen, Statistical Science 2001

Prediction in top research journals in
Information Systems
Predictive goal?
Predictive modeling?
Predictive assessment?
1990-2006

52 “predictive” articles among 1,072
in Information Systems top journals

“A good explanatory model will also
predict well”
“You must understand the underlying
causes in order to predict”

Philosophy of Science
“Explanation and prediction have the
same logical structure”
Hempel & Oppenheim, 1948
“It becomes pertinent to investigate the
possibilities of predictive procedures
autonomous of those used for explanation”
Helmer & Rescher, 1959
“Theories of social and human behavior
address themselves to two distinct goals of
science: (1) prediction and (2) understanding”
Dubin, Theory Building, 1969

Why statistical
explanatory modeling
differs from
predictive modeling

Explanatory Model:
Test/quantify causal effect for
“average” record in population
Predictive Model:
Predict new individual
observations
Different Scientific Goals
Different generalization

Theory vs. its manifestation
?

Four aspects
1. Theory – Data
2. Causation – Association
3. Retrospective – Prospective
4. Bias - Variance

“The goal of finding models that are
predictively accurate differs from the
goal of finding models that are true.”

Best explanatory model
Best predictive model
≠
Point #1

Predict ≠ Explain
+ ?
“we tried to benefit from an extensive
set of attributes describing each of the
movies in the dataset. Those attributes
certainly carry a significant signal and
can explain some of the user behavior.
However… they could not help at all
for improving the [predictive]
accuracy.”
Bell et al., 2008

Explain ≠ Predict
The FDA considers two products
bioequivalent if the 90% CI of the
relative mean of the generic to brand
formulation is within 80%-125%
“We are planning to… develop predictive models for bioavailability
and bioequivalence”
Lester M. Crawford, 2005
Acting Commissioner of Food & Drugs

“For a long time, we thought that
Tamoxifen was roughly 80%
effective for breast cancer
patients.
But now we know much more:
we know that it’s 100% effective
in 70%-80% of the patients, and
ineffective in the rest.”

Goal
Definition
Design &
Collection
Data
Preparation
EDA
Variables?
Methods? Evaluation,
Validation
& Model
Selection
Model Use &
Reporting

Study design
Hierarchical data
Observational or experiment?
Primary or secondary data?
Instrument (reliability+validity vs. meas. accuracy)
How much data?
How to sample?
& data collection

Data Preprocessing
reduced-
feature
models
missing
partitioning

Data exploration, viz, reduction
PCA
Factor Analysis
(interpretable)
Dimension Reduction
(fast, small)

Which Variables?
Multicollinearity?
causation associations
endogeneity
ex-post
availability
A, B, A*B?

ensembles
Shrinkage models
variance bias
Methods / Models
Blackbox / interpretable
Mapping to theory

Evaluation, Validation
& Model Selection
Training dataEmpirical
model Holdout data
Predictive power
Over-fitting
analysis
Theoretical
model
Empirical
model
Data
Validation
Model fit ≠
Explanatory power

Inference
Model Use: Industry
Identify causal
factors
generate
predictions for
new data
Predictive performance
Over-fitting analysis
Null hypothesis
Naïve/baseline

Inference
Model Use (Science)
test causal theory
generate new theory
develop measures
compare theories
improve theory
assess relevance
Evaluate predictability
Predictive performance
Over-fitting analysis
Null hypothesis
Naïve/baseline

Point #2
Explanatory
Power
Predictive
Power≠
Cannot infer one from the other

out-of-sample
Performance
Metrics
type I,II errors
goodness-of-fit
p-values
over-fitting
costs
prediction
accuracy
interpretation
Training vs.
holdout
R2

Explanatory Power
PredictivePower

The predictive power of an
explanatory model has important
scientific value
Relevance, reality check, predictability

Current state in academia
(social sciences and management)
“While the value of scientific prediction… is beyond
question… the inexact sciences [do not] have…the
use of predictive expertise well in hand.”
Helmer & Rescher, 1959
Distinction blurred
Unfamiliarity with predictive
modeling/assessment
Prediction underappreciated

State-of-the-art in industry
Distinction blurred
Prediction over-appreciated
“Big Data” synonymous with prediction

How does this impact
Scientific research?

How does this impact
organizations’ actions?
…and our lives?

Will the
customer pay?
What causes
non-payment?

Explain
Predict
Predict
Potential
explanations

Shmueli (2010) “To Explain or To Predict?”, Statistical Science
Shmueli & Koppius (2011) “Predictive Analytics in IS Research”, MISQ

Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Management

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Management

Similar to Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Management (20)

More from Galit Shmueli

More from Galit Shmueli (20)

Recently uploaded

Recently uploaded (20)

Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Management