To Explain Or To Predict?

Galit Shmuéli Georgetown University October 30, 2009 To Explain or To Predict? Explanatory vs. Predictive Modeling in Scientific Research

What are “explaining”? “predicting”?

Statistical modeling in social science research Purpose: test causal theory (“explain”) Association-based statistical models Prediction nearly absent

Lesson #1: Whether statisticians like it or not, in the social sciences, association-based statistical models are used for testing causal theory. Justification: a strong underlying theoretical model provides the causality.

Definition: Explanatory Model A statistical model used for testing causal theory (“proper” or not)

Definition: Predictive Model An empirical model used for predicting new records/scenarios

Multi-page sections with theoretical justifications of each hypothesis

Concept operationalization Poverty Trust Anger Economic stability Well-being 4 pages of such tables

Statistical model (here: path analysis)

Lesson #2 In the social sciences, empirical analysis is mainly used for testing causal theory. Empirical prediction is considered un-academic. Some statisticians share this view: The two goals in analyzing data... I prefer to describe as “management” and “science”. Management seeks profit... Science seeks truth. Parzen, Statistical Science 2001

Prediction in the Information Systems literature

Predictive goal stated?Predictive power assessed?

1072 articles of which52 empirical with predictive claims “Examples of [predictive] theory in IS do not come readily to hand, suggesting that they are not common”Gregor, MISQ 2006

Breakdown of the 52 “predictive” articles

Why Predict? Scientific use of empirical models To Predict To Explain test causal theory (utility) relevance new theory predictability

Why are statistical explanatory models different than predictive models?

Theory vs. its manifestation ?

“The goal of finding models that are predictively accurate differs from the goal of finding models that are true.”

Given the research environment in the social sciences, two critically important points are: Explanatory power and predictive accuracy cannot be inferred from one another. The “best” explanatory model is (nearly) never the “best” predictive model, and vice versa.

Point #1 Explanatory Power Predictive Power ≠ Cannot infer one from the other

In-sample vs. out-of-sample evaluation

out-of-sample interpretation prediction accuracy p-values Performance Evaluation R2 costs goodness-of-fit run time Danger: type I,II errors Danger: over-fitting

Suggestion for social scientists: Report predictive accuracy in addition to explanatory power

Predictive Power Explanatory Power

Point #2 Best explanatory model ≠ Best predictive model

Predict ≠ Explain “We should mention that not all data features were found to be useful. For example, we tried to benefit from an extensive set of attributes describing each of the movies in the dataset. Those attributes certainly carry a significant signal and can explain some of the user behavior. However, we concluded that they could not help at all for improving the accuracy of well tuned collaborative filtering models.” Bell et al., 2008 + ?

Predict ≠ Explain The FDA considers two products bioequivalent if the 90% CI of the relative mean of the generic to brand formulation is within 80%-125% “We are planning to… develop predictive models for bioavailability and bioequivalence” Lester M. Crawford, 2005 Acting Commissioner of Food & Drugs

Explanatory goal: minimize model bias Predictive goal: minimize MSE(model bias + sampling variance)

What is Optimized? Bias Prediction MSE or Var(Y)= uncontrollable bias2 = model misspecification estimation (sampling variance)

Linear Regression Example Underspecified model Estimated model True model Estimated model MSE2 < MSE1 when: σ2 large |β2| small corr(x1,x2) high limited range of x’s

China's Diverging Paths, photo by Clark Smith Twostatistical modeling paths

Design & Collection Data Preparation Goal Definition EDA Variables? Methods? Model Use & Reporting Evaluation, Validation & Model Selection

Hierarchical data Study design & data collection Observational or experiment? Primary or secondary data? Instrument (reliability+validity vs. measur accuracy) How much data? How to sample?

reduced-feature models partitioning Data preparation missing

summary stats plots outliers trends Interactive visualization PCASVD

Which variables? Multicollinearity? A, B, A*B? theoryassociations ex-post availability

Methods / Models bias variance Blackbox / interpretable Mapping to theory ridge regression ensembles boosting PLS PCR

Model fit ≠ Validation Explanatory power Empirical model Theoretical model Data Evaluation, Validation& Model Selection Training data Empirical model Over-fitting analysis Holdout data Predictive power

Model Use Inference Test causal theory Null hypothesis Predictions (utility) Relevance New theory Predictability Predictive performance Over-fitting analysis Naïve/baseline

Design & Collection Data Preparation Goal Definition EDA Variables? Methods? Model Use & Reporting Evaluation, Validation, & Model Selection

How does all this impact research in the (social) sciences?

Three Current Problems Prediction underappreciated Distinction blurred Inappropriate modeling/assessment “While the value of scientific prediction… is beyond question… the inexact sciences [do not] have…the use of predictive expertise well in hand.” Helmer & Rescher, 1959

Why? What can be done? Statisticians should acknowledge the difference and teach it!

It’s time for Change To Predict To Explain

To Explain Or To Predict?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to To Explain Or To Predict?

Similar to To Explain Or To Predict? (20)

More from Galit Shmueli

More from Galit Shmueli (20)

Recently uploaded

Recently uploaded (20)

To Explain Or To Predict?

Editor's Notes