Galit Shmuéli
       Ij       Israel Statistical Association
                    & Tel Aviv University
                         July 9, 2012


 To Explain or To Predict?
Points for discussion: goo.gl/gcjlN

Twitter: #explainpredict
Road Map
Definitions
Explanatory-dominated social sciences
Explanatory ≠ predictive modeling
 Why?
 Different modeling paths
 Explanatory vs. predictive power


So what?
Definitions

Explanatory modeling:
Theory-based, statistical testing of
causal hypotheses

Explanatory power:
Strength of relationship in statistical
model
Definitions

Predictive modeling:
Empirical method for predicting new
observations

Predictive power:
Ability to accurately predict new
observations
Statistical modeling in
     social science research



Purpose: test causal theory (“explain”)
           Association-based statistical models
                         Prediction nearly absent
Explanatory modeling à-la social sciences
Start with a causal
theory

Generate causal
hypotheses on
constructs

Operationalize constructs → Measurable variables

Fit statistical model

Statistical inference → Causal conclusions
In the social sciences,

data analysis is mainly used for testing
            causal theory.

     “If it explains, it predicts”
“Empirical prediction alone
            is un-scientific”

Some statisticians share this view:

   The two goals in analyzing data... I prefer to describe
   as “management” and “science”. Management seeks
   profit... Science seeks truth.

                        - Parzen, Statistical Science 2001
52 “predictive” articles among 1,072
in Information Systems top journals
Why Predict? for Scientific Research
          new theory
          develop measures
          compare theories
          improve theory
          assess relevance
          predictability

Shmueli & Koppius, “Predictive Analytics in IS Research”
(MISQ, 2011)
“A good explanatory model will also
predict well”
“You must understand the underlying
causes in order to predict”
Philosophy of Science
“Explanation and prediction have the
same logical structure”
                Hempel & Oppenheim, 1948

  “It becomes pertinent to investigate the
  possibilities of predictive procedures
  autonomous of those used for explanation”
                             Helmer & Rescher, 1959

         “Theories of social and human behavior
         address themselves to two distinct goals of
         science: (1) prediction and (2) understanding”
                                Dubin, Theory Building, 1969
Why statistical

explanatory modeling
       differs from

predictive modeling
   Shmueli (2010), Statistical Science
Theory vs. its manifestation




                     ?
Notation

Theoretical constructs: X, Y
Causal theoretical model: Y=F(X)
Measurable variables: X, Y
Statistical model: E(y)=f(X)
Four aspects                 Y=F(X)
                             E(Y)=f(X)
1. Theory – Data
2. Causation – Association
3. Retrospective – Prospective
4. Bias - Variance
“The goal of finding models that are
predictively accurate differs from the
goal of finding models that are true.”
Point #1
Best explanatory model


              ≠
       Best predictive model
Four aspects                 Y=F(X)
                             Y=f(X)
1. Theory - Data
2. Causation – Association
3. Retrospective – Prospective
4. Bias - Variance
Predict ≠ Explain
               “we tried to benefit from an extensive
               set of attributes describing each of the
               movies in the dataset. Those attributes
               certainly carry a significant signal and
                +
               can explain some of the user behavior.
               However… they could not help at all
                                                       ?
               for improving the [predictive]
               accuracy.”
                                         Bell et al., 2008
Predict ≠ Explain
The FDA considers two products
bioequivalent if the 90% CI of the
relative mean of the generic to brand
formulation is within 80%-125%




“We are planning to… develop predictive models for bioavailability
and bioequivalence”
                                           Lester M. Crawford, 2005
                                Acting Commissioner of Food & Drugs
Goal        Design &          Data          EDA
Definition    Collection     Preparation




Variables?                                 Model Use &
Methods?     Evaluation, V                  Reporting
             alidation &
                 Model
               Selection
Study design
    & data collection
Observational or experiment?
Primary or secondary data?
Instrument (reliability+validity vs. measur accuracy)
How much data?
How to sample?

                             Hierarchical data
Data Preprocessing




   missing    reduced-
               feature
               models
                         partitioning
Data exploration & reduction

                   Interactive
                  visualization
                      PCA
                      SVD
Which Variables?



                        endogeneity
                          ex-post
                         availability
causation associations
  Multicollinearity? A, B, A*B?
Methods / Models
                    Blackbox / interpretable
                    Mapping to theory


   variance                       bias




Shrinkage models
           ensembles
Model fit ≠
              Validation
                                 Explanatory power

Theoretical                Empirical
                                              Data
  model                     model

        Evaluation, Validation
          & Model Selection

Empirical                  Training data      Over-fitting
 model                     Holdout data        analysis
         Predictive power
Model Use
 test causal theory         Inference
                             Null hypothesis


new theory
Develop measures
compare theories      Predictive performance
improve theory         Naïve/baseline
assess relevance      Over-fitting analysis
predictability
Point #2

Explanatory            Predictive
  Power         ≠        Power

Cannot infer one from the other
out-of-sample
 interpretation

p-values                        prediction
                                 accuracy
               Performance
R2                                      costs
                 Metrics
                             Training vs.
goodness-of-fit              holdout
     type I,II errors    over-fitting
Predictive Power




                   Explanatory Power
The predictive power of an
explanatory model has important
scientific value


Relevance, reality check, predictability
In “explanatory” fields
Prediction underappreciated

Distinction blurred
Unfamiliar with predictive
modeling/assessment
  “While the value of scientific prediction… is beyond
  question… the inexact sciences *do not+ have…the
  use of predictive expertise well in hand.”
                               Helmer & Rescher, 1959
How does all this impact
   Scientific Research?
What can be done?

   acknowledge
incorporate prediction into
         curriculum
What happens in other fields?

     Epidemiology
         Engineering
             Life sciences

What about “predictive only”
fields?           http://goo.gl/gcjlN
Shmueli (2010), “To Explain or To Predict?”, Statistical Science
Shmueli & Koppius (2011), “Predictive analytics in IS research”, MISQ

To explain or to predict

  • 1.
    Galit Shmuéli Ij Israel Statistical Association & Tel Aviv University July 9, 2012 To Explain or To Predict?
  • 2.
    Points for discussion:goo.gl/gcjlN Twitter: #explainpredict
  • 3.
    Road Map Definitions Explanatory-dominated socialsciences Explanatory ≠ predictive modeling Why? Different modeling paths Explanatory vs. predictive power So what?
  • 4.
    Definitions Explanatory modeling: Theory-based, statisticaltesting of causal hypotheses Explanatory power: Strength of relationship in statistical model
  • 5.
    Definitions Predictive modeling: Empirical methodfor predicting new observations Predictive power: Ability to accurately predict new observations
  • 6.
    Statistical modeling in social science research Purpose: test causal theory (“explain”) Association-based statistical models Prediction nearly absent
  • 7.
    Explanatory modeling à-lasocial sciences Start with a causal theory Generate causal hypotheses on constructs Operationalize constructs → Measurable variables Fit statistical model Statistical inference → Causal conclusions
  • 8.
    In the socialsciences, data analysis is mainly used for testing causal theory. “If it explains, it predicts”
  • 9.
    “Empirical prediction alone is un-scientific” Some statisticians share this view: The two goals in analyzing data... I prefer to describe as “management” and “science”. Management seeks profit... Science seeks truth. - Parzen, Statistical Science 2001
  • 10.
    52 “predictive” articlesamong 1,072 in Information Systems top journals
  • 11.
    Why Predict? forScientific Research new theory develop measures compare theories improve theory assess relevance predictability Shmueli & Koppius, “Predictive Analytics in IS Research” (MISQ, 2011)
  • 12.
    “A good explanatorymodel will also predict well” “You must understand the underlying causes in order to predict”
  • 13.
    Philosophy of Science “Explanationand prediction have the same logical structure” Hempel & Oppenheim, 1948 “It becomes pertinent to investigate the possibilities of predictive procedures autonomous of those used for explanation” Helmer & Rescher, 1959 “Theories of social and human behavior address themselves to two distinct goals of science: (1) prediction and (2) understanding” Dubin, Theory Building, 1969
  • 14.
    Why statistical explanatory modeling differs from predictive modeling Shmueli (2010), Statistical Science
  • 15.
    Theory vs. itsmanifestation ?
  • 16.
    Notation Theoretical constructs: X,Y Causal theoretical model: Y=F(X) Measurable variables: X, Y Statistical model: E(y)=f(X)
  • 17.
    Four aspects Y=F(X) E(Y)=f(X) 1. Theory – Data 2. Causation – Association 3. Retrospective – Prospective 4. Bias - Variance
  • 18.
    “The goal offinding models that are predictively accurate differs from the goal of finding models that are true.”
  • 19.
    Point #1 Best explanatorymodel ≠ Best predictive model
  • 20.
    Four aspects Y=F(X) Y=f(X) 1. Theory - Data 2. Causation – Association 3. Retrospective – Prospective 4. Bias - Variance
  • 21.
    Predict ≠ Explain “we tried to benefit from an extensive set of attributes describing each of the movies in the dataset. Those attributes certainly carry a significant signal and + can explain some of the user behavior. However… they could not help at all ? for improving the [predictive] accuracy.” Bell et al., 2008
  • 22.
    Predict ≠ Explain TheFDA considers two products bioequivalent if the 90% CI of the relative mean of the generic to brand formulation is within 80%-125% “We are planning to… develop predictive models for bioavailability and bioequivalence” Lester M. Crawford, 2005 Acting Commissioner of Food & Drugs
  • 23.
    Goal Design & Data EDA Definition Collection Preparation Variables? Model Use & Methods? Evaluation, V Reporting alidation & Model Selection
  • 24.
    Study design & data collection Observational or experiment? Primary or secondary data? Instrument (reliability+validity vs. measur accuracy) How much data? How to sample? Hierarchical data
  • 25.
    Data Preprocessing missing reduced- feature models partitioning
  • 26.
    Data exploration &reduction Interactive visualization PCA SVD
  • 27.
    Which Variables? endogeneity ex-post availability causation associations Multicollinearity? A, B, A*B?
  • 28.
    Methods / Models Blackbox / interpretable Mapping to theory variance bias Shrinkage models ensembles
  • 29.
    Model fit ≠ Validation Explanatory power Theoretical Empirical Data model model Evaluation, Validation & Model Selection Empirical Training data Over-fitting model Holdout data analysis Predictive power
  • 30.
    Model Use testcausal theory Inference Null hypothesis new theory Develop measures compare theories Predictive performance improve theory Naïve/baseline assess relevance Over-fitting analysis predictability
  • 31.
    Point #2 Explanatory Predictive Power ≠ Power Cannot infer one from the other
  • 32.
    out-of-sample interpretation p-values prediction accuracy Performance R2 costs Metrics Training vs. goodness-of-fit holdout type I,II errors over-fitting
  • 33.
    Predictive Power Explanatory Power
  • 34.
    The predictive powerof an explanatory model has important scientific value Relevance, reality check, predictability
  • 35.
    In “explanatory” fields Predictionunderappreciated Distinction blurred Unfamiliar with predictive modeling/assessment “While the value of scientific prediction… is beyond question… the inexact sciences *do not+ have…the use of predictive expertise well in hand.” Helmer & Rescher, 1959
  • 36.
    How does allthis impact Scientific Research?
  • 37.
    What can bedone? acknowledge incorporate prediction into curriculum
  • 38.
    What happens inother fields? Epidemiology Engineering Life sciences What about “predictive only” fields? http://goo.gl/gcjlN
  • 39.
    Shmueli (2010), “ToExplain or To Predict?”, Statistical Science Shmueli & Koppius (2011), “Predictive analytics in IS research”, MISQ