5. Statistical modeling in social science research Purpose: test causal theory (“explain”) Association-based statistical models Prediction nearly absent
6. Lesson #1: Whether statisticians like it or not, in the social sciences, association-based statistical models are used for testing causal theory. Justification: a strong underlying theoretical model provides the causality.
15. Lesson #2 In the social sciences, empirical analysis is mainly used for testing causal theory. Empirical prediction is considered un-academic. Some statisticians share this view: The two goals in analyzing data... I prefer to describe as “management” and “science”. Management seeks profit... Science seeks truth. Parzen, Statistical Science 2001
18. 1072 articles of which52 empirical with predictive claims “Examples of [predictive] theory in IS do not come readily to hand, suggesting that they are not common”Gregor, MISQ 2006
23. “The goal of finding models that are predictively accurate differs from the goal of finding models that are true.”
24. Given the research environment in the social sciences, two critically important points are: Explanatory power and predictive accuracy cannot be inferred from one another. The “best” explanatory model is (nearly) never the “best” predictive model, and vice versa.
31. Point #2 Best explanatory model ≠ Best predictive model
32. Predict ≠ Explain “We should mention that not all data features were found to be useful. For example, we tried to benefit from an extensive set of attributes describing each of the movies in the dataset. Those attributes certainly carry a significant signal and can explain some of the user behavior. However, we concluded that they could not help at all for improving the accuracy of well tuned collaborative filtering models.” Bell et al., 2008 + ?
33. Predict ≠ Explain The FDA considers two products bioequivalent if the 90% CI of the relative mean of the generic to brand formulation is within 80%-125% “We are planning to… develop predictive models for bioavailability and bioequivalence” Lester M. Crawford, 2005 Acting Commissioner of Food & Drugs
36. What is Optimized? Bias Prediction MSE or Var(Y)= uncontrollable bias2 = model misspecification estimation (sampling variance)
37. Linear Regression Example Underspecified model Estimated model True model Estimated model MSE2 < MSE1 when: σ2 large |β2| small corr(x1,x2) high limited range of x’s
39. Design & Collection Data Preparation Goal Definition EDA Variables? Methods? Model Use & Reporting Evaluation, Validation & Model Selection
40. Hierarchical data Study design & data collection Observational or experiment? Primary or secondary data? Instrument (reliability+validity vs. measur accuracy) How much data? How to sample?
44. Methods / Models bias variance Blackbox / interpretable Mapping to theory ridge regression ensembles boosting PLS PCR
45. Model fit ≠ Validation Explanatory power Empirical model Theoretical model Data Evaluation, Validation& Model Selection Training data Empirical model Over-fitting analysis Holdout data Predictive power
46. Model Use Inference Test causal theory Null hypothesis Predictions (utility) Relevance New theory Predictability Predictive performance Over-fitting analysis Naïve/baseline
47. Design & Collection Data Preparation Goal Definition EDA Variables? Methods? Model Use & Reporting Evaluation, Validation, & Model Selection
48. How does all this impact research in the (social) sciences?
49. Three Current Problems Prediction underappreciated Distinction blurred Inappropriate modeling/assessment “While the value of scientific prediction… is beyond question… the inexact sciences [do not] have…the use of predictive expertise well in hand.” Helmer & Rescher, 1959
50. Why? What can be done? Statisticians should acknowledge the difference and teach it!