5-4 1Methodology and Ontology in Statistical Modeling:Some error statistical reflectionsOur presentation falls under the second of the bulletedquestions for the conference:How do methods of data generation, statisticalmodeling, and inference influence theconstruction and appraisal of theories?Statistical methodology can influence what we thinkwe’re finding out about the world, in the mostproblematic ways, traced to such facts as:• All statistical models are false• Statistical significance is not substantivesignificance• Statistical association is not causation• No evidence against a statistical null hypothesis isnot evidence the null is true• If you torture the data enough they will confess.(or just omit unfavorable data)These points are ancient (lying with statistics, lies damnlies, and statistics)People are discussing these problems more than ever(big data), but it’s rarely realized is how much certainmethodologies are at the root of the current problems
5-4 2All Statistical Models are FalseTake the popular slogan in statistics and elsewhere is“all statistical models are false!”What the “all models are false” charge boils down to:(1) the statistical model of the data is at most anidealized and partial representation of the actualdata generating source.(2) a statistical inference is at most an idealized andpartial answer to a substantive theory or question.• But we already know our models areidealizations: that’s what makes them models• Reasserting these facts is not informative,.• Yet they are taken to have various (dire)implications about the nature and limits ofstatistical methodology• Neither of these facts precludes the use of these tofind out true things• On the contrary, it would be impossible to learnabout the world if we did not deliberately falsifyand simplify.
5-4 3• Notably, the “all models are false” slogan is followedup by “But some are useful”,• Their usefulness, we claim, is being capable ofadequately capturing an aspect of a phenomenon ofinterest• Then a hypothesis asserting its adequacy (orinadequacy) is capable of being true!Note: All methods of statistical inferences rest onstatistical models.What differentiates accounts is how well theystep up to the plate in checking adequacy, learningdespite violations of statistical assumptions(robustness)
5-4 4Statistical significance is not substantive significanceStatistical models (as they arise in the methodology ofstatistical inference) live somewhere between1. Substantive questions, hypotheses, theories H2. Statistical models of phenomenon, experiments,data: M3. Data xWhat statistical inference has to do is affordadequate link-ups (reporting precision, accuracy,reliability)
5-4 5Recent Higgs reports on evidence of a real (Higg’s-like)effect (July 2012, March 2013)Researchers define a “global signal strength” parameterH0: μ = 0 corresponds to the background (null hypothesis),μ > 0 to background + Standard Model Higgs boson signal,but μ is only indirectly related to parameters in substantivemodelsAs is typical of so much of actual inference (experimentaland non), testable predictions are statistical:They deduced what would be expected statistically frombackground alone (compared to the 5 sigma observed)in particular, alluding to an overall test S:Pr(Test S would yields d(X) > 5 standarddeviations; H0) ≤ .0000003.This is an example of an error probability
5-4 6The move from statistical report to evidenceThe inference actually detached from the evidence can beput in any number of waysThere is strong evidence for H: a Higgs (or a Higgs-like)particle.An implicit principle of inference isWhy do data x0 from a test S provide evidence forrejecting H0 ?Because were H0 a reasonably adequate description ofthe process generating the data would (very probably)have survived, (with respect to the question).Yet statistically significant departures are generated:July 2012, March 2013 (from 5 to 7 sigma)Inferring the observed difference is “real” (non-fluke)has been put to a severe testPhilosophers often call it an “argument fromcoincidence”(This is a highly stringent level, apparently in this arena ofparticle physics smaller observed effects often disappear)
5-4 7Even so we cannot infer to any full theoryThat’s what’s wrong with the slogan “Inference to the“best” ExplanationSome explanatory hypothesis T entails statisticallysignificant effect.Statistical effect x is observed.Data x are good evidence for T.The problem: Pr(T “fits” data x; T is false ) = highAnd in other less theoretical fields, the perils of “theory-laden” interpretation of even genuine statistical effectsare great[Babies look statistically significantly longer when redballs are picked from a basket with few red balls:Does this show they are running, at some intuitive level,a statistical significance test, recognizing statisticallysurprising results? It’s not clear]
5-4 8The general worry reflects an implicit requirement forevidence:Minimal Requirement for Evidence. If data are inaccordance with a theory T, but the method would haveissued so good a fit even if T is false, then the dataprovide poor or no evidence for T.The basic principle isn’t new, we find it Peirce, Popper,Glymour….what’s new is finding a way to use errorprobabilities from frequentist statistics (error statistics)to cash it outTo resolve controversies in statistics and even give afoundation for rival accounts
5-4 9Dirty Hands: But these statistical assessments, someobject, depend on methodological choices in specifyingstatistical methods; outputs are influence bydiscretionary judgments: dirty hands argumentWhile it is obvious that human judgments and humanmeasurements are involved, (like “all models are false”)it is too trivial an observation to distinguish howdifferent account handle threats of bias and unwarrantedinferencesRegardless of the values behind choices in collecting,modeling, drawing inferences from data, I can criticallyevaluate how good a job has been done.(test too sensitive, not sensitive enough, violatedassumptions)
5-4 10An even more extreme argument, moves from “modelsare false”, to models are objects of belief, to thereforestatistical inference is all about subjective probability.By the time we get to the “confirmatory stage” we’vemade so many judgments, why fuss over a fewsubjective beliefs at the last part….George Box (a well known statistician) “theconfirmatory stage of an investigation…will typicallyoccupy, perhaps, only the last 5 per cent of theexperimental effort. The other 95 per cent—thewondering journey that has finally led to thatdestination---involves many heroic subjective choices(what variables? What levels? What scales?, etc. etc….Since there is no way to avoid these subjectivechoices…why should we fuss over subjectiveprobability?” (70)It is one thing to say our models are objects ofbelief, and quite another to convert the entire task tomodeling beliefs.We may call this shift from phenomena toepiphenomena (Glymour 2010)Yes there are assumptions, but we can test them, orat least discern how they may render our inferences lessprecise, or completely wrong.
5-4 11The choice isn’t full blown truth or degrees ofbelief.We may warrant models (and inferences) to variousdegrees, such as by assessing how well corroboratedthey are.Some try to adopt this perspective of testing theirstatistical models, but give us tools with very littlepower to find violations• Some of these same people, ironically, say since weknow our model is false, the criteria of high powerto detect falsity is not of interest. (Gelman).• Knowing something is an approximation is not topinpoint where it is false, or how to get a bettermodel.[Unless you have methods with power to probe thisapproximation, you will have learned nothing aboutwhere the model stands up and where it breaks down,what flaws you can rule out, and which you cannot.]
5-4 12Back to our questionHow do methods of data generation, statisticalmodeling, and analysis influence the construction andappraisal of theories at multiple levels?• All statistical models are false• Statistical significance is not substantivesignificance• Statistical association is not causation• No evidence against a statistical null hypothesis isnot evidence the null is true• If you torture the data enough they will confess.(or just omit unfavorable data)These facts open the door to a variety of antiquatedstatistical fallacies, but the all models are false, dirtyhands, it’s all subjective, encourage them.From popularized to sophisticated research, in socialsciences, medicine, social psychology“We’re more fooled by noise than ever before, and it’sbecause of a nasty phenomenon called “big data”. Withbig data, researchers have brought cherry-picking to anindustrial level”. (Taleb, Fooled by randomness 2013)It’s not big data it’s big mistakes about methodologyand modeling
5-4 13This business of cherry picking falls under a moregeneral issue of “selection effects” that I have beenstudying and writing about for many years.Selection effects come in various forms and givendifferent names: double counting,hunting with a shotgun(for statistical significance) looking for the pony, lookelsewhere effects, data dredging, multiple testing, p-value hackingOne common example: A published result of a clinicaltrial alleges statistically significant benefit (of a givendrug for a given disease), at a small level .01, butignores 19 other non-significant trials actually make iteasy to find a positive result on one factor or other, evenif all are spurious.The probability that the procedure yields erroneousrejections differs from, and will be much greater than,0.01(nominal vs actual significance levels)How to adjust for hunting and multiple testing is aseparate issue (e.g., false discovery rates).
5-4 14If one reports results selectively, or stop when thedata look good, etc. it becomes easy to prejudgehypotheses:Your favored hypothesis H might be said to have“passed” the test, but it is a test that lacks stringency orseverity.(our minimal principle for evidence again)• Selection effects alter the error probabilities of testsand estimation methods, so at least methods thatcompute them can pick up on the influences• If on the other hand, they are reported in the sameway, significance testing’s basic principles are beingtwisted, distorted, invalidly used• It is not a problem about long-runs either—.We cannot say about the case at hand that it has done agood job of avoiding the source of misinterpretation,since it makes it so easy to find a fit even if false.
5-4 15The growth of fallacious statistics is due to the acceptabilityof methods that declare themselves free from such error-probabilistic encumbrances (e.g., Bayesian accounts).Popular methods of model selection (AIC, and others)suffer from similar blind spotsWhole new fields for discerning spurious statistics, non-replicable results; statistical forensics: all use errorstatistical methods to identify flaws(Stan Young, John Simonsohn, Brad Efron, Baggerly andCoombes)• All statistical models are false• Statistical significance is not substantivesignificance• Statistical association is not causation• No evidence against a statistical null hypothesis isnot evidence the null is true• If you torture the data enough they will confess.(or just omit unfavorable data)To us, the list is not a list of embarrassments butjustifications for the account we favor.
5-4 16Models are falseDoes not prevent finding out true things with themDiscretionary choices in modelingDo not entail we are only really learning aboutbeliefsDo not prevent critically evaluating the propertiesof the tools you chose.A methodology that uses probability to assess andcontrol error probabilities has the basis for pinpointingthe fallacies (statistical forensics, meta statisticalanalytics)These models work because they need only capturerather coarse properties of the phenomena being probed:the error probabilities assessed are approximately relatedto actual ones.Problems are intertwined with testing assumptions ofstatistical modelsThe person I’ve learned the most about this is ArisSpanos who will now turn to that.