Good Data Dredging
Clark Glymour
My Theses
• There is good data dredging and bad data dredging
• Finite sample error probabilities are illusory.
• Good data dredging is serious severe testing.
• Good data dredging should accompany every empirical study in the
social and behavioral sciences
2
The Main Argument
1. Historically, Data Dredging has been key to fundamental scientific discoveries.
2. The best hypotheses are often not conceived of.
3. Lots of modern science considers large numbers of variables.
4. The number of possible causal relations grows superexponentially with the
number of variables.
5. There are sometimes more possible causes of a phenomenon than can be
tested by experiment.
6. Many experiments require adjustments for common causes.
7. Statistical experience is with bad Data Dredging.
8. Error probabilities for causal inference presuppose causal relations that have
no error probabilities.
9. Good Data Dredging methods are available and improving and can and have
helped with problems 2,3,4,5
10. Good data dredging is severe testing
3
Data Dredging Is:
• Examining data to “generate” hypotheses--bring them to notice and….
• Evaluating the hypotheses on the data—taking that data to support or confirm or
provide reason to believe (or disbelieve) those hypotheses that the data bring to
notice.
• Data dredging is more than “suggesting” a hypothesis.
• Modern data dredging is by computational search algorithms.
• Modern search algorithms are multivariate statistical estimators for parameter
structures (“models”), often interpreted as a collection of causal claims.
4
Who Dredged Data?
• Kepler
• Darwin
• Cannizzaro
• Mendeleev --- Etc.
• Everyone who has ever based a causal claim in whole or part on a
regression analysis or anova or factor analysis using non-experimental
data.
• Everyone who has selected variables to adjust for in experiments
using such methods.
5
Why Should Data Be Dredged?
“But it is not, I conceive, a valid reason for accepting any given
hypothesis, that we are unable to imagine any other which will
account for the facts. There is no necessity for supposing that the true
explanation must be one which, with only our present experience, we
could imagine”. --J.S. Mill, A System of Logic
6
Why Should Data Be Dredged?
”The historical record of scientific inquiry…is
characterized by the problem of unconceived
alternatives. Past scientists have routinely failed even
to conceive of alternatives to their own theories and
lines of theoretical investigation, alternatives that
were both well-confirmed by the evidence available
at the time and sufficiently serious as to be ultimately
accepted by later scientific communities.”
7
Why Should Data Be Dredged?
• 0 1
• 1 1
• 2 3
• 3 25
• 4 543
• 5 29281
• 6 3781503
• 7 1138779265
• 8 783702329343
• 9 1213442454842881
• 10 4175098976430598143
• 11 31603459396418917607425
• 12 521939651343829405020504063
• 13 18676600744432035186664816926721
• 14 1439428141044398334941790719839535103
• 15 237725265553410354992180218286376719253505
• 16 83756670773733320287699303047996412235223138303
• 17 62707921196923889899446452602494921906963551482675201
• 18 99421195322159515895228914592354524516555026878588305014783
• 19 332771901227107591736177573311261125883583076258421902583546773505
• 20 2344880451051088988152559855229099188899081192234291298795803236068491263
8
Number of
Acyclic Graphs
as a Function of
the Number of
Variables.
What is Severe Testing?
Mayo: were H false it would very probably have failed the test.
Spanos: Test as many aspects of H as you can in as many conditions as
you can such that if H is false it should fail one of them. H that survives
is severely tested.
Various wordings suggest they agree.
Serious severe testing: test multiple, diverse, implications of complex
causal claims.
9
Bad Data Dredging: Regression
Suppose Y is some variable and Xi are a few variables (indexed by i),
some, we know not which, we think might be causes of Y.
We want to find as many of the Xi as we can that are causes of Y but we
want to avoid false positives.
I will suppose we have a large sample of Y and the Xi, cases that are
drawn independently and at random from a much larger population.
Assume all the variables and noise terms are Gaussian.
10
Bad Data Dredging: Regression
U1 X6
X1 X2 X3 X4 X5
U2 Y
Regression Result: All variables except X6 are direct causes of Y.
True, linear Gaussian
model. N = 1,000.
Assume Xi are not
effects of Y. Us not
observed
11
Automated Causal Search Result (FCI)
U1 X6
X1o X2 X3 oX4 X5
U2 Y
0 0
Only X5 is a direct cause of Y. Other variables are not even indirect causes of Y.
Regression had to be told that Y is not a cause of the Xi; FCI did not need that information
12
Severe and Appropriate Testing
• Regression does only one test for each Xi -> Y, and the wrong test:
whether Xi and Y are independent conditional on all Xj.
• Regression takes colliders A -> B <- C to indicate that A causes C or vice-versa.
• Search algorithm (FCI)does multiple tests for each X1 -> Y and allows
the edge only if every test rejects a the null hypothesis that X and Y
are conditionally independent .
• Search algorithm correctly accounts for colliders and uses them to direct
edges.
• Search algorithm recognizes that if A -> B <-> C <- D then B and C have a
common unmeasured cause.
13
Why Should We Trust the Output of A Search
Procedure?
✔️
✔️
✔️
✔️
?
? 14
OK, Serious Science
• Does Acid Rain Influence Plant Growth in North Carolina?
• Small sample of Spartina plants from the Cape Fear Estuary. 14 variables
• What genes influence time to bolting in Arabidopsis thaliana?
• Sample RNA from 47 plants. >19,000 variables.
15
Spartina Grass Growth
• What factors influence the biomass of Spartina grass in the Cape Fear
estuary?
• In other words, what causes some Spartina plants to be larger, heavier, than
others?
16
Many Possible Answers
• 13 chemical factors (biologist-identified):
• Acidity (due to acid rain)
• Salinity (due to proximity to the ocean) (Biologist’s Preferred Answer)
• Ammonia salts (due to agricultural drainage)
• Phosphates
• Nitrogen
• Metal ions of many kinds
• Etc.
17
PC vs. Regression
• Determine the factors that influence Spartina biomass from
measurements of 45 plugs sampled from the estuary.
• Statistician tried many versions of regression. Gave up on the
problem. (J. Rawlings, Applied Regression Analysis, 1st Edition).
• Greenhouse experiment followed, varying pH while controlling for
salinity and varying salinity while controlling for pH.
• Simple PC search algorithm predicts pH is the controlling variable.
• PC result agreed with the greenhouse experiment:
Salinity does not influence Biomass when pH is held constant;
pH does influence Biomass when Salinity is held constant.
Spirtes, et al. 1993.
18
Arabidopsis thaliana: Genes Regulating Bolting Time
• mRNA from 47 plants
• 14 predicted regulator genes
• 9 mutant plants survived
• 4 novel regulators found
• 5 known regulators found
• Correlation, LASSO failed.
Used PC + Resampling + Estimation of Total Effect + Ranking of edges
Analogous results with identification of regulators in yeast.
(Steckhoven, et al. Nature 2017.) 19
Reproducibility: Overlooking Unobserved
Confounders
• Examples:
• Boston Housing Study
Build Pure Clusters Result
• 1972 Election (Goldberger)
20
General Remarks about Search
• There are no non trivial confidence intervals for the results of causal search;
• Confidence intervals are erroneous or meaningless if the causal structure is wrong;
• The structure of the data should be considered in choosing a search method;
• Prior knowledge constraining causal relations helps search;
• Credibility of search results can be improved by resampling strategies;
• Reliability of search methods can be assessed by selective experiments;
• Reliability of search methods can be assessed by simulation experiments;
• Frequentist tests vs quasi Bayesian scores in search is not a foundational issue—whatever works
best. (p values can be used as scores, even converted to odds ratios.)
• As yet there is no appropriate Bayesian score search for identifying unobserved confounders.
21
Why Don’t Social Scientists Search?
• They believe that estimation requires confidence intervals and
confidence intervals say something correct about individual sample
error.
• They have been taught that search is evil witchcraft.
• The history of science is no part of their education.
• They do not believe in the assumptions required for sound causal
search, but they assume them in every case.
• They do, but with bad methods.
22
Ancillary Remarks
• Correction for multiple tests can be done with search algorithms but
probably shouldn’t
• Bonferroni—see Mayo-Wilson’s remarks.
• False Discovery Rate
• Good if you want to choose a test alpha to control the fraction of false
positives in a specified set of tests…
• But don’t think that alpha says anything about the probability that any
particular positive among the tests is false…
• Because the FDR varies with the set of tests included—as does
Bonferroni
23
Ancillary Remarks
• It is unethical to always ignore outcomes that are not prespecified
• Unexpected “side effects”
• It is inefficient to ignore outcomes that are not prespecified
• Serendipity; opportunity costs
• Rather than prespecify outcomes, investigator bias is better
controlled by blinding the data analyst to the meaning of the
variables.
• Is the widely claimed reproducibility failure in biomedical and
psychological experiments due in part to failures to search for
confounders in treatment assignments?
24

Good Data Dredging

  • 1.
  • 2.
    My Theses • Thereis good data dredging and bad data dredging • Finite sample error probabilities are illusory. • Good data dredging is serious severe testing. • Good data dredging should accompany every empirical study in the social and behavioral sciences 2
  • 3.
    The Main Argument 1.Historically, Data Dredging has been key to fundamental scientific discoveries. 2. The best hypotheses are often not conceived of. 3. Lots of modern science considers large numbers of variables. 4. The number of possible causal relations grows superexponentially with the number of variables. 5. There are sometimes more possible causes of a phenomenon than can be tested by experiment. 6. Many experiments require adjustments for common causes. 7. Statistical experience is with bad Data Dredging. 8. Error probabilities for causal inference presuppose causal relations that have no error probabilities. 9. Good Data Dredging methods are available and improving and can and have helped with problems 2,3,4,5 10. Good data dredging is severe testing 3
  • 4.
    Data Dredging Is: •Examining data to “generate” hypotheses--bring them to notice and…. • Evaluating the hypotheses on the data—taking that data to support or confirm or provide reason to believe (or disbelieve) those hypotheses that the data bring to notice. • Data dredging is more than “suggesting” a hypothesis. • Modern data dredging is by computational search algorithms. • Modern search algorithms are multivariate statistical estimators for parameter structures (“models”), often interpreted as a collection of causal claims. 4
  • 5.
    Who Dredged Data? •Kepler • Darwin • Cannizzaro • Mendeleev --- Etc. • Everyone who has ever based a causal claim in whole or part on a regression analysis or anova or factor analysis using non-experimental data. • Everyone who has selected variables to adjust for in experiments using such methods. 5
  • 6.
    Why Should DataBe Dredged? “But it is not, I conceive, a valid reason for accepting any given hypothesis, that we are unable to imagine any other which will account for the facts. There is no necessity for supposing that the true explanation must be one which, with only our present experience, we could imagine”. --J.S. Mill, A System of Logic 6
  • 7.
    Why Should DataBe Dredged? ”The historical record of scientific inquiry…is characterized by the problem of unconceived alternatives. Past scientists have routinely failed even to conceive of alternatives to their own theories and lines of theoretical investigation, alternatives that were both well-confirmed by the evidence available at the time and sufficiently serious as to be ultimately accepted by later scientific communities.” 7
  • 8.
    Why Should DataBe Dredged? • 0 1 • 1 1 • 2 3 • 3 25 • 4 543 • 5 29281 • 6 3781503 • 7 1138779265 • 8 783702329343 • 9 1213442454842881 • 10 4175098976430598143 • 11 31603459396418917607425 • 12 521939651343829405020504063 • 13 18676600744432035186664816926721 • 14 1439428141044398334941790719839535103 • 15 237725265553410354992180218286376719253505 • 16 83756670773733320287699303047996412235223138303 • 17 62707921196923889899446452602494921906963551482675201 • 18 99421195322159515895228914592354524516555026878588305014783 • 19 332771901227107591736177573311261125883583076258421902583546773505 • 20 2344880451051088988152559855229099188899081192234291298795803236068491263 8 Number of Acyclic Graphs as a Function of the Number of Variables.
  • 9.
    What is SevereTesting? Mayo: were H false it would very probably have failed the test. Spanos: Test as many aspects of H as you can in as many conditions as you can such that if H is false it should fail one of them. H that survives is severely tested. Various wordings suggest they agree. Serious severe testing: test multiple, diverse, implications of complex causal claims. 9
  • 10.
    Bad Data Dredging:Regression Suppose Y is some variable and Xi are a few variables (indexed by i), some, we know not which, we think might be causes of Y. We want to find as many of the Xi as we can that are causes of Y but we want to avoid false positives. I will suppose we have a large sample of Y and the Xi, cases that are drawn independently and at random from a much larger population. Assume all the variables and noise terms are Gaussian. 10
  • 11.
    Bad Data Dredging:Regression U1 X6 X1 X2 X3 X4 X5 U2 Y Regression Result: All variables except X6 are direct causes of Y. True, linear Gaussian model. N = 1,000. Assume Xi are not effects of Y. Us not observed 11
  • 12.
    Automated Causal SearchResult (FCI) U1 X6 X1o X2 X3 oX4 X5 U2 Y 0 0 Only X5 is a direct cause of Y. Other variables are not even indirect causes of Y. Regression had to be told that Y is not a cause of the Xi; FCI did not need that information 12
  • 13.
    Severe and AppropriateTesting • Regression does only one test for each Xi -> Y, and the wrong test: whether Xi and Y are independent conditional on all Xj. • Regression takes colliders A -> B <- C to indicate that A causes C or vice-versa. • Search algorithm (FCI)does multiple tests for each X1 -> Y and allows the edge only if every test rejects a the null hypothesis that X and Y are conditionally independent . • Search algorithm correctly accounts for colliders and uses them to direct edges. • Search algorithm recognizes that if A -> B <-> C <- D then B and C have a common unmeasured cause. 13
  • 14.
    Why Should WeTrust the Output of A Search Procedure? ✔️ ✔️ ✔️ ✔️ ? ? 14
  • 15.
    OK, Serious Science •Does Acid Rain Influence Plant Growth in North Carolina? • Small sample of Spartina plants from the Cape Fear Estuary. 14 variables • What genes influence time to bolting in Arabidopsis thaliana? • Sample RNA from 47 plants. >19,000 variables. 15
  • 16.
    Spartina Grass Growth •What factors influence the biomass of Spartina grass in the Cape Fear estuary? • In other words, what causes some Spartina plants to be larger, heavier, than others? 16
  • 17.
    Many Possible Answers •13 chemical factors (biologist-identified): • Acidity (due to acid rain) • Salinity (due to proximity to the ocean) (Biologist’s Preferred Answer) • Ammonia salts (due to agricultural drainage) • Phosphates • Nitrogen • Metal ions of many kinds • Etc. 17
  • 18.
    PC vs. Regression •Determine the factors that influence Spartina biomass from measurements of 45 plugs sampled from the estuary. • Statistician tried many versions of regression. Gave up on the problem. (J. Rawlings, Applied Regression Analysis, 1st Edition). • Greenhouse experiment followed, varying pH while controlling for salinity and varying salinity while controlling for pH. • Simple PC search algorithm predicts pH is the controlling variable. • PC result agreed with the greenhouse experiment: Salinity does not influence Biomass when pH is held constant; pH does influence Biomass when Salinity is held constant. Spirtes, et al. 1993. 18
  • 19.
    Arabidopsis thaliana: GenesRegulating Bolting Time • mRNA from 47 plants • 14 predicted regulator genes • 9 mutant plants survived • 4 novel regulators found • 5 known regulators found • Correlation, LASSO failed. Used PC + Resampling + Estimation of Total Effect + Ranking of edges Analogous results with identification of regulators in yeast. (Steckhoven, et al. Nature 2017.) 19
  • 20.
    Reproducibility: Overlooking Unobserved Confounders •Examples: • Boston Housing Study Build Pure Clusters Result • 1972 Election (Goldberger) 20
  • 21.
    General Remarks aboutSearch • There are no non trivial confidence intervals for the results of causal search; • Confidence intervals are erroneous or meaningless if the causal structure is wrong; • The structure of the data should be considered in choosing a search method; • Prior knowledge constraining causal relations helps search; • Credibility of search results can be improved by resampling strategies; • Reliability of search methods can be assessed by selective experiments; • Reliability of search methods can be assessed by simulation experiments; • Frequentist tests vs quasi Bayesian scores in search is not a foundational issue—whatever works best. (p values can be used as scores, even converted to odds ratios.) • As yet there is no appropriate Bayesian score search for identifying unobserved confounders. 21
  • 22.
    Why Don’t SocialScientists Search? • They believe that estimation requires confidence intervals and confidence intervals say something correct about individual sample error. • They have been taught that search is evil witchcraft. • The history of science is no part of their education. • They do not believe in the assumptions required for sound causal search, but they assume them in every case. • They do, but with bad methods. 22
  • 23.
    Ancillary Remarks • Correctionfor multiple tests can be done with search algorithms but probably shouldn’t • Bonferroni—see Mayo-Wilson’s remarks. • False Discovery Rate • Good if you want to choose a test alpha to control the fraction of false positives in a specified set of tests… • But don’t think that alpha says anything about the probability that any particular positive among the tests is false… • Because the FDR varies with the set of tests included—as does Bonferroni 23
  • 24.
    Ancillary Remarks • Itis unethical to always ignore outcomes that are not prespecified • Unexpected “side effects” • It is inefficient to ignore outcomes that are not prespecified • Serendipity; opportunity costs • Rather than prespecify outcomes, investigator bias is better controlled by blinding the data analyst to the meaning of the variables. • Is the widely claimed reproducibility failure in biomedical and psychological experiments due in part to failures to search for confounders in treatment assignments? 24

Editor's Notes

  • #20 also applied to yeast data where outperformed other algorithms