Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

And thereby hangs a tail


Published on

36th Fisher Memorial Lecture delivered at the Royal Statistical Society meeting on 4 September 2017

Published in: Data & Analytics
  • Be the first to comment

And thereby hangs a tail

  1. 1. And thereby hangs a tail The strange history of P-values Stephen Senn 36th Fisher Memorial Lecture (c) Stephen Senn 1
  2. 2. Acknowledgements (c) Stephen Senn 2 My thanks to the Fisher Memorial Trust for inviting me to give the 36Th Fisher Memorial Lecture and to the Royal Statistical Society for kindly agreeing to host it Sandy Zabell’s 2008 paper on Student has been extremely useful as have various papers and comments by John Aldrich and Stephen Stigler and David Howie’s book Interpreting Probability as well as Hald’s and Stigler’s histories I thank John Aldrich and Andy Grieve for helpful comments on an earlier version. This work is partly supported by the European Union’s 7th Framework Programme for research, technological development and demonstration under grant agreement no. 602552. “IDeAl”
  3. 3. An apology to all of you (c) Stephen Senn 3 The abstract promised much more than I can deliver. The complete history of P-values would start with Arbuthnot (1710) or perhaps Daniel Bernoulli (1734) and continue to 2017 I shall limit myself (largely) to the first half of the 20th century (in fact, really, to years 1908-1939) 31/307  10% I shall just occasionally pretentiously sprinkle a few other names I have various excuses but the best is this: This lecture stands between you and a drink
  4. 4. An apology to Bayesians (c) Stephen Senn 4 I am going to claim that part of the problem with the current debate about the suitability or otherwise of P-values is to do with Bayesian statistics This most emphatically does not mean that I think that the Bayesian form of inference is bad My own thinking on statistical inference has been profoundly influenced (for the better!) by having had interactions with prominent Bayesian statisticians Nevertheless, I am going to claim that part of the perceived problem with P-values reflects an unresolved struggle in Bayesian inference that has been going on for very nearly 100 years (since 1918) I am now going to try and explain why I shall start with the case against Fisher
  5. 5. Fisher the cause of inferential confusion? David Colquhoun quoting Robert Matthews (c) Stephen Senn 5
  6. 6. Fisher the arch villain? As was said above, reports of clinical experiments often culminate in a significance test (or a set of tests, one for each variable observed) of the null hypothesis that the new treatment is indistinguishable from the standard. To anyone whose sensibilities have not been blunted by professional dedication to the science of statistics, such tests bring a pleasing touch of mystery and ceremony to the proceedings. It is natural to suppose that well-established statistical theory supports such tests. That is not so. Significance tests of such null hypotheses at the end of an experiment can fairly be laid at R. A. Fisher’s door, especially because of his insistence on them in The Design of Experiments.’ Francis Anscombe, 1990 (c) Stephen Senn 6
  7. 7. Fisher the false plagiarising prophet? The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives We want to persuade you of one claim: that William Sealy Gosset (1876-1937)—aka "Student" of the Student's t-test—was right, and that his difficult friend, Ronald A. Fisher, though a genius, was wrong. Fit is not the same thing as importance. Statistical significance is not the same thing as scientific importance. R2, t- statistic, F-test, and all the more sophisticated versions of them in time series and the most advanced statistics are misleading, at best. Ziliak and McCloskey, 2006 (c) Stephen Senn 7
  8. 8. The false history • To the extent that scientists were using formal statistical methods prior to the 1920s they were what we would now call Bayesian methods • RA Fisher invented P-values as part of his rival system of frequentist statistics • These seemed to give significance much more easily • They became an instant hit and seduced scientists away from the path of Bayesian rectitude • This is (largely) responsible for the replication crisis we now face (c) Stephen Senn 8
  9. 9. However • The history is not like that • P-values may or may not be good statistics • My own view is that they are one amongst many way of looking at data • Setting the record straight will help us see what the problem is • We are not close to a resolution • Understanding what is necessary will help us do better statistics • I am well aware that the problems we face will have to be solved by better brains than mine (c) Stephen Senn 9
  10. 10. (c) Stephen Senn 10 William Sealy Gosset 1876-1937 • Born Canterbury 1876 • Educated Winchester and Oxford • First in mathematical moderations 1897 and first in degree in Chemistry 1899 • Started with Guinness in 1899 in Dublin • Autumn 1906-spring 1907 with Karl Pearson at UCL • 1908 published ‘The probable error of a mean’ in Biometrika
  11. 11. (c) Stephen Senn 11 Student, Biometrika V1, March 1908, P2
  12. 12. (c) Stephen Senn 12 See The James Lind Library and Cushny AR, Peebles AR (1905). The action of optical isomers. II. Hyoscines. J Physiology 32:501-510.
  13. 13. (c) Stephen Senn 13 Student 1908
  14. 14. (c) Stephen Senn 14 Student, Biometrika, 1908, Vi, 1, p21
  15. 15. What Student did and did not do What he did • Obtained distribution of the sample standard deviation • Calculated using divisor n • Showed it was uncorrelated with the sample mean (and its square) • Assuming a symmetric data distribution • Obtained the distribution of the ratio of the mean to sample standard deviation assuming independence • Tabulated this distribution • Carried out various empirical investigations • Applied it • Interpreted the probabilities in a ‘Bayesian’ way What he did not do • Show that the sample mean and variance were independent • He just showed they were uncorrelated • Generalise the problem beyond one sample • Define the t-statistic in its modern form • Ratio of mean to standard error with SD calculated using divisor n-1 • Use the modern significance test interpretation • Explicitly use Bayes theorem or any derivation that we would now call Bayesian (c) Stephen Senn 15
  16. 16. The extent to which Student uses a Bayesian input? (c) Stephen Senn 16 Student, Biometrika V1, March 1908, P1 However, more explicit reference to prior distributions is provided in his correlation coefficient paper published at the same time
  17. 17. Now compare Student and Fisher (c) Stephen Senn 17 Fisher, Statistical Methods for Research Workers, 1925 An inverse (or what we would now call Bayesian) probability statement A direct probability statement
  18. 18. Looking at all three of Student’s analyses (c) Stephen Senn 18
  19. 19. What Fisher did and did not do What Fisher Did • Reformulated the statistic so that asymptotically it was Normal (0,1) • Showed that it could be adapted to use for many other problems • Two sample t (with suitable assumptions) • Regression coefficients • Generalised it to three or more means • Stressed an alternative interpretation (the one we now use) • Note, however, that these could also be found in Karl Pearson’s work • Suggested a doubling • This last (controversial) step gave significance less easily! What Fisher did not do • In this example he did not calculate the P-value • He merely noted ‘significance’ at a conventional level (1%) • This was computationally convenient • He does calculate the P-value for the sign test: ½9 =1/512 (c) Stephen Senn 19
  20. 20. Numerically Student and Fisher (would) agree (c) Stephen Senn 20 0.0028=2 x 0.0014
  21. 21. Diversion NHST: Null hypothesis significance testing (c) Stephen Senn 21 • This is a monstrous hybrid philosophy formed by mixing Fisherian significance tests (using P-values) and Neyman-Pearson hypothesis tests using rejection/acceptance and Type I and II error rates • People talk NP but do Fisherian • This leads them into inferential error Goodman, 1992, Statistics in Medicine P 878
  22. 22. Not so (c) Stephen Senn 22 P70 IMO Fisherian and NP approaches do not differ (at least for common standard cases) as regards this
  23. 23. (c) Stephen Senn 23 The way Student saw it Following Laplace (via Airy?) you could just invert the probability statement The distribution is centred on the statistic and is a statement about the probability of the parameter
  24. 24. (c) Stephen Senn 24 This is how Fisher saw it
  25. 25. (c) Stephen Senn 25 The two interpretations described The distributions are different but one is a translation of the other In fact you can reflect one about 2.03 (the average of the mean, 4.06, of one distribution and the mean, 0, of the other) to get the other distribution Different tail areas are numerically identical but have different interpretations
  26. 26. (c) Stephen Senn 26 A magnification of the previous diagram Bayesian: the probability that the treatment that appears to be better is worse after all Frequentist: the probability of a result as extreme or more extreme if the null hypothesis is true NB Andy Grieve has made the point to me that Bayesians would more naturally use 1-P
  27. 27. (c) Stephen Senn 27 Two independent observations, X1 and X2, from a Normal distribution with mean 0 10% level of significance (two-sided). Red circles significant Black x non- significant Contours give probability densities for circular Normal 100 points have been simulated 9 simulated values are ‘significant’
  28. 28. (c) Stephen Senn 28 Two independent observations, X1 and X2, from a Normal distribution with mean 2 10% level of significance (two-sided). Red circles significant Black x non- significant Contours give probability densities for circular Normal 100 points have been simulated 88 simulated values are ‘significant’
  29. 29. (c) Stephen Senn 29 Two independent observations, X1 and X2, from a Normal distribution with mean 0 10% level of significance (two-sided). Red circles significant Black x non- significant Contours give probability densities for circular Normal 100 points have been simulated 8 simulated values are ‘significant’
  30. 30. (c) Stephen Senn 30 Two independent observations, X1 and X2, from a Normal distribution with mean 2 10% level of significance (two-sided). Red circles significant Black x non- significant Contours give probability densities for circular Normal 100 points have been simulated 35 simulated values are ‘significant’
  31. 31. To sum up • Fisher and Student did not disagree as regards probabilities numerically • At least not in any way that casts Fisher as more liberal • Two tailed controversy • They differed as regard interpretation of the probability • Fisher saw that any Bayesian interpretation depended on prior assumptions • Student simply used a standard default argument • At least if the evidence of his 1908 paper is anything to go by • Student’s paper was only eventually influential thanks to Fisher • Speculation: Until Fisher’s work made an impact, estimation continued to be largely ‘Bayesian’ but ignoring nuisance parameter uncertainty (c) Stephen Senn 31
  32. 32. So who did produce a formal Bayesian derivation of the t-distribution? (c) Stephen Senn 32 Take your pick from Dedekind 1860 (nearly) Luroth, 1874 (but only considered 50% probability but is otherwise more general) Edgeworth, 1883 Burnside, 1923 Jeffreys, 1931 (and also in his book of 1939) We shall now consider the story with Jeffreys, for although Jeffreys produced a result for the t-distribution that is essentially the Luroth/Student/Fisher result he also did something radically different. However to get to Jeffreys we need to consider Laplace (briefly) and then Broad
  33. 33. Some pretentious sprinkling of names (as promised) (c) Stephen Senn 33 Laplace (1774) De Morgan (1838) Venn (1888), pp196-197
  34. 34. CD Broad 1887*-1971 • Graduated Cambridge 1910 • Fellow of Trinity 1911 • Lectured at St Andrews & Bristol • Returned to Cambridge 1926 • Knightbridge Professor of Philosophy 1933-1953 • Interested in epistemology and psychic research *NB Harold Jeffreys born 1891 & Fisher 1890 (c) Stephen Senn 34
  35. 35. CD Broad, 1918 (c) Stephen Senn 35 P393 p394 As m goes to infinity the first approaches 1 If n much greater than m the latter is small
  36. 36. What Jeffreys Understood (c) Stephen Senn 36 Theory of Probability, 3rd edition P128
  37. 37. The Economist gets it wrong (c) Stephen Senn 37 The canonical example is to imagine that a precocious newborn observes his first sunset, and wonders whether the sun will rise again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child’s degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise.
  38. 38. What Jeffreys (and Wrinch) concluded • If you have an uninformative prior distribution the probability of a precise hypothesis is very low • It will remain low even if you have lots of data consistent with it • It will never become plausible • You need to allocate a solid lump of probability that it is true • Nature has decided, other things being equal, that simpler hypotheses are more likely (c) Stephen Senn 38 Dorothy Wrinch
  39. 39. (c) Stephen Senn 39 When you switch from testing H0:   0 (dividing hypothesis) to H0:  = 0 (plausible hypothesis) It makes rather little difference to the performance of a frequentist test. This may or may not be a good thing In the Bayesian case it makes a world of difference (the terminology is due to David Cox, 1977)
  40. 40. Why the difference? • Imagine a point estimate of two standard errors (large sample) • Now consider the likelihood ratio for a given value of the parameter,  under the alternative to one under the null • Dividing hypothesis (smooth prior) for any given value  =  compare to  = - • Plausible hypothesis (lump prior) for any given value  =  compare to  = 0 (c) Stephen Senn 40 H0 H1
  41. 41. Why the difference? • Imagine a point estimate of two standard errors (large sample) • Now consider the likelihood ratio for a given value of the parameter,  under the alternative to one under the null • Dividing hypothesis (smooth prior) for any given value  =  compare to  = - • Plausible hypothesis (lump prior) for any given value  =  compare to  = 0 (c) Stephen Senn 41 H1 H1H0
  42. 42. The real history • Scientists before Fisher were using tail area probabilities to calculate posterior probabilities • This was following Laplace’s use of uninformative prior distributions • Fisher pointed out that this interpretation was unsafe and offered a more conservative one • Jeffreys, influenced by CD Broad’s criticism, was unsatisfied with the Laplacian framework and used a lump prior probability on a point hypothesis being true • Etz and Wagenmakers have claimed that Haldane 1932 anticipated Jeffreys • It is Bayesian Jeffreys versus Bayesian Laplace that makes the dramatic difference, not frequentist Fisher versus Bayesian Laplace (c) Stephen Senn 42
  43. 43. In summary • The major disagreement is not between P-values and Bayes using informative prior distribution • It’s between two Bayesian approaches • Using uninformative prior distributions • Using a highly informative one • The conflict is not going to go away by banning P-values • There is no automatic Bayesianism • You have to do it for real (c) Stephen Senn 43
  44. 44. My (tentative) opinion • The fundamental conflict will not disappear by banning P-values nor by modifying them nor by re-calibrating them • There may be a harmful culture of ‘significance’ however this is defined • P-values have a (very) limited use as rough and ready tools using little structure • Where you have more structure you can often do better • Likelihood (Fisher) • Confidence distributions • Severity (Deborah Mayo) • Point estimates and standard errors • extremely useful for future research synthesizers and should be provided regularly (c) Stephen Senn 44
  45. 45. And also, of course, Bayes! Good • For ‘personal’ decision-making • Ramsey, De Finetti, Savage, Lindley • Involves elicitation problems: O’Hagan • In pragmatic compromises • Good • Box (1980) • Racine, Grieve, Fluehler, Smith (1986) • As an aid to thinking • The reverse Bayes of Robert Matthews • The conditional Bayes approach of Spiegelhalter, Freedman & Parmar JRSSA, 1994 BART No so Good? • Bayesian significance tests • Bayes-factors • P-values modified to behave like Bayesian tests • Or Bayesian approaches modified just to make them behave like P-values (c) Stephen Senn 45
  46. 46. Speaking of BART (c) Stephen Senn 46 This lecture no longer stands between you and a drink