Views of the role of hypothesis falsification in statistical testing do not divide as cleanly between frequentist and Bayesian views as is commonly supposed. This can be shown by considering the two major variants of the Bayesian approach to statistical inference and the two major variants of the frequentist one.
A good case can be made that the Bayesian, de Finetti, just like Popper, was a falsificationist. A thumbnail view, which is not just a caricature, of de Finetti’s theory of learning, is that your subjective probabilities are modified through experience by noticing which of your predictions are wrong, striking out the sequences that involved them and renormalising.
On the other hand, in the formal frequentist Neyman-Pearson approach to hypothesis testing, you can, if you wish, shift conventional null and alternative hypotheses, making the latter the strawman and by ‘disproving’ it, assert the former.
The frequentist, Fisher, however, at least in his approach to testing of hypotheses, seems to have taken a strong view that the null hypothesis was quite different from any other and there was a strong asymmetry on inferences that followed from the application of significance tests.
Finally, to complete a quartet, the Bayesian geophysicist Jeffreys, inspired by Broad, specifically developed his approach to significance testing in order to be able to ‘prove’ scientific laws.
By considering the controversial case of equivalence testing in clinical trials, where the object is to prove that ‘treatments’ do not differ from each other, I shall show that there are fundamental differences between ‘proving’ and falsifying a hypothesis and that this distinction does not disappear by adopting a Bayesian philosophy. I conclude that falsificationism is important for Bayesians also, although it is an open question as to whether it is enough for frequentists.
Unfortunately, some have interpreted Numbers Needed to Treat as indicating the proportion of patients on whom the treatment has had a causal effect. This interpretation is very rarely, if ever, necessarily correct. It is certainly inappropriate if based on a responder dichotomy. I shall illustrate the problem using simple causal models.
One also sometimes encounters the claim that the extent to which two distributions of outcomes overlap from a clinical trial indicates how many patients benefit. This is also false and can be traced to a similar causal confusion.
The Seven Habits of Highly Effective StatisticiansStephen Senn
If you know why the title of this talk is extremely stupid, then you clearly know something about control, data and reasoning: in short, you have most of what it takes to be a statistician. If you have studied statistics then you will also know that a large amount of anything, and this includes successful careers, is luck.
In this talk I shall try share some of my experiences of being a statistician in the hope that it will help you make the most of whatever luck life throws you, In so doing, I shall try my best to overcome the distorting influence of that easiest of sciences hindsight. Without giving too much away, I shall be recommending that you read, listen, think, calculate, understand, communicate, and do. I shall give you some example of what I think works and what I think doesn’t
In all of this you should never forget the power of negativity and also the joy of being able to wake up every day and say to yourself ‘I love the small of data in the morning’.
Sample size determination in clinical trials is considered from various ethical and practical perspectives. It is concluded that cost is a missing dimension and that the value of information is key.
When estimating sample sizes for clinical trials there are several different views that might be taken as to what definition and meaning should be given to the sought-for treatment effect. However, if the concept of a ‘minimally important difference’ (MID) does have relevance to interpreting clinical trials (which can be disputed) then its value cannot be the same as the ‘clinically relevant difference’ (CRD) that would be used for planning them.
A doubly pernicious use of the MID is as a means of classifying patients as responders and non-responders. Not only does such an analysis lead to an increase in the necessary sample size but it misleads trialists into making causal distinctions that the data cannot support and has been responsible for exaggerating the scope for personalised medicine.
In this talk these statistical points will be explained using a minimum of technical detail.
Personalised medicine a sceptical viewStephen Senn
Some grounds for believing that the current enthusiasm about personalised medicine is exaggerated, founded on poor statistics and represents a disappointing loss of ambition.
Clinical trials: quo vadis in the age of covid?Stephen Senn
A discussion of the role of clinical trials in the age of COVID. My contribution to the phastar 2020 life sciences summit https://phastar.com/phastar-life-science-summit
This year marks the 70th anniversary of the Medical Research Council randomised clinical trial (RCT) of streptomycin in tuberculosis led by Bradford Hill. This is widely regarded as a landmark in clinical research. Despite its widespread use in drug regulation and in clinical research more widely and its high standing with the evidence based medicine movement, the RCT continues to attracts criticism. I show that many of these criticisms are traceable to failure to understand two key concepts in statistics: probabilistic inference and design efficiency. To these methodological misunderstandings can be added the practical one of failing to appreciate that entry into clinical trials is not simultaneous but sequential.
I conclude that although randomisation should not be used as an excuse for ignoring prognostic variables, it is valuable and that many standard criticisms of RCTs are invalid.
Unfortunately, some have interpreted Numbers Needed to Treat as indicating the proportion of patients on whom the treatment has had a causal effect. This interpretation is very rarely, if ever, necessarily correct. It is certainly inappropriate if based on a responder dichotomy. I shall illustrate the problem using simple causal models.
One also sometimes encounters the claim that the extent to which two distributions of outcomes overlap from a clinical trial indicates how many patients benefit. This is also false and can be traced to a similar causal confusion.
The Seven Habits of Highly Effective StatisticiansStephen Senn
If you know why the title of this talk is extremely stupid, then you clearly know something about control, data and reasoning: in short, you have most of what it takes to be a statistician. If you have studied statistics then you will also know that a large amount of anything, and this includes successful careers, is luck.
In this talk I shall try share some of my experiences of being a statistician in the hope that it will help you make the most of whatever luck life throws you, In so doing, I shall try my best to overcome the distorting influence of that easiest of sciences hindsight. Without giving too much away, I shall be recommending that you read, listen, think, calculate, understand, communicate, and do. I shall give you some example of what I think works and what I think doesn’t
In all of this you should never forget the power of negativity and also the joy of being able to wake up every day and say to yourself ‘I love the small of data in the morning’.
Sample size determination in clinical trials is considered from various ethical and practical perspectives. It is concluded that cost is a missing dimension and that the value of information is key.
When estimating sample sizes for clinical trials there are several different views that might be taken as to what definition and meaning should be given to the sought-for treatment effect. However, if the concept of a ‘minimally important difference’ (MID) does have relevance to interpreting clinical trials (which can be disputed) then its value cannot be the same as the ‘clinically relevant difference’ (CRD) that would be used for planning them.
A doubly pernicious use of the MID is as a means of classifying patients as responders and non-responders. Not only does such an analysis lead to an increase in the necessary sample size but it misleads trialists into making causal distinctions that the data cannot support and has been responsible for exaggerating the scope for personalised medicine.
In this talk these statistical points will be explained using a minimum of technical detail.
Personalised medicine a sceptical viewStephen Senn
Some grounds for believing that the current enthusiasm about personalised medicine is exaggerated, founded on poor statistics and represents a disappointing loss of ambition.
Clinical trials: quo vadis in the age of covid?Stephen Senn
A discussion of the role of clinical trials in the age of COVID. My contribution to the phastar 2020 life sciences summit https://phastar.com/phastar-life-science-summit
This year marks the 70th anniversary of the Medical Research Council randomised clinical trial (RCT) of streptomycin in tuberculosis led by Bradford Hill. This is widely regarded as a landmark in clinical research. Despite its widespread use in drug regulation and in clinical research more widely and its high standing with the evidence based medicine movement, the RCT continues to attracts criticism. I show that many of these criticisms are traceable to failure to understand two key concepts in statistics: probabilistic inference and design efficiency. To these methodological misunderstandings can be added the practical one of failing to appreciate that entry into clinical trials is not simultaneous but sequential.
I conclude that although randomisation should not be used as an excuse for ignoring prognostic variables, it is valuable and that many standard criticisms of RCTs are invalid.
Minimisation is an approach to allocating patients to treatment in clinical trials that forces a greater degree of balance than does randomisation. Here I explain why I dislike it.
What should we expect from reproducibiliryStephen Senn
Is there really a reproducibility crisis and if so are P-values to blame? Choose any statistic you like and carry out two identical independent studies and report this statistic for each. In advance of collecting any data, you ought to expect that it is just as likely that statistic 1 will be smaller than statistic 2 as vice versa. Once you have seen statistic 1, things are not so simple but if they are not so simple, it is that you have other information in some form. However, it is at least instructive that you need to be careful in jumping to conclusions about what to expect from reproducibility. Furthermore, the forecasts of good Bayesians ought to obey a Martingale property. On average you should be in the future where you are now but, of course, your inferential random walk may lead to some peregrination before it homes in on “the truth”. But you certainly can’t generally expect that a probability will get smaller as you continue. P-values, like other statistics are a position not a movement. Although often claimed, there is no such things as a trend towards significance.
Using these and other philosophical considerations I shall try and establish what it is we want from reproducibility. I shall conclude that we statisticians should probably be paying more attention to checking that standard errors are being calculated appropriately and rather less to inferential framework.
How to combine results from randomised clinical trials on the additive scale with real world data to provide predictions on the clinically relevant scale for individual patients
The Rothamsted school meets Lord's paradoxStephen Senn
Lords ‘paradox’ is a notoriously difficult puzzle that is guaranteed to provoke discussion, dissent and disagreement. Two statisticians analyse some observational data and come to radically different conclusions, each of which has acquired defenders over the years since Lord first proposed his puzzle in 1967. It features in the recent Book of Why by Pearl and McKenzie, who use it to demonstrate the power of Pearl’s causal calculus, obtaining a solution they claim is unambiguously right. They also claim that statisticians have failed to get to grips with causal questions for well over a century, in fact ever since Karl Pearson developed Galton’s idea of correlation and warned the scientific world that correlation is not causation.
However, only two years before Lord published his paradox John Nelder outlined a powerful causal calculus for analyzing designed experiments based on a careful distinction between block and treatment structure. This represents an important advance in formalizing the approach to analysing complex experiments that started with Fisher 100 years ago, when he proposed splitting variability using the square of the standard deviation, which he called the variance, continued with Yates and has been developed since the 1960s by Rosemary Bailey, amongst others. This tradition might be referred to as The Rothamsted School. It is fully implemented in Genstat® but, as far as I am aware, not in any other package.
With the help of Genstat®, I demonstrate how the Rothamsted School would approach Lord’s paradox and come to a solution that is not the same as the one reached by Pearl and McKenzie, although given certain strong but untestable assumptions it would reduce to it. I conclude that the statistical tradition may have more to offer in this respect than has been supposed.
Presidents' invited lecture ISCB Vigo 2017
Discusses various issues to do with how randomised clinical trials should be analysed. See also https://errorstatistics.com/2017/07/01/s-senn-fishing-for-fakes-with-fisher-guest-post/
An early and overlooked causal revolution in statistics was the development of the theory of experimental design, initially associated with the "Rothamstead School". An important stage in the evolution of this theory was the experimental calculus developed by John Nelder in the 1960s with its clear distinction between block and treatment factors in designed experiments. This experimental calculus produced appropriate models automatically from more basic formal considerations but was, unfortunately, only ever implemented in Genstat®, a package widely used in agriculture but rarely so in medical research. In consequence its importance has not been appreciated and the approach of many statistical packages to designed experiments is poor. A key feature of the Rothamsted School approach is that identification of the appropriate components of variation for judging treatment effects is simple and automatic.
The impressive more recent causal revolution in epidemiology, associated with Judea Pearl, seems to have no place for components of variation, however. By considering the application of Nelder’s experimental calculus to Lord’s Paradox, I shall show that this reveals that solutions that have been proposed using the more modern causal calculus are problematic. I shall also show that lessons from designed clinical trials have important implications for the use of historical data and big data more generally.
It is argued that when it comes to nuisance parameters an assumption of ignorance is harmful. On the other hand this raises problems as to how far one should go in searching for further data when combining evidence.
The statistical revolution of the 20th century was largely concerned with developing methods for analysing small datasets. Student’s paper of 1908 was the first in the English literature to address the problem of second order uncertainty (uncertainty about the measures of uncertainty) seriously and was hailed by Fisher as heralding a new age of statistics. Much of what Fisher did was concerned with problems of what might be called ‘small data’, not only as regards efficient analysis but also as regards efficient design and in addition paying close attention to what was necessary to measure uncertainty validly.
I shall consider the history of some of these developments, in particular those that are associated with what might be called the Rothamsted School, starting with Fisher and having its apotheosis in John Nelder’s theory of General Balance and see what lessons they hold for the supposed ‘big data’ revolution of the 21st century.
In Search of Lost Infinities: What is the “n” in big data?Stephen Senn
In designing complex experiments, agricultural scientists, with the help of their statistician collaborators, soon came to realise that variation at different levels had very different consequences for estimating different treatment effects, depending on how the treatments were mapped onto the underlying block structure. This was a key feature of the Rothamsted approach to design and analysis and a strong thread running through the work of Fisher, Yates and Nelder, being expressed in topics such as split-pot designs, recovering inter-block information and fractional factorials. The null block-structure of an experiment is key to this philosophy of design and analysis. However modern techniques for analysing experiments stress models rather than symmetries and this modelling approach requires much greater care in analysis, with the consequence that you can easily make mistakes and often will.
In this talk I shall underline the obvious, but often unintentionally overlooked, fact that understanding variation at the various levels at which it occurs is crucial to analysis. I shall take three examples, an application of John Nelder’s theory of general balance to Lord’s Paradox, the use of historical data in drug development and a hybrid randomised non-randomised clinical trial, the TARGET study, to show that the data that many, including those promoting a so-called causal revolution, assume to be ‘big’ may actually be rather ‘small’. The consequence is that there is a danger that the size of standard errors will be underestimated or even that the appropriate regression coefficients for adjusting for confounding may not be identified correctly.
I conclude that an old but powerful experimental design approach holds important lessons for observational data about limitations in interpretation that mere numbers cannot overcome. Small may be beautiful, after all.
Talk given at RSS 2016 Manchester
I consider the problems that the ASA faced in getting a P-value statement together, not in terms of the process, but by looking at the expressed opinion of 21 published commentaries of the agreed statement. I then trace the history of the development of P-values. I show that the perceived problem with P-values in not just one of a supposed inadequacy of frequentist statistics but reflects a struggle at the very heart of Bayesian inference. I conclude that replacing P-values by automatic Bayesian approaches is unlikely to abolish controversy. It may be better to try and embrace diversity than to pretend it is not there.
There are many questions one might ask of a clinical trial, ranging from what was the effect in the patients studied to what might the effect be in future patients via what was the effect in individual patients? The extent to which the answer to these questions is similar depends on various assumptions made and in some cases the design used may not permit any meaningful answer to be given at all.
A related issue is confusion between randomisation, random sampling, linear model and true multivariate based modelling. These distinctions don’t matter much for some purposes and under some circumstances but for others they do.
P Values and Replication: the problem is not what you think
Lecture at MRC Brain Science & Cognition, Cambridge 16 December 2015
Abstract
It has been claimed that there is a crisis of replication in science. Prominent amongst the many factors that have been fingered as being responsible is the humble and ubiquitous P-value. One journal has even gone so far as to ban all inferential statistics. However, it is one thing to banish measures of uncertainty and another to banish uncertainty from your measures. I shall claim that the apparent discrepancy between P-values and posterior probabilities is as much a discrepancy between two approaches to Bayesian inference as it is between frequentist and Bayesian frameworks and that a further problem has been misunderstandings regarding predictive probabilities. I conclude that banning P-values won’t make all published results repeatable and that it is possible undesirable that it should.
Minimisation is an approach to allocating patients to treatment in clinical trials that forces a greater degree of balance than does randomisation. Here I explain why I dislike it.
What should we expect from reproducibiliryStephen Senn
Is there really a reproducibility crisis and if so are P-values to blame? Choose any statistic you like and carry out two identical independent studies and report this statistic for each. In advance of collecting any data, you ought to expect that it is just as likely that statistic 1 will be smaller than statistic 2 as vice versa. Once you have seen statistic 1, things are not so simple but if they are not so simple, it is that you have other information in some form. However, it is at least instructive that you need to be careful in jumping to conclusions about what to expect from reproducibility. Furthermore, the forecasts of good Bayesians ought to obey a Martingale property. On average you should be in the future where you are now but, of course, your inferential random walk may lead to some peregrination before it homes in on “the truth”. But you certainly can’t generally expect that a probability will get smaller as you continue. P-values, like other statistics are a position not a movement. Although often claimed, there is no such things as a trend towards significance.
Using these and other philosophical considerations I shall try and establish what it is we want from reproducibility. I shall conclude that we statisticians should probably be paying more attention to checking that standard errors are being calculated appropriately and rather less to inferential framework.
How to combine results from randomised clinical trials on the additive scale with real world data to provide predictions on the clinically relevant scale for individual patients
The Rothamsted school meets Lord's paradoxStephen Senn
Lords ‘paradox’ is a notoriously difficult puzzle that is guaranteed to provoke discussion, dissent and disagreement. Two statisticians analyse some observational data and come to radically different conclusions, each of which has acquired defenders over the years since Lord first proposed his puzzle in 1967. It features in the recent Book of Why by Pearl and McKenzie, who use it to demonstrate the power of Pearl’s causal calculus, obtaining a solution they claim is unambiguously right. They also claim that statisticians have failed to get to grips with causal questions for well over a century, in fact ever since Karl Pearson developed Galton’s idea of correlation and warned the scientific world that correlation is not causation.
However, only two years before Lord published his paradox John Nelder outlined a powerful causal calculus for analyzing designed experiments based on a careful distinction between block and treatment structure. This represents an important advance in formalizing the approach to analysing complex experiments that started with Fisher 100 years ago, when he proposed splitting variability using the square of the standard deviation, which he called the variance, continued with Yates and has been developed since the 1960s by Rosemary Bailey, amongst others. This tradition might be referred to as The Rothamsted School. It is fully implemented in Genstat® but, as far as I am aware, not in any other package.
With the help of Genstat®, I demonstrate how the Rothamsted School would approach Lord’s paradox and come to a solution that is not the same as the one reached by Pearl and McKenzie, although given certain strong but untestable assumptions it would reduce to it. I conclude that the statistical tradition may have more to offer in this respect than has been supposed.
Presidents' invited lecture ISCB Vigo 2017
Discusses various issues to do with how randomised clinical trials should be analysed. See also https://errorstatistics.com/2017/07/01/s-senn-fishing-for-fakes-with-fisher-guest-post/
An early and overlooked causal revolution in statistics was the development of the theory of experimental design, initially associated with the "Rothamstead School". An important stage in the evolution of this theory was the experimental calculus developed by John Nelder in the 1960s with its clear distinction between block and treatment factors in designed experiments. This experimental calculus produced appropriate models automatically from more basic formal considerations but was, unfortunately, only ever implemented in Genstat®, a package widely used in agriculture but rarely so in medical research. In consequence its importance has not been appreciated and the approach of many statistical packages to designed experiments is poor. A key feature of the Rothamsted School approach is that identification of the appropriate components of variation for judging treatment effects is simple and automatic.
The impressive more recent causal revolution in epidemiology, associated with Judea Pearl, seems to have no place for components of variation, however. By considering the application of Nelder’s experimental calculus to Lord’s Paradox, I shall show that this reveals that solutions that have been proposed using the more modern causal calculus are problematic. I shall also show that lessons from designed clinical trials have important implications for the use of historical data and big data more generally.
It is argued that when it comes to nuisance parameters an assumption of ignorance is harmful. On the other hand this raises problems as to how far one should go in searching for further data when combining evidence.
The statistical revolution of the 20th century was largely concerned with developing methods for analysing small datasets. Student’s paper of 1908 was the first in the English literature to address the problem of second order uncertainty (uncertainty about the measures of uncertainty) seriously and was hailed by Fisher as heralding a new age of statistics. Much of what Fisher did was concerned with problems of what might be called ‘small data’, not only as regards efficient analysis but also as regards efficient design and in addition paying close attention to what was necessary to measure uncertainty validly.
I shall consider the history of some of these developments, in particular those that are associated with what might be called the Rothamsted School, starting with Fisher and having its apotheosis in John Nelder’s theory of General Balance and see what lessons they hold for the supposed ‘big data’ revolution of the 21st century.
In Search of Lost Infinities: What is the “n” in big data?Stephen Senn
In designing complex experiments, agricultural scientists, with the help of their statistician collaborators, soon came to realise that variation at different levels had very different consequences for estimating different treatment effects, depending on how the treatments were mapped onto the underlying block structure. This was a key feature of the Rothamsted approach to design and analysis and a strong thread running through the work of Fisher, Yates and Nelder, being expressed in topics such as split-pot designs, recovering inter-block information and fractional factorials. The null block-structure of an experiment is key to this philosophy of design and analysis. However modern techniques for analysing experiments stress models rather than symmetries and this modelling approach requires much greater care in analysis, with the consequence that you can easily make mistakes and often will.
In this talk I shall underline the obvious, but often unintentionally overlooked, fact that understanding variation at the various levels at which it occurs is crucial to analysis. I shall take three examples, an application of John Nelder’s theory of general balance to Lord’s Paradox, the use of historical data in drug development and a hybrid randomised non-randomised clinical trial, the TARGET study, to show that the data that many, including those promoting a so-called causal revolution, assume to be ‘big’ may actually be rather ‘small’. The consequence is that there is a danger that the size of standard errors will be underestimated or even that the appropriate regression coefficients for adjusting for confounding may not be identified correctly.
I conclude that an old but powerful experimental design approach holds important lessons for observational data about limitations in interpretation that mere numbers cannot overcome. Small may be beautiful, after all.
Talk given at RSS 2016 Manchester
I consider the problems that the ASA faced in getting a P-value statement together, not in terms of the process, but by looking at the expressed opinion of 21 published commentaries of the agreed statement. I then trace the history of the development of P-values. I show that the perceived problem with P-values in not just one of a supposed inadequacy of frequentist statistics but reflects a struggle at the very heart of Bayesian inference. I conclude that replacing P-values by automatic Bayesian approaches is unlikely to abolish controversy. It may be better to try and embrace diversity than to pretend it is not there.
There are many questions one might ask of a clinical trial, ranging from what was the effect in the patients studied to what might the effect be in future patients via what was the effect in individual patients? The extent to which the answer to these questions is similar depends on various assumptions made and in some cases the design used may not permit any meaningful answer to be given at all.
A related issue is confusion between randomisation, random sampling, linear model and true multivariate based modelling. These distinctions don’t matter much for some purposes and under some circumstances but for others they do.
P Values and Replication: the problem is not what you think
Lecture at MRC Brain Science & Cognition, Cambridge 16 December 2015
Abstract
It has been claimed that there is a crisis of replication in science. Prominent amongst the many factors that have been fingered as being responsible is the humble and ubiquitous P-value. One journal has even gone so far as to ban all inferential statistics. However, it is one thing to banish measures of uncertainty and another to banish uncertainty from your measures. I shall claim that the apparent discrepancy between P-values and posterior probabilities is as much a discrepancy between two approaches to Bayesian inference as it is between frequentist and Bayesian frameworks and that a further problem has been misunderstandings regarding predictive probabilities. I conclude that banning P-values won’t make all published results repeatable and that it is possible undesirable that it should.
Stephen Senn slides:"‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference"," at the 2015 APS Annual Convention in NYC
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docxkarlhennesey
Page 266
LEARNING OBJECTIVES
· Explain how researchers use inferential statistics to evaluate sample data.
· Distinguish between the null hypothesis and the research hypothesis.
· Discuss probability in statistical inference, including the meaning of statistical significance.
· Describe the t test and explain the difference between one-tailed and two-tailed tests.
· Describe the F test, including systematic variance and error variance.
· Describe what a confidence interval tells you about your data.
· Distinguish between Type I and Type II errors.
· Discuss the factors that influence the probability of a Type II error.
· Discuss the reasons a researcher may obtain nonsignificant results.
· Define power of a statistical test.
· Describe the criteria for selecting an appropriate statistical test.
Page 267IN THE PREVIOUS CHAPTER, WE EXAMINED WAYS OF DESCRIBING THE RESULTS OF A STUDY USING DESCRIPTIVE STATISTICS AND A VARIETY OF GRAPHING TECHNIQUES. In addition to descriptive statistics, researchers use inferential statistics to draw more general conclusions about their data. In short, inferential statistics allow researchers to (a) assess just how confident they are that their results reflect what is true in the larger population and (b) assess the likelihood that their findings would still occur if their study was repeated over and over. In this chapter, we examine methods for doing so.
SAMPLES AND POPULATIONS
Inferential statistics are necessary because the results of a given study are based only on data obtained from a single sample of research participants. Researchers rarely, if ever, study entire populations; their findings are based on sample data. In addition to describing the sample data, we want to make statements about populations. Would the results hold up if the experiment were conducted repeatedly, each time with a new sample?
In the hypothetical experiment described in Chapter 12 (see Table 12.1), mean aggression scores were obtained in model and no-model conditions. These means are different: Children who observe an aggressive model subsequently behave more aggressively than children who do not see the model. Inferential statistics are used to determine whether the results match what would happen if we were to conduct the experiment again and again with multiple samples. In essence, we are asking whether we can infer that the difference in the sample means shown in Table 12.1 reflects a true difference in the population means.
Recall our discussion of this issue in Chapter 7 on the topic of survey data. A sample of people in your state might tell you that 57% prefer the Democratic candidate for an office and that 43% favor the Republican candidate. The report then says that these results are accurate to within 3 percentage points, with a 95% confidence level. This means that the researchers are very (95%) confident that, if they were able to study the entire population rather than a sample, the actual percentage who preferred th ...
The history of p-values is covered to try and shed light on a mystery: why did Student and Fisher agree numerically but disagree in terms of interpretation.?
The replication crisis: are P-values the problem and are Bayes factors the so...StephenSenn2
Today’s posterior is tomorrow’s prior. Dennis Lindley1 (P2)
It has been claimed that science is undergoing a replication crisis and that when looking for culprits, the cult of significance is the chief suspect. It has also been claimed that Bayes factors might provide a solution.
In my opinion, these claims are misleading and part of the problem is our understanding of the purpose and nature of replication, which has only recently been subject to formal analysis2. What we are or should be interested in is truth. Replication is a coherence not a correspondence requirement3 and one that has a strong dependence on the size of the replication study4.
Consideration of Bayes factors raises a puzzling question. Should the Bayes factor for a replication study be calculated as if it were the initial study? If the answer is yes, the approach is not fully Bayesian and furthermore the Bayes factors will be subject to exactly the same replication ‘paradox’ as P-values. If the answer is no, then in what sense can an initially found Bayes factor be replicated and what are the implications for how we should view replication of P-values?
A further issue is that little attention has been paid to false negatives and, by extension to true negative values. Yet, as is well known from the theory of diagnostic tests, it is meaningless to consider the performance of a test in terms of false positives alone.
I shall argue that we are in danger of confusing evidence with the conclusions we draw and that any reforms of scientific practice should concentrate on producing evidence that is reliable as it can be qua evidence. There are many basic scientific practices in need of reform. Pseudoreplication5, for example, and the routine destruction of information through dichotomisation6 are far more serious problems than many matters of inferential framing that seem to have excited statisticians.
References
1. Lindley DV. Bayesian statistics: A review. SIAM; 1972.
2. Devezer B, Navarro DJ, Vandekerckhove J, Ozge Buzbas E. The case for formal methodology in scientific reform. R Soc Open Sci. Mar 31 2021;8(3):200805. doi:10.1098/rsos.200805
3. Walker RCS. Theories of Truth. In: Hale B, Wright C, Miller A, eds. A Companion to the Philosophy of Language. John Wiley & Sons,; 2017:532-553:chap 21.
4. Senn SJ. A comment on replication, p-values and evidence by S.N.Goodman, Statistics in Medicine 1992; 11:875-879. Letter. Statistics in Medicine. 2002;21(16):2437-44.
5. Hurlbert SH. Pseudoreplication and the design of ecological field experiments. Ecological monographs. 1984;54(2):187-211.
6. Senn SJ. Being Efficient About Efficacy Estimation. Research. Statistics in Biopharmaceutical Research. 2013;5(3):204-210. doi:10.1080/19466315.2012.754726
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
Today’s posterior is tomorrow’s prior. Dennis Lindley
It has been claimed that science is undergoing a replication crisis and that when looking for culprits, the cult of significance is the chief suspect. It has also been claimed that Bayes factors might provide a solution.
In my opinion, these claims are misleading and part of the problem is our understanding
of the purpose and nature of replication, which has only recently been subject to formal
analysis.
What we are or should be interested in is truth. Replication is a coherence not a correspondence requirement and one that has a strong dependence on the size
of the replication study
.
Consideration of Bayes factors raises a puzzling question. Should the Bayes factor for a replication study be calculated as if it were the initial study? If the answer is yes, the approach is not fully Bayesian and furthermore the Bayes factors will be subject to
exactly the same replication ‘paradox’ as P-values. If the answer is no, then in what
sense can an initially found Bayes factor be replicated and what are the implications for how we should view replication of P-values?
A further issue is that little attention has been paid to false negatives and, by extension
to true negative values. Yet, as is well known from the theory of diagnostic tests, it is
meaningless to consider the performance of a test in terms of false positives alone.
I shall argue that we are in danger of confusing evidence with the conclusions we draw and that any reforms of scientific practice should concentrate on producing evidence
that is reliable as it can be qua evidence. There are many basic scientific practices in
need of reform. Pseudoreplication, for example, and the routine destruction of
information through dichotomisation are far more serious problems than many matters of inferential framing that seem to have excited statisticians.
What is the significance of p value while reporting statistical analysis. Is there an alternate approach for Fisher, if so what is that approach. These are some of the issues addressed here.
There are many questions one might ask of a clinical trial, ranging from what was the effect in the patients studied to what might the effect be in future patients via what was the effect in individual patients? The extent to which the answer to these questions is similar depends on various assumptions made and in some cases the design used may not permit any meaningful answer to be given at all.
A related issue is confusion between randomisation, random sampling, linear model and true multivariate based modelling. These distinctions don’t matter much for some purposes and under some circumstances but for others they do.
A yet further issue is that causal analysis in epidemiology, which has brought valuable insights in many cases, has tended to stress point estimates and ignore standard errors. This has potentially misleading consequences.
An understanding of components of variation is key. Unfortunately, the development of two particular topics in recent years, evidence synthesis by the evidence based medicine movement and personalised medicine by bench scientists has either paid scant attention to components of variation or to the questions being asked or both resulting in confusion about many issues.
For instance, it is often claimed that numbers needed to treat indicate the proportion of patients for whom treatments work, that inclusion criteria determine the generalisability of results and that heterogeneity means that a random effects meta-analysis is required. None of these is true. The scope for personalised medicine has very plausibly been exaggerated and an important cause of variation in the healthcare system, physicians, is often overlooked.
I shall argue that thinking about questions is important.
The response to the COVID-19 crisis by various vaccine developers has been extraordinary, both in terms of speed of response and the delivered efficacy of the vaccines. It has also raised some fascinating issues of design, analysis and interpretation. I shall consider some of these issues, taking as my example, five vaccines: Pfizer/BioNTech, AstraZeneca/Oxford, Moderna, Novavax, and J&J Janssen but concentrating mainly on the first two. Among matters covered will be concurrent control, efficient design, issues of measurement raised by two-shot vaccines and implications for roll-out, and the surprising effectiveness of simple analyses. Differences between the five development programmes as they affect statistics will be covered but some essential similarities will also be discussed.
Talk given at ISCB 2016 Birmingham
For indications and treatments where their use is possible, n-of-1 trials represent a promising means of investigating potential treatments for rare diseases. Each patient permits repeated comparison of the treatments being investigated and this both increases the number of observations and reduces their variability compared to conventional parallel group trials.
However, depending on whether the framework for analysis used is randomisation-based or model- based produces puzzling difference in inferences. This can easily be shown by starting on the one hand with the randomisation philosophy associated with the Rothamsted school of inference and building up the analysis through the block + treatment structure approach associated with John Nelder’s theory of general balance (as implemented in GenStat®) or starting on the other hand with a plausible variance component approach through a mixed model. However, it can be shown that these differences are related not so much to modelling approach per se but to the questions one attempts to answer: ranging from testing whether there was a difference between treatments in the patients studied, to predicting the true difference for a future patient, via making inferences about the effect in the average patient.
This in turn yields interesting insight into the long-run debate over the use of fixed or random effect meta-analysis.
Some practical issues of analysis will also be covered in R and SAS®, in which languages some functions and macros to facilitate analysis have been written. It is concluded that n-of-1 hold great promise in investigating chronic rare diseases but that careful consideration of matters of purpose, design and analysis is necessary to make best use of them.
Acknowledgement
This work is partly supported by the European Union’s 7th Framework Programme for research, technological development and demonstration under grant agreement no. 602552. “IDEAL”
History of how and why a complex cross-over trial was designed to prove the equivalence of two formulations of a beta-agonist and what the eventual results were. Presented at the Newton Institute 28 July 2008. Warning: following the important paper by Kenward & Roger Biostatistics, 2010, I no longer think the random effects analysis is appropriate, although, in fact the results are pretty much the same as for the fixed effects analysis.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Analysis insight about a Flyball dog competition team's performance
De Finetti meets Popper
1. De Finetti meets Popper
or Should Bayesians care about falsificationism?
Stephen Senn, Edinburgh
(C) Stephen Senn 2019
Lecture at the Popper Symposium on 7 August 2019 at the
16th International Congress on Logic, Methodology & Philosophy of Science, Prague
2. Basic thesis Outline
The distinction between refuting and
‘corroborating’ a hypothesis is fundamental.
It does not become irrelevant by adopting a
Bayesian approach to inference.
It has no direct bearing on choice of meaning
for probability: subjective, relative frequency,
propensity, logical etc
Various practical problems in analysing clinical
trials illustrate this
Basic background
• De Finetti’s falsificationism
• Simple illustration
• Jeffreys’s alternative approach
• Inspired by Broad’s challenge
Falsificationist issues in clinical trials
• Bioequivalance
• Equivalence and falsificationism
• Blinding
• Competence
• Causal analysis versus prediction in clinical
trials
Conclusions
(C) Stephen Senn 2019
3. A puzzle to keep you thinking
(C) Stephen Senn 2019
Suppose we are to have 1 million independent trials with a binary outcome.
We wish to decide, in advance of beginning the trials, which of the following
is more likely
A: 1 million successes and no failures
B: 500,000 successes and 500,000 failures in any order
We use a Bayesian approach with a uninform prior for the binary outcome
(such as would have been employed by Laplace)
What is the correct answer?
5. “The acquisition of a further piece of information, H - in other words
experience, since experience is nothing more than the acquisition of
further information - acts always and only in the way we have just
described: suppressing the alternatives that turn out to be no longer
possible..”
Popper?
No, de Finetti
(C) Stephen Senn 2019
6. Example
• A man has a CD of popular music with 12 tracks on it
• He can play tracks in random order (Shuffle) or in sequential order
(Play)
• On a particular occasion he thinks he has pressed Shuffle (that was his
intention) but the first track played is the first track, F, on the CD
• What is the probability that he did, in fact, press Shuffle as intended’
(C) Stephen Senn 2019
7. We can put this together as follows
“Hypothesis” Prior
Probability
P
Evidence Likelihood P x L
Shuffle 9/10 F 1/12 9/120
Shuffle 9/10 X 11/12 99/120
Play 1/10 F 1 12/120
Play 1/10 X 0 0
TOTAL 120/120 = 1
(C) Stephen Senn 2019
Note that in de Fineti’s theory the relevant historical process is that of the individual’s thought process not
“real world” events
8. After seeing (hearing) the evidence, however, only two rows remain
“Hypothesis” Prior
Probability
P
Evidence Likelihood P x L
Shuffle 9/10 F 1/12 9/120
Shuffle 9/10 X 11/12 99/120
Play 1/10 F 1 12/120
Play 1/10 X 0 0
TOTAL 21/120
(C) Stephen Senn 2019
9. So we rescale by dividing by the total probability
“Hypothesis” Prior
Probability
P
Evidence Likelihood P x L Posterior Probability
Shuffle 9/10 F 1/12 9/120 (9/120)/(21/120)
=9/21
Shuffle 9/10 X 11/12 99/120
Play 1/10 F 1 12/120 (12/120)/(21/120)
=12/21
Play 1/10 X 0 0
TOTAL 21/120 21/21=1
(C) Stephen Senn 2019
10. Returning to De Finetti’s general approach
• Suppose we declare all possible sequences of some binary outcome (say S=
success and F = failure) equally likely
• Then no learning is possible
• This is because for any sequences consisting of a number of S and F
outcomes, then every possible forward sequence of S and F is also equally
likely
• Thus, observing which sequences have not occurred and renormalising
changes nothing
• Caution is required!
• This is one reason why De Finetti was sceptical about any automatic
approaches to Bayesian inference
(C) Stephen Senn 2019
12. CD Broad, 1918
(C) Stephen Senn 2019
P393
p394
As m goes to
infinity the first
approaches 1
If n is much greater
than m the latter is
small
13. The Economist gets it wrong
(C) Stephen Senn 2019
The canonical example is to imagine that a precocious newborn observes
his first sunset, and wonders whether the sun will rise again or not. He
assigns equal prior probabilities to both possible outcomes, and
represents this by placing one white and one black marble into a bag. The
following day, when the sun rises, the child places another white marble
in the bag. The probability that a marble plucked randomly from the bag
will be white (ie, the child’s degree of belief in future sunrises) has thus
gone from a half to two-thirds. After sunrise the next day, the child adds
another white marble, and the probability (and thus the degree of belief)
goes from two-thirds to three-quarters. And so on. Gradually, the initial
belief that the sun is just as likely as not to rise each morning is modified
to become a near-certainty that the sun will always rise.
The Economist, ‘In praise of Bayes’, September 2000
14. Jeffreys’s solution
• The fact that ‘laws’ cannot be proved using Bayes theorem if the
Laplacian approach to choosing prior distributions is adopted means
that the choice of prior distribution is wrong
• His solution is to place a mass of probability on the hypothesis being
true
• This gives simpler representations of the world more prior weight
than more complex ones
• In his view this is necessary to permit induction to work
• Prior probability replaces (or reflects) parsimony as a principle
(C) Stephen Senn 2019
15. Falsificationist issues in clinical
trials
Rather more technical – again please accept my apologies
(C) Stephen Senn 2019
16. Equivalence studies
(including bioequivalence)
• Studies in which one tries to prove that treatments do not differ
• The most extreme example is so-called bioequivalence studies
• The molecule is the same but the formulation differs
• The same manufacturer may wish to replace one route of administration by
another
• For example a suppository by a pill
• Or a single-dose inhaler with a multi-dose one
• Or a different so-called generic manufacture may wish to supply the market
with its version of a now off-patent brand-name product
• Or a manufacturer may wish for labelling reasons to prove that a drug does
not differ whether given with or without food
(C) Stephen Senn 2019
17. But surely, a drug is a drug?
• In fact, no, changing the formulation can have dramatic effects on
potency of a drug
• Here is an example I was involved with
• Bronchodilator in asthma
• Seven treatments compared over twelve hours using forced expiratory
volume in one second (FEV1)
• Placebo
• 6,12 and 24 g of new formulation (MTA)
• 6,12 and 24 g of old formulation (ISF)
• Other details omitted for the sake of brevity
• The results follow (high values of FEV1 are good)
(C) Stephen Senn 2019
Senn, S.J., et al., An incomplete
blocks cross-over in asthma: a
case study in collaboration, in
Cross-over Clinical Trials, J.
Vollmar and L.A. Hothorn, Editors.
1997, Fischer: Stuttgart. p. 3-26.
18. (C) Stephen Senn 2019
Treatment Placebo MT&A 6 MT&A 12 MT&A 24
FEV1 (L)
2.0
2.5
Minute
0 180 360 720
Placebo and the 3 doses of the new formulation
19. (C) Stephen Senn 2019
Treatment Placebo MT&A 6 MT&A 12 MT&A 24
ISF 6 ISF 12 ISF 24
FEV1 (L)
2.0
2.5
Minute
0 180 360 540 720
With the 3 doses of reference formulation added
20. Bioequivalence in terms of confidence
intervals
What is considered ‘proven’
A: neither equivalence nor difference proven
B: exact equivalence rejected
C: inconclusive
D & E: practical equivalence proven
F: practical equivalence proven but exact equivalence
rejected
G: exact and practical equivalence rejected
(C) Stephen Senn 2019
21. (C) Stephen Senn 2019
First issue: Blinding and Equivalence
• Running a double blind trial does not protect you against a conclusion
of equivalence
• You do not need to know the treatment code to bias results towards
equivalence
• Consider a particular simple (and very common) form of trial in which
two oral formulations of a molecule are compared by looking at the
concentration time profile in a cross-over trial
• Equivalence of these profiles is taken to mean equivalence of the
formulations
• “The blood is a gate through which the drug must pass”
22. (C) Stephen Senn 2019
The Unscrupulous Pharmacokineticist
• Take the 12 test tubes for day one for a given
volunteer
• hour 1,2…12
• Take the 12 test tubes for day two for the same
volunteer
• hour 1,2…12
• Mix each pair (by hour) together
• Divide them into two
• Et voila
• Perfect equivalence without having to unblind
23. (C) Stephen Senn 2019
Fanciful?
• In fact blinding does not protect against false conclusions of
equivalence
• Pharmaceutical companies commonly prosecute cheating doctors
• Reason
• Trial fails to show any effect whereas others do
• Explanation
• The trial never took place
• The data have been invented
• This will produce a conclusion of equivalence
24. (C) Stephen Senn 2019
Second issue: Competence
• Experiment is fair if treatments are handled equivalently
• in all aspects except those that form the essence (definition) of the treatment
• cannot be determined by looking at outcomes
• Competence is the ability to detect differences
• can only partly be determined on external grounds
• can be established if difference is detected
• It is a matter of “assay sensitivity”
25. (C) Stephen Senn 2019
A Model for Competence
competent, not competent
equivalent, inequivalent
observed difference, no difference
Likelihoods
( ) ( ) 1
( ) ( ) 1
( ) ( ) 1
( ) ( ) 1
1 0
"Priors" (
C C
E E
D D
P D E C P D E C
P D E C P D E C
P D EC P D EC
P D EC P D EC
P E
) , ( )P C E
See Senn, S.J., Inherent difficulties
with active control equivalence
studies. Statistics in Medicine, 1993.
12(24): p. 2367-75.
26. (C) Stephen Senn 2019
Interpretation of These Parameters
• 1- and reflect the ‘precision’ of ‘competent’ experiments
• Their converses and 1- are analogous to type I and II error rates
• and 1-, can be reduced by more and more precise experiments
• represents the probability that where a difference between
treatments really does exist a poor (not competent) experiment will
indicate it exists
• Joint effect of and represents factors beyond our control
• is the probability that ‘Nature’ has decided the two treatments are
equivalent
• is the probability that the trial is competent given that the treatments are
not equivalent
27. (C) Stephen Senn 2019
Notes
Under this formulation of the likelihoods it is irrelevant as to whether
the trial is competent if the treatments are equivalent.
We could require the combination EC as impossible.
We require > , but this is a linguistic convention.
28. (C) Stephen Senn 2019
For those who like formulae
1 (1 )
( )
(1 ) (1 )
(1 )
( )
(1 )(1 ) (1 )(1 )(1 ) (1 )
as 1 and 0
( ) 1
( )
(1 )(1 )(1 )
P E D
P E D
P E D
P E D
32. Consequences
• Asymmetry between concluding equivalence and difference
• The former is more problematic
• Not just a matter of reformulating the problem
• Conditional on an assumption of competence we can conclude
equivalence
• However, if we have any doubts about competence, these doubts increase by
finding a difference
• Speculation: this is a concrete instance of the more general point
made by Popper and Miller 1987
(C) Stephen Senn 2019
33. Hunt the thimble
• You are looking for a thimble in a room
• Consider two cases
• You find the thimble
• You search but don’t find the thimble
• Inferences about whether the thimble is in the room or not are
fundamentally different in the two cases
• In the first case, you conclude it is, and your competence as a searcher for
thimbles is irrelevant to this conclusion
• In the second case, you may believe that the thimble is not in the room but
this belief depends on your competence as a thimble-searcher, about
which you may come to have doubts
(C) Stephen Senn 2019
34. Third issue: causal versus predictive inference
• Clinical trials can be used to try and answer a number of very
different questions
• Two examples are
• Did the treatment have an effect in these patients?
• A causal purpose
• What will the effect be in future patients?
• A predictive purpose
• Unfortunately, in practice, an answer is produced without stating
what the question was
• Given certain assumptions these questions can be answered using the
same analysis but the assumptions are strong and rarely stated
(C) Stephen Senn 2019
35. Two models
Predictive
• The population is taken to be ‘patients in
general’
• Of course this really means future patients
• They are the ones to whom the treatment
will be applied
• We treat the patients in the trial as an
appropriate selection from this
population
• This does not require them to be typical
but it does require additivity of the
treatment effect
Causal
• We take the patients as fixed
• We want to know what the effect
was for them
• Unfortunately there are missing
counterfactuals
• What would have happened to
control patients given intervention
and vice-versa
• The population is the population of
all possible allocations to the
patients studied
(C) Stephen Senn 2019
36. Coverage probabilities for two questions
Average treatment effect in population is 300ml FEV1
Predictive Causal
Horizontal dashed line is population average effect (LHS & RHS). Blue horizontal bar is true
trial effect (RHS). Black Cis cover true effect, red don’t).
37. Conclusion
• There is a fundamental difference between
• Demonstrating that things are different
• Demonstrating they are the same
• There is a fundamental difference between
• Concluding something had an effect
• Concluding it must always have this effect
• Many features of clinical trials reflect this
• The value of blinding
• Competence (assay sensitivity)
• Causal versus predictive inference
• These are not a consequence of being frequentist
• They are not vanquished by becoming Bayesian
• The choice of a Bayesian or frequentist framework does not depend on this
(C) Stephen Senn 2019
39. The answer to the puzzle
(C) Stephen Senn 2019
Both are equally likely
The prior distribution is uniform.
By the time we completed the trials the relative frequency will be the probability
But the prior distribution says every probability is equally likely
Therefore it is hardly surprising that every relative frequency will be equally likely
Senn, S.J., Dicing with Death. 2003,
Cambridge: Cambridge University Press.
Editor's Notes
Views of the role of hypothesis falsification in statistical testing do not divide as cleanly between frequentist and Bayesian views as is commonly supposed. This can be shown by considering the two major variants of the Bayesian approach to statistical inference and the two major variants of the frequentist one.
A good case can be made that the Bayesian, de Finetti, just like Popper, was a falsificationist. A thumbnail view, which is not just a caricature, of de Finetti’s theory of learning, is that your subjective probabilities are modified through experience by noticing which of your predictions are wrong, striking out the sequences that involved them and renormalising.
On the other hand, in the formal frequentist Neyman-Pearson approach to hypothesis testing, you can, if you wish, shift conventional null and alternative hypotheses, making the latter the strawman and by ‘disproving’ it, assert the former.
The frequentist, Fisher, however, at least in his approach to testing of hypotheses, seems to have taken a strong view that the null hypothesis was quite different from any other and there was a strong asymmetry on inferences that followed from the application of significance tests.
Finally, to complete a quartet, the Bayesian geophysicist Jeffreys, inspired by Broad, specifically developed his approach to significance testing in order to be able to ‘prove’ scientific laws.
By considering the controversial case of equivalence testing in clinical trials, where the object is to prove that ‘treatments’ do not differ from each other, I shall show that there are fundamental differences between ‘proving’ and falsifying a hypothesis and that this distinction does not disappear by adopting a Bayesian philosophy. I conclude that falsificationism is important for Bayesians also, although it is an open question as to whether it is enough for frequentists.
In other words, falsificationism is a valuable perspective for Bayesians and Frequentist statisticians alike
See Dicing with Death, Cambridge, 2003, chapter 4
Some general discussion of what it means to be a Bayesian (and also a frequentist) will be found in
Senn, S.J., You may believe you are a Bayesian but you are probably wrong. Rationality, Markets and Morals, 2011. 2: p. 48-66.
See
http://www.frankfurt-school-verlag.de/rmm/downloads/Article_Senn.pdf
de Finetti, B.D., Theory of Probability (Volume 1). Vol. 1. 1974, Chichester: Wiley. 300. p141
This is based on a real example. I was playing the CD Hysteria by Def Leppard when this happened. The example is discussed in more detail in chapter 4 of Statistical Issues in Drug Development.
The four rows give the two combinations of hypothesis and evidence
The P column gives the marginal prior probability of the “hypothesis”
The evidence column has two sorts of evidence indicated. F for first track on CD and X for any other track.
The Likelihood column gives the conditional probability of the evidence given the hypothesis
The column headed P x L gives the joint probability of a given hypothesis and evidence combination
Strictly speaking, in the de Finetti view, P x L exists directly
The probabilities of the two cases which remain do not add up to 1.
However, since these two cases cover all the possibilities which remain, their combined probability must be 1.
Therefore, we rescale the individual probabilities to make them add to 1.
We can do this without changing their relative value by dividing by their total, 21/120.
This has been done in the table below.
This completes the Bayesian solution and the posterior probability is given in the extra final column
“In an article entitled, "In praise of Bayes", that appeared in The Economist in September 2000, the unnamed author tried to show how a newborn baby could, through successively observed sunrises and the application of Laplace's Law of succession, acquire increasing certainty that the sun would always rise. As The Economist put it, "Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise". This is false: not so much praise as hype. The Economist had confused the probability that the sun will rise tomorrow with the probability that it will always rise. One can only hope this astronomical confusion at that journal does not also attach to beliefs about share prices.
In praise of Bayes. September 2000.”
Dicing with Death, 2003 p77
See
https://errorstatistics.com/2015/05/09/stephen-senn-double-jeopardy-judge-jeffreys-upholds-the-law-guest-post/
and also
http://www.senns.demon.co.uk/Papers/Comment%20on%20Robert.pdf
See also
http://www.senns.demon.co.uk/Papers/Falsificationism.pdf
There has been a surprising amount of disagreement amongst frequentists as well as amongst Bayesians and of course between the two major camps as to how to analyses such studies. There is no time here to go over all this. However, see http://www.senns.demon.co.uk/Papers/Bioequivalence%20SiM.pdf
for an overview
Also this blog
https://errorstatistics.com/2014/06/05/stephen-senn-blood-simple-the-complicated-and-controversial-world-of-bioequivalence-guest-post/
gives an overview
In fact this was a so-called incomplete blocks cross-over design in which each patient received five of the seven treatments on a total of five days (one day for each treatment) separated by a suitable wash-out. Twenty-one sequences were chosen so that each treatments was used equally often, each of the 21 pairs of treatments were studied in the same number of patients and each treatment appeared equally often I each period. The trial was double blind and a six fold replication was targeted (6 x 21 patients were planned to be recruited). Many different centre were employed and in the event more patients were recruited than planned.
The model to analyse the treatment effect used “patient” and “period” (that is to say day 1,2,3,4 or 5) in addition to treatment as factor.
Rather than presenting the confidence intervals for the difference here I shall just show the time curves for FEV1 (appropriately adjusted for other effects), since these are sufficient to make the point.
A full description of the trials will be found in
Senn, S.J., et al., An incomplete blocks cross-over in asthma: a case study in collaboration, in Cross-over Clinical Trials, J. Vollmar and L.A. Hothorn, Editors. 1997, Fischer: Stuttgart. p. 3-26.
http://www.senns.demon.co.uk/Papers/SELIPATI.pdf
This is the time course in which patient and period effects have been eliminated (in other words it is a fair comparison). Only placebo and the three doses of the new treatment (MTA) are shown. The efficacy of MTA is clearly shown and there is a gratifying does response.
Unfortunately the highest dose of MTA has an observed effect that is lower than the lowest does of ISF. The conclusion was, much to everyone’s surprise and dismay, that the formulations differed in potency by a factor of 4 to 1.
“In the case of trial A, the treatment estimate lies outside the region of equivalence. However, the confidence intervals are so wide that exact equality of the treatments is not ruled out. In case B exact equality is ruled out (if the conventions of hypothesis testing are accepted), since there is a significant difference, but the possibility that the true treatment difference lies within the region of equivalence is not. In case C, no treatment difference is observed, but the confidence intervals are so wide that values outside the regions of equivalence are still plausible. In cases D, E and F, practical equivalence is ‘demonstrated’. However, in case E it corresponds to no observed difference at all, whereas in case F the treatments are significantly different (confidence interval does not straddle zero) even though practical equivalence appears to have been demonstrated (confidence interval lies within region of equivalence). In case G there is a significant difference and equivalence may be rejected. ”
Statistical Issues in Drug Development (2nd edition, 2007), Chapter 15
This concrete illustration first proposed to me by Joachim Roehmel.
See also http://www.senns.demon.co.uk/Papers/Fisher%27s%20game%20with%20the%20Devil.pdf
In other words, to fake results to produce a conclusion that two treatments are different, you would have to know which treatment was which.
To fake results that two treatments are equivalent you do not need to know which treatment is which.
The difference is that in the first case you wish to assert that two distributions are necessary. Thus assignment to the correct distribution is crucial.
In the second case you wish to assert that only one distribution is needed.
“The value of blinding in clinical trials, is essentially this: despite making sure that there are no superficial and nonpharmacological differences which enable us to distinguish one treatment from another (the trial is double-blind), the labels ‘experimental’ and ‘control’ do have an importance for prognosis. Thus, for a conventional trial where such a difference between groups is observed, because the trial has been run double-blind, we are able to assert that the difference between the groups cannot be due to prejudice and must therefore be due either to pharmacology or to chance. The whole purpose of ACES, however, is to be able to assert that there is no difference between treatment and clearly, therefore, blinding does not protect us against the prejudice that all patients ought to have similar outcomes. The point can be illustrated quite simply by considering the task of a statistician who has been ordered to fake equivalence by simulating suitable data. It is clear that he does not even need to know what the treatment codes are. All he needs to do is simulate data from a single Normal distribution with a suitable standard deviation (Senn, 1994). Whatever the allocation of patients, he is almost bound to demonstrate equivalence. If he is required to prove that one treatment is superior to another, however, such a strategy will not work. He needs to know the treatment codes.” Statistical Issues in Drug Development (2nd edition, 2007), Chapter 15
“There is a paradox of competence associated with equivalence trials and that is that the more we tend to provide proof within a trial of the equivalence of the two treatments, the more we ought to suspect that we have not been looking at the issue in the correct way: that the trial is incapable of finding a difference where it exists. In other words, there is more to a proof of equivalence than the matter of reversing the usual roles of null and alternative hypotheses. Even if in a given trial the test results indicated that the effects of the treatments being compared were very similar (as, say, in case D) the possibility could not be ruled out that a trial with different patients, or alternative measurements or some different approach altogether would have succeeded in finding a difference. No probabilistic calculation on the data in hand has anything to say about this possibility: it is essentially a matter of data not collected. There is a difference in kind between ‘proving’ that drugs are similar and proving that they are not similar. This difference is analogous to the difference which exists in principle between a proof of marital infidelity and fidelity. The first may be provided simply enough (in principle) by evidence; the second, if at all, only by a repeated failure to find the evidence which the first demands.”
Statistical Issues in Drug Development (2nd edition, 2007), Chapter 15
See Senn, S.J., Inherent difficulties with active control equivalence studies. Statistics in Medicine, 1993. 12(24): p. 2367-75.
http://www.senns.demon.co.uk/Papers/ACES%20SiM%201993.pdf
In the paper the symbol was used instead of , and was used instead of but the change has been made here to avoid confusion, since and are often used for type I and type II error rates.
One could argue that it is the joint effect of , and that reflects matters beyond our control.
On the other hand, our knowledge of statistics (and experimental design) enables us to fix and
This will probably be skipped over in the lecture
Reminder: is the prior probability of equivalence, is the probability of competence given non-equivalence
It is assumed that a ‘difference’ has been observed
The horizontal axis gives the probability, 𝑃(𝐷 𝐸 ′ 𝐶)=𝜋 of observing a ‘difference’ given that the trial is competent (C) and that non-Equivalence (E’) obtains. OTBE, we expect that the more precise the experiment, the bigger this value will be
The vertical axis gives the posterior probability of non-equivalence
A limit is reached as approaches 1 but this is because does not increase
Note to self. The program is “ACES Bayesian.gen” and the location is C:\Users\Stephen\Documents\Genstat\GenStat Files\Research\Equivalence
Now that the value of has been reduced, the limit for the posterior probability is much higher. In principle, simply by designing better experiments, we can make better and better inferences regarding differences.
However, this slide shows that the same is not true of equivalence. There is a limit to what we can conclude unless we can make a judgement of competence that relies on external matters.
This may seem puzzling, since what is equivalence but that which applies when non-equivalence does not but the real reason is that three alternatives are involved ‘equivalent’ ‘not competent’ ‘different’. It is distinguishing between the first two that is the problem.
Popper, K. and Miller, D. ‘Why probabilistic support is not inductive’,Philosophical Transactions ofthe
Royal Society of London, Series A, 321, 569-591 (1987).
See also The Jealous Husband’s dilemma , Dicing With Death, chapter 4.
Example of a trial in asthma comparing a bronchodilator to placebo using forced expiratory volume in one second (FEV1) in mL
This is a simulation to illustrate the issues. In the simulation a population of patients for whom the treatment effect is not identical has been considered. Each clinical trial has a different average treatment effect because involving different (possible unidentifiable) sub-populations of patients. This is done by drawing from a random distribution a common patient effect for the trial from an overall distribution.
Sixty trials are simulated
Once this value has been established for the trial, then individual patient values are simulated from the distribution for the trial.
The point estimates (diamonds) and 95% confidence intervals (whiskers) are calculated.
On the LHS the confidence intervals are judged according to whether they cover the population value (given by the horizontal line at 300 mL). Black, yes, red, no. It can be seen that the claimed 95% coverage does not apply.
On the RHS coverage is judged by whether they cover the ‘true’ local effect (which is given by the small blue horizontal bar, which varies from trial to trial). The theory holds up well and in fact, 3 out of 60, that is to say 5%, of the true values are not within the intervals.
Of course formal proofs using either calculus (integrating out) or proof by mathematical induction are possible. To understand what your choice of prior distribution commits you to you have to see the answer. This is an example of what Popper once wrote about scientists liking ‘weak’ proofs because they often bring more understanding.
The example is discussed an a proof by induction is given in chapter 4 of Dicing with Death.