This year marks the 70th anniversary of the Medical Research Council randomised clinical trial (RCT) of streptomycin in tuberculosis led by Bradford Hill. This is widely regarded as a landmark in clinical research. Despite its widespread use in drug regulation and in clinical research more widely and its high standing with the evidence based medicine movement, the RCT continues to attracts criticism. I show that many of these criticisms are traceable to failure to understand two key concepts in statistics: probabilistic inference and design efficiency. To these methodological misunderstandings can be added the practical one of failing to appreciate that entry into clinical trials is not simultaneous but sequential.
I conclude that although randomisation should not be used as an excuse for ignoring prognostic variables, it is valuable and that many standard criticisms of RCTs are invalid.
The Rothamsted school meets Lord's paradoxStephen Senn
Lords ‘paradox’ is a notoriously difficult puzzle that is guaranteed to provoke discussion, dissent and disagreement. Two statisticians analyse some observational data and come to radically different conclusions, each of which has acquired defenders over the years since Lord first proposed his puzzle in 1967. It features in the recent Book of Why by Pearl and McKenzie, who use it to demonstrate the power of Pearl’s causal calculus, obtaining a solution they claim is unambiguously right. They also claim that statisticians have failed to get to grips with causal questions for well over a century, in fact ever since Karl Pearson developed Galton’s idea of correlation and warned the scientific world that correlation is not causation.
However, only two years before Lord published his paradox John Nelder outlined a powerful causal calculus for analyzing designed experiments based on a careful distinction between block and treatment structure. This represents an important advance in formalizing the approach to analysing complex experiments that started with Fisher 100 years ago, when he proposed splitting variability using the square of the standard deviation, which he called the variance, continued with Yates and has been developed since the 1960s by Rosemary Bailey, amongst others. This tradition might be referred to as The Rothamsted School. It is fully implemented in Genstat® but, as far as I am aware, not in any other package.
With the help of Genstat®, I demonstrate how the Rothamsted School would approach Lord’s paradox and come to a solution that is not the same as the one reached by Pearl and McKenzie, although given certain strong but untestable assumptions it would reduce to it. I conclude that the statistical tradition may have more to offer in this respect than has been supposed.
Presidents' invited lecture ISCB Vigo 2017
Discusses various issues to do with how randomised clinical trials should be analysed. See also https://errorstatistics.com/2017/07/01/s-senn-fishing-for-fakes-with-fisher-guest-post/
The Seven Habits of Highly Effective StatisticiansStephen Senn
If you know why the title of this talk is extremely stupid, then you clearly know something about control, data and reasoning: in short, you have most of what it takes to be a statistician. If you have studied statistics then you will also know that a large amount of anything, and this includes successful careers, is luck.
In this talk I shall try share some of my experiences of being a statistician in the hope that it will help you make the most of whatever luck life throws you, In so doing, I shall try my best to overcome the distorting influence of that easiest of sciences hindsight. Without giving too much away, I shall be recommending that you read, listen, think, calculate, understand, communicate, and do. I shall give you some example of what I think works and what I think doesn’t
In all of this you should never forget the power of negativity and also the joy of being able to wake up every day and say to yourself ‘I love the small of data in the morning’.
Talk given at ISCB 2016 Birmingham
For indications and treatments where their use is possible, n-of-1 trials represent a promising means of investigating potential treatments for rare diseases. Each patient permits repeated comparison of the treatments being investigated and this both increases the number of observations and reduces their variability compared to conventional parallel group trials.
However, depending on whether the framework for analysis used is randomisation-based or model- based produces puzzling difference in inferences. This can easily be shown by starting on the one hand with the randomisation philosophy associated with the Rothamsted school of inference and building up the analysis through the block + treatment structure approach associated with John Nelder’s theory of general balance (as implemented in GenStat®) or starting on the other hand with a plausible variance component approach through a mixed model. However, it can be shown that these differences are related not so much to modelling approach per se but to the questions one attempts to answer: ranging from testing whether there was a difference between treatments in the patients studied, to predicting the true difference for a future patient, via making inferences about the effect in the average patient.
This in turn yields interesting insight into the long-run debate over the use of fixed or random effect meta-analysis.
Some practical issues of analysis will also be covered in R and SAS®, in which languages some functions and macros to facilitate analysis have been written. It is concluded that n-of-1 hold great promise in investigating chronic rare diseases but that careful consideration of matters of purpose, design and analysis is necessary to make best use of them.
Acknowledgement
This work is partly supported by the European Union’s 7th Framework Programme for research, technological development and demonstration under grant agreement no. 602552. “IDEAL”
This year marks the 70th anniversary of the Medical Research Council randomised clinical trial (RCT) of streptomycin in tuberculosis led by Bradford Hill. This is widely regarded as a landmark in clinical research. Despite its widespread use in drug regulation and in clinical research more widely and its high standing with the evidence based medicine movement, the RCT continues to attracts criticism. I show that many of these criticisms are traceable to failure to understand two key concepts in statistics: probabilistic inference and design efficiency. To these methodological misunderstandings can be added the practical one of failing to appreciate that entry into clinical trials is not simultaneous but sequential.
I conclude that although randomisation should not be used as an excuse for ignoring prognostic variables, it is valuable and that many standard criticisms of RCTs are invalid.
The Rothamsted school meets Lord's paradoxStephen Senn
Lords ‘paradox’ is a notoriously difficult puzzle that is guaranteed to provoke discussion, dissent and disagreement. Two statisticians analyse some observational data and come to radically different conclusions, each of which has acquired defenders over the years since Lord first proposed his puzzle in 1967. It features in the recent Book of Why by Pearl and McKenzie, who use it to demonstrate the power of Pearl’s causal calculus, obtaining a solution they claim is unambiguously right. They also claim that statisticians have failed to get to grips with causal questions for well over a century, in fact ever since Karl Pearson developed Galton’s idea of correlation and warned the scientific world that correlation is not causation.
However, only two years before Lord published his paradox John Nelder outlined a powerful causal calculus for analyzing designed experiments based on a careful distinction between block and treatment structure. This represents an important advance in formalizing the approach to analysing complex experiments that started with Fisher 100 years ago, when he proposed splitting variability using the square of the standard deviation, which he called the variance, continued with Yates and has been developed since the 1960s by Rosemary Bailey, amongst others. This tradition might be referred to as The Rothamsted School. It is fully implemented in Genstat® but, as far as I am aware, not in any other package.
With the help of Genstat®, I demonstrate how the Rothamsted School would approach Lord’s paradox and come to a solution that is not the same as the one reached by Pearl and McKenzie, although given certain strong but untestable assumptions it would reduce to it. I conclude that the statistical tradition may have more to offer in this respect than has been supposed.
Presidents' invited lecture ISCB Vigo 2017
Discusses various issues to do with how randomised clinical trials should be analysed. See also https://errorstatistics.com/2017/07/01/s-senn-fishing-for-fakes-with-fisher-guest-post/
The Seven Habits of Highly Effective StatisticiansStephen Senn
If you know why the title of this talk is extremely stupid, then you clearly know something about control, data and reasoning: in short, you have most of what it takes to be a statistician. If you have studied statistics then you will also know that a large amount of anything, and this includes successful careers, is luck.
In this talk I shall try share some of my experiences of being a statistician in the hope that it will help you make the most of whatever luck life throws you, In so doing, I shall try my best to overcome the distorting influence of that easiest of sciences hindsight. Without giving too much away, I shall be recommending that you read, listen, think, calculate, understand, communicate, and do. I shall give you some example of what I think works and what I think doesn’t
In all of this you should never forget the power of negativity and also the joy of being able to wake up every day and say to yourself ‘I love the small of data in the morning’.
Talk given at ISCB 2016 Birmingham
For indications and treatments where their use is possible, n-of-1 trials represent a promising means of investigating potential treatments for rare diseases. Each patient permits repeated comparison of the treatments being investigated and this both increases the number of observations and reduces their variability compared to conventional parallel group trials.
However, depending on whether the framework for analysis used is randomisation-based or model- based produces puzzling difference in inferences. This can easily be shown by starting on the one hand with the randomisation philosophy associated with the Rothamsted school of inference and building up the analysis through the block + treatment structure approach associated with John Nelder’s theory of general balance (as implemented in GenStat®) or starting on the other hand with a plausible variance component approach through a mixed model. However, it can be shown that these differences are related not so much to modelling approach per se but to the questions one attempts to answer: ranging from testing whether there was a difference between treatments in the patients studied, to predicting the true difference for a future patient, via making inferences about the effect in the average patient.
This in turn yields interesting insight into the long-run debate over the use of fixed or random effect meta-analysis.
Some practical issues of analysis will also be covered in R and SAS®, in which languages some functions and macros to facilitate analysis have been written. It is concluded that n-of-1 hold great promise in investigating chronic rare diseases but that careful consideration of matters of purpose, design and analysis is necessary to make best use of them.
Acknowledgement
This work is partly supported by the European Union’s 7th Framework Programme for research, technological development and demonstration under grant agreement no. 602552. “IDEAL”
How to combine results from randomised clinical trials on the additive scale with real world data to provide predictions on the clinically relevant scale for individual patients
Unfortunately, some have interpreted Numbers Needed to Treat as indicating the proportion of patients on whom the treatment has had a causal effect. This interpretation is very rarely, if ever, necessarily correct. It is certainly inappropriate if based on a responder dichotomy. I shall illustrate the problem using simple causal models.
One also sometimes encounters the claim that the extent to which two distributions of outcomes overlap from a clinical trial indicates how many patients benefit. This is also false and can be traced to a similar causal confusion.
What should we expect from reproducibiliryStephen Senn
Is there really a reproducibility crisis and if so are P-values to blame? Choose any statistic you like and carry out two identical independent studies and report this statistic for each. In advance of collecting any data, you ought to expect that it is just as likely that statistic 1 will be smaller than statistic 2 as vice versa. Once you have seen statistic 1, things are not so simple but if they are not so simple, it is that you have other information in some form. However, it is at least instructive that you need to be careful in jumping to conclusions about what to expect from reproducibility. Furthermore, the forecasts of good Bayesians ought to obey a Martingale property. On average you should be in the future where you are now but, of course, your inferential random walk may lead to some peregrination before it homes in on “the truth”. But you certainly can’t generally expect that a probability will get smaller as you continue. P-values, like other statistics are a position not a movement. Although often claimed, there is no such things as a trend towards significance.
Using these and other philosophical considerations I shall try and establish what it is we want from reproducibility. I shall conclude that we statisticians should probably be paying more attention to checking that standard errors are being calculated appropriately and rather less to inferential framework.
It is argued that when it comes to nuisance parameters an assumption of ignorance is harmful. On the other hand this raises problems as to how far one should go in searching for further data when combining evidence.
Views of the role of hypothesis falsification in statistical testing do not divide as cleanly between frequentist and Bayesian views as is commonly supposed. This can be shown by considering the two major variants of the Bayesian approach to statistical inference and the two major variants of the frequentist one.
A good case can be made that the Bayesian, de Finetti, just like Popper, was a falsificationist. A thumbnail view, which is not just a caricature, of de Finetti’s theory of learning, is that your subjective probabilities are modified through experience by noticing which of your predictions are wrong, striking out the sequences that involved them and renormalising.
On the other hand, in the formal frequentist Neyman-Pearson approach to hypothesis testing, you can, if you wish, shift conventional null and alternative hypotheses, making the latter the strawman and by ‘disproving’ it, assert the former.
The frequentist, Fisher, however, at least in his approach to testing of hypotheses, seems to have taken a strong view that the null hypothesis was quite different from any other and there was a strong asymmetry on inferences that followed from the application of significance tests.
Finally, to complete a quartet, the Bayesian geophysicist Jeffreys, inspired by Broad, specifically developed his approach to significance testing in order to be able to ‘prove’ scientific laws.
By considering the controversial case of equivalence testing in clinical trials, where the object is to prove that ‘treatments’ do not differ from each other, I shall show that there are fundamental differences between ‘proving’ and falsifying a hypothesis and that this distinction does not disappear by adopting a Bayesian philosophy. I conclude that falsificationism is important for Bayesians also, although it is an open question as to whether it is enough for frequentists.
When estimating sample sizes for clinical trials there are several different views that might be taken as to what definition and meaning should be given to the sought-for treatment effect. However, if the concept of a ‘minimally important difference’ (MID) does have relevance to interpreting clinical trials (which can be disputed) then its value cannot be the same as the ‘clinically relevant difference’ (CRD) that would be used for planning them.
A doubly pernicious use of the MID is as a means of classifying patients as responders and non-responders. Not only does such an analysis lead to an increase in the necessary sample size but it misleads trialists into making causal distinctions that the data cannot support and has been responsible for exaggerating the scope for personalised medicine.
In this talk these statistical points will be explained using a minimum of technical detail.
An early and overlooked causal revolution in statistics was the development of the theory of experimental design, initially associated with the "Rothamstead School". An important stage in the evolution of this theory was the experimental calculus developed by John Nelder in the 1960s with its clear distinction between block and treatment factors in designed experiments. This experimental calculus produced appropriate models automatically from more basic formal considerations but was, unfortunately, only ever implemented in Genstat®, a package widely used in agriculture but rarely so in medical research. In consequence its importance has not been appreciated and the approach of many statistical packages to designed experiments is poor. A key feature of the Rothamsted School approach is that identification of the appropriate components of variation for judging treatment effects is simple and automatic.
The impressive more recent causal revolution in epidemiology, associated with Judea Pearl, seems to have no place for components of variation, however. By considering the application of Nelder’s experimental calculus to Lord’s Paradox, I shall show that this reveals that solutions that have been proposed using the more modern causal calculus are problematic. I shall also show that lessons from designed clinical trials have important implications for the use of historical data and big data more generally.
Clinical trials: quo vadis in the age of covid?Stephen Senn
A discussion of the role of clinical trials in the age of COVID. My contribution to the phastar 2020 life sciences summit https://phastar.com/phastar-life-science-summit
The statistical revolution of the 20th century was largely concerned with developing methods for analysing small datasets. Student’s paper of 1908 was the first in the English literature to address the problem of second order uncertainty (uncertainty about the measures of uncertainty) seriously and was hailed by Fisher as heralding a new age of statistics. Much of what Fisher did was concerned with problems of what might be called ‘small data’, not only as regards efficient analysis but also as regards efficient design and in addition paying close attention to what was necessary to measure uncertainty validly.
I shall consider the history of some of these developments, in particular those that are associated with what might be called the Rothamsted School, starting with Fisher and having its apotheosis in John Nelder’s theory of General Balance and see what lessons they hold for the supposed ‘big data’ revolution of the 21st century.
There are many questions one might ask of a clinical trial, ranging from what was the effect in the patients studied to what might the effect be in future patients via what was the effect in individual patients? The extent to which the answer to these questions is similar depends on various assumptions made and in some cases the design used may not permit any meaningful answer to be given at all.
A related issue is confusion between randomisation, random sampling, linear model and true multivariate based modelling. These distinctions don’t matter much for some purposes and under some circumstances but for others they do.
The history of p-values is covered to try and shed light on a mystery: why did Student and Fisher agree numerically but disagree in terms of interpretation.?
Personalised medicine a sceptical viewStephen Senn
Some grounds for believing that the current enthusiasm about personalised medicine is exaggerated, founded on poor statistics and represents a disappointing loss of ambition.
Minimisation is an approach to allocating patients to treatment in clinical trials that forces a greater degree of balance than does randomisation. Here I explain why I dislike it.
There are many valid criticisms of P-values but the criticism that they are largely responsible for the reproducibility crisis has been accepted rather lightly in some quarters. Whatever the inferential statistic that is used, it is quite illogical to assume that as the sample size increases it will tend to show more evidence against the null hypothesis. This applies to Bayesian posterior probabilities as much as it does to P-values. In the context of P-values it can be referred to as the trend towards significance fallacy but more generally, for reasons I shall explain, it could be referred to as the anticipated evidence fallacy.
The anticipated evidence fallacy is itself an example of the overstated evidence fallacy. I shall also discuss this fallacy and other relevant matters affecting reproducible science including the problem of false negatives.
Sample size determination in clinical trials is considered from various ethical and practical perspectives. It is concluded that cost is a missing dimension and that the value of information is key.
The two statistical cornerstones of replicability: addressing selective infer...jemille6
Tukey’s last published work in 2020 was an obscure entry on multiple comparisons in the
Encyclopedia of Behavioral Sciences, addressing the two topics in the title. Replicability
was not mentioned at all, nor was any other connection made between the two topics. I shall demonstrate how these two topics critically affect replicability using recently completed studies. I shall review how these have been addressed in the past. I shall
review in more detail the available ways to address selective inference. My conclusion is that conducting many small replicability studies without strict standardization is the way to assure replicability of results in science, and we should introduce policies to make this happen.
The ASA president Task Force Statement on Statistical Significance and Replic...jemille6
Yoav Benjamini's slides "The ASA president Task Force Statement on Statistical Significance and Replicability” for Special Session of the (remote) Phil Stat Forum: “Statistical Significance Test Anxiety” on 11 January 2022
How to combine results from randomised clinical trials on the additive scale with real world data to provide predictions on the clinically relevant scale for individual patients
Unfortunately, some have interpreted Numbers Needed to Treat as indicating the proportion of patients on whom the treatment has had a causal effect. This interpretation is very rarely, if ever, necessarily correct. It is certainly inappropriate if based on a responder dichotomy. I shall illustrate the problem using simple causal models.
One also sometimes encounters the claim that the extent to which two distributions of outcomes overlap from a clinical trial indicates how many patients benefit. This is also false and can be traced to a similar causal confusion.
What should we expect from reproducibiliryStephen Senn
Is there really a reproducibility crisis and if so are P-values to blame? Choose any statistic you like and carry out two identical independent studies and report this statistic for each. In advance of collecting any data, you ought to expect that it is just as likely that statistic 1 will be smaller than statistic 2 as vice versa. Once you have seen statistic 1, things are not so simple but if they are not so simple, it is that you have other information in some form. However, it is at least instructive that you need to be careful in jumping to conclusions about what to expect from reproducibility. Furthermore, the forecasts of good Bayesians ought to obey a Martingale property. On average you should be in the future where you are now but, of course, your inferential random walk may lead to some peregrination before it homes in on “the truth”. But you certainly can’t generally expect that a probability will get smaller as you continue. P-values, like other statistics are a position not a movement. Although often claimed, there is no such things as a trend towards significance.
Using these and other philosophical considerations I shall try and establish what it is we want from reproducibility. I shall conclude that we statisticians should probably be paying more attention to checking that standard errors are being calculated appropriately and rather less to inferential framework.
It is argued that when it comes to nuisance parameters an assumption of ignorance is harmful. On the other hand this raises problems as to how far one should go in searching for further data when combining evidence.
Views of the role of hypothesis falsification in statistical testing do not divide as cleanly between frequentist and Bayesian views as is commonly supposed. This can be shown by considering the two major variants of the Bayesian approach to statistical inference and the two major variants of the frequentist one.
A good case can be made that the Bayesian, de Finetti, just like Popper, was a falsificationist. A thumbnail view, which is not just a caricature, of de Finetti’s theory of learning, is that your subjective probabilities are modified through experience by noticing which of your predictions are wrong, striking out the sequences that involved them and renormalising.
On the other hand, in the formal frequentist Neyman-Pearson approach to hypothesis testing, you can, if you wish, shift conventional null and alternative hypotheses, making the latter the strawman and by ‘disproving’ it, assert the former.
The frequentist, Fisher, however, at least in his approach to testing of hypotheses, seems to have taken a strong view that the null hypothesis was quite different from any other and there was a strong asymmetry on inferences that followed from the application of significance tests.
Finally, to complete a quartet, the Bayesian geophysicist Jeffreys, inspired by Broad, specifically developed his approach to significance testing in order to be able to ‘prove’ scientific laws.
By considering the controversial case of equivalence testing in clinical trials, where the object is to prove that ‘treatments’ do not differ from each other, I shall show that there are fundamental differences between ‘proving’ and falsifying a hypothesis and that this distinction does not disappear by adopting a Bayesian philosophy. I conclude that falsificationism is important for Bayesians also, although it is an open question as to whether it is enough for frequentists.
When estimating sample sizes for clinical trials there are several different views that might be taken as to what definition and meaning should be given to the sought-for treatment effect. However, if the concept of a ‘minimally important difference’ (MID) does have relevance to interpreting clinical trials (which can be disputed) then its value cannot be the same as the ‘clinically relevant difference’ (CRD) that would be used for planning them.
A doubly pernicious use of the MID is as a means of classifying patients as responders and non-responders. Not only does such an analysis lead to an increase in the necessary sample size but it misleads trialists into making causal distinctions that the data cannot support and has been responsible for exaggerating the scope for personalised medicine.
In this talk these statistical points will be explained using a minimum of technical detail.
An early and overlooked causal revolution in statistics was the development of the theory of experimental design, initially associated with the "Rothamstead School". An important stage in the evolution of this theory was the experimental calculus developed by John Nelder in the 1960s with its clear distinction between block and treatment factors in designed experiments. This experimental calculus produced appropriate models automatically from more basic formal considerations but was, unfortunately, only ever implemented in Genstat®, a package widely used in agriculture but rarely so in medical research. In consequence its importance has not been appreciated and the approach of many statistical packages to designed experiments is poor. A key feature of the Rothamsted School approach is that identification of the appropriate components of variation for judging treatment effects is simple and automatic.
The impressive more recent causal revolution in epidemiology, associated with Judea Pearl, seems to have no place for components of variation, however. By considering the application of Nelder’s experimental calculus to Lord’s Paradox, I shall show that this reveals that solutions that have been proposed using the more modern causal calculus are problematic. I shall also show that lessons from designed clinical trials have important implications for the use of historical data and big data more generally.
Clinical trials: quo vadis in the age of covid?Stephen Senn
A discussion of the role of clinical trials in the age of COVID. My contribution to the phastar 2020 life sciences summit https://phastar.com/phastar-life-science-summit
The statistical revolution of the 20th century was largely concerned with developing methods for analysing small datasets. Student’s paper of 1908 was the first in the English literature to address the problem of second order uncertainty (uncertainty about the measures of uncertainty) seriously and was hailed by Fisher as heralding a new age of statistics. Much of what Fisher did was concerned with problems of what might be called ‘small data’, not only as regards efficient analysis but also as regards efficient design and in addition paying close attention to what was necessary to measure uncertainty validly.
I shall consider the history of some of these developments, in particular those that are associated with what might be called the Rothamsted School, starting with Fisher and having its apotheosis in John Nelder’s theory of General Balance and see what lessons they hold for the supposed ‘big data’ revolution of the 21st century.
There are many questions one might ask of a clinical trial, ranging from what was the effect in the patients studied to what might the effect be in future patients via what was the effect in individual patients? The extent to which the answer to these questions is similar depends on various assumptions made and in some cases the design used may not permit any meaningful answer to be given at all.
A related issue is confusion between randomisation, random sampling, linear model and true multivariate based modelling. These distinctions don’t matter much for some purposes and under some circumstances but for others they do.
The history of p-values is covered to try and shed light on a mystery: why did Student and Fisher agree numerically but disagree in terms of interpretation.?
Personalised medicine a sceptical viewStephen Senn
Some grounds for believing that the current enthusiasm about personalised medicine is exaggerated, founded on poor statistics and represents a disappointing loss of ambition.
Minimisation is an approach to allocating patients to treatment in clinical trials that forces a greater degree of balance than does randomisation. Here I explain why I dislike it.
There are many valid criticisms of P-values but the criticism that they are largely responsible for the reproducibility crisis has been accepted rather lightly in some quarters. Whatever the inferential statistic that is used, it is quite illogical to assume that as the sample size increases it will tend to show more evidence against the null hypothesis. This applies to Bayesian posterior probabilities as much as it does to P-values. In the context of P-values it can be referred to as the trend towards significance fallacy but more generally, for reasons I shall explain, it could be referred to as the anticipated evidence fallacy.
The anticipated evidence fallacy is itself an example of the overstated evidence fallacy. I shall also discuss this fallacy and other relevant matters affecting reproducible science including the problem of false negatives.
Sample size determination in clinical trials is considered from various ethical and practical perspectives. It is concluded that cost is a missing dimension and that the value of information is key.
The two statistical cornerstones of replicability: addressing selective infer...jemille6
Tukey’s last published work in 2020 was an obscure entry on multiple comparisons in the
Encyclopedia of Behavioral Sciences, addressing the two topics in the title. Replicability
was not mentioned at all, nor was any other connection made between the two topics. I shall demonstrate how these two topics critically affect replicability using recently completed studies. I shall review how these have been addressed in the past. I shall
review in more detail the available ways to address selective inference. My conclusion is that conducting many small replicability studies without strict standardization is the way to assure replicability of results in science, and we should introduce policies to make this happen.
The ASA president Task Force Statement on Statistical Significance and Replic...jemille6
Yoav Benjamini's slides "The ASA president Task Force Statement on Statistical Significance and Replicability” for Special Session of the (remote) Phil Stat Forum: “Statistical Significance Test Anxiety” on 11 January 2022
Introduces and explains the use of multiple linear regression, a multivariate correlational statistical technique. For more info, see the lecture page at http://goo.gl/CeBsv. See also the slides for the MLR II lecture http://www.slideshare.net/jtneill/multiple-linear-regression-ii
Quantitative Analysis for Emperical ResearchAmit Kamble
Overview for Approach Methods for quantitative analysis; which includes
1) Planning of Experiments
2) Data Generation
3) presentation of report
some numerical approach methods; data modeling; hypothesis methods
There are many questions one might ask of a clinical trial, ranging from what was the effect in the patients studied to what might the effect be in future patients via what was the effect in individual patients? The extent to which the answer to these questions is similar depends on various assumptions made and in some cases the design used may not permit any meaningful answer to be given at all.
A related issue is confusion between randomisation, random sampling, linear model and true multivariate based modelling. These distinctions don’t matter much for some purposes and under some circumstances but for others they do.
A yet further issue is that causal analysis in epidemiology, which has brought valuable insights in many cases, has tended to stress point estimates and ignore standard errors. This has potentially misleading consequences.
An understanding of components of variation is key. Unfortunately, the development of two particular topics in recent years, evidence synthesis by the evidence based medicine movement and personalised medicine by bench scientists has either paid scant attention to components of variation or to the questions being asked or both resulting in confusion about many issues.
For instance, it is often claimed that numbers needed to treat indicate the proportion of patients for whom treatments work, that inclusion criteria determine the generalisability of results and that heterogeneity means that a random effects meta-analysis is required. None of these is true. The scope for personalised medicine has very plausibly been exaggerated and an important cause of variation in the healthcare system, physicians, is often overlooked.
I shall argue that thinking about questions is important.
The response to the COVID-19 crisis by various vaccine developers has been extraordinary, both in terms of speed of response and the delivered efficacy of the vaccines. It has also raised some fascinating issues of design, analysis and interpretation. I shall consider some of these issues, taking as my example, five vaccines: Pfizer/BioNTech, AstraZeneca/Oxford, Moderna, Novavax, and J&J Janssen but concentrating mainly on the first two. Among matters covered will be concurrent control, efficient design, issues of measurement raised by two-shot vaccines and implications for roll-out, and the surprising effectiveness of simple analyses. Differences between the five development programmes as they affect statistics will be covered but some essential similarities will also be discussed.
In Search of Lost Infinities: What is the “n” in big data?Stephen Senn
In designing complex experiments, agricultural scientists, with the help of their statistician collaborators, soon came to realise that variation at different levels had very different consequences for estimating different treatment effects, depending on how the treatments were mapped onto the underlying block structure. This was a key feature of the Rothamsted approach to design and analysis and a strong thread running through the work of Fisher, Yates and Nelder, being expressed in topics such as split-pot designs, recovering inter-block information and fractional factorials. The null block-structure of an experiment is key to this philosophy of design and analysis. However modern techniques for analysing experiments stress models rather than symmetries and this modelling approach requires much greater care in analysis, with the consequence that you can easily make mistakes and often will.
In this talk I shall underline the obvious, but often unintentionally overlooked, fact that understanding variation at the various levels at which it occurs is crucial to analysis. I shall take three examples, an application of John Nelder’s theory of general balance to Lord’s Paradox, the use of historical data in drug development and a hybrid randomised non-randomised clinical trial, the TARGET study, to show that the data that many, including those promoting a so-called causal revolution, assume to be ‘big’ may actually be rather ‘small’. The consequence is that there is a danger that the size of standard errors will be underestimated or even that the appropriate regression coefficients for adjusting for confounding may not be identified correctly.
I conclude that an old but powerful experimental design approach holds important lessons for observational data about limitations in interpretation that mere numbers cannot overcome. Small may be beautiful, after all.
History of how and why a complex cross-over trial was designed to prove the equivalence of two formulations of a beta-agonist and what the eventual results were. Presented at the Newton Institute 28 July 2008. Warning: following the important paper by Kenward & Roger Biostatistics, 2010, I no longer think the random effects analysis is appropriate, although, in fact the results are pretty much the same as for the fixed effects analysis.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Confounding, politics, frustration and knavish tricks
1. Bradford Hill lecture 2008 1 of 48
Confounding, politics,
frustration and knavish tricks
Stephen Senn
2. Bradford Hill lecture 2008 2 of 48
"If when the tide is falling you take out
water with a twopenny pail, you and the
moon together can do a great deal”
Bradford Hill, A., and Hill, I. D. (1990), (12th
edition) Principles of Medical
Statistics. p247
3. Bradford Hill lecture 2008 3 of 48
The Central Problem of
Epidemiology
• This is generally recognised to be confounding
• Where experiments cannot be conducted we must make
do with observational studies
• There is also the risk that due to hidden confounders we
will conclude causation when all we have is association
• Hill was a (the) key figure in promoting randomised
controlled trials (RCTs)
• But he also recognised that RCTs were not enough and
was a pioneer of observational studies
– Case control as in Doll and Hill (1950)
– Cohort as in Doll and Hill (1954)
4. Bradford Hill lecture 2008 4 of 48
Outline
• Some statistics of the propensity score
• An explanation of the propensity score
• Comparison to ANCOVA
• Some criticisms
• Conclusions
Acknowledgement
This is based on joint work with Erika Graf and Angelika Caputo
Senn, S., Graf, E., and Caputo, A. (2007), "Stratification for the Propensity Score
Compared with Linear Regression Techniques to Assess the Effect of Treatment or
Exposure," Statistics in Medicine, 26, 5529-5544.
5. Bradford Hill lecture 2008 5 of 48
A Question for you to Consider
• Consider these two experiments
– A completely randomised trial
– Patients allocated with 50% probability to A or B
– Randomised matched pairs
– Member of any pair randomised with 50% probability to A
or B
• In analysing, would you ignore the
matching in the second case?
6. Bradford Hill lecture 2008 6 of 48
Propensity score: background
• Due to Rosenbaum and Rubin, Biometrika
1983
• Has been cited over 1000 times since first
published
• Citation rate has grown rapidly since 1995
and is now more than 200 per year
7. Bradford Hill lecture 2008 7 of 48
This model
predicting more
than 300
citations in
2008
Annual citations of RosenbaumandRubin
50
150
0
100
1990 2000 2005
200
1995
250
1985
Year
Citations
Fit
Data
8. Bradford Hill lecture 2008 8 of 48
Cumulativecitations of RosenbaumandRubin
0
1985
1000
600
200
2005
1200
400
800
200019951990
Year
Cumulative
9. Bradford Hill lecture 2008 9 of 48
MEDICINE, GENERAL & INTERNAL (5.67%)
ECONOMICS (19.45%)
MATHEMATICAL & COMPUTATIONAL BIOLOGY (6.63%)
CARDIAC & CARDIOVASCULAR SYSTEMS (10.69%)
SOCIAL SCIENCES, MATHEMATICAL METHODS (8.99%)
PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH (12.75%)
SURGERY (6.48%)
HEALTH CARE SCIENCES & SERVICES (6.34%)
RESPIRATORY SYSTEM (5.75%)
STATISTICS & PROBABILITY (17.24%)
10. Bradford Hill lecture 2008 10 of 48
Propensity Score Explanation
• We consider two ‘treatments’ or exposures
a subject might have received
• The assignment indicator is X
– X = 0, if subject receives exposure 0
– X = 1, if subject receives exposure 1
• There is a vector of covariates W
11. Bradford Hill lecture 2008 11 of 48
Counterfactual responses
• For every subject we have two responses
– ro
– r1
• One of these will be observed
• One of these is unobserved
– Counterfactual
12. Bradford Hill lecture 2008 12 of 48
Propensity score: definition
( ) ( )1e W P X W= =
This is a form of balancing score b(W). A balancing score is
defined as follows. If r0 is the response given by a subject that is
unexposed (indexed by 0) and r1 is the response when the same
subject is exposed (indexed by 1), and
( ) ( ) ( )0 1 0 1, ,r r X W and r r X b W⊥ ⊥
then b(W) is a balancing score. R & R show that the finest such
score is W itself and the coarsest is the propensity score
13. Bradford Hill lecture 2008 13 of 48
Propensity score uses
• Calculate the propensity score for each
subject
• Stratify by the propensity score
– In practice fifths are used
• The resulting estimator is unbiased
– The possible confounding influence of W has
been eliminated
14. Bradford Hill lecture 2008 14 of 48
Exposure
A B Total
Young 240 80 320Male
Old 60 20 80
Young 80 240 320Female
Old 20 60 80
Total 400 400 800
Propensity Score:An Example
Disposition of subjects in a study
15. Bradford Hill lecture 2008 15 of 48
Exposure
Sex A B Total
Male 300 100 400
Female 100 300 400
400 400 800
Exposure
Age A B Total
Young 320 320 640
Old 80 80 160
400 400 800
Sex is predictive of exposure but age is not
16. Bradford Hill lecture 2008 16 of 48
Class Relative
frequency (A)
‘Probability’ of
Disposition (to A)
Young males 240/320 3/4
Old males 60/80 3/4
Young females 80/320 1/4
Old females 20/80 1/4
The philosophy of the propensity score is to stratify by probability
of allocation. In this case this is equivalent to stratifying by sex.
17. Bradford Hill lecture 2008 17 of 48
Treatment
Sex A B Difference
Male 96 136 40
Female 96 136 40
Treatment
Age A B Difference
Young 100 140 40
Old 80 120 40
Response
Age is predictive of outcome but sex is not
18. Bradford Hill lecture 2008 18 of 48
The Difference to Conventional
Approaches
• Conventional approaches correct for covariates
if they are predictive of outcome
– Analysis of covariance
– Stratification
• The propensity score corrects if covariates are
predictive of assignment (allocation)
• In this example correcting either for sex
(propensity score) or age (ANCOVA) will
produce an “unbiased” estimate
19. Bradford Hill lecture 2008 19 of 48
In terms of linear regression
UVβLet be the marginal regression of U on V
be the conditional regression of U on V given TTUV .Let β
)2(0
or
)1(0
if
.
.
..
=
=
=∴
+=
WX
XYW
WYXYX
WXXYWWYXYX
β
β
ββ
ββββ
(1) Is the analysis of covariance condition for not including
something in the model and (2) is the propensity score condition.
To define some general notation
Now consider a specific
implementation where Y is
outcome X is treatment and
W is covariate
20. Bradford Hill lecture 2008 20 of 48
Some myths of the propensity
score
• Colinearity of predictors makes traditional
regression adjustments unusable
• Quintile stratification on the propensity
score eliminates bias more effectively than
ANCOVA
• The propensity score can be more efficient
than ANCOVA
• The coarsening property of the propensity
score benefits efficiency
21. Bradford Hill lecture 2008 21 of 48
Colinearity of Predictors
Consider a simple example in which the following predictor pattern is
repeated a number of times
Covariate/Confounder Exposure
W1 W2 X
0 0 0
0 0 1
1 1 0
1 1 1
Clearly the effects of W1 and W2 are not identifiable but the effect of X is and
any decent statistical package should be able to estimate the effect even if
W1 and W2 are in the model. In the following example it is supposed that
( )
1 2
0,1
Y W W X
N
ε
ε
= + + +
:
And that we have the same basic pattern
of predictors for 1000 observations
22. Bradford Hill lecture 2008 22 of 48
Analysis with GenStat 1
Case where W1 and W2 are completely colinear
Message: term W2 cannot be included in the model because it is aliased
with terms already in the model.
(W2) = (W1)
Regression analysis
Estimates of parameters
Parameter estimate s.e. t(997) t pr.
Constant -0.0266 0.0542 -0.49 0.624
W1 2.0067 0.0626 32.05 <.001
X 1.0377 0.0626 16.57 <.001
23. Bradford Hill lecture 2008 23 of 48
Analysis with GenStat 2
Case where W1 and W2 are strongly colinear ( a small bit of noise added to W2)
Regression analysis
Estimates of parameters
Parameter estimate s.e. t(996) t pr.
Constant -0.0270 0.0542 -0.50 0.619
W1 -0.82 3.16 -0.26 0.795
W2e 2.83 3.16 0.89 0.372
X 1.0372 0.0626 16.56 <.001
Message: the variance of some parameter estimates is seriously inflated, due to near
collinearity or aliasing between the following parameters, listed with their variance
inflation factors.
W1 2553.00
W2e 2553.00
24. Bradford Hill lecture 2008 24 of 48
Better at eliminating bias?
• Some papers have purported to show this
• Claims have been demonstrated using
simulation
• But the simulations have been unfair
– For example using models of different implicit
complexity
• It is trivial to produce examples where quintile
stratification does not work
– Suppose a baseline covariate differs by one standard
deviation between exposures and outcome is a linear
function of this
• ANCOVA works perfectly, propensity score is biased
25. Bradford Hill lecture 2008 25 of 48
More efficient than ANCOVA ?
• Stratification by probability of assignment
• But ANCOVA stratifies by predictors of
outcome; not assignment.
• By definition residual variance less for
ANCOVA
• By definition, loss of orthogonality greater for
propensity.
• Consequence: variance of estimators higher
for propensity score
• Propensity score incoherent?
26. Bradford Hill lecture 2008 26 of 48
Furthermore
• The coarseness property of the propensity
score is completely irrelevant
• There is no gain in efficiency through this
property
• The loss in orthogonality is equivalent to
fitting all covariates and their interactions
with each other.
• You might as well just use (multivariate)
W.
27. Bradford Hill lecture 2008 27 of 48
A Regression Reminder
[ ]
( ) ( )
1 2
00
2
ˆvar
XX
Let P X W
β P P
a
a
σ
σ
−
=
′=
÷
÷=
÷
÷
L L M
M
M O
M O
The propensity score philosophy chooses the members of W
in such a way that axx is maximised. Analysis of covariance
chooses the members so that σ2
is minimised.
28. Bradford Hill lecture 2008 28 of 48
Another Example
Young Old Total all ages Total
X = 0 X = 1 X = 0 X = 1 X = 0 X = 1
Male 3 7 80 30 83 37 120
Female 8 42 9 21 17 63 80
Total
both
11 49 89 51 100 100
Total 60 140 200
29. Bradford Hill lecture 2008 29 of 48
Another Example
Young Old Total all ages Total
X = 0 X = 1 X = 0 X= 1 X= 0 X = 1
Male 3 7 80 30 83 37 120
Female 8 42 9 21 17 63 80
Total
both
11 49 89 51 100 100
Total 60 140 200
e(w) = 0.7
30. Bradford Hill lecture 2008 30 of 48
Propensity score stratification
Exposure Assignment Total
Stratum or
strata
Propensity
score
X = 0 X= 1
Old males e(W) = 0.27 80 30 110
Young males +
Old females
e(W) = 0.70 12 28 40
Young females e(W) = 0.84 8 42 50
Total 100 100 200
31. Bradford Hill lecture 2008 31 of 48
The last of these is the same as for the propensity score
For our Second Example
Factors in Model in
Addition to Exposure
Variance Multiplier, axx.
None 0.0200
Age 0.0242
Sex 0.0257
sex + age 0.0267
sex + age + sex × age 0.0271
32. Bradford Hill lecture 2008 32 of 48
Conditional Distributions and
The Propensity Score
• The appropriateness of the propensity
score is always illustrated in terms of the
expectation of the treatment estimate
– Unbiasedness in linear framework
• Its suitability when looked at in terms of
the full conditional distribution less obvious
as will now be demonstrated
33. Bradford Hill lecture 2008 33 of 48
Suppose that we are interested in the conditional distribution of an
outcome variable Y given a putative causal variable X and a further
covariate W. We wish to investigate the circumstances under
which W can be ignored. That is to say we wish to know the
conditions that
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( )
( )
( ) ( ) ( )
( )
( ) ( )
.
,,1)4(
)3(
)2(
)1(
XWYXYW
XWfYXWf
XWf
YXWf
XYfXWYf
XWf
YXWf
XYfXWYf
YXWfXYfXfXWYfXWfXf
YXWfXYfXfYXWf
XWYfXWfXfYXWf
⊥⊥
=∩∴=
∩
=∩
∩
=∩∴
∩=∩
∩=∩∩
∩=∩∩
henceandimplieswhich
andunless
toequivalentgeneralinnotis(3)Now,
(2)and(1)ofRHSEquating
and
Now,
L
L
L
L
( ) ( )f Y W X f Y X∩ =
34. Bradford Hill lecture 2008 34 of 48
Conclusion
• The claims that are made for the
propensity score are true in terms of
conditional expectation (at least for the
linear model)
• However, they are not true in terms of the
full conditional model
• For W to be ignorable in that sense
requires
• This is the ANCOVA condition
XWY ⊥
35. Bradford Hill lecture 2008 35 of 48
Implications for Modelling
• It is not true that ignoring a covariate that
is predictive of outcome but not
assignment is acceptable
• In the linear case estimators are unbiased
but their variances are “incorrect”
• More generally, however, conditional and
unconditional estimators are different
– Logistic regression, survival analysis
36. Bradford Hill lecture 2008 36 of 48
Y
Z
X4
X2
X3
X1
X5
X6
What should join Z
in the model?
37. Bradford Hill lecture 2008 37 of 48
Y
Z
X4
X2
X3
X1
X5
X6
With inappropriate
terms removed
38. Bradford Hill lecture 2008 38 of 48
Y
Z
X4
X2
X3
X1
X5
X6
Propensity score
adjustment
40. Bradford Hill lecture 2008 40 of 48
Non-linear example
Simulation as before but binary response on Y >1.5
With balanced covariates
antilog of
Parameter estimate s.e. t(*) t pr. estimate
Constant -2.442 0.185 -13.18 <.001 0.08696
W1 4.98 8.51 0.59 0.558 146.2
W2e -1.73 8.51 -0.20 0.839 0.1768
X 1.689 0.192 8.78 <.001 5.413
antilog of
Parameter estimate s.e. t(*) t pr. estimate
Constant -0.4642 0.0918 -5.06 <.001 0.6287
X 0.962 0.130 7.40 <.001 2.617
41. Bradford Hill lecture 2008 41 of 48
Not convinced?
An Example
• An open trial of the effect of alcohol
consumption on the ability to memorize
word lists
• Volunteers to be drawn at random and
divided into two groups
• One lot to be given a glass of wine, the
other a glass of water
42. Bradford Hill lecture 2008 42 of 48
Two Possible Approaches
Experiment 1 Experiment 2
• A subject has name
drawn at random
• If chosen for control
group, given blue ball
• If chosen for treatment
group given red ball
• “All you who have a blue
ball please come to
receive your glass of
water, red ball to receive
your glass of wine”
• A subject has name
drawn at random
• If chosen for control
group given glass of beer
to drink
• Otherwise given nothing
• “All you who have had a
beer come to receive
your glass of water, if you
had nothing, to receive
your glass of wine.”
43. Bradford Hill lecture 2008 43 of 48
Experiment 1
• Probability of receiving wine if ball blue = 0
• Probability of receiving wine if ball red = 1
• The propensity score takes on the values
0 and 1
• Do you have to stratify by the propensity
score?
44. Bradford Hill lecture 2008 44 of 48
Experiment 2
• Probability of receiving wine if beer = 0
• Probability of receiving wine if no beer = 1
• The propensity score takes on the values
0 and 1
• Do you have to stratify by the propensity
score?
45. Bradford Hill lecture 2008 45 of 48
The Difference?
• The difference between these two
experiments is not the propensity score
• This is 0 and 1 in both cases and all
subjects in both cases have a score of 0
and 1
• The difference is that in the first case the
covariate used to construct the score is
predictive of outcome and in the second it
is not.
46. Bradford Hill lecture 2008 46 of 48
Consequence
• It is association with outcome that is
important
– ANCOVA tradition
• Not association with assignment
– Propensity point of view
47. Bradford Hill lecture 2008 47 of 48
And that Question
• Consider these two experiments
– A completely randomised trial
– Patients allocated with 50% probability to A or B
– Randomised matched pairs
– Member of any pair randomised with 50% probability to A
or B
• In analysing, would you ignore the
matching in the second case?
• The propensity score philosophy says you
can!
48. Bradford Hill lecture 2008 48 of 48
Finally
All scientific work is incomplete - whether it be observational or experimental.
All scientific work is liable to be upset or modified by advancing knowledge.
That does not confer upon us a freedom to ignore the knowledge we already
have, or to postpone the action that it appears to demand at a given time.
Sir Austin Bradford Hill , 1965
Editor's Notes
Lecture given at the London School of Hygiene and Tropical Medicine 3 June 2008
The dangers of concluding that subsequence is consequence
&quot;Doll R, Hill AB. (1950) Smoking and carcinoma of the lung. Preliminary report, British Medical Journal, 2: 739-748.
&quot;Doll R, Hill AB. (1954) The mortality of doctors in relation to their smoking habits. British Medical Journal, 228:1451-5.
It was hearing Erika Graf lecture on this in the mid 1990s that first got me interested in this topic.
See Graf, E. (1997). &quot;The propensity score in the analysis of therapeutic studies.&quot; Biometrical Journal 39: 297-307.
&lt;number&gt;
This is an elementary issue in applied statistics that we teach students to understand
ECONOMICS264
STATISTICS & PROBABILITY234
PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH173
CARDIAC & CARDIOVASCULAR SYSTEMS145
SOCIAL SCIENCES, MATHEMATICAL METHODS122
MATHEMATICAL & COMPUTATIONAL BIOLOGY90
SURGERY88
HEALTH CARE SCIENCES & SERVICES86
RESPIRATORY SYSTEM78
MEDICINE, GENERAL & INTERNAL77
A given subject receives at most one.
One of responses is realised the other is not
The first of these conditions involving r0 and r1 is the assumption of no unmeasured confounders
“This means that the counterfactual responses and treatment assignment are conditionally independent given the vector of covariates.” (Senn, Graf and Caputo)
The second is a condition for some function of the covariates to be ‘enough’ to stratify on.
Stratification by fifths is often referred to as quintile stratification
&lt;number&gt;
Example to illustrate the propensity score.
We have two exposures (A and B) and two explanatory factors age (old and young) and sex (male and female).
The outcome is not related to sex but is definitely related to age.
&lt;number&gt;
The point about these marginal tables is that they show that the treatment groups are imbalanced by sex but are not imbalanced by age. The philosophy of the propensity score is to stratify by them by probability of allocation to one of the treatment groups (say group A). This is, in fact, equivalent to stratifying by sex, since this is the factor that affects the probability of allocation.
In the example here, relative frequencies rather than probabilities are used. In fact the propensity score is defined in terms of the latter and this can be seen as a weakness. The distinction is ignored here.
&lt;number&gt;
We can define the strata by the probability of exposure (the propensity score). In this example, this is equivalent to stratifying by sex.
&lt;number&gt;
This on the other hand shows a more relevant stratification from the point of view of tradintional ANCOVA
&lt;number&gt;
When looked at in terms of variance the propensity score appears in a less satisfactory light.
These two groups have the same propensity score of 3/10=9/30=0.3.
In fact although we can classify subjects by four covariate combinations, there are only three strata in the propensity score. The score is coarsened.
In other words the propensity score has gained nothing in terms of efficiency compared to fitting the full model.
An indicator of exposure, Z, an outcome variable Y and some potential confounders, X1-X6
An indicator of exposure, Z, an outcome variable Y and some potential confounders, X1-X6
With inappropriate confounders removed from the model
An indicator of exposure, Z, an outcome variable Y and some potential confounders, X1-X6
An indicator of exposure, Z, an outcome variable Y and some potential confounders, X1-X6
W! and W2e are almost orthogonal to X. However their omission in a non-linear model leads to a huge bias in the estimate of the effect of X.
In this example the response is 1 if Y &gt; 1.5 and 0 otherwise. Other details are as before.