Lords ‘paradox’ is a notoriously difficult puzzle that is guaranteed to provoke discussion, dissent and disagreement. Two statisticians analyse some observational data and come to radically different conclusions, each of which has acquired defenders over the years since Lord first proposed his puzzle in 1967. It features in the recent Book of Why by Pearl and McKenzie, who use it to demonstrate the power of Pearl’s causal calculus, obtaining a solution they claim is unambiguously right. They also claim that statisticians have failed to get to grips with causal questions for well over a century, in fact ever since Karl Pearson developed Galton’s idea of correlation and warned the scientific world that correlation is not causation.
However, only two years before Lord published his paradox John Nelder outlined a powerful causal calculus for analyzing designed experiments based on a careful distinction between block and treatment structure. This represents an important advance in formalizing the approach to analysing complex experiments that started with Fisher 100 years ago, when he proposed splitting variability using the square of the standard deviation, which he called the variance, continued with Yates and has been developed since the 1960s by Rosemary Bailey, amongst others. This tradition might be referred to as The Rothamsted School. It is fully implemented in Genstat® but, as far as I am aware, not in any other package.
With the help of Genstat®, I demonstrate how the Rothamsted School would approach Lord’s paradox and come to a solution that is not the same as the one reached by Pearl and McKenzie, although given certain strong but untestable assumptions it would reduce to it. I conclude that the statistical tradition may have more to offer in this respect than has been supposed.
2. Outline
Topic Number of Slides
Adjusting for baseline in clinical trials 12
Lord’s Paradox 6
The Book of Why versus Lord’s Paradox 2
The Rothamsted School 8
Genstat® versus Lord’s paradox 11
Conclusions 2
(C) Stephen Senn 2018 2
3. Disclaimer
• I shall be criticising one
particular claim made in The
Book of Why
• This should not be taken as a
criticism of the causal calculus
• In fact, I regard this as being
important for statisticians
• I freely admit that my work
would benefit from being more
familiar with it
(C) Stephen Senn 2018 3
4. Adjusting for baseline in clinical
trials
Some standard and not–so standard theory
(C) Stephen Senn 2018 4
5. (C) Stephen Senn 2018 5
SACS and ANCOVA
A simple randomised clinical trial in which there are two
treatment groups and only two measurements per patient: a
baseline measurement, X and an outcome measurement, Y.
Popular choices of outcome measure are
1) raw outcomes Y
2) change score d = Y - X
3) covariance adjusted outcomes Y - X. (where is chosen
appropriately)
NB As Laird (Am Stat., 37, 329-330, 1983) has shown, covariate
adjusted change scores are the same as 3)
6. Which to use?
• ANCOVA has a variance that is always less than or equal to the other
two
• Provided the slope (adjustment) parameter is known
• The Gauss-Markov theorem does not apply to random regressors so one
could do slightly better in theory
• Analogous to recovering inter-block information
• ANCOVA is conditionally unbiased
• It exhaust the information in the baselines
• If an additive model applies
• Nevertheless, it is usually better and most commentators have
concluded it is the approach to use
(C) Stephen Senn 2018 6
7. (C) Stephen Senn 2018 7
Here the variances
at outcome and
baseline are
assumed to be the
same in which case
the regression
coefficient is just the
correlation
8. (C) Stephen Senn 2018 8
Counter-Claims
• There is a significant minority of papers arguing against ANCOVA as a
means of dealing with bias
• E.g. Liang and Zeger (2000), Sankyha, Samuelson (1986), American Statistician
• The variance claims are accepted
• However, claims are made that unless there is balance at baseline
ANCOVA is biased
9. (C) Stephen Senn 2018 9
Justification of the Counter-Claim
1)(
)(
ctCt
ctCt
Ct
ct
cctcttt
ccc
ctt
cc
XXYYE
XXYYE
YYE
XXE
Hence
YE
YE
XE
XE
This just proves how
misleading models can
be
SACS is unbiased
ANCOVA is biased unless
𝜃 = 0
10. (C) Stephen Senn 2005 10
A Counter Counter-Example
• Suppose we design a bizarre clinical trial
• Only persons with diastolic blood pressure at baseline equal to
95mmHg or 105mmHg may enter
• In the first stratum they are randomised 3 to 1 and in the second 1 to
3
• Situation as follows
11. (C) Stephen Senn 2005 11
A Stupid Trial
Numbers of Patients by dbp and Treatment
Treatment
A B Total
Baseline
diastolic
blood
pressure
95mm Hg 300 100 400
105mm Hg 100 300 400
Total 400 400 800
12. (C) Stephen Senn 2005 12
Approach to Analysis
• Stratify by baseline dbp
• Produce treatment estimate for each stratum
• Overall estimate is average of the two estimates
• Stratification deals with the imbalance
13. (C) Stephen Senn 2005 13
An Equivalent Approach
• Create dummy variable stratum
S = -1 if baseline dbp, X = 95mmHg
S = 1 if baseline dbp, X =105 mmHg
• Regress dbp at outcome, Y, on treatment indicator T and on stratum
indicator S
• Estimate will be same as by stratification
• If you want variance estimate to be exactly the same you need to include
interaction also
14. (C) Stephen Senn 2005 14
An Equivalent Equivalent Approach
• Regress Y on T and X rather than on T and S
• This is called ANCOVA!
• Note that S= (X-100)/5
• Hence, this approach is equivalent to the previous one, which is
equivalent to stratification, which is unbiased
• On the other hand SACS is biased
• Hence we have produced a counter-example
15. (C) Stephen Senn 2005 15
Conclusion
• Contrary to what is often claimed there are cases where ANCOVA is
unbiased but SACS is biased.
• No simple statement of the form “ANCOVA is more efficient but SACS
is unbiased” is possible.
• In fact it is very difficult to imagine cases where SACS is the preferred
analysis
17. (C) Stephen Senn 2018 17
Lord’s Paradox
Lord, F.M. (1967) “ A paradox in the interpretation of
group comparisons”, Psychological Bulletin, 68, 304-
305.
“A large university is interested in investigating the effects on
the students of the diet provided in the university dining
halls….Various types of data are gathered. In particular the
weight of each student at the time of his arrival in September
and his weight in the following June are recorded”
We shall consider this in the Wainer and Brown version (also
considered by Pearl) in which there are two halls each
assigned a different one of two diets being compared.
18. (C) Stephen Senn 2018 18
Two Statisticians
Statistician One (Say John)
• Calculates difference in weight
(outcome-baseline) for each hall
• No significant difference
between diets as regards this
‘change score’
• Concludes no evidence of
difference between diets
Statistician Two (Say Jane)
• Adjusts for initial weight as a
covariate
• Finds significant diet effect on
adjusted weight
• Concludes there is a difference
between diets
22. Pearl’s causal calculus versus Lord’s
Paradox
Is expectation enough? What about variance?
(C) Stephen Senn 2018 22
23. Judea Pearl, born 1936
• Israeli-American computer scientist and philosopher
• Has developed powerful causal calculus based on distinguishing
between seeing and doing
• Explains Simpson’s paradox
• Causality: Models, Reasoning and Inference (2000)
• Has recently co-authored a popular book with Dana Mackenzie, The
Book of Why, 2018
(C) Stephen Senn 2018 23
24. Pearl & Mackenzie, 2018
(C) Stephen Senn 2018 24
D
(Diet)
WF
W1 However, for statisticians who
are trained in “conventional”
(i.e. model-blind) methodology
and avoid using causal lenses,
it is deeply paradoxical
The Book of Why p217
In this diagram, W1, is a
confounder
of D and WF and not a
mediator. Therefore, the
second statistician would
be unambiguously right
here.
The Book of Why p216
25. The Rothamsted School
A century of variance from ANOVA to Genstat® and back via General Balance
(C) Stephen Senn 2018 25
26. The Rothamsted School
(C) Stephen Senn 2018 26
RA Fisher
1890-1962
Variance, ANOVA
Randomisation, design,
significance tests
Frank Yates
1902-1994
Factorials, recovering
Inter-block information
John Nelder
1924-2010
General balance, computing
Genstat®
and Frank Anscombe, David Finney, Rosemary Bailey, Roger Payne etc
27. (C) Stephen Senn 2018 27
General Balance
• An idea of John Nelder’s
• Two papers in the Proceedings of the Royal Society, 1965 concerning
“The analysis of randomized experiments with orthogonal block
structure”
• Block structure and the null analysis of variance
• Treatment structure and the general analysis of variance
28. (C) Stephen Senn 2018 28
Basic Idea
• Splits an experiment into two radically different components
• The block structure, which describes the way that the experimental units are
organised
• The way that variation amongst units can be described
• Null ANOVA – an idea of Anscombe’s
• The treatment structure, which reflects the way that treatments are
combined for the scientific purpose of the experiment
29. (C) Stephen Senn 2018 29
Design Driven Modelling
• Together with a third piece of information, the design matrix, these
determine the analysis of variance
• Note that because both block and treatments structure can be hierarchical
such a design matrix is not, on its own sufficient to derive an ANOVA
• But together with John’s block and treatment structure it is
• For designs exhibiting general balance
• This approach is incorporated in Genstat®
30. Genstat® Help File Example
(C) Stephen Senn 2018 30
Block Plot S N Yield
1 1 0 0 0.750
1 4 0 180 1.204
1 3 0 230 0.799
1 12 10 0 0.925
1 5 10 180 1.648
1 8 10 230 1.463
1 7 20 0 0.654
1 2 20 180 1.596
1 10 20 230 1.594
1 11 40 0 0.526
1 9 40 180 1.672
1 6 40 230 1.804
2 8 0 0 0.503
2 10 0 180 0.489
etc
" This is a field experiment
to study the effects of
nitrogen and sulphur on the
yield of wheat with a
randomized block design."
BLOCKSTRUCTURE Block / Plot
TREATMENTSTRUCTURE N * S
ANOVA [PRINT=aov; FPROBABILITY=yes]
Yield
34. Start with the randomised equivalent
• We suppose that the diets had been randomised to the two halls
• Le us suppose there are 100 students per hall
• Generate some data
• See what Genstat® says about analysis
• Note that it is a particular feature of Genstat® that it does not have to
have outcome data to do this
• Given the block and treatment structure alone it will give us a
skeleton ANOVA
• We start by ignoring the covariate
(C) Stephen Senn 2018 34
35. (C) Stephen Senn 2018 35
BLOCKSTRUCTURE Hall/Student
TREATMENTSTRUCTURE Diet
ANOVA
Analysis of variance
Source of variation d.f.
Hall stratum
Diet 1
Hall.Student stratum 198
Total 199
Code Output
Gentstat® points out the obvious (which, however, has
been universally overlooked). There are no
degrees of freedom to estimate the variability of the
Diet estimate which appears in the Hall and not the
Hall.Student stratum
36. Consequences and further considerations
• Using outcomes only we cannot analyse this experiment
• We have no degrees of freedom to estimate the variance of any treatment
estimate
• We will return to baselines in due course
• Let’s first consider how we could fix this ‘experiment’
• Let’s increase the number of halls, while keeping the total number of
students we shall follow fixed
• 20 halls
• 10 halls per diet
• 10 students followed per hall
(C) Stephen Senn 2018 36
37. (C) Stephen Senn 2018 37
Analysis of variance
Source of variation d.f.
Hall stratum
Diet 1
Residual 18
Hall.Student stratum 180
Total 199
We now see that this experiment is analysable. Had we
carried out an experiment of this form we would not
need to use baseline values but we could do. Let’s
consider John’s and Jane’s estimators again.
Would they produce valid analyses?
38. The two estimators compared
John
Type Change score
Formula 𝑌 − 𝑌 − 𝑋 − 𝑋
Consistent? Yes
Correct variance? Not without strong
assumptions
Jane
Type ANCOVA
Formula 𝑌 − 𝑌 − 𝑟 𝑋 − 𝑋
Consistent? Yes
Correct variance? Not without strong
assumptions
(C) Stephen Senn 2018 38
NB
1. As the number of halls goes to infinity, then the second term for either estimator goes to zero.
2. Since the first term is the same, asymptotically they give the same answer.
3. The expectation of the first term, over all randomisations, is the effect of diet.
4. Thus, the two estimators are consistent.
5. The question is, which has the correct variance?
39. Adding covariates
Parameter settings Analysis code
(C) Stephen Senn 2018 39
Students per hall Number of halls per diet
10 10
g2, variance between halls s2, variance within halls
25.00 16.00
, average student weight D, Effect of diet
75.00 3.00
rh, between halls rs, within halls
0.70 0.50
Correct
BLOCKSTRUCTURE Hall/Student
TREATMENTSTRUCTURE Diet
COVARIATE Base
ANOVA Weight
Or incorrect
BLOCKSTRUCTURE Student
etc
42. We now understand the situation well enough to
return to the two hall case
Change-score (John)
• The between hall component of
variance must be zero having
subtracted the baseline
• Between-hall regression must be
equal to 1
ANCOVA (Jane)
• The between hall component of
variance must be zero having
conditioned on the baselines
• The regression between halls
must be as predicted by the
regression within
(C) Stephen Senn 2018 42
The minimal requirement for the analyses to be valid is the following
43. (C) Stephen Senn 2005 43
The Necessary Condition for ANCOVA to be
Unbiased
t C t c
t C t c
t C
E Y Y X X
E Y Y X X
E Y Y
Or in everyday language that the bias in the raw comparison at outcome
should be times the bias at baseline where is the individual regression
effect.
This requires a strong assumption that is untestable in the two-hall case.
But in any case, the fact that
the estimate is unbiased is
not a guarantee that the
estimate of the variance of
the estimate is unbiased
45. Lord’s Paradox
• It is not true that ‘the second statistician would be unambiguously
right’
• Additional untestable assumptions would be needed
• This does not mean that the first statistician would be right
• A lesson is that we need to consider the probability distribution of an
inference
• At least the variance and not just the expectation
• I note, by the by, that this is a mistake made in developing the propensity
score approach (See Senn, Graf and Caputo, 2007)
(C) Stephen Senn 2018 45
46. More generally
• The Rothamsted approach is valuable but sadly neglected
• Only implemented in Genstat®
• An R package is in development by Cullis and Smith
• All too often we take completely randomised designs as being the default analogy
to observational data-sets
• More complex designs may be appropriate
• Such as cluster randomised
• Even where we have identified the ‘correct’ confounders (perhaps with the help of causal
calculus) we may be getting the standard errors wrong
• Lessons for epidemiology?
• Variances matter
• It is an open question for me whether the causal calculus in its current form is
adequate to deal with complex data-sets
• Can it deal adequately with hierarchical structures?
(C) Stephen Senn 2018 46
47. References
47
1. Nelder JA. The analysis of randomised experiments with orthogonal block structure I. Block
structure and the null analysis of variance. Proceedings of the Royal Society of London Series A.
1965;283:147-62.
2. Nelder JA. The analysis of randomised experiments with orthogonal block structure II.
Treatment structure and the general analysis of variance. Proceedings of the Royal Society of London
Series A. 1965;283:163-78.
3. Lord FM. A paradox in the interpretation of group comparisons. Psychological Bulletin.
1967;66:304-5.
4. Holland PW, Rubin DB. On Lord's Paradox. In: Wainer H, Messick S, editors. Principals of
Modern Psychological Measurement. Hillsdale, NJ: Lawrence Erlbaum Associates; 1983.
5. Liang KY, Zeger SL. Longitudinal data analysis of continuous and discrete responses for pre-post
designs. Sankhya-the Indian Journal of Statistics Series B. 2000;62:134-48.
6. Wainer H, Brown LM. Two statistical paradoxes in the interpretation of group differences:
Illustrated with medical school admission and licensing data. American Statistician. 2004;58(2):117-23.
7. Senn SJ. Change from baseline and analysis of covariance revisited. Statistics in Medicine.
2006;25(24):4334–44.
8. Senn SJ, Graf E, Caputo A. Stratification for the propensity score compared with linear regression
techniques to assess the effect of treatment or exposure. Statistics in Medicine. 2007;26(30):5529-44.
9. Van Breukelen GJ. ANCOVA versus change from baseline had more power in randomized studies
and more bias in nonrandomized studies. Journal of clinical epidemiology. 2006;59(9):920-5.
10. Pearl J, Mackenzie D. The Book of Why: Basic Books; 2018.