Lecture on causal inference to the pediatric hematology/oncology fellows at Texas Children's hospital as part of their Biostatistics for Busy Clinicians lecture seriers.
Vision and reflection on Mining Software Repositories research in 2024
Pavlos Msaouel MD, PhD on Causal Inference and RCT Interpretation
1. Pavlos Msaouel MD, PhD
Assistant Professor
Genitourinary Medical Oncology
Translational Molecular Pathology
Confounding and Causal Inference
2. Disclosures
• Advisory Boards / Honoraria:
oMirati Therapeutics
oBristol-Myers Squibb
oExelixis
• Non-branded educational programs:
oExelixis
oPfizer
• Clinical Trials with Grant Support:
oMirati Therapeutics
oBristol-Myers Squibb
oTakeda Pharmaceutical Company
• All my clinical trials use Bayesian designs.
• I refuse to use 3+3.
3. Statistics cannot directly encode causal knowledge
• 4 Analytical inputs:
Question
Knowledge causal knowledge is a necessary ingredient to construct the golem
Data
Relationship
• Examples:
Mud does not cause rain
Symptoms do not cause the disease
• The bulk of human knowledge is causal. The bulk of medical and translational knowledge
is causal.
4. Statistics cannot directly encode causal knowledge
• Clinical scenario: a patient with clear cell renal cell carcinoma comes to clinic. We give
her immunotherapy instead of oral TKI and she has a complete response. How do we
know that choosing immunotherapy over the TKI caused the complete response?
The only way to truly know if immunotherapy was the cause of the complete
response is if we had access to an alternate universe (“counterfactual universe”)
where everything else was equal but we gave oral TKI instead of immunotherapy.
Counterfactual universes are the true Gold Standard for causality.
5. Statistics cannot directly encode causal knowledge
• No access to counterfactual universes: can never claim causality (Hume, Russell,
Pearson etc)
• Less extreme position: randomization can allow us to infer causality (but needs
assumptions)
The process of randomization lies outside statistics
There are other ways to infer causality (they also need assumptions)
• Even less extreme position: lab experiments can allow us to infer causality (but need
assumptions)
6. What is the purpose of RCTs?
• RCTs are clinical experiments.
• Their purpose is to compare two (or more) interventions.
• Relative measures are used to compare the interventions.
Differences
Ratios (more transportable)
o Hazard ratios (HR) and odds ratios (OR)
• The most reliable estimates are those contrasting all patients enrolled in each
intervention.
• Subgroup inferences are always less precise.
7. Interpreting RCT results
The MF07-01 multicenter, phase III RCT (Soran et al. Annals of Surgical Oncology, 2018)
compared resection of the primary tumor (LRT group) vs no surgery (ST group) in de novo
metastatic breast cancer. The overall survival results were HR = 0.66, 95% CI 0.49 to 0.88,
p =0.005 favoring the LRT group. However, when looking at Table 1 of the manuscript you
see the following imbalances:
Which of the following statements is most correct?
1. The imbalances in tumor type between LRT vs ST do not bias the results.
2. The results are biased because the ST group had more triple-negative (worse
prognosis) and fewer ER/PR+ (better prognosis) patients
3. The imbalances in tumor type between LRT vs ST suggest that the quality of
randomization was poor with p < 0.05.
4. I have no idea.
8. Interpreting RCT results
The MF07-01 multicenter, phase III RCT (Soran et al. Annals of Surgical Oncology, 2018)
compared resection of the primary tumor (LRT group) vs no surgery (ST group) in de novo
metastatic breast cancer. The overall survival results were HR = 0.66, 95% CI 0.49 to 0.88,
p =0.005 favoring the LRT group. However, when looking at Table 1 of the manuscript you
see the following imbalances:
Which of the following statements is most correct?
1. The imbalances in tumor type between LRT vs ST do not bias the results.
2. The results are biased because the ST group had more triple-negative (worse
prognosis) and fewer ER/PR+ (better prognosis) patients
3. The imbalances in tumor type between LRT vs ST suggest that the quality of
randomization was poor with p < 0.05.
4. I have no idea.
10. Table 1 fallacy
• The practice of seeing imbalances in baseline variables in Table 1 from an RCT and
concluding that these imbalances bias the results.
• Further reading:
• https://discourse.datamethods.org/t/should-we-ignore-covariate-imbalance-and-
stop-presenting-a-stratified-table-one-for-randomized-trials/547
• Assmann et al. "Subgroup analysis and other (mis)uses of baseline data in clinical
trials" Lancet (2000) PMID: 10744093
• Senn S. “Baseline comparisons in randomized clinical trials” Stat Med. (1991) PMID:
1876802
• Begg CB. “Suspended judgment. Significance tests of covariate imbalance in clinical
trials”. Control Clin Trials (1990) PMID: 2171874
11. Randomness comes in clusters
Random coin flip sequence (heads vs tails) 30 times using random.org website:
H H T T H T H H T T H H T H H T T T T T T T H T H T H T T H
First 15 coin flips: 9 heads and 6 tails
Last 15 coin flips: 4 heads and 11 tails
Random pattern Non-random pattern
12. Randomness comes in clusters
• The very definition of randomization implies there will be imbalances in prognostic factors
• Valid inference does not require such balance
• Stratification balances prognostic factors
• Balanced prognostic factors result in better precision
• Randomization removes bias from confounders resulting in more accurate inferences
Balance what you know. Randomize what you don’t know.
13. Precision and accuracy
• Accuracy is the opposite of bias. High accuracy = low bias.
Accuracy = trueness (answers the question: is my inference true?)
Randomization is a causal concept. Answers the question: Was choosing LRT over ST
the cause of the better overall survival?
• Precision measures variability
Answers the question: how close are my measurements to each other?
The less correct term is “power”
14. Precision and accuracy
• The width of the 95% CI can tell us the precision of the study.
• Higher precision: more narrow confidence intervals
• Balanced prognostic factors (e.g., via stratification) -> more narrow confidence intervals
15. Precision and accuracy
RCT with balanced
prognostic factors
Large observational
study with strong
unaccounted
confounding
Accurate &
Precise
Not accurate
but precise
Accurate
but not precise
Not accurate &
not precise
RCT with imbalanced
prognostic factors
Small observational
study with strong
unaccounted
confounding
16. Interpreting RCT results
The MF07-01 multicenter, phase III RCT (Soran et al. Annals of Surgical Oncology, 2018)
compared resection of primary Tumor (LRT group) vs no surgery (ST group) in de novo
metastatic breast cancer
The overall survival results were HR = 0.66, 95% CI 0.49 to 0.88, p =0.005
The results were precise enough to make the inference that LRT produces better overall
survival vs ST
If tumor types (and other prognostic variables) were balanced, this would have resulted in
even higher precision (more narrow confidence intervals)
Balance improves “power” (the correct term is precision) even if sample size is the same
17. How to achieve balance
• Stratification
• Adjustment (covariate adjustment): include the prognostic variable in a regression model
This will produce adjusted HRs (vs unadjusted HRs)
"A Chosen One shall come, born of no father, and through him will ultimate balance in the
Force be restored.“
- Ancient Jedi prophecy
18. Stratification vs covariate adjustment
The KEYNOTE-426 trial (Rini et al. NEJM, 2019) was a multicenter phase 3 RCT comparing
pembrolizumab + axitinib vs. sunitinib as first-line therapy for clear cell RCC.
Randomization was stratified according to the International Metastatic Renal Cell
Carcinoma Database Consortium (IMDC) risk group (favorable, intermediate, or poor risk)
and geographic region (North America, Western Europe, or the rest of the world).
Which of the following statements is most correct?
1. There is no need to covariate adjust for IMDC and geographic region as those are
already balanced between the treatment groups due to stratification.
2. The unadjusted hazard ratio is a less biased estimate that is more generalizable to
our patient population. Adjusted hazard ratios are less generalizable.
3. Covariate adjustment will further increase the precision of the hazard ratio
estimate.
4. I have no idea.
19. Stratification vs covariate adjustment
The KEYNOTE-426 trial (Rini et al. NEJM, 2019) was a multicenter phase 3 RCT comparing
pembrolizumab + axitinib vs. sunitinib as first-line therapy for clear cell RCC.
Randomization was stratified according to the International Metastatic Renal Cell
Carcinoma Database Consortium (IMDC) risk group (favorable, intermediate, or poor risk)
and geographic region (North America, Western Europe, or the rest of the world).
Which of the following statements is most correct?
1. There is no need to covariate adjust for IMDC and geographic region as those are
already balanced between the treatment groups due to stratification.
2. The unadjusted hazard ratio is a less biased estimate that is more generalizable to
our patient population. Adjusted hazard ratios are less generalizable.
3. Covariate adjustment will further increase the precision of the hazard ratio
estimate.
4. I have no idea.
20. The value of covariate adjustment
• Increases precision (even more than stratification)
• Produces more generalizable estimates:
Adjusted HR: compares a patient who received pembrolizumab + axitinib to a patient
who received sunitinib and started with the same IMDC risk and from the same
geographic region
Unadjusted HRs make more assumptions because they depend on the entire sample
mix and will not transport to a population with a different covariate mix.
• Balance prognostic factors by using both stratification and adjustment. But it you have to
choose, choose adjustment over stratification.
• Further reading:
Senn S. “Seven myths of randomisation in clinical trials” Stat Med (2013) PMID:
23255195
https://www.fharrell.com/post/covadj/
21.
22. Example graph used in basic & translational Research
Msaouel et al. Cancer Cell, 2020
23. Example graph used in clinical & co-clinical research
Shapiro & Msaouel. Clin Genitour Cancer, 2020
24. The bulk of human knowledge is causal
We need a mathematical language to encode causal knowledge: the do-calculus
Judea Pearl et al. “Causal Inference in Statistics: A Primer”
25. The bulk of human knowledge is causal
H0 : null hypothesis
D: data
Probability theories:
• P(D | H0) -> Frequentist probability
• P(H0 | D) -> Bayesian probability
do-Calculus:
• P(D | do X=x) distinguishes cases where *we* fix X = x
• “Do” vs “See”
• P(Rain | do mud) = 0 ≠ P(Rain | Mud)
• P(Disease | do symptoms) = 0 ≠ P(Disease | Symptoms) > 0
• P(Immunotherapy colitis | do diarrhea) = 0 ≠ P(IO colitis | diarrhea) > 0
26. • “If Aristotle was alive today, he would be
breathing water”
• Starting with a false premise always gets you
a true inference statement in classical logic
• The goal is to develop artificial general
intelligence
• But we can also use the do-calculus to make
more intelligent clinical decisions
Judea Pearl and Dana MacKenzie. “The Book of Why: The New Science of Cause and Effect”
The Ladder of Causation
27. Directed acyclic graphs (DAGs)
• Directed acyclic graphs (DAGs) are qualitative visual representations of the structural
causal model describing the functional relationships (ie, the structural equations)
between variables of interest.
• DAGs fully correspond to the do-calculus.
• “Directed”: all arcs have arrows
Judea Pearl et al. “Causal Inference in Statistics: A Primer”
Example non-directional
arc (no arrowheads)
between A and B
28. Directed acyclic graphs (DAGs)
• Directed acyclic graphs (DAGs) are qualitative visual representations of the structural
causal model describing the functional relationships (ie, the structural equations)
between variables of interest.
• DAGs fully correspond to the do-calculus.
• “Acyclic”: no directed path in the graph forms a closed loop
Judea Pearl et al. “Causal Inference in Statistics: A Primer”
A B
C
D
Cyclic graph: A causes A
29. Direct causal relationship between exposure (A) and outcome (B)
The causal relationship between exposure (A) and outcome (B) is
mediated by M.
M is a “mediator”
C acts as a collider blocking the causal relationship between
exposure (A) and outcome (B).
C acts as a confounder. There is no causal relationship between
exposure (A) and outcome (B).
Basic Types of Causal Relationships
30. Basic Types of Causal Relationships
No need for adjustment
Do not adjust for mediators (usually). Adjustment for M blocks the
causal effect we want to estimate.
Never adjust for colliders. Adjustment for C opens a false causal pathway
between A and B (“collider bias”)
Always adjust for confounders (if known and enough sample size).
↑Bias (↓ accuracy)
↑Bias (↓ accuracy)
↓Bias (↑ accuracy)
31. Question: what is the effect of primary tumor size on overall survival?
Scenarios
32. Question: what is the effect of primary tumor size on number of metastases?
Scenarios
35. To adjust or not to adjust?
We are performing a retrospective analysis to determine whether a new immunotherapy
(superlumab) improves overall survival compared with other immunotherapies. We have
also measured baseline serum IL-6 levels for these patients. The scientific consensus on
the relationship between immunotherapy type, serum IL-6 and overall survival is codified
below:
Which of the following statements is most correct?
1. Baseline IL-6 level is a mediator. It should be not adjusted for in the analysis.
2. Baseline IL-6 level is a collider. It should be adjusted for in the analysis.
3. Baseline IL-6 level is a confounder. It should be adjusted for in the analysis.
4. I have no idea.
Immunotherapy Overall survival
Baseline IL-6 level
36. To adjust or not to adjust?
We are performing a retrospective analysis to determine whether a new immunotherapy
(superlumab) improves overall survival compared with other immunotherapies. We have
also measured baseline serum IL-6 levels for these patients. The scientific consensus on
the relationship between immunotherapy type, serum IL-6 and overall survival is codified
below:
Which of the following statements is most correct?
1. Baseline IL-6 level is a mediator. It should be not adjusted for in the analysis.
2. Baseline IL-6 level is a collider. It should be adjusted for in the analysis.
3. Baseline IL-6 level is a confounder. It should be adjusted for in the analysis.
4. I have no idea.
Immunotherapy Overall survival
Baseline IL-6 level
37. To adjust or not to adjust? (sequel)
We are performing a retrospective analysis to determine whether a new immunotherapy
(superlumab) improves overall survival compared with other immunotherapies. We have
also measured TGFβ genotypes for these patients. The scientific consensus on the
relationship between immunotherapy type, TGFβ genotype and overall survival is codified
below:
Which of the following statements is most correct?
1. TGFβ genotype is a mediator. It should be not adjusted for in the analysis.
2. TGFβ genotype is a collider. It should be adjusted for in the analysis.
3. TGFβ genotype is a confounder. It should be adjusted for in the analysis.
4. I have no idea.
Immunotherapy Overall survival
TGFβ genotype
38. To adjust or not to adjust? (sequel)
We are performing a retrospective analysis to determine whether a new immunotherapy
(superlumab) improves overall survival compared with other immunotherapies. We have
also measured TGFβ genotypes for these patients. The scientific consensus on the
relationship between immunotherapy type, TGFβ genotype and overall survival is codified
below:
Which of the following statements is most correct?
1. TGFβ genotype is a mediator. It should be not adjusted for in the analysis.
2. TGFβ genotype is a collider. It should be adjusted for in the analysis.
3. TGFβ genotype is a confounder. It should be adjusted for in the analysis.
4. I have no idea.
Immunotherapy Overall survival
TGFβ genotype
39. Types of causally “neutral” variables
Gender is a “neutral variable”
Adjustment for gender does not affect bias/accuracy.
Adjustment for gender increases precision (↓ outcome heterogeneity)
Gender is a prognostic factor!
40. Types of causally “neutral” variables
Cancer center is a “neutral variable”
Adjustment for cancer center does not affect bias/accuracy.
Adjustment for cancer center decreases precision (↓ exposure heterogeneity)
41. What randomization truly does
Adjustment for IMDC
Increases precision
Observational study
Randomized study (do random treatment)
Adjustment for IMDC
reduces bias (↑ accuracy)
42. How to make your own DAGs
http://www.dagitty.net/dags.html
43. How to make your own DAGs
https://causalfusion.net
44. Estimating the effect of RCC histology on overall survival
Typical multivariable regression model
Corresponding DAG
Shapiro & Msaouel et al. Clin Genitour Cancer, 2020
45. Estimating the effect of RCC histology on overall survival
Start with a plausible DAG
Corresponding regression
model
Shapiro & Msaouel et al. Clin Genitour Cancer, 2020
46. Estimating the effect of RCC histology on overall survival
Start with a plausible DAG
Table 2 Fallacy:
this regression model should not
be used to estimate the causal
effect of biological sex on overall
survival. RCC subtype is a
mediator for the effect of
biological sex on overall survival.
Shapiro & Msaouel et al. Clin Genitour Cancer, 2020
47. Summary
• Imbalances between subgroups in RCTs reduce precision but not accuracy
• Stratification and adjustment for prognostic factors increases the
precision of RCTs
• Always adjust, even if you stratify
• DAGs can be used to represent causal relations of interest
• Adjust for confounders but not for colliders or mediators
• The statistical analysis (regression model) will always depend on the
question at hand