Correcting for missing data, measurement error and confounding
1. Correcting for missing data, measurement
error and confounding
Maarten van Smeden, PhD
University Medical Center Utrecht
Julius Center for Health Sciences and Primary Care
The Netherlands
Twitter: @MvanSmeden
Email: M.vanSmeden@umcutrecht.nl
30 November 2020
Methods meeting
Epidemiology methods group, UMC Utrecht
I have no conflicts of interest to declare
3. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Rationale
• Confounding -> correlation is not causation
• Measurement error & missing data -> correlation is not always
correlation
• In causal epidemiologic research we often see all three
…. but we rarely try to ”correct” for all three
4. Twitter: @MaartenvSmedenUtrecht, November 30 2020
There is no shortage of methods
Confounding Missing data Measurement error
Multivariable adjustments Multiple imputation Regression calibration
Weighting Weighting Weighting
Matching Full information maximum
likelihood
Multiple imputation for ME
Instrumental variable analysis Last observation carried forward SIMEX
RANDOMIZATION (!) Missing indicator methods Full information maximum
likelihood
“Bayesian approaches” “Bayesian approaches” “Bayesian approaches”
A non-exhaustive list of statistical correction strategies
5. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Outline
• Confounding (1 slide)
• Missing data (2 slides)
• Measurement error (many slides)
• How to solve” all three? (couple of more slides)
• What about prediction (if I have time left)
6. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Confounding
• A: treatment (Tx, 1 for treated; 0 for not treated)
• Y: outcome (1 for death; 0 for survival)
• Potential outcomes
Ya=1: outcome under Tx; Ya=0: outcome under no Tx
usually observe either Ya=0 or Ya=1 for an individual
• Randomized trials: Ya ⊥ A (unconditional exchangeability)
• Observational studies aim: Ya ⊥ A | L (conditional exchangeability)
L: confounding variables -> no unmeasured confounding
• (Additional causal assumptions: positivity, consistency,
SUTVA,…)
More info: causal inference: what if? Hernan & Robins
7. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Missing data
• Missing values are observations/records which were:
– never collected (either by design or not)
– lost accidentally
– wrongly collected and so deleted (measurement error?)
• Usually distinguish between three types of missing data
– MCAR: the probability that data are missing does NOT
depend on the values of observed or missing data
– MAR: the probability that data are missing depends on the
values of the observed data, but does NOT depend on the
values of the missing data
– MNAR: the probability that data are missing depends on the
values of the missing data
9. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Personal observations (I may be biased)
Causal inference epidemiology
• Confounding on center stage in analyses and discussion
• Missing data often cannot be ignored (especially for higher %):
performing multiple imputation becoming mainstream?
• Measurement error the elephant in the room: belongs to the
discussion section (not methods), lots of misconceptions!
• (Note: not independent, e.g. measurement error can result in
problems with confounding)
10. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Measurement error
“Errors in reading, calculating or recording a
numerical value. The difference between
observed values of a variable recorded
under similar conditions and some fixed true
value.“
The Cambridge Dictionary of Statistics (4th ed), ISBN: 9780521766999
15. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Measurement error: a long list
• Blood pressure
• Dietary intake
• Smoking status
• Air pollution
• BMI
• Physical activity
• Vaccination status
• Social class
• Carotid intima media thickness
• Thyroid hormone levels
• Glucose levels
• Cholesterol levels
• Income
• Family history
• Mental health history
• Education level
• “Intelligence”
• Respiratory rates
• Medication use
• Sedentary hours
• Vitamin use
• Immigration status
• Age at first intercourse
• Age at menopause
• ICD coding
• Symptoms
• Date of symptom onset
• Medication use
• Visceral adipose tissue
• Angina class
• Heart rate
• Grip and pinch strength
• Cough frequency
• Infant height
• Gestational age
• Disease specific mortality
• ….
16. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Measurement error mentioned
Journals of epidemiology
Jurek et al. 20061 61% (N = 35)
Brakenhoff et al. 20182 56% (N = 198)
Shaw et al. 20193 80% (N = 65)
doi: 110.1007/s10654-006-9083-0; 210.1016/j.jclinepi.2018.02.02; 310.1016/j.annepidem.2018.09.001
17. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Measurement error mentioned
Journals of general medicine
Brakenhoff et al. 20182: 25% (N = 57)
doi: 210.1016/j.jclinepi.2018.02.02
18. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Measurement error “corrections” applied
Journals of epidemiology
Jurek et al. 20061: 2% (N = 1)
Brakenhoff et al. 20182: 4% (N = 13)
Shaw et al. 20193: 6% (N = 5)
doi: 110.1007/s10654-006-9083-0; 210.1016/j.jclinepi.2018.02.02; 310.1016/j.annepidem.2018.09.001
19. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Measurement error “corrections” applied
Journals of general medicine
Brakenhoff et al. 20182: 0% (N = 0)
doi: 210.1016/j.jclinepi.2018.02.02
23. Twitter: @MaartenvSmedenUtrecht, November 30 2020
• Myth 1: measurement error can be compensated for by large
numbers of observations
• Myth 2: the exposure effect is underestimated when variables
are measured with error
• Myth 3: exposure measurement error is nondifferential if
measurements are taken without knowledge of the outcome
• Myth 4: measurement error can be prevented but not mitigated
in epidemiological data analyses
• Myth 5: certain types of epidemiological research are
unaffected by measurement error
24. Twitter: @MaartenvSmedenUtrecht, November 30 2020
• Myth 1: measurement error can be compensated for by large
numbers of observations
• Myth 2: the exposure effect is underestimated when variables
are measured with error
• Myth 3: exposure measurement error is nondifferential if
measurements are taken without knowledge of the outcome
• Myth 4: measurement error can be prevented but not mitigated
in epidemiological data analyses
• Myth 5: certain types of epidemiological research are
unaffected by measurement error
25. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Types of measurement error
Measurement are
consistently wrong in a
particular direction
Classical (Random)
measurement error
Differential
measurement error
Systematic
measurement error
Measurements fluctuate
around their true value
Measurements are
consistently wrong in a
particular direction,
varying per group
Courtesy: Linda Nab
31. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Example: classical measurement error
Second Manifestations of ARTerial disease (SMART) cohort
doi: 10.1371/journal.pone.0192298
Effect of
interest
Confounder
with error
Outcome
32. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Example: classical measurement error
doi: 10.1371/journal.pone.0192298
% bias in hazard ratio for SBP (multivariable Cox regression model)
33. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Example: classical measurement error
Second Manifestations of ARTerial disease (SMART) cohort
doi: 10.1371/journal.pone.0192298
Effect of
interest
Confounder
with error
Outcome
34. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Example: classical measurement error
doi: 10.1371/journal.pone.0192298
% bias in hazard ratio for SBP (multivariable Cox regression model)
35. Twitter: @MaartenvSmedenUtrecht, November 30 2020
• Myth 1: measurement error can be compensated for by large
numbers of observations
• Myth 2: the exposure effect is underestimated when variables
are measured with error
• Myth 3: exposure measurement error is nondifferential if
measurements are taken without knowledge of the outcome
• Myth 4: measurement error can be prevented but not mitigated
in epidemiological data analyses
• Myth 5: certain types of epidemiological research are
unaffected by measurement error
40. Twitter: @MaartenvSmedenUtrecht, November 30 2020
• Myth 1: measurement error can be compensated for by large
numbers of observations
• Myth 2: the exposure effect is underestimated when variables
are measured with error
• Myth 3: exposure measurement error is nondifferential if
measurements are taken without knowledge of the outcome
• Myth 4: measurement error can be prevented but not mitigated
in epidemiological data analyses
• Myth 5: certain types of epidemiological research are
unaffected by measurement error
41. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Tripple whammy of measurement error
• Bias
• Increased imprecision
• Masked functional relations
Usually the target for measurement error “corrections”
44. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Measurement error corrections
External validation set
Study sample
𝑌∗
External validation set
Standard
measurements
Standard
measurements
+
Validated
measurements
45. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Measurement error corrections
Internal validation set
Study sample
𝑌∗Internal validation set
Standard
measurements
Standard
measurements
+
Validated
measurements
46. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Simulation study
OLS regression
Y = a0+a1A + b1L1+…+bpLp + e, e~N(0,s)
a1: effect of primary interest
A,L ~ multivariate normal with mean vector 0 and correlation-matrix
with equal pairwise correlations
Random measurement error: on A, generating a new A*
Missing data (MAR): on L1
True values for a0 = 0, a1 = 10, and b1= b2 = … = bp based on total
confounding effect (crude minus adjusted)
49. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Sequential models
• MIME: Multiple imputation for measurement error
Multiple impute both A (only observed in subset) and missingness
L1 : full conditional specification (Y,A,A*,L), followed by OLS using
A and L as covariates (Rubin’s rules)
• MIRC: Multiple imputation and regression calibration
1. Impute missing values in L1
2. In subset: OLS for A given A*,L
3. For the entire set: Arc = E(A| A*,L)
4. For each multiple imputed set: OLS using Arc and L as
covariates, and adjust standard errors (RC)
5. Combine using Rubin’s rules
50. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Simultaneous models
Conditional submodels
• Y | A, L (primary analysis model)
• A*| Y, A, L
• A | L
• L1 | L2,…,LP
Estimated simultaneously
• MCMC: Bayes (uninformative priors)
• Full information maximum likelihood: FIML (structural equation
model)
52. Twitter: @MaartenvSmedenUtrecht, November 30 2020
What does this mean?
• Simple setting (OLS, 1 covariate with missing data, 1 covariate
with measurement error, internal validation): ”full adjustment”
approaches work really well even in small N = 100.
• Differences especially in rMSE, nearly no bias
• The Bayesian approach seems most promising (for its
frequentist properties!): least bias, easy to expand to other link
functions, multivariate missing data and measurement error
54. Twitter: @MaartenvSmedenUtrecht, November 30 2020
• Myth 1: measurement error can be compensated for by large
numbers of observations
• Myth 2: the exposure effect is underestimated when variables
are measured with error
• Myth 3: exposure measurement error is nondifferential if
measurements are taken without knowledge of the outcome
• Myth 4: measurement error can be prevented but not mitigated
in epidemiological data analyses
• Myth 5: certain types of epidemiological research are
unaffected by measurement error
55. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Exceptions?
• Measurement error in prognostic factors in an RCT
– Same argument about missing data (e.g. see White and
Thompson, Stat Med 2005)
• Special case of measurement error in a confounder
– e.g. confounding by indication, where indication was based
on the confounder with error
58. Twitter: @MaartenvSmedenUtrecht, November 30 2020
Exceptions?
• Measurement error in prognostic factors in an RCT
– Same argument about missing data (e.g. see White and
Thompson, Stat Med 2005)
• Special case of measurement error in a confounder
– e.g. confounding by indication, where indication was based
on the confounder with error
• Prediction models BUT…..