WEON preconference Greenland

Modern methods forModern methods for
(causal) modeling
in health and medical science:
Cautions and capabilitiesCautions and capabilities
Sander GreenlandSander Greenland
Epidemiology and Statistics Departments
U i i f C lif i L A lUniversity of California, Los Angeles
(UCLA)
5 June 2013 1Greenland, Modern methods

New(‐ish) tools to aid causal inference
• To aid identification of bias sources and sets
of adjustment covariates DAGsof adjustment covariates: DAGs.
• For adjustment of measured confounders:
algorithmic treatment modeling (PS, IPTW,
or OTW) combined with outcome modelingor OTW) combined with outcome modeling
to achieve “double robustness”.
T t f t i t b t• To account for uncertainty about
unmeasured confounders and other
uncontrolled bias sources: bias analysis.
5 June 2013 Greenland, Modern methods 2

I. Tools are not packaged with skills
to use them well or safely

Background readings:
G l d S (2010) O h i h• Greenland S (2010). Overthrowing the tyranny
of null hypotheses hidden in causal diagrams.
Ch 22 in: Dechter R Geffner H and HalpernCh. 22 in: Dechter, R., Geffner, H., and Halpern,
J.Y. (eds.). Heuristics, Probabilities, and
Causality: A Tribute to Judea Pearl. London:Causality: A Tribute to Judea Pearl. London:
College Publications, 365‐382.
• Greenland S (2012). Causal inference as aGreenland S (2012). Causal inference as a
prediction problem: Assumptions, identification,
and evidence synthesis. Ch. 5 in Berzuini C, y
Dawid AP, Bernardinelli L, eds. Causal Inference:
Statistical Perspectives and Applications. Wiley,
h hChichester, 43‐58.

“Mathematics is one necessary tool [but] any
i i i h ll i histatistician who actually practices his art
must possess many additional resources…the
mathematical tail has been allowed to wag
the statistical dog for far too long… I think g g
that the built‐in mathematical bias of many
statistics departments and of much that westatistics departments and of much that we
are presently teaching is not innocuous; it is
in fact antiscientific ”in fact antiscientific.
– George Box, Statistical Science 1990
5 June 2013 Greenland: Is causal inference more 5

Cautions (conclusions, 2011)
C t f l “ l i f ”• Current formal “causal inference”
approaches are mostly about modeling
effects in single studies, and projection to
conditionally exchangeable populations. y g p p
• As technically sophisticated as current
causal inference methods may seem theycausal‐inference methods may seem, they
are far too simple to encompass the
di it f id th t h t bdiversity of evidence that has to be
synthesized in most real health and medical
decision problems.

• From both a purely logical and practical
point of view, causal inference is about
predicting outcomes under different p g
interventions or actions (or sometimes,
with added ambiguity what would havewith added ambiguity, what would have
happened under counterfactual actions).
• Any algorithm developed for prediction
should thus be applicable, e.g., both
classical fixed‐model inference and modern
machine‐learning algorithms. However…machine learning algorithms. However…

• Current “causal inference” research seeks
algorithms that can automatically recognize
causal structure (effective interventions)
with human accuracy or better.
• As with models for visual other perceptual
cognition, these algorithms can help within
causal inference but are not yet near the y
best human cognition derived from
research synthesis. y
• Hence the human element enters into real
causal inference – for better and for worse…causal inference for better and for worse…

• Intuition is notoriously faulty, full of biases,
i l d i d isome innate, some value driven, and is
horrific at probability logic.
• Cognitive psychology and behavioral
economics provides books full of dramatic p
examples – which can be used to recognize
biases (e g double counting confirmationbiases (e.g., double counting, confirmation
bias, overconfidence, and wish bias):
“My colleagues they study artificialMy colleagues, they study artificial
intelligence; me, I study natural stupidity.”
‐Amos Tversky
5 June 2013 9Greenland: Is causal inference more

Unfortunately, application of formal
(statistical) methods are also full of human
frailties because they require
• choice of fixed background assumptions
(meta‐models theories) from an infinite set(meta‐models, theories) from an infinite set
of possibilities, and
ll d l f d• potentially distortive simplifications and
conventions to make them operational.
Values can and do influence these choices,
hencehence…

• Even in situations with clear risks, those
ith t ti ti l hi ti ti h twith statistical sophistication have not
outperformed those without (Susser, AJE
1977 i l i h lth i l )1977 gives classic health‐science examples).
• Similar lessons are seen in econometrics,
where pseudo‐Nobel laureates with
impressive mathematical skills have lost
fortunes for investors (e.g., the 1998 LTCM
fund disaster in which Merton and Scholes
lost nearly $5 billion; the 2009 Trinsum
bankruptcy; etc.): do(X=x), X= buy, sell

Subjective elements and values play a
decisive role in all statistical analysesdecisive role in all statistical analyses
• There is an illusory sense of objectivity
induced when there is great overconfidence,
as when individuals feel infallible or there is
strong social agreement.
• Feelings of objectivity in turn feed back toFeelings of objectivity in turn feed back to
create overconfidence. This is well illustrated
historically by scientists, statisticians, andhistorically by scientists, statisticians, and
often entire fields being certain of
hypotheses later refuted.hypotheses later refuted.
5 June 2013 Greenland ‐ Bayes Workshop 12

Classic statistician examples:
• Fisher against smoking causing lung cancer
• Jeffreys against continental drifty g
Classic clinician‐researcher examples:
F i t i & H it i t t• Feinstein & Horwitz against estrogen
therapy causing much endometrial cancer
• Indiscriminate promotion of trans‐fat
margarine and low‐fat diets in the 1970s andmargarine and low fat diets in the 1970s and
1980s for weight loss and CHD prevention,
along with dismissal of the sugar relationalong with dismissal of the sugar relation.

Some facts of statistical life:
• Data alone do not convey information; they• Data alone do not convey information; they
are interpreted via models for their
generationgeneration.
• Models are sets of assumptions about the
d t ti (DGP)data‐generation process (DGP).
• Models are analogous to language
grammars: No model, no meaning.
• Unfortunately, unlike bad grammar, bad stat y g
modeling may not produce gibberish, even
if the models and outputs are very wrong.
p y g

The classical tensions
• Bias vs. precision: Assumptions introduce
bias to the extent they are incorrect butbias to the extent they are incorrect, but
increase precision to the extent that they
l d d i ht t lt tiexclude or down‐weight most alternatives.
• Procedures are “optimal” only under meta‐p y
assumptions, some untestable.
• Models for causal inference always include• Models for causal inference always include
untestable terminal randomization (no
id l f di “i bili ”)residual confounding, “ignorability”).

II. Useful new tools
from not so new ideas

Neyman’s (1923) potential‐outcome
(“ t f t l”) l t d l(“counterfactual”) causal meta‐model:
• Say X and Y are the treatment and outcome
variables of interest. Then Y is replaced by
a list (vector) of the outcomes that would
follow under different treatments. So if X =
1 or 0, Y is replaced by the potential‐
outcome vector (Y1,Y0) where
Y1 = outcome if X is 1, Y0 = outcome if X is 01 0
• Yx can be replaced by a parameter θx , e.g.,
the outcome probability (risk)the outcome probability (risk)

Causal inference under the potential‐outcome
l b blmodel becomes a prediction problem:
• Causal‐inference (CI) problems are• Causal‐inference (CI) problems are
isomorphic to missing‐data problems:
At most only one potential outcome is
observed; the rest are missing (Rubin, Ann
Stat 1978).
• Thus the vast predictive (imputational)• Thus the vast predictive (imputational)
machinery of statistics can be used for
i f b t l tinference about causal parameters.

Further insights from potential outcomes:
The value of X tells us which potential
outcome we can observe; for binary X, ; y ,
Yobs = XY1 + (1‐X)Y0 ,
ll d th l “ i t ” t (called the causal “consistency” property (a
corollary of Pearl’s axiomatization of
potential outcomes, but called an
assumption by some epidemiologists)p y p g )
• Thus, for any set of covariates Z, we have
p(y |x z) p(y |x z)p(yobs|x,z) = p(yx|x,z)

From consistency, we get a precise definition
of sufficiency for confounding control:of sufficiency for confounding control:
A set of covariates Z is sufficient for control
f f di if h bof confounding if the outcomes we observe
when X=x follow the distribution of Yx given Z:
p(yobs|x,z) ≡ p(yx|x,z) = p(yx|z)
which is independence of X and Y given Z:which is independence of X and Yx given Z:
For all x and z, X ╨ Yx | z,
(“ id l f di ” “ d(“no residual confounding”, “no unmeasured
confounding”, “weak ignorability”); Z is also
minimal sufficient if no s bset of Z is s fficientminimal sufficient if no subset of Z is sufficient.

Further insights from potential outcomes:
B h 1960 h d l i• By the 1960s, methodologists were
developing methods for summarizing
f d i di i iconfounder sets using discriminant or
regression scores. The performance of the
i l l hvarious proposals was not clear, however.
• Rosenbaum & Rubin (1983) showed that,
given a sufficient set Z, the conditional
treatment distribution p(x|z) is itself
sufficient to control confounding of marginal
(total‐population) X effects by covariates in Z.

• For binary X, p(1|z) is usually called the
“ it ” (PS) t l f thi“propensity score” (PS); control of this score
will remove confounding when Z is sufficient.
• For other X, Robins, Mark & Newey (1992)
showed that, when Z is sufficient, control of
the regression score E(X|z) is sufficient for
control of confounding of additive effects of
X on Y. (note: PS = E(X|z) when X is binary)
Nonetheless, the missing‐data viewpoint leads g p
to other, more general ways to adjust for
confounding using treatment probabilities. g g p

• Inverse probability of treatment weighting
(IPTW) was adapted from survey weighting(IPTW) was adapted from survey weighting
ideas (Robins, Hernán, Brumback 2000).
I l b d i d f l i l di• It can also be derived from classical direct
standardization (Sato and Matsuyama 2003):
p(y|x) = ∑z p(y|x,z)p(z) = ∑z p(y,x,z)p(z)/p(x,z)
= ∑z p(y,x,z)/p(x|z) = ∑z wzp(y,x,z), ∑z p(y,x,z)/p(x|z) ∑z wzp(y,x,z),
where wz=1/p(x|z).
Th if Z i ffi i t th IPTW• Thus, if Z is sufficient, then IPTW removes
marginal confounding by averaging using the
i ht f ll ( t d di ti )same weights for all x (standardization).

Despite PO/PS/IPTW theory providing
landmark insights it is far from completelandmark insights, it is far from complete
for most health/med analyses:
I d h d l• It does not say how to model treatment,
but mismodeling can render the estimated
PS i ffi i d bi h ff iPS insufficient and bias the effect estimate;
• It does not address sampling variation or
how to balance bias vs. variance, e.g., in an
RCT, the randomization indicator predicts p
treatment perfectly so controlling it yields
infinite variance yet adjusts for no bias;y j

• It focuses on marginal (population‐
averaged) effects (ACE LATE CACE) It doesaveraged) effects (ACE, LATE, CACE). It does
not guide accurate estimation of effect
heterogeneity (modification) or conditionalheterogeneity (modification) or conditional
effects (e.g., effects in men vs. women),
which are essential for clinical practice;which are essential for clinical practice;
• It defines but does not operationalize how
f d ff l ffto find a sufficient or minimal sufficient Z.
These deficiencies are largely traceable to g y
omitting the outcome from modeling
(which Rubin AAS 2008 strongly advises). ( g y )

A simple solution: Treatment modeling
followed by outcome modelingfollowed by outcome modeling
Classical modeling for causal inference
th t Y X d Z f if Zregresses the outcome Y on X and Z, for if Z
is sufficient, E(Yobs|x,z) ≡ E(Yx|x,z) = E(Yx|z).
• The model for potential means E(Yx|z) is
called a structural model or structural
equation.
• This approach estimates conditional effects pp
as well as marginal effects (by averaging
over Z). As with PS, however, it will be ) , ,
biased by mismodeling.

By combining treatment modeling with
outcome modeling we can create estimatesoutcome modeling, we can create estimates
that are at least approximately doubly
robust (DR): If Z is sufficient the estimatedrobust (DR): If Z is sufficient, the estimated
effect of X on Y will be unconfounded if
either of the models is correcteither of the models is correct.
The simplest DR approaches either
• regress Y on X, Z, and PS as a covariate,
• regress Y on X, Z in a PS‐matched sample, orregress Y on X, Z in a PS matched sample, or
• regress Y on X, Z using IPT or OT weights.
E h f th h h dEach of these approaches have pros and cons.

Treating PS as a covariate:
Th l i f h PS i k b hi hl• The relation of the PS to risk can be highly
nonlinear and can be discontinuous when
i di Th i h PScovariates are discrete. Thus entering the PS
as a few terms may not retain sufficiency.
Hi hl fl ibl f l i b d dHighly flexible formulations may be needed
(e.g., many category indicators for the PS, or
h l d )machine‐learning procedures).
• The PS is a composite of Z; it thus can be p
highly collinear with Z terms in the outcome
model, leading to imprecision.g p

Outcome regression after PS matching:
• Almost all PS matching is to the treated
(X=1). This alters the distribution of effect ( )
modifiers to that seen in the exposed, which
in turn changes the target parameter to thein turn changes the target parameter to the
effect in the treated rather than in the total
(Kurth et al AJE 2006) This may be a good(Kurth et al., AJE 2006). This may be a good
change if the exposed are the target. But,
• Typical PS matching tends to discard many
subjects, harming efficiency. j , g y

Weighted outcome regression:
• Ordinary fitting methods for estimating
treatment probabilities tend to produce very p p y
small values for some subjects, resulting in
huge highly unstable weights There arehuge, highly unstable weights. There are
several approaches to weight stabilization:
h b d1. Restore the X margin: Robins and crew
use wz = p(x)/p(x|z), but this weight may still
be too unstable, leading to crude fixes like
weight trimming to obtain sensible results. g g

2. Ridgeway & McCaffrey (2004, 2007)
weight by the odds of X=1 vs X=X :weight by the odds of X=1 vs. X=Xobs:
wz=1 if X=1, wz= p(1|z)/p(0|z) if X=0.
• This odds‐of‐treatment weighting (OTW)
standardizes to the treated (X=1), as in PS
matching to the exposed.
• They fit these odds with a machine‐learningThey fit these odds with a machine learning
algorithm (boosted lasso).
Their approach eliminates stability problemsTheir approach eliminates stability problems.
Similar results have been reported using
related algorithms to fit probabilities for IPTWrelated algorithms to fit probabilities for IPTW.

Now, what does and doesn’t belong in Z?
• The answers were known intuitively to
some and demonstrated using potential g p
outcomes well before the ascendance of
causal diagrams but the explanationscausal diagrams, but the explanations
were opaque to many (e.g., see Robins &
Greenland 1992)Greenland, 1992).
• The development of formal causal graphs
in the 1980s opened the way to fast
screening algorithms for Z candidates.screening algorithms for Z candidates.

Graphical models predate causal models
• Graph theory began in the 1700s and was
used for circuit analysis in the 19th century.
Applications in probability and computer
science date back at least to the 1960s.science date back at least to the 1960s.
• Causal path diagrams appeared circa 1920.
• By the 1980s, AI research merged directed
acyclic graph (DAG) models for probabilities y g p ( ) p
(Bayes nets) with path diagrams, to produce
causal DAGs (causal Bayesian networks).
Greenland Pearlfest 2010 33
causal DAGs (causal Bayesian networks).

Example DAGExample DAG
A BA B
CC
FF
E D
2 Feb 2012 Greenland 34

Directed acyclic graphs and causal diagrams
• A DAG shows the factors in the problem as
nodes linked by arrows only, with no y y,
feedback loops.
• A graph is a causal diagram if the arrowsA graph is a causal diagram if the arrows
are interpreted as links in causal chains
(formalization is a bit controversial; R&R)(formalization is a bit controversial; R&R).
• Causal effects of one variable on another
are transmitted by causal sequences whichare transmitted by causal sequences, which
are directed (head‐tail) paths: X→Y→Z
means X can affect Z
means X can affect Z

Assumptions inherent in causal diagrams
Assumptions of a causal diagram are of two
forms:forms:
1) Arrow direction: resolvable by time order
2) Arrow absence: No directed path from X
to Y corresponds to a null hypothesis that,to Y corresponds to a null hypothesis that,
upon stratifying on all direct causes
(“parents”) of X X and Y would be( parents ) of X, X and Y would be
independent (“Causal Markov Condition”)
Thus: Most DAGs are full of null hypotheses!

Colliders vs. noncolliders on a path
P th l d (bl k d) t llid• Paths are closed (blocked) at colliders:
Associations cannot be transmitted across
a collider (→C←) on a path unless we
stratify (condition) on it or something it y ( ) g
affects (such as F in C→F).
• Paths are open (unblocked) at noncolliders:• Paths are open (unblocked) at noncolliders:
Associations can be transmitted across a
llid ( di →C→ f knoncollider (a mediator →C→ or a fork
←C→) on a path unless we stratify on it
completely.

Think of associations as signals flowing
h h h hthrough the graph
• A variable can transmit associations along g
some open (unblocked) directions but not
along closed (blocked) directionsalong closed (blocked) directions.
• The open and closed directions are
h d d b dswitched around by conditioning
(stratifying) on the variable, and are
partially switched by partially or indirectly
conditioning.
co d t o g

Example DAG: A diagram with an
embedded M path from E to D, E‐A‐C‐B‐D
A BA B
CC
FF
E D

“Control” of bias in causal modeling
• Target path: A path that transmits some of
the effect we want to estimate; it is athe effect we want to estimate; it is a
directed path from cause to effect.
Bi i h A h h b• Biasing path: Any other open path between
the cause and effect variables.
• By judicious conditioning, we must close all
biasing paths without closing target pathsbiasing paths without closing target paths
or opening new biasing paths. (This isn’t
always possible with available data )
always possible with available data.)

Graphical sufficiency
• If conditioning on Z closes all biasing paths
while leaving all target paths open Z iswhile leaving all target paths open, Z is
sufficient for control of bias.
• If Z is sufficient (for control of bias) but no
subset is sufficient, Z is minimal sufficient.
Like almost all graphical concepts and results,
these are qualitative (topological); they dothese are qualitative (topological); they do
not address extent of bias. But they can aid
i i i l i i dinitial covariate screening and more.

Example: inadequacy of statistical criteria
Among traditional statistical criteria for
defining or detecting confounders are:defining or detecting confounders are:
• C is associated with E and with D given E
Adj t t f C h th E D• Adjustment for C changes the E‐D
association (noncollapsibility).
These are equivalent in linear systems.
(Often added: C must precede E and D.)( p )
Graphs illustrate how both criteria can fail,
leading to adjustment that increases biasleading to adjustment that increases bias.

Pure M‐bias: C assoc with E and D|E,
yet no bias unless you adjust for C or F
(A) (B)(A) (B)
CC
FF
E D

Instrumental variables in a linear system:
A and F assoc with E and D|E yet worse bias if youA and F assoc with E and D|E, yet worse bias if you
adjust conventionally for A or F
A (B)A (B)
F E
A may be intent‐to‐treat D

Estimation of direct effects by adjustment for
intermediates (Judd‐Kenny 1981 Robins‐Greenlandintermediates (Judd‐Kenny 1981, Robins‐Greenland
1992 by POs, Hernan‐Cole 2002 by cDAG)
E (B)E (B)
[C]
D
E associated with D|C yet no direct effect

What do graphs say about complex cases?
• Traditional statistical criteria need
refinement: When we add adjustment j
variables, we have to weigh potential bias
eliminated against potential bias added.g p
• Complex graphs inherit all biases in their
simple subgraphs (like M‐bias) so simplesimple subgraphs (like M bias), so simple
graphs are great warning devices, but…
• Due to their qualitative nature graphs give• Due to their qualitative nature, graphs give
us only clues about the balance of bias, and
say nothing about bias variance tradeoffssay nothing about bias‐variance tradeoffs.

Confounding paths from E to D:
EACD, ECBD, ECD
A BA B
CC
FF
E D

Confounding paths from E to D after
conditioning on C: EACBD
A BA B
[C][C]
FF
E D

Confounding paths from E to D: None!Confounding paths from E to D: None!
A [B]A [B]
[C][C]
FF
E D

What if essential variables are not
d ( ff l bl )measured? (no sufficient Z available)
We then have to turn to sensitivity analysis ofWe then have to turn to sensitivity analysis of
bias (bias analysis; see Ch. 19 of ME3) to
get an idea of how much bias is left afterget an idea of how much bias is left after
adjustment for measured covariates, and
how much uncertainty is appropriatehow much uncertainty is appropriate.
• Ordinary statistics ignore uncertainty about
unmeasured or mismeasured variables andunmeasured or mismeasured variables, and
so are grossly overconfident (intervals much
too narrow P values much too small)too narrow, P‐values much too small).

All the usual validity problems
can be viewed bias due to missing datacan be viewed bias due to missing data
• Confounding: nonrandomly missing
potential outcomes
• Selection bias: nonrandomly missingSelection bias: nonrandomly missing
subjects
M i i l• Measurement error: missing actual
variables of interest, so we use proxies in
their place (which may produce bias even if
the error is random)
)

This view enables use of imputation methods
for bias analysis (Greenland, 2009):
Completed data = observed + imputed dataCompleted data observed + imputed data
• To make any inference beyond what we see
(th b d) t h d l th t(the observed), we must have a model that
projects from the observed data to the
missing data (or to aspects of the data, like
means) to get the completed data.) g p
• In bias analysis, however, key parameters
are not identified by the observations
are not identified by the observations.

As a result, bias analysis can have far more
impact on results than other methods Yet itimpact on results than other methods. Yet it
has seen the least adoption. Possible reasons:
It i f i ti t ff t t• It requires far more investigator effort to
specify the model and inputs (one group is
t i t f l t id li t thi )trying to formulate guidelines to ease this),
• Once specified, it is nowhere near as easy to
run with commercial software as other
methods,
• It can completely ruin any hint of
decisiveness or “significance” of results.g

III. Conclusion:
Some modern tools you should know
• For identification of bias sources and• For identification of bias sources and
sufficient adjustment sets: DAGs.
• For adjustment of measured confounders:
algorithmic treatment modeling (PS, IPTW,
or OTW) combined with outcome modeling
to achieve double robustness.to achieve double robustness.
• To account for uncertainty about
t ll d bi bi l iuncontrolled bias: bias analysis.

WEON preconference Greenland

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to WEON preconference Greenland

Similar to WEON preconference Greenland (20)

More from Bsie

More from Bsie (9)

Recently uploaded

Recently uploaded (20)

WEON preconference Greenland