Estimating Causal Effects from Observations

Estimating Causal
Effects from
Observations
Advanced Data Analysis
from an Elementary Point of View

Credits Team
The slides below are derived from the
Chapter 27 of the book “Advanced
Data Analysis from an Elementary
Point of View“ by Cosma Shalizi of the
Carnegie Mellon University, which
was created in order to assist the
“Advanced Data Analysis” course of
the CMU.
Antigoni-Maria Founta,
UIM: 647
Ioannis Athanasiadis,
UIM: 607

Overview
Identification & Estimation
➔ Back-door Criteria & Estimators
◆ Estimating Average Causal Effects
◆ Avoiding Estimating Marginal Distributions
◆ Matching
◆ Propensity Scores
◆ Propensity Score Matching
➔ Instrumental Variables Estimates
Conclusion
➔ Uncertainty & Inference
➔ Comparison of Methods

Identification of Causal Effects
● Causal Conditioning: Pr(Y|do(X = x))
//Counter-factual, not always identifiable even if X & Y are observable
● Probabilistic Conditioning: Pr(Y|X = x)
//Factual, always identifiable if X & Y are observable
Our goal is to identify when quantities like the former are
functions of the distribution of observable variables!
And once identified, the next question is how we can actually estimate it...

Identification/Estimation of Causal Effects
Each one of the next techniques provides us with both identification
of causal effects as well as formulas to estimate them:
1. Back-door Criterion
2. Front-door Criterion
3. Instrumental Variables
* Since estimating with the front-door criterion amounts to doing two rounds of
back-door adjustment, we will continue this lecture working only with the back-door
criterion.
*

Back-door Criteria
Estimations

Back-door criterion
CRITERIA:
I. S blocks every back-door path between X and Y
II. No node in S is a descendant of X
S is each subset that meets the
back-door criteria.
E.g.
● S={S1,S2} or S={S3} or
S={S1,S2,S3} meet criteria
● S={S1} or S={S2} or S={S3,B}
fail the criteria

Back-door Criteria Formula
When S satisfies the back-door criterion, the formula to estimate the Causal
Conditioning becomes:
...so we have transformed the unidentifiable Causal Conditioning to a distribution of
observables.

Estimations in the Back-Door Criteria
There are some special cases and tricks which are worth knowing about,
that will be analysed below.
1. Estimating Average Causal Effects
2. Avoiding Estimating Marginal Distributions
3. Matching
4. Propensity Scores
5. Propensity Score Matching
1 2
3 4
5

1. Estimating average causal effects I
Finding the average causal effects can summarize what there is to know of the effect
X has on Y. We don’t always learn all there is to know, however it’s still useful!
● Average Effect (or sometimes just the effect of do(X=x)) *
*When it makes sense for Y to have an expectation value

1. Estimating average causal effects II
● If we apply the Back-Door Criterion Formula with control variables S:
=>
µ(x, s)
(Inner Conditional Expectation => Regression Function)
We can compute µ(x, s) but we still need to know Pr(S=s)...

2. Avoiding Estimating Marginal Distributions
● The demands for computing, enumerating and summarizing the various
distributions of the Back-Door Criteria Formula is very high, so we need to
reduce these burdens. One shortcut is to smartly enumerate the possible values
of S.
● How? Law of Large Numbers!
With a large sample:

3. Matching (Binary Levels)
Combination of the above methods:
● By the Estimation of Average Causal Effects we got:
E [Y|do(X=x)] = ΣPr(S=s)µ(x, s)
● By the Law of Large Numbers we got:
E [Y|do(X=x)] = 1/n ΣPr(Y=y | X = x, S = si
)
When the variable is binary, we are interested in the expectations of X=0 or 1.
So we want the difference: E [Y|do(X=1)] - E [Y|do(X=0)]
This is called Average Treatment Effect or ATE!

ATE = E [Y|do(X=1)] - E [Y|do(X=0)] →
OR

● But what we can never observe µ(x, s) - we need to consider the errors!
● What we can see is Y = µ(xi
, si
) + ei
!
● So if we want to compute Yi
- Yj
we will actually have:

If we make an assumption that there is an s whose value is the same for two units i
and i*, where X=0 and X=1, then we can say that:
So, if we can find a match i* for every unit i, then:
Matching works vastly better than estimating μ through a linear model!

3. Matching (Binary Levels) - Conclusion
Matching is really just Nearest Neighbor Regression!
● Many technicalities arise. What if we match a unit against multiple units / we
can’t find exact match?
If not an exact match -> match each treated unit against the control-group unit
with the closest values of the covariates.
● Matching does not solve the identification problem. We need S to satisfy the
back-door criterion!
● As any NN method, many issues arise. Fixed K → low bias, but not always
what we want. Prioritizing low bias is not always a good idea.
Bias - Variance tradeoff is a tradeoff!

● Curse of dimensionality over data or conditional distributions is actually
manageable if control variables have few dimensions.
● If R, S sets of control variables, both satisfy back-door criterion and have all else
equal but dimensions, we should choose the one with the fewer variables (e.g. R).
● Especially when we can say that R = f(S): if S satisfies the back-door criterion, then
so does R. Since R is a function of S, both the computational and the statistical
problems which come from using R are no worse than those of using S, and
possibly much better, if R has much lower dimension.

● All that is required is that some combinations of the variables in S carry the same
information about X as the whole of S does - summary score!
● We (hope we) can reduce a p-dimensional set of control variables to a one-dimensional set.
● In reality that is difficult as it depends on the functional form a non-parametric case.
However, there is a special case where we can possibly use one-dimensional summary;
when X is binary! So in that case:
Propensity score: f(S) = Pr(X=1|S=s), which is our summary R
● However, still no analytical formula for f(S) so it needs to be modeled and estimated. The
most common model used is Logistic Regression, but in a high-dimensional S that would also
suffer from curse of dimensionality.

5. Propensity Score Matching
● Even if X is a binary variable, if S suffers from high dimensions then matching is
a really tough task → the more the levels of S variables, the lower the matching.
● Solution: Matching on R = f(S) - propensity scores!
● The gain here is in computational tractability and (perhaps) statistical
efficiency, not in fundamental identification. Que sera, sera…
Incredibly popular method ever since, huge number of implementations.

Instrumental Variables
● I is an instrument for identifying the effect of X on Y if:
○ I is a cause of X
○ I is associated to Y only through X
○ Given some controls S, I is a valid instrument when every path from I to Y left open by S has an
arrow into X
● Change one-unit in I → changes α-unit in X → changes αβ-unit in Y
● So, β is the way for us to estimate the causal coefficient of X on Y!
● We need to estimate β so that we satisfy all the above criteria. However, things
get very complicated when we care about many instruments and/or interest
about causal of multiple variables.

● In general, we need two steps * //2-stage Regression / 2-stage Least Squares
1. Regress X on I and S. Call the fitted values xˆ.
//We see how much changes in the instruments affect X.
2. Regress Y on xˆ and S, but not on I. The coefficient of Y on xˆ is a consistent estimate
of β.
//We see how much these I-caused changes in X change Y
● In the simplest case, when everything is linear, the above procedure is
calculated by the Wald estimator:
* This works, provided that the linearity assumption holds.

● There are circumstances where it is possible to use instrumental variables in
nonlinear and even nonparametric models, BUT the technique becomes far
more complicated.
● Solving integral equations like this:
It is not impossible, BUT it’s painful :-(
The techniques needed are really complicated.

Uncertainty & Inference
We can reduce the problem of causal inference → ordinary statistical inference.
So, in general, with the use of analytical formulas we can assess our uncertainty
about any of our estimates of causal effects the same way we would assess any
other statistical inference (e.g. standard errors).
However, in the two-stage least squares, taking standard errors, etc for β from the
usual formulas for the second regression neglects the fact that this estimate of β
comes from regressing Y on x, which is it itself an estimate and so uncertain.

Comparison of Methods
Matching Instrumental Variables
(+) Clever
(-) Assumptions on
underlying DAG
Crucial Point
The covariates block all
back-door pathways between
X and Y.
The instrument is an indirect cause
of Y, but only through X (no other
unblocked paths connecting I & Y).
Covariates used in matching are
not enough to block all the
back-door paths.
Too quick to discount the possibility
that the instruments are connected to
Y through unmeasured pathways.
According to
Practitioners:

Estimating Causal Effects from Observations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Estimating Causal Effects from Observations

Similar to Estimating Causal Effects from Observations (20)

Recently uploaded

Recently uploaded (20)

Estimating Causal Effects from Observations