Ten years ago there were rumours of the death of causal inference. Big data was supposed to enable us to rely on purely correlational data to predict and control the world. In this talk, I argue that the rumours were strongly exaggerated. Causal inference is becoming increasingly relevant thanks to improvements in inference methods and–ironically–the availability of data. Far from becoming marginalised, causal inference is today more relevant than it’s ever been.
2. Structure of talk
1. The case against causal
inference in the big data era
2. Reasons why big data did not
(and will not) end causal
inference
3. A reconciliation between big
data and causal inference?
4. Conclusions
3. Causal inference: Any modelling
approach where some parts of the
model are assumed to correspond
to some aspects of the causal
structure of the world.
Big data: Lots of observations and/or
variables per observation.
5. “Scientists are trained to recognize that correlation is
not causation, that no conclusions should be drawn
simply on the basis of correlation between X and Y (it
could just be a coincidence). Instead, you must
understand the underlying mechanisms that connect
the two. […]
There is now a better way. Petabytes allow us to say:
‘Correlation is enough.’ We can stop looking for
models. We can analyze the data without hypotheses
about what it might show. We can throw the numbers
into the biggest computing clusters the world has ever
seen and let statistical algorithms find patterns where
science cannot.”
Chris Anderson, Wired 2008
6. 1. Humans are bad at coming up with
causal hypotheses
Coming up with hypotheses about the world
and then testing them against experimental
evidence seems old-fashioned, and there’s a
sense that humans are somehow bad at this.
The experimental method was certainly
developed before powerful computing, so it’s
not a crazy idea that it’s due for a revolution
just like so many other things have been.
2. Correlational models form a more
accurate picture of reality
Anderson refers to the fact that models are
abstractions of the underlying reality. He
suggests that the correlational approach
results in a more accurate picture of how the
world because such an approach is more
flexible in its assumptions and can
incorporate more complexity.
3. Data analysis just seems to be
headed towards the correlational
approach
It’s clear that running correlational analyses
on big datasets has resulted in progress both
in science and business, and this big data
driven progress has presumably become
greater over time. So we could extrapolate
that progress is increasingly going to be
based on the correlational approach.
Reasons why big data might have ended causal inference
7. And still… causal inference seems
to be doing just fine
Science
Randomised controlled trials, mediation
analysis and quasi-experiments: 41400
Google Scholar hits since the beginning
of the year
Business and policy
A/B testing in business, incrementality
measurement in advertising,
experimental methods in public policy
And last but not least…
At Uber, we’re applying causal inference
methods to answer questions relevant to
our business
8. Reasons why big data did not (and will not) end causal
inference
9. Humans are good at causal
hypotheses
This is because during our
evolutionary history it’s been useful
to be able to answer the question:
“If I changed X, what would happen
to Y?”
Abstractness is what makes
models useful
When we abstract away from the
particulars of a situation, we can
generalize into other similar
contexts.
Correlational approaches don’t
give us the counterfactual
Estimating a causal effect requires
estimating what would have
happened in the absence of the
cause.
Three quick considerations
10. Bigger data = better causal
inference
Bigger sample sizes enable us to
identify smaller causal effects and/
or have a greater number of
treatment arms in a standard RCT.
Participant matching approaches
benefit from a larger number of
covariates. Time series based causal
inference methods require multiple
observations… All of which were
difficult to achieve before the big
data era.
Technology
11. Email open rate before
and after a
personalised title
Interrupted time series
analysis is a classic method
for inferring the causal
impact of an intervention,
based on qualitative
assumptions of the
underlying causal
structure. However, until
recently, high quality time
series data was hard to get.
Example:
interrupted time
series analysis
12. Causal inference is in its
infancy
The formal language to describe
causal relationships has been
developed fairly recently. This has
enabled both the development of
better computational methods for
causal inference as well as the
clarification of key assumptions.
New methodologies
13. How much of the
impact of an email is
mediated via click
through?
Causal mediation modeling
is designed to reveal the
mechanisms through which
the impact of an
intervention is mediated.
Until recently, there
weren’t easy to implement
methods to run mediation
models with nonparametric
data.
Example: causal
mediation modeling
15. Big data enables us to do more and better causal inference
Better participant matching, subpopulation analyses, multi-arm trials and the ability to
identify smaller effects are some of the benefits to causal inference that arise from the
existence of big data.
Big data findings can inspire causal hypotheses
Experiments, quasi-experiments and causal modelling can be used to test hypotheses
about patterns that arise from correlational analysis.
Machine learning methods can help us to estimate causal quantities
Exciting developments in machine learning ay help us to estimate counterfactuals like what
the outcome would have looked like for the treated in the absence of the intervention.
Three immediate ways in which big data and causal inference compliment each other
17. The rumours of the death of causal
inference were strongly exaggerated
The arguments probably weren’t very good to
begin with, but they did have the merit of
drawing our attention to the intersection of
these two important fields.
Causal inference is here to stay
If anything, the field has become more active
in recent years, thanks to technological and
methodological developments. Correlational
methods alone don’t answer the question of
what would have happened in the absence of
an intervention.
The future belongs to both big data
and causal inference
When it comes to the relationship between
big data and causal inference, perhaps the
most exciting recent developments are in the
area of combining causal inference methods
with big data approaches.
Conclusions