Causal challenges in Artificial Intelligence

Causal challenges for AI
David Lopez-Paz
Facebook AI Research
Clever Hans (1907)
(Sturm, 2014)
Outline
What’s wrong with machine learning?
A causal proposal
Searching for causality I: observational data
Searching for causality II: multiple environments
Conclusion
What succeeds in machine learning?
The recent winner (Hu et al., 2017) achieves a super-human performance of 2.2%.
What succeeds in machine learning?
(From Kartik Audhkhasi)
What succeeds in machine learning?
(Wikipedia, 2018)
What succeeds in machine learning?
(Silver et al., 2016)
What are the reasons for these successes?
Machines pull impressive performances at
− recognizing objects after training on more images than a human can see,
− translating natural languages after training on more bilingual text than a human can read,
− beating humans at Atari after playing more games than any teenager can endure,
− reigning Go after playing more grandmaster level games than mankind
Models consume too much data to solve a single task!
(From L´eon Bottou)
What fails in machine learning?
(From Pietro Perona)
What fails in machine learning?
(From Pietro Perona)
What fails in machine learning?
(Rosenfeld et al., 2018)
What fails in machine learning?
(Stock and Cisse, 2017)
What fails in machine learning?
(From Jamie Kiros)
What fails in machine learning?
(Jabri et al., 2016)
What fails in machine learning?
(Szegedy et al., 2013)
What fails in machine learning?
(IBM system at ICLR 2017)
What are the reasons for these failures?
The big liea
in machine learning:
Ptrain(X, Y ) = Ptest(X, Y )
aAs called by Zoubin Ghahramani.
− focus on interpolation
− out-of-distribution catastrophes
− over-justification of “minimizing the average error”
− emphasize the common, forget the rare
− reckless learning
Horses cheat our statistical estimation problems by using unexpected features
Outline
What’s wrong with machine learning?
A causal proposal
Searching for causality I: observational data
Searching for causality II: multiple environments
Conclusion
This talk in one slide
Predict Y from (X, Z). Process generating labeled training data:
X ← N(0, 1),
Y ← X + N(0, 1)
Z ← Y + N(0, 1).
Least-squares solution: YLS = X
2 + Z
2
Causal solution: YCau = X
Predict Y from (X, Z). Process generating unlabeled testing data:
X ← N(0, 1),
Y ← X + N(0, 1)
Z ← Y + N(0, 10).
Least-squares solution breaks at testing time!
Getting around the big lie machine learning
Horses absorb all training correlations recklessly, incl. confounders and spurious patterns
∼
If Ptrain ̸= Ptest, what correlations should we learn and what correlations should we ignore?
Reichenbach’s Principle of Common Cause
Correlations between X and Y arise due to one of the three causal structures
X Y X Y X Y
Z
What happens to Y when someone manipulates X? Why is Y = 2?
(Reichenbach, 1956) formalizes the claim “dependence does not imply causation”
∼
We are interested in causal correlations (from features to target)
Predicting open umbrellas from rain is more stable than predicting rain from open umbrellas
Focus on causal correlations for invariance?
(Woodward, 2005)
Focus on causal correlations for truth?
(Pearl, 2018)
The causal explanation predicts the outcome of real experiments in the world
∼
We will now explore two ways to discover causality in data using data alone
Outline
What’s wrong with machine learning?
A causal proposal
Searching for causality I: observational data
Searching for causality II: multiple environments
Conclusion
How does causation look like?
(Hertzsprung–Russell diagrams, 1911)
How does causation look like?
(Messerli, 2012)
How does causation look like?
−1 0 1
U
−1
0
1
V
−1 0 1
V
−1
0
1
U
Effect = f(Cause) + Noise
Cause independent from Noise
(Peters et al., 2014)
How does causation look like?
0.0 0.5 1.0
X
−3
−2
−1
0
1
2
3
Y
P(Y )
P(X)
Effect = f(Cause)
p(Cause) independent from f′
(Daniusis et al., 2010)
How does causation look like?
x → y x → y x → y x → y x → y x → y x → y x → y
x → y x → y x → y x → y x → y x → y x → y x → y
x → y x → y x → y x → y x → y x → y x → y x → y
x → y x → y x → y x → y x → y x → y x → y x → y
x → y x → y x → y x → y x → y x → y x → y x → y
x → y x → y x → y x → y x → y x → y x ← y x ← y
x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y
x ← y x ← y x ← y x → y x → y x → y x → y x ← y
x ← y x → y x → y x ← y x → y x ← y x → y x ← y
x → y x ← y x ← y x → y x → y x → y x ← y x → y
(Mooij et al., 2014)
NCC: learning causation footprints
{(xij, yij)}mi
j=1 (xi1, yi1)
(ximi , yimi )
1
mi
∑mi
j=1(·) ˆP(Xi → Yi)
average
classifier layers
embedding layers
each point featurized separately
(Lopez-Paz et al., 2017)
Trained using synthetic data!
NCC is the state-of-the-art
0 20 40 60 80 100
020406080100
decission rate
classificationaccuracy
RCC
ANM
IGCI
NCC is the state-of-the-art
NCC discovers causation in images
Features inside bounding boxes are caused by the presence of objects (wheel)
Features outside bounding boxes cause the presence of objects (road)object-featureratio
(Lopez-Paz et al., 2017)
NCC discovers causation in language
Between word2vec vectors relation concepts such as “smoking → cancer”
counts(WS)
prec-counts(WS)
prec-counts(entropy)
PMI(WS)
prec-PMI(WS)
counts(entropy)
PMI(entropy)
prec-PMI(entropy)
frequency
precedence
distr.prec-PMI
distr.w2vio
distr.PMI
distr.counts
distr.prec-counts
distr.w2vii
distr.w2voi
feat.counts
feat.prec-counts
feat.PPMI
feat.prec-PPMI
feat.w2vio
feat.w2voi
feat.w2vii
feat.w2voutput
feat.w2vinput
feat.w2vall
0.4
0.5
0.6
0.7
0.8
0.9
testaccuracy
baselines
distribution-based
feature-based
(Rojas-Carulla et al., 2017)
New hopes for unsupervised learning?
There are unexpected causal signals in unsupervised data!
These allow to gain causal intuitions from data, reducing the need for experimentation
What metrics/divergences best extract these causal signals, while discarding the rest?
We want simple models for a complex world (IKEA instructions)
− Against the usual hope of consistency (P = Q as n → ∞)
First results
Cause-effect discovery ≈ choosing the simplest model (Stegle et al., 2010) using a divergence
− GANs divergences distinguish between cause and effect (Lopez-Paz and Oquab, 2016)
− Discriminator((Cause, Generator(Cause, Noise)), (Cause, Effect))
is harder than
Discriminator((Generator(Effect, Noise), Effect), (Cause, Effect))
− These ideas extend to multiple variables (Goudet et al., 2017; Kalainathan et al., 2018)
− Each divergence has important geometry implications (Bottou et al., 2018)
− Hyperbolic divergences recover complex causal hierarchies (Klimovskaia et al., 2018)
p1
p2
p3
p4
p5
a b
...
Euclidean space Poincaré Ball
Preserve pairwise
distances
c
First conclusion
There are causal signals in unsupervised data ready to be leveraged in novel ways
Outline
What’s wrong with machine learning?
A causal proposal
Searching for causality I: observational data
Searching for causality II: multiple environments
Conclusion
Moving beyond the big lie
Ptrain(X, Y ) ̸= Ptest(X, Y )
Then, what remains invariant between train and test data?
∼
We assume that Ptrain and Ptest produce data about the same phenomena under different
experimental conditions, circumstances, or environments
∼
To succeed at the test environment, we observe multiple training environments and
− learn what is invariant across environments
− discard what is specific to each environment
∼
There is a causal justification for proceeding this way!
Functional causal models
A common tool to describe causal structures is the one of Functional Causal Model (FCM)
X1 X2
X3X4
Y
X1 ← f1(N1)
X2 ← f2(X1, X3, N2)
X3 ← f3(X1, N3) // X1 causes X3
X4 ← f4(X1, N4)
Y ← fy(X2, X3, Ny)
Ni ∼ P(N)
FCMs are compositional and allow counterfactual reasoning
FCMs are generative: observing their eqs produces the observational distribution P(X, Y )
We can also intervene the FCM eqs to produce interventional distributions ˜P(X, Y )!
∼
Each intervention produces one environment (distribution) of the phenomena (FCM) of
interest!
Functional causal models
One FCM = multiple interventions/distributions/environments
P1
train(X, Y ) ∼
X1 X2
X3X4
Y
X1 = f1(N1)
X2 = f2(X1, X3, N2)
X3= 1.5
X4 = f4(X1, N4)
Y = fy(X2, X3, Ny)
Ni ∼ P(N)
Functional causal models
One FCM = multiple interventions/distributions/environments
P2
train(X, Y ) ∼
X1 X2
X3X4
Y
X1∼ N(0, 1)
X2 = f2(X1, X3, N2)
X3 = f3(X1, N3)
X4 = f4(X1, N4)
Y = fy(X2, X3, Ny)
Ni ∼ P(N)
Functional causal models
One FCM = multiple interventions/distributions/environments
P3
train(X, Y ) ∼
X1 X2
X3X4
Y
X1 = f1(N1)
X2= f2(X1, X3, N2) + U(−10, 10)
X3 = f3(X1, N3)
X4 = f4(X1, N4)
Y = fy(X2, X3, Ny)
Ni ∼ P(N)
Functional causal models
X1 X2
X3X4
Y
X1 = f1(N1)
X2 = f2(X1, X3, N2)
X3 = f3(X1, N3)
X4 = f4(X1, N4)
Y= fy(X2, X3, Ny)
Ni ∼ P(N)
If mechanisms are autonomous, and
no intervention disturbs the conditional expectation of the target causal equation:
− the causal conditional distribution E(Y |X2, X3) remains invariant
− the non-causal conditional distribution E(Y |X) may vary wildly!
This reveals the link between invariances across environments and causal structures
∼
How can we find invariant causal predictors?
A simple example: X → Y → Z
For all environments e ∈ R:
Xe
← N(0, e),
Y e
← Xe
+ N(0, e)
Ze
← Y e
+ N(0, 1).
The task is to predict Y e
given (Xe
, Ze
) for unknown test e. We have three options:
E[Y e
|Xe
= x] = x,
E[Y e
|Ze
= z] =
2e
2e + 1
z,
E[Y e
|Xe
= x, Ze
= z] =
1
e + 1
x +
e
e + 1
z
The causal predictor based on x is invariant!
The state-of-the-art (Ganin et al., 2016; Peters et al., 2016) fails at this simple example
Our proposal
Find a feature representation that leads to the same optimal classifier across environments.
∼
Let we
ϕ be the optimal classifier for environment e, when using the featurizer ϕ:
we
ϕ = arg min
w
RP e (w ◦ ϕ),
where RP e (f) = E(x,y)∼P e
[
Error(f(x), y)
]
. Measure classifier discrepancy:
∥we
ϕ − we′
ϕ ∥P =
∫
(we
ϕ(ϕ(x)) − we′
ϕ (ϕ(x)))2
dP(X)
Let ¯w = 1
e
∑
e we
ϕ. Then, our new learning objective is:
arg min
ϕ
∑
e
RP e ( ¯w ◦ ϕ) + λ
∑
e,e′̸=e
∥we
ϕ − we′
ϕ ∥P e
(Arjovsky et al., 2018)
An approximation to our proposal
C(ϕ) =
∑
e
RP e ( ¯w ◦ ϕ) + λ
∑
e,e′̸=e
∥we
ϕ − we′
ϕ ∥P e
is an intractable bi-level optimization problem, since we
ϕ is an optimization problem itself
We approximate the interactions between the optimization problems using unrolled gradients
∼
1. Initialize at random ϕ and we
ϕ, for all e
1.1 Update we
ϕ ← Gradient(RP e , we
ϕ) using one step and fixed ϕ, for all e
1.2 Update me
ϕ ← Gradient(RP e , we
ϕ) using k steps and fixed ϕ, for all e
1.3 Update ϕ ← Gradient(C, me
ϕ) using one step and fixed me
ϕ
2. Return
(
1
e
∑
e we
ϕ
)
◦ ϕ
(Arjovsky et al., 2018)
First results
Empirical risk minimization:
Causal risk minimization:
∼
Implications to fairness? Partitions of one dataset? Theory?
Multiple environments in the big picture
setup training test
generative learning U1
1 ∅
unsupervised learning U1
1 U1
2
supervised learning L1
1 U1
1
semi-supervised learning L1
1U1
1 U1
2
transductive learning L1
1U1
1 U1
1
multitask learning L1
1L2
1 U1
2 U2
2
domain adaptation L1
1U2
1 U2
2
transfer learning U1
1 L2
1 U2
1
continual learning L1
1, . . . , L∞
1 U1
1 , . . . , U∞
1
multi-environment learning L1
1L2
1 U3
1 U4
1
− Li
j: labeled dataset number j drawn from distribution i
− Ui
j : unlabeled dataset number j drawn from distribution i
Second conclusion
Prediction rules based on stable correlations across environments are likely to be causal 1
1I call this the principle of causal concentration.
Outline
What’s wrong with machine learning?
A causal proposal
Searching for causality I: observational data
Searching for causality II: multiple environments
Conclusion
Finally: from machine learning to artificial intelligence
AIs will be world simulators that will
− align with the causal outcomes in the world,
− perform robustly across diverse environments,
− interrogate composable autonomous mechanisms to extrapolate,
− allow to imagine multiple futures given uncertainty about a situation,
− enable counterfactual reasoning for extreme generalization
These causal desiderata are out of reach for current machine learning systems. Let’s get to it!
∼
Thanks!
References I
Martin Arjovsky, Leon Bottou, and David Lopez-Paz. Learning invariant causal rules across environments. In preparation, 2018.
Leon Bottou, Martin Arjovsky, David Lopez-Paz, and Maxime Oquab. Geometrical insights for implicit generative modeling. In Braverman
Readings in Machine Learning. Key Ideas from Inception to Current State. Springer, 2018.
Povilas Daniusis, Dominik Janzing, Joris Mooij, Jakob Zscheischler, Bastian Steudel, Kun Zhang, and Bernhard Sch¨olkopf. Inferring
deterministic causal relations. In UAI, 2010.
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran¸cois Laviolette, Mario Marchand, and Victor
Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016.
O. Goudet, D. Kalainathan, P. Caillou, I. Guyon, D. Lopez-Paz, and M. Sebag. Causal Generative Neural Networks. arXiv, 2017.
Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. arXiv, 2017.
Allan Jabri, Armand Joulin, and Laurens van der Maaten. Revisiting visual question answering baselines. In ECCV, 2016.
D. Kalainathan, O. Goudet, I. Guyon, D. Lopez-Paz, and M. Sebag. SAM: Structural Agnostic Model, Causal Discovery and Penalized
Adversarial Learning. arXiv, 2018.
Anna Klimovskaia, Leon Bottou, David Lopez-Paz, and Maximilian Nickel. Poincar maps recover continuous hierarchies in single-celldata.
In preparation, 2018.
David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. ICLR, 2016.
David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Sch¨olkopf, and L´eon Bottou. Discovering causal signals in images.
CVPR, 2017.
Franz H. Messerli. Chocolate consumption, cognitive function, and nobel laureates. New England Journal of Medicine, 2012.
Joris M. Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Sch¨olkopf. Distinguishing cause from effect using
observational data: methods and benchmarks. JMLR, 2014.
Judea Pearl. Theoretical impediments to machine learning with seven sparks from the causal revolution. arXiv, 2018.
Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Sch¨olkopf. Causal discovery with continuous additive noise models. JMLR,
2014.
Jonas Peters, Peter B¨uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence
intervals. Journal of the Royal Statistical Society, 2016.
Hans Reichenbach. The direction of time. Dover, 1956.
Mateo Rojas-Carulla, Marco Baroni, and David Lopez-Paz. Causal discovery using proxy variables. In preparation, 2017.
A. Rosenfeld, R. Zemel, and J. K. Tsotsos. The Elephant in the Room. arXiv, 2018.
References II
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis
Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature,
2016.
Oliver Stegle, Dominik Janzing, Kun Zhang, Joris M Mooij, and Bernhard Sch¨olkopf. Probabilistic latent variable models for
distinguishing between cause and effect. In NIPS. 2010.
Pierre Stock and Moustapha Cisse. Convnets and imagenet beyond accuracy: Explanations, bias detection, adversarial examples and
model criticism. arXiv, 2017.
B. L. Sturm. A simple method to determine if a music information retrieval system is a “horse”. IEEE Transactions on Multimedia, 2014.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing
properties of neural networks. ICLR, 2013.
James Woodward. Making things happen: A theory of causal explanation. Oxford university press, 2005.
1 of 54

Recommended

ChatGPT and the Future of Work - Clark Boyd by
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
24.3K views69 slides
Getting into the tech field. what next by
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
5.7K views22 slides
Google's Just Not That Into You: Understanding Core Updates & Search Intent by
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
6.4K views99 slides
How to have difficult conversations by
How to have difficult conversations How to have difficult conversations
How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC
5K views19 slides
Introduction to Data Science by
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceChristy Abraham Joy
82.3K views51 slides
Time Management & Productivity - Best Practices by
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
169.7K views42 slides

More Related Content

Recently uploaded

Piloting & Scaling Successfully With Microsoft Viva by
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft VivaRichard Harbridge
13 views160 slides
Data Integrity for Banking and Financial Services by
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial ServicesPrecisely
29 views26 slides
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...Jasper Oosterveld
27 views49 slides
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...Bernd Ruecker
48 views69 slides
The Forbidden VPN Secrets.pdf by
The Forbidden VPN Secrets.pdfThe Forbidden VPN Secrets.pdf
The Forbidden VPN Secrets.pdfMariam Shaba
20 views72 slides
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc
72 views29 slides

Recently uploaded(20)

Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely29 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker48 views
The Forbidden VPN Secrets.pdf by Mariam Shaba
The Forbidden VPN Secrets.pdfThe Forbidden VPN Secrets.pdf
The Forbidden VPN Secrets.pdf
Mariam Shaba20 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc72 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays33 views
Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec15 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman38 views
"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays24 views
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn26 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab23 views
Future of AR - Facebook Presentation by ssuserb54b561
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
ssuserb54b56122 views

Featured

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
55.5K views138 slides
12 Ways to Increase Your Influence at Work by
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
401.7K views64 slides
ChatGPT webinar slides by
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slidesAlireza Esmikhani
30.4K views36 slides
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
3.6K views12 slides
Barbie - Brand Strategy Presentation by
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
25.1K views46 slides

Featured(20)

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by Applitools
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools55.5K views
12 Ways to Increase Your Influence at Work by GetSmarter
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter401.7K views
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by DevGAMM Conference
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference3.6K views
Barbie - Brand Strategy Presentation by Erica Santiago
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25.1K views
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well by Saba Software
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software25.2K views
Introduction to C Programming Language by Simplilearn
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn8.4K views
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr... by Palo Alto Software
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
Palo Alto Software88.4K views
9 Tips for a Work-free Vacation by Weekdone.com
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.2K views
How to Map Your Future by SlideShop.com
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -... by AccuraCast
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
AccuraCast3.4K views
Exploring ChatGPT for Effective Teaching and Learning.pptx by Stan Skrabut, Ed.D.
Exploring ChatGPT for Effective Teaching and Learning.pptxExploring ChatGPT for Effective Teaching and Learning.pptx
Exploring ChatGPT for Effective Teaching and Learning.pptx
Stan Skrabut, Ed.D.57.7K views
How to train your robot (with Deep Reinforcement Learning) by Lucas García, PhD
How to train your robot (with Deep Reinforcement Learning)How to train your robot (with Deep Reinforcement Learning)
How to train your robot (with Deep Reinforcement Learning)
Lucas García, PhD42.5K views
4 Strategies to Renew Your Career Passion by Daniel Goleman
4 Strategies to Renew Your Career Passion4 Strategies to Renew Your Career Passion
4 Strategies to Renew Your Career Passion
Daniel Goleman122K views
The Student's Guide to LinkedIn by LinkedIn
The Student's Guide to LinkedInThe Student's Guide to LinkedIn
The Student's Guide to LinkedIn
LinkedIn88K views
Different Roles in Machine Learning Career by Intellipaat
Different Roles in Machine Learning CareerDifferent Roles in Machine Learning Career
Different Roles in Machine Learning Career
Intellipaat12.4K views
Defining a Tech Project Vision in Eight Quick Steps pdf by TechSoup
Defining a Tech Project Vision in Eight Quick Steps pdfDefining a Tech Project Vision in Eight Quick Steps pdf
Defining a Tech Project Vision in Eight Quick Steps pdf
TechSoup 9.7K views

Causal challenges in Artificial Intelligence

  • 1. Causal challenges for AI David Lopez-Paz Facebook AI Research
  • 3. Outline What’s wrong with machine learning? A causal proposal Searching for causality I: observational data Searching for causality II: multiple environments Conclusion
  • 4. What succeeds in machine learning? The recent winner (Hu et al., 2017) achieves a super-human performance of 2.2%.
  • 5. What succeeds in machine learning? (From Kartik Audhkhasi)
  • 6. What succeeds in machine learning? (Wikipedia, 2018)
  • 7. What succeeds in machine learning? (Silver et al., 2016)
  • 8. What are the reasons for these successes? Machines pull impressive performances at − recognizing objects after training on more images than a human can see, − translating natural languages after training on more bilingual text than a human can read, − beating humans at Atari after playing more games than any teenager can endure, − reigning Go after playing more grandmaster level games than mankind Models consume too much data to solve a single task! (From L´eon Bottou)
  • 9. What fails in machine learning? (From Pietro Perona)
  • 10. What fails in machine learning? (From Pietro Perona)
  • 11. What fails in machine learning? (Rosenfeld et al., 2018)
  • 12. What fails in machine learning? (Stock and Cisse, 2017)
  • 13. What fails in machine learning? (From Jamie Kiros)
  • 14. What fails in machine learning? (Jabri et al., 2016)
  • 15. What fails in machine learning? (Szegedy et al., 2013)
  • 16. What fails in machine learning? (IBM system at ICLR 2017)
  • 17. What are the reasons for these failures? The big liea in machine learning: Ptrain(X, Y ) = Ptest(X, Y ) aAs called by Zoubin Ghahramani. − focus on interpolation − out-of-distribution catastrophes − over-justification of “minimizing the average error” − emphasize the common, forget the rare − reckless learning Horses cheat our statistical estimation problems by using unexpected features
  • 18. Outline What’s wrong with machine learning? A causal proposal Searching for causality I: observational data Searching for causality II: multiple environments Conclusion
  • 19. This talk in one slide Predict Y from (X, Z). Process generating labeled training data: X ← N(0, 1), Y ← X + N(0, 1) Z ← Y + N(0, 1). Least-squares solution: YLS = X 2 + Z 2 Causal solution: YCau = X Predict Y from (X, Z). Process generating unlabeled testing data: X ← N(0, 1), Y ← X + N(0, 1) Z ← Y + N(0, 10). Least-squares solution breaks at testing time!
  • 20. Getting around the big lie machine learning Horses absorb all training correlations recklessly, incl. confounders and spurious patterns ∼ If Ptrain ̸= Ptest, what correlations should we learn and what correlations should we ignore?
  • 21. Reichenbach’s Principle of Common Cause Correlations between X and Y arise due to one of the three causal structures X Y X Y X Y Z What happens to Y when someone manipulates X? Why is Y = 2? (Reichenbach, 1956) formalizes the claim “dependence does not imply causation” ∼ We are interested in causal correlations (from features to target) Predicting open umbrellas from rain is more stable than predicting rain from open umbrellas
  • 22. Focus on causal correlations for invariance? (Woodward, 2005)
  • 23. Focus on causal correlations for truth? (Pearl, 2018) The causal explanation predicts the outcome of real experiments in the world ∼ We will now explore two ways to discover causality in data using data alone
  • 24. Outline What’s wrong with machine learning? A causal proposal Searching for causality I: observational data Searching for causality II: multiple environments Conclusion
  • 25. How does causation look like? (Hertzsprung–Russell diagrams, 1911)
  • 26. How does causation look like? (Messerli, 2012)
  • 27. How does causation look like? −1 0 1 U −1 0 1 V −1 0 1 V −1 0 1 U Effect = f(Cause) + Noise Cause independent from Noise (Peters et al., 2014)
  • 28. How does causation look like? 0.0 0.5 1.0 X −3 −2 −1 0 1 2 3 Y P(Y ) P(X) Effect = f(Cause) p(Cause) independent from f′ (Daniusis et al., 2010)
  • 29. How does causation look like? x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x → y x → y x → y x → y x ← y x ← y x → y x → y x ← y x → y x ← y x → y x ← y x → y x ← y x ← y x → y x → y x → y x ← y x → y (Mooij et al., 2014)
  • 30. NCC: learning causation footprints {(xij, yij)}mi j=1 (xi1, yi1) (ximi , yimi ) 1 mi ∑mi j=1(·) ˆP(Xi → Yi) average classifier layers embedding layers each point featurized separately (Lopez-Paz et al., 2017) Trained using synthetic data!
  • 31. NCC is the state-of-the-art 0 20 40 60 80 100 020406080100 decission rate classificationaccuracy RCC ANM IGCI
  • 32. NCC is the state-of-the-art
  • 33. NCC discovers causation in images Features inside bounding boxes are caused by the presence of objects (wheel) Features outside bounding boxes cause the presence of objects (road)object-featureratio (Lopez-Paz et al., 2017)
  • 34. NCC discovers causation in language Between word2vec vectors relation concepts such as “smoking → cancer” counts(WS) prec-counts(WS) prec-counts(entropy) PMI(WS) prec-PMI(WS) counts(entropy) PMI(entropy) prec-PMI(entropy) frequency precedence distr.prec-PMI distr.w2vio distr.PMI distr.counts distr.prec-counts distr.w2vii distr.w2voi feat.counts feat.prec-counts feat.PPMI feat.prec-PPMI feat.w2vio feat.w2voi feat.w2vii feat.w2voutput feat.w2vinput feat.w2vall 0.4 0.5 0.6 0.7 0.8 0.9 testaccuracy baselines distribution-based feature-based (Rojas-Carulla et al., 2017)
  • 35. New hopes for unsupervised learning? There are unexpected causal signals in unsupervised data! These allow to gain causal intuitions from data, reducing the need for experimentation What metrics/divergences best extract these causal signals, while discarding the rest? We want simple models for a complex world (IKEA instructions) − Against the usual hope of consistency (P = Q as n → ∞)
  • 36. First results Cause-effect discovery ≈ choosing the simplest model (Stegle et al., 2010) using a divergence − GANs divergences distinguish between cause and effect (Lopez-Paz and Oquab, 2016) − Discriminator((Cause, Generator(Cause, Noise)), (Cause, Effect)) is harder than Discriminator((Generator(Effect, Noise), Effect), (Cause, Effect)) − These ideas extend to multiple variables (Goudet et al., 2017; Kalainathan et al., 2018) − Each divergence has important geometry implications (Bottou et al., 2018) − Hyperbolic divergences recover complex causal hierarchies (Klimovskaia et al., 2018) p1 p2 p3 p4 p5 a b ... Euclidean space Poincaré Ball Preserve pairwise distances c
  • 37. First conclusion There are causal signals in unsupervised data ready to be leveraged in novel ways
  • 38. Outline What’s wrong with machine learning? A causal proposal Searching for causality I: observational data Searching for causality II: multiple environments Conclusion
  • 39. Moving beyond the big lie Ptrain(X, Y ) ̸= Ptest(X, Y ) Then, what remains invariant between train and test data? ∼ We assume that Ptrain and Ptest produce data about the same phenomena under different experimental conditions, circumstances, or environments ∼ To succeed at the test environment, we observe multiple training environments and − learn what is invariant across environments − discard what is specific to each environment ∼ There is a causal justification for proceeding this way!
  • 40. Functional causal models A common tool to describe causal structures is the one of Functional Causal Model (FCM) X1 X2 X3X4 Y X1 ← f1(N1) X2 ← f2(X1, X3, N2) X3 ← f3(X1, N3) // X1 causes X3 X4 ← f4(X1, N4) Y ← fy(X2, X3, Ny) Ni ∼ P(N) FCMs are compositional and allow counterfactual reasoning FCMs are generative: observing their eqs produces the observational distribution P(X, Y ) We can also intervene the FCM eqs to produce interventional distributions ˜P(X, Y )! ∼ Each intervention produces one environment (distribution) of the phenomena (FCM) of interest!
  • 41. Functional causal models One FCM = multiple interventions/distributions/environments P1 train(X, Y ) ∼ X1 X2 X3X4 Y X1 = f1(N1) X2 = f2(X1, X3, N2) X3= 1.5 X4 = f4(X1, N4) Y = fy(X2, X3, Ny) Ni ∼ P(N)
  • 42. Functional causal models One FCM = multiple interventions/distributions/environments P2 train(X, Y ) ∼ X1 X2 X3X4 Y X1∼ N(0, 1) X2 = f2(X1, X3, N2) X3 = f3(X1, N3) X4 = f4(X1, N4) Y = fy(X2, X3, Ny) Ni ∼ P(N)
  • 43. Functional causal models One FCM = multiple interventions/distributions/environments P3 train(X, Y ) ∼ X1 X2 X3X4 Y X1 = f1(N1) X2= f2(X1, X3, N2) + U(−10, 10) X3 = f3(X1, N3) X4 = f4(X1, N4) Y = fy(X2, X3, Ny) Ni ∼ P(N)
  • 44. Functional causal models X1 X2 X3X4 Y X1 = f1(N1) X2 = f2(X1, X3, N2) X3 = f3(X1, N3) X4 = f4(X1, N4) Y= fy(X2, X3, Ny) Ni ∼ P(N) If mechanisms are autonomous, and no intervention disturbs the conditional expectation of the target causal equation: − the causal conditional distribution E(Y |X2, X3) remains invariant − the non-causal conditional distribution E(Y |X) may vary wildly! This reveals the link between invariances across environments and causal structures ∼ How can we find invariant causal predictors?
  • 45. A simple example: X → Y → Z For all environments e ∈ R: Xe ← N(0, e), Y e ← Xe + N(0, e) Ze ← Y e + N(0, 1). The task is to predict Y e given (Xe , Ze ) for unknown test e. We have three options: E[Y e |Xe = x] = x, E[Y e |Ze = z] = 2e 2e + 1 z, E[Y e |Xe = x, Ze = z] = 1 e + 1 x + e e + 1 z The causal predictor based on x is invariant! The state-of-the-art (Ganin et al., 2016; Peters et al., 2016) fails at this simple example
  • 46. Our proposal Find a feature representation that leads to the same optimal classifier across environments. ∼ Let we ϕ be the optimal classifier for environment e, when using the featurizer ϕ: we ϕ = arg min w RP e (w ◦ ϕ), where RP e (f) = E(x,y)∼P e [ Error(f(x), y) ] . Measure classifier discrepancy: ∥we ϕ − we′ ϕ ∥P = ∫ (we ϕ(ϕ(x)) − we′ ϕ (ϕ(x)))2 dP(X) Let ¯w = 1 e ∑ e we ϕ. Then, our new learning objective is: arg min ϕ ∑ e RP e ( ¯w ◦ ϕ) + λ ∑ e,e′̸=e ∥we ϕ − we′ ϕ ∥P e (Arjovsky et al., 2018)
  • 47. An approximation to our proposal C(ϕ) = ∑ e RP e ( ¯w ◦ ϕ) + λ ∑ e,e′̸=e ∥we ϕ − we′ ϕ ∥P e is an intractable bi-level optimization problem, since we ϕ is an optimization problem itself We approximate the interactions between the optimization problems using unrolled gradients ∼ 1. Initialize at random ϕ and we ϕ, for all e 1.1 Update we ϕ ← Gradient(RP e , we ϕ) using one step and fixed ϕ, for all e 1.2 Update me ϕ ← Gradient(RP e , we ϕ) using k steps and fixed ϕ, for all e 1.3 Update ϕ ← Gradient(C, me ϕ) using one step and fixed me ϕ 2. Return ( 1 e ∑ e we ϕ ) ◦ ϕ (Arjovsky et al., 2018)
  • 48. First results Empirical risk minimization: Causal risk minimization: ∼ Implications to fairness? Partitions of one dataset? Theory?
  • 49. Multiple environments in the big picture setup training test generative learning U1 1 ∅ unsupervised learning U1 1 U1 2 supervised learning L1 1 U1 1 semi-supervised learning L1 1U1 1 U1 2 transductive learning L1 1U1 1 U1 1 multitask learning L1 1L2 1 U1 2 U2 2 domain adaptation L1 1U2 1 U2 2 transfer learning U1 1 L2 1 U2 1 continual learning L1 1, . . . , L∞ 1 U1 1 , . . . , U∞ 1 multi-environment learning L1 1L2 1 U3 1 U4 1 − Li j: labeled dataset number j drawn from distribution i − Ui j : unlabeled dataset number j drawn from distribution i
  • 50. Second conclusion Prediction rules based on stable correlations across environments are likely to be causal 1 1I call this the principle of causal concentration.
  • 51. Outline What’s wrong with machine learning? A causal proposal Searching for causality I: observational data Searching for causality II: multiple environments Conclusion
  • 52. Finally: from machine learning to artificial intelligence AIs will be world simulators that will − align with the causal outcomes in the world, − perform robustly across diverse environments, − interrogate composable autonomous mechanisms to extrapolate, − allow to imagine multiple futures given uncertainty about a situation, − enable counterfactual reasoning for extreme generalization These causal desiderata are out of reach for current machine learning systems. Let’s get to it! ∼ Thanks!
  • 53. References I Martin Arjovsky, Leon Bottou, and David Lopez-Paz. Learning invariant causal rules across environments. In preparation, 2018. Leon Bottou, Martin Arjovsky, David Lopez-Paz, and Maxime Oquab. Geometrical insights for implicit generative modeling. In Braverman Readings in Machine Learning. Key Ideas from Inception to Current State. Springer, 2018. Povilas Daniusis, Dominik Janzing, Joris Mooij, Jakob Zscheischler, Bastian Steudel, Kun Zhang, and Bernhard Sch¨olkopf. Inferring deterministic causal relations. In UAI, 2010. Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran¸cois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016. O. Goudet, D. Kalainathan, P. Caillou, I. Guyon, D. Lopez-Paz, and M. Sebag. Causal Generative Neural Networks. arXiv, 2017. Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. arXiv, 2017. Allan Jabri, Armand Joulin, and Laurens van der Maaten. Revisiting visual question answering baselines. In ECCV, 2016. D. Kalainathan, O. Goudet, I. Guyon, D. Lopez-Paz, and M. Sebag. SAM: Structural Agnostic Model, Causal Discovery and Penalized Adversarial Learning. arXiv, 2018. Anna Klimovskaia, Leon Bottou, David Lopez-Paz, and Maximilian Nickel. Poincar maps recover continuous hierarchies in single-celldata. In preparation, 2018. David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. ICLR, 2016. David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Sch¨olkopf, and L´eon Bottou. Discovering causal signals in images. CVPR, 2017. Franz H. Messerli. Chocolate consumption, cognitive function, and nobel laureates. New England Journal of Medicine, 2012. Joris M. Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Sch¨olkopf. Distinguishing cause from effect using observational data: methods and benchmarks. JMLR, 2014. Judea Pearl. Theoretical impediments to machine learning with seven sparks from the causal revolution. arXiv, 2018. Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Sch¨olkopf. Causal discovery with continuous additive noise models. JMLR, 2014. Jonas Peters, Peter B¨uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society, 2016. Hans Reichenbach. The direction of time. Dover, 1956. Mateo Rojas-Carulla, Marco Baroni, and David Lopez-Paz. Causal discovery using proxy variables. In preparation, 2017. A. Rosenfeld, R. Zemel, and J. K. Tsotsos. The Elephant in the Room. arXiv, 2018.
  • 54. References II David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 2016. Oliver Stegle, Dominik Janzing, Kun Zhang, Joris M Mooij, and Bernhard Sch¨olkopf. Probabilistic latent variable models for distinguishing between cause and effect. In NIPS. 2010. Pierre Stock and Moustapha Cisse. Convnets and imagenet beyond accuracy: Explanations, bias detection, adversarial examples and model criticism. arXiv, 2017. B. L. Sturm. A simple method to determine if a music information retrieval system is a “horse”. IEEE Transactions on Multimedia, 2014. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ICLR, 2013. James Woodward. Making things happen: A theory of causal explanation. Oxford university press, 2005.