## Just for you: FREE 60-day trial to the world’s largest digital library.

The SlideShare family just got bigger. Enjoy access to millions of ebooks, audiobooks, magazines, and more from Scribd.

Cancel anytime.Free with a 14 day trial from Scribd

- 1. Causal challenges for AI David Lopez-Paz Facebook AI Research
- 2. Clever Hans (1907) (Sturm, 2014)
- 3. Outline What’s wrong with machine learning? A causal proposal Searching for causality I: observational data Searching for causality II: multiple environments Conclusion
- 4. What succeeds in machine learning? The recent winner (Hu et al., 2017) achieves a super-human performance of 2.2%.
- 5. What succeeds in machine learning? (From Kartik Audhkhasi)
- 6. What succeeds in machine learning? (Wikipedia, 2018)
- 7. What succeeds in machine learning? (Silver et al., 2016)
- 8. What are the reasons for these successes? Machines pull impressive performances at − recognizing objects after training on more images than a human can see, − translating natural languages after training on more bilingual text than a human can read, − beating humans at Atari after playing more games than any teenager can endure, − reigning Go after playing more grandmaster level games than mankind Models consume too much data to solve a single task! (From L´eon Bottou)
- 9. What fails in machine learning? (From Pietro Perona)
- 10. What fails in machine learning? (From Pietro Perona)
- 11. What fails in machine learning? (Rosenfeld et al., 2018)
- 12. What fails in machine learning? (Stock and Cisse, 2017)
- 13. What fails in machine learning? (From Jamie Kiros)
- 14. What fails in machine learning? (Jabri et al., 2016)
- 15. What fails in machine learning? (Szegedy et al., 2013)
- 16. What fails in machine learning? (IBM system at ICLR 2017)
- 17. What are the reasons for these failures? The big liea in machine learning: Ptrain(X, Y ) = Ptest(X, Y ) aAs called by Zoubin Ghahramani. − focus on interpolation − out-of-distribution catastrophes − over-justiﬁcation of “minimizing the average error” − emphasize the common, forget the rare − reckless learning Horses cheat our statistical estimation problems by using unexpected features
- 18. Outline What’s wrong with machine learning? A causal proposal Searching for causality I: observational data Searching for causality II: multiple environments Conclusion
- 19. This talk in one slide Predict Y from (X, Z). Process generating labeled training data: X ← N(0, 1), Y ← X + N(0, 1) Z ← Y + N(0, 1). Least-squares solution: YLS = X 2 + Z 2 Causal solution: YCau = X Predict Y from (X, Z). Process generating unlabeled testing data: X ← N(0, 1), Y ← X + N(0, 1) Z ← Y + N(0, 10). Least-squares solution breaks at testing time!
- 20. Getting around the big lie machine learning Horses absorb all training correlations recklessly, incl. confounders and spurious patterns ∼ If Ptrain ̸= Ptest, what correlations should we learn and what correlations should we ignore?
- 21. Reichenbach’s Principle of Common Cause Correlations between X and Y arise due to one of the three causal structures X Y X Y X Y Z What happens to Y when someone manipulates X? Why is Y = 2? (Reichenbach, 1956) formalizes the claim “dependence does not imply causation” ∼ We are interested in causal correlations (from features to target) Predicting open umbrellas from rain is more stable than predicting rain from open umbrellas
- 22. Focus on causal correlations for invariance? (Woodward, 2005)
- 23. Focus on causal correlations for truth? (Pearl, 2018) The causal explanation predicts the outcome of real experiments in the world ∼ We will now explore two ways to discover causality in data using data alone
- 24. Outline What’s wrong with machine learning? A causal proposal Searching for causality I: observational data Searching for causality II: multiple environments Conclusion
- 25. How does causation look like? (Hertzsprung–Russell diagrams, 1911)
- 26. How does causation look like? (Messerli, 2012)
- 27. How does causation look like? −1 0 1 U −1 0 1 V −1 0 1 V −1 0 1 U Eﬀect = f(Cause) + Noise Cause independent from Noise (Peters et al., 2014)
- 28. How does causation look like? 0.0 0.5 1.0 X −3 −2 −1 0 1 2 3 Y P(Y ) P(X) Eﬀect = f(Cause) p(Cause) independent from f′ (Daniusis et al., 2010)
- 29. How does causation look like? x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x → y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y x → y x → y x → y x → y x ← y x ← y x → y x → y x ← y x → y x ← y x → y x ← y x → y x ← y x ← y x → y x → y x → y x ← y x → y (Mooij et al., 2014)
- 30. NCC: learning causation footprints {(xij, yij)}mi j=1 (xi1, yi1) (ximi , yimi ) 1 mi ∑mi j=1(·) ˆP(Xi → Yi) average classiﬁer layers embedding layers each point featurized separately (Lopez-Paz et al., 2017) Trained using synthetic data!
- 31. NCC is the state-of-the-art 0 20 40 60 80 100 020406080100 decission rate classiﬁcationaccuracy RCC ANM IGCI
- 32. NCC is the state-of-the-art
- 33. NCC discovers causation in images Features inside bounding boxes are caused by the presence of objects (wheel) Features outside bounding boxes cause the presence of objects (road)object-featureratio (Lopez-Paz et al., 2017)
- 34. NCC discovers causation in language Between word2vec vectors relation concepts such as “smoking → cancer” counts(WS) prec-counts(WS) prec-counts(entropy) PMI(WS) prec-PMI(WS) counts(entropy) PMI(entropy) prec-PMI(entropy) frequency precedence distr.prec-PMI distr.w2vio distr.PMI distr.counts distr.prec-counts distr.w2vii distr.w2voi feat.counts feat.prec-counts feat.PPMI feat.prec-PPMI feat.w2vio feat.w2voi feat.w2vii feat.w2voutput feat.w2vinput feat.w2vall 0.4 0.5 0.6 0.7 0.8 0.9 testaccuracy baselines distribution-based feature-based (Rojas-Carulla et al., 2017)
- 35. New hopes for unsupervised learning? There are unexpected causal signals in unsupervised data! These allow to gain causal intuitions from data, reducing the need for experimentation What metrics/divergences best extract these causal signals, while discarding the rest? We want simple models for a complex world (IKEA instructions) − Against the usual hope of consistency (P = Q as n → ∞)
- 36. First results Cause-eﬀect discovery ≈ choosing the simplest model (Stegle et al., 2010) using a divergence − GANs divergences distinguish between cause and eﬀect (Lopez-Paz and Oquab, 2016) − Discriminator((Cause, Generator(Cause, Noise)), (Cause, Eﬀect)) is harder than Discriminator((Generator(Eﬀect, Noise), Eﬀect), (Cause, Eﬀect)) − These ideas extend to multiple variables (Goudet et al., 2017; Kalainathan et al., 2018) − Each divergence has important geometry implications (Bottou et al., 2018) − Hyperbolic divergences recover complex causal hierarchies (Klimovskaia et al., 2018) p1 p2 p3 p4 p5 a b ... Euclidean space Poincaré Ball Preserve pairwise distances c
- 37. First conclusion There are causal signals in unsupervised data ready to be leveraged in novel ways
- 38. Outline What’s wrong with machine learning? A causal proposal Searching for causality I: observational data Searching for causality II: multiple environments Conclusion
- 39. Moving beyond the big lie Ptrain(X, Y ) ̸= Ptest(X, Y ) Then, what remains invariant between train and test data? ∼ We assume that Ptrain and Ptest produce data about the same phenomena under diﬀerent experimental conditions, circumstances, or environments ∼ To succeed at the test environment, we observe multiple training environments and − learn what is invariant across environments − discard what is speciﬁc to each environment ∼ There is a causal justiﬁcation for proceeding this way!
- 40. Functional causal models A common tool to describe causal structures is the one of Functional Causal Model (FCM) X1 X2 X3X4 Y X1 ← f1(N1) X2 ← f2(X1, X3, N2) X3 ← f3(X1, N3) // X1 causes X3 X4 ← f4(X1, N4) Y ← fy(X2, X3, Ny) Ni ∼ P(N) FCMs are compositional and allow counterfactual reasoning FCMs are generative: observing their eqs produces the observational distribution P(X, Y ) We can also intervene the FCM eqs to produce interventional distributions ˜P(X, Y )! ∼ Each intervention produces one environment (distribution) of the phenomena (FCM) of interest!
- 41. Functional causal models One FCM = multiple interventions/distributions/environments P1 train(X, Y ) ∼ X1 X2 X3X4 Y X1 = f1(N1) X2 = f2(X1, X3, N2) X3= 1.5 X4 = f4(X1, N4) Y = fy(X2, X3, Ny) Ni ∼ P(N)
- 42. Functional causal models One FCM = multiple interventions/distributions/environments P2 train(X, Y ) ∼ X1 X2 X3X4 Y X1∼ N(0, 1) X2 = f2(X1, X3, N2) X3 = f3(X1, N3) X4 = f4(X1, N4) Y = fy(X2, X3, Ny) Ni ∼ P(N)
- 43. Functional causal models One FCM = multiple interventions/distributions/environments P3 train(X, Y ) ∼ X1 X2 X3X4 Y X1 = f1(N1) X2= f2(X1, X3, N2) + U(−10, 10) X3 = f3(X1, N3) X4 = f4(X1, N4) Y = fy(X2, X3, Ny) Ni ∼ P(N)
- 44. Functional causal models X1 X2 X3X4 Y X1 = f1(N1) X2 = f2(X1, X3, N2) X3 = f3(X1, N3) X4 = f4(X1, N4) Y= fy(X2, X3, Ny) Ni ∼ P(N) If mechanisms are autonomous, and no intervention disturbs the conditional expectation of the target causal equation: − the causal conditional distribution E(Y |X2, X3) remains invariant − the non-causal conditional distribution E(Y |X) may vary wildly! This reveals the link between invariances across environments and causal structures ∼ How can we ﬁnd invariant causal predictors?
- 45. A simple example: X → Y → Z For all environments e ∈ R: Xe ← N(0, e), Y e ← Xe + N(0, e) Ze ← Y e + N(0, 1). The task is to predict Y e given (Xe , Ze ) for unknown test e. We have three options: E[Y e |Xe = x] = x, E[Y e |Ze = z] = 2e 2e + 1 z, E[Y e |Xe = x, Ze = z] = 1 e + 1 x + e e + 1 z The causal predictor based on x is invariant! The state-of-the-art (Ganin et al., 2016; Peters et al., 2016) fails at this simple example
- 46. Our proposal Find a feature representation that leads to the same optimal classiﬁer across environments. ∼ Let we ϕ be the optimal classiﬁer for environment e, when using the featurizer ϕ: we ϕ = arg min w RP e (w ◦ ϕ), where RP e (f) = E(x,y)∼P e [ Error(f(x), y) ] . Measure classiﬁer discrepancy: ∥we ϕ − we′ ϕ ∥P = ∫ (we ϕ(ϕ(x)) − we′ ϕ (ϕ(x)))2 dP(X) Let ¯w = 1 e ∑ e we ϕ. Then, our new learning objective is: arg min ϕ ∑ e RP e ( ¯w ◦ ϕ) + λ ∑ e,e′̸=e ∥we ϕ − we′ ϕ ∥P e (Arjovsky et al., 2018)
- 47. An approximation to our proposal C(ϕ) = ∑ e RP e ( ¯w ◦ ϕ) + λ ∑ e,e′̸=e ∥we ϕ − we′ ϕ ∥P e is an intractable bi-level optimization problem, since we ϕ is an optimization problem itself We approximate the interactions between the optimization problems using unrolled gradients ∼ 1. Initialize at random ϕ and we ϕ, for all e 1.1 Update we ϕ ← Gradient(RP e , we ϕ) using one step and ﬁxed ϕ, for all e 1.2 Update me ϕ ← Gradient(RP e , we ϕ) using k steps and ﬁxed ϕ, for all e 1.3 Update ϕ ← Gradient(C, me ϕ) using one step and ﬁxed me ϕ 2. Return ( 1 e ∑ e we ϕ ) ◦ ϕ (Arjovsky et al., 2018)
- 48. First results Empirical risk minimization: Causal risk minimization: ∼ Implications to fairness? Partitions of one dataset? Theory?
- 49. Multiple environments in the big picture setup training test generative learning U1 1 ∅ unsupervised learning U1 1 U1 2 supervised learning L1 1 U1 1 semi-supervised learning L1 1U1 1 U1 2 transductive learning L1 1U1 1 U1 1 multitask learning L1 1L2 1 U1 2 U2 2 domain adaptation L1 1U2 1 U2 2 transfer learning U1 1 L2 1 U2 1 continual learning L1 1, . . . , L∞ 1 U1 1 , . . . , U∞ 1 multi-environment learning L1 1L2 1 U3 1 U4 1 − Li j: labeled dataset number j drawn from distribution i − Ui j : unlabeled dataset number j drawn from distribution i
- 50. Second conclusion Prediction rules based on stable correlations across environments are likely to be causal 1 1I call this the principle of causal concentration.
- 51. Outline What’s wrong with machine learning? A causal proposal Searching for causality I: observational data Searching for causality II: multiple environments Conclusion
- 52. Finally: from machine learning to artiﬁcial intelligence AIs will be world simulators that will − align with the causal outcomes in the world, − perform robustly across diverse environments, − interrogate composable autonomous mechanisms to extrapolate, − allow to imagine multiple futures given uncertainty about a situation, − enable counterfactual reasoning for extreme generalization These causal desiderata are out of reach for current machine learning systems. Let’s get to it! ∼ Thanks!
- 53. References I Martin Arjovsky, Leon Bottou, and David Lopez-Paz. Learning invariant causal rules across environments. In preparation, 2018. Leon Bottou, Martin Arjovsky, David Lopez-Paz, and Maxime Oquab. Geometrical insights for implicit generative modeling. In Braverman Readings in Machine Learning. Key Ideas from Inception to Current State. Springer, 2018. Povilas Daniusis, Dominik Janzing, Joris Mooij, Jakob Zscheischler, Bastian Steudel, Kun Zhang, and Bernhard Sch¨olkopf. Inferring deterministic causal relations. In UAI, 2010. Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran¸cois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016. O. Goudet, D. Kalainathan, P. Caillou, I. Guyon, D. Lopez-Paz, and M. Sebag. Causal Generative Neural Networks. arXiv, 2017. Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. arXiv, 2017. Allan Jabri, Armand Joulin, and Laurens van der Maaten. Revisiting visual question answering baselines. In ECCV, 2016. D. Kalainathan, O. Goudet, I. Guyon, D. Lopez-Paz, and M. Sebag. SAM: Structural Agnostic Model, Causal Discovery and Penalized Adversarial Learning. arXiv, 2018. Anna Klimovskaia, Leon Bottou, David Lopez-Paz, and Maximilian Nickel. Poincar maps recover continuous hierarchies in single-celldata. In preparation, 2018. David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. ICLR, 2016. David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Sch¨olkopf, and L´eon Bottou. Discovering causal signals in images. CVPR, 2017. Franz H. Messerli. Chocolate consumption, cognitive function, and nobel laureates. New England Journal of Medicine, 2012. Joris M. Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Sch¨olkopf. Distinguishing cause from effect using observational data: methods and benchmarks. JMLR, 2014. Judea Pearl. Theoretical impediments to machine learning with seven sparks from the causal revolution. arXiv, 2018. Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Sch¨olkopf. Causal discovery with continuous additive noise models. JMLR, 2014. Jonas Peters, Peter B¨uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society, 2016. Hans Reichenbach. The direction of time. Dover, 1956. Mateo Rojas-Carulla, Marco Baroni, and David Lopez-Paz. Causal discovery using proxy variables. In preparation, 2017. A. Rosenfeld, R. Zemel, and J. K. Tsotsos. The Elephant in the Room. arXiv, 2018.
- 54. References II David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 2016. Oliver Stegle, Dominik Janzing, Kun Zhang, Joris M Mooij, and Bernhard Sch¨olkopf. Probabilistic latent variable models for distinguishing between cause and effect. In NIPS. 2010. Pierre Stock and Moustapha Cisse. Convnets and imagenet beyond accuracy: Explanations, bias detection, adversarial examples and model criticism. arXiv, 2017. B. L. Sturm. A simple method to determine if a music information retrieval system is a “horse”. IEEE Transactions on Multimedia, 2014. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ICLR, 2013. James Woodward. Making things happen: A theory of causal explanation. Oxford university press, 2005.