SlideShare a Scribd company logo
1 of 68
Download to read offline
Causality-inspired ML: what can
ideas from causality do for ML?
The domain adaptation case
Sara Magliacane

MIT-IBM Watson AI Lab
Causality-inspired ML
• Real-world ML needs to deal with:

• Biased data (fairness, selection bias, generalization)

• Heterogeneous data, small samples, missing/corrupted data, not iid

• Actionable insights (decisions cannot be made on correlations)
Causality-inspired ML
• Real-world ML needs to deal with:

• Biased data (fairness, selection bias, generalization)

• Heterogeneous data, small samples, missing/corrupted data, not iid

• Actionable insights (decisions cannot be made on correlations)

• Causal inference can help with some of these questions:

• Systematic data fusion and reuse with biased data, heterogenous, not iid data 

• A systematic way to extract actionable insights
Causality-inspired ML
• Real-world ML needs to deal with:

• Biased data (fairness, selection bias, generalization)

• Heterogeneous data, small samples, missing/corrupted data, not iid

• Actionable insights (decisions cannot be made on correlations)

• Causal inference can help with some of these questions:

• Systematic data fusion and reuse with biased data, heterogenous, not iid data 

• A systematic way to extract actionable insights
• “Full” causality can be not necessary or too expensive -> causality-inspired ML
Unsupervised domain adaptation
by inferring separating sets of
features
(joint work with Thijs van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg and Joris Mooij)
Transfer learning and causal inference
• Transfer learning: 

• How can I predict what happens
when the distribution changes?
Transfer learning and causal inference
• Transfer learning: 

• How can I predict what happens
when the distribution changes?
Transfer learning and causal inference
• Transfer learning: 

• How can I predict what happens
when the distribution changes?
• Causal inference:

• How can I predict what happens when the
distribution changes after an intervention?

• Perfect intervention do(X): 

• do-calculus [Pearl, 2009]
Transfer learning and causal inference
• Transfer learning: 

• How can I predict what happens
when the distribution changes?
• Causal inference:

• How can I predict what happens when the
distribution changes after an intervention?

• Perfect intervention do(X): 

• do-calculus [Pearl, 2009]

• Soft intervention on X change of
distribution of P(X| parents) in the
graph 

• arbitrary change of distribution
≈
≈
ICML 2012
Domain adaptation
• Source domain (or domains): 

• Target domain: 

• Domain adaptation: predict Y in target domain using (also) information from source domain

• Unsupervised domain adaptation: no labels in the target domain (zero-shot)
{Xi, Yi}n
i=1 ∼ PS(X, Y)
{Xj, Yj}m
j=1 ∼ PT(X, Y) ≠ PS(X, Y)
{Xj}m
j=1 ∼ PT(X)
Domain adaptation
X1 X2 Y
Normal 0,1 2 0
Normal 0,2 3 0
Normal 1,1 2 1
Normal 0,1 3 0
X1 X2 Y
Gene A 3,1 2 ?
Gene A 3,2 3 ?
Gene A 4 2 ?
Gene A 3,2 3 ?
• Source domain (or domains): 

• Target domain: 

• Domain adaptation: predict Y in target domain using (also) information from source domain

• Unsupervised domain adaptation: no labels in the target domain (zero-shot)
{Xi, Yi}n
i=1 ∼ PS(X, Y)
{Xj, Yj}m
j=1 ∼ PT(X, Y) ≠ PS(X, Y)
{Xj}m
j=1 ∼ PT(X)
Domain adaptation from a graphical perspective
X1 X2 Y
Normal 0,1 2 0
Normal 0,2 3 0
Normal 1,1 2 1
Normal 0,1 3 0
X1 X2 Y
Gene A 3,1 2 ?
Gene A 3,2 3 ?
Gene A 4 2 ?
Gene A 3,2 3 ?
Domain adaptation from a graphical perspective
D X1 X2 Y
Normal 0,1 2 0
Normal 0,2 3 0
Normal 1,1 2 1
Normal 0,1 3 0
• Add a variable D to represent the domain
X1 X2 Y
Gene A 3,1 2 ?
Gene A 3,2 3 ?
Gene A 4 2 ?
Gene A 3,2 3 ?
Domain adaptation from a graphical perspective
• Add a variable D to represent the domain
• Consider the data as coming from a single distribution P(X,Y, D)
X1 X2 Y
Gene A 3,1 2 ?
Gene A 3,2 3 ?
Gene A 4 2 ?
Gene A 3,2 3 ?
D X1 X2 Y
Normal 0,1 2 0
Normal 0,2 3 0
Normal 1,1 2 1
Normal 0,1 3 0
Domain adaptation from a graphical perspective
• Add a variable D to represent the domain
• Consider the data as coming from a single distribution P(X,Y, D)
X1 Y X2D
• We can represent P(X,Y, D) with a
(possibly unknown) causal graph
X1 X2 Y
Gene A 3,1 2 ?
Gene A 3,2 3 ?
Gene A 4 2 ?
Gene A 3,2 3 ?
D X1 X2 Y
Normal 0,1 2 0
Normal 0,2 3 0
Normal 1,1 2 1
Normal 0,1 3 0
D-separation with slightly less tears
X1 YD
• Given a causal graph, d-separation is a graphical criterion that (under
standard conditions) allows us to read conditional independences
Y /⊥⊥ D
D-separation with slightly less tears
X1 YD
• Given a causal graph, d-separation is a graphical criterion that (under
standard conditions) allows us to read conditional independences
Y ⊥⊥ D|X1Y /⊥⊥ D
• Conditioning on X1 closes the path
D-separation with slightly less tears
X1 YD
• Given a causal graph, d-separation is a graphical criterion that (under
standard conditions) allows us to read conditional independences
Y ⊥⊥ D|X1Y /⊥⊥ D
• Conditioning on X1 closes the path
Y ⊥⊥ D
Y
X2
D
D-separation with slightly less tears
X1 YD
• Given a causal graph, d-separation is a graphical criterion that (under
standard conditions) allows us to read conditional independences
Y ⊥⊥ D|X1Y /⊥⊥ D
• Conditioning on X1 closes the path
Y ⊥⊥ D Y /⊥⊥ D|X2
• Conditioning on X2 opens a closed path
Y
X2
D
D-separation with slightly less tears
X1 YD
• Given a causal graph, d-separation is a graphical criterion that (under
standard conditions) allows us to read conditional independences
Y ⊥⊥ D|X1Y /⊥⊥ D
• Conditioning on X1 closes the path
Y
X2
D
Y ⊥⊥ D Y /⊥⊥ D|X2
• Conditioning on X2 opens a closed path
X1 YD X2
D-separation with slightly less tears
X1 YD
• Given a causal graph, d-separation is a graphical criterion that (under
standard conditions) allows us to read conditional independences
Y ⊥⊥ D|X1Y /⊥⊥ D
• Conditioning on X1 closes the path
Y
X2
D
Y ⊥⊥ D Y /⊥⊥ D|X2
• Conditioning on X2 opens a closed path
X1 YD X2
Y /⊥⊥ D|X1, X2
D-separation with slightly less tears
X1 YD
• Given a causal graph, d-separation is a graphical criterion that (under
standard conditions) allows us to read conditional independences
Y ⊥⊥ D|X1Y /⊥⊥ D
• Conditioning on X1 closes the path
Y
X2
D
Y ⊥⊥ D Y /⊥⊥ D|X2
• Conditioning on X2 opens a closed path
• Conditional independences with D tell us about invariance
Y ⊥⊥ D|X1 ⟹ P(Y|X1, D) = P(Y|X1)
⟹ P(Y|X1, D = d1) = P(Y|X1, D = d2)
Domain adaptation classification from a graphical
perspective [Zhang et al. 2013]
• Covariate shift assumption: 

• Related to counterfactual inference [Johansson et al. 2016, 2018]
PS(Y|X) = PT(Y|X) ≡ Y ⊥⊥ D|X
X1 YX2D
Domain adaptation classification from a graphical
perspective [Zhang et al. 2013]
• Covariate shift assumption: 

• Related to counterfactual inference [Johansson et al. 2016, 2018]
• Target shift and generalized target shift
PS(Y|X) = PT(Y|X) ≡ Y ⊥⊥ D|X
X1 YX2D
YD X YD X
Common misconceptions: 1. An invariant feature
need not be causal
X1 Y X2D
Y ⊥⊥ D|X1
• Y|X1,X2 is invariant invariant features are not necessarily parents of Y⟹
Y ⊥⊥ D|X1, X2
Common misconceptions: 1. An invariant feature
need not be causal
X1 Y X2D
Y ⊥⊥ D|X1
• Y|X1,X2 is invariant invariant features are not necessarily parents of Y⟹
Y ⊥⊥ D|X1, X2
Invariant feature across “many different datasets” is not enough in general to find
causal parents, need more assumptions
Common misconceptions: 1. An invariant feature
need not be causal
X1 Y X2D
Y ⊥⊥ D|X1
• Y|X1,X2 is invariant invariant features are not necessarily parents of Y

• Invariant Causal Prediction [Peters et al. 2016] under causal sufficiency:
⟹
S* =
⋂
Y⊥⊥D|S
S ⊆ Pa(Y) {X1, X2} ∩ {X1} = {X1}
Y ⊥⊥ D|X1, X2
Common misconceptions: 1. An invariant feature
need not be causal
X1 Y X2D
Y ⊥⊥ D|X1
• Y|X1,X2 is invariant invariant features are not necessarily parents of Y

• Invariant Causal Prediction [Peters et al. 2016] under causal sufficiency:
⟹
S* =
⋂
Y⊥⊥D|S
S ⊆ Pa(Y) {X1, X2} ∩ {X1} = {X1}
Y ⊥⊥ D|X1, X2
Common misconceptions: 1. An invariant feature
need not be causal
X1 Y X2D
Y ⊥⊥ D|X1
{X1, X2} ∩ {X1} = {X1}
Y ⊥⊥ D|X1, X2
X1 YX2D {X1} ∩ {X2} ∩ {X1, X2} = ∅
• Y|X1,X2 is invariant invariant features are not necessarily parents of Y

• Invariant Causal Prediction [Peters et al. 2016] under causal sufficiency:
⟹
S* =
⋂
Y⊥⊥D|S
S ⊆ Pa(Y)
Sometimes too
conservative:
Common misconception 2: Parents are not
enough under latent confounding
X1 YX2D
Y ⊥⊥ D|X1
Y /⊥⊥ D|X2
Y /⊥⊥ D|X1, X2
• Y|X1 is invariant, Y|X2 is not
Even if we knew all the parents, under latent confounding this wouldn’t
necessarily help transfer
Common misconception 2: Parents are not
enough under latent confounding
X1 YX2D
Y ⊥⊥ D|X1
Y /⊥⊥ D|X2
Y /⊥⊥ D|X1, X2
• Y|X1 is invariant, Y|X2 is not
Even if we knew all the parents, under latent confounding this wouldn’t
necessarily help transfer
• Conclusion: causality (e.g. using the causal parents, learning the
complete causal graph) is neither necessary or sufficient* for transfer
Desiderata for a causality inspired domain
adaptation method
• X, Y and changes can be represented by an unknown causal graph

• Allow for latent confounders
• Avoid parametric assumptions, allow for heterogeneous effects across domains
Desiderata for a causality inspired domain
adaptation method
• X, Y and changes can be represented by an unknown causal graph

• Allow for latent confounders
• Avoid parametric assumptions, allow for heterogeneous effects across domains

• Instead of modeling changes between each domain, distinguish the change
between the mixture of sources and the target
Desiderata for a causality inspired domain
adaptation method
• X, Y and changes can be represented by an unknown causal graph

• Allow for latent confounders
• Avoid parametric assumptions, allow for heterogeneous effects across domains

• Instead of modeling changes between each domain, distinguish the change
between the mixture of sources and the target
• Avoid common assumption that if Y| T(X) is invariant across multiple source
domains, then Y|T(X) is invariant also in the target domain 

• This also (implicitly) assumed by methods based on the idea that
invariance => causality
Causal domain adaptation problem
[Magliacane et al. 2018]
X1 X2 Y
Normal 0,1 2 0
Normal 0,2 3 0
Normal 1,1 2 1
Normal 0,1 3 0
X1 X2 Y
Gene A 3,1 2 1
Gene A 3,2 3 1
Gene A 4 1 1
Gene A 3,2 3 0
X1 X2 Y
Gene B 0,2 1 ?
Gene B 0,3 1 ?
Gene B 0,3 2 ?
Gene B 0,4 1 ?
• Unsupervised multi-source domain adaptation

• We interpret the change in the target domain as a soft intervention
• We assume Y cannot be intervened upon directly - P(Y) can still change
C1 C2 X1 X2 Y
0 1 0,2 1 ?
0 1 0,3 1 ?
0 1 0,3 2 ?
0 1 0,4 1 ?
C1 C2 X1 X2 Y
1 0 3,1 2 1
1 0 3,2 3 1
1 0 4 1 1
1 0 3,2 3 0
Joint Causal Inference [Mooij et al. 2020]
• We represent jointly different distributions as an unknown single causal graph

• Instead of a single domain variable, we add several context variables so we can
disentangle changes in distribution across the datasets

• If we know nothing about the changes in the datasets, we use indicator variables
X1 X2 Y
Normal 0,1 2 0
Normal 0,2 3 0
Normal 1,1 2 1
Normal 0,1 3 0
X1 X2 Y
Gene A 3,1 2 1
Gene A 3,2 3 1
Gene A 4 1 1
Gene A 3,2 3 0X1 X2 Y
Gene B 0,2 1 ?
Gene B 0,3 1 ?
Gene B 0,3 2 ?
Gene B 0,4 1 ?
C1 C2 X1 X2 Y
0 0 0,1 2 0
0 0 0,2 3 0
0 0 1,1 2 1
0 0 0,1 3 0
C1=1C1=0
Joint Causal Inference [Mooij et al. 2020]
• The joint graph is the union of all the possibly different subgraphs in each
dataset, and context variables that model the differences in each dataset

• We allow for latent confounders and in general also cyclic causal graphs
X1 Y X2 X1 Y X2 X1 Y X2
C1
Joint Causal Inference [Mooij et al. 2020]
• We can learn an equivalence class of the unknown single causal graph using
conditional independence tests on systematically pooled dat
• We treat context variables as normal variables that we know are uncaused and can
use any standard constraint-based causal discovery method
X1 Y X2
C1C2C1 ⊥⊥ Y|X1
X1 ⊥⊥ X2 |Y, C2
X1 /⊥⊥ X2 |Y
…C1 C2 X1 X2 Y
0 1 0,2 0 0
0 1 0,3 0 1
0 1 0,3 1 0
C1 C2 X1 X2 Y
1 0 3,1 2 1
1 0 3,2 3 1
1 0 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
X1 X2 Y
1 0,2 1 ?
1 0,3 1 ?
1 0,3 2 ?
1 0,4 1 ?
Causal domain adaptation: separating features
X1 Y X2
C1
• For simplicity: consider only one source and one target

• Assume we knew the causal graph:
C1 X1 X2 Y
0 0,1 2 0
0 0,2 3 0
0 1,1 2 1
0 0,1 3 0
• Which features can we use to predict Y in the target domain?
Causal domain adaptation: separating features
X1 Y X2
C1 Source
Target
X1
Y
Y ⊥⊥ C1 |X1 ≡
P(Y|X1, C1 = 0) = P(Y|X1, C1 = 1)
• Separating features: sets of features that d-separate Y from the context
variable C1 representing the target domain
Causal domain adaptation: separating features
X1 Y X2
C1 Source
Target
X1
Y
Y ⊥⊥ C1 |X1 ≡
P(Y|X1, C1 = 0) = P(Y|X1, C1 = 1)
Source
Target
Y /⊥⊥ C1 |X2 ≡
P(Y|X2, C1 = 0) ≠ P(Y|X2, C1 = 1)
X2
Y
• Separating features: sets of features that d-separate Y from the context
variable C1 representing the target domain
Causal domain adaptation: separating features
X1 Y X2
C1 Source
Target
X1
Y
P(Y|X1, C1 = 0) = P(Y|X1, C1 = 1)
Source
Target
P(Y|X2, C1 = 0) ≠ P(Y|X2, C1 = 1)
X2
Y
• Separating features: sets of features that d-separate Y from the context
variable C1 representing the target domain Features for which covariate
shift holds!
Y ⊥⊥ C1 |X1 ≡ Y /⊥⊥ C1 |X2 ≡
Causal domain adaptation: separating features
X1 Y X2
C1 Source
Target
X1
Y
P(Y|X1, C1 = 0) = P(Y|X1, C1 = 1)
Source
Target
P(Y|X2, C1 = 0) ≠ P(Y|X2, C1 = 1)
X2
Y
• Separating features: sets of features that d-separate Y from the context
variable C1 representing the target domain Features for which covariate
shift holds!
Y ⊥⊥ C1 |X1 ≡ Y /⊥⊥ C1 |X2 ≡
• {X1} is a separating feature set, {X1, X2} could lead to arbitrary large error
Causal domain adaptation: separating features
X1 Y X2
C1
• Separating features: sets of features that d-separate Y from the context
variable C1 representing the target domain
Source
Target
X1
Y
Source
Target
X2
Y
• {X1} is a separating feature set, {X1, X2} could lead to arbitrary large error
• Separating features need not be causal (parents of Y), e.g. X3
X3 Y ⊥⊥ C1 |X1 Y /⊥⊥ C1 |X2
What if the causal graph is unknown?
• Idea: we could test the conditional independence in the data
Y ⊥⊥ C1 |X1? Y ⊥⊥ C1 |X2?
What if the causal graph is unknown?
• Idea: we could test the conditional independence in the data
X1 X2 Y
1 0,2 1 ?
1 0,3 1 ?
1 0,3 2 ?
1 0,4 1 ?
C1 X1 X2 Y
0 0,1 2 0
0 0,2 3 0
0 1,1 2 1
0 0,1 3 0
Y ⊥⊥ C1 |X1? Y ⊥⊥ C1 |X2?
• Problem: Y is always missing when C1=1, so we cannot test these
What if the causal graph is unknown?
• Idea: we could test the conditional independence in the data
X1 X2 Y
1 0,2 1 ?
1 0,3 1 ?
1 0,3 2 ?
1 0,4 1 ?
C1 X1 X2 Y
0 0,1 2 0
0 0,2 3 0
0 1,1 2 1
0 0,1 3 0
Y ⊥⊥ C1 |X1? Y ⊥⊥ C1 |X2?
• Problem: Y is always missing when C1=1, so we cannot test these
X1 /⊥⊥ X2
X1 /⊥⊥ C1
X1 ⊥⊥ X2 |Y, C1 = 0
X1 /⊥⊥ X2 |C1
…
• Idea: Can we use all other independences?
Assumptions [Magliacane et al. 2018]
• We assume that there exists an acyclic causal graph that fits all the data
(Joint Causal Inference)
• We assume Y cannot be intervened upon directly
Assumptions [Magliacane et al. 2018]
A ⊥⊥ D|B, Y, C1 = 0 ⟹ A ⊥⊥ D|B, Y, C1 = 1
Y ⊥⊥ A|B, C1 = 0 ⟹ Y ⊥⊥ A|B, C1 = 1A, D, B ⊂ V∖{Y, C1}
• We assume that there exists an acyclic causal graph that fits all the data
(Joint Causal Inference)
• We assume Y cannot be intervened upon directly
• We assume no extra dependences involving Y in target domain C1=1
Assumptions [Magliacane et al. 2018]
• We assume that there exists an acyclic causal graph that fits all the data
(Joint Causal Inference)
• We assume Y cannot be intervened upon directly
• We assume no extra dependences involving Y in target domain C1=1
A ⊥⊥ D|B, Y, C1 = 0 ⟹ A ⊥⊥ D|B, Y, C1 = 1
Y ⊥⊥ A|B, C1 = 0 ⟹ Y ⊥⊥ A|B, C1 = 1A, D, B ⊂ V∖{Y, C1}
C1 ⊥⊥ Y|B?
• Note that this does not assume anything about the separating set test :
A slightly more complicated example
C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
A slightly more complicated example
C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
Y ⊥⊥ C2 |X1, C1 = 0
Y /⊥⊥ C2 |C1 = 0
X2 ⊥⊥ C2 |Y, C1 = 0C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
Perform allowed CI tests
A slightly more complicated example
X1 Y X2
C1C2
C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
Y ⊥⊥ C2 |X1, C1 = 0
Y /⊥⊥ C2 |C1 = 0
X2 ⊥⊥ C2 |Y, C1 = 0C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
Perform allowed CI tests All possible compatible graphs
A slightly more complicated example
X1 Y X2
C1C2
C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
Y ⊥⊥ C2 |X1, C1 = 0
Y /⊥⊥ C2 |C1 = 0
X2 ⊥⊥ C2 |Y, C1 = 0
Y ⊥⊥ C1 |X1
• We can prove untestable separating test without reconstructing the graph:
C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
Perform allowed CI tests All possible compatible graphs
True in all possible compatible graphs
C1=0
Invariance in sources is not invariance in target
X1 Y X2
C1C2
Y ⊥⊥ C2 |X1, C1 = 0
Y /⊥⊥ C2 |C1 = 0
X2 ⊥⊥ C2 |Y, C1 = 0
Y ⊥⊥ C2 |X1, C1 = 0
• Features that are invariant across the source domains:
Y ⊥⊥ C2 |X2, C1 = 0 Y ⊥⊥ C2 |X1, X2, C1 = 0
X1 Y X2
C2
C1=0
Invariance in sources is not invariance in target
X1 Y X2
C1C2
Y ⊥⊥ C2 |X1, C1 = 0
Y /⊥⊥ C2 |C1 = 0
X2 ⊥⊥ C2 |Y, C1 = 0
Y ⊥⊥ C2 |X1, C1 = 0
• Features that are invariant across the source domains:
Y ⊥⊥ C2 |X2, C1 = 0 Y ⊥⊥ C2 |X1, X2, C1 = 0
X1 Y X2
C2
Y ⊥⊥ C2 |X1, C1 = 1 Y /⊥⊥ C2 |X2, C1 = 1 Y /⊥⊥ C2 |X1, X2, C1 = 1
C1=0
Invariance in sources is not invariance in target
X1 Y X2
C1C2
Y ⊥⊥ C2 |X1, C1 = 0
Y /⊥⊥ C2 |C1 = 0
X2 ⊥⊥ C2 |Y, C1 = 0
Y ⊥⊥ C2 |X1, C1 = 0
• Features that are invariant across the source domains:
Y ⊥⊥ C2 |X2, C1 = 0 Y ⊥⊥ C2 |X1, X2, C1 = 0
X1 Y X2
C2
Y ⊥⊥ C2 |X1, C1 = 1 Y /⊥⊥ C2 |X2, C1 = 1 Y /⊥⊥ C2 |X1, X2, C1 = 1
• Idea: use data from target also in order to learn invariant features
Inferring separating sets without enumerating all
possible causal graphs
Query Y ⊥⊥ C1 |X1?
Y ⊥⊥ C2 |X1, C1 = 0
X1 ⊥⊥ X3 |X4
X2 ⊥⊥ C2 |Y, C1 = 0
…
Inferring separating sets without enumerating all
possible causal graphs
Query Y ⊥⊥ C1 |X1?
All testable conditional
independences from data
Y ⊥⊥ C2 |X1, C1 = 0
X1 ⊥⊥ X3 |X4
X2 ⊥⊥ C2 |Y, C1 = 0
…
Assumptions
Logic encoding of d-separation
[Hyttinen et al. 2014]
Inferring separating sets without enumerating all
possible causal graphs
Query Y ⊥⊥ C1 |X1?
All testable conditional
independences from data
Y ⊥⊥ C2 |X1, C1 = 0
X1 ⊥⊥ X3 |X4
X2 ⊥⊥ C2 |Y, C1 = 0
…
Assumptions
Theorem prover
Logic encoding of d-separation
[Hyttinen et al. 2014]
Inferring separating sets without enumerating all
possible causal graphs
Query Y ⊥⊥ C1 |X1?
All testable conditional
independences from data
Y ⊥⊥ C2 |X1, C1 = 0
X1 ⊥⊥ X3 |X4
X2 ⊥⊥ C2 |Y, C1 = 0
…
Assumptions
?
Y ⊥⊥ C1 |X1
Y /⊥⊥ C1 |X1
Theorem prover
Not identifiable
Provably separating
Provably not separating
Causal feature selection
• Use standard feature selection method to create a list L of subsets of
features and context variables ordered by lowest source domain loss 

• For example, L = {X1, C2}, {X1, X2, C2} , {X1, X2}, …

• Iteratively query theorem prover until provably separating set A
• Train a model f(A) to predict Y on source domains
• Apply f(A) to predict Y on target domain
Causal feature selection
• Use standard feature selection method to create a list L of subsets of
features and context variables ordered by lowest source domain loss 

• For example, L = {X1, C2}, {X1, X2, C2} , {X1, X2}, …

• Iteratively query theorem prover until provably separating set A
• Train a model f(A) to predict Y on source domains
• Apply f(A) to predict Y on target domain
Results
• Simulated data: 200 random causal graphs

• Baseline method for feature selection and regression: random forest

• Real-world data: CRM Causal Inference Challenge 

• 441 wild-type mice with 16 phenotypes

• ~13 mutant mice (single-gene knockout) for each of 10 interventions

• “Causal cross-validation”

• We randomly selected 3 phenotypes and 2 knockouts and created 1000 datasets

• For each dataset we randomly chose Y and C1 and ignored the values of Y in C1=1
Desiderata for a causality inspired domain
adaptation method
• X, Y and changes can be represented by an unknown causal graph

• Allow for latent confounders
• Avoid parametric assumptions, allow for heterogeneous effects across domains

• Instead of modeling changes between each domain, distinguish the change
between the mixture of sources and the target
• Avoid common assumption that if Y| T(X) is invariant across multiple source
domains, then Y|T(X) is invariant also in the target domain 

• Only search for invariant features with respect to current target task
No need to find
causal graph or
equivalence
class
Limitations and future work
• Potentially too conservative: Separating sets may exist that are not provably separating
• Extension: can we use active learning/intervention design to decide most
informative interventions?

• Cannot use directly for transformations of features T(X)
• Extension: incorporate structure learning with deterministic relations

• Scalability: using (error-correcting) logic-based encoding with all CI tests as input scales to
tens of vars (including context variables) 

• Extension: use approximate algorithms, combine with low dim representations

• Can we apply this to multi-task RL (e.g. in factored MDPs)?
Thanks!

More Related Content

What's hot

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
 
AstraZeneca - The promise of graphs & graph-based learning in drug discovery
AstraZeneca - The promise of graphs & graph-based learning in drug discoveryAstraZeneca - The promise of graphs & graph-based learning in drug discovery
AstraZeneca - The promise of graphs & graph-based learning in drug discoveryNeo4j
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB�MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB�
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUBTu Pham
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data LiteracyDATAVERSITY
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetHortonworks
 
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightIngestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightMicrosoft Tech Community
 
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionSARADINDU SENGUPTA
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep diveDataWorks Summit
 
Distributed Database Architecture for GDPR
Distributed Database Architecture for GDPRDistributed Database Architecture for GDPR
Distributed Database Architecture for GDPRYugabyte
 
An Authentication and Authorization Architecture for a Microservices World
An Authentication and Authorization Architecture for a Microservices WorldAn Authentication and Authorization Architecture for a Microservices World
An Authentication and Authorization Architecture for a Microservices WorldVMware Tanzu
 
Strategic imperative the enterprise data model
Strategic imperative the enterprise data modelStrategic imperative the enterprise data model
Strategic imperative the enterprise data modelDATAVERSITY
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for YouDATAVERSITY
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationDATAVERSITY
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaScyllaDB
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Databricks
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata ManagementDATAVERSITY
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayDATAVERSITY
 

What's hot (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
AstraZeneca - The promise of graphs & graph-based learning in drug discovery
AstraZeneca - The promise of graphs & graph-based learning in drug discoveryAstraZeneca - The promise of graphs & graph-based learning in drug discovery
AstraZeneca - The promise of graphs & graph-based learning in drug discovery
 
Big data case study collection
Big data   case study collectionBig data   case study collection
Big data case study collection
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB�MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB�
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightIngestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
 
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
 
Distributed Database Architecture for GDPR
Distributed Database Architecture for GDPRDistributed Database Architecture for GDPR
Distributed Database Architecture for GDPR
 
An Authentication and Authorization Architecture for a Microservices World
An Authentication and Authorization Architecture for a Microservices WorldAn Authentication and Authorization Architecture for a Microservices World
An Authentication and Authorization Architecture for a Microservices World
 
Strategic imperative the enterprise data model
Strategic imperative the enterprise data modelStrategic imperative the enterprise data model
Strategic imperative the enterprise data model
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata Management
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 

Similar to Causality-inspired ML - a use case in unsupervised domain adaptation

Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Frank Nielsen
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015Christian Robert
 
Probability statistics assignment help
Probability statistics assignment helpProbability statistics assignment help
Probability statistics assignment helpHomeworkAssignmentHe
 
Options Portfolio Selection
Options Portfolio SelectionOptions Portfolio Selection
Options Portfolio Selectionguasoni
 
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMINGChristian Robert
 
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...changedaeoh
 
Uncoupled Regression from Pairwise Comparison Data
Uncoupled Regression from Pairwise Comparison DataUncoupled Regression from Pairwise Comparison Data
Uncoupled Regression from Pairwise Comparison DataLiyuan Xu
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsChristian Robert
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber SecurityAltoros
 
Multivariate statistics
Multivariate statisticsMultivariate statistics
Multivariate statisticsVeneficus
 
Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentssuser2be88c
 
Statisticsforbiologists colstons
Statisticsforbiologists colstonsStatisticsforbiologists colstons
Statisticsforbiologists colstonsandymartin
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesGilles Louppe
 
presentation on Fandamental of Probability
presentation on Fandamental of Probabilitypresentation on Fandamental of Probability
presentation on Fandamental of ProbabilityMaheshGour5
 

Similar to Causality-inspired ML - a use case in unsupervised domain adaptation (20)

Extended Isolation Forest
Extended Isolation Forest Extended Isolation Forest
Extended Isolation Forest
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015
 
Probability statistics assignment help
Probability statistics assignment helpProbability statistics assignment help
Probability statistics assignment help
 
Options Portfolio Selection
Options Portfolio SelectionOptions Portfolio Selection
Options Portfolio Selection
 
MNAR
MNARMNAR
MNAR
 
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
 
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
 
Uncoupled Regression from Pairwise Comparison Data
Uncoupled Regression from Pairwise Comparison DataUncoupled Regression from Pairwise Comparison Data
Uncoupled Regression from Pairwise Comparison Data
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forests
 
Chapter 5 and Chapter 6
Chapter 5 and Chapter 6 Chapter 5 and Chapter 6
Chapter 5 and Chapter 6
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber Security
 
Multivariate statistics
Multivariate statisticsMultivariate statistics
Multivariate statistics
 
Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignment
 
Statisticsforbiologists colstons
Statisticsforbiologists colstonsStatisticsforbiologists colstons
Statisticsforbiologists colstons
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized trees
 
MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...
MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...
MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...
 
Data Mining Lecture_5.pptx
Data Mining Lecture_5.pptxData Mining Lecture_5.pptx
Data Mining Lecture_5.pptx
 
presentation on Fandamental of Probability
presentation on Fandamental of Probabilitypresentation on Fandamental of Probability
presentation on Fandamental of Probability
 

More from Sara Magliacane

Talk: Joint causal inference on observational and experimental data - NIPS 20...
Talk: Joint causal inference on observational and experimental data - NIPS 20...Talk: Joint causal inference on observational and experimental data - NIPS 20...
Talk: Joint causal inference on observational and experimental data - NIPS 20...Sara Magliacane
 
Joint causal inference on observational and experimental data - NIPS 2016 "Wh...
Joint causal inference on observational and experimental data - NIPS 2016 "Wh...Joint causal inference on observational and experimental data - NIPS 2016 "Wh...
Joint causal inference on observational and experimental data - NIPS 2016 "Wh...Sara Magliacane
 
Ancestral Causal Inference - NIPS 2016 poster
Ancestral Causal Inference - NIPS 2016 posterAncestral Causal Inference - NIPS 2016 poster
Ancestral Causal Inference - NIPS 2016 posterSara Magliacane
 
Ancestral Causal Inference - WIML 2016 @ NIPS
Ancestral Causal Inference - WIML 2016 @ NIPSAncestral Causal Inference - WIML 2016 @ NIPS
Ancestral Causal Inference - WIML 2016 @ NIPSSara Magliacane
 
ISWC DC poster "Reconstructing Provenance"
ISWC DC poster "Reconstructing Provenance"ISWC DC poster "Reconstructing Provenance"
ISWC DC poster "Reconstructing Provenance"Sara Magliacane
 
ISWC 2012 "Efficient execution of top-k SPARQL queries"
ISWC 2012 "Efficient execution of top-k SPARQL queries"ISWC 2012 "Efficient execution of top-k SPARQL queries"
ISWC 2012 "Efficient execution of top-k SPARQL queries"Sara Magliacane
 

More from Sara Magliacane (6)

Talk: Joint causal inference on observational and experimental data - NIPS 20...
Talk: Joint causal inference on observational and experimental data - NIPS 20...Talk: Joint causal inference on observational and experimental data - NIPS 20...
Talk: Joint causal inference on observational and experimental data - NIPS 20...
 
Joint causal inference on observational and experimental data - NIPS 2016 "Wh...
Joint causal inference on observational and experimental data - NIPS 2016 "Wh...Joint causal inference on observational and experimental data - NIPS 2016 "Wh...
Joint causal inference on observational and experimental data - NIPS 2016 "Wh...
 
Ancestral Causal Inference - NIPS 2016 poster
Ancestral Causal Inference - NIPS 2016 posterAncestral Causal Inference - NIPS 2016 poster
Ancestral Causal Inference - NIPS 2016 poster
 
Ancestral Causal Inference - WIML 2016 @ NIPS
Ancestral Causal Inference - WIML 2016 @ NIPSAncestral Causal Inference - WIML 2016 @ NIPS
Ancestral Causal Inference - WIML 2016 @ NIPS
 
ISWC DC poster "Reconstructing Provenance"
ISWC DC poster "Reconstructing Provenance"ISWC DC poster "Reconstructing Provenance"
ISWC DC poster "Reconstructing Provenance"
 
ISWC 2012 "Efficient execution of top-k SPARQL queries"
ISWC 2012 "Efficient execution of top-k SPARQL queries"ISWC 2012 "Efficient execution of top-k SPARQL queries"
ISWC 2012 "Efficient execution of top-k SPARQL queries"
 

Recently uploaded

Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 

Recently uploaded (20)

Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 

Causality-inspired ML - a use case in unsupervised domain adaptation

  • 1. Causality-inspired ML: what can ideas from causality do for ML? The domain adaptation case Sara Magliacane MIT-IBM Watson AI Lab
  • 2. Causality-inspired ML • Real-world ML needs to deal with: • Biased data (fairness, selection bias, generalization) • Heterogeneous data, small samples, missing/corrupted data, not iid • Actionable insights (decisions cannot be made on correlations)
  • 3. Causality-inspired ML • Real-world ML needs to deal with: • Biased data (fairness, selection bias, generalization) • Heterogeneous data, small samples, missing/corrupted data, not iid • Actionable insights (decisions cannot be made on correlations) • Causal inference can help with some of these questions: • Systematic data fusion and reuse with biased data, heterogenous, not iid data • A systematic way to extract actionable insights
  • 4. Causality-inspired ML • Real-world ML needs to deal with: • Biased data (fairness, selection bias, generalization) • Heterogeneous data, small samples, missing/corrupted data, not iid • Actionable insights (decisions cannot be made on correlations) • Causal inference can help with some of these questions: • Systematic data fusion and reuse with biased data, heterogenous, not iid data • A systematic way to extract actionable insights • “Full” causality can be not necessary or too expensive -> causality-inspired ML
  • 5. Unsupervised domain adaptation by inferring separating sets of features (joint work with Thijs van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg and Joris Mooij)
  • 6. Transfer learning and causal inference • Transfer learning: • How can I predict what happens when the distribution changes?
  • 7. Transfer learning and causal inference • Transfer learning: • How can I predict what happens when the distribution changes?
  • 8. Transfer learning and causal inference • Transfer learning: • How can I predict what happens when the distribution changes? • Causal inference: • How can I predict what happens when the distribution changes after an intervention? • Perfect intervention do(X): • do-calculus [Pearl, 2009]
  • 9. Transfer learning and causal inference • Transfer learning: • How can I predict what happens when the distribution changes? • Causal inference: • How can I predict what happens when the distribution changes after an intervention? • Perfect intervention do(X): • do-calculus [Pearl, 2009] • Soft intervention on X change of distribution of P(X| parents) in the graph • arbitrary change of distribution ≈ ≈
  • 11. Domain adaptation • Source domain (or domains): • Target domain: • Domain adaptation: predict Y in target domain using (also) information from source domain • Unsupervised domain adaptation: no labels in the target domain (zero-shot) {Xi, Yi}n i=1 ∼ PS(X, Y) {Xj, Yj}m j=1 ∼ PT(X, Y) ≠ PS(X, Y) {Xj}m j=1 ∼ PT(X)
  • 12. Domain adaptation X1 X2 Y Normal 0,1 2 0 Normal 0,2 3 0 Normal 1,1 2 1 Normal 0,1 3 0 X1 X2 Y Gene A 3,1 2 ? Gene A 3,2 3 ? Gene A 4 2 ? Gene A 3,2 3 ? • Source domain (or domains): • Target domain: • Domain adaptation: predict Y in target domain using (also) information from source domain • Unsupervised domain adaptation: no labels in the target domain (zero-shot) {Xi, Yi}n i=1 ∼ PS(X, Y) {Xj, Yj}m j=1 ∼ PT(X, Y) ≠ PS(X, Y) {Xj}m j=1 ∼ PT(X)
  • 13. Domain adaptation from a graphical perspective X1 X2 Y Normal 0,1 2 0 Normal 0,2 3 0 Normal 1,1 2 1 Normal 0,1 3 0 X1 X2 Y Gene A 3,1 2 ? Gene A 3,2 3 ? Gene A 4 2 ? Gene A 3,2 3 ?
  • 14. Domain adaptation from a graphical perspective D X1 X2 Y Normal 0,1 2 0 Normal 0,2 3 0 Normal 1,1 2 1 Normal 0,1 3 0 • Add a variable D to represent the domain X1 X2 Y Gene A 3,1 2 ? Gene A 3,2 3 ? Gene A 4 2 ? Gene A 3,2 3 ?
  • 15. Domain adaptation from a graphical perspective • Add a variable D to represent the domain • Consider the data as coming from a single distribution P(X,Y, D) X1 X2 Y Gene A 3,1 2 ? Gene A 3,2 3 ? Gene A 4 2 ? Gene A 3,2 3 ? D X1 X2 Y Normal 0,1 2 0 Normal 0,2 3 0 Normal 1,1 2 1 Normal 0,1 3 0
  • 16. Domain adaptation from a graphical perspective • Add a variable D to represent the domain • Consider the data as coming from a single distribution P(X,Y, D) X1 Y X2D • We can represent P(X,Y, D) with a (possibly unknown) causal graph X1 X2 Y Gene A 3,1 2 ? Gene A 3,2 3 ? Gene A 4 2 ? Gene A 3,2 3 ? D X1 X2 Y Normal 0,1 2 0 Normal 0,2 3 0 Normal 1,1 2 1 Normal 0,1 3 0
  • 17. D-separation with slightly less tears X1 YD • Given a causal graph, d-separation is a graphical criterion that (under standard conditions) allows us to read conditional independences Y /⊥⊥ D
  • 18. D-separation with slightly less tears X1 YD • Given a causal graph, d-separation is a graphical criterion that (under standard conditions) allows us to read conditional independences Y ⊥⊥ D|X1Y /⊥⊥ D • Conditioning on X1 closes the path
  • 19. D-separation with slightly less tears X1 YD • Given a causal graph, d-separation is a graphical criterion that (under standard conditions) allows us to read conditional independences Y ⊥⊥ D|X1Y /⊥⊥ D • Conditioning on X1 closes the path Y ⊥⊥ D Y X2 D
  • 20. D-separation with slightly less tears X1 YD • Given a causal graph, d-separation is a graphical criterion that (under standard conditions) allows us to read conditional independences Y ⊥⊥ D|X1Y /⊥⊥ D • Conditioning on X1 closes the path Y ⊥⊥ D Y /⊥⊥ D|X2 • Conditioning on X2 opens a closed path Y X2 D
  • 21. D-separation with slightly less tears X1 YD • Given a causal graph, d-separation is a graphical criterion that (under standard conditions) allows us to read conditional independences Y ⊥⊥ D|X1Y /⊥⊥ D • Conditioning on X1 closes the path Y X2 D Y ⊥⊥ D Y /⊥⊥ D|X2 • Conditioning on X2 opens a closed path X1 YD X2
  • 22. D-separation with slightly less tears X1 YD • Given a causal graph, d-separation is a graphical criterion that (under standard conditions) allows us to read conditional independences Y ⊥⊥ D|X1Y /⊥⊥ D • Conditioning on X1 closes the path Y X2 D Y ⊥⊥ D Y /⊥⊥ D|X2 • Conditioning on X2 opens a closed path X1 YD X2 Y /⊥⊥ D|X1, X2
  • 23. D-separation with slightly less tears X1 YD • Given a causal graph, d-separation is a graphical criterion that (under standard conditions) allows us to read conditional independences Y ⊥⊥ D|X1Y /⊥⊥ D • Conditioning on X1 closes the path Y X2 D Y ⊥⊥ D Y /⊥⊥ D|X2 • Conditioning on X2 opens a closed path • Conditional independences with D tell us about invariance Y ⊥⊥ D|X1 ⟹ P(Y|X1, D) = P(Y|X1) ⟹ P(Y|X1, D = d1) = P(Y|X1, D = d2)
  • 24. Domain adaptation classification from a graphical perspective [Zhang et al. 2013] • Covariate shift assumption: • Related to counterfactual inference [Johansson et al. 2016, 2018] PS(Y|X) = PT(Y|X) ≡ Y ⊥⊥ D|X X1 YX2D
  • 25. Domain adaptation classification from a graphical perspective [Zhang et al. 2013] • Covariate shift assumption: • Related to counterfactual inference [Johansson et al. 2016, 2018] • Target shift and generalized target shift PS(Y|X) = PT(Y|X) ≡ Y ⊥⊥ D|X X1 YX2D YD X YD X
  • 26. Common misconceptions: 1. An invariant feature need not be causal X1 Y X2D Y ⊥⊥ D|X1 • Y|X1,X2 is invariant invariant features are not necessarily parents of Y⟹ Y ⊥⊥ D|X1, X2
  • 27. Common misconceptions: 1. An invariant feature need not be causal X1 Y X2D Y ⊥⊥ D|X1 • Y|X1,X2 is invariant invariant features are not necessarily parents of Y⟹ Y ⊥⊥ D|X1, X2 Invariant feature across “many different datasets” is not enough in general to find causal parents, need more assumptions
  • 28. Common misconceptions: 1. An invariant feature need not be causal X1 Y X2D Y ⊥⊥ D|X1 • Y|X1,X2 is invariant invariant features are not necessarily parents of Y • Invariant Causal Prediction [Peters et al. 2016] under causal sufficiency: ⟹ S* = ⋂ Y⊥⊥D|S S ⊆ Pa(Y) {X1, X2} ∩ {X1} = {X1} Y ⊥⊥ D|X1, X2
  • 29. Common misconceptions: 1. An invariant feature need not be causal X1 Y X2D Y ⊥⊥ D|X1 • Y|X1,X2 is invariant invariant features are not necessarily parents of Y • Invariant Causal Prediction [Peters et al. 2016] under causal sufficiency: ⟹ S* = ⋂ Y⊥⊥D|S S ⊆ Pa(Y) {X1, X2} ∩ {X1} = {X1} Y ⊥⊥ D|X1, X2
  • 30. Common misconceptions: 1. An invariant feature need not be causal X1 Y X2D Y ⊥⊥ D|X1 {X1, X2} ∩ {X1} = {X1} Y ⊥⊥ D|X1, X2 X1 YX2D {X1} ∩ {X2} ∩ {X1, X2} = ∅ • Y|X1,X2 is invariant invariant features are not necessarily parents of Y • Invariant Causal Prediction [Peters et al. 2016] under causal sufficiency: ⟹ S* = ⋂ Y⊥⊥D|S S ⊆ Pa(Y) Sometimes too conservative:
  • 31. Common misconception 2: Parents are not enough under latent confounding X1 YX2D Y ⊥⊥ D|X1 Y /⊥⊥ D|X2 Y /⊥⊥ D|X1, X2 • Y|X1 is invariant, Y|X2 is not Even if we knew all the parents, under latent confounding this wouldn’t necessarily help transfer
  • 32. Common misconception 2: Parents are not enough under latent confounding X1 YX2D Y ⊥⊥ D|X1 Y /⊥⊥ D|X2 Y /⊥⊥ D|X1, X2 • Y|X1 is invariant, Y|X2 is not Even if we knew all the parents, under latent confounding this wouldn’t necessarily help transfer • Conclusion: causality (e.g. using the causal parents, learning the complete causal graph) is neither necessary or sufficient* for transfer
  • 33. Desiderata for a causality inspired domain adaptation method • X, Y and changes can be represented by an unknown causal graph • Allow for latent confounders • Avoid parametric assumptions, allow for heterogeneous effects across domains
  • 34. Desiderata for a causality inspired domain adaptation method • X, Y and changes can be represented by an unknown causal graph • Allow for latent confounders • Avoid parametric assumptions, allow for heterogeneous effects across domains • Instead of modeling changes between each domain, distinguish the change between the mixture of sources and the target
  • 35. Desiderata for a causality inspired domain adaptation method • X, Y and changes can be represented by an unknown causal graph • Allow for latent confounders • Avoid parametric assumptions, allow for heterogeneous effects across domains • Instead of modeling changes between each domain, distinguish the change between the mixture of sources and the target • Avoid common assumption that if Y| T(X) is invariant across multiple source domains, then Y|T(X) is invariant also in the target domain • This also (implicitly) assumed by methods based on the idea that invariance => causality
  • 36. Causal domain adaptation problem [Magliacane et al. 2018] X1 X2 Y Normal 0,1 2 0 Normal 0,2 3 0 Normal 1,1 2 1 Normal 0,1 3 0 X1 X2 Y Gene A 3,1 2 1 Gene A 3,2 3 1 Gene A 4 1 1 Gene A 3,2 3 0 X1 X2 Y Gene B 0,2 1 ? Gene B 0,3 1 ? Gene B 0,3 2 ? Gene B 0,4 1 ? • Unsupervised multi-source domain adaptation • We interpret the change in the target domain as a soft intervention • We assume Y cannot be intervened upon directly - P(Y) can still change
  • 37. C1 C2 X1 X2 Y 0 1 0,2 1 ? 0 1 0,3 1 ? 0 1 0,3 2 ? 0 1 0,4 1 ? C1 C2 X1 X2 Y 1 0 3,1 2 1 1 0 3,2 3 1 1 0 4 1 1 1 0 3,2 3 0 Joint Causal Inference [Mooij et al. 2020] • We represent jointly different distributions as an unknown single causal graph • Instead of a single domain variable, we add several context variables so we can disentangle changes in distribution across the datasets • If we know nothing about the changes in the datasets, we use indicator variables X1 X2 Y Normal 0,1 2 0 Normal 0,2 3 0 Normal 1,1 2 1 Normal 0,1 3 0 X1 X2 Y Gene A 3,1 2 1 Gene A 3,2 3 1 Gene A 4 1 1 Gene A 3,2 3 0X1 X2 Y Gene B 0,2 1 ? Gene B 0,3 1 ? Gene B 0,3 2 ? Gene B 0,4 1 ? C1 C2 X1 X2 Y 0 0 0,1 2 0 0 0 0,2 3 0 0 0 1,1 2 1 0 0 0,1 3 0
  • 38. C1=1C1=0 Joint Causal Inference [Mooij et al. 2020] • The joint graph is the union of all the possibly different subgraphs in each dataset, and context variables that model the differences in each dataset • We allow for latent confounders and in general also cyclic causal graphs X1 Y X2 X1 Y X2 X1 Y X2 C1
  • 39. Joint Causal Inference [Mooij et al. 2020] • We can learn an equivalence class of the unknown single causal graph using conditional independence tests on systematically pooled dat • We treat context variables as normal variables that we know are uncaused and can use any standard constraint-based causal discovery method X1 Y X2 C1C2C1 ⊥⊥ Y|X1 X1 ⊥⊥ X2 |Y, C2 X1 /⊥⊥ X2 |Y …C1 C2 X1 X2 Y 0 1 0,2 0 0 0 1 0,3 0 1 0 1 0,3 1 0 C1 C2 X1 X2 Y 1 0 3,1 2 1 1 0 3,2 3 1 1 0 4 3 1 C1 C2 X1 X2 Y 0 0 0,1 1 0 0 0 0,2 1 0 0 0 1,1 2 1
  • 40. X1 X2 Y 1 0,2 1 ? 1 0,3 1 ? 1 0,3 2 ? 1 0,4 1 ? Causal domain adaptation: separating features X1 Y X2 C1 • For simplicity: consider only one source and one target • Assume we knew the causal graph: C1 X1 X2 Y 0 0,1 2 0 0 0,2 3 0 0 1,1 2 1 0 0,1 3 0 • Which features can we use to predict Y in the target domain?
  • 41. Causal domain adaptation: separating features X1 Y X2 C1 Source Target X1 Y Y ⊥⊥ C1 |X1 ≡ P(Y|X1, C1 = 0) = P(Y|X1, C1 = 1) • Separating features: sets of features that d-separate Y from the context variable C1 representing the target domain
  • 42. Causal domain adaptation: separating features X1 Y X2 C1 Source Target X1 Y Y ⊥⊥ C1 |X1 ≡ P(Y|X1, C1 = 0) = P(Y|X1, C1 = 1) Source Target Y /⊥⊥ C1 |X2 ≡ P(Y|X2, C1 = 0) ≠ P(Y|X2, C1 = 1) X2 Y • Separating features: sets of features that d-separate Y from the context variable C1 representing the target domain
  • 43. Causal domain adaptation: separating features X1 Y X2 C1 Source Target X1 Y P(Y|X1, C1 = 0) = P(Y|X1, C1 = 1) Source Target P(Y|X2, C1 = 0) ≠ P(Y|X2, C1 = 1) X2 Y • Separating features: sets of features that d-separate Y from the context variable C1 representing the target domain Features for which covariate shift holds! Y ⊥⊥ C1 |X1 ≡ Y /⊥⊥ C1 |X2 ≡
  • 44. Causal domain adaptation: separating features X1 Y X2 C1 Source Target X1 Y P(Y|X1, C1 = 0) = P(Y|X1, C1 = 1) Source Target P(Y|X2, C1 = 0) ≠ P(Y|X2, C1 = 1) X2 Y • Separating features: sets of features that d-separate Y from the context variable C1 representing the target domain Features for which covariate shift holds! Y ⊥⊥ C1 |X1 ≡ Y /⊥⊥ C1 |X2 ≡ • {X1} is a separating feature set, {X1, X2} could lead to arbitrary large error
  • 45. Causal domain adaptation: separating features X1 Y X2 C1 • Separating features: sets of features that d-separate Y from the context variable C1 representing the target domain Source Target X1 Y Source Target X2 Y • {X1} is a separating feature set, {X1, X2} could lead to arbitrary large error • Separating features need not be causal (parents of Y), e.g. X3 X3 Y ⊥⊥ C1 |X1 Y /⊥⊥ C1 |X2
  • 46. What if the causal graph is unknown? • Idea: we could test the conditional independence in the data Y ⊥⊥ C1 |X1? Y ⊥⊥ C1 |X2?
  • 47. What if the causal graph is unknown? • Idea: we could test the conditional independence in the data X1 X2 Y 1 0,2 1 ? 1 0,3 1 ? 1 0,3 2 ? 1 0,4 1 ? C1 X1 X2 Y 0 0,1 2 0 0 0,2 3 0 0 1,1 2 1 0 0,1 3 0 Y ⊥⊥ C1 |X1? Y ⊥⊥ C1 |X2? • Problem: Y is always missing when C1=1, so we cannot test these
  • 48. What if the causal graph is unknown? • Idea: we could test the conditional independence in the data X1 X2 Y 1 0,2 1 ? 1 0,3 1 ? 1 0,3 2 ? 1 0,4 1 ? C1 X1 X2 Y 0 0,1 2 0 0 0,2 3 0 0 1,1 2 1 0 0,1 3 0 Y ⊥⊥ C1 |X1? Y ⊥⊥ C1 |X2? • Problem: Y is always missing when C1=1, so we cannot test these X1 /⊥⊥ X2 X1 /⊥⊥ C1 X1 ⊥⊥ X2 |Y, C1 = 0 X1 /⊥⊥ X2 |C1 … • Idea: Can we use all other independences?
  • 49. Assumptions [Magliacane et al. 2018] • We assume that there exists an acyclic causal graph that fits all the data (Joint Causal Inference) • We assume Y cannot be intervened upon directly
  • 50. Assumptions [Magliacane et al. 2018] A ⊥⊥ D|B, Y, C1 = 0 ⟹ A ⊥⊥ D|B, Y, C1 = 1 Y ⊥⊥ A|B, C1 = 0 ⟹ Y ⊥⊥ A|B, C1 = 1A, D, B ⊂ V∖{Y, C1} • We assume that there exists an acyclic causal graph that fits all the data (Joint Causal Inference) • We assume Y cannot be intervened upon directly • We assume no extra dependences involving Y in target domain C1=1
  • 51. Assumptions [Magliacane et al. 2018] • We assume that there exists an acyclic causal graph that fits all the data (Joint Causal Inference) • We assume Y cannot be intervened upon directly • We assume no extra dependences involving Y in target domain C1=1 A ⊥⊥ D|B, Y, C1 = 0 ⟹ A ⊥⊥ D|B, Y, C1 = 1 Y ⊥⊥ A|B, C1 = 0 ⟹ Y ⊥⊥ A|B, C1 = 1A, D, B ⊂ V∖{Y, C1} C1 ⊥⊥ Y|B? • Note that this does not assume anything about the separating set test :
  • 52. A slightly more complicated example C1 C2 X1 X2 ? 1 0 0,2 0 ? 1 0 0,3 0 ? 1 0 0,3 1 ? C1 C2 X1 X2 Y 0 1 3,1 2 1 0 1 3,2 3 1 0 1 4 3 1 C1 C2 X1 X2 Y 0 0 0,1 1 0 0 0 0,2 1 0 0 0 1,1 2 1 C1 C2 X1 X2 ? 1 0 0,2 0 ? 1 0 0,3 0 ? 1 0 0,3 1 ? C1 C2 X1 X2 Y 0 1 3,1 2 1 0 1 3,2 3 1 0 1 4 3 1 C1 C2 X1 X2 Y 0 0 0,1 1 0 0 0 0,2 1 0 0 0 1,1 2 1
  • 53. A slightly more complicated example C1 C2 X1 X2 ? 1 0 0,2 0 ? 1 0 0,3 0 ? 1 0 0,3 1 ? C1 C2 X1 X2 Y 0 1 3,1 2 1 0 1 3,2 3 1 0 1 4 3 1 C1 C2 X1 X2 Y 0 0 0,1 1 0 0 0 0,2 1 0 0 0 1,1 2 1 Y ⊥⊥ C2 |X1, C1 = 0 Y /⊥⊥ C2 |C1 = 0 X2 ⊥⊥ C2 |Y, C1 = 0C1 C2 X1 X2 ? 1 0 0,2 0 ? 1 0 0,3 0 ? 1 0 0,3 1 ? C1 C2 X1 X2 Y 0 1 3,1 2 1 0 1 3,2 3 1 0 1 4 3 1 C1 C2 X1 X2 Y 0 0 0,1 1 0 0 0 0,2 1 0 0 0 1,1 2 1 Perform allowed CI tests
  • 54. A slightly more complicated example X1 Y X2 C1C2 C1 C2 X1 X2 ? 1 0 0,2 0 ? 1 0 0,3 0 ? 1 0 0,3 1 ? C1 C2 X1 X2 Y 0 1 3,1 2 1 0 1 3,2 3 1 0 1 4 3 1 C1 C2 X1 X2 Y 0 0 0,1 1 0 0 0 0,2 1 0 0 0 1,1 2 1 Y ⊥⊥ C2 |X1, C1 = 0 Y /⊥⊥ C2 |C1 = 0 X2 ⊥⊥ C2 |Y, C1 = 0C1 C2 X1 X2 ? 1 0 0,2 0 ? 1 0 0,3 0 ? 1 0 0,3 1 ? C1 C2 X1 X2 Y 0 1 3,1 2 1 0 1 3,2 3 1 0 1 4 3 1 C1 C2 X1 X2 Y 0 0 0,1 1 0 0 0 0,2 1 0 0 0 1,1 2 1 Perform allowed CI tests All possible compatible graphs
  • 55. A slightly more complicated example X1 Y X2 C1C2 C1 C2 X1 X2 ? 1 0 0,2 0 ? 1 0 0,3 0 ? 1 0 0,3 1 ? C1 C2 X1 X2 Y 0 1 3,1 2 1 0 1 3,2 3 1 0 1 4 3 1 C1 C2 X1 X2 Y 0 0 0,1 1 0 0 0 0,2 1 0 0 0 1,1 2 1 Y ⊥⊥ C2 |X1, C1 = 0 Y /⊥⊥ C2 |C1 = 0 X2 ⊥⊥ C2 |Y, C1 = 0 Y ⊥⊥ C1 |X1 • We can prove untestable separating test without reconstructing the graph: C1 C2 X1 X2 ? 1 0 0,2 0 ? 1 0 0,3 0 ? 1 0 0,3 1 ? C1 C2 X1 X2 Y 0 1 3,1 2 1 0 1 3,2 3 1 0 1 4 3 1 C1 C2 X1 X2 Y 0 0 0,1 1 0 0 0 0,2 1 0 0 0 1,1 2 1 Perform allowed CI tests All possible compatible graphs True in all possible compatible graphs
  • 56. C1=0 Invariance in sources is not invariance in target X1 Y X2 C1C2 Y ⊥⊥ C2 |X1, C1 = 0 Y /⊥⊥ C2 |C1 = 0 X2 ⊥⊥ C2 |Y, C1 = 0 Y ⊥⊥ C2 |X1, C1 = 0 • Features that are invariant across the source domains: Y ⊥⊥ C2 |X2, C1 = 0 Y ⊥⊥ C2 |X1, X2, C1 = 0 X1 Y X2 C2
  • 57. C1=0 Invariance in sources is not invariance in target X1 Y X2 C1C2 Y ⊥⊥ C2 |X1, C1 = 0 Y /⊥⊥ C2 |C1 = 0 X2 ⊥⊥ C2 |Y, C1 = 0 Y ⊥⊥ C2 |X1, C1 = 0 • Features that are invariant across the source domains: Y ⊥⊥ C2 |X2, C1 = 0 Y ⊥⊥ C2 |X1, X2, C1 = 0 X1 Y X2 C2 Y ⊥⊥ C2 |X1, C1 = 1 Y /⊥⊥ C2 |X2, C1 = 1 Y /⊥⊥ C2 |X1, X2, C1 = 1
  • 58. C1=0 Invariance in sources is not invariance in target X1 Y X2 C1C2 Y ⊥⊥ C2 |X1, C1 = 0 Y /⊥⊥ C2 |C1 = 0 X2 ⊥⊥ C2 |Y, C1 = 0 Y ⊥⊥ C2 |X1, C1 = 0 • Features that are invariant across the source domains: Y ⊥⊥ C2 |X2, C1 = 0 Y ⊥⊥ C2 |X1, X2, C1 = 0 X1 Y X2 C2 Y ⊥⊥ C2 |X1, C1 = 1 Y /⊥⊥ C2 |X2, C1 = 1 Y /⊥⊥ C2 |X1, X2, C1 = 1 • Idea: use data from target also in order to learn invariant features
  • 59. Inferring separating sets without enumerating all possible causal graphs Query Y ⊥⊥ C1 |X1? Y ⊥⊥ C2 |X1, C1 = 0 X1 ⊥⊥ X3 |X4 X2 ⊥⊥ C2 |Y, C1 = 0 …
  • 60. Inferring separating sets without enumerating all possible causal graphs Query Y ⊥⊥ C1 |X1? All testable conditional independences from data Y ⊥⊥ C2 |X1, C1 = 0 X1 ⊥⊥ X3 |X4 X2 ⊥⊥ C2 |Y, C1 = 0 … Assumptions
  • 61. Logic encoding of d-separation [Hyttinen et al. 2014] Inferring separating sets without enumerating all possible causal graphs Query Y ⊥⊥ C1 |X1? All testable conditional independences from data Y ⊥⊥ C2 |X1, C1 = 0 X1 ⊥⊥ X3 |X4 X2 ⊥⊥ C2 |Y, C1 = 0 … Assumptions Theorem prover
  • 62. Logic encoding of d-separation [Hyttinen et al. 2014] Inferring separating sets without enumerating all possible causal graphs Query Y ⊥⊥ C1 |X1? All testable conditional independences from data Y ⊥⊥ C2 |X1, C1 = 0 X1 ⊥⊥ X3 |X4 X2 ⊥⊥ C2 |Y, C1 = 0 … Assumptions ? Y ⊥⊥ C1 |X1 Y /⊥⊥ C1 |X1 Theorem prover Not identifiable Provably separating Provably not separating
  • 63. Causal feature selection • Use standard feature selection method to create a list L of subsets of features and context variables ordered by lowest source domain loss • For example, L = {X1, C2}, {X1, X2, C2} , {X1, X2}, … • Iteratively query theorem prover until provably separating set A • Train a model f(A) to predict Y on source domains • Apply f(A) to predict Y on target domain
  • 64. Causal feature selection • Use standard feature selection method to create a list L of subsets of features and context variables ordered by lowest source domain loss • For example, L = {X1, C2}, {X1, X2, C2} , {X1, X2}, … • Iteratively query theorem prover until provably separating set A • Train a model f(A) to predict Y on source domains • Apply f(A) to predict Y on target domain
  • 65. Results • Simulated data: 200 random causal graphs • Baseline method for feature selection and regression: random forest • Real-world data: CRM Causal Inference Challenge • 441 wild-type mice with 16 phenotypes • ~13 mutant mice (single-gene knockout) for each of 10 interventions • “Causal cross-validation” • We randomly selected 3 phenotypes and 2 knockouts and created 1000 datasets • For each dataset we randomly chose Y and C1 and ignored the values of Y in C1=1
  • 66. Desiderata for a causality inspired domain adaptation method • X, Y and changes can be represented by an unknown causal graph • Allow for latent confounders • Avoid parametric assumptions, allow for heterogeneous effects across domains • Instead of modeling changes between each domain, distinguish the change between the mixture of sources and the target • Avoid common assumption that if Y| T(X) is invariant across multiple source domains, then Y|T(X) is invariant also in the target domain • Only search for invariant features with respect to current target task No need to find causal graph or equivalence class
  • 67. Limitations and future work • Potentially too conservative: Separating sets may exist that are not provably separating • Extension: can we use active learning/intervention design to decide most informative interventions? • Cannot use directly for transformations of features T(X) • Extension: incorporate structure learning with deterministic relations • Scalability: using (error-correcting) logic-based encoding with all CI tests as input scales to tens of vars (including context variables) • Extension: use approximate algorithms, combine with low dim representations • Can we apply this to multi-task RL (e.g. in factored MDPs)?