Causality-inspired ML - a use case in unsupervised domain adaptation

Causality-inspired ML: what can
ideas from causality do for ML?
The domain adaptation case
Sara Magliacane

MIT-IBM Watson AI Lab

Causality-inspired ML
• Real-world ML needs to deal with:

• Biased data (fairness, selection bias, generalization)

• Heterogeneous data, small samples, missing/corrupted data, not iid

• Actionable insights (decisions cannot be made on correlations)





• Causal inference can help with some of these questions:

• Systematic data fusion and reuse with biased data, heterogenous, not iid data

• A systematic way to extract actionable insights





• Causal inference can help with some of these questions:

• Systematic data fusion and reuse with biased data, heterogenous, not iid data

• A systematic way to extract actionable insights
• “Full” causality can be not necessary or too expensive -> causality-inspired ML

Unsupervised domain adaptation
by inferring separating sets of
features
(joint work with Thijs van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg and Joris Mooij)

Transfer learning and causal inference
• Transfer learning:

• How can I predict what happens
when the distribution changes?


• Causal inference:

• How can I predict what happens when the
distribution changes after an intervention?

• Perfect intervention do(X):

• do-calculus [Pearl, 2009]


• Causal inference:

• How can I predict what happens when the
distribution changes after an intervention?

• Perfect intervention do(X):

• do-calculus [Pearl, 2009]

• Soft intervention on X change of
distribution of P(X| parents) in the
graph

• arbitrary change of distribution
≈
≈

Domain adaptation
• Source domain (or domains):

• Target domain:

• Domain adaptation: predict Y in target domain using (also) information from source domain

• Unsupervised domain adaptation: no labels in the target domain (zero-shot)
{Xi, Yi}n
i=1 ∼ PS(X, Y)
{Xj, Yj}m
j=1 ∼ PT(X, Y) ≠ PS(X, Y)
{Xj}m
j=1 ∼ PT(X)

Domain adaptation
X1 X2 Y
Normal 0,1 2 0
Normal 0,2 3 0
Normal 1,1 2 1
Normal 0,1 3 0
X1 X2 Y
Gene A 3,1 2 ?
Gene A 3,2 3 ?
Gene A 4 2 ?
Gene A 3,2 3 ?
• Source domain (or domains):

• Target domain:

• Domain adaptation: predict Y in target domain using (also) information from source domain

• Unsupervised domain adaptation: no labels in the target domain (zero-shot)
{Xi, Yi}n
i=1 ∼ PS(X, Y)
{Xj, Yj}m
j=1 ∼ PT(X, Y) ≠ PS(X, Y)
{Xj}m
j=1 ∼ PT(X)

Domain adaptation from a graphical perspective
X1 X2 Y
Normal 0,1 2 0
Normal 0,2 3 0
Normal 1,1 2 1
Normal 0,1 3 0
X1 X2 Y
Gene A 3,1 2 ?
Gene A 3,2 3 ?
Gene A 4 2 ?
Gene A 3,2 3 ?

D X1 X2 Y
Normal 0,1 2 0
Normal 0,2 3 0
Normal 1,1 2 1
Normal 0,1 3 0
• Add a variable D to represent the domain
X1 X2 Y
Gene A 3,1 2 ?
Gene A 3,2 3 ?
Gene A 4 2 ?
Gene A 3,2 3 ?

• Consider the data as coming from a single distribution P(X,Y, D)
X1 X2 Y
Gene A 3,1 2 ?
Gene A 3,2 3 ?
Gene A 4 2 ?
Gene A 3,2 3 ?
D X1 X2 Y
Normal 0,1 2 0
Normal 0,2 3 0
Normal 1,1 2 1
Normal 0,1 3 0

• Consider the data as coming from a single distribution P(X,Y, D)
X1 Y X2D
• We can represent P(X,Y, D) with a
(possibly unknown) causal graph
X1 X2 Y
Gene A 3,1 2 ?
Gene A 3,2 3 ?
Gene A 4 2 ?
Gene A 3,2 3 ?
D X1 X2 Y
Normal 0,1 2 0
Normal 0,2 3 0
Normal 1,1 2 1
Normal 0,1 3 0

D-separation with slightly less tears
X1 YD
• Given a causal graph, d-separation is a graphical criterion that (under
standard conditions) allows us to read conditional independences
Y /⊥⊥ D

X1 YD
Y ⊥⊥ D|X1Y /⊥⊥ D
• Conditioning on X1 closes the path

X1 YD
Y ⊥⊥ D|X1Y /⊥⊥ D
Y ⊥⊥ D
Y
X2
D

X1 YD
Y ⊥⊥ D|X1Y /⊥⊥ D
Y ⊥⊥ D Y /⊥⊥ D|X2
• Conditioning on X2 opens a closed path
Y
X2
D

X1 YD
Y ⊥⊥ D|X1Y /⊥⊥ D
Y
X2
D
Y ⊥⊥ D Y /⊥⊥ D|X2
X1 YD X2

X1 YD
Y ⊥⊥ D|X1Y /⊥⊥ D
Y
X2
D
Y ⊥⊥ D Y /⊥⊥ D|X2
X1 YD X2
Y /⊥⊥ D|X1, X2

Domain adaptation classiﬁcation from a graphical
perspective [Zhang et al. 2013]
• Covariate shift assumption:

• Related to counterfactual inference [Johansson et al. 2016, 2018]
PS(Y|X) = PT(Y|X) ≡ Y ⊥⊥ D|X
X1 YX2D

Domain adaptation classiﬁcation from a graphical
perspective [Zhang et al. 2013]
• Covariate shift assumption:

• Related to counterfactual inference [Johansson et al. 2016, 2018]
• Target shift and generalized target shift
PS(Y|X) = PT(Y|X) ≡ Y ⊥⊥ D|X
X1 YX2D
YD X YD X

Common misconceptions: 1. An invariant feature
need not be causal
X1 Y X2D
Y ⊥⊥ D|X1
• Y|X1,X2 is invariant invariant features are not necessarily parents of Y⟹
Y ⊥⊥ D|X1, X2

need not be causal
X1 Y X2D
Y ⊥⊥ D|X1
• Y|X1,X2 is invariant invariant features are not necessarily parents of Y⟹
Y ⊥⊥ D|X1, X2
Invariant feature across “many diﬀerent datasets” is not enough in general to ﬁnd
causal parents, need more assumptions

need not be causal
X1 Y X2D
Y ⊥⊥ D|X1
• Y|X1,X2 is invariant invariant features are not necessarily parents of Y

• Invariant Causal Prediction [Peters et al. 2016] under causal suﬃciency:
⟹
S* =
⋂
Y⊥⊥D|S
S ⊆ Pa(Y) {X1, X2} ∩ {X1} = {X1}
Y ⊥⊥ D|X1, X2

need not be causal
X1 Y X2D
Y ⊥⊥ D|X1
{X1, X2} ∩ {X1} = {X1}
Y ⊥⊥ D|X1, X2
X1 YX2D {X1} ∩ {X2} ∩ {X1, X2} = ∅
• Y|X1,X2 is invariant invariant features are not necessarily parents of Y

• Invariant Causal Prediction [Peters et al. 2016] under causal suﬃciency:
⟹
S* =
⋂
Y⊥⊥D|S
S ⊆ Pa(Y)
Sometimes too
conservative:

Common misconception 2: Parents are not
enough under latent confounding
X1 YX2D
Y ⊥⊥ D|X1
Y /⊥⊥ D|X2
Y /⊥⊥ D|X1, X2
• Y|X1 is invariant, Y|X2 is not
Even if we knew all the parents, under latent confounding this wouldn’t
necessarily help transfer

Common misconception 2: Parents are not
enough under latent confounding
X1 YX2D
Y ⊥⊥ D|X1
Y /⊥⊥ D|X2
Y /⊥⊥ D|X1, X2
• Y|X1 is invariant, Y|X2 is not
Even if we knew all the parents, under latent confounding this wouldn’t
necessarily help transfer
• Conclusion: causality (e.g. using the causal parents, learning the
complete causal graph) is neither necessary or suﬃcient* for transfer

Desiderata for a causality inspired domain
adaptation method
• X, Y and changes can be represented by an unknown causal graph

• Allow for latent confounders
• Avoid parametric assumptions, allow for heterogeneous eﬀects across domains

adaptation method


• Instead of modeling changes between each domain, distinguish the change
between the mixture of sources and the target

adaptation method


• Avoid common assumption that if Y| T(X) is invariant across multiple source
domains, then Y|T(X) is invariant also in the target domain

• This also (implicitly) assumed by methods based on the idea that
invariance => causality

Causal domain adaptation problem
[Magliacane et al. 2018]
X1 X2 Y
Normal 0,1 2 0
Normal 0,2 3 0
Normal 1,1 2 1
Normal 0,1 3 0
X1 X2 Y
Gene A 3,1 2 1
Gene A 3,2 3 1
Gene A 4 1 1
Gene A 3,2 3 0
X1 X2 Y
Gene B 0,2 1 ?
Gene B 0,3 1 ?
Gene B 0,3 2 ?
Gene B 0,4 1 ?
• Unsupervised multi-source domain adaptation

• We interpret the change in the target domain as a soft intervention
• We assume Y cannot be intervened upon directly - P(Y) can still change

C1 C2 X1 X2 Y
0 1 0,2 1 ?
0 1 0,3 1 ?
0 1 0,3 2 ?
0 1 0,4 1 ?
C1 C2 X1 X2 Y
1 0 3,1 2 1
1 0 3,2 3 1
1 0 4 1 1
1 0 3,2 3 0
Joint Causal Inference [Mooĳ et al. 2020]
• We represent jointly diﬀerent distributions as an unknown single causal graph

• Instead of a single domain variable, we add several context variables so we can
disentangle changes in distribution across the datasets

• If we know nothing about the changes in the datasets, we use indicator variables
X1 X2 Y
Normal 0,1 2 0
Normal 0,2 3 0
Normal 1,1 2 1
Normal 0,1 3 0
X1 X2 Y
Gene A 3,1 2 1
Gene A 3,2 3 1
Gene A 4 1 1
Gene A 3,2 3 0X1 X2 Y
Gene B 0,2 1 ?
Gene B 0,3 1 ?
Gene B 0,3 2 ?
Gene B 0,4 1 ?
C1 C2 X1 X2 Y
0 0 0,1 2 0
0 0 0,2 3 0
0 0 1,1 2 1
0 0 0,1 3 0

C1=1C1=0
• The joint graph is the union of all the possibly diﬀerent subgraphs in each
dataset, and context variables that model the diﬀerences in each dataset

• We allow for latent confounders and in general also cyclic causal graphs
X1 Y X2 X1 Y X2 X1 Y X2
C1

• We can learn an equivalence class of the unknown single causal graph using
conditional independence tests on systematically pooled dat
• We treat context variables as normal variables that we know are uncaused and can
use any standard constraint-based causal discovery method
X1 Y X2
C1C2C1 ⊥⊥ Y|X1
X1 ⊥⊥ X2 |Y, C2
X1 /⊥⊥ X2 |Y
…C1 C2 X1 X2 Y
0 1 0,2 0 0
0 1 0,3 0 1
0 1 0,3 1 0
C1 C2 X1 X2 Y
1 0 3,1 2 1
1 0 3,2 3 1
1 0 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1

X1 X2 Y
1 0,2 1 ?
1 0,3 1 ?
1 0,3 2 ?
1 0,4 1 ?
Causal domain adaptation: separating features
X1 Y X2
C1
• For simplicity: consider only one source and one target

• Assume we knew the causal graph:
C1 X1 X2 Y
0 0,1 2 0
0 0,2 3 0
0 1,1 2 1
0 0,1 3 0
• Which features can we use to predict Y in the target domain?

X1 Y X2
C1 Source
Target
X1
Y
Y ⊥⊥ C1 |X1 ≡
P(Y|X1, C1 = 0) = P(Y|X1, C1 = 1)
• Separating features: sets of features that d-separate Y from the context
variable C1 representing the target domain

X1 Y X2
C1 Source
Target
X1
Y
P(Y|X1, C1 = 0) = P(Y|X1, C1 = 1)
Source
Target
P(Y|X2, C1 = 0) ≠ P(Y|X2, C1 = 1)
X2
Y
variable C1 representing the target domain Features for which covariate
shift holds!
Y ⊥⊥ C1 |X1 ≡ Y /⊥⊥ C1 |X2 ≡
• {X1} is a separating feature set, {X1, X2} could lead to arbitrary large error

X1 Y X2
C1
Source
Target
X1
Y
Source
Target
X2
Y
• {X1} is a separating feature set, {X1, X2} could lead to arbitrary large error
• Separating features need not be causal (parents of Y), e.g. X3
X3 Y ⊥⊥ C1 |X1 Y /⊥⊥ C1 |X2

What if the causal graph is unknown?
• Idea: we could test the conditional independence in the data
Y ⊥⊥ C1 |X1? Y ⊥⊥ C1 |X2?

X1 X2 Y
1 0,2 1 ?
1 0,3 1 ?
1 0,3 2 ?
1 0,4 1 ?
C1 X1 X2 Y
0 0,1 2 0
0 0,2 3 0
0 1,1 2 1
0 0,1 3 0
Y ⊥⊥ C1 |X1? Y ⊥⊥ C1 |X2?
• Problem: Y is always missing when C1=1, so we cannot test these

X1 X2 Y
1 0,2 1 ?
1 0,3 1 ?
1 0,3 2 ?
1 0,4 1 ?
C1 X1 X2 Y
0 0,1 2 0
0 0,2 3 0
0 1,1 2 1
0 0,1 3 0
Y ⊥⊥ C1 |X1? Y ⊥⊥ C1 |X2?
• Problem: Y is always missing when C1=1, so we cannot test these
X1 /⊥⊥ X2
X1 /⊥⊥ C1
X1 ⊥⊥ X2 |Y, C1 = 0
X1 /⊥⊥ X2 |C1
…
• Idea: Can we use all other independences?

Assumptions [Magliacane et al. 2018]
• We assume that there exists an acyclic causal graph that ﬁts all the data
(Joint Causal Inference)
• We assume Y cannot be intervened upon directly

A ⊥⊥ D|B, Y, C1 = 0 ⟹ A ⊥⊥ D|B, Y, C1 = 1
Y ⊥⊥ A|B, C1 = 0 ⟹ Y ⊥⊥ A|B, C1 = 1A, D, B ⊂ V∖{Y, C1}
• We assume no extra dependences involving Y in target domain C1=1

• We assume no extra dependences involving Y in target domain C1=1
A ⊥⊥ D|B, Y, C1 = 0 ⟹ A ⊥⊥ D|B, Y, C1 = 1
Y ⊥⊥ A|B, C1 = 0 ⟹ Y ⊥⊥ A|B, C1 = 1A, D, B ⊂ V∖{Y, C1}
C1 ⊥⊥ Y|B?
• Note that this does not assume anything about the separating set test :

A slightly more complicated example
C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1

C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
Y ⊥⊥ C2 |X1, C1 = 0
Y /⊥⊥ C2 |C1 = 0
X2 ⊥⊥ C2 |Y, C1 = 0C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
Perform allowed CI tests

X1 Y X2
C1C2
C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
Y ⊥⊥ C2 |X1, C1 = 0
Y /⊥⊥ C2 |C1 = 0
X2 ⊥⊥ C2 |Y, C1 = 0C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
Perform allowed CI tests All possible compatible graphs

X1 Y X2
C1C2
C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
Y ⊥⊥ C2 |X1, C1 = 0
Y /⊥⊥ C2 |C1 = 0
X2 ⊥⊥ C2 |Y, C1 = 0
Y ⊥⊥ C1 |X1
• We can prove untestable separating test without reconstructing the graph:
C1 C2 X1 X2 ?
1 0 0,2 0 ?
1 0 0,3 0 ?
1 0 0,3 1 ?
C1 C2 X1 X2 Y
0 1 3,1 2 1
0 1 3,2 3 1
0 1 4 3 1
C1 C2 X1 X2 Y
0 0 0,1 1 0
0 0 0,2 1 0
0 0 1,1 2 1
Perform allowed CI tests All possible compatible graphs
True in all possible compatible graphs

Inferring separating sets without enumerating all
possible causal graphs
Query Y ⊥⊥ C1 |X1?
Y ⊥⊥ C2 |X1, C1 = 0
X1 ⊥⊥ X3 |X4
X2 ⊥⊥ C2 |Y, C1 = 0
…

All testable conditional
independences from data
Y ⊥⊥ C2 |X1, C1 = 0
X1 ⊥⊥ X3 |X4
X2 ⊥⊥ C2 |Y, C1 = 0
…
Assumptions

Logic encoding of d-separation
[Hyttinen et al. 2014]
Y ⊥⊥ C2 |X1, C1 = 0
X1 ⊥⊥ X3 |X4
X2 ⊥⊥ C2 |Y, C1 = 0
…
Assumptions
Theorem prover

Causal feature selection
• Use standard feature selection method to create a list L of subsets of
features and context variables ordered by lowest source domain loss

• For example, L = {X1, C2}, {X1, X2, C2} , {X1, X2}, …

• Iteratively query theorem prover until provably separating set A
• Train a model f(A) to predict Y on source domains
• Apply f(A) to predict Y on target domain

Results
• Simulated data: 200 random causal graphs

• Baseline method for feature selection and regression: random forest

• Real-world data: CRM Causal Inference Challenge

• 441 wild-type mice with 16 phenotypes

• ~13 mutant mice (single-gene knockout) for each of 10 interventions

• “Causal cross-validation”

• We randomly selected 3 phenotypes and 2 knockouts and created 1000 datasets

• For each dataset we randomly chose Y and C1 and ignored the values of Y in C1=1

adaptation method


• Avoid common assumption that if Y| T(X) is invariant across multiple source
domains, then Y|T(X) is invariant also in the target domain

• Only search for invariant features with respect to current target task
No need to ﬁnd
causal graph or
equivalence
class

Limitations and future work
• Potentially too conservative: Separating sets may exist that are not provably separating
• Extension: can we use active learning/intervention design to decide most
informative interventions?

• Cannot use directly for transformations of features T(X)
• Extension: incorporate structure learning with deterministic relations

• Scalability: using (error-correcting) logic-based encoding with all CI tests as input scales to
tens of vars (including context variables)

• Extension: use approximate algorithms, combine with low dim representations

• Can we apply this to multi-task RL (e.g. in factored MDPs)?

Causality-inspired ML - a use case in unsupervised domain adaptation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Causality-inspired ML - a use case in unsupervised domain adaptation

Similar to Causality-inspired ML - a use case in unsupervised domain adaptation (20)

More from Sara Magliacane

More from Sara Magliacane (6)

Recently uploaded

Recently uploaded (20)

Causality-inspired ML - a use case in unsupervised domain adaptation