Engler and Prantl system of classification in plant taxonomy
A non-Gaussian model for causal discovery in the presence of hidden common causes
1. Shohei Shimizu
Shiga University / Osaka University
Japan
1
A non-Gaussian model for causal
discovery in the presence of hidden
common causes
2016 Munich Workshop on
Causal Inference and Information Theory
2. Abstract
• Managing hidden common causes is
essential in causal discovery
• Non-causally-related observed variables
can be correlated due to hidden common
causes
• Propose a linear non-Gaussian model for
estimating causal direction in cases with
hidden common causes
2
4. Strong correlation btw chocolate
consumption and number of Nobel
laureates (Messerli12NEJM)
4
2002-2011Chocolate consumption (kg/yr/capita)
Num.Nobellaureatesper10millionpop.
Corr. 0.791
P-value < 0.001
5. Eating more chocolate increases
num. Nobel laureates?
• Interpretational drift (Maurage+13, J. Nutrition)
5
Chclt Nobel
?Chclt Nobel
or
GDP GDP
Chclt Nobel
or
GDP
Corr. 0.791
P-value < 0.001
Nobel
Chocolate
Hidden
Common
cause
Manage this gap!
Hidden
Common
cause
Hidden
Common
cause
7. Structural causal models
(Pearl, 2000,2009; cf. Bollen, 1989)
• A framework for describing causal relations
• Generally speaking, if the value of 𝑥1 has
been changed and then that of 𝑥2 changes,
then 𝑥1 causes 𝑥2
7
2122
111
,,
,
efxgx
efgx
x1 x2
f
e1 e2
GDP
NobelChclt
8. Challenge in causal discovery
8
Hidden common cause
2122
111
,,
,
efxgx
efgx
Data matrix
x1
x2
21... ,~ xxpdii
obs.1
Assume that either of
the three generated
the data
Estimate which of the
three models generated
the data
obs.nobs.2 …
x1 x2
f
x1 x2
f
x1 x2
f
e1 e2 e1 e2 e1 e2
fpepep ,, 21
Hidden common cause Hidden common cause
222
1211
,
,,
efgx
efxgx
222
111
,
,
efgx
efgx
fpepep ,, 21 fpepep ,, 21
9. Under what conditions
can we manage the gap?
• We have shown that it is possible under the three
assumptions: i) linearity; ii) Acyclicty;
iii) non-Gaussianity (Hoyer+08IJAR; Shimizu+14JMLR):
• Classical Bayesian network approach incapable
9
x1 x2
?
x1 x2
or
f1 f1
x1 x2
f1
or
21211212
11121
efxbx
efx
21212
11122121
efx
efxbx
22212
11121
efx
efx
10. Basic non-Gaussian model
(No hidden common cause)
S. Shimizu, P. O. Hoyer, A. Hyvärinen
and A. Kerminen
Journal of Machine Learning Research
2006
11. Linear Non-Gaussian Acyclic
Model (LiNGAM) (Shimizu et al., 2006)
• Identifiable: causal directions and coefficients
• Various extensions including nonlinear (Hoyer+08NIPS,
Zhang+09UAI) and cyclic (Lacerda+08UAI) models
11
i
ij
jiji exbx
x1 x2
x3
21b
23b13b
2e
3e
1e
Linearity
Acyclicity
Non-Gaussian errors ei
Independence of errors ei
(no hidden common causes)
12. 1212
Different directions give
different data distributions
Gaussian Non-Gaussian
(ex. uniform)
Model 1:
Model 2:
x1
x2
x1
x2
e1
e2
x1
x2
e1
e2
x1
x2
x1
x2
x1
x2
212
11
8.0 exx
ex
22
121 8.0
ex
exx
1varvar 21 xx
,021 eEeE
13. 13
Independent Component Analysis
(ICA) (Jutten & Herault, 1991; Comon, 1994; Hyvarinen et al., 2001)
• Observed variables are modeled by
where
– Hidden variables are non-Gaussian and
independent
• Then, mixing matrix A is identifiable up to
permutation and scaling of the columns
Asx
pjsj ,,1
p
j
jiji sax
1
or
ix
14. Sketch of the identifiability proof
• Different directions give different zero/non-
zero patterns of the mixing matrices
– No zeros on the diagonal in the causal model
– No permutation indeterminacy
14
2
1
212
1
1
01
e
e
bx
x
21212
11
exbx
ex
A sx
2
112
2
1
10
1
e
eb
x
x
A sx22
12121
ex
exbx
x1
x2
e1
e2
x1
x2
e1
e2
0
0
Model 1:
Model 2:
15. LiNGAM with hidden
common causes
P. O. Hoyer, S. Shimizu, A. Kerminen,
and M. Palviainen
Int. J. Approximate Reasoning
2008
17. i
ij
jij
Q
q
qiqi exbfx 1
2
:2 f
ef1
:1 f
ef
qfWLG, hidden common causes
are assumed to be independent
Independent hidden
common causes
17
x1 x2 2e1e
1f
e 2f
e
x1 x2 2e1e
1f 2f
Dependent hidden
common causes
2
1
2221
11
2221
11
2
1
00
2
1
f
f
aa
a
e
e
aa
a
f
f
f
f
18. Non-Gaussian
x2
x1
Gaussian e1,e2, f1
x2
• Faithfulness on 𝑥𝑖, 𝑓𝑖 + Number of 𝑓𝑖 given
Different directions give different
zero/non-zero patterns (Hoyer+08IJAR)
18
x1 x2
f1
x1 x2
f1
x1 x2
f1
Models
1.
2.
3.
**0
*0*
***
*0*
**0
***
A
A
19. Previous estimation methods
(Hoyer+08IJAR; Henao+11JMLR)
• Explicitly model hidden common causes
• Do model comparison based on maximum
likelihood principle or Bayesian approach
• Need to specify their number and distributions,
which is difficult in general
19
x1 x2
f1
x1 x2
orfQ f1 fQ
… …
2e1e2e1e
20. Our proposal:
A Bayesian LiNGAM
approach
S. Shimizu and K. Bollen.
Journal of Machine Learning Research,
2014
and something extra
21. Key idea (1/2)
• Transform the model to a model with
no hidden common causes
21
)1(
1x )1(
2x
)(
2
m
x
)1(
1x
x1 x2
f1 fQ…
2e1e
)1(
2e)1(
1e
)(
2
m
e)(
1
m
e
……
21b
21b
21b
)(
2
m
)1(
2
LiNGAM with no hidden
common causes but with
possibly different
intercepts over obs.
LiNGAM with
hidden common
causes
)1(
1
)(
1
m
22. Key idea (2/2)
• Include the sums of hidden common causes as
the model parameters, i.e., observation-specific
intercepts:
• Not explicitly model hidden common causes
– Neither necessary to specify the number of hidden
common causes Q nor estimate the coefficients
22
)(
2
m
)(
2
)(
121
1
)(
2
)(
2
mm
Q
q
m
qq
m
exbfx
m-th obs.:
q2
Obs.-specific
intercept
23. • Compare the marginal likelihoods wth data stndrdzd
• Once a direction has been estimated, compute the
posterior of the connection strength b21 or b12
• Many obs.-specific intercepts
– Similar to mixed models and multi-level models
– Informative prior
)()(
121
)(
2
)(
2
)(
1
)(
1
)(
1
m
i
mmm
mmm
exbx
ex
Bayesian model selection
23
),,1;2,1()(
nmim
i
Model 3 (x1 x2)
)(
2
)(
2
)(
2
)(
1
)(
212
)(
1
)(
1
mmm
mmmm
ex
exbx
Model 4 (x1 x2)
24. Prior for the observation-specific
intercepts
• Motivation: Central limit theorem
– Sums of independent variables tend to be more Gaussian
• Approximate the density by a bell-shaped curve dist.
– Dependent due to hidden common causes
• Select the hyper-parameter values
that maximize the marginal likelihood
24
Q
q
m
qq
m
Q
q
m
qq
m
ff
1
)(
2
)(
2
1
)(
1
)(
1 ,
~)(
2
)(
1
m
m
t-distribution with sd ,
correlation , and DOF12
21,
v
}8.0,.6.0,4.0{, 21
)(m
qf
(here, 8)
25. Error distributions and other
priors used in the experiment
• Error distributions
– Fixed to be the Laplace distribution
– Possible to be estimated assuming a family of
generalized Gaussian distributions, for
example
• Priors for the other parameters
25
)75.0,0(~
)75.0,0(~
)1,1(~
2
21
2
12
12
Nb
Nb
U
)1,0(~)(
)1,0(~)(
2
1
Uestd
Uestd
)(),( 21 epep
27. Sociology data
• Source: General Social Survey (n=1380)
– Non-farm background, ages 35-44, white, male, in the labor
force, no missing data for any of the covariates, 1972-2006
• 15 pairs with known temporal directions
(Duncan+1972)
27
Status attainment model
(Duncan et al., 1972)
x2: Son’s Income
28. Numbers of successes
(n=1380)
28
FE
✔
✔
Cf. LiNGAM-GU-UK (Chen+13NECO) 0.20; PNL(Zhang+09UAI): 0.60
Known (temporal)
orderings of 15 pairs
Son’s
Education
Father’s
Education
Son’s
Income
Son’s
Occupation
…
f1
f1
30. Conclusion
• Estimation of causal direction in the presence of
hidden common causes is a major challenge in
causal discovery
• Proposed a linear non-Gaussian SEM approach
– Not necessary to model individual hidden common
causes
• Future directions
– Cyclic cases: Using some prior for forcing the
identifiability condition of Lacerda+08UAI?
– Non-stationarity: Combining with Kun’s method
(Huang+15IJACI)?
30