1. Inferring Gene Regulatory Networks from
Time-Course Gene Expression Data
Camille Charbonnier and Julien Chiquet and Christophe Ambroise
´
Laboratoire Statistique et Genome,
´ ´ ´
La genopole - Universite d’Evry
JOBIM – 7-9 septembre 2010
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 1
2. Problem
t0 Inference
t1
tn
≈ 10s microarrays over time
Which interactions?
≈ 1000s probes (“genes”)
The main statistical issue is the high dimensional setting.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 2
3. Handling the scarcity of the data
By introducing some prior
Priors should be biologically grounded
1. few genes effectively interact (sparsity),
2. networks are organized (latent clustering),
G8
G7 G9
G11
G1 G6 G10
G4 G5 G2
G12
G13
G3
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 3
4. Handling the scarcity of the data
By introducing some prior
Priors should be biologically grounded
1. few genes effectively interact (sparsity),
2. networks are organized (latent clustering),
G8
G7 G9
G11
G1 G6 G10
G4 G5 G2
G12
G13
G3
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 3
5. Handling the scarcity of the data
By introducing some prior
Priors should be biologically grounded
1. few genes effectively interact (sparsity),
2. networks are organized (latent clustering),
B3
B2 B4
B
A1 B1 B5
A4 A A2
C1
C
A3
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 3
6. Outline
Statistical models
Gaussian Graphical Model for Time-course data
Structured Regularization
Algorithms and methods
Overall view
Model selection
Numerical experiments
Inference methods
Performance on simulated data
E. coli S.O.S DNA repair network
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 4
7. Gaussian Graphical Model for Time-course data
Collecting gene expression
1. Follow-up of one single experiment/individual;
2. Close enough time-points to ensure
dependency between consecutive measurements;
homogeneity of the Markov process.
X1
t
X4
X2
t X2
t+1
X1 stands for X3
t+1
X3 X2 X5 G X4
t+1
X5
t+1
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 5
8. Gaussian Graphical Model for Time-course data
Collecting gene expression
1. Follow-up of one single experiment/individual;
2. Close enough time-points to ensure
dependency between consecutive measurements;
homogeneity of the Markov process.
X1
t
X4
X2
t X2
t+1
X1 stands for X3
t+1
X3 X2 X5 G X4
t+1
X5
t+1
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 5
9. Gaussian Graphical Model for Time-course data
Collecting gene expression
1. Follow-up of one single experiment/individual;
2. Close enough time-points to ensure
dependency between consecutive measurements;
homogeneity of the Markov process.
X1
t X1
2
... X1
n
X2
1 X2
2
... X2
n
X3
1 X3
2
... X3
n
X4
1 X4
2
... X4
n
G G G
X5
1 X5
2
... X5
n
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 5
10. Gaussian Graphical Model for Time-Course data
Assumption
A microarray can be represented as a multivariate Gaussian
vector X = (X(1), . . . , X(p)) ∈ Rp , following a first order vector
autoregressive process V AR(1):
Xt = ΘXt−1 + b + εt , t ∈ [1, n]
where we are looking for Θ = (θij )i,j∈P .
Graphical interpretation
i conditional dependency between Xt−1 (j) and Xt (i)
if and only if
j
non null partial correlation between Xt−1 (j) and Xt (i)
θij = 0
k
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 6
11. Gaussian Graphical Model for Time-Course data
Assumption
A microarray can be represented as a multivariate Gaussian
vector X = (X(1), . . . , X(p)) ∈ Rp , following a first order vector
autoregressive process V AR(1):
Xt = ΘXt−1 + b + εt , t ∈ [1, n]
where we are looking for Θ = (θij )i,j∈P .
Graphical interpretation
? i conditional dependency between Xt−1 (j) and Xt (i)
if and only if
j
non null partial correlation between Xt−1 (j) and Xt (i)
θij = 0
k
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 6
12. Gaussian Graphical Model for Time-Course data
Assumption
A microarray can be represented as a multivariate Gaussian
vector X = (X(1), . . . , X(p)) ∈ Rp , following a first order vector
autoregressive process V AR(1):
Xt = ΘXt−1 + b + εt , t ∈ [1, n]
where we are looking for Θ = (θij )i,j∈P .
Graphical interpretation
? i conditional dependency between Xt−1 (j) and Xt (i)
if and only if
j
non null partial correlation between Xt−1 (j) and Xt (i)
θij = 0
k
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 6
13. Gaussian Graphical Model for Time-course data
Let
X be the n × p matrix whose kth row is Xk ,
S = n−1 Xn Xn be the within time covariance matrix,
V = n−1 Xn X0 be the across time covariance matrix.
The log-likelihood
n
Ltime (Θ; S, V) = n Trace (VΘ) − Trace (Θ SΘ) + c.
2
Maximum Likelihood Estimator ΘM LE = S−1 V
not defined for n < p;
even if n > p, requires multiple testing.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 7
14. Structured regularization
1 penalized log-likelihood
Charbonnier, Chiquet, Ambroise, SAGMB 2010
ˆ
Θλ = arg max Ltime (Θ; S, V) − λ · PZ |Θij |
ij
Θ i,j∈P
where λ is an overall tuning parameter and PZ is a
(non-symmetric) matrix of weights depending on the underlying
clustering Z.
It performs
1. regularization (needed when n p),
2. selection (specificity of the 1 -norm),
3. cluster-driven inference (penalty adapted to Z).
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 8
15. Structured regularization
“Bayesian” interpretation of 1 regularization
Laplacian prior on Θ depends on the clustering Z
P(Θ|Z) ∝ exp −λ · PZ · |Θij | .
ij
i,j
PZ summarizes prior information on the position of edges
1.0
λPZ = 1
ij
λPZ = 2
ij
prior distribution
0.8
0.6
0.4
0.2
-1.0 -0.5 0.0 0.5 1.0
Θij
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 9
16. How to come up with a latent clustering?
Biological expertise
Build Z from prior biological information
transcription factors vs. regulatees,
number of potential binding sites,
KEGG pathways, . . .
Build the weight matrix from Z.
¨ ´
Inference: Erdos-Renyi Mixture for Networks
(Daudin et al., 2008)
Spread the nodes into Q classes;
Connexion probabilities depends upon node classes:
P(i → j|i ∈ class q, j ∈ class ) = πq .
Build PZ ∝ 1 − πq .
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 10
17. Algorithm
Suppose you want toSIMoNE
recover a clustered network:
Graph
Target Adjacency Matrix
Target Network
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 11
18. Algorithm
Start with microarray data
SIMoNE
Data
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 11
19. Algorithm
SIMoNE
Inference without prior
Adjacency Matrix
Data corresponding to G
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 11
20. Algorithm
SIMoNE
Inference without prior
Adjacency Matrix
Penalty matrix PZ Data corresponding to G
Decreasing transformation Mixer
πZ
Connectivity matrix
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 11
21. Algorithm
Inference with clustered prior
SIMoNE
Inference without prior
+
Adjacency Matrix Adjacency Matrix
Penalty matrix PZ Data corresponding to G corresponding to GZ
Decreasing transformation Mixer
πZ
Connectivity matrix
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 11
22. Tuning the penalty parameter
BIC / AIC
Degrees of freedom of the Lasso (Zou et al. 2008)
ˆ
df(β λ ) = ˆλ
1(βk = 0)
k
Straightforward extensions to the graphical framework
ˆ ˆ log n
BIC(λ) = L(Θλ ; X) − df(Θλ )
2
ˆ ˆ
AIC(λ) = L(Θλ ; X) − df(Θλ )
Rely on asymptotic approximations,
but still relevant on simulated small samples.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 12
23. Inference methods
1. Lasso (Tibshirani)
2. Adaptive Lasso (Zou et al.)
Weights inversely proportional to an initial Lasso estimate.
3. KnwCl
Weights structured according to true clustering.
4. InfCl
Weights structured according to inferred clustering.
5. Renet-VAR (Shimamura et al.)
Edge estimation based on a recursive elastic net.
6. G1DBN (Lebre et al.)
`
Edge estimation based on dynamic Bayesian networks followed by statistical
testing of edges.
7. Shrinkage (Opgen-Rhein et al.)
Edge estimation based on shrinkage followed by multiple testing local false
discovery rate correction.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 13
24. Simulations: time-course data with star-pattern
Simulation settings
2 classes, hubs and leaves, with proportions α = (0.1, 0.9),
K = 2p edges, among which:
85% from hubs to leaves,
15% between hubs.
p genes n arrays samples
20 40 500
20 20 500
20 10 500
100 100 200
100 50 200
800 20 100
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 14
25. Simulations: time-course data with star-pattern
a) p=20, n=40
1.5
1.0
0.5
0.0
b) p=20, n=20
1.5
1.0
0.5
0.0
c) p=20, n=10
1.5
1.0
0.5
0.0 Precision
d) p=100, n=100 Recall
1.5
1.0
0.5
0.0
e) p=100, n=50
1.5
1.0
0.5
0.0
f) p=800, n=20
1.5
1.0
0.5
0.0
Ad SO C
Ad tive C
tiv IC
C C
IC
In BIC
In AIC
et IC
G AR
rin N
ge
S AI
I
I
Sh BD
ap −B
ap −A
Kn e−B
Kn l−A
R l−B
ka
−V
LA O−
l−
l−
1D
C
fC
fC
SS
w
w
en
LA
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 15
26. Reasonnable computing time
CPU time (log scale)
10
4 hr.
G1DBN
Renet−VAR
InfCl
1 hr.
8
10 min.
6
log of CPU time (sec)
1 min.
4
10 sec.
2
1 sec.
0
2.5 3.0 3.5 4.0 4.5 5.0
log of p/n
Figure: Computing times on the log-log scale for Renet-VAR, G1DBN
and InfCl (including inference of classes). Intel Dual Core 3.40 GHz
processor.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 16
27. E. coli S.O.S DNA repair network
E.coli S.O.S data
SOS.pdf
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 17
28. E. coli S.O.S DNA repair network
Precision and Recall rates
Exp. 1
1.5
1.0
0.5
0.0
Exp. 2
1.5
1.0
0.5
0.0 Precision
Exp. 3 Recall
1.5
1.0
0.5
0.0
Exp. 4
1.5
1.0
0.5
0.0
o
e
BN
et
l
l
nC
C
tiv
ss
en
r
fe
1D
ow
ap
La
R
In
Ad
G
Kn
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 18
30. Conclusions
To sum-up
cluster-driven inference of gene regulatory networks from
time-course data,
expert-based or inferred latent structure,
embedded in the SIMoNe R package along with similar
algorithms dealing with steady-state or multitask data.
Perspectives
inference of truely dynamic networks,
use of additive biological information to refine the
inference,
comparison of inferred networks.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 20
31. Publications
Ambroise, Chiquet, Matias, 2009.
Inferring sparse Gaussian graphical models with latent structure
Electronic Journal of Statistics, 3, 205-238.
Chiquet, Smith, Grasseau, Matias, Ambroise, 2009.
SIMoNe: Statistical Inference for MOdular NEtworks Bioinformatics,
25(3), 417-418.
Charbonnier, Chiquet, Ambroise, 2010.
Weighted-Lasso for Structured Network Inference from Time
Course Data., SAGMB, 9.
Chiquet, Grandvalet, Ambroise, 2010. Statistics and Computing.
Inferring multiple Gaussian graphical models.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 21
32. Publications
Ambroise, Chiquet, Matias, 2009.
Inferring sparse Gaussian graphical models with latent structure
Electronic Journal of Statistics, 3, 205-238.
Chiquet, Smith, Grasseau, Matias, Ambroise, 2009.
SIMoNe: Statistical Inference for MOdular NEtworks Bioinformatics,
25(3), 417-418.
Charbonnier, Chiquet, Ambroise, 2010.
Weighted-Lasso for Structured Network Inference from Time
Course Data., SAGMB, 9.
Chiquet, Grandvalet, Ambroise, 2010. Statistics and Computing.
Inferring multiple Gaussian graphical models.
Working paper: Chiquet, Charbonnier, Ambroise, Grasseau.
SIMoNe: An R package for inferring Gausssian networks with
latent structure, Journal of Statistical Softwares.
Working paper: Chiquet, Grandvalet, Ambroise, Jeanmougin.
Biological analysis of breast cancer by multitasks learning.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 21