The document discusses inferring gene regulatory networks from time-course gene expression data. It presents the problem of inferring interactions between genes from high-dimensional and sparse time-course microarray data. It proposes using a Gaussian graphical model and introducing biologically grounded priors, such as sparsity and latent clustering of networks, to help address the scarcity of data. Several statistical models and algorithms are described for performing regularized inference on the networks with and without using a known or inferred latent clustering structure. The methods are evaluated on simulated time-course data and a real E. coli S.O.S DNA repair network.
A copula-based Simulation Method for Clustered Multi-State Survival Datafedericorotolo
Generating survival data with a clustered and multi-state structure is useful to study Multi-State
Models, Competing Risks Models and Frailty Models. The simulation of such kind of data
is not straightforward as one needs to introduce dependence between times of different transitions
while taking under control the probability of each competing event, the median sojourn time in
each state, the effect of covariates and the type and magnitude of heterogeneity.
Here we propose a simulation procedure based on Clayton copulas for the joint distribution
of times of each competing events block. It allows to specify the marginal distributions of time
variables, while their dependence is induced by the copula. Furthermore, even though a dependence
is obtained between all the time variables, only some joint distributions have to be handled.
The choice of simulation parameters is done by numerical minimization of a criterion function
based on the ratio of target and observed values of median times and of probabilities of competing
events.
The proposed method further allows to simulate discrete and continuous covariates and to
specify their effect on each transition in a proportional hazards way. A frailty term can be added,
too, in order to provide clustering. No particular restriction is needed on covariates distributions,
frailty distribution, number and sizes of clusters.
An example is provided simulating data mimicking those from an Italian multi-center study
on head and neck cancer. The multi-state structure of these data arises from the interest in
studying both time to local relapses and to distant metastases before death.
We show that our proposed method reaches very good convergence to the target values.
Nonlinear Stochastic Programming by the Monte-Carlo methodSSA KPI
AACIMP 2010 Summer School lecture by Leonidas Sakalauskas. "Applied Mathematics" stream. "Stochastic Programming and Applications" course. Part 4.
More info at http://summerschool.ssa.org.ua
A copula-based Simulation Method for Clustered Multi-State Survival Datafedericorotolo
Generating survival data with a clustered and multi-state structure is useful to study Multi-State
Models, Competing Risks Models and Frailty Models. The simulation of such kind of data
is not straightforward as one needs to introduce dependence between times of different transitions
while taking under control the probability of each competing event, the median sojourn time in
each state, the effect of covariates and the type and magnitude of heterogeneity.
Here we propose a simulation procedure based on Clayton copulas for the joint distribution
of times of each competing events block. It allows to specify the marginal distributions of time
variables, while their dependence is induced by the copula. Furthermore, even though a dependence
is obtained between all the time variables, only some joint distributions have to be handled.
The choice of simulation parameters is done by numerical minimization of a criterion function
based on the ratio of target and observed values of median times and of probabilities of competing
events.
The proposed method further allows to simulate discrete and continuous covariates and to
specify their effect on each transition in a proportional hazards way. A frailty term can be added,
too, in order to provide clustering. No particular restriction is needed on covariates distributions,
frailty distribution, number and sizes of clusters.
An example is provided simulating data mimicking those from an Italian multi-center study
on head and neck cancer. The multi-state structure of these data arises from the interest in
studying both time to local relapses and to distant metastases before death.
We show that our proposed method reaches very good convergence to the target values.
Nonlinear Stochastic Programming by the Monte-Carlo methodSSA KPI
AACIMP 2010 Summer School lecture by Leonidas Sakalauskas. "Applied Mathematics" stream. "Stochastic Programming and Applications" course. Part 4.
More info at http://summerschool.ssa.org.ua
Those are the slides for my Master course on Monte Carlo Statistical Methods given in conjunction with the Monte Carlo Statistical Methods book with George Casella.
Aristidis Likas, Associate Professor and Christoforos Nikou, Assistant Professor, University of Ioannina, Department of Computer Science , Mixture Models for Image Analysis
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...BigMC
talk by Nicolas Chopin at CREST Statistics Seminar, 16/01/2011.
This is partly a review, partly a talk on recent research such as
http://arxiv.org/abs/1101.1528
Digital Signal Processing[ECEG-3171]-Ch1_L03Rediet Moges
This Digital Signal Processing Lecture material is the property of the author (Rediet M.) . It is not for publication,nor is it to be sold or reproduced.
#Africa#Ethiopia
Those are the slides for my Master course on Monte Carlo Statistical Methods given in conjunction with the Monte Carlo Statistical Methods book with George Casella.
Aristidis Likas, Associate Professor and Christoforos Nikou, Assistant Professor, University of Ioannina, Department of Computer Science , Mixture Models for Image Analysis
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...BigMC
talk by Nicolas Chopin at CREST Statistics Seminar, 16/01/2011.
This is partly a review, partly a talk on recent research such as
http://arxiv.org/abs/1101.1528
Digital Signal Processing[ECEG-3171]-Ch1_L03Rediet Moges
This Digital Signal Processing Lecture material is the property of the author (Rediet M.) . It is not for publication,nor is it to be sold or reproduced.
#Africa#Ethiopia
We propose a regularized method for multivariate linear regression when the number of predictors may exceed the sample size. This method is designed to strengthen the estimation and the selection of the relevant input features with three ingredients: it takes advantage of the dependency pattern between the responses by estimating the residual covariance; it performs selection on direct links between predictors and responses; and selection is driven by prior structural information. To this end, we build on a recent reformulation of the multivariate linear regression model to a conditional Gaussian graphical model and propose a new regularization scheme accompanied with an efficient optimization procedure. On top of showing very competitive performance on artificial and real data sets, our method demonstrates capabilities for fine interpretation of its parameters, as illustrated in applications to genetics, genomics and spectroscopy.
Salah satu kajian Artificial Intelligenceadalah logika fuzzy. Logika fuzzy (logika samar) merupakan logika yang berhadapan dengan konsep kebenaran sebagian, dimana logika klasik menyatakan bahwa segala hal dapat diekpresikan dalam istilah binary (0 atau 1). Logika fuzzy memungkinkan nilai keanggotaan antara 0 dan 1. Berbagai teori dalam pengembangan logika fuzzy menunjukkan bahwa pada dasarnya logika fuzzy dapat digunakan untuk memodelkan berbagai sistem. Logika fuzzy dianggap mampu memetakan suatu input dan output dengan tidak mengabaikan faktor-faktor yang ada.Konsep fuzzy telah diterapkan dalam berbagai segi kehidupan, sebagai contoh dalam bidang perekonomian yaitu dalam pada penetapan suku bunga pada bank. Teori himpunan fuzzy diterapkan dengan berbagai cara ke dalam berbagai macam disiplin ilmu. Sehingga aplikasi teori ini dapat ditemukan kecerdasan buatan,teori pengambilan keputusan, ilmu komputer, teknik kendali, ilmu manajemen, robotika, dan lain-lain.Bandar Udara Hang Nadim Batamadalah Bandar Udara yang terletak di pulau Batam, Provinsi Kepulauan Riau. Bandar udara ini menghubungkan Kota Batam dengan Bandar Udara di seluruh Indonesia. Beberapa pesawat penerbangan yang beroperasi di Bandar Udara Hang Nadim Batam diantaranya, Garuda Indonesia Air, Citilink Air, Lion Air, Sriwijaya Air, dan lain sebagainya. Kesemuanya itu dikelola oleh berbagai macam perusahaan maskapai penerbangan yang menjadi operator dibawah kontrol dari Unit Pelaksana Tugas (UPT) Bandar Udara Hang Nadim Batam.Kualitas pelayanan bukan hanya masalah dalam mengontrol kualitas yang akan datang saja, akan tetapi juga pencegahan terjadinya kualitas pelayanan yang jelek sejak awal. Para calon penumpang pesawat menginginkan pelayanan yang bisa diterima secara cepat dan baik, dan hal tersebut juga menjadi nilai dalam peningkatan kualitas pelayanan. Tingkat kepuasan yang berbeda dari para calon penumpang pesawat menjadi indikator yang baik untuk pengukuran tingkat kualitas ataupun pelayanan yang mereka terima. Kualitas pelayanan yang baik dimana maskapai penerbangan pemberi layanan mampu memberikan pelayanan yang memuaskan sehingga dapat memenuhi permintaan dan harapan penumpang. Pada penelitian peningkatan kapasitas ini pendekatan logika fuzzy diterapkan untuk mengetahui tingkat kepuasan penumpang terhadap layanan Bandar Udara Hang Nadim Batam.
New Mathematical Tools for the Financial SectorSSA KPI
AACIMP 2010 Summer School lecture by Gerhard Wilhelm Weber. "Applied Mathematics" stream. "Modern Operational Research and Its Mathematical Methods with a Focus on Financial Mathematics" course. Part 5.
More info at http://summerschool.ssa.org.ua
1. Inferring Gene Regulatory Networks from
Time-Course Gene Expression Data
Camille Charbonnier and Julien Chiquet and Christophe Ambroise
´
Laboratoire Statistique et Genome,
´ ´ ´
La genopole - Universite d’Evry
JOBIM – 7-9 septembre 2010
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 1
2. Problem
t0 Inference
t1
tn
≈ 10s microarrays over time
Which interactions?
≈ 1000s probes (“genes”)
The main statistical issue is the high dimensional setting.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 2
3. Handling the scarcity of the data
By introducing some prior
Priors should be biologically grounded
1. few genes effectively interact (sparsity),
2. networks are organized (latent clustering),
G8
G7 G9
G11
G1 G6 G10
G4 G5 G2
G12
G13
G3
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 3
4. Handling the scarcity of the data
By introducing some prior
Priors should be biologically grounded
1. few genes effectively interact (sparsity),
2. networks are organized (latent clustering),
G8
G7 G9
G11
G1 G6 G10
G4 G5 G2
G12
G13
G3
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 3
5. Handling the scarcity of the data
By introducing some prior
Priors should be biologically grounded
1. few genes effectively interact (sparsity),
2. networks are organized (latent clustering),
B3
B2 B4
B
A1 B1 B5
A4 A A2
C1
C
A3
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 3
6. Outline
Statistical models
Gaussian Graphical Model for Time-course data
Structured Regularization
Algorithms and methods
Overall view
Model selection
Numerical experiments
Inference methods
Performance on simulated data
E. coli S.O.S DNA repair network
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 4
7. Gaussian Graphical Model for Time-course data
Collecting gene expression
1. Follow-up of one single experiment/individual;
2. Close enough time-points to ensure
dependency between consecutive measurements;
homogeneity of the Markov process.
X1
t
X4
X2
t X2
t+1
X1 stands for X3
t+1
X3 X2 X5 G X4
t+1
X5
t+1
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 5
8. Gaussian Graphical Model for Time-course data
Collecting gene expression
1. Follow-up of one single experiment/individual;
2. Close enough time-points to ensure
dependency between consecutive measurements;
homogeneity of the Markov process.
X1
t
X4
X2
t X2
t+1
X1 stands for X3
t+1
X3 X2 X5 G X4
t+1
X5
t+1
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 5
9. Gaussian Graphical Model for Time-course data
Collecting gene expression
1. Follow-up of one single experiment/individual;
2. Close enough time-points to ensure
dependency between consecutive measurements;
homogeneity of the Markov process.
X1
t X1
2
... X1
n
X2
1 X2
2
... X2
n
X3
1 X3
2
... X3
n
X4
1 X4
2
... X4
n
G G G
X5
1 X5
2
... X5
n
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 5
10. Gaussian Graphical Model for Time-Course data
Assumption
A microarray can be represented as a multivariate Gaussian
vector X = (X(1), . . . , X(p)) ∈ Rp , following a first order vector
autoregressive process V AR(1):
Xt = ΘXt−1 + b + εt , t ∈ [1, n]
where we are looking for Θ = (θij )i,j∈P .
Graphical interpretation
i conditional dependency between Xt−1 (j) and Xt (i)
if and only if
j
non null partial correlation between Xt−1 (j) and Xt (i)
θij = 0
k
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 6
11. Gaussian Graphical Model for Time-Course data
Assumption
A microarray can be represented as a multivariate Gaussian
vector X = (X(1), . . . , X(p)) ∈ Rp , following a first order vector
autoregressive process V AR(1):
Xt = ΘXt−1 + b + εt , t ∈ [1, n]
where we are looking for Θ = (θij )i,j∈P .
Graphical interpretation
? i conditional dependency between Xt−1 (j) and Xt (i)
if and only if
j
non null partial correlation between Xt−1 (j) and Xt (i)
θij = 0
k
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 6
12. Gaussian Graphical Model for Time-Course data
Assumption
A microarray can be represented as a multivariate Gaussian
vector X = (X(1), . . . , X(p)) ∈ Rp , following a first order vector
autoregressive process V AR(1):
Xt = ΘXt−1 + b + εt , t ∈ [1, n]
where we are looking for Θ = (θij )i,j∈P .
Graphical interpretation
? i conditional dependency between Xt−1 (j) and Xt (i)
if and only if
j
non null partial correlation between Xt−1 (j) and Xt (i)
θij = 0
k
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 6
13. Gaussian Graphical Model for Time-course data
Let
X be the n × p matrix whose kth row is Xk ,
S = n−1 Xn Xn be the within time covariance matrix,
V = n−1 Xn X0 be the across time covariance matrix.
The log-likelihood
n
Ltime (Θ; S, V) = n Trace (VΘ) − Trace (Θ SΘ) + c.
2
Maximum Likelihood Estimator ΘM LE = S−1 V
not defined for n < p;
even if n > p, requires multiple testing.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 7
14. Structured regularization
1 penalized log-likelihood
Charbonnier, Chiquet, Ambroise, SAGMB 2010
ˆ
Θλ = arg max Ltime (Θ; S, V) − λ · PZ |Θij |
ij
Θ i,j∈P
where λ is an overall tuning parameter and PZ is a
(non-symmetric) matrix of weights depending on the underlying
clustering Z.
It performs
1. regularization (needed when n p),
2. selection (specificity of the 1 -norm),
3. cluster-driven inference (penalty adapted to Z).
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 8
15. Structured regularization
“Bayesian” interpretation of 1 regularization
Laplacian prior on Θ depends on the clustering Z
P(Θ|Z) ∝ exp −λ · PZ · |Θij | .
ij
i,j
PZ summarizes prior information on the position of edges
1.0
λPZ = 1
ij
λPZ = 2
ij
prior distribution
0.8
0.6
0.4
0.2
-1.0 -0.5 0.0 0.5 1.0
Θij
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 9
16. How to come up with a latent clustering?
Biological expertise
Build Z from prior biological information
transcription factors vs. regulatees,
number of potential binding sites,
KEGG pathways, . . .
Build the weight matrix from Z.
¨ ´
Inference: Erdos-Renyi Mixture for Networks
(Daudin et al., 2008)
Spread the nodes into Q classes;
Connexion probabilities depends upon node classes:
P(i → j|i ∈ class q, j ∈ class ) = πq .
Build PZ ∝ 1 − πq .
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 10
17. Algorithm
Suppose you want toSIMoNE
recover a clustered network:
Graph
Target Adjacency Matrix
Target Network
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 11
18. Algorithm
Start with microarray data
SIMoNE
Data
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 11
19. Algorithm
SIMoNE
Inference without prior
Adjacency Matrix
Data corresponding to G
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 11
20. Algorithm
SIMoNE
Inference without prior
Adjacency Matrix
Penalty matrix PZ Data corresponding to G
Decreasing transformation Mixer
πZ
Connectivity matrix
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 11
21. Algorithm
Inference with clustered prior
SIMoNE
Inference without prior
+
Adjacency Matrix Adjacency Matrix
Penalty matrix PZ Data corresponding to G corresponding to GZ
Decreasing transformation Mixer
πZ
Connectivity matrix
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 11
22. Tuning the penalty parameter
BIC / AIC
Degrees of freedom of the Lasso (Zou et al. 2008)
ˆ
df(β λ ) = ˆλ
1(βk = 0)
k
Straightforward extensions to the graphical framework
ˆ ˆ log n
BIC(λ) = L(Θλ ; X) − df(Θλ )
2
ˆ ˆ
AIC(λ) = L(Θλ ; X) − df(Θλ )
Rely on asymptotic approximations,
but still relevant on simulated small samples.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 12
23. Inference methods
1. Lasso (Tibshirani)
2. Adaptive Lasso (Zou et al.)
Weights inversely proportional to an initial Lasso estimate.
3. KnwCl
Weights structured according to true clustering.
4. InfCl
Weights structured according to inferred clustering.
5. Renet-VAR (Shimamura et al.)
Edge estimation based on a recursive elastic net.
6. G1DBN (Lebre et al.)
`
Edge estimation based on dynamic Bayesian networks followed by statistical
testing of edges.
7. Shrinkage (Opgen-Rhein et al.)
Edge estimation based on shrinkage followed by multiple testing local false
discovery rate correction.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 13
24. Simulations: time-course data with star-pattern
Simulation settings
2 classes, hubs and leaves, with proportions α = (0.1, 0.9),
K = 2p edges, among which:
85% from hubs to leaves,
15% between hubs.
p genes n arrays samples
20 40 500
20 20 500
20 10 500
100 100 200
100 50 200
800 20 100
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 14
25. Simulations: time-course data with star-pattern
a) p=20, n=40
1.5
1.0
0.5
0.0
b) p=20, n=20
1.5
1.0
0.5
0.0
c) p=20, n=10
1.5
1.0
0.5
0.0 Precision
d) p=100, n=100 Recall
1.5
1.0
0.5
0.0
e) p=100, n=50
1.5
1.0
0.5
0.0
f) p=800, n=20
1.5
1.0
0.5
0.0
Ad SO C
Ad tive C
tiv IC
C C
IC
In BIC
In AIC
et IC
G AR
rin N
ge
S AI
I
I
Sh BD
ap −B
ap −A
Kn e−B
Kn l−A
R l−B
ka
−V
LA O−
l−
l−
1D
C
fC
fC
SS
w
w
en
LA
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 15
26. Reasonnable computing time
CPU time (log scale)
10
4 hr.
G1DBN
Renet−VAR
InfCl
1 hr.
8
10 min.
6
log of CPU time (sec)
1 min.
4
10 sec.
2
1 sec.
0
2.5 3.0 3.5 4.0 4.5 5.0
log of p/n
Figure: Computing times on the log-log scale for Renet-VAR, G1DBN
and InfCl (including inference of classes). Intel Dual Core 3.40 GHz
processor.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 16
27. E. coli S.O.S DNA repair network
E.coli S.O.S data
SOS.pdf
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 17
28. E. coli S.O.S DNA repair network
Precision and Recall rates
Exp. 1
1.5
1.0
0.5
0.0
Exp. 2
1.5
1.0
0.5
0.0 Precision
Exp. 3 Recall
1.5
1.0
0.5
0.0
Exp. 4
1.5
1.0
0.5
0.0
o
e
BN
et
l
l
nC
C
tiv
ss
en
r
fe
1D
ow
ap
La
R
In
Ad
G
Kn
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 18
30. Conclusions
To sum-up
cluster-driven inference of gene regulatory networks from
time-course data,
expert-based or inferred latent structure,
embedded in the SIMoNe R package along with similar
algorithms dealing with steady-state or multitask data.
Perspectives
inference of truely dynamic networks,
use of additive biological information to refine the
inference,
comparison of inferred networks.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 20
31. Publications
Ambroise, Chiquet, Matias, 2009.
Inferring sparse Gaussian graphical models with latent structure
Electronic Journal of Statistics, 3, 205-238.
Chiquet, Smith, Grasseau, Matias, Ambroise, 2009.
SIMoNe: Statistical Inference for MOdular NEtworks Bioinformatics,
25(3), 417-418.
Charbonnier, Chiquet, Ambroise, 2010.
Weighted-Lasso for Structured Network Inference from Time
Course Data., SAGMB, 9.
Chiquet, Grandvalet, Ambroise, 2010. Statistics and Computing.
Inferring multiple Gaussian graphical models.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 21
32. Publications
Ambroise, Chiquet, Matias, 2009.
Inferring sparse Gaussian graphical models with latent structure
Electronic Journal of Statistics, 3, 205-238.
Chiquet, Smith, Grasseau, Matias, Ambroise, 2009.
SIMoNe: Statistical Inference for MOdular NEtworks Bioinformatics,
25(3), 417-418.
Charbonnier, Chiquet, Ambroise, 2010.
Weighted-Lasso for Structured Network Inference from Time
Course Data., SAGMB, 9.
Chiquet, Grandvalet, Ambroise, 2010. Statistics and Computing.
Inferring multiple Gaussian graphical models.
Working paper: Chiquet, Charbonnier, Ambroise, Grasseau.
SIMoNe: An R package for inferring Gausssian networks with
latent structure, Journal of Statistical Softwares.
Working paper: Chiquet, Grandvalet, Ambroise, Jeanmougin.
Biological analysis of breast cancer by multitasks learning.
Camille Charbonnier and Julien Chiquet and Christophe Ambroise 21