Biological Network Inference via Gaussian Graphical Models

An introduction to Biological network inference via
Gaussian Graphical Models

Christophe Ambroise, Julien Chiquet

e ´
Statistique et G´nome, CNRS & Universit´ d’Evry Val d’Essonne
e

S˜o Paulo – School on Advance Science – Octobre 2012
a

http://stat.genopole.cnrs.fr/~cambroise

Network inference 1

Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Statistical dependence
Graphical models
Covariance selection and Gaussian vector
Gaussian Graphical Models for genomic data
Steady-state data
Time-course data
Statistical inference
Penalized likelihood approach
Inducing sparsity and regularization
The Lasso
Application in Post-genomics
Modeling time-course data
Illustrations
Multitask learning
Network inference 2

Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Graphical models
Steady-state data
Time-course data
The Lasso
Illustrations
Multitask learning
Network inference 3

Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Graphical models
Steady-state data
Time-course data
The Lasso
Illustrations
Multitask learning
Network inference 4

Real networks

I Many scientiﬁc ﬁelds :

I World Wide Web
I Biology, sociology, physics

I Nature of data under study:

I Interactions between N
objects
I O(N 2 ) possible interactions
I Network topology :
I Describes the way nodes
interact, structure/function Sample of 250 blogs (nodes) with their links
relationship (edges) of the French political Blogosphere.

Network inference 5

1
What the reconstructed networks are expected to be (1)
Regulatory networks

E. coli regulatory network
I relationships between
gene and their products
I inhibition/activation
I impossible to recover at
large scale
I always incomplete

1
1
and are presumably wrongly assumed to be
Network inference 6

Regulatory networks

Figure: Regulatory network identiﬁed in mammalian cells: highly structured
Network inference 7

Protein-Protein interaction networks

Figure: Yeast PPI network : do not be mislead by the representation, trust stat !
Network inference 8

Protein-Protein interaction networks

Figure: Yeast PPI network : do not be mislead by the representation, trust stat !

Network inference 8

Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Graphical models
Steady-state data
Time-course data
The Lasso
Illustrations
Multitask learning
Network inference 9

What are we looking at?

Central dogma of molecular biology
transcription translation
DNA mRNA Proteins
replication

Proteins
I are building blocks of any cellular functionality,
I are encoded by the genes,
I do interact (at the protein and gene level – regulations).

Network inference 10

What questions in functional genomics? (1)

Various levels/scales of study
I genome: sequence analysis,
I transcriptome: gene expression levels,
I proteome: protein functions and interactions.

Questions
1. Biological understanding
I Mechanisms of diseases,
I gene/protein functions and interactions.
2. Medical/clinical care
I Diagnostic (type of disease),
I prognostic (survival analysis),
I treatment (prediction of response).


What questions in functional genomics? (2)

Central dogma of molecular biology
transcription translation
DNA mRNA Proteins
replication

Basic biostatistical issues
Selecting some genes of interest (biomarkers),
Looking for interactions between them (pathway analysis).


How is this measured? (1)
Microarray technology: parallel measurement of many biological features

signal processing

Matrix of features n ⌧ p 0 1 2 3 p 1
x1 x1 x1 . . . x1
Expression levels of p B. C
pretreatment X=@ .
. A
probes are simultaneously p
1 2 2
xn xn x1 . . . xn
monitored for n individuals


How is this measured? (2)
Next Generation Sequencing: parallel measurement of even many more biological features

assembling

Matrix of features n n p 0 1 2 3 p 1
k1 k1 k1 . . . k1
B. C
X=@ .
Expression counts are extracted
pretreatment . A
from small repeated sequences p
1 2 2
kn kn k1 . . . kn
and monitored for n individuals


What questions are we dealing with? (1)
Supervised canonical example at the gene level: di↵erential analysis

Leukemia (Golub data, thanks to P. Neuvial)
I AML – Acute Myeloblastic Leukemia, n1 = 11,
I ALL – Acute Lymphoblastic Leukemia n2 = 27,
a n1 + n2 vector of outcome with each patient’s tumor type.

Supervised classiﬁcation
Find genes with signiﬁcant
di↵erent expression levels
between groups – biomarkers
prediction purpose


What questions are we dealing with? (2)
Unsupervised canonical example at the gene level: hierarchical clustering

Same kind of data, no outcome is considered

(Unsupervised) clustering
Find groups of gene which show
statistical
dependencies/commonalities –
hoping for biological interactions

exploratory purpose
functional understanding

Can we do better than that ? And how do genes interact anyway?


Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Graphical models
Steady-state data
Time-course data
The Lasso
Illustrations
Multitask learning

The problem at hand

Inference

⇡ 10s/100s microarray/sequencing experiments
⇡ 1000s probes (“genes”)

Modeling questions prior to inference
1. What do the nodes represent? (the easiest one)
2. What is/should be the meaning of an edge? (the toughest one)
I Biologically?
I Statistically?


More questions/issues

Modelling
I Is the network dynamic of static?
I How has the data been generated? (time-course/steady state)
I Are the edges oriented or not? (causality)
I What do the edges represent for my particular problem?

Statistical challenges
I (Ultra) high dimensionality,
I Noisy data, lack of reproducibility,
I Heterogeneity of the data (many techniques, various signals).


Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Graphical models
Steady-state data
Time-course data
The Lasso
Illustrations
Multitask learning

Canonical model settings
Biological microarrays in comparable conditions

Notations
1. a set P = {1, . . . , p} of p variables:
these are typically the genes (could be proteins);
2. a sample N = {1, . . . , n} of individuals associated to the variables:
these are typically the microarray (could be sequence counts).

Basic statistical model
This can be view as
I a random vector X in Rp , whose j th entry is the j th variable,
I a n-size sample (X 1 , . . . , X n ), such as X i is the i th microarrays,
I could be independent identically distributed copies (steady-state)
I could be dependent in a certain way (time-course data)
I assume a parametric probability distribution for X (Gaussian).


Canonical model settings
Biological microarrays in comparable conditions

Notations
1. a set P = {1, . . . , p} of p variables:
these are typically the genes (could be proteins);
2. a sample N = {1, . . . , n} of individuals associated to the variables:
The data are typically the microarray (could be sequence counts).
these
Stacking (X 1 , . . . , X n ), we met the usual individual/variable table X
Basic statistical model 0 1 2 3 p1
This can be view as x1 x1 x1 . . . x1
B. C
I Inference j th @ .
a random vector X in Rp , whose X =entry is the j th variable, A
.
1 2 2 p
I a n-size sample (X 1 , . . . , X n ), such as Xxin is xn ix1 microarrays,
the th . . . xn
I could be independent identically distributed copies (steady-state)
I could be dependent in a certain way (time-course data)
I assume a parametric probability distribution for X (Gaussian).


Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Graphical models
Steady-state data
Time-course data
The Lasso
Illustrations
Multitask learning

Modeling relationship between variables (1)
Independence

Deﬁnition (Independence of events)
Two events A and B are independent if and only if

P(A, B ) = P(A)P(B ),

which is usually denoted by A ? B . Equivalently,
?
I A ? B , P(A|B ) = P(A),
?
I A ? B , P(A|B ) = P(A|B c )
?

Example (class vs party)

party party
class Labour Tory class Labour Tory
working 0.42 0.28 working 0.60 0.40
bourgeoisie 0.06 0.24 bourgeoisie 0.20 0.80
Table: Joint probability (left) vs. conditional probability (right)

Modeling relationships between variables (2)
Conditional independence

Generalizing to more than two events requires strong assumptions
(mutual independence). Better handle with
Deﬁnition (Conditional independence of events)

P(A, B |C ) = P(A|C )P(B |C ),

which is usually denoted by A ? B |C
?

Example (Does QI depends on weight?)
Consider the events A = ”having low QI”, B = ”having low weight”.




P(A, B |C ) = P(A|C )P(B |C ),

?
Estimating2 P(A, B ), P(A) and P(B ) in a sample would lead to

P(A, B ) 6= P(A)P(B )

2
stupidly



P(A, B |C ) = P(A|C )P(B |C ),

?

But in fact, introducing C = ”having a given age”,

P(A, B |C ) = P(A|C )P(B |C )


Independence of random vectors (1)
Independence and Conditional independence: natural generalization

Deﬁnition
Consider 3 random vector X , Y , Z with distribution fX , fY , fZ , jointly
fXY , fXYZ . Then,
I X and Y are independent iif fXY (x , y) = fX (x )fY (y);
I X and Y are conditionally independent on Z , z : fZ (z ) > 0 iif
fXY |Z (x , y; z ) = fX |Z (x ; z )fY |Z (y; z ).

Proposition (Factorization criterion)
X and Y are independent (resp. conditionally independent on Z ) iif
there exists functions g and h such as, for all x and y
1. fXY (x , y) = g(x )h(y),
2. fXYZ (x , y, z ) = g(x , z )h(y, z ), for all z fZ (z ) > 0.


Independence of random vectors (2)
Independence vs Conditional independence

f ; X ? Y |Z
?
f ; fXYZ

f ; fX fY fZ

f ; X ? Z |Y
? f ; Y ? Z |X
?

Figure: Mutual independence, Conditional dependence, full dependence.

Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Graphical models
Steady-state data
Time-course data
The Lasso
Illustrations
Multitask learning

Deﬁnition

Deﬁnition
A graphical model gives a graphical (intuitive) representation of the
dependence structure of a probability distribution.

Graphical structure $ Random variables/Random vector

It links
1. a random vector (or a set of random variables.) X = {X1 , . . . , Xp }
with distribution P,

2. a graph G = (P, E) where
I P = {1, . . . , p} is the set of nodes associated to each variable,
I E is a set of edges describing the dependence relationship of X ⇠ P.


Conditional Independence Graphs
Deﬁnition

Deﬁnition
The conditional independence graph of a random vector X is the
undirected graph G = {P, E} with the set of node P = {1, . . . , p} and
where
(i , j ) 2 E , Xi ? Xj |P{i , j }.
/ ?

Property
It owns the Markov property: any two subsets of variables separated by a
third is independent conditionally on variables in the third set.


An example

Let X1 , X2 , X3 , X4 be four random variables with joint probability density
function fX (x ) = exp(u + x1 + x1 x2 + x2 x3 x4 ) with u a given constant.
Apply the factorization property

fX (x ) = exp(u + x1 + x1 x2 + x2 x3 x4 )
= exp(u) · exp(x1 + x1 x2 ) · exp(x2 x3 x4 )

Graphical representation

1 2 4
G = (P, E) such as P = {1, 2, 3, 4}
and
E=
3


An example


fX (x ) = exp(u + x1 + x1 x2 + x2 x3 x4 )


1 2 4
G = (P, E) such as P = {1, 2, 3, 4}
and
E = {?}
3


An example


fX (x ) = exp(u + x1 + x1 x2 + x2 x3 x4 )


1 2 4
G = (P, E) such as P = {1, 2, 3, 4}
and
E = {(1, 2)}
3


An example


fX (x ) = exp(u + x1 + x1 x2 + x2 x3 x4 )


1 2 4
G = (P, E) such as P = {1, 2, 3, 4}
and

E = {(2, 3), (3, 4), (2, 4)}
3


Directed Acyclic conditional independence Graph (DAG)
Motivation

Limitation of undirected graphs
Sometimes an ordering on the variables is known, which allows to break
the symmetry in the graphical representation to introduce, in some sense,
“causality” in the modeling.

Consequences
I Each element of E has to be directed.
I There are no directed cycle in the graph.

We thus deal with a directed acyclic graph (or DAG).


Definition

Definition (Ordering)
An ordering between variables {1, . . . , p} is a relation such that: i) for
all couple (i , j ), either i j or j i , ii) is transitive iii) is not
reflexive.
I A natural ordering is obtained when variables are observed across
time,
I A natural conditioning set for a pair of variables (i , j ) is the past,
denoted P(j ) = 1, . . . , j for j .

Definition (DAG)
The directed conditional dependence graph of X is the directed graph
G = (P, E ) where

(i , j ) such as i j 2 E , Xj ? Xi |P(j ){i , j }.
/ ?


Factorization and Markov property

Another view is a parent/descendant relationships to deal with the
ordering of the nodes:
The factorization property
p
Y
fX (x ) = fXk |pak (xk |pak ),
k =1

where pak are the parents of node k .


An example

x1

x2 x3

x4 x5

x6 x7

fX (x ) = ?fX1 fX2 fX3 fX4 |X1 ,X2 ,X3 fX5 |X1 ,X3 fX6 |X4 fX7 |X4 ,X5 .


An example

x1

x2 x3

x4 x5

x6 x7

fX (x ) = ?fX1 · · · fX2 fX3 fX4 |X1 ,X2 ,X3 fX5 |X1 ,X3 fX6 |X4 fX7 |X4 ,X5 .


An example

x1

x2 x3

x4 x5

x6 x7

fX (x ) = ?fX1 fX2 · · · fX3 fX4 |X1 ,X2 ,X3 fX5 |X1 ,X3 fX6 |X4 fX7 |X4 ,X5 .


An example

x1

x2 x3

x4 x5

x6 x7

fX (x ) = ?fX1 fX2 fX3 · · · fX4 |X1 ,X2 ,X3 fX5 |X1 ,X3 fX6 |X4 fX7 |X4 ,X5 .


An example

x1

x2 x3

x4 x5

x6 x7

fX (x ) = ?fX1 fX2 fX3 fX4 |X1 ,X2 ,X3 · · · fX5 |X1 ,X3 fX6 |X4 fX7 |X4 ,X5 .


An example

x1

x2 x3

x4 x5

x6 x7

fX (x ) = ?fX1 fX2 fX3 fX4 |X1 ,X2 ,X3 fX5 |X1 ,X3 · · · fX6 |X4 fX7 |X4 ,X5 .


An example

x1

x2 x3

x4 x5

x6 x7

fX (x ) = ?fX1 fX2 fX3 fX4 |X1 ,X2 ,X3 fX5 |X1 ,X3 fX6 |X4 · · · fX7 |X4 ,X5 .


Markov property

Local Markov property
For any Y 2 dek where dek are the descendants of k , then

Xk ? Y | pak ,
?

that is, Xk is conditionally independent on its non-descendants given its
parents.


Local Markov property: example

x1

x2 x3 Check that x4 ? x5 | {x2 , x3 }, by
?
using the factorization property.

x4 x5

P(x2 , x3 , x4 , x5 )
P(x4 |x5 , x2 , x3 ) =
P(x2 , x3 , x5 )
P(x2 )P(x3 )P(x4 |x2 , x3 )P(x5 |x3 )
=
P(x2 )P(x3 )P(x5 |x3 )
= P(x4 |x2 , x3 ).


Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Graphical models
Steady-state data
Time-course data
The Lasso
Illustrations
Multitask learning

Modeling the genomic data
Gaussian assumption

The data
0 1 2 3 p 1
x1 x1 x1 . . . x1
B. C
Inference X=@ .
. A
1 2 2 p
xn xn x1 . . . xn

Assuming fX (X) multivariate Gaussian
Greatly simpliﬁes the inference:
naturally links independence and conditional independence to the
covariance and partial covariance,
gives a straightforward interpretation to the graphical modeling
previously considered.


Start gently with the univariate Gaussian distribution

The Gaussian distribution is the
natural model for the level of
expression of gene (noisy data).

We note X ⇠ N (µ, 2 ),so as EX = µ, VarX = 2 and
⇢
1 1
fX (x ) = p exp (x µ)2 ,
2⇡ 2 2

and
p 1
log fX (x ) = log 2⇡ 2
(x µ)2 .
2

Useless for modeling the distribution of expression level for a whole
bunch of genes.


One step forward: bivariate Gaussian distribution
Need concepts of covariance and correlation

Let X , Y be two real random variables.
Deﬁnitions h i
cov(X , Y ) = E X E(X ) Y E(Y ) = E(XY ) E(X )E(Y ).

cov(X , Y )
⇢XY = cor(X , Y ) = p .
Var(X ) · Var(Y )

Proposition
I cov(X , X ) = Var(X ) = E[(X EX )(Y EY )],
I cov(X + Y , Z ) = cov(X , Z ) + cov(X , Z ),
I Var(X + Y ) = Var(X ) + Var(Y ) + 2cov(X , Y ).
I X ? Y ) cov(X , Y ) = 0.
?
I X ? Y , cov(X , Y ) = 0 when X , Y are Gaussian.
?


The bivariate Gaussian distribution

✓ ◆
1 1 1 x µ1
fXY (x , y) = p exp{ x µ1 y µ2 ⌃ }
2⇡ det ⌃ 2 y µ2

where ⌃ is the variance/covariance matrix which is symmetric and
positive deﬁnite.
✓ ◆
Var(X ) cov(Y , X )
⌃= .
cov(Y , X ) Var(Y )

and
1 1
fX ,Y (x , y) = p exp (x 2 + y 2 + 2⇢XY xy),
2⇡(1 ⇢2 )
XY
2(1 ⇢2 )
XY

where ⇢XY is the correlation between X , Y and describe the interaction
between them.


✓ ◆
1 1 1 x µ1
fXY (x , y) = p exp{ x µ1 y µ2 ⌃ }
2⇡ det ⌃ 2 y µ2

where ⌃ is the variance/covariance matrix which is symmetric and
positive deﬁnite. If standardized,
✓ ◆
1 ⇢XY
⌃= .
⇢XY 1

and
1 1
fX ,Y (x , y) = p exp (x 2 + y 2 + 2⇢XY xy),
2⇡(1 ⇢2 )
XY
2(1 ⇢2 )
XY

where ⇢XY is the correlation between X , Y and describe the interaction
between them.


The Covariance Matrix
Let

X ⇠ N (0, ⌃),

with unit variance and
⇢XY = 0
✓ ◆
1 0
⌃= .
0 1

The shape of the 2-D
distribution evolves
accordingly.



The Covariance Matrix
Let

X ⇠ N (0, ⌃),

with unit variance and
⇢XY = 0.9
✓ ◆
1 0.9
⌃= .
0.9 1

The shape of the 2-D
distribution evolves
accordingly.


Full generalization: multivariate Gaussian vector
Now need partial covariance and partial correlation

Let X , Y , Z be real random variables.
Deﬁnitions

cov(X , Y |Z ) = cov(X , Y ) cov(X , Z )cov(Y , Z )/Var(Z ).
⇢XY ⇢XZ ⇢YZ
⇢XY |Z = q q .
1 ⇢2XZ 1 ⇢2 YZ

Give the interaction between X and Y once removed the e↵ect of Z .

Proposition
When X , Y , Z are jointly Gaussian, then

cov(X , Y |Z ) = 0 , cor(X , Y |Z ) = 0 , X ? Y |Z .
?


The multivariate Gaussian distribution

Allow to give a modeling for the expression level of a whole set of genes
P:
Gaussian vector
Let X ⇠ N (µ, ⌃), and assume any block decomposition with {a, b} a
partition of P ✓ ◆
⌃ab ⌃ba
⌃= .
⌃ab ⌃bb
Then
1. Xa is Gaussian with distribution N (µa , ⌃aa )
2. Xa |Xb = x is Gaussian with distribution N (µa|b , ⌃a|b ) known.


Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Graphical models
Steady-state data
Time-course data
The Lasso
Illustrations
Multitask learning

Steady-state data: scheme

Inference

⇡ 10s microarrays over time
Which interactions?


Modeling the underlying distribution (1)

Model for data generation
I A microarray can be represented as a multivariate vector
X = (X1 , . . . , Xp ) 2 Rp ,
I Consider n biological replicate in the same condition, which forms a
usual n-size sample (X1 , . . . , Xn ).

Consequence: a Gaussian Graphical Model
I X ⇠ N (µ, ⌃) with X1 , . . . , Xn i.i.d. copies of X ,
I ⇥ = (✓ij )i,j 2P , ⌃ 1
is called the concentration matrix.


Interpretation as a GGM

Multivariate Gaussian vector and covariance selection
✓ij
p = cor Xi , Xj |XPi,j = ⇢ij |P{i,j } ,
✓ii ✓jj

Graphical Interpretation
The matrix ⇥ = (✓ij )i,j 2P encodes the network G we are looking for.
conditional dependency between Xj and Xi
? i or
if and only if non-null partial correlation between Xj and Xi
j m
✓ij 6= 0


Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Graphical models
Steady-state data
Time-course data
The Lasso
Illustrations
Multitask learning

Time-course data: scheme

t0 Inference
t1
tn
Which interactions?


Modeling time-course data with DAG
Collecting gene expression
1. Follow-up of one single experiment/individual;
2. Close enough time-points to ensure
I dependency between consecutive measurements;
I homogeneity of the Markov process.

Xt
1 Xt+1
1

X4
Xt
2 Xt+1
2

X1 stands for Xt+1
3

X3 X2 X5 G Xt+1
4

Xt+1
5


Modeling time-course data with DAG
1. Follow-up of one single experiment/individual;
2. Close enough time-points to ensure
I dependency between consecutive measurements;
I homogeneity of the Markov process.

Xt
1 X2
1
... Xn
1

X1
2 X2
2
... Xn
2

X1
3 X2
3
... Xn
3

X1
4 X2
4
... Xn
4

G G G
X1
5 X2
5
... Xn
5


DAG: remark

X1
t X1
t+1

X4
X2
t X2
t+1

X1 versus X3
t+1

X3 X2 X5 G X4
t+1

X5
t+1

Argh, there is a cycle :’( is indeed a DAG

Overcomes the rather restrictive acyclic requirement



Model for data generation
A microarray can be represented as a multivariate vector
X = (X1 , . . . , Xp ) 2 Rp , generated through a ﬁrst order vector
autoregressive process VAR(1):

X t = ⇥X t 1
+ b + "t , t 2 [1, n]

where "t is a white noise to ensure the Markov property and
X 0 ⇠ N (0, ⌃0 ).

Consequence: a Gaussian Graphical Model
I Each X t |X t 1 ⇠ N (✓X t 1 , ⌃),

I or, equivalently, Xjt |X t 1 ⇠ N (⇥j X t 1 , ⌃)

where ⌃ is known and ⇥j is the j th row of ⇥.


I
2 3
2 3 ✓11 . . . ✓1j . . ✓1p 2 1 3 2 3 2 1 3
Xt1 X b "
6 . 7 6
6 . . . . . . . . 7 6 t 17 6 1 7 6 t 7
7 . .7 6.7
6 7 6 . . . . . . . . 76 7 6
6 . 7 6 76 . 7 6 . 7 6 . 7
6 i7 6 . . . . . . . . 7676 7 6 7 6 7
6 Xt 7 6 . 7 6 7 6 i7
6 7 = 6 ✓i1 . . . ✓ij . 7 6 j 7 + 6 bi 7 + 6 "t 7
. ✓ip 7 6
6 . 7 6 X 7 6.7 6.7
6 7 6 . . . . . . . . 7 6 t 17 6 7 6 7
6 . 7 6 76 . 7 6 . 7 6 . 7
6 7 6 . . . . . . . . 76 7 6 7 6 7
4 . 5 6 74 . 5 4 . 5 4 . 5
4 . . . . . . . . 5
Xtp Xtp 1 bp "pt
✓p1 . . . ✓pj . . ✓pp
I Example:
0 1
✓11 ✓12 0
⇥ = @ ✓21 0 0 A
0 ✓32 0


Interpretation as a GGM

The VAR(1) as a covariance selection model
⇣ ⌘
cov Xit , Xjt 1 |XPj1
t
✓ij = ⇣ ⌘ ,
var Xjt 1 |XPj1
t

Graphical Interpretation
The matrix ⇥ = (✓ij )i,j 2P encodes the network G we are looking for.
conditional dependency between Xjt 1 and Xit
? i or
if and only if non-null partial correlation between Xjt 1 and Xit
j
m
✓ij 6= 0


Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Graphical models
Steady-state data
Time-course data
The Lasso
Illustrations
Multitask learning

The graphical models: remindera
a
for goldﬁsh-like memories

Assumption
A microarray can be represented as a multivariate Gaussian vector X .

1. Steady-state data leads to an i.i.d. sample.
2. Time-course data gives a time series.

Graphical interpretation
i conditional dependency between X (i) and X (j )
if and only if or
j non null partial correlation between X (i) and X (j )

Encoded in an unknown matrix of parameters ⇥.


a

Assumption


i
? conditional dependency between X (i) and X (j )
if and only if or
j non null partial correlation between X (i) and X (j )



a

Assumption


i
? conditional dependency between Xt (i) and Xt 1 (j )
if and only if or
j non null partial correlation between Xt (i) and Xt 1 (j )



The Maximum likelihood estimator
The natural approach for parametric statistics

Let X be a random vector with distribution deﬁned by fX (x ; ⇥), where
⇥ are the model parameters.
Maximum likelihood estimator

ˆ
⇥ = arg max L(⇥; X)
⇥

where L is the log likelihood, a function of the parameters:
n
Y
L(⇥; X) = log fX (xk ; ⇥),
k =1

where xk is the k row of X.
Remarks
I This a convex optimization problem,
I We just need to detect non zero coe cients in ⇥

The penalized likelihood approach

Let ⇥ be the parameters to infer (the edges).

A penalized likelihood approach

ˆ
⇥ = arg max L(⇥; X) pen`1 (⇥),
⇥

I L is the model log-likelihood,
I pen`1 is a penalty function tuned by > 0.

It performs
1. regularization (needed when n ⌧ p),
2. selection (sparsity induced by the `1 -norm),


Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Graphical models
Steady-state data
Time-course data
The Lasso
Illustrations
Multitask learning

A Geometric View of Sparsity
Constrained Optimization

We basically want to solve a problem of
the form

maximize f ( 1, 2 ; X)
1, 2
2 ; X)

where f is typically a concave likelihood
function.
1,

This is strictly equivalent to solve
f(

minimize g( 1, 2 ; X)
1, 2

where g = f is convex ! For instance
the square lost in the OLS.
2
1



(
2 ; X)

1, 2 ,
s.t. ⌦( 1, 2) c
1,

where ⌦ deﬁnes a domain that
f(

constrains .

2
1



(
1, 2 ,
s.t. ⌦( 1, 2) c

constrains .
m
2

maximize f ( 1, 2 ; X) ⌦( 1, 2)
1, 2

1



(
1, 2 ,
s.t. ⌦( 1, 2) c

constrains .
m
maximize f ( 1, 2 ; X) ⌦( 1, 2)
2

1, 2

How shall we deﬁne ⌦ to induce
sparsity?
1


Supporting Hyperplane

An hyperplane supports a set i↵
I the set is contained in one half-space
I the set has at least one point on the hyperplane

2

1



2

1




2

1

There are Supporting Hyperplane at all points of convex sets:
Generalize tangents



2

2
1 1


Dual Cone

Generalizes normals
2

2

2
1 1 1


Dual Cone

Generalizes normals
2

2

2
1 1 1

Shape of dual cones ) sparsity pattern


Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Graphical models
Steady-state data
Time-course data
The Lasso
Illustrations
Multitask learning

The LASSO

R. Tibshirani, 1996.
The Lasso: Least Absolute Shrinkage and Selection Operator
S. Chen , D. Donoho , M. Saunders, 1995.
3.2. Basis Pursuit.
Régularisations ` p 23
Weisberg, 1980.
Forward Stagewise regression.
2 2

(
minimize ky X k2 ,
2
2R2
s.t.` 2 k k1 = | 1 | + | 2|  c.
ls ls

m
`1 1 1

minimize ky X k2 + k k1 .
2
2R

Fig. 3.2 – Comparaisons des solutions de problèmes régularisés par une norme `1 et `2 .

Orthogonal case and link to the OLS
OLS shrinkage
The Lasso has no analytical solution but in the orthogonal case: when
X| X = I (never for real data),
ˆlasso = sign( ˆols ) max(0, | ˆols | ).
j j j

OLS
4
Lasso
2

0 ols
4 2 0 2 4

2

4


LARs: Least angle regression

B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, 2004.
Least Angle Regression.

E cient algorithm to compute the Lasso solutions
The LARS solution consists of a curve denoting the solution for each
value of .
I construct a piecewise linear path of solution starting from the null
vector towards the OLS estimate,
I (Almost) the same cost as OLS,
I well adapted to cross validation (help us to choose ).


Example: prostate cancer I

Lasso solution path with Lars

> library(lars)
> load("prostate.rda")
> x <- as.matrix(x)
> x <- scale(as.matrix(x))
> out <- lars(x,y)
> plot(out)


Example: prostate cancer II
LASSO
0 1 3 5 6 7 8

1
*
6

* *
*
**

*
Standardized Coefficients

*
4

*

2
* *
* *
*
*
2

**
*
*

8
* *
*
* *
* *
*

7
*
* *
0

* * * ** * * *
*
*
*

3
*

0.0 0.2 0.4 0.6 0.8 1.0

Choice of the tuning parameter I

Model selection criteria
log n
BIC( ) = ky X ˆ k2
2 df( ˆ )
2

AIC( ) = ky X ˆ k2
2 df( ˆ )
where df( ˆ ) is the number of nonzero entries in .

Cross-validation
1. split the data into K folds,
2. use successively each K fold as the testing set,
3. compute the test error on this K folds,
4. average to obtain the CV estimation of the test error.
is chosen to minimize the CV test error.


Choice of the tuning parameter II

CV choice for

> cv.lars(x,y, K=10)


Choice of the tuning parameter III
1.6
1.4

●
●
Cross−Validated MSE

●
●
●
1.2

●
●
●
●
●
●
●
●
1.0

●
●
●
●
●
●
●
●●●●
0.8

●●●●●●●●●●●●●●●●●
●●
●●
●●
0.6

●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●

0.0 0.2 0.4 0.6 0.8 1.0

Many variations

Group-Lasso
Activate the variables by group (given by the user).

Adaptive/Weighted-Lasso
Adjust the penalty level to each variables, according to prior knowledge or
with data driven weights.

BoLasso
Bootstrapped version that removes false positives/stabilizes the estimate.

etc.

+ many theoretical results.


Outline
Introduction
Motivations
Background on omics
Modeling issue
Modeling tools
Graphical models
Steady-state data
Time-course data
The Lasso
Illustrations
Multitask learning

Problem

t0 Inference
t1
tn
Which interactions?

The main statistical issue is the high dimensional setting.


Handling the scarcity of the data
By introducing some prior

Priors should be biologically grounded
1. few genes e↵ectively interact (sparsity),
2. networks are organized (latent clustering),

G8

G7 G9

G11

G1 G6 G10

G4 G5 G2
G12

G13
G3


Handling the scarcity of the data
By introducing some prior

Priors should be biologically grounded
1. few genes e↵ectively interact (sparsity),
2. networks are organized (latent clustering),

B3

B2 B4

B

A1 B1 B5

A4 A A2
C1

C
A3


Penalized log-likelihood

Banerjee et al., JMLR 2008

ˆ
⇥ = arg max Liid (⇥; S) k⇥k`1 ,
⇥

e ciently solved by the graphical Lasso of Friedman et al, 2008.

Ambroise, Chiquet, Matias, EJS 2009
Use adaptive penalty parameters for di↵erent coe cients

Liid (⇥; S) kPZ ? ⇥k`1 ,

where PZ is a matrix of weights depending on the underlying clustering
Z.

Works with the pseudo log-likelihood (computationally e cient).


Penalized log-likelihood

Banerjee et al., JMLR 2008

ˆ
⇥ = arg max Liid (⇥; S) k⇥k`1 ,
⇥

e ciently solved by the graphical Lasso of Friedman et al, 2008.

Ambroise, Chiquet, Matias, EJS 2009
Use adaptive penalty parameters for di↵erent coe cients
˜
Liid (⇥; S) kPZ ? ⇥k`1 ,

where PZ is a matrix of weights depending on the underlying clustering
Z.

Works with the pseudo log-likelihood (computationally e cient).


Neighborhood selection (1)

Let
I Xi be the i th column of X,
I Xi be X deprived of Xi .

✓ij
Xi = Xi + ", where j = .
✓ii

Meinshausen and B¨lhman, 2006
u
Since sign(corij |P{i,j } ) = sign( j ), select the neighbors of i with

1 2
arg min Xi Xi 2
+ k k`1 .
n

The sign pattern of ⇥ is inferred after a symmetrization step.


Neighborhood selection (2)

The pseudo log-likelihood of the i.i.d Gaussian sample is
p n
!
X X
˜
Liid (⇥; S) = log P(Xk (i )|Xk (Pi ); ⇥i ) ,
i=1 k =1
n n ⇣ ⌘ n
1/2 1/2
= log det(D) Trace D ⇥S⇥D log(2⇡),
2 2 2
where D = diag(⇥).

Proposition

ˆ pseudo = arg max Liid (⇥; S)
⇥ ˜ k⇥k`1
⇥:✓ij 6=✓ii

has the same null entries as inferred by neighborhood selection.


Structured regularization
Introduce prior knowledge

Building the weights
1. Build w from prior biological information
I transcription factors vs. regulatees,
I number of potential binding sites,
I KEGG pathways, Gene Ontology . . .
2. Build the weights matrix from clustering algorithm
I Infer the network G 0 with w = 1 for each node,
I Apply a clustering algorithm on G 0 ,
I Re-Infer G with w built according to the clustering Z.


Biological Network Inference via Gaussian Graphical Models

Biological Network Inference via Gaussian Graphical Models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Biological Network Inference via Gaussian Graphical Models

Similar to Biological Network Inference via Gaussian Graphical Models (20)

More from CTBE - Brazilian Bioethanol Sci&Tech Laboratory

More from CTBE - Brazilian Bioethanol Sci&Tech Laboratory (20)

Recently uploaded

Recently uploaded (20)

Biological Network Inference via Gaussian Graphical Models