Learning Gene Networks with Only Positive Data

On Learning Gene
Regulatory Networks with
Only Positive Examples
Luigi Cerulo, University of Sannio, Italy
Michele Ceccarelli, University of Sannio, Italy

Tuesday, October 12, 2010

Outline
• Supervised inference of gene regulatory
networks
• The positive only problem
• Negative selection approaches
• Effect on prediction accuracy
• Conclusions and future directions

Gene Regulatory
Network (GRN)
The network of transcription dependences among genes of an organism,
known as transcription factors, and their binding sites.

TF protein

TF
Gene A Gene B

gene A gene B


Gene Regulatory
Network (GRN)
• A gene regulatory G2
network can be G1
represented as a graph
G = (Vertices, Edges) G6 G3

• Vertices = Genes G7

• Edges = Interactions G5

G4
G8


Inference of Gene
regulatory networks
G2

G1

G6 G3

G7
G5

G4
G8
Gi = {e1 , e2 , e3 , . . . , en }


GRN
unsupervised inference
• Correlation models (eg. Mutual
information)
• Bayesian Network
• Boolean networks
• ODE
• ...

GRN
supervised Inference
G2

G1

• Part of the network
is known in advance G6
G3

from public databases G7

(Eg. RegulonDB) G5

G4
G8


GRN
supervised Inference
G2 G2

G1 G1

+
G6 G3 G6 G3

G7 G7
G5 G5

G4 G4
G8 G8

Gi = {e1 , e2 , e3 , . . . , en }
T = {(G1 , G2 ), (G2 , G3 ), (G6 , G7 ), (G7 , G8 )}

Binary classiﬁer (SVM, Decision Tree, Neural Networks,...)

Related work


• SIRENE approach

• trains an SVM classiﬁer for each gene and predicts
which genes are regulated by that gene

• combines all predicted regulations to obtain the full
regulatory network

G2 G2 G2

G1 G1 G1

...
G6 G3 G6 G3 G6 G3

G7 G7 G7
G5 G5 G5

G4 G4 G4
G8 G8 G8


1 1
CLR
SIRENE
0.8 0.8 SIRENE−Bias
Ratio of true positives

0.6 0.6

Precision
Precision
0.4 0.4

0.2 CLR 0.2
SIRENE
SIRENE−Bias
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Ratio of false positives Recall

Method Recall at 60% of Precision Recall at 80% of Precision
SIRENE 44.5% 17.6%
CLR 7.5% 5.5%
Relevance networks 4.7% 3.3%
ARACNe 1% 0%
Bayesian network 1% 0%

Compared with unsupervised methods (Mordelet and Vert, 2008)


60
supervised (SIRENE)

45
True Positives

30
unsupervised (ARACNE)

15

0
0 100 200 300 400
Top N
prediction of new c-Myc regulations
True positives are validated with IPA
(www.ingenuity.com)

Supervised learning

+ +
- + +
+ +
+ +
- + +
-
- - - +
- -
- +
-

Supervised learning

+ +
- + +
+ +
+ +
- + +
-
- - - +
- -
- +
- f(x)

Supervised learning

+ +
- + +
+ +
- ?+ + +
- +
- - - +
- - ?
- +
- f(x)

Supervised learning

+ +
- + +
+ +
- ?+ + +
- +
- - - +
- - -
- +
- f(x)

Supervised learning

+ +
- + +
+ +
- ++ + +
- +
- - - +
- - -
- +
- f(x)

Supervised learning with
unlabeled data
+ +
- + +
+ +
+ +
- + +
-
- - - +
- -
- +
-

unlabeled data
+ +
?

- + +
+
? +
+ +
- + +
- ?

- - - +?

- -
- +
-

unlabeled data
+ + ?

- ?
+ +
+ + ?
+ +
- +
? +
-
?
?

- - ? ? -
?
+
?

-
? - ?

- ? +
- ?


unlabeled data
+ + ?

- ?
+ +
+ + ?
+ +
- +
? + f(x)
-
?
?

- - ? ? -
?
+
?
PU-learning

-
? - ?

- ? +
- ?


Supervised learning of
gene regulatory networks
+1 G2

G1 +1

G6
G3
+1

G7 Is this a negative
G5
example?
+1
G4
G8
Is this a negative
example?


Training set
Labeled Unlabeled

P Q N
Positive Negative

|P |
% of Known Positives
|P ∪ Q|


1.0
0.9
0.8
AUROC

0.7
0.6
0.5

10 20 30 40 50 60 70 80 90 100

% of known positives

Effect of PU-learning
E.coli dataset [J.J. Faith et al., 2007]

Reliable negative
selection
+ + ?

- ?
+ +
+ + ?
+ +
- +
? + f(x)
-?
?

- - ? ? -
?
+
?
PU-learning

-
? - ?

- ? +
- ?


Reliable negative
selection
+ + ?

- + +
+ ?+
+ +
- + + f(x)
-?
?

- - -
?
+
?
PU-learning

- - ?

- +
-?


Reliable negative
selection
+ + ?

- + +
+ ?+
+ +
- + + f(x)
-?
?

- - -
?
+
?
PU-learning

- - ?

- +
-?

f’(x)

Reliable negative
selection in text mining
• B. Liu et al. Building Text Classifiers Using
Positive and Unlabeled Examples, in ICDM
2003
• Yu et al. PEBL: Positive Example Based
Learning for Web Page Classification Using
SVM, in KDD 2002
• Denis et al. Text classification from positive
and unlabeled Examples, in IPMU 2002


Methods based on
reliable negative selection
Labeled Unlabeled

Original
training set P Q N
Negative selection
heuristic

New
training set P RN


Quality of RN

RN

• RN could be contaminated with positives
embedded in unlabeled data
• The fraction of positive contamination is the
ratio between the number of positives in RN
and the total number of unknown positives |Q|


1.0
positive contamination = 0

0.9
0.8
AUROC

0.7
(PU-learning)
0.6
0.5

10 20 30 40 50 60 70 80 90 100


Effect of positive contamination


0.6
0.5

0.4
F-Measure

0.3
0.2
0.1

(PU-learning)
0.0

10 20 30 40 50 60 70 80 90 100




Network topology
based heuristics


Network motifs

Network motifs are small connected
subnetworks a network exhibits in a
signiﬁcant higher or lower occurrences than
would be expected just by chance

A

A B C
B C D E


B. Goemann, E. Wingender, and A. P. Potapov, “An approach to evaluate the
topological significance of motifs and other patterns in regulatory networks.”
BMC System Biology, vol. 3, no. 53, May 2009.

S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifs in the
transcriptional regulation network of escherichia coli,” Nature Genetics, vol. 31,
no. 1, pp. 64–68, May 2002.


Network Motifs
Heuristic

• For each three genes sub networks T:
• If matches a network motifs M then
considers all connections not present in M
as negatives
B

A C


1.0

0.9
0.8
AUROC

0.7
(PU-learning)
0.6
0.5

10 20 30 40 50 60 70 80 90 100


MOTIF selection performance
E.coli dataset [J.J. Faith et al., 2007 and RegulonDB]


1.0

0.9
0.8
AUROC

0.7
MOTIF (PU-learning)
0.6
0.5

10 20 30 40 50 60 70 80 90 100


MOTIF selection performance
E.coli dataset [J.J. Faith et al., 2007 and RegulonDB]


0.6
0.5

0.4
F-Measure

0.3
MOTIF
0.2
0.1

(PU-learning)
0.0

10 20 30 40 50 60 70 80 90 100




Scale free networks

Albert-László Barabási and Zoltán N. Oltvai
Network biology: Understanding the cell’s functional organization
Nature Reviews Genetics 5, 101-113 (2004)


Hierarchical networks

Hong-Wu Ma, Jan Buer, and An-Ping Zeng
Hierarchical structure and modules in the Escherichia coli transcriptional
regulatory network revealed by a new top-down approach
BMC Bioinformatics 2004 5:199


Experimental data
• 445 Affymetrix Antisense2 microarray
expression proﬁles for 4345 genes of E.coli
[J.J. Faith et al., 2007]
• Data were standardized (i.e. zero mean unit
standard deviation)
• Regulations extracted from RegulonDB (v.
5) between 154 Transcription Factors and
1211 genes


Summary and
conclusions
• Learning gene regulations is affected by the problem
of learning from positive only data
• At least for E.coli
• The study of positive contamination shows that
there is room for new heuristics
• Topology based heuristics (eg. motifs) have shown
promising results.
• Open issues arise on higher level organisms where
gene interactions are more complex


Re weighting strategy
(PosOnly)
Cerulo et al. BMC Bioinformatics 2010, 11:228
http://www.biomedcentral.com/1471-2105/11/228

RESEARCH ARTICLE Open Access

Learning gene regulatory networks from only
Research article

positive and unlabeled data
Luigi Cerulo*1,2, Charles Elkan3 and Michele Ceccarelli1,2

Abstract
Background: Recently, supervised learning methods have been exploited to reconstruct gene regulatory networks
from gene expression data. The reconstruction of a network is modeled as a binary classification problem for each pair
of genes. A statistical classifier is trained to recognize the relationships between the activation profiles of gene pairs.
This approach has been proven to outperform previous unsupervised methods. However, the supervised approach
raises open questions. In particular, although known regulatory connections can safely be assumed to be positive
training examples, obtaining negative examples is not straightforward, because definite knowledge is typically not
available that a given pair of genes do not interact.
Results: A recent advance in research on data mining is a method capable of learning a classifier from only positive and
unlabeled examples, that does not need labeled negative examples. Applied to the reconstruction of gene regulatory
networks, we show that this method significantly outperforms the current state of the art of machine learning
methods. We assess the new method using both simulated and experimental data, and obtain major performance
Tuesday, October 12, 2010 improvement.

PosOnly: How it works
Labeled Unlabeled
s=1 s=0

Let x be a P Q N
random example Positive Negative
y=1 y=0

s=1 iff x is labeled, s=0 iff x is unlabeled
y=1 iff x is positive, y=0 iff x is negative
If s=1 then y=1 (the contrary is not always true!)
p(s=1|x,y=0) = 0

Labeled Unlabeled
s=1 s=0

• The goal is to learn P Q N
a classiﬁer such that:
f(x) = p(y=1|x) Positive
y=1
Negative
y=0

• It is easy to see that (Elkan and Noto, 2008):
f(x) = p(s=1|x)/p(s=1|y=1)
= p(s=1 and y=1|x)/p(s=1|y=1)
= p(s=1|y=1,x)p(y=1|x)/p(s=1|y=1)
= p(y=1|x)


binary classiﬁer trained with
labeled and unlabeled examples

p(s = 1|x)
f (x) =
p(s = 1|y = 1)
unknown constant estimated
empirically in a number of ways


Mean of F-Measure

% of Known Positives

Results: experimental data

Learning Gene Networks with Only Positive Data

Recommended

Recommended

More Related Content

Featured

Featured (20)

Learning Gene Networks with Only Positive Data