Learning gene regulations with only positive examples
Upcoming SlideShare
Loading in...5
×
 

Learning gene regulations with only positive examples

on

  • 1,483 views

 

Statistics

Views

Total Views
1,483
Views on SlideShare
1,393
Embed Views
90

Actions

Likes
0
Downloads
10
Comments
0

3 Embeds 90

http://rcost.unisannio.it 47
http://www.rcost.unisannio.it 41
http://www.dsba.unisannio.it 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Learning gene regulations with only positive examples Learning gene regulations with only positive examples Presentation Transcript

  • On Learning Gene Regulatory Networks with Only Positive Examples Luigi Cerulo, University of Sannio, Italy Michele Ceccarelli, University of Sannio, Italy Thursday, September 30, 2010
  • Outline • Supervised inference of gene regulatory networks • The positive only problem • Negative selection approaches • Effect on prediction accuracy • Conclusions and future directions Thursday, September 30, 2010
  • Gene Regulatory Network (GRN) The network of transcription dependences among genes of an organism, known as transcription factors, and their binding site. TF protein TF Gene A Gene B gene A gene B Thursday, September 30, 2010
  • Gene Regulatory Network (GRN) • A gene regulatory G2 network can be G1 represented as a graph G = (Vertices, Edges) G6 G3 • Vertices = Genes G7 • Edges = Interactions G5 G4 G8 Thursday, September 30, 2010
  • Inference of Gene regulatory networks G2 G1 G6 G3 G7 G5 G4 G8 Gi = {e1 , e2 , e3 , . . . , en } Thursday, September 30, 2010
  • GRN unsupervised inference • Gene relevance (eg. Mutual information) • Bayesian Network • Boolean networks • ODE • ... Thursday, September 30, 2010
  • GRN supervised Inference G2 G1 • Part of the network is known in advance G6 G3 from public databases G7 (Eg. RegulonDB) G5 G4 G8 Thursday, September 30, 2010
  • GRN supervised Inference G2 G2 G1 G1 + G6 G3 G6 G3 G7 G7 G5 G5 G4 G4 G8 G8 Gi = {e1 , e2 , e3 , . . . , en } T = {(G1 , G2 ), (G2 , G3 ), (G6 , G7 ), (G7 , G8 )} Binary classifier (SVM, Decision Tree, Neural Networks,...) Thursday, September 30, 2010
  • Related work Thursday, September 30, 2010
  • • SIRENE approach • trains an SVM classifier for each gene and predicts which genes are regulated by that gene • combines all predicted regulations to obtain the full regulatory network G2 G2 G2 G1 G1 G1 ... G6 G3 G6 G3 G6 G3 G7 G7 G7 G5 G5 G5 G4 G4 G4 G8 G8 G8 Thursday, September 30, 2010
  • 1 1 CLR SIRENE 0.8 0.8 SIRENE−Bias Ratio of true positives 0.6 0.6 Precision Precision 0.4 0.4 0.2 CLR 0.2 SIRENE SIRENE−Bias 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Ratio of false positives Recall Method Recall at 60% Recall at 80% SIRENE 44.5% 17.6% CLR 7.5% 5.5% Relevance networks 4.7% 3.3% ARACNe 1% 0% Bayesian network 1% 0% Compared with unsupervised methods (Mordelet and Vert, 2008) Thursday, September 30, 2010
  • 60 supervised (SIRENE) 45 True Positives 30 unsupervised (ARACNE) 15 0 0 100 200 300 400 Top N prediction of new c-Myc regulations True positives are validated with IPA (www.ingenuity.com) Thursday, September 30, 2010
  • Supervised learning + + - + + + + + + - + + - - - - + - - - + - Thursday, September 30, 2010
  • Supervised learning + + - + + + + + + - + + - - - - + - - - + - f(x) Thursday, September 30, 2010
  • Supervised learning + + - + + + + - ?+ + + - + - - - + - - ? - + - f(x) Thursday, September 30, 2010
  • Supervised learning + + - + + + + - ?+ + + - + - - - + - - - - + - f(x) Thursday, September 30, 2010
  • Supervised learning + + - + + + + - ++ + + - + - - - + - - - - + - f(x) Thursday, September 30, 2010
  • Supervised learning with unlabeled data + + - + + + + + + - + + - - - - + - - - + - Thursday, September 30, 2010
  • Supervised learning with unlabeled data + + ? - + + + ? + + + - + + - ? - - - +? - - - + - Thursday, September 30, 2010
  • Supervised learning with unlabeled data + + ? - ? + + + + ? + + - + ? + - ? ? - - ? ? - ? + ? - ? - ? - ? + - ? Thursday, September 30, 2010
  • Supervised learning with unlabeled data + + ? - ? + + + + ? + + - + ? + f(x) - ? ? - - ? ? - ? + ? PU-learning - ? - ? - ? + - ? Thursday, September 30, 2010
  • Supervised learning of gene regulatory networks +1 G2 G1 +1 G6 G3 +1 G7 Is this a negative G5 example? +1 G4 G8 Is this a negative example? Thursday, September 30, 2010
  • Training set Labeled Unlabeled s=1 s=0 P Q N Positive Negative y=1 y=0 |P | % of Known Positives |P ∪ Q| Thursday, September 30, 2010
  • 1.0 0.9 0.8 AUROC 0.7 0.6 0.5 10 20 30 40 50 60 70 80 90 100 % of known positives Effect of PU-learning E.coli dataset [J.J. Faith et al., 2007] Thursday, September 30, 2010
  • Reliable negative selection + + ? - ? + + + + ? + + - + ? + f(x) -? ? - - ? ? - ? + ? PU-learning - ? - ? - ? + - ? Thursday, September 30, 2010
  • Reliable negative selection + + ? - + + + ?+ + + - + + f(x) -? ? - - - ? + ? PU-learning - - ? - + -? Thursday, September 30, 2010
  • Reliable negative selection + + ? - + + + ?+ + + - + + f(x) -? ? - - - ? + ? PU-learning - - ? - + -? f’(x) Thursday, September 30, 2010
  • Reliable negative selection in text mining • B. Liu et al. Building Text Classifiers Using Positive and Unlabeled Examples, in ICDM 2003 • Yu et al. PEBL: Positive Example Based Learning for Web Page Classification Using SVM, in KDD 2002 • Denis et al. Text classification from positive and unlabeled Examples, in IPMU 2002 Thursday, September 30, 2010
  • Methods based on reliable negative selection Labeled Unlabeled Original training set P Q N Negative selection heuristic New training set P RN Thursday, September 30, 2010
  • Quality of RN RN • RN could be contaminated with positives embedded in unlabeled data • The fraction of positive contamination is the ratio between the number of positives in RN and the total number of unknown positives |Q| Thursday, September 30, 2010
  • 1.0 0.9 0.8 AUROC 0.7 0.6 % of Known Positives 10 % 30 % 50 % 70 % 90 % 20 % 40 % 60 % 80 % 100 % 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 positive contamination Effect of positive contamination E.coli dataset [J.J. Faith et al., 2007] Thursday, September 30, 2010
  • 1.0 positive contamination = 0 0.9 0.8 AUROC 0.7 positive contamination = 1 (PU-learning) 0.6 0.5 10 20 30 40 50 60 70 80 90 100 % of known positives Area of improvement E.coli dataset [J.J. Faith et al., 2007] Thursday, September 30, 2010
  • Network topology based heuristics Thursday, September 30, 2010
  • Network motifs Network motifs are small connected subnetworks that a network exhibit in a significant higher or lower occurrences than would be expected just by chance A A B C B C D E Thursday, September 30, 2010
  • B. Goemann, E. Wingender, and A. P. Potapov, “An approach to evaluate the topological significance of motifs and other patterns in regulatory networks.” BMC System Biology, vol. 3, no. 53, May 2009. S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifs in the transcriptional regulation network of escherichia coli,” Nature Genetics, vol. 31, no. 1, pp. 64–68, May 2002. Thursday, September 30, 2010
  • Network Motifs Heuristic • For each three genes sub networks T: • If matches a network motifs M then considers all connections not present in M as negatives B A C Thursday, September 30, 2010
  • Network Motifs Heuristic • For each three genes sub networks T: • If matches a network motifs M then considers all connections not present in M as negatives B A C Thursday, September 30, 2010
  • Network Motifs Heuristic • For each three genes sub networks T: • If matches a network motifs M then considers all connections not present in M as negatives B A C Thursday, September 30, 2010
  • Network Motifs Heuristic • For each three genes sub networks T: • If matches a network motifs M then considers all connections not present in M as negatives B A C Thursday, September 30, 2010
  • AUROC % of known positives MOTIF selection performance E.coli dataset [J.J. Faith et al., 2007 and RegulonDB] Thursday, September 30, 2010
  • F-measure % of known positives MOTIF selection performance E.coli dataset [J.J. Faith et al., 2007 and RegulonDB] Thursday, September 30, 2010
  • Scale free networks Albert-László Barabási and Zoltán N. Oltvai Network biology: Understanding the cell’s functional organization Nature Reviews Genetics 5, 101-113 (2004) Thursday, September 30, 2010
  • Hierarchical networks Hong-Wu Ma, Jan Buer, and An-Ping Zeng Hierarchical structure and modules in the Escherichia coli transcriptional regulatory network revealed by a new top-down approach BMC Bioinformatics 2004 5:199 Thursday, September 30, 2010
  • Experimental data • 445 Affymetrix Antisense2 microarray expression profiles for 4345 genes of E.coli [J.J. Faith et al., 2007] • Data were standardized (i.e. zero mean unit standard deviation) • Regulations extracted from RegulonDB (v. 5) between 154 Transcription Factors and 1211 genes Thursday, September 30, 2010
  • Summary and conclusions • Learning gene regulations is affected by the problem of learning from positive only data • At least for E.coli • The study of positive contamination shows that there is room for new heuristics • Topology based heuristics (eg. motifs) have shown promising results. • Open issues arise on higher level organisms where gene interactions are more complex Thursday, September 30, 2010
  • Re weighting strategy (PosOnly) Cerulo et al. BMC Bioinformatics 2010, 11:228 http://www.biomedcentral.com/1471-2105/11/228 RESEARCH ARTICLE Open Access Learning gene regulatory networks from only Research article positive and unlabeled data Luigi Cerulo*1,2, Charles Elkan3 and Michele Ceccarelli1,2 Abstract Background: Recently, supervised learning methods have been exploited to reconstruct gene regulatory networks from gene expression data. The reconstruction of a network is modeled as a binary classification problem for each pair of genes. A statistical classifier is trained to recognize the relationships between the activation profiles of gene pairs. This approach has been proven to outperform previous unsupervised methods. However, the supervised approach raises open questions. In particular, although known regulatory connections can safely be assumed to be positive training examples, obtaining negative examples is not straightforward, because definite knowledge is typically not available that a given pair of genes do not interact. Results: A recent advance in research on data mining is a method capable of learning a classifier from only positive and unlabeled examples, that does not need labeled negative examples. Applied to the reconstruction of gene regulatory networks, we show that this method significantly outperforms the current state of the art of machine learning methods. We assess the new method using both simulated and experimental data, and obtain major performance Thursday, September 30, 2010 improvement.
  • PosOnly: How it works Labeled Unlabeled s=1 s=0 Let x be a P Q N random example Positive Negative y=1 y=0 s=1 iff x is labeled, s=0 iff x is unlabeled y=1 iff x is positive, y=0 iff x is negative If s=1 then y=1 (the contrary is not always true!) p(s=1|x,y=0) = 0 Thursday, September 30, 2010
  • PosOnly: How it works Labeled Unlabeled s=1 s=0 • The goal is to learn P Q N a classifier such that: f(x) = p(y=1|x) Positive y=1 Negative y=0 • It is easy to see that (Elkan and Noto, 2008): f(x) = p(s=1|x)/p(s=1|y=1) = p(s=1 and y=1|x)/p(s=1|y=1) = p(s=1|y=1,x)p(y=1|x)/p(s=1|y=1) = p(y=1|x) Thursday, September 30, 2010
  • PosOnly: How it works binary classifier trained with labeled and unlabeled examples p(s = 1|x) f (x) = p(s = 1|y = 1) unknown constant estimated empirically in a number of ways Thursday, September 30, 2010
  • Mean of F-Measure % of Known Positives Results: experimental data Thursday, September 30, 2010