On Learning Gene
            Regulatory Networks with
            Only Positive Examples
             Luigi Cerulo, Univer...
Outline
                    • Supervised inference of gene regulatory
                           networks
                ...
Gene Regulatory
                                    Network (GRN)
                               The network of transcript...
Gene Regulatory
                               Network (GRN)
                   • A gene regulatory                       ...
Inference of Gene
                               regulatory networks
                                                     ...
GRN
                      unsupervised inference
                    • Gene relevance (eg. Mutual information)
           ...
GRN
                               supervised Inference
                                                                  ...
GRN
                               supervised Inference
                                                            G2    ...
Related work




Thursday, September 30, 2010
•      SIRENE approach

                          •     trains an SVM classifier for each gene and predicts
               ...
1                                                                       1
                                                ...
60
                                                                         supervised (SIRENE)

                         ...
Supervised learning

                                          +   +
                                    -          +    +...
Supervised learning

                                          +   +
                                    -          + +
  ...
Supervised learning

                                          +   +
                                    -          + +
  ...
Supervised learning

                                          +   +
                                    -          + +
  ...
Supervised learning

                                          +   +
                                    -          + +
  ...
Supervised learning with
                      unlabeled data
                                       +   +
               ...
Supervised learning with
                      unlabeled data
                                       +   +
               ...
Supervised learning with
                      unlabeled data
                                       +   +               ?...
Supervised learning with
                      unlabeled data
                                       +   +               ?...
Supervised learning of
                gene regulatory networks
                                +1        G2

            ...
Training set
                               Labeled        Unlabeled
                                 s=1             s=0
...
1.0
                                       0.9
                                       0.8
                               A...
Reliable negative
                                   selection
                                        +   +              ...
Reliable negative
                                   selection
                                        +   +    ?

       ...
Reliable negative
                                   selection
                                        +   +    ?

       ...
Reliable negative
                     selection in text mining
              • B. Liu et al. Building Text Classifiers Usi...
Methods based on
                reliable negative selection
                                      Labeled       Unlabeled...
Quality of RN

                                             RN

                    •      RN could be contaminated with p...
1.0
                                       0.9
                                       0.8
                               A...
1.0
                                             positive contamination = 0




                                       0.9...
Network topology
                                based heuristics


Thursday, September 30, 2010
Network motifs

                       Network motifs are small connected
                       subnetworks that a networ...
B. Goemann, E. Wingender, and A. P. Potapov, “An approach to evaluate the
                topological significance of moti...
Network Motifs
                                 Heuristic

                    • For each three genes sub networks T:
    ...
Network Motifs
                                 Heuristic

                    • For each three genes sub networks T:
    ...
Network Motifs
                                 Heuristic

                    • For each three genes sub networks T:
    ...
Network Motifs
                                 Heuristic

                    • For each three genes sub networks T:
    ...
AUROC




                                                % of known positives



                MOTIF selection performa...
F-measure




                                                 % of known positives



                MOTIF selection per...
Scale free networks




                Albert-László Barabási and Zoltán N. Oltvai
                Network biology: Under...
Hierarchical networks




                Hong-Wu Ma, Jan Buer, and An-Ping Zeng
                Hierarchical structure an...
Experimental data
                    • 445 Affymetrix Antisense2 microarray
                           expression profiles...
Summary and
                                     conclusions
                    •      Learning gene regulations is affec...
Re weighting strategy
                            (PosOnly)
                               Cerulo et al. BMC Bioinformatic...
PosOnly: How it works
                                                  Labeled             Unlabeled
                    ...
PosOnly: How it works
                                                Labeled        Unlabeled
                           ...
PosOnly: How it works
                                         binary classifier trained with
                             ...
Mean of F-Measure




                                                   % of Known Positives




                Results:...
Learning gene regulations with only positive examples
Learning gene regulations with only positive examples
Upcoming SlideShare
Loading in...5
×

Learning gene regulations with only positive examples

1,319

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,319
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Learning gene regulations with only positive examples

  1. 1. On Learning Gene Regulatory Networks with Only Positive Examples Luigi Cerulo, University of Sannio, Italy Michele Ceccarelli, University of Sannio, Italy Thursday, September 30, 2010
  2. 2. Outline • Supervised inference of gene regulatory networks • The positive only problem • Negative selection approaches • Effect on prediction accuracy • Conclusions and future directions Thursday, September 30, 2010
  3. 3. Gene Regulatory Network (GRN) The network of transcription dependences among genes of an organism, known as transcription factors, and their binding site. TF protein TF Gene A Gene B gene A gene B Thursday, September 30, 2010
  4. 4. Gene Regulatory Network (GRN) • A gene regulatory G2 network can be G1 represented as a graph G = (Vertices, Edges) G6 G3 • Vertices = Genes G7 • Edges = Interactions G5 G4 G8 Thursday, September 30, 2010
  5. 5. Inference of Gene regulatory networks G2 G1 G6 G3 G7 G5 G4 G8 Gi = {e1 , e2 , e3 , . . . , en } Thursday, September 30, 2010
  6. 6. GRN unsupervised inference • Gene relevance (eg. Mutual information) • Bayesian Network • Boolean networks • ODE • ... Thursday, September 30, 2010
  7. 7. GRN supervised Inference G2 G1 • Part of the network is known in advance G6 G3 from public databases G7 (Eg. RegulonDB) G5 G4 G8 Thursday, September 30, 2010
  8. 8. GRN supervised Inference G2 G2 G1 G1 + G6 G3 G6 G3 G7 G7 G5 G5 G4 G4 G8 G8 Gi = {e1 , e2 , e3 , . . . , en } T = {(G1 , G2 ), (G2 , G3 ), (G6 , G7 ), (G7 , G8 )} Binary classifier (SVM, Decision Tree, Neural Networks,...) Thursday, September 30, 2010
  9. 9. Related work Thursday, September 30, 2010
  10. 10. • SIRENE approach • trains an SVM classifier for each gene and predicts which genes are regulated by that gene • combines all predicted regulations to obtain the full regulatory network G2 G2 G2 G1 G1 G1 ... G6 G3 G6 G3 G6 G3 G7 G7 G7 G5 G5 G5 G4 G4 G4 G8 G8 G8 Thursday, September 30, 2010
  11. 11. 1 1 CLR SIRENE 0.8 0.8 SIRENE−Bias Ratio of true positives 0.6 0.6 Precision Precision 0.4 0.4 0.2 CLR 0.2 SIRENE SIRENE−Bias 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Ratio of false positives Recall Method Recall at 60% Recall at 80% SIRENE 44.5% 17.6% CLR 7.5% 5.5% Relevance networks 4.7% 3.3% ARACNe 1% 0% Bayesian network 1% 0% Compared with unsupervised methods (Mordelet and Vert, 2008) Thursday, September 30, 2010
  12. 12. 60 supervised (SIRENE) 45 True Positives 30 unsupervised (ARACNE) 15 0 0 100 200 300 400 Top N prediction of new c-Myc regulations True positives are validated with IPA (www.ingenuity.com) Thursday, September 30, 2010
  13. 13. Supervised learning + + - + + + + + + - + + - - - - + - - - + - Thursday, September 30, 2010
  14. 14. Supervised learning + + - + + + + + + - + + - - - - + - - - + - f(x) Thursday, September 30, 2010
  15. 15. Supervised learning + + - + + + + - ?+ + + - + - - - + - - ? - + - f(x) Thursday, September 30, 2010
  16. 16. Supervised learning + + - + + + + - ?+ + + - + - - - + - - - - + - f(x) Thursday, September 30, 2010
  17. 17. Supervised learning + + - + + + + - ++ + + - + - - - + - - - - + - f(x) Thursday, September 30, 2010
  18. 18. Supervised learning with unlabeled data + + - + + + + + + - + + - - - - + - - - + - Thursday, September 30, 2010
  19. 19. Supervised learning with unlabeled data + + ? - + + + ? + + + - + + - ? - - - +? - - - + - Thursday, September 30, 2010
  20. 20. Supervised learning with unlabeled data + + ? - ? + + + + ? + + - + ? + - ? ? - - ? ? - ? + ? - ? - ? - ? + - ? Thursday, September 30, 2010
  21. 21. Supervised learning with unlabeled data + + ? - ? + + + + ? + + - + ? + f(x) - ? ? - - ? ? - ? + ? PU-learning - ? - ? - ? + - ? Thursday, September 30, 2010
  22. 22. Supervised learning of gene regulatory networks +1 G2 G1 +1 G6 G3 +1 G7 Is this a negative G5 example? +1 G4 G8 Is this a negative example? Thursday, September 30, 2010
  23. 23. Training set Labeled Unlabeled s=1 s=0 P Q N Positive Negative y=1 y=0 |P | % of Known Positives |P ∪ Q| Thursday, September 30, 2010
  24. 24. 1.0 0.9 0.8 AUROC 0.7 0.6 0.5 10 20 30 40 50 60 70 80 90 100 % of known positives Effect of PU-learning E.coli dataset [J.J. Faith et al., 2007] Thursday, September 30, 2010
  25. 25. Reliable negative selection + + ? - ? + + + + ? + + - + ? + f(x) -? ? - - ? ? - ? + ? PU-learning - ? - ? - ? + - ? Thursday, September 30, 2010
  26. 26. Reliable negative selection + + ? - + + + ?+ + + - + + f(x) -? ? - - - ? + ? PU-learning - - ? - + -? Thursday, September 30, 2010
  27. 27. Reliable negative selection + + ? - + + + ?+ + + - + + f(x) -? ? - - - ? + ? PU-learning - - ? - + -? f’(x) Thursday, September 30, 2010
  28. 28. Reliable negative selection in text mining • B. Liu et al. Building Text Classifiers Using Positive and Unlabeled Examples, in ICDM 2003 • Yu et al. PEBL: Positive Example Based Learning for Web Page Classification Using SVM, in KDD 2002 • Denis et al. Text classification from positive and unlabeled Examples, in IPMU 2002 Thursday, September 30, 2010
  29. 29. Methods based on reliable negative selection Labeled Unlabeled Original training set P Q N Negative selection heuristic New training set P RN Thursday, September 30, 2010
  30. 30. Quality of RN RN • RN could be contaminated with positives embedded in unlabeled data • The fraction of positive contamination is the ratio between the number of positives in RN and the total number of unknown positives |Q| Thursday, September 30, 2010
  31. 31. 1.0 0.9 0.8 AUROC 0.7 0.6 % of Known Positives 10 % 30 % 50 % 70 % 90 % 20 % 40 % 60 % 80 % 100 % 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 positive contamination Effect of positive contamination E.coli dataset [J.J. Faith et al., 2007] Thursday, September 30, 2010
  32. 32. 1.0 positive contamination = 0 0.9 0.8 AUROC 0.7 positive contamination = 1 (PU-learning) 0.6 0.5 10 20 30 40 50 60 70 80 90 100 % of known positives Area of improvement E.coli dataset [J.J. Faith et al., 2007] Thursday, September 30, 2010
  33. 33. Network topology based heuristics Thursday, September 30, 2010
  34. 34. Network motifs Network motifs are small connected subnetworks that a network exhibit in a significant higher or lower occurrences than would be expected just by chance A A B C B C D E Thursday, September 30, 2010
  35. 35. B. Goemann, E. Wingender, and A. P. Potapov, “An approach to evaluate the topological significance of motifs and other patterns in regulatory networks.” BMC System Biology, vol. 3, no. 53, May 2009. S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifs in the transcriptional regulation network of escherichia coli,” Nature Genetics, vol. 31, no. 1, pp. 64–68, May 2002. Thursday, September 30, 2010
  36. 36. Network Motifs Heuristic • For each three genes sub networks T: • If matches a network motifs M then considers all connections not present in M as negatives B A C Thursday, September 30, 2010
  37. 37. Network Motifs Heuristic • For each three genes sub networks T: • If matches a network motifs M then considers all connections not present in M as negatives B A C Thursday, September 30, 2010
  38. 38. Network Motifs Heuristic • For each three genes sub networks T: • If matches a network motifs M then considers all connections not present in M as negatives B A C Thursday, September 30, 2010
  39. 39. Network Motifs Heuristic • For each three genes sub networks T: • If matches a network motifs M then considers all connections not present in M as negatives B A C Thursday, September 30, 2010
  40. 40. AUROC % of known positives MOTIF selection performance E.coli dataset [J.J. Faith et al., 2007 and RegulonDB] Thursday, September 30, 2010
  41. 41. F-measure % of known positives MOTIF selection performance E.coli dataset [J.J. Faith et al., 2007 and RegulonDB] Thursday, September 30, 2010
  42. 42. Scale free networks Albert-László Barabási and Zoltán N. Oltvai Network biology: Understanding the cell’s functional organization Nature Reviews Genetics 5, 101-113 (2004) Thursday, September 30, 2010
  43. 43. Hierarchical networks Hong-Wu Ma, Jan Buer, and An-Ping Zeng Hierarchical structure and modules in the Escherichia coli transcriptional regulatory network revealed by a new top-down approach BMC Bioinformatics 2004 5:199 Thursday, September 30, 2010
  44. 44. Experimental data • 445 Affymetrix Antisense2 microarray expression profiles for 4345 genes of E.coli [J.J. Faith et al., 2007] • Data were standardized (i.e. zero mean unit standard deviation) • Regulations extracted from RegulonDB (v. 5) between 154 Transcription Factors and 1211 genes Thursday, September 30, 2010
  45. 45. Summary and conclusions • Learning gene regulations is affected by the problem of learning from positive only data • At least for E.coli • The study of positive contamination shows that there is room for new heuristics • Topology based heuristics (eg. motifs) have shown promising results. • Open issues arise on higher level organisms where gene interactions are more complex Thursday, September 30, 2010
  46. 46. Re weighting strategy (PosOnly) Cerulo et al. BMC Bioinformatics 2010, 11:228 http://www.biomedcentral.com/1471-2105/11/228 RESEARCH ARTICLE Open Access Learning gene regulatory networks from only Research article positive and unlabeled data Luigi Cerulo*1,2, Charles Elkan3 and Michele Ceccarelli1,2 Abstract Background: Recently, supervised learning methods have been exploited to reconstruct gene regulatory networks from gene expression data. The reconstruction of a network is modeled as a binary classification problem for each pair of genes. A statistical classifier is trained to recognize the relationships between the activation profiles of gene pairs. This approach has been proven to outperform previous unsupervised methods. However, the supervised approach raises open questions. In particular, although known regulatory connections can safely be assumed to be positive training examples, obtaining negative examples is not straightforward, because definite knowledge is typically not available that a given pair of genes do not interact. Results: A recent advance in research on data mining is a method capable of learning a classifier from only positive and unlabeled examples, that does not need labeled negative examples. Applied to the reconstruction of gene regulatory networks, we show that this method significantly outperforms the current state of the art of machine learning methods. We assess the new method using both simulated and experimental data, and obtain major performance Thursday, September 30, 2010 improvement.
  47. 47. PosOnly: How it works Labeled Unlabeled s=1 s=0 Let x be a P Q N random example Positive Negative y=1 y=0 s=1 iff x is labeled, s=0 iff x is unlabeled y=1 iff x is positive, y=0 iff x is negative If s=1 then y=1 (the contrary is not always true!) p(s=1|x,y=0) = 0 Thursday, September 30, 2010
  48. 48. PosOnly: How it works Labeled Unlabeled s=1 s=0 • The goal is to learn P Q N a classifier such that: f(x) = p(y=1|x) Positive y=1 Negative y=0 • It is easy to see that (Elkan and Noto, 2008): f(x) = p(s=1|x)/p(s=1|y=1) = p(s=1 and y=1|x)/p(s=1|y=1) = p(s=1|y=1,x)p(y=1|x)/p(s=1|y=1) = p(y=1|x) Thursday, September 30, 2010
  49. 49. PosOnly: How it works binary classifier trained with labeled and unlabeled examples p(s = 1|x) f (x) = p(s = 1|y = 1) unknown constant estimated empirically in a number of ways Thursday, September 30, 2010
  50. 50. Mean of F-Measure % of Known Positives Results: experimental data Thursday, September 30, 2010
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×