Large scale machine learning challenges for systems biology Yvan Saeys Bioinformatics and Evolutionary Genomics (BEG) Department of Plant Systems Biology, VIB/UGent [email_address]
Machine Learning techniques “ A class of data mining techniques that aim to learn the underlying theory (knowledge) automatically from the data, usually based on inductive reasoning.” Predictive modelling: Classification/prediction Regression Descriptive modelling: Clustering Association rule mining Dimensionality reduction Feature selection Outlier detection
ML challenges for systems biology Scale (size and dimensionality) of the data NGS analysis Text Mining on PubMed scale 20 million citations Full genome microarrays, high-resolution mass spectrometry, high-resolution microscopy Complex and diverse structure of the samples Sequences, graphs, images, spectra, literature,… Designing robust methodologies Quantifying and improving robustness of methods Data integration New learning paradigms Semi-supervised learning: combining labeled and unlabeled information Transferring knowledge from one domain to another Transfer learning Domain adaptation
3 Case studies Robust biomarker discovery PubMed: the Big Friendly Giant Network inference
Case study 1: Robust biomarker discovery
Biomarker selection: challenges Goal: find the entities that best explain the differences in phenotypes: E.g. patients with disease versus normal patients Increased biomass: plants with small leaves versus large leaves Challenges with current data sets: Many possible biomarkers (high dimensionality) Only very few biomarkers are important for the specific phenotypic difference Very few samples
Biomarker selection: challenges Microarray data:  thousands of variables, tens/hundreds of samples Mass spec data:  tens/hundreds of thousands of variables, tens/hundreds of samples SNP data (e.g. new sequencing technologies): hundreds of thousands/Millions of variables, tens/hundreds of samples Abeel, T., Helleputte, T., Van de Peer, Y., Dupont, C., Saeys, Y. (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods.  Bioinformatics  26, 392-398.
The need for robust marker selection algorithms Ranked gene list: gene A gene B gene C gene D gene E … Ranked gene list: gene X gene A gene W gene Y gene C …
Scalable ensemble feature selection Instead of applying biomarker selection once,  repeatedly  apply the algorithm on slight variations of the original data set Subsequently, average over the repetitions and generate a  consensus ranking Can be efficiently parallelized on a computing cluster
Results: stability
Results: classification performance
Case study 2: PubMed: the Big Friendly Giant
Automated literature screening “ MAD-3 masks the nuclear localization signal of p65 and inhibits p65 DNA binding.” Event 1 Event 2 Event 3 3 proteins T1 : Protein : “MAD-3” T2 : Protein : “p65” (first occurrence) T3 : Protein : “p65” (second occurrence) 3 triggers T4 : Negative regulation : “masks” T5 : Negative regulation : “inhibits” T6 : Binding : “binding” 1 extra argument T7 : Entity : “nuclear localization signal”
Current state-of-the-art Extraction of specific biological relationships Potential for automatic summarization of articles Current performance [BioNLP Shared Task]
From text mining to integrated networks [Saeys, Y., Van Landeghem, S., Van de Peer, Y. (2010) Event based text mining for integrated network construction. Journal of Machine Learning Research, Workshop and Conference proceedings 8, 112-121.] Binding/unspecied Regulation Phosphorylation Transcription Positive Regulation Negative Regulation
Recent advances and applications Going from abstracts to full text Mining figures, tables, … Text mining at PubMed scale Requires high-performance computing environment Required time : 346 CPU days Currently only done on abstracts Full text currently under investigation
Example: apoptosis pathway [Björne J, Ginter F, Pyysalo S, Tsujii J, Salakoski T.  Scaling up Biomedical Event Extraction to the Entire PubMed (2010)  In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, pp. 28-36.
Case study 3: Large scale network inference Dream 5 Network Inference challenge
Problem setting Data V â n Anh Huynh-Thu, Alexandre Irrthum, Louis Wehenkel, Yvan Saeys, and Pierre Geurts (2010) Regulatory network inference with GENIE3: application to the DREAM5 challenge.  Recomb Regulatory Genomics workshop. 805 4511 334 E. Coli 536 5950 333 S. Cerevisiae 160 2810 99 S. Aureus 805 1643 195 In silico #  Chips #  Genes # T ransc Factors Network
Genie3: Gene Network Inference using Ensembles of Trees
Results: gold standard evaluation In silico E. Coli S. Cerevisiae 5.81 GGM 22.711 Team 548 7.15 Lin. Regr. 3.22 ARACNE 23.93 CLR 28.75 Team 862 31.1 Team 776 34.02 Team 543 40.28 Genie3-RF Overall score
Advantages of Genie3 Scalable, state-of-the-art network inference tool Can handle multivariate effects Features used can be very versatile: Expression values MicroRNAs Genotypic data (e.g. markers, SNPs,…) Straightforward data integration framework
Conclusions Ensemble methods are essential for scalable learning models State-of-the-art performance Improve robustness Straightforward data integration Model robustness should be incorporated as an evaluation criterion, complementary to model performance High-performance computing clusters should be considered as the  de facto  standard for large scale learning
Acknowledgements @UGent-VIB Thomas Abeel Sofie Van Landeghem Yvan Saeys @ULG V â n Anh Huynh-Thu Pierre Geurts Alexandre Irrthum Louis Wehenkel @UCL Thibault Helleputte Pierre Dupont

Large scale machine learning challenges for systems biology

  • 1.
    Large scale machinelearning challenges for systems biology Yvan Saeys Bioinformatics and Evolutionary Genomics (BEG) Department of Plant Systems Biology, VIB/UGent [email_address]
  • 2.
    Machine Learning techniques“ A class of data mining techniques that aim to learn the underlying theory (knowledge) automatically from the data, usually based on inductive reasoning.” Predictive modelling: Classification/prediction Regression Descriptive modelling: Clustering Association rule mining Dimensionality reduction Feature selection Outlier detection
  • 3.
    ML challenges forsystems biology Scale (size and dimensionality) of the data NGS analysis Text Mining on PubMed scale 20 million citations Full genome microarrays, high-resolution mass spectrometry, high-resolution microscopy Complex and diverse structure of the samples Sequences, graphs, images, spectra, literature,… Designing robust methodologies Quantifying and improving robustness of methods Data integration New learning paradigms Semi-supervised learning: combining labeled and unlabeled information Transferring knowledge from one domain to another Transfer learning Domain adaptation
  • 4.
    3 Case studiesRobust biomarker discovery PubMed: the Big Friendly Giant Network inference
  • 5.
    Case study 1:Robust biomarker discovery
  • 6.
    Biomarker selection: challengesGoal: find the entities that best explain the differences in phenotypes: E.g. patients with disease versus normal patients Increased biomass: plants with small leaves versus large leaves Challenges with current data sets: Many possible biomarkers (high dimensionality) Only very few biomarkers are important for the specific phenotypic difference Very few samples
  • 7.
    Biomarker selection: challengesMicroarray data: thousands of variables, tens/hundreds of samples Mass spec data: tens/hundreds of thousands of variables, tens/hundreds of samples SNP data (e.g. new sequencing technologies): hundreds of thousands/Millions of variables, tens/hundreds of samples Abeel, T., Helleputte, T., Van de Peer, Y., Dupont, C., Saeys, Y. (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26, 392-398.
  • 8.
    The need forrobust marker selection algorithms Ranked gene list: gene A gene B gene C gene D gene E … Ranked gene list: gene X gene A gene W gene Y gene C …
  • 9.
    Scalable ensemble featureselection Instead of applying biomarker selection once, repeatedly apply the algorithm on slight variations of the original data set Subsequently, average over the repetitions and generate a consensus ranking Can be efficiently parallelized on a computing cluster
  • 10.
  • 11.
  • 12.
    Case study 2:PubMed: the Big Friendly Giant
  • 13.
    Automated literature screening“ MAD-3 masks the nuclear localization signal of p65 and inhibits p65 DNA binding.” Event 1 Event 2 Event 3 3 proteins T1 : Protein : “MAD-3” T2 : Protein : “p65” (first occurrence) T3 : Protein : “p65” (second occurrence) 3 triggers T4 : Negative regulation : “masks” T5 : Negative regulation : “inhibits” T6 : Binding : “binding” 1 extra argument T7 : Entity : “nuclear localization signal”
  • 14.
    Current state-of-the-art Extractionof specific biological relationships Potential for automatic summarization of articles Current performance [BioNLP Shared Task]
  • 15.
    From text miningto integrated networks [Saeys, Y., Van Landeghem, S., Van de Peer, Y. (2010) Event based text mining for integrated network construction. Journal of Machine Learning Research, Workshop and Conference proceedings 8, 112-121.] Binding/unspecied Regulation Phosphorylation Transcription Positive Regulation Negative Regulation
  • 16.
    Recent advances andapplications Going from abstracts to full text Mining figures, tables, … Text mining at PubMed scale Requires high-performance computing environment Required time : 346 CPU days Currently only done on abstracts Full text currently under investigation
  • 17.
    Example: apoptosis pathway[Björne J, Ginter F, Pyysalo S, Tsujii J, Salakoski T. Scaling up Biomedical Event Extraction to the Entire PubMed (2010) In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, pp. 28-36.
  • 18.
    Case study 3:Large scale network inference Dream 5 Network Inference challenge
  • 19.
    Problem setting DataV â n Anh Huynh-Thu, Alexandre Irrthum, Louis Wehenkel, Yvan Saeys, and Pierre Geurts (2010) Regulatory network inference with GENIE3: application to the DREAM5 challenge. Recomb Regulatory Genomics workshop. 805 4511 334 E. Coli 536 5950 333 S. Cerevisiae 160 2810 99 S. Aureus 805 1643 195 In silico # Chips # Genes # T ransc Factors Network
  • 20.
    Genie3: Gene NetworkInference using Ensembles of Trees
  • 21.
    Results: gold standardevaluation In silico E. Coli S. Cerevisiae 5.81 GGM 22.711 Team 548 7.15 Lin. Regr. 3.22 ARACNE 23.93 CLR 28.75 Team 862 31.1 Team 776 34.02 Team 543 40.28 Genie3-RF Overall score
  • 22.
    Advantages of Genie3Scalable, state-of-the-art network inference tool Can handle multivariate effects Features used can be very versatile: Expression values MicroRNAs Genotypic data (e.g. markers, SNPs,…) Straightforward data integration framework
  • 23.
    Conclusions Ensemble methodsare essential for scalable learning models State-of-the-art performance Improve robustness Straightforward data integration Model robustness should be incorporated as an evaluation criterion, complementary to model performance High-performance computing clusters should be considered as the de facto standard for large scale learning
  • 24.
    Acknowledgements @UGent-VIB ThomasAbeel Sofie Van Landeghem Yvan Saeys @ULG V â n Anh Huynh-Thu Pierre Geurts Alexandre Irrthum Louis Wehenkel @UCL Thibault Helleputte Pierre Dupont