Your SlideShare is downloading. ×
0
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Large scale machine learning challenges for systems biology

1,948

Published on

Large scale machine learning challenges for systems biology …

Large scale machine learning challenges for systems biology
by dr. Yvan Saeys - Machine Learning and Data Mining group, Bioinformatics and Systems Biology Division, VIB-UGent Department of Plant Systems Biology

Due to technological advances, the amount of biological data, and the pace at which it is generated has increased dramatically during the past decade. To extract new knowledge from these ever increasing data sets, automated techniques such as data mining and machine learning techniques have become standard practice.
In this talk, I will give an overview of large scale machine learning challenges in bioinformatics and systems biology, highlighting the importance of using scalable and robust techniques such as ensemble learning methods implemented on large computing grids.
I will present some of our state-of-the-art tools to solve problems such as biomarker discovery, large scale network inference, and biomedical text mining at PubMed scale.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,948
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. Large scale machine learning challenges for systems biology Yvan Saeys Bioinformatics and Evolutionary Genomics (BEG) Department of Plant Systems Biology, VIB/UGent [email_address]
  2. Machine Learning techniques “ A class of data mining techniques that aim to learn the underlying theory (knowledge) automatically from the data, usually based on inductive reasoning.” <ul><li>Predictive modelling: </li></ul><ul><li>Classification/prediction </li></ul><ul><li>Regression </li></ul><ul><li>Descriptive modelling: </li></ul><ul><li>Clustering </li></ul><ul><li>Association rule mining </li></ul><ul><li>Dimensionality reduction </li></ul><ul><li>Feature selection </li></ul><ul><li>Outlier detection </li></ul>
  3. ML challenges for systems biology <ul><li>Scale (size and dimensionality) of the data </li></ul><ul><ul><li>NGS analysis </li></ul></ul><ul><ul><li>Text Mining on PubMed scale </li></ul></ul><ul><ul><ul><li>20 million citations </li></ul></ul></ul><ul><ul><li>Full genome microarrays, high-resolution mass spectrometry, high-resolution microscopy </li></ul></ul><ul><li>Complex and diverse structure of the samples </li></ul><ul><ul><li>Sequences, graphs, images, spectra, literature,… </li></ul></ul><ul><li>Designing robust methodologies </li></ul><ul><ul><li>Quantifying and improving robustness of methods </li></ul></ul><ul><ul><li>Data integration </li></ul></ul><ul><li>New learning paradigms </li></ul><ul><ul><li>Semi-supervised learning: combining labeled and unlabeled information </li></ul></ul><ul><ul><li>Transferring knowledge from one domain to another </li></ul></ul><ul><ul><ul><li>Transfer learning </li></ul></ul></ul><ul><ul><ul><li>Domain adaptation </li></ul></ul></ul>
  4. 3 Case studies <ul><li>Robust biomarker discovery </li></ul><ul><li>PubMed: the Big Friendly Giant </li></ul><ul><li>Network inference </li></ul>
  5. Case study 1: Robust biomarker discovery
  6. Biomarker selection: challenges <ul><li>Goal: find the entities that best explain the differences in phenotypes: </li></ul><ul><ul><li>E.g. patients with disease versus normal patients </li></ul></ul><ul><ul><li>Increased biomass: plants with small leaves versus large leaves </li></ul></ul><ul><li>Challenges with current data sets: </li></ul><ul><ul><li>Many possible biomarkers (high dimensionality) </li></ul></ul><ul><ul><li>Only very few biomarkers are important for the specific phenotypic difference </li></ul></ul><ul><ul><li>Very few samples </li></ul></ul>
  7. Biomarker selection: challenges <ul><li>Microarray data: thousands of variables, tens/hundreds of samples </li></ul><ul><li>Mass spec data: tens/hundreds of thousands of variables, tens/hundreds of samples </li></ul><ul><li>SNP data (e.g. new sequencing technologies): hundreds of thousands/Millions of variables, tens/hundreds of samples </li></ul>Abeel, T., Helleputte, T., Van de Peer, Y., Dupont, C., Saeys, Y. (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26, 392-398.
  8. The need for robust marker selection algorithms <ul><li>Ranked gene list: </li></ul><ul><li>gene A </li></ul><ul><li>gene B </li></ul><ul><li>gene C </li></ul><ul><li>gene D </li></ul><ul><li>gene E </li></ul><ul><li>… </li></ul><ul><li>Ranked gene list: </li></ul><ul><li>gene X </li></ul><ul><li>gene A </li></ul><ul><li>gene W </li></ul><ul><li>gene Y </li></ul><ul><li>gene C </li></ul><ul><li>… </li></ul>
  9. Scalable ensemble feature selection <ul><li>Instead of applying biomarker selection once, repeatedly apply the algorithm on slight variations of the original data set </li></ul><ul><li>Subsequently, average over the repetitions and generate a consensus ranking </li></ul><ul><li>Can be efficiently parallelized on a computing cluster </li></ul>
  10. Results: stability
  11. Results: classification performance
  12. Case study 2: PubMed: the Big Friendly Giant
  13. Automated literature screening “ MAD-3 masks the nuclear localization signal of p65 and inhibits p65 DNA binding.” Event 1 Event 2 Event 3 <ul><li>3 proteins </li></ul><ul><ul><li>T1 : Protein : “MAD-3” </li></ul></ul><ul><ul><li>T2 : Protein : “p65” (first occurrence) </li></ul></ul><ul><ul><li>T3 : Protein : “p65” (second occurrence) </li></ul></ul><ul><li>3 triggers </li></ul><ul><ul><li>T4 : Negative regulation : “masks” </li></ul></ul><ul><ul><li>T5 : Negative regulation : “inhibits” </li></ul></ul><ul><ul><li>T6 : Binding : “binding” </li></ul></ul><ul><li>1 extra argument </li></ul><ul><ul><li>T7 : Entity : “nuclear localization signal” </li></ul></ul>
  14. Current state-of-the-art <ul><li>Extraction of specific biological relationships </li></ul><ul><li>Potential for automatic summarization of articles </li></ul><ul><li>Current performance [BioNLP Shared Task] </li></ul>
  15. From text mining to integrated networks [Saeys, Y., Van Landeghem, S., Van de Peer, Y. (2010) Event based text mining for integrated network construction. Journal of Machine Learning Research, Workshop and Conference proceedings 8, 112-121.] Binding/unspecied Regulation Phosphorylation Transcription Positive Regulation Negative Regulation
  16. Recent advances and applications <ul><li>Going from abstracts to full text </li></ul><ul><li>Mining figures, tables, … </li></ul><ul><li>Text mining at PubMed scale </li></ul><ul><ul><li>Requires high-performance computing environment </li></ul></ul><ul><ul><ul><li>Required time : 346 CPU days </li></ul></ul></ul><ul><ul><li>Currently only done on abstracts </li></ul></ul><ul><ul><li>Full text currently under investigation </li></ul></ul>
  17. Example: apoptosis pathway [Björne J, Ginter F, Pyysalo S, Tsujii J, Salakoski T. Scaling up Biomedical Event Extraction to the Entire PubMed (2010) In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, pp. 28-36.
  18. Case study 3: Large scale network inference Dream 5 Network Inference challenge
  19. Problem setting Data V â n Anh Huynh-Thu, Alexandre Irrthum, Louis Wehenkel, Yvan Saeys, and Pierre Geurts (2010) Regulatory network inference with GENIE3: application to the DREAM5 challenge. Recomb Regulatory Genomics workshop. 805 4511 334 E. Coli 536 5950 333 S. Cerevisiae 160 2810 99 S. Aureus 805 1643 195 In silico # Chips # Genes # T ransc Factors Network
  20. Genie3: Gene Network Inference using Ensembles of Trees
  21. Results: gold standard evaluation In silico E. Coli S. Cerevisiae 5.81 GGM 22.711 Team 548 7.15 Lin. Regr. 3.22 ARACNE 23.93 CLR 28.75 Team 862 31.1 Team 776 34.02 Team 543 40.28 Genie3-RF Overall score
  22. Advantages of Genie3 <ul><li>Scalable, state-of-the-art network inference tool </li></ul><ul><li>Can handle multivariate effects </li></ul><ul><li>Features used can be very versatile: </li></ul><ul><ul><li>Expression values </li></ul></ul><ul><ul><li>MicroRNAs </li></ul></ul><ul><ul><li>Genotypic data (e.g. markers, SNPs,…) </li></ul></ul><ul><li>Straightforward data integration framework </li></ul>
  23. Conclusions <ul><li>Ensemble methods are essential for scalable learning models </li></ul><ul><ul><li>State-of-the-art performance </li></ul></ul><ul><ul><li>Improve robustness </li></ul></ul><ul><ul><li>Straightforward data integration </li></ul></ul><ul><li>Model robustness should be incorporated as an evaluation criterion, complementary to model performance </li></ul><ul><li>High-performance computing clusters should be considered as the de facto standard for large scale learning </li></ul>
  24. Acknowledgements <ul><li>@UGent-VIB </li></ul><ul><li>Thomas Abeel </li></ul><ul><li>Sofie Van Landeghem </li></ul><ul><li>Yvan Saeys </li></ul><ul><li>@ULG </li></ul><ul><li>V â n Anh Huynh-Thu </li></ul><ul><li>Pierre Geurts </li></ul><ul><li>Alexandre Irrthum </li></ul><ul><li>Louis Wehenkel </li></ul><ul><li>@UCL </li></ul><ul><li>Thibault Helleputte </li></ul><ul><li>Pierre Dupont </li></ul>

×