Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Agile large-scale machine-learning pipelines in drug discovery


Published on

Presentation held at EBI on Aug 12, 2015

Published in: Science
  • You might get some help from ⇒ ⇐ Success and best regards!
    Are you sure you want to  Yes  No
    Your message goes here
  • Writing good research paper is quite easy and very difficult simultaneously. It depends on the individual skill set also. You can get help from research paper writing. Check out, please ⇒ ⇐
    Are you sure you want to  Yes  No
    Your message goes here

Agile large-scale machine-learning pipelines in drug discovery

  1. 1. Agile large-scale machine-learning pipelines in drug discovery Ola Spjuth Department of Pharmaceutical Biosciences and Science for Life Laboratory Uppsala University, Sweden
  2. 2. Outline • My research in perspective • Our approach to machine learning in ligand-based modeling • Challenges when data grows • Automation workflows/pipelines • HPC, Cloud Computing and Big Data Analytics
  3. 3. From data to insights • We have access to a wealth of information • Data mining and predictive modeling can be useful
  4. 4. History: Bioclipse – an open source workbench for the life sciences O. Spjuth, J. Alvarsson, A. Berg, M. Eklund, S. Kuhn, C. Mäsak, G. Torrance, J. Wagener, E.L. Willighagen, C. Steinbeck, and J.E.S. Wikberg. Bioclipse 2: A scriptable integration platform for the life sciences. BMC Bioinformatics 2009, 10:397 Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J, Murray-Rust P, Steinbeck C, Wikberg JES: Bioclipse: an open source workbench for chemo- and bioinformatics. BMC Bioinformatics 2007, 8:59.
  5. 5. How is the compound metabolized? Are any of its metabolites reactive/toxic? Here? Here? Is it toxic? Chemical liabilities (drug safety, alerts) Adverse effects? Can we, based on existing experimental studies, IT, and statistical models, predict the outcome for new compounds?
  6. 6. Starting out in 2008 with a challenge: • Build a system with predictive models which runs on the client – Initial problem: Site-of-metabolism prediction
  7. 7. Site-of-metabolism (SOM) predictions – MetaPrint2D L. Carlsson, O. Spjuth, S. Adams, R. C. Glen, and S. Boyer. Use of historic metabolic biotransformation data as a means of anticipating metabolic sites using MetaPrint2D and Bioclipse. BMC Bioinformatics 2010, 11:362 Boyer S, Arnby CH, Carlsson L, Smith J, Stein V, Glen RC. Reaction site mapping of xenobiotic biotransformations. J Chem Inf Model. 2007 Mar-Apr;47(2):583-90. Reaction Database MetaPrint2D database Circular Fingerprints Highest probability of metabolism Low probability of metabolism Medium probability of metabolism Mapping
  8. 8. Bioclipse and MetaPrint2D
  9. 9. Next challenge: Extend to general predictive models • Fast predictive models, allow for instant updates upon structural changes • Span from virtual screening to lead optimization
  10. 10. Bioclipse Decision Support • Integrate various predictive methods – Similarity searches (InChI, signatures, fingerprints) – Structural alerts (toxicophores) – QSAR models (classification, regression) • Visual interpretation – Highlight important substructures O. Spjuth, L. Carlsson, M. Eklund, E. Ahlberg Helgee, and Scott Boyer. Integrated decision support for assessing chemical liabilities. Accepted in J. Chem. Inf. Model, 2011.
  11. 11. Ligand-based predictive modeling Quantitative Structure-Activity Relationship (QSAR) – Start with a dataset of chemical structures with measured property to model (inhibition, toxicity, etc) – Describe chemicals using descriptors – Make use of statistical modeling to relate chemical structures to a response
  12. 12. Machine learning pipelines Preprocessing Model building Validation Reporting
  13. 13. QSAR modeling • Signatures1 descriptor in CDK2 – Canonical representation of atom environments • Support Vector Machine (SVM) – Robust modeling 1. Faulon, J.-L.; Visco, D. P.; Pophale, R. S. Journal of Chemical Information and Computer Sciences, 2003, 43, 707-720 2. Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. Journal of Chemical Information and Computer Sciences, 2003,43, 493-500.
  14. 14. Local interpretation of nonlinear QSAR models • Method – Compute gradient of decision function for prediction – Extract descriptor(s) with largest component in the gradient • Demonstrated on RF, SVM, and PLS Carlsson, L., Helgee, E. A., and Boyer, S. Interpretation of nonlinear qsar models applied to ames mutagenicity data. J Chem Inf Model 49, 11 (Nov 2009), 2551–2558. Lars Carlsson, AstraZeneca R&D
  15. 15. Bioclipse Decision Support
  16. 16. Next challenge: Simple model building • Build a solution where: – Scientists can build accurate models without modeling expertise, in order to aid their decision making – Combine these models with other models
  17. 17. Simple model building with graphical wizards
  18. 18. Next challenge: Predict using distributed services • OpenTox - European project for creating a interoperable framework for toxicity predictions • Academia and industry • Parts – Ontology and API – Query and invocation of predictive services – Methods and algorithms – Authentication and authorization
  19. 19. Bioclipse Decision Support Model discovery predictions
  20. 20. Bioclipse and OpenTox Collaboration with
  21. 21. OpenTox in Bioclipse
  22. 22. Summary of Bioclipse Decision Support • Flexible, general method – Apply to any collection of molecules • State-of-the-art machine-learning methods • Handles large data sets • Fast predictions
  23. 23. Advantages with the DS method • Fast: Can run on local computer – “Instant predictions”, “calculate as you draw” • Interpretable results: Can be used for hypothesis generation • General: Apply any modeling technique to any data set • Extensible: Very easy to add new components • Open: Free, open source
  24. 24. Observations • Predictive drug discovery is becoming data-intensive – High throughput technologies • Drug/chemical screening • Molecular biology (omics) – More and bigger publicly available data sources • Data is continuously updated  We need scalable and automated methods for predictive modeling
  25. 25. Challenges with bigger data sets for machine learning • Modeling time increases – Reduce/avoid parameter tuning – Run on high-performance e-infrastructures – Use approximate methods • Not all implementations can handle dataset sizes – Use sparse implementations
  26. 26. Determine parameter intervals for modeling (sweetspot) J. Alvarsson, M. Eklund, C. Andersson, L. Carlsson, O. Spjuth, and Jarl Wikberg. Benchmarking study of parameter variation when using signature fingerprints together with support vector machines. J Chem Inf Model. 2014, 54(11), pp 3211–3217. SVM: Cost and Gamma parameters Signatures: Heights
  27. 27. Example 1: Modeling large number of observations Jonathan Alvarsson
  28. 28. Example 2: Target predictions Alvarsson J, Eklund M, Engkvist O, Spjuth O, Carlsson L, Wikberg JE, Noeske T. Ligand-based target prediction with signature fingerprints. J Chem Inf Model. 2014 Oct 27;54(10):2647-53
  29. 29. Challenge with running on HPC • Reduce manual work – Automate data preprocessing and modeling – Support modeling life cycle (build, validate, document, version, publish, re-train …) • Automating model building is not trivial – Aim: Agile, component-based architecture
  30. 30. Example application: Training large number of datasets Aim: Build models for hundreds of targets – Challenge to extract – Challenge to automate model building Data sources Samuel Lampa
  31. 31. Automating analysis on HPC clusters • Workflow systems can aid development and deployment • We used Luigi system • Integrate with queuing system (SLURM) Train and assess model Samuel Lampa
  32. 32. Example ML pipeline (unpublished data)
  33. 33. Publishing models • Publish models for easy access and consumption • We used P2 (OSGi) provisioning system v. 1.3 v. 1.2 v. 1.1 Use models
  34. 34. Reactive/continuous modeling Data sources Coordinate Integrate Version Monitor Publish models Archive models Train and assess model User Bioclipse
  35. 35. Model building WFs on HPC is not trivial • Many workflow systems exist – DSLs vs APIs – Dynamic input/output in e.g. cross-validation not supported out of the box • Time-consuming to create WFs • Workflows can be useful but is not (yet) the silver bullet we sought O. Spjuth, E. Bongcam-Rudloff, G. C. Hernandez, L. Forer, M. Giovacchini, R. V. Guimera, A. Kallio, E. Korpelainen, M. Kandula, M. Krachunov, D. P. Kreil, O. Kulev, P. P. Labaj, S. Lampa, L. Pireddu, S. Schönherr, A. Siretskiy, and D. Vassilev. Experiences with workflows for automating data- intensive bioinformatics. Accepted in Biology Direct.
  36. 36. Could cloud computing improve things?
  37. 37. QSAR Modeling on Amazon Elastic Cloud Number of cores Time(hours) 1 2 4 8 16 5 50 100 150 200 220 20k 75k 150k 300k B. Torabi, J. Alvarsson, M. Holm, M. Eklund, L. Carlsson, and O. Spjuth. “Scaling predictive modeling in drug development with cloud computing”. J. Chem. Inf. Model., 2015, 55 (1), pp 19-25
  38. 38. Private clouds • We set up an OpenStack system at UPPMAX (our HPC center) • Primarily Infrastructure as a Service (IaaS) – users can run virtual machines • Platform-as-a-Service (PaaS): Hadoop and Spark – Our question: Can this be useful for model building?
  39. 39. • Open catalogue of VMIs • Hosted at Uppsala University M. Dahlö, F. Haziza, A. Kallio, E. Korpelainen, E. Bongcam- Rudloff, and O. Spjuth. A catalogue of virtual machine images for the life sciences. Accepted in Bioinformatics and Biology Insights. Managing Virtual Machine Images
  40. 40. Cloud computing enables Big Data Analytics • Hadoop – Open Source Map-Reduce, suited for massively parallel tasks – Distributed execution, high availability, fault tolerant, can be run on commodity hardware – E.g. Google, Facebook and Twitter use it • Hadoop File System (HDFS) distributes data on nodes, computing done in parallel – “bring computations to data”
  41. 41. Hadoop (MapReduce) for massively parallel analysis
  42. 42. Evaluating Hadoop for next-generation sequencing • Compare Hadoop and HPC – Create as identical pipelines as possible – Calculate efficiency as function of data size – Conclusion: Hadoop pipeline scales better than HPC and is economical for current data sizes Alexey Siretskiy, former postdoc at UPPMAX A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data. Gigascience (2015) Jun 4; 4:26. A. Siretskiy and O. Spjuth. HTSeq-Hadoop: Extending HTSeq for Massively Parallel Sequencing Data Analysis using Hadoop. In e-Science, 2014 IEEE 10th International Conference on (2014), vol. 1, pp. 317–323.
  43. 43. SPARK • Add caching to Hadoop (MapReduce) – in memory computing • Good for iterative algorithms • We applied it for ligand- based virtual screening With Åke Edlund, HPCViz, KTH L. Ahmed, A. Edlund, E. Laure, O. Spjuth. Using Iterative MapReduce for Parallel Virtual Screening. Cloud Computing Technology and Science (Cloud- Com), 2013 IEEE 5th International Conference on , vol.2, no., pp.27,32, 2-5, 2013
  44. 44. Large-scale machine learning on Spark • Ongoing project: Create a large-scale machine learning pipeline for QSAR using Spark ML as alternative to Luigi workflow system – Apply to large data sets – Apply to many data sets – Compare Spark with workflows on Batch system – Aim: Use for Reactive Modeling
  45. 45. Some conclusions so far on cloud computing and Hadoop/Spark for bioinformatics • Cloud computing – Easy provisioning of infrastructures, services and platforms • Hadoop – Scalable and efficient – but to the price of software incompatibility • Spark – improves over Hadoop with in-memory computing and more intuitive interface • Current working hypothesis: Spark more advantageous compared to workflows on batch systems for machine learning pipelines
  46. 46. Conformal prediction Seek answer to: “How good is your prediction?” • Traditional machine learning algorithms: – Simple predictions (e.g. “Class A”, 8.45) • Conformal predictions – Prediction intervals for a given confidence level – based on a consistent and well-defined mathematical framework1 1 Vovk, V.; Gammerman, A.; Shafer, G. “Algorithmic learning in a random world”; Springer: New York, 2005.
  47. 47. Conformal predictions Norinder, U., Carlsson, L., Boyer, S., and Eklund, M. Introducing conformal prediction in predictive modeling. a transparent and flexible alternative to applicability domain determination. J Chem Inf Model 54, 6 (Jun 2014), 1596–603.
  48. 48. Some projects on Conformal Predictions • CP Feature Highlighting • CP in Spark • Large-scale model building in cheminformatics and virtual screening – Ongoing projects Ahlberg E, Spjuth O, Hasselgren C, Carlsson L. Interpretation of Conformal Prediction Classification Models. Statistical Learning and Data Sciences. Springer International Publishing; 2015. pp. 323–334. Capuccini M, Carlsson L, Norinder U., and Spjuth O. Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence. Submitted.
  49. 49. Two pilots for clinical data management
  50. 50. CML, Lucia Cavelier
  51. 51. MDR, Åsa Melhus
  52. 52. e-Science (cyberinfrastructure, “big data”) “Systematic and advanced use of computers in research” – High-performance computing – Distributed data, “Big data” – Enabling science!
  53. 53. Acknowledgements Workflows Samuel Lampa David Kreil Maciej Kańduła Martin Dahlö Frédéric Haziza Mentell Design Hadoop & Spark Alexey Siretskiy Åke Edlund Izhar ul Hassan Marco Cappucini Staffan Arvidsson Cloud computing Frédéric Haziza Tore Sundqvist Behrooz Torabi Salman Toor Andreas Hellander Predictive modeling Lars Carlsson Ernst Ahlberg-Helgee Martin Eklund Ulf Norinder Wesley Schaal Jonathan Alvarsson Bioclipse Arvid Berg Egon Willighagen All Bioclipse and CDK contributors
  54. 54. Thank you