Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using research software in a production environment

118 views

Published on

The exponential increase in data has caused an analysis bottleneck: the effort needed to manage the data and develop complex analysis pipelines is greater than the collection itself. I discuss some of the major techniques we used in order to turn our research pipelines into a production system able to analyze diverse datasets with minimal failures. I highlight the importance of valid metadata, the adaptation of research software, and surrounding infrastructure including workflow systems.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Using research software in a production environment

  1. 1. Using research software in a production environment Morgan Taschuk @morgantaschuk Senior Manager, Genome Sequence Informatics Ontario Institute for Cancer Research
  2. 2. ONTARIO INSTITUTE FOR CANCER RESEARCH Genome Sequence Informatics 2 • Primary Analysis and QC at OICR • 8100 cores • 2 petabytes of disk • Support dozens of projects, 100s publications • Half bioinformaticians • Half software developers/engineers est 2008
  3. 3. ONTARIO INSTITUTE FOR CANCER RESEARCH Core process 3
  4. 4. ONTARIO INSTITUTE FOR CANCER RESEARCH We are consumers of research software 4
  5. 5. ONTARIO INSTITUTE FOR CANCER RESEARCH 5 https://doi.org/10.1371/journal.pcbi.1005412
  6. 6. Good software is not enough
  7. 7. ONTARIO INSTITUTE FOR CANCER RESEARCH Big Data 7 Scale: 1 sequenced human whole genome is between 30-45 GB • Genomics England’s 100 000 Genomes Project will take ~20 PB of disk to store • Need to sequence between 5000-20 000 cases to confidently link rare variants with disease
  8. 8. ONTARIO INSTITUTE FOR CANCER RESEARCH Data is too big!! 8 Costs of whole genome sequencing (grey line) and computer power (Moore law, black line). Clinical and Translational Radiation Oncology 2017 3, 16-20DOI: (10.1016/j.ctro.2017.03.002)
  9. 9. ONTARIO INSTITUTE FOR CANCER RESEARCH Translation to the clinic 9 • Only 10-25% of research is able to be translated into clinical practice • Example: Recommended laboratory test turnaround time is 14 days • Genomics test results between biopsy and results ~35 days Aung et al. Clin Cancer Res. 2018 doi: 10.1158/1078-0432.
  10. 10. ONTARIO INSTITUTE FOR CANCER RESEARCH Growing pains 10 OICR acquires a lot of sequencing instruments
  11. 11. ONTARIO INSTITUTE FOR CANCER RESEARCH 11
  12. 12. ONTARIO INSTITUTE FOR CANCER RESEARCH 12
  13. 13. ONTARIO INSTITUTE FOR CANCER RESEARCH GSI In 2017 13 • 17 staff but only ~2 monitor this system • 90,098 analysis workflows executed on human whole genome, exome, targeted panels, and RNA sequencing • 1 successful workflow every 6 minutes • Vast majority of data never needs human intervention • My goal is/was to reduce turnaround time… stay tuned for the end of the talk
  14. 14. ONTARIO INSTITUTE FOR CANCER RESEARCH Our Current Approach 14 1. Nothing should be on fire
  15. 15. ONTARIO INSTITUTE FOR CANCER RESEARCH Our Current Approach 15 1. Control our inputs (data and metadata) 2. As little human intervention as possible 3. Fail fast, fail loudly 4. Totally traceable and reproducible
  16. 16. ONTARIO INSTITUTE FOR CANCER RESEARCH Total assimilation 16 • Borg’ed out on supply chain management • Assimilate all aspects of metadata and data management to ensure consistent quality
  17. 17. Caveat
  18. 18. ONTARIO INSTITUTE FOR CANCER RESEARCH Monitoring Our Approach Valid metadata Workflow system Automation Genomics Reports HPC Research Software Valid metadata entering an automated system running on robust software with reproducible results - and everything tracked and monitored.
  19. 19. ONTARIO INSTITUTE FOR CANCER RESEARCH Total assimilation 19 Valid metadata Workflow system Automation Reports/ Data Genomics SCIENCE!!
  20. 20. ONTARIO INSTITUTE FOR CANCER RESEARCH Only good metadata enter 20 • Control and validate metadata as far upstream as we can • Laboratory Information Management System (LIMS)
  21. 21. ONTARIO INSTITUTE FOR CANCER RESEARCH MISO LIMS as the metadata solution 21
  22. 22. ONTARIO INSTITUTE FOR CANCER RESEARCH MISO as the metadata solution 22 • Since 2017, MISO LIMS • open source • completely customizable • Validate data at entry • Sanity checks • Reduce data entry and thus reduce data entry errors https://github.com/TGAC/miso-lims
  23. 23. ONTARIO INSTITUTE FOR CANCER RESEARCH Automation 23 • Deciders: • take in metadata and data • decide what analysis to perform using rules (if- then; map-reduce; etc) • check whether data has previously been analyzed • if system is at capacity • Difficult to write • especially when metadata is poor • software needs to understand all metadata
  24. 24. ONTARIO INSTITUTE FOR CANCER RESEARCH Monitoring 24 • Track everything before you need it • Silence on success • but make sure you detect when systems go offline! • Dashboards and tickets instead of emails • Fail fast, fail loudly
  25. 25. ONTARIO INSTITUTE FOR CANCER RESEARCH How machines are performing... 25
  26. 26. ONTARIO INSTITUTE FOR CANCER RESEARCH 26 Whether I should worry about disk...
  27. 27. ONTARIO INSTITUTE FOR CANCER RESEARCH Tickets and alerts instead of emails 27 Automatic of course
  28. 28. ONTARIO INSTITUTE FOR CANCER RESEARCH Workflows 28 • Workflow systems: • takes in input data and parameters • runs the data through analysis steps • produces data • Analysis steps: • Good research software • Absolutely critical and integral to all other systems discussed so far
  29. 29. ONTARIO INSTITUTE FOR CANCER RESEARCH Having good software is not enough 29 Monitoring Metadata validation Automation Workflow systems software
  30. 30. ONTARIO INSTITUTE FOR CANCER RESEARCH Turnaround time 30 • Sequencing to alignment has dropped from about 20 days to 7 days for Hiseq whole genome lanes • Anecdotal: Variability reduced, hands-on time reduced
  31. 31. ONTARIO INSTITUTE FOR CANCER RESEARCH Current/future work 31 • Automation • make it simpler • more complete • (never going to be done) • Research is a changing field by nature • Flexibility versus robustness • Hot new things: sc-seq, ct-seq, immuno-onco- genomics • Underlying assumptions change over time
  32. 32. ONTARIO INSTITUTE FOR CANCER RESEARCH We’re investing in good infrastructure 32 Turonno ! entry-level! Look for GSI! Software dev! report to me!! Apply! http://bit.ly/oicr-gsi-dev
  33. 33. ONTARIO INSTITUTE FOR CANCER RESEARCH Conclusions 33 • The FUTURE is • hundreds of thousands of samples • expediting clinical results • no loss of reproducibility or quality • Everyone needs a little production-style infrastructure, even if you’re not production • control your metadata! • automate! • standardize your analysis! • monitor all the things!
  34. 34. ONTARIO INSTITUTE FOR CANCER RESEARCH Acknowledgements 34 • Lars Jorgensen • Lawrence Heisler • Michael Laszloffy • Heather Armstrong • Dillan Cooke • Andre Masella • Iain Bancarz • Timothy Beck • Peter Ruzanov • Prisni Rath • Jonathan Torchia • Richard Jovelin • Yogi Sundaravadanam • Xuemei Luo • Many excellent co-op students To past and current GSI members.
  35. 35. OICR Technology Programs enable cancer research in Ontario by providing value-added expertise, training and access to high-end infrastructure and technologies. Find out more at oicr.on.ca This project was supported by the OICR Adaptive Oncology Program
  36. 36. Funding for the Ontario Institute for Cancer Research is provided by the Government of Ontario
  37. 37. ONTARIO INSTITUTE FOR CANCER RESEARCH Attributions 37 Jensflorian CC BY-SA 3.0 Timothy Dilich - Noun Project, CC0 http://andrewjrobinson.github.io/training_docs/tutorials/variant_calling_galaxy_1/variant_calling_galaxy_1/ By David pogrebeshsky [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons Star Trek ® Paramount Pictures
  38. 38. ONTARIO INSTITUTE FOR CANCER RESEARCH GSI on the web 38 https://github.com/oicr-gsi https://gsi.oicr.on.ca

×