Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Joe parker-benchmarking-bioinformatics

396 views

Published on

Talk presented at Benchmarking 2016 symposium, KCL, London, 20th April 2016.

Covers performance measurement strategies (or lack of) in common genomics workflows.

Published in: Science
  • Be the first to comment

Joe parker-benchmarking-bioinformatics

  1. 1. Why Aren't We Benchmarking Bioinformatics? Joe Parker Early Career Research Fellow (Phylogenomics) Department of Biodiversity Informatics and Spatial Analysis Royal Botanic Gardens, Kew Richmond TW9 3AB joe.parker@kew.org
  2. 2. Outline • Introduction • Brief history of Bioinformatics • Benchmarking in Bioinformatics • Case study 1: Typical benchmarking across environments • Case study 2: Mean-variance relationship for repeated measures • Conclusions: implication for statistical genomics
  3. 3. A (very) brief history of bioinformatics Kluge & Farris (1969) Syst. Zool. 18:1-32
  4. 4. A (very) brief history of bioinformatics Stewart et al. (1987) Nature 330:401-404
  5. 5. A (very) brief history of bioinformatics ENCODE Consortium (2012) Nature 489:57-74
  6. 6. A (very) brief history of bioinformatics Kluge & Farris (1969) Syst. Zool. 18:1-32
  7. 7. Benchmarking to biologists • Benchmarking as a comparative process • i.e. ‘which software’s best?’ / ‘which platform’ • Benchmarking application logic / profiling unknown • Environments / runtimes generally either assumed to be identical, or else loosely categorised into ‘laptops vs clusters’
  8. 8. Case Study 1 aka ‘Which program’s the best?’
  9. 9. Bioinformatics environments are very heterogeneous • Laptop: – Portable – Very costly form-factor – Maté? Beer? • Raspi: – Low: cost, energy (& power) – Highly portable – Hackable form-factor • Clusters: • Not portable, setup costs • The cloud: – Power closely linked to budget (as limited as) – Almost infinitely scalable -Have to have a connection to get data up there (and down!) – Fiddly setup
  10. 10. Benchmarking to biologists
  11. 11. Comparison System Arch CPU type, clock GHz cores RAM Gb / MHz / type HDD Gb Haemodorum i686 Xeon E5620 @ 2.4 8 33 1000 @ SATA Raspberry Pi 2 B+ ARM ARMv7 @ 1.0 1 1 8 @ flash card Macbook Pro (2011) x64 Core i7 @ 2.2 4 8 250 @ SSD EC2 m4.10xlarge x64 Xeon E5 @ 2.4 40 160 320 @ SSD
  12. 12. Reviewing / comparing new methods • Biological problems often scale horribly unpredictably • Algorithm analyses • So empirical measurements on different problem sets to predict how problems will scale…
  13. 13. Workflow Setup BLAST 2.2.30 CEGMA genes Short reads Concatenate hits to CEGMA alignments Muscle 3.8.31 RAxML 7.2.8+ Set up workflow, binaries, and reference / alignment data. Deploy to machines. Protein-protein blast reads (from MG-RAST repository, Bass Strait oil field) against 458 core eukaryote genes from CEGMA. Keep only top hits. Use max. num_threads available. Append top hit sequences to CEGMA alignments. For each: Align in MUSCLE using default parameters Infer de novo phylogeny in RAxML under Dayhoff, random starting tree and max. PTHREADS. Output and parse times.
  14. 14. Results - BLASTP
  15. 15. Results - RAxML
  16. 16. Case Study 2 aka ‘What the hell’s a random seed?’
  17. 17. Properly benchmarking workflows • Ignoring limiting-steps analyses – in many workflows might actually be data cleaning / parsing / transformation • Or (most common error) inefficiently iterating • Or even disk I/O!
  18. 18. Workflow benchmarking very rare • Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc • http://beast.bio.ed.ac.uk/benchmarks • Many e.g. bioinformatics papers • More harm than good?
  19. 19. Conclusion • Biologists and error • Current practice • Help! • Interesting challenges too… Thanks: RBG Kew, BI&SA, Mike Chester

×