Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Advertisement
Advertisement

Joe parker-benchmarking-bioinformatics

  1. Why Aren't We Benchmarking Bioinformatics? Joe Parker Early Career Research Fellow (Phylogenomics) Department of Biodiversity Informatics and Spatial Analysis Royal Botanic Gardens, Kew Richmond TW9 3AB joe.parker@kew.org
  2. Outline • Introduction • Brief history of Bioinformatics • Benchmarking in Bioinformatics • Case study 1: Typical benchmarking across environments • Case study 2: Mean-variance relationship for repeated measures • Conclusions: implication for statistical genomics
  3. A (very) brief history of bioinformatics Kluge & Farris (1969) Syst. Zool. 18:1-32
  4. A (very) brief history of bioinformatics Stewart et al. (1987) Nature 330:401-404
  5. A (very) brief history of bioinformatics ENCODE Consortium (2012) Nature 489:57-74
  6. A (very) brief history of bioinformatics Kluge & Farris (1969) Syst. Zool. 18:1-32
  7. Benchmarking to biologists • Benchmarking as a comparative process • i.e. ‘which software’s best?’ / ‘which platform’ • Benchmarking application logic / profiling unknown • Environments / runtimes generally either assumed to be identical, or else loosely categorised into ‘laptops vs clusters’
  8. Case Study 1 aka ‘Which program’s the best?’
  9. Bioinformatics environments are very heterogeneous • Laptop: – Portable – Very costly form-factor – Maté? Beer? • Raspi: – Low: cost, energy (& power) – Highly portable – Hackable form-factor • Clusters: • Not portable, setup costs • The cloud: – Power closely linked to budget (as limited as) – Almost infinitely scalable -Have to have a connection to get data up there (and down!) – Fiddly setup
  10. Benchmarking to biologists
  11. Comparison System Arch CPU type, clock GHz cores RAM Gb / MHz / type HDD Gb Haemodorum i686 Xeon E5620 @ 2.4 8 33 1000 @ SATA Raspberry Pi 2 B+ ARM ARMv7 @ 1.0 1 1 8 @ flash card Macbook Pro (2011) x64 Core i7 @ 2.2 4 8 250 @ SSD EC2 m4.10xlarge x64 Xeon E5 @ 2.4 40 160 320 @ SSD
  12. Reviewing / comparing new methods • Biological problems often scale horribly unpredictably • Algorithm analyses • So empirical measurements on different problem sets to predict how problems will scale…
  13. Workflow Setup BLAST 2.2.30 CEGMA genes Short reads Concatenate hits to CEGMA alignments Muscle 3.8.31 RAxML 7.2.8+ Set up workflow, binaries, and reference / alignment data. Deploy to machines. Protein-protein blast reads (from MG-RAST repository, Bass Strait oil field) against 458 core eukaryote genes from CEGMA. Keep only top hits. Use max. num_threads available. Append top hit sequences to CEGMA alignments. For each: Align in MUSCLE using default parameters Infer de novo phylogeny in RAxML under Dayhoff, random starting tree and max. PTHREADS. Output and parse times.
  14. Results - BLASTP
  15. Results - RAxML
  16. Case Study 2 aka ‘What the hell’s a random seed?’
  17. Properly benchmarking workflows • Ignoring limiting-steps analyses – in many workflows might actually be data cleaning / parsing / transformation • Or (most common error) inefficiently iterating • Or even disk I/O!
  18. Workflow benchmarking very rare • Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc • http://beast.bio.ed.ac.uk/benchmarks • Many e.g. bioinformatics papers • More harm than good?
  19. Conclusion • Biologists and error • Current practice • Help! • Interesting challenges too… Thanks: RBG Kew, BI&SA, Mike Chester

Editor's Notes

  1. NB slot is 30mins INCLUDING questions Bioinformatics (computational manipulation, analysis and modelling of biological and biochemical data) is a substantial area of primary research activity as well as a growing market worth up to $13bn US (2020) by some estimates[1]. In particular, exponential growth in the capacity and speed of DNA sequencers has led to a critical shortfall in analytical tools often termed a 'DNA deluge'[2]. Despite this critical need and substantial and growing investment in bioinformatics software, many best practices in software design are neglected, ignored, or even unacknowledged. I will illustrate the problem with an overview of two case studies in current bioinformatics projects, and present data showing worrying reproducibility deficits in commonly-used software. [1]http://www.marketsandmarkets.com/PressReleases/bioinformatics-market.asp [2]http://spectrum.ieee.org/biomedical/devices/the-dna-data-deluge
  2. I am basically an empirical scientist But driven by methods Replication -> empirical investigation This is a big gap in bioinfo at the mo
  3. Biological problems often scale horribly unpredictably On top of which hardly any biologists and few bioinformaticians (me included) understand algorithmic complexity / analysis properly So empirical measurements on different problem sets really the only way to predict how problems will scale…
  4. Wall-clock time on ly one part of the process Slowest (limiting) steps in many workflows might actually be data cleaning / parsing / transformation Or inefficiently iterating over parametizations etc Or even disk I/O!
  5. Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc http://beast.bio.ed.ac.uk/benchmarks Many e.g. bioinformatics papers Biologists are benchmarking the wrong things, or not at all
  6. Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc http://beast.bio.ed.ac.uk/benchmarks Many e.g. bioinformatics papers Biologists are benchmarking the wrong things, or not at all
Advertisement