Successfully reported this slideshow.

Joe parker-benchmarking-bioinformatics

2

Share

1 of 20
1 of 20

Joe parker-benchmarking-bioinformatics

2

Share

Download to read offline

Talk presented at Benchmarking 2016 symposium, KCL, London, 20th April 2016.

Covers performance measurement strategies (or lack of) in common genomics workflows.

Talk presented at Benchmarking 2016 symposium, KCL, London, 20th April 2016.

Covers performance measurement strategies (or lack of) in common genomics workflows.

More Related Content

Viewers also liked

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Joe parker-benchmarking-bioinformatics

  1. 1. Why Aren't We Benchmarking Bioinformatics? Joe Parker Early Career Research Fellow (Phylogenomics) Department of Biodiversity Informatics and Spatial Analysis Royal Botanic Gardens, Kew Richmond TW9 3AB joe.parker@kew.org
  2. 2. Outline • Introduction • Brief history of Bioinformatics • Benchmarking in Bioinformatics • Case study 1: Typical benchmarking across environments • Case study 2: Mean-variance relationship for repeated measures • Conclusions: implication for statistical genomics
  3. 3. A (very) brief history of bioinformatics Kluge & Farris (1969) Syst. Zool. 18:1-32
  4. 4. A (very) brief history of bioinformatics Stewart et al. (1987) Nature 330:401-404
  5. 5. A (very) brief history of bioinformatics ENCODE Consortium (2012) Nature 489:57-74
  6. 6. A (very) brief history of bioinformatics Kluge & Farris (1969) Syst. Zool. 18:1-32
  7. 7. Benchmarking to biologists • Benchmarking as a comparative process • i.e. ‘which software’s best?’ / ‘which platform’ • Benchmarking application logic / profiling unknown • Environments / runtimes generally either assumed to be identical, or else loosely categorised into ‘laptops vs clusters’
  8. 8. Case Study 1 aka ‘Which program’s the best?’
  9. 9. Bioinformatics environments are very heterogeneous • Laptop: – Portable – Very costly form-factor – Maté? Beer? • Raspi: – Low: cost, energy (& power) – Highly portable – Hackable form-factor • Clusters: • Not portable, setup costs • The cloud: – Power closely linked to budget (as limited as) – Almost infinitely scalable -Have to have a connection to get data up there (and down!) – Fiddly setup
  10. 10. Benchmarking to biologists
  11. 11. Comparison System Arch CPU type, clock GHz cores RAM Gb / MHz / type HDD Gb Haemodorum i686 Xeon E5620 @ 2.4 8 33 1000 @ SATA Raspberry Pi 2 B+ ARM ARMv7 @ 1.0 1 1 8 @ flash card Macbook Pro (2011) x64 Core i7 @ 2.2 4 8 250 @ SSD EC2 m4.10xlarge x64 Xeon E5 @ 2.4 40 160 320 @ SSD
  12. 12. Reviewing / comparing new methods • Biological problems often scale horribly unpredictably • Algorithm analyses • So empirical measurements on different problem sets to predict how problems will scale…
  13. 13. Workflow Setup BLAST 2.2.30 CEGMA genes Short reads Concatenate hits to CEGMA alignments Muscle 3.8.31 RAxML 7.2.8+ Set up workflow, binaries, and reference / alignment data. Deploy to machines. Protein-protein blast reads (from MG-RAST repository, Bass Strait oil field) against 458 core eukaryote genes from CEGMA. Keep only top hits. Use max. num_threads available. Append top hit sequences to CEGMA alignments. For each: Align in MUSCLE using default parameters Infer de novo phylogeny in RAxML under Dayhoff, random starting tree and max. PTHREADS. Output and parse times.
  14. 14. Results - BLASTP
  15. 15. Results - RAxML
  16. 16. Case Study 2 aka ‘What the hell’s a random seed?’
  17. 17. Properly benchmarking workflows • Ignoring limiting-steps analyses – in many workflows might actually be data cleaning / parsing / transformation • Or (most common error) inefficiently iterating • Or even disk I/O!
  18. 18. Workflow benchmarking very rare • Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc • http://beast.bio.ed.ac.uk/benchmarks • Many e.g. bioinformatics papers • More harm than good?
  19. 19. Conclusion • Biologists and error • Current practice • Help! • Interesting challenges too… Thanks: RBG Kew, BI&SA, Mike Chester

Editor's Notes

  • NB slot is 30mins INCLUDING questions

    Bioinformatics (computational manipulation, analysis and modelling of biological and biochemical data) is a substantial area of primary research activity as well as a growing market worth up to $13bn US (2020) by some estimates[1]. In particular, exponential growth in the capacity and speed of DNA sequencers has led to a critical shortfall in analytical tools often termed a 'DNA deluge'[2].

    Despite this critical need and substantial and growing investment in bioinformatics software, many best practices in software design are neglected, ignored, or even unacknowledged.

    I will illustrate the problem with an overview of two case studies in current bioinformatics projects, and present data showing worrying reproducibility deficits in commonly-used software.

    [1]http://www.marketsandmarkets.com/PressReleases/bioinformatics-market.asp
    [2]http://spectrum.ieee.org/biomedical/devices/the-dna-data-deluge
  • I am basically an empirical scientist
    But driven by methods
    Replication -> empirical investigation
    This is a big gap in bioinfo at the mo
  • Biological problems often scale horribly unpredictably
    On top of which hardly any biologists and few bioinformaticians (me included) understand algorithmic complexity / analysis properly
    So empirical measurements on different problem sets really the only way to predict how problems will scale…

  • Wall-clock time on ly one part of the process
    Slowest (limiting) steps in many workflows might actually be data cleaning / parsing / transformation
    Or inefficiently iterating over parametizations etc
    Or even disk I/O!
  • Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc
    http://beast.bio.ed.ac.uk/benchmarks
    Many e.g. bioinformatics papers
    Biologists are benchmarking the wrong things, or not at all
  • Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc
    http://beast.bio.ed.ac.uk/benchmarks
    Many e.g. bioinformatics papers
    Biologists are benchmarking the wrong things, or not at all
  • ×