Joe parker-benchmarking-bioinformatics

•Download as PPTX, PDF•

2 likes•651 views

Talk presented at Benchmarking 2016 symposium, KCL, London, 20th April 2016. Covers performance measurement strategies (or lack of) in common genomics workflows.

Science

Why Aren't We Benchmarking
Bioinformatics?
Joe Parker
Early Career Research Fellow (Phylogenomics)
Department of Biodiversity Informatics and Spatial Analysis
Royal Botanic Gardens, Kew
Richmond TW9 3AB
joe.parker@kew.org

Outline
• Introduction
• Brief history of Bioinformatics
• Benchmarking in Bioinformatics
• Case study 1: Typical benchmarking across
environments
• Case study 2: Mean-variance relationship for
repeated measures
• Conclusions: implication for statistical
genomics

A (very) brief history of
bioinformatics
Kluge & Farris (1969) Syst. Zool. 18:1-32

A (very) brief history of
bioinformatics
Stewart et al. (1987) Nature 330:401-404

A (very) brief history of
bioinformatics
ENCODE Consortium (2012) Nature 489:57-74

Benchmarking to biologists
• Benchmarking as a comparative process
• i.e. ‘which software’s best?’ / ‘which
platform’
• Benchmarking application logic / profiling
unknown
• Environments / runtimes generally either
assumed to be identical, or else loosely
categorised into ‘laptops vs clusters’

Case Study 1
aka
‘Which program’s the best?’

Bioinformatics environments are
very heterogeneous
• Laptop:
– Portable
– Very costly form-factor
– Maté? Beer?
• Raspi:
– Low: cost, energy (& power)
– Highly portable
– Hackable form-factor
• Clusters:
• Not portable, setup costs
• The cloud:
– Power closely linked to budget (as limited as)
– Almost infinitely scalable
-Have to have a connection to get data up there
(and down!)
– Fiddly setup

Comparison
System Arch
CPU type,
clock GHz
cores
RAM Gb /
MHz / type
HDD Gb
Haemodorum i686
Xeon E5620 @
2.4
8 33
1000
@ SATA
Raspberry Pi 2
B+
ARM
ARMv7
@ 1.0
1 1
8
@ flash card
Macbook Pro
(2011)
x64
Core i7
@ 2.2
4 8
250
@ SSD
EC2
m4.10xlarge
x64
Xeon E5
@ 2.4
40 160
320
@ SSD

Reviewing / comparing new methods
• Biological problems
often scale horribly
unpredictably
• Algorithm analyses
• So empirical
measurements on
different problem
sets to predict how
problems will scale…

Workflow
Setup
BLAST
2.2.30
CEGMA
genes
Short reads
Concatenate hits to CEGMA
alignments
Muscle
3.8.31
RAxML
7.2.8+
Set up workflow, binaries, and reference /
alignment data.
Deploy to machines.
Protein-protein blast reads (from MG-RAST
repository, Bass Strait oil field) against 458
core eukaryote genes from CEGMA. Keep
only top hits. Use max. num_threads
available.
Append top hit sequences to CEGMA
alignments.
For each:
Align in MUSCLE using default
parameters
Infer de novo phylogeny in RAxML
under Dayhoff, random starting tree
and max. PTHREADS.
Output and parse times.

Case Study 2
aka
‘What the hell’s a random
seed?’

Properly benchmarking workflows
• Ignoring limiting-steps analyses
– in many workflows might actually be data
cleaning / parsing / transformation
• Or (most common error) inefficiently
iterating
• Or even disk I/O!

Workflow benchmarking very rare
• Many bioinformatics workflows / pipelines
limiting at odd steps, parsing etc
• http://beast.bio.ed.ac.uk/benchmarks
• Many e.g. bioinformatics papers
• More harm than good?

Conclusion
• Biologists and error
• Current practice
• Help!
• Interesting challenges
too…
Thanks:
RBG Kew, BI&SA, Mike Chester

What's hot

Big data nebraskaAdina Chuang Howe

Reconstructing paleoenvironments using metagenomicsRutger Vos

Expanding Your Research Capabilities Using Targeted NGSIntegrated DNA Technologies

2015 beacon-metagenome-tutorialc.titus.brown

Genomic Big Data Management, Integration and Mining - Emanuel WeitschekData Driven Innovation

Molecular QC: Interpreting your Bioinformatics PipelineCandy Smellie

Next Generation Sequencing Informatics - Challenges and OpportunitiesChung-Tsai Su

Making Use of NGS Data: From Reads to Trees and AnnotationsJoão André Carriço

NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...VHIR Vall d’Hebron Institut de Recerca

White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...EMC

Ion Torrent SequencingUSD Bioinformatics

2nd stripe rust izmirICARDA

DNA sequencer by kk sahu KAUSHAL SAHU

Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesSurya Saha

Building bioinformatics resources for the global communityExternalEvents

Analyzing the exome—focusing your NGS analysis with high performance target c...Integrated DNA Technologies

Phylogenomic methods for comparative evolutionary biology - University Colleg...Joe Parker

A decade into Next Generation Sequencing on marine non-model organisms: curre...Alexander Jueterbock

A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015Torsten Seemann

Ig Nobel Prize 2021 – The 9 Funny and Scientific Winning AchievementsEurofins Genomics Germany GmbH

What's hot (20)

Big data nebraska

Reconstructing paleoenvironments using metagenomics

Expanding Your Research Capabilities Using Targeted NGS

2015 beacon-metagenome-tutorial

Genomic Big Data Management, Integration and Mining - Emanuel Weitschek

Molecular QC: Interpreting your Bioinformatics Pipeline

Next Generation Sequencing Informatics - Challenges and Opportunities

Making Use of NGS Data: From Reads to Trees and Annotations

NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...

White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...

Ion Torrent Sequencing

2nd stripe rust izmir

DNA sequencer by kk sahu

Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences

Building bioinformatics resources for the global community

Analyzing the exome—focusing your NGS analysis with high performance target c...

Phylogenomic methods for comparative evolutionary biology - University Colleg...

A decade into Next Generation Sequencing on marine non-model organisms: curre...

A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015

Ig Nobel Prize 2021 – The 9 Funny and Scientific Winning Achievements

Viewers also liked

Capture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid DetectionCFTCC

Interpreting ‘tree space’ in the context of very large empirical datasetsJoe Parker

Real-time Phylogenomics: Joe ParkerJoe Parker

Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...Joe Parker

High Throughput Sequencing Technologies: What We Can KnowBrian Krueger

20160308 dtl ngs_focus_group_meeting_slidesharehansjansen9999

Oxford Nanopore MinIONVenómica Estructural y Funcional

Nanotechnology in biology and medicineravirajkuber991

Reframing PhylogenomicsJoe Parker

NGS data formats and analysesrjorton

A Comparison of NGS Platforms.mkim8

Driving Connectivity in the Scottish Islands: Droneways and Airmasts3G4G

An Introduction to IoT: Connectivity & Case Studies3G4G

5G Network Architecture and Design3G4G

3GPP Standards for the Internet-of-ThingsEiko Seidel

NGS - Basic principles and sequencing platformsAnnelies Haegeman

Viewers also liked (16)

Capture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid Detection

Interpreting ‘tree space’ in the context of very large empirical datasets

Real-time Phylogenomics: Joe Parker

Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...

High Throughput Sequencing Technologies: What We Can Know

20160308 dtl ngs_focus_group_meeting_slideshare

Oxford Nanopore MinION

Nanotechnology in biology and medicine

Reframing Phylogenomics

NGS data formats and analyses

A Comparison of NGS Platforms.

Driving Connectivity in the Scottish Islands: Droneways and Airmasts

An Introduction to IoT: Connectivity & Case Studies

5G Network Architecture and Design

3GPP Standards for the Internet-of-Things

NGS - Basic principles and sequencing platforms

Similar to Joe parker-benchmarking-bioinformatics

Genome AssemblyAureliano Bombarely

Folker Meyer: Metagenomic Data AnnotationGigaScience, BGI Hong Kong

Thesis defJay Vyas

Novozymes Enzyme Stability PredictionIttrainingIttraining

Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier

Cloud bioinformatics 2ARPUTHA SELVARAJ A

Dgaston dec-06-2012Dan Gaston

20211119 ntuh azure hpc workshop finalMeng-Ru (Raymond) Tsai

Cshl minseqe 2013_ouelletteFunctional Genomics Data Society

Intro to databasesbhargvi sharma

Distributed approach for Peptide Identificationabhinav vedanbhatla

Machine LearningIndian Institute of Science Education and Research

Paper memo: persistent homology on biological problemsRyohei Suzuki

SOT short course on computational toxicology Sean Ekins

171114 best practices for benchmarking variant calls justinGenomeInABottle

HUG @ NGCLE@e-Novia 15.11.2017NECST Lab @ Politecnico di Milano

Bioinfomatics PresentationZhenhong Bao

2014 11-13-sbsm032-reproducible researchYannick Wurm

Bioinformatica 06-10-2011-t2-databasesProf. Wim Van Criekinge

BioAssay Express: Creating and exploiting assay metadataPhilip Cheung

Similar to Joe parker-benchmarking-bioinformatics (20)

Genome Assembly

Folker Meyer: Metagenomic Data Annotation

Thesis def

Novozymes Enzyme Stability Prediction

Design and evaluation of a genomics variant analysis pipeline using GATK Spar...

Cloud bioinformatics 2

Dgaston dec-06-2012

20211119 ntuh azure hpc workshop final

Cshl minseqe 2013_ouellette

Intro to databases

Distributed approach for Peptide Identification

Machine Learning

Paper memo: persistent homology on biological problems

SOT short course on computational toxicology

171114 best practices for benchmarking variant calls justin

HUG @ NGCLE@e-Novia 15.11.2017

Bioinfomatics Presentation

2014 11-13-sbsm032-reproducible research

Bioinformatica 06-10-2011-t2-databases

BioAssay Express: Creating and exploiting assay metadata

Recently uploaded

Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa

Module for Grade 9 for Asynchronous/Distance learninglevieagacer

Factory Acceptance Test( FAT).pptx .Poonam Aher Patil

chemical bonding Essentials of Physical Chemistry2.pdfTukamushabaBismark

Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa

Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2

FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson

9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1

Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay

Forensic Biology & Its biological significance.pdfrohankumarsinghrore1

GBSN - Microbiology (Unit 1)Areesha Ahmad

300003-World Science Day For Peace And Development.pptxryanrooker

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6

Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha

Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1

Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Youngkajalvid75

module for grade 9 for distance learninglevieagacer

Introduction to VirusesAreesha Ahmad

Recently uploaded (20)

Molecular markers- RFLP, RAPD, AFLP, SNP etc.

Module for Grade 9 for Asynchronous/Distance learning

Factory Acceptance Test( FAT).pptx .

chemical bonding Essentials of Physical Chemistry2.pdf

Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...

Thyroid Physiology_Dr.E. Muralinath_ Associate Professor

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry

9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service

Grade 7 - Lesson 1 - Microscope and Its Functions

Forensic Biology & Its biological significance.pdf

GBSN - Microbiology (Unit 1)

300003-World Science Day For Peace And Development.pptx

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

development of diagnostic enzyme assay to detect leuser virus

Pests of mustard_Identification_Management_Dr.UPR.pdf

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000

Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service

Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young

module for grade 9 for distance learning

Introduction to Viruses

Joe parker-benchmarking-bioinformatics

1. Why Aren't We Benchmarking Bioinformatics? Joe Parker Early Career Research Fellow (Phylogenomics) Department of Biodiversity Informatics and Spatial Analysis Royal Botanic Gardens, Kew Richmond TW9 3AB joe.parker@kew.org

2. Outline • Introduction • Brief history of Bioinformatics • Benchmarking in Bioinformatics • Case study 1: Typical benchmarking across environments • Case study 2: Mean-variance relationship for repeated measures • Conclusions: implication for statistical genomics

3. A (very) brief history of bioinformatics Kluge & Farris (1969) Syst. Zool. 18:1-32

4. A (very) brief history of bioinformatics Stewart et al. (1987) Nature 330:401-404

5. A (very) brief history of bioinformatics ENCODE Consortium (2012) Nature 489:57-74

6. A (very) brief history of bioinformatics Kluge & Farris (1969) Syst. Zool. 18:1-32

7. Benchmarking to biologists • Benchmarking as a comparative process • i.e. ‘which software’s best?’ / ‘which platform’ • Benchmarking application logic / profiling unknown • Environments / runtimes generally either assumed to be identical, or else loosely categorised into ‘laptops vs clusters’

8. Case Study 1 aka ‘Which program’s the best?’

9. Bioinformatics environments are very heterogeneous • Laptop: – Portable – Very costly form-factor – Maté? Beer? • Raspi: – Low: cost, energy (& power) – Highly portable – Hackable form-factor • Clusters: • Not portable, setup costs • The cloud: – Power closely linked to budget (as limited as) – Almost infinitely scalable -Have to have a connection to get data up there (and down!) – Fiddly setup

10. Benchmarking to biologists

11. Comparison System Arch CPU type, clock GHz cores RAM Gb / MHz / type HDD Gb Haemodorum i686 Xeon E5620 @ 2.4 8 33 1000 @ SATA Raspberry Pi 2 B+ ARM ARMv7 @ 1.0 1 1 8 @ flash card Macbook Pro (2011) x64 Core i7 @ 2.2 4 8 250 @ SSD EC2 m4.10xlarge x64 Xeon E5 @ 2.4 40 160 320 @ SSD

12. Reviewing / comparing new methods • Biological problems often scale horribly unpredictably • Algorithm analyses • So empirical measurements on different problem sets to predict how problems will scale…

13. Workflow Setup BLAST 2.2.30 CEGMA genes Short reads Concatenate hits to CEGMA alignments Muscle 3.8.31 RAxML 7.2.8+ Set up workflow, binaries, and reference / alignment data. Deploy to machines. Protein-protein blast reads (from MG-RAST repository, Bass Strait oil field) against 458 core eukaryote genes from CEGMA. Keep only top hits. Use max. num_threads available. Append top hit sequences to CEGMA alignments. For each: Align in MUSCLE using default parameters Infer de novo phylogeny in RAxML under Dayhoff, random starting tree and max. PTHREADS. Output and parse times.

14. Results - BLASTP

15. Results - RAxML

16. Case Study 2 aka ‘What the hell’s a random seed?’

17.

18. Properly benchmarking workflows • Ignoring limiting-steps analyses – in many workflows might actually be data cleaning / parsing / transformation • Or (most common error) inefficiently iterating • Or even disk I/O!

19. Workflow benchmarking very rare • Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc • http://beast.bio.ed.ac.uk/benchmarks • Many e.g. bioinformatics papers • More harm than good?

20. Conclusion • Biologists and error • Current practice • Help! • Interesting challenges too… Thanks: RBG Kew, BI&SA, Mike Chester

Editor's Notes

NB slot is 30mins INCLUDING questions Bioinformatics (computational manipulation, analysis and modelling of biological and biochemical data) is a substantial area of primary research activity as well as a growing market worth up to $13bn US (2020) by some estimates[1]. In particular, exponential growth in the capacity and speed of DNA sequencers has led to a critical shortfall in analytical tools often termed a 'DNA deluge'[2]. Despite this critical need and substantial and growing investment in bioinformatics software, many best practices in software design are neglected, ignored, or even unacknowledged. I will illustrate the problem with an overview of two case studies in current bioinformatics projects, and present data showing worrying reproducibility deficits in commonly-used software. [1]http://www.marketsandmarkets.com/PressReleases/bioinformatics-market.asp [2]http://spectrum.ieee.org/biomedical/devices/the-dna-data-deluge
I am basically an empirical scientist But driven by methods Replication -> empirical investigation This is a big gap in bioinfo at the mo
Biological problems often scale horribly unpredictably On top of which hardly any biologists and few bioinformaticians (me included) understand algorithmic complexity / analysis properly So empirical measurements on different problem sets really the only way to predict how problems will scale…
Wall-clock time on ly one part of the process Slowest (limiting) steps in many workflows might actually be data cleaning / parsing / transformation Or inefficiently iterating over parametizations etc Or even disk I/O!
Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc http://beast.bio.ed.ac.uk/benchmarks Many e.g. bioinformatics papers Biologists are benchmarking the wrong things, or not at all
Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc http://beast.bio.ed.ac.uk/benchmarks Many e.g. bioinformatics papers Biologists are benchmarking the wrong things, or not at all

Joe parker-benchmarking-bioinformatics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Joe parker-benchmarking-bioinformatics

Similar to Joe parker-benchmarking-bioinformatics (20)

Recently uploaded

Recently uploaded (20)

Joe parker-benchmarking-bioinformatics

Editor's Notes