SlideShare a Scribd company logo
1 of 20
Why Aren't We Benchmarking
Bioinformatics?
Joe Parker
Early Career Research Fellow (Phylogenomics)
Department of Biodiversity Informatics and Spatial Analysis
Royal Botanic Gardens, Kew
Richmond TW9 3AB
joe.parker@kew.org
Outline
• Introduction
• Brief history of Bioinformatics
• Benchmarking in Bioinformatics
• Case study 1: Typical benchmarking across
environments
• Case study 2: Mean-variance relationship for
repeated measures
• Conclusions: implication for statistical
genomics
A (very) brief history of
bioinformatics
Kluge & Farris (1969) Syst. Zool. 18:1-32
A (very) brief history of
bioinformatics
Stewart et al. (1987) Nature 330:401-404
A (very) brief history of
bioinformatics
ENCODE Consortium (2012) Nature 489:57-74
A (very) brief history of
bioinformatics
Kluge & Farris (1969) Syst. Zool. 18:1-32
Benchmarking to biologists
• Benchmarking as a comparative process
• i.e. ‘which software’s best?’ / ‘which
platform’
• Benchmarking application logic / profiling
unknown
• Environments / runtimes generally either
assumed to be identical, or else loosely
categorised into ‘laptops vs clusters’
Case Study 1
aka
‘Which program’s the best?’
Bioinformatics environments are
very heterogeneous
• Laptop:
– Portable
– Very costly form-factor
– Maté? Beer?
• Raspi:
– Low: cost, energy (& power)
– Highly portable
– Hackable form-factor
• Clusters:
• Not portable, setup costs
• The cloud:
– Power closely linked to budget (as limited as)
– Almost infinitely scalable
-Have to have a connection to get data up there
(and down!)
– Fiddly setup
Benchmarking to biologists
Comparison
System Arch
CPU type,
clock GHz
cores
RAM Gb /
MHz / type
HDD Gb
Haemodorum i686
Xeon E5620 @
2.4
8 33
1000
@ SATA
Raspberry Pi 2
B+
ARM
ARMv7
@ 1.0
1 1
8
@ flash card
Macbook Pro
(2011)
x64
Core i7
@ 2.2
4 8
250
@ SSD
EC2
m4.10xlarge
x64
Xeon E5
@ 2.4
40 160
320
@ SSD
Reviewing / comparing new methods
• Biological problems
often scale horribly
unpredictably
• Algorithm analyses
• So empirical
measurements on
different problem
sets to predict how
problems will scale…
Workflow
Setup
BLAST
2.2.30
CEGMA
genes
Short reads
Concatenate hits to CEGMA
alignments
Muscle
3.8.31
RAxML
7.2.8+
Set up workflow, binaries, and reference /
alignment data.
Deploy to machines.
Protein-protein blast reads (from MG-RAST
repository, Bass Strait oil field) against 458
core eukaryote genes from CEGMA. Keep
only top hits. Use max. num_threads
available.
Append top hit sequences to CEGMA
alignments.
For each:
Align in MUSCLE using default
parameters
Infer de novo phylogeny in RAxML
under Dayhoff, random starting tree
and max. PTHREADS.
Output and parse times.
Results - BLASTP
Results - RAxML
Case Study 2
aka
‘What the hell’s a random
seed?’
Properly benchmarking workflows
• Ignoring limiting-steps analyses
– in many workflows might actually be data
cleaning / parsing / transformation
• Or (most common error) inefficiently
iterating
• Or even disk I/O!
Workflow benchmarking very rare
• Many bioinformatics workflows / pipelines
limiting at odd steps, parsing etc
• http://beast.bio.ed.ac.uk/benchmarks
• Many e.g. bioinformatics papers
• More harm than good?
Conclusion
• Biologists and error
• Current practice
• Help!
• Interesting challenges
too…
Thanks:
RBG Kew, BI&SA, Mike Chester

More Related Content

What's hot

Reconstructing paleoenvironments using metagenomics
Reconstructing paleoenvironments using metagenomicsReconstructing paleoenvironments using metagenomics
Reconstructing paleoenvironments using metagenomicsRutger Vos
 
Expanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSExpanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSIntegrated DNA Technologies
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekData Driven Innovation
 
Molecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineMolecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineCandy Smellie
 
Next Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesNext Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesChung-Tsai Su
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsJoão André Carriço
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...VHIR Vall d’Hebron Institut de Recerca
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...EMC
 
2nd stripe rust izmir
2nd stripe rust izmir2nd stripe rust izmir
2nd stripe rust izmirICARDA
 
DNA sequencer by kk sahu
DNA sequencer by kk sahu DNA sequencer by kk sahu
DNA sequencer by kk sahu KAUSHAL SAHU
 
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesTools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesSurya Saha
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...Integrated DNA Technologies
 
Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...Joe Parker
 
A decade into Next Generation Sequencing on marine non-model organisms: curre...
A decade into Next Generation Sequencing on marine non-model organisms: curre...A decade into Next Generation Sequencing on marine non-model organisms: curre...
A decade into Next Generation Sequencing on marine non-model organisms: curre...Alexander Jueterbock
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015Torsten Seemann
 
Ig Nobel Prize 2021 – The 9 Funny and Scientific Winning Achievements
Ig Nobel Prize 2021 – The 9 Funny and Scientific Winning AchievementsIg Nobel Prize 2021 – The 9 Funny and Scientific Winning Achievements
Ig Nobel Prize 2021 – The 9 Funny and Scientific Winning AchievementsEurofins Genomics Germany GmbH
 

What's hot (20)

Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Reconstructing paleoenvironments using metagenomics
Reconstructing paleoenvironments using metagenomicsReconstructing paleoenvironments using metagenomics
Reconstructing paleoenvironments using metagenomics
 
Expanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSExpanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGS
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
 
Molecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineMolecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics Pipeline
 
Next Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesNext Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and Opportunities
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
Ion Torrent Sequencing
Ion Torrent SequencingIon Torrent Sequencing
Ion Torrent Sequencing
 
2nd stripe rust izmir
2nd stripe rust izmir2nd stripe rust izmir
2nd stripe rust izmir
 
DNA sequencer by kk sahu
DNA sequencer by kk sahu DNA sequencer by kk sahu
DNA sequencer by kk sahu
 
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesTools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...
 
Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...
 
A decade into Next Generation Sequencing on marine non-model organisms: curre...
A decade into Next Generation Sequencing on marine non-model organisms: curre...A decade into Next Generation Sequencing on marine non-model organisms: curre...
A decade into Next Generation Sequencing on marine non-model organisms: curre...
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
 
Ig Nobel Prize 2021 – The 9 Funny and Scientific Winning Achievements
Ig Nobel Prize 2021 – The 9 Funny and Scientific Winning AchievementsIg Nobel Prize 2021 – The 9 Funny and Scientific Winning Achievements
Ig Nobel Prize 2021 – The 9 Funny and Scientific Winning Achievements
 

Viewers also liked

Capture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid Detection
Capture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid DetectionCapture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid Detection
Capture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid DetectionCFTCC
 
Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasetsInterpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasetsJoe Parker
 
Real-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerReal-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerJoe Parker
 
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...Joe Parker
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowBrian Krueger
 
20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slideshare20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slidesharehansjansen9999
 
Nanotechnology in biology and medicine
Nanotechnology in biology and medicineNanotechnology in biology and medicine
Nanotechnology in biology and medicineravirajkuber991
 
Reframing Phylogenomics
Reframing PhylogenomicsReframing Phylogenomics
Reframing PhylogenomicsJoe Parker
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.mkim8
 
Driving Connectivity in the Scottish Islands: Droneways and Airmasts
Driving Connectivity in the Scottish Islands: Droneways and AirmastsDriving Connectivity in the Scottish Islands: Droneways and Airmasts
Driving Connectivity in the Scottish Islands: Droneways and Airmasts3G4G
 
An Introduction to IoT: Connectivity & Case Studies
An Introduction to IoT: Connectivity & Case StudiesAn Introduction to IoT: Connectivity & Case Studies
An Introduction to IoT: Connectivity & Case Studies3G4G
 
5G Network Architecture and Design
5G Network Architecture and Design5G Network Architecture and Design
5G Network Architecture and Design3G4G
 
3GPP Standards for the Internet-of-Things
3GPP Standards for the Internet-of-Things3GPP Standards for the Internet-of-Things
3GPP Standards for the Internet-of-ThingsEiko Seidel
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsAnnelies Haegeman
 

Viewers also liked (16)

Capture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid Detection
Capture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid DetectionCapture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid Detection
Capture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid Detection
 
Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasetsInterpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasets
 
Real-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerReal-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe Parker
 
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can Know
 
20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slideshare20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slideshare
 
Oxford Nanopore MinION
Oxford Nanopore MinIONOxford Nanopore MinION
Oxford Nanopore MinION
 
Nanotechnology in biology and medicine
Nanotechnology in biology and medicineNanotechnology in biology and medicine
Nanotechnology in biology and medicine
 
Reframing Phylogenomics
Reframing PhylogenomicsReframing Phylogenomics
Reframing Phylogenomics
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
 
Driving Connectivity in the Scottish Islands: Droneways and Airmasts
Driving Connectivity in the Scottish Islands: Droneways and AirmastsDriving Connectivity in the Scottish Islands: Droneways and Airmasts
Driving Connectivity in the Scottish Islands: Droneways and Airmasts
 
An Introduction to IoT: Connectivity & Case Studies
An Introduction to IoT: Connectivity & Case StudiesAn Introduction to IoT: Connectivity & Case Studies
An Introduction to IoT: Connectivity & Case Studies
 
5G Network Architecture and Design
5G Network Architecture and Design5G Network Architecture and Design
5G Network Architecture and Design
 
3GPP Standards for the Internet-of-Things
3GPP Standards for the Internet-of-Things3GPP Standards for the Internet-of-Things
3GPP Standards for the Internet-of-Things
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
 

Similar to Joe parker-benchmarking-bioinformatics

Thesis def
Thesis defThesis def
Thesis defJay Vyas
 
Novozymes Enzyme Stability Prediction
Novozymes Enzyme Stability PredictionNovozymes Enzyme Stability Prediction
Novozymes Enzyme Stability PredictionIttrainingIttraining
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop finalMeng-Ru (Raymond) Tsai
 
Distributed approach for Peptide Identification
Distributed approach for Peptide IdentificationDistributed approach for Peptide Identification
Distributed approach for Peptide Identificationabhinav vedanbhatla
 
Paper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsPaper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsRyohei Suzuki
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology Sean Ekins
 
171114 best practices for benchmarking variant calls justin
171114 best practices for benchmarking variant calls justin171114 best practices for benchmarking variant calls justin
171114 best practices for benchmarking variant calls justinGenomeInABottle
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics PresentationZhenhong Bao
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataPhilip Cheung
 

Similar to Joe parker-benchmarking-bioinformatics (20)

Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
 
Thesis def
Thesis defThesis def
Thesis def
 
Novozymes Enzyme Stability Prediction
Novozymes Enzyme Stability PredictionNovozymes Enzyme Stability Prediction
Novozymes Enzyme Stability Prediction
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
 
Distributed approach for Peptide Identification
Distributed approach for Peptide IdentificationDistributed approach for Peptide Identification
Distributed approach for Peptide Identification
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Paper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsPaper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problems
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
 
171114 best practices for benchmarking variant calls justin
171114 best practices for benchmarking variant calls justin171114 best practices for benchmarking variant calls justin
171114 best practices for benchmarking variant calls justin
 
HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics Presentation
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Bioinformatica 06-10-2011-t2-databases
Bioinformatica 06-10-2011-t2-databasesBioinformatica 06-10-2011-t2-databases
Bioinformatica 06-10-2011-t2-databases
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 

Recently uploaded

Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfTukamushabaBismark
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Youngkajalvid75
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to VirusesAreesha Ahmad
 

Recently uploaded (20)

Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 

Joe parker-benchmarking-bioinformatics

  • 1. Why Aren't We Benchmarking Bioinformatics? Joe Parker Early Career Research Fellow (Phylogenomics) Department of Biodiversity Informatics and Spatial Analysis Royal Botanic Gardens, Kew Richmond TW9 3AB joe.parker@kew.org
  • 2. Outline • Introduction • Brief history of Bioinformatics • Benchmarking in Bioinformatics • Case study 1: Typical benchmarking across environments • Case study 2: Mean-variance relationship for repeated measures • Conclusions: implication for statistical genomics
  • 3. A (very) brief history of bioinformatics Kluge & Farris (1969) Syst. Zool. 18:1-32
  • 4. A (very) brief history of bioinformatics Stewart et al. (1987) Nature 330:401-404
  • 5. A (very) brief history of bioinformatics ENCODE Consortium (2012) Nature 489:57-74
  • 6. A (very) brief history of bioinformatics Kluge & Farris (1969) Syst. Zool. 18:1-32
  • 7. Benchmarking to biologists • Benchmarking as a comparative process • i.e. ‘which software’s best?’ / ‘which platform’ • Benchmarking application logic / profiling unknown • Environments / runtimes generally either assumed to be identical, or else loosely categorised into ‘laptops vs clusters’
  • 8. Case Study 1 aka ‘Which program’s the best?’
  • 9. Bioinformatics environments are very heterogeneous • Laptop: – Portable – Very costly form-factor – Maté? Beer? • Raspi: – Low: cost, energy (& power) – Highly portable – Hackable form-factor • Clusters: • Not portable, setup costs • The cloud: – Power closely linked to budget (as limited as) – Almost infinitely scalable -Have to have a connection to get data up there (and down!) – Fiddly setup
  • 11. Comparison System Arch CPU type, clock GHz cores RAM Gb / MHz / type HDD Gb Haemodorum i686 Xeon E5620 @ 2.4 8 33 1000 @ SATA Raspberry Pi 2 B+ ARM ARMv7 @ 1.0 1 1 8 @ flash card Macbook Pro (2011) x64 Core i7 @ 2.2 4 8 250 @ SSD EC2 m4.10xlarge x64 Xeon E5 @ 2.4 40 160 320 @ SSD
  • 12. Reviewing / comparing new methods • Biological problems often scale horribly unpredictably • Algorithm analyses • So empirical measurements on different problem sets to predict how problems will scale…
  • 13. Workflow Setup BLAST 2.2.30 CEGMA genes Short reads Concatenate hits to CEGMA alignments Muscle 3.8.31 RAxML 7.2.8+ Set up workflow, binaries, and reference / alignment data. Deploy to machines. Protein-protein blast reads (from MG-RAST repository, Bass Strait oil field) against 458 core eukaryote genes from CEGMA. Keep only top hits. Use max. num_threads available. Append top hit sequences to CEGMA alignments. For each: Align in MUSCLE using default parameters Infer de novo phylogeny in RAxML under Dayhoff, random starting tree and max. PTHREADS. Output and parse times.
  • 16. Case Study 2 aka ‘What the hell’s a random seed?’
  • 17.
  • 18. Properly benchmarking workflows • Ignoring limiting-steps analyses – in many workflows might actually be data cleaning / parsing / transformation • Or (most common error) inefficiently iterating • Or even disk I/O!
  • 19. Workflow benchmarking very rare • Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc • http://beast.bio.ed.ac.uk/benchmarks • Many e.g. bioinformatics papers • More harm than good?
  • 20. Conclusion • Biologists and error • Current practice • Help! • Interesting challenges too… Thanks: RBG Kew, BI&SA, Mike Chester

Editor's Notes

  1. NB slot is 30mins INCLUDING questions Bioinformatics (computational manipulation, analysis and modelling of biological and biochemical data) is a substantial area of primary research activity as well as a growing market worth up to $13bn US (2020) by some estimates[1]. In particular, exponential growth in the capacity and speed of DNA sequencers has led to a critical shortfall in analytical tools often termed a 'DNA deluge'[2]. Despite this critical need and substantial and growing investment in bioinformatics software, many best practices in software design are neglected, ignored, or even unacknowledged. I will illustrate the problem with an overview of two case studies in current bioinformatics projects, and present data showing worrying reproducibility deficits in commonly-used software. [1]http://www.marketsandmarkets.com/PressReleases/bioinformatics-market.asp [2]http://spectrum.ieee.org/biomedical/devices/the-dna-data-deluge
  2. I am basically an empirical scientist But driven by methods Replication -> empirical investigation This is a big gap in bioinfo at the mo
  3. Biological problems often scale horribly unpredictably On top of which hardly any biologists and few bioinformaticians (me included) understand algorithmic complexity / analysis properly So empirical measurements on different problem sets really the only way to predict how problems will scale…
  4. Wall-clock time on ly one part of the process Slowest (limiting) steps in many workflows might actually be data cleaning / parsing / transformation Or inefficiently iterating over parametizations etc Or even disk I/O!
  5. Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc http://beast.bio.ed.ac.uk/benchmarks Many e.g. bioinformatics papers Biologists are benchmarking the wrong things, or not at all
  6. Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc http://beast.bio.ed.ac.uk/benchmarks Many e.g. bioinformatics papers Biologists are benchmarking the wrong things, or not at all