JoãoAndré Carriço,
Microbiology Institute and Instituto de Medicina Molecular,
Faculty of Medicine, University of Lisbon
jcarrico@fm.ul.pt twitter: @jacarrico
Whole genome sequencing for clinical microbiology:
Translation into routine applications
2 September 2017, Basel
A pipeline (in software engineering) consists of a chain of
processing elements arranged so that the output of each
element is the input of the next; the name is by analogy to
a physical pipeline
https://en.wikipedia.org/wiki/Pipeline_(software)
Physical pipeline
Software Pipeline
Software /Algorithm module
Microbiological
Sample
The Ideal Scenario
Magic Box of
NGS Wonders for
Clinical Microbiology
Completely characterized strain:
• Species Identification
• Serotype
• Multilocus SequenceType (MLST)
• cgMLST / wgMLST / SNPs
• Antibiotic resistance profile
• Virulence factors
• Other SBTM information eg:
• spa (S. aureus)
• emm (Group A Streptococcus)
Actionable information for :
• Diagnostics
• Surveillance
• Outbreak detection
Magic Box of
NGS Wonders for
Clinical Microbiology
Pipelines
of
HTS
analysis
software
 Comparability
 The same analysis workflow is
applied to multiple samples
 Accountability
 Keeping track on what software
(and version) did the analysis
 Modularity
 Adding new software to the pipeline
without changing the existing one
 BioinformaticsWorkflow software:
https://www.nextflow.io/
https://github.com/bionode/bionode-watermill
Bionode
Watermill
Snakemake https://snakemake.readthedocs.io/en/stable/
Re-run as needed
If a module doesn’t run, there is no need
to re-run the whole analysis
Compatible with High Performance
Computing job schedulers (SLURM , etc)
 Software validation
 Most software contain bugs that can affect
the results. Pipelines can hamper tracking
the problem
 Reproducibility
 Running the same strain “should” yield the
same results but some software have
stochastics steps
 Opacity
 Given the dependency of multiple
software, it can be difficult to determine
how the final results were achieved
 Database dependency
 Several bioinformatics software
are dependent on publicly
available and curated databases.
Difficult to assess False Positives
/False Negatives.
Virulence Factor Databases
 VFDB (http://www.mgc.ac.cn/VFs/main.htm)
 Pathosystems Resource Integration Center
(PATRIC)VF (https://www.patricbrc.org/)
 Victors (http://www.phidias.us/victors/)
 PHI-Base (http://www.phi-base.org/)
 MvirDB (http://mvirdb.llnl.gov/ )
To know more:
- Presentation on the Controversies in interpreting whole genome sequence data
session : http://eccmidlive.org/#resources/how-can-we-design-actionable-virulome-
databases
 Comprehensive Antibiotic Resistance
Database (CARD) (https://card.mcmaster.ca/ )
 Resfinder 2.1 (https://cge.cbs.dtu.dk/services/ResFinder/ )
(https://bitbucket.org/genomicepidemiology/resfinder_db ) -> DB repository
 Repository of Antibiotic resistance Cassettes
(RAC)(http://rac.aihi.mq.edu.au/rac/)
 Integrall :The integron database
(http://integrall.bio.ua.pt/)
(…)
 Software dependencies
 If a software is updated and output
changes the pipeline breaks and needs to
be revised
 Database /URL format changes
 When Databases or URL where data is
stored in public repositories changes
several software modules can be
effected (a.k.a. the NCBI effect)
 Setting up the pipeline
 Not as easy as it seems.The Bus effect .
Output of a software is used as input of another :
Most bioinformatics software are pipelines !
INNUCA  Assembly Pipeline
Prokka  GenomeAnnotation Pipeline
Nullarbor  All in one Pipeline
Web platforms
 Innuendo platform
https://www.cdc.gov/pulsenet/pathogens/wgs.html
Contamination
Mislabelling
E.coli
E. fergusonii
Mixture
Barcode
bleaching
Wrong file
assignment
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
samtools http://www.htslib.org/
http://www.bioinformatics.babraham.ac.uk/projects/fastqc
http://cab.spbu.ru/software/spades/
https://github.com/broadinstitute/pilon
MLST 2 https://github.com/tseemann/mlst
Dependencies :
Features :
• Species confirmation
• Contamination detection
• Assembly correction
• Multiple allele detection -> multiple strains
Spades
https://github.com/INNUENDOCON/INNUca
Output
20-40 mins per strain (60x-100x coverage; 8 CPUs)
High Performance Cluster:
6-7 nodes, 244 CPUs used : 3h57m for 124 E.coli ~=1.9 mins per strain
Benchmark
Contamination and
multi-strain detection
 Genome annotation made easy byTorsten
Seemann (slides byTorsten)
 Genome annotation: adding biological
information to the sequence, by describing
features
To know more :
http://www.slideshare.net/torstenseemann/prokka-rapid-bacterial-genome-annotation-abphm-2013
Available at: https://github.com/tseemann/prokka
 Complete pipeline from reads to reports byTorsten
Seemann
 Objective is automate analysis for everyday use on
public health labs /research settings
 Uses and distills outputs by a lot of software
 Avaliable at: https://github.com/tseemann/nullarbor
Slide byTorsten Seeman
From: https://github.com/tseemann/nullarbor
Slides byTorsten Seeman
 Web Platforms:
 Facilitate the use of pipelines by non-
bioinformaticians (the old and boring Windows vs
Linux software debate can end (?) …)
 Facilitate data sharing and comparison: Creation
of Federated Strain Databases
A novel cross-sectorial platform for the
integration of genomics in surveillance of
foodborne pathogens
http://www.innuendoweb.org/
Target species:
Escherichia coli
Salmonella enterica
Yersinia enterocolitica
Campylobacter sp.
http://www.irida.ca/
INNUENDO Platform
Sequences
Storage
LDAP
SLURM Job Scheduler
Computation Module
INNUca ReMatCh chewBBACA PHYLOViZ
Online
Job Processing
Application
Web
Application
R
E
S
T
A
P
I
Client
Browser
(Chrome)
Calculation
Server
R
E
S
T
A
P
I Metadata
Storage
Frontend/ DB
Server
NGS Onto
Slide credit:
Bruno Gonçalves
Target users: Reference laboratories. Small groups.
• Multi-user
• Create projects within a species for:
• Outbreaks
• Surveillance
Applying multiple pipelines to the same strains and queue them for processing using SLURM.
Can use an High Performance Computer if available
Aggregate selected strains from multiple projects into reports:
• Reports can be saved and exported
• Gene-by-gene analyses can be visualized directly into PHYLOViZ online
and and the resulting trees saved and shared.r N Closest strains in the
database can be added to the tree automatically
Automatically adds the metadata filled in the project and several tree
analysis can be performed :
• NLVGraph
• Interactive distance matrix
• Dynamic exploration of wgMLST schemas
To know more: https://online.phyloviz.net/index
Input Output
See-through box
See-through boxBlack box
Commercial/Freeware Freeware
You get what it gives you You can “tailor”
Ready to use “Major” headache
Stealth change Visible change
Standalone Dependencies
Slide credit: Mario Ramirez
 Pipelines can provide actionable results for Clinical Microbiology
out of HTS data
 One must be aware of the limitations of each pipeline. Setting
up a pipeline that can be maintainable needs Bioinformaticians.
 Most are Linux based. But web platforms can provide a easy to
use way to non-bioinformaticians and are useful to stimulate
data sharing.
 Pipelines greatly benefit from High Performance Computing
Clusters. Nevertheless, these need specialized personal to install
and maintain.
http://im.fm.ul.pt
INNUENDO project [GP/EFSA/AFSCO/2015/01/CT2]
BacGenTrack project [FCT / Scientific and Technological Research Council of Turkey, TUBITAK/0004/2014]
ONEIDA project (LISBOA-01-0145-FEDER-016417) co-funded by FEEI - “Fundos Europeus Estruturais e de
Investimento” from “Programa Operacional Regional Lisboa 2020” and by national funds from FCT -
“Fundação para a Ciência e Tecnologia”
Disclaimer
The conclusions, findings, and opinions expressed in this presentation reflect only the
view of the INNUENDO consortium members and not the official position of the
European Food Safety Authority nor of the Government of the Basque Country that are
not responsible for any use that may be made of the information they contain.

Software Pipelines: The Good, The Bad and The Ugly

  • 1.
    JoãoAndré Carriço, Microbiology Instituteand Instituto de Medicina Molecular, Faculty of Medicine, University of Lisbon jcarrico@fm.ul.pt twitter: @jacarrico Whole genome sequencing for clinical microbiology: Translation into routine applications 2 September 2017, Basel
  • 2.
    A pipeline (insoftware engineering) consists of a chain of processing elements arranged so that the output of each element is the input of the next; the name is by analogy to a physical pipeline https://en.wikipedia.org/wiki/Pipeline_(software)
  • 3.
  • 4.
    Microbiological Sample The Ideal Scenario MagicBox of NGS Wonders for Clinical Microbiology Completely characterized strain: • Species Identification • Serotype • Multilocus SequenceType (MLST) • cgMLST / wgMLST / SNPs • Antibiotic resistance profile • Virulence factors • Other SBTM information eg: • spa (S. aureus) • emm (Group A Streptococcus) Actionable information for : • Diagnostics • Surveillance • Outbreak detection
  • 5.
    Magic Box of NGSWonders for Clinical Microbiology Pipelines of HTS analysis software
  • 7.
     Comparability  Thesame analysis workflow is applied to multiple samples  Accountability  Keeping track on what software (and version) did the analysis  Modularity  Adding new software to the pipeline without changing the existing one
  • 8.
     BioinformaticsWorkflow software: https://www.nextflow.io/ https://github.com/bionode/bionode-watermill Bionode Watermill Snakemakehttps://snakemake.readthedocs.io/en/stable/ Re-run as needed If a module doesn’t run, there is no need to re-run the whole analysis Compatible with High Performance Computing job schedulers (SLURM , etc)
  • 9.
     Software validation Most software contain bugs that can affect the results. Pipelines can hamper tracking the problem  Reproducibility  Running the same strain “should” yield the same results but some software have stochastics steps  Opacity  Given the dependency of multiple software, it can be difficult to determine how the final results were achieved
  • 10.
     Database dependency Several bioinformatics software are dependent on publicly available and curated databases. Difficult to assess False Positives /False Negatives.
  • 11.
    Virulence Factor Databases VFDB (http://www.mgc.ac.cn/VFs/main.htm)  Pathosystems Resource Integration Center (PATRIC)VF (https://www.patricbrc.org/)  Victors (http://www.phidias.us/victors/)  PHI-Base (http://www.phi-base.org/)  MvirDB (http://mvirdb.llnl.gov/ ) To know more: - Presentation on the Controversies in interpreting whole genome sequence data session : http://eccmidlive.org/#resources/how-can-we-design-actionable-virulome- databases
  • 12.
     Comprehensive AntibioticResistance Database (CARD) (https://card.mcmaster.ca/ )  Resfinder 2.1 (https://cge.cbs.dtu.dk/services/ResFinder/ ) (https://bitbucket.org/genomicepidemiology/resfinder_db ) -> DB repository  Repository of Antibiotic resistance Cassettes (RAC)(http://rac.aihi.mq.edu.au/rac/)  Integrall :The integron database (http://integrall.bio.ua.pt/) (…)
  • 13.
     Software dependencies If a software is updated and output changes the pipeline breaks and needs to be revised  Database /URL format changes  When Databases or URL where data is stored in public repositories changes several software modules can be effected (a.k.a. the NCBI effect)  Setting up the pipeline  Not as easy as it seems.The Bus effect .
  • 14.
    Output of asoftware is used as input of another : Most bioinformatics software are pipelines !
  • 15.
    INNUCA  AssemblyPipeline Prokka  GenomeAnnotation Pipeline Nullarbor  All in one Pipeline Web platforms  Innuendo platform
  • 16.
  • 17.
    http://bowtie-bio.sourceforge.net/bowtie2/index.shtml samtools http://www.htslib.org/ http://www.bioinformatics.babraham.ac.uk/projects/fastqc http://cab.spbu.ru/software/spades/ https://github.com/broadinstitute/pilon MLST 2https://github.com/tseemann/mlst Dependencies : Features : • Species confirmation • Contamination detection • Assembly correction • Multiple allele detection -> multiple strains Spades https://github.com/INNUENDOCON/INNUca
  • 18.
    Output 20-40 mins perstrain (60x-100x coverage; 8 CPUs) High Performance Cluster: 6-7 nodes, 244 CPUs used : 3h57m for 124 E.coli ~=1.9 mins per strain Benchmark Contamination and multi-strain detection
  • 19.
     Genome annotationmade easy byTorsten Seemann (slides byTorsten)  Genome annotation: adding biological information to the sequence, by describing features To know more : http://www.slideshare.net/torstenseemann/prokka-rapid-bacterial-genome-annotation-abphm-2013 Available at: https://github.com/tseemann/prokka
  • 20.
     Complete pipelinefrom reads to reports byTorsten Seemann  Objective is automate analysis for everyday use on public health labs /research settings  Uses and distills outputs by a lot of software  Avaliable at: https://github.com/tseemann/nullarbor
  • 21.
  • 22.
  • 23.
  • 24.
     Web Platforms: Facilitate the use of pipelines by non- bioinformaticians (the old and boring Windows vs Linux software debate can end (?) …)  Facilitate data sharing and comparison: Creation of Federated Strain Databases
  • 25.
    A novel cross-sectorialplatform for the integration of genomics in surveillance of foodborne pathogens http://www.innuendoweb.org/ Target species: Escherichia coli Salmonella enterica Yersinia enterocolitica Campylobacter sp. http://www.irida.ca/
  • 26.
    INNUENDO Platform Sequences Storage LDAP SLURM JobScheduler Computation Module INNUca ReMatCh chewBBACA PHYLOViZ Online Job Processing Application Web Application R E S T A P I Client Browser (Chrome) Calculation Server R E S T A P I Metadata Storage Frontend/ DB Server NGS Onto Slide credit: Bruno Gonçalves Target users: Reference laboratories. Small groups.
  • 27.
    • Multi-user • Createprojects within a species for: • Outbreaks • Surveillance
  • 28.
    Applying multiple pipelinesto the same strains and queue them for processing using SLURM. Can use an High Performance Computer if available
  • 29.
    Aggregate selected strainsfrom multiple projects into reports: • Reports can be saved and exported • Gene-by-gene analyses can be visualized directly into PHYLOViZ online and and the resulting trees saved and shared.r N Closest strains in the database can be added to the tree automatically
  • 30.
    Automatically adds themetadata filled in the project and several tree analysis can be performed : • NLVGraph • Interactive distance matrix • Dynamic exploration of wgMLST schemas To know more: https://online.phyloviz.net/index
  • 31.
    Input Output See-through box See-throughboxBlack box Commercial/Freeware Freeware You get what it gives you You can “tailor” Ready to use “Major” headache Stealth change Visible change Standalone Dependencies Slide credit: Mario Ramirez
  • 32.
     Pipelines canprovide actionable results for Clinical Microbiology out of HTS data  One must be aware of the limitations of each pipeline. Setting up a pipeline that can be maintainable needs Bioinformaticians.  Most are Linux based. But web platforms can provide a easy to use way to non-bioinformaticians and are useful to stimulate data sharing.  Pipelines greatly benefit from High Performance Computing Clusters. Nevertheless, these need specialized personal to install and maintain.
  • 33.
  • 34.
    INNUENDO project [GP/EFSA/AFSCO/2015/01/CT2] BacGenTrackproject [FCT / Scientific and Technological Research Council of Turkey, TUBITAK/0004/2014] ONEIDA project (LISBOA-01-0145-FEDER-016417) co-funded by FEEI - “Fundos Europeus Estruturais e de Investimento” from “Programa Operacional Regional Lisboa 2020” and by national funds from FCT - “Fundação para a Ciência e Tecnologia” Disclaimer The conclusions, findings, and opinions expressed in this presentation reflect only the view of the INNUENDO consortium members and not the official position of the European Food Safety Authority nor of the Government of the Basque Country that are not responsible for any use that may be made of the information they contain.