Making Use of NGS Data: From Reads to Trees and Annotations

João André Carriço, PhD
Microbiology Institute/Institute for Molecular Medicine
Faculty of Medicine, University of Lisbon
Portugal
http://im.fm.ul.pt
http://imm.fm.ul.pt
http://www.joaocarrico.info
WORKSHOP 24:
NGS FOR MICROBIAL GENOMIC
SURVEILLANCE AND MORE - ONE
TECHNOLOGY FITS ALL

 This presentation is not intended to cover all available
software or databases (we would need several weeks or
months to do that)
 I’ll present what I use or intend to use in a near future
 I gladly accept any suggestions to included on similar
presentations in the future.
 It is supposed to be interactive so ask away during the
presentation.

 What is in the reads FASTQ files
 Available Databases
 Virulence Factors and AMR DBs
 Sequence-based typing databases: Pubmlst.org / Enterobase
 HighThroughput Sequencing data analysis (freeware)
 Prokka
 Roary
 Nullabor
 Microreact.org
 PHYLOViZ
 Commercial Solutions
 Bionumerics 7.5
 CLC GenomicsWorkbench (CLC Bio)
 Ridom Seqsphere+

Isolate
Genome*
Sequenced
Reads
Slide Source: Nick Loman
Other isolates
in the sequencing run
Contamination
* Chromosome + Plasmids + Phages

Virulence Factor Databases
 VFDB (http://www.mgc.ac.cn/VFs/main.htm)
 Pathosystems Resource Integration Center (PATRIC)
VF (https)://www.patricbrc.org/)
 Victors (http://www.phidias.us/victors/)
 PHI-Base (http://www.phi-base.org/)
 MvirDB (http://mvirdb.llnl.gov/ )
To know more:
- Presentation on the Controversies in interpreting whole genome sequence data session :
http://eccmidlive.org/#resources/how-can-we-design-actionable-virulome-databases

 Comprehensive Antibiotic Resistance Database
(CARD) (https://card.mcmaster.ca/)
 Repository of Antibiotic resistanceCassetes
(RAC)(http://rac.aihi.mq.edu.au/rac/)
 Integrall :The integron database
(http://integrall.bio.ua.pt/)
(…)

http://www.pubmlst.org
http://bigsdb.web.pasteur.fr/

slide by @happy_khan
Martin Sergeant
Mark Achtman
Nabil-Fareed Alikhan
Zhemin Zhou

To know more :
http://www.slideshare.net/nickloman/eccmid-2015-so-i-have-sequenced-my-genome-what-now
Reads
(fastq files)
contigs
(fasta files)
Annotated contigs
(gbk/gff files)
Roary :PanGenome Analysis
Enterobase
BIGSdb
Nullabor
PHYLOViZ:
Tree + metada
visualization
Microreact.org:
Tree +metadata
+vizualization
Prokka
De novo assembler

 Genome annotation made easy byTorsten
Seemann (slides byTorsten)
 Genome annotation: adding biological
information to the sequence, by describing
features
To know more :
http://www.slideshare.net/torstenseemann/prokka-rapid-bacterial-genome-annotation-abphm-2013
Available at: https://github.com/tseemann/prokka

 Pan genome analysis by Andrew Page
 Available at: https://sangerpathogens.github.io/Roary/
Core
genome
Accessory
genome
Pan-genome

 Inputs:Annotated de novo assemblies (GFF files)
• Typically from the annotation pipeline
 Outputs:
• Spreadsheet with presence and absence of genes
• Multi-FASTA alignment of core genes so you can build a tree without a
reference
• Multi-FASTA alignments for each gene
• Plots for the open/closed genome, unique genes
• Integrates with Phandango so you can visualise all structural variation
• QC report from Kraken to help identify suspect samples
(Slide by Andrew Page)

Core (n or n-1 strains)
Soft-Core
(n-2 or n-3 strains)
Shell
( 8(?) to n-3 strains)
Cloud
( <8 (?) strains)
Core genome:
Core + Soft-Core
Accessory genome:
Shell + Cloud

iCANDY output of presence and
absence of genes in accessory
genome.
S. Weltevreden & public S. enterica
genomes
(Slide by Andrew Page)

 Complete pipeline from reads to reports byTorsten
Seemann
 Objective is automate analysis for everyday use on
public health labs /research settings
 Uses and distills outputs by a lot of software
 Avaliable at: https://github.com/tseemann/nullarbor

From: https://github.com/tseemann/nullarbor

Inputs:
- Tab separated txt (profiles)
- Fasta files
- Automatic database retrieval
(MLST)
Outputs:
• goeBURST and goeBURST
MST
• Link quality assessment
• High quality images
Can be easily applied to:
- MLST/ cgMLST/wgMLST
- MLVA
- SNP data*
- Gene Presence/absence

New features:
• Hierarchical clustering
• Neighbor-Joining
• Project Saving

 Available at http://online.phyloviz.net
 Web based version of PHYLOViZ
 Allows users to create their own datasets, save them and share their data
(privately or publicly)
 REST API available
 Scalable to thousands of nodes
 Tree Analysis tools:
 Interactive distance matrix
 NLV graph

NLV Graph
Tree cut-off
Full MST

Create Selections
Change tree options

 Available at http://microreact.org/
 Presentation on session Harnessing whole genome sequence data
for public health applications : Novel open access tools forWGS-
based pathogen surveillance and the identification of high-risk
clones
 http://eccmidlive.org/#resources/novel-open-access-tools-for-
wgs-based-pathogen-surveillance-and-the-identification-of-high-
risk-clones

• Ridom Seqsphere+ : http://www.ridom.de/seqsphere/
• Applied Maths Bionumerics 7.6: http://www.applied-maths.com/bionumerics
• CLCBioGenomicWorkbench : http://www.clcbio.com/blog/clc-genomics-workbench-7-5/

• Huge variety of software and database solutions
• There is no single One-Size-Fits-All solution (job
security for bioinformaticians)
• Different questions require different approaches
• Always question the results and data provenance

 ECCMID2015 Meet-the-expert session on “What bioinformatic tools
should I use for analysis of HighThroughput Sequencing data for
molecular diagnostics? ”
 Nick Loman: http://www.slideshare.net/nickloman/eccmid-2015-
meettheexpert-bioinformatics-tools
 João André Carriço:
http://www.slideshare.net/joaoandrecarrico/eccmid-meet-
theexpert2015

 UMMI Members
 Bruno Gonçalves
 Mário Ramirez
 José Melo-Cristino
 INESC-ID
 Alexandre Francisco
 Cátia Vaz
 Marta Nascimento
 EFSA INNUENDO Project (https://sites.google.com/site/innuendocon/)
 Mirko Rossi
 FP7 PathoNGenTrace (http://www.patho-ngen-trace.eu/):
 Dag Harmsen (Univ. Muenster)
 Stefan Niemann (Research Center Borstel)
 Keith Jolley, James Bray and Martin Maiden (Univ. Oxford)
 Joerg Rothganger (RIDOM)
 Hannes Pouseele (Applied Maths)
 Genome Canada IRIDA project (www.irida.ca)
 Franklin Bristow, Thomas Matthews, Aaron Petkau, Morag Graham and Gary Van Domselaar(NLM , PHAC)
 Ed Taboada and Peter Kruczkiewicz (LabFoodborne Zoonoses, PHAC)
 Fiona Brinkman (SFU)
 William Hsiao (BCCDC)
INTEGRATED RAPID INFECTIOUS DISEASE ANALYSIS

Making Use of NGS Data: From Reads to Trees and Annotations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Making Use of NGS Data: From Reads to Trees and Annotations

Similar to Making Use of NGS Data: From Reads to Trees and Annotations (20)

Recently uploaded

Recently uploaded (20)

Making Use of NGS Data: From Reads to Trees and Annotations