ECCMID 2015 - So I have sequenced my genome ... what now?

Nick Loman
Nick LomanUniversity of Birmingham
So I have sequenced my
organism … what do I do now?
Nick Loman
ECCMID 2015 - So I have sequenced my genome ... what now?
Oh dear
Sequence some more
Sensible
Useful things
Whole-genome sequencing:
utility in clinical microbiology
• Diagnostics
– Species, subspecies, strain identification
– In silico antibiogram
– In silico virulence profile
• Surveillance
• Typing (including backwards compatibility with MLST and
serotype)
• What strains and resistance elements are lurking in my
hospital/community?
• Forensic epidemiology
– Is there an outbreak?
• Who gave what to who?
Common types of sequencing
• Paired-end Illumina (typically 150 – 300 bases)
• Single-end Ion Torrent (typically 300-400
bases)
– Can be treated more or less the same
• Pacific Biosciences or Oxford Nanopore
– Requires special handling, not covered today
Quality Control: Questions to Ask
• Did my sequencing work?
• What are the fragment lengths?
• Is my sample what I think it is?
• Is my sample contaminated?
Read QC
Adaptor/quality
trimming
Species ID
Sample QC
FastQC, Qualimap,
Kraken, BLAST
Trimmomatic
BLAST, Metaphlan,
MOCAT
Blobology
Did my sequencing work?
• FastQC:
What coverage do I have?
• SNP calling: >10x (>15x better)
• De novo assembly: >30x (50x probably better)
• Absolutely no benefits over about 100x for
standard applications and slows everything
down and takes more disk space
• (BTW, FASTQ files are probably a waste of
space)
What are the fragment lengths?
• Qualimap (or just BWA)
Bad
Fragment length < read
length
OK
Fragment length > read
length
Good
Fragment length > 2x read
length
You are in dangerous territory dealing with
repetitive regions longer than the fragment
length, regardless of read depth coverage
Repetitive regions
This is important because repeat-containing are often
the most interesting parts of the genome! Think:
• Insertion elements
• Transposons
• Plasmids
• Ribosomal RNA
REPEAT: You are in dangerous territory dealing
with repetitive regions longer than the fragment
length, regardless of read depth coverage
Do not trust the computer
Bioinformatics software will do its best to look
like it is dealing with repeats in a rational way,
but it is in fact plotting aggressively to ruin your
analysis without telling you.
Computers are just like that!
If repeats are important to your analysis, you need an
alternative sequencing strategy: long mate-pairs, long reads
(Pacific Biosciences or Oxford Nanopore). Don’t drive
yourself mad making short reads do what they can’t.
Adaptor trim reads
• With Nextera libraries, failing to adaptor trim
will KILL your assemblies.
• Particularly important when mean fragment
length < read length.
• Many trimmers available: I like to use
Trimmomatic
• Quality trimming not important with modern
tools (BWA and Spades)
For more explanation: http://nickloman.github.io/high-
throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die-
experiences-with-nextera-libraries/
Is my sample what I think it is?
• BLASTing a few random reads usually very
efficient quality control check, as well as
helping identify a reference genome
• Kraken or Metaphlan can give rapid organism
report
Species identification
• Methods:
– 16S rDNA extraction (typically following de novo
assembly and annotation) and BLAST
– Taxon-defining genes (e.g. Metaphlan)
– Phylogenetic approach (e.g. MOCAT, Phylosift)
For more explanation: http://nickloman.github.io/high-
throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die-
experiences-with-nextera-libraries/
Isolate genome
Sequence reads
Other samples on
sequencing run
Contamination
Unsequenced
regions
ECCMID 2015 - So I have sequenced my genome ... what now?
Sources of contamination
• Accidental multiple colony picks or mixed liquid
culture
– Same or different organism
– E.g. Achromobacter & Pseudomonas aeruginosa in CF
• Reagent contamination (DNA extractions)
• Sequencer “carry-over” (0.2%?)
• PhiX control sequence <- don’t be this guy
• Barcode “cross-over” (bad pipetting technique or
contaminated reagents)
ECCMID 2015 - So I have sequenced my genome ... what now?
Blobology
Contamination
Adaptor trim reads
• With Nextera libraries, failing to adaptor trim
will KILL your assemblies.
• Particularly important when mean fragment
length < read length.
• Many trimmers available: I like to use
Trimmomatic
For more explanation: http://nickloman.github.io/high-
throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die-
experiences-with-nextera-libraries/
Reference-based or de novo?
Reference-based or de novo?
• Reference-based
– Implies ALIGNMENT to reference
– Implies you HAVE a reference
– Allows exquisitely sensitive and specific SNP calling
(forensic SNP calling to single mutation precision)
– Important for looking at CHAINS OF TRANSMISSION
– Can only call in parts of the genome COMMON
between your SAMPLES and REFERENCE: the CORE
Reference-based or de novo?
• De-novo
– Implies de novo assembly
– Does NOT require a reference
– Gives access to the entire PAN-genome
– E.g.
• Unexpected antibiotic resistance genes
• Virulence factors
– Can give misleading results in REPEAT sequences
– Not suitable for very fine-resolution SNP analysis
In practice
• Most people will want to do both.
• And if you have no reference, you can use a
draft de novo assembly AS your reference
– But exercise caution
Reference-based approach
Alignment
Variant calling
SNP extraction & filter
Recombination
filtering
Tree building
MLST/Antibiogram
Read QC
Adaptor/quality
trimming
Species ID
Sample QC
FastQC, Qualimap,
Kraken, BLAST
Trimmomatic
BLAST, Metaphlan,
MOCAT
Blobology
BWA
Samtools/VarScan
GATK
Custom script, snippy,
snpEff, BRESEQ
Gubbins,
ClonalFrameML
FastTree, RaXML
SRST2
Analysis choice highly species
dependent: not one size fits all!
• What is the mode and tempo of evolution?
• Monomorphic organisms:
– Characterised by vertical pattern of inheritance
– Isolates differ by few mutations
• Highly recombinogenic organisms
– Mutations dominated by recombination
– May have vast differences in gene content, gene
order
– “Clonal frame” may be obscured or absent
Different species require different
analysis strategies
Variation
M. tuberculosis
S. aureus
B. anthracis
E. coli
P. aeruginosa
N. meningitidis
S. pneumoniae
Clonal population structure
Branching phylogenies
Open pan-genome
Horizontal gene transfer
Salmonella
High rates of recombination
Phylogenetic networks
Tips for picking a reference
• The higher quality the better (aim for pre-NGS
Sanger genomes, e.g. <2001)
• Ideally single contig, no gaps
• Canonical strains have most portable and
referenced gene references, e.g. TB H37Rv,
PAO1, E. coli K-12 etc.
• For SNP calling specificity: more closely
related is better
The core genome
• The core genome used to
call SNPs will reduce as
more genomes are added
• Particularly noticeable in
species with highly
plastic genomes: E. coli
• Has significance for
forensic applications
Is my reference good enough?
• Assess core genome size
– Harvest will do this for you
• Or look at samtools flagstat (?)
• Between-sample SNP calling efficiency goes
down with reference divergence
• Luxury option: get a Pacific Biosciences
complete reference done for each “clone” in
your dataset (for some definition of clone)
Effect of closer reference on P.
aeruginosa genotyping
SNPs Indels Mapped
PAO1
Reference
23 4 77%
PacBio
Reference
40 5 97%
Quick, Loman et al. BMJ Open 2014
SNP filtering
• Specific SNP dataset is vital for effective
phylogenetic reconstructions and outbreak
tracing
• Most SNP calling errors come from
– A) misalignment (sequence present in sample but not
in reference, align)
– B) copy number variation (2 copies in sample, 1 copy
in reference)
• NOT from sequencing error (at least with
Illumina: systematic errors with other platforms)
SNP filtering (2)
• Allele frequency filter is most effective SNP filter
– AF > 0.9 (90%) works very well empirically
• Strand filter also very useful to prevent SNPs
around structural variations
• Filtering for low coverage not that helpful:
– 1/1000 error (Q30) * minimum of 3 coverage =
.000000001 chance of an error per position = < 1
error per genome
• Avoid SNPs at ends of contigs as these may be
mismapping
Detecting recombination
• Simple algorithms rely on SNP density, more
complex ones asssess impact on “clonal
frame”
Normal SNP density Recombining region
Impact of recombination filtering
De novo approach
• Interrogate the accessory genome
– Novel genes
• Some important applications take contigs
rather than reads as primary input
• SNP calling with de novo assembly is
fundamentally less reliable due to lack of
allele frequency information; but fine for
broad-scale clustering
Reference-based approach
Alignment
Variant calling
SNP extraction & filter
Recombination
filtering
Tree building
MLST/Antibiogram
Read QC
Adaptor/quality
trimming
Species ID
Sample QC
FastQC, Qualimap
Trimmomatic
BLAST, Metaphlan,
MOCAT
Blobology, Kraken,
BLAST
BWA
Samtools/VarScan
GATK
Custom script, snippy
Gubbins,
ClonalFrameML
FastTree, RaXML
SRST2
De novo approach
Assembly
MLST/Antibiogram
Annotation
Tree building
Population genomics
Pan-genome
Velvet
SPADES
Prokka
Harvest
BigsDB
Phyloviz
LS-BSR
mlst, Abricate
Concluding thoughts
1. Don’t trust your sequencing data (or others’)
– sense-check and validate each step
2. Make extensive use of visualisation tools to
do this
3. There’s more than one way to do any one
task
CLoud Infrastructure for Microbial
Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastructure for
microbial bioinformatics
• £4M of hardware, capable
of supporting >1000
individual virtual servers
• Amazon/Google cloud for
Academics
Meet-The-Expert
• Meet-The-Expert: Joao Carrico and I
• Tomorrow (Monday)
• 07:45 (really)
• Hall M
• Session ME11 What bioinformatics tools do I use for whole-
genome sequence (WGS)-based bacterial diagnostics and
typing?
Acknowledgements
• Twitter comments:
– Tom Connor, Alan McNally, Torsten Seemann, C.
Titus Brown, Heng Li, Christoffer Flensburg, Matt
MacManes, Rachel Glover, Willem van Schaik, Bill
Hanage, Jennifer Gardy, Mick Watson, Alan
McNally, Esther Robinson, Nicola Fawcett, Aziz
Aboobaker, Ruth Massey
1 of 44

Recommended

Eccmid meet the-expert by
Eccmid meet the-expertEccmid meet the-expert
Eccmid meet the-expertNick Loman
1.2K views14 slides
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools by
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsNick Loman
7.5K views29 slides
ASM Microbe 2017: Reaching the Parts Other Methods Can't: Long Reads for Micr... by
ASM Microbe 2017: Reaching the Parts Other Methods Can't: Long Reads for Micr...ASM Microbe 2017: Reaching the Parts Other Methods Can't: Long Reads for Micr...
ASM Microbe 2017: Reaching the Parts Other Methods Can't: Long Reads for Micr...Nick Loman
2.3K views29 slides
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS by
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGSCurso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGSVHIR Vall d’Hebron Institut de Recerca
6.9K views84 slides
NEXT GENERATION SEQUENCING by
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGBilal Nizami
71.4K views27 slides
Introduction to Next-Generation Sequencing (NGS) Technology by
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyQIAGEN
13K views46 slides

More Related Content

What's hot

RNASeq - Analysis Pipeline for Differential Expression by
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionJatinder Singh
1.3K views29 slides
Ngs part i 2013 by
Ngs part i 2013Ngs part i 2013
Ngs part i 2013Elsa von Licy
2.6K views35 slides
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ... by
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...Jan Aerts
1.7K views14 slides
Knowing Your NGS Upstream: Alignment and Variants by
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsGolden Helix Inc
5.9K views60 slides
2011 jeroen vanhoudt_ngs by
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngsDin Apellidos
4.2K views73 slides
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis by
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
1.6K views109 slides

What's hot(20)

RNASeq - Analysis Pipeline for Differential Expression by Jatinder Singh
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
Jatinder Singh1.3K views
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ... by Jan Aerts
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
Jan Aerts1.7K views
Knowing Your NGS Upstream: Alignment and Variants by Golden Helix Inc
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and Variants
Golden Helix Inc5.9K views
2011 jeroen vanhoudt_ngs by Din Apellidos
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs
Din Apellidos4.2K views
NGS: bioinformatic challenges by Lex Nederbragt
NGS: bioinformatic challengesNGS: bioinformatic challenges
NGS: bioinformatic challenges
Lex Nederbragt4.7K views
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences by Surya Saha
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesTools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Surya Saha35.4K views
High Throughput Sequencing Technologies: What We Can Know by Brian Krueger
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can Know
Brian Krueger3.2K views
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule by Justin Johnson
The Next, Next Generation of Sequencing - From Semiconductor to Single MoleculeThe Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
Justin Johnson2.7K views
A Comparison of NGS Platforms. by mkim8
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
mkim859K views
RNASeq Experiment Design by Yaoyu Wang
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
Yaoyu Wang5.7K views
Rnaseq basics ngs_application1 by Yaoyu Wang
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1
Yaoyu Wang5.3K views
Next generation sequencing by Vishal Pandey
Next generation sequencingNext generation sequencing
Next generation sequencing
Vishal Pandey745 views
Exploring new frontiers with next-generation sequencing by QIAGEN
Exploring new frontiers with next-generation sequencingExploring new frontiers with next-generation sequencing
Exploring new frontiers with next-generation sequencing
QIAGEN1.5K views
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015 by Torsten Seemann
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Torsten Seemann9.4K views
RNA-seq Data Analysis Overview by Sean Davis
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
Sean Davis1.9K views

Similar to ECCMID 2015 - So I have sequenced my genome ... what now?

Genome in a bottle for amp GeT-RM 181030 by
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030GenomeInABottle
594 views28 slides
Genome in a bottle for ashg grc giab workshop 181016 by
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
258 views23 slides
Bio305 genome analysis and annotation 2012 by
Bio305 genome analysis and annotation 2012Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012Mark Pallen
2.4K views49 slides
RNA Seq Data Analysis by
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data AnalysisRavi Gandham
117 views80 slides
Next generation sequencing methods by
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods Mrinal Vashisth
3.2K views30 slides
Bioinformatics workshop Sept 2014 by
Bioinformatics workshop Sept 2014Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014LutzFr
4.8K views46 slides

Similar to ECCMID 2015 - So I have sequenced my genome ... what now?(20)

Genome in a bottle for amp GeT-RM 181030 by GenomeInABottle
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
GenomeInABottle594 views
Genome in a bottle for ashg grc giab workshop 181016 by GenomeInABottle
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
GenomeInABottle258 views
Bio305 genome analysis and annotation 2012 by Mark Pallen
Bio305 genome analysis and annotation 2012Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012
Mark Pallen2.4K views
RNA Seq Data Analysis by Ravi Gandham
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
Ravi Gandham117 views
Next generation sequencing methods by Mrinal Vashisth
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods
Mrinal Vashisth3.2K views
Bioinformatics workshop Sept 2014 by LutzFr
Bioinformatics workshop Sept 2014Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014
LutzFr4.8K views
QIAseq Targeted DNA, RNA and Fusion Gene Panels by QIAGEN
QIAseq Targeted DNA, RNA and Fusion Gene PanelsQIAseq Targeted DNA, RNA and Fusion Gene Panels
QIAseq Targeted DNA, RNA and Fusion Gene Panels
QIAGEN3.3K views
Evaluation of the impact of error correction algorithms on SNP calling. by Nathan Olson
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.
Nathan Olson838 views
High Throughput Sequencing Technologies: On the path to the $0* genome by Brian Krueger
High Throughput Sequencing Technologies: On the path to the $0* genomeHigh Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genome
Brian Krueger6.6K views
Next Generation Sequencing by shinycthomas
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
shinycthomas27 views
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511 by GenomeInABottle
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GenomeInABottle1.1K views
Overview of the commonly used sequencing platforms, bioinformatic search tool... by OECD Environment
Overview of the commonly used sequencing platforms, bioinformatic search tool...Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...
OECD Environment185 views
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi... by QIAGEN
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
QIAGEN9.7K views
NGS.pptx by Bl Saini
NGS.pptxNGS.pptx
NGS.pptx
Bl Saini522 views
Festival of Genomics Jan 2018 by Graham Taylor
Festival of Genomics Jan 2018Festival of Genomics Jan 2018
Festival of Genomics Jan 2018
Graham Taylor501 views
DNA Markers Techniques for Plant Varietal Identification by Senthil Natesan
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification
Senthil Natesan9.4K views
Genome sequencing. ppt.pptx by GedifewGebrie
Genome sequencing. ppt.pptxGenome sequencing. ppt.pptx
Genome sequencing. ppt.pptx
GedifewGebrie83 views
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016 by Prof. Wim Van Criekinge
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016

Recently uploaded

plasmids by
plasmidsplasmids
plasmidsscribddarkened352
7 views2 slides
How to be(come) a successful PhD student by
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD studentTom Mens
422 views62 slides
Artificial Intelligence Helps in Drug Designing and Discovery.pptx by
Artificial Intelligence Helps in Drug Designing and Discovery.pptxArtificial Intelligence Helps in Drug Designing and Discovery.pptx
Artificial Intelligence Helps in Drug Designing and Discovery.pptxabhinashsahoo2001
117 views22 slides
DATABASE MANAGEMENT SYSTEM by
DATABASE MANAGEMENT SYSTEMDATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEMDr. GOPINATH D
5 views50 slides
Disinfectants & Antiseptic by
Disinfectants & AntisepticDisinfectants & Antiseptic
Disinfectants & AntisepticSanket P Shinde
8 views36 slides
"How can I develop my learning path in bioinformatics? by
"How can I develop my learning path in bioinformatics?"How can I develop my learning path in bioinformatics?
"How can I develop my learning path in bioinformatics?Bioinformy
18 views13 slides

Recently uploaded(20)

How to be(come) a successful PhD student by Tom Mens
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD student
Tom Mens422 views
Artificial Intelligence Helps in Drug Designing and Discovery.pptx by abhinashsahoo2001
Artificial Intelligence Helps in Drug Designing and Discovery.pptxArtificial Intelligence Helps in Drug Designing and Discovery.pptx
Artificial Intelligence Helps in Drug Designing and Discovery.pptx
abhinashsahoo2001117 views
"How can I develop my learning path in bioinformatics? by Bioinformy
"How can I develop my learning path in bioinformatics?"How can I develop my learning path in bioinformatics?
"How can I develop my learning path in bioinformatics?
Bioinformy18 views
Connecting communities to promote FAIR resources: perspectives from an RDA / ... by Allyson Lister
Connecting communities to promote FAIR resources: perspectives from an RDA / ...Connecting communities to promote FAIR resources: perspectives from an RDA / ...
Connecting communities to promote FAIR resources: perspectives from an RDA / ...
Allyson Lister33 views
Pollination By Nagapradheesh.M.pptx by MNAGAPRADHEESH
Pollination By Nagapradheesh.M.pptxPollination By Nagapradheesh.M.pptx
Pollination By Nagapradheesh.M.pptx
MNAGAPRADHEESH15 views
application of genetic engineering 2.pptx by SankSurezz
application of genetic engineering 2.pptxapplication of genetic engineering 2.pptx
application of genetic engineering 2.pptx
SankSurezz6 views
RemeOs science and clinical evidence by PetrusViitanen1
RemeOs science and clinical evidenceRemeOs science and clinical evidence
RemeOs science and clinical evidence
PetrusViitanen126 views
Light Pollution for LVIS students by CWBarthlmew
Light Pollution for LVIS studentsLight Pollution for LVIS students
Light Pollution for LVIS students
CWBarthlmew5 views
Ethical issues associated with Genetically Modified Crops and Genetically Mod... by PunithKumars6
Ethical issues associated with Genetically Modified Crops and Genetically Mod...Ethical issues associated with Genetically Modified Crops and Genetically Mod...
Ethical issues associated with Genetically Modified Crops and Genetically Mod...
PunithKumars618 views
Metatheoretical Panda-Samaneh Borji.pdf by samanehborji
Metatheoretical Panda-Samaneh Borji.pdfMetatheoretical Panda-Samaneh Borji.pdf
Metatheoretical Panda-Samaneh Borji.pdf
samanehborji16 views
Conventional and non-conventional methods for improvement of cucurbits.pptx by gandhi976
Conventional and non-conventional methods for improvement of cucurbits.pptxConventional and non-conventional methods for improvement of cucurbits.pptx
Conventional and non-conventional methods for improvement of cucurbits.pptx
gandhi97616 views
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl... by GIFT KIISI NKIN
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
GIFT KIISI NKIN14 views
별헤는 사람들 2023년 12월호 전명원 교수 자료 by sciencepeople
별헤는 사람들 2023년 12월호 전명원 교수 자료별헤는 사람들 2023년 12월호 전명원 교수 자료
별헤는 사람들 2023년 12월호 전명원 교수 자료
sciencepeople7 views

ECCMID 2015 - So I have sequenced my genome ... what now?

  • 1. So I have sequenced my organism … what do I do now? Nick Loman
  • 7. Whole-genome sequencing: utility in clinical microbiology • Diagnostics – Species, subspecies, strain identification – In silico antibiogram – In silico virulence profile • Surveillance • Typing (including backwards compatibility with MLST and serotype) • What strains and resistance elements are lurking in my hospital/community? • Forensic epidemiology – Is there an outbreak? • Who gave what to who?
  • 8. Common types of sequencing • Paired-end Illumina (typically 150 – 300 bases) • Single-end Ion Torrent (typically 300-400 bases) – Can be treated more or less the same • Pacific Biosciences or Oxford Nanopore – Requires special handling, not covered today
  • 9. Quality Control: Questions to Ask • Did my sequencing work? • What are the fragment lengths? • Is my sample what I think it is? • Is my sample contaminated? Read QC Adaptor/quality trimming Species ID Sample QC FastQC, Qualimap, Kraken, BLAST Trimmomatic BLAST, Metaphlan, MOCAT Blobology
  • 10. Did my sequencing work? • FastQC:
  • 11. What coverage do I have? • SNP calling: >10x (>15x better) • De novo assembly: >30x (50x probably better) • Absolutely no benefits over about 100x for standard applications and slows everything down and takes more disk space • (BTW, FASTQ files are probably a waste of space)
  • 12. What are the fragment lengths? • Qualimap (or just BWA) Bad Fragment length < read length OK Fragment length > read length Good Fragment length > 2x read length You are in dangerous territory dealing with repetitive regions longer than the fragment length, regardless of read depth coverage
  • 13. Repetitive regions This is important because repeat-containing are often the most interesting parts of the genome! Think: • Insertion elements • Transposons • Plasmids • Ribosomal RNA REPEAT: You are in dangerous territory dealing with repetitive regions longer than the fragment length, regardless of read depth coverage
  • 14. Do not trust the computer Bioinformatics software will do its best to look like it is dealing with repeats in a rational way, but it is in fact plotting aggressively to ruin your analysis without telling you. Computers are just like that! If repeats are important to your analysis, you need an alternative sequencing strategy: long mate-pairs, long reads (Pacific Biosciences or Oxford Nanopore). Don’t drive yourself mad making short reads do what they can’t.
  • 15. Adaptor trim reads • With Nextera libraries, failing to adaptor trim will KILL your assemblies. • Particularly important when mean fragment length < read length. • Many trimmers available: I like to use Trimmomatic • Quality trimming not important with modern tools (BWA and Spades) For more explanation: http://nickloman.github.io/high- throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die- experiences-with-nextera-libraries/
  • 16. Is my sample what I think it is? • BLASTing a few random reads usually very efficient quality control check, as well as helping identify a reference genome • Kraken or Metaphlan can give rapid organism report
  • 17. Species identification • Methods: – 16S rDNA extraction (typically following de novo assembly and annotation) and BLAST – Taxon-defining genes (e.g. Metaphlan) – Phylogenetic approach (e.g. MOCAT, Phylosift) For more explanation: http://nickloman.github.io/high- throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die- experiences-with-nextera-libraries/
  • 18. Isolate genome Sequence reads Other samples on sequencing run Contamination Unsequenced regions
  • 20. Sources of contamination • Accidental multiple colony picks or mixed liquid culture – Same or different organism – E.g. Achromobacter & Pseudomonas aeruginosa in CF • Reagent contamination (DNA extractions) • Sequencer “carry-over” (0.2%?) • PhiX control sequence <- don’t be this guy • Barcode “cross-over” (bad pipetting technique or contaminated reagents)
  • 23. Adaptor trim reads • With Nextera libraries, failing to adaptor trim will KILL your assemblies. • Particularly important when mean fragment length < read length. • Many trimmers available: I like to use Trimmomatic For more explanation: http://nickloman.github.io/high- throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die- experiences-with-nextera-libraries/
  • 25. Reference-based or de novo? • Reference-based – Implies ALIGNMENT to reference – Implies you HAVE a reference – Allows exquisitely sensitive and specific SNP calling (forensic SNP calling to single mutation precision) – Important for looking at CHAINS OF TRANSMISSION – Can only call in parts of the genome COMMON between your SAMPLES and REFERENCE: the CORE
  • 26. Reference-based or de novo? • De-novo – Implies de novo assembly – Does NOT require a reference – Gives access to the entire PAN-genome – E.g. • Unexpected antibiotic resistance genes • Virulence factors – Can give misleading results in REPEAT sequences – Not suitable for very fine-resolution SNP analysis
  • 27. In practice • Most people will want to do both. • And if you have no reference, you can use a draft de novo assembly AS your reference – But exercise caution
  • 28. Reference-based approach Alignment Variant calling SNP extraction & filter Recombination filtering Tree building MLST/Antibiogram Read QC Adaptor/quality trimming Species ID Sample QC FastQC, Qualimap, Kraken, BLAST Trimmomatic BLAST, Metaphlan, MOCAT Blobology BWA Samtools/VarScan GATK Custom script, snippy, snpEff, BRESEQ Gubbins, ClonalFrameML FastTree, RaXML SRST2
  • 29. Analysis choice highly species dependent: not one size fits all! • What is the mode and tempo of evolution? • Monomorphic organisms: – Characterised by vertical pattern of inheritance – Isolates differ by few mutations • Highly recombinogenic organisms – Mutations dominated by recombination – May have vast differences in gene content, gene order – “Clonal frame” may be obscured or absent
  • 30. Different species require different analysis strategies Variation M. tuberculosis S. aureus B. anthracis E. coli P. aeruginosa N. meningitidis S. pneumoniae Clonal population structure Branching phylogenies Open pan-genome Horizontal gene transfer Salmonella High rates of recombination Phylogenetic networks
  • 31. Tips for picking a reference • The higher quality the better (aim for pre-NGS Sanger genomes, e.g. <2001) • Ideally single contig, no gaps • Canonical strains have most portable and referenced gene references, e.g. TB H37Rv, PAO1, E. coli K-12 etc. • For SNP calling specificity: more closely related is better
  • 32. The core genome • The core genome used to call SNPs will reduce as more genomes are added • Particularly noticeable in species with highly plastic genomes: E. coli • Has significance for forensic applications
  • 33. Is my reference good enough? • Assess core genome size – Harvest will do this for you • Or look at samtools flagstat (?) • Between-sample SNP calling efficiency goes down with reference divergence • Luxury option: get a Pacific Biosciences complete reference done for each “clone” in your dataset (for some definition of clone)
  • 34. Effect of closer reference on P. aeruginosa genotyping SNPs Indels Mapped PAO1 Reference 23 4 77% PacBio Reference 40 5 97% Quick, Loman et al. BMJ Open 2014
  • 35. SNP filtering • Specific SNP dataset is vital for effective phylogenetic reconstructions and outbreak tracing • Most SNP calling errors come from – A) misalignment (sequence present in sample but not in reference, align) – B) copy number variation (2 copies in sample, 1 copy in reference) • NOT from sequencing error (at least with Illumina: systematic errors with other platforms)
  • 36. SNP filtering (2) • Allele frequency filter is most effective SNP filter – AF > 0.9 (90%) works very well empirically • Strand filter also very useful to prevent SNPs around structural variations • Filtering for low coverage not that helpful: – 1/1000 error (Q30) * minimum of 3 coverage = .000000001 chance of an error per position = < 1 error per genome • Avoid SNPs at ends of contigs as these may be mismapping
  • 37. Detecting recombination • Simple algorithms rely on SNP density, more complex ones asssess impact on “clonal frame” Normal SNP density Recombining region
  • 39. De novo approach • Interrogate the accessory genome – Novel genes • Some important applications take contigs rather than reads as primary input • SNP calling with de novo assembly is fundamentally less reliable due to lack of allele frequency information; but fine for broad-scale clustering
  • 40. Reference-based approach Alignment Variant calling SNP extraction & filter Recombination filtering Tree building MLST/Antibiogram Read QC Adaptor/quality trimming Species ID Sample QC FastQC, Qualimap Trimmomatic BLAST, Metaphlan, MOCAT Blobology, Kraken, BLAST BWA Samtools/VarScan GATK Custom script, snippy Gubbins, ClonalFrameML FastTree, RaXML SRST2 De novo approach Assembly MLST/Antibiogram Annotation Tree building Population genomics Pan-genome Velvet SPADES Prokka Harvest BigsDB Phyloviz LS-BSR mlst, Abricate
  • 41. Concluding thoughts 1. Don’t trust your sequencing data (or others’) – sense-check and validate each step 2. Make extensive use of visualisation tools to do this 3. There’s more than one way to do any one task
  • 42. CLoud Infrastructure for Microbial Bioinformatics (CLIMB) • MRC funded project to develop Cloud Infrastructure for microbial bioinformatics • £4M of hardware, capable of supporting >1000 individual virtual servers • Amazon/Google cloud for Academics
  • 43. Meet-The-Expert • Meet-The-Expert: Joao Carrico and I • Tomorrow (Monday) • 07:45 (really) • Hall M • Session ME11 What bioinformatics tools do I use for whole- genome sequence (WGS)-based bacterial diagnostics and typing?
  • 44. Acknowledgements • Twitter comments: – Tom Connor, Alan McNally, Torsten Seemann, C. Titus Brown, Heng Li, Christoffer Flensburg, Matt MacManes, Rachel Glover, Willem van Schaik, Bill Hanage, Jennifer Gardy, Mick Watson, Alan McNally, Esther Robinson, Nicola Fawcett, Aziz Aboobaker, Ruth Massey

Editor's Notes

  1. Reminds me of an old joke: A man is travelling and stops an old man on the road and says “How do I get to xyz?”. The man pauses and has a good think about it. He asks “You want to get to xyz?”. He pauses again and concludes: “Well if I wanted to get to xyz, I wouldn’t have started from here.”
  2. Caution with filtering: several important antibiotic resistance mutations may occur in just several copies of a repetitive gene, e.g. 23S (linezolid resistance) - filtering will exclude these!