Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
So I have sequenced my
organism … what do I do now?
Nick Loman
Oh dear
Sequence some more
Sensible
Useful things
Whole-genome sequencing:
utility in clinical microbiology
• Diagnostics
– Species, subspecies, strain identification
– In ...
Common types of sequencing
• Paired-end Illumina (typically 150 – 300 bases)
• Single-end Ion Torrent (typically 300-400
b...
Quality Control: Questions to Ask
• Did my sequencing work?
• What are the fragment lengths?
• Is my sample what I think i...
Did my sequencing work?
• FastQC:
What coverage do I have?
• SNP calling: >10x (>15x better)
• De novo assembly: >30x (50x probably better)
• Absolutely no ...
What are the fragment lengths?
• Qualimap (or just BWA)
Bad
Fragment length < read
length
OK
Fragment length > read
length...
Repetitive regions
This is important because repeat-containing are often
the most interesting parts of the genome! Think:
...
Do not trust the computer
Bioinformatics software will do its best to look
like it is dealing with repeats in a rational w...
Adaptor trim reads
• With Nextera libraries, failing to adaptor trim
will KILL your assemblies.
• Particularly important w...
Is my sample what I think it is?
• BLASTing a few random reads usually very
efficient quality control check, as well as
he...
Species identification
• Methods:
– 16S rDNA extraction (typically following de novo
assembly and annotation) and BLAST
– ...
Isolate genome
Sequence reads
Other samples on
sequencing run
Contamination
Unsequenced
regions
Sources of contamination
• Accidental multiple colony picks or mixed liquid
culture
– Same or different organism
– E.g. Ac...
Blobology
Contamination
Adaptor trim reads
• With Nextera libraries, failing to adaptor trim
will KILL your assemblies.
• Particularly important w...
Reference-based or de novo?
Reference-based or de novo?
• Reference-based
– Implies ALIGNMENT to reference
– Implies you HAVE a reference
– Allows exq...
Reference-based or de novo?
• De-novo
– Implies de novo assembly
– Does NOT require a reference
– Gives access to the enti...
In practice
• Most people will want to do both.
• And if you have no reference, you can use a
draft de novo assembly AS yo...
Reference-based approach
Alignment
Variant calling
SNP extraction & filter
Recombination
filtering
Tree building
MLST/Anti...
Analysis choice highly species
dependent: not one size fits all!
• What is the mode and tempo of evolution?
• Monomorphic ...
Different species require different
analysis strategies
Variation
M. tuberculosis
S. aureus
B. anthracis
E. coli
P. aerugi...
Tips for picking a reference
• The higher quality the better (aim for pre-NGS
Sanger genomes, e.g. <2001)
• Ideally single...
The core genome
• The core genome used to
call SNPs will reduce as
more genomes are added
• Particularly noticeable in
spe...
Is my reference good enough?
• Assess core genome size
– Harvest will do this for you
• Or look at samtools flagstat (?)
•...
Effect of closer reference on P.
aeruginosa genotyping
SNPs Indels Mapped
PAO1
Reference
23 4 77%
PacBio
Reference
40 5 97...
SNP filtering
• Specific SNP dataset is vital for effective
phylogenetic reconstructions and outbreak
tracing
• Most SNP c...
SNP filtering (2)
• Allele frequency filter is most effective SNP filter
– AF > 0.9 (90%) works very well empirically
• St...
Detecting recombination
• Simple algorithms rely on SNP density, more
complex ones asssess impact on “clonal
frame”
Normal...
Impact of recombination filtering
De novo approach
• Interrogate the accessory genome
– Novel genes
• Some important applications take contigs
rather than r...
Reference-based approach
Alignment
Variant calling
SNP extraction & filter
Recombination
filtering
Tree building
MLST/Anti...
Concluding thoughts
1. Don’t trust your sequencing data (or others’)
– sense-check and validate each step
2. Make extensiv...
CLoud Infrastructure for Microbial
Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastructure for
microbi...
Meet-The-Expert
• Meet-The-Expert: Joao Carrico and I
• Tomorrow (Monday)
• 07:45 (really)
• Hall M
• Session ME11 What bi...
Acknowledgements
• Twitter comments:
– Tom Connor, Alan McNally, Torsten Seemann, C.
Titus Brown, Heng Li, Christoffer Fle...
ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?
Upcoming SlideShare
Loading in …5
×

ECCMID 2015 - So I have sequenced my genome ... what now?

5,940 views

Published on

Presentation for Sunday 26 April 2015

Published in: Science
  • Be the first to comment

ECCMID 2015 - So I have sequenced my genome ... what now?

  1. 1. So I have sequenced my organism … what do I do now? Nick Loman
  2. 2. Oh dear
  3. 3. Sequence some more
  4. 4. Sensible
  5. 5. Useful things
  6. 6. Whole-genome sequencing: utility in clinical microbiology • Diagnostics – Species, subspecies, strain identification – In silico antibiogram – In silico virulence profile • Surveillance • Typing (including backwards compatibility with MLST and serotype) • What strains and resistance elements are lurking in my hospital/community? • Forensic epidemiology – Is there an outbreak? • Who gave what to who?
  7. 7. Common types of sequencing • Paired-end Illumina (typically 150 – 300 bases) • Single-end Ion Torrent (typically 300-400 bases) – Can be treated more or less the same • Pacific Biosciences or Oxford Nanopore – Requires special handling, not covered today
  8. 8. Quality Control: Questions to Ask • Did my sequencing work? • What are the fragment lengths? • Is my sample what I think it is? • Is my sample contaminated? Read QC Adaptor/quality trimming Species ID Sample QC FastQC, Qualimap, Kraken, BLAST Trimmomatic BLAST, Metaphlan, MOCAT Blobology
  9. 9. Did my sequencing work? • FastQC:
  10. 10. What coverage do I have? • SNP calling: >10x (>15x better) • De novo assembly: >30x (50x probably better) • Absolutely no benefits over about 100x for standard applications and slows everything down and takes more disk space • (BTW, FASTQ files are probably a waste of space)
  11. 11. What are the fragment lengths? • Qualimap (or just BWA) Bad Fragment length < read length OK Fragment length > read length Good Fragment length > 2x read length You are in dangerous territory dealing with repetitive regions longer than the fragment length, regardless of read depth coverage
  12. 12. Repetitive regions This is important because repeat-containing are often the most interesting parts of the genome! Think: • Insertion elements • Transposons • Plasmids • Ribosomal RNA REPEAT: You are in dangerous territory dealing with repetitive regions longer than the fragment length, regardless of read depth coverage
  13. 13. Do not trust the computer Bioinformatics software will do its best to look like it is dealing with repeats in a rational way, but it is in fact plotting aggressively to ruin your analysis without telling you. Computers are just like that! If repeats are important to your analysis, you need an alternative sequencing strategy: long mate-pairs, long reads (Pacific Biosciences or Oxford Nanopore). Don’t drive yourself mad making short reads do what they can’t.
  14. 14. Adaptor trim reads • With Nextera libraries, failing to adaptor trim will KILL your assemblies. • Particularly important when mean fragment length < read length. • Many trimmers available: I like to use Trimmomatic • Quality trimming not important with modern tools (BWA and Spades) For more explanation: http://nickloman.github.io/high- throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die- experiences-with-nextera-libraries/
  15. 15. Is my sample what I think it is? • BLASTing a few random reads usually very efficient quality control check, as well as helping identify a reference genome • Kraken or Metaphlan can give rapid organism report
  16. 16. Species identification • Methods: – 16S rDNA extraction (typically following de novo assembly and annotation) and BLAST – Taxon-defining genes (e.g. Metaphlan) – Phylogenetic approach (e.g. MOCAT, Phylosift) For more explanation: http://nickloman.github.io/high- throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die- experiences-with-nextera-libraries/
  17. 17. Isolate genome Sequence reads Other samples on sequencing run Contamination Unsequenced regions
  18. 18. Sources of contamination • Accidental multiple colony picks or mixed liquid culture – Same or different organism – E.g. Achromobacter & Pseudomonas aeruginosa in CF • Reagent contamination (DNA extractions) • Sequencer “carry-over” (0.2%?) • PhiX control sequence <- don’t be this guy • Barcode “cross-over” (bad pipetting technique or contaminated reagents)
  19. 19. Blobology Contamination
  20. 20. Adaptor trim reads • With Nextera libraries, failing to adaptor trim will KILL your assemblies. • Particularly important when mean fragment length < read length. • Many trimmers available: I like to use Trimmomatic For more explanation: http://nickloman.github.io/high- throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die- experiences-with-nextera-libraries/
  21. 21. Reference-based or de novo?
  22. 22. Reference-based or de novo? • Reference-based – Implies ALIGNMENT to reference – Implies you HAVE a reference – Allows exquisitely sensitive and specific SNP calling (forensic SNP calling to single mutation precision) – Important for looking at CHAINS OF TRANSMISSION – Can only call in parts of the genome COMMON between your SAMPLES and REFERENCE: the CORE
  23. 23. Reference-based or de novo? • De-novo – Implies de novo assembly – Does NOT require a reference – Gives access to the entire PAN-genome – E.g. • Unexpected antibiotic resistance genes • Virulence factors – Can give misleading results in REPEAT sequences – Not suitable for very fine-resolution SNP analysis
  24. 24. In practice • Most people will want to do both. • And if you have no reference, you can use a draft de novo assembly AS your reference – But exercise caution
  25. 25. Reference-based approach Alignment Variant calling SNP extraction & filter Recombination filtering Tree building MLST/Antibiogram Read QC Adaptor/quality trimming Species ID Sample QC FastQC, Qualimap, Kraken, BLAST Trimmomatic BLAST, Metaphlan, MOCAT Blobology BWA Samtools/VarScan GATK Custom script, snippy, snpEff, BRESEQ Gubbins, ClonalFrameML FastTree, RaXML SRST2
  26. 26. Analysis choice highly species dependent: not one size fits all! • What is the mode and tempo of evolution? • Monomorphic organisms: – Characterised by vertical pattern of inheritance – Isolates differ by few mutations • Highly recombinogenic organisms – Mutations dominated by recombination – May have vast differences in gene content, gene order – “Clonal frame” may be obscured or absent
  27. 27. Different species require different analysis strategies Variation M. tuberculosis S. aureus B. anthracis E. coli P. aeruginosa N. meningitidis S. pneumoniae Clonal population structure Branching phylogenies Open pan-genome Horizontal gene transfer Salmonella High rates of recombination Phylogenetic networks
  28. 28. Tips for picking a reference • The higher quality the better (aim for pre-NGS Sanger genomes, e.g. <2001) • Ideally single contig, no gaps • Canonical strains have most portable and referenced gene references, e.g. TB H37Rv, PAO1, E. coli K-12 etc. • For SNP calling specificity: more closely related is better
  29. 29. The core genome • The core genome used to call SNPs will reduce as more genomes are added • Particularly noticeable in species with highly plastic genomes: E. coli • Has significance for forensic applications
  30. 30. Is my reference good enough? • Assess core genome size – Harvest will do this for you • Or look at samtools flagstat (?) • Between-sample SNP calling efficiency goes down with reference divergence • Luxury option: get a Pacific Biosciences complete reference done for each “clone” in your dataset (for some definition of clone)
  31. 31. Effect of closer reference on P. aeruginosa genotyping SNPs Indels Mapped PAO1 Reference 23 4 77% PacBio Reference 40 5 97% Quick, Loman et al. BMJ Open 2014
  32. 32. SNP filtering • Specific SNP dataset is vital for effective phylogenetic reconstructions and outbreak tracing • Most SNP calling errors come from – A) misalignment (sequence present in sample but not in reference, align) – B) copy number variation (2 copies in sample, 1 copy in reference) • NOT from sequencing error (at least with Illumina: systematic errors with other platforms)
  33. 33. SNP filtering (2) • Allele frequency filter is most effective SNP filter – AF > 0.9 (90%) works very well empirically • Strand filter also very useful to prevent SNPs around structural variations • Filtering for low coverage not that helpful: – 1/1000 error (Q30) * minimum of 3 coverage = .000000001 chance of an error per position = < 1 error per genome • Avoid SNPs at ends of contigs as these may be mismapping
  34. 34. Detecting recombination • Simple algorithms rely on SNP density, more complex ones asssess impact on “clonal frame” Normal SNP density Recombining region
  35. 35. Impact of recombination filtering
  36. 36. De novo approach • Interrogate the accessory genome – Novel genes • Some important applications take contigs rather than reads as primary input • SNP calling with de novo assembly is fundamentally less reliable due to lack of allele frequency information; but fine for broad-scale clustering
  37. 37. Reference-based approach Alignment Variant calling SNP extraction & filter Recombination filtering Tree building MLST/Antibiogram Read QC Adaptor/quality trimming Species ID Sample QC FastQC, Qualimap Trimmomatic BLAST, Metaphlan, MOCAT Blobology, Kraken, BLAST BWA Samtools/VarScan GATK Custom script, snippy Gubbins, ClonalFrameML FastTree, RaXML SRST2 De novo approach Assembly MLST/Antibiogram Annotation Tree building Population genomics Pan-genome Velvet SPADES Prokka Harvest BigsDB Phyloviz LS-BSR mlst, Abricate
  38. 38. Concluding thoughts 1. Don’t trust your sequencing data (or others’) – sense-check and validate each step 2. Make extensive use of visualisation tools to do this 3. There’s more than one way to do any one task
  39. 39. CLoud Infrastructure for Microbial Bioinformatics (CLIMB) • MRC funded project to develop Cloud Infrastructure for microbial bioinformatics • £4M of hardware, capable of supporting >1000 individual virtual servers • Amazon/Google cloud for Academics
  40. 40. Meet-The-Expert • Meet-The-Expert: Joao Carrico and I • Tomorrow (Monday) • 07:45 (really) • Hall M • Session ME11 What bioinformatics tools do I use for whole- genome sequence (WGS)-based bacterial diagnostics and typing?
  41. 41. Acknowledgements • Twitter comments: – Tom Connor, Alan McNally, Torsten Seemann, C. Titus Brown, Heng Li, Christoffer Flensburg, Matt MacManes, Rachel Glover, Willem van Schaik, Bill Hanage, Jennifer Gardy, Mick Watson, Alan McNally, Esther Robinson, Nicola Fawcett, Aziz Aboobaker, Ruth Massey

×