This document discusses bioinformatic tools for analyzing high-throughput sequencing data for molecular diagnostics. It recommends tools for quality control like FastQC and Qualimap to check if sequencing worked and fragment lengths. BLAST and Kraken can check if a sample matches expected identity or is contaminated. Trimmomatic is recommended for adaptor trimming. For analysis, it recommends reference-based approaches using BWA and GATK or de novo assembly with SPADES. Both approaches have advantages, so most people will want to do both and use a de novo assembly as a reference if none exists.
1. What bioinformatic tools should I use for
analysis of high-throughput sequencing data
for molecular diagnostics?
Nick Loman
2. Read QC
Assembly
Whole-genome
alignment
Reference-based approach
De novo approach
Mauve
ParsnpAlignment BWA
Variant calling Samtools/VarScan
GATK
SPADES
FastQC
Qualimap
Kraken
BLAST!Adaptor/quality
trimming Trimmomatic
SNP extraction
Python script!
Snippy
Recombination filtering Gubbins
MLST/Antibiogram
Annotation
Mlst
abricate
Prokka
Tree building FastTree
RAXML
Tree building Harvest
Population genomics
BIGSDB
Phyloviz
MLST/Antibiogram SRST2 Pan-genome LS-BSR
3. Quality Control: Questions to Ask
• Did my sequencing work?
• What are the fragment lengths?
• Is my sample what I think it is?
• Is my sample contaminated?
5. What are the fragment lengths?
• Qualimap (or just BWA)
Bad
Fragment length < read
length
OK
Fragment length > read
length
Good
Fragment length > 2x read
length
Will affect: genome coverage, de novo assembly performance, alignment performance
6. Is my sample what I think it is?
• BLASTing a few reads usually very efficient
8. Adaptor trim reads
• With Nextera libraries, failing to adaptor trim
will KILL your assemblies.
• Particularly important when mean fragment
length < read length.
• Many trimmers available: I like to use
Trimmomatic
For more explanation: http://nickloman.github.io/high-
throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die-
experiences-with-nextera-libraries/
9. Adaptor trim reads
• With Nextera libraries, failing to adaptor trim
will KILL your assemblies.
• Particularly important when mean fragment
length < read length.
• Many trimmers available: I like to use
Trimmomatic
For more explanation: http://nickloman.github.io/high-
throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die-
experiences-with-nextera-libraries/
11. Reference-based or de novo?
• Reference-based
– Implies ALIGNMENT to reference
– Implies you HAVE a reference
– Allows exquisitely sensitive and specific SNP
calling (forensic SNP calling to single mutation
precision)
– Important for looking at CHAINS OF
TRANSMISSION
– Can only call in parts of the genome COMMON
between your SAMPLES and REFERENCE
12. Reference-based or de novo?
• De-novo
– Implies de novo assembly
– Does NOT require a reference
– Gives access to the entire PAN-genome
– E.g.
• Unexpected antibiotic resistance genes
• Virulence factors
– Can give misleading results in REPEAT sequences
– Not suitable for very fine-resolution SNP analysis
13. In practice
• Most people will want to do both.
• And if you have no reference, you can use a
draft de novo assembly AS your reference.
14. Acknowledgements
• Twitter comments:
– Tom Connor, Alan McNally, Torsten Seemann, C.
Titus Brown, Heng Li, Christoffer Flensburg, Matt
MacManes, Rachel Glover, Willem van Schaik