Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ECCMID 2015 Meet-The-Expert: Bioinformatics Tools

5,897 views

Published on

Supporting material for ECCMID 2015 Meet-The-Expert session.

Published in: Science
  • Be the first to comment

ECCMID 2015 Meet-The-Expert: Bioinformatics Tools

  1. 1. What bioinformatic tools should I use for analysis of high-throughput sequencing data for molecular diagnostics? Nick Loman
  2. 2. Reference-based approach Alignment Variant calling SNP extraction & filter Recombination filtering Tree building MLST/Antibiogram Read QC Adaptor/quality trimming Species ID Sample QC FastQC, Qualimap Trimmomatic BLAST, Metaphlan, MOCAT Blobology, Kraken, BLAST BWA Samtools/VarScan GATK Custom script, snippy, SnpEff, BRESEQ Gubbins, ClonalFrameML FastTree, RaXML SRST2 De novo approach Assembly MLST/Antibiogram Annotation Tree building Population genomics Pan-genome Velvet SPADES Prokka Harvest BigsDB Phyloviz LS-BSR mlst, Abricate
  3. 3. FastQC • What: Analyse read-level sequence quality. • Why: Determine serious errors in read quality that might affect downstream analysis. • Where: http://www.bioinformatics.babraham.ac.uk/p rojects/fastqc/
  4. 4. FastQC
  5. 5. Qualimap • What: Analyse insert size distribution • Why: Determine whether sequencing has been effective, particularly for de novo assembly, need for adaptor trimming • Where: http://qualimap.bioinfo.cipf.es/
  6. 6. Trimmomatic • What: One of several million read trimmers • Why: To remove sequence adaptors which may influence the results of de novo assembly • Where: http://www.usadellab.org/cms/?page=trimmo matic
  7. 7. Species ID: BLAST • What: Only the most famous bioinformatics algorithm ever made • Why: A few random BLAST searches will reveal much important information about your data before you start on a pipeline analysis • Where: http://ncbi.nlm.nih.gov/BLAST
  8. 8. Species ID: Metaphlan • What: Designed for metagenomics, this algorithm will find “taxon-defining” genes to identify what species are in a sample • Why: Check for extent of sample contamination, give an accurate species ID for unknown samples • Where: https://bitbucket.org/biobakery/metaphlan2
  9. 9. Species ID: Kraken • What: Similar to Metaphlan but even faster and with a more complete database • Why: Check for extent of sample contamination, give an accurate species ID for unknown samples • Where: https://ccb.jhu.edu/software/kraken/
  10. 10. Species ID: MOCAT • What: Uses a phylogenetic approach to identify novel or divergent species by relying on distances in conserved marker genes • Why: Sometimes you sequence something completely novel and want to know more about its relationships • Where: http://vm- lux.embl.de/~kultima/MOCAT/ • Alternatives: Phylosift, rMLST
  11. 11. Sample QC: Blobology • What: A simple method of plotting de novo assembly contigs by GC, coverage and taxon • Why: Characterise contamination, plasmids, lytic phage in a sample • Where: https://github.com/blaxterlab/blobology
  12. 12. Reference approach
  13. 13. Alignment: BWA • What: The standard method for aligning Illumina sequences to a reference, use in BWA-MEM mode which works well with most read lengths • Why: Finds the likely location of each sequence read in a reference genome • Where: https://github.com/lh3/bwa • Alternatives: SMALT, Bowtie2 (beware standard insert size parameters)
  14. 14. Variant calling: samtools&VarScan • What: A way of calling SNPs against a reference in one or more samples • Why: VarScan permits easy filtering of SNPs by allele frequency and strand, useful for getting a precise dataset • Where: http://www.htslib.org/ • http://varscan.sourceforge.net/ • Alternatives: GATK, snippy, Nesoni
  15. 15. Recombination filtering: Gubbins • What: Detect regions which have undergone recombination which will confound phylogenetic reconstructions assuming clonality • Why: Important when attempting phylogenetic reconstructions from recombining organisms • Where: http://sanger- pathogens.github.io/gubbins/ • Alternatives: ClonalFrameML, BRATNextGen
  16. 16. Tree building: FastTree • What: Phylogenetic reconstructions from SNP data • Why: Tree reconstructions are an effective way of examining evolutionary relationships in isolates and testing if they are from an outbreak, FastTree • Note: Ensure you don’t hit the double-precision bug! (http://darlinglab.org/blog/2015/03/23/not-so- fast-fasttree.html) • Where: http://meta.microbesonline.org/fasttree/Alternat ives: RAxML (more thorough, slower), REALPHY http://realphy.unibas.ch/fcgi/realphy
  17. 17. MLST & Antibiogram: SRST2 • What: Aligns reads against MLST and antibiotic resistance databases • Why: Permits MLST typing with genome data and a rough prediction of antibiotic resistance • Where: http://katholt.github.io/srst2/
  18. 18. De novo approach
  19. 19. De novo assembly: SPADES • What: A reliable de novo assembler which works well with multiple data types • Why: Has in-built error corrector so no need for read trimming, can use multiple values of k so less need for experimentation, consistently performs well in comparisons • Where: http://bioinf.spbau.ru/spades
  20. 20. De novo assembly: Velvet • What: The original short-read assembly • Why: Extremely fast for draft assemblies, particularly if just want to do MLST or antibiograms • Where: https://www.ebi.ac.uk/~zerbino/velvet/ • Alternatives: MEGAHIT – even faster!
  21. 21. Annotation: Prokka • What: Takes de novo assembly contig files and annotates them with coding sequences and non- coding features such as RNAs • Why: A very sensible set of tools and reference databases in a single package, produces usable output for other software and database submission • Where: http://www.vicbioinformatics.com/software.prok ka.shtml • Alternatives: xBASE annotation interface
  22. 22. Tree building: Harvest • What: Takes de novo assembly contigs, performs whole-genome alignment and permits reconstruction of core genome phylogenies • Why: Scaleable to hundreds of genomes on a laptop and with an excellent viewer • Where: http://harvest.readthedocs.org/en/latest/index.h tml • Alternatives: Mauve
  23. 23. Population genomics: BIGSDB • What: Takes de novo assembly contigs and applies MLST-like schemes working on hundreds or thousands of core genes • Why: Scaleable to >1000s of genomes for rapid population-level clustering • Where: http://pubmlst.org/software/database/bigsdb / • Alternatives: Bionumerics
  24. 24. Pan/accessory genomes: LS-BSR • What: Takes de novo assembly contigs or annotations and compares gene content • Why: To determine differences in gene content between 1 to 1000s of strains • Where: https://github.com/jasonsahl/LS-BSR • Alternatives: OrthoMCL
  25. 25. MLST/Antibiogram: mlst and Abricate • What: Works on de novo assembly to give mlst prediction and antibiotic resistance perdiction • Why: A very fast method • Where: https://github.com/tseemann/mlst • https://github.com/tseemann/abricate • Alternatives: SRST2
  26. 26. CLoud Infrastructure for Microbial Bioinformatics (CLIMB) • MRC funded project to develop Cloud Infrastructure for microbial bioinformatics • £4M of hardware, capable of supporting >1000 individual virtual servers • Amazon/Google cloud for Academics
  27. 27. Acknowledgements • Twitter comments: – Tom Connor, Alan McNally, Torsten Seemann, C. Titus Brown, Heng Li, Christoffer Flensburg, Matt MacManes, Rachel Glover, Willem van Schaik, Bill Hanage, Jennifer Gardy, Mick Watson, Alan McNally, Esther Robinson, Nicola Fawcett, Aziz Aboobaker, Ruth Massey

×