Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CCBC tutorial beiko

6,587 views

Published on

Rob's 16S tutorial from the Canadian Computational Biology Conference, 2016

Published in: Science
  • Be the first to comment

CCBC tutorial beiko

  1. 1. Microbiome Analysis 16S AND METAGENOMICS ‘
  2. 2. Welcome! Your Tutorial Team: Me (16S theory) Mike Hall (16S practical) Morgan Langille (metagenomics theory and practical) Special thanks to: Will Hsiao (CBW presentation) 2
  3. 3. Today’s presentation CBW “Analysis of metagenomic data” 3 http://bioinformatics.ca/workshops/2015/analysis-metagenomic-data-2015
  4. 4. Overview Morning session 1. A brief history of molecules and microbes 2. Why 16S? 3. How 16S analysis is usually done 4. Assumptions 5. Hands-on practical Afternoon session 1. 16S vs Metagenomics 2. Metagenome Taxonomic Composition 3. Metagenome Functional Composition 4. PICRUSt: Functional Inference 5. Hands-on practical 4
  5. 5. Learning objectives At the end of the 16S tutorial, you should be able to do the following: 1. Run a simple QIIME analysis of a data set (https://www.dropbox.com/s/kpte51nm17wav9o/stool_data.zip) 2. Interpret analysis results 3. Understand the limitations of the standard 16S analysis pipeline 5
  6. 6. Defining metagenomics Microbiome: Attributed to Joshua Lederberg by Hooper and Gordon (2001): “the collective genome of our indigenous microbes (microflora), the idea being that a comprehensive genetic view of Homo sapiens as a life-form should include the genes in our microbiome” Is also used to mean microbiota, the group of microorganisms found in a particular setting (usage varies: be careful and precise!) Metagenome: Handelsman et al. (1998) “…advances in molecular biology and eukaryotic genomics, which have laid the groundwork for cloning and functional analysis of the collective genomes of soil microflora, which we term the metagenome of the soil.” Does not encompass marker-gene surveys (e.g., 16S) This report says it does. 6
  7. 7. Micro-what? Metagenomics is often defined to encompass only Bacteria and Archaea (and often Archaea are excluded too!) Other small things to consider: ◦ Viruses / phages ◦ Microbial eukaryotes ◦ Worms (helminths, nematodes, …) 7 Lukeš et al. (2015) PLoS Pathogens
  8. 8. The dawn of metagenomics 3.5 BYA – the Archaean Eon 16S position 349 (-ish) ? G A Archaea Bacteria 8
  9. 9. Aaaaand more recently t 9
  10. 10. The 16S ribosomal RNA gene THE FIRST WORD IN MICROBIAL BIODIVERSITY 10
  11. 11. 11 Yarza et al. (2014) Escherichia coli ribosome (PDB 4YBB) So much RNA!
  12. 12. Why 16S? The “universal phylogenetic marker” (1) Present in all living organisms (2) Single copy* (no recombination) (3) Highly conserved + highly variable regions (4) Huge reference databases 12
  13. 13. Milestones 13 1990: “proposal for the domains Archaea, Bacteria, and Eucarya”
  14. 14. Milestones 14 Nature (1990) 2002: “…as much as 50% of the total surface microbial community…”
  15. 15. Milestones 15 PNAS (2006) Many critical papers followed (error filtering, clustering approaches, …)
  16. 16. Milestones 16 Huttenhower, Gevers et al. (2012) + 681 metagenomic samples
  17. 17. 16S analysis HOW IT’S DONE 17
  18. 18. Your basic workflow Sample collection DNA extraction Amplification Analysis 18
  19. 19. Sample collection and DNA extraction Defined protocols exist, many kits (e.g. PowerSoil®) Need to consider barriers to DNA recovery and PCR (e.g. humic acids from soil, bile salts from feces) Additional mechanical approaches (e.g., mechanical lysis of tissues with bead beating) Kits and rogue lab DNA can end up in your sample – need to run negative controls!! ◦ Example from [year redacted]: shocking finding of bacterial DNA in the [location redacted]! However, [taxonomic group redacted] was a known frequent contaminant of DNA extraction kits. 19
  20. 20. 20 Size fractionation http://www.jove.com/video/52685/automated-gel-size-selection-to-improve-quality-next-generation
  21. 21. Choosing a PCR strategy Need to consider: ◦ Correct melting temperature (60-65 degrees C for Illumina protocol) ◦ DNA sequencing read length (influences choice of primers) ◦ Primer specificity! ◦ Comparability with previous studies? [Good luck with that] [but that’s what the Earth Microbiome Project protocol http://www.earthmicrobiome.org/emp-standard-protocols/16s/ is meant to achieve] 21
  22. 22. Which variable regions to target? V1-V3 favours Prevotella, Fusobacterium, Streptococcus, Granulicatella, Bacteroides, Porphyromonas and Treponema V4-V6 favours Streptococcus, Treponema, Prevotella, Eubacterium, Porphyromonas, Campylobacter and Enterococcus. ◦ failed to detect Fusobacterium V7-V9 favours Veillonella, Streptococcus, Eubacterium, Enterococcus, Treponema, Catonella and Selenomonas. ◦ failed to detect Selenomonas, TM7 and Mycoplasma 22
  23. 23. At least there’s no shortage of options… 23 Detailed in silico evaluation of primers, experimental evaluation of two sets Heavily biased recovery of Bacteria, Archaea, and missing groups depending on primer choice. “Out of the 175 primers and 512 primer pairs checked, only 10 can be recommended as broad-range primers.”
  24. 24. Amplification Example: Illumina protocol 24
  25. 25. Analysis (examples mostly from QIIME) 1. Quality Control ◦ Error checking 2. Sample diversity ◦ Taxonomy agnostic ◦ Taxonomy aware 3. Similarity among samples 4. Associations with metadata/groups (ANOSIM, MRPP) 5. Machine-learning classification 6. Functional prediction 25
  26. 26. 26 QIIME Mothur A python interface to glue together many programs Single program with minimal external dependency Wrappers for existing programs Reimplementation of popular algorithms Large number of dependencies / VM available Easy to install and setup; work best on single multi-core server with lots of memory More scalable Less scalable Steeper learning curve but more flexible workflow if you can write your own scripts Easy to learn but workflow works the best with built-in tools http://www.ncbi.nlm.nih.gov/pubmed/2406 0131 http://www.mothur.org/wiki/MiSeq_SOP Will Hsiao
  27. 27. “Analysis” #1 Quality Control 27 Quality score filtering: ◦ Minimal length of consecutive high-quality bases (as % of total read length) ◦ Maximal number of consecutive low-quality bases ◦ Maximal number of ambiguous bases (N’s) ◦ Minimum Phred quality score Other quality filtering tools available ◦ Cutadapt (https://github.com/marcelm/cutadapt) ◦ Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic) ◦ Sickle (https://github.com/najoshi/sickle) Chimera checking: ◦ UCHIME
  28. 28. 28 Sequence quality summary using FASTQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  29. 29. Analysis #2 Within-sample (“alpha”) diversity To describe the diversity of a sample, you need to know what you are counting! Individual sequences? ◦ Most precise, but vulnerable to sequencing error effects – inflation of diversity Clusters of sequences? ◦ Operational taxonomic units (OTUs) – 97% sequence identity as the “species” level of similarity Taxonomic groups? ◦ It’s always reassuring to put names on things, but taxonomic labels can be extremely misleading 29
  30. 30. OTU clustering 30 Choose a % identity threshold 97% Cluster centroids in some order (e.g., length, abundance) – these are reference sequences Continue procedure until all sequences are clustered OTU (singletons may be excluded) Calculate distances between sequences 6%
  31. 31. What’s in a name? 31 Bacteroides Parabacteroides Ruminococcus ??? ??? ??? ??? Akkermansia
  32. 32. Taxonomic assignment Many choices: BLAST – assign taxonomic label of closest match (simple, possibly too simple) Phylogenetic placement – e.g. Pplacer (Matsen et al., BMC Bioinformatics 2010) Machine-learning classification, in particular Naïve Bayes e.g. RDP Classifier, Wang et al. (2007) BMC Bioinformatics 32
  33. 33. Example RDP Classifier output 33 GD6JEAT01AYGPE Root rootrank 1.0 Bacteria domain 1.0 "Planctomycetes" phylum 1.0 "Planctomycetacia"class 1.0 Planctomycetales order 1.0 Planctomycetaceaefamily 1.0 Schlesneria genus 0.96 GD6JEAT01BEUG6 Root rootrank 1.0 Bacteria domain 1.0 Firmicutes phylum 0.32 Clostridia class 0.26 Clostridiales order 0.23 Ruminococcaceae family 0.22 Anaerotruncus genus 0.19 Includes bootstrap support
  34. 34. Calculating alpha diversity OTU counts – richness only Simpson index – probability of sampling two individuals of the same type Phylogenetic diversity – sum of branch lengths 34
  35. 35. Example: human body-site diversity 35 Huttenhower, Gevers et al. (2012)
  36. 36. Analysis #3 Among-sample (“beta”) diversity 1. Perform pairwise comparisons between all samples to build a dissimilarity matrix 2. Summarize the matrix using based on major patterns of covariance or hierarchical similarity 36
  37. 37. Analysis #3 Among-sample (“beta”) diversity Given a pair of samples (described as e.g. OTU abundance), calculate their dissimilarity Beta-diversity measures can be: ◦ non-phylogenetic or phylogenetic ◦ weighted or unweighted There are a lot of measures! -Bray-Curtis (weighted, non-phylogenetic) -Jaccard (unweighted, non-phylogenetic) -Weighted UniFrac (weighted, phylogenetic) -… 37
  38. 38. Analysis #3 Among-sample (“beta”) diversity How similar are the results of different measures? CORRELATIONS between calculated values 38 Parks and Beiko (2013): ISME J
  39. 39. Analysis #3 Among-sample (“beta”) diversity What to do with a dissimilarity matrix? 39 Yatsunenko et al. (2012) Nature Parks and Beiko (2012) Mol Biol Evol Ordination Clustering
  40. 40. Analysis #3 Among-sample (“beta”) diversity Different beta-diversity measures can yield dramatically different clusters! 40 Parks and Beiko (2013): ISME J
  41. 41. Analysis #4 Associations with metadata PERMANOVA: Permutational multivariate analysis of variance ANOSIM: Rank-based analysis of similarity Mantel test: Comparison of between-group vs within-group distances 41 Good review: Anderson and Walsh (2013) Ecological Monographs Example: Weighted UniFrac distance: root compartment explains 46.62% of variance (PERMANOVA p<0.001) Unweighted UniFrac: root compartment explains only 18.07% of variance (PERMANOVA p<0.001); soil type is more important
  42. 42. Analysis #5 Machine-learning classification Identify aspects of community structure that are predictive of sample attributes Advantages of machine-learning approaches: ◦ Non-linear combinations of variables ◦ Data transformations ◦ Can accommodate many different representations of the data Disadvantages: ◦ Complex, may “overfit” ◦ Can be time consuming ◦ Obfuscation of predictive rules 42
  43. 43. Random forests (supervised_learning.py) 43 “…there are only weak and, for the most part, non-significant associations of particular taxa or overall diversity with the obese human gut that hold true across different studies. However, using supervised learning with receiver operator curves to maximize sensitivity and specificity, one can categorize subjects according to lean and obese states with in some cases considerable accuracy…”
  44. 44. Tree-based classifications Nested clade analysis and feature selection Classification of plaque samples using support vector machines 44 Ning and Beiko (2015): Microbiome
  45. 45. Analysis #6 Functional prediction PICRUSt: Langille et al (2013) Nat Biotechnol 45 Morgan can tell you about this…
  46. 46. Assumptions THAT ARE OFTEN FALSE 46
  47. 47. Do not assume that #1: 16S is an effective proxy for microbial diversity. #2: All 16S studies are created equal, with results that are comparable. #3: Rarefaction is a good idea. #4: 16S OTUs describe ecologically cohesive units (“species”?). #5: The 16S tree is the “Tree of Life”. 47
  48. 48. Assumption #1 16S is an effective proxy for microbial diversity. 48 rrnDB: Stoddard et al. NAR (2014) Estimating copy number: Kembel et al. (2012) and PICRUSt (coming up later) Variation: Coenye and Vandamme (2003)
  49. 49. Assumption #1 16S is an effective proxy for microbial diversity. Alternative marker genes: cpn60, rpoB, … Smaller reference databases! Protein-coding genes! 49
  50. 50. Assumption #2 All 16S studies are created equal. Effects of sequencing platform, V region, amplicon vs metagenomics 50 Tremblay et al. (2015) Front Microbiol
  51. 51. Assumption #3 Rarefaction is a good idea. Example of statistics before and after rarefaction: Loss of statistical power Random subsampling can increase false-positive differences Arbitrary minimum library size chosen for downsampling Alternatives e.g. Negative Binomial fitting (e.g., DeSeq2) 51 McMurdie and Holmes (2014) PLoS Comp Biol
  52. 52. Assumption #4 16S OTUs describe ecologically cohesive units. 52 Distribution of sequence similarity (dashed line = OTU threshold) branch lengths Nguyen et al. (2016) npj Biofilms and Microbiomes
  53. 53. Assumption #4 16S OTUs describe ecologically cohesive units. 53 Hall et al., in preparation Same OTU, different temporal patterns
  54. 54. Assumption #4 16S OTUs describe ecologically cohesive units. 54 Many alternatives exist, including Swarm: Mahé et al. (2015) PeerJ
  55. 55. Assumption #5 The 16S tree is the “Tree of Life”. 16S is limited for several reasons: Limited resolving power Subject to compositional bias Subject to recombination and lateral transfer Models typically applied to protein- coding genes do not make sense for noncoding RNA 55
  56. 56. Moving On ADVENTURES IN “MULTI-OMICS” 56
  57. 57. Multi-omics?? 16S can profile the biodiversity of a microbial sample… But we need the metagenome to shine a light on function… The metatranscriptome tells us what is expressed under specific conditions… And the metaproteome can quantify the relative abundance of different enzymes… While the metametabolome focuses on the products of metabolism. What do we really need? 57
  58. 58. Metagenomic / metatranscriptomic AMD analysis - Hua et al., ISME J (2015) Draft genomes at MG-RAST
  59. 59. 59 Differences in the microbiome between arsenic- exposed and control mice 16S taxonomic analysis + metametabolomics Taxonomy Metabolic function
  60. 60. Hands on! LET’S MAKE SCIENCE HAPPEN 60
  61. 61. The Dataset 61
  62. 62. Workflow 1. Retrieve data 2. Cluster sequences 3. Taxonomic classification 4. Phylogenetic tree construction 5. OTU table creation 6. Downstream visualization / analysis 62
  63. 63. FIN 63 Presentations http://www.slideshare.net/MickWatson/studying-the-microbiome http://bioinformatics.ca/metagenomics2015module2pptx

×