Introduction to 16S rRNA gene multivariate analysis


Published on

Short introductory talk on multivariate statistics for 16S rRNA gene analysis given at the 2nd Soil Metagenomics conference in Braunschweig Germany, December 2013. A previous talk had discussed quality filtering, chimera detection, and clustering algorithms.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to 16S rRNA gene multivariate analysis

  1. 1. Multivariate exploration of microbial communities Josh D. Neufeld Braunschweig, Germany December, 2013 Andre Masella (MSc): Computer science Michael Lynch (PhD): Taxonomy, phylogenetics, ecology Michael Hall (co-op): mathematics, programming, user friendly! Posted on Slideshare without images and unpublished data
  2. 2. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  3. 3. Who lives with whom, and why, and where? Data reduction is essential for: a) summarizing large numbers of observations into manageable numbers b) visualizing many interconnected variables in a compact manner Alpha diversity: species richness (and evenness) within a single sample Beta diversity: change in species composition across a collection of samples Gamma diversity: total species richness across an environmental gradient
  4. 4. An (abbreviated) history Numerical ecology phenetics and statistical analysis of organismal counts macroecology 16S rRNA gene era sequence analysis as a surrogate for counting mapping of marker to taxonomy NGS enabled synthesis of phenetics, phylogenetics, and numerical ecology
  5. 5. Now generate V3-V4 bacterial amplicons (~450 bases) Usually PE 300
  6. 6. Assembling paired-end reads dramatically reduces error Corrects mismatches in region of overlap (quality threshold >0.9), set a minimum overlap. Can compare to perfect overlap assembly: “completelymissesthepoint” (name changing soon)
  7. 7. PANDAseq >30x faster than next fastest alternative assembler
  8. 8. 1. p-value threshold 2. parallelizes correctly (both are now added or fixed in PANDAseq)
  9. 9. Biological Observation Matrix BIOM file format (MacDonald et al. 2012) Standard recognized by EMP, MG-RAST, VAMPS Based on JSON data interchange format Computational structure in multiple languages “facilitates the efficient handling and storage of large, sparse biological contingency tables” Encapsulates metadata and contingency table (e.g., OTU table) in one file
  10. 10. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  11. 11. Who lives with whom, and why, and where? Data reduction is essential for: a) summarizing large numbers of observations into manageable numbers b) visualizing many interconnected variables in a compact manner Alpha diversity: species richness (and evenness) within a single sample Beta diversity: change in species composition across a collection of samples Gamma diversity: total species richness across an environmental gradient
  12. 12. Diversity (richness and evenness)
  13. 13. α-diversity: Richness and Evenness Shannon index (H’), Estimators (Chao1, ACE), Phylogenetic Diversity Shannon index (H’): richness and evenness Estimators: richness Faith’s PD: phylogenetic richness Stearns et al., 2011 Hughes et al., 2001
  14. 14. “All biologists who sample natural communities are plagued with the problem of how well a sample reflects a community’s ‘true’ diversity.”
  15. 15. Hughes et al. 2001 “Nonparametric estimators show particular promise for microbial data and in some habitats may require sample sizes of only 200 to 1,000 clones to detect richness differences of only tens of species.”
  16. 16. 1 Google Scholar proportion [Seqeuncing tech] AND 16S 400 454 300 Sanger re e re Ra 0 2000 200 2002 2004 2004 ph os 100 bi 2008 0 2010 Time (year) Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation. 2012 “Rare biosphere” citations Illumina 500
  17. 17. GOALS Understanding of community structure Better alpha-diversity measures Robust beta-diversity measures Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.
  18. 18. Stearns et al. 2011
  19. 19. Bartram et al. 2011
  20. 20. Clustering algorithms (influence alpha diversity primarily) CD-HIT (Li and Godzik, Sanford-Burnham Medical Research Institute) ‘longest-sequence-first’ removal algorithm Fast, many implementations (nucleotide, protein, OTUspecific) Tends to be more stringent than UCLUST UCLUST (R. Edgar, Faster than CD-HIT Tends to generate larger number of low-abundance OTUs Broader range of clustering thresholds "I do not recommend using the UCLUST algorithm or CD-HIT for generating OTUs” – Robert Edgar
  21. 21. CROP: Clustering 16S rRNA for OTU Prediction (CROP) “CROP can find clusters based on the natural organization of data without setting a hard cut-off threshold (3%/5%) as required by hierarchical clustering methods.”
  22. 22. Chimeras DNA from two or more parent molecules PCR artifact Can easily be classified as a “novel” sequence Increases α-diversity Software ChimeraSlayer, Bellerophon, UCHIME, Pintail Reference database or de novo
  23. 23. Classification and taxonomy Ribosomal Database Project (RDP) classifier Naïve Bayesian classifier (James Cole and Tiedje) pplacer Phylogenetic placement and visualization BLAST The tool we know and love RTAX (UC Berkely, Rob Knight involved) mothur (Patrick Schloss) SINA (SILVA)
  24. 24. RDP classifier Large training sets require active memory management Can be easily run in parallel by breaking up very large data sets Can classify Bacteria/Archaea SSU and fungal LSU (can be re-trained) Algorithm: determine the probability that an unknown query sequence is a member of a known genus (training set), based on the profile of word subsets of known genera. Confidence estimation: the number of times in 100 trials that a genus was selected based on a random subset of words in the query Take home: The higher the diversity (bigger sequence space) of the training set, the better the assignment Longer query = better and more reliable assignment Short reads (i.e., <250 base) will have lower confidence estimates (cutoff of 0.5 suggested)
  25. 25. Database sources GreenGenes Latest May 2013 SILVA Latest 115 (August 2013) Includes 18S, 23S, 28S, LSU RDP Database Latest 11 (October 2013) GenBank Research-specific e.g., CORE Oral
  26. 26. Multivariate data reduction
  27. 27. β-diversity Visualization (ordination) versus hypothesis testing (MRPP, indicator species analysis) Many more algorithms out there for exploration and statistical testing mostly through widely used R packages vegan (Community Ecology Package) labdsv (Ordination and Multivariate Analysis for Ecology) ape (Analyses of Phylogenetics and Evolution) picante (community analyses etc.)
  28. 28. Visualization (ordination) Complementary to data clustering looks for discontinuities Ordination extracts main trends as continuous axes analysis of the square matrix derived from the OTU table Non-parametric, unconstrained ordination methods most widely used (and best suited) methods that can work directly on a square matrix An appropriate metric is required to derive this square matrix many options...
  29. 29. Metrics Ordination is essentially reducing dimensionality first requirement: accurately model differences among samples Models are *really* important. Examples include: OTU presence/absence “all models are wrong, Dice, Jaccard some are useful” OTU abundance - G.E. Box Bray-Curtis “You can't publish anything without a Phylogenetic PCoA plot anymore, but METRICS UniFrac used to draw plot important.” - Susan Huse
  30. 30. Metrics: UniFrac A distance measure comparing multiple communities using phylogenetic information Requires sequence alignment and tree-building PyNAST, MUSCLE, Infernal Time-consuming and susceptible to poor phylogenetic inference (does it matter?) Weighted (abundance) ecological features related to abundance Unweighted ecological features related to taxonomic presence/absence
  31. 31. Ordination example 1 (of many): Principal Coordinates Analysis Classical Multidimensional Scaling (MDS; Gower 1966) Procedure: based on eigenvectors position objects in low-dimensional space while preserving distance relationships as well as possible highly flexible can choose among many association measures In microbial ecology, used for visualizing phylogenetic or count-based distances Consistent visual output for given distance matrix Include variance explained (%) on Axis 1 and 2
  32. 32. Ordination example 2 (of many): Non-metric Multidimensional Scaling Ordination not based on eigenvectors Does not preserve exact distances among objects attempts to preserve ordering of samples (“ranks”) Procedure: iterative, tries to position the objects in a few (2-3) dimensions in such a way that minimizes the “stress” how well does the new ranked distribution of points represent the original distances in the association matrix? Can express as R2 on axes 1 and 2. the adjustment goes on until the stress value reaches a local minimum (heuristic solution) NMDS often represents distance relationships better than PCoA in the same number of dimensions Susceptible to the “local minimum issue”, and therefore should have strong starting point (e.g., PCoA) or many permutations You won't get the same result each time you run the analysis. Try several runs until you are comfortable with the result.
  33. 33. Do my treatments separate?
  34. 34. Beta-diversity: Hypothesis testing Multiple methods, implemented in QIIME, mothur, AXIOME e.g., MRPP, adonis, NP-MANOVA (perMANOVA), ANOSIM Are treatment effects significant? Because these are predominantly nonparametric methods, tests for significance rely on testing by permutation Let's focus on MRPP
  35. 35. Multiresponse Permutation Procedures Compare intragroup average distances with the average distances that would have resulted from all the other possible combinations T statistic: more negative with increasing group separation (T>-10 common for ecology) A statistic: Degree of scatter within groups (A=1 when all points fall on top of one another) p value: likelihood of similar separation with randomized data.
  36. 36. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  37. 37. “PCoA plots are the first step of a community analysis, not the last.” Josh Neufeld
  38. 38. Searching for species that matter High dimensional data often have too many features to investigate solution: identify and study species significantly associated with categorical metadata Indicator species (Dufrene-Legendre) calculates indicator value (fidelity and relative abundance) of species Permutation test for significance Need solution for sparse data - be wary of groups with small numbers of sites (influence on permutation tests) low abundance can artificially inflate indicator values
  39. 39. Specificity Fidelity
  40. 40. IndVal (Dufrene & Legendre, 1997) Specificity Large mean abundance within group relative to summed mean abundances of other groups Fidelity Presence in most or all sites of that group Groups defined by a priori by metadata or statistical clustering
  41. 41. Simple linear correlations Metadata mbc Taxon R^2 value k__Bacteria;p__Planctomycetes;c__Planctomycetia;o__Gemmat ales;f__Isosphaeraceae;g__ 0.611368489781491 mbc k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhiz obiales;f__Methylocystaceae;g__ 0.677209935419981 mbn k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhiz obiales;f__Methylocystaceae;g__ 0.64092523702996 soil_depth k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomyc etales;f__Intrasporangiaceae;g__ 0.669761188668774
  42. 42. mothur: cooccurrence function, measuring whether populations are co-occurring more frequently than you would expect by chance.
  43. 43. Non-negative Matrix Factorization NMF as a representation method for portraying high-dimensional data as a small number of taxonomic components. Patterns of co-occurring OTUs can be described by a smaller number of taxonomic components. Each sample represented by the collection of component taxa, helping identify relationships between taxa and the environment. Jonathan Dushoff, McMaster University, Ontario, Canada
  44. 44. SSUnique
  45. 45. SILVA
  46. 46. SILVA
  47. 47. SILVA
  48. 48. SILVA
  49. 49. SILVA
  50. 50. Nakai et al. 2012 Lynch et al. 2012
  51. 51. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  52. 52. Why pipelines? Merge and manage (many) disparate techniques Democratize analysis improve accessibility Accelerate pace of innovation, collaboration, and research
  53. 53. Early synthesis Early synthesis for numerical microbial ecology Synthesis of 16S phylogenetics (Woese et al.) and Hughes (Counting the uncountable) Numerical ecology for microorganisms Algorithm development libshuff, dotur (mothur) Analysis pipelines QIIME, mothur
  54. 54. Knight Lab, U. Colorado at Boulder Predominantly a collection of integrated Python/R scripts Many dependencies easy managed installation: qiime-deploy MacQIIME virtual box and Ubuntu fork avoid for anything but small runs Becoming the standard for marker gene studies integrated analysis and visualization easy access to broad computational biology toolbox (Python/R)
  55. 55. Automation and extension AXIOME and phyloseq Extend existing technologies (QIIME, mothur, R, custom) Layers of abstraction Automation and rapid re-analysis Promote reproducible research (iPython, XML, make) Implement existing techniques (e.g., MRPP, Dufrene-Legendre IndVal) numerical microbial ecology needs to better incorporate modern statistical theory Develop and test new techniques
  56. 56. Axiometic GUI companion for AXIOME Cross-platform New implementation in development Generates AXIOME file (XML) xls template coming soon for all commands, sample metadata, and extra info… much easier for everyone.
  57. 57. “QIIME wraps many other software packages, and these should be cited if they are used. Any time you're using tools that QIIME wraps, it is essential to cite those tools.”
  58. 58. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  59. 59. The future As data get bigger, interpretation should be “hands off” Move towards hypothesis testing of highdimension taxonomic data Convergence on Galaxy e.g., QIIME in Galaxy is developing Further extension to cloud services e.g., Amazon EC2 Machine learning and data mining applications
  60. 60. Open-source, web-based platform Deployed locally or in the cloud Ongoing development of 16S rRNA gene analysis
  61. 61. Galaxy Workshed (available tools)
  62. 62. “The advantages of having large numbers of samples at shallow coverage (~1,000 sequences per sample) clearly outweigh having a small number of samples at greater coverage for many datasets, suggesting that the focus for future studies should be on broader sampling that can reveal association with key biological parameters rather than on deeper sequencing.”
  63. 63. “….even [phylogenetic beta-diversity] measures suited to the underlying mechanism of differentiation may require deep sequencing to reveal subtle patterns” Dr. Donovan Parks
  64. 64. Method standardization Impossible. Data storage Sequence reads outpacing data storage costs Federated data? File formats e.g., FASTA (difficult to search, difficult to retrieve sequences, not space efficient, do not ensure data is in correct format, no space for metadata, no absolute standard)… relational databases? Software Free and Open Source enables an experiment to be faithfully replicated Algorithms Memory! Many clustering and phylogenetic inference algorithms vary n2 Distributed, parallel, or cloud computing may not be helpful Metadata What to do with it? How to marry sequence and metadata sets? We need better metadata integration, not necessarily more/better metadata
  65. 65. What should we be doing? (take-home messages) *Surveys are really important for spatial and temporal mapping *Hypothesis testing follows (or implicit) *What species account for treatment effects? *Who tracks with who? (why=function) *Who avoids who? *Are all microorganisms accounted for? (no) *How can we use this information to manipulate, manage and predict ecosystems?
  66. 66. What should we be doing? (take-home messages) There is no “one way” to analyze 16S rRNA You need to build a pipeline for you. If this seems daunting, it is. If this is not daunting, your hands are dirty. It’s getting better all the tii-ime.
  67. 67. Helpful resources
  68. 68. Thank you