TutorialAccounting for uncertainty in species delineation during the analysis of environmental DNA sequence data<br />Jeff...
DNA-based taxonomic approaches and biodiversity estimation<br /><ul><li>The current biodiversity crisis has lead some to a...
Therefore, a major imperative in bioinformatics is the development of theoretical and practical approaches for generating ...
This is especially true for taxa that are relatively understudied from a taxonomic perspective (particularly microorganism...
A promising approach utilizes a mixed model that differentiates speciation events from population coalescent events based ...
Probabilistic diversity estimation with uncertain species boundaries <br />(using GMYC and model averaging)<br />Extension...
<ul><li>R is available at http://cran.r-project.org/.
The commands to enter are preceded by ‘> ‘, modify these as appropriate for your data; notes are entered after the ‘#’ sym...
‘Powell_supplemental_script.R’ contains functions for GMYC model averaging and most of the code used here; is available at...
Can open the source file with a text editor (e.g. Notepad, TextEdit)<br /><ul><li>These R packages (and their dependencies...
 Downloads and installs ‘igraph’ package<br /> Downloads and installs ‘vegan’ package<br /> Downloads and installs ‘gto...
‘ape’ and ‘paran’ are also required by the ‘splits’ package<br />‘splits’ needs to be installed from source, use the follo...
Read functions into R from source file in the working directory; calls to load required packages are also in source file<b...
Read tree into R; normally would read tree from file in working directory:<br />Newick format: “read.tree(‘treefile.phylo’...
Plot tree; needs to be ultrametric, meaning the distance from root to each tip is the same<br /> Can check with ‘is.ultra...
Plot the accumulation of branches (N) though time; GMYC model used to detect abrupt changes in this accumulation rate<br />
Fit the GMYC model to the tree using the single-threshold method (“method=‘s’”)<br /><ul><li>Results are stored in object ...
The model is fit using each node (first column) as the threshold, from the second to the last branching event (age in seco...
The last three columns are for diagnostic purposes (convergence warnings, number of iterations, and number of clusters), n...
<ul><li>Less than one minute to run single-threshold procedure
Time required increases with tree size</li></li></ul><li>Summary of results<br /> Comparison of the ML model (five paramet...
Fit the GMYC model to the tree using the multiple-thresholds method (“method=‘m’”)<br /><ul><li>Results are stored in obje...
Procedure starts by placing single threshold at a fixed point in the tree, then introducing additional thresholds closer t...
Model likelihood is printed to screen when improvement is observed
Procedure finished when improvements over null model or earlier GMYC models are no longer found</li></li></ul><li><ul><li>...
Increases (approximately exponentially) with tree size, decreases if multiple thresholds not detected</li></li></ul><li>Su...
Calculate AICc scores for GMYC models using different thresholds<br /> Specify object(s) containing GMYC model output fit ...
Generate some summary output: specify object contain model scores calculations; specify cutoff for maximum delta AICc to p...
Generate some summary output, continued:<br />Output:<br /> Models ranked by increasing delta AICc; last column (spilled o...
Generate some summary output, continued:<br />Output:<br /> Model-averaged parameter estimates  (this output does not acco...
Estimate number of clusters, entities (clusters + singletons), Shannon diversity; also estimate variance associated with t...
Calculate pairwise probabilities that tips co-occur within GMYC clusters<br /> Specify object containing model scores calc...
Visual representation of cluster sizes, uncertainty; probabilities range from white (1) to red (0); x- and y-axis labels a...
Plot tree, numbers above branches represent probabilities that all tips nested within node exist in a single GMYC cluster ...
Plot to file, specify dimensions (in inches) to plot over larger area<br /> Open connection<br /> Plot to file<br /> Close...
File found in working directory; numbers above branches represent probabilities that all tips nested within node exist in ...
Finish session:<br /> Show all objects in the workspace<br /> Quit R; specifying ‘y’ to save image will result in this wor...
Reload session to demonstrate sample-specific diversity estimates:<br /> Show working directory (started here)<br /> Show ...
Upcoming SlideShare
Loading in …5
×

Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

12,368 views

Published on

Tutorial accompanying the paper of the same name, published in Methods in Ecology and Evolution

Full paper
http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2011.00122.x/abstract

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
12,368
On SlideShare
0
From Embeds
0
Number of Embeds
7,425
Actions
Shares
0
Downloads
94
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

  1. 1. TutorialAccounting for uncertainty in species delineation during the analysis of environmental DNA sequence data<br />Jeff R. Powell<br />
  2. 2. DNA-based taxonomic approaches and biodiversity estimation<br /><ul><li>The current biodiversity crisis has lead some to advocate a primary role for high-throughput DNA sequencing technologies in taxonomic research
  3. 3. Therefore, a major imperative in bioinformatics is the development of theoretical and practical approaches for generating biodiversity estimates from DNA sequences
  4. 4. This is especially true for taxa that are relatively understudied from a taxonomic perspective (particularly microorganisms and cryptic taxa)
  5. 5. A promising approach utilizes a mixed model that differentiates speciation events from population coalescent events based on timing of divergences within a taxon</li></li></ul><li>Estimating species boundaries from environmental DNA sequences using the General Mixed-Yule Coalescent (GMYC) model<br />Current GMYC approach:<br />Fit models predicting inter- and intra-specific divergence rates and threshold times differentiating these processes to multispecies coalescent trees; models contain single (Pons et al. 2006 Syst Biol) or multiple (Monaghan et al. 2009 Syst Biol) thresholds<br />Step 1: compare maximum likelihood (ML) single-threshold model to the null hypothesis of a single coalescent population (Fontaneto et a. 2007 PLoS Biol)<br />Step 2: compare ML multiple-threshold model to ML single-threshold model to determine if increased number of parameters significantly enhances model fit (Monaghan et al. 2009 Syst Biol)<br />This ignores models using thresholds that may fit the data slightly less well than the maximum likelihood models<br />Speciation<br /><br />after Pons et al. 2006 Syst Biol<br />
  6. 6. Probabilistic diversity estimation with uncertain species boundaries <br />(using GMYC and model averaging)<br />Extension to current approach: (Powell 2011 Methods Ecol Evol)<br />Step 1: Estimate AIC of each model (all single- and multiple-threshold models) and rank based on fit to the data<br />Step 2a: Estimate probabilities that two taxa belong to the same ‘species’ based on the weights associated with each model<br />Step 2b: Estimate sample richness (and variance associated with this estimate) using model averaging<br />Added benefit is that uncertainty in species boundaries can be directly incorporated into the variance associated with diversity estimates<br />Several models fit well<br />
  7. 7. <ul><li>R is available at http://cran.r-project.org/.
  8. 8. The commands to enter are preceded by ‘> ‘, modify these as appropriate for your data; notes are entered after the ‘#’ symbol
  9. 9. ‘Powell_supplemental_script.R’ contains functions for GMYC model averaging and most of the code used here; is available at: </li></ul> http://dx.doi.org/10.1111/j.2041-210X.2011.00122.x<br />
  10. 10. Can open the source file with a text editor (e.g. Notepad, TextEdit)<br /><ul><li>These R packages (and their dependencies) are required to run the following functions; instructions for installing on the following slides (install ‘splits’ after other packages) </li></li></ul><li> Downloads and installs ‘geiger’ and its dependencies<br />
  11. 11.  Downloads and installs ‘igraph’ package<br /> Downloads and installs ‘vegan’ package<br /> Downloads and installs ‘gtools’ package<br />
  12. 12. ‘ape’ and ‘paran’ are also required by the ‘splits’ package<br />‘splits’ needs to be installed from source, use the following: <br />
  13. 13. Read functions into R from source file in the working directory; calls to load required packages are also in source file<br />Show the workspace to check that functions were read correctly into the R workspace<br />
  14. 14. Read tree into R; normally would read tree from file in working directory:<br />Newick format: “read.tree(‘treefile.phylo’)”<br />Nexus format: “read.nexus(‘treefile.nex’)”<br />Tree summary to check that tree was read correctly (proper number of tips); the tree needs to be fully dichotomous (number of nodes is one fewer than number of tips) <br />
  15. 15. Plot tree; needs to be ultrametric, meaning the distance from root to each tip is the same<br /> Can check with ‘is.ultrametric(test.tr)’<br />
  16. 16. Plot the accumulation of branches (N) though time; GMYC model used to detect abrupt changes in this accumulation rate<br />
  17. 17. Fit the GMYC model to the tree using the single-threshold method (“method=‘s’”)<br /><ul><li>Results are stored in object ‘test.sing’
  18. 18. The model is fit using each node (first column) as the threshold, from the second to the last branching event (age in second column), and estimates the model likelihood (third column)
  19. 19. The last three columns are for diagnostic purposes (convergence warnings, number of iterations, and number of clusters), not important here</li></li></ul><li> Maximum likelihood (ML) model<br />
  20. 20. <ul><li>Less than one minute to run single-threshold procedure
  21. 21. Time required increases with tree size</li></li></ul><li>Summary of results<br /> Comparison of the ML model (five parameters) to the null model (single coalescent population, two parameters)<br /> Number of clusters, entities (clusters + singletons) predicted by the ML model; CI: models within 2 log-likelihood units of ML model<br /> Node age for the threshold in the ML model<br />
  22. 22. Fit the GMYC model to the tree using the multiple-thresholds method (“method=‘m’”)<br /><ul><li>Results are stored in object ‘test.mult’
  23. 23. Procedure starts by placing single threshold at a fixed point in the tree, then introducing additional thresholds closer to/further from node for particular lineages
  24. 24. Model likelihood is printed to screen when improvement is observed
  25. 25. Procedure finished when improvements over null model or earlier GMYC models are no longer found</li></li></ul><li><ul><li>Approximately five minutes to run multiple-threshold procedure here
  26. 26. Increases (approximately exponentially) with tree size, decreases if multiple thresholds not detected</li></li></ul><li>Summary of results<br /> Comparison of the ML model (≥ six parameters to the null model (single coalescent population, two parameters)<br /> Number of clusters, entities (clusters + singletons) predicted by the ML model; CI: models within 2 log-likelihood units of ML model<br /> Node ages for the threshold in the ML model<br />
  27. 27. Calculate AICc scores for GMYC models using different thresholds<br /> Specify object(s) containing GMYC model output fit using ‘gmyc.edit()’<br /> Output: <br /> Model-averaged parameter estimates<br /> Other information (e.g., only single/multiple-threshold output objects specified)<br />
  28. 28. Generate some summary output: specify object contain model scores calculations; specify cutoff for maximum delta AICc to print model summary to screen<br />Output:<br /> Models ranked by increasing delta AICc; ‘step’ used to identify model output in ‘gmyc.edit()’ results<br />
  29. 29. Generate some summary output, continued:<br />Output:<br /> Models ranked by increasing delta AICc; last column (spilled over in screen output here) indicates Akaike weight given to model in the model-averaged parameter estimates<br />
  30. 30. Generate some summary output, continued:<br />Output:<br /> Model-averaged parameter estimates (this output does not account for the deltaAICc argument)<br />
  31. 31. Estimate number of clusters, entities (clusters + singletons), Shannon diversity; also estimate variance associated with these parameters<br /> Specify object contain model scores calculations; specify cutoff for maximum delta AICc of included models<br /> Enter ‘y’ to continue, ‘n’ to stop (e.g., if too many models – time limitation)<br />
  32. 32. Calculate pairwise probabilities that tips co-occur within GMYC clusters<br /> Specify object containing model scores calculations; specify cutoff for maximum delta AICc of included models<br /> Enter ‘y’ to continue, ‘n’ to stop (e.g., if too many models – time limitation)<br />delta AICLevel of empirical support<br /> 0-2 substantial<br /> 4-7 considerably less<br /> >10 essentially none<br />(Burnham and Anderson, 2002, Model Selection and Multimodel Inference, page 170)<br />
  33. 33. Visual representation of cluster sizes, uncertainty; probabilities range from white (1) to red (0); x- and y-axis labels are arbitrary<br />
  34. 34. Plot tree, numbers above branches represent probabilities that all tips nested within node exist in a single GMYC cluster (hard to see in the default plot window)<br />
  35. 35. Plot to file, specify dimensions (in inches) to plot over larger area<br /> Open connection<br /> Plot to file<br /> Close connection<br /> Show files in working directory<br />
  36. 36. File found in working directory; numbers above branches represent probabilities that all tips nested within node exist in a GMYC cluster<br />
  37. 37. Finish session:<br /> Show all objects in the workspace<br /> Quit R; specifying ‘y’ to save image will result in this workspace being restored upon next start, as long as the user first navigates to the current directory before starting R<br /> - alternatively: “save.image(‘tutorial.rdata’)”  results in image to load from any directory<br />
  38. 38. Reload session to demonstrate sample-specific diversity estimates:<br /> Show working directory (started here)<br /> Show files in working directory; contains a species-sample matrix (‘test.samples.txt’)<br /> Reload source file to load necessary packages<br />
  39. 39. <ul><li>These data were randomly generated and written to file using the code below, cells representing species presence/abundance in samples – species in rows, samples in columns</li></li></ul><li>Read species-sample information from file (tab-delimited: “sep=‘t’”)<br />Show structure of samples object  data in data.frame object (default of ‘read.table()’, 150 species in rows, two samples in columns<br />Show summary of samples object<br />
  40. 40. Model-averaged diversity estimates for whole tree (as previously calculated, for comparison)<br />
  41. 41. Model-averaged diversity estimates in each sample<br />For example,<br />‘est’: Species richness in each sample<br />‘var’: Variance of richness estimate – can propagate through further analyses<br />
  42. 42.  Average richness<br /> Variance around the mean (including species boundary uncertainty)<br /> Variance (underestimated, neglects species boundary uncertainty)<br />
  43. 43. TutorialAccounting for uncertainty in species delineation during the analysis of environmental DNA sequence data<br />For more information:<br />jeffpowell2@gmail.com<br />or<br />Jeff.Powell@uws.edu.au<br />

×