Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

TutorialAccounting for uncertainty in species delineation during the analysis of environmental DNA sequence data Jeff R. Powell

DNA-based taxonomic approaches and biodiversity estimation ,[object Object]

Therefore, a major imperative in bioinformatics is the development of theoretical and practical approaches for generating biodiversity estimates from DNA sequences

This is especially true for taxa that are relatively understudied from a taxonomic perspective (particularly microorganisms and cryptic taxa)

A promising approach utilizes a mixed model that differentiates speciation events from population coalescent events based on timing of divergences within a taxon,[object Object]

Probabilistic diversity estimation with uncertain species boundaries (using GMYC and model averaging) Extension to current approach: (Powell 2011 Methods Ecol Evol) Step 1: Estimate AIC of each model (all single- and multiple-threshold models) and rank based on fit to the data Step 2a: Estimate probabilities that two taxa belong to the same ‘species’ based on the weights associated with each model Step 2b: Estimate sample richness (and variance associated with this estimate) using model averaging Added benefit is that uncertainty in species boundaries can be directly incorporated into the variance associated with diversity estimates Several models fit well

The commands to enter are preceded by ‘> ‘, modify these as appropriate for your data; notes are entered after the ‘#’ symbol

‘Powell_supplemental_script.R’ contains functions for GMYC model averaging and most of the code used here; is available at: http://dx.doi.org/10.1111/j.2041-210X.2011.00122.x

Can open the source file with a text editor (e.g. Notepad, TextEdit) ,[object Object],[object Object]

 Downloads and installs ‘igraph’ package  Downloads and installs ‘vegan’ package  Downloads and installs ‘gtools’ package

‘ape’ and ‘paran’ are also required by the ‘splits’ package ‘splits’ needs to be installed from source, use the following:

Read functions into R from source file in the working directory; calls to load required packages are also in source file Show the workspace to check that functions were read correctly into the R workspace

Read tree into R; normally would read tree from file in working directory: Newick format: “read.tree(‘treefile.phylo’)” Nexus format: “read.nexus(‘treefile.nex’)” Tree summary to check that tree was read correctly (proper number of tips); the tree needs to be fully dichotomous (number of nodes is one fewer than number of tips)

Plot tree; needs to be ultrametric, meaning the distance from root to each tip is the same  Can check with ‘is.ultrametric(test.tr)’

Plot the accumulation of branches (N) though time; GMYC model used to detect abrupt changes in this accumulation rate

Fit the GMYC model to the tree using the single-threshold method (“method=‘s’”) ,[object Object]

The model is fit using each node (first column) as the threshold, from the second to the last branching event (age in second column), and estimates the model likelihood (third column)

The last three columns are for diagnostic purposes (convergence warnings, number of iterations, and number of clusters), not important here,[object Object]

Time required increases with tree size,[object Object]

Fit the GMYC model to the tree using the multiple-thresholds method (“method=‘m’”) ,[object Object]

Procedure starts by placing single threshold at a fixed point in the tree, then introducing additional thresholds closer to/further from node for particular lineages

Model likelihood is printed to screen when improvement is observed

Procedure finished when improvements over null model or earlier GMYC models are no longer found,[object Object]

Increases (approximately exponentially) with tree size, decreases if multiple thresholds not detected,[object Object]

Calculate AICc scores for GMYC models using different thresholds Specify object(s) containing GMYC model output fit using ‘gmyc.edit()’ Output: Model-averaged parameter estimates Other information (e.g., only single/multiple-threshold output objects specified)

Generate some summary output: specify object contain model scores calculations; specify cutoff for maximum delta AICc to print model summary to screen Output: Models ranked by increasing delta AICc; ‘step’ used to identify model output in ‘gmyc.edit()’ results

Generate some summary output, continued: Output: Models ranked by increasing delta AICc; last column (spilled over in screen output here) indicates Akaike weight given to model in the model-averaged parameter estimates

Generate some summary output, continued: Output: Model-averaged parameter estimates (this output does not account for the deltaAICc argument)

Estimate number of clusters, entities (clusters + singletons), Shannon diversity; also estimate variance associated with these parameters Specify object contain model scores calculations; specify cutoff for maximum delta AICc of included models Enter ‘y’ to continue, ‘n’ to stop (e.g., if too many models – time limitation)

Calculate pairwise probabilities that tips co-occur within GMYC clusters Specify object containing model scores calculations; specify cutoff for maximum delta AICc of included models Enter ‘y’ to continue, ‘n’ to stop (e.g., if too many models – time limitation) delta AICLevel of empirical support 0-2 substantial 4-7 considerably less >10 essentially none (Burnham and Anderson, 2002, Model Selection and Multimodel Inference, page 170)

Visual representation of cluster sizes, uncertainty; probabilities range from white (1) to red (0); x- and y-axis labels are arbitrary

Plot tree, numbers above branches represent probabilities that all tips nested within node exist in a single GMYC cluster (hard to see in the default plot window)

Plot to file, specify dimensions (in inches) to plot over larger area Open connection Plot to file Close connection Show files in working directory

File found in working directory; numbers above branches represent probabilities that all tips nested within node exist in a GMYC cluster

Finish session: Show all objects in the workspace Quit R; specifying ‘y’ to save image will result in this workspace being restored upon next start, as long as the user first navigates to the current directory before starting R - alternatively: “save.image(‘tutorial.rdata’)”  results in image to load from any directory

Reload session to demonstrate sample-specific diversity estimates: Show working directory (started here) Show files in working directory; contains a species-sample matrix (‘test.samples.txt’) Reload source file to load necessary packages

Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Similar to Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data (20)

Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data