• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data
 

Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

on

  • 9,940 views

Tutorial accompanying the paper of the same name, published in Methods in Ecology and Evolution ...

Tutorial accompanying the paper of the same name, published in Methods in Ecology and Evolution

Full paper
http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2011.00122.x/abstract

Statistics

Views

Total Views
9,940
Views on SlideShare
3,960
Embed Views
5,980

Actions

Likes
0
Downloads
86
Comments
0

13 Embeds 5,980

http://www.methodsinecologyandevolution.org 4482
http://methodsblog.wordpress.com 1186
http://springboard.wiley.com 248
http://translate.googleusercontent.com 19
url_unknown 17
http://feeds.feedburner.com 8
https://methodsblog.wordpress.com 6
http://www.slideshare.net 5
http://webcache.googleusercontent.com 3
http://www.docseek.net 2
http://webservices.blackwellpublishing.com 2
http://honyaku.yahoofs.jp 1
http://translate.google.com.mx 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data Presentation Transcript

    • TutorialAccounting for uncertainty in species delineation during the analysis of environmental DNA sequence data
      Jeff R. Powell
    • DNA-based taxonomic approaches and biodiversity estimation
      • The current biodiversity crisis has lead some to advocate a primary role for high-throughput DNA sequencing technologies in taxonomic research
      • Therefore, a major imperative in bioinformatics is the development of theoretical and practical approaches for generating biodiversity estimates from DNA sequences
      • This is especially true for taxa that are relatively understudied from a taxonomic perspective (particularly microorganisms and cryptic taxa)
      • A promising approach utilizes a mixed model that differentiates speciation events from population coalescent events based on timing of divergences within a taxon
    • Estimating species boundaries from environmental DNA sequences using the General Mixed-Yule Coalescent (GMYC) model
      Current GMYC approach:
      Fit models predicting inter- and intra-specific divergence rates and threshold times differentiating these processes to multispecies coalescent trees; models contain single (Pons et al. 2006 Syst Biol) or multiple (Monaghan et al. 2009 Syst Biol) thresholds
      Step 1: compare maximum likelihood (ML) single-threshold model to the null hypothesis of a single coalescent population (Fontaneto et a. 2007 PLoS Biol)
      Step 2: compare ML multiple-threshold model to ML single-threshold model to determine if increased number of parameters significantly enhances model fit (Monaghan et al. 2009 Syst Biol)
      This ignores models using thresholds that may fit the data slightly less well than the maximum likelihood models
      Speciation

      after Pons et al. 2006 Syst Biol
    • Probabilistic diversity estimation with uncertain species boundaries
      (using GMYC and model averaging)
      Extension to current approach: (Powell 2011 Methods Ecol Evol)
      Step 1: Estimate AIC of each model (all single- and multiple-threshold models) and rank based on fit to the data
      Step 2a: Estimate probabilities that two taxa belong to the same ‘species’ based on the weights associated with each model
      Step 2b: Estimate sample richness (and variance associated with this estimate) using model averaging
      Added benefit is that uncertainty in species boundaries can be directly incorporated into the variance associated with diversity estimates
      Several models fit well
      • R is available at http://cran.r-project.org/.
      • The commands to enter are preceded by ‘> ‘, modify these as appropriate for your data; notes are entered after the ‘#’ symbol
      • ‘Powell_supplemental_script.R’ contains functions for GMYC model averaging and most of the code used here; is available at:
      http://dx.doi.org/10.1111/j.2041-210X.2011.00122.x
    • Can open the source file with a text editor (e.g. Notepad, TextEdit)
      • These R packages (and their dependencies) are required to run the following functions; instructions for installing on the following slides (install ‘splits’ after other packages)
    •  Downloads and installs ‘geiger’ and its dependencies
    •  Downloads and installs ‘igraph’ package
       Downloads and installs ‘vegan’ package
       Downloads and installs ‘gtools’ package
    • ‘ape’ and ‘paran’ are also required by the ‘splits’ package
      ‘splits’ needs to be installed from source, use the following:
    • Read functions into R from source file in the working directory; calls to load required packages are also in source file
      Show the workspace to check that functions were read correctly into the R workspace
    • Read tree into R; normally would read tree from file in working directory:
      Newick format: “read.tree(‘treefile.phylo’)”
      Nexus format: “read.nexus(‘treefile.nex’)”
      Tree summary to check that tree was read correctly (proper number of tips); the tree needs to be fully dichotomous (number of nodes is one fewer than number of tips)
    • Plot tree; needs to be ultrametric, meaning the distance from root to each tip is the same
       Can check with ‘is.ultrametric(test.tr)’
    • Plot the accumulation of branches (N) though time; GMYC model used to detect abrupt changes in this accumulation rate
    • Fit the GMYC model to the tree using the single-threshold method (“method=‘s’”)
      • Results are stored in object ‘test.sing’
      • The model is fit using each node (first column) as the threshold, from the second to the last branching event (age in second column), and estimates the model likelihood (third column)
      • The last three columns are for diagnostic purposes (convergence warnings, number of iterations, and number of clusters), not important here
    •  Maximum likelihood (ML) model
      • Less than one minute to run single-threshold procedure
      • Time required increases with tree size
    • Summary of results
      Comparison of the ML model (five parameters) to the null model (single coalescent population, two parameters)
      Number of clusters, entities (clusters + singletons) predicted by the ML model; CI: models within 2 log-likelihood units of ML model
      Node age for the threshold in the ML model
    • Fit the GMYC model to the tree using the multiple-thresholds method (“method=‘m’”)
      • Results are stored in object ‘test.mult’
      • Procedure starts by placing single threshold at a fixed point in the tree, then introducing additional thresholds closer to/further from node for particular lineages
      • Model likelihood is printed to screen when improvement is observed
      • Procedure finished when improvements over null model or earlier GMYC models are no longer found
      • Approximately five minutes to run multiple-threshold procedure here
      • Increases (approximately exponentially) with tree size, decreases if multiple thresholds not detected
    • Summary of results
      Comparison of the ML model (≥ six parameters to the null model (single coalescent population, two parameters)
      Number of clusters, entities (clusters + singletons) predicted by the ML model; CI: models within 2 log-likelihood units of ML model
      Node ages for the threshold in the ML model
    • Calculate AICc scores for GMYC models using different thresholds
      Specify object(s) containing GMYC model output fit using ‘gmyc.edit()’
      Output:
      Model-averaged parameter estimates
      Other information (e.g., only single/multiple-threshold output objects specified)
    • Generate some summary output: specify object contain model scores calculations; specify cutoff for maximum delta AICc to print model summary to screen
      Output:
      Models ranked by increasing delta AICc; ‘step’ used to identify model output in ‘gmyc.edit()’ results
    • Generate some summary output, continued:
      Output:
      Models ranked by increasing delta AICc; last column (spilled over in screen output here) indicates Akaike weight given to model in the model-averaged parameter estimates
    • Generate some summary output, continued:
      Output:
      Model-averaged parameter estimates (this output does not account for the deltaAICc argument)
    • Estimate number of clusters, entities (clusters + singletons), Shannon diversity; also estimate variance associated with these parameters
      Specify object contain model scores calculations; specify cutoff for maximum delta AICc of included models
      Enter ‘y’ to continue, ‘n’ to stop (e.g., if too many models – time limitation)
    • Calculate pairwise probabilities that tips co-occur within GMYC clusters
      Specify object containing model scores calculations; specify cutoff for maximum delta AICc of included models
      Enter ‘y’ to continue, ‘n’ to stop (e.g., if too many models – time limitation)
      delta AICLevel of empirical support
      0-2 substantial
      4-7 considerably less
      >10 essentially none
      (Burnham and Anderson, 2002, Model Selection and Multimodel Inference, page 170)
    • Visual representation of cluster sizes, uncertainty; probabilities range from white (1) to red (0); x- and y-axis labels are arbitrary
    • Plot tree, numbers above branches represent probabilities that all tips nested within node exist in a single GMYC cluster (hard to see in the default plot window)
    • Plot to file, specify dimensions (in inches) to plot over larger area
      Open connection
      Plot to file
      Close connection
      Show files in working directory
    • File found in working directory; numbers above branches represent probabilities that all tips nested within node exist in a GMYC cluster
    • Finish session:
      Show all objects in the workspace
      Quit R; specifying ‘y’ to save image will result in this workspace being restored upon next start, as long as the user first navigates to the current directory before starting R
      - alternatively: “save.image(‘tutorial.rdata’)”  results in image to load from any directory
    • Reload session to demonstrate sample-specific diversity estimates:
      Show working directory (started here)
      Show files in working directory; contains a species-sample matrix (‘test.samples.txt’)
      Reload source file to load necessary packages
      • These data were randomly generated and written to file using the code below, cells representing species presence/abundance in samples – species in rows, samples in columns
    • Read species-sample information from file (tab-delimited: “sep=‘t’”)
      Show structure of samples object  data in data.frame object (default of ‘read.table()’, 150 species in rows, two samples in columns
      Show summary of samples object
    • Model-averaged diversity estimates for whole tree (as previously calculated, for comparison)
    • Model-averaged diversity estimates in each sample
      For example,
      ‘est’: Species richness in each sample
      ‘var’: Variance of richness estimate – can propagate through further analyses
    •  Average richness
       Variance around the mean (including species boundary uncertainty)
       Variance (underestimated, neglects species boundary uncertainty)
    • TutorialAccounting for uncertainty in species delineation during the analysis of environmental DNA sequence data
      For more information:
      jeffpowell2@gmail.com
      or
      Jeff.Powell@uws.edu.au