SlideShare a Scribd company logo
TutorialAccounting for uncertainty in species delineation during the analysis of environmental DNA sequence data Jeff R. Powell
DNA-based taxonomic approaches and biodiversity estimation ,[object Object]
Therefore, a major imperative in bioinformatics is the development of theoretical and practical approaches for generating biodiversity estimates from DNA sequences
This is especially true for taxa that are relatively understudied from a taxonomic perspective (particularly microorganisms and cryptic taxa)
A promising approach utilizes a mixed model that differentiates speciation events from population coalescent events based on timing of divergences within a taxon,[object Object]
Probabilistic diversity estimation with uncertain species boundaries  (using GMYC and model averaging) Extension to current approach: (Powell 2011 Methods Ecol Evol) Step 1: Estimate AIC of each model (all single- and multiple-threshold models) and rank based on fit to the data Step 2a: Estimate probabilities that two taxa belong to the same ‘species’ based on the weights associated with each model Step 2b: Estimate sample richness (and variance associated with this estimate) using model averaging Added benefit is that uncertainty in species boundaries can be directly incorporated into the variance associated with diversity estimates Several models fit well
[object Object]
The commands to enter are preceded by ‘> ‘, modify these as appropriate for your data; notes are entered after the ‘#’ symbol
‘Powell_supplemental_script.R’ contains functions for GMYC model averaging and most of the code used here; is available at: 	http://dx.doi.org/10.1111/j.2041-210X.2011.00122.x
Can open the source file with a text editor (e.g. Notepad, TextEdit) ,[object Object],[object Object]
 Downloads and installs ‘igraph’ package  Downloads and installs ‘vegan’ package  Downloads and installs ‘gtools’ package
‘ape’ and ‘paran’ are also required by the ‘splits’ package ‘splits’ needs to be installed from source, use the following:
Read functions into R from source file in the working directory; calls to load required packages are also in source file Show the workspace to check that functions were read correctly into the R workspace
Read tree into R; normally would read tree from file in working directory: Newick format: “read.tree(‘treefile.phylo’)” Nexus format: “read.nexus(‘treefile.nex’)” Tree summary to check that tree was read correctly (proper number of tips); the tree needs to be fully dichotomous (number of nodes is one fewer than number of tips)
Plot tree; needs to be ultrametric, meaning the distance from root to each tip is the same  Can check with ‘is.ultrametric(test.tr)’
Plot the accumulation of branches (N) though time; GMYC model used to detect abrupt changes in this accumulation rate
Fit the GMYC model to the tree using the single-threshold method (“method=‘s’”) ,[object Object]
The model is fit using each node (first column) as the threshold, from the second to the last branching event (age in second column), and estimates the model likelihood (third column)
The last three columns are for diagnostic purposes (convergence warnings, number of iterations, and number of clusters), not important here,[object Object]
[object Object]
Time required increases with tree size,[object Object]
Fit the GMYC model to the tree using the multiple-thresholds method (“method=‘m’”) ,[object Object]
Procedure starts by placing single threshold at a fixed point in the tree, then introducing additional thresholds closer to/further from node for particular lineages
Model likelihood is printed to screen when improvement is observed
Procedure finished when improvements over null model or earlier GMYC models are no longer found,[object Object]
Increases (approximately exponentially) with tree size, decreases if multiple thresholds not detected,[object Object]
Calculate AICc scores for GMYC models using different thresholds  Specify object(s) containing GMYC model output fit using ‘gmyc.edit()’  Output:   Model-averaged parameter estimates  Other information (e.g., only single/multiple-threshold output objects specified)
Generate some summary output: specify object contain model scores calculations; specify cutoff for maximum delta AICc to print model summary to screen Output:  Models ranked by increasing delta AICc; ‘step’ used to identify model output in ‘gmyc.edit()’ results
Generate some summary output, continued: Output:  Models ranked by increasing delta AICc; last column (spilled over in screen output here) indicates Akaike weight given to model in the model-averaged parameter estimates
Generate some summary output, continued: Output:  Model-averaged parameter estimates  (this output does not account for the deltaAICc argument)
Estimate number of clusters, entities (clusters + singletons), Shannon diversity; also estimate variance associated with these parameters  Specify object contain model scores calculations; specify cutoff for maximum delta AICc of included models  Enter ‘y’ to continue, ‘n’ to stop (e.g., if too many models – time limitation)
Calculate pairwise probabilities that tips co-occur within GMYC clusters  Specify object containing model scores calculations; specify cutoff for maximum delta AICc of included models  Enter ‘y’ to continue, ‘n’ to stop (e.g., if too many models – time limitation) delta AICLevel of empirical support 	0-2			substantial 	4-7			considerably less 	>10			essentially none (Burnham and Anderson, 2002, Model Selection and Multimodel Inference, page 170)
Visual representation of cluster sizes, uncertainty; probabilities range from white (1) to red (0); x- and y-axis labels are arbitrary
Plot tree, numbers above branches represent probabilities that all tips nested within node exist in a single GMYC cluster (hard to see in the default plot window)
Plot to file, specify dimensions (in inches) to plot over larger area  Open connection  Plot to file  Close connection  Show files in working directory
File found in working directory; numbers above branches represent probabilities that all tips nested within node exist in a GMYC cluster
Finish session:  Show all objects in the workspace  Quit R; specifying ‘y’ to save image will result in this workspace being restored upon next start, as long as the user first navigates to the current directory before starting R 	- alternatively: “save.image(‘tutorial.rdata’)”  results in image to load from any directory
Reload session to demonstrate sample-specific diversity estimates:  Show working directory (started here)  Show files in working directory; contains a species-sample matrix (‘test.samples.txt’)  Reload source file to load necessary packages

More Related Content

Viewers also liked

Early Childhood Experience at CLC
Early Childhood Experience at CLCEarly Childhood Experience at CLC
Early Childhood Experience at CLC
Jason Flom
 
The Red Thread of the Arts at CLC
The Red Thread of the Arts at CLCThe Red Thread of the Arts at CLC
The Red Thread of the Arts at CLC
Jason Flom
 
Trestle theatre presntation
Trestle theatre presntationTrestle theatre presntation
Trestle theatre presntationLexia Tomlinson
 
SNAG Milano - Slides anno 2015 un anno insieme
SNAG Milano - Slides anno 2015 un anno insiemeSNAG Milano - Slides anno 2015 un anno insieme
SNAG Milano - Slides anno 2015 un anno insieme
SNAG Milano
 
Netflix Business Analysis Q3 2015
Netflix Business Analysis Q3 2015Netflix Business Analysis Q3 2015
Netflix Business Analysis Q3 2015
revenuesandprofits
 
Inaugural Addresses
Inaugural AddressesInaugural Addresses
Inaugural Addresses
Booz Allen Hamilton
 
Teaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & TextspeakTeaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & Textspeak
Shelly Sanchez Terrell
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
LinkedIn
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
Luminary Labs
 

Viewers also liked (9)

Early Childhood Experience at CLC
Early Childhood Experience at CLCEarly Childhood Experience at CLC
Early Childhood Experience at CLC
 
The Red Thread of the Arts at CLC
The Red Thread of the Arts at CLCThe Red Thread of the Arts at CLC
The Red Thread of the Arts at CLC
 
Trestle theatre presntation
Trestle theatre presntationTrestle theatre presntation
Trestle theatre presntation
 
SNAG Milano - Slides anno 2015 un anno insieme
SNAG Milano - Slides anno 2015 un anno insiemeSNAG Milano - Slides anno 2015 un anno insieme
SNAG Milano - Slides anno 2015 un anno insieme
 
Netflix Business Analysis Q3 2015
Netflix Business Analysis Q3 2015Netflix Business Analysis Q3 2015
Netflix Business Analysis Q3 2015
 
Inaugural Addresses
Inaugural AddressesInaugural Addresses
Inaugural Addresses
 
Teaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & TextspeakTeaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & Textspeak
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 

Similar to Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
University of Huddersfield
 
Approaches to online quantile estimation
Approaches to online quantile estimationApproaches to online quantile estimation
Approaches to online quantile estimation
Data Con LA
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
SAIL_QU
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
infopapers
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
Yao Yao
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
Dmitry Grapov
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityGon-soo Moon
 
AHF_IDETC_2011_Jie
AHF_IDETC_2011_JieAHF_IDETC_2011_Jie
AHF_IDETC_2011_Jie
MDO_Lab
 
A tale of experiments on bug prediction
A tale of experiments on bug predictionA tale of experiments on bug prediction
A tale of experiments on bug prediction
Martin Pinzger
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
Siddharth Shrivastava
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
r-kor
 
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
IAEME Publication
 
Building Predictive Models R_caret language
Building Predictive Models R_caret languageBuilding Predictive Models R_caret language
Building Predictive Models R_caret language
javed khan
 
22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx
MarceloHenriques20
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Olga Scrivner
 
Data science in chemical manufacturing
Data science in chemical manufacturingData science in chemical manufacturing
Data science in chemical manufacturing
Karthik Venkataraman
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
midi
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMSAli T. Lotia
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
Jenny Liu
 

Similar to Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data (20)

Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Approaches to online quantile estimation
Approaches to online quantile estimationApproaches to online quantile estimation
Approaches to online quantile estimation
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-Severity
 
AHF_IDETC_2011_Jie
AHF_IDETC_2011_JieAHF_IDETC_2011_Jie
AHF_IDETC_2011_Jie
 
A tale of experiments on bug prediction
A tale of experiments on bug predictionA tale of experiments on bug prediction
A tale of experiments on bug prediction
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
 
Building Predictive Models R_caret language
Building Predictive Models R_caret languageBuilding Predictive Models R_caret language
Building Predictive Models R_caret language
 
22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
 
Data science in chemical manufacturing
Data science in chemical manufacturingData science in chemical manufacturing
Data science in chemical manufacturing
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
 

Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

  • 1. TutorialAccounting for uncertainty in species delineation during the analysis of environmental DNA sequence data Jeff R. Powell
  • 2.
  • 3. Therefore, a major imperative in bioinformatics is the development of theoretical and practical approaches for generating biodiversity estimates from DNA sequences
  • 4. This is especially true for taxa that are relatively understudied from a taxonomic perspective (particularly microorganisms and cryptic taxa)
  • 5.
  • 6. Probabilistic diversity estimation with uncertain species boundaries (using GMYC and model averaging) Extension to current approach: (Powell 2011 Methods Ecol Evol) Step 1: Estimate AIC of each model (all single- and multiple-threshold models) and rank based on fit to the data Step 2a: Estimate probabilities that two taxa belong to the same ‘species’ based on the weights associated with each model Step 2b: Estimate sample richness (and variance associated with this estimate) using model averaging Added benefit is that uncertainty in species boundaries can be directly incorporated into the variance associated with diversity estimates Several models fit well
  • 7.
  • 8. The commands to enter are preceded by ‘> ‘, modify these as appropriate for your data; notes are entered after the ‘#’ symbol
  • 9. ‘Powell_supplemental_script.R’ contains functions for GMYC model averaging and most of the code used here; is available at: http://dx.doi.org/10.1111/j.2041-210X.2011.00122.x
  • 10.
  • 11.  Downloads and installs ‘igraph’ package  Downloads and installs ‘vegan’ package  Downloads and installs ‘gtools’ package
  • 12. ‘ape’ and ‘paran’ are also required by the ‘splits’ package ‘splits’ needs to be installed from source, use the following:
  • 13. Read functions into R from source file in the working directory; calls to load required packages are also in source file Show the workspace to check that functions were read correctly into the R workspace
  • 14. Read tree into R; normally would read tree from file in working directory: Newick format: “read.tree(‘treefile.phylo’)” Nexus format: “read.nexus(‘treefile.nex’)” Tree summary to check that tree was read correctly (proper number of tips); the tree needs to be fully dichotomous (number of nodes is one fewer than number of tips)
  • 15. Plot tree; needs to be ultrametric, meaning the distance from root to each tip is the same  Can check with ‘is.ultrametric(test.tr)’
  • 16. Plot the accumulation of branches (N) though time; GMYC model used to detect abrupt changes in this accumulation rate
  • 17.
  • 18. The model is fit using each node (first column) as the threshold, from the second to the last branching event (age in second column), and estimates the model likelihood (third column)
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. Procedure starts by placing single threshold at a fixed point in the tree, then introducing additional thresholds closer to/further from node for particular lineages
  • 24. Model likelihood is printed to screen when improvement is observed
  • 25.
  • 26.
  • 27. Calculate AICc scores for GMYC models using different thresholds Specify object(s) containing GMYC model output fit using ‘gmyc.edit()’ Output: Model-averaged parameter estimates Other information (e.g., only single/multiple-threshold output objects specified)
  • 28. Generate some summary output: specify object contain model scores calculations; specify cutoff for maximum delta AICc to print model summary to screen Output: Models ranked by increasing delta AICc; ‘step’ used to identify model output in ‘gmyc.edit()’ results
  • 29. Generate some summary output, continued: Output: Models ranked by increasing delta AICc; last column (spilled over in screen output here) indicates Akaike weight given to model in the model-averaged parameter estimates
  • 30. Generate some summary output, continued: Output: Model-averaged parameter estimates (this output does not account for the deltaAICc argument)
  • 31. Estimate number of clusters, entities (clusters + singletons), Shannon diversity; also estimate variance associated with these parameters Specify object contain model scores calculations; specify cutoff for maximum delta AICc of included models Enter ‘y’ to continue, ‘n’ to stop (e.g., if too many models – time limitation)
  • 32. Calculate pairwise probabilities that tips co-occur within GMYC clusters Specify object containing model scores calculations; specify cutoff for maximum delta AICc of included models Enter ‘y’ to continue, ‘n’ to stop (e.g., if too many models – time limitation) delta AICLevel of empirical support 0-2 substantial 4-7 considerably less >10 essentially none (Burnham and Anderson, 2002, Model Selection and Multimodel Inference, page 170)
  • 33. Visual representation of cluster sizes, uncertainty; probabilities range from white (1) to red (0); x- and y-axis labels are arbitrary
  • 34. Plot tree, numbers above branches represent probabilities that all tips nested within node exist in a single GMYC cluster (hard to see in the default plot window)
  • 35. Plot to file, specify dimensions (in inches) to plot over larger area Open connection Plot to file Close connection Show files in working directory
  • 36. File found in working directory; numbers above branches represent probabilities that all tips nested within node exist in a GMYC cluster
  • 37. Finish session: Show all objects in the workspace Quit R; specifying ‘y’ to save image will result in this workspace being restored upon next start, as long as the user first navigates to the current directory before starting R - alternatively: “save.image(‘tutorial.rdata’)”  results in image to load from any directory
  • 38. Reload session to demonstrate sample-specific diversity estimates: Show working directory (started here) Show files in working directory; contains a species-sample matrix (‘test.samples.txt’) Reload source file to load necessary packages
  • 39.
  • 40. Model-averaged diversity estimates for whole tree (as previously calculated, for comparison)
  • 41. Model-averaged diversity estimates in each sample For example, ‘est’: Species richness in each sample ‘var’: Variance of richness estimate – can propagate through further analyses
  • 42.  Average richness  Variance around the mean (including species boundary uncertainty)  Variance (underestimated, neglects species boundary uncertainty)
  • 43. TutorialAccounting for uncertainty in species delineation during the analysis of environmental DNA sequence data For more information: jeffpowell2@gmail.com or Jeff.Powell@uws.edu.au