I first describe the challenges associated to the rapid accumulation of HTS data; these issues are not trivial and may have a negative impact in terms of money, time and scientific quality/reproducibility. I then present recommendations for mitigating these issues and aiding in the management and analysis of HTS data.
In this presentation from the DDN User Meeting at SC13, Tim Cutts from The Sanger Insitute describes how the company wrangles genomics data with DDN storage.
Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
This slidedeck discusses the most biologically efficient, cost-effective method for successful NGS. The GeneRead DNA QuantiMIZE Kits enable determination of the optimum conditions for targeted enrichment of DNA isolated from biological samples, while the GeneRead DNAseq Panels V2 allow you to quickly and reliably deep sequence your genes of interest. Applications in translational and clinical research are highlighted.
Dr. Scott Kahn, CIO of Illumina, presents challenges and progress on big data solutions and its impact on scientific research at the 2013 Genome Informatics Alliance meeting.
Presentation by Valerie Schneider discussing Genome Reference Consortium (GRC) plans for the mouse and zebrafish reference genome assemblies, presented at the 2016 meeting of the The Allied Genetic Conference (TAGC). Includes description of resources at the National Center for Biotechnology Information (NCBI) for working with reference genome assemblies.
I first describe the challenges associated to the rapid accumulation of HTS data; these issues are not trivial and may have a negative impact in terms of money, time and scientific quality/reproducibility. I then present recommendations for mitigating these issues and aiding in the management and analysis of HTS data.
In this presentation from the DDN User Meeting at SC13, Tim Cutts from The Sanger Insitute describes how the company wrangles genomics data with DDN storage.
Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
This slidedeck discusses the most biologically efficient, cost-effective method for successful NGS. The GeneRead DNA QuantiMIZE Kits enable determination of the optimum conditions for targeted enrichment of DNA isolated from biological samples, while the GeneRead DNAseq Panels V2 allow you to quickly and reliably deep sequence your genes of interest. Applications in translational and clinical research are highlighted.
Dr. Scott Kahn, CIO of Illumina, presents challenges and progress on big data solutions and its impact on scientific research at the 2013 Genome Informatics Alliance meeting.
Presentation by Valerie Schneider discussing Genome Reference Consortium (GRC) plans for the mouse and zebrafish reference genome assemblies, presented at the 2016 meeting of the The Allied Genetic Conference (TAGC). Includes description of resources at the National Center for Biotechnology Information (NCBI) for working with reference genome assemblies.
Cancer research involves large datasets.
Comparing multi terabyte datasets impose a high pressure on IT infrastructure.
This presentation explains a bit on genetics and shows how Oracle Exadata, Oracle Database and VX Company's Huvariome solution crack that nut.
The National Center for Biotechnology Information (NCBI) provides one of the most extensive sets of web-based tools for biological research. The tools are indispensable when planning genomics experiments, including for qPCR, NGS, and CRISPR. In this presentation, Dr Matt McNeill takes a practical look at getting started with the wealth of NCBI tools, and shares some relevant tips to help you sift through the tools and options that we regularly use. In particular, he focuses on commonly adjusted parameters that will allow you to more effectively use the powerful Basic Local Alignment Algorithm Tool (BLAST) to identify off-target hybridization/annealing events. Dr McNeill also covers practical examples using NCBI tools to design assays.
Next-generation sequencing (NGS) has revolutionized the way we analyze diseases and commercial outfits such as Illumina, Helicos, QIAGEN and Pacific Biosciences have made significant contributions. In addition, the launch of direct-to-consumer genetic testing solutions has dramatically changed the way consumers access genomics data. Until a few years ago, the cost of sequencing was a major bottleneck. Recent developments have reduced the cost from thousands of dollars to a couple of cents per megabase. When did these changes start? What were the changes in the commercial sector in the last 15 years? This infographic is a timeline of the NGS commercial marketplace.
GRC Workshop at Churchill College on Sep 21, 2014. This is Aaron Quinlan's talk on issues with representing variants in the full assembly, with suggestions for VCF modifications for handling variant calls on the alts.
CS Lecture 2017 04-11 from Data to Precision MedicineGabe Rudy
High level review of how NGS is used in precision medicine today, from sequencers, through bioinformatic algorithms to variant annotation and interpretation.
Cancer research involves large datasets.
Comparing multi terabyte datasets impose a high pressure on IT infrastructure.
This presentation explains a bit on genetics and shows how Oracle Exadata, Oracle Database and VX Company's Huvariome solution crack that nut.
The National Center for Biotechnology Information (NCBI) provides one of the most extensive sets of web-based tools for biological research. The tools are indispensable when planning genomics experiments, including for qPCR, NGS, and CRISPR. In this presentation, Dr Matt McNeill takes a practical look at getting started with the wealth of NCBI tools, and shares some relevant tips to help you sift through the tools and options that we regularly use. In particular, he focuses on commonly adjusted parameters that will allow you to more effectively use the powerful Basic Local Alignment Algorithm Tool (BLAST) to identify off-target hybridization/annealing events. Dr McNeill also covers practical examples using NCBI tools to design assays.
Next-generation sequencing (NGS) has revolutionized the way we analyze diseases and commercial outfits such as Illumina, Helicos, QIAGEN and Pacific Biosciences have made significant contributions. In addition, the launch of direct-to-consumer genetic testing solutions has dramatically changed the way consumers access genomics data. Until a few years ago, the cost of sequencing was a major bottleneck. Recent developments have reduced the cost from thousands of dollars to a couple of cents per megabase. When did these changes start? What were the changes in the commercial sector in the last 15 years? This infographic is a timeline of the NGS commercial marketplace.
GRC Workshop at Churchill College on Sep 21, 2014. This is Aaron Quinlan's talk on issues with representing variants in the full assembly, with suggestions for VCF modifications for handling variant calls on the alts.
CS Lecture 2017 04-11 from Data to Precision MedicineGabe Rudy
High level review of how NGS is used in precision medicine today, from sequencers, through bioinformatic algorithms to variant annotation and interpretation.
Golden Rules of Bioinformatics.
Presented as part of a full-day introductory bioinformatics course - the example data and source for the slides can be found at https://github.com/widdowquinn/Teaching-Intro-to-Bioinf
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
With the size of genomic data doubling every seven months, existing tools in the genomic space designed for the gigabyte scale tip over when used to process the terabytes of data being made available by current biobank-scale efforts. To enable common genomic analyses at massive scale while being flexible to ad-hoc analysis, Databricks and Regeneron Genetics Center have partnered to launch an open-source project.
The project includes optimized DataFrame readers for loading genomics data formats, as well as Spark SQL functions to perform statistical tests and quality control analyses on genomic data. We discuss a variety of real-world use cases for processing genomic variant data, which represents how an individual’s genomic sequence differs from the average human genome. Two use cases we will discuss are: joint genotyping, in which multiple individuals’ genomes are analyzed as a group to improve the accuracy of identifying true variants; and variant effect annotation, which annotates variants with their predicted biological impact. Enabling such workflows on Spark follows a straightforward model: we ingest flat files into DataFrames, prepare the data for processing with common Spark SQL primitives, perform the processing on each partition or row with existing genomic analysis tools, and save the results to Delta or flat files.
Slides for the NCBI Workshop at PAG 2019, demonstrating how to use BLAST_v5, BLAST Docker, A Preview of Some Plant Related Projects (GeneHummus), and the Future of Hackathons
A Genome Sequence Analysis System Built with HypertableDATAVERSITY
Deep genome sequencing has revolutionized the fields of biology and medicine. Since January 2008, the capacity to generate sequence data has increased exponentially, far outpacing Moore's Law. The emergence of scalable NoSQL database technologies has made the analysis of this vast amount of sequence data not only feasible, but cost effective.
The University of California at San Francisco UCSF-Abbott Viral Detection and Discovery Center, led by director Charles Chiu, MD, PhD, Taylor Sittler, MD and the Hypertable development team have embarked upon a project to build a scalable software platform to facilitate deep sequencing analysis in diagnostic microbiology, transcriptomic analysis, and clinical / environmental metagenomics, areas for which existing commercial and academic solutions are sorely lacking. Doug Judd, the original creator of Hypertable, will present an overview of this genome sequencing analysis system. The presentation will cover the following topics:
Rationale for choosing NoSQL
Schema design
Sources and description of input data
Algorithms for generating and querying lookup tables
Table sizes and compression ratios
Lessons learned during system deployment
Detecting and Quantifying Low Level Variants in Sanger Sequencing TracesThermo Fisher Scientific
Automated fluorescent dye-terminator DNA Sequencing using capillary electrophoresis (also known as CE or Sanger sequencing) has been instrumental in the detailed characterization of the human genome and is now widely used as gold standard method for verification of mutation findings, notably in tumor samples. The primary information of the DNA sequencing process is the identification of the nucleotides and of possible sequence variants. A largely unexplored feature of fluorescent Sanger sequencing traces is the quantitative information embedded therein. With the growing need for quantifying somatic mutations in tumor tissue it is desirable to exploit the potential of the quantitative information obtained from sequencing traces.
Materials and Methods
To this end, we have developed a software tool that converts a Sanger sequencing trace file into a .comma separated value (.csv) file containing numerical data of peak data characteristics that can be explored and analyzed using conventional spreadsheet software. The web-based tool can be accessed at: http://apps.lifetechnologies.com/ab1peakreporter .
The output file contains the peak height and quality values for each nucleotide and peak height ratios for all 4 bases at any given locus allowing the detection and assessment of subtle changes at any given allele.
Results and Discussion
We demonstrate the utility of this tool by analyzing mixed DNA samples with known amounts of spiked in variant alleles from the human TP53 gene ranging from 2.5%, 5%, 7.5%, 10%, 15% and 25% and show that the minor alleles could be readily detected below the 10% level.
Conclusion
Enabling high sensitivity detection of minor alleles with a widely available and simple to use technology like Sanger sequencing will be useful for verification of results obtained from next generation sequencing (NGS) platforms.
Advancing Microbiome Research: From challenging samples to insight with Confi...QIAGEN
Microbiome research encompasses sample types as diverse as the human gut, Antarctic soil, ocean water and acidic hot spring biofilms. These samples are challenging because they are difficult to lyse, with some microbes containing a tough extracellular matrix. Incomplete lysis of a microbial community results in an inaccurate representation of the microbial content of the sample. Additionally, PCR inhibitors present in these samples, especially humic acids, polysaccharides, polyphenolics, lipids and heavy metals result in inaccurate quantification of nucleic acids that may inhibit downstream applications such as qPCR and NGS.
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Nathan Olson
"Next Generation Sequencing for Identification and Subtyping of Foodborne Pathogens" presentation at the Standards for Pathogen Identification via NGS (SPIN) workshop hosted by the National Institute for Standards and Technology October 2014 by Rebecca Lindsey, PhD from Enteric Diseases Laboratory Branch of the CDC.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Studia Poinsotiana
I Introduction
II Subalternation and Theology
III Theology and Dogmatic Declarations
IV The Mixed Principles of Theology
V Virtual Revelation: The Unity of Theology
VI Theology as a Natural Science
VII Theology’s Certitude
VIII Conclusion
Notes
Bibliography
All the contents are fully attributable to the author, Doctor Victor Salas. Should you wish to get this text republished, get in touch with the author or the editorial committee of the Studia Poinsotiana. Insofar as possible, we will be happy to broker your contact.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
4. Clustering
Swarm – Not biased to input order. Works on
a d difference method
CD-HIT – Biased to input order. Works on %
identity
Vsearch - Biased to input order. Works on %
identity
BLASTCLUST – Unknown bias to input order.
Works on % identity
Bowtie - PERFECT matches only!
Read
database
5. New approach to Metabarcoding
analysis – Exact sequence variants
Jamie Orr
OTU – Operational Taxonomic Unit
6. OTUs vs ZOTUs (Exact sequence
variants)
• OTUs define sequence-similar
groups; variation could be
biological, or technical
(PCR/sequencing).
• ZOTUs explicitly try to correct
PCR and sequencing errors.
7. Correcting - ZOTUs
• Two sequences (A and B)
• Skew = abundanceA/abundanceB
• B(d)=1/2ad+1
Where “d” is the number of positional differences
between two sequences
“a” is set by the user
• If skew is less than B(d) then A is assigned to B
8. Method Tool
DNA extraction/ PCR
DNAseq
QC, Trim, Chimera detection
Assemble reads
Nested PCR
Illumina overlapping read
Fastqc, Trimmomatic, Vsearch
Flash / PEAR
Convert FQ, FA
Trim primers off seq
Cluster
Biopython
Python
Swarm
CD-HIT
Vsearch
Bowtie
Blastclust
Python: sklearn
Compare
clustering
Graphics
Summarise species Python
DADA2 (ZOTU)
9. Metapy checks database is OK :
INFO: QC passed on sequences: assembled_skew: normal skewtest assemb_lens = 0.718
pvalue = 0.4731
database_skew: normal skewtest db_lens = -2.703 pvalue = 0.0069 Mann_whitney U test: 0.000104940190514
INFO: db_mean= 196.958 db_stdev= 18.808 assem_mean = 189.681 , assem_stdev = 21.155
10. Metapy checks database is OK to use:
FAILED – Used in previous publication
The assembled size of your reads is significantly different to your database. You need to
adjust your DB sequences to that of the region you sequenced.
assembled_skew: normal skewtest assemb_lens = 0.718 pvalue = 0.4731 database_skew: normal skewtest db_lens = -
8.199 pvalue = 0.0000 Mann_whitney U test: 1.3189757498e-85 db_mean= 711.194 db_stdev= 218.250
assem_mean = 189.681 , assem_stdev = 21.155INFO
11. Database matters!!!
• If you are going to pick species based on a
database. These entries matter!
• Reference database quality critically determines
classification accuracy!
• Compare 5 Phytophthora database.
• 2 used for publications
12. Database matters!!!
Phytophthora_db_v0.001
• Tracked on Github
• Can be automatically updated and generated by
scripts.
• If you are going to pick species based on a
database. These entries matter!
• Reference database quality critically determines
classification accuracy!
• Compare 5 Phytophthora database.
• 2 used for publications
13. Compare
databases
Out of a known 10 species "spiked" sample - DNAmix
Database:
TOOL: Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
Result
found in
all tools
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
Blastclust
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
Bowtie
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
cdhit
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
Swarm
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
Vsearch
fastclust
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
Vsearch
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
DADA2
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0
14. Out of a known 10 species "spiked" sample - DNAmix
Database:
TOOL: Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
Result
found in
all tools
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
Blastclust
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
Bowtie
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
cdhit
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
Swarm
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
Vsearch
fastclust
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
Vsearch
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
DADA2
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0
Compare
databases
Message 1:
• Database length matters.
• Including non-ITS1 region has
negative impact (obvious, but used
in publications!)
Out of a known 10 species "spiked" sample - DNAmix
Database:
tergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
true
ositives
0 4 5 2 4
mis -
luster
0 3 1 3 1
true
ositives
3 7 9 8 9
mis -
luster
23 34 37 29 21
true
ositives
4 4 5 3 4
mis -
luster
5 3 1 3 1
true
ositives
7 7 8 7 8
mis -
luster
39 23 17 26 19
true
ositives
0 7 8 7 8
mis -
luster
0 11 8 20 19
true
ositives
4 7 8 4 7
mis -
luster
26 15 12 11 8
true
ositives
6 6 8 4 7
mis -
luster
7 7 7 8 5
true
ositives
0 3 3 2 3
mis -
luster
0 2 1 2 0
15. Compare
databases
Message 2:
• Bowtie and DADA2
reduced mis-cluster
rate (and true positive
rate)
Out of a known 10 species "spiked" sample - DNAmix
Database:
TOOL: Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
Result
found in
all tools
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
Blastclust
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
Bowtie
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
cdhit
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
Swarm
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
Vsearch
fastclust
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
Vsearch
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
DADA2
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0
Out of a known 10 species "spiked" sample - DNAmix
Database:
ry
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
es
0 4 5 2 4
r
0 3 1 3 1
es
3 7 9 8 9
r
23 34 37 29 21
es
4 4 5 3 4
r
5 3 1 3 1
es
7 7 8 7 8
r
39 23 17 26 19
es
0 7 8 7 8
r
0 11 8 20 19
es
4 7 8 4 7
r
26 15 12 11 8
es
6 6 8 4 7
r
7 7 7 8 5
es
0 3 3 2 3
r
0 2 1 2 0
Out of a known 10 species "spiked" sample - DNAmix
Database:
TOOL: Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
Result
und in
ll tools
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
astclust
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
Bowtie
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
cdhit
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
Swarm
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
search
astclust
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
search
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
DADA2
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0
16. Compare
databases
Message 2:
• Bowtie and DADA2 reduced
false positive rate
Message 3:
• Blastclust is the worst.
We knew that already!!
• Blastclust does not
produce reliable
identifications with
these ITS1 databases.
• Blastclust also
deprecated – do not
use!
Out of a known 10 species "spiked" sample - DNAmix
Database:
TOOL: Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
Result
found in
all tools
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
Blastclust
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
Bowtie
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
cdhit
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
Swarm
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
Vsearch
fastclust
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
Vsearch
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
DADA2
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0
Out of a known 10 species "spiked" sample - DNAmix
Database:
ory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
es
0 4 5 2 4
r
0 3 1 3 1
es
3 7 9 8 9
r
23 34 37 29 21
es
4 4 5 3 4
r
5 3 1 3 1
es
7 7 8 7 8
r
39 23 17 26 19
es
0 7 8 7 8
r
0 11 8 20 19
es
4 7 8 4 7
r
26 15 12 11 8
es
6 6 8 4 7
r
7 7 7 8 5
es
0 3 3 2 3
r
0 2 1 2 0
Out of a known 10 species "spiked" sample - DNAmix
Database:
OOL: Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
esult
nd in
tools
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
tclust
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
wtie
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
dhit
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
warm
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
earch
tclust
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
earch
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
ADA2
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0
17. Out of a known 10 species "spiked" sample - DNAmix
Database:
ry
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
es
0 4 5 2 4
r
0 3 1 3 1
es
3 7 9 8 9
r
23 34 37 29 21
es
4 4 5 3 4
r
5 3 1 3 1
es
7 7 8 7 8
r
39 23 17 26 19
es
0 7 8 7 8
r
0 11 8 20 19
es
4 7 8 4 7
r
26 15 12 11 8
es
6 6 8 4 7
r
7 7 7 8 5
es
0 3 3 2 3
r
0 2 1 2 0
Compare
databases
Message 2:
• Bowtie and DADA2 reduced false
positive rate
Message 3:
• Blastclust is the worst.
Message 4:
• These results are
helping us refine the
DB. Mis-cluster rate is
now reducing
Out of a known 10 species "spiked" sample - DNAmix
Database:
TOOL: Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
Result
found in
all tools
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
Blastclust
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
Bowtie
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
cdhit
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
Swarm
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
Vsearch
fastclust
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
Vsearch
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
DADA2
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0
Out of a known 10 species "spiked" sample - DNAmix
Database:
Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0
18. Other software made for this project
• Software estimates copy number of a given gene of interest.
• ITS(theoretical) = ∑ITS_hits ⋅ (x̅ ITS_coverage(assembled) / x̅ gene_coverage)
https://github.com/widdowquinn/THAPBI/tree/master/Phyt_ITS_identifying_pipeline
Quantify gene copy number:
Sanger sequencing identification:
• No need for “pointy and clicky” sequencing editor, then web BLAST
• Does it all for you! Sanger read ----> Species
https://github.com/peterthorpe5/public_scripts/tree/master/Sanger_read_metagenetics
19. Future directions
“Pipeline” needs to be verified with controls.
Sequencing controls: known spikes, “fake” sequences to
obtain error rates, identification limitations
TODO: Write Bayesian based clustering/ probabilistic
model
20. Thanks!
Plant health testing and natural ecosystem surveillance
via In situ water sampling and metabarcoding of
Phytophthora diversity
THE TEAM!
David Cooke
Leighton Pritchard
Eva Randall & Beatrix Clark
Editor's Notes
First why whither and what does it mean? “What is the likely future of”
To remind us that language also changes and evolves as do species concepts
Terminology – metabarcoding better term