DATA ANALYSIS SOFTWARE:METAPY

DATA ANALYSIS SOFTWARE:
METAPY.

METAPY
• https://github.com
/widdowquinn/THA
PBI-pycits

Method Tool
DNA extraction/ PCR
DNAseq
QC, Trim, Chimera detection
Assemble reads
Error correction ???:
Bayes Hammer
Nested PCR
Illumina overlapping read
Fastqc, Trimmomatic, Vsearch
Flash / PEAR
Convert FQ, FA
Trim primers off seq
Cluster
Biopython
Python
Swarm
CD-HIT
Vsearch
Bowtie
Blastclust
Python: sklearn
Compare
clustering
Graphics
Summarise species Python

Clustering
Swarm – Not biased to input order. Works on
a d difference method
CD-HIT – Biased to input order. Works on %
identity
Vsearch - Biased to input order. Works on %
identity
BLASTCLUST – Unknown bias to input order.
Works on % identity
Bowtie - PERFECT matches only!
Read
database

New approach to Metabarcoding
analysis – Exact sequence variants
Jamie Orr
OTU – Operational Taxonomic Unit

OTUs vs ZOTUs (Exact sequence
variants)
• OTUs define sequence-similar
groups; variation could be
biological, or technical
(PCR/sequencing).
• ZOTUs explicitly try to correct
PCR and sequencing errors.

Correcting - ZOTUs
• Two sequences (A and B)
• Skew = abundanceA/abundanceB
• B(d)=1/2ad+1
Where “d” is the number of positional differences
between two sequences
“a” is set by the user
• If skew is less than B(d) then A is assigned to B

Method Tool
DNA extraction/ PCR
DNAseq
QC, Trim, Chimera detection
Assemble reads
Nested PCR
Illumina overlapping read
Fastqc, Trimmomatic, Vsearch
Flash / PEAR
Convert FQ, FA
Trim primers off seq
Cluster
Biopython
Python
Swarm
CD-HIT
Vsearch
Bowtie
Blastclust
Python: sklearn
Compare
clustering
Graphics
Summarise species Python
DADA2 (ZOTU)

Metapy checks database is OK :
INFO: QC passed on sequences: assembled_skew: normal skewtest assemb_lens = 0.718
pvalue = 0.4731
database_skew: normal skewtest db_lens = -2.703 pvalue = 0.0069 Mann_whitney U test: 0.000104940190514
INFO: db_mean= 196.958 db_stdev= 18.808 assem_mean = 189.681 , assem_stdev = 21.155

Metapy checks database is OK to use:
FAILED – Used in previous publication
The assembled size of your reads is significantly different to your database. You need to
adjust your DB sequences to that of the region you sequenced.
assembled_skew: normal skewtest assemb_lens = 0.718 pvalue = 0.4731 database_skew: normal skewtest db_lens = -
8.199 pvalue = 0.0000 Mann_whitney U test: 1.3189757498e-85 db_mean= 711.194 db_stdev= 218.250
assem_mean = 189.681 , assem_stdev = 21.155INFO

Database matters!!!
• If you are going to pick species based on a
database. These entries matter!
• Reference database quality critically determines
classification accuracy!
• Compare 5 Phytophthora database.
• 2 used for publications

Database matters!!!
Phytophthora_db_v0.001
• Tracked on Github
• Can be automatically updated and generated by
scripts.
• If you are going to pick species based on a
database. These entries matter!
• Reference database quality critically determines
classification accuracy!
• Compare 5 Phytophthora database.
• 2 used for publications

Compare
databases
Out of a known 10 species "spiked" sample - DNAmix
Database:
TOOL: Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
Result
found in
all tools
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
Blastclust
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
Bowtie
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
cdhit
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
Swarm
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
Vsearch
fastclust
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
Vsearch
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
DADA2
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0

Database:
TOOL: Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
Result
found in
all tools
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
Blastclust
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
Bowtie
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
cdhit
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
Swarm
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
Vsearch
fastclust
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
Vsearch
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
DADA2
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0
Compare
databases
Message 1:
• Database length matters.
• Including non-ITS1 region has
negative impact (obvious, but used
in publications!)
Database:
tergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
true
ositives
0 4 5 2 4
mis -
luster
0 3 1 3 1
true
ositives
3 7 9 8 9
mis -
luster
23 34 37 29 21
true
ositives
4 4 5 3 4
mis -
luster
5 3 1 3 1
true
ositives
7 7 8 7 8
mis -
luster
39 23 17 26 19
true
ositives
0 7 8 7 8
mis -
luster
0 11 8 20 19
true
ositives
4 7 8 4 7
mis -
luster
26 15 12 11 8
true
ositives
6 6 8 4 7
mis -
luster
7 7 7 8 5
true
ositives
0 3 3 2 3
mis -
luster
0 2 1 2 0

Compare
databases
Message 2:
• Bowtie and DADA2
reduced mis-cluster
rate (and true positive
rate)
Database:
TOOL: Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
Result
found in
all tools
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
Blastclust
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
Bowtie
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
cdhit
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
Swarm
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
Vsearch
fastclust
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
Vsearch
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
DADA2
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0
Database:
ry
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
es
0 4 5 2 4
r
0 3 1 3 1
es
3 7 9 8 9
r
23 34 37 29 21
es
4 4 5 3 4
r
5 3 1 3 1
es
7 7 8 7 8
r
39 23 17 26 19
es
0 7 8 7 8
r
0 11 8 20 19
es
4 7 8 4 7
r
26 15 12 11 8
es
6 6 8 4 7
r
7 7 7 8 5
es
0 3 3 2 3
r
0 2 1 2 0
Database:
TOOL: Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
Result
und in
ll tools
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
astclust
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
Bowtie
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
cdhit
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
Swarm
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
search
astclust
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
search
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
DADA2
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0

Compare
databases
Message 2:
• Bowtie and DADA2 reduced
false positive rate
Message 3:
• Blastclust is the worst.
We knew that already!!
• Blastclust does not
produce reliable
identifications with
these ITS1 databases.
• Blastclust also
deprecated – do not
use!
Database:
TOOL: Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
Result
found in
all tools
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
Blastclust
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
Bowtie
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
cdhit
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
Swarm
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
Vsearch
fastclust
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
Vsearch
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
DADA2
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0
Database:
ory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
es
0 4 5 2 4
r
0 3 1 3 1
es
3 7 9 8 9
r
23 34 37 29 21
es
4 4 5 3 4
r
5 3 1 3 1
es
7 7 8 7 8
r
39 23 17 26 19
es
0 7 8 7 8
r
0 11 8 20 19
es
4 7 8 4 7
r
26 15 12 11 8
es
6 6 8 4 7
r
7 7 7 8 5
es
0 3 3 2 3
r
0 2 1 2 0
Database:
OOL: Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
esult
nd in
tools
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
tclust
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
wtie
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
dhit
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
warm
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
earch
tclust
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
earch
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
ADA2
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0

Database:
ry
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
es
0 4 5 2 4
r
0 3 1 3 1
es
3 7 9 8 9
r
23 34 37 29 21
es
4 4 5 3 4
r
5 3 1 3 1
es
7 7 8 7 8
r
39 23 17 26 19
es
0 7 8 7 8
r
0 11 8 20 19
es
4 7 8 4 7
r
26 15 12 11 8
es
6 6 8 4 7
r
7 7 7 8 5
es
0 3 3 2 3
r
0 2 1 2 0
Compare
databases
Message 2:
• Bowtie and DADA2 reduced false
positive rate
Message 3:
• Blastclust is the worst.
Message 4:
• These results are
helping us refine the
DB. Mis-cluster rate is
now reducing
Database:
TOOL: Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
Result
found in
all tools
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
Blastclust
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
Bowtie
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
cdhit
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
Swarm
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
Vsearch
fastclust
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
Vsearch
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
DADA2
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0
Database:
Catergory
235_FULL
length_error
_removed
235_trimm
ed_to_ITS1
Santi_
modified
David’s pre -
database
trimmed to ITS1
Phytophthora
DB version 0.01
true
positives
0 4 5 2 4
mis -
cluster
0 3 1 3 1
true
positives
3 7 9 8 9
mis -
cluster
23 34 37 29 21
true
positives
4 4 5 3 4
mis -
cluster
5 3 1 3 1
true
positives
7 7 8 7 8
mis -
cluster
39 23 17 26 19
true
positives
0 7 8 7 8
mis -
cluster
0 11 8 20 19
true
positives
4 7 8 4 7
mis -
cluster
26 15 12 11 8
true
positives
6 6 8 4 7
mis -
cluster
7 7 7 8 5
true
positives
0 3 3 2 3
mis -
cluster
0 2 1 2 0

Other software made for this project
• Software estimates copy number of a given gene of interest.
• ITS(theoretical) = ∑ITS_hits ⋅ (x̅ ITS_coverage(assembled) / x̅ gene_coverage)
https://github.com/widdowquinn/THAPBI/tree/master/Phyt_ITS_identifying_pipeline
Quantify gene copy number:
Sanger sequencing identification:
• No need for “pointy and clicky” sequencing editor, then web BLAST
• Does it all for you! Sanger read ----> Species
https://github.com/peterthorpe5/public_scripts/tree/master/Sanger_read_metagenetics

Future directions
“Pipeline” needs to be verified with controls.
 Sequencing controls: known spikes, “fake” sequences to
obtain error rates, identification limitations
 TODO: Write Bayesian based clustering/ probabilistic
model

Thanks!
Plant health testing and natural ecosystem surveillance
via In situ water sampling and metabarcoding of
Phytophthora diversity
THE TEAM!
David Cooke
Leighton Pritchard
Eva Randall & Beatrix Clark

DATA ANALYSIS SOFTWARE:METAPY

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DATA ANALYSIS SOFTWARE:METAPY

Similar to DATA ANALYSIS SOFTWARE:METAPY (20)

More from Forest Research

More from Forest Research (20)

Recently uploaded

Recently uploaded (20)