What are you measuring?:
(lessons from synthetic control sequences)
Leighton Pritchard
Phytothreats
BBSRC/RESAS-funded collaborationPete Thorpe
David Cooke
Peter Cock
Sarah Green
 Phytophthora spp. cause disease
on crops, ornamentals
 Many Phytophthoras invasive,
and/or notorious e.g. P. ramorum, P.
austrocedri, P. infestans
 Track and trace Phytophthora
spp. through plant nurseries
 Accreditation scheme
 Advice on “clean practice”
 Stakes for business if quarantine
organisms discovered
ITS1 Identification
 ITS1 used for Phytophthora phylogenetics
 “One species, one ITS1”
 Some species separated by a single SNP
Metabarcoding for plant pathogens
 Three “pinch points”:
 Are (non-model pathogen) barcode databases comprehensive?
 riboSeed: mining 16S from Illumina reads: https://nickp60.github.io/riboSeed/
 Are standard barcodes precise enough (for non-model pathogens)?
 Redesign of non-16S markers for Pectobacterium atrosepticum in Scotland
 How do barcodes perform?
 How individualizing are oomycete ITS1 sequences in metabarcoding?
Synthetic “ITS1” control sequences
 Four artificial, unique sequences
 Same base composition as Phytophthora ITS1
 random sequence
 “Crosstalk”/contamination assumed if:
 Synthetics found in sample channels
 Phytophthora sequence found in synthetic channels
 Estimate sequencing error, and potential for (cross-) contamination.
Synthetic controls
Many extra sequences seen!
 Synthetic mixture – four controls only:
 Target read counts: ≈1e6, ≈1e4, ≈1e3, ≈1e2
 Six combinations of synthetics (GC1-3, GL4-6)
 Each mixture at three dilutions (sensitivity):
 1x, 10x, 100x -> 18 controls in total
 Long tail of amplicons that not expected in
the original mixture
 How many (different) sequence variants are
there?
GC1
GC2
GL6
GL5
GC3
GL4
1x 10x 100x
log10 abundance
PCR amplification introduces variation
 Most abundant control has most variants
 Sequence variants (`000s) can be more abundant than sample sequences (4)!
 Variants differ from control sequence by up to ≈5-8% (≈10-15bp)
 Most abundant control equally abundant at all dilutions
 Other controls become undetectable with dilution: NOT QUANTITATIVE
 “unmatched” non-control sequences present up to ≈1e2 abundance
 The most abundant sequence may not be in the sample!
log10 abundance
1x 10x 100x
Synthetic controls: DADA2
 Most variation is excluded, but…
 ITS1 sequences truncated
 PCR variants still come through
 PCR variants still more abundant than
controls
 Still not quantitative
Cross-contamination
Estimate from synthetic controls
 BLASTN+ non-control sequences vs NCBI nt
 Identifiable oomycete sequences at up to
≈25 abundance
 Possible cross-contamination filtering
threshold?
 ≈50% of variants are unique to each lane
 Variation introduced in PCR amplification,
not in original sample?
non-oomycete
contaminant best hit abundance
Cedecea neteri 64
Lactifluus acrissimus 35
Mycobacterium sp. djl-10 39
Paracoccus zhejiangensis 114
Raphanus sativus 37
Roseomonas gilardii 67
uncultured fungus 36
oomycete
abundance
besthitE-value
“One species, one ITS1”?
 Genomics: “One isolate, ≈40-170 ITS1”
 (and they can be quite diverse, as in P. infestans…)
Single-isolate controls
 One Phytophthora isolate per well/lane/barcode
 Also synthetic controls and environmental samples
 Some controls amplified with different PCR cycle numbers
 Estimate within-isolate variation
 How many OTUs per individual Phytophthora isolate?
 Do similar OTUs make individualization difficult?
Unique amplicons
Many unique ITS1s in all samples
Distinct amplicons by lane/barcode
Single isolates: up to ≈2000
Environmental isolates: up to ≈3000
Synthetic controls: up to ≈2000
 Identify ITS1 matches:
 Profile HMM of known Phytophthora
ITS1 sequences
 Contamination abundance up to ≈1e3!
log 10 abundance
profilebitscore
OTUs by isolate
Several OTUs (swarm) for each isolate
 Are OTUs good representations of
diversity?
Swarms/OTUs by lane/barcode
Single isolates: up to ≈150
Environmental isolates: up to ≈450
Synthetic controls: up to ≈20
Most Phytophthora isolates: ≈40-60
Remove contamination
Reduces OTUs per isolate
 Remove all amplicons <25 abundance
 Use synthetic controls on every plate to
threshold?
Swarms/OTUs by lane/barcode
Single isolates: up to ≈6
Environmental isolates: up to ≈40
Synthetic controls: up to ≈4
Most Phytophthora isolates: ≈1-2
Are the reduced OTUs individualizing?
 Jaccard distance between
swarm/OTU members
 Several OTUs unique to Phytophthora
species
 Some OTUs common to Phytophthora
species
 ITS1 can be individualising:
1. Threshold for contamination/low
abundance sequence variants
2. Filter against ITS1 HMM profile
What is a classifier?
Classifier
P. infestans: 8932
P. austrocedri: 755
P. alni: 132
What’s in a classifier?
P. infestans: 8932
P. austrocedri: 755
P. alni: 132
Database of sequences and classes
Method for associating input
sequences with classes
{Performance metrics: accuracy, FPR, precision, etc.}
How to get performance metrics
 Apply the classifier to known positive and negative examples
 Determine when the classifier is correct and incorrect (TP, TN, FP, FN)
 Calculate metrics
We need:
• Known inputs
(real data from known
sources)
• Multiple classifiers
(so we can
compare/choose the best)
Automating the process
ITS1
“core db”
NCBI
legacy
control
isolates
classification
algorithms
BLAST HMM swarm random
forest
Requires a programmatic framework
Compare classifiers (db + algorithm)
on same test data
Where are we?
 Building reference database framework
 Data sources:
 legacy (PT’s old data); NCBI ITS1; control isolate
 Trust framework
 Automated (e.g. internal consistency); manual (evidence
base)
 Rules-based subsetting
 trusted sequences; data source; single-clade; etc.
Where next?
 Create training data (artificial communities)
 combinations of reads from single-isolate controls
 Apply and evaluate current approaches
 sequence identity (BLAST); clustering (swarm)
 use filtering information from synthetic controls
 test against artificial and real mock communities
 Extend systematically to new algorithms/dbs
 Apply validated classifiers to old/new sample
Summary
 synthetic controls: cross-
contamination/methodological artefacts
 single-isolate controls: individualization/potential
for confusion
 classifier framework: systematic identification of
optimal database/algorithm combination(s)
Beatrix Clark
Pete Hedley
Ewan Mollison
Eva Randall
Paul Sharp
Acknowledgements

Leighton Pritchard wp1 Nov 2018

  • 1.
    What are youmeasuring?: (lessons from synthetic control sequences) Leighton Pritchard
  • 2.
    Phytothreats BBSRC/RESAS-funded collaborationPete Thorpe DavidCooke Peter Cock Sarah Green  Phytophthora spp. cause disease on crops, ornamentals  Many Phytophthoras invasive, and/or notorious e.g. P. ramorum, P. austrocedri, P. infestans  Track and trace Phytophthora spp. through plant nurseries  Accreditation scheme  Advice on “clean practice”  Stakes for business if quarantine organisms discovered
  • 3.
    ITS1 Identification  ITS1used for Phytophthora phylogenetics  “One species, one ITS1”  Some species separated by a single SNP
  • 4.
    Metabarcoding for plantpathogens  Three “pinch points”:  Are (non-model pathogen) barcode databases comprehensive?  riboSeed: mining 16S from Illumina reads: https://nickp60.github.io/riboSeed/  Are standard barcodes precise enough (for non-model pathogens)?  Redesign of non-16S markers for Pectobacterium atrosepticum in Scotland  How do barcodes perform?  How individualizing are oomycete ITS1 sequences in metabarcoding?
  • 5.
    Synthetic “ITS1” controlsequences  Four artificial, unique sequences  Same base composition as Phytophthora ITS1  random sequence  “Crosstalk”/contamination assumed if:  Synthetics found in sample channels  Phytophthora sequence found in synthetic channels  Estimate sequencing error, and potential for (cross-) contamination.
  • 6.
    Synthetic controls Many extrasequences seen!  Synthetic mixture – four controls only:  Target read counts: ≈1e6, ≈1e4, ≈1e3, ≈1e2  Six combinations of synthetics (GC1-3, GL4-6)  Each mixture at three dilutions (sensitivity):  1x, 10x, 100x -> 18 controls in total  Long tail of amplicons that not expected in the original mixture  How many (different) sequence variants are there? GC1 GC2 GL6 GL5 GC3 GL4 1x 10x 100x log10 abundance
  • 7.
    PCR amplification introducesvariation  Most abundant control has most variants  Sequence variants (`000s) can be more abundant than sample sequences (4)!  Variants differ from control sequence by up to ≈5-8% (≈10-15bp)  Most abundant control equally abundant at all dilutions  Other controls become undetectable with dilution: NOT QUANTITATIVE  “unmatched” non-control sequences present up to ≈1e2 abundance  The most abundant sequence may not be in the sample! log10 abundance 1x 10x 100x
  • 8.
    Synthetic controls: DADA2 Most variation is excluded, but…  ITS1 sequences truncated  PCR variants still come through  PCR variants still more abundant than controls  Still not quantitative
  • 9.
    Cross-contamination Estimate from syntheticcontrols  BLASTN+ non-control sequences vs NCBI nt  Identifiable oomycete sequences at up to ≈25 abundance  Possible cross-contamination filtering threshold?  ≈50% of variants are unique to each lane  Variation introduced in PCR amplification, not in original sample? non-oomycete contaminant best hit abundance Cedecea neteri 64 Lactifluus acrissimus 35 Mycobacterium sp. djl-10 39 Paracoccus zhejiangensis 114 Raphanus sativus 37 Roseomonas gilardii 67 uncultured fungus 36 oomycete abundance besthitE-value
  • 10.
    “One species, oneITS1”?  Genomics: “One isolate, ≈40-170 ITS1”  (and they can be quite diverse, as in P. infestans…)
  • 11.
    Single-isolate controls  OnePhytophthora isolate per well/lane/barcode  Also synthetic controls and environmental samples  Some controls amplified with different PCR cycle numbers  Estimate within-isolate variation  How many OTUs per individual Phytophthora isolate?  Do similar OTUs make individualization difficult?
  • 12.
    Unique amplicons Many uniqueITS1s in all samples Distinct amplicons by lane/barcode Single isolates: up to ≈2000 Environmental isolates: up to ≈3000 Synthetic controls: up to ≈2000  Identify ITS1 matches:  Profile HMM of known Phytophthora ITS1 sequences  Contamination abundance up to ≈1e3! log 10 abundance profilebitscore
  • 13.
    OTUs by isolate SeveralOTUs (swarm) for each isolate  Are OTUs good representations of diversity? Swarms/OTUs by lane/barcode Single isolates: up to ≈150 Environmental isolates: up to ≈450 Synthetic controls: up to ≈20 Most Phytophthora isolates: ≈40-60
  • 14.
    Remove contamination Reduces OTUsper isolate  Remove all amplicons <25 abundance  Use synthetic controls on every plate to threshold? Swarms/OTUs by lane/barcode Single isolates: up to ≈6 Environmental isolates: up to ≈40 Synthetic controls: up to ≈4 Most Phytophthora isolates: ≈1-2
  • 15.
    Are the reducedOTUs individualizing?  Jaccard distance between swarm/OTU members  Several OTUs unique to Phytophthora species  Some OTUs common to Phytophthora species  ITS1 can be individualising: 1. Threshold for contamination/low abundance sequence variants 2. Filter against ITS1 HMM profile
  • 16.
    What is aclassifier? Classifier P. infestans: 8932 P. austrocedri: 755 P. alni: 132
  • 17.
    What’s in aclassifier? P. infestans: 8932 P. austrocedri: 755 P. alni: 132 Database of sequences and classes Method for associating input sequences with classes {Performance metrics: accuracy, FPR, precision, etc.}
  • 18.
    How to getperformance metrics  Apply the classifier to known positive and negative examples  Determine when the classifier is correct and incorrect (TP, TN, FP, FN)  Calculate metrics We need: • Known inputs (real data from known sources) • Multiple classifiers (so we can compare/choose the best)
  • 19.
    Automating the process ITS1 “coredb” NCBI legacy control isolates classification algorithms BLAST HMM swarm random forest Requires a programmatic framework Compare classifiers (db + algorithm) on same test data
  • 20.
    Where are we? Building reference database framework  Data sources:  legacy (PT’s old data); NCBI ITS1; control isolate  Trust framework  Automated (e.g. internal consistency); manual (evidence base)  Rules-based subsetting  trusted sequences; data source; single-clade; etc.
  • 21.
    Where next?  Createtraining data (artificial communities)  combinations of reads from single-isolate controls  Apply and evaluate current approaches  sequence identity (BLAST); clustering (swarm)  use filtering information from synthetic controls  test against artificial and real mock communities  Extend systematically to new algorithms/dbs  Apply validated classifiers to old/new sample
  • 22.
    Summary  synthetic controls:cross- contamination/methodological artefacts  single-isolate controls: individualization/potential for confusion  classifier framework: systematic identification of optimal database/algorithm combination(s)
  • 23.
    Beatrix Clark Pete Hedley EwanMollison Eva Randall Paul Sharp Acknowledgements