Phytothreats WP4
13th November 2019
Predicting risk via analysis of Phytophthora genome evolution
Ewan Mollison, Paul Sharp – University of Edinburgh
Sarah Green – Forest Research
Leighton Pritchard, David Cooke – James Hutton Institute
Phytophthora species
• Oomycete pathogens that can cause
serious plant disease
• Over 170 species identified to date and
more being discovered all the time
• Impact can vary greatly between species –
currently impossible to predict – genomes?
P. infestans in potato
P. austrocedri killing juniper in the Lake District
P. ramorum in larch
Rationale
• Sequence three Phytophthora species thought to be less damaging
than close relatives
• Compare genes from sequenced Phytophthora genomes
• Including sequencing for this study, genomes for 30 species available from 10
phylogenetic clades
• Can we identify sources of variation between highly damaging and
less damaging species that may improve understanding of genes /
gene families involved in Phytophthora virulence?
Target species
• P. europaea
Associated with European
oak
Clade 7 species closely
related to P. alni
• P. foliorum
Causes leaf blight in azaleas
Clade 8 species closely
related to P. ramorum
• P. obscura
Associated with horse chestnut
and pieris
Clade 8 species closely related
to P. austrocedri
Contig assembly
• PacBio sequencing was done at Exeter Sequencing Service
• 2 SMRT cells / species to ensure high genome coverage
• Assembled using Canu with 100 Mbp genome estimate
• High N50, low number of contigs shows high degree of contiguity in all
three assemblies
P. europaea P. foliorum P. obscura
No. contigs 112 127 103
Contig N50 (Mbp) 2.8 3.0 2.4
Max contig length (Mbp) 9.6 6.8 5.6
Mean contig length (Mbp) 0.7 0.5 0.6
Total length (Mbp) 76.5 60.4 61.8
Scaffolding, gap-filling and polishing
• Scaffolded contigs with SSPACE-Longread
• Gaps filled with GapFinisher
• Re-scaffold gap-filled contigs then gap-fill again: iterate until no
change (3 – 5 rounds)
• Run scaffolds and reads through Arrow for 3 rounds of polishing
After scaffolding, gap-filling and polishing
P. europaea P. foliorum P. obscura
Scaffolds 39 71 29
N50 (Mbp) 11.0 7.1 6.4
Max scaffold (Mbp) 15.7 12.4 12.7
Mean scaffold (Mbp) 2.0 0.9 2.1
Total length (Mbp) 76.9 61.0 62.2
Inserted Ns 44,056 114,405 34,148
% GC content 53.62 53.20 51.88
• Highly contiguous
• Low number of
scaffolds, high N50
• BUT: what’s going on
with P. foliorum?
• 71 scaffolds – others
have 29 and 39
• Contamination?
Checking for contamination
• Check for scaffolds arising from contaminant
sequences
• Use blobtools to identify taxonomic origin of
scaffolds in the assembly
• P. foliorum: 49 scaffolds (1.77Mbp) possibly
from bacterial contamination (Firmicutes)
• Very short scaffolds with lower GC than rest
of assembly
• P. obscura: 1 scaffold (14Kbp) possibly from
bacterial contamination (Bacteroidetes)
• P. europaea: none found
• Manually removed these scaffolds
Depth of
coverage
Final assembly statistics
P. europaea P. foliorum P. obscura
Scaffolds 39 22 28
N50 (Mbp) 11.0 7.5 6.4
Max scaffold (Mbp) 15.7 12.4 12.7
Mean scaffold (Mbp) 2.0 2.7 2.2
Total length (Mbp) 76.9 59.1 62.2
Inserted Ns 44,056 6,799 34,145
% GC content 53.6 53.4 51.9
% Repeat content 35.5 29.0 28.7
Augustus gene predictions 19,658 19,484 19,441
• Augustus training sets
based on closest
available relative
• P. europaea: P. rubi-
based
• P. foliorum & P. obscura:
P. austrocedri-based
• Very similar gene model
counts
• Comparison of
scaffold count and
N50 across all 30
genomes
• Closest published
assembly is P.
sojae
• Many assemblies
highly fragmented
(thousands of
scaffolds)
Completeness of coverage
P. europaea
Complete – 229 (97.9%)
Single – 226 (96.6%)
Duplicated – 3 (1.3%)
Fragmented – 3 (1.3%)
Missing – 2 (0.8%)
P. foliorum
Complete – 229 (97.9%)
Single – 228 (97.4%)
Duplicated – 1 (0.4%)
Fragmented – 1 (0.4%)
Missing – 4 (1.8%)
P. obscura
Complete – 231 (98.7%)
Single – 230 (98.3%)
Duplicated – 1 (0.4%)
Fragmented – 0
Missing – 3 (1.3%)
Gene complement comparison
• Using protein sets from 26 species >85%
BUSCO complete
(Not using P. alni, P. cambivora, P. palmivora or P.
lateralis)
• Identify “core” set of proteins common to all
genomes as well as single-copy orthologues
• All-by-all BLAST followed by MCL clustering
with Orthofinder
• 54,963 clusters identified
• 33,254 are single-protein clusters
• 8,666 have proteins from 20 – 26 genomes
• 5,097 classed as single-copy clusters with proteins
from 20 – 26 genomes
Gene content differences: clade 7 • Woody host set
• Genes from either P. cinnamomi or P.
rubi
• Non-woody host set
• Genes from P. fragariae, P. pisi or P.
sojae
• Further filtering shows a set of 101
genes common to all five selected
pathogens, but not present in P.
europaea
• 36 of these are present in the P.
europaea genome sequence, but are
disrupted by internal stop codons or
indels that may affect expression or
function
Gene content differences: clade 8 • Selected pathogens can all infect
woody hosts
• Genes from either P. cryptogea, P.
austrocedri or P. ramorum
• P. cryptogea infects woody and non-
woody
• Further filtering shows
• 101 genes common to all three selected
pathogens, but not present in P. foliorum
• 133 common to pathogens but not
present in P. obscura
• 40 common to all pathogens but not
present in both P. obscura and P.
foliorum
Summary
• PacBio-only sequencing has produced three highly contiguous
assemblies – much more so than published Illumina-based assemblies
for other species
• High BUSCO completeness of 98 – 99% indicates very good coverage
of the gene-space
• Low BUSCO duplication of 1% suggests good resolution of the
haplotypes
• Core set of 8,666 proteins common to 75% of species compared
• A number of genes common to pathogenic Phytophthora species are
“missing” from the less pathogenic species
• Some of these are still present in the genome sequence but are not expressed
or rendered non-functional due to premature “stop” signals in the sequence
Funding and partners

Ewan Mollison wp4 14 Nov 19

  • 1.
    Phytothreats WP4 13th November2019 Predicting risk via analysis of Phytophthora genome evolution Ewan Mollison, Paul Sharp – University of Edinburgh Sarah Green – Forest Research Leighton Pritchard, David Cooke – James Hutton Institute
  • 2.
    Phytophthora species • Oomycetepathogens that can cause serious plant disease • Over 170 species identified to date and more being discovered all the time • Impact can vary greatly between species – currently impossible to predict – genomes? P. infestans in potato P. austrocedri killing juniper in the Lake District P. ramorum in larch
  • 3.
    Rationale • Sequence threePhytophthora species thought to be less damaging than close relatives • Compare genes from sequenced Phytophthora genomes • Including sequencing for this study, genomes for 30 species available from 10 phylogenetic clades • Can we identify sources of variation between highly damaging and less damaging species that may improve understanding of genes / gene families involved in Phytophthora virulence?
  • 4.
    Target species • P.europaea Associated with European oak Clade 7 species closely related to P. alni • P. foliorum Causes leaf blight in azaleas Clade 8 species closely related to P. ramorum • P. obscura Associated with horse chestnut and pieris Clade 8 species closely related to P. austrocedri
  • 5.
    Contig assembly • PacBiosequencing was done at Exeter Sequencing Service • 2 SMRT cells / species to ensure high genome coverage • Assembled using Canu with 100 Mbp genome estimate • High N50, low number of contigs shows high degree of contiguity in all three assemblies P. europaea P. foliorum P. obscura No. contigs 112 127 103 Contig N50 (Mbp) 2.8 3.0 2.4 Max contig length (Mbp) 9.6 6.8 5.6 Mean contig length (Mbp) 0.7 0.5 0.6 Total length (Mbp) 76.5 60.4 61.8
  • 6.
    Scaffolding, gap-filling andpolishing • Scaffolded contigs with SSPACE-Longread • Gaps filled with GapFinisher • Re-scaffold gap-filled contigs then gap-fill again: iterate until no change (3 – 5 rounds) • Run scaffolds and reads through Arrow for 3 rounds of polishing
  • 7.
    After scaffolding, gap-fillingand polishing P. europaea P. foliorum P. obscura Scaffolds 39 71 29 N50 (Mbp) 11.0 7.1 6.4 Max scaffold (Mbp) 15.7 12.4 12.7 Mean scaffold (Mbp) 2.0 0.9 2.1 Total length (Mbp) 76.9 61.0 62.2 Inserted Ns 44,056 114,405 34,148 % GC content 53.62 53.20 51.88 • Highly contiguous • Low number of scaffolds, high N50 • BUT: what’s going on with P. foliorum? • 71 scaffolds – others have 29 and 39 • Contamination?
  • 8.
    Checking for contamination •Check for scaffolds arising from contaminant sequences • Use blobtools to identify taxonomic origin of scaffolds in the assembly • P. foliorum: 49 scaffolds (1.77Mbp) possibly from bacterial contamination (Firmicutes) • Very short scaffolds with lower GC than rest of assembly • P. obscura: 1 scaffold (14Kbp) possibly from bacterial contamination (Bacteroidetes) • P. europaea: none found • Manually removed these scaffolds Depth of coverage
  • 9.
    Final assembly statistics P.europaea P. foliorum P. obscura Scaffolds 39 22 28 N50 (Mbp) 11.0 7.5 6.4 Max scaffold (Mbp) 15.7 12.4 12.7 Mean scaffold (Mbp) 2.0 2.7 2.2 Total length (Mbp) 76.9 59.1 62.2 Inserted Ns 44,056 6,799 34,145 % GC content 53.6 53.4 51.9 % Repeat content 35.5 29.0 28.7 Augustus gene predictions 19,658 19,484 19,441 • Augustus training sets based on closest available relative • P. europaea: P. rubi- based • P. foliorum & P. obscura: P. austrocedri-based • Very similar gene model counts
  • 10.
    • Comparison of scaffoldcount and N50 across all 30 genomes • Closest published assembly is P. sojae • Many assemblies highly fragmented (thousands of scaffolds)
  • 12.
    Completeness of coverage P.europaea Complete – 229 (97.9%) Single – 226 (96.6%) Duplicated – 3 (1.3%) Fragmented – 3 (1.3%) Missing – 2 (0.8%) P. foliorum Complete – 229 (97.9%) Single – 228 (97.4%) Duplicated – 1 (0.4%) Fragmented – 1 (0.4%) Missing – 4 (1.8%) P. obscura Complete – 231 (98.7%) Single – 230 (98.3%) Duplicated – 1 (0.4%) Fragmented – 0 Missing – 3 (1.3%)
  • 13.
    Gene complement comparison •Using protein sets from 26 species >85% BUSCO complete (Not using P. alni, P. cambivora, P. palmivora or P. lateralis) • Identify “core” set of proteins common to all genomes as well as single-copy orthologues • All-by-all BLAST followed by MCL clustering with Orthofinder • 54,963 clusters identified • 33,254 are single-protein clusters • 8,666 have proteins from 20 – 26 genomes • 5,097 classed as single-copy clusters with proteins from 20 – 26 genomes
  • 14.
    Gene content differences:clade 7 • Woody host set • Genes from either P. cinnamomi or P. rubi • Non-woody host set • Genes from P. fragariae, P. pisi or P. sojae • Further filtering shows a set of 101 genes common to all five selected pathogens, but not present in P. europaea • 36 of these are present in the P. europaea genome sequence, but are disrupted by internal stop codons or indels that may affect expression or function
  • 15.
    Gene content differences:clade 8 • Selected pathogens can all infect woody hosts • Genes from either P. cryptogea, P. austrocedri or P. ramorum • P. cryptogea infects woody and non- woody • Further filtering shows • 101 genes common to all three selected pathogens, but not present in P. foliorum • 133 common to pathogens but not present in P. obscura • 40 common to all pathogens but not present in both P. obscura and P. foliorum
  • 16.
    Summary • PacBio-only sequencinghas produced three highly contiguous assemblies – much more so than published Illumina-based assemblies for other species • High BUSCO completeness of 98 – 99% indicates very good coverage of the gene-space • Low BUSCO duplication of 1% suggests good resolution of the haplotypes • Core set of 8,666 proteins common to 75% of species compared • A number of genes common to pathogenic Phytophthora species are “missing” from the less pathogenic species • Some of these are still present in the genome sequence but are not expressed or rendered non-functional due to premature “stop” signals in the sequence
  • 17.

Editor's Notes

  • #3 Narrow host range – e.g. P. rubi, fragariae highly specific to soft fruit (raspberry, strawberry) Broad host range – e.g. P. parasitica: 255 species from 90 genera P. Infestans - sequenced 2009, large genome of 240Mbp, smallest genomes can be down to 30Mbp
  • #7 SSPACE parameters: -g 500 (min gap between 2 contigs); –l 10 (min links between pairs); -o 100 (min bp overlap between contigs); -k 1 (retain scaffolding info to use with gapfinisher!)