Paul Sharp and Ewan Mollison wp4 Nov 2018

Phytothreats WP4
20th November 2018
Predicting risk via analysis of Phytophthora genome evolution
Ewan Mollison, Paul Sharp – University of Edinburgh
Sarah Green – Forest Research
Leighton Pritchard, David Cooke – James Hutton Institute

Introduction
• What can drive evolution of a pathogen?
• “Intrinsic” factors: duplication, rearrangement, insertion, deletion of DNA
regions
• “Extrinsic” factors: hybridisation between species, transfer of genes between
species
• Allow pathogens to
• Adapt to evolving host defences
• Expand host range
• Increase virulence

NOWELL, LAUE, SHARP & GREEN (2016)
“Comparative genomics reveals genes significantly
associated with woody hosts in the plant pathogen
Pseudomonas syringae”
Molec. Plant Path. 17:1409-1424
Genomes of 64 strains
of Pseudomonas syringae
38 from woody hosts
26 others

genes
associated
with
woody hosts

Aims
• Compare genes from available sequenced Phytophthora genomes
• Identify a core set of Phytophthora genes, common to all species
• Identify species-specific genes or variation
• Sequence genomes from three less damaging species, which are
closely related to highly damaging species
• Topic of this talk
• Study target genes / gene families known to be important for
virulence
• How do variations in these influence the pathogen, e.g. host range, damage
caused, etc.?

Phytophthora species
Genome assemblies
available for 27 species
from 10 clades
(11 are from clades 7 & 8)
CladeSpecies
1a P. cactorum
1c P. infestans
1 P. parasitica
2a P. colocasiae
2b P. capsici
2c P. multivora
2c P. plurivora
3 P. pluvialis
3? P. taxon totara
4 P. litchii
4 P. megakarya
4 P. palmivora
5 P. agathidicida
6b P. pinifolia
Clade Species
7a P. alni
7a P. cambivora
7a P. fragariae
7a P. rubi
7b P. pisi
7b P. sojae
7c P. cinnamomi
8a P. cryptogea
8c P. lateralis
8c P. ramorum
8d P. austrocedri
9b P. fallax
10 P. kernoviae

8
6
9
10
7
3
2
4
1
5
Phylogenetic tree of
27 previously
released genomes
based on alignment
of concatenated DNA
sequences from 7
genes
(as in Yang et al. 2017)

P. europeae
• Mainly infects European Oak (roots),
also identified in North America
• Clade 7 (Subclade 7a): most closely related to
P. alni, P. cambivora (both woody host), P.
fragariae, P. rubi (both soft fruit host)

P. foliorum
• Causes leaf blight in azaleas
• Clade 8 (Subclade 8c): most
closely related to P. lateralis, P.
ramorum (both woody host)

P. obscura
• Found associated with horse
chestnut and pieris
• Clade 8 (subclade 8d): most
closely related to P. austrocedri
(juniper and other cypress
species)

8
6
9
10
7
3
2
4
1
5
Phylogenetic tree of
all 30 species
based on alignment
of concatenated DNA
sequences from 7
genes
(as in Yang et al. 2017)

Three less damaging, although still pathogenic, Phytophthora species:
• P. europeae
• P. foliorum
• P. obscura
Sequencing now complete!
PacBio sequencing of 2 SMRT cells for each species (Exeter)
DNA prepared by Carolyn Riddell (Forest Research)

Why PacBio rather than Illumina?
• MUCH longer read lengths can be achieved
• Tens of Kbp rather than 150 – 300bp
• Repeats more easily resolved
• Greater overall contiguity
• Random source of error rather than systematic bias
• Over-coverage can be used to help error-correct rather than amplify bias
• P. austrocedri – hybrid Illumina/PacBio
• Other assembled Phytophthoras – Illumina only

P. austrocedri reassembly (Peter Thorpe)
• Hybrid sequencing – both PacBio and
Illumina
• Hampered by not quite enough
coverage of either for optimal assembly
• Reassembled P. austrocedri using only
the PacBio reads
• Error-corrected using trimmed, de-
duplicated Illumina reads
• Purged “haplotigs” to produce
consensus haploid assembly
Hybrid
assembly
Corrected
PacBio
No. scaffolds 43,700 862
Scaffold N50 41,889 213,073
Max scaffold length 422,335 861,531
Mean scaffold length 3,089 121,524
Total length (Mbp) 135.01 104.75
% GC 51.4 51.5
% Repeat masked 49.0 39.3
No. gene models 38,492 26,960

Raw sequence generated
P. foliorum SMRT 1 SMRT 2 Combined
No. reads 634,588 836,118 1,470,706
Max read length 77,730 82,231 82,231
Mean read length 9,669 8,144 8,802
Total length (Gbp) 6.1 6.8 12.9
P. obscura SMRT 1 SMRT 2 Combined
No. reads 475,375 564,534 1,039,909
Max read length 79,519 80,795 80,795
Mean read length 12,308 10,876 11,531
Total length (Gbp) 5.9 6.1 12.0
P. europeae SMRT 1 SMRT 2 Combined
No. reads 739,454 723,675 1,463,129
Max read length 85,983 81,473 85,983
Mean read length 9,956 8,983 9,475
Total length (Gbp) 7.4 6.5 13.9 • Variable read length but
high read N50 indicates
good overall read length
achieved
• Max read length >80Kbp for
all three species
• Generally good consistency
across both SMRT cells

Overall strategy
PacBio
sequencing
Canu
assembly
SSPACE long read
scaffolding
BUSCO
completeness
estimate
Additional error-
correction &
assembly polishing
Repeat masking
Gene
model
prediction
Final assembly
• Conflicting opinions on whether best to
error-correct and polish before or after
scaffolding
• Correction can take a few weeks, so have
run early repeat mask and gene prediction
on initial scaffolds to get preliminary values

Sequencing and assembly summary
• Canu assembly of first cell from each to get “quick” picture of what’s in there
• Run with initial assumption of approx. 100Mbp genome size
• Early estimate of genome size from k-mer analysis of corrected, trimmed reads (k=31)
before assembling full data sets
• P. europeae: 95Mbp
• P. foliorum: 70Mbp
• P. obscura: 63Mbp
• Run full assembly with estimated genome size of 100Mbp for all three

Canu contig level assembly
P. europeae P. foliorum P. obscura
No. contigs 112 103 127
Contig N50 (Mbp) 2.83 2.42 2.99
Max contig length (Mbp) 9.61 5.60 6.83
Mean contig length (Mbp) 0.68 0.60 0.48
Total length (Mbp) 76.5 61.8 60.4
Process duration
(correct, trim, assemble) 12d 7h 3d 20h 3d 2h
• High N50, low number of contigs shows very high degree of contiguity in all
three assemblies
• N50: 50% of the sequence is contained within fragments of length N, or
greater

Scaffolding
• Scaffold contigs using full set of PacBio reads with SSPACE long-read
• Scaffolding links contigs together with gaps of known length padded
out with “N” characters
• Reduced number of scaffolds, N50 now >2.5Mbp for each assembly
No. scaffolds 69 67 77
Scaffold N50 (Mbp) 4.28 2.88 5.42
Max scaffold length (Mbp) 9.61 8.01 7.11
Mean scaffold length (Mbp) 1.11 0.92 0.79
Total length (Mbp) 76.7 61.9 61.1
No. N's 124,828 99,240 440,106

• Comparison of
scaffold count and
N50 across all 30
genomes
• Assembly is
comparable to
that of P. sojae

“BUSCO” completeness (n = 234)
Complete BUSCOs 230 (98.3%) 230 (98.3%) 230 (98.3%)
Complete/single 227 (97.0%) 229 (97.9%) 229 (97.9%)
Complete/duplicated 3 (1.3%) 1 (0.4%) 1 (0.4%)
Fragmented 2 (0.9%) 0 (0.0%) 1 (0.4%)
Missing 2 (0.9%) 4 (1.7%) 3 (1.3%)
• High estimate of completeness for all three assemblies (98%)
• Good coverage of the “gene-space” achieved
• Very low level of duplication seen in all three (~1%)
• Suggests good resolution of haplotypes within assembly
• Also suggests polyploidy unlikely

Repeat content and gene model estimation
• RepeatMasker run vs. scaffolded assemblies with models derived
from multiple Phytophthora species (generated by RepeatModeler)
• Augustus run vs. masked assemblies using training set from closest
available relative
• P. europeae – P. rubi based set
• P. foliorum, P. obscura – P. austrocedri based set
• No. predicted gene models comparable to P. infestans, P. ramorum, etc. –
realistic looking figure
% GC 53.6 51.9 53.3
% Repeat masked 45 38 35
No. gene models 15,863 15,907 17,178

Sample gene family: Xylanases
• Class of cell wall degrading enzymes which
break down hemicellulose by degrading b-1-4-
xylan into xylose
• Hemicellulose is a major constituent of the
plant cell wall
• Xylanase enzymes play a major role in the ability of
micro-organisms to degrade plant material
• Help the pathogen enter host tissues by breaking
down the cell wall
• Expand previous xylanase analysis to include
new genome assemblies

full Xylanase tree
• Two major clades
• One containing xyn1 and
xyn2
• The other containing
xyn3 and xyn4
xyn1
xyn2
xyn3
xyn4

8: 3
6: 4
9: 2
10: 2
7: 4
3: 4
2: 4
4: 4
1: 4
5: 4
Number of xylanase genes
varies among clades

Next stages
• Finalise assembly improvement
• Remove contaminant reads, polishing, gap-filling, etc.
• Re-scaffold assemblies
• Re-run repeat masking, gene model prediction
• Bring these assemblies together with the others for downstream
comparative analysis
• Identify orthologous groups, targeted gene family studies, etc.

Paul Sharp and Ewan Mollison wp4 Nov 2018

Recommended

Recommended

More Related Content

Similar to Paul Sharp and Ewan Mollison wp4 Nov 2018

Similar to Paul Sharp and Ewan Mollison wp4 Nov 2018 (20)

More from Forest Research

More from Forest Research (20)

Recently uploaded

Recently uploaded (20)

Paul Sharp and Ewan Mollison wp4 Nov 2018