1. Mathu Malar C., Jennifer
Yuzon, Takao Kasuga and
Sucheta Tripathy
UC Davis, CA, USA.
CSIR- Indian Institute of
Chemical Biology, Kolkata,
India.
2. Background
Phytophthora ramorum, a highly destructive pathogen with a
wide host-range that causes Sudden Oak Death in
western North America and Sudden Larch Death in the
UK.
P.ramorum was first reported in 1995 and the origins of
the pathogens are still unclear
P. ramorum can be spread over several miles in mists, air
currents, watercourses and rain splash. It is also known
that Phytophthora pathogens can be spread on footwear,
dogs’ paws, bicycle wheels, tools and equipment etc.
Parke, J. L., and S. Lucas. 2008. Sudden oak death and ramorum blight. The Plant Health Instructor. DOI: 10.1094/PHI-I-2008-
0227-01 https://sites.google.com/site/phytophthoragenomicslab/home
3. Platform No of Reads
generated
Total reads used
for assembly
Organism Read coverage
Pacbio 435399 33% and 47% Pr102 25X
illumina 20942377 20942377 (100%) Pr102 10X
Platform No of Reads
generated
Total reads used
for assembly
Organism Read coverage
pacbio 402170 285487 (70%) ND886 50X
Illumina 43676830 43676830(100%) ND886 50X
For strain Pr102
For ND886
4. V1 assembly (Tyler BM et al, 2006 ) by Sanger sequencing
method, 65 MB, Genome Coverage 7.7X and Total Gaps 12 MB.
V2 Assembly (September 2015)
V3 Assembly (December 2015)
V4 Assembly (March 2016)
V5 Assembly April 2016
5. Pacbio Pr102 435399
(raw reads)
ECTools with Sanger
Unitigs from 2006
phyra V1 assembly
Corrected
(33%)
147429 reads
Uncorrected
(67%)
287970 reads
ECTools with mock
intermediate
assembly (Illumina
reads + unitigs (V1)
derived 6K, 20K
simulated libraries
using allpaths)
Corrected
1418 reads
0.49%
Uncorrected
reads 286552
66.50%
PBCR Auto Error
correction assembly
used as input to
Ectools for EC
Corrected
57640 reads
13.2%
Uncorrected
228912
52%
Improved 3-way error
correction protocol
6. An Overview of Assemblers and tools used in this study
Tools Input type Function
ECTools PacBio reads with a reference dataset
(unitgs) for read error correction.
Correcting errors in PacBioreads
PBCR (PacBioToCA) PacBio reads Error corrections and Assembly
Canu PacBio reads Successor of PBCR assembler
SSPACE (stand-alone
scaffolder of pre-
assembled contigs using
paired-read data)
Pre-assembled contigs, short reads
(paired end and mate pair)
Is not a de novo assembler. Used
for scaffolding and extending
contigs
SSPACE Long Reads Pre-assembled contigs, uses (the
pacbio reads) especially long reads
Is a successor of SSPACE and
performs better on a case to
base basis.
Dedupe Sequence reads Removes PCR duplicates and
identical sequences prior to
mapping
Redundans Hybrid datasets Recently developed (2016)
specifically effective for
heterozygous genomes
7. Improved Error
corrected reads (49%)
Illumina reads
Dedupe
Redudans
2325 scaffolds 76
Mb largest 781884
N50=65030
Canu
Largest scaffold
=655506,
smallest=3055
Total scaffolds = 920,
N50 = 116386, size =
61mb
V3 Assembly
Celera
minimus
SSPACE
SSPACE
Long Reads
1114 scaffolds
Largest = 886281
Smallest = 15009
Total length =
79285078
Previous error corrected
protocol (33%)
V2 Assembly
Other Assembly Protocols
minimus
SSPACE
SSPACE
Long reads
SSPACE
SSPACE
Long reads
Improved Error
corrected reads (49%)
V4 Assembly
8. Total error corrected
reads 206487
Celera assembly with length
cut off 10k (2735 contigs,
77Mb )
Library No reads Read length
Illumina R1=10157419
R2=10784958
varies from 50 nt
to 100nt
V1 unitigs MP
20k
R1=28379
R2=28379
101
Pacbio corrected
MP 10k
6k
R1=5234
R2=5234
R1=59180
R2=59180
150
101
V1 unitigs (2006
assembly)
7589 (unitigs) variable
Input data for
Redundans
Comparison with Phyra unitigs
using mummer CAP3 on unmapped
sequences from V1 unitigs
appended to assembly
back
No of scaffolds = 2005, largest scaffold= 781884,
smallest scaffold = 2000 , N50 = 76032, total
length = 67996746 Gaps = 220 bases
Protocol for V5 assembly
Redundans Assembly 65M
1825 scaffolds, N50=76861,
Largest=781884,Smallest=2000
15. Assembly
Version
No of core
Prots(248
completely
highly conserved
CEG)
Unique gene %
completen
ess
Out of 458 core
genes present in
genome
V1 236 KOG0948
Nuclear exosomal
RNA helicase
MTR4
95.16 412
V2 237 KOG0434
Isoleucyl-tRNA
synthetase
95.56 412
V3 236 KOG0734
AAA+-type ATPase
containing the
peptide
95.16 413
V4 237 KOG2311
NAD/FAD-
utilizing protein
95.56 416
V5 238 KOG1158
NADP/FAD
dependent
oxidoreductase
95.97 414
16. Effector Prediction Pipeline
V5 Assembly
Signal p predicted protein
sequences (7159)
Removed proteins with
transmembrane domains.
RXLR motifs on the N
terminus (373 sequences)
Motif prediction with
MEME (W Y domain)
343 sequences were
detected in MEME
18. ND886 error correction and Read statistics
Pacbio raw reads
(402170)
ECTools with Sanger
Unitigs (V1
Assembly)
Corrected(70.
9%)
285487
Uncorrected
(29.1%)
19. ND886 assemblyTotal error corrected reads
285487
Celera Assembly
Minimus
Dedupe
Library No reads Read
length
Illumina R1=28389
986
R2=
28334221
varies
from 50 nt
to 100nt
Pacbio
corrected
MP 10k
6k
R1=
91555
R2=
91555
R1=13170
3
R2=13170
3
101
101
Read statistics
SSPACE [with illumina reads],Total contigs = 6443
Largest contig =648889,Smallest contig =2098,
assembly size = 150 Mb
Redudans
No of scaffolds = 2225, largest = 648906 ,
smallest = 2745 , N50 = 48161 , total length =
92877686 , Gaps = 4133
Assembly No of core
proteins
from 248
%
completen
ess
No of core
genes out
of 458
Nd886 234 94.35 410
20. Comparison of ND886 against Pr102 2006 assembly
P.ramorum ND886
P.ramorum Pr102 (2006)
21. De Novo assemblers alone are not enough for a good genome
assembly.
PacBio Reads are marred with errors and one error correction
protocol alone does not always produce the best result.
Hybrid assembly in combination with scaffolder, duplicate removers
are effective for assembly.
No protocol works best for 2 genomes, has to be mixed and
matched.
Assembly improvement does not necessarily change the gene space
rather works better for repetitive regions and correcting assembly.
24. Long reads ranges from 14,000 to 48,000 base pairs greater than that of sanger and
NGS reads
Shortest run time (30 mins).
Least GC bias.
No amplification bias.
Handles the highly repetitive genome, can fill the gaps efficiently.
Reference: http://www.pacificbiosciences.com/products/smrt-technology/smrt-sequencing-advantage/
25. Assembly
name
bases
masked
Small
RNA
Simple
repeats
Low
complexity
GC
content
Total
interspersed
repeats
LINE
[R2/R4/
NeSL]
Ty1/copio
Gypsy/DIRS1
LTR elements
DNA transposon Piggy BAC Tourist/harbinger
P.ramorum
2006
7847064 bp
(11.77%)
11 (6033
bp)
0.01%
5336 (242077
bp)
0.36%
422(20747 bp)
0.03%
53.86% 7580618 bp
(11.37%)
53 (88470
bp 0.13%)
5972 (6669143
bp) 10.01 %
1174 (823005bp)
1.23 %
200 (104977 bp)
0.16%
12 (5609 bp)
0.01%
Protocol
1b
16553511
bp
(24.34%)
75 (49453
bp)
0.07%
7122(331100
bp)
0.49%
816 (40373
bp)
0.06 %
53.98 % 16138229 bp
(23.73%)
87
(250632bp
)
0.37%
8822(14885437
bp ) 21.89 %
1419 ( 1002160 bp )
1.47 %
198 ( 104684 bp )
0.15 %
13 (5809 bp )
0.0.1%
Protocol 2 21185972
bp
(27.00%)
605(3493
28 bp)
0.45%
11702
(586604 bp)
0.75%
1787 (91389
bp)
0.12%
52.40 % 20163370 bp
25.70 %
112
308607 bp
0.39 %
11127 (
18710327 bp )
23.85 %
1756
(1144436) bp
1.46 %
231
129697 bp
0.17 %
12
5417 bp
0.01 %
Protocol 3 12854764
bp
(21.06 %)
64( 69255
bp)
0.11 %
6801 (323105
bp)
0.53%
679 (33221
bp)
0.05%
54.09 % 12434182 bp
20.37 %
64 (176881
bp)
0.29 %
6752
(11415393 bp )
18.70 %
1133 (841908 bp) 1.38
%
191 (105824 bp)
0.17 %
12 (6211 bp)
0.0.1 %
Bangalore
meeting
16192690
bp
(20.68 %)
8 (3317
bp)
0.00%
7549 (340933
bp)
0.44 %
699(33372 bp)
0.04 %
54.32% 15819353 bp
(20.20 %)
92
250092bp
(0.32 %)
2498 (4413567
bp) 5.64 %
1560 (1155118 bp )
1.48 %
228 (126831bp)
0.16 %
15 (6376 bp)
0.0.1 %
Repeat Regions captured in the genome
28. Assembly No of
genes
predict
ed
Averag
e gene
length
Larges
t gene
Mappi
ng
with
V1
assem
bly
Mappi
ng
with
V2
assem
bly
Mappi
ng
with
V3
assem
bly
Mappi
ng
with
V4
assem
bly
Mappi
ng
with
V5
assem
bly
V1 16134 1673 21479 NA 15978 15645 15855 16072
V2 20741 2162.78 31832 20739 NA 20377 20519 20675
V3 15110 2005.05 46572 15055 15019 NA 14990 15073
V4 17311 1821.26 47518 17307 17245 16906 NA 17277
V5 19278 1829.68 31832 19273 19167 18861 19051 NA