Genome assembly from three sequencing platforms: minION, MiSeq and PacBio
1. Genome Assembly from Three Sequencing Platforms:
MinION, MiSeq and PacBio
Francesca Giordano, Louise Aigrain, Michael Quail, James Bonfield,
Robert Davies, David Jackson, Thomas Keane, Zemin Ning and Richard Durbin
2. 2
Yeast strains: S288c, SK1, N44, CBS
S288c Reference:
12 Million bases,17 chromosomes
Sequenced at the Wellcome Trust Sanger Institute
MinION Reads
Strain
Bases
(Mb)
Reads
Mean
Length
Longest
Read
Coverag
e
Identity
Numbe
r of
Runs
Flowcell
S288c 323 32770 9843 56477 27X 93% 3 R7
N44 130 15654 8292 37837 11X N/A 4 R7
CBS 109 12211 8952 46481 9X N/A 4 R7
SK1 51 5938 8589 36791 4X N/A 2 R7
MiSeq reads, for each strain: ~120X coverage, Identity ~93%
ONT Read
Lengths
PacBio Read
Lengths
PacBio Reads
Strain
Bases
(Mb)
Reads
Mean
Length
Longest
Read
Coverag
e
Identity
Number
of Runs
S288c 1463 239408 6109 35196 120X 93% 3
N44 1794 371025 4834 33906 148X N/A 3
CBS 1639 324414 5052 34173 134X N/A 2
SK1 3019 697989 4325 34080 248X N/A 5
3. 3
Assemblers and other analysis tools
De Novo Assembly with Long Reads (ONT & PacBio)
Canu https://github.com/marbl/canu
Falcon https://github.com/PacificBiosciences/falcon
MiniAsm https://github.com/lh3/miniasm
PBcR http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR
Nanopolish https://github.com/jts/nanopolish
Analysis I, S288c:
De novo Assembly with Long
reads only -- ONT or PacBio
Analysis II, all strains:
De novo Assembly with
MiSeq reads, Scaffolding using
Long Reads -- ONT or PacBio
More tools used:
Poretools (https://github.com/arq5x/poretools),
dnadiff (https://github.com/garviz/MUMmer)
De Novo Assembly with MiSeq Reads
SOAP denovo http://soap.genomics.org.cn/soapdenovo.html
Fermi https://github.com/lh3/fermi
SPAdes http://bioinf.spbau.ru/spades
Masurca http://www.genome.umd.edu/masurca.html
Scaffolding pipelines
Hybrid-SPAdes http://bioinf.spbau.ru/spades
SMIS https://sourceforge.net/projects/phusion2/files/smis/
4. 4
Read Length distribution
Analysis I: S288c, de novo assembly with Long Reads only
Nanopore vs. PacBio Platforms
S288c
Reads
Bases
(Mb)
Reads
Mean Read
Length
Longest
Read
Coverage Identity
Nanopore 323 32770 9843 56477 27X 93%
PacBio 328 34248 9584 32921 27X 92%
6. 6
Additional tests performed
Analysis I: S288c, de novo assembly with Long Reads
only
De novo assembly with varied coverage: 10X, 20X, 27X
for both ONT and PacBio data
De novo assembly using 2D reads from Pass folder
versus using 2D reads from Pass+Fail folders
Polish assemblies with Nanopolish to improve accuracy
De novo assembly with the full PacBio data samples:
> 120X per strain
Analysis II: Scaffolding drafts assemblies from MiSeq
data
De novo assembly with MiSeq reads and compare results
of scaffolding by Hybrid-SPAdes and by the SMIS pipeline
De novo assembly using 2D reads from Pass folder
versus using 2D reads from Pass+Fail folders
Thank you!
Acknowledgments:
Louise Aigrain, Michael Quail, James Bonfield, Robert
Davies, David Jackson, Thomas Keane, Zemin Ning,
Richard Durbin and Gianni Liti