A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
1. A different kettle of fish entirely
Bioinformatic challenges and solutions for whole de novo
genome assembly of Atlantic cod and Atlantic salmon
Lex Nederbragt, NSC and CEES
lex.nederbragt@bio.uio.no
@lexnederbragt
OK
8. Sequence data
Reads
reads
contigs
scaffolds
original DNA
fragments
original DNA
fragments
Sequenced ends
http://www.cbcb.umd.edu/research/assembly_primer.shtml
26. Salmon: phase 1
Sanger sequencing Illumina sequencing
Phase 1 assembly
555 960 sequences
2.4 Gbp of 3 Gbp
Half of that in pieces of 9 300 bp or longer
Scaffold
gap
contig
http://www.flickr.com/photos/jurvetson/57080968/
27. Salmon: phase 2
Illumina sequencing
Paired end
Mate Pair 3kb and longer
Phase 2 stated goal
Scaffolds greater than 1 Mbp
Half the genome in contigs of at least 50 000 bp
he%female%
named% “Sally”%
with%
double[haploid%genome% of%
es>mated% length% Gbp.%
3%
12%
Scaffold
gap
contig
29. Cod: the genome
Heterozygote
850 million bases (Mbp )
*
‘Wild-caught’
*
*
30. Cod: phase 1
454 sequencing (Sanger sequencing)
Phase 1 assembly
157 887 sequences
753 Mbp of 830 Mbp
Half in scaffolds of at least 460 000 bp
Half in contigs at least 2 800 bp
Scaffold
gap
contig
32. Cod: phase 2
Phase 2
Illumina sequencing
Paired end >200x
Mate Pair 5kb >100x
Phase 2 goal
Half in scaffolds of at least 1 Mbp
Half in contigs at least 10 – 15 000 bp
41. Pacbio for salmon and cod
SMRTBell'template'
Libraries
Standard'Sequencing'
Generates& pass& ea
one& on&
Large Insert& Sizes
Large& Sizes&
Insert sequenced&
Aim for looooong insert sizes
Circular'Consensus'Sequencing'
Small&
Insert&
Sizes&
Generates&
mul8ple&
passes
sequenced&
42. chnology
Salmon: PacBio reads
Data set 1
1.1x coverage
Half of all bases in reads at least 5.5 kbp
Longest 26.5 kbp
SMRTBell'template'
104 SMRT Cells Data set 2
Latest chemistry and enzyme (C2-XL) 0.7x coverage
By PacBio Menlo Park 3
Half of all bases in reads at least 6 kbp
Longest 25 kbp
Standard'Sequencing'
Generates& pass& each&
one& on& molecule&
Large Insert& Sizes
Large& Sizes&
Insert sequenced&
Circular'Consensus'Sequencing'
Small&
Insert&
Sizes&
43. Salmon: PacBio reads
Alignments of at least 1kb to released assembly
Alignments'binned'by'%idenVty'
Portion of the alignments
Bin for read accuracy reported in the alignment
CumulaVve'Alignment'QuanVty'
Figure courtesy of Jason Miller, JCVI, USA
62. Cod: error-correction
P_errorCorrection pipeline from
93% of reads recovered
2.7x
Alignments of at least 1kb to published assembly
+
23x
+
24 cpus
4.5 days
100 Gb RAM
63. Cod: prospect
PacBio reads span many gaps
PacBio reads may span heterozygous regions
Polymorphic contig 2
Contig 1 Contig 4
Polymorphic contig 3
64. Summary
Salmon and cod extra challenging
Assembly is difficult
reads
contigs
scaffolds
PacBio has a huge potential
3-7 kb repeats mapped to PacBio reads
left flank repeat right flank
http://en.wikipedia.org, http://fishandboat.com
65. Acknowledgements
University of Oslo Jason Miller, JCVI
Pacific Biosciences
Sequencing team NSC
ICSASG
Ole Kristian Tørresen
Kjetill Jakobsen
Sissel Jentoft
Cod genome group The%female%
named%
double[haploid%
“Sally”%
genome%
with%
of%
es>mated% length% Gbp.%
3%
12%