A presentation I held in the first day of the special workshop:
"Current Challenges 2015 - Next Generation Sequencing (592003) 1-3 ECTS" (7-11 December 2015)
at University of Helsinki Viikki Campus
http://www.helsinki.fi/dpps/currentchallenges2015.htm
The purpose was to inform Finnish researchers present on the pros and cons to take into consideration before starting a genome project and show the resources available at SciLifeLab.
9. Things to consider
• Repeats
• Heterozygosity
• Size of your genome
• GC content
• Access to material and specifically HMW DNA
• Access to a good computational cluster
• Good bioinformaticians / lab technicians
10. Things to consider
• Repeats
• Heterozygosity
• Size of your genome
• Access to material and specifically HMW DNA
• Access to a good computational cluster
• Good bioinformaticians / lab technicians
WHAT IS YOUR SCIENTIFIC QUESTION?
11. Variation space
• Repeats
• Heterozygosity
• Size of your genome
• Access to material and specifically HMW DNA
• Access to a good computational cluster
• Good bioinformaticians / lab technicians
WHAT IS YOUR SCIENTIFIC QUESTION?
12. Things to consider
• Repeats
• Heterozygosity
• Size of your genome
• Access to material and specifically HMW DNA
• Access to a good computational cluster
• Good bioinformaticians / lab technicians
WHAT IS YOUR SCIENTIFIC QUESTION?
http://www.intechopen.com/books/recent-advances-in-autism-
spectrum-disorders-volume-i/discovering-the-genetics-of-autism
13. Things to consider
• Repeats
• Heterozygosity
• Size of your genome
• Access to material and specifically HMW DNA
• Access to a good computational cluster
• Good bioinformaticians / lab technicians
WHAT IS YOUR SCIENTIFIC QUESTION?
Alkan C., Coe B.P., Eichler E.E.. Nature Rev Genetics (2011)
14. Things to consider
• Repeats
• Heterozygosity
• Size of your genome
• Access to material and specifically HMW DNA
• Access to a good computational cluster
• Good bioinformaticians / lab technicians
WHAT IS YOUR SCIENTIFIC QUESTION?
15. Things to consider
• Repeats
• Heterozygosity
• Size of your genome
• Access to material and specifically HMW DNA
• Access to a good computational cluster
• Good bioinformaticians / lab technicians
WHAT IS YOUR SCIENTIFIC QUESTION?
Ward L.D. & Kellis M.
Nat Biotechnology (2012)
17. Figure 1 | Cost-effectiveness of Pool-seq. The accuracy of allele frequency estimates is compared for whole-genome
sequencing of pools of individuals (Pool-seq) and whole-genome sequencing of individuals using the ratio of the
standard deviation (SD) of the estimated allele frequency with both methods. The same number of reads is used for
both sequencing strategies. A value smaller than one indicates that Pool-seq is more accurate than sequencing of
individuals. a | The influence of the pool size is shown. A larger pool size results in higher accuracy of Pool-seq, but
Pool-seq still produces more accurate allele frequency estimates even for pool sizes of 50 individuals in most
Nature Reviews | Genetics
0.4
10 20 30
Number of individuals sequenced seperately
SDpool/SDindividuals
SDpool/SDindividuals
Number of individuals sequenced seperately
40 50
0.5
0.6
0.7
0.8
0.9
1.0
1.1a b
0.4
10 20 30 40 50
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Pool size
Coverage per sequenced individual
Deviation in DNA content from
each individual in the pool
100
20×
0%
100
20×
30%
100
5×
30%
100
1×
30%
Pool size
Coverage per sequenced individual
Deviation in DNA content from
each individual in the pool
500
5×
30%
100
5×
30%
50
5×
30%
Schlötterer C., Tobler R., Kofler R. and Nolte V. Nature Rev Genetics (2014)
Why pooling?
19. SciLifeLab (promotion slides)SciLifeLab
National service
Local scientific
center
SciLifeLab
Director (July 2015)
Olli Kallioniemi
Co-director
Kerstin Lindblad-TohVision:
To be an internationally leading center that
develops, uses and provides access to
advanced technologies for molecular
biosciences with focus on health and
environment.
www.scilifelab.se
2010: Strategic research initiative
2013: National resource
2015: New management and chairman
23. 23
The Bioinformatics Platform 2016
Funding
• The Research
Council
• SciLifeLab
• KAW foundation
• Host universities
Applied at the Research Council as continued
national infrastructure 2016-2023. Decision late 2015.
Custom-tailored support Tools Training
Today
~70 FTE
24. 24
Long-term Support
Wallenberg Advanced Bioinformatics Infrastructure
www.scilifelab.se/facilities/wabi/
Björn Nystedt Thomas Svensson
Tailored solutions – high impact
Siv AnderssonGunnar von Heijne
Applied bioinformatics: 500h free support/project
• Variant analyses in health and disease
• Transcriptomics
• Single-cell analyses
• Epigenetics
• Metagenomics
Directors
Managers
Swedens strongest unit for analyses of
large-scale genomic data (24 FTE)
National committee reviews and selects
projects based on scientific quality
Staff in Stockholm, Uppsala, Lund,
Gothenburg, Linköping, Umeå.
56. Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
1.6
1.8
2
Heterozygosity with kmer graphs
57. Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
1.6
1.8
2
Fig4.2 Hybrid effect on K-mer distribution.
The X axis is the depth of 17-mer and Y axis is the ratio of 17-mer. The Epi is the 17-mer
curve of herring. The H_0.01067 means that the heterozygosis rate is 1.067%, and H_0.012 is
1.2%, H_0.015 is 1.5%.
From this figure, we can see that with the heterozygosis rate increasing, the sub-peak is
becoming more apparent at the position of the half of the expected K-mer depth on the X axis. We
can get the conclusion that the heterozygosis rate of herring genome is about 1.5%.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 20 40 60 80
Percentage(X)
Depth(X)
H_0.01067
Epi
H_0.012
H_0.015
Heterozygosity with kmer graphs
58. Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
1.6
1.8
2
Fig4.2 Hybrid effect on K-mer distribution.
The X axis is the depth of 17-mer and Y axis is the ratio of 17-mer. The Epi is the 17-mer
curve of herring. The H_0.01067 means that the heterozygosis rate is 1.067%, and H_0.012 is
1.2%, H_0.015 is 1.5%.
From this figure, we can see that with the heterozygosis rate increasing, the sub-peak is
becoming more apparent at the position of the half of the expected K-mer depth on the X axis. We
can get the conclusion that the heterozygosis rate of herring genome is about 1.5%.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 20 40 60 80
Percentage(X)
Depth(X)
H_0.01067
Epi
H_0.012
H_0.015
The heterozygosity was estimated to be 1.5%
Heterozygosity with kmer graphs
66. Repeat library from low coverage dataQuan<fy your repeat seqs
R R R’ R R’’
Independent
set of sparse
data
Screen reads with
repeat seqs
33% of all bases in the reads are covered by repeat seqs
ó
33% of the genome is “repeated”
Warning! The quan<fica<on depends heavily on the size of the original read set
71. 0 20 40 60 80 100
050100150200250300350
Coverage
NumberofMb'sinhg19
454
Illumina
SOLiD
average
coverage
_C%:(!)#&1-#!
!
"#$%#&'#!
The current
(in hg18 the
The six type
1&!
00
4
umina
OLiD
average
coverage
• Stephan C. Schuster (Penn U)
75. Short Reads (Illumina) - graph assembly
adapter
removal
quality
trimming
de Bruijn or string graph construction
error
correction
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
Long Reads (PacBio) - HGAP assembly
read length
reads
read self-correction
overlap-layout-consensus
assembly
consensus calling with
quiver
assembled genome
ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT
CGAGTCT-CGCGCAATCGCAAGCG-TTTC
ATCGTT-CCGAGTCTCCCCGCCATC
TT-CCGAGACTCCCCGCAATCGCAAGCGATT
GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT
1
2
3
1 pre-processing 2 assembly 3 finishing/polishing
the overall assembly strategy is the same…
…but the data and tools are fundamentally different
100. other options for assembling PacBio reads
https:/ /github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large-Genome-
Assembly-with-PacBio-Long-Reads
104. • PacBio data cannot (currently) be assembled in its raw
state
• several strategies exist for correcting reads prior to assembly
• correction without complementary technology used to be
difficult
– until recently, was limited by computational power and SMRT cell
throughput
PacBio data is noisy
Koren & Philippy Curr Op Micro 2014
106. Hybrid assemblers
106
other options for assembling PacBio reads
Zimin A.V., Marçais G., Puiu D., Roberts M., Salzberg S.L., Yorke J.A. Bioinformatics (2013)
107. Hybrid assemblers
107
other options for assembling PacBio reads
Zimin A.V., Marçais G., Puiu D., Roberts M., Salzberg S.L., Yorke J.A. Bioinformatics (2013)
108. Pure PacBio
Short Reads (Illumina) - graph assembly
adapter
removal
quality
trimming
de Bruijn or string graph construction
error
correction
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
Long Reads (PacBio) - HGAP assembly
read length
reads
read self-correction
overlap-layout-consensus
assembly
consensus calling with
quiver
assembled genome
ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT
CGAGTCT-CGCGCAATCGCAAGCG-TTTC
ATCGTT-CCGAGTCTCCCCGCCATC
TT-CCGAGACTCCCCGCAATCGCAAGCGATT
GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT
1
2
3
1 pre-processing 2 assembly 3 finishing/polishing
the overall assembly strategy is the same…
…but the data and tools are fundamentally different
122. LETTER doi:10.1038/nature15714
Single-molecule sequencing of the desiccation-
tolerant grass Oropetium thomaeum
Robert VanBuren1
*, Doug Bryant1
*, Patrick P. Edger2,3
, Haibao Tang4,5
, Diane Burgess2
, Dinakar Challabathula6
†, Kristi Spittle7
,
Richard Hall7
, Jenny Gu7
, Eric Lyons4
, Michael Freeling2
, Dorothea Bartels6
, Boudewijn Ten Hallers8
, Alex Hastie8
,
Todd P. Michael9
& Todd C. Mockler1
Plant genomes, and eukaryotic genomes in general, are typically
repetitive, polyploid and heterozygous, which complicates genome
assembly1
. The short read lengths of early Sanger and current
next-generation sequencing platforms hinder assembly through
complex repeat regions, and many draft and reference genomes
are fragmented, lacking skewed GC and repetitive intergenic
sequences, which are gaining importance due to projects like
the Encyclopedia of DNA Elements (ENCODE)2
. Here we report
the whole-genome sequencing and assembly of the desiccation-
tolerant grass Oropetium thomaeum. Using only single-molecule
real-time sequencing, which generates long (>16 kilobases)
reads with random errors, we assembled 99% (244megabases)
of the Oropetium genome into 625 contigs with an N50 length of
2.4megabases. Oropetium is an example of a ‘near-complete’ draft
genome which includes gapless coverage over gene space as well as
intergenic sequences such as centromeres, telomeres, transposable
elements and rRNA clusters that are typically unassembled in draft
genomes. Oropetium has 28,466 protein-coding genes and 43%
repeat sequences, yet with 30% more compact euchromatic regions
it is the smallest known grass genome. The Oropetium genome
demonstrates the utility of single-molecule real-time sequencing for
assembling high-quality plant and other eukaryotic genomes, and
serves as a valuable resource for the plant comparative genomics
community.
The genomes of Arabidopsis3
, rice4
, poplar, grape and Sorghum5
were first sequenced using high-quality and reiterative Sanger-based
approaches producing a series of ‘gold standard’ reference genomes.
The advent of next-generation sequencing (NGS) technologies reduced
and comparative genomics, although draft genomes are now avail-
able for most agriculturally important grasses1
. The largest genome
assemblies, such as maize (2,300megabases (Mb))7
, barley (5,100Mb)8
and wheat (hexaploid, 17,000Mb)9
are highly fragmented as a result
of the inability of current sequencing technologies to span complex
repeat regions. Near-finished reference genomes are available for rice4
,
Sorghum5
and Brachypodium10
, but more high-quality grass genomes
are needed for comparative genomics and gene discovery. Here we pres-
ent the ‘near-complete’ draft genome of the grass Oropetium thomaeum,
the first high-quality reference genome from the Chloridoideae sub-
family. The draft genome is near complete because we were able to
sequence through complex repeat regions that are unassembled in most
draft genomes. Oropetium has the smallest known grass genome at
245Mb and is also a resurrection plant that can survive the extreme
water stress such as loss of >95% of cellular water (Fig. 1)11
.
Single-molecule real-time (SMRT) sequencing (Pacific Biosciences)
produces long and unbiased sequences, which enables assembly of
complex repeat structures and GC- and AT-rich regions that are often
unassembled or highly fragmented in NGS-based draft genomes. We
generated ~72× sequencing coverage of the Oropetium genome using
32 SMRT cells on the PacBio RS II platform (which is equivalent to <1
week of sequencing time and <US$10,000 in reagents). The resulting
sequence had a read N50 length of over 16kilobases (kb), and there was
10× coverage of reads over 20kb in length (Extended Data Fig. 1a). The
raw reads were error-corrected using the hierarchical genome assembly
process (HGAP), and the longest reads (>16kb) were assembled using
Celera assembler followed by two rounds of genome polishing using
Quiver12
. The assembly contains 650 contigs spanning 99% (244Mb)
OPEN
143. Things that are not there
100Mb
1 2 3 4 5 6 7 8 9 10 11
12
13
1415
16
1718
1920
2122
X
Closed gap
Inversion
Complex event
High
Low
STR Density
Extended Data Figure 3 | Genome distribution of closed gaps and
insertions. Chromosome ideogram heatmap depicts the normalized density of
inserted CHM1 base pairs per 5-Mb bin with a strong bias noted near the end of
most chromosomes. Locations of structural variants and closed gaps are given
by coloured diamonds to the left of each chromosome: closed gap sequences
(red), inversions (green), and complex events (blue).
RESEARCH LETTER
Chaison M.J.P et al. Nature (2014)
yhigh-throughputDNAsequencing(ChIP-seq)analysis(Supplemen-
aryInformation).Weidentifiedasignificant15-foldenrichmentofshort
andemrepeats(STRs)whencomparedtoarandomsample(P,0.00001)
Fig. 1a). A total of 78% (39 out of 50) of the closed gap sequences were
omposedof10%ormoreofSTRs.TheSTRswerefrequentlyembedded
n longer, more complex, tandem arrays of degenerate repeats reach-
ng up to 8,000 bp in length (Extended Data Fig. 1a–c), some of which
ore resemblance to sequences known to be toxic to Escherichia coli16
.
ecause most human reference sequences17,18
have been derived from
ones propagated in E. coli, it is perhaps not surprising that the appli-
ation of a long-read sequence technology to uncloned DNA would
esolvesuchgaps.Moreover,thelengthandcomplexdegeneracyofthese
TRs embedded within (G1C)-rich DNA probably thwarted efforts to
ollow up most of these by PCR amplification and sequencing.
Next, we developed a computational pipeline (Extended Data Fig. 2)
o characterize structural variation systematically (structural variation
efined here as differences $50 bp in length, including deletions, dupli-
ations, insertions and inversions7
). Structural variants were discovered
y mapping SMRT sequencing reads to the human reference genome11
P = 0.02712
P = 0.00003
P < 0.00001
0
25
50
75
100
(G+C)content
Reference flank
Gap closure
Tandem repeat
P < 2.2 × 10–16
0.00
0.25
0.50
0.75
1.00
Gaps Reference
Proportionofregionwithsimplerepeats
a b
G
ap
onlyTandem
repeatsG
ap
w
ithout
tandem
repeats
Sam
pled
reference
igure 1 | Sequence content of gap closures. a, Gap closures are enriched
or simple repeats compared to equivalently sized regions randomly sampled
om GRCh37. b, Human genome gaps typically consist of (G1C)-rich
equence (yellow) flanking complex (A1T)-rich STRs (green) (empirical
value; Supplementary Information). Red line indicates genomic (G1C)
ontent.
144. Things that are not there
Steinberg K.M. et al.
Genome Research (2014)Figure 5. Overview of the Chr 11 (NC_018922.2) 1.9-Mb region, exhibiting three alignment bins with a large number of PacBio ‘‘cliff’’ reads where the
alignment coverage dropped off sharply. WGS component (light green lines) boundaries flanked by such reads are marked with red dashed lines. The ends
of each component at the boundary are labeled with letters to show orientation. Pairs of alignments corresponding to three different PacBio reads are
marked in yellow, green, and dark blue. These alignments overlap by < 10% on each of the reads. The split alignments for these three reads suggest that
the two WGS components marked in purple should be inverted and translocated as indicated by the arrow at the top of the image. The other PacBio reads
in these bins exhibit the same pattern of split alignments, which supports the proposed reordering and orientation of the WGS components. The bottom
light green lines show a proposed tiling path with the orientation corrected; the letters indicate where each end of the initial tiling path components should
be placed.
CHM1 assembly of the human genome
Cold Spring Harbor Laboratory Presson November 16, 2014 - Published bygenome.cshlp.orgDownloaded from
145. Summary
• Genome size and repeat content can be estimated w/o an assembly.
• Adapters and trim low QV is good unless the assembly program does
EC itself.
• Assess the levels of heterozygosity in your target genome before you
assemble (or sequence) it and set your expectations accordingly.
• Choose an assembler that excels in the area you are interested in
(e.g., coverage, continuity) and do libraries for it.
• Interested in doing just coding potential analyses? (e.g., training a
gene finder, studying codon usage bias, looking for intron-specific
motifs) => Consider studying exome assemblies.
• Or consider a proxy, studying a specie that it is sufficiently close
evolutionary which genome is quite good in quality.
146. Summary
• Genome size and repeat content can be estimated w/o an assembly.
• Adapters and trim low QV is good unless the assembly program does
EC itself.
• Assess the levels of heterozygosity in your target genome before you
assemble (or sequence) it and set your expectations accordingly.
• Choose an assembler that excels in the area you are interested in
(e.g., coverage, continuity, or number of error free bases).
• Interested in doing just coding potential analyses? (e.g., training a
gene finder, studying codon usage bias, looking for intron-specific
motifs) => Consider studying exome assemblies.
• Or consider a proxy, studying a specie that it is sufficiently close
evolutionary which genome is quite good in quality.
Settle down an assembly so Science can continue!