Helsinki genome project-20151210-amb

Things to consider when
initiating a genome project
the assembly pipeline @ SciLifeLab
!
Helsinki, Dec 9th 2015
Álvaro Martínez Barrio, PhD
Alvaro.Martinez.Barrio@scilifelab.se
linkedin.com/in/ambarrio
@ambarrio

Workshop Outline
• Introducing SciLifeLab
• The important considerations of all genome
projects
• The annotation and assembly platforms
• A vision into the future

Survey
• How many of you have used sequencing
facilities?

Survey
facilities?
• Assembled a genome?

Survey
facilities?
• Planning to start a genome project?

Survey
facilities?
• Have worked with NGS data?

Survey
facilities?
• Have worked with NGS data?
• Just curious about NGS?

Things to consider
• Repeats
• Heterozygosity
• Size of your genome
• GC content
• Access to material and specifically HMW DNA
• Access to a good computational cluster
• Good bioinformaticians / lab technicians

Things to consider
• Repeats
• Heterozygosity
WHAT IS YOUR SCIENTIFIC QUESTION?

Variation space
• Repeats
• Heterozygosity

Things to consider
• Repeats
• Heterozygosity
http://www.intechopen.com/books/recent-advances-in-autism-
spectrum-disorders-volume-i/discovering-the-genetics-of-autism

Things to consider
• Repeats
• Heterozygosity
Alkan C., Coe B.P., Eichler E.E.. Nature Rev Genetics (2011)

Things to consider
• Repeats
• Heterozygosity
Ward L.D. & Kellis M.
Nat Biotechnology (2012)

About me
Álvaro Martínez Barrio, PhD
Alvaro.Martinez.Barrio@scilifelab.se
linkedin.com/in/ambarrio
@ambarrio
• PhD Bioinformatics 2010
• Postdoc Pop Genetics / Comp Biol 2014,
L. Andersson + H. Ronne
• Herring: Illumina, SOLiD, Moleculo, PacBio
• Species Plant: 454, SOLiD, Illumina
• Species Seal (~3Gb): Illumina
• Species Beetle: Illumina, PacBio

Figure 1 | Cost-effectiveness of Pool-seq. The accuracy of allele frequency estimates is compared for whole-genome
sequencing of pools of individuals (Pool-seq) and whole-genome sequencing of individuals using the ratio of the
standard deviation (SD) of the estimated allele frequency with both methods. The same number of reads is used for
both sequencing strategies. A value smaller than one indicates that Pool-seq is more accurate than sequencing of
individuals. a | The influence of the pool size is shown. A larger pool size results in higher accuracy of Pool-seq, but
Pool-seq still produces more accurate allele frequency estimates even for pool sizes of 50 individuals in most
Nature Reviews | Genetics
0.4
10 20 30
Number of individuals sequenced seperately
SDpool/SDindividuals
SDpool/SDindividuals
Number of individuals sequenced seperately
40 50
0.5
0.6
0.7
0.8
0.9
1.0
1.1a b
0.4
10 20 30 40 50
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Pool size
Coverage per sequenced individual
Deviation in DNA content from
each individual in the pool
100
20×
0%
100
20×
30%
100
5×
30%
100
1×
30%
Pool size
Coverage per sequenced individual
Deviation in DNA content from
each individual in the pool
500
5×
30%
100
5×
30%
50
5×
30%
Schlötterer C., Tobler R., Kofler R. and Nolte V. Nature Rev Genetics (2014)
Why pooling?

Schlötterer C., Tobler R., Kofler R. and Nolte V. Nature Rev Genetics (2014)

SciLifeLab (promotion slides)SciLifeLab
National service
Local scientific
center
SciLifeLab
Director (July 2015)
Olli Kallioniemi
Co-director
Kerstin Lindblad-TohVision:
To be an internationally leading center that
develops, uses and provides access to
advanced technologies for molecular
biosciences with focus on health and
environment.
www.scilifelab.se
2010: Strategic research initiative
2013: National resource
2015: New management and chairman

SciLifeLab platforms
SciLifeLab
National
Genomics
Infrastructure
National
Bioinformatics
Infrastructure
Sweden
Joakim Lundeberg
Ann-Christine Syvänen
Ulf Gyllensten
Bengt Persson
Clinical
Diagnostics
…
Lars Engstrand
Computer
resources
free for
Swedish
researchers
VR
SNIC
Ongoing merge of BILS,
WABI and more; complete
2016.
National, distributed

NBIS - We’re here for you!We’re here for you!

23
The Bioinformatics Platform 2016
Funding
•  The Research
Council
•  SciLifeLab
•  KAW foundation
•  Host universities
Applied at the Research Council as continued
national infrastructure 2016-2023. Decision late 2015.
Custom-tailored support Tools Training
Today
~70 FTE

24
Long-term Support
Wallenberg Advanced Bioinformatics Infrastructure
www.scilifelab.se/facilities/wabi/
Björn Nystedt Thomas Svensson
Tailored solutions – high impact
Siv AnderssonGunnar von Heijne
Applied bioinformatics: 500h free support/project
•  Variant analyses in health and disease
•  Transcriptomics
•  Single-cell analyses
•  Epigenetics
•  Metagenomics
Directors
Managers
Swedens strongest unit for analyses of
large-scale genomic data (24 FTE)
National committee reviews and selects
projects based on scientific quality
Staff in Stockholm, Uppsala, Lund,
Gothenburg, Linköping, Umeå.

WABI personnel (2013-2014)
Johan Reimegård Mikael Huss Åsa Björklund Pär Engström Jakub
Orzechowski
Westholm
Estelle Proux-
Wéra
Sanella
Kjellqvist
Diana Ekman Pall Olason Anna Johansson Marcel Martin
Alvaro Martinez
Barrio
Per Unneberg

Today:)Human)genome)sequenced)in)days)
C)towards)$1000)genome)
…requires$supercomputers$
for$analysis$and$storage$
Massively$parallel$sequencing….$

2.$Data$delivery$
SciLifeLab)Bioinforma/cs)Compute)and)Storage)(UPPNEX))
3.$Analysis$
ScienBsts$
www.uppmax.uu.se/uppnex$
High%performance/computers/and/
large/scale/storage/for/
bioinforma6cs/analysis./
1.$Sample$
transfer$

Login$
Submit$jobs$
Job)Que)
Job$
assigned$
Work$interacBvely$
How)do)you)work)on)UPPMAX)computers?)
Job Queue

Research$$
~8000$cores$
ProducBon$
~3200$cores$
Redundancy$
768$cores$
Storage$
~11$PB$
2015)
Longbterm$
Storage$
Mosler$
384$cores$
Research$
3328$cores$
ProducBon$
768$cores$
Storage$
~7$PB$
Longbterm$
Storage$
2014) Resources)
Mosler$
384$cores$
Private$Cloud$
1600$cores$
Chipster,$CanvasDB$

Project)growth)
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
100
200
300
400
Active Projects
Numberofactiveprojects
●
●
UPPMAX
UPPNEX

2009:$ $13,152$MSEK$from$KAW$and$SNIC$
$
2012:$$ $23.8$MSEK$from$KAW/SNIC$
$
2014:$ $20$+$20$MSEK$from$KAW$for$WGS$
$ $ $SNIC$receives$47.8$MSEK$from$VR$to$handle$sensiBve$data$
$
UPPNEX)history)

Olga Vinnere Pettersson (UGC)
olga.pettersson@igp.uu.se
mp4 - http://bit.ly/1Ul7RmH
pptx - http://bit.ly/1Z6yIFH
Q&A - http://bit.ly/1I1Sb6o

Know how to measure your assembly results

Just a word on N50…
N50 typically refers to a con<g (or scaﬀold) length

But…
•  The original deﬁni<on is the number of con<gs needed to reach half
of the genome size (L50 is the length)
•  Many programs use the total assembly size as a proxy for the
genome size; this is some<mes completely misleading: Use NG50!
•  PI:s don’t understand N50 anyway; use something more intui<ve J:
- con<gs larger than 1 kbp sum to 93% of the genome size

Genome
Assembly
Genome size Assembly size NG50 N50
3 con<gs
100 kbp
5 con<gs
30 kbp
Just a word on N50

Know why assembling is difficult

Two types of assemblies
Case 1 : Flycatcher (1.2 Gbp)
Herring (800 Mbp)
Malassezia (7 Mbp)
Case 2 : Spruce (20 Gbp)
Barnacle (1.4 Gbp)
Wolbachia (4 Mbp)
Two types of assemblies

Pre-assembly
• Quality trimming
• (Error correction)
• Kmer analysis
• De novo repeat library

Quality trimming
DeBruijn-graph assemblers are in principle sensi<ve to errors
since they do not take base quality values into account
•  Trim adapters (e.g. Cutadapt)
•  Filter on quality, both 5’ and 3’ end! (e.g. Trimmoma<c)
•  Consider hard-trimming of 5’ end
•  Error correc<on (e.g. Quake)
•  Inspect (e.g. FastQC)
Plots by Olof Karlberg
Quality trimming

Reads vs kmers
1 read:
100 bp
……..
Kmers:
k=21bp
N= (L – k + 1)
(100bp – 21 bp + 1)
80
Base coverage * (L-k+1) = Kmer coverage!
! ! L!

Ex: !50X * (100-21+1) = 40X (i.e. kmer coverage is 80% of base coverage)
! ! 100!

Reads vs Kmers

Kmer analyses
Compute the frequency of each
kmer in the dataset
(e.g. Jellyfish --both-strands)

Note: RAM-intense!
How to count kmers?

Digging into the kmers
Genome size
•  Remove low-copy kmers
•  Iden<fy the coverage peak
•  Divide total nb of kmers by peak

Genome size = Ktot/Cpeak!
!
Here: !
1.4 Gbp = 80 G / 55 !
!
Note: Ktot = Nb reads * (L-k+1)!
!
!
“Cpeak
20 million dis<nct kmers occure
55 <mes in all reads combined”
Base coverage = Cpeak
! ! (L-k+1)/L!
Here:!
69X = 55 !
(100 – 21 +1)/100!
Interpreting kmer graphs (1/2)

Repeats: ﬁrst shot
The nb of dis<nct kmers in
the single-copy peak
corresponds roughly to the
single-copy genome size
Repeats
Single-copy Example
Beetle: 0.75 Gbp is single-copy, so almost
40% of the 1.2 Gbp genome is repeated
(kmer=27)
Interpreting kmer graphs (2/2)

Heterozygosity and ploidy
…and humans are easy.
Bacteria, archaea,
fungi, some plants
Most animals,
some plants
Many plants
Also: Heterozygozity is generally very low in mammals;
most other species are much harder

Heterozygosity with kmer graphs
Double peak in the kmer histogram; clear indica6on of heterozygosity
Not en6rely easy to quan6fy (although a=empts have been made)

A word on quality filtering…
Light QC filter Hard QC filter
A word of precaution on quality filtering!

Double peak in the kmer histogram; clear indica6on of heterozygosity
Not en6rely easy to quan6fy (although a=empts have been made)

Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
1.6
1.8
2

17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
1.6
1.8
2
Fig4.2 Hybrid effect on K-mer distribution.
The X axis is the depth of 17-mer and Y axis is the ratio of 17-mer. The Epi is the 17-mer
curve of herring. The H_0.01067 means that the heterozygosis rate is 1.067%, and H_0.012 is
1.2%, H_0.015 is 1.5%.
From this figure, we can see that with the heterozygosis rate increasing, the sub-peak is
becoming more apparent at the position of the half of the expected K-mer depth on the X axis. We
can get the conclusion that the heterozygosis rate of herring genome is about 1.5%.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 20 40 60 80
Percentage(X)
Depth(X)
H_0.01067
Epi
H_0.012
H_0.015

17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
1.6
1.8
2
Fig4.2 Hybrid effect on K-mer distribution.
The X axis is the depth of 17-mer and Y axis is the ratio of 17-mer. The Epi is the 17-mer
curve of herring. The H_0.01067 means that the heterozygosis rate is 1.067%, and H_0.012 is
1.2%, H_0.015 is 1.5%.
From this figure, we can see that with the heterozygosis rate increasing, the sub-peak is
becoming more apparent at the position of the half of the expected K-mer depth on the X axis. We
can get the conclusion that the heterozygosis rate of herring genome is about 1.5%.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 20 40 60 80
Percentage(X)
Depth(X)
H_0.01067
Epi
H_0.012
H_0.015
The heterozygosity was estimated to be 1.5%

Repeats: ﬁrst shot
The nb of dis<nct kmers in
the single-copy peak
corresponds roughly to the
single-copy genome size
Repeats
Single-copy Example
Beetle: 0.75 Gbp is single-copy, so almost
40% of the 1.2 Gbp genome is repeated
(kmer=27)
Estimating repeats with kmer graphs

Why repeats destroy assembliesGenome assembly - things to think about

Repeat library and repeat quantification
Create a de novo repeat library
•  Run a low-coverage (e.g. 0.1X) assembly (e.g. RepeatExplorer or Trinity)
•  Filter contaminants and mito/chloro
•  [ Make non-redundant (e.g. Cdhit) ]
•  QuanJfy the (high) repeat content by an independent subset of reads
- Mapping (e.g. bwa), or
- Mask with RepeatMasker

Repeat library and repeat quantification
Create a de novo repeat library
•  Run a low-coverage (e.g. 0.1X) assembly (e.g. RepeatExplorer or Trinity)
•  Filter contaminants and mito/chloro
•  [ Make non-redundant (e.g. Cdhit) ]
•  QuanJfy the (high) repeat content by an independent subset of reads
- Mapping (e.g. bwa), or
- Mask with RepeatMasker
A!real!example!
Coverage!
%GC!
5!Mbp!mitochondrion!in!spruce!

Repeat library from low coverage dataRepeat library from low coverage data
R R R’ R R’’
Overlaps?
Sparse
seq data

R R R’ R R’’
Overlaps?
Assembled con<gs
Sparse
seq data

R R R’ R R’’
Overlaps?
Assembled con<gs
Sparse
seq data
Warning! Beware of contamina<ons, plas<ds etc

Repeat library from low coverage dataQuan<fy your repeat seqs
R R R’ R R’’
Independent
set of sparse
data
Screen reads with
repeat seqs
33% of all bases in the reads are covered by repeat seqs
ó
33% of the genome is “repeated”
Warning! The quan<ﬁca<on depends heavily on the size of the original read set

Classifying repeats
LTR Gypsy/Copia
LINE/SINE
DNA elements
…
This is very tricky…

Classifying the repeat library directly
•  RepeatMasker
•  Repeat protein domain search (h=p://www.repeatmasker.org/cgi-bin/RepeatProteinMaskRequest)
Problems
•  No close homologs in databases
•  Rapid evoluHon of repeats (like transposable elements, TEs)
•  Non-autonomous TEs do not contain proteins

SoluHons
•  Fetch intact ORF:s from hits in assembly
•  Extend assembly matches and get more complete elements
•  Check match alignment proﬁles in assembly (LINEs conserved at 3’ end but not at 5’..)

=> OWen slow, manual, species-speciﬁc soluHons

0 20 40 60 80 100
050100150200250300350
Coverage
NumberofMb'sinhg19
454
Illumina
SOLiD
average
coverage
_C%:(!)#&1-#!
!
"#$%#&'#!
The current
(in hg18 the
The six type
1&!
00
4
umina
OLiD
average
coverage
• Stephan C. Schuster (Penn U)

Clark M.J., et al.
Nat. Biotech (2011)
Performance comparison of exome
DNA sequencing technologies.
(Mike Snyder’s lab)

Ning L., et al. Scientific
Reports (2015)

Short Reads (Illumina) - graph assembly
adapter
removal
quality
trimming
de Bruijn or string graph construction
error
correction
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
Long Reads (PacBio) - HGAP assembly
read length
reads
read self-correction
overlap-layout-consensus
assembly
consensus calling with
quiver
assembled genome
ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT
CGAGTCT-CGCGCAATCGCAAGCG-TTTC
ATCGTT-CCGAGTCTCCCCGCCATC
TT-CCGAGACTCCCCGCAATCGCAAGCGATT
GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT
1
2
3
1 pre-processing 2 assembly 3 ﬁnishing/polishing
the overall assembly strategy is the same…
…but the data and tools are fundamentally different

http://www.lucigen.com/NxSeq-Long-Mate-Pair-Library-Kit/

adapter
removal
quality
trimming
error
correction
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
read length
reads
assembly
quiver
assembled genome
1
2
3
Many!instruments…!to
Assembler#Name# Algorithm# Input#
Arachne! OLC! Sanger!
CAP3! OLC! Sanger!
TIGR! Greedy! Sanger!
Newbler! OLC! 454/Roche!
Edena! OLC! Illumina!
SGA! OLC! Illumina!
MaSuRCA! De!Bruijn/OLC! Illumina!
Velvet! De!Bruijn! Illumina!
ALLPATHS! De!Bruijn! Illumina/PacBio!
ABySS! De!Bruijn! Illumina!
SOAPdenovo! De!Bruijn! Illumina!
CLC! De!Bruijn! Illumina/454!
CABOG! OLC! Hybrid!!
•  Currently!eﬀorts!ongoing!to!

OLC
• Pros: Can use longer reads properly
• Cons: Time consuming, high memory
requirements

Generate assembly via de Bruijn
Marpn & Wang, Nat. Rev. Genet. (2011)

• Pros: Computationally efficient, can work with
large coverage short read datasets
• Cons: Sensitive to sequence errors, connection
between assembly and read is lost, does not
work so well with longer reads
De Bruijn

adapter
removal
quality
trimming
error
correction
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
read length
reads
assembly
quiver
assembled genome
1
2
3
CAP3! OLC! Sanger!
SGA! OLC! Illumina!
!
•  No!easy!way!to!determine!best!
assembly/assembler!
•  implemented!heuris4cs!are!the!
key!issue!
•  Choice!of!approach!depends!on!
data!being!assembled!

adapter
removal
quality
trimming
error
correction
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
read length
reads
assembly
quiver
assembled genome
1
2
3
CAP3! OLC! Sanger!
SGA! OLC! Illumina!
establish!best!prac4ces!
•  Assemblathons!and!GAGE!to!
evaluate!exis4ng!solu4ons…!

Some recommendations
• Large eukaryote genome, Illumina data: Allpaths-LG (needs
specific libraries), SOAPdenovo, SGA, Masurca, DISCOVAR
• Large eukaryote genome, additional longer reads: Masurca,
Newbler, CABOG
• Small eukaryote or prokaryote genome, Illumina data: Spades,
Masurca, SOAPdenovo, Abyss, Velvet, DISCOVAR
• Small eukaryote or prokaryote genome, mixed data: MIRA,
Spades, Masurca, Newbler
• Need to run in parallel: Abyss, Rai
• Amplified data (Single Cell Genomics): Spades

Standard contiguity metrics
Just a word on N50…
N50 typically refers to a con<g (or scaﬀold) length

But…
•  The original deﬁni<on is the number of con<gs needed to reach half
of the genome size (L50 is the length)
•  Many programs use the total assembly size as a proxy for the
genome size; this is some<mes completely misleading: Use NG50!
•  PI:s don’t understand N50 anyway; use something more intui<ve J:

Genome
Assembly
Genome size Assembly size NG50 N50
3 con<gs
100 kbp
5 con<gs
30 kbp

The devil is in the repeatsats and Short Reads
reover
C R A B
Mathema,cally best result:

Repeat errors
Overlapping non-iden/cal reads Collapsed repeats
and chimeras
Wrong con/g order Inversions

ATCGGGTATATAG-CCTA!
||||||| || || ||||!
ATCGGGTGTACAGCCCTA!
!
?
A
B
A & B
A:

B:
Collapsable repeat errors (worst!)

Know how to patch gaps/finalize

other options for assembling PacBio reads
https:/ /github.com/PaciﬁcBiosciences/Bioinformatics-Training/wiki/Large-Genome-
Assembly-with-PacBio-Long-Reads

•  PacBio data cannot (currently) be assembled in its raw
state
•  several strategies exist for correcting reads prior to assembly
•  correction without complementary technology used to be
difficult
–  until recently, was limited by computational power and SMRT cell
throughput
PacBio data is noisy
Koren & Philippy Curr Op Micro 2014

Hybrid assemblers (for PacBio)
105

Hybrid assemblers
106
Zimin A.V., Marçais G., Puiu D., Roberts M., Salzberg S.L., Yorke J.A. Bioinformatics (2013)

Hybrid assemblers
107
Zimin A.V., Marçais G., Puiu D., Roberts M., Salzberg S.L., Yorke J.A. Bioinformatics (2013)

Pure PacBio
adapter
removal
quality
trimming
error
correction
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
read length
reads
assembly
quiver
assembled genome
1
2
3

Pure PacBio

Finishing/Polishing (Olli-Pekka)

Finishing/Polishing (Olli-Pekka)
quiver isn’t perfect
using Pilon to polish remaining indels
•  makes use of short read mapping to identify potential indels,
SNPs, ambiguous bases, local misassemblies
$ java -Xmx16G –jar path/to/pilon-1.8.jar
--genome path/to/fasta --unpaired path/to/mapping.bam
--output sample_name --changes --variant --tracks
--mindepth 100
Pilon removed 128 remaining indels in 3.8 Mbp genome despite

LETTER doi:10.1038/nature15714
Single-molecule sequencing of the desiccation-
tolerant grass Oropetium thomaeum
Robert VanBuren1
*, Doug Bryant1
*, Patrick P. Edger2,3
, Haibao Tang4,5
, Diane Burgess2
, Dinakar Challabathula6
†, Kristi Spittle7
,
Richard Hall7
, Jenny Gu7
, Eric Lyons4
, Michael Freeling2
, Dorothea Bartels6
, Boudewijn Ten Hallers8
, Alex Hastie8
,
Todd P. Michael9
& Todd C. Mockler1
Plant genomes, and eukaryotic genomes in general, are typically
repetitive, polyploid and heterozygous, which complicates genome
assembly1
. The short read lengths of early Sanger and current
next-generation sequencing platforms hinder assembly through
complex repeat regions, and many draft and reference genomes
are fragmented, lacking skewed GC and repetitive intergenic
sequences, which are gaining importance due to projects like
the Encyclopedia of DNA Elements (ENCODE)2
. Here we report
the whole-genome sequencing and assembly of the desiccation-
tolerant grass Oropetium thomaeum. Using only single-molecule
real-time sequencing, which generates long (>16 kilobases)
reads with random errors, we assembled 99% (244megabases)
of the Oropetium genome into 625 contigs with an N50 length of
2.4megabases. Oropetium is an example of a ‘near-complete’ draft
genome which includes gapless coverage over gene space as well as
intergenic sequences such as centromeres, telomeres, transposable
elements and rRNA clusters that are typically unassembled in draft
genomes. Oropetium has 28,466 protein-coding genes and 43%
repeat sequences, yet with 30% more compact euchromatic regions
it is the smallest known grass genome. The Oropetium genome
demonstrates the utility of single-molecule real-time sequencing for
assembling high-quality plant and other eukaryotic genomes, and
serves as a valuable resource for the plant comparative genomics
community.
The genomes of Arabidopsis3
, rice4
, poplar, grape and Sorghum5
were first sequenced using high-quality and reiterative Sanger-based
approaches producing a series of ‘gold standard’ reference genomes.
The advent of next-generation sequencing (NGS) technologies reduced
and comparative genomics, although draft genomes are now avail-
able for most agriculturally important grasses1
. The largest genome
assemblies, such as maize (2,300megabases (Mb))7
, barley (5,100Mb)8
and wheat (hexaploid, 17,000Mb)9
are highly fragmented as a result
of the inability of current sequencing technologies to span complex
repeat regions. Near-finished reference genomes are available for rice4
,
Sorghum5
and Brachypodium10
, but more high-quality grass genomes
are needed for comparative genomics and gene discovery. Here we pres-
ent the ‘near-complete’ draft genome of the grass Oropetium thomaeum,
the first high-quality reference genome from the Chloridoideae sub-
family. The draft genome is near complete because we were able to
sequence through complex repeat regions that are unassembled in most
draft genomes. Oropetium has the smallest known grass genome at
245Mb and is also a resurrection plant that can survive the extreme
water stress such as loss of >95% of cellular water (Fig. 1)11
.
Single-molecule real-time (SMRT) sequencing (Pacific Biosciences)
produces long and unbiased sequences, which enables assembly of
complex repeat structures and GC- and AT-rich regions that are often
unassembled or highly fragmented in NGS-based draft genomes. We
generated ~72× sequencing coverage of the Oropetium genome using
32 SMRT cells on the PacBio RS II platform (which is equivalent to <1
week of sequencing time and <US$10,000 in reagents). The resulting
sequence had a read N50 length of over 16kilobases (kb), and there was
10× coverage of reads over 20kb in length (Extended Data Fig. 1a). The
raw reads were error-corrected using the hierarchical genome assembly
process (HGAP), and the longest reads (>16kb) were assembled using
Celera assembler followed by two rounds of genome polishing using
Quiver12
. The assembly contains 650 contigs spanning 99% (244Mb)
OPEN

Annotation (Jarkko)
BILS assembly and annota1on service
1
Henrik Lantz
Team leader
Mahesh Panchal
Assembly
Jacques Dainat
Annota1on
Mar1n Norling
Assembly
Lucile Soler
Annota1on
5 PhDs, all in Uppsala
•  Annota1on 2 years, assembly 1 year
•  Not driving own research, focusing on support
•  80 h of free support to all projects - submiPed by customer
•  Dedicated compute cluster for annota1on, ~160 cores
•  Assemblies run on shared cluster, ~3200 cores
•  All organisms - all types of data
•  Close contact with sequencing facili1es

Annotation (Jarkko)
BILS assembly and annota1on service
1
Henrik Lantz
Team leader
Mahesh Panchal
Assembly
Jacques Dainat
Annota1on
Mar1n Norling
Assembly
Lucile Soler
Annota1on
5 PhDs, all in Uppsala
•  Annota1on 2 years, assembly 1 year
•  Not driving own research, focusing on support
•  80 h of free support to all projects - submiPed by customer
•  Dedicated compute cluster for annota1on, ~160 cores
•  Assemblies run on shared cluster, ~3200 cores
•  All organisms - all types of data
•  Close contact with sequencing facili1es

Annota1on/Assembly technology
Assembly
Perl/Make pipeline
•  Pre-assembly
–  Quality control
–  kmer analyses
•  Assembly
–  Diﬀerent assembly
programs
•  Assembly valida1on
–  FRCbam
–  Quast
–  Own tools
Annota-on
•  Maker-MPI
–  proteins
–  RNA-seq
•  Reﬁnement scripts
•  Func1onal annota1on
–  Blast
–  Synteny
2

Assembly validationly!valida4on…!is!it!important?!
Some4mes,!easy!ques4ons!are!the!most!diﬃcult:!
•  Is!my!de!novo!assembly!correct?!
•  What!assembler!I!need!to!use?!
•  I!just!used!all!the!possible!assemblers!one!
can!think!of….!How!I!pick!up!one!now?!
n!genes?!
ugh!to!!
?!
Assembly!valida4on…!is!it!important?!
Some4mes,!easy!ques4ons!are!the!most!diﬃcult:!
•  Is!my!de!novo!assembly!correct?!
•  What!assembler!I!need!to!use?!
•  I!just!used!all!the!possible!assemblers!one!
can!think!of….!How!I!pick!up!one!now?!
•  Does!my!assembly!contain!genes?!
•  Is!my!assembly!good!!enough!to!!
perform!gene!annota4on?!
!
!
!

Assembly validationAssembly!valida4on!
Assembly!valida4on!is!extremely!diﬃcult!
•  Too!o_en!only!connec4vity!measures!are!used!
•  There!is!no!a!real!solu4on,!only!a!set!of!best!prac4ces!
that!one!can!follow!
!
Recently!a!lot!of!a`en4on!on!assembly!valida4on:!

Evaluating assemblies with referenceEvalua4ng!assemblies!with!a!reference!
Coun4ng!errors!not!always!possible:!
•  Reference!almost!always!absent.!
•  Error! types! are! not! weighted!
accordingly.!
Visualiza4on!is!useful,!however:!
•  No!automa4on!
•  !Does!not!scale!on!large!genomes!
WOW….!Looks!like!that!it!is!diﬃcult!even!
with!the!answer!

Evaluating assemblies without referenceEvalua4ng!assemblies!without!a!reference!
•  Sta4s4cs!(N50,!etc.)!
•  Congruency!with!raw!sequencing!data:!
•  Alignments!
•  QAtools!
•  FRCbam!
•  REAPR!
•  Gene!space!!
•  CEGMA!
•  reference!genes!
•  transcriptome!
There!is!no!a!real!recipe,!or!a!tool.!We!can!only!suggest!some!
best!prac4ce.!!

Your reads are often the best source to validate your
assemblies
• Check again your insert sizes (Picard Tools, http://picard.sourceforge.net)
!
!
!
!
!
• Plotting coverage x %GC x length
Post!assembly…!am!I!on!the!right!track?!
•  Check!lib!insert!sizes!(use!PicardTools!h`p://picard.sourceforge.net/)!
•  PE! •  MP!
Your!genome!
Mitochondrion!
Contamina4ons!
0 2000 4000 6000 8000
02004006008001000
Insert Size Histogram for All_Reads
in file MP_on_masurca_sorted.bam
Insert Size
Count
FR
RF
TANDEM
0 100 200 300 400 500
02000400060008000
in file PE_on_masurca_sorted.bam
Insert Size
Count
FR
RF
TANDEM
0 2000 4000 6000 8000 10000
0200400600800
in file 7_130425_AD1YUEACXX_P469_101_index12_trimmed−to−assembly.abyss.scaf_onlyAligned.bam
Insert Size
Count
FR
RF
TANDEM
•  Failed!MP!or!bad!
assembly?!
•  Plot!cov!vs!%GC!vs!length!
!
Look! at! the! plots!
and!at!the!tables,!
duplica4on! rate!
is! an! important!
measure.!!
You!need!to!check!
i f! t h e! p l o t ( s )!
coincides! with!
what!you!expect.!
0.0 0.2 0.4 0.6 0.8 1.0
0100200300400500
GC
coverage
coverage
Frequency
0 100 200 300 400 500
050100150
0 100 200 300 400 500
01020304050
cov
len(kbp)
0 10 20 30 40 50
0246810
cov
len(kbp)
Plopng!coverage!and!GC!content!
0
coverage
Frequency
0 100 200 300 400 500
050100150
age!and!GC!content!

Data congruencyData!congruency!
Idea:!Map!read:pairs!back!to!assembly!and!look!for!discrepancies!like:!
•  no!read!coverage!
•  no!span!coverage!
•  too!long/short!pair!distances!
Reads! can! be! aligned!
back! to! the! assembly! to!
iden4ﬁes! “suspicious”!
features.!
But!what!we!do!with!this!features?!
FRCbam (Vezzi et al. 2012)

Data congruency
FRCbam (Vezzi et al. 2012)
Features!
4!coverage!related!features:!
•  LOW_COV_PE,!HIGH_COV_PE,!LOW_NORM_COV_PE,!and!HIGH_NORM_COV_PE!
!
!
!
!
!
4!features!for!compression/expansion!event!(CE!stats)!
•  COMPR_PE,!STRECH_PE,!COMPR_MP,!and!STRECH_MP!
!
!
!
6!features!on!suspicious!pair/mate!orienta4ons:!
•  HIGH_SINGLE_PE,!and!HIGH_SINGLE_MP!
•  HIGH_SPAN_PE,!and!HIGH_SPAN_MP!
•  HIGH_OUTIE_PE,!and!HIGH_OUTIE_MP!
!
A
R1,2
B
A
R1,2
C
B
A R1 B R2 C
AGAGCTAGC
AGAGCTAGC
AGATCTCGC
AGATCTCGC
Reads! can! be! aligned! back! to!
the! assembly! to! iden4ﬁes!
“suspicious”!features.!

FRCurve
FRCurve!
FRCbam!predicted!“Assemblathon!2”!outcome!
The!Feature!Response!Curve!(FRCurve)!characterizes!the!sensi4vity!
(coverage)! of! the! sequence! assembler! as! a! func4on! of! its!
discrimina4on!threshold!(number!of!features!).!
Feature!Response!Curve:!
•  Overcomes!limits!of!standard!
indicators!(i.e.!N50)!
•  Captures!trade:off!between!
quality!and!con4guity!
•  Deeply!connected!to!ROC!curves!
•  Features!can!be!used!to!iden4fy!
problema4c!regions!
•  Single!features!can!be!ploèd!to!
iden4fy!assembler:specific!bias!
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,000
0
20
40
60
80
100
120
Feature threshold
approximatecoverage(%)
Feature Space rhody TOTAL
SGA
Ray
CLC
SOAPdenovo
ALLPATHS-LG PB
ABySS
MSRA-CA
CABOG PB
CABOG
VELVET
ALLPATHS-LG

Features and PCA
Features!and!PCA!
−5 −4 −3 −2 −1
−2−1012
PCA1
PCA2
●bifido
ecoli
entero
eubac
fragilis
kleb
staphylocossus
strep
●swig
tim●bifido
ecoli
entero
fragilis
●fuso7
kleb
staphylocossusstrep
●swig
tim
●bifido
clap
clap19
ecoli
entero
fragilis
fusonuke
kleb
strep
●swig
tim
●bifido
ecoli
entero
eubac
fragilis
kleb
staphylocossus
strep
●swig
●bifido
ecolientero
eubac
●swig
tim
●bifido
ecoli
entero
eubac
fragilis
kleb
staphylocossus
strep
●
swig
clap
clap19
ecoli
enteroeubac
fragilis
●fuso7
kleb
staphylocossus
strep
●swig
tim
entero
eubac
●fuso7
strep
●
swig
●bifido
ecoli
entero
eubac
fragilis
kleb
strep
●swig
●bifido
ecoli
entero
eubac
kleb
staphylocossus
strep
●
swig
●bifido
ecoli
entero
eubac
fragilis
kleb
staphylocossus
strep
●swig
−4 −2 0 2 4
−6−4−20246
PCA1
PCA2
●bifido
clap
clap19
copro
ecoli
egg
enteroeubac
fragilis
● fuso7
fusonuke
kleb
staphylocossus
strep
●swig
tim●bifido
clap
clap19
copro
ecoli
egg
entero
eubac
fragilis
●fuso7
fusonuke
kleb
staphylocossusstrep●swig
tim
●bifido
clap
clap19
copro
ecoli
egg
entero
eubac
fragilis
●fuso7
fusonuke
kleb
strep
●swig
tim
●bifido
clap
clap19
ecoli
egg
entero
eubac
fragilis
● fuso7
fusonuke
kleb
staphylocossus
strep
●swig
●bifido
clap
clap19
copro
ecoli
egg
entero
eubac
● fuso7
fusonuke
strep
●swig
tim
●bifido
clap
clap19
copro
ecoli
egg
entero eubac
fragilis
● fuso7
fusonuke
kleb
staphylocossus
strep
●
swig
tim
●bifido
clap
clap19
copro
ecoli
egg
enteroeubac
fragilis
●fuso7
fusonuke
kleb
●
rhody
staphylocossus
strep
●swig
tim
●
bifido
copro
ecoli
egg
entero
eubac
fragilis
●fuso7
fusonuke
kleb
strep
●
swig
tim
●bifido
clap
clap19
copro
ecoli
egg
entero
eubac
fragilis
● fuso7
fusonuke
kleb
staphylocossus
strep●swig
tim
●bifido
clap
clap19
copro
ecoli
egg
entero
eubac
fragilis
●fuso7
fusonuke
kleb
staphylocossus
strep
●
swig
●bifido
clap
clap19
copro
ecoli
egg
entero
eubac
fragilis
●fuso7
fusonuke
kleb
staphylocossus
strep
●swig
Assembled!18!bacterial!genomes!
with!11!assemblers!!
(illumina!+!PacBio!data)!
PCA!performed!on!features:!
•  Assemblies!of!the!same!organism!
(family)!tend!to!cluster;!
•  No!clear!diﬀerence!when!using!
PacBio!data;!

REAPR (Hunt et al. 2013)
REAPRREAPR!
REAPR!(Hunt!et!al.!2013)!
Uses!same!principle!of!FRCurve:!
•  Iden4ﬁes!suspicious/erroneous!
posi4ons!
•  Breaks#assemblies#in#suspicious#
posi.ons#
•  The!“broken!assembly”!is!more!
fragmented!but!hopefully!more!
corrected!(Reapr!cannot!make!
things!worse…)!

Conserved core (species) gene space
Gene!space!
CEGMA#(h`p://korﬂab.ucdavis.edu/datasets/cegma/)!
HMM:s!for!248!core!eukaryo4c!genes!aligned!to!your!
assembly!to!assess!completeness!of!gene!space!
“complete”:!70%!aligned!
“par4al”:! !30%!aligned!
!
!
Similar#idea#based#on#aa#or#nt#alignments#of#
•  Golden!standard!genes!from!own!species!
•  Transcriptome!assembly!
•  Reference!species!protein!set!
Use!e.g.!GSNAP/BLAT!(nt),!exonerate/SCIPIO!(aa)!!
!

Other external validation methodsOther!External!Valida4on!Methods!
!  Restric4on!Map!
◦  Representa4on! of! the! cut! sites! on! a!
given! DNA! molecule! to! provide! spa4al!
informa4on!of!gene4c!loci!
Op4cal!maps!can!be!used!to!check!assembly!correctness:!
Long!PacBio!Reads!can!be!used!as!well!

Other external validation methods
De!novo!reconstructs!!parts!
missing!in!the!reference!strain!
Correctly!assembles!long!tandem!
repeats!!
De!Novo!assembly!
!!!(Illumina,!PGM)!
Set!of!un:ordered!
and!not!oriented!ctgs!
Op4cal!Map!
DNA!seq!Con4gs!
Other!External!Valida4on!Methods!

Don’t!panic.!And!don’t!rush!
Keeping!up!with!the!development!can!be!stressful,!!so!you!need!to!stay!calm!
• !Choose!quality!before!quan4ty!
• !Know!your!biological!system!!so!you!know!what!to!expect!
• Combine!sequencing!with!other!data!
• !Share!knowledge!and!be!nice!to!your!bioinforma4cs!friends!
For!each!conclusion,!ask!yourself!if!it!can!be!an!artefact!due!to!!
• !Incomplete!assembly!
• !Repeats!
• !Indels!
• !Coverage!bias!
• !Divergent!sequences!(mapping)!
Don’t panic. And don’t rush

Know that your final assembly will be incomplete

Things that are not there
100Mb
1 2 3 4 5 6 7 8 9 10 11
12
13
1415
16
1718
1920
2122
X
Closed gap
Inversion
Complex event
High
Low
STR Density
Extended Data Figure 3 | Genome distribution of closed gaps and
insertions. Chromosome ideogram heatmap depicts the normalized density of
inserted CHM1 base pairs per 5-Mb bin with a strong bias noted near the end of
most chromosomes. Locations of structural variants and closed gaps are given
by coloured diamonds to the left of each chromosome: closed gap sequences
(red), inversions (green), and complex events (blue).
RESEARCH LETTER
Chaison M.J.P et al. Nature (2014)
yhigh-throughputDNAsequencing(ChIP-seq)analysis(Supplemen-
aryInformation).Weidentifiedasignificant15-foldenrichmentofshort
andemrepeats(STRs)whencomparedtoarandomsample(P,0.00001)
Fig. 1a). A total of 78% (39 out of 50) of the closed gap sequences were
omposedof10%ormoreofSTRs.TheSTRswerefrequentlyembedded
n longer, more complex, tandem arrays of degenerate repeats reach-
ng up to 8,000 bp in length (Extended Data Fig. 1a–c), some of which
ore resemblance to sequences known to be toxic to Escherichia coli16
.
ecause most human reference sequences17,18
have been derived from
ones propagated in E. coli, it is perhaps not surprising that the appli-
ation of a long-read sequence technology to uncloned DNA would
esolvesuchgaps.Moreover,thelengthandcomplexdegeneracyofthese
TRs embedded within (G1C)-rich DNA probably thwarted efforts to
ollow up most of these by PCR amplification and sequencing.
Next, we developed a computational pipeline (Extended Data Fig. 2)
o characterize structural variation systematically (structural variation
efined here as differences $50 bp in length, including deletions, dupli-
ations, insertions and inversions7
). Structural variants were discovered
y mapping SMRT sequencing reads to the human reference genome11
P = 0.02712
P = 0.00003
P < 0.00001
0
25
50
75
100
(G+C)content
Reference flank
Gap closure
Tandem repeat
P < 2.2 × 10–16
0.00
0.25
0.50
0.75
1.00
Gaps Reference
Proportionofregionwithsimplerepeats
a b
G
ap
onlyTandem
repeatsG
ap
w
ithout
tandem
repeats
Sam
pled
reference
igure 1 | Sequence content of gap closures. a, Gap closures are enriched
or simple repeats compared to equivalently sized regions randomly sampled
om GRCh37. b, Human genome gaps typically consist of (G1C)-rich
equence (yellow) flanking complex (A1T)-rich STRs (green) (empirical
value; Supplementary Information). Red line indicates genomic (G1C)
ontent.

Things that are not there
Steinberg K.M. et al.
Genome Research (2014)Figure 5. Overview of the Chr 11 (NC_018922.2) 1.9-Mb region, exhibiting three alignment bins with a large number of PacBio ‘‘cliff’’ reads where the
alignment coverage dropped off sharply. WGS component (light green lines) boundaries flanked by such reads are marked with red dashed lines. The ends
of each component at the boundary are labeled with letters to show orientation. Pairs of alignments corresponding to three different PacBio reads are
marked in yellow, green, and dark blue. These alignments overlap by < 10% on each of the reads. The split alignments for these three reads suggest that
the two WGS components marked in purple should be inverted and translocated as indicated by the arrow at the top of the image. The other PacBio reads
in these bins exhibit the same pattern of split alignments, which supports the proposed reordering and orientation of the WGS components. The bottom
light green lines show a proposed tiling path with the orientation corrected; the letters indicate where each end of the initial tiling path components should
be placed.
CHM1 assembly of the human genome
Cold Spring Harbor Laboratory Presson November 16, 2014 - Published bygenome.cshlp.orgDownloaded from

Summary
• Genome size and repeat content can be estimated w/o an assembly.
• Adapters and trim low QV is good unless the assembly program does
EC itself.
• Assess the levels of heterozygosity in your target genome before you
assemble (or sequence) it and set your expectations accordingly.
• Choose an assembler that excels in the area you are interested in
(e.g., coverage, continuity) and do libraries for it.
• Interested in doing just coding potential analyses? (e.g., training a
gene finder, studying codon usage bias, looking for intron-specific
motifs) => Consider studying exome assemblies.
• Or consider a proxy, studying a specie that it is sufficiently close
evolutionary which genome is quite good in quality.

Summary
• Genome size and repeat content can be estimated w/o an assembly.
• Adapters and trim low QV is good unless the assembly program does
EC itself.
• Assess the levels of heterozygosity in your target genome before you
assemble (or sequence) it and set your expectations accordingly.
• Choose an assembler that excels in the area you are interested in
(e.g., coverage, continuity, or number of error free bases).
• Interested in doing just coding potential analyses? (e.g., training a
gene finder, studying codon usage bias, looking for intron-specific
motifs) => Consider studying exome assemblies.
• Or consider a proxy, studying a specie that it is sufficiently close
evolutionary which genome is quite good in quality.
Settle down an assembly so Science can continue!

Acknowledgements
• Olga Vinnere Pettersson
• Björn Nysted
• Ola Spujth
• Henrik Lantz
• Jacques Daimat
• Francesco Vezzi
• BGI
• Jon Badalamenti (Bond Lab)
• Stephan C. Schuster (Penn U)

Helsinki genome project-20151210-amb

Recommended

Recommended

More Related Content

Similar to Helsinki genome project-20151210-amb

Similar to Helsinki genome project-20151210-amb (20)

Recently uploaded

Recently uploaded (20)

Helsinki genome project-20151210-amb