SlideShare a Scribd company logo
1 of 221
Download to read offline
Genome assembly: then and now
Keith Bradnam
Image from Wellcome Trust
Image from flickr.com/photos/dougitdesign/5613967601/
Contents
Sequencing 101
Genome assembly: then
Genome assembly: now
Assemblathon 1
Assemblathon 2
Assemblathon 3
More info
✤ http://assemblathon.org
✤ http://arxiv.org
✤ http://twitter.com/assemblathon
Assemblathon 2 paper has been reviewed, just dealing with reviewer's comments.
Sequencing 101
A, C, G, T...
Image from nlm.nih.gov
Fred Sanger
Read
Most sequencing technologies start with a sequencing read. A read could be as short as 25
bp (Solexa sequencing from a few years ago), or >15,000 bp (PacBio with latest chemistry).
Read pair
Most sequencing is done with pairs of connected reads, separated by a short interval whose
length is known. Read pairs can also overlap with each other.
Read pair
Mate pair
Mate pairs, also known as jumping pairs, have much larger inserts (thousands or tens of
thousands of bp), but it is hard to make good mate pair libraries. Having very large inserts is
very useful for the purposes of genome assembly.
Sequence a whole lot of read pairs, and hopefully they will overlap with each other and allow
you to start making contiguous sequences...
Contigs
...which are better known as contigs.
Mate pairs — or other information — can hopefully be used to connect contigs together into
scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns).
Mate pairs — or other information — can hopefully be used to connect contigs together into
scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns).
Scaffold
NNNNNNNNNNNNNNNNNNN
Mate pairs — or other information — can hopefully be used to connect contigs together into
scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns).
Assembly size
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
15
15
15
5
Assembly size is simply the sum of all scaffolds or contigs that are included in the final
genome assembly. If you are calculating the assembly size from scaffolds, then some fraction
of that final size will come from the Ns in scaffold sequences.
Here we have a toy genome assembly, with 12 scaffolds totaling 200 Mbp.
Assembly size
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
200 Mbp
15
15
15
5
Assembly size is simply the sum of all scaffolds or contigs that are included in the final
genome assembly. If you are calculating the assembly size from scaffolds, then some fraction
of that final size will come from the Ns in scaffold sequences.
Here we have a toy genome assembly, with 12 scaffolds totaling 200 Mbp.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
200 Mbp
15
15
15
5
The most widely used measure to describe genome assemblies is the N50 lengths of
scaffolds or contigs. This is essentially a weighted mean, designed to be more informative
than a crude mean length (which is not very useful if you end up with thousands of very short
contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
200 Mbp
15
15
15
5
The most widely used measure to describe genome assemblies is the N50 lengths of
scaffolds or contigs. This is essentially a weighted mean, designed to be more informative
than a crude mean length (which is not very useful if you end up with thousands of very short
contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
200 Mbp
15
15
15
5
70
The most widely used measure to describe genome assemblies is the N50 lengths of
scaffolds or contigs. This is essentially a weighted mean, designed to be more informative
than a crude mean length (which is not very useful if you end up with thousands of very short
contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
15
15
15
5
200 Mbp
95
If this length does not exceed 50% of the total assembly size (50% is why it is N50), proceed
to the next longest scaffold, and add the length to a running total.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
15
15
15
5
200 Mbp
95
If this length does not exceed 50% of the total assembly size (50% is why it is N50), proceed
to the next longest scaffold, and add the length to a running total.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
15
15
15
5
200 Mbp
115
Now we have exceeded 50% of the total assembly size.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
15
15
15
5
200 Mbp
115
Now we have exceeded 50% of the total assembly size.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
5
15
15
15
5
200 Mbp
The length of the contig or scaffold that takes you past 50% is what is reported as the N50
length. So here, we have an N50 length of 20 Mbp.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
15
15
15
5
5
N50 may be more robust than using a simple mean length, but it can still be easily
manipulated. What if we excluded the two shortest scaffolds from our assembly?
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
15
15
15
5
5
N50 may be more robust than using a simple mean length, but it can still be easily
manipulated. What if we excluded the two shortest scaffolds from our assembly?
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
15
15
15
N50 may be more robust than using a simple mean length, but it can still be easily
manipulated. What if we excluded the two shortest scaffolds from our assembly?
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
15
15
15
Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25
Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50
of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
15
15
15
190 Mbp
Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25
Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50
of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70
25
20
10
10
5
5
15
15
15
190 Mbp
Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25
Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50
of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
N50 for two assemblies
Here are another two fictional assemblies. The first assembly now has a lower N50 value, but
this is purely because it contains more sequence (which are albeit short scaffolds). Do you
want more sequence in your assembly, or fewer but longer sequences?
N50 for two assemblies
208 Mbp 190 Mbp
Here are another two fictional assemblies. The first assembly now has a lower N50 value, but
this is purely because it contains more sequence (which are albeit short scaffolds). Do you
want more sequence in your assembly, or fewer but longer sequences?
N50 for two assemblies
208 Mbp 190 Mbp
N50 = 15 Mbp N50 = 25 Mbp
Here are another two fictional assemblies. The first assembly now has a lower N50 value, but
this is purely because it contains more sequence (which are albeit short scaffolds). Do you
want more sequence in your assembly, or fewer but longer sequences?
NG50 for two assemblies
208 Mbp 190 Mbp
We prefer a measure called NG50. This does not use the assembly size, but instead uses the
known (or estimated) genome size (the 'G' in NG50 refers to the Genome).
NG50 for two assemblies
We prefer a measure called NG50. This does not use the assembly size, but instead uses the
known (or estimated) genome size (the 'G' in NG50 refers to the Genome).
NG50 for two assemblies
Expected genome size = 250 Mbp
We prefer a measure called NG50. This does not use the assembly size, but instead uses the
known (or estimated) genome size (the 'G' in NG50 refers to the Genome).
Expected genome size = 250 Mbp
NG50 for two assemblies
The NG50 of these two assemblies is now the same. We think that NG50 is a fairer way of
comparing genome assemblies that might differ in their total size.
NG50 = 15 Mbp NG50 = 15 Mbp
Expected genome size = 250 Mbp
NG50 for two assemblies
The NG50 of these two assemblies is now the same. We think that NG50 is a fairer way of
comparing genome assemblies that might differ in their total size.
How do I describe thee?
Let me count the ways
Apart from assembly size, and N50/NG50 length, there are many other ways to describe a
genome assembly.
How do I describe thee?
Let me count the ways
Metric Description
Assembly size With or without very short contigs?
N50 / NG50 For contigs and/or scaffolds
Coverage When compared to a reference sequence
Errors
Base errors from alignment to reference sequence
and/or input read data
Number of genes
From comparison to reference transcriptome
and/or set of known genes
Apart from assembly size, and N50/NG50 length, there are many other ways to describe a
genome assembly.
How do I describe thee?
Let me count the ways
Metric Description
Assembly size With or without very short contigs?
N50 / NG50 For contigs and/or scaffolds
Coverage When compared to a reference sequence
Errors
Base errors from alignment to reference sequence
and/or input read data
Number of genes
From comparison to reference transcriptome
and/or set of known genes
And many, many more...
Apart from assembly size, and N50/NG50 length, there are many other ways to describe a
genome assembly.
Genome assembly
Back in the day...
How were genomes assembled back in the late 1990s when genome sequencing projects
were starting to make the news?
Genome assembly: then
Genome sequencing projects often had a fantastic amount of supporting material which
helped put the genome together. They were further helped by targeting genomes which had
low heterozygosity. And of course this was all done with Sanger sequencing which gave long,
accurate reads.
Genetic maps ✓
Physical maps ✓
Understanding of target genome ✓
Haploid / low heterozygosity genome ✓
Accurate & long reads ✓
Resources (time, money, people) ✓
Genome assembly: then
Genome sequencing projects often had a fantastic amount of supporting material which
helped put the genome together. They were further helped by targeting genomes which had
low heterozygosity. And of course this was all done with Sanger sequencing which gave long,
accurate reads.
Genetic maps ✓
Physical maps ✓
Understanding of target genome ✓
Haploid / low heterozygosity genome ✓
Accurate & long reads ✓
Resources (time, money, people) ✓
Genome assembly: then
Genome sequencing projects often had a fantastic amount of supporting material which
helped put the genome together. They were further helped by targeting genomes which had
low heterozygosity. And of course this was all done with Sanger sequencing which gave long,
accurate reads.
Genetic maps ✓
Physical maps ✓
Understanding of target genome ✓
Haploid / low heterozygosity genome ✓
Accurate & long reads ✓
Resources (time, money, people) ✓
Genome assembly: then
Genome sequencing projects often had a fantastic amount of supporting material which
helped put the genome together. They were further helped by targeting genomes which had
low heterozygosity. And of course this was all done with Sanger sequencing which gave long,
accurate reads.
Genetic maps ✓
Physical maps ✓
Understanding of target genome ✓
Haploid / low heterozygosity genome ✓
Accurate & long reads ✓
Resources (time, money, people) ✓
Genome assembly: then
Genome sequencing projects often had a fantastic amount of supporting material which
helped put the genome together. They were further helped by targeting genomes which had
low heterozygosity. And of course this was all done with Sanger sequencing which gave long,
accurate reads.
Genetic maps ✓
Physical maps ✓
Understanding of target genome ✓
Haploid / low heterozygosity genome ✓
Accurate & long reads ✓
Resources (time, money, people) ✓
Genome assembly: then
Genome sequencing projects often had a fantastic amount of supporting material which
helped put the genome together. They were further helped by targeting genomes which had
low heterozygosity. And of course this was all done with Sanger sequencing which gave long,
accurate reads.
Genetic maps ✓
Physical maps ✓
Understanding of target genome ✓
Haploid / low heterozygosity genome ✓
Accurate & long reads ✓
Resources (time, money, people) ✓
Genome assembly: then
Genome sequencing projects often had a fantastic amount of supporting material which
helped put the genome together. They were further helped by targeting genomes which had
low heterozygosity. And of course this was all done with Sanger sequencing which gave long,
accurate reads.
So what was the result of spending millions of dollars
to assemble genomes of well-characterized species,
with accurate long reads, and detailed maps???
So hopefully this gave us a useful set of finished genomes, right?
✤ 2000: published genome size = 125 Mbp
✤ 2007: genome size = 157 Mbp
✤ 2012: genome size = 135 Mbp
Arabidopsis thaliana
Many published genome sizes are sometimes based on estimates which can be wrong. As
they sequenced more and more of the Arabidopsis genome, they had to revise how big it was.
So between 2000 and 2007 they produced more sequence but paradoxically it became less
complete because the estimate of the size went up. Now it has come back down again. But
the genome remains unfinished.
✤ 2000: published genome size = 125 Mbp
✤ 2007: genome size = 157 Mbp
✤ 2012: genome size = 135 Mbp
✤ Amount sequenced = 119 Mbp
Arabidopsis thaliana
Many published genome sizes are sometimes based on estimates which can be wrong. As
they sequenced more and more of the Arabidopsis genome, they had to revise how big it was.
So between 2000 and 2007 they produced more sequence but paradoxically it became less
complete because the estimate of the size went up. Now it has come back down again. But
the genome remains unfinished.
✤ 2000: published genome size = 125 Mbp
✤ 2007: genome size = 157 Mbp
✤ 2012: genome size = 135 Mbp
✤ Amount sequenced = 119 Mbp
✤ Ns = 0.2% of genome
Arabidopsis thaliana
Many published genome sizes are sometimes based on estimates which can be wrong. As
they sequenced more and more of the Arabidopsis genome, they had to revise how big it was.
So between 2000 and 2007 they produced more sequence but paradoxically it became less
complete because the estimate of the size went up. Now it has come back down again. But
the genome remains unfinished.
Drosophila melanogaster
✤ Genome published 1998
✤ Heterochromatin finished 2007
The fly genome was 'finished' in 1998. But this was only really the easy-to-sequence portion
of the genome (the euchromatin). The trickier heterochromatin was sequenced as a separate
project that didn't finish until almost a decade later. The fly genome remains unfinished.
Drosophila melanogaster
✤ Genome published 1998
✤ Heterochromatin finished 2007
✤ Ns = 4% of genome
The fly genome was 'finished' in 1998. But this was only really the easy-to-sequence portion
of the genome (the euchromatin). The trickier heterochromatin was sequenced as a separate
project that didn't finish until almost a decade later. The fly genome remains unfinished.
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
The worm genome has no unknown bases in it. However, since the publication of the genome
sequence the genome has continued to be refined as errors are corrected. The last batch of
changes all occurred just last November. So after almost 15 years of post-genome-
publication, we can still find over 1,400 errors in one of the best characterized genome
sequences that exists.
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
✤ 1998–2013: genome sequence changes
The worm genome has no unknown bases in it. However, since the publication of the genome
sequence the genome has continued to be refined as errors are corrected. The last batch of
changes all occurred just last November. So after almost 15 years of post-genome-
publication, we can still find over 1,400 errors in one of the best characterized genome
sequences that exists.
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
✤ 1998–2013: genome sequence changes
✤ 558 insertions
✤ 230 deletions
✤ 614 substitutions
The worm genome has no unknown bases in it. However, since the publication of the genome
sequence the genome has continued to be refined as errors are corrected. The last batch of
changes all occurred just last November. So after almost 15 years of post-genome-
publication, we can still find over 1,400 errors in one of the best characterized genome
sequences that exists.
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
✤ 1998–2013: genome sequence changes
✤ 558 insertions
✤ 230 deletions
✤ 614 substitutions
}Nov 2012
The worm genome has no unknown bases in it. However, since the publication of the genome
sequence the genome has continued to be refined as errors are corrected. The last batch of
changes all occurred just last November. So after almost 15 years of post-genome-
publication, we can still find over 1,400 errors in one of the best characterized genome
sequences that exists.
Saccharomyces cerevisiae
✤ Genome published 1997
✤ 12 Mbp genome
✤ 1,653 changes to genome since 1997
Likewise in yeast. The first eukaryotic genome sequence continues to receives fixes to correct
the sequence. The last set of changes were made in 2011. These changes affected coding
sequences, not just intergenic and intronic DNA.
Saccharomyces cerevisiae
✤ Genome published 1997
✤ 12 Mbp genome
✤ 1,653 changes to genome since 1997
✤ Last changes made in 2011
Likewise in yeast. The first eukaryotic genome sequence continues to receives fixes to correct
the sequence. The last set of changes were made in 2011. These changes affected coding
sequences, not just intergenic and intronic DNA.
Genetic maps ✓
Physical maps ✓
Understanding of target genome ✓
Haploid / low heterozygosity genome ✓
Accurate & long reads ✓
Resources (time, money, people) ✓
Genome assembly: then
And all of this was done in an era when we had all of these supporting materials.
Genetic maps ✗
Physical maps ✗
Understanding of target genome ✗
Haploid / low heterozygosity genome ✗
Accurate & long reads ✗
Resources (time, money, people) ✗
Genome assembly: now
We don't have these now! Genome sequencing no longer requires an international
consortium, rather it could be a project for a Grad student.
Assembling & finishing
a genome is not easy!
It was never easy, even when we access to lots of resources to help us put together genomes.
And it is not easy now. Don't be fooled into thinking that because there are many published
genome sequences, that these sequences represent the absolute ideal genome sequence.
Assemblathons
A new idea is born
Image from flickr.com/photos/dullhunk/4422952630
The Assemblathon was born out of the Genome 10K project.
If you sequence 10,000 genomes...
...you need to assemble 10,000 genomes
The Assemblathon was born out of the Genome 10K project.
How many assembly tools are out there?
There are many, many tools out there for assembling, or helping to assemble, a genome
sequence. It seems reasonable to ask...which is the best?
How many assembly tools are out there?
Ray
Celera
MIRA
ALLPATHS-LG
SGA
Curtain
Metassembler
Phusion
ABySS
Amos
Arapan
CLC
Cortex
DNAnexus
DNA Dragon EULER
Edena
Forge
Geneious
IDBA
Newbler
PRICE
PADENA
PASHA
Phrap
TIGR
Sequencher
SeqMan NGen
SHARCGS
SOPRA
SSAKE
SPAdes
Taipan
VCAKE
Velvet
Arachne
PCAP
GAM
Monument
Atlas
ABBA
Anchor
ATAC
Contrail
DecGPU GenoMinerLasergene
PE-Assembler
Pipeline Pilot
QSRA
SeqPrep
SHORTY
fermi
Telescoper
Quast
SCARPA Hapsembler
HapCompass
HaploMerger
SWiPS
GigAssembler
MSR-CA
There are many, many tools out there for assembling, or helping to assemble, a genome
sequence. It seems reasonable to ask...which is the best?
How many assembly tools are out there?
Ray
Celera
MIRA
ALLPATHS-LG
SGA
Curtain
Metassembler
Phusion
ABySS
Amos
Arapan
CLC
Cortex
DNAnexus
DNA Dragon EULER
Edena
Forge
Geneious
IDBA
Newbler
PRICE
PADENA
PASHA
Phrap
TIGR
Sequencher
SeqMan NGen
SHARCGS
SOPRA
SSAKE
SPAdes
Taipan
VCAKE
Velvet
Arachne
PCAP
GAM
Monument
Atlas
ABBA
Anchor
ATAC
Contrail
DecGPU GenoMinerLasergene
PE-Assembler
Pipeline Pilot
QSRA
SeqPrep
SHORTY
fermi
Telescoper
Quast
SCARPA Hapsembler
HapCompass
HaploMerger
SWiPS
GigAssembler
MSR-CA
Which is the best?
There are many, many tools out there for assembling, or helping to assemble, a genome
sequence. It seems reasonable to ask...which is the best?
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
However, it is not always straightforward to compare two tools if they were used on different
species or on different datasets from the same species.
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
However, it is not always straightforward to compare two tools if they were used on different
species or on different datasets from the same species.
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from
different NGS platforms
However, it is not always straightforward to compare two tools if they were used on different
species or on different datasets from the same species.
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from
different NGS platforms
✤ used same NGS platform but different sequence libraries
However, it is not always straightforward to compare two tools if they were used on different
species or on different datasets from the same species.
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from
different NGS platforms
✤ used same NGS platform but different sequence libraries
✤ Even using different options for the same assembler may produce
very different assemblies!
However, it is not always straightforward to compare two tools if they were used on different
species or on different datasets from the same species.
A genome assembly competition
That's where the Assemblathon came in.
An attempt to standardize some aspects
of the genome assembly process
Genome assembly contests
Others have been trying to do the same thing. E.g. GAGE, and dnGASP.
✤ 2010–2011
✤ Used synthetic data
✤ Small genome (~100 Mbp)
✤ We knew the answer!
Assemblathon 1
It is easier to judge a tool when you know what the final answer should look like. However,
many people that work on developing assemblers would prefer to work with real data...
Here we go again
...which is where Assemblathon 2 came in.
Type of data
Number of
genomes
Size of
genomes
Do we know
the answer?
Assemblathon 1 Synthetic 1 Small ✓
Assemblathon 2 Real 3 Large ✗
Type of data
Number of
genomes
Size of
genomes
Do we know
the answer?
Assemblathon 1 Synthetic 1 Small ✓
Assemblathon 2 Real 3 Large ✗
Melopsittacus undulatus
Boa constrictor constrictorMaylandia zebra
A budgie, a cichlid fish from Lake Mawali, and a reptile.
Bird
SnakeFish
Let's simplify the names for the rest of the talk.
Why these three species?
There is no special reason why these species were used. People had a need to sequence the
genomes, and some companies were willing to donate sequences.
Why these three species?
Because they were there
There is no special reason why these species were used. People had a need to sequence the
genomes, and some companies were willing to donate sequences.
Species
Estimated
genome size
Illumina Roche 454 PacBio
Bird 1.2 Gbp 285x
(14 libraries)
16x
(3 libraries)
10x
(2 libraries)
Fish 1.0 Gbp 192x
(8 libraries)
Snake 1.6 Gbp 125x
(4 libraries)
Assemble this!
Lots of sequence data was provided for the bird. Mate pair and read pair libraries were
available for all Illumina datasets.
Species
Estimated
genome size
Illumina Roche 454 PacBio
Bird 1.2 Gbp 285x
(14 libraries)
16x
(3 libraries)
10x
(2 libraries)
Fish 1.0 Gbp 192x
(8 libraries)
Snake 1.6 Gbp 125x
(4 libraries)
Assemble this!
Lots of sequence data was provided for the bird. Mate pair and read pair libraries were
available for all Illumina datasets.
Species
Estimated
genome size
Illumina Roche 454 PacBio
Bird 1.2 Gbp 285x
(14 libraries)
16x
(3 libraries)
10x
(2 libraries)
Fish 1.0 Gbp 192x
(8 libraries)
Snake 1.6 Gbp 125x
(4 libraries)
Assemble this!
Lots of sequence data was provided for the bird. Mate pair and read pair libraries were
available for all Illumina datasets.
Species
Estimated
genome size
Illumina Roche 454 PacBio
Bird 1.2 Gbp 285x
(14 libraries)
16x
(3 libraries)
10x
(2 libraries)
Fish 1.0 Gbp 192x
(8 libraries)
Snake 1.6 Gbp 125x
(4 libraries)
Assemble this!
Lots of sequence data was provided for the bird. Mate pair and read pair libraries were
available for all Illumina datasets.
Who took part?
Lots of teams took part. Not just from the big sequencing/genome centers.
Who took part?
Lots of teams took part. Not just from the big sequencing/genome centers.
Who took part?
21 teams
43 assemblies
52,013,623,777 bp of sequence
Lots of teams took part. Not just from the big sequencing/genome centers.
Species
Competitive
entries
Evaluation
entries
Bird 12 3
Fish 10 6
Snake 12 0
Entries
There were evaluation entries (not eligible to be declared the winner) allowed in addition to
competition entries (only 1 per team).
Species
Competitive
entries
Evaluation
entries
Bird 12 3
Fish 10 6
Snake 12 0
Entries
There were evaluation entries (not eligible to be declared the winner) allowed in addition to
competition entries (only 1 per team).
Goals
Goals
✤ Assess 'quality' of assemblies
Goals
✤ Assess 'quality' of assemblies
✤ Define quality!
Goals
✤ Assess 'quality' of assemblies
✤ Define quality!
✤ Produce ranking of assemblies for each species
Goals
✤ Assess 'quality' of assemblies
✤ Define quality!
✤ Produce ranking of assemblies for each species
✤ Produce ranking of assemblers across species?
Who did what?
Person/group Jobs
Me, Ian, and Joseph Fass Perform various analyses of all assemblies
David Schwarz et al. Produce & evaluate optical maps
Jay Shendure et al.
Produce Fosmid sequences
(bird & snake only)
Martin Hunt & Thomas Otto Performed REAPR analysis
Dent Earl & Benedict Paten Help with meta-analysis of final rankings
flickr.com/photos/jamescridland/613445810
Hard to get agreement on how best to interpret the results. Some analyses and
interpretations in the Assemblathon 2 paper end up being compromises.
91 co-authors!
flickr.com/photos/jamescridland/613445810
Hard to get agreement on how best to interpret the results. Some analyses and
interpretations in the Assemblathon 2 paper end up being compromises.
Results!
Lots of results!
A screen grab of my master spreadsheet that contains all of the numerical results.
102 different metrics!
10 key metrics
We focused on 10 of 102 metrics that we thought were a) useful and b) captured different
aspects of an assembly's quality.
Key Metric Description
1 NG50 scaffold length
2 NG50 contig length
3 Amount of assembly in 'gene-sized' scaffolds
4 Number of 'core genes' present
5 Fosmid coverage
6 Fosmid validity
7 Short-range scaffold accuracy
8 Optical map: level 1
9 Optical map: levels 1–3
10 REAPR summary score
The 10 key metrics.
1) Scaffold NG50 lengths
✤ Can calculate NG50 length for each assembly
✤ But also calculate NG60, NG70 etc.
✤ Plot all results as a graph
An N50 (or NG50) value on its own doesn't tell you that much. Ideally you should always be
aware of the total assembly size and the distribution of lengths when comparing assemblies.
You can do this by not only calculating NG50, but NG1..NG100. NG1 would be the length of
scaffold that captures 1% of the estimated genome size (when summing scaffolds from
longest to shortest).
1) Scaffold NG50 lengths
Scaffold length is on a log axis and team identifiers are shown in the legend.
The black dashed line shows the NG50 value, but the point where each series starts on the
left shows the lengths of the longest scaffolds. Also, if the NG100 value is greater than zero,
then that assembly is bigger than the known/estimated genome size.
2) Contig vs scaffold NG50
We did the same thing for contig NG50 as well as scaffold NG50. The two measure are
sometimes, but not always, correlated. The two highlighted data points show outliers for bird
assemblies, reflecting assemblies that are good at making long contigs *or* good at making
long scaffolds, but not both.
2) Contig vs scaffold NG50
We did the same thing for contig NG50 as well as scaffold NG50. The two measure are
sometimes, but not always, correlated. The two highlighted data points show outliers for bird
assemblies, reflecting assemblies that are good at making long contigs *or* good at making
long scaffolds, but not both.
2) Contig vs scaffold NG50
We did the same thing for contig NG50 as well as scaffold NG50. The two measure are
sometimes, but not always, correlated. The two highlighted data points show outliers for bird
assemblies, reflecting assemblies that are good at making long contigs *or* good at making
long scaffolds, but not both.
3) Gene-sized scaffolds
It is great to have long scaffolds, but maybe for many questions that you might be interested
in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance
of capturing a full-length gene.
3) Gene-sized scaffolds
✤ Do assemblers get a little too excited by length?
It is great to have long scaffolds, but maybe for many questions that you might be interested
in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance
of capturing a full-length gene.
3) Gene-sized scaffolds
✤ Do assemblers get a little too excited by length?
✤ How long is 'long enough' for a scaffold?
It is great to have long scaffolds, but maybe for many questions that you might be interested
in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance
of capturing a full-length gene.
3) Gene-sized scaffolds
✤ Do assemblers get a little too excited by length?
✤ How long is 'long enough' for a scaffold?
✤ What if you just wanted to find genes?
It is great to have long scaffolds, but maybe for many questions that you might be interested
in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance
of capturing a full-length gene.
3) Gene-sized scaffolds
✤ Do assemblers get a little too excited by length?
✤ How long is 'long enough' for a scaffold?
✤ What if you just wanted to find genes?
✤ Average vertebrate gene = ~25 Kbp
It is great to have long scaffolds, but maybe for many questions that you might be interested
in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance
of capturing a full-length gene.
3) Gene-sized scaffolds
The blue line shows the percentage of the estimated genome size that is present in scaffolds
of 25 Kbp or longer. Most assemblies, even if they have a much shorter *average* scaffold
length, may contain many scaffolds that are long enough to contain a single gene.
4) Core genes
A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely,
highly conserved) are present in each assembly. Note that CEGMA finds genes where a full-
length (or nearly full-length) gene is present within a single scaffold. Many core genes might
be present, but split across scaffolds.
4) Core genes
✤ Used CEGMA tool
A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely,
highly conserved) are present in each assembly. Note that CEGMA finds genes where a full-
length (or nearly full-length) gene is present within a single scaffold. Many core genes might
be present, but split across scaffolds.
4) Core genes
✤ Used CEGMA tool
✤ CEGMA = set of 458 'Core Eukaryotic Genes' (CEGs)
A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely,
highly conserved) are present in each assembly. Note that CEGMA finds genes where a full-
length (or nearly full-length) gene is present within a single scaffold. Many core genes might
be present, but split across scaffolds.
4) Core genes
✤ Used CEGMA tool
✤ CEGMA = set of 458 'Core Eukaryotic Genes' (CEGs)
✤ How many full-length CEGs are in each assembly?
A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely,
highly conserved) are present in each assembly. Note that CEGMA finds genes where a full-
length (or nearly full-length) gene is present within a single scaffold. Many core genes might
be present, but split across scaffolds.
4) Core genes
These results show the number of CEGMA genes that were present in any one assembly as a
percentage of all possible CEGMA genes (i.e. those present across all assemblies for each
species).
4) Core genes
Core genes (out of 458)Core genes (out of 458)
Species
Best individual
assembly
Across all
assemblies
Bird 420 442
Fish 436 455
Snake 438 454
In the three species, most of the core genes were present across all assemblies, but
individual assemblies typically lacked several core genes.
4) Core genes
Core genes (out of 458)Core genes (out of 458)
Species
Best individual
assembly
Across all
assemblies
Bird 420 442
Fish 436 455
Snake 438 454
In the three species, most of the core genes were present across all assemblies, but
individual assemblies typically lacked several core genes.
ABYSS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
BCM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
CRACS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
CURT MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
GAM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVED
MERAC MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
PHUS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
RAY MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
SGA MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
SYMB MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVED
SOAP MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED
************************************************ *****
ABYSS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
BCM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
CRACS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
CURT FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
GAM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNNLPHTHI
MERAC FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
PHUS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
RAY FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
SGA FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
SYMB FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
SOAP FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------
******************************************************
ABYSS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
BCM ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
CRACS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
CURT ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
GAM YGHALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLK------------------
MERAC ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
PHUS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
RAY ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
SGA ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
SYMB ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
SOAP ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG
***************************************
ABYSS ILPLVTGAGHISVPFPDTYKMTKSY
BCM ILPLVTGAGHISVPFPDTYKMTKSY
CRACS ILPLVTGAGHISVPFPDTYKMTKSY
CURT ILPLVTGAGHISVPFPDTYKMTKSY
GAM -------------------------
4) Core genes
Example of one core gene predicted in bird assemblies. CEGMA gene predictions are available
as supplementary material with the paper.
5) Fosmid coverage
5) Fosmid coverage
✤ Had to first assemble Fosmids
5) Fosmid coverage
✤ Had to first assemble Fosmids
✤ Looked at repeat content & coverage across Fosmids
5) Fosmid coverage
✤ Had to first assemble Fosmids
✤ Looked at repeat content & coverage across Fosmids
✤ Aligned assembly scaffolds to Fosmids
5) Fosmid coverage
✤ Had to first assemble Fosmids
✤ Looked at repeat content & coverage across Fosmids
✤ Aligned assembly scaffolds to Fosmids
✤ Only had Fosmids for bird and snake
5) Fosmid coverage
Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids.
Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or
regions of the genome not captured by the input read data. We aligned scaffolds to the
Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate
the picture.
5) Fosmid coverage
Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids.
Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or
regions of the genome not captured by the input read data. We aligned scaffolds to the
Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate
the picture.
5) Fosmid coverage
Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids.
Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or
regions of the genome not captured by the input read data. We aligned scaffolds to the
Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate
the picture.
5) Fosmid coverage
Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids.
Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or
regions of the genome not captured by the input read data. We aligned scaffolds to the
Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate
the picture.
5) Fosmid coverage
Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids.
Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or
regions of the genome not captured by the input read data. We aligned scaffolds to the
Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate
the picture.
5) Fosmid coverage
Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids.
Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or
regions of the genome not captured by the input read data. We aligned scaffolds to the
Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate
the picture.
5) Fosmid coverage
Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids.
Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or
regions of the genome not captured by the input read data. We aligned scaffolds to the
Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate
the picture.
5) Fosmid coverage
Most of the Fosmid sequences were used as 'Trusted' reference sequences by which to assess
the assemblies.
5) Fosmid coverage
✤ Only used regions of Fosmids that were validated by
one or more assemblies
Most of the Fosmid sequences were used as 'Trusted' reference sequences by which to assess
the assemblies.
5) Fosmid coverage
✤ Only used regions of Fosmids that were validated by
one or more assemblies
✤ Validated Fosmid Regions (VFRs)
✤ 99% of bird Fosmids
✤ 89% of snake Fosmids
Most of the Fosmid sequences were used as 'Trusted' reference sequences by which to assess
the assemblies.
5 & 6) Coverage & Validity
COMPASS tool by Joe Fass
The COMPASS tool compared the Validated Fosmid Regions (VFRs) to the scaffolds to
calculate four measures, two of which ('coverage' and 'validity') were used as key metrics.
5 & 6) Coverage & Validity
Some COMPASS results from the bird assemblies. Multiplicity is high when the assemblies
were large (compared to the estimated genome size).
Validated Fosmid Region
7) Short-range scaffold accuracy
We also used the VFRs in another way. We took pairs of 100 nt 'tag' sequences from either
end of consecutive 1000 nt fragments across all VFR sequences.
Validated Fosmid Region
7) Short-range scaffold accuracy
We also used the VFRs in another way. We took pairs of 100 nt 'tag' sequences from either
end of consecutive 1000 nt fragments across all VFR sequences.
Validated Fosmid Region
100 nt 100 nt
7) Short-range scaffold accuracy
We also used the VFRs in another way. We took pairs of 100 nt 'tag' sequences from either
end of consecutive 1000 nt fragments across all VFR sequences.
Validated Fosmid Region
7) Short-range scaffold accuracy
The start coordinates of each pair of tag sequences should map 900 nt apart in the
assemblies and hopefully both tags map only to the same scaffold. We combined both of
these into one 'summary score' metric.
Validated Fosmid Region
Map pairs of 'tag' sequences to assembly scaffolds
7) Short-range scaffold accuracy
The start coordinates of each pair of tag sequences should map 900 nt apart in the
assemblies and hopefully both tags map only to the same scaffold. We combined both of
these into one 'summary score' metric.
Validated Fosmid Region
Map pairs of 'tag' sequences to assembly scaffolds
7) Short-range scaffold accuracy
How many map as a pair to one scaffold?
The start coordinates of each pair of tag sequences should map 900 nt apart in the
assemblies and hopefully both tags map only to the same scaffold. We combined both of
these into one 'summary score' metric.
Validated Fosmid Region
Map pairs of 'tag' sequences to assembly scaffolds
7) Short-range scaffold accuracy
How many map as a pair to one scaffold?
How many map at expected distance apart (900 ± 2 bp)?
The start coordinates of each pair of tag sequences should map 900 nt apart in the
assemblies and hopefully both tags map only to the same scaffold. We combined both of
these into one 'summary score' metric.
7) Short-range scaffold accuracy
Expected distance apart (900 bp)Expected distance apart (900 bp)
Species Shortest Longest
Bird 702 bp 41,949 bp
Snake 673 bp 46,813 bp
Most pairs of tags mapped to the same scaffold, and at the expected distance apart, but
there were a few notable exceptions.
7) Short-range scaffold accuracy
The red line indicates the theoretical maximum summary score that could be achieved.
8 & 9) Optical maps
For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough
restriction enzyme sites.
8 & 9) Optical maps
✤ Stretch out DNA
For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough
restriction enzyme sites.
8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough
restriction enzyme sites.
8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
✤ Note lengths of fragments
For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough
restriction enzyme sites.
8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
✤ Note lengths of fragments
✤ Compare to in silico digest of scaffolds
For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough
restriction enzyme sites.
8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
✤ Note lengths of fragments
✤ Compare to in silico digest of scaffolds
✤ Not all scaffolds suitable for analysis
For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough
restriction enzyme sites.
8 & 9) Optical maps
Image from University of Wisconsin-Madison
An example of an optical map. After cutting, each DNA fragment is measured to estimate its
length. Optical map results were divided into three categories (levels 1–3).
8 & 9) Optical maps
White bars: total length of scaffolds that were suitable for optical map analysis. Dark blue:
global alignments of scaffolds to maps (these are the best quality). Light blue: global
alignments with more permissive thresholds. Orange bars: local alignments. We used level 1
(dark blue) as one key metric and levels 1+2+3 as a second key metric. The MLK assembly is
good, *relatively* speaking (high percentage of suitable scaffolds are in level 1 category), but
we record scores on an absolute basis (MERAC highest for level 1, SOAP highest for levels
8 & 9) Optical maps
Fish optical map results were much worse than in bird, with very few assemblies having
scaffolds with 'level 1' global alignments to the optical map. SGA had the most level 1
coverage, but a much lower amount of sequence that was alignable at any level (1, 2, or 3).
8 & 9) Optical maps
Snake optical map results were intermediate compared to bird and fish.
10) REAPR summary score
REAPR is a tool that aligns input reads to scaffolds and looks for base errors and regions
which might represent misassemblies (where scaffolds should ideally be split in two). These
two facets are combined into one summary score.
10) REAPR summary score
REAPR is a tool that aligns input reads to scaffolds and looks for base errors and regions
which might represent misassemblies (where scaffolds should ideally be split in two). These
two facets are combined into one summary score.
What does this all mean?
102 metrics
per assembly
10 key
metrics
1 final
ranking
Using the 10 key metrics, we combined the results to produce a single score for each
assembly by which to rank them.
Assembly
Number of
core genes
Rank Z-score
CRACS 438 1 +0.68
SYMB 436 2 +0.59
PHUS 435 3 +0.54
BCM 434 4 +0.49
SGA 433 5 +0.44
MERAC 430 6 +0.30
ABYSS 429 7 +0.25
SOAP 428 8 +0.21
RAY 422 9 –0.08
GAM 415 10 –0.41
CURT 360 11 –3.02
Although we did take an average rank from the 10 individual rankings, we preferred to use a
Z-score approach. Each assembly was scored based on the total number of standard
deviations from the average of each metric. This rewards/penalizes assemblies with very
high/low scores in individual metrics. The above results are from the CEGMA metric in bird.
Assembly
Number of
core genes
Rank Z-score
CRACS 438 1 +0.68
SYMB 436 2 +0.59
PHUS 435 3 +0.54
BCM 434 4 +0.49
SGA 433 5 +0.44
MERAC 430 6 +0.30
ABYSS 429 7 +0.25
SOAP 428 8 +0.21
RAY 422 9 –0.08
GAM 415 10 –0.41
CURT 360 11 –3.02
Although we did take an average rank from the 10 individual rankings, we preferred to use a
Z-score approach. Each assembly was scored based on the total number of standard
deviations from the average of each metric. This rewards/penalizes assemblies with very
high/low scores in individual metrics. The above results are from the CEGMA metric in bird.
Assembly
Number of
core genes
Rank Z-score
CRACS 438 1 +0.68
SYMB 436 2 +0.59
PHUS 435 3 +0.54
BCM 434 4 +0.49
SGA 433 5 +0.44
MERAC 430 6 +0.30
ABYSS 429 7 +0.25
SOAP 428 8 +0.21
RAY 422 9 –0.08
GAM 415 10 –0.41
CURT 360 11 –3.02
Although we did take an average rank from the 10 individual rankings, we preferred to use a
Z-score approach. Each assembly was scored based on the total number of standard
deviations from the average of each metric. This rewards/penalizes assemblies with very
high/low scores in individual metrics. The above results are from the CEGMA metric in bird.
This graph shows the final rankings of bird assemblies based on their sum Z-scores.
Assemblies in red are the evaluation entries. The error bars reflect what would be the highest
and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10.
Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of
Medicine, their competitive entry ranked number 2.
This graph shows the final rankings of bird assemblies based on their sum Z-scores.
Assemblies in red are the evaluation entries. The error bars reflect what would be the highest
and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10.
Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of
Medicine, their competitive entry ranked number 2.
This graph shows the final rankings of bird assemblies based on their sum Z-scores.
Assemblies in red are the evaluation entries. The error bars reflect what would be the highest
and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10.
Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of
Medicine, their competitive entry ranked number 2.
This graph shows the final rankings of bird assemblies based on their sum Z-scores.
Assemblies in red are the evaluation entries. The error bars reflect what would be the highest
and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10.
Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of
Medicine, their competitive entry ranked number 2.
This graph shows the final rankings of bird assemblies based on their sum Z-scores.
Assemblies in red are the evaluation entries. The error bars reflect what would be the highest
and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10.
Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of
Medicine, their competitive entry ranked number 2.
In fish, BCM ranked 1st though the error bars suggest there is much variability. The lack of
Fosmid data means that there is only 7 key metrics rather than 10.
Snake seemed to the only species that outwardly looked like one assembler outperformed all
others (SGA, in this case). We will return to this issue. Note that there were no evaluation
entries for snake.
Another way of looking at all of this data is to plot the Z-scores for each metric as a heat
map (red = higher Z-scores).
A parallel coordinates plot is another way of trying to show all of the information at once.
What does this all mean?
No really, what does this all mean?
Still a bit hard to make sense of the overall rankings. What are the main findings from our
paper?
Some conclusions
✤ Very hard to find assemblers that performed well across
all 10 key metrics
✤ Assemblers that perform well in one species, do not
always perform as well in another
✤ Bird & snake assemblies appear better than fish
✤ No real 'winner' for bird and fish
SGA — best assembler for snake?
Even if we had happened to use 9 key metrics rather than 10, and even if we threw out the
metric where SGA performed the best, it would still probably rank 1st. So is that the end of
the story?
SGA — best assembler for snake?
Even if we had happened to use 9 key metrics rather than 10, and even if we threw out the
metric where SGA performed the best, it would still probably rank 1st. So is that the end of
the story?
Description Rank of snake SGA assembly
NG50 scaffold length 2
NG50 contig length 5
Amount of assembly in 'gene-sized' scaffolds 7
Number of 'core genes' present 5
Fosmid coverage 2
Fosmid validity 2
Short-range scaffold accuracy 3
Optical map: level 1 2
Optical map: levels 1–3 1
REAPR summary score 2
SGA only ranked 1st in one of the ten key metrics and ranked 7th in another. So it is a good
assembler *on average*. But if one of these metrics was highly important to you, you may
want to use an assembler that ranked higher in that metric.
Description Rank of snake SGA assembly
NG50 scaffold length 2
NG50 contig length 5
Amount of assembly in 'gene-sized' scaffolds 7
Number of 'core genes' present 5
Fosmid coverage 2
Fosmid validity 2
Short-range scaffold accuracy 3
Optical map: level 1 2
Optical map: levels 1–3 1
REAPR summary score 2
SGA only ranked 1st in one of the ten key metrics and ranked 7th in another. So it is a good
assembler *on average*. But if one of these metrics was highly important to you, you may
want to use an assembler that ranked higher in that metric.
Best assembler across species?
Not all assemblers were used for all species, but many teams submitted entries for 2 or 3 of
the species. In theory, if a team submitted an entry for all species, and if their assembler
ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for
fish). So what was the best assembler across species, as judged by total number of 1st
places? It is BCM. But Ray comes 4th with three 1st places.
Best assembler across species?
Assembler
Number of 1st places
(out of 27)
BCM 5
Meraculous 4
Symbiose 4
Ray 3
Excluding evaluation entries
Not all assemblers were used for all species, but many teams submitted entries for 2 or 3 of
the species. In theory, if a team submitted an entry for all species, and if their assembler
ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for
fish). So what was the best assembler across species, as judged by total number of 1st
places? It is BCM. But Ray comes 4th with three 1st places.
Best assembler across species?
Assembler
Number of 1st places
(out of 27)
BCM 5
Meraculous 4
Symbiose 4
Ray 3
Excluding evaluation entries
Not all assemblers were used for all species, but many teams submitted entries for 2 or 3 of
the species. In theory, if a team submitted an entry for all species, and if their assembler
ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for
fish). So what was the best assembler across species, as judged by total number of 1st
places? It is BCM. But Ray comes 4th with three 1st places.
Ray performance
Species Final ranking
Bird 7
Fish 7
Snake 9
However, Ray ranks much lower when looking at its performance across all key metrics. So
some assemblers do very well in specific measures, and not so well in others and other
assemblers do moderately well across lots of metrics (e.g. SGA).
We found it interesting that the best bird assembly was the evaluation entry by Baylor College
of Medicine. What is different about this entry compared to their competitive entry?
We found it interesting that the best bird assembly was the evaluation entry by Baylor College
of Medicine. What is different about this entry compared to their competitive entry?
Assembler
Final
rank
NGS data
used in
assembly
Coverage
Z-score
Validity
Z-score
NG50 Contig
Z-score
BCM -
evaluation
1
Illumina +
454
+2.0 +1.4 +1.5
BCM -
competitive
2
Illumina +
454 + PacBio
–0.3 –0.8 +2.7
BCM bird assemblies
The only difference is that the BCM competitive entry included PacBio data, and somehow this
led to the paradoxical situation where including more sequenced produced a lower measures
for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did
improve.
Assembler
Final
rank
NGS data
used in
assembly
Coverage
Z-score
Validity
Z-score
NG50 Contig
Z-score
BCM -
evaluation
1
Illumina +
454
+2.0 +1.4 +1.5
BCM -
competitive
2
Illumina +
454 + PacBio
–0.3 –0.8 +2.7
BCM bird assemblies
The only difference is that the BCM competitive entry included PacBio data, and somehow this
led to the paradoxical situation where including more sequenced produced a lower measures
for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did
improve.
Assembler
Final
rank
NGS data
used in
assembly
Coverage
Z-score
Validity
Z-score
NG50 Contig
Z-score
BCM -
evaluation
1
Illumina +
454
+2.0 +1.4 +1.5
BCM -
competitive
2
Illumina +
454 + PacBio
–0.3 –0.8 +2.7
BCM bird assemblies
The only difference is that the BCM competitive entry included PacBio data, and somehow this
led to the paradoxical situation where including more sequenced produced a lower measures
for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did
improve.
Assembler
Final
rank
NGS data
used in
assembly
Coverage
Z-score
Validity
Z-score
NG50 Contig
Z-score
BCM -
evaluation
1
Illumina +
454
+2.0 +1.4 +1.5
BCM -
competitive
2
Illumina +
454 + PacBio
–0.3 –0.8 +2.7
BCM bird assemblies
The only difference is that the BCM competitive entry included PacBio data, and somehow this
led to the paradoxical situation where including more sequenced produced a lower measures
for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did
improve.
Assembler
Final
rank
NGS data
used in
assembly
Coverage
Z-score
Validity
Z-score
NG50 Contig
Z-score
BCM -
evaluation
1
Illumina +
454
+2.0 +1.4 +1.5
BCM -
competitive
2
Illumina +
454 + PacBio
–0.3 –0.8 +2.7
BCM bird assemblies
The only difference is that the BCM competitive entry included PacBio data, and somehow this
led to the paradoxical situation where including more sequenced produced a lower measures
for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did
improve.
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM used PacBio data to help fill in the gaps in their scaffolds.
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
NNNNNNNNNNNNNNNNNNN
BCM used PacBio data to help fill in the gaps in their scaffolds.
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
NNNNNNNNNNNNNNNNNNN
PacBio sequence
BCM used PacBio data to help fill in the gaps in their scaffolds.
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
CGTCGNNATCNNGGTTACG
Errors in the PacBio sequence were penalized by the choice of alignment program used to
align Fosmids to scaffolds.
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
CGTCGNNATCNNGGTTACG
Mismatches from PacBio sequence penalized alignment
score more than matching unknown bases
Errors in the PacBio sequence were penalized by the choice of alignment program used to
align Fosmids to scaffolds.
The choice of one command-line option,
used by one tool in the calculation of one key metric...
...probably made enough difference to drop
the PacBio-containing assembly to 2nd place.
This was actually down to the use of a single command-line option to the lastz alignment
program. If we had not chosen this option, the PacBio-containing entry would have probably
ranked 1st among all bird assemblies.
Other conclusions
✤ Different metrics tell different stories
✤ Heterozygosity was a big issue for bird & fish assemblies
✤ Final rankings very sensitive to changes in metrics
✤ N50 is a semi-useful predictor of assembly quality
The last point may disappoint some. Despite looking at many different metrics, N50 scaffold
length still does a reasonable job of predicting overall quality. However...
...the outliers in this relationship should be noted. The highlighted bird assembly had the
second highest scaffold N50 length, but ranked 6th among bird assemblies.
...the outliers in this relationship should be noted. The highlighted bird assembly had the
second highest scaffold N50 length, but ranked 6th among bird assemblies.
Inter-specific differences matter
Biological differences may account for differences in assembler performance between
different species. However, the input data for each species was also very difference and this
may play a role as well (some assemblers perform prefer certain short-insert sizes).
Inter-specific differences matter
✤ The three species have genomes with different properties
✤ repeats
✤ heterozygosity
Biological differences may account for differences in assembler performance between
different species. However, the input data for each species was also very difference and this
may play a role as well (some assemblers perform prefer certain short-insert sizes).
Inter-specific differences matter
✤ The three species have genomes with different properties
✤ repeats
✤ heterozygosity
✤ The three genomes had very different NGS data sets
✤ Only bird had PacBio & 454 data
✤ Different insert sizes in short-insert libraries
Biological differences may account for differences in assembler performance between
different species. However, the input data for each species was also very difference and this
may play a role as well (some assemblers perform prefer certain short-insert sizes).
The Big Conclusion
The Big Conclusion
"You can't always get what you want"
Sir Michael Jagger, 1969
What comes next?
What comes next?
There may be an Assemblathon 3. This will be decided at the next Genome 10K workshop (in
April, 2013).
What comes next?
3?
There may be an Assemblathon 3. This will be decided at the next Genome 10K workshop (in
April, 2013).
A wish list for Assemblathon 3
If there is to be an Assemblathon 3, here are some things that we have learned from
Assemblathon 2.
A wish list for Assemblathon 3
✤ Only have 1 species
If there is to be an Assemblathon 3, here are some things that we have learned from
Assemblathon 2.
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
If there is to be an Assemblathon 3, here are some things that we have learned from
Assemblathon 2.
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
If there is to be an Assemblathon 3, here are some things that we have learned from
Assemblathon 2.
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
If there is to be an Assemblathon 3, here are some things that we have learned from
Assemblathon 2.
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
If there is to be an Assemblathon 3, here are some things that we have learned from
Assemblathon 2.
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
✤ Use new FASTG genome assembly file format
If there is to be an Assemblathon 3, here are some things that we have learned from
Assemblathon 2.
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
✤ Use new FASTG genome assembly file format
✤ Get someone else to write the paper!
If there is to be an Assemblathon 3, here are some things that we have learned from
Assemblathon 2.
~ fin ~

More Related Content

What's hot

【論文読み会】Self-Attention Generative Adversarial Networks
【論文読み会】Self-Attention Generative  Adversarial Networks【論文読み会】Self-Attention Generative  Adversarial Networks
【論文読み会】Self-Attention Generative Adversarial NetworksARISE analytics
 
PR12-193 NISP: Pruning Networks using Neural Importance Score Propagation
PR12-193 NISP: Pruning Networks using Neural Importance Score PropagationPR12-193 NISP: Pruning Networks using Neural Importance Score Propagation
PR12-193 NISP: Pruning Networks using Neural Importance Score PropagationTaesu Kim
 
Salome-Meca応力解析中級教材:大変形弾塑性解析による引張試験の再現
Salome-Meca応力解析中級教材:大変形弾塑性解析による引張試験の再現Salome-Meca応力解析中級教材:大変形弾塑性解析による引張試験の再現
Salome-Meca応力解析中級教材:大変形弾塑性解析による引張試験の再現stebee19
 
単純ベイズ法による異常検知 #ml-professional
単純ベイズ法による異常検知  #ml-professional単純ベイズ法による異常検知  #ml-professional
単純ベイズ法による異常検知 #ml-professionalAi Makabi
 
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly DetectionMVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly DetectionLEE HOSEONG
 
分類問題 - 機械学習ライブラリ scikit-learn の活用
分類問題 - 機械学習ライブラリ scikit-learn の活用分類問題 - 機械学習ライブラリ scikit-learn の活用
分類問題 - 機械学習ライブラリ scikit-learn の活用y-uti
 
System biology and its tools
System biology and its toolsSystem biology and its tools
System biology and its toolsGaurav Diwakar
 
Proteome analysis
Proteome analysisProteome analysis
Proteome analysisAkash Verma
 
Practical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonPractical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonRyuichiKanoh
 
PRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptx
PRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptxPRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptx
PRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptxBulBulsTutorial
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep LearningYan Xu
 
スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習hagino 3000
 
Anomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsAnomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsTuri, Inc.
 
Triplet Loss 徹底解説
Triplet Loss 徹底解説Triplet Loss 徹底解説
Triplet Loss 徹底解説tancoro
 

What's hot (20)

Prosite
PrositeProsite
Prosite
 
Swiss PROT
Swiss PROT Swiss PROT
Swiss PROT
 
Autoencoder
AutoencoderAutoencoder
Autoencoder
 
【論文読み会】Self-Attention Generative Adversarial Networks
【論文読み会】Self-Attention Generative  Adversarial Networks【論文読み会】Self-Attention Generative  Adversarial Networks
【論文読み会】Self-Attention Generative Adversarial Networks
 
PR12-193 NISP: Pruning Networks using Neural Importance Score Propagation
PR12-193 NISP: Pruning Networks using Neural Importance Score PropagationPR12-193 NISP: Pruning Networks using Neural Importance Score Propagation
PR12-193 NISP: Pruning Networks using Neural Importance Score Propagation
 
Clustal X
Clustal XClustal X
Clustal X
 
Maeshori missing
Maeshori missingMaeshori missing
Maeshori missing
 
Salome-Meca応力解析中級教材:大変形弾塑性解析による引張試験の再現
Salome-Meca応力解析中級教材:大変形弾塑性解析による引張試験の再現Salome-Meca応力解析中級教材:大変形弾塑性解析による引張試験の再現
Salome-Meca応力解析中級教材:大変形弾塑性解析による引張試験の再現
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
 
単純ベイズ法による異常検知 #ml-professional
単純ベイズ法による異常検知  #ml-professional単純ベイズ法による異常検知  #ml-professional
単純ベイズ法による異常検知 #ml-professional
 
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly DetectionMVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
 
分類問題 - 機械学習ライブラリ scikit-learn の活用
分類問題 - 機械学習ライブラリ scikit-learn の活用分類問題 - 機械学習ライブラリ scikit-learn の活用
分類問題 - 機械学習ライブラリ scikit-learn の活用
 
System biology and its tools
System biology and its toolsSystem biology and its tools
System biology and its tools
 
Proteome analysis
Proteome analysisProteome analysis
Proteome analysis
 
Practical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonPractical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaiton
 
PRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptx
PRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptxPRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptx
PRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptx
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep Learning
 
スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習
 
Anomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsAnomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation Forests
 
Triplet Loss 徹底解説
Triplet Loss 徹底解説Triplet Loss 徹底解説
Triplet Loss 徹底解説
 

Viewers also liked

2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotationScott Dawson
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Keith Bradnam
 
Improving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBioImproving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBioLex Nederbragt
 
BIOL335: How to annotate a genome
BIOL335: How to annotate a genomeBIOL335: How to annotate a genome
BIOL335: How to annotate a genomePaul Gardner
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems BiologyMike Hucka
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsNtino Krampis
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 

Viewers also liked (14)

Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotation
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
Improving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBioImproving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBio
 
Genome wide association mapping
Genome wide association mappingGenome wide association mapping
Genome wide association mapping
 
BIOL335: How to annotate a genome
BIOL335: How to annotate a genomeBIOL335: How to annotate a genome
BIOL335: How to annotate a genome
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems Biology
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly Algorithms
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
Gemome annotation
Gemome annotationGemome annotation
Gemome annotation
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Similar to Genome assembly: then and now — v1.0

Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Keith Bradnam
 
Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Keith Bradnam
 
Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Keith Bradnam
 
Splay trees balance trees in a different way from AVL trees. A.docx
Splay trees balance trees in a different way from AVL trees. A.docxSplay trees balance trees in a different way from AVL trees. A.docx
Splay trees balance trees in a different way from AVL trees. A.docxmckellarhastings
 
Brock peters single tube long fragment read technology
Brock peters single tube long fragment read technologyBrock peters single tube long fragment read technology
Brock peters single tube long fragment read technologyGenomeInABottle
 

Similar to Genome assembly: then and now — v1.0 (9)

Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2
 
Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1
 
Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2
 
Splay trees balance trees in a different way from AVL trees. A.docx
Splay trees balance trees in a different way from AVL trees. A.docxSplay trees balance trees in a different way from AVL trees. A.docx
Splay trees balance trees in a different way from AVL trees. A.docx
 
Galgo f
Galgo fGalgo f
Galgo f
 
R
RR
R
 
Alignments
AlignmentsAlignments
Alignments
 
On-Chip Variation
On-Chip VariationOn-Chip Variation
On-Chip Variation
 
Brock peters single tube long fragment read technology
Brock peters single tube long fragment read technologyBrock peters single tube long fragment read technology
Brock peters single tube long fragment read technology
 

More from Keith Bradnam

13 questions you might have about galaxy
13 questions you might have about galaxy13 questions you might have about galaxy
13 questions you might have about galaxyKeith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'Keith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'Keith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'Keith Bradnam
 
Thoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestThoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestKeith Bradnam
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Keith Bradnam
 
Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Keith Bradnam
 
What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?Keith Bradnam
 
The art of good science writing
The art of good science writingThe art of good science writing
The art of good science writingKeith Bradnam
 
Polish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesPolish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesKeith Bradnam
 
10 tips for adding polish to presentations
10 tips for adding polish to presentations10 tips for adding polish to presentations
10 tips for adding polish to presentationsKeith Bradnam
 
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meetingDatabase talk for Bits & Bites meeting
Database talk for Bits & Bites meetingKeith Bradnam
 
Benchmarking short-read mapping programs
Benchmarking short-read mapping programsBenchmarking short-read mapping programs
Benchmarking short-read mapping programsKeith Bradnam
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesThoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesKeith Bradnam
 
When is a genome finished?
When is a genome finished? When is a genome finished?
When is a genome finished? Keith Bradnam
 
Twitter 101 - an introduction to Twitter
Twitter 101  - an introduction to TwitterTwitter 101  - an introduction to Twitter
Twitter 101 - an introduction to TwitterKeith Bradnam
 

More from Keith Bradnam (16)

13 questions you might have about galaxy
13 questions you might have about galaxy13 questions you might have about galaxy
13 questions you might have about galaxy
 
This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'
 
This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'
 
This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'
 
Thoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestThoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contest
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
 
Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1
 
What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?
 
The art of good science writing
The art of good science writingThe art of good science writing
The art of good science writing
 
Polish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesPolish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slides
 
10 tips for adding polish to presentations
10 tips for adding polish to presentations10 tips for adding polish to presentations
10 tips for adding polish to presentations
 
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meetingDatabase talk for Bits & Bites meeting
Database talk for Bits & Bites meeting
 
Benchmarking short-read mapping programs
Benchmarking short-read mapping programsBenchmarking short-read mapping programs
Benchmarking short-read mapping programs
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesThoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore Technologies
 
When is a genome finished?
When is a genome finished? When is a genome finished?
When is a genome finished?
 
Twitter 101 - an introduction to Twitter
Twitter 101  - an introduction to TwitterTwitter 101  - an introduction to Twitter
Twitter 101 - an introduction to Twitter
 

Recently uploaded

Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxNikitaBankoti2
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIShubhangi Sonawane
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 

Recently uploaded (20)

Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 

Genome assembly: then and now — v1.0

  • 1. Genome assembly: then and now Keith Bradnam Image from Wellcome Trust
  • 2. Image from flickr.com/photos/dougitdesign/5613967601/ Contents Sequencing 101 Genome assembly: then Genome assembly: now Assemblathon 1 Assemblathon 2 Assemblathon 3
  • 3. More info ✤ http://assemblathon.org ✤ http://arxiv.org ✤ http://twitter.com/assemblathon Assemblathon 2 paper has been reviewed, just dealing with reviewer's comments.
  • 4. Sequencing 101 A, C, G, T... Image from nlm.nih.gov Fred Sanger
  • 5. Read Most sequencing technologies start with a sequencing read. A read could be as short as 25 bp (Solexa sequencing from a few years ago), or >15,000 bp (PacBio with latest chemistry).
  • 6. Read pair Most sequencing is done with pairs of connected reads, separated by a short interval whose length is known. Read pairs can also overlap with each other.
  • 7. Read pair Mate pair Mate pairs, also known as jumping pairs, have much larger inserts (thousands or tens of thousands of bp), but it is hard to make good mate pair libraries. Having very large inserts is very useful for the purposes of genome assembly.
  • 8. Sequence a whole lot of read pairs, and hopefully they will overlap with each other and allow you to start making contiguous sequences...
  • 9. Contigs ...which are better known as contigs.
  • 10. Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns).
  • 11. Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns).
  • 12. Scaffold NNNNNNNNNNNNNNNNNNN Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns).
  • 13. Assembly size NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 15 15 15 5 Assembly size is simply the sum of all scaffolds or contigs that are included in the final genome assembly. If you are calculating the assembly size from scaffolds, then some fraction of that final size will come from the Ns in scaffold sequences. Here we have a toy genome assembly, with 12 scaffolds totaling 200 Mbp.
  • 14. Assembly size NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 200 Mbp 15 15 15 5 Assembly size is simply the sum of all scaffolds or contigs that are included in the final genome assembly. If you are calculating the assembly size from scaffolds, then some fraction of that final size will come from the Ns in scaffold sequences. Here we have a toy genome assembly, with 12 scaffolds totaling 200 Mbp.
  • 15. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 200 Mbp 15 15 15 5 The most widely used measure to describe genome assemblies is the N50 lengths of scaffolds or contigs. This is essentially a weighted mean, designed to be more informative than a crude mean length (which is not very useful if you end up with thousands of very short contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...
  • 16. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 200 Mbp 15 15 15 5 The most widely used measure to describe genome assemblies is the N50 lengths of scaffolds or contigs. This is essentially a weighted mean, designed to be more informative than a crude mean length (which is not very useful if you end up with thousands of very short contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...
  • 17. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 200 Mbp 15 15 15 5 70 The most widely used measure to describe genome assemblies is the N50 lengths of scaffolds or contigs. This is essentially a weighted mean, designed to be more informative than a crude mean length (which is not very useful if you end up with thousands of very short contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...
  • 18. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 15 15 15 5 200 Mbp 95 If this length does not exceed 50% of the total assembly size (50% is why it is N50), proceed to the next longest scaffold, and add the length to a running total.
  • 19. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 15 15 15 5 200 Mbp 95 If this length does not exceed 50% of the total assembly size (50% is why it is N50), proceed to the next longest scaffold, and add the length to a running total.
  • 22. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 15 15 15 5 200 Mbp The length of the contig or scaffold that takes you past 50% is what is reported as the N50 length. So here, we have an N50 length of 20 Mbp.
  • 23. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 5 5 N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our assembly?
  • 24. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 5 5 N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our assembly?
  • 25. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our assembly?
  • 26. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25 Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
  • 27. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 190 Mbp Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25 Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
  • 28. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 190 Mbp Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25 Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
  • 29. N50 for two assemblies Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which are albeit short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?
  • 30. N50 for two assemblies 208 Mbp 190 Mbp Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which are albeit short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?
  • 31. N50 for two assemblies 208 Mbp 190 Mbp N50 = 15 Mbp N50 = 25 Mbp Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which are albeit short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?
  • 32. NG50 for two assemblies 208 Mbp 190 Mbp We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to the Genome).
  • 33. NG50 for two assemblies We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to the Genome).
  • 34. NG50 for two assemblies Expected genome size = 250 Mbp We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to the Genome).
  • 35. Expected genome size = 250 Mbp NG50 for two assemblies The NG50 of these two assemblies is now the same. We think that NG50 is a fairer way of comparing genome assemblies that might differ in their total size.
  • 36. NG50 = 15 Mbp NG50 = 15 Mbp Expected genome size = 250 Mbp NG50 for two assemblies The NG50 of these two assemblies is now the same. We think that NG50 is a fairer way of comparing genome assemblies that might differ in their total size.
  • 37. How do I describe thee? Let me count the ways Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.
  • 38. How do I describe thee? Let me count the ways Metric Description Assembly size With or without very short contigs? N50 / NG50 For contigs and/or scaffolds Coverage When compared to a reference sequence Errors Base errors from alignment to reference sequence and/or input read data Number of genes From comparison to reference transcriptome and/or set of known genes Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.
  • 39. How do I describe thee? Let me count the ways Metric Description Assembly size With or without very short contigs? N50 / NG50 For contigs and/or scaffolds Coverage When compared to a reference sequence Errors Base errors from alignment to reference sequence and/or input read data Number of genes From comparison to reference transcriptome and/or set of known genes And many, many more... Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.
  • 40. Genome assembly Back in the day... How were genomes assembled back in the late 1990s when genome sequencing projects were starting to make the news?
  • 41. Genome assembly: then Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
  • 42. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Accurate & long reads ✓ Resources (time, money, people) ✓ Genome assembly: then Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
  • 43. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Accurate & long reads ✓ Resources (time, money, people) ✓ Genome assembly: then Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
  • 44. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Accurate & long reads ✓ Resources (time, money, people) ✓ Genome assembly: then Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
  • 45. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Accurate & long reads ✓ Resources (time, money, people) ✓ Genome assembly: then Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
  • 46. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Accurate & long reads ✓ Resources (time, money, people) ✓ Genome assembly: then Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
  • 47. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Accurate & long reads ✓ Resources (time, money, people) ✓ Genome assembly: then Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.
  • 48. So what was the result of spending millions of dollars to assemble genomes of well-characterized species, with accurate long reads, and detailed maps??? So hopefully this gave us a useful set of finished genomes, right?
  • 49. ✤ 2000: published genome size = 125 Mbp ✤ 2007: genome size = 157 Mbp ✤ 2012: genome size = 135 Mbp Arabidopsis thaliana Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of the size went up. Now it has come back down again. But the genome remains unfinished.
  • 50. ✤ 2000: published genome size = 125 Mbp ✤ 2007: genome size = 157 Mbp ✤ 2012: genome size = 135 Mbp ✤ Amount sequenced = 119 Mbp Arabidopsis thaliana Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of the size went up. Now it has come back down again. But the genome remains unfinished.
  • 51. ✤ 2000: published genome size = 125 Mbp ✤ 2007: genome size = 157 Mbp ✤ 2012: genome size = 135 Mbp ✤ Amount sequenced = 119 Mbp ✤ Ns = 0.2% of genome Arabidopsis thaliana Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of the size went up. Now it has come back down again. But the genome remains unfinished.
  • 52. Drosophila melanogaster ✤ Genome published 1998 ✤ Heterochromatin finished 2007 The fly genome was 'finished' in 1998. But this was only really the easy-to-sequence portion of the genome (the euchromatin). The trickier heterochromatin was sequenced as a separate project that didn't finish until almost a decade later. The fly genome remains unfinished.
  • 53. Drosophila melanogaster ✤ Genome published 1998 ✤ Heterochromatin finished 2007 ✤ Ns = 4% of genome The fly genome was 'finished' in 1998. But this was only really the easy-to-sequence portion of the genome (the euchromatin). The trickier heterochromatin was sequenced as a separate project that didn't finish until almost a decade later. The fly genome remains unfinished.
  • 54. Caenorhabditis elegans ✤ Genome published 1998 ✤ 2004: last N removed The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last batch of changes all occurred just last November. So after almost 15 years of post-genome- publication, we can still find over 1,400 errors in one of the best characterized genome sequences that exists.
  • 55. Caenorhabditis elegans ✤ Genome published 1998 ✤ 2004: last N removed ✤ 1998–2013: genome sequence changes The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last batch of changes all occurred just last November. So after almost 15 years of post-genome- publication, we can still find over 1,400 errors in one of the best characterized genome sequences that exists.
  • 56. Caenorhabditis elegans ✤ Genome published 1998 ✤ 2004: last N removed ✤ 1998–2013: genome sequence changes ✤ 558 insertions ✤ 230 deletions ✤ 614 substitutions The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last batch of changes all occurred just last November. So after almost 15 years of post-genome- publication, we can still find over 1,400 errors in one of the best characterized genome sequences that exists.
  • 57. Caenorhabditis elegans ✤ Genome published 1998 ✤ 2004: last N removed ✤ 1998–2013: genome sequence changes ✤ 558 insertions ✤ 230 deletions ✤ 614 substitutions }Nov 2012 The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last batch of changes all occurred just last November. So after almost 15 years of post-genome- publication, we can still find over 1,400 errors in one of the best characterized genome sequences that exists.
  • 58. Saccharomyces cerevisiae ✤ Genome published 1997 ✤ 12 Mbp genome ✤ 1,653 changes to genome since 1997 Likewise in yeast. The first eukaryotic genome sequence continues to receives fixes to correct the sequence. The last set of changes were made in 2011. These changes affected coding sequences, not just intergenic and intronic DNA.
  • 59. Saccharomyces cerevisiae ✤ Genome published 1997 ✤ 12 Mbp genome ✤ 1,653 changes to genome since 1997 ✤ Last changes made in 2011 Likewise in yeast. The first eukaryotic genome sequence continues to receives fixes to correct the sequence. The last set of changes were made in 2011. These changes affected coding sequences, not just intergenic and intronic DNA.
  • 60. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Accurate & long reads ✓ Resources (time, money, people) ✓ Genome assembly: then And all of this was done in an era when we had all of these supporting materials.
  • 61. Genetic maps ✗ Physical maps ✗ Understanding of target genome ✗ Haploid / low heterozygosity genome ✗ Accurate & long reads ✗ Resources (time, money, people) ✗ Genome assembly: now We don't have these now! Genome sequencing no longer requires an international consortium, rather it could be a project for a Grad student.
  • 62. Assembling & finishing a genome is not easy! It was never easy, even when we access to lots of resources to help us put together genomes. And it is not easy now. Don't be fooled into thinking that because there are many published genome sequences, that these sequences represent the absolute ideal genome sequence.
  • 63. Assemblathons A new idea is born Image from flickr.com/photos/dullhunk/4422952630
  • 64. The Assemblathon was born out of the Genome 10K project.
  • 65. If you sequence 10,000 genomes... ...you need to assemble 10,000 genomes The Assemblathon was born out of the Genome 10K project.
  • 66. How many assembly tools are out there? There are many, many tools out there for assembling, or helping to assemble, a genome sequence. It seems reasonable to ask...which is the best?
  • 67. How many assembly tools are out there? Ray Celera MIRA ALLPATHS-LG SGA Curtain Metassembler Phusion ABySS Amos Arapan CLC Cortex DNAnexus DNA Dragon EULER Edena Forge Geneious IDBA Newbler PRICE PADENA PASHA Phrap TIGR Sequencher SeqMan NGen SHARCGS SOPRA SSAKE SPAdes Taipan VCAKE Velvet Arachne PCAP GAM Monument Atlas ABBA Anchor ATAC Contrail DecGPU GenoMinerLasergene PE-Assembler Pipeline Pilot QSRA SeqPrep SHORTY fermi Telescoper Quast SCARPA Hapsembler HapCompass HaploMerger SWiPS GigAssembler MSR-CA There are many, many tools out there for assembling, or helping to assemble, a genome sequence. It seems reasonable to ask...which is the best?
  • 68. How many assembly tools are out there? Ray Celera MIRA ALLPATHS-LG SGA Curtain Metassembler Phusion ABySS Amos Arapan CLC Cortex DNAnexus DNA Dragon EULER Edena Forge Geneious IDBA Newbler PRICE PADENA PASHA Phrap TIGR Sequencher SeqMan NGen SHARCGS SOPRA SSAKE SPAdes Taipan VCAKE Velvet Arachne PCAP GAM Monument Atlas ABBA Anchor ATAC Contrail DecGPU GenoMinerLasergene PE-Assembler Pipeline Pilot QSRA SeqPrep SHORTY fermi Telescoper Quast SCARPA Hapsembler HapCompass HaploMerger SWiPS GigAssembler MSR-CA Which is the best? There are many, many tools out there for assembling, or helping to assemble, a genome sequence. It seems reasonable to ask...which is the best?
  • 69. Comparing assemblers ✤ Can't fairly compare two assemblers if they: However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
  • 70. Comparing assemblers ✤ Can't fairly compare two assemblers if they: ✤ produced assemblies from different species However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
  • 71. Comparing assemblers ✤ Can't fairly compare two assemblers if they: ✤ produced assemblies from different species ✤ assembled same species, but used sequence data from different NGS platforms However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
  • 72. Comparing assemblers ✤ Can't fairly compare two assemblers if they: ✤ produced assemblies from different species ✤ assembled same species, but used sequence data from different NGS platforms ✤ used same NGS platform but different sequence libraries However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
  • 73. Comparing assemblers ✤ Can't fairly compare two assemblers if they: ✤ produced assemblies from different species ✤ assembled same species, but used sequence data from different NGS platforms ✤ used same NGS platform but different sequence libraries ✤ Even using different options for the same assembler may produce very different assemblies! However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.
  • 74. A genome assembly competition That's where the Assemblathon came in.
  • 75. An attempt to standardize some aspects of the genome assembly process Genome assembly contests Others have been trying to do the same thing. E.g. GAGE, and dnGASP.
  • 76. ✤ 2010–2011 ✤ Used synthetic data ✤ Small genome (~100 Mbp) ✤ We knew the answer! Assemblathon 1 It is easier to judge a tool when you know what the final answer should look like. However, many people that work on developing assemblers would prefer to work with real data...
  • 77. Here we go again ...which is where Assemblathon 2 came in.
  • 78. Type of data Number of genomes Size of genomes Do we know the answer? Assemblathon 1 Synthetic 1 Small ✓ Assemblathon 2 Real 3 Large ✗
  • 79. Type of data Number of genomes Size of genomes Do we know the answer? Assemblathon 1 Synthetic 1 Small ✓ Assemblathon 2 Real 3 Large ✗
  • 80. Melopsittacus undulatus Boa constrictor constrictorMaylandia zebra A budgie, a cichlid fish from Lake Mawali, and a reptile.
  • 81. Bird SnakeFish Let's simplify the names for the rest of the talk.
  • 82. Why these three species? There is no special reason why these species were used. People had a need to sequence the genomes, and some companies were willing to donate sequences.
  • 83. Why these three species? Because they were there There is no special reason why these species were used. People had a need to sequence the genomes, and some companies were willing to donate sequences.
  • 84. Species Estimated genome size Illumina Roche 454 PacBio Bird 1.2 Gbp 285x (14 libraries) 16x (3 libraries) 10x (2 libraries) Fish 1.0 Gbp 192x (8 libraries) Snake 1.6 Gbp 125x (4 libraries) Assemble this! Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets.
  • 85. Species Estimated genome size Illumina Roche 454 PacBio Bird 1.2 Gbp 285x (14 libraries) 16x (3 libraries) 10x (2 libraries) Fish 1.0 Gbp 192x (8 libraries) Snake 1.6 Gbp 125x (4 libraries) Assemble this! Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets.
  • 86. Species Estimated genome size Illumina Roche 454 PacBio Bird 1.2 Gbp 285x (14 libraries) 16x (3 libraries) 10x (2 libraries) Fish 1.0 Gbp 192x (8 libraries) Snake 1.6 Gbp 125x (4 libraries) Assemble this! Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets.
  • 87. Species Estimated genome size Illumina Roche 454 PacBio Bird 1.2 Gbp 285x (14 libraries) 16x (3 libraries) 10x (2 libraries) Fish 1.0 Gbp 192x (8 libraries) Snake 1.6 Gbp 125x (4 libraries) Assemble this! Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets.
  • 88. Who took part? Lots of teams took part. Not just from the big sequencing/genome centers.
  • 89. Who took part? Lots of teams took part. Not just from the big sequencing/genome centers.
  • 90. Who took part? 21 teams 43 assemblies 52,013,623,777 bp of sequence Lots of teams took part. Not just from the big sequencing/genome centers.
  • 91. Species Competitive entries Evaluation entries Bird 12 3 Fish 10 6 Snake 12 0 Entries There were evaluation entries (not eligible to be declared the winner) allowed in addition to competition entries (only 1 per team).
  • 92. Species Competitive entries Evaluation entries Bird 12 3 Fish 10 6 Snake 12 0 Entries There were evaluation entries (not eligible to be declared the winner) allowed in addition to competition entries (only 1 per team).
  • 93. Goals
  • 94. Goals ✤ Assess 'quality' of assemblies
  • 95. Goals ✤ Assess 'quality' of assemblies ✤ Define quality!
  • 96. Goals ✤ Assess 'quality' of assemblies ✤ Define quality! ✤ Produce ranking of assemblies for each species
  • 97. Goals ✤ Assess 'quality' of assemblies ✤ Define quality! ✤ Produce ranking of assemblies for each species ✤ Produce ranking of assemblers across species?
  • 98. Who did what? Person/group Jobs Me, Ian, and Joseph Fass Perform various analyses of all assemblies David Schwarz et al. Produce & evaluate optical maps Jay Shendure et al. Produce Fosmid sequences (bird & snake only) Martin Hunt & Thomas Otto Performed REAPR analysis Dent Earl & Benedict Paten Help with meta-analysis of final rankings
  • 99. flickr.com/photos/jamescridland/613445810 Hard to get agreement on how best to interpret the results. Some analyses and interpretations in the Assemblathon 2 paper end up being compromises.
  • 100. 91 co-authors! flickr.com/photos/jamescridland/613445810 Hard to get agreement on how best to interpret the results. Some analyses and interpretations in the Assemblathon 2 paper end up being compromises.
  • 102. Lots of results! A screen grab of my master spreadsheet that contains all of the numerical results.
  • 103.
  • 105. 10 key metrics We focused on 10 of 102 metrics that we thought were a) useful and b) captured different aspects of an assembly's quality.
  • 106. Key Metric Description 1 NG50 scaffold length 2 NG50 contig length 3 Amount of assembly in 'gene-sized' scaffolds 4 Number of 'core genes' present 5 Fosmid coverage 6 Fosmid validity 7 Short-range scaffold accuracy 8 Optical map: level 1 9 Optical map: levels 1–3 10 REAPR summary score The 10 key metrics.
  • 107. 1) Scaffold NG50 lengths ✤ Can calculate NG50 length for each assembly ✤ But also calculate NG60, NG70 etc. ✤ Plot all results as a graph An N50 (or NG50) value on its own doesn't tell you that much. Ideally you should always be aware of the total assembly size and the distribution of lengths when comparing assemblies. You can do this by not only calculating NG50, but NG1..NG100. NG1 would be the length of scaffold that captures 1% of the estimated genome size (when summing scaffolds from longest to shortest).
  • 108. 1) Scaffold NG50 lengths Scaffold length is on a log axis and team identifiers are shown in the legend. The black dashed line shows the NG50 value, but the point where each series starts on the left shows the lengths of the longest scaffolds. Also, if the NG100 value is greater than zero, then that assembly is bigger than the known/estimated genome size.
  • 109. 2) Contig vs scaffold NG50 We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.
  • 110. 2) Contig vs scaffold NG50 We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.
  • 111. 2) Contig vs scaffold NG50 We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.
  • 112. 3) Gene-sized scaffolds It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.
  • 113. 3) Gene-sized scaffolds ✤ Do assemblers get a little too excited by length? It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.
  • 114. 3) Gene-sized scaffolds ✤ Do assemblers get a little too excited by length? ✤ How long is 'long enough' for a scaffold? It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.
  • 115. 3) Gene-sized scaffolds ✤ Do assemblers get a little too excited by length? ✤ How long is 'long enough' for a scaffold? ✤ What if you just wanted to find genes? It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.
  • 116. 3) Gene-sized scaffolds ✤ Do assemblers get a little too excited by length? ✤ How long is 'long enough' for a scaffold? ✤ What if you just wanted to find genes? ✤ Average vertebrate gene = ~25 Kbp It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.
  • 117. 3) Gene-sized scaffolds The blue line shows the percentage of the estimated genome size that is present in scaffolds of 25 Kbp or longer. Most assemblies, even if they have a much shorter *average* scaffold length, may contain many scaffolds that are long enough to contain a single gene.
  • 118. 4) Core genes A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. Note that CEGMA finds genes where a full- length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.
  • 119. 4) Core genes ✤ Used CEGMA tool A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. Note that CEGMA finds genes where a full- length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.
  • 120. 4) Core genes ✤ Used CEGMA tool ✤ CEGMA = set of 458 'Core Eukaryotic Genes' (CEGs) A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. Note that CEGMA finds genes where a full- length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.
  • 121. 4) Core genes ✤ Used CEGMA tool ✤ CEGMA = set of 458 'Core Eukaryotic Genes' (CEGs) ✤ How many full-length CEGs are in each assembly? A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. Note that CEGMA finds genes where a full- length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.
  • 122. 4) Core genes These results show the number of CEGMA genes that were present in any one assembly as a percentage of all possible CEGMA genes (i.e. those present across all assemblies for each species).
  • 123. 4) Core genes Core genes (out of 458)Core genes (out of 458) Species Best individual assembly Across all assemblies Bird 420 442 Fish 436 455 Snake 438 454 In the three species, most of the core genes were present across all assemblies, but individual assemblies typically lacked several core genes.
  • 124. 4) Core genes Core genes (out of 458)Core genes (out of 458) Species Best individual assembly Across all assemblies Bird 420 442 Fish 436 455 Snake 438 454 In the three species, most of the core genes were present across all assemblies, but individual assemblies typically lacked several core genes.
  • 125. ABYSS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED BCM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED CRACS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED CURT MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED GAM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVED MERAC MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED PHUS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED RAY MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED SGA MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED SYMB MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVED SOAP MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED ************************************************ ***** ABYSS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ BCM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ CRACS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ CURT FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ GAM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNNLPHTHI MERAC FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ PHUS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ RAY FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SGA FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SYMB FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SOAP FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ ****************************************************** ABYSS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG BCM ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG CRACS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG CURT ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG GAM YGHALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLK------------------ MERAC ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG PHUS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG RAY ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SGA ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SYMB ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SOAP ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG *************************************** ABYSS ILPLVTGAGHISVPFPDTYKMTKSY BCM ILPLVTGAGHISVPFPDTYKMTKSY CRACS ILPLVTGAGHISVPFPDTYKMTKSY CURT ILPLVTGAGHISVPFPDTYKMTKSY GAM ------------------------- 4) Core genes Example of one core gene predicted in bird assemblies. CEGMA gene predictions are available as supplementary material with the paper.
  • 127. 5) Fosmid coverage ✤ Had to first assemble Fosmids
  • 128. 5) Fosmid coverage ✤ Had to first assemble Fosmids ✤ Looked at repeat content & coverage across Fosmids
  • 129. 5) Fosmid coverage ✤ Had to first assemble Fosmids ✤ Looked at repeat content & coverage across Fosmids ✤ Aligned assembly scaffolds to Fosmids
  • 130. 5) Fosmid coverage ✤ Had to first assemble Fosmids ✤ Looked at repeat content & coverage across Fosmids ✤ Aligned assembly scaffolds to Fosmids ✤ Only had Fosmids for bird and snake
  • 131. 5) Fosmid coverage Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids. Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or regions of the genome not captured by the input read data. We aligned scaffolds to the Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate the picture.
  • 132. 5) Fosmid coverage Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids. Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or regions of the genome not captured by the input read data. We aligned scaffolds to the Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate the picture.
  • 133. 5) Fosmid coverage Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids. Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or regions of the genome not captured by the input read data. We aligned scaffolds to the Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate the picture.
  • 134. 5) Fosmid coverage Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids. Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or regions of the genome not captured by the input read data. We aligned scaffolds to the Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate the picture.
  • 135. 5) Fosmid coverage Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids. Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or regions of the genome not captured by the input read data. We aligned scaffolds to the Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate the picture.
  • 136. 5) Fosmid coverage Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids. Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or regions of the genome not captured by the input read data. We aligned scaffolds to the Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate the picture.
  • 137. 5) Fosmid coverage Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids. Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or regions of the genome not captured by the input read data. We aligned scaffolds to the Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate the picture.
  • 138. 5) Fosmid coverage Most of the Fosmid sequences were used as 'Trusted' reference sequences by which to assess the assemblies.
  • 139. 5) Fosmid coverage ✤ Only used regions of Fosmids that were validated by one or more assemblies Most of the Fosmid sequences were used as 'Trusted' reference sequences by which to assess the assemblies.
  • 140. 5) Fosmid coverage ✤ Only used regions of Fosmids that were validated by one or more assemblies ✤ Validated Fosmid Regions (VFRs) ✤ 99% of bird Fosmids ✤ 89% of snake Fosmids Most of the Fosmid sequences were used as 'Trusted' reference sequences by which to assess the assemblies.
  • 141. 5 & 6) Coverage & Validity COMPASS tool by Joe Fass The COMPASS tool compared the Validated Fosmid Regions (VFRs) to the scaffolds to calculate four measures, two of which ('coverage' and 'validity') were used as key metrics.
  • 142. 5 & 6) Coverage & Validity Some COMPASS results from the bird assemblies. Multiplicity is high when the assemblies were large (compared to the estimated genome size).
  • 143. Validated Fosmid Region 7) Short-range scaffold accuracy We also used the VFRs in another way. We took pairs of 100 nt 'tag' sequences from either end of consecutive 1000 nt fragments across all VFR sequences.
  • 144. Validated Fosmid Region 7) Short-range scaffold accuracy We also used the VFRs in another way. We took pairs of 100 nt 'tag' sequences from either end of consecutive 1000 nt fragments across all VFR sequences.
  • 145. Validated Fosmid Region 100 nt 100 nt 7) Short-range scaffold accuracy We also used the VFRs in another way. We took pairs of 100 nt 'tag' sequences from either end of consecutive 1000 nt fragments across all VFR sequences.
  • 146. Validated Fosmid Region 7) Short-range scaffold accuracy The start coordinates of each pair of tag sequences should map 900 nt apart in the assemblies and hopefully both tags map only to the same scaffold. We combined both of these into one 'summary score' metric.
  • 147. Validated Fosmid Region Map pairs of 'tag' sequences to assembly scaffolds 7) Short-range scaffold accuracy The start coordinates of each pair of tag sequences should map 900 nt apart in the assemblies and hopefully both tags map only to the same scaffold. We combined both of these into one 'summary score' metric.
  • 148. Validated Fosmid Region Map pairs of 'tag' sequences to assembly scaffolds 7) Short-range scaffold accuracy How many map as a pair to one scaffold? The start coordinates of each pair of tag sequences should map 900 nt apart in the assemblies and hopefully both tags map only to the same scaffold. We combined both of these into one 'summary score' metric.
  • 149. Validated Fosmid Region Map pairs of 'tag' sequences to assembly scaffolds 7) Short-range scaffold accuracy How many map as a pair to one scaffold? How many map at expected distance apart (900 ± 2 bp)? The start coordinates of each pair of tag sequences should map 900 nt apart in the assemblies and hopefully both tags map only to the same scaffold. We combined both of these into one 'summary score' metric.
  • 150. 7) Short-range scaffold accuracy Expected distance apart (900 bp)Expected distance apart (900 bp) Species Shortest Longest Bird 702 bp 41,949 bp Snake 673 bp 46,813 bp Most pairs of tags mapped to the same scaffold, and at the expected distance apart, but there were a few notable exceptions.
  • 151. 7) Short-range scaffold accuracy The red line indicates the theoretical maximum summary score that could be achieved.
  • 152. 8 & 9) Optical maps For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
  • 153. 8 & 9) Optical maps ✤ Stretch out DNA For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
  • 154. 8 & 9) Optical maps ✤ Stretch out DNA ✤ Cut with restriction enzymes For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
  • 155. 8 & 9) Optical maps ✤ Stretch out DNA ✤ Cut with restriction enzymes ✤ Note lengths of fragments For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
  • 156. 8 & 9) Optical maps ✤ Stretch out DNA ✤ Cut with restriction enzymes ✤ Note lengths of fragments ✤ Compare to in silico digest of scaffolds For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
  • 157. 8 & 9) Optical maps ✤ Stretch out DNA ✤ Cut with restriction enzymes ✤ Note lengths of fragments ✤ Compare to in silico digest of scaffolds ✤ Not all scaffolds suitable for analysis For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.
  • 158. 8 & 9) Optical maps Image from University of Wisconsin-Madison An example of an optical map. After cutting, each DNA fragment is measured to estimate its length. Optical map results were divided into three categories (levels 1–3).
  • 159. 8 & 9) Optical maps White bars: total length of scaffolds that were suitable for optical map analysis. Dark blue: global alignments of scaffolds to maps (these are the best quality). Light blue: global alignments with more permissive thresholds. Orange bars: local alignments. We used level 1 (dark blue) as one key metric and levels 1+2+3 as a second key metric. The MLK assembly is good, *relatively* speaking (high percentage of suitable scaffolds are in level 1 category), but we record scores on an absolute basis (MERAC highest for level 1, SOAP highest for levels
  • 160. 8 & 9) Optical maps Fish optical map results were much worse than in bird, with very few assemblies having scaffolds with 'level 1' global alignments to the optical map. SGA had the most level 1 coverage, but a much lower amount of sequence that was alignable at any level (1, 2, or 3).
  • 161. 8 & 9) Optical maps Snake optical map results were intermediate compared to bird and fish.
  • 162. 10) REAPR summary score REAPR is a tool that aligns input reads to scaffolds and looks for base errors and regions which might represent misassemblies (where scaffolds should ideally be split in two). These two facets are combined into one summary score.
  • 163. 10) REAPR summary score REAPR is a tool that aligns input reads to scaffolds and looks for base errors and regions which might represent misassemblies (where scaffolds should ideally be split in two). These two facets are combined into one summary score.
  • 164. What does this all mean?
  • 165. 102 metrics per assembly 10 key metrics 1 final ranking Using the 10 key metrics, we combined the results to produce a single score for each assembly by which to rank them.
  • 166. Assembly Number of core genes Rank Z-score CRACS 438 1 +0.68 SYMB 436 2 +0.59 PHUS 435 3 +0.54 BCM 434 4 +0.49 SGA 433 5 +0.44 MERAC 430 6 +0.30 ABYSS 429 7 +0.25 SOAP 428 8 +0.21 RAY 422 9 –0.08 GAM 415 10 –0.41 CURT 360 11 –3.02 Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics. The above results are from the CEGMA metric in bird.
  • 167. Assembly Number of core genes Rank Z-score CRACS 438 1 +0.68 SYMB 436 2 +0.59 PHUS 435 3 +0.54 BCM 434 4 +0.49 SGA 433 5 +0.44 MERAC 430 6 +0.30 ABYSS 429 7 +0.25 SOAP 428 8 +0.21 RAY 422 9 –0.08 GAM 415 10 –0.41 CURT 360 11 –3.02 Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics. The above results are from the CEGMA metric in bird.
  • 168. Assembly Number of core genes Rank Z-score CRACS 438 1 +0.68 SYMB 436 2 +0.59 PHUS 435 3 +0.54 BCM 434 4 +0.49 SGA 433 5 +0.44 MERAC 430 6 +0.30 ABYSS 429 7 +0.25 SOAP 428 8 +0.21 RAY 422 9 –0.08 GAM 415 10 –0.41 CURT 360 11 –3.02 Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics. The above results are from the CEGMA metric in bird.
  • 169. This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine, their competitive entry ranked number 2.
  • 170. This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine, their competitive entry ranked number 2.
  • 171. This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine, their competitive entry ranked number 2.
  • 172. This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine, their competitive entry ranked number 2.
  • 173. This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine, their competitive entry ranked number 2.
  • 174. In fish, BCM ranked 1st though the error bars suggest there is much variability. The lack of Fosmid data means that there is only 7 key metrics rather than 10.
  • 175. Snake seemed to the only species that outwardly looked like one assembler outperformed all others (SGA, in this case). We will return to this issue. Note that there were no evaluation entries for snake.
  • 176. Another way of looking at all of this data is to plot the Z-scores for each metric as a heat map (red = higher Z-scores).
  • 177. A parallel coordinates plot is another way of trying to show all of the information at once.
  • 178. What does this all mean?
  • 179. No really, what does this all mean? Still a bit hard to make sense of the overall rankings. What are the main findings from our paper?
  • 180. Some conclusions ✤ Very hard to find assemblers that performed well across all 10 key metrics ✤ Assemblers that perform well in one species, do not always perform as well in another ✤ Bird & snake assemblies appear better than fish ✤ No real 'winner' for bird and fish
  • 181. SGA — best assembler for snake? Even if we had happened to use 9 key metrics rather than 10, and even if we threw out the metric where SGA performed the best, it would still probably rank 1st. So is that the end of the story?
  • 182. SGA — best assembler for snake? Even if we had happened to use 9 key metrics rather than 10, and even if we threw out the metric where SGA performed the best, it would still probably rank 1st. So is that the end of the story?
  • 183. Description Rank of snake SGA assembly NG50 scaffold length 2 NG50 contig length 5 Amount of assembly in 'gene-sized' scaffolds 7 Number of 'core genes' present 5 Fosmid coverage 2 Fosmid validity 2 Short-range scaffold accuracy 3 Optical map: level 1 2 Optical map: levels 1–3 1 REAPR summary score 2 SGA only ranked 1st in one of the ten key metrics and ranked 7th in another. So it is a good assembler *on average*. But if one of these metrics was highly important to you, you may want to use an assembler that ranked higher in that metric.
  • 184. Description Rank of snake SGA assembly NG50 scaffold length 2 NG50 contig length 5 Amount of assembly in 'gene-sized' scaffolds 7 Number of 'core genes' present 5 Fosmid coverage 2 Fosmid validity 2 Short-range scaffold accuracy 3 Optical map: level 1 2 Optical map: levels 1–3 1 REAPR summary score 2 SGA only ranked 1st in one of the ten key metrics and ranked 7th in another. So it is a good assembler *on average*. But if one of these metrics was highly important to you, you may want to use an assembler that ranked higher in that metric.
  • 185. Best assembler across species? Not all assemblers were used for all species, but many teams submitted entries for 2 or 3 of the species. In theory, if a team submitted an entry for all species, and if their assembler ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for fish). So what was the best assembler across species, as judged by total number of 1st places? It is BCM. But Ray comes 4th with three 1st places.
  • 186. Best assembler across species? Assembler Number of 1st places (out of 27) BCM 5 Meraculous 4 Symbiose 4 Ray 3 Excluding evaluation entries Not all assemblers were used for all species, but many teams submitted entries for 2 or 3 of the species. In theory, if a team submitted an entry for all species, and if their assembler ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for fish). So what was the best assembler across species, as judged by total number of 1st places? It is BCM. But Ray comes 4th with three 1st places.
  • 187. Best assembler across species? Assembler Number of 1st places (out of 27) BCM 5 Meraculous 4 Symbiose 4 Ray 3 Excluding evaluation entries Not all assemblers were used for all species, but many teams submitted entries for 2 or 3 of the species. In theory, if a team submitted an entry for all species, and if their assembler ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for fish). So what was the best assembler across species, as judged by total number of 1st places? It is BCM. But Ray comes 4th with three 1st places.
  • 188. Ray performance Species Final ranking Bird 7 Fish 7 Snake 9 However, Ray ranks much lower when looking at its performance across all key metrics. So some assemblers do very well in specific measures, and not so well in others and other assemblers do moderately well across lots of metrics (e.g. SGA).
  • 189. We found it interesting that the best bird assembly was the evaluation entry by Baylor College of Medicine. What is different about this entry compared to their competitive entry?
  • 190. We found it interesting that the best bird assembly was the evaluation entry by Baylor College of Medicine. What is different about this entry compared to their competitive entry?
  • 191. Assembler Final rank NGS data used in assembly Coverage Z-score Validity Z-score NG50 Contig Z-score BCM - evaluation 1 Illumina + 454 +2.0 +1.4 +1.5 BCM - competitive 2 Illumina + 454 + PacBio –0.3 –0.8 +2.7 BCM bird assemblies The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequenced produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.
  • 192. Assembler Final rank NGS data used in assembly Coverage Z-score Validity Z-score NG50 Contig Z-score BCM - evaluation 1 Illumina + 454 +2.0 +1.4 +1.5 BCM - competitive 2 Illumina + 454 + PacBio –0.3 –0.8 +2.7 BCM bird assemblies The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequenced produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.
  • 193. Assembler Final rank NGS data used in assembly Coverage Z-score Validity Z-score NG50 Contig Z-score BCM - evaluation 1 Illumina + 454 +2.0 +1.4 +1.5 BCM - competitive 2 Illumina + 454 + PacBio –0.3 –0.8 +2.7 BCM bird assemblies The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequenced produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.
  • 194. Assembler Final rank NGS data used in assembly Coverage Z-score Validity Z-score NG50 Contig Z-score BCM - evaluation 1 Illumina + 454 +2.0 +1.4 +1.5 BCM - competitive 2 Illumina + 454 + PacBio –0.3 –0.8 +2.7 BCM bird assemblies The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequenced produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.
  • 195. Assembler Final rank NGS data used in assembly Coverage Z-score Validity Z-score NG50 Contig Z-score BCM - evaluation 1 Illumina + 454 +2.0 +1.4 +1.5 BCM - competitive 2 Illumina + 454 + PacBio –0.3 –0.8 +2.7 BCM bird assemblies The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequenced produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.
  • 196. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN BCM used PacBio data to help fill in the gaps in their scaffolds.
  • 197. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN BCM competition scaffold NNNNNNNNNNNNNNNNNNN BCM used PacBio data to help fill in the gaps in their scaffolds.
  • 198. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN BCM competition scaffold NNNNNNNNNNNNNNNNNNN PacBio sequence BCM used PacBio data to help fill in the gaps in their scaffolds.
  • 199. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN BCM competition scaffold CGTCGNNATCNNGGTTACG Errors in the PacBio sequence were penalized by the choice of alignment program used to align Fosmids to scaffolds.
  • 200. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN BCM competition scaffold CGTCGNNATCNNGGTTACG Mismatches from PacBio sequence penalized alignment score more than matching unknown bases Errors in the PacBio sequence were penalized by the choice of alignment program used to align Fosmids to scaffolds.
  • 201. The choice of one command-line option, used by one tool in the calculation of one key metric... ...probably made enough difference to drop the PacBio-containing assembly to 2nd place. This was actually down to the use of a single command-line option to the lastz alignment program. If we had not chosen this option, the PacBio-containing entry would have probably ranked 1st among all bird assemblies.
  • 202. Other conclusions ✤ Different metrics tell different stories ✤ Heterozygosity was a big issue for bird & fish assemblies ✤ Final rankings very sensitive to changes in metrics ✤ N50 is a semi-useful predictor of assembly quality The last point may disappoint some. Despite looking at many different metrics, N50 scaffold length still does a reasonable job of predicting overall quality. However...
  • 203. ...the outliers in this relationship should be noted. The highlighted bird assembly had the second highest scaffold N50 length, but ranked 6th among bird assemblies.
  • 204. ...the outliers in this relationship should be noted. The highlighted bird assembly had the second highest scaffold N50 length, but ranked 6th among bird assemblies.
  • 205. Inter-specific differences matter Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also very difference and this may play a role as well (some assemblers perform prefer certain short-insert sizes).
  • 206. Inter-specific differences matter ✤ The three species have genomes with different properties ✤ repeats ✤ heterozygosity Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also very difference and this may play a role as well (some assemblers perform prefer certain short-insert sizes).
  • 207. Inter-specific differences matter ✤ The three species have genomes with different properties ✤ repeats ✤ heterozygosity ✤ The three genomes had very different NGS data sets ✤ Only bird had PacBio & 454 data ✤ Different insert sizes in short-insert libraries Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also very difference and this may play a role as well (some assemblers perform prefer certain short-insert sizes).
  • 209. The Big Conclusion "You can't always get what you want" Sir Michael Jagger, 1969
  • 211. What comes next? There may be an Assemblathon 3. This will be decided at the next Genome 10K workshop (in April, 2013).
  • 212. What comes next? 3? There may be an Assemblathon 3. This will be decided at the next Genome 10K workshop (in April, 2013).
  • 213. A wish list for Assemblathon 3 If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 214. A wish list for Assemblathon 3 ✤ Only have 1 species If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 215. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 216. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost? If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 217. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost? ✤ Agree on metrics before evaluating assemblies! If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 218. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost? ✤ Agree on metrics before evaluating assemblies! ✤ Encourage experimental assemblies If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 219. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost? ✤ Agree on metrics before evaluating assemblies! ✤ Encourage experimental assemblies ✤ Use new FASTG genome assembly file format If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.
  • 220. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost? ✤ Agree on metrics before evaluating assemblies! ✤ Encourage experimental assemblies ✤ Use new FASTG genome assembly file format ✤ Get someone else to write the paper! If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.