Genome assembly: then and now — v1.0

  • 10,401 views
Uploaded on

A talk that I gave to a a general audience at UC Davis. Slides were also used for Prof. Ian Korf's presentation at the Genome 10K workshop (May 25th, 2013). This talk mostly concerns the results of …

A talk that I gave to a a general audience at UC Davis. Slides were also used for Prof. Ian Korf's presentation at the Genome 10K workshop (May 25th, 2013). This talk mostly concerns the results of the Assemblathon 2 contest, but also covers other issues relating to genome assembly.

Note, this talk has been superseded by updated versions (also available on slideshare)!

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
10,401
On Slideshare
0
From Embeds
0
Number of Embeds
11

Actions

Shares
Downloads
167
Comments
0
Likes
8

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Genome assembly: then and nowKeith BradnamImage from Wellcome Trust
  • 2. Image from flickr.com/photos/dougitdesign/5613967601/ContentsSequencing 101Genome assembly: thenGenome assembly: nowAssemblathon 1Assemblathon 2Assemblathon 3
  • 3. More info✤ http://assemblathon.org✤ http://arxiv.org✤ http://twitter.com/assemblathonAssemblathon 2 paper has been reviewed, just dealing with reviewers comments.
  • 4. Sequencing 101A, C, G, T...Image from nlm.nih.govFred Sanger
  • 5. ReadMost sequencing technologies start with a sequencing read. A read could be as short as 25bp (Solexa sequencing from a few years ago), or >15,000 bp (PacBio with latest chemistry).
  • 6. Read pairMost sequencing is done with pairs of connected reads, separated by a short interval whoselength is known. Read pairs can also overlap with each other.
  • 7. Read pairMate pairMate pairs, also known as jumping pairs, have much larger inserts (thousands or tens ofthousands of bp), but it is hard to make good mate pair libraries. Having very large inserts isvery useful for the purposes of genome assembly.
  • 8. Sequence a whole lot of read pairs, and hopefully they will overlap with each other and allowyou to start making contiguous sequences...
  • 9. Contigs...which are better known as contigs.
  • 10. Mate pairs — or other information — can hopefully be used to connect contigs together intoscaffolds. The unknown gap between contigs is replaced with unknown bases (Ns).
  • 11. Mate pairs — or other information — can hopefully be used to connect contigs together intoscaffolds. The unknown gap between contigs is replaced with unknown bases (Ns).
  • 12. ScaffoldNNNNNNNNNNNNNNNNNNNMate pairs — or other information — can hopefully be used to connect contigs together intoscaffolds. The unknown gap between contigs is replaced with unknown bases (Ns).
  • 13. Assembly sizeNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN70252010105551515155Assembly size is simply the sum of all scaffolds or contigs that are included in the finalgenome assembly. If you are calculating the assembly size from scaffolds, then some fractionof that final size will come from the Ns in scaffold sequences.Here we have a toy genome assembly, with 12 scaffolds totaling 200 Mbp.
  • 14. Assembly sizeNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN7025201010555200 Mbp1515155Assembly size is simply the sum of all scaffolds or contigs that are included in the finalgenome assembly. If you are calculating the assembly size from scaffolds, then some fractionof that final size will come from the Ns in scaffold sequences.Here we have a toy genome assembly, with 12 scaffolds totaling 200 Mbp.
  • 15. N50 lengthNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN7025201010555200 Mbp1515155The most widely used measure to describe genome assemblies is the N50 lengths ofscaffolds or contigs. This is essentially a weighted mean, designed to be more informativethan a crude mean length (which is not very useful if you end up with thousands of very shortcontigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...
  • 16. N50 lengthNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN7025201010555200 Mbp1515155The most widely used measure to describe genome assemblies is the N50 lengths ofscaffolds or contigs. This is essentially a weighted mean, designed to be more informativethan a crude mean length (which is not very useful if you end up with thousands of very shortcontigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...
  • 17. N50 lengthNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN7025201010555200 Mbp151515570The most widely used measure to describe genome assemblies is the N50 lengths ofscaffolds or contigs. This is essentially a weighted mean, designed to be more informativethan a crude mean length (which is not very useful if you end up with thousands of very shortcontigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...
  • 18. N50 lengthNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN70252010105551515155200 Mbp95If this length does not exceed 50% of the total assembly size (50% is why it is N50), proceedto the next longest scaffold, and add the length to a running total.
  • 19. N50 lengthNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN70252010105551515155200 Mbp95If this length does not exceed 50% of the total assembly size (50% is why it is N50), proceedto the next longest scaffold, and add the length to a running total.
  • 20. N50 lengthNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN70252010105551515155200 Mbp115Now we have exceeded 50% of the total assembly size.
  • 21. N50 lengthNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN70252010105551515155200 Mbp115Now we have exceeded 50% of the total assembly size.
  • 22. N50 lengthNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN70252010105551515155200 MbpThe length of the contig or scaffold that takes you past 50% is what is reported as the N50length. So here, we have an N50 length of 20 Mbp.
  • 23. N50 lengthNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN70252010105515151555N50 may be more robust than using a simple mean length, but it can still be easilymanipulated. What if we excluded the two shortest scaffolds from our assembly?
  • 24. N50 lengthNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN70252010105515151555N50 may be more robust than using a simple mean length, but it can still be easilymanipulated. What if we excluded the two shortest scaffolds from our assembly?
  • 25. N50 lengthNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN702520101055151515N50 may be more robust than using a simple mean length, but it can still be easilymanipulated. What if we excluded the two shortest scaffolds from our assembly?
  • 26. N50 lengthNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN702520101055151515Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
  • 27. N50 lengthNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN702520101055151515190 MbpNow the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
  • 28. N50 lengthNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN702520101055151515190 MbpNow the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.
  • 29. N50 for two assembliesHere are another two fictional assemblies. The first assembly now has a lower N50 value, butthis is purely because it contains more sequence (which are albeit short scaffolds). Do youwant more sequence in your assembly, or fewer but longer sequences?
  • 30. N50 for two assemblies208 Mbp 190 MbpHere are another two fictional assemblies. The first assembly now has a lower N50 value, butthis is purely because it contains more sequence (which are albeit short scaffolds). Do youwant more sequence in your assembly, or fewer but longer sequences?
  • 31. N50 for two assemblies208 Mbp 190 MbpN50 = 15 Mbp N50 = 25 MbpHere are another two fictional assemblies. The first assembly now has a lower N50 value, butthis is purely because it contains more sequence (which are albeit short scaffolds). Do youwant more sequence in your assembly, or fewer but longer sequences?
  • 32. NG50 for two assemblies208 Mbp 190 MbpWe prefer a measure called NG50. This does not use the assembly size, but instead uses theknown (or estimated) genome size (the G in NG50 refers to the Genome).
  • 33. NG50 for two assembliesWe prefer a measure called NG50. This does not use the assembly size, but instead uses theknown (or estimated) genome size (the G in NG50 refers to the Genome).
  • 34. NG50 for two assembliesExpected genome size = 250 MbpWe prefer a measure called NG50. This does not use the assembly size, but instead uses theknown (or estimated) genome size (the G in NG50 refers to the Genome).
  • 35. Expected genome size = 250 MbpNG50 for two assembliesThe NG50 of these two assemblies is now the same. We think that NG50 is a fairer way ofcomparing genome assemblies that might differ in their total size.
  • 36. NG50 = 15 Mbp NG50 = 15 MbpExpected genome size = 250 MbpNG50 for two assembliesThe NG50 of these two assemblies is now the same. We think that NG50 is a fairer way ofcomparing genome assemblies that might differ in their total size.
  • 37. How do I describe thee?Let me count the waysApart from assembly size, and N50/NG50 length, there are many other ways to describe agenome assembly.
  • 38. How do I describe thee?Let me count the waysMetric DescriptionAssembly size With or without very short contigs?N50 / NG50 For contigs and/or scaffoldsCoverage When compared to a reference sequenceErrorsBase errors from alignment to reference sequenceand/or input read dataNumber of genesFrom comparison to reference transcriptomeand/or set of known genesApart from assembly size, and N50/NG50 length, there are many other ways to describe agenome assembly.
  • 39. How do I describe thee?Let me count the waysMetric DescriptionAssembly size With or without very short contigs?N50 / NG50 For contigs and/or scaffoldsCoverage When compared to a reference sequenceErrorsBase errors from alignment to reference sequenceand/or input read dataNumber of genesFrom comparison to reference transcriptomeand/or set of known genesAnd many, many more...Apart from assembly size, and N50/NG50 length, there are many other ways to describe agenome assembly.
  • 40. Genome assemblyBack in the day...How were genomes assembled back in the late 1990s when genome sequencing projectswere starting to make the news?
  • 41. Genome assembly: thenGenome sequencing projects often had a fantastic amount of supporting material whichhelped put the genome together. They were further helped by targeting genomes which hadlow heterozygosity. And of course this was all done with Sanger sequencing which gave long,accurate reads.
  • 42. Genetic maps ✓Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓Genome assembly: thenGenome sequencing projects often had a fantastic amount of supporting material whichhelped put the genome together. They were further helped by targeting genomes which hadlow heterozygosity. And of course this was all done with Sanger sequencing which gave long,accurate reads.
  • 43. Genetic maps ✓Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓Genome assembly: thenGenome sequencing projects often had a fantastic amount of supporting material whichhelped put the genome together. They were further helped by targeting genomes which hadlow heterozygosity. And of course this was all done with Sanger sequencing which gave long,accurate reads.
  • 44. Genetic maps ✓Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓Genome assembly: thenGenome sequencing projects often had a fantastic amount of supporting material whichhelped put the genome together. They were further helped by targeting genomes which hadlow heterozygosity. And of course this was all done with Sanger sequencing which gave long,accurate reads.
  • 45. Genetic maps ✓Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓Genome assembly: thenGenome sequencing projects often had a fantastic amount of supporting material whichhelped put the genome together. They were further helped by targeting genomes which hadlow heterozygosity. And of course this was all done with Sanger sequencing which gave long,accurate reads.
  • 46. Genetic maps ✓Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓Genome assembly: thenGenome sequencing projects often had a fantastic amount of supporting material whichhelped put the genome together. They were further helped by targeting genomes which hadlow heterozygosity. And of course this was all done with Sanger sequencing which gave long,accurate reads.
  • 47. Genetic maps ✓Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓Genome assembly: thenGenome sequencing projects often had a fantastic amount of supporting material whichhelped put the genome together. They were further helped by targeting genomes which hadlow heterozygosity. And of course this was all done with Sanger sequencing which gave long,accurate reads.
  • 48. So what was the result of spending millions of dollarsto assemble genomes of well-characterized species,with accurate long reads, and detailed maps???So hopefully this gave us a useful set of finished genomes, right?
  • 49. ✤ 2000: published genome size = 125 Mbp✤ 2007: genome size = 157 Mbp✤ 2012: genome size = 135 MbpArabidopsis thalianaMany published genome sizes are sometimes based on estimates which can be wrong. Asthey sequenced more and more of the Arabidopsis genome, they had to revise how big it was.So between 2000 and 2007 they produced more sequence but paradoxically it became lesscomplete because the estimate of the size went up. Now it has come back down again. Butthe genome remains unfinished.
  • 50. ✤ 2000: published genome size = 125 Mbp✤ 2007: genome size = 157 Mbp✤ 2012: genome size = 135 Mbp✤ Amount sequenced = 119 MbpArabidopsis thalianaMany published genome sizes are sometimes based on estimates which can be wrong. Asthey sequenced more and more of the Arabidopsis genome, they had to revise how big it was.So between 2000 and 2007 they produced more sequence but paradoxically it became lesscomplete because the estimate of the size went up. Now it has come back down again. Butthe genome remains unfinished.
  • 51. ✤ 2000: published genome size = 125 Mbp✤ 2007: genome size = 157 Mbp✤ 2012: genome size = 135 Mbp✤ Amount sequenced = 119 Mbp✤ Ns = 0.2% of genomeArabidopsis thalianaMany published genome sizes are sometimes based on estimates which can be wrong. Asthey sequenced more and more of the Arabidopsis genome, they had to revise how big it was.So between 2000 and 2007 they produced more sequence but paradoxically it became lesscomplete because the estimate of the size went up. Now it has come back down again. Butthe genome remains unfinished.
  • 52. Drosophila melanogaster✤ Genome published 1998✤ Heterochromatin finished 2007The fly genome was finished in 1998. But this was only really the easy-to-sequence portionof the genome (the euchromatin). The trickier heterochromatin was sequenced as a separateproject that didnt finish until almost a decade later. The fly genome remains unfinished.
  • 53. Drosophila melanogaster✤ Genome published 1998✤ Heterochromatin finished 2007✤ Ns = 4% of genomeThe fly genome was finished in 1998. But this was only really the easy-to-sequence portionof the genome (the euchromatin). The trickier heterochromatin was sequenced as a separateproject that didnt finish until almost a decade later. The fly genome remains unfinished.
  • 54. Caenorhabditis elegans✤ Genome published 1998✤ 2004: last N removedThe worm genome has no unknown bases in it. However, since the publication of the genomesequence the genome has continued to be refined as errors are corrected. The last batch ofchanges all occurred just last November. So after almost 15 years of post-genome-publication, we can still find over 1,400 errors in one of the best characterized genomesequences that exists.
  • 55. Caenorhabditis elegans✤ Genome published 1998✤ 2004: last N removed✤ 1998–2013: genome sequence changesThe worm genome has no unknown bases in it. However, since the publication of the genomesequence the genome has continued to be refined as errors are corrected. The last batch ofchanges all occurred just last November. So after almost 15 years of post-genome-publication, we can still find over 1,400 errors in one of the best characterized genomesequences that exists.
  • 56. Caenorhabditis elegans✤ Genome published 1998✤ 2004: last N removed✤ 1998–2013: genome sequence changes✤ 558 insertions✤ 230 deletions✤ 614 substitutionsThe worm genome has no unknown bases in it. However, since the publication of the genomesequence the genome has continued to be refined as errors are corrected. The last batch ofchanges all occurred just last November. So after almost 15 years of post-genome-publication, we can still find over 1,400 errors in one of the best characterized genomesequences that exists.
  • 57. Caenorhabditis elegans✤ Genome published 1998✤ 2004: last N removed✤ 1998–2013: genome sequence changes✤ 558 insertions✤ 230 deletions✤ 614 substitutions}Nov 2012The worm genome has no unknown bases in it. However, since the publication of the genomesequence the genome has continued to be refined as errors are corrected. The last batch ofchanges all occurred just last November. So after almost 15 years of post-genome-publication, we can still find over 1,400 errors in one of the best characterized genomesequences that exists.
  • 58. Saccharomyces cerevisiae✤ Genome published 1997✤ 12 Mbp genome✤ 1,653 changes to genome since 1997Likewise in yeast. The first eukaryotic genome sequence continues to receives fixes to correctthe sequence. The last set of changes were made in 2011. These changes affected codingsequences, not just intergenic and intronic DNA.
  • 59. Saccharomyces cerevisiae✤ Genome published 1997✤ 12 Mbp genome✤ 1,653 changes to genome since 1997✤ Last changes made in 2011Likewise in yeast. The first eukaryotic genome sequence continues to receives fixes to correctthe sequence. The last set of changes were made in 2011. These changes affected codingsequences, not just intergenic and intronic DNA.
  • 60. Genetic maps ✓Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓Genome assembly: thenAnd all of this was done in an era when we had all of these supporting materials.
  • 61. Genetic maps ✗Physical maps ✗Understanding of target genome ✗Haploid / low heterozygosity genome ✗Accurate & long reads ✗Resources (time, money, people) ✗Genome assembly: nowWe dont have these now! Genome sequencing no longer requires an internationalconsortium, rather it could be a project for a Grad student.
  • 62. Assembling & finishinga genome is not easy!It was never easy, even when we access to lots of resources to help us put together genomes.And it is not easy now. Dont be fooled into thinking that because there are many publishedgenome sequences, that these sequences represent the absolute ideal genome sequence.
  • 63. AssemblathonsA new idea is bornImage from flickr.com/photos/dullhunk/4422952630
  • 64. The Assemblathon was born out of the Genome 10K project.
  • 65. If you sequence 10,000 genomes......you need to assemble 10,000 genomesThe Assemblathon was born out of the Genome 10K project.
  • 66. How many assembly tools are out there?There are many, many tools out there for assembling, or helping to assemble, a genomesequence. It seems reasonable to ask...which is the best?
  • 67. How many assembly tools are out there?RayCeleraMIRAALLPATHS-LGSGACurtainMetassemblerPhusionABySSAmosArapanCLCCortexDNAnexusDNA Dragon EULEREdenaForgeGeneiousIDBANewblerPRICEPADENAPASHAPhrapTIGRSequencherSeqMan NGenSHARCGSSOPRASSAKESPAdesTaipanVCAKEVelvetArachnePCAPGAMMonumentAtlasABBAAnchorATACContrailDecGPU GenoMinerLasergenePE-AssemblerPipeline PilotQSRASeqPrepSHORTYfermiTelescoperQuastSCARPA HapsemblerHapCompassHaploMergerSWiPSGigAssemblerMSR-CAThere are many, many tools out there for assembling, or helping to assemble, a genomesequence. It seems reasonable to ask...which is the best?
  • 68. How many assembly tools are out there?RayCeleraMIRAALLPATHS-LGSGACurtainMetassemblerPhusionABySSAmosArapanCLCCortexDNAnexusDNA Dragon EULEREdenaForgeGeneiousIDBANewblerPRICEPADENAPASHAPhrapTIGRSequencherSeqMan NGenSHARCGSSOPRASSAKESPAdesTaipanVCAKEVelvetArachnePCAPGAMMonumentAtlasABBAAnchorATACContrailDecGPU GenoMinerLasergenePE-AssemblerPipeline PilotQSRASeqPrepSHORTYfermiTelescoperQuastSCARPA HapsemblerHapCompassHaploMergerSWiPSGigAssemblerMSR-CAWhich is the best?There are many, many tools out there for assembling, or helping to assemble, a genomesequence. It seems reasonable to ask...which is the best?
  • 69. Comparing assemblers✤ Cant fairly compare two assemblers if they:However, it is not always straightforward to compare two tools if they were used on differentspecies or on different datasets from the same species.
  • 70. Comparing assemblers✤ Cant fairly compare two assemblers if they:✤ produced assemblies from different speciesHowever, it is not always straightforward to compare two tools if they were used on differentspecies or on different datasets from the same species.
  • 71. Comparing assemblers✤ Cant fairly compare two assemblers if they:✤ produced assemblies from different species✤ assembled same species, but used sequence data fromdifferent NGS platformsHowever, it is not always straightforward to compare two tools if they were used on differentspecies or on different datasets from the same species.
  • 72. Comparing assemblers✤ Cant fairly compare two assemblers if they:✤ produced assemblies from different species✤ assembled same species, but used sequence data fromdifferent NGS platforms✤ used same NGS platform but different sequence librariesHowever, it is not always straightforward to compare two tools if they were used on differentspecies or on different datasets from the same species.
  • 73. Comparing assemblers✤ Cant fairly compare two assemblers if they:✤ produced assemblies from different species✤ assembled same species, but used sequence data fromdifferent NGS platforms✤ used same NGS platform but different sequence libraries✤ Even using different options for the same assembler may producevery different assemblies!However, it is not always straightforward to compare two tools if they were used on differentspecies or on different datasets from the same species.
  • 74. A genome assembly competitionThats where the Assemblathon came in.
  • 75. An attempt to standardize some aspectsof the genome assembly processGenome assembly contestsOthers have been trying to do the same thing. E.g. GAGE, and dnGASP.
  • 76. ✤ 2010–2011✤ Used synthetic data✤ Small genome (~100 Mbp)✤ We knew the answer!Assemblathon 1It is easier to judge a tool when you know what the final answer should look like. However,many people that work on developing assemblers would prefer to work with real data...
  • 77. Here we go again...which is where Assemblathon 2 came in.
  • 78. Type of dataNumber ofgenomesSize ofgenomesDo we knowthe answer?Assemblathon 1 Synthetic 1 Small ✓Assemblathon 2 Real 3 Large ✗
  • 79. Type of dataNumber ofgenomesSize ofgenomesDo we knowthe answer?Assemblathon 1 Synthetic 1 Small ✓Assemblathon 2 Real 3 Large ✗
  • 80. Melopsittacus undulatusBoa constrictor constrictorMaylandia zebraA budgie, a cichlid fish from Lake Mawali, and a reptile.
  • 81. BirdSnakeFishLets simplify the names for the rest of the talk.
  • 82. Why these three species?There is no special reason why these species were used. People had a need to sequence thegenomes, and some companies were willing to donate sequences.
  • 83. Why these three species?Because they were thereThere is no special reason why these species were used. People had a need to sequence thegenomes, and some companies were willing to donate sequences.
  • 84. SpeciesEstimatedgenome sizeIllumina Roche 454 PacBioBird 1.2 Gbp 285x(14 libraries)16x(3 libraries)10x(2 libraries)Fish 1.0 Gbp 192x(8 libraries)Snake 1.6 Gbp 125x(4 libraries)Assemble this!Lots of sequence data was provided for the bird. Mate pair and read pair libraries wereavailable for all Illumina datasets.
  • 85. SpeciesEstimatedgenome sizeIllumina Roche 454 PacBioBird 1.2 Gbp 285x(14 libraries)16x(3 libraries)10x(2 libraries)Fish 1.0 Gbp 192x(8 libraries)Snake 1.6 Gbp 125x(4 libraries)Assemble this!Lots of sequence data was provided for the bird. Mate pair and read pair libraries wereavailable for all Illumina datasets.
  • 86. SpeciesEstimatedgenome sizeIllumina Roche 454 PacBioBird 1.2 Gbp 285x(14 libraries)16x(3 libraries)10x(2 libraries)Fish 1.0 Gbp 192x(8 libraries)Snake 1.6 Gbp 125x(4 libraries)Assemble this!Lots of sequence data was provided for the bird. Mate pair and read pair libraries wereavailable for all Illumina datasets.
  • 87. SpeciesEstimatedgenome sizeIllumina Roche 454 PacBioBird 1.2 Gbp 285x(14 libraries)16x(3 libraries)10x(2 libraries)Fish 1.0 Gbp 192x(8 libraries)Snake 1.6 Gbp 125x(4 libraries)Assemble this!Lots of sequence data was provided for the bird. Mate pair and read pair libraries wereavailable for all Illumina datasets.
  • 88. Who took part?Lots of teams took part. Not just from the big sequencing/genome centers.
  • 89. Who took part?Lots of teams took part. Not just from the big sequencing/genome centers.
  • 90. Who took part?21 teams43 assemblies52,013,623,777 bp of sequenceLots of teams took part. Not just from the big sequencing/genome centers.
  • 91. SpeciesCompetitiveentriesEvaluationentriesBird 12 3Fish 10 6Snake 12 0EntriesThere were evaluation entries (not eligible to be declared the winner) allowed in addition tocompetition entries (only 1 per team).
  • 92. SpeciesCompetitiveentriesEvaluationentriesBird 12 3Fish 10 6Snake 12 0EntriesThere were evaluation entries (not eligible to be declared the winner) allowed in addition tocompetition entries (only 1 per team).
  • 93. Goals
  • 94. Goals✤ Assess quality of assemblies
  • 95. Goals✤ Assess quality of assemblies✤ Define quality!
  • 96. Goals✤ Assess quality of assemblies✤ Define quality!✤ Produce ranking of assemblies for each species
  • 97. Goals✤ Assess quality of assemblies✤ Define quality!✤ Produce ranking of assemblies for each species✤ Produce ranking of assemblers across species?
  • 98. Who did what?Person/group JobsMe, Ian, and Joseph Fass Perform various analyses of all assembliesDavid Schwarz et al. Produce & evaluate optical mapsJay Shendure et al.Produce Fosmid sequences(bird & snake only)Martin Hunt & Thomas Otto Performed REAPR analysisDent Earl & Benedict Paten Help with meta-analysis of final rankings
  • 99. flickr.com/photos/jamescridland/613445810Hard to get agreement on how best to interpret the results. Some analyses andinterpretations in the Assemblathon 2 paper end up being compromises.
  • 100. 91 co-authors!flickr.com/photos/jamescridland/613445810Hard to get agreement on how best to interpret the results. Some analyses andinterpretations in the Assemblathon 2 paper end up being compromises.
  • 101. Results!
  • 102. Lots of results!A screen grab of my master spreadsheet that contains all of the numerical results.
  • 103. 102 different metrics!
  • 104. 10 key metricsWe focused on 10 of 102 metrics that we thought were a) useful and b) captured differentaspects of an assemblys quality.
  • 105. Key Metric Description1 NG50 scaffold length2 NG50 contig length3 Amount of assembly in gene-sized scaffolds4 Number of core genes present5 Fosmid coverage6 Fosmid validity7 Short-range scaffold accuracy8 Optical map: level 19 Optical map: levels 1–310 REAPR summary scoreThe 10 key metrics.
  • 106. 1) Scaffold NG50 lengths✤ Can calculate NG50 length for each assembly✤ But also calculate NG60, NG70 etc.✤ Plot all results as a graphAn N50 (or NG50) value on its own doesnt tell you that much. Ideally you should always beaware of the total assembly size and the distribution of lengths when comparing assemblies.You can do this by not only calculating NG50, but NG1..NG100. NG1 would be the length ofscaffold that captures 1% of the estimated genome size (when summing scaffolds fromlongest to shortest).
  • 107. 1) Scaffold NG50 lengthsScaffold length is on a log axis and team identifiers are shown in the legend.The black dashed line shows the NG50 value, but the point where each series starts on theleft shows the lengths of the longest scaffolds. Also, if the NG100 value is greater than zero,then that assembly is bigger than the known/estimated genome size.
  • 108. 2) Contig vs scaffold NG50We did the same thing for contig NG50 as well as scaffold NG50. The two measure aresometimes, but not always, correlated. The two highlighted data points show outliers for birdassemblies, reflecting assemblies that are good at making long contigs *or* good at makinglong scaffolds, but not both.
  • 109. 2) Contig vs scaffold NG50We did the same thing for contig NG50 as well as scaffold NG50. The two measure aresometimes, but not always, correlated. The two highlighted data points show outliers for birdassemblies, reflecting assemblies that are good at making long contigs *or* good at makinglong scaffolds, but not both.
  • 110. 2) Contig vs scaffold NG50We did the same thing for contig NG50 as well as scaffold NG50. The two measure aresometimes, but not always, correlated. The two highlighted data points show outliers for birdassemblies, reflecting assemblies that are good at making long contigs *or* good at makinglong scaffolds, but not both.
  • 111. 3) Gene-sized scaffoldsIt is great to have long scaffolds, but maybe for many questions that you might be interestedin (e.g. studying codon usage bias), you only need to have scaffolds that have a good chanceof capturing a full-length gene.
  • 112. 3) Gene-sized scaffolds✤ Do assemblers get a little too excited by length?It is great to have long scaffolds, but maybe for many questions that you might be interestedin (e.g. studying codon usage bias), you only need to have scaffolds that have a good chanceof capturing a full-length gene.
  • 113. 3) Gene-sized scaffolds✤ Do assemblers get a little too excited by length?✤ How long is long enough for a scaffold?It is great to have long scaffolds, but maybe for many questions that you might be interestedin (e.g. studying codon usage bias), you only need to have scaffolds that have a good chanceof capturing a full-length gene.
  • 114. 3) Gene-sized scaffolds✤ Do assemblers get a little too excited by length?✤ How long is long enough for a scaffold?✤ What if you just wanted to find genes?It is great to have long scaffolds, but maybe for many questions that you might be interestedin (e.g. studying codon usage bias), you only need to have scaffolds that have a good chanceof capturing a full-length gene.
  • 115. 3) Gene-sized scaffolds✤ Do assemblers get a little too excited by length?✤ How long is long enough for a scaffold?✤ What if you just wanted to find genes?✤ Average vertebrate gene = ~25 KbpIt is great to have long scaffolds, but maybe for many questions that you might be interestedin (e.g. studying codon usage bias), you only need to have scaffolds that have a good chanceof capturing a full-length gene.
  • 116. 3) Gene-sized scaffoldsThe blue line shows the percentage of the estimated genome size that is present in scaffoldsof 25 Kbp or longer. Most assemblies, even if they have a much shorter *average* scaffoldlength, may contain many scaffolds that are long enough to contain a single gene.
  • 117. 4) Core genesA previously developed tool (CEGMA) was used to see how many core genes (extremely,highly conserved) are present in each assembly. Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes mightbe present, but split across scaffolds.
  • 118. 4) Core genes✤ Used CEGMA toolA previously developed tool (CEGMA) was used to see how many core genes (extremely,highly conserved) are present in each assembly. Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes mightbe present, but split across scaffolds.
  • 119. 4) Core genes✤ Used CEGMA tool✤ CEGMA = set of 458 Core Eukaryotic Genes (CEGs)A previously developed tool (CEGMA) was used to see how many core genes (extremely,highly conserved) are present in each assembly. Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes mightbe present, but split across scaffolds.
  • 120. 4) Core genes✤ Used CEGMA tool✤ CEGMA = set of 458 Core Eukaryotic Genes (CEGs)✤ How many full-length CEGs are in each assembly?A previously developed tool (CEGMA) was used to see how many core genes (extremely,highly conserved) are present in each assembly. Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes mightbe present, but split across scaffolds.
  • 121. 4) Core genesThese results show the number of CEGMA genes that were present in any one assembly as apercentage of all possible CEGMA genes (i.e. those present across all assemblies for eachspecies).
  • 122. 4) Core genesCore genes (out of 458)Core genes (out of 458)SpeciesBest individualassemblyAcross allassembliesBird 420 442Fish 436 455Snake 438 454In the three species, most of the core genes were present across all assemblies, butindividual assemblies typically lacked several core genes.
  • 123. 4) Core genesCore genes (out of 458)Core genes (out of 458)SpeciesBest individualassemblyAcross allassembliesBird 420 442Fish 436 455Snake 438 454In the three species, most of the core genes were present across all assemblies, butindividual assemblies typically lacked several core genes.
  • 124. ABYSS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDBCM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDCRACS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDCURT MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDGAM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVEDMERAC MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDPHUS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDRAY MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDSGA MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDSYMB MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVEDSOAP MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED************************************************ *****ABYSS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------BCM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------CRACS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------CURT FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------GAM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNNLPHTHIMERAC FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------PHUS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------RAY FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------SGA FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------SYMB FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------SOAP FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------******************************************************ABYSS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGBCM ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGCRACS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGCURT ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGGAM YGHALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLK------------------MERAC ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGPHUS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGRAY ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGSGA ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGSYMB ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGSOAP ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG***************************************ABYSS ILPLVTGAGHISVPFPDTYKMTKSYBCM ILPLVTGAGHISVPFPDTYKMTKSYCRACS ILPLVTGAGHISVPFPDTYKMTKSYCURT ILPLVTGAGHISVPFPDTYKMTKSYGAM -------------------------4) Core genesExample of one core gene predicted in bird assemblies. CEGMA gene predictions are availableas supplementary material with the paper.
  • 125. 5) Fosmid coverage
  • 126. 5) Fosmid coverage✤ Had to first assemble Fosmids
  • 127. 5) Fosmid coverage✤ Had to first assemble Fosmids✤ Looked at repeat content & coverage across Fosmids
  • 128. 5) Fosmid coverage✤ Had to first assemble Fosmids✤ Looked at repeat content & coverage across Fosmids✤ Aligned assembly scaffolds to Fosmids
  • 129. 5) Fosmid coverage✤ Had to first assemble Fosmids✤ Looked at repeat content & coverage across Fosmids✤ Aligned assembly scaffolds to Fosmids✤ Only had Fosmids for bird and snake
  • 130. 5) Fosmid coverageLooked at coverage of Fosmids by aligning some of the input reads to the Fosmids.Occasionally we see small gaps in coverage. These represent Fosmid assembly errors orregions of the genome not captured by the input read data. We aligned scaffolds to theFosmids and see that most assemblies contain most of the Fosmids, but repeats complicatethe picture.
  • 131. 5) Fosmid coverageLooked at coverage of Fosmids by aligning some of the input reads to the Fosmids.Occasionally we see small gaps in coverage. These represent Fosmid assembly errors orregions of the genome not captured by the input read data. We aligned scaffolds to theFosmids and see that most assemblies contain most of the Fosmids, but repeats complicatethe picture.
  • 132. 5) Fosmid coverageLooked at coverage of Fosmids by aligning some of the input reads to the Fosmids.Occasionally we see small gaps in coverage. These represent Fosmid assembly errors orregions of the genome not captured by the input read data. We aligned scaffolds to theFosmids and see that most assemblies contain most of the Fosmids, but repeats complicatethe picture.
  • 133. 5) Fosmid coverageLooked at coverage of Fosmids by aligning some of the input reads to the Fosmids.Occasionally we see small gaps in coverage. These represent Fosmid assembly errors orregions of the genome not captured by the input read data. We aligned scaffolds to theFosmids and see that most assemblies contain most of the Fosmids, but repeats complicatethe picture.
  • 134. 5) Fosmid coverageLooked at coverage of Fosmids by aligning some of the input reads to the Fosmids.Occasionally we see small gaps in coverage. These represent Fosmid assembly errors orregions of the genome not captured by the input read data. We aligned scaffolds to theFosmids and see that most assemblies contain most of the Fosmids, but repeats complicatethe picture.
  • 135. 5) Fosmid coverageLooked at coverage of Fosmids by aligning some of the input reads to the Fosmids.Occasionally we see small gaps in coverage. These represent Fosmid assembly errors orregions of the genome not captured by the input read data. We aligned scaffolds to theFosmids and see that most assemblies contain most of the Fosmids, but repeats complicatethe picture.
  • 136. 5) Fosmid coverageLooked at coverage of Fosmids by aligning some of the input reads to the Fosmids.Occasionally we see small gaps in coverage. These represent Fosmid assembly errors orregions of the genome not captured by the input read data. We aligned scaffolds to theFosmids and see that most assemblies contain most of the Fosmids, but repeats complicatethe picture.
  • 137. 5) Fosmid coverageMost of the Fosmid sequences were used as Trusted reference sequences by which to assessthe assemblies.
  • 138. 5) Fosmid coverage✤ Only used regions of Fosmids that were validated byone or more assembliesMost of the Fosmid sequences were used as Trusted reference sequences by which to assessthe assemblies.
  • 139. 5) Fosmid coverage✤ Only used regions of Fosmids that were validated byone or more assemblies✤ Validated Fosmid Regions (VFRs)✤ 99% of bird Fosmids✤ 89% of snake FosmidsMost of the Fosmid sequences were used as Trusted reference sequences by which to assessthe assemblies.
  • 140. 5 & 6) Coverage & ValidityCOMPASS tool by Joe FassThe COMPASS tool compared the Validated Fosmid Regions (VFRs) to the scaffolds tocalculate four measures, two of which (coverage and validity) were used as key metrics.
  • 141. 5 & 6) Coverage & ValiditySome COMPASS results from the bird assemblies. Multiplicity is high when the assemblieswere large (compared to the estimated genome size).
  • 142. Validated Fosmid Region7) Short-range scaffold accuracyWe also used the VFRs in another way. We took pairs of 100 nt tag sequences from eitherend of consecutive 1000 nt fragments across all VFR sequences.
  • 143. Validated Fosmid Region7) Short-range scaffold accuracyWe also used the VFRs in another way. We took pairs of 100 nt tag sequences from eitherend of consecutive 1000 nt fragments across all VFR sequences.
  • 144. Validated Fosmid Region100 nt 100 nt7) Short-range scaffold accuracyWe also used the VFRs in another way. We took pairs of 100 nt tag sequences from eitherend of consecutive 1000 nt fragments across all VFR sequences.
  • 145. Validated Fosmid Region7) Short-range scaffold accuracyThe start coordinates of each pair of tag sequences should map 900 nt apart in theassemblies and hopefully both tags map only to the same scaffold. We combined both ofthese into one summary score metric.
  • 146. Validated Fosmid RegionMap pairs of tag sequences to assembly scaffolds7) Short-range scaffold accuracyThe start coordinates of each pair of tag sequences should map 900 nt apart in theassemblies and hopefully both tags map only to the same scaffold. We combined both ofthese into one summary score metric.
  • 147. Validated Fosmid RegionMap pairs of tag sequences to assembly scaffolds7) Short-range scaffold accuracyHow many map as a pair to one scaffold?The start coordinates of each pair of tag sequences should map 900 nt apart in theassemblies and hopefully both tags map only to the same scaffold. We combined both ofthese into one summary score metric.
  • 148. Validated Fosmid RegionMap pairs of tag sequences to assembly scaffolds7) Short-range scaffold accuracyHow many map as a pair to one scaffold?How many map at expected distance apart (900 ± 2 bp)?The start coordinates of each pair of tag sequences should map 900 nt apart in theassemblies and hopefully both tags map only to the same scaffold. We combined both ofthese into one summary score metric.
  • 149. 7) Short-range scaffold accuracyExpected distance apart (900 bp)Expected distance apart (900 bp)Species Shortest LongestBird 702 bp 41,949 bpSnake 673 bp 46,813 bpMost pairs of tags mapped to the same scaffold, and at the expected distance apart, butthere were a few notable exceptions.
  • 150. 7) Short-range scaffold accuracyThe red line indicates the theoretical maximum summary score that could be achieved.
  • 151. 8 & 9) Optical mapsFor optical map analysis, scaffolds had to be a certain minimum length *and* possess enoughrestriction enzyme sites.
  • 152. 8 & 9) Optical maps✤ Stretch out DNAFor optical map analysis, scaffolds had to be a certain minimum length *and* possess enoughrestriction enzyme sites.
  • 153. 8 & 9) Optical maps✤ Stretch out DNA✤ Cut with restriction enzymesFor optical map analysis, scaffolds had to be a certain minimum length *and* possess enoughrestriction enzyme sites.
  • 154. 8 & 9) Optical maps✤ Stretch out DNA✤ Cut with restriction enzymes✤ Note lengths of fragmentsFor optical map analysis, scaffolds had to be a certain minimum length *and* possess enoughrestriction enzyme sites.
  • 155. 8 & 9) Optical maps✤ Stretch out DNA✤ Cut with restriction enzymes✤ Note lengths of fragments✤ Compare to in silico digest of scaffoldsFor optical map analysis, scaffolds had to be a certain minimum length *and* possess enoughrestriction enzyme sites.
  • 156. 8 & 9) Optical maps✤ Stretch out DNA✤ Cut with restriction enzymes✤ Note lengths of fragments✤ Compare to in silico digest of scaffolds✤ Not all scaffolds suitable for analysisFor optical map analysis, scaffolds had to be a certain minimum length *and* possess enoughrestriction enzyme sites.
  • 157. 8 & 9) Optical mapsImage from University of Wisconsin-MadisonAn example of an optical map. After cutting, each DNA fragment is measured to estimate itslength. Optical map results were divided into three categories (levels 1–3).
  • 158. 8 & 9) Optical mapsWhite bars: total length of scaffolds that were suitable for optical map analysis. Dark blue:global alignments of scaffolds to maps (these are the best quality). Light blue: globalalignments with more permissive thresholds. Orange bars: local alignments. We used level 1(dark blue) as one key metric and levels 1+2+3 as a second key metric. The MLK assembly isgood, *relatively* speaking (high percentage of suitable scaffolds are in level 1 category), butwe record scores on an absolute basis (MERAC highest for level 1, SOAP highest for levels
  • 159. 8 & 9) Optical mapsFish optical map results were much worse than in bird, with very few assemblies havingscaffolds with level 1 global alignments to the optical map. SGA had the most level 1coverage, but a much lower amount of sequence that was alignable at any level (1, 2, or 3).
  • 160. 8 & 9) Optical mapsSnake optical map results were intermediate compared to bird and fish.
  • 161. 10) REAPR summary scoreREAPR is a tool that aligns input reads to scaffolds and looks for base errors and regionswhich might represent misassemblies (where scaffolds should ideally be split in two). Thesetwo facets are combined into one summary score.
  • 162. 10) REAPR summary scoreREAPR is a tool that aligns input reads to scaffolds and looks for base errors and regionswhich might represent misassemblies (where scaffolds should ideally be split in two). Thesetwo facets are combined into one summary score.
  • 163. What does this all mean?
  • 164. 102 metricsper assembly10 keymetrics1 finalrankingUsing the 10 key metrics, we combined the results to produce a single score for eachassembly by which to rank them.
  • 165. AssemblyNumber ofcore genesRank Z-scoreCRACS 438 1 +0.68SYMB 436 2 +0.59PHUS 435 3 +0.54BCM 434 4 +0.49SGA 433 5 +0.44MERAC 430 6 +0.30ABYSS 429 7 +0.25SOAP 428 8 +0.21RAY 422 9 –0.08GAM 415 10 –0.41CURT 360 11 –3.02Although we did take an average rank from the 10 individual rankings, we preferred to use aZ-score approach. Each assembly was scored based on the total number of standarddeviations from the average of each metric. This rewards/penalizes assemblies with veryhigh/low scores in individual metrics. The above results are from the CEGMA metric in bird.
  • 166. AssemblyNumber ofcore genesRank Z-scoreCRACS 438 1 +0.68SYMB 436 2 +0.59PHUS 435 3 +0.54BCM 434 4 +0.49SGA 433 5 +0.44MERAC 430 6 +0.30ABYSS 429 7 +0.25SOAP 428 8 +0.21RAY 422 9 –0.08GAM 415 10 –0.41CURT 360 11 –3.02Although we did take an average rank from the 10 individual rankings, we preferred to use aZ-score approach. Each assembly was scored based on the total number of standarddeviations from the average of each metric. This rewards/penalizes assemblies with veryhigh/low scores in individual metrics. The above results are from the CEGMA metric in bird.
  • 167. AssemblyNumber ofcore genesRank Z-scoreCRACS 438 1 +0.68SYMB 436 2 +0.59PHUS 435 3 +0.54BCM 434 4 +0.49SGA 433 5 +0.44MERAC 430 6 +0.30ABYSS 429 7 +0.25SOAP 428 8 +0.21RAY 422 9 –0.08GAM 415 10 –0.41CURT 360 11 –3.02Although we did take an average rank from the 10 individual rankings, we preferred to use aZ-score approach. Each assembly was scored based on the total number of standarddeviations from the average of each metric. This rewards/penalizes assemblies with veryhigh/low scores in individual metrics. The above results are from the CEGMA metric in bird.
  • 168. This graph shows the final rankings of bird assemblies based on their sum Z-scores.Assemblies in red are the evaluation entries. The error bars reflect what would be the highestand lowest sum Z-score if we had used any combination of 9 key metrics rather than 10.Note that the highest ranked bird assembly was an evaluation assembly by Baylor College ofMedicine, their competitive entry ranked number 2.
  • 169. This graph shows the final rankings of bird assemblies based on their sum Z-scores.Assemblies in red are the evaluation entries. The error bars reflect what would be the highestand lowest sum Z-score if we had used any combination of 9 key metrics rather than 10.Note that the highest ranked bird assembly was an evaluation assembly by Baylor College ofMedicine, their competitive entry ranked number 2.
  • 170. This graph shows the final rankings of bird assemblies based on their sum Z-scores.Assemblies in red are the evaluation entries. The error bars reflect what would be the highestand lowest sum Z-score if we had used any combination of 9 key metrics rather than 10.Note that the highest ranked bird assembly was an evaluation assembly by Baylor College ofMedicine, their competitive entry ranked number 2.
  • 171. This graph shows the final rankings of bird assemblies based on their sum Z-scores.Assemblies in red are the evaluation entries. The error bars reflect what would be the highestand lowest sum Z-score if we had used any combination of 9 key metrics rather than 10.Note that the highest ranked bird assembly was an evaluation assembly by Baylor College ofMedicine, their competitive entry ranked number 2.
  • 172. This graph shows the final rankings of bird assemblies based on their sum Z-scores.Assemblies in red are the evaluation entries. The error bars reflect what would be the highestand lowest sum Z-score if we had used any combination of 9 key metrics rather than 10.Note that the highest ranked bird assembly was an evaluation assembly by Baylor College ofMedicine, their competitive entry ranked number 2.
  • 173. In fish, BCM ranked 1st though the error bars suggest there is much variability. The lack ofFosmid data means that there is only 7 key metrics rather than 10.
  • 174. Snake seemed to the only species that outwardly looked like one assembler outperformed allothers (SGA, in this case). We will return to this issue. Note that there were no evaluationentries for snake.
  • 175. Another way of looking at all of this data is to plot the Z-scores for each metric as a heatmap (red = higher Z-scores).
  • 176. A parallel coordinates plot is another way of trying to show all of the information at once.
  • 177. What does this all mean?
  • 178. No really, what does this all mean?Still a bit hard to make sense of the overall rankings. What are the main findings from ourpaper?
  • 179. Some conclusions✤ Very hard to find assemblers that performed well acrossall 10 key metrics✤ Assemblers that perform well in one species, do notalways perform as well in another✤ Bird & snake assemblies appear better than fish✤ No real winner for bird and fish
  • 180. SGA — best assembler for snake?Even if we had happened to use 9 key metrics rather than 10, and even if we threw out themetric where SGA performed the best, it would still probably rank 1st. So is that the end ofthe story?
  • 181. SGA — best assembler for snake?Even if we had happened to use 9 key metrics rather than 10, and even if we threw out themetric where SGA performed the best, it would still probably rank 1st. So is that the end ofthe story?
  • 182. Description Rank of snake SGA assemblyNG50 scaffold length 2NG50 contig length 5Amount of assembly in gene-sized scaffolds 7Number of core genes present 5Fosmid coverage 2Fosmid validity 2Short-range scaffold accuracy 3Optical map: level 1 2Optical map: levels 1–3 1REAPR summary score 2SGA only ranked 1st in one of the ten key metrics and ranked 7th in another. So it is a goodassembler *on average*. But if one of these metrics was highly important to you, you maywant to use an assembler that ranked higher in that metric.
  • 183. Description Rank of snake SGA assemblyNG50 scaffold length 2NG50 contig length 5Amount of assembly in gene-sized scaffolds 7Number of core genes present 5Fosmid coverage 2Fosmid validity 2Short-range scaffold accuracy 3Optical map: level 1 2Optical map: levels 1–3 1REAPR summary score 2SGA only ranked 1st in one of the ten key metrics and ranked 7th in another. So it is a goodassembler *on average*. But if one of these metrics was highly important to you, you maywant to use an assembler that ranked higher in that metric.
  • 184. Best assembler across species?Not all assemblers were used for all species, but many teams submitted entries for 2 or 3 ofthe species. In theory, if a team submitted an entry for all species, and if their assemblerranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 forfish). So what was the best assembler across species, as judged by total number of 1stplaces? It is BCM. But Ray comes 4th with three 1st places.
  • 185. Best assembler across species?AssemblerNumber of 1st places(out of 27)BCM 5Meraculous 4Symbiose 4Ray 3Excluding evaluation entriesNot all assemblers were used for all species, but many teams submitted entries for 2 or 3 ofthe species. In theory, if a team submitted an entry for all species, and if their assemblerranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 forfish). So what was the best assembler across species, as judged by total number of 1stplaces? It is BCM. But Ray comes 4th with three 1st places.
  • 186. Best assembler across species?AssemblerNumber of 1st places(out of 27)BCM 5Meraculous 4Symbiose 4Ray 3Excluding evaluation entriesNot all assemblers were used for all species, but many teams submitted entries for 2 or 3 ofthe species. In theory, if a team submitted an entry for all species, and if their assemblerranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 forfish). So what was the best assembler across species, as judged by total number of 1stplaces? It is BCM. But Ray comes 4th with three 1st places.
  • 187. Ray performanceSpecies Final rankingBird 7Fish 7Snake 9However, Ray ranks much lower when looking at its performance across all key metrics. Sosome assemblers do very well in specific measures, and not so well in others and otherassemblers do moderately well across lots of metrics (e.g. SGA).
  • 188. We found it interesting that the best bird assembly was the evaluation entry by Baylor Collegeof Medicine. What is different about this entry compared to their competitive entry?
  • 189. We found it interesting that the best bird assembly was the evaluation entry by Baylor Collegeof Medicine. What is different about this entry compared to their competitive entry?
  • 190. AssemblerFinalrankNGS dataused inassemblyCoverageZ-scoreValidityZ-scoreNG50 ContigZ-scoreBCM -evaluation1Illumina +454+2.0 +1.4 +1.5BCM -competitive2Illumina +454 + PacBio–0.3 –0.8 +2.7BCM bird assembliesThe only difference is that the BCM competitive entry included PacBio data, and somehow thisled to the paradoxical situation where including more sequenced produced a lower measuresfor coverage and validity (from the Fosmids), though one key metric (NG50 contig length) didimprove.
  • 191. AssemblerFinalrankNGS dataused inassemblyCoverageZ-scoreValidityZ-scoreNG50 ContigZ-scoreBCM -evaluation1Illumina +454+2.0 +1.4 +1.5BCM -competitive2Illumina +454 + PacBio–0.3 –0.8 +2.7BCM bird assembliesThe only difference is that the BCM competitive entry included PacBio data, and somehow thisled to the paradoxical situation where including more sequenced produced a lower measuresfor coverage and validity (from the Fosmids), though one key metric (NG50 contig length) didimprove.
  • 192. AssemblerFinalrankNGS dataused inassemblyCoverageZ-scoreValidityZ-scoreNG50 ContigZ-scoreBCM -evaluation1Illumina +454+2.0 +1.4 +1.5BCM -competitive2Illumina +454 + PacBio–0.3 –0.8 +2.7BCM bird assembliesThe only difference is that the BCM competitive entry included PacBio data, and somehow thisled to the paradoxical situation where including more sequenced produced a lower measuresfor coverage and validity (from the Fosmids), though one key metric (NG50 contig length) didimprove.
  • 193. AssemblerFinalrankNGS dataused inassemblyCoverageZ-scoreValidityZ-scoreNG50 ContigZ-scoreBCM -evaluation1Illumina +454+2.0 +1.4 +1.5BCM -competitive2Illumina +454 + PacBio–0.3 –0.8 +2.7BCM bird assembliesThe only difference is that the BCM competitive entry included PacBio data, and somehow thisled to the paradoxical situation where including more sequenced produced a lower measuresfor coverage and validity (from the Fosmids), though one key metric (NG50 contig length) didimprove.
  • 194. AssemblerFinalrankNGS dataused inassemblyCoverageZ-scoreValidityZ-scoreNG50 ContigZ-scoreBCM -evaluation1Illumina +454+2.0 +1.4 +1.5BCM -competitive2Illumina +454 + PacBio–0.3 –0.8 +2.7BCM bird assembliesThe only difference is that the BCM competitive entry included PacBio data, and somehow thisled to the paradoxical situation where including more sequenced produced a lower measuresfor coverage and validity (from the Fosmids), though one key metric (NG50 contig length) didimprove.
  • 195. BCM evaluation scaffoldNNNNNNNNNNNNNNNNNNNBCM used PacBio data to help fill in the gaps in their scaffolds.
  • 196. BCM evaluation scaffoldNNNNNNNNNNNNNNNNNNNBCM competition scaffoldNNNNNNNNNNNNNNNNNNNBCM used PacBio data to help fill in the gaps in their scaffolds.
  • 197. BCM evaluation scaffoldNNNNNNNNNNNNNNNNNNNBCM competition scaffoldNNNNNNNNNNNNNNNNNNNPacBio sequenceBCM used PacBio data to help fill in the gaps in their scaffolds.
  • 198. BCM evaluation scaffoldNNNNNNNNNNNNNNNNNNNBCM competition scaffoldCGTCGNNATCNNGGTTACGErrors in the PacBio sequence were penalized by the choice of alignment program used toalign Fosmids to scaffolds.
  • 199. BCM evaluation scaffoldNNNNNNNNNNNNNNNNNNNBCM competition scaffoldCGTCGNNATCNNGGTTACGMismatches from PacBio sequence penalized alignmentscore more than matching unknown basesErrors in the PacBio sequence were penalized by the choice of alignment program used toalign Fosmids to scaffolds.
  • 200. The choice of one command-line option,used by one tool in the calculation of one key metric......probably made enough difference to dropthe PacBio-containing assembly to 2nd place.This was actually down to the use of a single command-line option to the lastz alignmentprogram. If we had not chosen this option, the PacBio-containing entry would have probablyranked 1st among all bird assemblies.
  • 201. Other conclusions✤ Different metrics tell different stories✤ Heterozygosity was a big issue for bird & fish assemblies✤ Final rankings very sensitive to changes in metrics✤ N50 is a semi-useful predictor of assembly qualityThe last point may disappoint some. Despite looking at many different metrics, N50 scaffoldlength still does a reasonable job of predicting overall quality. However...
  • 202. ...the outliers in this relationship should be noted. The highlighted bird assembly had thesecond highest scaffold N50 length, but ranked 6th among bird assemblies.
  • 203. ...the outliers in this relationship should be noted. The highlighted bird assembly had thesecond highest scaffold N50 length, but ranked 6th among bird assemblies.
  • 204. Inter-specific differences matterBiological differences may account for differences in assembler performance betweendifferent species. However, the input data for each species was also very difference and thismay play a role as well (some assemblers perform prefer certain short-insert sizes).
  • 205. Inter-specific differences matter✤ The three species have genomes with different properties✤ repeats✤ heterozygosityBiological differences may account for differences in assembler performance betweendifferent species. However, the input data for each species was also very difference and thismay play a role as well (some assemblers perform prefer certain short-insert sizes).
  • 206. Inter-specific differences matter✤ The three species have genomes with different properties✤ repeats✤ heterozygosity✤ The three genomes had very different NGS data sets✤ Only bird had PacBio & 454 data✤ Different insert sizes in short-insert librariesBiological differences may account for differences in assembler performance betweendifferent species. However, the input data for each species was also very difference and thismay play a role as well (some assemblers perform prefer certain short-insert sizes).
  • 207. The Big Conclusion
  • 208. The Big Conclusion"You cant always get what you want"Sir Michael Jagger, 1969
  • 209. What comes next?
  • 210. What comes next?There may be an Assemblathon 3. This will be decided at the next Genome 10K workshop (inApril, 2013).
  • 211. What comes next?3?There may be an Assemblathon 3. This will be decided at the next Genome 10K workshop (inApril, 2013).
  • 212. A wish list for Assemblathon 3If there is to be an Assemblathon 3, here are some things that we have learned fromAssemblathon 2.
  • 213. A wish list for Assemblathon 3✤ Only have 1 speciesIf there is to be an Assemblathon 3, here are some things that we have learned fromAssemblathon 2.
  • 214. A wish list for Assemblathon 3✤ Only have 1 species✤ Teams have to buy resources using virtual budgetsIf there is to be an Assemblathon 3, here are some things that we have learned fromAssemblathon 2.
  • 215. A wish list for Assemblathon 3✤ Only have 1 species✤ Teams have to buy resources using virtual budgets✤ Factor in CPU time/cost?If there is to be an Assemblathon 3, here are some things that we have learned fromAssemblathon 2.
  • 216. A wish list for Assemblathon 3✤ Only have 1 species✤ Teams have to buy resources using virtual budgets✤ Factor in CPU time/cost?✤ Agree on metrics before evaluating assemblies!If there is to be an Assemblathon 3, here are some things that we have learned fromAssemblathon 2.
  • 217. A wish list for Assemblathon 3✤ Only have 1 species✤ Teams have to buy resources using virtual budgets✤ Factor in CPU time/cost?✤ Agree on metrics before evaluating assemblies!✤ Encourage experimental assembliesIf there is to be an Assemblathon 3, here are some things that we have learned fromAssemblathon 2.
  • 218. A wish list for Assemblathon 3✤ Only have 1 species✤ Teams have to buy resources using virtual budgets✤ Factor in CPU time/cost?✤ Agree on metrics before evaluating assemblies!✤ Encourage experimental assemblies✤ Use new FASTG genome assembly file formatIf there is to be an Assemblathon 3, here are some things that we have learned fromAssemblathon 2.
  • 219. A wish list for Assemblathon 3✤ Only have 1 species✤ Teams have to buy resources using virtual budgets✤ Factor in CPU time/cost?✤ Agree on metrics before evaluating assemblies!✤ Encourage experimental assemblies✤ Use new FASTG genome assembly file format✤ Get someone else to write the paper!If there is to be an Assemblathon 3, here are some things that we have learned fromAssemblathon 2.
  • 220. ~ fin ~