Genome assembly: then and now — v1.1

2,769 views

Published on


This was a talk given on 2014-06-19 for the Genome Center’s Bioinformatics Core as part of a 1 week workshop on using Galaxy. It concerns the Assemblathon projects as well as other aspects relating to genome assembly.

A version of this talk is also available on Slideshare with embedded notes.

Note, this is an evolving talk. There are older and newer versions of the talk also available on slideshare.

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,769
On SlideShare
0
From Embeds
0
Number of Embeds
364
Actions
Shares
0
Downloads
98
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Genome assembly: then and now — v1.1

  1. 1. Genome assembly: then and now Keith Bradnam Image from Wellcome Trust
  2. 2. Image from flickr.com/photos/dougitdesign/5613967601/ Contents Sequencing 101! ! Genome assembly: then! ! Genome assembly: now Assemblathon 1 & 2! ! Advice & Angst! ! The future
  3. 3. More info ✤ http://assemblathon.org! ! ✤ http://gigasciencejournal.com! ! ✤ http://twitter.com/assemblathon
  4. 4. Sequencing 101 A, C, G, T... Image from nlm.nih.gov
  5. 5. Read
  6. 6. Read pair
  7. 7. Read pair Mate pair
  8. 8. Contigs
  9. 9. Scaffold NNNNNNNNNNNNNNNNNNN
  10. 10. Assembly size NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 15 15 15 5
  11. 11. Assembly size NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 200 Mbp 15 15 15 5
  12. 12. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 200 Mbp 15 15 15 5
  13. 13. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 200 Mbp 15 15 15 5
  14. 14. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 200 Mbp 15 15 15 5 70
  15. 15. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 15 15 15 5 200 Mbp 95
  16. 16. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 15 15 15 5 200 Mbp 95
  17. 17. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 15 15 15 5 200 Mbp 115
  18. 18. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 15 15 15 5 200 Mbp 115
  19. 19. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 5 15 15 15 5 200 Mbp
  20. 20. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 5 5
  21. 21. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 5 5
  22. 22. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15
  23. 23. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15
  24. 24. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 190 Mbp
  25. 25. N50 length NNNNNNNNNNNNNNNNNNN NNNNNNNNNNN NNNNNNNNNNN 70 25 20 10 10 5 5 15 15 15 190 Mbp
  26. 26. N50 for two assemblies
  27. 27. N50 for two assemblies 208 Mbp 190 Mbp
  28. 28. N50 for two assemblies 208 Mbp 190 Mbp N50 = 15 Mbp N50 = 25 Mbp
  29. 29. NG50 for two assemblies 208 Mbp 190 Mbp
  30. 30. NG50 for two assemblies
  31. 31. NG50 for two assemblies Expected genome size = 250 Mbp
  32. 32. Expected genome size = 250 Mbp NG50 for two assemblies
  33. 33. NG50 = 15 Mbp NG50 = 15 Mbp Expected genome size = 250 Mbp NG50 for two assemblies
  34. 34. You should check that high N50 values! are not simply due to lots of Ns in the scaffolds!
  35. 35. Assembly 'x'
  36. 36. Assembly 'x' Size: 859 Mbp! ! Number of scaffolds: 28! ! N50 = 70.3 Mbp
  37. 37. Assembly 'x' Size: 859 Mbp! ! Number of scaffolds: 28! ! N50 = 70.3 Mbp Ns = 90.6% !!!
  38. 38. Assembly 'x' Size: 859 Mbp! ! Number of scaffolds: 28! ! N50 = 70.3 Mbp Ns = 90.6% !!!
  39. 39. Basic assembly metrics
  40. 40. Basic assembly metrics Metric Description Assembly size With or without very short contigs? N50 / NG50 For contigs and/or scaffolds Coverage When compared to a reference sequence Errors Base errors from alignment to reference sequence ! and/or input read data Number of genes From comparison to reference transcriptome ! and/or set of known genes
  41. 41. Basic assembly metrics Metric Description Assembly size With or without very short contigs? N50 / NG50 For contigs and/or scaffolds Coverage When compared to a reference sequence Errors Base errors from alignment to reference sequence ! and/or input read data Number of genes From comparison to reference transcriptome ! and/or set of known genes And many, many more...
  42. 42. Genome assembly Back in the day...
  43. 43. Genome assembly Back in the day... 1998
  44. 44. Genome assembly: then
  45. 45. Genetic maps ✓ Genome assembly: then
  46. 46. Genetic maps ✓ Physical maps ✓ Genome assembly: then
  47. 47. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Genome assembly: then
  48. 48. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Genome assembly: then
  49. 49. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Accurate & long reads ✓ Genome assembly: then
  50. 50. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Accurate & long reads ✓ Resources (time, money, people) ✓ Genome assembly: then
  51. 51. So what was the result of spending millions of dollars ! to assemble genomes of well-characterized species,! with accurate long reads, and detailed maps???
  52. 52. ✤ 2000: published genome size = 125 Mbp ✤ 2007: genome size = 157 Mbp ✤ 2012: genome size = 135 Mbp Arabidopsis thaliana
  53. 53. ✤ 2000: published genome size = 125 Mbp ✤ 2007: genome size = 157 Mbp ✤ 2012: genome size = 135 Mbp ✤ Amount sequenced = 119 Mbp Arabidopsis thaliana
  54. 54. ✤ 2000: published genome size = 125 Mbp ✤ 2007: genome size = 157 Mbp ✤ 2012: genome size = 135 Mbp ✤ Amount sequenced = 119 Mbp ✤ Ns = 0.2% of genome Arabidopsis thaliana
  55. 55. Two views of the same gene
  56. 56. Two views of the same gene Top: from genome sequence view on TAIR web site! Bottom: from gene sequence file on TAIR FTP site
  57. 57. Drosophila melanogaster ✤ Genome published 1998 ✤ Heterochromatin finished 2007
  58. 58. Drosophila melanogaster ✤ Genome published 1998 ✤ Heterochromatin finished 2007 ✤ Ns = 4% of genome
  59. 59. Caenorhabditis elegans ✤ Genome published 1998 ✤ 2004: last N removed
  60. 60. Caenorhabditis elegans ✤ Genome published 1998 ✤ 2004: last N removed ✤ 1998–2014: genome sequence changes
  61. 61. Caenorhabditis elegans ✤ Genome published 1998 ✤ 2004: last N removed ✤ 1998–2014: genome sequence changes ✤ 558 insertions ✤ 230 deletions ✤ 614 substitutions
  62. 62. Caenorhabditis elegans ✤ Genome published 1998 ✤ 2004: last N removed ✤ 1998–2014: genome sequence changes ✤ 558 insertions ✤ 230 deletions ✤ 614 substitutions }Nov 2012
  63. 63. Saccharomyces cerevisiae ✤ Genome published 1997 ✤ 12 Mbp genome ✤ 1,653 changes to genome since 1997
  64. 64. Saccharomyces cerevisiae ✤ Genome published 1997 ✤ 12 Mbp genome ✤ 1,653 changes to genome since 1997 ✤ Last changes made in 2011
  65. 65. Genetic maps ✓ Physical maps ✓ Understanding of target genome ✓ Haploid / low heterozygosity genome ✓ Accurate & long reads ✓ Resources (time, money, people) ✓ Genome assembly: then
  66. 66. Genetic maps ✗ Physical maps ✗ Understanding of target genome ✗ Haploid / low heterozygosity genome ✗ Accurate & long reads ✗ Resources (time, money, people) ✗ Genome assembly: now
  67. 67. Assembling & finishing! a genome is not easy!
  68. 68. Assemblathons A new idea is born Image from flickr.com/photos/dullhunk/4422952630
  69. 69. If you sequence 10,000 genomes...! ...you need to assemble 10,000 genomes
  70. 70. How many assembly tools are out there?
  71. 71. bambus2 How many assembly tools are out there? Ray Celera MIRA ALLPATHS-LGSGA Curtain Metassembler Phusion ABySS Amos Arapan CLC Cortex DNAnexus DNA Dragon Edena Forge Geneious IDBA Newbler PRICE PADENA PASHA Phrap TIGR Sequencher SeqMan NGen SHARCGS SOPRA SSAKE SPAdes Taipan VCAKE Velvet Arachne PCAP GAM Monument Atlas ABBA Anchor ATAC Contrail DecGPU GenoMinerLasergene PE-Assembler Pipeline Pilot QSRA SeqPrep SHORTY fermiTelescoper Quast SCARPA Hapsembler HapCompass HaploMerger SWiPS GigAssembler MSR-CA MaSuRCA GARM Cerulean TIGRA ngsShoRT PERGA SOAPdenovo REAPR FRCBam EULER-SR SSPACE Opera mip gapfiller image PBJelly HGAP FALCON Dazzler GGAKE A5 CABOG SHRAP SR-ASM SuccinctAssembly SUTTA Ragout Tedna Trinity SWAP-Assembler SILP3 AutoAssemblyD KGBAssembler MetAMOS iMetAMOS MetaVelvet-SL KmerGenie Nesoni Pilon Platanus CGAL GAGM Enly BESST Khmer GRIT IDBA-MTP dipSPAdes WhatsHap SHEAR ELOPER OMACC
  72. 72. How many assembly tools are out there?
  73. 73. bambus2 How many assembly tools are out there? Ray Celera MIRA ALLPATHS-LGSGA Curtain Metassembler Phusion ABySS Amos Arapan CLC Cortex DNAnexus DNA Dragon Edena Forge Geneious IDBA Newbler PRICE PADENA PASHA Phrap TIGR Sequencher SeqMan NGen SHARCGS SOPRA SSAKE SPAdes Taipan VCAKE Velvet Arachne PCAP GAM Monument Atlas ABBA Anchor ATAC Contrail DecGPU GenoMinerLasergene PE-Assembler Pipeline Pilot QSRA SeqPrep SHORTY fermiTelescoper Quast SCARPA Hapsembler HapCompass HaploMerger SWiPS GigAssembler MSR-CA MaSuRCA GARM Cerulean TIGRA ngsShoRT PERGA SOAPdenovo REAPR FRCBam EULER-SR SSPACE Opera mip gapfiller image PBJelly HGAP FALCON Dazzler GGAKE A5 CABOG SHRAP SR-ASM SuccinctAssembly SUTTA Ragout Tedna Trinity SWAP-Assembler SILP3 AutoAssemblyD KGBAssembler MetAMOS iMetAMOS MetaVelvet-SL KmerGenie Nesoni Pilon Platanus CGAL GAGM Enly BESST Khmer GRIT IDBA-MTP dipSPAdes WhatsHap SHEAR ELOPER OMACC
  74. 74. bambus2 How many assembly tools are out there? Ray Celera MIRA ALLPATHS-LGSGA Curtain Metassembler Phusion ABySS Amos Arapan CLC Cortex DNAnexus DNA Dragon Edena Forge Geneious IDBA Newbler PRICE PADENA PASHA Phrap TIGR Sequencher SeqMan NGen SHARCGS SOPRA SSAKE SPAdes Taipan VCAKE Velvet Arachne PCAP GAM Monument Atlas ABBA Anchor ATAC Contrail DecGPU GenoMinerLasergene PE-Assembler Pipeline Pilot QSRA SeqPrep SHORTY fermiTelescoper Quast SCARPA Hapsembler HapCompass HaploMerger SWiPS GigAssembler MSR-CA MaSuRCA GARM Cerulean TIGRA ngsShoRT PERGA SOAPdenovo REAPR FRCBam EULER-SR SSPACE Opera mip gapfiller image PBJelly HGAP FALCON Dazzler GGAKE A5 CABOG SHRAP SR-ASM SuccinctAssembly SUTTA Ragout Tedna Trinity SWAP-Assembler SILP3 AutoAssemblyD KGBAssembler MetAMOS iMetAMOS MetaVelvet-SL KmerGenie Nesoni Pilon Platanus CGAL GAGM Enly BESST Khmer GRIT IDBA-MTP dipSPAdes WhatsHap SHEAR ELOPER OMACC Which is the best?
  75. 75. Comparing assemblers ✤ Can't fairly compare two assemblers if they:
  76. 76. Comparing assemblers ✤ Can't fairly compare two assemblers if they: ✤ produced assemblies from different species
  77. 77. Comparing assemblers ✤ Can't fairly compare two assemblers if they: ✤ produced assemblies from different species ✤ assembled same species, but used sequence data from different sequencing technologies
  78. 78. Comparing assemblers ✤ Can't fairly compare two assemblers if they: ✤ produced assemblies from different species ✤ assembled same species, but used sequence data from different sequencing technologies ✤ used same sequencing technologies but have different sequence libraries
  79. 79. Comparing assemblers ✤ Can't fairly compare two assemblers if they: ✤ produced assemblies from different species ✤ assembled same species, but used sequence data from different sequencing technologies ✤ used same sequencing technologies but have different sequence libraries ✤ Even using different options for the same assembler may produce very different assemblies!
  80. 80. The PRICE genome assembler has 52 command-line options!!!
  81. 81. The PRICE genome assembler has 52 command-line options!!! how many of them are you going to learn?
  82. 82. A genome assembly competition
  83. 83. An attempt to standardize some aspects ! of the genome assembly process Genome assembly contests
  84. 84. ✤ 2010–2011! ✤ Used synthetic data! ✤ Small genome (~100 Mbp)! ✤ We knew the answer! Assemblathon 1
  85. 85. Here we go again
  86. 86. Type of data Number of genomes Size of genomes Do we know the answer? Assemblathon 1 Synthetic 1 Small ✓
  87. 87. Type of data Number of genomes Size of genomes Do we know the answer? Assemblathon 1 Synthetic 1 Small ✓ Assemblathon 2 Real 3 Large ✗
  88. 88. Melopsittacus undulatus Boa constrictor constrictorMaylandia zebra
  89. 89. Bird SnakeFish
  90. 90. Why these three species?
  91. 91. Why these three species? Because they were there
  92. 92. Species Bird Fish Snake Estimated genome size 1.2 Gbp 1.0 Gbp 1.6 Gbp Assemble this!
  93. 93. Species Bird Fish Snake Estimated genome size 1.2 Gbp 1.0 Gbp 1.6 Gbp Illumina 285x! (14 libraries) 192x! (8 libraries) 125x! (4 libraries) Assemble this!
  94. 94. Species Bird Fish Snake Estimated genome size 1.2 Gbp 1.0 Gbp 1.6 Gbp Illumina 285x! (14 libraries) 192x! (8 libraries) 125x! (4 libraries) Roche 454 16x! (3 libraries) Assemble this!
  95. 95. Species Bird Fish Snake Estimated genome size 1.2 Gbp 1.0 Gbp 1.6 Gbp Illumina 285x! (14 libraries) 192x! (8 libraries) 125x! (4 libraries) Roche 454 16x! (3 libraries) PacBio 10x! (2 libraries) Assemble this!
  96. 96. Who took part?
  97. 97. Who took part?
  98. 98. Who took part? 21 teams! 43 assemblies! 52,013,623,777 bp of sequence
  99. 99. Species Bird Fish Snake Competitive entries 12 10 12 Entries
  100. 100. Species Bird Fish Snake Competitive entries 12 10 12 Evaluation entries 3 6 0 Entries
  101. 101. Goals
  102. 102. Goals ✤ Assess 'quality' of assemblies
  103. 103. Goals ✤ Assess 'quality' of assemblies ✤ Define quality!
  104. 104. Goals ✤ Assess 'quality' of assemblies ✤ Define quality! ✤ Produce ranking of assemblies for each species
  105. 105. Goals ✤ Assess 'quality' of assemblies ✤ Define quality! ✤ Produce ranking of assemblies for each species ✤ Produce ranking of assemblers across species?
  106. 106. Who did what? Person/group Jobs Me, Ian Korf, and Joseph Fass Perform various analyses of all assemblies David Schwarz et al. Produce & evaluate optical maps Jay Shendure et al. Produce Fosmid sequences ! (bird & snake only) Martin Hunt & Thomas Otto Performed REAPR analysis Dent Earl & Benedict Paten Help with meta-analysis of final rankings
  107. 107. 91 co-authors! flickr.com/photos/jamescridland/613445810
  108. 108. Results!
  109. 109. Lots of results!
  110. 110. 102 different metrics!
  111. 111. 10 key metrics
  112. 112. Key Metric Description 1 NG50 scaffold length 2 NG50 contig length 3 Amount of assembly in 'gene-sized' scaffolds 4 Number of 'core genes' present 5 Fosmid coverage 6 Fosmid validity 7 Short-range scaffold accuracy 8 Optical map: level 1 9 Optical map: levels 1–3 10 REAPR summary score
  113. 113. Key Metric Description 1 NG50 scaffold length 2 NG50 contig length 3 Amount of assembly in 'gene-sized' scaffolds 4 Number of 'core genes' present 5 Fosmid coverage 6 Fosmid validity 7 Short-range scaffold accuracy 8 Optical map: level 1 9 Optical map: levels 1–3 10 REAPR summary score
  114. 114. 1) Scaffold NG50 lengths ✤ Can calculate NG50 length for each assembly! ✤ But also calculate NG60, NG70 etc.! ✤ Plot all results as a graph
  115. 115. 1) Scaffold NG50 lengths
  116. 116. 2) Contig vs scaffold NG50
  117. 117. 2) Contig vs scaffold NG50
  118. 118. 2) Contig vs scaffold NG50
  119. 119. 3) Gene-sized scaffolds
  120. 120. 3) Gene-sized scaffolds ✤ Some assembly folks get a little obsessed by length!
  121. 121. 3) Gene-sized scaffolds ✤ Some assembly folks get a little obsessed by length! ✤ How long is 'long enough' for a scaffold?
  122. 122. 3) Gene-sized scaffolds ✤ Some assembly folks get a little obsessed by length! ✤ How long is 'long enough' for a scaffold? ✤ What if you just wanted to find genes?
  123. 123. 3) Gene-sized scaffolds ✤ Some assembly folks get a little obsessed by length! ✤ How long is 'long enough' for a scaffold? ✤ What if you just wanted to find genes? ✤ Average vertebrate gene = ~25 Kbp
  124. 124. 3) Gene-sized scaffolds
  125. 125. 4) Core genes
  126. 126. 4) Core genes ✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
  127. 127. 4) Core genes ✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach) ✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)
  128. 128. 4) Core genes ✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach) ✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs) ✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens
  129. 129. 4) Core genes ✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach) ✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs) ✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens ✤ How many full-length CEGs are in each assembly?
  130. 130. 4) Core genes Species Bird Fish Snake Core genes (out of 458) Best individual assembly 420 436 438
  131. 131. 4) Core genes Species Bird Fish Snake Core genes (out of 458) Best individual assembly 420 436 438 Across all assemblies 442 455 454
  132. 132. 4) Core genes
  133. 133. ABYSS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED BCM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED CRACS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED CURT MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED GAM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVED MERAC MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED PHUS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED RAY MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED SGA MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED SYMB MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVED SOAP MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED ************************************************ ***** ! ABYSS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ BCM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ CRACS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ CURT FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ GAM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNNLPHTHI MERAC FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ PHUS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ RAY FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SGA FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SYMB FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ SOAP FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ ****************************************************** ! ABYSS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG BCM ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG CRACS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG CURT ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG GAM YGHALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLK------------------ MERAC ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG PHUS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG RAY ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SGA ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SYMB ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG SOAP ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG *************************************** 4) Core genes
  134. 134. 8 & 9) Optical maps
  135. 135. 8 & 9) Optical maps ✤ Stretch out DNA
  136. 136. 8 & 9) Optical maps ✤ Stretch out DNA ✤ Cut with restriction enzymes
  137. 137. 8 & 9) Optical maps ✤ Stretch out DNA ✤ Cut with restriction enzymes ✤ Note lengths of fragments
  138. 138. 8 & 9) Optical maps ✤ Stretch out DNA ✤ Cut with restriction enzymes ✤ Note lengths of fragments ✤ Compare to in silico digest of scaffolds
  139. 139. 8 & 9) Optical maps ✤ Stretch out DNA ✤ Cut with restriction enzymes ✤ Note lengths of fragments ✤ Compare to in silico digest of scaffolds ✤ Not all scaffolds suitable for analysis
  140. 140. 8 & 9) Optical maps Image from University of Wisconsin-Madison
  141. 141. 8 & 9) Optical maps
  142. 142. 8 & 9) Optical maps
  143. 143. 8 & 9) Optical maps
  144. 144. What does this all mean?
  145. 145. 102 metrics! per assembly 10 key ! metrics 1 final! ranking
  146. 146. Assembly CRACS SYMB PHUS BCM SGA MERAC ABYSS SOAP RAY GAM CURT Number of ! core genes 438 436 435 434 433 430 429 428 422 415 360
  147. 147. Assembly CRACS SYMB PHUS BCM SGA MERAC ABYSS SOAP RAY GAM CURT Number of ! core genes 438 436 435 434 433 430 429 428 422 415 360 Rank 1 2 3 4 5 6 7 8 9 10 11
  148. 148. Assembly CRACS SYMB PHUS BCM SGA MERAC ABYSS SOAP RAY GAM CURT Number of ! core genes 438 436 435 434 433 430 429 428 422 415 360 Rank 1 2 3 4 5 6 7 8 9 10 11 Z-score +0.68 +0.59 +0.54 +0.49 +0.44 +0.30 +0.25 +0.21 –0.08 –0.41 –3.02
  149. 149. What does this all mean?
  150. 150. No really, what does this all mean?
  151. 151. Some conclusions ✤ Very hard to find assemblers that performed well across all 10 key metrics! ✤ Assemblers that perform well in one species, do not always perform as well in another! ✤ Bird & snake assemblies appear better than fish! ✤ No real 'winner' for bird and fish
  152. 152. SGA — best assembler for snake?
  153. 153. SGA — best assembler for snake?
  154. 154. Description Rank of snake SGA assembly NG50 scaffold length 2 NG50 contig length 5 Amount of assembly in 'gene-sized' scaffolds 7 Number of 'core genes' present 5 Fosmid coverage 2 Fosmid validity 2 Short-range scaffold accuracy 3 Optical map: level 1 2 Optical map: levels 1–3 1 REAPR summary score 2
  155. 155. Description Rank of snake SGA assembly NG50 scaffold length 2 NG50 contig length 5 Amount of assembly in 'gene-sized' scaffolds 7 Number of 'core genes' present 5 Fosmid coverage 2 Fosmid validity 2 Short-range scaffold accuracy 3 Optical map: level 1 2 Optical map: levels 1–3 1 REAPR summary score 2
  156. 156. Best assembler across species?
  157. 157. Best assembler across species? Assembler Number of 1st places (out of 27) BCM 5 Meraculous 4 Symbiose 4 Ray 3 Excluding evaluation entries
  158. 158. Best assembler across species? Assembler Number of 1st places (out of 27) BCM 5 Meraculous 4 Symbiose 4 Ray 3 Excluding evaluation entries
  159. 159. Ray performance Species Final ranking Bird 7th Fish 7th Snake 9th
  160. 160. Assembler BCM - evaluation BCM - competitive Final rank 1 2 NGS data used in assembly Illumina + 454 Illumina + 454 + PacBio BCM bird assemblies
  161. 161. Assembler BCM - evaluation BCM - competitive Final rank 1 2 NGS data used in assembly Illumina + 454 Illumina + 454 + PacBio BCM bird assemblies
  162. 162. Assembler BCM - evaluation BCM - competitive Final rank 1 2 NGS data used in assembly Illumina + 454 Illumina + 454 + PacBio Coverage! Z-score +2.0 –0.3 BCM bird assemblies
  163. 163. Assembler BCM - evaluation BCM - competitive Final rank 1 2 NGS data used in assembly Illumina + 454 Illumina + 454 + PacBio Coverage! Z-score +2.0 –0.3 Validity! Z-score +1.4 –0.8 BCM bird assemblies
  164. 164. Assembler BCM - evaluation BCM - competitive Final rank 1 2 NGS data used in assembly Illumina + 454 Illumina + 454 + PacBio Coverage! Z-score +2.0 –0.3 Validity! Z-score +1.4 –0.8 NG50 Contig Z-score +1.5 +2.7 BCM bird assemblies
  165. 165. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN
  166. 166. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN BCM competition scaffold NNNNNNNNNNNNNNNNNNN
  167. 167. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN BCM competition scaffold NNNNNNNNNNNNNNNNNNN PacBio sequence
  168. 168. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN BCM competition scaffold CGTCGNNATCNNGGTTACG
  169. 169. BCM evaluation scaffold NNNNNNNNNNNNNNNNNNN BCM competition scaffold CGTCGNNATCNNGGTTACG Mismatches from PacBio sequence penalized alignment ! score more than matching unknown bases
  170. 170. The choice of one command-line option,! used by one tool in the calculation of one key metric... ...probably made enough difference to drop! the PacBio-containing assembly to 2nd place.
  171. 171. Other conclusions ✤ Different metrics tell different stories! ✤ Heterozygosity was a big issue for bird & fish assemblies! ✤ Final rankings very sensitive to changes in metrics! ✤ N50 is a semi-useful predictor of assembly quality
  172. 172. Inter-specific differences matter
  173. 173. Inter-specific differences matter ✤ The three species have genomes with different properties ! ✤ repeats! ✤ heterozygosity
  174. 174. Inter-specific differences matter ✤ The three species have genomes with different properties ! ✤ repeats! ✤ heterozygosity ✤ The three genomes had very different NGS data sets! ✤ Only bird had PacBio & 454 data! ✤ Different insert sizes in short-insert libraries
  175. 175. The Big Conclusion
  176. 176. The Big Conclusion "You can't always get what you want" Sir Michael Jagger, 1969
  177. 177. What comes next?
  178. 178. What comes next?
  179. 179. What comes next? 3?
  180. 180. A wish list for Assemblathon 3
  181. 181. A wish list for Assemblathon 3 ✤ Only have 1 species
  182. 182. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets
  183. 183. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost?
  184. 184. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost? ✤ Agree on metrics before evaluating assemblies!
  185. 185. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost? ✤ Agree on metrics before evaluating assemblies! ✤ Encourage experimental assemblies
  186. 186. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost? ✤ Agree on metrics before evaluating assemblies! ✤ Encourage experimental assemblies ✤ Use new FASTG genome assembly file format
  187. 187. A wish list for Assemblathon 3 ✤ Only have 1 species ✤ Teams have to 'buy' resources using virtual budgets ✤ Factor in CPU time/cost? ✤ Agree on metrics before evaluating assemblies! ✤ Encourage experimental assemblies ✤ Use new FASTG genome assembly file format ✤ Get someone else to write the paper!
  188. 188. Intermission
  189. 189. NGS must die!
  190. 190. NGS must die! ‘NGS’ is used to refer to everything post-Sanger
  191. 191. NGS must die! ‘NGS’ is used to refer to everything post-Sanger Pyrosequencing was developed ~1996
  192. 192. NGS madness Next generation sequencing aka second generation sequencing
  193. 193. NGS madness Next generation sequencing aka second generation sequencing but there’s also:
  194. 194. NGS madness Next generation sequencing aka second generation sequencing but there’s also: third generation sequencing
  195. 195. NGS madness Next generation sequencing aka second generation sequencing but there’s also: third generation sequencing fourth generation sequencing
  196. 196. NGS madness Next generation sequencing aka second generation sequencing but there’s also: third generation sequencing fourth generation sequencing next-next generation sequencing
  197. 197. NGS madness Next generation sequencing aka second generation sequencing but there’s also: third generation sequencing fourth generation sequencing next-next generation sequencing next-next-next generation sequencing
  198. 198. NGS madness Technology Complete Genomics Ion Torrent PacBio Oxford Nanopore According to some papers… 2nd generation 2nd generation 2nd generation 3rd generation
  199. 199. NGS madness Technology Complete Genomics Ion Torrent PacBio Oxford Nanopore According to some papers… 2nd generation 2nd generation 2nd generation 3rd generation According to other papers… 3rd generation 3rd generation 3rd generation 4th generation
  200. 200. NGS madness “PacBio is a 2.5th generation” “Helicos lies between the transition of next-generation to third generation”
  201. 201. NGS madness There are different sequencing methodologies, ! and there are different sequencing platforms.
  202. 202. NGS madness There are different sequencing methodologies, ! and there are different sequencing platforms. Use one or the other.
  203. 203. NGS madness There are different sequencing methodologies, ! and there are different sequencing platforms. Use one or the other. Or just say ‘current sequencing technologies’.
  204. 204. Intermission
  205. 205. My #1 piece! of advice flickr.com/julia_manzerova
  206. 206. flickr.com/thomashawk
  207. 207. flickr.com/thomashawk Look at your data!
  208. 208. I looked at the shortest 10 sequences in 34 different genome assemblies…
  209. 209. I looked at the shortest 10 sequences in 34 different genome assemblies…
  210. 210. I looked at the shortest 10 sequences in 34 different genome assemblies…
  211. 211. I looked at the shortest 10 sequences in 34 different genome assemblies…
  212. 212. From a vertebrate genome assembly with 72,214 sequences…
  213. 213. From a vertebrate genome assembly with 72,214 sequences…
  214. 214. From a vertebrate genome assembly with 72,214 sequences…
  215. 215. From a vertebrate genome assembly with 72,214 sequences…
  216. 216. From a vertebrate genome assembly with 72,214 sequences…
  217. 217. From a vertebrate genome assembly with 72,214 sequences… Length of 10 shortest sequences: ! 100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp!
  218. 218. Reasons to be cheerful flickr.com/danielygo
  219. 219. Data from Lex Nederbragt’s blog, June 2014
  220. 220. Data from Lex Nederbragt’s blog, June 2014
  221. 221. Long-read technology Moleculo read data from Illumina BaseSpace, July 2013
  222. 222. Long-read technology From https://flxlexblog.wordpress.com (Lex Nederbragt's blog) PacBio! data
  223. 223. Long-read technology MinIon from Oxford Nanopore
  224. 224. Long-read technology MinIon from Oxford Nanopore
  225. 225. Where is the data?
  226. 226. Where is the data?
  227. 227. Where is the data? Nick Loman published the first real-world data on June 10th
  228. 228. Single chromosome assembly?
  229. 229. Single chromosome assembly?
  230. 230. Single chromosome assembly?
  231. 231. Tackling heterozygosity 1000 Genomes project plans to sequence 15 'trios' in high-depth
  232. 232. Hi-C ✤ Nature Biotechnology, 31, 2013 ! ✤ Burton et al.! ✤ Selvaraj et al.! ✤ Kaplan & Dekker
  233. 233. The future of genome assembly
  234. 234. Kwik-E-Assembler acgtaacacaancac gggaacnnnacatta acnactagcataata nnnnnnnnnnaacac actttaaattatatc The future of genome assembly
  235. 235. The future of genome assembly
  236. 236. The future of genome assembly ✤ At some point we will look back with embarrassment at this era.
  237. 237. The future of genome assembly ✤ At some point we will look back with embarrassment at this era. ✤ Assembly must, and will, get better, but...
  238. 238. The future of genome assembly ✤ At some point we will look back with embarrassment at this era. ✤ Assembly must, and will, get better, but... ✤ ...'perfect' genomes may remain elusive.
  239. 239. The future of genome assembly ✤ At some point we will look back with embarrassment at this era. ✤ Assembly must, and will, get better, but... ✤ ...'perfect' genomes may remain elusive. ✤ Data management will remain an issue:
  240. 240. The future of genome assembly ✤ At some point we will look back with embarrassment at this era. ✤ Assembly must, and will, get better, but... ✤ ...'perfect' genomes may remain elusive. ✤ Data management will remain an issue: ✤ the human genome -> human genomes -> tissue-specific genomes
  241. 241. Summary
  242. 242. Summary ✤ There is no real consensus on how to make a good genome assembly
  243. 243. Summary ✤ There is no real consensus on how to make a good genome assembly ✤ Try different assemblers, try different command-line options
  244. 244. Summary ✤ There is no real consensus on how to make a good genome assembly ✤ Try different assemblers, try different command-line options ✤ Decide what it is you want to get out of a genome assembly
  245. 245. Summary ✤ There is no real consensus on how to make a good genome assembly ✤ Try different assemblers, try different command-line options ✤ Decide what it is you want to get out of a genome assembly ✤ Look at your input and output data
  246. 246. Summary ✤ There is no real consensus on how to make a good genome assembly ✤ Try different assemblers, try different command-line options ✤ Decide what it is you want to get out of a genome assembly ✤ Look at your input and output data ✤ Wait 5 years and come back, we’ll (probably) have solved everything!
  247. 247. Resources ✤ Lex Nederbragt’s blog - https://flxlexblog.wordpress.com! ✤ Nick Loman’s blog - http://pathogenomics.bham.ac.uk/blog/! ✤ Assemblathon twitter feed - https://twitter.com/assemblathon

×