Combining PacBio with short read technology for improved de novo genome assembly

6,017 views

Published on

Published in: Technology

Combining PacBio with short read technology for improved de novo genome assembly

  1. 1. The best of both worldsCombining PacBio with short read technology for improved de novo genome assembly Lex Nederbragt, NSC and CEES lex.nederbragt@bio.uio.no
  2. 2. This talk
  3. 3. Why does everybody want longer reads? … for genome assemblies
  4. 4. What is a genome assembly Hierarchical structurereads contigs scaffolds
  5. 5. Sequence data Reads reads contigs scaffoldsoriginal DNA fragmentsoriginal DNA fragments Sequenced ends http://www.cbcb.umd.edu/research/assembly_primer.shtml
  6. 6. Contigs Building contigs reads contigs scaffolds ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA CGTAGCGCATTACACAGATT TAGCGCATTACACAGATTAGConsensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG
  7. 7. Contigs Building contigs reads contigs scaffolds Repeat copy 1 Repeat copy 2 Contig orientation? Contig order?Collapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
  8. 8. Mate pairs Other read type reads contigs scaffolds Repeat copy 1 Repeat copy 2(much) longer fragments mate pair reads
  9. 9. Scaffolds Ordered, oriented contigs reads contigs scaffolds mate pairscontigs gap size estimate
  10. 10. What is a genome assembly Hierarchical structurereads ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA contigs CGTAGCGCATTACACAGATT TAGCGCATTACACAGATTAG Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG scaffolds
  11. 11. Genome assemblySo, what’s so hard about it?
  12. 12. 1) Repeats reads contigs scaffolds Repeat copy 1 Repeat copy 2 Repeats break up contigsCollapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
  13. 13. 2) Heterozygosity Differences between sister * chromosomes * *http://commons.wikimedia.org/wiki/File:Chromosome_1.svg
  14. 14. 2) Heterozygosity Polymorphic contig 2Contig 1 Contig 4 Polymorphic contig 3
  15. 15. 2) Heterozygosityhttp://www.astraean.com/borderwars/wp-content/uploads/2012/04/heterozygoats.jpgand many other sites
  16. 16. 3) Many programs to choose fromZhang et al. (2011) doi:10.1371/journal.pone.0017915.g001
  17. 17. Assembly: challenges Repeat copy 1 Repeat copy 2 Knowing how to use the programsHeterozygosity Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  18. 18. So, why does everybody want longer reads?http://www.autobizz.com.my/forum/forum/General-Chat/944-The-worlds-longest-car.html
  19. 19. Longer reads?Repeat copy 1 Repeat copy 2 Long reads can span repeats and heterozygous regions Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  20. 20. PacBio to the rescue?
  21. 21. High-throughput sequencing Library preparationSMRTBelltemplateSMRTBelltemplateStandardSequencingStandardSequencing Generates& pass& each& one& on& molecule& Insert& Large& Sizes& Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert sequenced& Single pass sequenced&CircularConsensusSequencingCircularConsensusSequencing Continued generations of reads Small Insert Sizes& Small& Small& Insert& Sizes Insert& Sizes& Multiple mul8ple& passes passes& each& Generates& on& molecule& Generates& mul8ple& sequenced& passes& each& on& molecule& sequenced&
  22. 22. High-throughput sequencing Raw read length
  23. 23. High-throughput sequencingSMRTBelltemplate Raw reads and subreadsStandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& ‘Subreads’CircularConsensusSequencing Small Insert Sizes& Small&Insert& Sizes Multiple mul8ple& passes passes& each& Generates& on& molecule& sequenced&
  24. 24. PacBio: usesSMRTBelltemplate Long reads  low qualityStandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& 85-87% accuracyCircularConsensusSequencing Useful for assembly? Small& Insert& Sizes& Generates& mul8ple& passes& each& on& molecule& sequenced&
  25. 25. Solutions for assembly
  26. 26. Solutions for assembly (1) Designed by Pacific Bioscienceshttp://www.clker.com/clipart-4245.html
  27. 27. Solutions for assembly (2) Broad InstituteNeed a special recipe for sequencing
  28. 28. Solutions for assembly (3) PacBioToCA Error correct with short readsCelera assembler http://schatzlab.cshl.edu/presentations/2012-01-17.PAG.SMRTassembly.pdf
  29. 29. PacBioToCA Koren et al, 2012
  30. 30. Shameless self-promotionflxlexblog.wordpress.com
  31. 31. Shameless self-promotion @lexnederbragt
  32. 32. The Atlantic cod genome project
  33. 33. First draftFragmented assembly - short contigs - many gap bases http://en.wikipedia.org
  34. 34. First draft6467 scaffolds 35% gap bases
  35. 35. The causesShort Tandem Repeats (>20% of gaps)
  36. 36. The causes Heterozygosity? Polymorphic contig 2Contig 1 Contig 4 Polymorphic contig 3
  37. 37. The goal 23 pseudochromosomes Longer contigs Below 5% gap basesPacBio to the rescue?
  38. 38. The approach SMRTBelltemplate Libraries StandardSequencing Generates& pass& ea one& on& Large Insert& Sizes Large& Sizes& Insert sequenced&Aim for looooong insert sizes CircularConsensusSequencing Small& Insert& Sizes& Generates& mul8ple& passes sequenced&
  39. 39. SMRTBelltemplate The approach Sequencing StandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& Sequence with 90 minute movies CircularConsensusSequencing Small& Insert& Sizes& Generates& mul8ple& passes& each& on& molecule&10 x coverage in reads of at least 3000 bp sequenced& No, we don’t throw this away…
  40. 40. The approachError-correction
  41. 41. PacBio results 100 Relative throughput at different minimum length cutoffs 10kb lib 2 Fraction of bases at minimum 10kb lib 1 length 4kb lib 80Percentage of total sequence 60 40 20 0 0kbp 3kbp 5kbp 10kbp 15kbp Length cutoff longest subread Large library insert size important!
  42. 42. chnology PacBio results SMRTBelltemplate 64 SMRT Cells 3.2 Gigabytes in raw reads at least 3kb 3.8 x coverage 3 StandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert sequenced& 2.2 Gigabytes in longest subreads reads CircularConsensusSequencing Largest 15 kbp Small& Insert& Sizes& Generates& mul8ple& passes& each& on& molecule&
  43. 43. PacBio resultsMapping to the cod genome 11.4 kbp subread 10.6 kbp subread 10.9 kbp subread
  44. 44. Example 1ACACAC repeat232 bp GapTGTGTG repeat
  45. 45. Example 1
  46. 46. Example 1
  47. 47. Example 1Scaffold ...ACACAC TGTGTG...PacBio reads Unplaced contig
  48. 48. Example 2TGTGTG repeat 344 bp Gap
  49. 49. Example 2
  50. 50. Example 2Scaffold ...TGTGTGPacBio reads Heterozygosity?
  51. 51. Example 3Scaffold PacBio reads 300 bp misassembly?
  52. 52. Error-correction Work In Progresshttp://openclipart.org/
  53. 53. Outlook Will PacBio solve our problems?
  54. 54. Outlook Or
  55. 55. Outlook Polymorphic contig 2Contig 1 Contig 4 Polymorphic contig 3 Will we find the heterozygous regions?
  56. 56. Outlook http://www.pasteur.fr/recherche/unites/Bbi/ en.wikipedia.org and Martin Malmstrøm

×