Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Improving and validating the Atlantic Cod genome assembly using PacBio

4,571 views

Published on

My talk at the PacBio European Usergroup Meeting, November 20th, 2013

Published in: Spiritual, Technology
  • Be the first to comment

Improving and validating the Atlantic Cod genome assembly using PacBio

  1. 1. Improving and validating the Atlantic Cod genome assembly using error-corrected as well as raw PacBio reads Lex Nederbragt, NSC and CEES lex.nederbragt@ibv.uio.no @lexnederbragt OK
  2. 2. Acknowledgements University of Oslo Sequencing team NSC Ole Kristian Tøressen Kjetill Jakobsen Sissel Jentoft Cod genome group Jason Miller, JCVI Pacific Biosciences
  3. 3. The Atlantic cod genome project
  4. 4. Cod: the genome 850 million bases (Mbp ) ‘Wild-caught’ Heterozygote * * *
  5. 5. Cod: phase 1 454 sequencing (Sanger sequencing)
  6. 6. N50 50% of the genome is in contigs as large as the N50 value 1000 bp genome Sum 400 445 490 520 N50 Courtesy of Michael Schatz, CSHL
  7. 7. Cod: phase 1 (Sanger sequencing) 454 sequencing Phase 1 assembly 157 887 sequences 753 Mbp of 830 Mbp N50 460 kbp Scaffold N50 2.8 kbp contig gap
  8. 8. Cod: phase 1 6467 scaffolds 35% gap bases
  9. 9. The causes Short Tandem Repeats (>20% of gaps)
  10. 10. The causes Heterozygosity? Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  11. 11. Cod: phase 2 New data Illumina sequencing Paired end >200x Mate Pair 5kb >100x Improved/new software
  12. 12. Cod: phase 2 goal 23 pseudochromosomes Longer contigs Below 5% gap bases Phase 2 goal Scaffold N50 1 Mbp Contig N50 15 kbp
  13. 13. Cod: phase 2 programs Zhang et al. PLoSOne 2011
  14. 14. Cod phase 2: status Contig N50 gaps scaffold N50 15 kbp <5% 1.5 Mbp Celera, 454 + Ilmn 9 kbp 5% too short Newbler, 454 6 kbp 24% OK Goal
  15. 15. Enter PacBio
  16. 16. SMRTBell'template' Sequencing Standard'Sequencing' Large& Sizes& Large Insert& Sizes Insert Aim for looooong insert sizes Circular'Consensus'Sequencing' SMRT® Technology Small& Insert& Sizes& Chemistry Coverage Av. Raw length C2 Photo: Tore Oldeide Elgvin 3.2x 4.6 kb XL-XL 3.5x 5.1 kb TOTAL 3 3.0 kb C2-XL 147 SMRT Cells 9.2x 15.9x
  17. 17. Error-correction Celera Assembler merTrim PacBioToCa (Koren et al) 13.7x 27x + + 234x 27x  9x (67%) recovered
  18. 18. Using PacBio reads
  19. 19. PacBio reads for cod Error-corrected reads Assembly improvement Celera Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  20. 20. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  21. 21. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  22. 22. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  23. 23. Assembly improvement: corrected reads N50 Goal Celera, 454 reads + corrected PacBio + PBJelly gaps 15 kbp <5% 9 kbp 5% 11 kbp 1.5%
  24. 24. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  25. 25. Assembly improvement: raw reads N50 Goal Newbler, 454 + raw PacBio + PBJelly gaps 15 kbp <5% 6 kbp 24% 30 kbp 20%
  26. 26. Assembly improvement: raw reads N50 15 kbp + raw PacBio + PBJelly Too good to be true? 5% 46 kbp Celera, 454 + Ilmn <5% 9 kbp Goal gaps 1.5%
  27. 27. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  28. 28. Assembly validation Sequence
  29. 29. Assembly validation Sequence Coverage Aligned raw Pacbio reads
  30. 30. Assembly validation Sequence Coverage Aligned raw Pacbio reads Aligned corrected Pacbio reads
  31. 31. Assembly validation Newbler scaffold Corrected pacbio reads Raw pacbio reads 308 bp gap (TG)n repeat (TG)n repeat
  32. 32. Assembly validation Newbler scaffold Raw pacbio reads 939 bp gap (AG)n repeat Heterozygous region
  33. 33. Assembly validation Raw pacbio reads Celera scaffold Misassembly?
  34. 34. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  35. 35. Assembly validation: bridgemapper (beta) Split alignments structural variation misassemblies
  36. 36. bridgemapper (beta) on E. coli Positions in the contig color coded Illumina + velvet
  37. 37. bridgemapper (beta) on cod s05514 2510 bp gap Point to a 2350 bp scaffold
  38. 38. bridgemapper (beta) on cod 2145 bp gap s08737 Point to a 3 kbp scaffold
  39. 39. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  40. 40. Assembly with error-corrected reads Contig N50 15 kbp CA + corrected PacBio + 454 mates 1.4 times genome size  underassembled 5% too short 8 kbp Celera Assembly <5% 9 kbp Goal gaps 2% very short scaffolds
  41. 41. The improved Atlantic cod genome: status http://en.wikipedia.org
  42. 42. Newbler plus Celera Celera: Long contigs, short scaffolds Scaffold gap contig Slide courtesy of Ole Kristian Tøressen
  43. 43. Newbler plus Celera Celera: Long contigs, short scaffolds Scaffold gap contig Newbler: Short contigs, long scaffolds Scaffold contig gap Slide courtesy of Ole Kristian Tøressen
  44. 44. Newbler plus Celera Celera: Long contigs, short scaffolds Scaffold gap contig Newbler: Short contigs, long scaffolds Scaffold contig gap Combined: Long contigs, long scaffolds gap Scaffold contig Slide courtesy of Ole Kristian Tøressen
  45. 45. Adding PacBio Closed gap Reduced gap Using PBJelly PacBio reads Contig Scaffold Slide courtesy of Ole Kristian Tøressen
  46. 46. Polishing the assembly 454 and Illumina reads Contig Scaffold Contig N50: 30 - 40 kbp Scaffold N50: 1 - 1.5 Mbp Slide courtesy of Ole Kristian Tøressen
  47. 47. Imageby Mathieu Thouvenin http://www.flickr.com/photos/mathoov/4681491052/
  48. 48. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  49. 49. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper Celera
  50. 50. Assembly Contig N50 15 kbp CA + corrected PacBio + 454 mates CA + raw PacBio reads + 454 mates 1.6 times genome size  underassembled <5% 8 kbp Goal gaps 2% very short 38 kbp <1% very short scaffolds
  51. 51. Lessons learned from PacBio reads
  52. 52. Cod genome Heterozygous: Large polymorphism (100’s of bases) Homozygous Homozygous Heterozygous: Large indel (100’s of bases) Homozygous
  53. 53. Atlantic cod version 2 23 pseudochromosomes Longer contigs Below 5% gap bases New annotation
  54. 54. From observation to insight We need better programs Mathias Bigge, Ricordisamoa, others (wikimedia commons)

×