Improving and validating the Atlantic Cod genome assembly using PacBio

  • 2,470 views
Uploaded on

My talk at the PacBio European Usergroup Meeting, November 20th, 2013

My talk at the PacBio European Usergroup Meeting, November 20th, 2013

More in: Spiritual , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,470
On Slideshare
0
From Embeds
0
Number of Embeds
9

Actions

Shares
Downloads
30
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Improving and validating the Atlantic Cod genome assembly using error-corrected as well as raw PacBio reads Lex Nederbragt, NSC and CEES lex.nederbragt@ibv.uio.no @lexnederbragt OK
  • 2. Acknowledgements University of Oslo Sequencing team NSC Ole Kristian Tøressen Kjetill Jakobsen Sissel Jentoft Cod genome group Jason Miller, JCVI Pacific Biosciences
  • 3. The Atlantic cod genome project
  • 4. Cod: the genome 850 million bases (Mbp ) ‘Wild-caught’ Heterozygote * * *
  • 5. Cod: phase 1 454 sequencing (Sanger sequencing)
  • 6. N50 50% of the genome is in contigs as large as the N50 value 1000 bp genome Sum 400 445 490 520 N50 Courtesy of Michael Schatz, CSHL
  • 7. Cod: phase 1 (Sanger sequencing) 454 sequencing Phase 1 assembly 157 887 sequences 753 Mbp of 830 Mbp N50 460 kbp Scaffold N50 2.8 kbp contig gap
  • 8. Cod: phase 1 6467 scaffolds 35% gap bases
  • 9. The causes Short Tandem Repeats (>20% of gaps)
  • 10. The causes Heterozygosity? Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  • 11. Cod: phase 2 New data Illumina sequencing Paired end >200x Mate Pair 5kb >100x Improved/new software
  • 12. Cod: phase 2 goal 23 pseudochromosomes Longer contigs Below 5% gap bases Phase 2 goal Scaffold N50 1 Mbp Contig N50 15 kbp
  • 13. Cod: phase 2 programs Zhang et al. PLoSOne 2011
  • 14. Cod phase 2: status Contig N50 gaps scaffold N50 15 kbp <5% 1.5 Mbp Celera, 454 + Ilmn 9 kbp 5% too short Newbler, 454 6 kbp 24% OK Goal
  • 15. Enter PacBio
  • 16. SMRTBell'template' Sequencing Standard'Sequencing' Large& Sizes& Large Insert& Sizes Insert Aim for looooong insert sizes Circular'Consensus'Sequencing' SMRT® Technology Small& Insert& Sizes& Chemistry Coverage Av. Raw length C2 Photo: Tore Oldeide Elgvin 3.2x 4.6 kb XL-XL 3.5x 5.1 kb TOTAL 3 3.0 kb C2-XL 147 SMRT Cells 9.2x 15.9x
  • 17. Error-correction Celera Assembler merTrim PacBioToCa (Koren et al) 13.7x 27x + + 234x 27x  9x (67%) recovered
  • 18. Using PacBio reads
  • 19. PacBio reads for cod Error-corrected reads Assembly improvement Celera Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  • 20. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  • 21. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  • 22. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  • 23. Assembly improvement: corrected reads N50 Goal Celera, 454 reads + corrected PacBio + PBJelly gaps 15 kbp <5% 9 kbp 5% 11 kbp 1.5%
  • 24. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  • 25. Assembly improvement: raw reads N50 Goal Newbler, 454 + raw PacBio + PBJelly gaps 15 kbp <5% 6 kbp 24% 30 kbp 20%
  • 26. Assembly improvement: raw reads N50 15 kbp + raw PacBio + PBJelly Too good to be true? 5% 46 kbp Celera, 454 + Ilmn <5% 9 kbp Goal gaps 1.5%
  • 27. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  • 28. Assembly validation Sequence
  • 29. Assembly validation Sequence Coverage Aligned raw Pacbio reads
  • 30. Assembly validation Sequence Coverage Aligned raw Pacbio reads Aligned corrected Pacbio reads
  • 31. Assembly validation Newbler scaffold Corrected pacbio reads Raw pacbio reads 308 bp gap (TG)n repeat (TG)n repeat
  • 32. Assembly validation Newbler scaffold Raw pacbio reads 939 bp gap (AG)n repeat Heterozygous region
  • 33. Assembly validation Raw pacbio reads Celera scaffold Misassembly?
  • 34. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  • 35. Assembly validation: bridgemapper (beta) Split alignments structural variation misassemblies
  • 36. bridgemapper (beta) on E. coli Positions in the contig color coded Illumina + velvet
  • 37. bridgemapper (beta) on cod s05514 2510 bp gap Point to a 2350 bp scaffold
  • 38. bridgemapper (beta) on cod 2145 bp gap s08737 Point to a 3 kbp scaffold
  • 39. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  • 40. Assembly with error-corrected reads Contig N50 15 kbp CA + corrected PacBio + 454 mates 1.4 times genome size  underassembled 5% too short 8 kbp Celera Assembly <5% 9 kbp Goal gaps 2% very short scaffolds
  • 41. The improved Atlantic cod genome: status http://en.wikipedia.org
  • 42. Newbler plus Celera Celera: Long contigs, short scaffolds Scaffold gap contig Slide courtesy of Ole Kristian Tøressen
  • 43. Newbler plus Celera Celera: Long contigs, short scaffolds Scaffold gap contig Newbler: Short contigs, long scaffolds Scaffold contig gap Slide courtesy of Ole Kristian Tøressen
  • 44. Newbler plus Celera Celera: Long contigs, short scaffolds Scaffold gap contig Newbler: Short contigs, long scaffolds Scaffold contig gap Combined: Long contigs, long scaffolds gap Scaffold contig Slide courtesy of Ole Kristian Tøressen
  • 45. Adding PacBio Closed gap Reduced gap Using PBJelly PacBio reads Contig Scaffold Slide courtesy of Ole Kristian Tøressen
  • 46. Polishing the assembly 454 and Illumina reads Contig Scaffold Contig N50: 30 - 40 kbp Scaffold N50: 1 - 1.5 Mbp Slide courtesy of Ole Kristian Tøressen
  • 47. Imageby Mathieu Thouvenin http://www.flickr.com/photos/mathoov/4681491052/
  • 48. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper
  • 49. PacBio reads for cod Error-corrected reads Assembly improvement PBJelly Assembly validation blasr De novo assembly Celera Raw reads PBJelly blasr bridgemapper Celera
  • 50. Assembly Contig N50 15 kbp CA + corrected PacBio + 454 mates CA + raw PacBio reads + 454 mates 1.6 times genome size  underassembled <5% 8 kbp Goal gaps 2% very short 38 kbp <1% very short scaffolds
  • 51. Lessons learned from PacBio reads
  • 52. Cod genome Heterozygous: Large polymorphism (100’s of bases) Homozygous Homozygous Heterozygous: Large indel (100’s of bases) Homozygous
  • 53. Atlantic cod version 2 23 pseudochromosomes Longer contigs Below 5% gap bases New annotation
  • 54. From observation to insight We need better programs Mathias Bigge, Ricordisamoa, others (wikimedia commons)