Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon

0 views

Published on

A talk I gave for the 4th yearly seminar of the Norwegian Sequencinc Centre (www.sequencing.uio.no)

  • Be the first to comment

A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon

  1. 1. A different kettle of fish entirelyBioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon Lex Nederbragt, NSC and CEES lex.nederbragt@bio.uio.no @lexnederbragt OK
  2. 2. Developments inHigh Throughput Sequencing
  3. 3. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton IonTorrent PGM 10 PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton PGM 0.1 GA II GS Junior 0.01 PacBio RS 0.001 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940
  4. 4. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton 10 Long IonTorrent PGM PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton ‘Sanger like’ PGM 0.1 GA II GS Junior 0.01 PacBio RS Intermediate 0.001 Short 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940
  5. 5. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton 10 Long IonTorrent PGM PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton ‘Sanger like’ PGM 0.1 GA II GS Junior 0.01 PacBio RS Intermediate 0.001 Short 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940
  6. 6. What is this thing called ‘genome assembly’?
  7. 7. Hierarchical structurereads contigs scaffolds
  8. 8. Sequence data Reads reads contigs scaffoldsoriginal DNA fragmentsoriginal DNA fragments Sequenced ends http://www.cbcb.umd.edu/research/assembly_primer.shtml
  9. 9. Reads! reads contigs scaffoldshttp://www.sciencephoto.com/media/210915/enlarge
  10. 10. ContigsBuilding contigs reads contigs scaffolds ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA CGTAGCGCATTACACAGATT TAGCGCATTACACAGATTAG Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG
  11. 11. ContigsBuilding contigs reads contigs scaffolds Repeat copy 1 Repeat copy 2 Contig orienation? Contig order? Collapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
  12. 12. Mate pairsOther read type reads contigs scaffolds Repeat copy 1 Repeat copy 2 (much) longer fragments mate pair reads
  13. 13. Mate pairs Paired end reads  100-500 bp insertoriginal DNA fragments Sequenced ends Mate pairs  2-20 kb insert Repeat copy 1 Repeat copy 2 mate pair reads
  14. 14. Scaffolds • Ordered, oriented contigs reads contigs scaffolds mate pairs contigs gap size estimate Scaffold gap contighttp://dx.doi.org/10.6084/m9.figshare.100940
  15. 15. Hierarchical structure reads ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA contigs ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA CGTAGCGCATTACACAGA TAGCGCATTACACAGA Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGAScaffold contig scaffolds gap
  16. 16. Why is genome assembly such a difficult problem?
  17. 17. 1) Repeats Repeat copy 1 Repeat copy 2 Repeats break up assemblyCollapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
  18. 18. 2) Diploidy Differences between sister * chromosomes ‘heterozygosity’ * *http://commons.wikimedia.org/wiki/File:Chromosome_1.svg
  19. 19. 2) Diploidy Polymorphic region 2Region 1 Region 4 Polymorphic region 3Homozygous Heterozygous Homozygous
  20. 20. 2) Diploidyhttp://www.astraean.com/borderwars/wp-content/uploads/2012/04/heterozygoats.jpgand many other sites
  21. 21. 3) Polyploidyhttp://en.wikipedia.org/wiki/Polyploidy
  22. 22. 4) Many programs to choose from Zhang et al. PLoSOne 2011
  23. 23. The Atlantic salmon and Atlantic cod genome projects http://kettleoffish.net/
  24. 24. Salmon: the playersThe%female%named% “Sally”% with% ‘Sally’ double[haploid%genome% of% es>mated% length% Gbp.% 3% 12%
  25. 25. Salmon: the genome Pseudotetraploid 3 billion bases (Gbp ) ‘Double haploid’ The%female% named% “Sally”% with% double[haploid%genome% of% es>mated% length% Gbp.% 3% 12% Repeat copy 1 Repeat copy 2 30-35%: repetitive DNA DNA transposons ~ 1500 bp: 6-10% ** Davidson et al., 2010 http://genomebiology.com/2010/11/9/403
  26. 26. Salmon: phase 1 Sanger sequencing Illumina sequencing Phase 1 assembly 555 960 sequences 2.4 Gbp of 3 Gbp Half of that in pieces of 9 300 bp or longer Scaffold gap contighttp://www.flickr.com/photos/jurvetson/57080968/
  27. 27. Salmon: phase 2 Illumina sequencing Paired end Mate Pair 3kb and longer Phase 2 stated goal Scaffolds greater than 1 Mbp Half the genome in contigs of at least 50 000 bphe%female% named% “Sally”% with% double[haploid%genome% of% es>mated% length% Gbp.% 3% 12% Scaffold gap contig
  28. 28. Cod: the playersUnnamed Atlantic cod
  29. 29. Cod: the genome Heterozygote850 million bases (Mbp ) * ‘Wild-caught’ * *
  30. 30. Cod: phase 1 454 sequencing (Sanger sequencing) Phase 1 assembly 157 887 sequences 753 Mbp of 830 Mbp Half in scaffolds of at least 460 000 bp Half in contigs at least 2 800 bpScaffold gap contig
  31. 31. Cod: phase 1
  32. 32. Cod: phase 2Phase 2Illumina sequencingPaired end >200xMate Pair 5kb >100x Phase 2 goal Half in scaffolds of at least 1 Mbp Half in contigs at least 10 – 15 000 bp
  33. 33. Atlantic salmon and Atlantic cod Pseudotetraploid Heterozygosity * * reads contigs ? scaffolds *Repeat copy 1 Repeat copy 2 Long repeats
  34. 34. What we need? Long reads!
  35. 35. Longer reads!Repeat copy 1 Repeat copy 2 Long reads can span repeats and heterozygous regions Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  36. 36. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton IonTorrent PGM 10 PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton PGM 0.1 GA II GS Junior 0.01 PacBio RS 0.001 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940
  37. 37. PacBio sequencing Single-moleculeC2 (current) chemistry:Average read length 3100 bp36 000 reads110 Mbp per ‘run’
  38. 38. PacBio sequencingSMRTBelltemplate Sequencing ‘modes’StandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& ‘Subreads’CircularConsensusSequencing Small Insert Sizes& Small&Insert& Sizes Multiple mul8ple& passes passes& each& Generates& on& molecule& sequenced&
  39. 39. PacBio: usesSMRTBelltemplate Long reads  low qualityStandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& 85-87% accuracyCircularConsensusSequencing Useful for assembly? Small& Insert& Sizes& Generates& mul8ple& passes& each& on& molecule& sequenced&
  40. 40. Solutions for assembly
  41. 41. Pacbio for salmon and cod SMRTBelltemplate Libraries StandardSequencing Generates& pass& ea one& on& Large Insert& Sizes Large& Sizes& Insert sequenced& Aim for looooong insert sizes CircularConsensusSequencing Small& Insert& Sizes& Generates& mul8ple& passes sequenced&
  42. 42. chnology Salmon: PacBio reads Data set 1 1.1x coverage Half of all bases in reads at least 5.5 kbp Longest 26.5 kbp SMRTBelltemplate 104 SMRT Cells Data set 2 Latest chemistry and enzyme (C2-XL) 0.7x coverage By PacBio Menlo Park 3 Half of all bases in reads at least 6 kbp Longest 25 kbp StandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert sequenced& CircularConsensusSequencing Small& Insert& Sizes&
  43. 43. Salmon: PacBio reads Alignments of at least 1kb to released assembly Alignmentsbinnedby%idenVtyPortion of the alignments Bin for read accuracy reported in the alignment CumulaVveAlignmentQuanVty Figure courtesy of Jason Miller, JCVI, USA
  44. 44. Salmon: PacBio reads Repeat copy 1 Repeat copy 2 SMRTBelltemplate Salmon repeat database Mapping StandardSequencing Generates& pass& each& one& on& molecule& Large& Insert& Sizes& sequenced& Mapping CircularConsensusSequencingScaffold gap Small& Insert& Sizes& contig Generates& mul8ple& passes& each& on& molecule& sequenced&
  45. 45. Salmon: repeats 1.6 kb repeats mapped to PacBio reads left flank repeat right flank0 5000 10000 Scale (bp) 15000 20000 25000
  46. 46. Salmon: repeats 3-7 kb repeats mapped to PacBio reads left flank repeat right flank0 5000 10000 Scale (bp) 15000 20000 25000
  47. 47. Salmon: error-correction PacBioToCA Jason Miller, JCVI: “Low fraction of reads recovered”“Improves contig lengths by enabling new joins” “Challenge for error-correction: polymorphic repeat copies” Repeat copy 1 Repeat copy 2
  48. 48. Salmon: prospect PacBio reads span even the longest repeats 3-7 kb repeats mapped to PacBio reads left flank repeat right flankRepeat copy 1 Repeat copy 2
  49. 49. chnology Cod: PacBio reads 8.1x coverage Half of all bases in reads at least 4 kbp Longest 16.5 kbp SMRTBelltemplate 104 SMRT Cells Regular C2 chemistry Univ. of Oslo, Norway 3 StandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert sequenced& CircularConsensusSequencing Small& Insert& Sizes&
  50. 50. SMRTBelltemplate Cod: PacBio reads StandardSequencing Generates& pass& each& one& on& molecule& Large& Insert& Sizes& sequenced& Mapping CircularConsensusSequencingScaffold gap Small& Insert& Sizes& contig Generates& mul8ple& passes& each& on& molecule& sequenced&
  51. 51. Cod: PacBio resultsMapping to the published genome 11.4 kbp subread 10.6 kbp subread 10.9 kbp subread
  52. 52. Cod: example 1Assembly ...ACACAC TGTGTG... 232 bp gap TGTGTG...
  53. 53. Cod: example 1 ACACAC repeat 232 bp Gap TGTGTG repeat
  54. 54. Cod: example 1
  55. 55. Cod: example 1
  56. 56. Cod: example 1Assembly ...ACACAC TGTGTG... ...ACACACAC TGTGTG... ...ACACACAC TGTGTG... Unplaced region AC TGTGTG...
  57. 57. Cod: example 2Assembly ...TGTGTG 344 bp gap
  58. 58. Cod: example 2 TGTGTG repeat 344 bp Gap
  59. 59. Cod: example 2
  60. 60. Cod: example 2Assembly ...TGTGTG ...TGTGTG ...TGTGTG ...TGTGTG Heterozygosity?
  61. 61. Cod: example 3Assembly 300 bp misassembly?
  62. 62. Cod: error-correction P_errorCorrection pipeline from  93% of reads recovered 2.7x Alignments of at least 1kb to published assembly+ 23x+ 24 cpus 4.5 days 100 Gb RAM
  63. 63. Cod: prospectPacBio reads span many gaps PacBio reads may span heterozygous regions Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  64. 64. Summary Salmon and cod extra challengingAssembly is difficult reads contigs scaffoldsPacBio has a huge potential 3-7 kb repeats mapped to PacBio reads left flank repeat right flank http://en.wikipedia.org, http://fishandboat.com
  65. 65. Acknowledgements University of Oslo Jason Miller, JCVI Pacific BiosciencesSequencing team NSC ICSASGOle Kristian Tørresen Kjetill Jakobsen Sissel Jentoft Cod genome group The%female% named% double[haploid% “Sally”% genome% with% of% es>mated% length% Gbp.% 3% 12%
  66. 66. http://wiki.galaxyproject.org/Events/GCC2013

×