Your SlideShare is downloading. ×
0
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon

3,294

Published on

A talk I gave for the 4th yearly seminar of the Norwegian Sequencinc Centre (www.sequencing.uio.no)

A talk I gave for the 4th yearly seminar of the Norwegian Sequencinc Centre (www.sequencing.uio.no)

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,294
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
79
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • November2012
  • November2012
  • November2012
  • November2012
  • Transcript

    • 1. A different kettle of fish entirelyBioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon Lex Nederbragt, NSC and CEES lex.nederbragt@bio.uio.no @lexnederbragt OK
    • 2. Developments inHigh Throughput Sequencing
    • 3. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton IonTorrent PGM 10 PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton PGM 0.1 GA II GS Junior 0.01 PacBio RS 0.001 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940
    • 4. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton 10 Long IonTorrent PGM PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton ‘Sanger like’ PGM 0.1 GA II GS Junior 0.01 PacBio RS Intermediate 0.001 Short 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940
    • 5. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton 10 Long IonTorrent PGM PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton ‘Sanger like’ PGM 0.1 GA II GS Junior 0.01 PacBio RS Intermediate 0.001 Short 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940
    • 6. What is this thing called ‘genome assembly’?
    • 7. Hierarchical structurereads contigs scaffolds
    • 8. Sequence data Reads reads contigs scaffoldsoriginal DNA fragmentsoriginal DNA fragments Sequenced ends http://www.cbcb.umd.edu/research/assembly_primer.shtml
    • 9. Reads! reads contigs scaffoldshttp://www.sciencephoto.com/media/210915/enlarge
    • 10. ContigsBuilding contigs reads contigs scaffolds ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA CGTAGCGCATTACACAGATT TAGCGCATTACACAGATTAG Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG
    • 11. ContigsBuilding contigs reads contigs scaffolds Repeat copy 1 Repeat copy 2 Contig orienation? Contig order? Collapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
    • 12. Mate pairsOther read type reads contigs scaffolds Repeat copy 1 Repeat copy 2 (much) longer fragments mate pair reads
    • 13. Mate pairs Paired end reads  100-500 bp insertoriginal DNA fragments Sequenced ends Mate pairs  2-20 kb insert Repeat copy 1 Repeat copy 2 mate pair reads
    • 14. Scaffolds • Ordered, oriented contigs reads contigs scaffolds mate pairs contigs gap size estimate Scaffold gap contighttp://dx.doi.org/10.6084/m9.figshare.100940
    • 15. Hierarchical structure reads ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA contigs ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA CGTAGCGCATTACACAGA TAGCGCATTACACAGA Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGAScaffold contig scaffolds gap
    • 16. Why is genome assembly such a difficult problem?
    • 17. 1) Repeats Repeat copy 1 Repeat copy 2 Repeats break up assemblyCollapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
    • 18. 2) Diploidy Differences between sister * chromosomes ‘heterozygosity’ * *http://commons.wikimedia.org/wiki/File:Chromosome_1.svg
    • 19. 2) Diploidy Polymorphic region 2Region 1 Region 4 Polymorphic region 3Homozygous Heterozygous Homozygous
    • 20. 2) Diploidyhttp://www.astraean.com/borderwars/wp-content/uploads/2012/04/heterozygoats.jpgand many other sites
    • 21. 3) Polyploidyhttp://en.wikipedia.org/wiki/Polyploidy
    • 22. 4) Many programs to choose from Zhang et al. PLoSOne 2011
    • 23. The Atlantic salmon and Atlantic cod genome projects http://kettleoffish.net/
    • 24. Salmon: the playersThe%female%named% “Sally”% with% ‘Sally’ double[haploid%genome% of% es>mated% length% Gbp.% 3% 12%
    • 25. Salmon: the genome Pseudotetraploid 3 billion bases (Gbp ) ‘Double haploid’ The%female% named% “Sally”% with% double[haploid%genome% of% es>mated% length% Gbp.% 3% 12% Repeat copy 1 Repeat copy 2 30-35%: repetitive DNA DNA transposons ~ 1500 bp: 6-10% ** Davidson et al., 2010 http://genomebiology.com/2010/11/9/403
    • 26. Salmon: phase 1 Sanger sequencing Illumina sequencing Phase 1 assembly 555 960 sequences 2.4 Gbp of 3 Gbp Half of that in pieces of 9 300 bp or longer Scaffold gap contighttp://www.flickr.com/photos/jurvetson/57080968/
    • 27. Salmon: phase 2 Illumina sequencing Paired end Mate Pair 3kb and longer Phase 2 stated goal Scaffolds greater than 1 Mbp Half the genome in contigs of at least 50 000 bphe%female% named% “Sally”% with% double[haploid%genome% of% es>mated% length% Gbp.% 3% 12% Scaffold gap contig
    • 28. Cod: the playersUnnamed Atlantic cod
    • 29. Cod: the genome Heterozygote850 million bases (Mbp ) * ‘Wild-caught’ * *
    • 30. Cod: phase 1 454 sequencing (Sanger sequencing) Phase 1 assembly 157 887 sequences 753 Mbp of 830 Mbp Half in scaffolds of at least 460 000 bp Half in contigs at least 2 800 bpScaffold gap contig
    • 31. Cod: phase 1
    • 32. Cod: phase 2Phase 2Illumina sequencingPaired end >200xMate Pair 5kb >100x Phase 2 goal Half in scaffolds of at least 1 Mbp Half in contigs at least 10 – 15 000 bp
    • 33. Atlantic salmon and Atlantic cod Pseudotetraploid Heterozygosity * * reads contigs ? scaffolds *Repeat copy 1 Repeat copy 2 Long repeats
    • 34. What we need? Long reads!
    • 35. Longer reads!Repeat copy 1 Repeat copy 2 Long reads can span repeats and heterozygous regions Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
    • 36. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton IonTorrent PGM 10 PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton PGM 0.1 GA II GS Junior 0.01 PacBio RS 0.001 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940
    • 37. PacBio sequencing Single-moleculeC2 (current) chemistry:Average read length 3100 bp36 000 reads110 Mbp per ‘run’
    • 38. PacBio sequencingSMRTBelltemplate Sequencing ‘modes’StandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& ‘Subreads’CircularConsensusSequencing Small Insert Sizes& Small&Insert& Sizes Multiple mul8ple& passes passes& each& Generates& on& molecule& sequenced&
    • 39. PacBio: usesSMRTBelltemplate Long reads  low qualityStandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& 85-87% accuracyCircularConsensusSequencing Useful for assembly? Small& Insert& Sizes& Generates& mul8ple& passes& each& on& molecule& sequenced&
    • 40. Solutions for assembly
    • 41. Pacbio for salmon and cod SMRTBelltemplate Libraries StandardSequencing Generates& pass& ea one& on& Large Insert& Sizes Large& Sizes& Insert sequenced& Aim for looooong insert sizes CircularConsensusSequencing Small& Insert& Sizes& Generates& mul8ple& passes sequenced&
    • 42. chnology Salmon: PacBio reads Data set 1 1.1x coverage Half of all bases in reads at least 5.5 kbp Longest 26.5 kbp SMRTBelltemplate 104 SMRT Cells Data set 2 Latest chemistry and enzyme (C2-XL) 0.7x coverage By PacBio Menlo Park 3 Half of all bases in reads at least 6 kbp Longest 25 kbp StandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert sequenced& CircularConsensusSequencing Small& Insert& Sizes&
    • 43. Salmon: PacBio reads Alignments of at least 1kb to released assembly Alignmentsbinnedby%idenVtyPortion of the alignments Bin for read accuracy reported in the alignment CumulaVveAlignmentQuanVty Figure courtesy of Jason Miller, JCVI, USA
    • 44. Salmon: PacBio reads Repeat copy 1 Repeat copy 2 SMRTBelltemplate Salmon repeat database Mapping StandardSequencing Generates& pass& each& one& on& molecule& Large& Insert& Sizes& sequenced& Mapping CircularConsensusSequencingScaffold gap Small& Insert& Sizes& contig Generates& mul8ple& passes& each& on& molecule& sequenced&
    • 45. Salmon: repeats 1.6 kb repeats mapped to PacBio reads left flank repeat right flank0 5000 10000 Scale (bp) 15000 20000 25000
    • 46. Salmon: repeats 3-7 kb repeats mapped to PacBio reads left flank repeat right flank0 5000 10000 Scale (bp) 15000 20000 25000
    • 47. Salmon: error-correction PacBioToCA Jason Miller, JCVI: “Low fraction of reads recovered”“Improves contig lengths by enabling new joins” “Challenge for error-correction: polymorphic repeat copies” Repeat copy 1 Repeat copy 2
    • 48. Salmon: prospect PacBio reads span even the longest repeats 3-7 kb repeats mapped to PacBio reads left flank repeat right flankRepeat copy 1 Repeat copy 2
    • 49. chnology Cod: PacBio reads 8.1x coverage Half of all bases in reads at least 4 kbp Longest 16.5 kbp SMRTBelltemplate 104 SMRT Cells Regular C2 chemistry Univ. of Oslo, Norway 3 StandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert sequenced& CircularConsensusSequencing Small& Insert& Sizes&
    • 50. SMRTBelltemplate Cod: PacBio reads StandardSequencing Generates& pass& each& one& on& molecule& Large& Insert& Sizes& sequenced& Mapping CircularConsensusSequencingScaffold gap Small& Insert& Sizes& contig Generates& mul8ple& passes& each& on& molecule& sequenced&
    • 51. Cod: PacBio resultsMapping to the published genome 11.4 kbp subread 10.6 kbp subread 10.9 kbp subread
    • 52. Cod: example 1Assembly ...ACACAC TGTGTG... 232 bp gap TGTGTG...
    • 53. Cod: example 1 ACACAC repeat 232 bp Gap TGTGTG repeat
    • 54. Cod: example 1
    • 55. Cod: example 1
    • 56. Cod: example 1Assembly ...ACACAC TGTGTG... ...ACACACAC TGTGTG... ...ACACACAC TGTGTG... Unplaced region AC TGTGTG...
    • 57. Cod: example 2Assembly ...TGTGTG 344 bp gap
    • 58. Cod: example 2 TGTGTG repeat 344 bp Gap
    • 59. Cod: example 2
    • 60. Cod: example 2Assembly ...TGTGTG ...TGTGTG ...TGTGTG ...TGTGTG Heterozygosity?
    • 61. Cod: example 3Assembly 300 bp misassembly?
    • 62. Cod: error-correction P_errorCorrection pipeline from  93% of reads recovered 2.7x Alignments of at least 1kb to published assembly+ 23x+ 24 cpus 4.5 days 100 Gb RAM
    • 63. Cod: prospectPacBio reads span many gaps PacBio reads may span heterozygous regions Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
    • 64. Summary Salmon and cod extra challengingAssembly is difficult reads contigs scaffoldsPacBio has a huge potential 3-7 kb repeats mapped to PacBio reads left flank repeat right flank http://en.wikipedia.org, http://fishandboat.com
    • 65. Acknowledgements University of Oslo Jason Miller, JCVI Pacific BiosciencesSequencing team NSC ICSASGOle Kristian Tørresen Kjetill Jakobsen Sissel Jentoft Cod genome group The%female% named% double[haploid% “Sally”% genome% with% of% es>mated% length% Gbp.% 3% 12%
    • 66. http://wiki.galaxyproject.org/Events/GCC2013

    ×