A different kettle of fish entirelyBioinformatic challenges and solutions for whole de novo  genome assembly of Atlantic c...
Developments inHigh Throughput Sequencing
Developments in High Throughput Sequencing                                   ABI 3730xl                                   ...
Developments in High Throughput Sequencing                                                 ABI 3730xl                     ...
Developments in High Throughput Sequencing                                                 ABI 3730xl                     ...
What is this thing called ‘genome assembly’?
Hierarchical structurereads contigs   scaffolds
Sequence data                           Reads                                                    reads                    ...
Reads!                               reads                                 contigs                                   scaff...
ContigsBuilding contigs                                                                 reads                             ...
ContigsBuilding contigs                                                             reads                                 ...
Mate pairsOther read type                                                     reads                                       ...
Mate pairs   Paired end reads  100-500 bp insertoriginal DNA fragments                               Sequenced ends   Mat...
Scaffolds       • Ordered, oriented contigs                                                                            rea...
Hierarchical structure           reads                         ACGCGATTCAGGTTACCACG                                       ...
Why is genome assembly such    a difficult problem?
1) Repeats     Repeat copy 1                                    Repeat copy 2                                         Repe...
2) Diploidy                                                               Differences                                     ...
2) Diploidy             Polymorphic region 2Region 1                              Region 4             Polymorphic region ...
2) Diploidyhttp://www.astraean.com/borderwars/wp-content/uploads/2012/04/heterozygoats.jpgand many other sites
3) Polyploidyhttp://en.wikipedia.org/wiki/Polyploidy
4) Many programs to choose from                         Zhang et al. PLoSOne 2011
The Atlantic salmon and Atlantic cod         genome projects           http://kettleoffish.net/
Salmon: the playersThe%female%named% “Sally”%                         with%          ‘Sally’  double[haploid%genome% of%  ...
Salmon: the genome                                                                 Pseudotetraploid        3 billion bases...
Salmon: phase 1                              Sanger sequencing           Illumina sequencing                              ...
Salmon: phase 2              Illumina sequencing              Paired end              Mate Pair 3kb and longer            ...
Cod: the playersUnnamed Atlantic cod
Cod: the genome                                 Heterozygote850 million bases (Mbp )                                      ...
Cod: phase 1           454 sequencing            (Sanger sequencing)                                  Phase 1 assembly    ...
Cod: phase 1
Cod: phase 2Phase 2Illumina sequencingPaired end    >200xMate Pair 5kb >100x                         Phase 2 goal         ...
Atlantic salmon and Atlantic cod                                                       Pseudotetraploid    Heterozygosity ...
What we need? Long reads!
Longer reads!Repeat copy 1                                 Repeat copy 2    Long reads can span repeats and heterozygous r...
Developments in High Throughput Sequencing                                   ABI 3730xl                                   ...
PacBio sequencing                              Single-moleculeC2 (current) chemistry:Average read length 3100 bp36 000 rea...
PacBio sequencingSMRTBelltemplate                           Sequencing ‘modes’StandardSequencing                          ...
PacBio: usesSMRTBelltemplate                           Long reads  low qualityStandardSequencing                         ...
Solutions for assembly
Pacbio for salmon and cod      SMRTBelltemplate              Libraries      StandardSequencing                            ...
chnology                                        Salmon: PacBio reads                                                     D...
Salmon: PacBio reads                            Alignments of at least 1kb to released assembly                           ...
Salmon: PacBio reads     Repeat copy 1                             Repeat copy 2    SMRTBelltemplate                      ...
Salmon: repeats                1.6 kb repeats mapped to PacBio reads           left flank   repeat   right flank0   5000  ...
Salmon: repeats                        3-7 kb repeats mapped to PacBio reads    left flank                repeat          ...
Salmon: error-correction                                                                      PacBioToCA Jason Miller, JCV...
Salmon: prospect       PacBio reads span even the longest repeats                          3-7 kb repeats mapped to PacBio...
chnology                                      Cod: PacBio reads                                            8.1x coverage  ...
SMRTBelltemplate                               Cod: PacBio reads    StandardSequencing                                    ...
Cod: PacBio resultsMapping to the published genome        11.4 kbp subread         10.6 kbp subread         10.9 kbp subread
Cod: example 1Assembly           ...ACACAC                TGTGTG...                       232 bp gap                      ...
Cod: example 1 ACACAC repeat 232 bp Gap TGTGTG repeat
Cod: example 1
Cod: example 1
Cod: example 1Assembly                      ...ACACAC     TGTGTG...                      ...ACACACAC   TGTGTG...          ...
Cod: example 2Assembly           ...TGTGTG                       344 bp gap
Cod: example 2  TGTGTG repeat       344 bp Gap
Cod: example 2
Cod: example 2Assembly           ...TGTGTG           ...TGTGTG           ...TGTGTG           ...TGTGTG                 Het...
Cod: example 3Assembly              300 bp misassembly?
Cod: error-correction                 P_errorCorrection pipeline from                                93% of reads recover...
Cod: prospectPacBio reads span many gaps                     PacBio reads may span heterozygous regions                   ...
Summary                                       Salmon and cod extra challengingAssembly is difficult   reads    contigs    ...
Acknowledgements     University of Oslo                                  Jason Miller, JCVI                               ...
http://wiki.galaxyproject.org/Events/GCC2013
Upcoming SlideShare
Loading in …5
×

A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon

5,581 views
5,191 views

Published on

A talk I gave for the 4th yearly seminar of the Norwegian Sequencinc Centre (www.sequencing.uio.no)

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,581
On SlideShare
0
From Embeds
0
Number of Embeds
1,085
Actions
Shares
0
Downloads
107
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • November2012
  • November2012
  • November2012
  • November2012
  • A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon

    1. 1. A different kettle of fish entirelyBioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon Lex Nederbragt, NSC and CEES lex.nederbragt@bio.uio.no @lexnederbragt OK
    2. 2. Developments inHigh Throughput Sequencing
    3. 3. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton IonTorrent PGM 10 PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton PGM 0.1 GA II GS Junior 0.01 PacBio RS 0.001 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940
    4. 4. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton 10 Long IonTorrent PGM PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton ‘Sanger like’ PGM 0.1 GA II GS Junior 0.01 PacBio RS Intermediate 0.001 Short 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940
    5. 5. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton 10 Long IonTorrent PGM PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton ‘Sanger like’ PGM 0.1 GA II GS Junior 0.01 PacBio RS Intermediate 0.001 Short 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940
    6. 6. What is this thing called ‘genome assembly’?
    7. 7. Hierarchical structurereads contigs scaffolds
    8. 8. Sequence data Reads reads contigs scaffoldsoriginal DNA fragmentsoriginal DNA fragments Sequenced ends http://www.cbcb.umd.edu/research/assembly_primer.shtml
    9. 9. Reads! reads contigs scaffoldshttp://www.sciencephoto.com/media/210915/enlarge
    10. 10. ContigsBuilding contigs reads contigs scaffolds ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA CGTAGCGCATTACACAGATT TAGCGCATTACACAGATTAG Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG
    11. 11. ContigsBuilding contigs reads contigs scaffolds Repeat copy 1 Repeat copy 2 Contig orienation? Contig order? Collapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
    12. 12. Mate pairsOther read type reads contigs scaffolds Repeat copy 1 Repeat copy 2 (much) longer fragments mate pair reads
    13. 13. Mate pairs Paired end reads  100-500 bp insertoriginal DNA fragments Sequenced ends Mate pairs  2-20 kb insert Repeat copy 1 Repeat copy 2 mate pair reads
    14. 14. Scaffolds • Ordered, oriented contigs reads contigs scaffolds mate pairs contigs gap size estimate Scaffold gap contighttp://dx.doi.org/10.6084/m9.figshare.100940
    15. 15. Hierarchical structure reads ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA contigs ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA CGTAGCGCATTACACAGA TAGCGCATTACACAGA Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGAScaffold contig scaffolds gap
    16. 16. Why is genome assembly such a difficult problem?
    17. 17. 1) Repeats Repeat copy 1 Repeat copy 2 Repeats break up assemblyCollapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
    18. 18. 2) Diploidy Differences between sister * chromosomes ‘heterozygosity’ * *http://commons.wikimedia.org/wiki/File:Chromosome_1.svg
    19. 19. 2) Diploidy Polymorphic region 2Region 1 Region 4 Polymorphic region 3Homozygous Heterozygous Homozygous
    20. 20. 2) Diploidyhttp://www.astraean.com/borderwars/wp-content/uploads/2012/04/heterozygoats.jpgand many other sites
    21. 21. 3) Polyploidyhttp://en.wikipedia.org/wiki/Polyploidy
    22. 22. 4) Many programs to choose from Zhang et al. PLoSOne 2011
    23. 23. The Atlantic salmon and Atlantic cod genome projects http://kettleoffish.net/
    24. 24. Salmon: the playersThe%female%named% “Sally”% with% ‘Sally’ double[haploid%genome% of% es>mated% length% Gbp.% 3% 12%
    25. 25. Salmon: the genome Pseudotetraploid 3 billion bases (Gbp ) ‘Double haploid’ The%female% named% “Sally”% with% double[haploid%genome% of% es>mated% length% Gbp.% 3% 12% Repeat copy 1 Repeat copy 2 30-35%: repetitive DNA DNA transposons ~ 1500 bp: 6-10% ** Davidson et al., 2010 http://genomebiology.com/2010/11/9/403
    26. 26. Salmon: phase 1 Sanger sequencing Illumina sequencing Phase 1 assembly 555 960 sequences 2.4 Gbp of 3 Gbp Half of that in pieces of 9 300 bp or longer Scaffold gap contighttp://www.flickr.com/photos/jurvetson/57080968/
    27. 27. Salmon: phase 2 Illumina sequencing Paired end Mate Pair 3kb and longer Phase 2 stated goal Scaffolds greater than 1 Mbp Half the genome in contigs of at least 50 000 bphe%female% named% “Sally”% with% double[haploid%genome% of% es>mated% length% Gbp.% 3% 12% Scaffold gap contig
    28. 28. Cod: the playersUnnamed Atlantic cod
    29. 29. Cod: the genome Heterozygote850 million bases (Mbp ) * ‘Wild-caught’ * *
    30. 30. Cod: phase 1 454 sequencing (Sanger sequencing) Phase 1 assembly 157 887 sequences 753 Mbp of 830 Mbp Half in scaffolds of at least 460 000 bp Half in contigs at least 2 800 bpScaffold gap contig
    31. 31. Cod: phase 1
    32. 32. Cod: phase 2Phase 2Illumina sequencingPaired end >200xMate Pair 5kb >100x Phase 2 goal Half in scaffolds of at least 1 Mbp Half in contigs at least 10 – 15 000 bp
    33. 33. Atlantic salmon and Atlantic cod Pseudotetraploid Heterozygosity * * reads contigs ? scaffolds *Repeat copy 1 Repeat copy 2 Long repeats
    34. 34. What we need? Long reads!
    35. 35. Longer reads!Repeat copy 1 Repeat copy 2 Long reads can span repeats and heterozygous regions Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
    36. 36. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton IonTorrent PGM 10 PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton PGM 0.1 GA II GS Junior 0.01 PacBio RS 0.001 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940
    37. 37. PacBio sequencing Single-moleculeC2 (current) chemistry:Average read length 3100 bp36 000 reads110 Mbp per ‘run’
    38. 38. PacBio sequencingSMRTBelltemplate Sequencing ‘modes’StandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& ‘Subreads’CircularConsensusSequencing Small Insert Sizes& Small&Insert& Sizes Multiple mul8ple& passes passes& each& Generates& on& molecule& sequenced&
    39. 39. PacBio: usesSMRTBelltemplate Long reads  low qualityStandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& 85-87% accuracyCircularConsensusSequencing Useful for assembly? Small& Insert& Sizes& Generates& mul8ple& passes& each& on& molecule& sequenced&
    40. 40. Solutions for assembly
    41. 41. Pacbio for salmon and cod SMRTBelltemplate Libraries StandardSequencing Generates& pass& ea one& on& Large Insert& Sizes Large& Sizes& Insert sequenced& Aim for looooong insert sizes CircularConsensusSequencing Small& Insert& Sizes& Generates& mul8ple& passes sequenced&
    42. 42. chnology Salmon: PacBio reads Data set 1 1.1x coverage Half of all bases in reads at least 5.5 kbp Longest 26.5 kbp SMRTBelltemplate 104 SMRT Cells Data set 2 Latest chemistry and enzyme (C2-XL) 0.7x coverage By PacBio Menlo Park 3 Half of all bases in reads at least 6 kbp Longest 25 kbp StandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert sequenced& CircularConsensusSequencing Small& Insert& Sizes&
    43. 43. Salmon: PacBio reads Alignments of at least 1kb to released assembly Alignmentsbinnedby%idenVtyPortion of the alignments Bin for read accuracy reported in the alignment CumulaVveAlignmentQuanVty Figure courtesy of Jason Miller, JCVI, USA
    44. 44. Salmon: PacBio reads Repeat copy 1 Repeat copy 2 SMRTBelltemplate Salmon repeat database Mapping StandardSequencing Generates& pass& each& one& on& molecule& Large& Insert& Sizes& sequenced& Mapping CircularConsensusSequencingScaffold gap Small& Insert& Sizes& contig Generates& mul8ple& passes& each& on& molecule& sequenced&
    45. 45. Salmon: repeats 1.6 kb repeats mapped to PacBio reads left flank repeat right flank0 5000 10000 Scale (bp) 15000 20000 25000
    46. 46. Salmon: repeats 3-7 kb repeats mapped to PacBio reads left flank repeat right flank0 5000 10000 Scale (bp) 15000 20000 25000
    47. 47. Salmon: error-correction PacBioToCA Jason Miller, JCVI: “Low fraction of reads recovered”“Improves contig lengths by enabling new joins” “Challenge for error-correction: polymorphic repeat copies” Repeat copy 1 Repeat copy 2
    48. 48. Salmon: prospect PacBio reads span even the longest repeats 3-7 kb repeats mapped to PacBio reads left flank repeat right flankRepeat copy 1 Repeat copy 2
    49. 49. chnology Cod: PacBio reads 8.1x coverage Half of all bases in reads at least 4 kbp Longest 16.5 kbp SMRTBelltemplate 104 SMRT Cells Regular C2 chemistry Univ. of Oslo, Norway 3 StandardSequencing Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert sequenced& CircularConsensusSequencing Small& Insert& Sizes&
    50. 50. SMRTBelltemplate Cod: PacBio reads StandardSequencing Generates& pass& each& one& on& molecule& Large& Insert& Sizes& sequenced& Mapping CircularConsensusSequencingScaffold gap Small& Insert& Sizes& contig Generates& mul8ple& passes& each& on& molecule& sequenced&
    51. 51. Cod: PacBio resultsMapping to the published genome 11.4 kbp subread 10.6 kbp subread 10.9 kbp subread
    52. 52. Cod: example 1Assembly ...ACACAC TGTGTG... 232 bp gap TGTGTG...
    53. 53. Cod: example 1 ACACAC repeat 232 bp Gap TGTGTG repeat
    54. 54. Cod: example 1
    55. 55. Cod: example 1
    56. 56. Cod: example 1Assembly ...ACACAC TGTGTG... ...ACACACAC TGTGTG... ...ACACACAC TGTGTG... Unplaced region AC TGTGTG...
    57. 57. Cod: example 2Assembly ...TGTGTG 344 bp gap
    58. 58. Cod: example 2 TGTGTG repeat 344 bp Gap
    59. 59. Cod: example 2
    60. 60. Cod: example 2Assembly ...TGTGTG ...TGTGTG ...TGTGTG ...TGTGTG Heterozygosity?
    61. 61. Cod: example 3Assembly 300 bp misassembly?
    62. 62. Cod: error-correction P_errorCorrection pipeline from  93% of reads recovered 2.7x Alignments of at least 1kb to published assembly+ 23x+ 24 cpus 4.5 days 100 Gb RAM
    63. 63. Cod: prospectPacBio reads span many gaps PacBio reads may span heterozygous regions Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
    64. 64. Summary Salmon and cod extra challengingAssembly is difficult reads contigs scaffoldsPacBio has a huge potential 3-7 kb repeats mapped to PacBio reads left flank repeat right flank http://en.wikipedia.org, http://fishandboat.com
    65. 65. Acknowledgements University of Oslo Jason Miller, JCVI Pacific BiosciencesSequencing team NSC ICSASGOle Kristian Tørresen Kjetill Jakobsen Sissel Jentoft Cod genome group The%female% named% double[haploid% “Sally”% genome% with% of% es>mated% length% Gbp.% 3% 12%
    66. 66. http://wiki.galaxyproject.org/Events/GCC2013

    ×