SlideShare a Scribd company logo
1 of 66
Download to read offline
A different kettle of fish entirely
Bioinformatic challenges and solutions for whole de novo
  genome assembly of Atlantic cod and Atlantic salmon

                 Lex Nederbragt, NSC and CEES
                   lex.nederbragt@bio.uio.no
                         @lexnederbragt

                                                           OK
Developments in
High Throughput Sequencing
Developments in High Throughput Sequencing                                   ABI 3730xl

                                    1000                                                                                        Roche/454 GS
                                                                                                                                Series3
                                                                    Hiseq                                                       Illumina HiSeq
                                     100                                                                                        Life Tech SOLiD
                                                                                                                                MiSeq
                                                 SOLiD                          Proton                                          IonTorrent PGM
                                      10
                                                                                                                                PacBio RS
                                                                                                                                GS Junior
   Gigabses per run (log scale)




                                                              MiSeq
                                       1                                                               GS FLX
                                                                                                                                Ion Proton



                                                                    PGM
                                      0.1           GA II


                                                                                           GS Junior
                                     0.01
                                                                                                                    PacBio RS

                                    0.001



                                   0.0001
                                                                                                         ‘Sanger’

                                  0.00001
                                            10                100                                      1000                                      10000
                                                                            Read length (log scale)
http://dx.doi.org/10.6084/m9.figshare.100940
Developments in High Throughput Sequencing                                                 ABI 3730xl

                                    1000                                                                                                       Roche/454 GS
                                                                                                                                               Series3
                                                                     Hiseq                                                                     Illumina HiSeq
                                     100                                                                                                       Life Tech SOLiD
                                                                                                                                               MiSeq
                                                  SOLiD                          Proton
                                      10                                                                                              Long IonTorrent PGM
                                                                                                                                               PacBio RS
                                                                                                                                               GS Junior
   Gigabses per run (log scale)




                                                               MiSeq
                                       1                                                                      GS FLX
                                                                                                                                               Ion Proton




                                                                                                        ‘Sanger like’
                                                                     PGM
                                      0.1            GA II


                                                                                            GS Junior
                                     0.01
                                                                                                                                   PacBio RS
                                                                             Intermediate
                                    0.001

                                                 Short
                                   0.0001
                                                                                                                        ‘Sanger’

                                  0.00001
                                            10                 100                                               1000                                           10000
                                                                             Read length (log scale)
http://dx.doi.org/10.6084/m9.figshare.100940
Developments in High Throughput Sequencing                                                 ABI 3730xl

                                    1000                                                                                                       Roche/454 GS
                                                                                                                                               Series3
                                                                     Hiseq                                                                     Illumina HiSeq
                                     100                                                                                                       Life Tech SOLiD
                                                                                                                                               MiSeq
                                                  SOLiD                          Proton
                                      10                                                                                              Long IonTorrent PGM
                                                                                                                                               PacBio RS
                                                                                                                                               GS Junior
   Gigabses per run (log scale)




                                                               MiSeq
                                       1                                                                      GS FLX
                                                                                                                                               Ion Proton




                                                                                                        ‘Sanger like’
                                                                     PGM
                                      0.1            GA II


                                                                                            GS Junior
                                     0.01
                                                                                                                                   PacBio RS
                                                                             Intermediate
                                    0.001

                                                 Short
                                   0.0001
                                                                                                                        ‘Sanger’

                                  0.00001
                                            10                 100                                               1000                                           10000
                                                                             Read length (log scale)
http://dx.doi.org/10.6084/m9.figshare.100940
What is this thing called ‘genome assembly’?
Hierarchical structure



reads

 contigs

   scaffolds
Sequence data

                           Reads
                                                    reads

                                                      contigs

                                                        scaffolds



original DNA

 fragments




original DNA

 fragments

                  Sequenced ends




               http://www.cbcb.umd.edu/research/assembly_primer.shtml
Reads!

                               reads

                                 contigs

                                   scaffolds




http://www.sciencephoto.com/media/210915/enlarge
Contigs

Building contigs
                                                                 reads

                                                                   contigs

                                                                     scaffolds


                   ACGCGATTCAGGTTACCACG
                     GCGATTCAGGTTACCACGCG
                       GATTCAGGTTACCACGCGTA
                         TTCAGGTTACCACGCGTAGC
                           CAGGTTACCACGCGTAGCGC
    Aligned reads            GGTTACCACGCGTAGCGCAT
                               TTACCACGCGTAGCGCATTA
                                  ACCACGCGTAGCGCATTACA
                                    CACGCGTAGCGCATTACACA
                                      CGCGTAGCGCATTACACAGA
                                        CGTAGCGCATTACACAGATT
                                          TAGCGCATTACACAGATTAG
  Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG
Contigs

Building contigs
                                                             reads

                                                                contigs

                                                                  scaffolds




       Repeat copy 1                            Repeat copy 2




                                                    Contig orienation?
                                                      Contig order?




  Collapsed repeat
     consensus
                       http://www.cbcb.umd.edu/research/assembly_primer.shtml
Mate pairs

Other read type
                                                     reads

                                                         contigs

                                                           scaffolds




       Repeat copy 1                     Repeat copy 2




  (much) longer fragments
                                           mate pair reads
Mate pairs


   Paired end reads  100-500 bp insert
original DNA

 fragments

                               Sequenced ends


   Mate pairs  2-20 kb insert
               Repeat copy 1                    Repeat copy 2




                                                  mate pair reads
Scaffolds

       • Ordered, oriented contigs
                                                                            reads

                                                                              contigs

                                                                                scaffolds



            mate pairs
       contigs



                                                        gap size estimate



                 Scaffold
                                                             gap
                                               contig




http://dx.doi.org/10.6084/m9.figshare.100940
Hierarchical structure



           reads                         ACGCGATTCAGGTTACCACG
                                           GCGATTCAGGTTACCACGCG
                                             GATTCAGGTTACCACGCGTA
                                               TTCAGGTTACCACGCGTAGC
                                                 CAGGTTACCACGCGTAGCGC
                          Aligned reads            GGTTACCACGCGTAGCGCAT
                                                     TTACCACGCGTAGCGCATTA
                contigs                                 ACCACGCGTAGCGCATTACA
                                                          CACGCGTAGCGCATTACACA
                                                            CGCGTAGCGCATTACACAGA
                                                              CGTAGCGCATTACACAGA
                                                                TAGCGCATTACACAGA
                        Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGA

Scaffold
           contig
                    scaffolds
                          gap
Why is genome assembly such
    a difficult problem?
1) Repeats


     Repeat copy 1                                    Repeat copy 2




                                         Repeats break up assembly


Collapsed repeat
   consensus


                     http://www.cbcb.umd.edu/research/assembly_primer.shtml
2) Diploidy



                                                               Differences
                                                              between sister
                                                          *   chromosomes


                                                              ‘heterozygosity’
                                                          *




                                                          *




http://commons.wikimedia.org/wiki/File:Chromosome_1.svg
2) Diploidy




             Polymorphic region 2

Region 1                              Region 4
             Polymorphic region 3



Homozygous   Heterozygous           Homozygous
2) Diploidy




http://www.astraean.com/borderwars/wp-content/uploads/2012/04/heterozygoats.jpg
and many other sites
3) Polyploidy




http://en.wikipedia.org/wiki/Polyploidy
4) Many programs to choose from




                         Zhang et al. PLoSOne 2011
The Atlantic salmon and Atlantic cod
         genome projects




           http://kettleoffish.net/
Salmon: the players




The%female%named% “Sally”%
                         with%
          ‘Sally’
  double[haploid%genome% of%
   es>mated% length% Gbp.%
                    3%
                                         12%
Salmon: the genome
                                                                 Pseudotetraploid
        3 billion bases (Gbp )




          ‘Double haploid’
       The%female%
                 named%  “Sally”%
                                with%
         double[haploid%genome% of%
          es>mated% length% Gbp.%
                           3%
                                                                                    12%
       Repeat copy 1                      Repeat copy 2



                 30-35%: repetitive DNA
           DNA transposons ~ 1500 bp: 6-10% *

* Davidson et al., 2010 http://genomebiology.com/2010/11/9/403
Salmon: phase 1

                              Sanger sequencing           Illumina sequencing




                                                         Phase 1 assembly
                                                         555 960 sequences
                                                         2.4 Gbp of 3 Gbp
                                                         Half of that in pieces of 9 300 bp or longer


   Scaffold
                                                   gap
                                contig



http://www.flickr.com/photos/jurvetson/57080968/
Salmon: phase 2

              Illumina sequencing
              Paired end
              Mate Pair 3kb and longer




                                                    Phase 2 stated goal
                                                    Scaffolds greater than 1 Mbp
                                                    Half the genome in contigs of at least 50 000 bp




he%female%
         named%  “Sally”%
                        with%
 double[haploid%genome% of%
  es>mated% length% Gbp.%
                   3%
                                                     12%




        Scaffold
                                              gap
                                  contig
Cod: the players




Unnamed Atlantic cod
Cod: the genome
                                 Heterozygote
850 million bases (Mbp )




                                        *
     ‘Wild-caught’

                                        *




                                        *
Cod: phase 1

           454 sequencing            (Sanger sequencing)




                                  Phase 1 assembly
                                  157 887 sequences
                                  753 Mbp of 830 Mbp
                                  Half in scaffolds of at least 460 000 bp
                                  Half in contigs at least 2 800 bp


Scaffold
                            gap
               contig
Cod: phase 1
Cod: phase 2

Phase 2
Illumina sequencing
Paired end    >200x
Mate Pair 5kb >100x


                         Phase 2 goal
                         Half in scaffolds of at least 1 Mbp
                         Half in contigs at least 10 – 15 000 bp
Atlantic salmon and Atlantic cod
                                                       Pseudotetraploid
    Heterozygosity




                *



                *
                               reads

                                contigs

                                       ?
                                  scaffolds




                *




Repeat copy 1                          Repeat copy 2



                Long repeats
What we need? Long reads!
Longer reads!
Repeat copy 1                                 Repeat copy 2




    Long reads can span repeats and heterozygous regions




                       Polymorphic contig 2

 Contig 1                                              Contig 4
                       Polymorphic contig 3
Developments in High Throughput Sequencing                                   ABI 3730xl

                                    1000                                                                                        Roche/454 GS
                                                                                                                                Series3
                                                                    Hiseq                                                       Illumina HiSeq
                                     100                                                                                        Life Tech SOLiD
                                                                                                                                MiSeq
                                                 SOLiD                          Proton                                          IonTorrent PGM
                                      10
                                                                                                                                PacBio RS
                                                                                                                                GS Junior
   Gigabses per run (log scale)




                                                              MiSeq
                                       1                                                               GS FLX
                                                                                                                                Ion Proton



                                                                    PGM
                                      0.1           GA II


                                                                                           GS Junior
                                     0.01
                                                                                                                    PacBio RS

                                    0.001



                                   0.0001
                                                                                                         ‘Sanger’

                                  0.00001
                                            10                100                                      1000                                      10000
                                                                            Read length (log scale)
http://dx.doi.org/10.6084/m9.figshare.100940
PacBio sequencing

                              Single-molecule




C2 (current) chemistry:
Average read length 3100 bp
36 000 reads
110 Mbp per ‘run’
PacBio sequencing
SMRTBell'template'

                           Sequencing ‘modes’

Standard'Sequencing'


                                          Generates& pass& each&
                                                   one&  on&   molecule&
     Large Insert& Sizes
      Large&     Sizes&
            Insert                         Single pass
                                          sequenced&


                                         ‘Subreads’
Circular'Consensus'Sequencing'



  Small Insert Sizes&
   Small&Insert&
               Sizes

                                          Multiple mul8ple&
                                                   passes passes& each&
                                          Generates&            on&   molecule&
                                          sequenced&
PacBio: uses
SMRTBell'template'

                           Long reads  low quality

Standard'Sequencing'


                                             Generates& pass& each&
                                                      one&  on&   molecule&
     Large Insert& Sizes
      Large&     Sizes&
            Insert                            Single pass
                                             sequenced&
                                               85-87% accuracy
Circular'Consensus'Sequencing'
                             Useful for assembly?
    Small&
         Insert&
               Sizes&


                                             Generates&
                                                      mul8ple&
                                                             passes& each&
                                                                   on&   molecule&
                                             sequenced&
Solutions for assembly
Pacbio for salmon and cod
      SMRTBell'template'


              Libraries

      Standard'Sequencing'


                                       Generates& pass& ea
                                                one&  on&
           Large Insert& Sizes
            Large&     Sizes&
                  Insert               sequenced&


     Aim for looooong insert sizes
      Circular'Consensus'Sequencing'


          Small&
               Insert&
                     Sizes&


                                       Generates&
                                                mul8ple&
                                                       passes
                                       sequenced&
chnology
                                        Salmon: PacBio reads


                                                     Data set 1
                                                          1.1x coverage
                                                          Half of all bases in reads at least 5.5 kbp
                                                          Longest 26.5 kbp

              SMRTBell'template'
                  104 SMRT Cells                     Data set 2
       Latest chemistry and enzyme (C2-XL)                0.7x coverage
               By PacBio Menlo Park              3
                                                          Half of all bases in reads at least 6 kbp
                                                          Longest 25 kbp
              Standard'Sequencing'


                                                              Generates& pass& each&
                                                                       one&  on&   molecule&
                  Large Insert& Sizes
                   Large&     Sizes&
                         Insert                               sequenced&



             Circular'Consensus'Sequencing'


                 Small&
                      Insert&
                            Sizes&
Salmon: PacBio reads



                            Alignments of at least 1kb to released assembly
                                  Alignments'binned'by'%idenVty'
Portion of the alignments




                                   Bin for read accuracy reported in the alignment

                                  CumulaVve'Alignment'QuanVty'


                                                                 Figure courtesy of Jason Miller, JCVI, USA
Salmon: PacBio reads
     Repeat copy 1                             Repeat copy 2




    SMRTBell'template'                                                 Salmon
                                                                       repeat
                                                                      database

                                                              Mapping
    Standard'Sequencing'


                                                Generates& pass& each&
                                                         one&  on&   molecule&
           Large&
                Insert&
                      Sizes&
                                                sequenced&
                                                             Mapping
   Circular'Consensus'Sequencing'

Scaffold
                                         gap
       Small&
            Insert&
                  Sizes&
                                contig

                                               Generates&
                                                        mul8ple&
                                                               passes& each&
                                                                     on&   molecule&
                                               sequenced&
Salmon: repeats
                1.6 kb repeats mapped to PacBio reads
           left flank   repeat   right flank




0   5000                10000    Scale (bp)    15000    20000   25000
Salmon: repeats
                        3-7 kb repeats mapped to PacBio reads
    left flank                repeat          right flank




0                5000         10000    Scale (bp)   15000       20000   25000
Salmon: error-correction
                                                                      PacBioToCA
 Jason Miller, JCVI:
 “Low fraction of reads recovered”




“Improves contig lengths by enabling new joins”




                                              “Challenge for error-correction:
                                                polymorphic repeat copies”
                                     Repeat copy 1                     Repeat copy 2
Salmon: prospect

       PacBio reads span even the longest repeats
                          3-7 kb repeats mapped to PacBio reads
            left flank          repeat       right flank




Repeat copy 1                            Repeat copy 2
chnology
                                      Cod: PacBio reads

                                            8.1x coverage
                                            Half of all bases in reads at least 4 kbp
                                            Longest 16.5 kbp



           SMRTBell'template'
             104 SMRT Cells
           Regular C2 chemistry
           Univ. of Oslo, Norway               3




           Standard'Sequencing'


                                                           Generates& pass& each&
                                                                    one&  on&   molecule&
                Large Insert& Sizes
                 Large&     Sizes&
                       Insert                              sequenced&



           Circular'Consensus'Sequencing'


               Small&
                    Insert&
                          Sizes&
SMRTBell'template'
                               Cod: PacBio reads

    Standard'Sequencing'


                                              Generates& pass& each&
                                                       one&  on&   molecule&
           Large&
                Insert&
                      Sizes&
                                              sequenced&
                                                            Mapping
   Circular'Consensus'Sequencing'

Scaffold
                                        gap
       Small&
            Insert&
                  Sizes&
                               contig

                                              Generates&
                                                       mul8ple&
                                                              passes& each&
                                                                    on&   molecule&
                                              sequenced&
Cod: PacBio results

Mapping to the published genome
        11.4 kbp subread




         10.6 kbp subread




         10.9 kbp subread
Cod: example 1
Assembly
           ...ACACAC                TGTGTG...
                       232 bp gap

                                    TGTGTG...
Cod: example 1


 ACACAC repeat




 232 bp Gap




 TGTGTG repeat
Cod: example 1
Cod: example 1
Cod: example 1
Assembly
                      ...ACACAC     TGTGTG...
                      ...ACACACAC   TGTGTG...
                      ...ACACACAC   TGTGTG...
           Unplaced region   AC     TGTGTG...
Cod: example 2
Assembly
           ...TGTGTG
                       344 bp gap
Cod: example 2


  TGTGTG repeat




       344 bp Gap
Cod: example 2
Cod: example 2
Assembly
           ...TGTGTG
           ...TGTGTG
           ...TGTGTG
           ...TGTGTG

                 Heterozygosity?
Cod: example 3
Assembly




              300 bp misassembly?
Cod: error-correction
                 P_errorCorrection pipeline from

                                93% of reads recovered
    2.7x
                     Alignments of at least 1kb to published assembly


+

    23x



+
    24 cpus
    4.5 days
    100 Gb RAM
Cod: prospect
PacBio reads span many gaps




                     PacBio reads may span heterozygous regions


                          Polymorphic contig 2

       Contig 1                                    Contig 4
                          Polymorphic contig 3
Summary
                                       Salmon and cod extra challenging
Assembly is difficult

   reads

    contigs

      scaffolds


PacBio has a huge potential




                                             3-7 kb repeats mapped to PacBio reads
                          left flank               repeat       right flank




                        http://en.wikipedia.org, http://fishandboat.com
Acknowledgements
     University of Oslo                                  Jason Miller, JCVI


                                                         Pacific Biosciences
Sequencing team NSC


                                                               ICSASG

Ole Kristian Tørresen
  Kjetill Jakobsen
    Sissel Jentoft
 Cod genome group       The%female%
                                  named%
                          double[haploid%
                                          “Sally”%
                                         genome%
                                                 with%
                                                 of%
                           es>mated% length% Gbp.%
                                            3%
                                                                               12%
http://wiki.galaxyproject.org/Events/GCC2013

More Related Content

Viewers also liked

Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN
 
Aug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansAug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansGenomeInABottle
 
Molecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineMolecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineCandy Smellie
 
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...John Blue
 
A Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on HadoopA Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on HadoopChung-Tsai Su
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyQIAGEN
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Sri Ambati
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposalGenomeInABottle
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 

Viewers also liked (14)

Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
 
Aug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansAug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plans
 
Molecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineMolecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics Pipeline
 
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
 
A Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on HadoopA Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on Hadoop
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) Technology
 
Clinical Applications of Next Generation Sequencing
Clinical Applications of Next Generation SequencingClinical Applications of Next Generation Sequencing
Clinical Applications of Next Generation Sequencing
 
Ngs part i 2013
Ngs part i 2013Ngs part i 2013
Ngs part i 2013
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 

More from Lex Nederbragt

Coding & Best Practice in Programming in the NGS era
Coding & Best Practice in Programming in the NGS eraCoding & Best Practice in Programming in the NGS era
Coding & Best Practice in Programming in the NGS eraLex Nederbragt
 
Why of version control
Why of version controlWhy of version control
Why of version controlLex Nederbragt
 
Assembly: before and after
Assembly: before and afterAssembly: before and after
Assembly: before and afterLex Nederbragt
 
Improving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBioImproving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBioLex Nederbragt
 
Repeat after me: Is our research reproducible (enough)?
Repeat after me: Is our research reproducible (enough)? Repeat after me: Is our research reproducible (enough)?
Repeat after me: Is our research reproducible (enough)? Lex Nederbragt
 
Combining PacBio with short read technology for improved de novo genome assembly
Combining PacBio with short read technology for improved de novo genome assemblyCombining PacBio with short read technology for improved de novo genome assembly
Combining PacBio with short read technology for improved de novo genome assemblyLex Nederbragt
 
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Lex Nederbragt
 
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...Lex Nederbragt
 
How and why I use blogging
How and why I use bloggingHow and why I use blogging
How and why I use bloggingLex Nederbragt
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeLex Nederbragt
 
Assembly of metagenomes
Assembly of metagenomesAssembly of metagenomes
Assembly of metagenomesLex Nederbragt
 
NGS techniques and data
NGS techniques and data NGS techniques and data
NGS techniques and data Lex Nederbragt
 
NGS: bioinformatic challenges
NGS: bioinformatic challengesNGS: bioinformatic challenges
NGS: bioinformatic challengesLex Nederbragt
 

More from Lex Nederbragt (13)

Coding & Best Practice in Programming in the NGS era
Coding & Best Practice in Programming in the NGS eraCoding & Best Practice in Programming in the NGS era
Coding & Best Practice in Programming in the NGS era
 
Why of version control
Why of version controlWhy of version control
Why of version control
 
Assembly: before and after
Assembly: before and afterAssembly: before and after
Assembly: before and after
 
Improving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBioImproving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBio
 
Repeat after me: Is our research reproducible (enough)?
Repeat after me: Is our research reproducible (enough)? Repeat after me: Is our research reproducible (enough)?
Repeat after me: Is our research reproducible (enough)?
 
Combining PacBio with short read technology for improved de novo genome assembly
Combining PacBio with short read technology for improved de novo genome assemblyCombining PacBio with short read technology for improved de novo genome assembly
Combining PacBio with short read technology for improved de novo genome assembly
 
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
 
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
 
How and why I use blogging
How and why I use bloggingHow and why I use blogging
How and why I use blogging
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genome
 
Assembly of metagenomes
Assembly of metagenomesAssembly of metagenomes
Assembly of metagenomes
 
NGS techniques and data
NGS techniques and data NGS techniques and data
NGS techniques and data
 
NGS: bioinformatic challenges
NGS: bioinformatic challengesNGS: bioinformatic challenges
NGS: bioinformatic challenges
 

A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon

  • 1. A different kettle of fish entirely Bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon Lex Nederbragt, NSC and CEES lex.nederbragt@bio.uio.no @lexnederbragt OK
  • 3. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton IonTorrent PGM 10 PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton PGM 0.1 GA II GS Junior 0.01 PacBio RS 0.001 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale) http://dx.doi.org/10.6084/m9.figshare.100940
  • 4. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton 10 Long IonTorrent PGM PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton ‘Sanger like’ PGM 0.1 GA II GS Junior 0.01 PacBio RS Intermediate 0.001 Short 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale) http://dx.doi.org/10.6084/m9.figshare.100940
  • 5. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton 10 Long IonTorrent PGM PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton ‘Sanger like’ PGM 0.1 GA II GS Junior 0.01 PacBio RS Intermediate 0.001 Short 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale) http://dx.doi.org/10.6084/m9.figshare.100940
  • 6. What is this thing called ‘genome assembly’?
  • 8. Sequence data Reads reads contigs scaffolds original DNA fragments original DNA fragments Sequenced ends http://www.cbcb.umd.edu/research/assembly_primer.shtml
  • 9. Reads! reads contigs scaffolds http://www.sciencephoto.com/media/210915/enlarge
  • 10. Contigs Building contigs reads contigs scaffolds ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA CGTAGCGCATTACACAGATT TAGCGCATTACACAGATTAG Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG
  • 11. Contigs Building contigs reads contigs scaffolds Repeat copy 1 Repeat copy 2 Contig orienation? Contig order? Collapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
  • 12. Mate pairs Other read type reads contigs scaffolds Repeat copy 1 Repeat copy 2 (much) longer fragments mate pair reads
  • 13. Mate pairs Paired end reads  100-500 bp insert original DNA fragments Sequenced ends Mate pairs  2-20 kb insert Repeat copy 1 Repeat copy 2 mate pair reads
  • 14. Scaffolds • Ordered, oriented contigs reads contigs scaffolds mate pairs contigs gap size estimate Scaffold gap contig http://dx.doi.org/10.6084/m9.figshare.100940
  • 15. Hierarchical structure reads ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA contigs ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA CGTAGCGCATTACACAGA TAGCGCATTACACAGA Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGA Scaffold contig scaffolds gap
  • 16. Why is genome assembly such a difficult problem?
  • 17. 1) Repeats Repeat copy 1 Repeat copy 2 Repeats break up assembly Collapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
  • 18. 2) Diploidy Differences between sister * chromosomes ‘heterozygosity’ * * http://commons.wikimedia.org/wiki/File:Chromosome_1.svg
  • 19. 2) Diploidy Polymorphic region 2 Region 1 Region 4 Polymorphic region 3 Homozygous Heterozygous Homozygous
  • 22. 4) Many programs to choose from Zhang et al. PLoSOne 2011
  • 23. The Atlantic salmon and Atlantic cod genome projects http://kettleoffish.net/
  • 24. Salmon: the players The%female%named% “Sally”% with% ‘Sally’ double[haploid%genome% of% es>mated% length% Gbp.% 3% 12%
  • 25. Salmon: the genome Pseudotetraploid 3 billion bases (Gbp ) ‘Double haploid’ The%female% named% “Sally”% with% double[haploid%genome% of% es>mated% length% Gbp.% 3% 12% Repeat copy 1 Repeat copy 2 30-35%: repetitive DNA DNA transposons ~ 1500 bp: 6-10% * * Davidson et al., 2010 http://genomebiology.com/2010/11/9/403
  • 26. Salmon: phase 1 Sanger sequencing Illumina sequencing Phase 1 assembly 555 960 sequences 2.4 Gbp of 3 Gbp Half of that in pieces of 9 300 bp or longer Scaffold gap contig http://www.flickr.com/photos/jurvetson/57080968/
  • 27. Salmon: phase 2 Illumina sequencing Paired end Mate Pair 3kb and longer Phase 2 stated goal Scaffolds greater than 1 Mbp Half the genome in contigs of at least 50 000 bp he%female% named% “Sally”% with% double[haploid%genome% of% es>mated% length% Gbp.% 3% 12% Scaffold gap contig
  • 28. Cod: the players Unnamed Atlantic cod
  • 29. Cod: the genome Heterozygote 850 million bases (Mbp ) * ‘Wild-caught’ * *
  • 30. Cod: phase 1 454 sequencing (Sanger sequencing) Phase 1 assembly 157 887 sequences 753 Mbp of 830 Mbp Half in scaffolds of at least 460 000 bp Half in contigs at least 2 800 bp Scaffold gap contig
  • 32. Cod: phase 2 Phase 2 Illumina sequencing Paired end >200x Mate Pair 5kb >100x Phase 2 goal Half in scaffolds of at least 1 Mbp Half in contigs at least 10 – 15 000 bp
  • 33. Atlantic salmon and Atlantic cod Pseudotetraploid Heterozygosity * * reads contigs ? scaffolds * Repeat copy 1 Repeat copy 2 Long repeats
  • 34. What we need? Long reads!
  • 35. Longer reads! Repeat copy 1 Repeat copy 2 Long reads can span repeats and heterozygous regions Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  • 36. Developments in High Throughput Sequencing ABI 3730xl 1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeq SOLiD Proton IonTorrent PGM 10 PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1 GS FLX Ion Proton PGM 0.1 GA II GS Junior 0.01 PacBio RS 0.001 0.0001 ‘Sanger’ 0.00001 10 100 1000 10000 Read length (log scale) http://dx.doi.org/10.6084/m9.figshare.100940
  • 37. PacBio sequencing Single-molecule C2 (current) chemistry: Average read length 3100 bp 36 000 reads 110 Mbp per ‘run’
  • 38. PacBio sequencing SMRTBell'template' Sequencing ‘modes’ Standard'Sequencing' Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& ‘Subreads’ Circular'Consensus'Sequencing' Small Insert Sizes& Small&Insert& Sizes Multiple mul8ple& passes passes& each& Generates& on& molecule& sequenced&
  • 39. PacBio: uses SMRTBell'template' Long reads  low quality Standard'Sequencing' Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& 85-87% accuracy Circular'Consensus'Sequencing' Useful for assembly? Small& Insert& Sizes& Generates& mul8ple& passes& each& on& molecule& sequenced&
  • 41. Pacbio for salmon and cod SMRTBell'template' Libraries Standard'Sequencing' Generates& pass& ea one& on& Large Insert& Sizes Large& Sizes& Insert sequenced& Aim for looooong insert sizes Circular'Consensus'Sequencing' Small& Insert& Sizes& Generates& mul8ple& passes sequenced&
  • 42. chnology Salmon: PacBio reads Data set 1 1.1x coverage Half of all bases in reads at least 5.5 kbp Longest 26.5 kbp SMRTBell'template' 104 SMRT Cells Data set 2 Latest chemistry and enzyme (C2-XL) 0.7x coverage By PacBio Menlo Park 3 Half of all bases in reads at least 6 kbp Longest 25 kbp Standard'Sequencing' Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert sequenced& Circular'Consensus'Sequencing' Small& Insert& Sizes&
  • 43. Salmon: PacBio reads Alignments of at least 1kb to released assembly Alignments'binned'by'%idenVty' Portion of the alignments Bin for read accuracy reported in the alignment CumulaVve'Alignment'QuanVty' Figure courtesy of Jason Miller, JCVI, USA
  • 44. Salmon: PacBio reads Repeat copy 1 Repeat copy 2 SMRTBell'template' Salmon repeat database Mapping Standard'Sequencing' Generates& pass& each& one& on& molecule& Large& Insert& Sizes& sequenced& Mapping Circular'Consensus'Sequencing' Scaffold gap Small& Insert& Sizes& contig Generates& mul8ple& passes& each& on& molecule& sequenced&
  • 45. Salmon: repeats 1.6 kb repeats mapped to PacBio reads left flank repeat right flank 0 5000 10000 Scale (bp) 15000 20000 25000
  • 46. Salmon: repeats 3-7 kb repeats mapped to PacBio reads left flank repeat right flank 0 5000 10000 Scale (bp) 15000 20000 25000
  • 47. Salmon: error-correction PacBioToCA Jason Miller, JCVI: “Low fraction of reads recovered” “Improves contig lengths by enabling new joins” “Challenge for error-correction: polymorphic repeat copies” Repeat copy 1 Repeat copy 2
  • 48. Salmon: prospect PacBio reads span even the longest repeats 3-7 kb repeats mapped to PacBio reads left flank repeat right flank Repeat copy 1 Repeat copy 2
  • 49. chnology Cod: PacBio reads 8.1x coverage Half of all bases in reads at least 4 kbp Longest 16.5 kbp SMRTBell'template' 104 SMRT Cells Regular C2 chemistry Univ. of Oslo, Norway 3 Standard'Sequencing' Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert sequenced& Circular'Consensus'Sequencing' Small& Insert& Sizes&
  • 50. SMRTBell'template' Cod: PacBio reads Standard'Sequencing' Generates& pass& each& one& on& molecule& Large& Insert& Sizes& sequenced& Mapping Circular'Consensus'Sequencing' Scaffold gap Small& Insert& Sizes& contig Generates& mul8ple& passes& each& on& molecule& sequenced&
  • 51. Cod: PacBio results Mapping to the published genome 11.4 kbp subread 10.6 kbp subread 10.9 kbp subread
  • 52. Cod: example 1 Assembly ...ACACAC TGTGTG... 232 bp gap TGTGTG...
  • 53. Cod: example 1 ACACAC repeat 232 bp Gap TGTGTG repeat
  • 56. Cod: example 1 Assembly ...ACACAC TGTGTG... ...ACACACAC TGTGTG... ...ACACACAC TGTGTG... Unplaced region AC TGTGTG...
  • 57. Cod: example 2 Assembly ...TGTGTG 344 bp gap
  • 58. Cod: example 2 TGTGTG repeat 344 bp Gap
  • 60. Cod: example 2 Assembly ...TGTGTG ...TGTGTG ...TGTGTG ...TGTGTG Heterozygosity?
  • 61. Cod: example 3 Assembly 300 bp misassembly?
  • 62. Cod: error-correction P_errorCorrection pipeline from  93% of reads recovered 2.7x Alignments of at least 1kb to published assembly + 23x + 24 cpus 4.5 days 100 Gb RAM
  • 63. Cod: prospect PacBio reads span many gaps PacBio reads may span heterozygous regions Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  • 64. Summary Salmon and cod extra challenging Assembly is difficult reads contigs scaffolds PacBio has a huge potential 3-7 kb repeats mapped to PacBio reads left flank repeat right flank http://en.wikipedia.org, http://fishandboat.com
  • 65. Acknowledgements University of Oslo Jason Miller, JCVI Pacific Biosciences Sequencing team NSC ICSASG Ole Kristian Tørresen Kjetill Jakobsen Sissel Jentoft Cod genome group The%female% named% double[haploid% “Sally”% genome% with% of% es>mated% length% Gbp.% 3% 12%

Editor's Notes

  1. November2012
  2. November2012
  3. November2012
  4. November2012