An Introduction to NGS
(Next Generation Sequencing)
        François Paillier - 22/02/2011
Plan
  [ Reminder about Sanger Sequencing ]



• NGS Definition
• Overview of NGS technologies
• NGS Applications & examples
• Conclusion

 NOT discussed here : Sequence accuracy, assembly and sampling ; NGS
 data Analysis & BioInformatics tools
A word about Sanger Sequencing
  (First generation sequencing machine  Video)
                                                                         3730xl
Principle (only the tube G + dideoxyG)




                                                                               From gel to
                                                                               capillary




         Still a gold standard but capillary sequencing has reached its technical
         limitation (costs and performance will remain unchanged)
Short Reminder about « Classical » Assembly
                 projects

     Sample  Libraries

                                 Target genome


 n Sequencing sub-projects                    Cloning
                                 SubTargets (BACs, cosmids, ..)




           Assembly
                                     Clone selection &
                                        Sequencing
      Finishing: Draft (Q40)


          Annotation
                                       Assembly

     Annotated Genome
                                                 Other strategy : wgs
Sequencing, what for ?
                          Assembly projects for example

           In bioinformatics, sequence assembly refers to aligning and merging fragments of
           a much longer DNA sequence in order to reconstruct the original sequence. This
           is needed as DNA sequencing technology cannot read whole genomes in one go,
           but rather small pieces between 20 and 1000 bases, depending on the technology
           used. Typically the short fragments, called reads, result from shotgun sequencing
           genomic DNA, or gene transcript (ESTs).



Target genome


                                          Sequencing




                                                                                  reads

                                           Assembly
                                                                   Assembled reads




                    gap                               gap       gap
                            4X Local coverage                         Consensus
scaffold
Vocabulary that should be kept in mind
                  in the sequencing field

•   Assembly : result of the sequence clustering based on their local
    similarity
•   Contig : A set of overlapping DNA segments
•   Coverage (in sequencing) : The mean number of times a nucleotide is
    sequenced in a genome (example: 10X coverage)

•   Scaffolds : A series of contigs that are in the right order but not necessarily
    connected in one contiguous stretch
•   Mate pairs Sequences known to be in the 3′ and 5′ of a contig from a single
    clone




•   WGS = Whole genome shotgun sequencing strategy
•   ESS = Environmental Shotgun Sequencing
NGS = Next Generation
         Sequencing



    After PCR,
THE new revolution
   in Biology ?
NGS Synonym is : High-throughput Sequencing
                     (HTS)




                                    Third Generation :
                                    NGS = HTS, Single
                                    Molecule Sequencing

                     Second Generation :
                     NGS = Massively
                     Parallel Sequencing
First Generation :
SANGER Sequencing
Overview of actual NGS technologies
                 (Second generation sequencing machines)

Year 2005*

                                Roche, 454 GS-FLX
                                Titanium Protocol a must                           Each machine with
                                                                                   different :
 2006                                                                              - Throughput
                                                                                   - Sequence accuracy
                                 Illumina,        GA1 then      GA2
                                                                                   - Data formats (and
                                                                                   programs)
 2007
                                                   Applied Bio.,
                                                   Solid v3


*NGS “proof of principle” was done in 2000 by Lynx Therapeutics : They publishes and markets "MPSS" - a parallelized,
adapter/ligation-mediated, bead-based sequencing technology, launching "next-generation" sequencing.
Throughput per
Illumina Channel
HOW is it
Possible ? 
NGS Principle

Building sequencing devices at nanoscale

 Polony : Discrete clonal amplifications of a single DNA molecule,
  grown in a gel matrix. The clusters can then be individually
  sequenced, producing short reads. Polony-based sequencing is
  the basis of most second generation sequencers


A typical NGS Workflow is:
1) Library construction
2) Template CLONAL amplification
3) Massively PARALLEL sequencing
High Parallelism is Achieved in
     Polony Sequencing

Sanger                   Polony
Generation of Polony array: DNA
       Beads (454, SOLiD)




DNA Beads are generated using Emulsion PCR
Generation of Polony array: DNA
     Beads (454, SOLiD)




   DNA Beads are placed in wells
Sequencing: Pyrosequencing (454)

                                          DNA Polymerase




« pyrogram » / « Flowgram »
454 Process : Emulsion PCR &
       Pyrosequencing




              Titanium =
              Read lengths approx. 400 nt
              1 million reads / Run
               400 Mb / day


              VIDEOs
              About Pyrosequencing 1’53’’: <here>

              Summary about GS Flex 4’34’’: <click
              here>
454 GS FLX titanium



No more Cloning step                   - Seq. Accuracy not so high
From purified DNA to Sequencing        (especially in case of
Fit the laboratory bench top / small   homopolymers
LONG Sequences (400 nt)                 Main error type is indel
GS Junior system not so expensive
                                       - Cost : approx. 20K€ / Gb
Capabilities :   Multiplexing &        Cost per base is cheaper
                 paired-ends           (regarding Sanger) but still
                                       High regarding others NexGen
Well fitted to :                       Machines
         - proK. Genome sequencing
         - RNA-seq
Illumina* : Bridge PCR




                GA2x Version =
                Read lengths
                approx. 100 nt
                240 million reads
                 1500 Mb / day
                 30000 Mb / Run
Generation of Polony array: Bridge-
          PCR (Solexa)




DNA fragments are attached to array and
        used as PCR templates

<Watch VIDEO : Related Links  Video : Genome
    Analyzer workflow  Panel technology>
Illumina Chemistry : 4-color DNA sequencing-by-synthesis using reversible
              terminators with removable flourescent dyes




                                                                   8
                                                                   Lanes




                                                   A Flow cell
Illumina seq. Accuracy
Illumina Throughput
Illumina



No more Cloning step
From purified DNA to Sequencing          - Machine is very expensive
Fit the laboratory bench top / small     Main error type is mismatch
Good Sequence Accuracy
                                         - Read lengths are still too short
Capabilities :   Multiplexing &          Not fitted to big genomes
                 paired-ends             (Repeats)

Cost : approx. 2K€ / Gb , Cost per       - Poor coverage of AT rich regions
base is cheaper than 454                 - Most widely used NGS platform.
                                         - Requires least DNA
Well fitted to :
         - proK. Genome sequencing
         - RNA-seq, ChIP-Seq,
         Methyl-Seq
SOLiD system : 4-color DNA Sequencing by
                 Ligation




                         SOLiD V3 =
                         Read lengths
                         approx. 50 nt
                         400 million reads
                          1500 Mb / day
                          20000 Mb / Run
                          1500€ / Gb

                         <Watch Video> 4’46’’
Sequencing by ligation rxn: Fluorescently Labeled
             Nucleotides (ABI SOLiD)




Complementar y strand elongation: DNA Ligase
Sequencing by ligation ABI SOLiD
Sequencing: Fluorescently Labeled Nucleotides
                (ABI SOLiD)




            5 reading frames, each
             position is read twice
Sequencing: Fluorescently Labeled
    Nucleotides (ABI SOLiD)
SOLiD



No more Cloning step
From purified DNA or RNA to Seq.          - This Technology is NOT
Fit the laboratory bench top / small      Intuitive
Good Sequence Accuracy
                                          - Machine is VERY expensive
Capabilities :   Multiplexing &
                 paired-ends              -HUGE amount of data produced
                                          (1500 Gb !!)
Cost : approx. 1.5K€ / Gb , Cost per
base is cheaper than illumina             -Long Run times

Well fitted to :                          -Has been demonstrated
         - REsequencing                   certain reads don’t match
         - RNA-seq, ChIP-Seq,             Reference !
         Methyl-Seq
Focusing NGS effort on predefined targets :
« Target Enrichment » Technology (Capture Array)
Focusing NGS effort on predefined targets :
« Target Enrichment » Technology (Capture Beads)
Summary : NGS Workflows




   +/- Target Enrichment Strategy

                                    Source: BCG
Prokaryotic Genome Sequencing
 Project as a mix of NGS technologies




                                         Conclusion :
  - High quality drafts can be produced for small genomes without any Sanger data input.
- We found that 454 GSFLX and Solexa/Illumina show great complementarity in producing
                     large contigs and supercontigs with a low error rate.
NGS Applications
DEEPER insight into biological processes
BROADER sampling of populations (cells, viruses,
Ecosystems…)



   • In different fields…
      – Metagenomics
      – Genomics
      – Transcriptomics
      – proteomics
Genome
  * De Novo Sequencing
  * Targeted Resequencing           …for different
(SNP, Indel, CNV)
  * Whole Genome Resequencing       purposes…
                                    -Towards Personalized
  * Metagenome analyses             Medicine
                                    - Biodiversity assessment
Transcriptome                       -De Novo Sequencing of
  * Gene Expression Profiling       prokaryotic or eukaryotic
                                    genomes (or re-sequencing)
  * Small RNA Analysis
                                    -RNA-Seq  Annotation of
  * Whole Transcriptome Analysis    eukaryotic genomes
                                    -SNP calling : identification of
Epigenome                           mutations
  * Chromatin Immunoprecipitation   -Chip-Seq : identification of
                                    DNA/protein interactions
      Sequencing (ChIP-Seq)
  * Methylation Analysis
What is the current impact of
                NGS on Biology ?



• Both transcriptomics and genomics can now be
  adressed using one technology with higher
  accuracy and robustess (instead of Sanger
  sequencing + µarrays p.e.) ( Example of RNA-SEQ)
• SNP calling can rely on ultra-deep assemblies
• Whole genome overview of transcription factors
  binding sites
• Biodiversity assessment ( Metagenomics projects)
• And so much more…
About whole-exome sequencing :
 « For the First Time, DNA Sequencing Technology
                Saves A Child's Life »




« Proponents of genetic medicine say DNA sequencing is the future of
medicine and that soon every truly sick person will have his or her genome
sequenced. Critics cite privacy concerns and note that genetic mutations and
variations don’t necessarily lead to medical outcomes. Whatever the
position, it’s hard to argue that this isn’t good news: the first child – plagued
by undiagnosable illness – has been saved by DNA sequencing.
That may be a bit of a strong statement – six-year-old Nicholas Volker is
doing well, though complications could soon arise. But it’s highly likely that
the sequencing of young Nicholas’s genome saved his life. »
<Link> <Article>
                     Mayer & Al. Genetics IN Medicine • Volume xx, Number xx, 01 2011
What’s Next ?


                            IonTorrent
                               PacBio


 Roche, 454 GS-FLX
 Titanium




Illumina, GA2              Third Generation :
                           - Single
                           Molecule Sequencing (no bias)
                           - Faster
Applied BioSys, Solid v3
                           - Cheaper (or not)
Second Generation :        - 1000€ Human genome ?
NGS = Massively
Parallel Sequencing
(polony sequencing)
Conclusion : impact of NGS
               Global Shift to sequencing-based technologies

 Great improvements on-going : Higher throughput, longer reads
 Is it the end of µarrays ? A sub-part of NGS workflows restricted to target-
enrichment ?
 Is it the end of forward genetics ? Reverse genetics only ?
 Biologists education should integrate NGS knowledge
 Is it the end of « Big sequencing centers »? change in their mission ?


Next bottleneck : BioInformatics


- Storing data a problem (SRA soon down ?) AND IT networks speed
FAR too low  Very difficult to share NGS data  Fridges instead of
disks !?
- Analyzing data a problem  great improvements but still a lot of work
remain to be done
Thanks
for your attention !
Technology Summary

                Read length   Sequencing   Throughput   Cost
                              Technology   (per run)    (1mbp)*
   Sanger       ~800bp        Sanger       400kbp       500$

   454          ~400bp        Polony       500Mbp       60$

   Solexa/Illumi 75bp         Polony       20Gbp        2$
   na
   SOLiD        75bp          Polony       60Gbp        2$

   Helicos      30-35bp       Single       25Gbp        1$
                              molecule

*Source: Shendure & Ji, Nat Biotech, 2008
NGS Technology Comparison
           ABI SOLiD               Illumina GA               454 Roche FLX
Cost       SOLiD 4: $495k          IIe: $470k                Titanium: $500k
           SOLiD PI: $240k         IIx: $250k
                                   HiSeq: $690k
Quantity   SOLiD 4: 100Gb          IIe: 20 - 38 Gb           450 Mb
of Data    SOLiD PI: 50Gb          IIx: 50 – 95 Gb
per run                            HiSeq: 200Gb +

Run Time   7 Days                  4 Days                    9 Hours

Pros       Low error rate due to   Most widely used          Short run time. Long
           dibase probes           NGS platform.             reads better for de
                                   Requires least DNA        novo sequencing
Cons       Long run times. Has     Least multiplexing        Expensive reagent
           been demonstrated       capability of the 3.      cost. Difficulty
           certain reads don’t     Poor coverage of AT       reading
           match reference         rich regions              homopolymer
                                                             regions
                                                     Source: The University of Western Ontario

Ngs intro_v6_public

  • 1.
    An Introduction toNGS (Next Generation Sequencing) François Paillier - 22/02/2011
  • 2.
    Plan [Reminder about Sanger Sequencing ] • NGS Definition • Overview of NGS technologies • NGS Applications & examples • Conclusion NOT discussed here : Sequence accuracy, assembly and sampling ; NGS data Analysis & BioInformatics tools
  • 3.
    A word aboutSanger Sequencing (First generation sequencing machine  Video) 3730xl Principle (only the tube G + dideoxyG) From gel to capillary Still a gold standard but capillary sequencing has reached its technical limitation (costs and performance will remain unchanged)
  • 4.
    Short Reminder about« Classical » Assembly projects Sample  Libraries Target genome n Sequencing sub-projects Cloning SubTargets (BACs, cosmids, ..) Assembly Clone selection & Sequencing Finishing: Draft (Q40) Annotation Assembly Annotated Genome Other strategy : wgs
  • 5.
    Sequencing, what for? Assembly projects for example In bioinformatics, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather small pieces between 20 and 1000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcript (ESTs). Target genome Sequencing reads Assembly Assembled reads gap gap gap 4X Local coverage Consensus scaffold
  • 6.
    Vocabulary that shouldbe kept in mind in the sequencing field • Assembly : result of the sequence clustering based on their local similarity • Contig : A set of overlapping DNA segments • Coverage (in sequencing) : The mean number of times a nucleotide is sequenced in a genome (example: 10X coverage) • Scaffolds : A series of contigs that are in the right order but not necessarily connected in one contiguous stretch • Mate pairs Sequences known to be in the 3′ and 5′ of a contig from a single clone • WGS = Whole genome shotgun sequencing strategy • ESS = Environmental Shotgun Sequencing
  • 7.
    NGS = NextGeneration Sequencing After PCR, THE new revolution in Biology ?
  • 8.
    NGS Synonym is: High-throughput Sequencing (HTS) Third Generation : NGS = HTS, Single Molecule Sequencing Second Generation : NGS = Massively Parallel Sequencing First Generation : SANGER Sequencing
  • 9.
    Overview of actualNGS technologies (Second generation sequencing machines) Year 2005* Roche, 454 GS-FLX Titanium Protocol a must Each machine with different : 2006 - Throughput - Sequence accuracy Illumina, GA1 then GA2 - Data formats (and programs) 2007 Applied Bio., Solid v3 *NGS “proof of principle” was done in 2000 by Lynx Therapeutics : They publishes and markets "MPSS" - a parallelized, adapter/ligation-mediated, bead-based sequencing technology, launching "next-generation" sequencing.
  • 10.
  • 11.
  • 12.
    NGS Principle Building sequencingdevices at nanoscale  Polony : Discrete clonal amplifications of a single DNA molecule, grown in a gel matrix. The clusters can then be individually sequenced, producing short reads. Polony-based sequencing is the basis of most second generation sequencers A typical NGS Workflow is: 1) Library construction 2) Template CLONAL amplification 3) Massively PARALLEL sequencing
  • 13.
    High Parallelism isAchieved in Polony Sequencing Sanger Polony
  • 14.
    Generation of Polonyarray: DNA Beads (454, SOLiD) DNA Beads are generated using Emulsion PCR
  • 15.
    Generation of Polonyarray: DNA Beads (454, SOLiD) DNA Beads are placed in wells
  • 16.
    Sequencing: Pyrosequencing (454) DNA Polymerase « pyrogram » / « Flowgram »
  • 17.
    454 Process :Emulsion PCR & Pyrosequencing Titanium = Read lengths approx. 400 nt 1 million reads / Run  400 Mb / day VIDEOs About Pyrosequencing 1’53’’: <here> Summary about GS Flex 4’34’’: <click here>
  • 19.
    454 GS FLXtitanium No more Cloning step - Seq. Accuracy not so high From purified DNA to Sequencing (especially in case of Fit the laboratory bench top / small homopolymers LONG Sequences (400 nt)  Main error type is indel GS Junior system not so expensive - Cost : approx. 20K€ / Gb Capabilities : Multiplexing & Cost per base is cheaper paired-ends (regarding Sanger) but still High regarding others NexGen Well fitted to : Machines - proK. Genome sequencing - RNA-seq
  • 20.
    Illumina* : BridgePCR GA2x Version = Read lengths approx. 100 nt 240 million reads  1500 Mb / day  30000 Mb / Run
  • 21.
    Generation of Polonyarray: Bridge- PCR (Solexa) DNA fragments are attached to array and used as PCR templates <Watch VIDEO : Related Links  Video : Genome Analyzer workflow  Panel technology>
  • 22.
    Illumina Chemistry :4-color DNA sequencing-by-synthesis using reversible terminators with removable flourescent dyes 8 Lanes A Flow cell
  • 23.
  • 24.
  • 25.
    Illumina No more Cloningstep From purified DNA to Sequencing - Machine is very expensive Fit the laboratory bench top / small Main error type is mismatch Good Sequence Accuracy - Read lengths are still too short Capabilities : Multiplexing & Not fitted to big genomes paired-ends (Repeats) Cost : approx. 2K€ / Gb , Cost per - Poor coverage of AT rich regions base is cheaper than 454 - Most widely used NGS platform. - Requires least DNA Well fitted to : - proK. Genome sequencing - RNA-seq, ChIP-Seq, Methyl-Seq
  • 26.
    SOLiD system :4-color DNA Sequencing by Ligation SOLiD V3 = Read lengths approx. 50 nt 400 million reads  1500 Mb / day  20000 Mb / Run  1500€ / Gb <Watch Video> 4’46’’
  • 27.
    Sequencing by ligationrxn: Fluorescently Labeled Nucleotides (ABI SOLiD) Complementar y strand elongation: DNA Ligase
  • 28.
  • 29.
    Sequencing: Fluorescently LabeledNucleotides (ABI SOLiD) 5 reading frames, each position is read twice
  • 30.
    Sequencing: Fluorescently Labeled Nucleotides (ABI SOLiD)
  • 31.
    SOLiD No more Cloningstep From purified DNA or RNA to Seq. - This Technology is NOT Fit the laboratory bench top / small Intuitive Good Sequence Accuracy - Machine is VERY expensive Capabilities : Multiplexing & paired-ends -HUGE amount of data produced (1500 Gb !!) Cost : approx. 1.5K€ / Gb , Cost per base is cheaper than illumina -Long Run times Well fitted to : -Has been demonstrated - REsequencing certain reads don’t match - RNA-seq, ChIP-Seq, Reference ! Methyl-Seq
  • 32.
    Focusing NGS efforton predefined targets : « Target Enrichment » Technology (Capture Array)
  • 33.
    Focusing NGS efforton predefined targets : « Target Enrichment » Technology (Capture Beads)
  • 34.
    Summary : NGSWorkflows +/- Target Enrichment Strategy Source: BCG
  • 35.
    Prokaryotic Genome Sequencing Project as a mix of NGS technologies Conclusion : - High quality drafts can be produced for small genomes without any Sanger data input. - We found that 454 GSFLX and Solexa/Illumina show great complementarity in producing large contigs and supercontigs with a low error rate.
  • 36.
    NGS Applications DEEPER insightinto biological processes BROADER sampling of populations (cells, viruses, Ecosystems…) • In different fields… – Metagenomics – Genomics – Transcriptomics – proteomics
  • 37.
    Genome *De Novo Sequencing * Targeted Resequencing …for different (SNP, Indel, CNV) * Whole Genome Resequencing purposes… -Towards Personalized * Metagenome analyses Medicine - Biodiversity assessment Transcriptome -De Novo Sequencing of * Gene Expression Profiling prokaryotic or eukaryotic genomes (or re-sequencing) * Small RNA Analysis -RNA-Seq  Annotation of * Whole Transcriptome Analysis eukaryotic genomes -SNP calling : identification of Epigenome mutations * Chromatin Immunoprecipitation -Chip-Seq : identification of DNA/protein interactions Sequencing (ChIP-Seq) * Methylation Analysis
  • 39.
    What is thecurrent impact of NGS on Biology ? • Both transcriptomics and genomics can now be adressed using one technology with higher accuracy and robustess (instead of Sanger sequencing + µarrays p.e.) ( Example of RNA-SEQ) • SNP calling can rely on ultra-deep assemblies • Whole genome overview of transcription factors binding sites • Biodiversity assessment ( Metagenomics projects) • And so much more…
  • 40.
    About whole-exome sequencing: « For the First Time, DNA Sequencing Technology Saves A Child's Life » « Proponents of genetic medicine say DNA sequencing is the future of medicine and that soon every truly sick person will have his or her genome sequenced. Critics cite privacy concerns and note that genetic mutations and variations don’t necessarily lead to medical outcomes. Whatever the position, it’s hard to argue that this isn’t good news: the first child – plagued by undiagnosable illness – has been saved by DNA sequencing. That may be a bit of a strong statement – six-year-old Nicholas Volker is doing well, though complications could soon arise. But it’s highly likely that the sequencing of young Nicholas’s genome saved his life. » <Link> <Article> Mayer & Al. Genetics IN Medicine • Volume xx, Number xx, 01 2011
  • 41.
    What’s Next ? IonTorrent PacBio Roche, 454 GS-FLX Titanium Illumina, GA2 Third Generation : - Single Molecule Sequencing (no bias) - Faster Applied BioSys, Solid v3 - Cheaper (or not) Second Generation : - 1000€ Human genome ? NGS = Massively Parallel Sequencing (polony sequencing)
  • 42.
    Conclusion : impactof NGS Global Shift to sequencing-based technologies  Great improvements on-going : Higher throughput, longer reads  Is it the end of µarrays ? A sub-part of NGS workflows restricted to target- enrichment ?  Is it the end of forward genetics ? Reverse genetics only ?  Biologists education should integrate NGS knowledge  Is it the end of « Big sequencing centers »? change in their mission ? Next bottleneck : BioInformatics - Storing data a problem (SRA soon down ?) AND IT networks speed FAR too low  Very difficult to share NGS data  Fridges instead of disks !? - Analyzing data a problem  great improvements but still a lot of work remain to be done
  • 44.
  • 45.
    Technology Summary Read length Sequencing Throughput Cost Technology (per run) (1mbp)* Sanger ~800bp Sanger 400kbp 500$ 454 ~400bp Polony 500Mbp 60$ Solexa/Illumi 75bp Polony 20Gbp 2$ na SOLiD 75bp Polony 60Gbp 2$ Helicos 30-35bp Single 25Gbp 1$ molecule *Source: Shendure & Ji, Nat Biotech, 2008
  • 46.
    NGS Technology Comparison ABI SOLiD Illumina GA 454 Roche FLX Cost SOLiD 4: $495k IIe: $470k Titanium: $500k SOLiD PI: $240k IIx: $250k HiSeq: $690k Quantity SOLiD 4: 100Gb IIe: 20 - 38 Gb 450 Mb of Data SOLiD PI: 50Gb IIx: 50 – 95 Gb per run HiSeq: 200Gb + Run Time 7 Days 4 Days 9 Hours Pros Low error rate due to Most widely used Short run time. Long dibase probes NGS platform. reads better for de Requires least DNA novo sequencing Cons Long run times. Has Least multiplexing Expensive reagent been demonstrated capability of the 3. cost. Difficulty certain reads don’t Poor coverage of AT reading match reference rich regions homopolymer regions Source: The University of Western Ontario