1. NGS Overview Part I:
A Comparison of Next-Generation
Sequencing Platforms.
MIN SOO KIM
APRIL 1, 2013
QUANTITATIVE BIOMEDICAL RESEARCH CENTER
DEPARTMENT OF CLINICAL SCIENCES
UT SOUTHWESTERN
2. OUTLINE
Background.
Common Pipeline.
Library Prep.
Sequencing – Massively Parallel Sequencing.
Bioinformatics - Data Analysis .
Popular Platforms:
Roche 454 , AB SOLiD, Illumina(HiSeq,MiSeq).
Newer Platforms (Third Generation):
Ion Torrent, PacBio RS, Oxford Nanopore.
4. Next Generation Sequencing
Pipeline
DNA Sample NGS Instrument Data
Library Data
Sequencing
Preparation Analysis
5. Library Preparation
DNA samples are randomly fragmented and
platform-specific adaptors are added to the flanking
ends to produce a “library”.
Library is then amplified through PCR. (Platform-
specific amplification e.g. beads or glass)
Amplification Introduces Bias:
Amplification bias against AT, GC rich regions. (corrected by
adding PCR additives.)
Alteration of representational abundances(duplicates).
Important for quantitative applications like RNA-seq.
Overrepresentation of smaller fragments. (corrected by running
fewer PCR cycles.)
6. Massively Parallel Sequencing
In Sanger-sequencing the DNA synthesis and
detection steps are two separate procedures (slow).
Next generation sequencing relies on coupling the
DNA synthesis and detection (sequencing by
synthesis) and multiple sequencing reactions are run
simultaneously (Massively Parallel Sequencing).
For most NGS platforms desynchronization of reads
during the sequencing and detection cycle is the
main cause of sequencing errors and shorter reads.
7. Platform Chemistry Read Run Gb/Run Advantage Disadvantage
Length Time
454 GS Pyro- 500 8 hrs. 0.04 Long Read High error rate
Junior sequencing Length in
(Roche) homopolymer
454 GS Pyro- 700 23 hrs. 0.7 Long Read High error rate
FLX+ sequencing Length in
(Roche) homopolymer
HiSeq Reversible 2*100 2 days 120 High- Short reads
(Illumina) Terminator (rapid (rapid throughput Long run time
mode) mode) / cost (normal mode)
SOLiD Ligation 85 8 days 150 Low Error Short reads
(Life) Rate Long run time
Ion Proton Proton 200 2 hrs. 100 Short Run New*
(Life) Detection times
3000 No PCR
PacBio RS Real-time (up to 20 min 3 Longest High Error
Sequencing 15,000) Read Rate
Length
8. Roche 454 – Library Prep.
Emulsion PCR (emPCR)
1. DNA fragmentations and
adaptor ligation.
2. DNA fragments are added to
an oil mixture containing
millions of beads.
3. Emulsion PCR results in
multiple copies of the
fragment.
4. Beads are deposited on plate
wells ready for sequencing.
11. Illumina – Sequence by Synthesis
1. Add dye-labeled nucleotides.
2. Scan and detect nucleotide
specific fluorescence.
3. Remove 3’ – blocking group
(Reversible termination).
4. Cleave fluorescent group.
5. Rinse and Repeat.
http://hmg.oxfordjournals.org
12. Illumina – Sequencing by Synthesis
Platform Run Yield Error Error
Time (GB) Type Rate
GAIIx 14 days 96 Sub 0.1
HiSeq 10/2 600/120 Sub 0.03*
2000/2500 days
MiSeq 1 day 2 Sub 0.03*
13. Illumina
Advantages
High throughput / cost.
Suitable for a wide rage of applications most notably
whole genome sequencing.
Disadvantages
Substitution error rates (recently improved).
Lagging strand dephasing causes sequence quality
deterioration towards the end of read.
14. SOLiD (Life/AB)
SOLiD: Sequencing by Oligonucleotide Ligation and
Detection.
Library Prep: emPCR
“2-base encoding” – Instead of the typical single
dNTP addition, two base matching probes are used.
(possible 16 probes).
Color Space – four color sequencing encoding
further increases accuracy.
15. SOLiD (Life/AB)
1. Anneal primer and hybridize
probe(8-mer).
2. Ligation and Detection.
3. Cleave fluorescent tail(3-mer).
4. Repeat ligation cycle.
Repeat steps 1-4 with (n-1) primer.
16. SOLiD – Color Space
• 16 possible base combination
are represented by 4 colors.
• All possible sequencing
combination need to be decoded.
18. SOLiD – SNP Detection
Summary: SOLiD has one of the lowest error-rates (~0.01) due to 2-base
encoding. It is however still limited by short read lengths ( 35 bp / 85 bp for PE).
19. Ion Torrent (Life Technologies)
• Similar to pyrosequencing but uses
semiconducting chip to detect dNTP
incorporation.
• The chip measure differences in pH.
• Shown to have problems with
homopolymer reads and coverage
bias with GC-rich regions.
• Ion Proton™ promises higher output
and longer reads.
J. M. Rothberg, et al. Nature (2011) 475:348-352
20. PacBio RS
Single Molecule Sequencing – instead of sequencing
clonally amplified templates from beads (Pyro) or
clusters (Illumina) DNA synthesis is detected on a
single DNA strand.
Zero-mode waveguide (ZMW)
• DNA polymerase is affixed to the bottom
of a tiny hole (~70nm).
• Only the bottom portion of the hole is
illuminated allowing for detection of
incorporation of dye-labeled nucleotide.
21. PacBio RS
Real-time Sequencing – Unlike reversible termination
methods (Illumina) the DNA synthesis process is never
halted. Detection occurs in real-time.
Library Prep.
• DNA template is circularized by the
use of “bell” shaped adapters.
• As long as the polymerase is stable
this allows for continuous sequencing
of both strands.
22. PacBio RS
Advantages
No amplification required.
Extremely long read lengths.
Average 2500 bp. Longest 15,000bp.
Disadvantages
High error rates.
Error rate of ~15% for Indels. 1% Substitutions.
23. Oxford Nanopore
Announced Feb. 2012 at
ABGT conference.
Measure changes in ion
flow through nanopore.
Potential for long read
lengths and short
sequencing times.
http://www2.technologyreview.com
25. Thank you.
Acknowledgements
Dr. Tae Hyun Hwang
References
Mardis, E. R. A decade’s perspective on DNA sequencing technology.
Nature (2011) 470:198 - 203
Metzker, M.L. Sequencing technologies – the next generation. Nature
Review Genetics(2010) 11:31 – 46
N. J. Loman, C. Constantinidou, Jacqueline Z. M. Chan, M. Halachev, M.
Sergeant, C. W. Penn, E. R. Robinson & M. J. Pallen. High-throughput
bacterial genome sequencing: an embarrassment of choice, a world of
opportunity. Nature Review Microbiology 10, 599-606.
Editor's Notes
Thank you for joining us in our tech talk series. Today’s tech talk will be an overview of next generation sequencing platforms.
I will start of by giving a brief history on the origin of next generation sequencing. I will then cover some of the common steps in doing next-gen sequening. We will then cover in more detail the specific technologies behind some of the more popular sequencing platforms we use as well discuss some of the new comers in the sequencing Market. These next generation sequencing technologies have been mostly developed and sold by individual companies. I am however not affiliated with any of them. I will try to present both the advantages and disadvantage of each system.
Although it has become quite common to do nex-gen sequencing studies many biomedical research labs, the technologies behind the data have really only come about in the past decade. The story of next generation sequencing start with the Human Genome Project.2000 First draft of Human Genome Complete. - This set the roadmap for modern biomedical research. - Most of the DNA sequencing for the Human Genome Project was based on Sanger Sequencing. - Sanger Sequencing methods were slow and expensive. Estimated cost of the human genome project is 3 billion dollars and 10 years to generate the first draft of Human Genome. - After the completion of the first Human Genome there was an increased focus on improving the machinery used to generate sequencing data.Introduction of Next Generation Sequencing Instruments. - Next Generation Sequencing refers to the advancement of sequencing technologies that allowed for significant improvements in quantity, speed, and cost of generating sequencing data. - In 2004 first parallelized version of 454 pyrosequencing instrument was commercialize. 6-fold cost reduction compared to capillary based Sanger Sequencing. - Since then a wide range of similar sequencing technologies have been introduced bringing down the cost even further. - We are now getting very close to the $1000 per sample mark for Whole Genome Sequencing.
Single-molecule template sequencing avoids some of these problems by circumventing the amplification step all together. I will go over an example of a platfDNA amplification results in multiple copies of the initial DNA fragment. For most NGS platforms, except for single molecule sequencing instruments, this is necessary because sequencing takes places on all the fragments at the same time. The sequencing signal from all the fragments are read in at once. This is also the main reason why the read lengths of most NGS platforms are short. Incomplete extension or over extension of the sequencing stand can result in strand dephasing which causes noise in the detection signal. Accumulation of dephased strands deceases the signal to noise ratio to a point where the sequence can no longer be read causing the termination of the sequencing read. Real-time sequencers can avoid this problem leading to much longer sequencing reads, but they suffer from other limitations ( which I will go over later. ) ormthat uses single-molecule sequencing.
The main purpose for dna amplification is so that you can sequence all the fragments at once. This is essentially what sets Next generation sequencing apart from Sanger. In Sanger …
SOLiD has lowest error rate. Good for resequencing.Illumina has the highest throughput / cost.PacBio has the longest read length. Up to 15,000 bp. Applications in DeNovo sequencing and structural variation studies.Ion Torrent from Life Technologies has two instruments. The Personal Genome Machine which has been out since 2010. It’s mostly only applicable for small genome sequencing like bacterial genomes. The Ion Proton, which promises much greater throughput and it has been recently made commercially available. There is however limited information on it’s actual performance.
Emulsion PCR Fragment DNA and attache adapters.PCR is carried out in a oil mixture containing beads and reagents.Each of the beads has a small adaptor that is complementary to the adaptor attached to the DNA fragment.Once the DNA is added to the emulsion ideally you get one fragment attached to each bead.You then carry out the PCR reaction which results in multiple clonal DNA fragments on each bead.The beads are then lowered on plates containing millions of wells roughly the size of each bead.
Additional beads are added to each well containing Sulphurulase and Luciferase.Reagents are flowed through the top of the plates for the sequencing reaction to occur. A key step to the process is that only one nucleotide type (ex: cytosine) is added to the plate.If the added nucleotide is complementary to the template strand it is incorporated into the growing strand by the polymerase releasing a pyrophosphate.The pyrophosphate is then picked up by the catalytic enzymes causing a chain reaction which ultamately emits light.The light is then picked up by CCD camera and recoded on a flowgram.The cycle repeated between reagent and washing.A key limitation to this procedure is that if you have a chain of identical nucleotides on the template sequences you will get the addition of multiple nucleotides in a single reaction cycle. The light intensity will be proportional to the number of nucleotides added but the detection limit is around 6 or 7. The camera will therefore not be able to detect homopolymer runs that are longer than 6 or 7 nucleotides.The most common error rate in pyrosequencing is therefore indels which is around 1%. Substitution error rate however are very low which makes this platform great for targeted resequencing and studies where you are looking for single nucleotide substitutions.
Illumina Library Prep is done through a process called bridge amplification.1. PCR is done on glass plate called flow cells.2. DNA is fragmented and ligate adaptors are added to both ends similar to emulsion PCR.3. Single stranded fragments are attached to surface of flowcells. The flowcell contains the complementary primers to the fragment adaptors.4. PCR reaction is carried out which extents the bridge.5. Through the PCR cycle multiple single strand copies are made creating a dense cluster on the flowcell.
Improve chemistry and detection for Illumina HiSeq/MiSeq
Summary – Due to the 2-base encoding system you get very low error rate. (Error rate of around .01)It is still however limited by short reads. (35bp)
Obvious advantage is no need for clonal amplification. Reduces amplification bias.
Indels errors occur when multiple nucleotide are incorporated too fast for the camera to detect.Improvments to the technology primarily have come with advances in it’s chemistry. Primarily it’s polymerase enzyme. Improvements include higher nucleotide incorporation accuracy. Higher tolerance to heat ( longer reads ).