[I0D51A] Bioinformatics: High-Throughput Analysis
 Next-generation sequencing. Part 1: Technologies
Prof Jan Aerts
Faculty of Engineering - ESAT/SCD
jan.aerts@esat.kuleuven.be

TA: Alejandro Sifrim (alejandro.sifrim@esat.kuleuven.be)




                                                           1
Announcements

May 27th (9am-noon): evaluation


open book




                                  2
Note to self...

Upload s_1_sequence.txt and s_2_sequence.txt to Galaxy first...




                                                                 3
Overview

• linux refresher (6/5)


• next-generation sequencing technologies and applications (6/5)


• sequence mapping (13/5)


• variant calling - SNPs (20/5)


• variant calling - structural variation (20/5)




                                                                   4
Linux Refresher...




                     5
Next-generation sequencing technologies




                                          6
General principle




                    7
Big data...




              8
First vs second generation sequencing
Sanger sequencing (1st gen)   2nd/next gen sequencing




                                                 Shendure & Ji, 2008




                                                                       9
Paired-end sequencing




                        Korbel et al, 2007




                                             10
General approaches

• 2nd generation: clonally amplified single molecules


  • Roche 454 pyrosequencing


  • Illumina Genome Analyzer -> HiSeq: reversible terminator technology


  • ABI SOLiD: ligation-based extension


• Next-next-generation/3rd generation: true single molecule


  • Helicos: Heliscore


  • Pacific Biosciences: SMRT
                                                                          11
Mardis, 2011

               12
Steps


        genome enrichment




                    template preparation



                              sequencing and imaging



                                           data analysis




                                                           13
A. Genome enrichment




                       14
Sequencing costs




                   15
What?

Only sequence relevant parts of the genome instead of whole genome, e.g.:


• specific Mb-scale regions known to be involved in particular disease (e.g.
  based on GWAS)


• specific candidate genes belonging to disease pathway


• exome (= all exons)


 => how to isolate these from non-target sequence? “pulldown”




                                                                              16
Pulldown: on-array




                     Turner et al, 2009




                                          17
Pulldown: in-solution




                        Turner et al, 2009




                                             18
Performance metrics

• fold-enrichment: ratio of abundance of target sequences post-enrichment vs
  pre-enrichment


• capture specificity: fraction of sequence reads that map to target


• uniformity: relative abundance of individual targets after enrichment


• completeness: fraction of target bases detectably captured




                                                                           19
B. Template preparation




                          20
Problem: most imaging systems not designed to detect single fluorescent event
=> need amplified templates


Aim: to produce a representative, non-biased source of nucleic acid material
from the genome under investigation => population of identical templates


Steps:


   1. shear DNA


   2. amplify templates


 Options: emulsion PCR (emPCR) or solid phase amplification

                                                                               21
Amplification by emulsion PCR

emulsion = mixture of two or more immiscible (unblendable) liquids; e.g.
mayonnaise, vinaigrette


emPCR: thousands of microreactors/micro-eppendorfs


one bead + one DNA molecule per microreactor => PCR to 1000s of copies




                                                                           22
Williams et al, 2006




 Metzker et al, 2010


                       23
Solid-phase amplification




                                             http://bit.ly/6JYIUz




http://www.youtube.com/watch?v=77r5p8IBwJk&NR=1
                                                                    Metzker et al, 2010
                                                                                       24
C. Sequencing and imaging




                            25
Sequencing and imaging

Technologies:


1. cyclic reversible termination


2. sequencing by ligation


3. pyrosequencing


4. real-time sequencing




                                   26
Cyclic reversible termination

DNA synthesis is terminated after adding single nucleotide


start/stop/start/stop/start/stop/...

                            Illumina: 4-colour



sequencing result
                      sequencing steps




                               Metzker et al, 2010
                                                             27
Helicos: 1-colour




         sequencing steps




sequencing result




                                      Metzker et al, 2010




          Metzker et al, 2010



                                                            28
Sequencing by ligation




   http://bit.ly/fPh22X




sequencing steps




                          29
sequencing result




http://bit.ly/fPh22X




                       30
Pyrosequencing




                                  Metzker et al, 2010




            Metzker et al, 2010                         31
Real-time sequencing




                    “ZMW” zero-mode waveguide
   DNA polymerase

                                        “strobe sequencing”


                                                              32
Run time   Gb/run

Roche 454    8.5 hr     45

 Illumina    9 days     35

 SOLiD      14 days     50

 Helicos     8 days     37

 PacBio        ?         ?


                                33
Accuracy - base calling error

• base quality drops along read


        Sanger > SOLiD > Illumina > 454 > Helicos


        (“dephasing” within clusters)




• base calling errors




                                                    34
Accuracy - homopolymer runs

 Issue for Roche 454:


   39% of errors are homopolymers


      A5 motifs: 3.3% error rate


      A8 motifs: 50% error rate


   Reason: use signal intensity as a measure for homopolymer length




                                                                      35
36
Ronaghi, Genome Res 11:3-11 (2001)




                                     37
http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg




                                                       38
Is it 4? Is it 5? Is it 4?




      http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg




                                                             39
Consensus accuracy

Increase accuracy for SNP calling by increasing coverage:


   Illumina: 20X


   SOLiD: 12X


   454: 7.4X


   Sanger: 3X


Factors: raw accuracy + read length


How deep do you have to sequence? => Poisson distribution: “If you sequence at
average of 10X, how much of the genome will be covered at least 5X”?

                                                                                 40
Bentley et al, Nature 456:53-56 (2008)




                                         41
FASTQ file format
                                                   example fasta entries (n=2)




             “@” + identifier            example fastq entries (n=2)
               sequence
  “+” + identifier (optional)
phred-based quality scores




         phred quality score encoding




                                                                Wikipedia

                                                                                 42
Sequence quality control

Is this good sequence? (essential!)


E.g.: using FastQC tool (Babraham Institute, UK; http://
www.bioinformatics.bbsrc.ac.uk/projects/fastqc/)




                                                           43
Sequence quality control
              per base sequence quality
                    good         bad




                                          44
Sequence quality control
              per sequence quality scores
                    good         bad




                                            45
Sequence quality control
              per base sequence content
                   good         bad




                                          46
Sequence quality control
                per base GC content
                  good         bad




                                      47
Sequence quality control
               per sequence GC content
                   good        bad




                                         48
Sequence quality control
                   k-mer content
                  good       bad




                                   49
Intermezzo: Galaxy




                     50
Online genome analysis

http://galaxy.psu.edu/


“Galaxy allows you to do analyses you cannot do anywhere else without the
need to install or download anything. You can analyze multiple alignments,
compare genomic annotations, profile metagenomic samples and much much
more...”




                                                                             51
52
53
Applications of next-generation sequencing




                                             54
Kahvejian et al, 2008


                        55
DNA-seq

ChIP-seq




           RNA-seq




                        Kahvejian et al, 2008


                                                50
                                                56
identify
                                                            sequence
                                                            variations



                          DNA-seq

            ChIP-seq




                       RNA-seq

 identify
pathogens

                                    Kahvejian et al, 2008


                                                                         50
                                                                         51
                                                                         57
Exercises




            58
Try to login to the server mentioned on Toledo with username and password
provided there.



There are 2 FASTQ files in /mnt/homes/jaerts/: s_1_sequence.txt and
s_2_sequence.txt (= paired ends)



  • How many sequences are in s_1_sequence.txt?


  • What encoding was used for the quality score? Illumina? Sanger?


  • What are the numerical quality scores for the first sequence in
    s_1_sequence.txt (i.e. 7172283/1)?




                                                                            59
• Create an account on the Galaxy server



• Download s_1_sequence.txt and s_2_sequence.txt from Toledo and upload
  them into Galaxy. These files are also available on the linux server



• Have a look at the contents of s_1_sequence.txt.



• Convert quality scores to numeric values for s_1_sequence.txt (“FASTQ
  Groomer”)



• Draw the quality score boxplot for s_1_sequence.txt



• Draw the nucleotide distribution chart for s_1_sequence.txt

                                                                          60
References

Bentley DR et al. Accurate whole human genome sequencing using reversible
terminator chemistry. Nature 456: 53-59 (2008)
Kahvejian A, Quackenbush J & Thompson JF. What would you do if you could
sequence everything? Nature Biotechnology 26: 1125-1133 (2008)
Korbel JO et al. Paired-end mapping reveals extensive structural variation in the
human genome. Science 318: 420-426 (2007)
Mardis ER. A decade’s perspective on DNA sequencing technology. Nature
470: 198-203 (2011)
Metzker ML. Sequencing technologies - the next generation. Nature Reviews
Genetics 11:31-46 (2010)
Shendure J & Ji H. Next-generation DNA sequencing. Nature Biotechnology
26:1135-1145 (2008)
Turner EH et al. Methods for genomic partitioning. Annual Review of Genomics
and Human Genetics 10 (2009)

                                                                                61

Next-generation sequencing course, part 1: technologies

  • 1.
    [I0D51A] Bioinformatics: High-ThroughputAnalysis Next-generation sequencing. Part 1: Technologies Prof Jan Aerts Faculty of Engineering - ESAT/SCD jan.aerts@esat.kuleuven.be TA: Alejandro Sifrim (alejandro.sifrim@esat.kuleuven.be) 1
  • 2.
    Announcements May 27th (9am-noon):evaluation open book 2
  • 3.
    Note to self... Uploads_1_sequence.txt and s_2_sequence.txt to Galaxy first... 3
  • 4.
    Overview • linux refresher(6/5) • next-generation sequencing technologies and applications (6/5) • sequence mapping (13/5) • variant calling - SNPs (20/5) • variant calling - structural variation (20/5) 4
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    First vs secondgeneration sequencing Sanger sequencing (1st gen) 2nd/next gen sequencing Shendure & Ji, 2008 9
  • 10.
    Paired-end sequencing Korbel et al, 2007 10
  • 11.
    General approaches • 2ndgeneration: clonally amplified single molecules • Roche 454 pyrosequencing • Illumina Genome Analyzer -> HiSeq: reversible terminator technology • ABI SOLiD: ligation-based extension • Next-next-generation/3rd generation: true single molecule • Helicos: Heliscore • Pacific Biosciences: SMRT 11
  • 12.
  • 13.
    Steps genome enrichment template preparation sequencing and imaging data analysis 13
  • 14.
  • 15.
  • 16.
    What? Only sequence relevantparts of the genome instead of whole genome, e.g.: • specific Mb-scale regions known to be involved in particular disease (e.g. based on GWAS) • specific candidate genes belonging to disease pathway • exome (= all exons) => how to isolate these from non-target sequence? “pulldown” 16
  • 17.
    Pulldown: on-array Turner et al, 2009 17
  • 18.
    Pulldown: in-solution Turner et al, 2009 18
  • 19.
    Performance metrics • fold-enrichment:ratio of abundance of target sequences post-enrichment vs pre-enrichment • capture specificity: fraction of sequence reads that map to target • uniformity: relative abundance of individual targets after enrichment • completeness: fraction of target bases detectably captured 19
  • 20.
  • 21.
    Problem: most imagingsystems not designed to detect single fluorescent event => need amplified templates Aim: to produce a representative, non-biased source of nucleic acid material from the genome under investigation => population of identical templates Steps: 1. shear DNA 2. amplify templates Options: emulsion PCR (emPCR) or solid phase amplification 21
  • 22.
    Amplification by emulsionPCR emulsion = mixture of two or more immiscible (unblendable) liquids; e.g. mayonnaise, vinaigrette emPCR: thousands of microreactors/micro-eppendorfs one bead + one DNA molecule per microreactor => PCR to 1000s of copies 22
  • 23.
    Williams et al,2006 Metzker et al, 2010 23
  • 24.
    Solid-phase amplification http://bit.ly/6JYIUz http://www.youtube.com/watch?v=77r5p8IBwJk&NR=1 Metzker et al, 2010 24
  • 25.
  • 26.
    Sequencing and imaging Technologies: 1.cyclic reversible termination 2. sequencing by ligation 3. pyrosequencing 4. real-time sequencing 26
  • 27.
    Cyclic reversible termination DNAsynthesis is terminated after adding single nucleotide start/stop/start/stop/start/stop/... Illumina: 4-colour sequencing result sequencing steps Metzker et al, 2010 27
  • 28.
    Helicos: 1-colour sequencing steps sequencing result Metzker et al, 2010 Metzker et al, 2010 28
  • 29.
    Sequencing by ligation http://bit.ly/fPh22X sequencing steps 29
  • 30.
  • 31.
    Pyrosequencing Metzker et al, 2010 Metzker et al, 2010 31
  • 32.
    Real-time sequencing “ZMW” zero-mode waveguide DNA polymerase “strobe sequencing” 32
  • 33.
    Run time Gb/run Roche 454 8.5 hr 45 Illumina 9 days 35 SOLiD 14 days 50 Helicos 8 days 37 PacBio ? ? 33
  • 34.
    Accuracy - basecalling error • base quality drops along read Sanger > SOLiD > Illumina > 454 > Helicos (“dephasing” within clusters) • base calling errors 34
  • 35.
    Accuracy - homopolymerruns Issue for Roche 454: 39% of errors are homopolymers A5 motifs: 3.3% error rate A8 motifs: 50% error rate Reason: use signal intensity as a measure for homopolymer length 35
  • 36.
  • 37.
    Ronaghi, Genome Res11:3-11 (2001) 37
  • 38.
  • 39.
    Is it 4?Is it 5? Is it 4? http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg 39
  • 40.
    Consensus accuracy Increase accuracyfor SNP calling by increasing coverage: Illumina: 20X SOLiD: 12X 454: 7.4X Sanger: 3X Factors: raw accuracy + read length How deep do you have to sequence? => Poisson distribution: “If you sequence at average of 10X, how much of the genome will be covered at least 5X”? 40
  • 41.
    Bentley et al,Nature 456:53-56 (2008) 41
  • 42.
    FASTQ file format example fasta entries (n=2) “@” + identifier example fastq entries (n=2) sequence “+” + identifier (optional) phred-based quality scores phred quality score encoding Wikipedia 42
  • 43.
    Sequence quality control Isthis good sequence? (essential!) E.g.: using FastQC tool (Babraham Institute, UK; http:// www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) 43
  • 44.
    Sequence quality control per base sequence quality good bad 44
  • 45.
    Sequence quality control per sequence quality scores good bad 45
  • 46.
    Sequence quality control per base sequence content good bad 46
  • 47.
    Sequence quality control per base GC content good bad 47
  • 48.
    Sequence quality control per sequence GC content good bad 48
  • 49.
    Sequence quality control k-mer content good bad 49
  • 50.
  • 51.
    Online genome analysis http://galaxy.psu.edu/ “Galaxyallows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more...” 51
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
    DNA-seq ChIP-seq RNA-seq Kahvejian et al, 2008 50 56
  • 57.
    identify sequence variations DNA-seq ChIP-seq RNA-seq identify pathogens Kahvejian et al, 2008 50 51 57
  • 58.
  • 59.
    Try to loginto the server mentioned on Toledo with username and password provided there. There are 2 FASTQ files in /mnt/homes/jaerts/: s_1_sequence.txt and s_2_sequence.txt (= paired ends) • How many sequences are in s_1_sequence.txt? • What encoding was used for the quality score? Illumina? Sanger? • What are the numerical quality scores for the first sequence in s_1_sequence.txt (i.e. 7172283/1)? 59
  • 60.
    • Create anaccount on the Galaxy server • Download s_1_sequence.txt and s_2_sequence.txt from Toledo and upload them into Galaxy. These files are also available on the linux server • Have a look at the contents of s_1_sequence.txt. • Convert quality scores to numeric values for s_1_sequence.txt (“FASTQ Groomer”) • Draw the quality score boxplot for s_1_sequence.txt • Draw the nucleotide distribution chart for s_1_sequence.txt 60
  • 61.
    References Bentley DR etal. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53-59 (2008) Kahvejian A, Quackenbush J & Thompson JF. What would you do if you could sequence everything? Nature Biotechnology 26: 1125-1133 (2008) Korbel JO et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318: 420-426 (2007) Mardis ER. A decade’s perspective on DNA sequencing technology. Nature 470: 198-203 (2011) Metzker ML. Sequencing technologies - the next generation. Nature Reviews Genetics 11:31-46 (2010) Shendure J & Ji H. Next-generation DNA sequencing. Nature Biotechnology 26:1135-1145 (2008) Turner EH et al. Methods for genomic partitioning. Annual Review of Genomics and Human Genetics 10 (2009) 61