Next-generation sequencing course, part 1: technologies

[I0D51A] Bioinformatics: High-Throughput Analysis
Next-generation sequencing. Part 1: Technologies
Prof Jan Aerts
Faculty of Engineering - ESAT/SCD
jan.aerts@esat.kuleuven.be

TA: Alejandro Sifrim (alejandro.sifrim@esat.kuleuven.be)

1

Announcements

May 27th (9am-noon): evaluation

open book

2

Note to self...

Upload s_1_sequence.txt and s_2_sequence.txt to Galaxy ﬁrst...

3

Overview

• linux refresher (6/5)

• next-generation sequencing technologies and applications (6/5)

• sequence mapping (13/5)

• variant calling - SNPs (20/5)

• variant calling - structural variation (20/5)

4

Linux Refresher...

5

Next-generation sequencing technologies

6

General principle

7

First vs second generation sequencing
Sanger sequencing (1st gen) 2nd/next gen sequencing

Shendure & Ji, 2008

9

Paired-end sequencing

Korbel et al, 2007

10

General approaches

• 2nd generation: clonally ampliﬁed single molecules

• Roche 454 pyrosequencing

• Illumina Genome Analyzer -> HiSeq: reversible terminator technology

• ABI SOLiD: ligation-based extension

• Next-next-generation/3rd generation: true single molecule

• Helicos: Heliscore

• Paciﬁc Biosciences: SMRT
11

Steps

genome enrichment

template preparation

sequencing and imaging

data analysis

13

A. Genome enrichment

14

Sequencing costs

15

What?

Only sequence relevant parts of the genome instead of whole genome, e.g.:

• speciﬁc Mb-scale regions known to be involved in particular disease (e.g.
based on GWAS)

• speciﬁc candidate genes belonging to disease pathway

• exome (= all exons)

=> how to isolate these from non-target sequence? “pulldown”

16

Pulldown: on-array

Turner et al, 2009

17

Pulldown: in-solution

Turner et al, 2009

18

Performance metrics

• fold-enrichment: ratio of abundance of target sequences post-enrichment vs
pre-enrichment

• capture speciﬁcity: fraction of sequence reads that map to target

• uniformity: relative abundance of individual targets after enrichment

• completeness: fraction of target bases detectably captured

19

B. Template preparation

20

Problem: most imaging systems not designed to detect single fluorescent event
=> need amplified templates

Aim: to produce a representative, non-biased source of nucleic acid material
from the genome under investigation => population of identical templates

Steps:

1. shear DNA

2. amplify templates

Options: emulsion PCR (emPCR) or solid phase amplification

21

Ampliﬁcation by emulsion PCR

emulsion = mixture of two or more immiscible (unblendable) liquids; e.g.
mayonnaise, vinaigrette

emPCR: thousands of microreactors/micro-eppendorfs

one bead + one DNA molecule per microreactor => PCR to 1000s of copies

22

Williams et al, 2006

Metzker et al, 2010

23

Solid-phase ampliﬁcation

http://bit.ly/6JYIUz

http://www.youtube.com/watch?v=77r5p8IBwJk&NR=1
Metzker et al, 2010
24

C. Sequencing and imaging

25

Sequencing and imaging

Technologies:

1. cyclic reversible termination

2. sequencing by ligation

3. pyrosequencing

4. real-time sequencing

26

Cyclic reversible termination

DNA synthesis is terminated after adding single nucleotide

start/stop/start/stop/start/stop/...

Illumina: 4-colour

sequencing result
sequencing steps

Metzker et al, 2010
27

Helicos: 1-colour

sequencing steps

sequencing result

Metzker et al, 2010

Metzker et al, 2010

28

Sequencing by ligation

http://bit.ly/fPh22X

sequencing steps

29

sequencing result

http://bit.ly/fPh22X

30

Pyrosequencing

Metzker et al, 2010

Metzker et al, 2010 31

Real-time sequencing

“ZMW” zero-mode waveguide
DNA polymerase

“strobe sequencing”

32

Run time Gb/run

Roche 454 8.5 hr 45

Illumina 9 days 35

SOLiD 14 days 50

Helicos 8 days 37

PacBio ? ?

33

Accuracy - base calling error

• base quality drops along read

Sanger > SOLiD > Illumina > 454 > Helicos

(“dephasing” within clusters)

• base calling errors

34

Accuracy - homopolymer runs

Issue for Roche 454:

39% of errors are homopolymers

A5 motifs: 3.3% error rate

A8 motifs: 50% error rate

Reason: use signal intensity as a measure for homopolymer length

35

Ronaghi, Genome Res 11:3-11 (2001)

37

http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg

38

Is it 4? Is it 5? Is it 4?

http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg

39

Consensus accuracy

Increase accuracy for SNP calling by increasing coverage:

Illumina: 20X

SOLiD: 12X

454: 7.4X

Sanger: 3X

Factors: raw accuracy + read length

How deep do you have to sequence? => Poisson distribution: “If you sequence at
average of 10X, how much of the genome will be covered at least 5X”?

40

Bentley et al, Nature 456:53-56 (2008)

41

FASTQ file format
example fasta entries (n=2)

“@” + identifier example fastq entries (n=2)
sequence
“+” + identifier (optional)
phred-based quality scores

phred quality score encoding

Wikipedia

42

Sequence quality control

Is this good sequence? (essential!)

E.g.: using FastQC tool (Babraham Institute, UK; http://
www.bioinformatics.bbsrc.ac.uk/projects/fastqc/)

43

per base sequence quality
good bad

44

per sequence quality scores
good bad

45

per base sequence content
good bad

46

per base GC content
good bad

47

per sequence GC content
good bad

48

k-mer content
good bad

49

Intermezzo: Galaxy

50

Online genome analysis

http://galaxy.psu.edu/

“Galaxy allows you to do analyses you cannot do anywhere else without the
need to install or download anything. You can analyze multiple alignments,
compare genomic annotations, proﬁle metagenomic samples and much much
more...”

51

Applications of next-generation sequencing

54

Kahvejian et al, 2008

55

DNA-seq

ChIP-seq

RNA-seq


50
56

identify
sequence
variations

DNA-seq

ChIP-seq

RNA-seq

identify
pathogens


50
51
57

Try to login to the server mentioned on Toledo with username and password
provided there.

There are 2 FASTQ ﬁles in /mnt/homes/jaerts/: s_1_sequence.txt and
s_2_sequence.txt (= paired ends)

• How many sequences are in s_1_sequence.txt?

• What encoding was used for the quality score? Illumina? Sanger?

• What are the numerical quality scores for the ﬁrst sequence in
s_1_sequence.txt (i.e. 7172283/1)?

59

• Create an account on the Galaxy server

• Download s_1_sequence.txt and s_2_sequence.txt from Toledo and upload
them into Galaxy. These ﬁles are also available on the linux server

• Have a look at the contents of s_1_sequence.txt.

• Convert quality scores to numeric values for s_1_sequence.txt (“FASTQ
Groomer”)

• Draw the quality score boxplot for s_1_sequence.txt

• Draw the nucleotide distribution chart for s_1_sequence.txt

60

References

Bentley DR et al. Accurate whole human genome sequencing using reversible
terminator chemistry. Nature 456: 53-59 (2008)
Kahvejian A, Quackenbush J & Thompson JF. What would you do if you could
sequence everything? Nature Biotechnology 26: 1125-1133 (2008)
Korbel JO et al. Paired-end mapping reveals extensive structural variation in the
human genome. Science 318: 420-426 (2007)
Mardis ER. A decade’s perspective on DNA sequencing technology. Nature
470: 198-203 (2011)
Metzker ML. Sequencing technologies - the next generation. Nature Reviews
Genetics 11:31-46 (2010)
Shendure J & Ji H. Next-generation DNA sequencing. Nature Biotechnology
26:1135-1145 (2008)
Turner EH et al. Methods for genomic partitioning. Annual Review of Genomics
and Human Genetics 10 (2009)

61

Next-generation sequencing course, part 1: technologies

More Related Content

What's hot

Viewers also liked

Similar to Next-generation sequencing course, part 1: technologies

More from Jan Aerts

Recently uploaded

Next-generation sequencing course, part 1: technologies