Sequencing

Surya Saha
Sol Genomics Network (SGN)
Boyce Thompson Institute, Ithaca, NY
ss2489@cornell.edu // @SahaSurya
BTI PGRP Summer Internship Program 2014
http://www.acgt.me/blog/2014/3/7/next-generation-sequencing-must-die

Why Sequencing?
• Targeted interrogation
of genome
• Economical
• Technological
developments
• High-throughput assays
• But requires subsequent
validation
7/8/2014 BTI PGRP Summer Internship Program 2014 2

1953
DNA Structure
discovery
1977
2012
Sanger DNA sequencing by
chain-terminating inhibitors
1984
Epstein-Barr
virus
(170 Kb)
1987Abi370
Sequencer
1995
2001
Homo
sapiens
(3.0 Gb)
2005
454
Solexa
Solid
2007
2011
Ion
Torrent
PacBio
Haemophilus
influenzae
(1.83 Mb)
2013
Slide credit: Aureliano Bombarely
Sequencing over the Ages
Illumina
Illumina
Hiseq X
454
Pinus
taeda
(24 Gb)

First generation sequencing

Sanger method
Frederick Sanger
13 Aug 1918 – 19 Nov 2013
Won the Nobel Prize for Chemistry in 1958 and
1980. Published the dideoxy chain termination
method or “Sanger method” in 1977
http://dailym.ai/1f1XeTB

Sanger method
http://bit.ly/1g6Cudq
http://bit.ly/1lcQO4J

First generation sequencing
• Very high quality sequences (99.999%)
• Very low throughput
Run Time Read Length Reads / Run
Total
nucleotides
sequenced
Cost / MB
Capillary
Sequencing
(ABI3730xl)
20m-3h 400-900 bp 96 or 386 1.9-84 Kb $2400
http://bit.ly/1clLps3
http://1.usa.gov/1cLqIRd

Use the specific technology used
to generate the data
– Illumina Hiseq/Miseq/NextSeq
– Pacific Biosciences RS1/RSII
– Ion Torrent Proton/PGM
– SOLiD
http://www.acgt.me/blog/2014/3/10/next-generation-
sequencing-must-diepart-2

454 Pyrosequencing
One purified DNA
fragment, to one bead, to
one read.
http://bit.ly/1ehwxWN
GS FLX
Titanium
http://bit.ly/1ehAcEh

Illumina
Output 15 Gb 120 GB 1000 GB 1800 GB
Number
of Reads
25 Million 400 Million 4 Billion 6 Billion
Read
Length
2x300 bp 2x150 bp 2x125 bp
(2x250 update mid-2014)
2x150 bp
Cost $99K $250K $740K $10M
Source: Illumina

Illumina
http://1.usa.gov/1fP9ybl

Illumina:Moleculo
http://bit.ly/1aEPOBn

Pacific Biosciences SMRT sequencing
Single Molecule Real
Time sequencing
http://bit.ly/1naxgTe

Error correction methods
Hierarchical genome-assembly
process (HGAP)
PBJelly
Enlish et al., PLOS One. 2012
PBJelly

7/8/2014 Centre for Agricultural Bioinformatics, Pusa 15
Read Lengths

Oxford Nanopore
7/8/2014 Centre for Agricultural Bioinformatics, Pusa 16
https://www.nanoporetech.com/
• No data yet??
• Error model
http://erlichya.tumblr.com/post/66376172948/hands-on-
experience-with-oxford-nanopore-minion

Others
• Ion Torrent Proton/PGM
• Helicos
• Nabsys
• SOLiD
• ……

Comparison

Next generation sequencing
Run Time Read Length Quality
Total
nucleotides
sequenced
Cost /MB
454
Pyrosequencing
24h 700 bp Q20-Q30 0.7 GB $10
Illumina Miseq 27h 2x250bp > Q30 15 GB $0.15
Illumina Hiseq
2500
11days 2x125bp >Q30 1000 GB $0.05
Ion torrent 2h 400bp >Q20 50MB-1GB $1
Pacific
Biosciences
2h 8.5-20kb
>Q30 consensus
>Q10 single
400-850MB
/SMRT cell
$0.33-$1
http://bit.ly/1clLps3
http://1.usa.gov/1cLqIRd

http://omicsmaps.com/
Next Generation Genomics:
World Map of High-throughput Sequencers
BTI PGRP Summer Internship Program 20147/8/2014 20

Real cost of Sequencing!!
Sboner, Genome Biology, 2011
7/8/2014 22Centre for Agricultural Bioinformatics, Pusa

Library Types
Single end
Pair end (PE, 150-800 bp, Fwd:/1, Rev:/2)
Mate pair (MP, 2Kb to 20 Kb)
F
F R
F R 454/Roche
FR Illumina
Illumina

Implications of Choice of Library
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Pseudomolecule
(or ultracontig)
F
Genetic information (markers)
NNNNN NN

Multiplexing Libraries
Use of different tags (4-6 nucleotides) to identify
different samples in the same lane/sector.
AGTCGT
TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA
TGAGCA
TGAGCA
TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA
TGAGCA
TGAGCA
TGAGCA
Sequencing

Fasta files:
It is a text-based format for representing either nucleotide sequences or peptide
sequences, in which nucleotides or amino acids are represented using single-letter codes.
-Wikipedia
File Formats

Fastq files:
FASTQ format is a text-based format for storing both a biological sequence (usually
nucleotide sequence) and its corresponding quality scores.
-Wikipedia
• Single line ID with at symbol (“@”) in the first column.
• Sequences can be in multiple lines after the ID line
• Single line with plus symbol (“+”) in the first column to represent the quality line.
• Quality ID line may contain ID
• Quality values are in multiple lines after the + line but length should be identical to sequence
File Formats

Quality control: Encoding
Fastq files:
!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[]^_`abcdefgh Offset by 64 (Phred+64)

!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[]^_`abcdefgh Offset by 64 (Phred+64)

http://bit.ly/N28yUd
Phred score of a base is:
Qphred = -10 log10 (e)
where e is the estimated probability of a base
being wrong

Pre-processing: Tools
Trimming
• FastQC
• FASTX toolkit
• Trimmomatic
• Scythe
Joining paired-end reads
• fastq-join
• FLASH
• PANDAseq

Pre-processing: Error correction

Thank you!!

Sequencing

Recommended

Recommended

More Related Content

Similar to Sequencing

Similar to Sequencing (20)

More from Surya Saha

More from Surya Saha (19)

Recently uploaded

Recently uploaded (20)

Sequencing