3. First generation sequencing
3/29/2016 BTI Plant Bioinformatics Course 2016 3
Sanger. Annu Rev Biochem. 1988;57:1-28.
Thanks to Nick Loman for the mention
6. Sanger method
3/29/2016 BTI Plant Bioinformatics Course 2016 6
Frederick Sanger
13 Aug 1918 – 19 Nov 2013
Won the Nobel Prize for Chemistry in 1958 and
1980. Published the dideoxy chain termination
method or “Sanger method” in 1977
http://dailym.ai/1f1XeTB
8. First generation sequencing
• Very high quality sequences (99.999% or Q50)
• Very very low throughput
3/29/2016 BTI Plant Bioinformatics Course 2016 8
Run Time Read Length Reads / Run
Total
nucleotides
sequenced
Cost / MB
Capillary
Sequencing
(ABI3730xl)
20m-3h 400-900 bp 96 or 384 1.9-84 Kb $2400
http://www.hindawi.com/journals/bmri/2012/251364/tab1/
10. Use the specific technology used
to generate the data
– Illumina Hiseq/Miseq/NextSeq
– Pacific Biosciences RS I/RS II
– Ion Torrent Proton/PGM
– SOLiD
– Oxford Nanopore
3/29/2016 BTI Plant Bioinformatics Course 2016 10
http://www.acgt.me/blog/2015/3/10/next-generation-sequencing-must-
diepart-2
11. 454 Pyrosequencing
One purified DNA
fragment, to one bead, to
one read.
3/29/2016 BTI Plant Bioinformatics Course 2016 11
http://www.genengnews.com/
GS FLX
Titanium
https://mariamuir.com/wp-
content/uploads/2013/04/rip.gif
12. Illumina
3/29/2016 BTI Plant Bioinformatics Course 2016 12
Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800 GB
Number
of Reads/
Flow cell
25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion
Read
Length
2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp
Cost $99K $250K $740K $10M (10 units)
Source: Illumina
2500
3000
4000
500
13. Illumina
3/29/2016 BTI Plant Bioinformatics Course 2016 13
Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800 GB
Number
of Reads/
Flow cell
25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion
Read
Length
2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp
Cost $99K $250K $740K $10M (10 units)
Source: Illumina
2500
3000
4000
$1000 human
genome??
500
17. Pacific Biosciences SMRT sequencing
Single Molecule Real
Time sequencing
3/29/2016 BTI Plant Bioinformatics Course 2016 17
http://smrt.med.cornell.edu/images/pacbio_library_prep-1.gif
RS II
Sequel
18. Pacific Biosciences SMRT sequencing
Error correction methods
3/29/2016 BTI Plant Bioinformatics Course 2016 18
Hierarchical genome-assembly
process (HGAP)
English et al., PLOS One. 2012
PBJelly
26. 3/29/2016 BTI Plant Bioinformatics Course 2016 26
http://mms.businesswire.com/media/20150225005296/en
/454639/5/GemCodePlatform.jpg
• Long read information from short reads
using 14bp bar codes
• Very low input DNA (ng) and 20 minute
library preparation time
• 1ng of DNA is split across 100,000 Gel
Coated Beads (GEMs)
• Chromium instrument for single-cell
RNAseq
GemCode
29. 3/29/2016 BTI Plant Bioinformatics Course 2016 29
Human MHC map
• Sample prep requires very high molecular weight DNA
• Nicks at 10 sites / 100kb
• Individual molecules are assembles into optical maps
• Optical maps and sequences are merged in a hybrid assembly
http://www.bionanogenomics.com/technology/why-genome-mapping/
38. Library Types
Single end
Pair end (PE, 150-300 bp, Fwd:/1, Rev:/2)
Mate pair (MP, 2Kb to 20 Kb)
3/29/2016 38
F
F R
F R 454/Roche
FR Illumina
Illumina
Slide credit: Aureliano Bombarely
BTI Plant Bioinformatics Course 2016
39. Implications of Choice of Library
3/29/2016 39
Slide credit: Aureliano Bombarely
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Pseudomolecule
(or ultracontig)
F
Genetic information (markers) or Optical maps
NNNNN NN
BTI Plant Bioinformatics Course 2016
40. Multiplexing Libraries
Use of different tags (4-6 nucleotides) to identify
different samples in the same lane/sector.
3/29/2016 40
Slide credit: Aureliano Bombarely
AGTCGT
TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA
TGAGCA
TGAGCA
TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA
TGAGCA
TGAGCA
TGAGCA
Sequencing
BTI Plant Bioinformatics Course 2016
42. Fasta files:
It is a text-based format for representing either nucleotide sequences or peptide
sequences, in which nucleotides or amino acids are represented using single-letter codes.
-Wikipedia
File Formats
3/29/2016 42
Slide credit: Aureliano Bombarely
BTI Plant Bioinformatics Course 2016
43. Fastq files:
FASTQ format is a text-based format for storing both a biological sequence (usually
nucleotide sequence) and its corresponding quality scores.
-Wikipedia
• Single line ID with at symbol (“@”) in the first column.
• Sequences can be in multiple lines after the ID line
• Single line with plus symbol (“+”) in the first column to represent the quality line.
• Quality ID line may contain ID
• Quality values are in multiple lines after the + line but length is identical to sequence
3/29/2016 43
Slide credit: Aureliano Bombarely
File Formats
BTI Plant Bioinformatics Course 2016
46. 3/29/2016 46
Quality control: Encoding
http://en.wikipedia.org/wiki/Phred_quality_score
Phred score of a base is:
Qphred = -10 log10 (e)
where e is the estimated error probability of a base
BTI Plant Bioinformatics Course 2016