Sequencing 2017

Surya Saha
Sol Genomics Network (SGN)
Boyce Thompson Institute, Ithaca, NY
suryasaha@cornell.edu // Twitter:@SahaSurya
BTI Plant Bioinformatics Course 2017
http://www.acgt.me/blog/2015/3/7/next-generation-sequencing-must-die

Earth BioGenome Project (EBP)
3/28/2017 BTI Plant Bioinformatics Course 2017 2
• Complete genome of 1
representative from each
eukaryotic family (9000)
• Low coverage sequencing of a
species from each of the 150,000
to 200,000 genera
• Budget estimate $4.8 billion
Maybe better to sequence less to
higher quality and invest in
interpretation???
http://omicsomics.blogspot.com/2017/02/earth-biogenome-project-ill-conceived.html

1953
DNA
Structure
discovery
1977
2012
Sanger DNA
sequencing by
chain-terminating
inhibitors
1984
Epstein-Barr
virus
(170 Kb)
1987
Abi370
Sequencer
1995
2001
Homo
sapiens
(3.0 Gb)
2005
454
Solexa
Solid
2007
2011
Ion
Torrent
PacBio
Haemophilus
influenzae
(1.83 Mb)
2013
Slide concept: Aureliano Bombarely
Sequencing over the Ages
Illumina
Illumina
Hiseq X
454
Pinus
taeda
(24 Gb)
2014
Nanopore
MinION
2015
10X
Genomics

First generation sequencing
Sanger. Annu Rev Biochem. 1988;57:1-28.
Thanks to Nick Loman for the mention

Maxam-Gilbert method

Maxam-Gilbert method
http://en.wikipedia.org/wiki/File:Maxam-
Gilbert_sequencing_en.svg
https://www.nationaldiagnostics.com/electrophoresis
/article/maxam-gilbert-sequencing

Sanger method
Frederick Sanger
13 Aug 1918 – 19 Nov 2013
Won the Nobel Prize for Chemistry in 1958 and
1980. Published the dideoxy chain termination
method or “Sanger method” in 1977
http://dailym.ai/1f1XeTB

Sanger method
http://en.wikipedia.org/wiki/File:Sanger-sequencing.svg
http://en.wikipedia.org/wiki/File:
Radioactive_Fluorescent_Seq.jpg

First generation sequencing
• Very high quality sequences (99.999% or Q50)
• Very very low throughput
Run Time Read Length Reads / Run
Total
nucleotides
sequenced
Cost / MB
Capillary
Sequencing
(ABI3730xl)
20m-3h 400-900 bp 96 or 384 1.9-84 Kb $2400
http://www.hindawi.com/journals/bmri/2012/251364/tab1/

Next generation sequencing

Use the specific technology used
to generate the data
– Illumina Hiseq/Miseq/NextSeq
– Pacific Biosciences RS I/RS II
– Ion Torrent Proton/PGM
– SOLiD
– Oxford Nanopore
http://www.acgt.me/blog/2015/3/10/next-generation-sequencing-must-
diepart-2

454 Pyrosequencing
One purified DNA
fragment, to one bead, to
one read.
http://www.genengnews.com/
GS FLX
Titanium
https://mariamuir.com/wp-
content/uploads/2013/04/rip.gif

Illumina
Output 15 Gb 120 GB 1500 GB 1800 GB
Max Number
of Reads/
Run
25 Million 400 Million 5 Billion 6 Billion
Max Read
Length
2x300 bp 2x150 bp 2x125- 2x250 bp (RR mode) 2x150 bp
Cost $99K $250K $740K $10M (10 units)
Source: Illumina
2500
3000
4000
500
550

Illumina
Max Number
of Reads/
Run
Max Read
Length
Cost $99K $250K $740K $10M (10 units)
Source: Illumina
2500
3000
4000
500
550

Illumina
Mardis 2008. Annu. Rev. Genomics Hum. Genet. 2008. 9:387–402

Pacific Biosciences SMRT sequencing
Single Molecule Real
Time sequencing
http://smrt.med.cornell.edu/images/pacbio_library_prep-1.gif
RS II
Sequel

Error correction methods
Hierarchical genome-assembly
process (HGAP)
English et al., PLOS One. 2012
PBJelly

Error correction methods
PBcRPipeline

Read Lengths

Oxford Nanopore
https://www.nanoporetech.com/
http://erlichya.tumblr.com/post/66376172948/hands-on-
experience-with-oxford-nanopore-minion
http://halegrafx.com/vector-art/free-vector-despicable-me-minions/

http://lab.loman.net/2017/03/09/ultrareads-for-nanopore/
E. coli K-12 MG1655 on a standard
FLO-MIN106 (R9.4) flowcell

Next generation sequencing
Run Time Read Length Quality
Total
nucleotides
sequenced
Cost /MB
454
Pyrosequencing
24h 700 bp Q20-Q30 1 GB $10
Illumina Miseq 27h 2x300bp > Q30 15 GB $0.15
Illumina Hiseq
2500
1 - 10days 2x250bp >Q30 3000 GB $0.05
Ion torrent 2h 400bp >Q20 50MB-1GB $1
Pacific
Biosciences
30m - 4h 10kb - >40kb
>Q50 consensus
>Q10 single
500 - 1000MB
/SMRT cell
$0.13 - $0.60
http://www.hindawi.com/journals/bmri/2012/251364/
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431227
Note: Some figures might be out of date

Long range scaffolding

Hi-C Crosslinking

http://mms.businesswire.com/media/20150225005296/en
/454639/5/GemCodePlatform.jpg
• Long read information from short reads using 14bp bar codes
• Very low input DNA ( as low as 0.625 ng)
• Short library preparation time
• 1ng of DNA is split across 100,000 Gel Coated Beads (GEMs)
• Chromium instrument for single-cell RNAseq
GemCode

http://mms.businesswire.com/media/20150225005296/en
/454639/5/GemCodePlatform.jpg
GemCode
http://www.nature.com/nbt/journal/v34/n3/full/nbt.3432.html

http://www.bionanogenomics.com/technology/why-genome-mapping/

Human MHC map
• Sample prep requires very high molecular weight DNA
• Nicks at 10 sites / 100kb
• Individual molecules are assembles into optical maps
• Optical maps and sequences are merged in a hybrid assembly
http://www.bionanogenomics.com/technology/why-genome-mapping/

Many Others..
• Ion Torrent Proton/PGM
• Dovetail
• Supporting technologies
– Nabsys
– OpGen
– Fluidigm
http://nextgenseek.com/2012/11/did-you-know-there-are-
at-least-14-next-gen-sequence-technology-companies/

Real cost of Sequencing!!
Sboner, Genome Biology, 2011
3/28/2017 33BTI Plant Bioinformatics Course 2017

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2011-12-8-125

So What Sequencer Do I Use??
Microbial genome
• Draft genome
– Illumina Miseq (100-130X)
– Illumina Hiseq (<200X)
• Complete genome
– Pacific Biosciences (80-100X)
• Amplicons (16S, ITS)
– Illumina Miseq
Eukaryotic genome
• Denovo assembly
– Pacific Biosciences (70-80X)
– Illumina Hiseq (100X+)
– 10X Genomics
– Bionano
• Genotyping (GBS)
– Illumina Hiseq
• BACs
– Pacific Biosciences
$$$$ ????

The diploid
reference
genome

Cornell Sequencing Core
• Illumina Hiseq 2500 (Rapid run and High output)
• Illumina Miseq
• Illumina Nextseq 500
• 10X Genomics GemCode
http://www.biotech.cornell.edu/brc/g
enomics/services/price-list#overlay-
context=brc/genomics-facility/next-
generation-sequencing
$
$
$

Library Types
Single end
Pair end (PE, 150-300 bp, Fwd:/1, Rev:/2)
Mate pair (MP, 2Kb to 20 Kb)
3/28/2017 38
F
F R
F R 454/Roche
FR Illumina
Illumina
Slide credit: Aureliano Bombarely

Implications of Choice of Library
3/28/2017 39
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Pseudomolecule
(or ultracontig)
F
Genetic information (markers) or Optical maps
NNNNN NN

Multiplexing Libraries
Use of different tags (4-6 nucleotides) to identify
different samples in the same lane/sector.
3/28/2017 40
AGTCGT
TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA
TGAGCA
TGAGCA
TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA
TGAGCA
TGAGCA
TGAGCA
Sequencing

Data!!

Fasta files:
It is a text-based format for representing either nucleotide sequences or peptide
sequences, in which nucleotides or amino acids are represented using single-letter codes.
-Wikipedia
File Formats
3/28/2017 42

Fastq files:
FASTQ format is a text-based format for storing both a biological sequence (usually
nucleotide sequence) and its corresponding quality scores.
-Wikipedia
• Single line ID with at symbol (“@”) in the first column.
• Sequences can be in multiple lines after the ID line
• Single line with plus symbol (“+”) in the first column to represent the quality line.
• Quality ID line may contain ID
• Quality values are in multiple lines after the + line but length is identical to sequence
3/28/2017 43
File Formats

3/28/2017 44
Quality control: Encoding
Fastq files:
!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[]^_`abcdefgh Offset by 64 (Phred+64)

3/28/2017 45
!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[]^_`abcdefgh Offset by 64 (Phred+64)

3/28/2017 46
http://en.wikipedia.org/wiki/Phred_quality_score
Phred score of a base is:
Qphred = -10 log10 (e)
where e is the estimated error probability of a base

Pre-processing: Tools
Trimming
• FastQC
• FASTX toolkit
• Trimmomatic
• Scythe
Joining paired-end reads
• fastq-join
• FLASH
• PANDAseq
3/28/2017 47BTI Plant Bioinformatics Course 2017

Thank you!!

Sequencing 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sequencing 2017

Similar to Sequencing 2017 (20)

More from Surya Saha

More from Surya Saha (20)

Recently uploaded

Recently uploaded (20)

Sequencing 2017