Sequencing, Genome Assembly and the SGN Platform

Surya Saha, Ph.D.
Cornell University & Boyce Thompson Institute
suryasaha@cornell.edu @SahaSurya
Centre for Agricultural Bioinformatics
Pusa, New Delhi
June 13,2014
Slides: http://bit.ly/CABin_Pusa_2014
http://www.acgt.me/blog/2014/3/7/next-generation-sequencing-must-die
Genome Assembly
Jason Chin http://www.bit.ly/SZPKIG

6/15/2014 Centre for Agricultural Bioinformatics, Pusa 2
You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;
Blog, live-blog, or post video of;
This presentation. Provided that:
You attribute the work to its author and respect the rights
and licenses associated with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero. Social Media Icons adapted with
permission from originals by Christopher Ross. Original images are available under GPL at
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites

Sequencing

1953
DNA Structure
discovery
1977
2012
Sanger DNA sequencing by
chain-terminating inhibitors
1984
Epstein-Barr
virus
(170 Kb)
1987Abi370
Sequencer
1995
2001
Homo
sapiens
(3.0 Gb)
2005
454
Solexa
Solid
2007
2011
Ion
Torrent
PacBio
Haemophilus
influenzae
(1.83 Mb)
2013
Slide credit: Aureliano Bombarely
Sequencing over the Ages
Illumina
Illumina
Hiseq X
454
Pinus
taeda
(24 Gb)
2014
MinION
The Next Generation

Its all about the $£€¥
http://www.genome.gov/sequencingcosts/

First generation sequencing

Sanger method
Frederick Sanger
13 Aug 1918 – 19 Nov 2013
Won the Nobel Prize for Chemistry in 1958 and
1980. Published the dideoxy chain termination
method or “Sanger method” in 1977
http://dailym.ai/1f1XeTB

Sanger method
http://bit.ly/1g6Cudq
http://bit.ly/1lcQO4J

First generation sequencing
• Very high quality sequences (99.999%)
• Very low throughput
Run Time Read Length Reads / Run
Total
nucleotides
sequenced
Cost / MB
Capillary
Sequencing
(ABI3730xl)
20m-3h 400-900 bp 96 or 386 1.9-84 Kb $2400
http://bit.ly/1clLps3
http://1.usa.gov/1cLqIRd

Use the specific technology used
to generate the data
– Illumina Hiseq/Miseq/NextSeq
– Pacific Biosciences RS I/RS II
– Ion Torrent Proton/PGM
– SOLiD
– 454
http://www.acgt.me/blog/2014/3/10/next-generation-
sequencing-must-diepart-2

454 Pyrosequencing
One purified DNA
fragment, to one bead, to
one read.
http://bit.ly/1ehwxWN
GS FLX
Titanium
http://bit.ly/1ehAcEh

Illumina
Output 15 Gb 120 GB 1000 GB 1800 GB
Number
of Reads
25 Million 400 Million 4 Billion 6 Billion
Read
Length
2x300 bp 2x150 bp 2x125 bp
(2x250 update mid-2014)
2x150 bp
Cost $99K $250K $740K $10M
Source: Illumina
$1000 human
genome??

Illumina
http://1.usa.gov/1fP9ybl

Illumina:Moleculo
http://bit.ly/1aEPOBn

Pacific Biosciences SMRT sequencing
Single Molecule Real
Time sequencing
http://bit.ly/1naxgTe

Error correction methods
Hierarchical genome-assembly
process (HGAP)
PBJelly
Enlish et al., PLOS One. 2012
PBJelly

Read Lengths

Oxford Nanopore
https://www.nanoporetech.com/
• No data yet??
• Error model
http://erlichya.tumblr.com/post/66376172948/hands-on-
experience-with-oxford-nanopore-minion

Others
• Ion Torrent Proton/PGM
• Nabsys
• SOLiD

Comparison

Next generation sequencing
Run Time Read Length Quality
Total
nucleotides
sequenced
Cost /MB
454
Pyrosequencing
24h 700 bp Q20-Q30 0.7 GB $10
Illumina Miseq 27h 2x250bp > Q30 15 GB $0.15
Illumina Hiseq
2500
11days 2x125bp >Q30 1000 GB $0.05
Ion torrent 2h 400bp >Q20 50MB-1GB $1
Pacific
Biosciences
2h 10-20kb
>Q30 consensus
>Q10 single
400-800MB
/SMRT cell
$0.33-$1
http://bit.ly/1clLps3
http://1.usa.gov/1cLqIRd

http://omicsmaps.com/
Next Generation Genomics:
World Map of High-throughput Sequencers
Centre for Agricultural Bioinformatics, Pusa6/15/2014 22

http://bit.ly/18pfUId

Real cost of Sequencing!!
Sboner, Genome Biology, 2011
6/15/2014 24Centre for Agricultural Bioinformatics, Pusa

Library Types
Single end
Pair end (PE, 150-800 bp, Fwd:/1, Rev:/2)
Mate pair (MP, 2Kb to 20 Kb)
F
F R
F R 454/Roche
FR Illumina
Illumina

Implications of Choice of Library
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Pseudomolecule
(or ultracontig)
F
Genetic information (markers)
NNNNN NN

Quality control: Encoding
http://bit.ly/N28yUd
Phred score of a base is:
Qphred = -10 log10 (e)
where e is the estimated probability of a base
being incorrect

Genome Assembly

Whole Genome Shotgun Sequencing
Slide credit: cbcb.umd.edu

Genome Sequencing Strategies

Genome Sequencing Strategies
International Human Genome Sequencing Consortium 2001
Overlap Layout Consensus
http://contig.wordpress.com/
cbcb.umd.edu

DeBruijnGraph

Ingredient for a Good Assembly
Slide credit: Mike Schatz

Bird Snake

• You have the expertise to install and run
• You have the suitable infrastructure (CPU & RAM) to run the assembler
• You have sufficient time to run the assembler
• Is designed to work with the specific mix of NGS data that you have
generated
• Best addresses what you want to get out of a genome assembly (bigger
overall assembly, more genes, most accuracy, longer scaffolds, most
resolution of haplotypes, most tolerant of repeats, etc.)
The BEST?? Genome Assembler for YOU
http://haldanessieve.org/2013/01/28/our-paper-making-pizzas-and-genome-assemblies/

Which technology to use??
• Microbial genomes
• Eukaryotic genomes
• Resequencing genomes
• RNAseq and other XXXseq methods
http://bit.ly/1ko9Kgh

SOL Genomics Network

The SGN Team!!
Surya Saha, Tom Fisher-York, Hartmut Foerster, Suzy Strickler, Jeremy Edwards,
Noe Fernandez, Naama Menda, Aure Bombarely, Aimin Yan, Isaak Tecle

SGN Website
http://solgenomics.net

Main web page (front page):
WEB ICONS
TOOL BAR

Main web page (front page):
TOOL BAR
(MENUS)

But the DATA also can be
edited
LocusLocus Editor Data
Community Data Curation

You need
• SGN account.
• Activate submitter / Locus Editor privileges by SGN curator
LocusLocus Editor Data

Tools

Genome Browser: GBrowse

Genome Browser: JBrowse

CassavaBase
http://cassavabase.org/
Slide credit: Jeremy Edwards

NextGen Cassava Project
● Project: Adapt SGN database for Cassava Breeding
● Goal: Apply Genomic Selection to cassava breeding
● Predict breeding values from genotype information
● Shorten the breeding cycle
● Massive amounts of genotypic data (GBS)
● Phenotypic data
● Data management challenge
● Improve flowering
● http://nextgencassava.org

SGN/Cassavabase behind the scenes
● Perl/Catalyst MVC Framework
● PostgreSQL Database
● Generic Model Organism Database (GMOD)
– Chado relational database schema
– GBrowse
– JBrowse
● R
– Experimental design
– QTL mapping
– Genomic selection

Objectives
Provide cassava breeders and researchers access
to data and tools in a centralized, user-friendly
and reliable database.
– Improve partner breeding program information
tracking
– Streamline management of genotypic and
phenotypic data
– Pipeline genotypic and phenotypic data through
Genomic Selection prediction analyses

Genomic Selection
The 'training population' is genotyped and phenotyped to 'train'
the genomic selection (GS) prediction model. Genotypic
information from the breeding material is then fed into the
model to calculate genomic estimated breeding values (GEBV)
for these lines. From Heffner et al. 2009 Crop Sci. 49:1–12
Information from a majority of lines in the breeding population (the training set) is used to create the
prediction model. The model is then used to predict the phenotypes of the remaining lines (the validation
set), using genotypic information only. The results from the model are compared to the actual data to give
the prediction accuracy. Image courtesy of Martha Hamblin, Cornell University
Flow diagram of a genomic selection breeding program.
Breeding cycle time is shortened by removing phenotypic
evaluation of lines before selection as parents for the next
cycle. From Heffner et al. 2009 Crop Sci. 49:1–12

Data collection in the field
● Android tablets
● Field book app
– Jesse Poland's group at
USDA-ARS / Kansas
State University

Cassava Trait Ontology
Kulakow et al. 2011
Kulakow et al. 2011
● Standard terminology
● Facilitate the sharing of information
● Allow users to query keywords related to traits

Position available at Solgenomics
Cassavabase project
Plant Breeding + Bioinformatician
● Familiar with breeding
● Programming in Perl, R, SQL, Hadoop
● Linux
● Africa
● Genius
http://www.cassavabase.org/forum/posts
.pl?topic_id=9

Thank you!!
Questions??

Sequencing, Genome Assembly and the SGN Platform

More Related Content

What's hot

Viewers also liked

Similar to Sequencing, Genome Assembly and the SGN Platform

More from Surya Saha

Recently uploaded

Sequencing, Genome Assembly and the SGN Platform