SlideShare a Scribd company logo
1 of 64
Download to read offline
Insights into Plant Genome Sequence Assembly.
by
Aureliano Bombarely
ab782@cornell.edu
1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
1. A brief history of the sequence assembly.
http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html
Genome = N x Sequence of DNA
Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix
Nano Lett., 2012, 12 (12), pp 6453–6458
DNA
sequencing
???
1. A brief history of the sequence assembly.
DNA Sequencing:
“Process of determining the precise order of nucleotides within a DNA molecule.”
-Wikipedia
ddATP
ddGTP
ddCTP
ddTTP
Taq-Polymerase
STOP
time
2) Chromatographic
Separation
GTCACCCTGAAT
1) PCR with ddNTPs
3) Chromatogram Read
DNA Sanger Sequencing
1. A brief history of the sequence assembly.
Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix
Nano Lett., 2012, 12 (12), pp 6453–6458
ACGCGGTGACGTTGTCGA
ACGCGTTTTGTTGTGG
ACGATTTAAATGACGTTGTCGA
ACGCGGTGACGTTGTCGA
ACGCGTTTTGTTGTGG
ACCCCGACGTTGTCGA
DNA
sequencing
ACGCGTTTTGTTGTGG
ACCCCTGGGGGGGTTGTCGA
Fragments
Sequence
assembly
ACGCGTTTTGTTGTGGTGGCCACACCACGCAGTGACGGAGATAACGGCGAGAGCATGGACGGAGGATGAGGATGG
1. A brief history of the sequence assembly.
Sequence assembly
=
Resolve a puzzle and rebuild a DNA sequence from
its pieces (fragments)
Image courtesy of iStock photo
1. A brief history of the sequence assembly.
Resolve a puzzle and rebuild a DNA sequence from
its pieces (fragments)
BACs BAC-by-BAC sequencing
Whole Genome Shotgun (WGS) sequencing
1. A brief history of the sequence assembly.
MS2Bacteriophage(3.658Kb)1977
1978
1979
1980
1981
1982
1983
Epstein-BarrVirus(170Kb)1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
Haemophilusinfluenzae(1.83Mb)1995
Saccharomycescerevisiae(12.1Mb)1996
1997
Caenorhabditiselegans(100Mb)1998
1999
Arabidopsisthaliana(157Mb)2000
Homosapiens(3.2Gb)2001
Oryzasativa(420Mb)2002
2003
2004
2005
Populustrichocarpa(550Mb)2006
Vitisvinifera(487Mb)2007
Physcomitrela(480Mb);Caricapapaya(372Mb)2008
Sorghumbicolor(730Mb);ZeaMays(2.3Gb);Cucumissativus(367Mb)2009
Ectocarpus(214Mb);Malusxdomestica(742Mb);Glycinemax(1.1Gb)2010
Pigeonpea,Potato,Cannabis,A.lyrata,Cacao,Strawberry,Medicago2011
Mei,Benthamiana,Tomato,Setaria,Melon,Flax,T.salsuginea,Banana,Cotton,Orange,Pear2012
Sanger Sequencing
454
Solexa / Illumina
SOLiD
IonT
PB
1. A brief history of the sequence assembly.
1977MS2Bacteriophage(3.658Kb)1977
N
O
SO
FT
W
A
R
E
U
SED
1. A brief history of the sequence assembly.
Haemophilusinfluenzae(1.83Mb)1995
Software:
TIGR ASSEMBLER
Hardware:
SPARCenter 2000 (512 Mb RAM)
1. A brief history of the sequence assembly.
Haemophilusinfluenzae(1.83Mb)1995
Software:
TIGR ASSEMBLER Smith-Waterman alignments
Sutton GG. et al.TIGR Assembler:A New Tool to Assembly Large Shotgun Sequencing Projects
Genome Science and Technology, 1995,1:9-19
O
V
ER
LA
PPIN
G
M
ET
H
O
D
O
LO
G
Y
1. A brief history of the sequence assembly.
http://www.ebi.ac.uk/training/online/course/nucleotide-sequence-data-resources-ebi/what-ena/how-sequence-assembled
1. A brief history of the sequence assembly.
Homosapiens(3.2Gb)2001
BAC-by-BAC sequencing
Whole Genome Shotgun (WGS) sequencing
International Human Genome Sequencing Consortium. Initial Sequencing and Analysis of the Human Genome.
Nature. 2001. 409:860-921
Venter JC. et al. The Sequence of the Human Genome. Science. 2001. 291:1304-1351
O
V
ER
LA
PPIN
G
M
ET
H
O
D
O
LO
G
Y
1. A brief history of the sequence assembly.
Homosapiens(3.2Gb)2001
Whole Genome Shotgun (WGS) sequencing
Venter JC. et al. The Sequence of the Human Genome. Science. 2001. 291:1304-1351
Software:
WGAASSEMBLER (CABOG)
Hardware:
40 machines AlphaSMPs (4 Gb RAM/each and 4 cores/each, total=160 Gb RAM and
160 cores); 5 days.
1. A brief history of the sequence assembly.
Ailuropodamelanoleura(2.3Gb)2009
Whole Genome Shotgun (WGS) sequencing
Li R. et al.The Sequence and the Novo Assembly of the Giant Panda Genome. Nature. 2009. 463:311-317
Software:
SOAPdenovo
Hardware:
Supercomputer with 32 cores and 512 Gb RAM.
B
R
U
IJN
G
R
A
PH
S
M
ET
H
O
D
O
LO
G
Y
1. A brief history of the sequence assembly.
1. A brief history of the sequence assembly.
What is a Kmer ?
ATGCGCAGTGGAGAGAGAGCGATG Sequence A with 25 nt
Specific n-tuple or n-gram of nucleic acid or amino acid sequences.
-Wikipedia
ordered list
of elements
contiguous sequence
of n items from a given
sequence of text
5 Kmers of 20-mer
ATGCGCAGTGGAGAGAGAGC
TGCGCAGTGGAGAGAGAGCG
GCGCAGTGGAGAGAGAGCGA
CGCAGTGGAGAGAGAGCGAT
GCAGTGGAGAGAGAGCGATG
N_kmers = L_read - Kmer_size
Compeau PEC. et al. How to apply de Bruijn graphs to genome assembly. Nature Biotech. 2011. 29:287-291
1. A brief history of the sequence assembly.
OLC (Overlap-layout-consensus) algorithm is more
suitable for the low-coverage long reads, whereas the
DBG (De-Bruijn-Graph) algorithm is more suitable for
high-coverage short reads and especially for large
genome assembly
1. A brief history of the sequence assembly.
1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Technology
Library Preparation
Sequencing Amount
Fastq raw
Fastq Processed
Contigs
Scaffolds
Decisions during the
Experimental Design
Software
Identity %, Kmer ...
Post-assembly Filtering
Decisions during the
Assembly Optimization
Reads
processing
and
filtering
1. Low quality reads (qscore) (Q30)
2. Short reads (L50)
3. PCR duplications (Only Genomes).
4. Contaminations.
5. Corrections
Consensus
Scaffolding
Scaffolds
Gap Filling
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Technology
Library Preparation
Sequencing Amount
Decisions during the
Experimental Design
2. Sequencing, tools and computers.
2.1 Technologies
Run Time Sequence Length Reads/Run
Total nucleotides
sequenced per run
Capillary Sequencing
(ABI37000)
~2.5 h 800 bp 386 0.308 Mb
454 Pyrosequencing
(GS FLX Titanium XL+)
~23 h 700 bp 1,000,000
700 Mb
(0.7 Gb)
Illumina
(HiSeq X Ten)
72 h
(3 days)
2 x 150 bp 6,000,000,000
1,600,000,000 Mb
(1,600 Gb)
Illumina
(MiSeq)
65 h 2 x 300 bp 2 x 22,000,000
13,500 Mb
(13.5 Gb)
SOLID
(5500xl system)
120 h
(7 days)
2 x 60 bp 400,000,000
300,000 Mb
(300 Gb)
Ion Torrent
(Ion Proton I)
2 h 200 bp 60,000,000
10,000 Mb
(10 Gb)
PacBio
(PacBioRS II)
1.5 h ~8,500 bp 50,000
375 Mb
(0.37 Gb)
2. Sequencing, tools and computers.
Strenghs Weaknesses
454 Pyrosequencing
(GS FLX Titanium XL+)
- Long reads (450/700 bp).
- Long insert for mate pair libraries (20Kb).
- Low observed raw error rate (0.1%)
- Low percentage of PCR duplications for mate
pair libraries
- Homopolymer error.
- Low sequence yield per run (0.7 Gb).
- Preferred assembler (gsAssembler) uses
overlapping methodology.
Illumina
(HiSeq 2500)
- High sequence yield per run (600 Gb)
- Low observed raw error rate (0.26%)
- High percentage of PCR duplications for mate
pair libraries.
- Long run time (11 days)
- High instrument cost (~ $650K)
Illumina
(MiSeq)
- Medium read size (250 bp)
- Faster run than Illumina HiSeq - Medium sequence yield per run (8.5 Gb)
SOLID
(5500xl system)
- 2-base encoding reduce the observed raw error
rate (0.06%)
- 2-base color coding makes difficult the
sequence manipulation and assembly.
- Short reads (75 bp)
Ion Torrent
(Ion Proton I)
- Fast run (2 hours)
- Low instrument cost ($80K).
- Medium read size (200 bp)
- Medium sequence yield per run (10 Gb)
- Medium observed raw error rate (1.7%)
PacBio
(PacBioRS)
- Long reads (3000 bp)
- Fast run (2 hours)
- Really high observed raw error rate (12.7%)
- High instrument cost (~ $700K)
- No pair end/mate pair reads
2.1 Technologies
2. Sequencing, tools and computers.
★ Library types (orientations):
• Single reads
• Pair ends (PE) (150-800 bp insert size)
• Mate pairs (MP) (2-40 Kb insert size)
F
F
F
F
R
R
R 454/Roche
Illumina
Illumina
2.2 Libraries
2. Sequencing, tools and computers.
• Why is important the pair information ?
F
- novo assembly:
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Genetic information (markers)
Pseudomolecule
(or ultracontig)
NNNNN NN
2.2 Libraries
2. Sequencing, tools and computers.
2.3 Sequencing Amount
* Slide from MC. Schatz
2. Sequencing, tools and computers.
2.3 Sequencing Amount
Depending of the genome complexity, technology used and assembler:
• More is better (if you have enough computational resources).
- Sanger > 10X (less for BACs-by-BACs approaches).
- 454 > 20X
- Illumina > 100X
• Polyploidy or high heterogozyty increase the amount of reads
needed.
• The use of different library types (pair ends and mate pairs with
different insert sizes is essential).
• Longer reads is preferable.
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Fastq raw
Fastq Processed
Contigs
Scaffolds
Software
Decisions during the
Assembly Optimization
Reads
processing
and
filtering
1. Low quality reads (qscore) (Q30)
2. Short reads (L50)
3. PCR duplications (Only Genomes).
4. Contaminations.
5. Corrections
Consensus
Scaffolding
Scaffolds
Gap Filling
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Fastq raw
Fastq Processed
Software
Decisions during the
Assembly Optimization
Reads
processing
and
filtering
1. Low quality reads (qscore) (Q30)
2. Short reads (L50)
3. PCR duplications (Only Genomes).
4. Contaminations.
5. Corrections
2. Sequencing, tools and computers.
2.4 Tools: Read Quality Evaluation
a) Length of the read.
b) Bases with qscore > 20 or 30.
•FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
2. Sequencing, tools and computers.
2.4 Tools: Read Trimming and Filtering
• Fastx-Toolkit (http://hannonlab.cshl.edu)
• Ea-Utils (http://code.google.com/p/ea-utils/)
• PrinSeq (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi)
Software Multiplexing Trimming/Filtering
Fastx-Toolkit fastx_barcode_splitter fastq_quality_filter
Ea-Utils fastq-multx fastq-mcf
PrinSeq PrinSeq PrinSeq
1. Adaptor removal.
2. Low quality reads (qscore) (Q30)
3. Short reads (L50)
4. PCR duplications (Only Genomes, Use PrinSeq).
2. Sequencing, tools and computers.
2.4 Tools: Contaminations
5. Contaminations
Contaminations can be removed mapping the reads against a reference with the
contaminants such as E. coli and human genomes.The most common tools are Bowtie
or BWA (for short reads) and Blast (for long reads).
2. Sequencing, tools and computers.
2.4 Tools: Read Corrections
6. Read Corrections
Read corrections are generally based in the Kmer analysis.
Medvedev P. et al. Error correction of high-throughput sequencing datasets with non-uniform coverage
Bioinformatics. 2011 27 (13):i137-i141
2. Sequencing, tools and computers.
2.4 Tools: Read Corrections
6. Read Corrections
• Quake (http://www.cbcb.umd.edu/software/quake/index.html)
• Reptile (http://aluru-sun.ece.iastate.edu/doku.php?id=software)
• ECHO (http://uc-echo.sourceforge.net/)
• Corrector (http://soap.genomics.org.cn/soapdenovo.html)
Usual Suspects:
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Fastq Processed
Contigs
Scaffolds
Software
Decisions during the
Assembly Optimization
Consensus
Scaffolding
2. Sequencing, tools and computers.
Type Technology Used Features Link
Arachne Overlap-layout-consensus Sanger, 454
Highly
configurable
http://www.broadinstitute.org/crd/wiki/
index.php/Main_Page
CABOG Overlap-layout-consensus Sanger, 454, Illumina
Highly
configurable
http://sourceforge.net/apps/mediawiki/wgs-
assembler/index.php?title=Main_Page
MIRA Overlap-layout-consensus Sanger, 454
Highly
configurable
http://sourceforge.net/apps/mediawiki/mira-
assembler
gsAssembler Overlap-layout-consensus Sanger, 454 Easy to use
http://454.com/products/analysis-software/
index.asp
iAssembler Overlap-layout-consensus Sanger, 454 Improves MIRA http://bioinfo.bti.cornell.edu/tool/iAssembler
ABySS Bruijn graph 454 or Illumina Easy to use
http://www.bcgsc.ca/platform/bioinfo/software/
abyss
ALLPATH-LG Bruijn graph 454 or Illumina Good results
http://www.broadinstitute.org/software/allpaths-
lg/blog
Ray Bruijn graph 454 or Illumina
Slow but use less
memory
http://denovoassembler.sf.net/
SOAPdenovo Bruijn graph 454 or Illumina Fastest http://soap.genomics.org.cn/soapdenovo.html
Velvet Bruijn graph
454 or Illumina or
SOLiD
SOLiD http://www.ebi.ac.uk/~zerbino/velvet/
2.4 Tools: Assemblers
2. Sequencing, tools and computers.
Type Technology Used Features Link
Arachne Overlap-layout-consensus Sanger, 454
Highly
configurable
http://www.broadinstitute.org/crd/wiki/
index.php/Main_Page
CABOG Overlap-layout-consensus Sanger, 454, Illumina
Highly
configurable
http://sourceforge.net/apps/mediawiki/wgs-
assembler/index.php?title=Main_Page
MIRA Overlap-layout-consensus Sanger, 454
Highly
configurable
http://sourceforge.net/apps/mediawiki/mira-
assembler
gsAssembler Overlap-layout-consensus Sanger, 454 Easy to use
http://454.com/products/analysis-software/
index.asp
iAssembler Overlap-layout-consensus Sanger, 454 Improves MIRA http://bioinfo.bti.cornell.edu/tool/iAssembler
ABySS Bruijn graph 454 or Illumina Easy to use
http://www.bcgsc.ca/platform/bioinfo/software/
abyss
ALLPATH-LG Bruijn graph 454 or Illumina Good results
http://www.broadinstitute.org/software/allpaths-
lg/blog
Ray Bruijn graph 454 or Illumina
Slow but use less
memory
http://denovoassembler.sf.net/
SOAPdenovo Bruijn graph 454 or Illumina Fastest http://soap.genomics.org.cn/soapdenovo.html
Velvet Bruijn graph
454 or Illumina or
SOLiD
SOLiD http://www.ebi.ac.uk/~zerbino/velvet/
2.4 Tools: Assemblers
2. Sequencing, tools and computers.
2.4 Tools: Assemblers
... but there are more assemblers and information... Take a look to SeqAnswers
http://seqanswers.com/wiki/Special:BrowseData/Bioinformatics_application?
Bioinformatics_method=Assembly&Biological_domain=De-novo_assembly
Also highly recommendable:
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Scaffolds
Software
Decisions during the
Assembly Optimization
Scaffolds
Gap Filling
2. Sequencing, tools and computers.
Type Technology Used Features Link
GapCloser Overlap-layout-consensus Illumina
C++
Free
http://soap.genomics.org.cn/soapdenovo.html
GapFiller Overlap-layout-consensus Any
Perl
Commercial*
http://www.baseclear.com/landingpages/
basetools-a-wide-range-of-bioinformatics-
solutions/gapfiller/
PBSuite Overlap-layout-consensus
454,
PacBio
Python,
Long reads
http://sourceforge.net/p/pb-jelly/wiki/
2.4 Tools: Gap Filling
2. Sequencing, tools and computers.
2.5 Computers
Bigger is better:
• How much do you need depends of:
➡ how big is your genome ?
~ Human size (3Gb) require ~256 Gb to 1 Tb
➡ how many reads do you have, ?
More reads, more memory and hard disk.
➡ what software are you going to use ?
OLC uses more memory and time than DBG.
➡ what parameters are you going to use ?
Bigger Kmers, more memory.
Four options:
1. Buy your own server (512 Gb, 4.5 Tb, 64 cores; ~ $15,000).
2. Rent a server for ~ 1 month (same specs. $1.5/h; ~$1,000)(CBSU).
3. Use a supercomputing center associated with NSF, NIH, USDA... where they
offer reduced prices (iPlant, Indiana University....).
4. Collaborate with some group with a big server.
2. Sequencing, tools and computers.
2.6 Assembly evaluation
During the assembly optimization will be generated several assemblies.The most used
parameters to evaluate the assembly are:
1. Total Assembly Size,
How far is this value from the estimated genome size
2. Total Number of Sequences (Scaffold/Contigs)
How far is this value from the number of chromosomes.
3. Longest scaffold/contig
4. Average scaffold/contig size
5. N50/L50 (or any other N/L)
Number sequence (N) and minimum size of them (L) that represents the 50% of
the assembly if the sequences are sorted by size, from bigger to smaller.
2. Sequencing, tools and computers.
2.6 Assembly evaluation
N50/L50
Total assembly size: 1000 Mb
93 87 75 68 62 56 50 44 37 30 25 20 18
Sequences order by descending size (Mb)
2. Sequencing, tools and computers.
2.6 Assembly evaluation
N50/L50
Total assembly size: 1000 Mb
93 87 75 68 62 56 50 44 37 30 25 20 18
Sequences order by descending size (Mb)
50 % assembly: 500 Mb
N50
N50 = 7 sequences
L50 = 50 Mb
2. Sequencing, tools and computers.
2.6 Assembly evaluation
N90/L90
Total assembly size: 1000 Mb
93 87 75 68 62 56 50 44 37 30 25 20 18
Sequences order by descending size (Mb)
90 % assembly: 900 Mb
N90
N90 = 29 sequences
L90 = 12.5 Mb
P. axillaris
Current Assembly Peaxi v1.6.2
Dataset Contigs Scaffolds
Total assembly size (Gb) 1.22 1.26
Total assembly sequences 109,892 83,639
Longest sequence (Mb) 0.57 8.56
Sequence length mean (Kb) 11.08 15.05
N90 (sequences) 13,481 1,051
L90 (Kb) 22.28 295.75
N50 (sequences) 3,943 309
L50 (Kb) 95.17 1,236.73
2. Sequencing, tools and computers.
2.6 Assembly evaluation
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
Arabidopsis thaliana (TAIR10)
(5 chromosomes with 119 Mb)
(96 gaps with 185 Kb total)
Solanum lycopersicum (v2.5)
(12 chromosomes with 824 Mb)
(26,866 gaps with 82 Mb total)
Petunia axillaris (v1.6.2)
(83,639 scaffolds with 1.26 Gb)
(26,268 gaps with 41 Mb total)
Nicotiana benthamiana (v0.4.4)
(141,339 scaffolds with 2.63 Gb*)
(337,433 gaps with 162 Mb total)
*0.61-0.67 Gb no assembled
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
?
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
Genome size
vs
assembly size
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
Cytogenetic data
(Flow cytometry)
✴ http://data.kew.org/cvalues
Example: Petunia axillaris genome size 1.37 Gb
(White and Rees. 1987)
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
contigs
Incomplete genes Markers development
Complete for non-
repetitive regions
Medium scaffolds and
contigs
Complete but 1-2
genes/contig
Gene mining
Complete for non-
repetitive regions
Large scaffolds and
contigs
Several dozens of
genes/contig
Microsynteny
Complete for almost the
whole genome
Pseudomolecules
Hundreds of genes/
contig
Any (Synteny,
Candidate gene by QTLs)
Complete
genome
Pseudomolecules
Thousands of genes/
contig5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
Sequencing data
(Kmer count)
• Jellyfish (http://www.genome.umd.edu/
jellyfish.html)
• BFCounter (http://
pritchardlab.stanford.edu/bfcounter.html)
0 10 20 30 40 50 60
0e+002e+074e+076e+078e+071e+08
63 Kmer Frequency for Petunia axillaris
Coverage
KmerCount
peak = 31
Estimated genome size = 1.38 Gb
Possiblesequencingerrors
Sequencing data (Kmer count)
2. Sequencing, tools and computers.
G = (N − B) * K / D * 2
G = genome size
N = total kmers
B = possible kmers with errors
K = kmer length
D = kmer peak
2.7 Assembly stages
Estimated genome size for P. axillaris:
• Flow Cytometry (White & Rees,1987)= 1.37 Gb
• Kmer Count* = 1.38 Gb
• Assembly size (scaffolds v1.6.2) = 1.26 Gb
• Assembly size (contigs v1.6.2) = 1.22 Gb
Estimated genome not assembled = 110 - 120 Mb
2. Sequencing, tools and computers.
2.7 Assembly stages
1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about genomes.
3. Things that you should know about genomes.
1. They have variable size, for example in angiosperm plants they range from 63 Mb
(Genlisea margaretae) to 150 Gb (Paris japonica) with an average of 5.6 Gb. More data
at:
✴ Plants: http://data.kew.org/cvalues/
✴ Animals: http://www.genomesize.com/
2. They can be polyploids. It means that homoeologus regions with highly similar will
collapse during the assembly.
3. They can be highly heterozygous and polymorphic. In this case some of the alleles will
collapse, some of them not.The effective coverage will lower than expected.
4. They can have recent whole genome duplication (or triplication) events. Paralogous
genes may collapse during the assembly.
5. Repeats, most of the genomes are full of repeats and they are difficult to resolve. By
default assemblers throw them away based in the Kmer histogram.
3. Things that you should know about genomes.
3.1 Collapsing problem
CACTTGACGACATGACG
CTTGACGACATGACGAC
CCCTTGACGACATGACG
CGCCCTTGACGACATGA
Gene 1 A
Gene 1 B
CGCCCTTGACGACATGACGACA
Collapsed consensus Gene 1 A + Gene 1 B
✴Polyploidy
✴Whole Genome Duplication
Gene 1
Gene 1 A
Gene 1 B
Schmutz J et al. Genome Sequence of the
Paleoploid Soybean. Nature 2010 463:178-183
3. Things that you should know about genomes.
3.1 Collapsing problem
✴Whole Genome Duplication
http://chibba.agtec.uga.edu/duplication/index/home
3. Things that you should know about genomes.
3.1 Collapsing evaluation
CACTTGACGACATGACG
CTTGACGACATGACGAC
CCCTTGACGACATGACG
CGCCCTTGACGACATGA
CGCCCTTGACGACATGACGACA Consensus
Reads
Read collapsing during the assembly process can be evaluated mapping reads
back to the consensus sequence and analyzing SNPs.
CACTTGACGACATGA
SNPs
=
Heterozygous Positions (AB)
+
Homoeologs Collapsing (ABBB,AABB,ABBB)
+
Paralogs Collapsing (variable)
3. Things that you should know about genomes.
3.1 Collapsing solutions
1. Sequence the two progenitors and use them as a reference.
✓Approach used previously (cotton genome; Patterson A. et. al 2012).
❖ Information not contained in the references will be lost.
❖ Progenitors are not always available.
2. Use single molecule technologies such as PacBio or Moleculo
✓ Best approach because it will give the right phase for long sequence
chunks. It exists software to integrate big sequences (minimus2...)
❖ Experimental and probably expensive.
3. Assembly everything with the higher Kmer, evaluate the collapsed regions
and rescue the haplotype/region using SNPs and pair end read information.
✓ Cheap, easy to apply for the current technologies.
❖ No software available
3. Things that you should know about genomes.
3.2 Repeats and assemblers
5. Repeats, most of the genomes are full of repeats and they are difficult to resolve. By
default some assemblers throw them away based in the kmer histogram.
For example, SOAPdenovo throw away anything up to 255.
It is convenient to do a Kmer analysis using Jellyfish (or other Kmer counter)
before do any assembly to analyze contaminations (bacterial contaminations may appears
with high Kmers content) and repeats.
Software to analyze repeats based
in sequence graphs:
RepeatExplorer
(http://repeatexplorer.umbr.cas.cz/)
Novak, P., Neumann, P. & Macas, J. Graph-based clustering and characterization of repetitive sequences in next-
generation sequencing data. Bmc Bioinformatics 11, 378 (2010)
Acknowledgements:
Petunia Genome Sequencing
Consortium
Petunia
NIU (USA)
Tom Sims
Mitrick Jones
BTI (USA)
Lukas Mueller
Aureliano Bombarely
UNIBE (Switzerland)
Cris Kuhlemeier
Remy Bruggmann
Michel Moser
UV (Italy)
Massimo Delledonne
Mario Pezzotti
RU andVU (Netherlands)
Tom Gerats.
Francesca.Quattrochio
BGI (China)

More Related Content

What's hot

methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure predictionkaramveer prajapat
 
shotgun sequncing
 shotgun sequncing shotgun sequncing
shotgun sequncingSAIFALI444
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSHEETHUMOLKS
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure predictionSiva Dharshini R
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicshemantbreeder
 
DNA SEQUENCING METHODS AND STRATEGIES FOR GENOME SEQUENCING
DNA SEQUENCING METHODS AND STRATEGIES FOR GENOME SEQUENCINGDNA SEQUENCING METHODS AND STRATEGIES FOR GENOME SEQUENCING
DNA SEQUENCING METHODS AND STRATEGIES FOR GENOME SEQUENCINGPuneet Kulyana
 
Open Reading Frames
Open Reading FramesOpen Reading Frames
Open Reading FramesOsama Zahid
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijayVijay Hemmadi
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seqJyoti Singh
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionArindam Ghosh
 
Messenger RNA (mRNA) enrichment
Messenger RNA (mRNA) enrichmentMessenger RNA (mRNA) enrichment
Messenger RNA (mRNA) enrichmentJasmin Russel
 
sequence alignment
sequence alignmentsequence alignment
sequence alignmentammar kareem
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENTMariya Raju
 

What's hot (20)

methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure prediction
 
shotgun sequncing
 shotgun sequncing shotgun sequncing
shotgun sequncing
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure prediction
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
DNA SEQUENCING METHODS AND STRATEGIES FOR GENOME SEQUENCING
DNA SEQUENCING METHODS AND STRATEGIES FOR GENOME SEQUENCINGDNA SEQUENCING METHODS AND STRATEGIES FOR GENOME SEQUENCING
DNA SEQUENCING METHODS AND STRATEGIES FOR GENOME SEQUENCING
 
Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
 
Genome assembly
Genome assemblyGenome assembly
Genome assembly
 
Open Reading Frames
Open Reading FramesOpen Reading Frames
Open Reading Frames
 
Rasmol
RasmolRasmol
Rasmol
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seq
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure Prediction
 
Messenger RNA (mRNA) enrichment
Messenger RNA (mRNA) enrichmentMessenger RNA (mRNA) enrichment
Messenger RNA (mRNA) enrichment
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Est database
Est databaseEst database
Est database
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENT
 

Similar to Genome Assembly

20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_coursehansjansen9999
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...QBiC_Tue
 
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...VHIR Vall d’Hebron Institut de Recerca
 
Real-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerReal-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerJoe Parker
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshopc.titus.brown
 
Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...
Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...
Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...EarthCube
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pubsesejun
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.mkim8
 
DNA sequencing: rapid improvements and their implications
DNA sequencing: rapid improvements and their implicationsDNA sequencing: rapid improvements and their implications
DNA sequencing: rapid improvements and their implicationsJeffrey Funk
 
Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121Jane Landolin
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS
 
Joe parker-benchmarking-bioinformatics
Joe parker-benchmarking-bioinformaticsJoe parker-benchmarking-bioinformatics
Joe parker-benchmarking-bioinformaticsJoe Parker
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialThomas Keane
 
Using field-based DNA sequencing to accelerate phylogenomics
Using field-based DNA sequencing to accelerate phylogenomicsUsing field-based DNA sequencing to accelerate phylogenomics
Using field-based DNA sequencing to accelerate phylogenomicsJoe Parker
 

Similar to Genome Assembly (20)

Genome Assembly 2018
Genome Assembly 2018Genome Assembly 2018
Genome Assembly 2018
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
 
Real-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerReal-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe Parker
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
NCBI
NCBINCBI
NCBI
 
Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...
Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...
Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
 
DNA sequencing: rapid improvements and their implications
DNA sequencing: rapid improvements and their implicationsDNA sequencing: rapid improvements and their implications
DNA sequencing: rapid improvements and their implications
 
Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2
 
Joe parker-benchmarking-bioinformatics
Joe parker-benchmarking-bioinformaticsJoe parker-benchmarking-bioinformatics
Joe parker-benchmarking-bioinformatics
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing Tutorial
 
Using field-based DNA sequencing to accelerate phylogenomics
Using field-based DNA sequencing to accelerate phylogenomicsUsing field-based DNA sequencing to accelerate phylogenomics
Using field-based DNA sequencing to accelerate phylogenomics
 

More from Aureliano Bombarely (10)

Lesson mitochondria bombarely_a20180927
Lesson mitochondria bombarely_a20180927Lesson mitochondria bombarely_a20180927
Lesson mitochondria bombarely_a20180927
 
PlastidEvolution
PlastidEvolutionPlastidEvolution
PlastidEvolution
 
Genomic Data Analysis: From Reads to Variants
Genomic Data Analysis: From Reads to VariantsGenomic Data Analysis: From Reads to Variants
Genomic Data Analysis: From Reads to Variants
 
RNAseq Analysis
RNAseq AnalysisRNAseq Analysis
RNAseq Analysis
 
BasicLinux
BasicLinuxBasicLinux
BasicLinux
 
PerlTesting
PerlTestingPerlTesting
PerlTesting
 
PerlScripting
PerlScriptingPerlScripting
PerlScripting
 
Introduction2R
Introduction2RIntroduction2R
Introduction2R
 
GoTermsAnalysisWithR
GoTermsAnalysisWithRGoTermsAnalysisWithR
GoTermsAnalysisWithR
 
BasicGraphsWithR
BasicGraphsWithRBasicGraphsWithR
BasicGraphsWithR
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 

Recently uploaded (20)

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 

Genome Assembly

  • 1. Insights into Plant Genome Sequence Assembly. by Aureliano Bombarely ab782@cornell.edu
  • 2. 1. A brief history of the sequence assembly. 2. Sequencing, tools and computers. 3. Things that you should know about genomes.
  • 3. 1. A brief history of the sequence assembly. 2. Sequencing, tools and computers. 3. Things that you should know about genomes.
  • 4. 1. A brief history of the sequence assembly. http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html
  • 5. Genome = N x Sequence of DNA Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix Nano Lett., 2012, 12 (12), pp 6453–6458 DNA sequencing ??? 1. A brief history of the sequence assembly.
  • 6. DNA Sequencing: “Process of determining the precise order of nucleotides within a DNA molecule.” -Wikipedia ddATP ddGTP ddCTP ddTTP Taq-Polymerase STOP time 2) Chromatographic Separation GTCACCCTGAAT 1) PCR with ddNTPs 3) Chromatogram Read DNA Sanger Sequencing 1. A brief history of the sequence assembly.
  • 7. Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix Nano Lett., 2012, 12 (12), pp 6453–6458 ACGCGGTGACGTTGTCGA ACGCGTTTTGTTGTGG ACGATTTAAATGACGTTGTCGA ACGCGGTGACGTTGTCGA ACGCGTTTTGTTGTGG ACCCCGACGTTGTCGA DNA sequencing ACGCGTTTTGTTGTGG ACCCCTGGGGGGGTTGTCGA Fragments Sequence assembly ACGCGTTTTGTTGTGGTGGCCACACCACGCAGTGACGGAGATAACGGCGAGAGCATGGACGGAGGATGAGGATGG 1. A brief history of the sequence assembly.
  • 8. Sequence assembly = Resolve a puzzle and rebuild a DNA sequence from its pieces (fragments) Image courtesy of iStock photo 1. A brief history of the sequence assembly.
  • 9. Resolve a puzzle and rebuild a DNA sequence from its pieces (fragments) BACs BAC-by-BAC sequencing Whole Genome Shotgun (WGS) sequencing 1. A brief history of the sequence assembly.
  • 10. MS2Bacteriophage(3.658Kb)1977 1978 1979 1980 1981 1982 1983 Epstein-BarrVirus(170Kb)1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 Haemophilusinfluenzae(1.83Mb)1995 Saccharomycescerevisiae(12.1Mb)1996 1997 Caenorhabditiselegans(100Mb)1998 1999 Arabidopsisthaliana(157Mb)2000 Homosapiens(3.2Gb)2001 Oryzasativa(420Mb)2002 2003 2004 2005 Populustrichocarpa(550Mb)2006 Vitisvinifera(487Mb)2007 Physcomitrela(480Mb);Caricapapaya(372Mb)2008 Sorghumbicolor(730Mb);ZeaMays(2.3Gb);Cucumissativus(367Mb)2009 Ectocarpus(214Mb);Malusxdomestica(742Mb);Glycinemax(1.1Gb)2010 Pigeonpea,Potato,Cannabis,A.lyrata,Cacao,Strawberry,Medicago2011 Mei,Benthamiana,Tomato,Setaria,Melon,Flax,T.salsuginea,Banana,Cotton,Orange,Pear2012 Sanger Sequencing 454 Solexa / Illumina SOLiD IonT PB 1. A brief history of the sequence assembly.
  • 12. Haemophilusinfluenzae(1.83Mb)1995 Software: TIGR ASSEMBLER Hardware: SPARCenter 2000 (512 Mb RAM) 1. A brief history of the sequence assembly.
  • 13. Haemophilusinfluenzae(1.83Mb)1995 Software: TIGR ASSEMBLER Smith-Waterman alignments Sutton GG. et al.TIGR Assembler:A New Tool to Assembly Large Shotgun Sequencing Projects Genome Science and Technology, 1995,1:9-19 O V ER LA PPIN G M ET H O D O LO G Y 1. A brief history of the sequence assembly.
  • 15. Homosapiens(3.2Gb)2001 BAC-by-BAC sequencing Whole Genome Shotgun (WGS) sequencing International Human Genome Sequencing Consortium. Initial Sequencing and Analysis of the Human Genome. Nature. 2001. 409:860-921 Venter JC. et al. The Sequence of the Human Genome. Science. 2001. 291:1304-1351 O V ER LA PPIN G M ET H O D O LO G Y 1. A brief history of the sequence assembly.
  • 16. Homosapiens(3.2Gb)2001 Whole Genome Shotgun (WGS) sequencing Venter JC. et al. The Sequence of the Human Genome. Science. 2001. 291:1304-1351 Software: WGAASSEMBLER (CABOG) Hardware: 40 machines AlphaSMPs (4 Gb RAM/each and 4 cores/each, total=160 Gb RAM and 160 cores); 5 days. 1. A brief history of the sequence assembly.
  • 17. Ailuropodamelanoleura(2.3Gb)2009 Whole Genome Shotgun (WGS) sequencing Li R. et al.The Sequence and the Novo Assembly of the Giant Panda Genome. Nature. 2009. 463:311-317 Software: SOAPdenovo Hardware: Supercomputer with 32 cores and 512 Gb RAM. B R U IJN G R A PH S M ET H O D O LO G Y 1. A brief history of the sequence assembly.
  • 18. 1. A brief history of the sequence assembly. What is a Kmer ? ATGCGCAGTGGAGAGAGAGCGATG Sequence A with 25 nt Specific n-tuple or n-gram of nucleic acid or amino acid sequences. -Wikipedia ordered list of elements contiguous sequence of n items from a given sequence of text 5 Kmers of 20-mer ATGCGCAGTGGAGAGAGAGC TGCGCAGTGGAGAGAGAGCG GCGCAGTGGAGAGAGAGCGA CGCAGTGGAGAGAGAGCGAT GCAGTGGAGAGAGAGCGATG N_kmers = L_read - Kmer_size
  • 19. Compeau PEC. et al. How to apply de Bruijn graphs to genome assembly. Nature Biotech. 2011. 29:287-291 1. A brief history of the sequence assembly.
  • 20. OLC (Overlap-layout-consensus) algorithm is more suitable for the low-coverage long reads, whereas the DBG (De-Bruijn-Graph) algorithm is more suitable for high-coverage short reads and especially for large genome assembly 1. A brief history of the sequence assembly.
  • 21. 1. A brief history of the sequence assembly. 2. Sequencing, tools and computers. 3. Things that you should know about genomes.
  • 22. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Technology Library Preparation Sequencing Amount Fastq raw Fastq Processed Contigs Scaffolds Decisions during the Experimental Design Software Identity %, Kmer ... Post-assembly Filtering Decisions during the Assembly Optimization Reads processing and filtering 1. Low quality reads (qscore) (Q30) 2. Short reads (L50) 3. PCR duplications (Only Genomes). 4. Contaminations. 5. Corrections Consensus Scaffolding Scaffolds Gap Filling
  • 23. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Technology Library Preparation Sequencing Amount Decisions during the Experimental Design
  • 24. 2. Sequencing, tools and computers. 2.1 Technologies Run Time Sequence Length Reads/Run Total nucleotides sequenced per run Capillary Sequencing (ABI37000) ~2.5 h 800 bp 386 0.308 Mb 454 Pyrosequencing (GS FLX Titanium XL+) ~23 h 700 bp 1,000,000 700 Mb (0.7 Gb) Illumina (HiSeq X Ten) 72 h (3 days) 2 x 150 bp 6,000,000,000 1,600,000,000 Mb (1,600 Gb) Illumina (MiSeq) 65 h 2 x 300 bp 2 x 22,000,000 13,500 Mb (13.5 Gb) SOLID (5500xl system) 120 h (7 days) 2 x 60 bp 400,000,000 300,000 Mb (300 Gb) Ion Torrent (Ion Proton I) 2 h 200 bp 60,000,000 10,000 Mb (10 Gb) PacBio (PacBioRS II) 1.5 h ~8,500 bp 50,000 375 Mb (0.37 Gb)
  • 25. 2. Sequencing, tools and computers. Strenghs Weaknesses 454 Pyrosequencing (GS FLX Titanium XL+) - Long reads (450/700 bp). - Long insert for mate pair libraries (20Kb). - Low observed raw error rate (0.1%) - Low percentage of PCR duplications for mate pair libraries - Homopolymer error. - Low sequence yield per run (0.7 Gb). - Preferred assembler (gsAssembler) uses overlapping methodology. Illumina (HiSeq 2500) - High sequence yield per run (600 Gb) - Low observed raw error rate (0.26%) - High percentage of PCR duplications for mate pair libraries. - Long run time (11 days) - High instrument cost (~ $650K) Illumina (MiSeq) - Medium read size (250 bp) - Faster run than Illumina HiSeq - Medium sequence yield per run (8.5 Gb) SOLID (5500xl system) - 2-base encoding reduce the observed raw error rate (0.06%) - 2-base color coding makes difficult the sequence manipulation and assembly. - Short reads (75 bp) Ion Torrent (Ion Proton I) - Fast run (2 hours) - Low instrument cost ($80K). - Medium read size (200 bp) - Medium sequence yield per run (10 Gb) - Medium observed raw error rate (1.7%) PacBio (PacBioRS) - Long reads (3000 bp) - Fast run (2 hours) - Really high observed raw error rate (12.7%) - High instrument cost (~ $700K) - No pair end/mate pair reads 2.1 Technologies
  • 26. 2. Sequencing, tools and computers. ★ Library types (orientations): • Single reads • Pair ends (PE) (150-800 bp insert size) • Mate pairs (MP) (2-40 Kb insert size) F F F F R R R 454/Roche Illumina Illumina 2.2 Libraries
  • 27. 2. Sequencing, tools and computers. • Why is important the pair information ? F - novo assembly: Consensus sequence (Contig) Reads Scaffold (or Supercontig) Pair Read information NNNNN Genetic information (markers) Pseudomolecule (or ultracontig) NNNNN NN 2.2 Libraries
  • 28. 2. Sequencing, tools and computers. 2.3 Sequencing Amount * Slide from MC. Schatz
  • 29. 2. Sequencing, tools and computers. 2.3 Sequencing Amount Depending of the genome complexity, technology used and assembler: • More is better (if you have enough computational resources). - Sanger > 10X (less for BACs-by-BACs approaches). - 454 > 20X - Illumina > 100X • Polyploidy or high heterogozyty increase the amount of reads needed. • The use of different library types (pair ends and mate pairs with different insert sizes is essential). • Longer reads is preferable.
  • 30. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Fastq raw Fastq Processed Contigs Scaffolds Software Decisions during the Assembly Optimization Reads processing and filtering 1. Low quality reads (qscore) (Q30) 2. Short reads (L50) 3. PCR duplications (Only Genomes). 4. Contaminations. 5. Corrections Consensus Scaffolding Scaffolds Gap Filling
  • 31. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Fastq raw Fastq Processed Software Decisions during the Assembly Optimization Reads processing and filtering 1. Low quality reads (qscore) (Q30) 2. Short reads (L50) 3. PCR duplications (Only Genomes). 4. Contaminations. 5. Corrections
  • 32. 2. Sequencing, tools and computers. 2.4 Tools: Read Quality Evaluation a) Length of the read. b) Bases with qscore > 20 or 30. •FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
  • 33. 2. Sequencing, tools and computers. 2.4 Tools: Read Trimming and Filtering • Fastx-Toolkit (http://hannonlab.cshl.edu) • Ea-Utils (http://code.google.com/p/ea-utils/) • PrinSeq (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi) Software Multiplexing Trimming/Filtering Fastx-Toolkit fastx_barcode_splitter fastq_quality_filter Ea-Utils fastq-multx fastq-mcf PrinSeq PrinSeq PrinSeq 1. Adaptor removal. 2. Low quality reads (qscore) (Q30) 3. Short reads (L50) 4. PCR duplications (Only Genomes, Use PrinSeq).
  • 34. 2. Sequencing, tools and computers. 2.4 Tools: Contaminations 5. Contaminations Contaminations can be removed mapping the reads against a reference with the contaminants such as E. coli and human genomes.The most common tools are Bowtie or BWA (for short reads) and Blast (for long reads).
  • 35. 2. Sequencing, tools and computers. 2.4 Tools: Read Corrections 6. Read Corrections Read corrections are generally based in the Kmer analysis. Medvedev P. et al. Error correction of high-throughput sequencing datasets with non-uniform coverage Bioinformatics. 2011 27 (13):i137-i141
  • 36. 2. Sequencing, tools and computers. 2.4 Tools: Read Corrections 6. Read Corrections • Quake (http://www.cbcb.umd.edu/software/quake/index.html) • Reptile (http://aluru-sun.ece.iastate.edu/doku.php?id=software) • ECHO (http://uc-echo.sourceforge.net/) • Corrector (http://soap.genomics.org.cn/soapdenovo.html) Usual Suspects:
  • 37. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Fastq Processed Contigs Scaffolds Software Decisions during the Assembly Optimization Consensus Scaffolding
  • 38. 2. Sequencing, tools and computers. Type Technology Used Features Link Arachne Overlap-layout-consensus Sanger, 454 Highly configurable http://www.broadinstitute.org/crd/wiki/ index.php/Main_Page CABOG Overlap-layout-consensus Sanger, 454, Illumina Highly configurable http://sourceforge.net/apps/mediawiki/wgs- assembler/index.php?title=Main_Page MIRA Overlap-layout-consensus Sanger, 454 Highly configurable http://sourceforge.net/apps/mediawiki/mira- assembler gsAssembler Overlap-layout-consensus Sanger, 454 Easy to use http://454.com/products/analysis-software/ index.asp iAssembler Overlap-layout-consensus Sanger, 454 Improves MIRA http://bioinfo.bti.cornell.edu/tool/iAssembler ABySS Bruijn graph 454 or Illumina Easy to use http://www.bcgsc.ca/platform/bioinfo/software/ abyss ALLPATH-LG Bruijn graph 454 or Illumina Good results http://www.broadinstitute.org/software/allpaths- lg/blog Ray Bruijn graph 454 or Illumina Slow but use less memory http://denovoassembler.sf.net/ SOAPdenovo Bruijn graph 454 or Illumina Fastest http://soap.genomics.org.cn/soapdenovo.html Velvet Bruijn graph 454 or Illumina or SOLiD SOLiD http://www.ebi.ac.uk/~zerbino/velvet/ 2.4 Tools: Assemblers
  • 39. 2. Sequencing, tools and computers. Type Technology Used Features Link Arachne Overlap-layout-consensus Sanger, 454 Highly configurable http://www.broadinstitute.org/crd/wiki/ index.php/Main_Page CABOG Overlap-layout-consensus Sanger, 454, Illumina Highly configurable http://sourceforge.net/apps/mediawiki/wgs- assembler/index.php?title=Main_Page MIRA Overlap-layout-consensus Sanger, 454 Highly configurable http://sourceforge.net/apps/mediawiki/mira- assembler gsAssembler Overlap-layout-consensus Sanger, 454 Easy to use http://454.com/products/analysis-software/ index.asp iAssembler Overlap-layout-consensus Sanger, 454 Improves MIRA http://bioinfo.bti.cornell.edu/tool/iAssembler ABySS Bruijn graph 454 or Illumina Easy to use http://www.bcgsc.ca/platform/bioinfo/software/ abyss ALLPATH-LG Bruijn graph 454 or Illumina Good results http://www.broadinstitute.org/software/allpaths- lg/blog Ray Bruijn graph 454 or Illumina Slow but use less memory http://denovoassembler.sf.net/ SOAPdenovo Bruijn graph 454 or Illumina Fastest http://soap.genomics.org.cn/soapdenovo.html Velvet Bruijn graph 454 or Illumina or SOLiD SOLiD http://www.ebi.ac.uk/~zerbino/velvet/ 2.4 Tools: Assemblers
  • 40. 2. Sequencing, tools and computers. 2.4 Tools: Assemblers ... but there are more assemblers and information... Take a look to SeqAnswers http://seqanswers.com/wiki/Special:BrowseData/Bioinformatics_application? Bioinformatics_method=Assembly&Biological_domain=De-novo_assembly Also highly recommendable:
  • 41. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Scaffolds Software Decisions during the Assembly Optimization Scaffolds Gap Filling
  • 42. 2. Sequencing, tools and computers. Type Technology Used Features Link GapCloser Overlap-layout-consensus Illumina C++ Free http://soap.genomics.org.cn/soapdenovo.html GapFiller Overlap-layout-consensus Any Perl Commercial* http://www.baseclear.com/landingpages/ basetools-a-wide-range-of-bioinformatics- solutions/gapfiller/ PBSuite Overlap-layout-consensus 454, PacBio Python, Long reads http://sourceforge.net/p/pb-jelly/wiki/ 2.4 Tools: Gap Filling
  • 43. 2. Sequencing, tools and computers. 2.5 Computers Bigger is better: • How much do you need depends of: ➡ how big is your genome ? ~ Human size (3Gb) require ~256 Gb to 1 Tb ➡ how many reads do you have, ? More reads, more memory and hard disk. ➡ what software are you going to use ? OLC uses more memory and time than DBG. ➡ what parameters are you going to use ? Bigger Kmers, more memory. Four options: 1. Buy your own server (512 Gb, 4.5 Tb, 64 cores; ~ $15,000). 2. Rent a server for ~ 1 month (same specs. $1.5/h; ~$1,000)(CBSU). 3. Use a supercomputing center associated with NSF, NIH, USDA... where they offer reduced prices (iPlant, Indiana University....). 4. Collaborate with some group with a big server.
  • 44. 2. Sequencing, tools and computers. 2.6 Assembly evaluation During the assembly optimization will be generated several assemblies.The most used parameters to evaluate the assembly are: 1. Total Assembly Size, How far is this value from the estimated genome size 2. Total Number of Sequences (Scaffold/Contigs) How far is this value from the number of chromosomes. 3. Longest scaffold/contig 4. Average scaffold/contig size 5. N50/L50 (or any other N/L) Number sequence (N) and minimum size of them (L) that represents the 50% of the assembly if the sequences are sorted by size, from bigger to smaller.
  • 45. 2. Sequencing, tools and computers. 2.6 Assembly evaluation N50/L50 Total assembly size: 1000 Mb 93 87 75 68 62 56 50 44 37 30 25 20 18 Sequences order by descending size (Mb)
  • 46. 2. Sequencing, tools and computers. 2.6 Assembly evaluation N50/L50 Total assembly size: 1000 Mb 93 87 75 68 62 56 50 44 37 30 25 20 18 Sequences order by descending size (Mb) 50 % assembly: 500 Mb N50 N50 = 7 sequences L50 = 50 Mb
  • 47. 2. Sequencing, tools and computers. 2.6 Assembly evaluation N90/L90 Total assembly size: 1000 Mb 93 87 75 68 62 56 50 44 37 30 25 20 18 Sequences order by descending size (Mb) 90 % assembly: 900 Mb N90 N90 = 29 sequences L90 = 12.5 Mb
  • 48. P. axillaris Current Assembly Peaxi v1.6.2 Dataset Contigs Scaffolds Total assembly size (Gb) 1.22 1.26 Total assembly sequences 109,892 83,639 Longest sequence (Mb) 0.57 8.56 Sequence length mean (Kb) 11.08 15.05 N90 (sequences) 13,481 1,051 L90 (Kb) 22.28 295.75 N50 (sequences) 3,943 309 L50 (Kb) 95.17 1,236.73 2. Sequencing, tools and computers. 2.6 Assembly evaluation
  • 49. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages
  • 50. 5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages Arabidopsis thaliana (TAIR10) (5 chromosomes with 119 Mb) (96 gaps with 185 Kb total) Solanum lycopersicum (v2.5) (12 chromosomes with 824 Mb) (26,866 gaps with 82 Mb total) Petunia axillaris (v1.6.2) (83,639 scaffolds with 1.26 Gb) (26,268 gaps with 41 Mb total) Nicotiana benthamiana (v0.4.4) (141,339 scaffolds with 2.63 Gb*) (337,433 gaps with 162 Mb total) *0.61-0.67 Gb no assembled
  • 51. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages ?
  • 52. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages Genome size vs assembly size
  • 53. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages Cytogenetic data (Flow cytometry) ✴ http://data.kew.org/cvalues Example: Petunia axillaris genome size 1.37 Gb (White and Rees. 1987)
  • 54. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages Sequencing data (Kmer count) • Jellyfish (http://www.genome.umd.edu/ jellyfish.html) • BFCounter (http:// pritchardlab.stanford.edu/bfcounter.html)
  • 55. 0 10 20 30 40 50 60 0e+002e+074e+076e+078e+071e+08 63 Kmer Frequency for Petunia axillaris Coverage KmerCount peak = 31 Estimated genome size = 1.38 Gb Possiblesequencingerrors Sequencing data (Kmer count) 2. Sequencing, tools and computers. G = (N − B) * K / D * 2 G = genome size N = total kmers B = possible kmers with errors K = kmer length D = kmer peak 2.7 Assembly stages
  • 56. Estimated genome size for P. axillaris: • Flow Cytometry (White & Rees,1987)= 1.37 Gb • Kmer Count* = 1.38 Gb • Assembly size (scaffolds v1.6.2) = 1.26 Gb • Assembly size (contigs v1.6.2) = 1.22 Gb Estimated genome not assembled = 110 - 120 Mb 2. Sequencing, tools and computers. 2.7 Assembly stages
  • 57. 1. A brief history of the sequence assembly. 2. Sequencing, tools and computers. 3. Things that you should know about genomes.
  • 58. 3. Things that you should know about genomes. 1. They have variable size, for example in angiosperm plants they range from 63 Mb (Genlisea margaretae) to 150 Gb (Paris japonica) with an average of 5.6 Gb. More data at: ✴ Plants: http://data.kew.org/cvalues/ ✴ Animals: http://www.genomesize.com/ 2. They can be polyploids. It means that homoeologus regions with highly similar will collapse during the assembly. 3. They can be highly heterozygous and polymorphic. In this case some of the alleles will collapse, some of them not.The effective coverage will lower than expected. 4. They can have recent whole genome duplication (or triplication) events. Paralogous genes may collapse during the assembly. 5. Repeats, most of the genomes are full of repeats and they are difficult to resolve. By default assemblers throw them away based in the Kmer histogram.
  • 59. 3. Things that you should know about genomes. 3.1 Collapsing problem CACTTGACGACATGACG CTTGACGACATGACGAC CCCTTGACGACATGACG CGCCCTTGACGACATGA Gene 1 A Gene 1 B CGCCCTTGACGACATGACGACA Collapsed consensus Gene 1 A + Gene 1 B ✴Polyploidy ✴Whole Genome Duplication Gene 1 Gene 1 A Gene 1 B Schmutz J et al. Genome Sequence of the Paleoploid Soybean. Nature 2010 463:178-183
  • 60. 3. Things that you should know about genomes. 3.1 Collapsing problem ✴Whole Genome Duplication http://chibba.agtec.uga.edu/duplication/index/home
  • 61. 3. Things that you should know about genomes. 3.1 Collapsing evaluation CACTTGACGACATGACG CTTGACGACATGACGAC CCCTTGACGACATGACG CGCCCTTGACGACATGA CGCCCTTGACGACATGACGACA Consensus Reads Read collapsing during the assembly process can be evaluated mapping reads back to the consensus sequence and analyzing SNPs. CACTTGACGACATGA SNPs = Heterozygous Positions (AB) + Homoeologs Collapsing (ABBB,AABB,ABBB) + Paralogs Collapsing (variable)
  • 62. 3. Things that you should know about genomes. 3.1 Collapsing solutions 1. Sequence the two progenitors and use them as a reference. ✓Approach used previously (cotton genome; Patterson A. et. al 2012). ❖ Information not contained in the references will be lost. ❖ Progenitors are not always available. 2. Use single molecule technologies such as PacBio or Moleculo ✓ Best approach because it will give the right phase for long sequence chunks. It exists software to integrate big sequences (minimus2...) ❖ Experimental and probably expensive. 3. Assembly everything with the higher Kmer, evaluate the collapsed regions and rescue the haplotype/region using SNPs and pair end read information. ✓ Cheap, easy to apply for the current technologies. ❖ No software available
  • 63. 3. Things that you should know about genomes. 3.2 Repeats and assemblers 5. Repeats, most of the genomes are full of repeats and they are difficult to resolve. By default some assemblers throw them away based in the kmer histogram. For example, SOAPdenovo throw away anything up to 255. It is convenient to do a Kmer analysis using Jellyfish (or other Kmer counter) before do any assembly to analyze contaminations (bacterial contaminations may appears with high Kmers content) and repeats. Software to analyze repeats based in sequence graphs: RepeatExplorer (http://repeatexplorer.umbr.cas.cz/) Novak, P., Neumann, P. & Macas, J. Graph-based clustering and characterization of repetitive sequences in next- generation sequencing data. Bmc Bioinformatics 11, 378 (2010)
  • 64. Acknowledgements: Petunia Genome Sequencing Consortium Petunia NIU (USA) Tom Sims Mitrick Jones BTI (USA) Lukas Mueller Aureliano Bombarely UNIBE (Switzerland) Cris Kuhlemeier Remy Bruggmann Michel Moser UV (Italy) Massimo Delledonne Mario Pezzotti RU andVU (Netherlands) Tom Gerats. Francesca.Quattrochio BGI (China)