Insights into Plant Genome Sequence Assembly.
by
Aureliano Bombarely
ab782@cornell.edu
1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about geno...
1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about geno...
1. A brief history of the sequence assembly.
http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html
Genome = N x Sequence of DNA
Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix
Nano Lett., 2012, 12...
DNA Sequencing:
“Process of determining the precise order of nucleotides within a DNA molecule.”
-Wikipedia
ddATP
ddGTP
dd...
Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix
Nano Lett., 2012, 12 (12), pp 6453–6458
ACGCGGTGA...
Sequence assembly
=
Resolve a puzzle and rebuild a DNA sequence from
its pieces (fragments)
Image courtesy of iStock photo...
Resolve a puzzle and rebuild a DNA sequence from
its pieces (fragments)
BACs BAC-by-BAC sequencing
Whole Genome Shotgun (W...
MS2Bacteriophage(3.658Kb)1977
1978
1979
1980
1981
1982
1983
Epstein-BarrVirus(170Kb)1984
1985
1986
1987
1988
1989
1990
199...
1977MS2Bacteriophage(3.658Kb)1977
N
O
SO
FT
W
A
R
E
U
SED
1. A brief history of the sequence assembly.
Haemophilusinfluenzae(1.83Mb)1995
Software:
TIGR ASSEMBLER
Hardware:
SPARCenter 2000 (512 Mb RAM)
1. A brief history of the...
Haemophilusinfluenzae(1.83Mb)1995
Software:
TIGR ASSEMBLER Smith-Waterman alignments
Sutton GG. et al.TIGR Assembler:A New ...
http://www.ebi.ac.uk/training/online/course/nucleotide-sequence-data-resources-ebi/what-ena/how-sequence-assembled
1. A br...
Homosapiens(3.2Gb)2001
BAC-by-BAC sequencing
Whole Genome Shotgun (WGS) sequencing
International Human Genome Sequencing C...
Homosapiens(3.2Gb)2001
Whole Genome Shotgun (WGS) sequencing
Venter JC. et al. The Sequence of the Human Genome. Science. ...
Ailuropodamelanoleura(2.3Gb)2009
Whole Genome Shotgun (WGS) sequencing
Li R. et al.The Sequence and the Novo Assembly of t...
1. A brief history of the sequence assembly.
What is a Kmer ?
ATGCGCAGTGGAGAGAGAGCGATG Sequence A with 25 nt
Specific n-tup...
Compeau PEC. et al. How to apply de Bruijn graphs to genome assembly. Nature Biotech. 2011. 29:287-291
1. A brief history ...
OLC (Overlap-layout-consensus) algorithm is more
suitable for the low-coverage long reads, whereas the
DBG (De-Bruijn-Grap...
1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about geno...
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Technology
Library Preparation
Sequenci...
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Technology
Library Preparation
Sequenci...
2. Sequencing, tools and computers.
2.1 Technologies
Run Time Sequence Length Reads/Run
Total nucleotides
sequenced per ru...
2. Sequencing, tools and computers.
Strenghs Weaknesses
454 Pyrosequencing
(GS FLX Titanium XL+)
- Long reads (450/700 bp)...
2. Sequencing, tools and computers.
★ Library types (orientations):
• Single reads
• Pair ends (PE) (150-800 bp insert siz...
2. Sequencing, tools and computers.
• Why is important the pair information ?
F
- novo assembly:
Consensus sequence
(Conti...
2. Sequencing, tools and computers.
2.3 Sequencing Amount
* Slide from MC. Schatz
2. Sequencing, tools and computers.
2.3 Sequencing Amount
Depending of the genome complexity, technology used and assemble...
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Fastq raw
Fastq Processed
Contigs
Scaff...
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Fastq raw
Fastq Processed
Software
Deci...
2. Sequencing, tools and computers.
2.4 Tools: Read Quality Evaluation
a) Length of the read.
b) Bases with qscore > 20 or...
2. Sequencing, tools and computers.
2.4 Tools: Read Trimming and Filtering
• Fastx-Toolkit (http://hannonlab.cshl.edu)
• E...
2. Sequencing, tools and computers.
2.4 Tools: Contaminations
5. Contaminations
Contaminations can be removed mapping the ...
2. Sequencing, tools and computers.
2.4 Tools: Read Corrections
6. Read Corrections
Read corrections are generally based i...
2. Sequencing, tools and computers.
2.4 Tools: Read Corrections
6. Read Corrections
• Quake (http://www.cbcb.umd.edu/softw...
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Fastq Processed
Contigs
Scaffolds
Softw...
2. Sequencing, tools and computers.
Type Technology Used Features Link
Arachne Overlap-layout-consensus Sanger, 454
Highly...
2. Sequencing, tools and computers.
Type Technology Used Features Link
Arachne Overlap-layout-consensus Sanger, 454
Highly...
2. Sequencing, tools and computers.
2.4 Tools: Assemblers
... but there are more assemblers and information... Take a look...
2. Sequencing, tools and computers.
2.0 Overview of a Sequencing Project: Assembly
Scaffolds
Software
Decisions during the...
2. Sequencing, tools and computers.
Type Technology Used Features Link
GapCloser Overlap-layout-consensus Illumina
C++
Fre...
2. Sequencing, tools and computers.
2.5 Computers
Bigger is better:
• How much do you need depends of:
➡ how big is your g...
2. Sequencing, tools and computers.
2.6 Assembly evaluation
During the assembly optimization will be generated several ass...
2. Sequencing, tools and computers.
2.6 Assembly evaluation
N50/L50
Total assembly size: 1000 Mb
93 87 75 68 62 56 50 44 3...
2. Sequencing, tools and computers.
2.6 Assembly evaluation
N50/L50
Total assembly size: 1000 Mb
93 87 75 68 62 56 50 44 3...
2. Sequencing, tools and computers.
2.6 Assembly evaluation
N90/L90
Total assembly size: 1000 Mb
93 87 75 68 62 56 50 44 3...
P. axillaris
Current Assembly Peaxi v1.6.2
Dataset Contigs Scaffolds
Total assembly size (Gb) 1.22 1.26
Total assembly seq...
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
con...
5
4
3
2
1
2. Sequencing, tools and computers.
2.7 Assembly stages
Arabidopsis thaliana (TAIR10)
(5 chromosomes with 119 Mb...
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
con...
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
con...
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
con...
Whole Genome
Representation
Sequence
Status
Genes Usability
Incomplete for non-
repetitive regions
Small scaffolds and
con...
0 10 20 30 40 50 60
0e+002e+074e+076e+078e+071e+08
63 Kmer Frequency for Petunia axillaris
Coverage
KmerCount
peak = 31
Es...
Estimated genome size for P. axillaris:
• Flow Cytometry (White & Rees,1987)= 1.37 Gb
• Kmer Count* = 1.38 Gb
• Assembly s...
1. A brief history of the sequence assembly.
2. Sequencing, tools and computers.
3. Things that you should know about geno...
3. Things that you should know about genomes.
1. They have variable size, for example in angiosperm plants they range from...
3. Things that you should know about genomes.
3.1 Collapsing problem
CACTTGACGACATGACG
CTTGACGACATGACGAC
CCCTTGACGACATGACG...
3. Things that you should know about genomes.
3.1 Collapsing problem
✴Whole Genome Duplication
http://chibba.agtec.uga.edu...
3. Things that you should know about genomes.
3.1 Collapsing evaluation
CACTTGACGACATGACG
CTTGACGACATGACGAC
CCCTTGACGACATG...
3. Things that you should know about genomes.
3.1 Collapsing solutions
1. Sequence the two progenitors and use them as a r...
3. Things that you should know about genomes.
3.2 Repeats and assemblers
5. Repeats, most of the genomes are full of repea...
Acknowledgements:
Petunia Genome Sequencing
Consortium
Petunia
NIU (USA)
Tom Sims
Mitrick Jones
BTI (USA)
Lukas Mueller
Au...
Upcoming SlideShare
Loading in …5
×

Genome Assembly

4,884 views

Published on

Presentation about genome sequence assembly

Published in: Education
  • Be the first to comment

Genome Assembly

  1. 1. Insights into Plant Genome Sequence Assembly. by Aureliano Bombarely ab782@cornell.edu
  2. 2. 1. A brief history of the sequence assembly. 2. Sequencing, tools and computers. 3. Things that you should know about genomes.
  3. 3. 1. A brief history of the sequence assembly. 2. Sequencing, tools and computers. 3. Things that you should know about genomes.
  4. 4. 1. A brief history of the sequence assembly. http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html
  5. 5. Genome = N x Sequence of DNA Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix Nano Lett., 2012, 12 (12), pp 6453–6458 DNA sequencing ??? 1. A brief history of the sequence assembly.
  6. 6. DNA Sequencing: “Process of determining the precise order of nucleotides within a DNA molecule.” -Wikipedia ddATP ddGTP ddCTP ddTTP Taq-Polymerase STOP time 2) Chromatographic Separation GTCACCCTGAAT 1) PCR with ddNTPs 3) Chromatogram Read DNA Sanger Sequencing 1. A brief history of the sequence assembly.
  7. 7. Gentile F. et al. Direct Imaging of DNA Fibers:TheVisage of Double Helix Nano Lett., 2012, 12 (12), pp 6453–6458 ACGCGGTGACGTTGTCGA ACGCGTTTTGTTGTGG ACGATTTAAATGACGTTGTCGA ACGCGGTGACGTTGTCGA ACGCGTTTTGTTGTGG ACCCCGACGTTGTCGA DNA sequencing ACGCGTTTTGTTGTGG ACCCCTGGGGGGGTTGTCGA Fragments Sequence assembly ACGCGTTTTGTTGTGGTGGCCACACCACGCAGTGACGGAGATAACGGCGAGAGCATGGACGGAGGATGAGGATGG 1. A brief history of the sequence assembly.
  8. 8. Sequence assembly = Resolve a puzzle and rebuild a DNA sequence from its pieces (fragments) Image courtesy of iStock photo 1. A brief history of the sequence assembly.
  9. 9. Resolve a puzzle and rebuild a DNA sequence from its pieces (fragments) BACs BAC-by-BAC sequencing Whole Genome Shotgun (WGS) sequencing 1. A brief history of the sequence assembly.
  10. 10. MS2Bacteriophage(3.658Kb)1977 1978 1979 1980 1981 1982 1983 Epstein-BarrVirus(170Kb)1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 Haemophilusinfluenzae(1.83Mb)1995 Saccharomycescerevisiae(12.1Mb)1996 1997 Caenorhabditiselegans(100Mb)1998 1999 Arabidopsisthaliana(157Mb)2000 Homosapiens(3.2Gb)2001 Oryzasativa(420Mb)2002 2003 2004 2005 Populustrichocarpa(550Mb)2006 Vitisvinifera(487Mb)2007 Physcomitrela(480Mb);Caricapapaya(372Mb)2008 Sorghumbicolor(730Mb);ZeaMays(2.3Gb);Cucumissativus(367Mb)2009 Ectocarpus(214Mb);Malusxdomestica(742Mb);Glycinemax(1.1Gb)2010 Pigeonpea,Potato,Cannabis,A.lyrata,Cacao,Strawberry,Medicago2011 Mei,Benthamiana,Tomato,Setaria,Melon,Flax,T.salsuginea,Banana,Cotton,Orange,Pear2012 Sanger Sequencing 454 Solexa / Illumina SOLiD IonT PB 1. A brief history of the sequence assembly.
  11. 11. 1977MS2Bacteriophage(3.658Kb)1977 N O SO FT W A R E U SED 1. A brief history of the sequence assembly.
  12. 12. Haemophilusinfluenzae(1.83Mb)1995 Software: TIGR ASSEMBLER Hardware: SPARCenter 2000 (512 Mb RAM) 1. A brief history of the sequence assembly.
  13. 13. Haemophilusinfluenzae(1.83Mb)1995 Software: TIGR ASSEMBLER Smith-Waterman alignments Sutton GG. et al.TIGR Assembler:A New Tool to Assembly Large Shotgun Sequencing Projects Genome Science and Technology, 1995,1:9-19 O V ER LA PPIN G M ET H O D O LO G Y 1. A brief history of the sequence assembly.
  14. 14. http://www.ebi.ac.uk/training/online/course/nucleotide-sequence-data-resources-ebi/what-ena/how-sequence-assembled 1. A brief history of the sequence assembly.
  15. 15. Homosapiens(3.2Gb)2001 BAC-by-BAC sequencing Whole Genome Shotgun (WGS) sequencing International Human Genome Sequencing Consortium. Initial Sequencing and Analysis of the Human Genome. Nature. 2001. 409:860-921 Venter JC. et al. The Sequence of the Human Genome. Science. 2001. 291:1304-1351 O V ER LA PPIN G M ET H O D O LO G Y 1. A brief history of the sequence assembly.
  16. 16. Homosapiens(3.2Gb)2001 Whole Genome Shotgun (WGS) sequencing Venter JC. et al. The Sequence of the Human Genome. Science. 2001. 291:1304-1351 Software: WGAASSEMBLER (CABOG) Hardware: 40 machines AlphaSMPs (4 Gb RAM/each and 4 cores/each, total=160 Gb RAM and 160 cores); 5 days. 1. A brief history of the sequence assembly.
  17. 17. Ailuropodamelanoleura(2.3Gb)2009 Whole Genome Shotgun (WGS) sequencing Li R. et al.The Sequence and the Novo Assembly of the Giant Panda Genome. Nature. 2009. 463:311-317 Software: SOAPdenovo Hardware: Supercomputer with 32 cores and 512 Gb RAM. B R U IJN G R A PH S M ET H O D O LO G Y 1. A brief history of the sequence assembly.
  18. 18. 1. A brief history of the sequence assembly. What is a Kmer ? ATGCGCAGTGGAGAGAGAGCGATG Sequence A with 25 nt Specific n-tuple or n-gram of nucleic acid or amino acid sequences. -Wikipedia ordered list of elements contiguous sequence of n items from a given sequence of text 5 Kmers of 20-mer ATGCGCAGTGGAGAGAGAGC TGCGCAGTGGAGAGAGAGCG GCGCAGTGGAGAGAGAGCGA CGCAGTGGAGAGAGAGCGAT GCAGTGGAGAGAGAGCGATG N_kmers = L_read - Kmer_size
  19. 19. Compeau PEC. et al. How to apply de Bruijn graphs to genome assembly. Nature Biotech. 2011. 29:287-291 1. A brief history of the sequence assembly.
  20. 20. OLC (Overlap-layout-consensus) algorithm is more suitable for the low-coverage long reads, whereas the DBG (De-Bruijn-Graph) algorithm is more suitable for high-coverage short reads and especially for large genome assembly 1. A brief history of the sequence assembly.
  21. 21. 1. A brief history of the sequence assembly. 2. Sequencing, tools and computers. 3. Things that you should know about genomes.
  22. 22. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Technology Library Preparation Sequencing Amount Fastq raw Fastq Processed Contigs Scaffolds Decisions during the Experimental Design Software Identity %, Kmer ... Post-assembly Filtering Decisions during the Assembly Optimization Reads processing and filtering 1. Low quality reads (qscore) (Q30) 2. Short reads (L50) 3. PCR duplications (Only Genomes). 4. Contaminations. 5. Corrections Consensus Scaffolding Scaffolds Gap Filling
  23. 23. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Technology Library Preparation Sequencing Amount Decisions during the Experimental Design
  24. 24. 2. Sequencing, tools and computers. 2.1 Technologies Run Time Sequence Length Reads/Run Total nucleotides sequenced per run Capillary Sequencing (ABI37000) ~2.5 h 800 bp 386 0.308 Mb 454 Pyrosequencing (GS FLX Titanium XL+) ~23 h 700 bp 1,000,000 700 Mb (0.7 Gb) Illumina (HiSeq X Ten) 72 h (3 days) 2 x 150 bp 6,000,000,000 1,600,000,000 Mb (1,600 Gb) Illumina (MiSeq) 65 h 2 x 300 bp 2 x 22,000,000 13,500 Mb (13.5 Gb) SOLID (5500xl system) 120 h (7 days) 2 x 60 bp 400,000,000 300,000 Mb (300 Gb) Ion Torrent (Ion Proton I) 2 h 200 bp 60,000,000 10,000 Mb (10 Gb) PacBio (PacBioRS II) 1.5 h ~8,500 bp 50,000 375 Mb (0.37 Gb)
  25. 25. 2. Sequencing, tools and computers. Strenghs Weaknesses 454 Pyrosequencing (GS FLX Titanium XL+) - Long reads (450/700 bp). - Long insert for mate pair libraries (20Kb). - Low observed raw error rate (0.1%) - Low percentage of PCR duplications for mate pair libraries - Homopolymer error. - Low sequence yield per run (0.7 Gb). - Preferred assembler (gsAssembler) uses overlapping methodology. Illumina (HiSeq 2500) - High sequence yield per run (600 Gb) - Low observed raw error rate (0.26%) - High percentage of PCR duplications for mate pair libraries. - Long run time (11 days) - High instrument cost (~ $650K) Illumina (MiSeq) - Medium read size (250 bp) - Faster run than Illumina HiSeq - Medium sequence yield per run (8.5 Gb) SOLID (5500xl system) - 2-base encoding reduce the observed raw error rate (0.06%) - 2-base color coding makes difficult the sequence manipulation and assembly. - Short reads (75 bp) Ion Torrent (Ion Proton I) - Fast run (2 hours) - Low instrument cost ($80K). - Medium read size (200 bp) - Medium sequence yield per run (10 Gb) - Medium observed raw error rate (1.7%) PacBio (PacBioRS) - Long reads (3000 bp) - Fast run (2 hours) - Really high observed raw error rate (12.7%) - High instrument cost (~ $700K) - No pair end/mate pair reads 2.1 Technologies
  26. 26. 2. Sequencing, tools and computers. ★ Library types (orientations): • Single reads • Pair ends (PE) (150-800 bp insert size) • Mate pairs (MP) (2-40 Kb insert size) F F F F R R R 454/Roche Illumina Illumina 2.2 Libraries
  27. 27. 2. Sequencing, tools and computers. • Why is important the pair information ? F - novo assembly: Consensus sequence (Contig) Reads Scaffold (or Supercontig) Pair Read information NNNNN Genetic information (markers) Pseudomolecule (or ultracontig) NNNNN NN 2.2 Libraries
  28. 28. 2. Sequencing, tools and computers. 2.3 Sequencing Amount * Slide from MC. Schatz
  29. 29. 2. Sequencing, tools and computers. 2.3 Sequencing Amount Depending of the genome complexity, technology used and assembler: • More is better (if you have enough computational resources). - Sanger > 10X (less for BACs-by-BACs approaches). - 454 > 20X - Illumina > 100X • Polyploidy or high heterogozyty increase the amount of reads needed. • The use of different library types (pair ends and mate pairs with different insert sizes is essential). • Longer reads is preferable.
  30. 30. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Fastq raw Fastq Processed Contigs Scaffolds Software Decisions during the Assembly Optimization Reads processing and filtering 1. Low quality reads (qscore) (Q30) 2. Short reads (L50) 3. PCR duplications (Only Genomes). 4. Contaminations. 5. Corrections Consensus Scaffolding Scaffolds Gap Filling
  31. 31. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Fastq raw Fastq Processed Software Decisions during the Assembly Optimization Reads processing and filtering 1. Low quality reads (qscore) (Q30) 2. Short reads (L50) 3. PCR duplications (Only Genomes). 4. Contaminations. 5. Corrections
  32. 32. 2. Sequencing, tools and computers. 2.4 Tools: Read Quality Evaluation a) Length of the read. b) Bases with qscore > 20 or 30. •FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
  33. 33. 2. Sequencing, tools and computers. 2.4 Tools: Read Trimming and Filtering • Fastx-Toolkit (http://hannonlab.cshl.edu) • Ea-Utils (http://code.google.com/p/ea-utils/) • PrinSeq (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi) Software Multiplexing Trimming/Filtering Fastx-Toolkit fastx_barcode_splitter fastq_quality_filter Ea-Utils fastq-multx fastq-mcf PrinSeq PrinSeq PrinSeq 1. Adaptor removal. 2. Low quality reads (qscore) (Q30) 3. Short reads (L50) 4. PCR duplications (Only Genomes, Use PrinSeq).
  34. 34. 2. Sequencing, tools and computers. 2.4 Tools: Contaminations 5. Contaminations Contaminations can be removed mapping the reads against a reference with the contaminants such as E. coli and human genomes.The most common tools are Bowtie or BWA (for short reads) and Blast (for long reads).
  35. 35. 2. Sequencing, tools and computers. 2.4 Tools: Read Corrections 6. Read Corrections Read corrections are generally based in the Kmer analysis. Medvedev P. et al. Error correction of high-throughput sequencing datasets with non-uniform coverage Bioinformatics. 2011 27 (13):i137-i141
  36. 36. 2. Sequencing, tools and computers. 2.4 Tools: Read Corrections 6. Read Corrections • Quake (http://www.cbcb.umd.edu/software/quake/index.html) • Reptile (http://aluru-sun.ece.iastate.edu/doku.php?id=software) • ECHO (http://uc-echo.sourceforge.net/) • Corrector (http://soap.genomics.org.cn/soapdenovo.html) Usual Suspects:
  37. 37. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Fastq Processed Contigs Scaffolds Software Decisions during the Assembly Optimization Consensus Scaffolding
  38. 38. 2. Sequencing, tools and computers. Type Technology Used Features Link Arachne Overlap-layout-consensus Sanger, 454 Highly configurable http://www.broadinstitute.org/crd/wiki/ index.php/Main_Page CABOG Overlap-layout-consensus Sanger, 454, Illumina Highly configurable http://sourceforge.net/apps/mediawiki/wgs- assembler/index.php?title=Main_Page MIRA Overlap-layout-consensus Sanger, 454 Highly configurable http://sourceforge.net/apps/mediawiki/mira- assembler gsAssembler Overlap-layout-consensus Sanger, 454 Easy to use http://454.com/products/analysis-software/ index.asp iAssembler Overlap-layout-consensus Sanger, 454 Improves MIRA http://bioinfo.bti.cornell.edu/tool/iAssembler ABySS Bruijn graph 454 or Illumina Easy to use http://www.bcgsc.ca/platform/bioinfo/software/ abyss ALLPATH-LG Bruijn graph 454 or Illumina Good results http://www.broadinstitute.org/software/allpaths- lg/blog Ray Bruijn graph 454 or Illumina Slow but use less memory http://denovoassembler.sf.net/ SOAPdenovo Bruijn graph 454 or Illumina Fastest http://soap.genomics.org.cn/soapdenovo.html Velvet Bruijn graph 454 or Illumina or SOLiD SOLiD http://www.ebi.ac.uk/~zerbino/velvet/ 2.4 Tools: Assemblers
  39. 39. 2. Sequencing, tools and computers. Type Technology Used Features Link Arachne Overlap-layout-consensus Sanger, 454 Highly configurable http://www.broadinstitute.org/crd/wiki/ index.php/Main_Page CABOG Overlap-layout-consensus Sanger, 454, Illumina Highly configurable http://sourceforge.net/apps/mediawiki/wgs- assembler/index.php?title=Main_Page MIRA Overlap-layout-consensus Sanger, 454 Highly configurable http://sourceforge.net/apps/mediawiki/mira- assembler gsAssembler Overlap-layout-consensus Sanger, 454 Easy to use http://454.com/products/analysis-software/ index.asp iAssembler Overlap-layout-consensus Sanger, 454 Improves MIRA http://bioinfo.bti.cornell.edu/tool/iAssembler ABySS Bruijn graph 454 or Illumina Easy to use http://www.bcgsc.ca/platform/bioinfo/software/ abyss ALLPATH-LG Bruijn graph 454 or Illumina Good results http://www.broadinstitute.org/software/allpaths- lg/blog Ray Bruijn graph 454 or Illumina Slow but use less memory http://denovoassembler.sf.net/ SOAPdenovo Bruijn graph 454 or Illumina Fastest http://soap.genomics.org.cn/soapdenovo.html Velvet Bruijn graph 454 or Illumina or SOLiD SOLiD http://www.ebi.ac.uk/~zerbino/velvet/ 2.4 Tools: Assemblers
  40. 40. 2. Sequencing, tools and computers. 2.4 Tools: Assemblers ... but there are more assemblers and information... Take a look to SeqAnswers http://seqanswers.com/wiki/Special:BrowseData/Bioinformatics_application? Bioinformatics_method=Assembly&Biological_domain=De-novo_assembly Also highly recommendable:
  41. 41. 2. Sequencing, tools and computers. 2.0 Overview of a Sequencing Project: Assembly Scaffolds Software Decisions during the Assembly Optimization Scaffolds Gap Filling
  42. 42. 2. Sequencing, tools and computers. Type Technology Used Features Link GapCloser Overlap-layout-consensus Illumina C++ Free http://soap.genomics.org.cn/soapdenovo.html GapFiller Overlap-layout-consensus Any Perl Commercial* http://www.baseclear.com/landingpages/ basetools-a-wide-range-of-bioinformatics- solutions/gapfiller/ PBSuite Overlap-layout-consensus 454, PacBio Python, Long reads http://sourceforge.net/p/pb-jelly/wiki/ 2.4 Tools: Gap Filling
  43. 43. 2. Sequencing, tools and computers. 2.5 Computers Bigger is better: • How much do you need depends of: ➡ how big is your genome ? ~ Human size (3Gb) require ~256 Gb to 1 Tb ➡ how many reads do you have, ? More reads, more memory and hard disk. ➡ what software are you going to use ? OLC uses more memory and time than DBG. ➡ what parameters are you going to use ? Bigger Kmers, more memory. Four options: 1. Buy your own server (512 Gb, 4.5 Tb, 64 cores; ~ $15,000). 2. Rent a server for ~ 1 month (same specs. $1.5/h; ~$1,000)(CBSU). 3. Use a supercomputing center associated with NSF, NIH, USDA... where they offer reduced prices (iPlant, Indiana University....). 4. Collaborate with some group with a big server.
  44. 44. 2. Sequencing, tools and computers. 2.6 Assembly evaluation During the assembly optimization will be generated several assemblies.The most used parameters to evaluate the assembly are: 1. Total Assembly Size, How far is this value from the estimated genome size 2. Total Number of Sequences (Scaffold/Contigs) How far is this value from the number of chromosomes. 3. Longest scaffold/contig 4. Average scaffold/contig size 5. N50/L50 (or any other N/L) Number sequence (N) and minimum size of them (L) that represents the 50% of the assembly if the sequences are sorted by size, from bigger to smaller.
  45. 45. 2. Sequencing, tools and computers. 2.6 Assembly evaluation N50/L50 Total assembly size: 1000 Mb 93 87 75 68 62 56 50 44 37 30 25 20 18 Sequences order by descending size (Mb)
  46. 46. 2. Sequencing, tools and computers. 2.6 Assembly evaluation N50/L50 Total assembly size: 1000 Mb 93 87 75 68 62 56 50 44 37 30 25 20 18 Sequences order by descending size (Mb) 50 % assembly: 500 Mb N50 N50 = 7 sequences L50 = 50 Mb
  47. 47. 2. Sequencing, tools and computers. 2.6 Assembly evaluation N90/L90 Total assembly size: 1000 Mb 93 87 75 68 62 56 50 44 37 30 25 20 18 Sequences order by descending size (Mb) 90 % assembly: 900 Mb N90 N90 = 29 sequences L90 = 12.5 Mb
  48. 48. P. axillaris Current Assembly Peaxi v1.6.2 Dataset Contigs Scaffolds Total assembly size (Gb) 1.22 1.26 Total assembly sequences 109,892 83,639 Longest sequence (Mb) 0.57 8.56 Sequence length mean (Kb) 11.08 15.05 N90 (sequences) 13,481 1,051 L90 (Kb) 22.28 295.75 N50 (sequences) 3,943 309 L50 (Kb) 95.17 1,236.73 2. Sequencing, tools and computers. 2.6 Assembly evaluation
  49. 49. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages
  50. 50. 5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages Arabidopsis thaliana (TAIR10) (5 chromosomes with 119 Mb) (96 gaps with 185 Kb total) Solanum lycopersicum (v2.5) (12 chromosomes with 824 Mb) (26,866 gaps with 82 Mb total) Petunia axillaris (v1.6.2) (83,639 scaffolds with 1.26 Gb) (26,268 gaps with 41 Mb total) Nicotiana benthamiana (v0.4.4) (141,339 scaffolds with 2.63 Gb*) (337,433 gaps with 162 Mb total) *0.61-0.67 Gb no assembled
  51. 51. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages ?
  52. 52. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages Genome size vs assembly size
  53. 53. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages Cytogenetic data (Flow cytometry) ✴ http://data.kew.org/cvalues Example: Petunia axillaris genome size 1.37 Gb (White and Rees. 1987)
  54. 54. Whole Genome Representation Sequence Status Genes Usability Incomplete for non- repetitive regions Small scaffolds and contigs Incomplete genes Markers development Complete for non- repetitive regions Medium scaffolds and contigs Complete but 1-2 genes/contig Gene mining Complete for non- repetitive regions Large scaffolds and contigs Several dozens of genes/contig Microsynteny Complete for almost the whole genome Pseudomolecules Hundreds of genes/ contig Any (Synteny, Candidate gene by QTLs) Complete genome Pseudomolecules Thousands of genes/ contig5 4 3 2 1 2. Sequencing, tools and computers. 2.7 Assembly stages Sequencing data (Kmer count) • Jellyfish (http://www.genome.umd.edu/ jellyfish.html) • BFCounter (http:// pritchardlab.stanford.edu/bfcounter.html)
  55. 55. 0 10 20 30 40 50 60 0e+002e+074e+076e+078e+071e+08 63 Kmer Frequency for Petunia axillaris Coverage KmerCount peak = 31 Estimated genome size = 1.38 Gb Possiblesequencingerrors Sequencing data (Kmer count) 2. Sequencing, tools and computers. G = (N − B) * K / D * 2 G = genome size N = total kmers B = possible kmers with errors K = kmer length D = kmer peak 2.7 Assembly stages
  56. 56. Estimated genome size for P. axillaris: • Flow Cytometry (White & Rees,1987)= 1.37 Gb • Kmer Count* = 1.38 Gb • Assembly size (scaffolds v1.6.2) = 1.26 Gb • Assembly size (contigs v1.6.2) = 1.22 Gb Estimated genome not assembled = 110 - 120 Mb 2. Sequencing, tools and computers. 2.7 Assembly stages
  57. 57. 1. A brief history of the sequence assembly. 2. Sequencing, tools and computers. 3. Things that you should know about genomes.
  58. 58. 3. Things that you should know about genomes. 1. They have variable size, for example in angiosperm plants they range from 63 Mb (Genlisea margaretae) to 150 Gb (Paris japonica) with an average of 5.6 Gb. More data at: ✴ Plants: http://data.kew.org/cvalues/ ✴ Animals: http://www.genomesize.com/ 2. They can be polyploids. It means that homoeologus regions with highly similar will collapse during the assembly. 3. They can be highly heterozygous and polymorphic. In this case some of the alleles will collapse, some of them not.The effective coverage will lower than expected. 4. They can have recent whole genome duplication (or triplication) events. Paralogous genes may collapse during the assembly. 5. Repeats, most of the genomes are full of repeats and they are difficult to resolve. By default assemblers throw them away based in the Kmer histogram.
  59. 59. 3. Things that you should know about genomes. 3.1 Collapsing problem CACTTGACGACATGACG CTTGACGACATGACGAC CCCTTGACGACATGACG CGCCCTTGACGACATGA Gene 1 A Gene 1 B CGCCCTTGACGACATGACGACA Collapsed consensus Gene 1 A + Gene 1 B ✴Polyploidy ✴Whole Genome Duplication Gene 1 Gene 1 A Gene 1 B Schmutz J et al. Genome Sequence of the Paleoploid Soybean. Nature 2010 463:178-183
  60. 60. 3. Things that you should know about genomes. 3.1 Collapsing problem ✴Whole Genome Duplication http://chibba.agtec.uga.edu/duplication/index/home
  61. 61. 3. Things that you should know about genomes. 3.1 Collapsing evaluation CACTTGACGACATGACG CTTGACGACATGACGAC CCCTTGACGACATGACG CGCCCTTGACGACATGA CGCCCTTGACGACATGACGACA Consensus Reads Read collapsing during the assembly process can be evaluated mapping reads back to the consensus sequence and analyzing SNPs. CACTTGACGACATGA SNPs = Heterozygous Positions (AB) + Homoeologs Collapsing (ABBB,AABB,ABBB) + Paralogs Collapsing (variable)
  62. 62. 3. Things that you should know about genomes. 3.1 Collapsing solutions 1. Sequence the two progenitors and use them as a reference. ✓Approach used previously (cotton genome; Patterson A. et. al 2012). ❖ Information not contained in the references will be lost. ❖ Progenitors are not always available. 2. Use single molecule technologies such as PacBio or Moleculo ✓ Best approach because it will give the right phase for long sequence chunks. It exists software to integrate big sequences (minimus2...) ❖ Experimental and probably expensive. 3. Assembly everything with the higher Kmer, evaluate the collapsed regions and rescue the haplotype/region using SNPs and pair end read information. ✓ Cheap, easy to apply for the current technologies. ❖ No software available
  63. 63. 3. Things that you should know about genomes. 3.2 Repeats and assemblers 5. Repeats, most of the genomes are full of repeats and they are difficult to resolve. By default some assemblers throw them away based in the kmer histogram. For example, SOAPdenovo throw away anything up to 255. It is convenient to do a Kmer analysis using Jellyfish (or other Kmer counter) before do any assembly to analyze contaminations (bacterial contaminations may appears with high Kmers content) and repeats. Software to analyze repeats based in sequence graphs: RepeatExplorer (http://repeatexplorer.umbr.cas.cz/) Novak, P., Neumann, P. & Macas, J. Graph-based clustering and characterization of repetitive sequences in next- generation sequencing data. Bmc Bioinformatics 11, 378 (2010)
  64. 64. Acknowledgements: Petunia Genome Sequencing Consortium Petunia NIU (USA) Tom Sims Mitrick Jones BTI (USA) Lukas Mueller Aureliano Bombarely UNIBE (Switzerland) Cris Kuhlemeier Remy Bruggmann Michel Moser UV (Italy) Massimo Delledonne Mario Pezzotti RU andVU (Netherlands) Tom Gerats. Francesca.Quattrochio BGI (China)

×