SeRC: de novo assembly workshop. Francesco Vezzi
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

SeRC: de novo assembly workshop. Francesco Vezzi

on

  • 832 views

De novo assembly, a multi-technology approach: Illumina, PacBio, and OpGen.

De novo assembly, a multi-technology approach: Illumina, PacBio, and OpGen.
A multi technological prospective for de novo assembly projects.

Statistics

Views

Total Views
832
Views on SlideShare
597
Embed Views
235

Actions

Likes
0
Downloads
16
Comments
0

6 Embeds 235

https://twitter.com 115
http://www.scoop.it 109
http://www.slideee.com 6
http://tweetedtimes.com 3
https://www.linkedin.com 1
http://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

SeRC: de novo assembly workshop. Francesco Vezzi Presentation Transcript

  • 1. De novo assembly, a multi-technology approach: Illumina, PacBio, and OpGen PhD. Francesco Vezzi Senior Bioinformatician, NGI-Stockholm
  • 2. Both Stockholm and Uppsala nodes Illumina HiSeq 2000/2500 16 Illumina MiSeq 3 Life Technologies SOLiD 5500xl 4 Life Technologies SOLiD 5500wildfire 2 Life Technologies Ion Torrent 2 Life Technologies Ion Proton 6 Life Technologies Sanger ABI3730 2 Pacific Biosciences RSII 1 Argus Whole Genome Mapping System 1 One of 3 best-equipped sequencing sites in Europe
  • 3. In this talk Illumina (Stockholm): • 100/150 bp paired reads (low error rate) • 900/200 Gbp in 6/2 day(s) PacBio (Uppsala): • 8.5 Kbp reads, (max 30Kbp, high error rate) • 375 Mbp (1 SMRT Cell) in 10 hours OpGen Argus System (Stockholm): • ~300 Kbp maps • 10 Gbp in ~1 day
  • 4. Optical Maps • Restriction Map ◦ Representation of the cut sites on a given DNA molecule to provide spatial information of genetic loci • An enzyme is selected and used to cut the molecules. This provides a 2D representation of the molecule structure
  • 5. Optical Maps: workflow DNA extraction directly from culture Quality control of extracted material Prepare a chip Run Argus System Data assembly StepsTime 3-8h 1h 1.5h 1h 2-8h Notes
  • 6. Closing genomes with Optical Maps De novo reconstructs parts missing in the reference strain Correctly assembles long tandem repeats De Novo assembly (Illumina, PacBio) Set of un-ordered and not oriented contigs Optical Map Contigs
  • 7. Case Study: Combing all the technologies ~15 Mbp genome sequenced at High Coverage with: • Illumina HiSeq: • 500X PE libraries (180bp and 650bp insert) • 150X MP library (3Kbp) • 150X MP library (7Kbp) • PacBio • 50/60X with reads longer than 2Kbp • OpGen • 3 chips (only one worked really well) • 300X coverage • Average map length 320Kbp
  • 8. Assembly Strategy https://github.com/vezzi/de_novo_scilife Semi-automated pipeline for de novo assembly: • Global configuration file  tools and system configuration • Sample configuration file  samples description 3 modules: 1. QC-module (Illumina only): • Adaptor removal, kmer-analysis, fastqc, (insert size estimation) 2. Assemble-module (Illumina only): • Runs specified assemblers and outputs executed commands 3. Validation-module: • FRCbam, coverage analysis, GC-analysis, (N50) I NEED USERS/FEEDBACK/CONTIRBUTIONS
  • 9. QC-Module Kmer analysis: • Samples complexity • Error rate • Heterozygosity 0 1000 2000 3000 4000 5000 6000 05000100001500020000 Insert Size Histogram for All_Reads in file lib_3000.bam Insert Size Count FR RF TANDEM FASTQC Adaptor removal Alignment (partial assembly)
  • 10. Assemble-Module Illumina only: • SOAPdenovo • MaSuRCA • Allpaths-LG PacBio only: • HGAP • CABOG Hybrid: • PB-jelly (HAH) >5000 #scaffolds totalLength maxContigLength N50 N80 percentageNs Allpaths-LG 227 14513103 596012 139364 57619 15% MASURCA 163 18549484 1188669 526519 282507 2% HGAP 290 14399273 763592 142483 37117 0% PB-Jelly 179 14718213 747750 195225 85127 13% • Try-and-fail process • Automated pipeline developed in order to streamline these analysis • MASURCA surprisingly the “best” assembler
  • 11. MaSuRCA HGAP PB-Jelly (HAH) Validation-Module
  • 12. FRCbam Validation-Module PacBio-only assembly is clearly outperforming the others
  • 13. Optical Maps PacBio produces the best assembly however 290 contigs contigs are produced. Optical Maps allowed to obtain the 2D representation of the 7 chromosomes. N.B. chromosome number was one of the biological questions of this project!!! But much more can be done!!!
  • 14. Incredible tool to finish (or almost finish) genomes % contigs placed Total size of placed contigs % size placed contigs % genome covered pacBio+OpGene 94.12 11578995 97% 77.05 Allpaths+OpGene 71.88 10692027 84% 52.88 Allpaths+Masurca+Opgene 80.65 27506424 92% 69.64 Allpaths+PacBio+Opgene 82.32 22271022 91% 83.05 Masurca+PacBio+pgene 94.44 28393392 98% 83.79 Allpaths+Masurca+PacBio+Opgene 85.42 39085419 94% 87.39 Combing all the technologies
  • 15. Conclusions – Take home message Attempt to automate de novo assembly process: • https://github.com/vezzi/de_novo_scilife • Not 100% automated Illumina, PacBio, Hybrid assemblies: • PacBio alone seems to produce the best assemblers • Hybrid assembly seems to not be able to correct merged-assembly problems Mixing technologies is always a good idea: • Possibility to compensate technological biases • Allows to produce better assemblies
  • 16. Thanks https://github.com/vezzi/de_novo_scilife