<ul><li>Genome Assembly  </li></ul><ul><li>and Finishing </li></ul><ul><li>Alla Lapidus, Ph.D. </li></ul><ul><li>Associate...
A typical Microbial (and not only) project FINISHING  Annotation Public release Sequencing Draft assembly  Goals:  Complet...
Sequencing Technology at a Glance
Evolution of Microbial Drafts <ul><li>Sanger  only   </li></ul><ul><ul><li>4x of 3kb plasmids + 4x of 8kb plasmids + 1x of...
Process Overview
Library Preparation - Sanger DNA fragmentation  Random fragment DNA
Library Preparation - new
Assembly (assembler) <ul><li>Sanger  reads only ( phrap, PGA, Arachne ) </li></ul><ul><li>454/Solexa  ( Newbler, PCAP, Vel...
Draft assembly - what we get Assembly: set of contigs 10 16 21 10 21 Clone walk (Sanger lib) Ordered sets of contigs (scaf...
Primer walking Clone walk (captured gaps) Clone A PCR – sequence (un captured gaps) Template: gDNA PCR product
Why do we have gaps <ul><li>Sequencing coverage may not span all regions of the genome, thus producing gaps in the assembl...
Assembling repeats Actual genome
High GC sequencing problems: The presence of small hairpins (inverted repeat sequences) in the DNA that re anneal ether du...
Why more than one platform? <ul><li>454 - high quality reliable skeletons of genomes (454 std + 454 PE): correctly assembl...
454 (pyrosequence) and low GC genomes Thermotoga lettingae TMO  Sanger based draft assembly:  - 55 total contigs; 41 conti...
454  and High GC projects Xylanimonas cellulosilytica DSM 15894 (3.8 MB; 72.1% GC) PGA assembly - 9x of 8kb   PGA assembly...
NextGen high Quality Drafts at JGI  (multiple sequencing platforms) 454/Sanger contig Fosmid ends* and 454 PE 1.Pyrosequen...
Solving gaps: gapResopution tool Contig Gap (due to repeat) Read pairs that are found in contigs outside of this scaffold ...
Solving gaps: gapResopution tool (II) Step 3   If gap is not closed, tool designs  designs primers for sequencing reaction...
<ul><li>Velvet assembly </li></ul><ul><li>Blast Velvet contigs against Newbler ends </li></ul><ul><li>Use proper Velvet co...
Low quality areas – areas of potential frameshifts  Assemblies contain low quality regions (red tags)
Frameshift 1 (AAAAA, should be AAAA) Frameshift 2 (CCCC, should be CCC) homopolymers (n>=3) Modified from N. Ivanova (JGI)...
Polisher:  software for consensus quality improvement  Step 1:   Align Illumina data to 454-only  or Sanger/454 hybrid ass...
Errors corrected by Solexa CCTCTTTGATGGAAATGATA**TCTTCGAGCATCGCCTC**GGGTTTTCCATACAGAGAACCTTTGATGATGAACCGGTTGAAGATCTGCGGGTC...
So, what is Finishing? <ul><li>The process of taking a rough draft assembly composed of </li></ul><ul><li>shotgun sequenci...
Genome projects Archaea + Bacteria only http://www.genomesonline.org/ 298 Complete Genomes 137 Complete Genomes
Metagenomic assembly and Finishing <ul><li>Typically size of metagenomic sequencing project is very large  </li></ul><ul><...
QC: Annotation of poor quality sequence To avoid this:   -make sure you use high quality sequence -choose proper assembler...
Assembly mistakes A Bioinformatician's Guide to Metagenomics . Microbiol Mol Biol Rev. 2008  December; 72(4): 557–578.
Recommendations for metagenomic assembly <ul><li>Use Trimmer (Lucy etc) to treat reads PRIOR to assembly </li></ul><ul><li...
Metagenomic finishing: approach Binning:   Which DNA fragment  derived from which phylotype?  (BLAST; GC%; read depth) Com...
Few more details: read quality
 
Merged assemblies (  k=31   and   k=51 ) with minimus (Cloneview used for visualization) <ul><li>Green  k=31 </li></ul><ul...
Stats for 31, 51 and merged 31-51 assemblies
<ul><li>Thank you! </li></ul>
Upcoming SlideShare
Loading in...5
×

Assembly and finishing

1,783

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,783
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
72
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Almost any sequencing project that needs assembly
  • 1 and 3 – assemble reads of the same nature; can use particular assemblers 2 reads from different platforms -&gt; assembler? -&gt; shred like an option 454/Solexa co-assembly - ? - Newbler OK with longer Soelxa reads
  • We need to order our contigs and build scaffolds
  • Primers1 – standard primers to check the template
  • Kak slomat’ venik? Pereprigivaet.
  • You still have gaps after that….
  • 454 reads when aligned cover the entire genome
  • Do six por ispol’zujut!
  • Results varies from project to project – depends in GC%
  • Quality of the sequence you work with is even more important than for single genomes
  • Green – 31 hash; purple – k=51
  • Assembly and finishing

    1. 1. <ul><li>Genome Assembly </li></ul><ul><li>and Finishing </li></ul><ul><li>Alla Lapidus, Ph.D. </li></ul><ul><li>Associate Professor </li></ul><ul><li>Fox Chase Cancer Center </li></ul>
    2. 2. A typical Microbial (and not only) project FINISHING Annotation Public release Sequencing Draft assembly Goals: Completely restore genome Produce high quality consensus
    3. 3. Sequencing Technology at a Glance
    4. 4. Evolution of Microbial Drafts <ul><li>Sanger only </li></ul><ul><ul><li>4x of 3kb plasmids + 4x of 8kb plasmids + 1x of fosmids </li></ul></ul><ul><ul><li>~ $50k for 5MB genome draft </li></ul></ul><ul><li>Hybrid Sanger/pyrosequence/Illumina </li></ul><ul><ul><li>4x 8kb Sanger + 15 x coverage 454 shotgun + 20x Illumina (quality improvement) </li></ul></ul><ul><ul><li>~ $35k for 5MB genome draft </li></ul></ul>454 + Solexa - 20x coverage 454 standard + 4x coverage 454 paired end (PE) + 50x coverage Illumina shotgun (quality improvement; gaps) - ~ $10k per 5MB genome Solexa only - low cost; too fragmented; good assembler is needed! Solexa +PacBio - low cost; better sachffolding
    5. 5. Process Overview
    6. 6. Library Preparation - Sanger DNA fragmentation Random fragment DNA
    7. 7. Library Preparation - new
    8. 8. Assembly (assembler) <ul><li>Sanger reads only ( phrap, PGA, Arachne ) </li></ul><ul><li>454/Solexa ( Newbler, PCAP, Velvet, ALLPATH etc ) – </li></ul>--3kb-- --3kb-- --8kb-- --8kb-- ---------40kb-------- <ul><li>Hybrid Sanger/pyrosequence/Solexa ( no special assemblers ; use Newbler, PGA, Arachne) </li></ul>454 contig --8kb-- --8kb-- --8kb-- --8kb-- --8kb-- --8kb-- 454 shreds Shotgun reads PE reads
    9. 9. Draft assembly - what we get Assembly: set of contigs 10 16 21 10 21 Clone walk (Sanger lib) Ordered sets of contigs (scaffolds) New technologies: no clones to walk off even if you can scaffold contigs (bPCR – new approach of gap closing) PE 16 PCR - sequence pri1 pri2 PCR product
    10. 10. Primer walking Clone walk (captured gaps) Clone A PCR – sequence (un captured gaps) Template: gDNA PCR product
    11. 11. Why do we have gaps <ul><li>Sequencing coverage may not span all regions of the genome, thus producing gaps in the assembly – colony picking </li></ul><ul><li>Assembly results of the shotgun reads may produce misassembled regions due to repetitive sequences (new and old tech) </li></ul><ul><li>A biased base content (this can result in failure to be cloned, poor stability in the chosen host-vector system, or inability of the polymerase to reliably copy the sequence): </li></ul><ul><li>~ AT-rich DNA clones poorly in bacteria (cloning bias; </li></ul><ul><li>promoters like structures {Sanger} )=> uncaptured gaps </li></ul><ul><li>~GC rich DNA is difficult to PCR and to sequence and often </li></ul><ul><li>requires the use of special chemistry => captured gaps </li></ul><ul><li>~ high AT and GC content caused by problematic PCR (new tech) </li></ul>What are gaps ? - Genome areas not covered by random shotgun
    12. 12. Assembling repeats Actual genome
    13. 13. High GC sequencing problems: The presence of small hairpins (inverted repeat sequences) in the DNA that re anneal ether during sequencing or electrophoresis resulting in failed sequencing reactions or unreadable electrophoresis results. (This can be aided by adding modifiers to the reaction, sequencing smaller clones and running gels at higher temperatures in the presence of stronger denaturants).
    14. 14. Why more than one platform? <ul><li>454 - high quality reliable skeletons of genomes (454 std + 454 PE): correctly assembled contigs; problems with repeats (unassembled or assembled in contigs outside of main scaffolds); homopolymer related frame shifts </li></ul><ul><li>Illumina data is used to help improve the overall consensus quality, correct frameshifts and to close secondary structure related gaps; not ready for de-novo assembly of complex genomes (too many gaps!) </li></ul><ul><li>Sanger – finishing reads; fosmids – larger repeats and templates for primer walk – less cost effective but very useful in many cases </li></ul>
    15. 15. 454 (pyrosequence) and low GC genomes Thermotoga lettingae TMO Sanger based draft assembly: - 55 total contigs; 41 contigs >2kb - 38GC% - biased Sanger libraries Draft assembly +454 - 2 total contigs; 1 contigs >2kb - 454 – no cloning <166bp> - average length of gaps
    16. 16. 454 and High GC projects Xylanimonas cellulosilytica DSM 15894 (3.8 MB; 72.1% GC) PGA assembly - 9x of 8kb PGA assembly - 9x of 8kb +454 Assembly Total contigs Major contigs Scaffolds Misassenblies* N50 PGA-8kb 210 166 4 165 41,048 PGA-8kb+454 33 23 2 14 288,369
    17. 17. NextGen high Quality Drafts at JGI (multiple sequencing platforms) 454/Sanger contig Fosmid ends* and 454 PE 1.Pyrosequence and Sanger to obtain main ordered and oriented part of the assembly – Newbler assembler 3. Solexa reads to detect and correct errors in consensus – in house created tool (the Polisher) and close gaps (Velvet) 2. GapResolution (in house tool) to close some (up to 40%) gaps using unassembled 454 data – PGA or Newbler assemblers * Fosmids ends not used for microbes Unassembled 454 reads Solexa contig Solexa
    18. 18. Solving gaps: gapResopution tool Contig Gap (due to repeat) Read pairs that are found in contigs outside of this scaffold Step 1 For each gap, identify read pairs from contigs found on different scaffolds Step 2 Assemble reads in contigs adjacent to the gap and reads obtained from contigs outside the scaffold. Sometimes use assembler other than Newbler for sub-assemblies (PGA) Contig Gap Consensus from sub-assembly
    19. 19. Solving gaps: gapResopution tool (II) Step 3 If gap is not closed, tool designs designs primers for sequencing reactions Step 4 Iterate as necessary (in sub-assemblies) http://www.jgi.doe.gov/ [email_address] Contig Gap Design sequencing reactions to close gap
    20. 20. <ul><li>Velvet assembly </li></ul><ul><li>Blast Velvet contigs against Newbler ends </li></ul><ul><li>Use proper Velvet contigs to close gaps </li></ul>Solexa for gaps 454 Contig Gap Velvet contig Illumina reads Velvet contigs close gaps caused by hairpins and secondary structures
    21. 21. Low quality areas – areas of potential frameshifts Assemblies contain low quality regions (red tags)
    22. 22. Frameshift 1 (AAAAA, should be AAAA) Frameshift 2 (CCCC, should be CCC) homopolymers (n>=3) Modified from N. Ivanova (JGI) Homopoymer related frameshifts
    23. 23. Polisher: software for consensus quality improvement Step 1: Align Illumina data to 454-only or Sanger/454 hybrid assembly Contig Illumina reads Step 2: Analyze and correct consensus errors C T T G A A A A A Corrections Illumina coverage >= 10X and at least 70% llumina bases disagrees with the reference base Unsupported a. Illumina coverage < 10X b. Illumina coverage >= 10X and <70% of Illumina bases agree with the reference base Step 3: Design sequencing reactions for low quality and unsupported Illumina areas Unsupported Illumina region Sanger/454 low quality
    24. 24. Errors corrected by Solexa CCTCTTTGATGGAAATGATA**TCTTCGAGCATCGCCTC**GGGTTTTCCATACAGAGAACCTTTGATGATGAACCGGTTGAAGATCTGCGGGTCAAA CCTCTTTGATGGAAATAATA**TATTCGAGCATC TTAGTGGAAATGATA**TCTTCGAGCATCGCCTC CGAGCNTCGCCTC**GGGCTTTCCCT CGAGCATCGCCTC**GGGTTCTCCATACACAGA GCATCGCCTC**GGGTTTTCAATACAGAGAACCT CAGCGCCTC**GGGTTTTCCATACAGAGAACCTT ATCGCCTC**GGGTTTTCCAGACAGAGAACCTTT GGTTC**GGGTTTTCCATACAGAGAACCTTTGAT GTTTTCCATACAGAGAACATTTGATGATGAAC GTTGTCCATACAGAGAACTTTTGATGATGAAC TATANCATACAGAGAACCTTTGATGATGAACC ATTTCCAGACAGAGAACCNTTGATGATGAACC CAAACAGAGAACCTTTGAGGATGAACCGGTTG ACAGGGAACCTTAGATGATGAACCGGTTGAAG ACAGAGAACCTTAGATGATGAACCGGTTGAAG ACCGTTGATGATGAACCGGTTGAAGATCTGCG GATGGTGAACGGGTTGAAGATCTGCGGGTCAA GGTTTGAAGATCTGCGGGTCAAACCAGTCCTC GGTGGAAGATCTGCGGGTAAAACCAGTCCTCT GGT.GNAGAGCTGCGGGTCAAACCAGTCCTCTG TGAAGATCTGCGGTTCAAACCAGTCCTCTCCC GATCGGCGTGTCAAACCAGTCCTCTGCCTCGT TCTGCGGGTCAAACCAGTACTCTGCCTCGTTC Frame shift detected (454 contig) 454 contig Finished consensus Sanger reads
    25. 25. So, what is Finishing? <ul><li>The process of taking a rough draft assembly composed of </li></ul><ul><li>shotgun sequencing reads, identifying and resolving miss </li></ul><ul><li>assemblies, sequence gaps and regions of low quality to </li></ul><ul><li>produce a highly accurate finished DNA sequence. </li></ul>Final error rate should be less than 1 per 50 Kb. No gaps, no misassembled areas, no characters other than ACGT Final quality:
    26. 26. Genome projects Archaea + Bacteria only http://www.genomesonline.org/ 298 Complete Genomes 137 Complete Genomes
    27. 27. Metagenomic assembly and Finishing <ul><li>Typically size of metagenomic sequencing project is very large </li></ul><ul><li>Different organisms have different coverage. Non-uniform sequence coverage results in significant under- and over-representation of certain community members </li></ul><ul><li>Low coverage for the majority of organisms in highly complex communities leads to poor (if any) assemblies </li></ul><ul><li>Chimerical contigs produced by co-assembly of sequencing reads originating from different species. </li></ul><ul><li>Genome rearrangements and the presence of mobile genetic elements (phages, transposons) in closely related organisms further complicate assembly. </li></ul><ul><li>No assemblers developed for metagenomic data sets </li></ul>The whole-genome shotgun sequencing approach was used for a number of microbial community projects, however useful quality control and assembly of these data require reassessing methods developed to handle relatively uniform sequences derived from isolate microbes.
    28. 28. QC: Annotation of poor quality sequence To avoid this: -make sure you use high quality sequence -choose proper assembler A Bioinformatician's Guide to Metagenomics . Microbiol Mol Biol Rev. 2008 December; 72(4): 557–578.
    29. 29. Assembly mistakes A Bioinformatician's Guide to Metagenomics . Microbiol Mol Biol Rev. 2008 December; 72(4): 557–578.
    30. 30. Recommendations for metagenomic assembly <ul><li>Use Trimmer (Lucy etc) to treat reads PRIOR to assembly </li></ul><ul><li>None of the existing assemblers designed for metagenomic data but assemblers like PGA work better with paired reads information and produce better assemblies. </li></ul><ul><li>We currently test Newbler assembler for second generation sequencing: 454 only and 454/Solexa co-assembly </li></ul>
    31. 31. Metagenomic finishing: approach Binning: Which DNA fragment derived from which phylotype? (BLAST; GC%; read depth) Complete genome of Candidatus Accumulibacter phosphatis Lucy/PGA Candidatus Accumulibacter phosphatis (CAP) ~ 45% Non-CAP reads CAP reads +
    32. 32. Few more details: read quality
    33. 34. Merged assemblies ( k=31 and k=51 ) with minimus (Cloneview used for visualization) <ul><li>Green k=31 </li></ul><ul><li>Purple k=51 </li></ul>Illumina only data
    34. 35. Stats for 31, 51 and merged 31-51 assemblies
    35. 36. <ul><li>Thank you! </li></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×