Ngs de novo assembly progresses and challenges


Published on

Yingrui Li's talk at Assemblathon 2011

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Ngs de novo assembly progresses and challenges

  1. 1. NGS de novo assembly: progresses and challenges<br />YingruiLi<br />BGI Shenzhen<br />
  2. 2. Overall sketch of SOAPdenovo<br />2<br />
  3. 3. Overall sketch of SOAPdenovo<br />3<br />
  4. 4. Main issues in NGS de novo assembly<br />Efficient graph building and reduction<br />Contig construction<br />Scaffold construction<br />Gap closure (to solve repeats)<br />Iterative refining assemblies<br />
  5. 5. 1. Reducing graph complexity<br />Eliminate errors in original raw reads<br /><ul><li>Graph-based
  6. 6. Kmerfrequency spectrum-based</li></ul>Reduce errors beforehand to construct graph memory- and time-efficiently<br />Also will significantly reduce the load in graph-reduction step<br />Improve reliability of primary contigs, which serve as data basis for subsequent steps<br />
  7. 7. Recent progresses<br />1) larger Kmer (up to 27) can be used with acceptable memory and speed.<br /> 2) algorithm is optimized so more error bases can be corrected.<br /> 3) combination of error correction and merging of PE-read, whose insert size is slight shorter than the sum of two reads’ length, e.g., insert size of 170bp with read length of 100bp, which further improves the result. <br />
  8. 8. Simulation result of Arabidopsis data using different Kmer size<br />
  9. 9. Results of different versions for error correction<br />* overlap_cor: combination of error correction and merging of PE-read<br />
  10. 10. 2. Contiging<br />For SOAPdenovo, contiging is a process that finds all unique unambiguous paths in complexity-reduced de Bruijn graph<br />
  11. 11. Progresses<br />1) larger kmer up to 127 can be used when having merged PE-read or longer read coming out soon, e.g., read length of 150bp from illumina.<br /> 2) longer repeat can be resolved using overhung PE-read.<br />
  12. 12. 3. Scaffolding<br />Scaffolding is to link primary contigs to a unambiguous path in relationship graph<br />The data basis for gap-closure<br />Highly-associated with final contig size<br />Performance are hyper-sensitive to parameter setting<br />
  13. 13. Progresses<br />1) repetitivecontigs are handled more cautiously.<br /> 2) some algorithmic logic are optimized to make less mistakes.<br />*When one(more) contig(s) in a scaffold is(are) not in correct position(s), <br />there is an error.<br />
  14. 14. 4. Gap closure<br />Based on conservatively constructed scaffolds, intra-scaffold gaps (between linked contigs) are attempted to be filled to form longer contigs (scaftigs):<br />Unique regions that did not pass stringent contiging threshold<br />Repeat regions that are cut/not assembled in original assemblies<br />A process that has high risk to induce errors<br />
  15. 15. Progresses<br />1) overhung PE-read are used to span small gaps and fill them.<br /> 2) gaps are treated specifically according to their characteristics, e.g., gap size, nearby contigs’ length, number of reads fell in gaps, having tandem repeat inside or not…<br /> 3) local assembly strategy is optimized to make better decision when encountering conflicts.<br />
  16. 16. Results of different versions for gap filling<br />* When gap sequence of fully filled gap is not exactly the same as <br />reference sequence, there is an error.<br />
  17. 17. 5. Post-processing<br />Align reads back to the assembly to evaluate the reliability of each locus<br />Correct artifacts in the assemblies<br />Analyze the possibility of further improvement<br />
  18. 18. 6. Computational performance<br />A bunch of low-level optimizations now achieved 1 round of assembly cost 1 day for human genome on a 256G memory node<br />Cloud-based assembler at dawn (dev code: Hecate)<br />Memory footprint cut to <32G; speed performance scalable to number of nodes used.<br />
  19. 19. Issues<br />Achieving theorectical upper limit in contiging<br />Paired-end short reads + insert size ~= Long reads<br />Mixing up two haploids <br />Several key factors affect quality of WGS assembly<br />Heterozygous rate of the diploid genome<br />Repetitive sequence distribution pattern of the species’ genome <br />K-mer size used when the de Bruijn graph assembly applied<br />
  20. 20. Revised Hierarchical Assembly<br />Build libraries hierarchically<br />Using Fosmid clones<br />Avoid combining two haploids<br />Assembly hierarchically<br />Combines de Bruijn graph & OLC strategies<br />Providing an affordable sequencing solution to<br /> diploid & complex genome <br />
  21. 21. Flowchart of Revised Hierarchical Assembly<br />
  22. 22. Revised Hierarchical de novo Assembly on a Asian Genome<br />Data Production:<br /><ul><li>8x(500k) Fosmids on a human genome
  23. 23. ~16k index libraries
  24. 24. Optimally 30 Fosmids clones a pool
  25. 25. 40x raw data per Fosmid clone
  26. 26. 20x 200bp IS
  27. 27. 20x 500bp IS
  28. 28. 320 index libraries per lane
  29. 29. ~120 Illumina HiSeq lanes
  30. 30. Total Amount of data: 1650G
  31. 31. Sequenced: 15 lanes
  32. 32. Produced data: 213G</li></li></ul><li>Expect of Outcomes<br />Novel sequences for the gap closure of reference genome.<br />A comprehensive map of structural variations.<br />Diploid sequences in relatively highly heterogenous regions.<br />An assembly that is more “real”<br />
  33. 33. Progress of G10-BGI Species<br />
  34. 34. PROGRESS STATUS<br />total species:101<br />
  35. 35. FINISHED SPECIES<br />
  36. 36. Preliminary assembled species<br />
  37. 37. Sequencing of species<br />
  38. 38. Straw webhost on genomes<br /><br />Please advise what kind of functions to include, considering the fact that genomes will be available at different levels of completeness:<br />Finished map<br />Fine map w/ haploids solved<br />Draft map w/ physical map anchord<br />
  39. 39. Thank you!<br />