NGS de novo assembly: progresses and challengesYingruiLiBGI Shenzhen
Overall sketch of SOAPdenovo2
Overall sketch of SOAPdenovo3
Main issues in NGS de novo assemblyEfficient graph building and reductionContig constructionScaffold constructionGap closure (to solve repeats)Iterative refining assemblies
1. Reducing graph complexityEliminate errors in original raw readsGraph-based
Kmerfrequency spectrum-basedReduce errors beforehand to construct graph memory- and time-efficientlyAlso will significantly reduce the load in graph-reduction stepImprove reliability of primary contigs, which serve as data basis for subsequent steps
Recent progresses1) larger Kmer (up to 27) can be used with acceptable memory and speed.	2) algorithm is optimized so more error bases can be corrected.	3) combination of error correction and merging of PE-read, whose insert size is slight shorter than the sum of two reads’ length, e.g., insert size of 170bp with read length of 100bp, which further  improves the result.
Simulation result of Arabidopsis data using different Kmer size
Results of different versions for error correction* overlap_cor: combination of error correction and merging of PE-read
2. ContigingFor SOAPdenovo, contiging is a process that finds all unique unambiguous paths in complexity-reduced de Bruijn graph
Progresses1) larger kmer up to 127 can be used when having merged PE-read or longer read coming out soon, e.g., read length of 150bp from illumina.    2) longer repeat can be resolved using overhung PE-read.
3. ScaffoldingScaffolding is to link primary contigs to a unambiguous path in relationship graphThe data basis for gap-closureHighly-associated with final contig sizePerformance are hyper-sensitive to parameter setting
Progresses1) repetitivecontigs are handled more cautiously.	2) some algorithmic logic are optimized to make less mistakes.*When one(more) contig(s) in a scaffold is(are) not in correct position(s), there is an error.
4. Gap closureBased on conservatively constructed scaffolds, intra-scaffold gaps (between linked contigs) are attempted to be filled to form longer contigs (scaftigs):Unique regions that did not pass stringent contiging thresholdRepeat regions that are cut/not assembled  in original assembliesA process that has high risk to induce errors
Progresses1) overhung PE-read are used to span small gaps and fill them.	2) gaps are treated specifically according to their characteristics, e.g., gap size, nearby contigs’ length, number of reads fell in gaps, having tandem repeat inside or not…	3) local assembly strategy is optimized to make better decision when encountering conflicts.
Results of different versions for gap filling* When gap sequence of fully filled gap is not exactly the same as reference sequence, there is an error.
5. Post-processingAlign reads back to the assembly to evaluate the reliability of each locusCorrect artifacts in the assembliesAnalyze the possibility of further improvement
6. Computational performanceA bunch of low-level optimizations now achieved 1 round of assembly cost 1 day for human genome on a 256G memory nodeCloud-based assembler at dawn (dev code: Hecate)Memory footprint cut to <32G; speed performance scalable to number of nodes used.
IssuesAchieving theorectical upper limit in contigingPaired-end short reads + insert size ~= Long readsMixing up two haploids Several key factors affect quality of WGS assemblyHeterozygous rate of the diploid genomeRepetitive sequence distribution pattern of the species’ genome K-mer size used when the de Bruijn graph assembly applied
Revised Hierarchical AssemblyBuild libraries hierarchicallyUsing Fosmid clonesAvoid combining two haploidsAssembly hierarchicallyCombines de Bruijn graph & OLC strategiesProviding an affordable sequencing solution to  diploid & complex genome
Flowchart of Revised Hierarchical Assembly
Revised Hierarchical  de novo Assembly on a Asian GenomeData Production:8x(500k) Fosmids on a human genome
~16k index libraries
Optimally 30 Fosmids clones a pool
40x raw data per Fosmid clone
20x 200bp IS
20x 500bp IS
320 index libraries per lane
~120 Illumina HiSeq lanes

Ngs de novo assembly progresses and challenges

  • 1.
    NGS de novoassembly: progresses and challengesYingruiLiBGI Shenzhen
  • 2.
  • 3.
  • 4.
    Main issues inNGS de novo assemblyEfficient graph building and reductionContig constructionScaffold constructionGap closure (to solve repeats)Iterative refining assemblies
  • 5.
    1. Reducing graphcomplexityEliminate errors in original raw readsGraph-based
  • 6.
    Kmerfrequency spectrum-basedReduce errorsbeforehand to construct graph memory- and time-efficientlyAlso will significantly reduce the load in graph-reduction stepImprove reliability of primary contigs, which serve as data basis for subsequent steps
  • 7.
    Recent progresses1) largerKmer (up to 27) can be used with acceptable memory and speed. 2) algorithm is optimized so more error bases can be corrected. 3) combination of error correction and merging of PE-read, whose insert size is slight shorter than the sum of two reads’ length, e.g., insert size of 170bp with read length of 100bp, which further improves the result.
  • 8.
    Simulation result ofArabidopsis data using different Kmer size
  • 9.
    Results of differentversions for error correction* overlap_cor: combination of error correction and merging of PE-read
  • 10.
    2. ContigingFor SOAPdenovo,contiging is a process that finds all unique unambiguous paths in complexity-reduced de Bruijn graph
  • 11.
    Progresses1) larger kmerup to 127 can be used when having merged PE-read or longer read coming out soon, e.g., read length of 150bp from illumina. 2) longer repeat can be resolved using overhung PE-read.
  • 12.
    3. ScaffoldingScaffolding isto link primary contigs to a unambiguous path in relationship graphThe data basis for gap-closureHighly-associated with final contig sizePerformance are hyper-sensitive to parameter setting
  • 13.
    Progresses1) repetitivecontigs arehandled more cautiously. 2) some algorithmic logic are optimized to make less mistakes.*When one(more) contig(s) in a scaffold is(are) not in correct position(s), there is an error.
  • 14.
    4. Gap closureBasedon conservatively constructed scaffolds, intra-scaffold gaps (between linked contigs) are attempted to be filled to form longer contigs (scaftigs):Unique regions that did not pass stringent contiging thresholdRepeat regions that are cut/not assembled in original assembliesA process that has high risk to induce errors
  • 15.
    Progresses1) overhung PE-readare used to span small gaps and fill them. 2) gaps are treated specifically according to their characteristics, e.g., gap size, nearby contigs’ length, number of reads fell in gaps, having tandem repeat inside or not… 3) local assembly strategy is optimized to make better decision when encountering conflicts.
  • 16.
    Results of differentversions for gap filling* When gap sequence of fully filled gap is not exactly the same as reference sequence, there is an error.
  • 17.
    5. Post-processingAlign readsback to the assembly to evaluate the reliability of each locusCorrect artifacts in the assembliesAnalyze the possibility of further improvement
  • 18.
    6. Computational performanceAbunch of low-level optimizations now achieved 1 round of assembly cost 1 day for human genome on a 256G memory nodeCloud-based assembler at dawn (dev code: Hecate)Memory footprint cut to <32G; speed performance scalable to number of nodes used.
  • 19.
    IssuesAchieving theorectical upperlimit in contigingPaired-end short reads + insert size ~= Long readsMixing up two haploids Several key factors affect quality of WGS assemblyHeterozygous rate of the diploid genomeRepetitive sequence distribution pattern of the species’ genome K-mer size used when the de Bruijn graph assembly applied
  • 20.
    Revised Hierarchical AssemblyBuildlibraries hierarchicallyUsing Fosmid clonesAvoid combining two haploidsAssembly hierarchicallyCombines de Bruijn graph & OLC strategiesProviding an affordable sequencing solution to diploid & complex genome
  • 21.
    Flowchart of RevisedHierarchical Assembly
  • 22.
    Revised Hierarchical de novo Assembly on a Asian GenomeData Production:8x(500k) Fosmids on a human genome
  • 23.
  • 24.
  • 25.
    40x raw dataper Fosmid clone
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
    Total Amount ofdata: 1650G
  • 31.
  • 32.
    Produced data: 213GExpectof OutcomesNovel sequences for the gap closure of reference genome.A comprehensive map of structural variations.Diploid sequences in relatively highly heterogenous regions.An assembly that is more “real”
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
    Straw webhost ongenomeshttp://climb.genomics.org.cn/g10k/home.jspPlease advise what kind of functions to include, considering the fact that genomes will be available at different levels of completeness:Finished mapFine map w/ haploids solvedDraft map w/ physical map anchord
  • 39.