Reduce errors beforehand to construct graph memory- and time-efficiently Also will significantly reduce the load in graph-reduction step Improve reliability of primary contigs, which serve as data basis for subsequent steps
Recent progresses 1) larger Kmer (up to 27) can be used with acceptable memory and speed. 2) algorithm is optimized so more error bases can be corrected. 3) combination of error correction and merging of PE-read, whose insert size is slight shorter than the sum of two reads’ length, e.g., insert size of 170bp with read length of 100bp, which further improves the result.
Simulation result of Arabidopsis data using different Kmer size
Results of different versions for error correction * overlap_cor: combination of error correction and merging of PE-read
2. Contiging For SOAPdenovo, contiging is a process that finds all unique unambiguous paths in complexity-reduced de Bruijn graph
Progresses 1) larger kmer up to 127 can be used when having merged PE-read or longer read coming out soon, e.g., read length of 150bp from illumina. 2) longer repeat can be resolved using overhung PE-read.
3. Scaffolding Scaffolding is to link primary contigs to a unambiguous path in relationship graph The data basis for gap-closure Highly-associated with final contig size Performance are hyper-sensitive to parameter setting
Progresses 1) repetitivecontigs are handled more cautiously. 2) some algorithmic logic are optimized to make less mistakes. *When one(more) contig(s) in a scaffold is(are) not in correct position(s), there is an error.
4. Gap closure Based on conservatively constructed scaffolds, intra-scaffold gaps (between linked contigs) are attempted to be filled to form longer contigs (scaftigs): Unique regions that did not pass stringent contiging threshold Repeat regions that are cut/not assembled in original assemblies A process that has high risk to induce errors
Progresses 1) overhung PE-read are used to span small gaps and fill them. 2) gaps are treated specifically according to their characteristics, e.g., gap size, nearby contigs’ length, number of reads fell in gaps, having tandem repeat inside or not… 3) local assembly strategy is optimized to make better decision when encountering conflicts.
Results of different versions for gap filling * When gap sequence of fully filled gap is not exactly the same as reference sequence, there is an error.
5. Post-processing Align reads back to the assembly to evaluate the reliability of each locus Correct artifacts in the assemblies Analyze the possibility of further improvement
6. Computational performance A bunch of low-level optimizations now achieved 1 round of assembly cost 1 day for human genome on a 256G memory node Cloud-based assembler at dawn (dev code: Hecate) Memory footprint cut to <32G; speed performance scalable to number of nodes used.
Issues Achieving theorectical upper limit in contiging Paired-end short reads + insert size ~= Long reads Mixing up two haploids Several key factors affect quality of WGS assembly Heterozygous rate of the diploid genome Repetitive sequence distribution pattern of the species’ genome K-mer size used when the de Bruijn graph assembly applied
Revised Hierarchical Assembly Build libraries hierarchically Using Fosmid clones Avoid combining two haploids Assembly hierarchically Combines de Bruijn graph & OLC strategies Providing an affordable sequencing solution to diploid & complex genome
Expect of Outcomes Novel sequences for the gap closure of reference genome. A comprehensive map of structural variations. Diploid sequences in relatively highly heterogenous regions. An assembly that is more “real”
Straw webhost on genomes http://climb.genomics.org.cn/g10k/home.jsp Please advise what kind of functions to include, considering the fact that genomes will be available at different levels of completeness: Finished map Fine map w/ haploids solved Draft map w/ physical map anchord