Ngs de novo assembly progresses and challenges

NGS de novo assembly: progresses and challengesYingruiLiBGI Shenzhen

Main issues in NGS de novo assemblyEfficient graph building and reductionContig constructionScaffold constructionGap closure (to solve repeats)Iterative refining assemblies

1. Reducing graph complexityEliminate errors in original raw readsGraph-based

Kmerfrequency spectrum-basedReduce errors beforehand to construct graph memory- and time-efficientlyAlso will significantly reduce the load in graph-reduction stepImprove reliability of primary contigs, which serve as data basis for subsequent steps

Recent progresses1) larger Kmer (up to 27) can be used with acceptable memory and speed. 2) algorithm is optimized so more error bases can be corrected. 3) combination of error correction and merging of PE-read, whose insert size is slight shorter than the sum of two reads’ length, e.g., insert size of 170bp with read length of 100bp, which further improves the result.

Simulation result of Arabidopsis data using different Kmer size

Results of different versions for error correction* overlap_cor: combination of error correction and merging of PE-read

2. ContigingFor SOAPdenovo, contiging is a process that finds all unique unambiguous paths in complexity-reduced de Bruijn graph

Progresses1) larger kmer up to 127 can be used when having merged PE-read or longer read coming out soon, e.g., read length of 150bp from illumina. 2) longer repeat can be resolved using overhung PE-read.

3. ScaffoldingScaffolding is to link primary contigs to a unambiguous path in relationship graphThe data basis for gap-closureHighly-associated with final contig sizePerformance are hyper-sensitive to parameter setting

Progresses1) repetitivecontigs are handled more cautiously. 2) some algorithmic logic are optimized to make less mistakes.*When one(more) contig(s) in a scaffold is(are) not in correct position(s), there is an error.

4. Gap closureBased on conservatively constructed scaffolds, intra-scaffold gaps (between linked contigs) are attempted to be filled to form longer contigs (scaftigs):Unique regions that did not pass stringent contiging thresholdRepeat regions that are cut/not assembled in original assembliesA process that has high risk to induce errors

Progresses1) overhung PE-read are used to span small gaps and fill them. 2) gaps are treated specifically according to their characteristics, e.g., gap size, nearby contigs’ length, number of reads fell in gaps, having tandem repeat inside or not… 3) local assembly strategy is optimized to make better decision when encountering conflicts.

Results of different versions for gap filling* When gap sequence of fully filled gap is not exactly the same as reference sequence, there is an error.

5. Post-processingAlign reads back to the assembly to evaluate the reliability of each locusCorrect artifacts in the assembliesAnalyze the possibility of further improvement

6. Computational performanceA bunch of low-level optimizations now achieved 1 round of assembly cost 1 day for human genome on a 256G memory nodeCloud-based assembler at dawn (dev code: Hecate)Memory footprint cut to <32G; speed performance scalable to number of nodes used.

IssuesAchieving theorectical upper limit in contigingPaired-end short reads + insert size ~= Long readsMixing up two haploids Several key factors affect quality of WGS assemblyHeterozygous rate of the diploid genomeRepetitive sequence distribution pattern of the species’ genome K-mer size used when the de Bruijn graph assembly applied

Revised Hierarchical AssemblyBuild libraries hierarchicallyUsing Fosmid clonesAvoid combining two haploidsAssembly hierarchicallyCombines de Bruijn graph & OLC strategiesProviding an affordable sequencing solution to diploid & complex genome

Flowchart of Revised Hierarchical Assembly

Revised Hierarchical de novo Assembly on a Asian GenomeData Production:8x(500k) Fosmids on a human genome

Optimally 30 Fosmids clones a pool

Ngs de novo assembly progresses and challenges

More Related Content

What's hot

Viewers also liked

Similar to Ngs de novo assembly progresses and challenges

More from Scott Edmunds

Recently uploaded

Ngs de novo assembly progresses and challenges