Ngs de novo assembly progresses and challenges
Upcoming SlideShare
Loading in...5
×
 

Ngs de novo assembly progresses and challenges

on

  • 3,320 views

Yingrui Li's talk at Assemblathon 2011

Yingrui Li's talk at Assemblathon 2011

Statistics

Views

Total Views
3,320
Views on SlideShare
3,317
Embed Views
3

Actions

Likes
3
Downloads
97
Comments
0

3 Embeds 3

http://twitter.com 1
http://paper.li 1
http://www.docshut.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Ngs de novo assembly progresses and challenges Ngs de novo assembly progresses and challenges Presentation Transcript

  • NGS de novo assembly: progresses and challenges
    YingruiLi
    BGI Shenzhen
  • Overall sketch of SOAPdenovo
    2
  • Overall sketch of SOAPdenovo
    3
  • Main issues in NGS de novo assembly
    Efficient graph building and reduction
    Contig construction
    Scaffold construction
    Gap closure (to solve repeats)
    Iterative refining assemblies
  • 1. Reducing graph complexity
    Eliminate errors in original raw reads
    • Graph-based
    • Kmerfrequency spectrum-based
    Reduce errors beforehand to construct graph memory- and time-efficiently
    Also will significantly reduce the load in graph-reduction step
    Improve reliability of primary contigs, which serve as data basis for subsequent steps
  • Recent progresses
    1) larger Kmer (up to 27) can be used with acceptable memory and speed.
    2) algorithm is optimized so more error bases can be corrected.
    3) combination of error correction and merging of PE-read, whose insert size is slight shorter than the sum of two reads’ length, e.g., insert size of 170bp with read length of 100bp, which further improves the result.
  • Simulation result of Arabidopsis data using different Kmer size
  • Results of different versions for error correction
    * overlap_cor: combination of error correction and merging of PE-read
  • 2. Contiging
    For SOAPdenovo, contiging is a process that finds all unique unambiguous paths in complexity-reduced de Bruijn graph
  • Progresses
    1) larger kmer up to 127 can be used when having merged PE-read or longer read coming out soon, e.g., read length of 150bp from illumina.
    2) longer repeat can be resolved using overhung PE-read.
  • 3. Scaffolding
    Scaffolding is to link primary contigs to a unambiguous path in relationship graph
    The data basis for gap-closure
    Highly-associated with final contig size
    Performance are hyper-sensitive to parameter setting
  • Progresses
    1) repetitivecontigs are handled more cautiously.
    2) some algorithmic logic are optimized to make less mistakes.
    *When one(more) contig(s) in a scaffold is(are) not in correct position(s),
    there is an error.
  • 4. Gap closure
    Based on conservatively constructed scaffolds, intra-scaffold gaps (between linked contigs) are attempted to be filled to form longer contigs (scaftigs):
    Unique regions that did not pass stringent contiging threshold
    Repeat regions that are cut/not assembled in original assemblies
    A process that has high risk to induce errors
  • Progresses
    1) overhung PE-read are used to span small gaps and fill them.
    2) gaps are treated specifically according to their characteristics, e.g., gap size, nearby contigs’ length, number of reads fell in gaps, having tandem repeat inside or not…
    3) local assembly strategy is optimized to make better decision when encountering conflicts.
  • Results of different versions for gap filling
    * When gap sequence of fully filled gap is not exactly the same as
    reference sequence, there is an error.
  • 5. Post-processing
    Align reads back to the assembly to evaluate the reliability of each locus
    Correct artifacts in the assemblies
    Analyze the possibility of further improvement
  • 6. Computational performance
    A bunch of low-level optimizations now achieved 1 round of assembly cost 1 day for human genome on a 256G memory node
    Cloud-based assembler at dawn (dev code: Hecate)
    Memory footprint cut to <32G; speed performance scalable to number of nodes used.
  • Issues
    Achieving theorectical upper limit in contiging
    Paired-end short reads + insert size ~= Long reads
    Mixing up two haploids
    Several key factors affect quality of WGS assembly
    Heterozygous rate of the diploid genome
    Repetitive sequence distribution pattern of the species’ genome
    K-mer size used when the de Bruijn graph assembly applied
  • Revised Hierarchical Assembly
    Build libraries hierarchically
    Using Fosmid clones
    Avoid combining two haploids
    Assembly hierarchically
    Combines de Bruijn graph & OLC strategies
    Providing an affordable sequencing solution to
    diploid & complex genome
  • Flowchart of Revised Hierarchical Assembly
  • Revised Hierarchical de novo Assembly on a Asian Genome
    Data Production:
    • 8x(500k) Fosmids on a human genome
    • ~16k index libraries
    • Optimally 30 Fosmids clones a pool
    • 40x raw data per Fosmid clone
    • 20x 200bp IS
    • 20x 500bp IS
    • 320 index libraries per lane
    • ~120 Illumina HiSeq lanes
    • Total Amount of data: 1650G
    • Sequenced: 15 lanes
    • Produced data: 213G
  • Expect of Outcomes
    Novel sequences for the gap closure of reference genome.
    A comprehensive map of structural variations.
    Diploid sequences in relatively highly heterogenous regions.
    An assembly that is more “real”
  • Progress of G10-BGI Species
  • PROGRESS STATUS
    total species:101
  • FINISHED SPECIES
  • Preliminary assembled species
  • Sequencing of species
  • Straw webhost on genomes
    http://climb.genomics.org.cn/g10k/home.jsp
    Please advise what kind of functions to include, considering the fact that genomes will be available at different levels of completeness:
    Finished map
    Fine map w/ haploids solved
    Draft map w/ physical map anchord
  • Thank you!