Scaling metagenome assembly

1,741 views
1,651 views

Published on

Talk given at JGI metagenome assembly workshop, oct 12, 2011.

Published in: Technology, Spiritual
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,741
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
49
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Thank organizers; point to talk online. Mention Susannah/first asst prof problem.
  • 1) Very high diversity ~30 billion k-mers. 2) No k-mer overlap between Iowa corn and prairie; co-assembly futile.
  • Indicate “surprising/awesome” components.
  • Connectivity source organism abundance
  • Comparing assemblies is hard, and we’ve had to build tools to build tools to let us compare assemblies. However, the results are good. Multi-k assemblies are essential, note.
  • Completely different style of assembler; useful for cross validation.
  • Note that all of this was done on Amazon in 68gb
  • Move towards loosely coupled environment for lossless approaches to scaling assembly? Weak classifiers & boosting theory can also be applied (trivially). Note, at some point you should just sequence single cells or something.
  • Funding: MSU startup, USDA NIFA, DOE, BEACON, Amazon.
  • Funding: MSU startup, USDA NIFA, DOE, BEACON, Amazon.
  • Multi-k stuff.
  • Scaling metagenome assembly

    1. 1. Scaling metagenome assembly –to infinity and beeeeeeeeeeyond!<br />C. Titus Brown et al.<br />Computer Science / Microbiology Depts<br />Michigan State University<br />In collaboration with Great Prairie Grand Challenge<br />(Tiedje, Jansson, Tringe)<br />
    2. 2. SAMPLING LOCATIONS<br />
    3. 3. Sampling strategy per site<br />1 M<br />1 cM<br />10 M<br />1 cM<br />Reference soil<br />1 M<br />Soil cores: 1 inch diameter, 4 inches deep<br />Total:<br />8 Reference metagenomes +<br />64 spatially separated cores (pyrotag sequencing)<br />10 M<br />
    4. 4. Great Prairie sequencing summary<br />200x human genome…!<br />> 10x more challenging (total diversity)<br />
    5. 5. Our perspective<br />Great Prairie project: there is no end to the data!<br />Immense biological depth: estimate ~1-2 TB (10**12) of raw sequence needed to assemble top ~20-40% of microbes.<br />Improvements in sequencing tech<br />Existing methods for scaling assembly simply will not suffice: this is a losing battle.<br />Abundance filtering XXX<br />Better data structures XXX<br />Parallelization is not going to be sufficient; neither are advances in data structures.<br />I think: bad scaling is holding back assembly progress.<br />
    6. 6. Our perspective, #2<br />Deep sampling is needed for these samples<br />Illumina is it, for now.<br />The last thing in the world we want to do is write yet another assembler…pre-assembly filtering, instead.<br />All of our techniques can be used together with any assembler.<br />We’ve mostly stuck with Velvet, for reasons of historical contingency.<br />
    7. 7. Two enabling technologies<br />Very efficient k-mer counting<br />Bloom counting hash/MinCount Sketch data structure; constant memory<br />Scales ~10x over traditional data structures<br />k-independent.<br />Probabilistic properties well suited to next-gen data sets.<br />Very efficient de Bruijn graph representation<br />We traverse k-mers stored in constant-memory Bloom filters.<br />Compressible probabilistic data structure; very accurate.<br />Scales ~20x over traditional data structures.<br />K-independent.<br />…cannot directly be used for assembly because of FP.<br />
    8. 8. Approach 1: Partitioning<br />Use compressible graph representation to explore natural structure of data: many disconnected components. <br />
    9. 9. Partitioning for scaling<br />Can be done in ~10x less memory than assembly.<br />Partition at low k and assemble exactly at any higher k (DBG).<br />Partitions can then be assembled independently<br />Multiple processors -> scaling<br />Multiple k, coverage -> improved assembly<br />Multiple assembly packages (tailored to high variation, etc.)<br />Can eliminate small partitions/contigs in the partitioning phase.<br />In theory, an exact approach to divide and conquer/data reduction.<br />
    10. 10. Adina Howe<br />
    11. 11. Partitioning challenges<br />Technical challenge: existence of “knots” in the graph that artificially connect everything.<br />Unfortunately, partitioning is not the solution.<br />Runs afoul of same k-mer/error scaling problem that all k-mer assemblers have…<br />20x scaling isn’t nearly enough, anyway <br />
    12. 12. Digression: sequencing artifacts<br />Adina Howe<br />
    13. 13. Partitioning challenges<br />Unfortunately, partitioning is not the solution.<br />Runs afoul of same k-mer/error scaling problem that all k-mer assemblers have…<br />20x scaling isn’t nearly enough, anyway <br />
    14. 14. Approach 2: Digital normalization<br />“Squash” high coverage reads<br />Eliminate reads we’ve seen before (e.g. “> 5 times”)<br />Digital version of experimental “mRNA normalization”.<br />Nice algorithm!<br />Single-pass<br />Constant memory<br />Trivial to implement<br />Easy to parallelize / scale (memory AND throughput)<br />“Perfect” solution?<br />(Works fine for MDA, mRNAseq…)<br />
    15. 15. Digital normalization<br />Two benefits:<br />Decrease amount of data (real, but redundant sequence)<br />Eliminate errors associated that redundant sequence.<br />Single-pass algorithm (c.f. streaming sketch algorithms)<br />
    16. 16. Digital normalization validation?<br />Two independent methods for comparing assemblies… by both of them, we get very similar results for raw and treated.<br />
    17. 17. Comparing assemblies quantitatively<br />Build a “vector basis” for assemblies out of orthogonal M-base windows of DNA.<br />This allows us to disassemble assemblies into vectors, compare them, and even “subtract” them from one another.<br />
    18. 18. Running HMMs over de Bruijn graphs(=> cross validation)<br />hmmgs: Assemble based on good-scoring HMM paths through the graph.<br />Independent of other assemblers; very sensitive, specific.<br />95% of hmmgsrplB domains are present in our partitioned assemblies.<br />CTC<br />ACT<br />TTC<br />GTA<br />GAC<br />ATA<br />ACC<br />CTA<br />Jordan Fish, Qiong Wang, and Jim Cole (RDP)<br />GTT<br />
    19. 19. Digital normalization validation<br />Two independent methods for comparing assemblies… by both of them, we get very similar results for raw and treated. <br />Hmmgs results tell us that Velvet multi-k assembly is also very sensitive.<br />Our primary concern at this point is about long-range artifacts (chimeric assembly).<br />
    20. 20. Techniques<br />Developed suite of techniques that work for scaling, without loss of information (?)<br />While we have no good way to assess chimeras and misassemblies, basic sequence content and gene content stay the same across treatments.<br />And… what, are we just sitting here writing code?<br />No! We have data to assemble!<br />
    21. 21. Assembling Great Prairie data, v0.8<br />Iowa corn GAII, ~500m reads / 50 Gb => largest partition ~200k reads<br />84 Mb in 53,501 contigs > 1kb.<br />Iowa prairie GAII, ~500m reads / 50 Gb => biggest ~100k read partition<br />102 MB in 70,895 contigs > 1kb.<br />Both done on a single 8-core Amazon EC2 bigmem node, 68 GB of RAM, ~$100.<br />(Yay, we can do it! Boo, we’re only using 2% of reads.)<br />No systematic optimization of partitions yet; 2-4x improvement expected. Normalization of HiSeq is also yet to be done.<br />Have applied to other metagenomes, note; longer story.<br />
    22. 22. Future directions?<br />khmer software reasonably stable & well-tested; needs documentation, software engineering love.<br />github.com/ctb/khmer/ (see ‘refactor’ branch…)<br />Massively scalable implementation (HPC & cloud).<br />Scalable digital normalization (~10 TB / 1 day? ;)<br />Iterative partitioning<br />Integrating other types of sequencing data (454, PacBio, …)?<br />Polymorphism rates / error rates seem to be quite a bit higher.<br />Validation and standard data sets? Someone? Please?<br />
    23. 23. Lossless assembly; boosting.<br />
    24. 24. Acknowledgements:<br />Thek-mer gang:<br />Adina Howe, Jason Pell, ArendHintze, Qingpeng Zhang, Rose Canino-Koning, Tim Brom.<br />mRNAseq:<br />LikitPreeyanon, Alexis Pyrkosz, Hans Cheng, Billie Swalla, and Weiming Li.<br />HMM graph search:<br />Jordan Fish, Qiong Wang, Jim Cole.<br />Great Prairie consortium:<br />Jim Tiedje, Rachel Mackelprang, Susannah Tringe, Janet Jansson<br />Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.<br />
    25. 25. Acknowledgements:<br />Thek-mer gang:<br />Adina Howe, Jason Pell, ArendHintze, Qingpeng Zhang, Rose Canino-Koning, Tim Brom.<br />mRNAseq:<br />LikitPreeyanon, Alexis Pyrkosz, Hans Cheng, Billie Swalla, and Weiming Li.<br />HMM graph search:<br />Jordan Fish, Qiong Wang, Jim Cole.<br />Great Prairie consortium:<br />Jim Tiedje, Rachel Mackelprang, Susannah Tringe, Janet Jansson<br />Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.<br />
    26. 26.
    27. 27. Lumps!<br />Adina Howe<br />
    28. 28. Lumps!<br />Adina Howe<br />
    29. 29. Knots in the graph are caused by sequencing artifacts.<br />
    30. 30. Identifying the source of knots<br />Use a systematic traversal algorithm to identify highly-connected k-mers.<br />Removal of these k-mers (trimming) breaks up the knots.<br />Many, but not all, of these highly-connected k-mers are associated with high-abundance k-mers.<br />
    31. 31. Highly connected k-mers are position-dependent<br />Adina Howe<br />
    32. 32. HCKs under-represented in assembly<br />Adina Howe<br />
    33. 33. HCKs tend to end contigs<br />Adina Howe<br />
    34. 34. Our current model<br />Contigs are extended or joined around artifacts, with an observation bias towards such extensions (because of length cutoff).<br />Tendency is for a long contig to be extended by 1-2 reads, so artifacts trend towards location at end of contig.<br />Adina Howe<br />
    35. 35. Conclusions (artifacts)<br />They connect lots of stuff (preferential attachment)<br />They result from something in the sequencing (3’ bias in reads)<br />Assemblers don’t like using them<br />The major effect of removing them is to shorten many contigs by a read.<br />
    36. 36. Digital normalization algorithm<br />for read in dataset:<br /> if median_kmer_count(read) < CUTOFF:<br />update_kmer_counts(read)<br />save(read)<br /> else:<br /> # discard read<br />
    37. 37. Supplemental: abundance filtering is very lossy.<br />
    38. 38. Per-partition assembly optimization<br />Strategy:<br />Vary k from 21 to 51, assemble with velvet.<br />Choose k that maximizes sum(contigs > 1kb)<br />Ran top partitions in Iowa corn (4.2m reads, 303 partitions)<br />For k=33, 3.5 mb in 1876 contigs > 1kb, max 15.7 kb<br />For best k for each partition(varied between 31 and 47),<br /> 5.7 mb in 2511 contigs > 1kb, max 51.7 kb<br />
    39. 39. Comparing assemblies quantitatively<br />Build a “vector basis” for assemblies out of orthogonal M-base windows of DNA.<br />This allows us to disassemble assemblies into vectors, compare them, and even “subtract” them from one another.<br />
    40. 40. Comparing assemblies / dendrogram<br />

    ×