Climbing Mt. Metagenome


Published on

Assembling very large metagenomes from Illumina short reads.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Expand on this last point
  • Quantify, or do cumulative distribution
  • Bridge between this kind of view and k-mers
  • Constant memory
  • @@
  • @@
  • @@ k up to 64 graph
  • Expand; talk about density, circumference
  • @@ redo
  • @@ redo
  • Details!
  • 2x coverage vs 10x coverage? Add “reads”
  • Funding: MSU startup, USDA NIFA, DOE, BEACON, Amazon.
  • Refactor with error bars, etc.
  • Put in subtraction foo
  • Briefly, all six open reading frames (ORFs) were translated by the ORF_finder (or ORFs were predicted by MetaGene) from translation table 11 with minimum length 30aa. The ORFs were clustered at 90(default 90) % identity to identify the non-redundant sequences, which are further clustered to families at a conservative threshold 60 (default 60) % identity over 80 (default 80) % of length of ORFs. The resulting ORFs are annotated from Pfam and Tigrfam with HMMER, accelerated with Hammerhead, and from COG with RPS-BLAST with e-values less than 0.001. GO annotations were mapped from Pfam or Tigrfam and EC numbers were mapped from the GO database.
  • Paint between the greens.
  • When a green connects two or more colors, recolor one color.
  • Dependent on minimumdensity tagging
  • Climbing Mt. Metagenome

    1. 1. Scaling Mt. Metagenome:Assembling very large data sets<br />C. Titus Brown<br />Assistant Professor<br />Computer Science and Engineering /<br />Microbiology and Molecular Genetics<br />Michigan State University<br />
    2. 2. Thanks for coming!<br />Note: this talk is about the computational side of metagenome assembly, motivated by the Great Prairie Grand Challenge soil sequencing project.<br />Jim Tiedje will talk about the project as a whole at the JGI User’s Meeting.<br />
    3. 3. The basic problem.<br />Lots of metagenomic sequence data<br />(200 GB Illumina for< $20k?)<br />Assembly, especially metagenome assembly, scales poorly (due to high diversity).<br />Standard assembly techniques don’t work well with sequences from multiple abundance genomes.<br />Many people don’t have the necessary computational resources to assemble (~1 TB of RAM or more, if at all).<br />
    4. 4. We can’t just throw more hardware at the problem, either.<br />Lincoln Stein<br />
    5. 5. Jumping to the end:<br />We have implemented a solution for these problems:<br />Scalability of assembly,<br />Lack of resources, <br />and parameter choice.<br />We demonstrate this solution for a high diversity sample (219.1 Gb of Iowa corn field soil metagenome).<br />…there is an additional surprise or two, so you should stick around!<br />
    6. 6. Whole genome shotgun sequencing & assembly<br />Randomly fragment & sequence from DNA;<br />reassemble computationally.<br />UMD assembly primer (<br />
    7. 7. K-mer graphs - overlaps<br />J.R. Miller et al. / Genomics (2010)<br />
    8. 8. K-mer graphs - branching<br />For decisions about which paths etc, biology-based heuristics come into play as well.<br />
    9. 9. Too much data – what can we do?<br />Reduce the size of the data (either with an approximate or an exact approach)<br />Divide & conquer: subdivide the problem.<br />For exact data reduction or subdivision, need to grok the entire assembly graph structure.<br />…but that is why assembly scales poorly in the first place.<br />
    10. 10.
    11. 11.
    12. 12.
    13. 13. Abundance filtering<br />Approach used in two published Illumina metagenomic papers (MetaHIT/human microbiomeand rumen papers)<br />Remove or trim reads with low-abundance k-mers<br />Either due to errors, or low-abundance organisms.<br />Inexact data reduction: may or may not remove usable data.<br />Works well for high-coverage data sets (rumen est56x!!)<br />However, for low-coverage or high-diversity data sets, abundance filtering will reject potentially useful reads.<br />
    14. 14. Abundance filtering<br />
    15. 15. Two exact data reduction techniques:<br />Eliminate reads that do not connect to many other reads.<br />Group reads by connectivity into different partitions of the entire graph.<br />For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.<br />
    16. 16. Eliminating unconnected reads<br />“Graphsize filtering”<br />
    17. 17. Subdividing reads by connection<br />“Partitioning”<br />
    18. 18. Two exact data reduction techniques:<br />Eliminate reads that do not connect to many other reads (“graphsize filtering”).<br />Group reads by connectivity into different partitions of the entire graph (“partitioning”).<br />For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.<br />
    19. 19. Engineering overview<br />Built a k-mer graph representation based on Bloom filters, a simple probabilistic data structure;<br />With this, we can store graphs efficiently in memory, ~1-2 bytes/(unique) k-mer for arbitrary k.<br />Also implemented efficient global traversal of extremely large graphs (5-20 bn nodes).<br />For details see source code (, or online webinar:<br />
    20. 20. Store graph nodes in Bloom filter<br />Graph traversal is done in full k-mer space;<br />Presence/absence of individual nodes is kept in Bloom filter data structure (hash tables w/o collision tracking).<br />
    21. 21. Practical application<br />Enables:<br />graph trimming (exact removal)<br />partitioning (exact subdivision)<br />abundance filtering<br />… all for K <= 64, for 200+ gb sequence collections.<br />All results (except for comparison) obtained using a single Amazon EC2 4xlarge node, 68 GB of RAM / 8 cores.<br />Similar running times to using Velvet alone.<br />
    22. 22. We pre-filter data for assembly:<br />
    23. 23. Does removing small graphs work?<br />Small data set (35m reads / 3.4 gb rhizosphere soil sample)<br />Filtered at k=32, assembled at k=33 with ABYSS<br />N contigs / Total bp Largest contig<br />130      223,341   61,766 Unfiltered (35m)<br />130      223,341   61,766 Filtered (2m reads)<br />YES.<br />
    24. 24. Does partitioning into disconnected graphs work?<br />Partitioned same data set (35m reads / 3.5 gb) into<br />45k partitions containing > 10 reads; assembled<br />partitions separately (k0=32, k=33).<br />N contigs / Total bp Largest contig<br />130      223,341   61,766 Unfiltered (35m)<br />130      223,341   61,766 Sum partitions<br />YES.<br />
    25. 25. Data reduction for assembly / practical details<br />Reduction performed on machine with 16 gb of RAM.<br />Removing poorly connected reads: 35m -> 2m reads.<br /> - Memory required reduced from 40 gb to 2 gb;<br /> - Time reduced from 4 hrs to 20 minutes.<br />Partitioning reads into disconnected groups:<br /> - Biggest group is 300k reads<br /> - Memory required reduced from 40 gb to 500 mb;<br /> - Time reduced from 4 hrs to < 5 minutes/group.<br />
    26. 26. Does it work on bigger data sets?<br />35 m read data set partition sizes:<br />P1: 277,043 reads<br />P2: 5776 reads<br />P3: 4444 reads<br />P4: 3513 reads<br />P5: 2528 reads<br />P6: 2397 reads<br />…<br />Iowa continuous corn GA2 partitions (218.5 m reads):<br />P1: 204,582,365 reads<br />P2: 3583 reads<br />P3: 2917 reads<br />P4: 2463 reads<br />P5: 2435 reads<br />P6: 2316 reads<br />…<br />
    27. 27. Problem: big data sets have one big partition!?<br />Too big to handle on EC2.<br />Assembles with low coverage.<br />Contains 2.5 bn unique k-mers (~500microbial genomes), at ~3-5x coverage<br />As we sequence more deeply, the “lump” becomes bigger percentage of reads => trouble!<br />Both for our approach,<br />And possibly for assembly in general (because it assembles more poorly than it should, for given coverage/size)<br />
    28. 28. Why this lump?<br />Real biological connectivity (rRNA, conserved genes, etc.)<br />Bug in our software<br />Sequencing artifact or error<br />
    29. 29. Why this lump?<br />Real biological connectivity? Probably not.<br /> - Increasing Kfrom 32 to ~64 didn’t break up the lump: not biological.<br />Bug in our software? Probably not.<br /><ul><li>We have a second, completely separate approach & implementation that confirmed the lump (bleu, by RosangelaCanino-Koning)</li></ul>Sequencing artifact or error? YES.<br />- (Note, we do filter & quality trim all sequences already)<br />
    30. 30. “Good” vs “bad” assembly graph<br />Low density<br />High density<br />
    31. 31. Non-biological levels of local graph connectivity:<br />
    32. 32. Higher local graph density correlates with position in read<br />
    33. 33. Higher local graph density correlates with position in read<br />ARTIFACT<br />
    34. 34. Trimming reads<br />Trim at high “soddd”, sum of degree degree distribution:<br />From each k-mer in each read, walk two k-mers in all directions in the graph;<br />If more than 3 k-mers can be found at exactly two steps, trim remainder of sequence.<br />Overly stringent; actually trimming (k-1) connectivity graph by degree.<br />
    36. 36. Preferential attachment due to bias<br />Any sufficiently large collection of connected reads will have one or more reads containing an artifact;<br />These artifacts will then connect that group of reads to all other groups possessing artifacts;<br />…and all high-coverage contigs will amalgamate into a single graph.<br />
    37. 37. Artifacts from sequencing falsely connect graphs<br />
    38. 38. Preferential attachment due to bias<br />Any sufficiently large collection of connected reads will have one or more reads containing an artifact;<br />These artifacts will then connect that group of reads to all other groups possessing artifacts;<br />…and all high-coverage contigs will amalgamate into a single graph.<br />
    39. 39. Groxel view of knot-like region / ArendHintze<br />
    40. 40. Density trimming breaks up the lump:<br />Old P1,sodddtrimmed<br /> (204.6 mreads -> 179 m):<br />P1: 23,444,332 reads<br />P2: 60,703 reads<br />P3: 48,818 reads<br />P4: 39,755 reads<br />P5: 34,902 reads<br />P6: 33,284 reads<br />…<br />Untrimmed partitioning (218.5 m reads):<br />P1: 204,582,365 reads<br />P2: 3583 reads<br />P3: 2917 reads<br />P4: 2463 reads<br />P5: 2435 reads<br />P6: 2316 reads<br />…<br />
    41. 41. What does density trimming do to assembly?<br />204 m reads in lump:<br /> assembles into 52,610 contigs;<br />total 73.5 MB<br />180 m reads in trimmed lump:<br /> assembles into 57,135 contigs;<br />total83.6 MB<br />(all contigs > 1kb)<br />Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0<br />
    42. 42. Wait, what?<br />Yes, trimming these “knot-like” sequences improves the overall assembly!<br />We remove 25.6 m reads and gain 10.1 MB!?<br />Trend is same for ABySS, another k-mergraph assembler, as well.<br />
    43. 43. Is this a valid assembly?<br />Paired-end usage is good.<br />50% of contigs have BLASTX hit better than 1e-20 in Swissprot;<br />75% of contigs have BLASTX hit better than 1e-20 in TrEMBL;<br />Reference genomes sequenced by JGI:<br />Frateuriaaurantia: 1376 hits > 100 aa<br />Saprospiragrandis: 1114 hits > 100 aa<br />(> 50% identity over > 50% of gene)<br />
    44. 44. So what’s going on?<br />Current assemblers are bad at dealing with certain graph sturctures (“knots”).<br />If we can untangle knots for them, that’s good, maybe?<br />Or, by eliminating locations where reads from differently abundant contigs connect, repeat resolution improves?<br />Happens with other k-mer graph assemblers (ABYSS), and with at least one other (non-metagenomic) data set.<br />
    45. 45. OK, let’s assemble!<br />Iowa corn (HiSeq+ GA2): 219.11 Gb of sequence assembles to:<br /> 148,053 contigs,<br /> in220MB;<br /> max length 20322<br /> max coverage ~10x<br />…all done on Amazon EC2, ~ 1 week for under $500.<br />Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0<br />
    46. 46. Full Iowa corn / mapping stats<br />1,806,800,000 QC/trimmed reads (1.8 bn)<br />204,900,000 reads map to somecontig (11%)<br />37,244,000 reads map to contigs > 1kb (2.1%)<br />> 1 kb contig is a stringent criterion!<br />Compare:<br />80% of MetaHIT reads to > 500 bp;<br />65%+ of rumen reads to > 1kb<br />
    47. 47. Percentage mapped vscontig size<br />
    48. 48. High coverage partitions assemble more reads<br />
    49. 49. Success, tentatively.<br />We are still evaluating assembly and assembly parameters; should be possible to improve in every way. <br />(~10 hrs to redo entire assembly, once partitioned.)<br />The main engineering point is that we can actuallyrun this entire pipeline on a relatively small machine<br />(8 core/68 GB RAM)<br />We can do dozens of these in parallel on Amazon rental hardware.<br />And, from our preliminary results, we get ~ equivalent assembly results as if we were scaling our hardware.<br />
    50. 50. Optimizing per-partition assembly<br />Metagenomes contain mixed-abundance genomes.<br />Current assemblers are not built for mixed-abundance samples (problem with mRNAseq, too).<br />Repeat resolution<br />Error/edge trimming<br />Since we’re breaking the data set into multiple partitions containing reads that may assemble together, can we optimize assembler parameters (k, coverage) for each partition? <br />
    51. 51. Mixing parameters improves assembly statistics<br />Objective function: maximize sum(contigs > 1kb)<br />4.5x average coverage– gained 228 contigs/469 kb<br /> (over 152/215 kb)<br />5.8x average coverage – gained 78 contigs/304 kb<br /> (over 248/708 kb)<br />8.2x average coverage – lost 58 contigs /gained 116 kb<br /> (over 279/803 kb)<br />
    52. 52. Conclusions<br />Engineering: can assemble large data sets.<br />Scaling: can assemble on rented machines.<br />Science: can optimize assembly for individual partitions.<br />Science: retain low-abundance.<br />
    53. 53. Caveats<br />Quality of assembly??<br />Illumina sequencing bias/error issue needs to be explored.<br />Regardless of Illumina-specific issue, it’s good to have tools/approaches to look at structure of large graphs.<br />Need to better analyze upper limits of data structures.<br />Have not applied our approaches to high-coverage data yet; in progress.<br />
    54. 54. Future thoughts<br />Our pre-filtering technique alwayshas lower memory requirements than Velvet or other assemblers. So it is a good first step to try, even if it doesn’t reduce the problem significantly.<br />Divide & conquer approach should allow more sophisticated (compute intensive) graph analysis approaches in the future.<br />This approach enables (in theory) assembly of arbitrarily large amounts of metagenomic DNA sequence.<br />Can k-mer filtering work for non-de Bruijn graph assemblers? (SGA, ALLPATHS-LG, …)<br />
    55. 55. Acknowledgements<br />The k-mer gang:<br />Adina Howe<br />Jason Pell<br />RosangelaCanino-Koning<br />QingpengZhang<br />ArendHintze<br />Collaborators:<br />Jim Tiedje (Il padrino)<br />Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI)<br />Charles Ofria (MSU)<br />Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.<br />
    56. 56.
    57. 57. A guide to khmer<br />Python wrapping C++; BSD license.<br />Tools for:<br />K-mer abundance filtering (constant mem; inexact)<br />Assembly graph size filtering (constant mem; exact)<br />Assembly graph partitioning (exact)<br />Error trimming (constant mem; inexact)<br />Still in alpha form… undocumented, esp.<br />
    58. 58. k-mer coverage by partition<br />
    59. 59. Abundance filtering affects low-coverage contigs dramatically<br />
    60. 60. Many read pairs map together<br />
    61. 61. Bonus slides<br />How much more do we need to sequence, anyway??<br />
    62. 62. Calculating expected k-mer numbers<br />Entire population<br />S1<br />S2<br />Note: no simple way to correct abundance bias, so we don’t, yet.<br />
    63. 63. Coverage estimates<br />(Based on k-mer mark/recapture analysis.)<br />Iowa prairie (136 GB): est 1.26 x<br />Iowa corn (62 GB): est 0.86 x<br />Wisconsin corn (190 GB): est 2.17 x<br />For comparison, the panda genome assembly<br />used ~50x with short reads.<br />Qingpeng Zhang<br />
    64. 64. Coverage estimates: getting to 50x…<br />Human -> 150 GB for 50x<br />Iowa prairie (136 GB): est 1.26 x -> 5.4 TB for 50x<br />Iowa corn (62 GB): est 0.86 x -> 3.6 TB for 50x<br />Wisconsin corn (190 GB): est 2.17 x -> 4.4 TB for 50x<br />…note that it’s not clear what “coverage” exactly means in this case, since 16s-estimated diversity is very high.<br />
    65. 65. What does coverage mean here?<br />“Unseen” sequence:<br />1x ~ 37%<br />2x ~ 14%<br />5x ~ 0.7%<br />10x ~ .00005% <br />50x ~ 2e-20%<br />For metagenomes, coverage is of abundance weighted DNA.<br />
    66. 66. CAMERA Annotation of full set contigs(>1000 bp) <br /># of ORFS: 344,661 (Metagene)<br /> Longest ORF: 1,974 bp<br /> Shortest ORF: 20 bp<br /> Average ORF: 173 bp<br /># of COG hits: 153,138 (e-value < 0.001)<br /># of Pfam hits: 170,072<br /># of TIGRfam hits: 315,776<br />
    67. 67. CAMERA COG Summary<br />
    68. 68. The k-mer oracle<br />Q: is this k-mer present in the data set?<br />A: no => then it is not.<br />A: yes => it may or may not be present.<br />This lets us store k-mers efficiently.<br />
    69. 69. Building on the k-mer oracle:<br />Once we can store/query k-mers efficiently in this oracle, we can build additional oracles on top of it:<br />
    70. 70. The k-mer graph oracle<br />Q: does this k-mer overlap with this other k-mer?<br />A: no => then it does not, guaranteed.<br />A: yes => it may or may not.<br />This lets us traverse de Bruijn graphs efficiently.<br />
    71. 71. The contig size oracle<br />Q: could this read contribute to a contig bigger than N?<br />A: no => then it does not, guaranteed.<br />A: yes => then it might.<br />This lets us eliminate reads that do not belong to “big” contigs.<br />
    72. 72. The read partition oracle<br />Does this read connect to this other read in any way?<br />A: no => then it does not, guaranteed.<br />A: yes => then it might.<br />This lets us subdivide the assembly problem into many smaller, disconnected problems that are much easier.<br />
    73. 73. Oracular fact<br />All of these oracles are cheap, can yield answers from a different probability distribution, and can be “chained” together (so you can keep on asking oracles for as long as you want, and get more and more accurate).<br />
    74. 74. Implementing a basic k-mer oracle<br />Conveniently, perhaps the simplest data structure in computer science is what we need…<br />…a hash table that ignores collisions.<br />Note, P(false positive) = fractional occupancy.<br />
    75. 75. A more reliable k-mer oracle<br />Use a Bloom filter approach – multiple oracles, in serial, are multiplicatively more reliable.<br />
    76. 76. Scaling the k-mer oracle<br />Adding additional filters increases discrimination at the cost of speed.<br />This gives you a fairly straightforward tradeoff: memory (decrease individual false positives) vs computation (more filters!)<br />
    77. 77.
    78. 78.
    79. 79. The k-mer oracle, revisited<br />We can now ask, “does k-mer ACGTGGCAGG… occur in the data set?”, quickly and accurately.<br />This implicitly lets us store the graph structure, too!<br />
    80. 80. B. Partitioning graphs into disconnected subgraphs<br />Which nodes do not connect to each other?<br />
    81. 81. Partitioning graphs – it’s easy looking<br />Which nodes do not connect to each other?<br />
    82. 82. But partitioning big graphs is expensive<br />Requires exhaustive exploration.<br />
    83. 83. But partitioning big graphs is expensive<br />
    84. 84. Tabu search – avoid global searches<br />
    85. 85. Tabu search – systematic local exploration<br />
    86. 86. Tabu search – systematic local exploration<br />
    87. 87. Tabu search – systematic local exploration<br />
    88. 88. Tabu search – systematic local exploration<br />
    89. 89. Strategies for completing big searches…<br />
    90. 90. Hard-to-traverse graphs are well-connected<br />
    91. 91. Add neighborhood-exclusion to tabu search<br />
    92. 92. Exclusion strategy lets you systematically explore big graphs with a local algorithm<br />
    93. 93. Potential problems<br />Our oracle can mistakenly connect clusters.<br />
    94. 94. Potential problems<br />This is a problem if the rate is sufficiently high!<br />
    95. 95. However, the error is one-sided:<br />Graphs will never be erroneously disconnected<br />
    96. 96. The error is one-sided:<br />Nodes will never be erroneously disconnected<br />
    97. 97. The error is one-sided:<br />Nodes will never be erroneously disconnected.<br />This is critically important: it guarantees that our k-mer graph representation yields reliable “no” answers.<br />This, in turn, lets us reliably partition graphs into smaller graphs.<br />
    98. 98. Actual implementation<br />