Upcoming SlideShare
×

# Probabilistic breakdown of assembly graphs

2,772 views

Published on

Published in: Technology
4 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

Views
Total views
2,772
On SlideShare
0
From Embeds
0
Number of Embeds
83
Actions
Shares
0
47
0
Likes
4
Embeds 0
No embeds

No notes for slide
• Note, no tolerance for indels
• @@
• @@
• Paint between the greens.
• When a green connects two or more colors, recolor one color.
• Dependent on minimumdensity tagging
• ### Probabilistic breakdown of assembly graphs

1. 1. C. Titus Brown ctb@msu.edu
2. 2. Assistant Professor (2008) Computer Science & Engineering / Microbiology and Molecular Genetics, Michigan State University BA Reed College/Math PhD Caltech / Developmental Biology Member of the Python Software Foundation (a.k.a. awesomest programming language)
3. 3. I’m a bit sick, so I may cough loudly and obnoxiously at times.
4. 4. 1. O’Reilly folk asked if I had anything to talk about. 2. Professors love talking. 3. Nifty techniques, applied to a new problem. 1. Can they be applied to your problem? 2. Do you have any ideas for me?
5. 5.  ctb@msu.edu  http://ged.msu.edu/  http://github.com/ctb/ ◦ khmer package, BSD license; k-mer analysis. ◦ …lotsa other stuff.
6. 6. Slide courtesy of Lincoln Stein My blog: http://ivory.idyll.org/blog/oct-10/sky- is-falling ; cloud computing will not save us!
7. 7. “Quantity has a quality all its own” J. Stalin
8. 8. “Quantity has a quality all its own” J. Stalin “Ours is a just cause; victory will be ours!” V. Molotov
9. 9. SAMPLING LOCATIONS
10. 10.  Wisconsin ◦ Native prairie (Goose Pond, Audubon) ◦ Long term cultivation (corn) ◦ Switchgrass rotation (previously corn) ◦ Restored prairie (from 1998)  Iowa ◦ Native prairie (Morris prairie) ◦ Long term cultivation (corn)  Kansas ◦ Native prairie (Konza prairie) ◦ Long term cultivation (corn) Iowa Native Praire Switchgrass (Wisconsin) Iowa >100 yr tilled
11. 11.  30 Gb of sequence from Iowa corn  50 Gb of sequence from Iowa prairie  200 Gb of sequence from Wisconsin corn, prairie http://ivory.idyll.org/blog/aug-10/assembly-part-i http://ivory.idyll.org/blog/jul-10/kmer-filtering http://ivory.idyll.org/blog/jul-10/illumina-read- phenomenology
12. 12.  Whole (meta)genome shotgun sequencing involves fragmenting and sequencing, followed by re-assembly.  The shorter the reads, the more difficult this is to do reliably.  Assembly scales poorly.
13. 13. Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
14. 14. Assembly is inherently an all by all process. There is no good way to subdivide the short sequences without potentially missing a key connection:
15. 15. Essentially, break reads (of any length) down into multiple overlapping words of fixed length k. ATGGACCAGATGACAC (k=12) => ATGGACCAGATG TGGACCAGATGA GGACCAGATGAC GACCAGATGACA ACCAGATGACAC
16. 16. J.R. Miller et al. / Genomics (2010)
17. 17. J.R. Miller et al. / Genomics (2010)
18. 18. For decisions about which paths etc, biology- based heuristics come into play as well.
19. 19.  Fixed-length words => great CS techniques (hashing, trie structures, etc.)  Data loading/comparison scales with size of your data, N.  Memory usage scales with # of unique words.  This is an advantage over other techniques ◦ NxN comparisons…  Some disadvantages, too; see review,  J.R. Miller et al. / Genomics (2010)
20. 20.  Unlike some other common computational science problems in physics and chemistry, which are combinatorial in nature, graph analysis requires a lot of RAM (to store the graph).  This leads to the mildly unusual HPC scaling issue of RAM as a limiting factor.  …and RAM is expensive.
21. 21.  If we knew which original genomes our short sequences came from?  Then we could just put all the sequences that came from a particular genome in a smaller bin, and assemble that independently!
22. 22.  Which nodes do not connect to each other?
23. 23.  If we knew which original genomes our short sequences came from?  Then we could just put all the sequences that came from a particular genome in a smaller bin, and assemble that independently!  Unfortunately this is already equivalent to solving the hard component of the assembly problem…
24. 24.  Q: is this k-mer present in the data set?  A: no => then it is not.  A: yes => it may or may not be present. This lets us store k-mers efficiently.
25. 25.  Once we can store/query k-mers efficiently in this oracle, we can build additional oracles on top of it:
26. 26.  Q: does this k-mer overlap with this other k- mer?  A: no => then it does not, guaranteed.  A: yes => it may or may not. This lets us traverse k-mer graphs efficiently.
27. 27.  Conveniently, perhaps the simplest data structure in computer science is what we need…  …a hash table that ignores collisions.  Note, P(false positive) = fractional occupancy.
28. 28.  If you ignore collisions…  O(1) query, insertion, update  Fixed memory usage  Ridiculously simple to implement (although developing a good hash function can take some effort)
29. 29.  Conveniently, perhaps the simplest data structure in computer science is what we need…  …a hash table that ignores collisions.  Note, P(false positive) = fractional occupancy.
30. 30. Use a Bloom filter approach – multiple oracles, in serial, are multiplicatively more reliable. http://en.wikipedia.org/wiki/Bloom_filter
31. 31. Adding additional filters increases discrimination at the cost of speed. This gives you a fairly straightforward tradeoff: memory (decrease individual false positives) vs computation (more filters!)
32. 32. Memory usage, Bloom filter vs trie (theoretical minimum)
33. 33.  We can now ask, “does k-mer ACGTGGCAGG… occur in the data set?”, quickly and accurately.  This implicitly lets us store the graph structure, too!
34. 34. Once you can look up k-mers quickly, traversal is easy: there are only 8 possible overlapping k-mers: 4 before, and 4 after.
35. 35.  We can now ask, “does k-mer ACGTGGCAGG… occur in the data set?”, quickly and accurately.  This implicitly lets us store the graph structure, too, because there are only 8 possible connected nodes.  We can now traverse this graph structure and ask several times of questions:
36. 36. Which of these graphs has more than 3 nodes?
37. 37. Which of these graphs has more than 3 nodes?
38. 38. Which nodes do not connect to each other?
39. 39. Which nodes do not connect to each other?
40. 40. Our oracle can mistakenly connect clusters.
41. 41. This is a problem if the rate is sufficiently high!
42. 42. Graphs will never be erroneously disconnected
43. 43. Nodes will never be erroneously disconnected
44. 44. Nodes will never be erroneously disconnected. This is critically important: it guarantees that our k-mer graph representation yields reliable “no” answers. This, in turn, lets us reliably partition graphs into smaller graphs… …and we can do so iteratively.
45. 45. 1. Built lightweight probabilistic data structure/algorithm for k-mer storage. - Constant memory, constant lookup - Linear time to create structure 2. Implemented systematic graph traversal of arbitrarily large graphs (> ~3 billion connected k-mers, so far) - Affine memory (with small linear constant) - Bounded time for exploration; bound traded for memory 3. Built partitioning system to eliminate small graphs and extract disconnected graphs.
46. 46. Pre-filter/partition for somebody else’s assembler N.B. This results in identical assembly.
47. 47.  Python wrapping C++, ~5000 LoC. (Python handles parallelization; go free, GIL!)  Partitioning & assembling 2 Gb data set can be done in ~8 gb of RAM in < 1 day ◦ Compare with 40 gb requirement for existing (released) assemblers. ◦ Probably 10-fold speed improvement easily (KISS; no premature opt)  Can partition, assemble ~50 Gb in < 1 wk in 70 gb of RAM, single chassis, 8 CPU.  Not yet clear how well it scales to 200 Gb, but should…  …all of this is running on Amazon cloud rentals.
48. 48.  Lightweight probabilistic storage system for k-mers, ~1 byte / k-mer.  Large graph traversal (10-20 bn k-mers) ◦ Tabu search ◦ Neighborhood exclusion  Graph partitioning, trimming, grokking. ◦ Iterative refinement is “perfect” ◦ Failure rate ~ memory usage, with good failover ( connectivity increases).
49. 49.  More general assembly graph analysis  Breaking graphs in good places  Clustering of large protein similarity graphs/matrices Caveats:  Preferential attachment with false positives? First publication --  Bloom counting hash (see kmer-filtering blog post)
50. 50.  We were lucky & could turn our graph traversal problem into a set membership query.  Tabu search / neighborhood exclusion for exhaustive graph traversal isn’t novel, but might be useful. Requires systematic tagging.  But… random and probabilistic approaches (skip lists, Bloom filters, etc.) can be surprisingly useful. ◦ One sided errors are awesome for Big Data. http://en.wikipedia.org/wiki/Category: Probabilistic_data_structures
51. 51. GED lab / k-mer gang Adina Howe (w/Tiedje) Arend Hintze, postdoc Jason Pell, grad Rosangela Canino-Koning, grad Qingpeng Zhang, grad Collaborators (MSU) Weiming Li Charles Ofria Jim Tiedje (w/Janet Jansson, Rachel Mackelprang (JGI)) Funding USDA NIFA, NSF, DOE, Michigan State U.
52. 52.  ABySS assembler – multi-node assembly in RAM On-disk assembly:  SOAP assembler (BGI) – not open source  Cortex assembler (EBI) – unpub/not released  Contrail assembler (Michael Schatz) – unpub/not released It’s hard for me to tell how these last three compare ;) BUT our current approach is orthogonal and can be used in conjunction (as a pre-filter) with these assemblers.