2. Assistant Professor (2008)
Computer Science & Engineering /
Microbiology and Molecular Genetics,
Michigan State University
BA Reed College/Math
PhD Caltech / Developmental Biology
Member of the Python Software Foundation
(a.k.a. awesomest programming language)
3. I’m a bit sick, so I may cough loudly and
obnoxiously at times.
4. 1. O’Reilly folk asked if I had anything to talk
about.
2. Professors love talking.
3. Nifty techniques, applied to a new problem.
1. Can they be applied to your problem?
2. Do you have any ideas for me?
10. Wisconsin
◦ Native prairie (Goose Pond,
Audubon)
◦ Long term cultivation (corn)
◦ Switchgrass rotation (previously
corn)
◦ Restored prairie (from 1998)
Iowa
◦ Native prairie (Morris prairie)
◦ Long term cultivation (corn)
Kansas
◦ Native prairie (Konza prairie)
◦ Long term cultivation (corn)
Iowa Native Praire
Switchgrass
(Wisconsin)
Iowa >100 yr tilled
11. 30 Gb of sequence from Iowa corn
50 Gb of sequence from Iowa prairie
200 Gb of sequence from Wisconsin corn,
prairie
http://ivory.idyll.org/blog/aug-10/assembly-part-i
http://ivory.idyll.org/blog/jul-10/kmer-filtering
http://ivory.idyll.org/blog/jul-10/illumina-read-
phenomenology
12. Whole (meta)genome shotgun sequencing
involves fragmenting and sequencing,
followed by re-assembly.
The shorter the reads, the more difficult this
is to do reliably.
Assembly scales poorly.
13. Randomly fragment & sequence from DNA;
reassemble computationally.
UMD assembly primer (cbcb.umd.edu)
14. Assembly is inherently an all by all process.
There is no good way to subdivide the short
sequences without potentially missing a key
connection:
15. Essentially, break reads (of any length) down into
multiple overlapping words of fixed length k.
ATGGACCAGATGACAC (k=12) =>
ATGGACCAGATG
TGGACCAGATGA
GGACCAGATGAC
GACCAGATGACA
ACCAGATGACAC
18. For decisions about which paths etc, biology-
based heuristics come into play as well.
19. Fixed-length words => great CS techniques
(hashing, trie structures, etc.)
Data loading/comparison scales with size of your
data, N.
Memory usage scales with # of unique words.
This is an advantage over other techniques
◦ NxN comparisons…
Some disadvantages, too; see review,
J.R. Miller et al. / Genomics (2010)
20. Unlike some other common computational
science problems in physics and chemistry,
which are combinatorial in nature, graph
analysis requires a lot of RAM (to store the
graph).
This leads to the mildly unusual HPC scaling
issue of RAM as a limiting factor.
…and RAM is expensive.
21. If we knew which original genomes our short
sequences came from?
Then we could just put all the sequences that
came from a particular genome in a smaller
bin, and assemble that independently!
23. If we knew which original genomes our short
sequences came from?
Then we could just put all the sequences that
came from a particular genome in a smaller
bin, and assemble that independently!
Unfortunately this is already equivalent to
solving the hard component of the assembly
problem…
24. Q: is this k-mer present in the data set?
A: no => then it is not.
A: yes => it may or may not be present.
This lets us store k-mers efficiently.
25. Once we can store/query k-mers efficiently in
this oracle, we can build additional oracles on
top of it:
26. Q: does this k-mer overlap with this other k-
mer?
A: no => then it does not, guaranteed.
A: yes => it may or may not.
This lets us traverse k-mer graphs efficiently.
27. Conveniently, perhaps
the simplest data
structure in computer
science is what we
need…
…a hash table that
ignores collisions.
Note, P(false positive) =
fractional occupancy.
28. If you ignore collisions…
O(1) query, insertion, update
Fixed memory usage
Ridiculously simple to implement (although
developing a good hash function can take
some effort)
29. Conveniently, perhaps
the simplest data
structure in computer
science is what we
need…
…a hash table that
ignores collisions.
Note, P(false positive) =
fractional occupancy.
30. Use a Bloom filter approach – multiple oracles,
in serial, are multiplicatively more reliable.
http://en.wikipedia.org/wiki/Bloom_filter
31. Adding additional filters increases discrimination
at the cost of speed.
This gives you a fairly straightforward tradeoff:
memory (decrease individual false positives) vs
computation (more filters!)
35. We can now ask, “does k-mer
ACGTGGCAGG… occur in the data set?”,
quickly and accurately.
This implicitly lets us store the graph
structure, too!
36. Once you can look up k-mers quickly, traversal
is easy: there are only 8 possible overlapping
k-mers:
4 before, and 4 after.
37. We can now ask, “does k-mer
ACGTGGCAGG… occur in the data set?”,
quickly and accurately.
This implicitly lets us store the graph
structure, too, because there are only 8
possible connected nodes.
We can now traverse this graph structure and
ask several times of questions:
55. Nodes will never be erroneously disconnected.
This is critically important: it guarantees that our
k-mer graph representation yields reliable “no”
answers.
This, in turn, lets us reliably partition graphs into
smaller graphs…
…and we can do so iteratively.
56.
57. 1. Built lightweight probabilistic data
structure/algorithm for k-mer storage.
- Constant memory, constant lookup
- Linear time to create structure
2. Implemented systematic graph traversal of
arbitrarily large graphs (> ~3 billion connected
k-mers, so far)
- Affine memory (with small linear constant)
- Bounded time for exploration; bound traded for
memory
3. Built partitioning system to eliminate small
graphs and extract disconnected graphs.
60. Python wrapping C++, ~5000 LoC. (Python handles
parallelization; go free, GIL!)
Partitioning & assembling 2 Gb data set can be done in ~8
gb of RAM in < 1 day
◦ Compare with 40 gb requirement for existing (released) assemblers.
◦ Probably 10-fold speed improvement easily (KISS; no premature opt)
Can partition, assemble ~50 Gb in < 1 wk in 70 gb of RAM,
single chassis, 8 CPU.
Not yet clear how well it scales to 200 Gb, but should…
…all of this is running on Amazon cloud rentals.
61. Lightweight probabilistic storage system for
k-mers, ~1 byte / k-mer.
Large graph traversal (10-20 bn k-mers)
◦ Tabu search
◦ Neighborhood exclusion
Graph partitioning, trimming, grokking.
◦ Iterative refinement is “perfect”
◦ Failure rate ~ memory usage, with good failover (
connectivity increases).
62. More general assembly graph analysis
Breaking graphs in good places
Clustering of large protein similarity graphs/matrices
Caveats:
Preferential attachment with false positives?
First publication --
Bloom counting hash (see kmer-filtering blog post)
63. We were lucky & could turn our graph traversal
problem into a set membership query.
Tabu search / neighborhood exclusion for
exhaustive graph traversal isn’t novel, but might
be useful. Requires systematic tagging.
But… random and probabilistic approaches (skip
lists, Bloom filters, etc.) can be surprisingly
useful.
◦ One sided errors are awesome for Big Data.
http://en.wikipedia.org/wiki/Category:
Probabilistic_data_structures
64. GED lab / k-mer gang
Adina Howe (w/Tiedje)
Arend Hintze, postdoc
Jason Pell, grad
Rosangela Canino-Koning,
grad
Qingpeng Zhang, grad
Collaborators (MSU)
Weiming Li
Charles Ofria
Jim Tiedje
(w/Janet Jansson, Rachel
Mackelprang (JGI))
Funding
USDA NIFA, NSF, DOE,
Michigan State U.
65. ABySS assembler – multi-node assembly in RAM
On-disk assembly:
SOAP assembler (BGI) – not open source
Cortex assembler (EBI) – unpub/not released
Contrail assembler (Michael Schatz) – unpub/not
released
It’s hard for me to tell how these last three compare ;)
BUT our current approach is orthogonal and can be
used in conjunction (as a pre-filter) with these
assemblers.
Editor's Notes
Note, no tolerance for indels
@@
@@
Paint between the greens.
When a green connects two or more colors, recolor one color.