Probabilistic breakdown of assembly graphs

Assistant Professor (2008)
Computer Science & Engineering /
Microbiology and Molecular Genetics,
Michigan State University
BA Reed College/Math
PhD Caltech / Developmental Biology
Member of the Python Software Foundation
(a.k.a. awesomest programming language)

I’m a bit sick, so I may cough loudly and
obnoxiously at times.

1. O’Reilly folk asked if I had anything to talk
about.
2. Professors love talking.
3. Nifty techniques, applied to a new problem.
1. Can they be applied to your problem?
2. Do you have any ideas for me?

 ctb@msu.edu
 http://ged.msu.edu/
 http://github.com/ctb/
◦ khmer package, BSD license; k-mer analysis.
◦ …lotsa other stuff.

Slide courtesy of Lincoln Stein
My blog: http://ivory.idyll.org/blog/oct-10/sky-
is-falling ; cloud computing will not save us!

“Quantity has a quality all its own”
J. Stalin

“Quantity has a quality all its own”
J. Stalin
“Ours is a just cause; victory will be ours!”
V. Molotov

 Wisconsin
◦ Native prairie (Goose Pond,
Audubon)
◦ Long term cultivation (corn)
◦ Switchgrass rotation (previously
corn)
◦ Restored prairie (from 1998)
 Iowa
◦ Native prairie (Morris prairie)
 Kansas
◦ Native prairie (Konza prairie)
Iowa Native Praire
Switchgrass
(Wisconsin)
Iowa >100 yr tilled

 30 Gb of sequence from Iowa corn
 50 Gb of sequence from Iowa prairie
 200 Gb of sequence from Wisconsin corn,
prairie
http://ivory.idyll.org/blog/aug-10/assembly-part-i
http://ivory.idyll.org/blog/jul-10/kmer-filtering
http://ivory.idyll.org/blog/jul-10/illumina-read-
phenomenology

 Whole (meta)genome shotgun sequencing
involves fragmenting and sequencing,
followed by re-assembly.
 The shorter the reads, the more difficult this
is to do reliably.
 Assembly scales poorly.

Randomly fragment & sequence from DNA;
reassemble computationally.
UMD assembly primer (cbcb.umd.edu)

Assembly is inherently an all by all process.
There is no good way to subdivide the short
sequences without potentially missing a key
connection:

Essentially, break reads (of any length) down into
multiple overlapping words of fixed length k.
ATGGACCAGATGACAC (k=12) =>
ATGGACCAGATG
TGGACCAGATGA
GGACCAGATGAC
GACCAGATGACA
ACCAGATGACAC

J.R. Miller et al. / Genomics (2010)

For decisions about which paths etc, biology-
based heuristics come into play as well.

 Fixed-length words => great CS techniques
(hashing, trie structures, etc.)
 Data loading/comparison scales with size of your
data, N.
 Memory usage scales with # of unique words.
 This is an advantage over other techniques
◦ NxN comparisons…
 Some disadvantages, too; see review,
 J.R. Miller et al. / Genomics (2010)

 Unlike some other common computational
science problems in physics and chemistry,
which are combinatorial in nature, graph
analysis requires a lot of RAM (to store the
graph).
 This leads to the mildly unusual HPC scaling
issue of RAM as a limiting factor.
 …and RAM is expensive.

 If we knew which original genomes our short
sequences came from?
 Then we could just put all the sequences that
came from a particular genome in a smaller
bin, and assemble that independently!

 Which nodes do not connect to each other?

 If we knew which original genomes our short
sequences came from?
 Then we could just put all the sequences that
came from a particular genome in a smaller
bin, and assemble that independently!
 Unfortunately this is already equivalent to
solving the hard component of the assembly
problem…

 Q: is this k-mer present in the data set?
 A: no => then it is not.
 A: yes => it may or may not be present.
This lets us store k-mers efficiently.

 Once we can store/query k-mers efficiently in
this oracle, we can build additional oracles on
top of it:

 Q: does this k-mer overlap with this other k-
mer?
 A: no => then it does not, guaranteed.
 A: yes => it may or may not.
This lets us traverse k-mer graphs efficiently.

 Conveniently, perhaps
the simplest data
structure in computer
science is what we
need…
 …a hash table that
ignores collisions.
 Note, P(false positive) =
fractional occupancy.

 If you ignore collisions…
 O(1) query, insertion, update
 Fixed memory usage
 Ridiculously simple to implement (although
developing a good hash function can take
some effort)

Use a Bloom filter approach – multiple oracles,
in serial, are multiplicatively more reliable.
http://en.wikipedia.org/wiki/Bloom_filter

Adding additional filters increases discrimination
at the cost of speed.
This gives you a fairly straightforward tradeoff:
memory (decrease individual false positives) vs
computation (more filters!)

Memory usage, Bloom filter vs trie (theoretical minimum)

 We can now ask, “does k-mer
ACGTGGCAGG… occur in the data set?”,
quickly and accurately.
 This implicitly lets us store the graph
structure, too!

Once you can look up k-mers quickly, traversal
is easy: there are only 8 possible overlapping
k-mers:
4 before, and 4 after.

 We can now ask, “does k-mer
ACGTGGCAGG… occur in the data set?”,
quickly and accurately.
 This implicitly lets us store the graph
structure, too, because there are only 8
possible connected nodes.
 We can now traverse this graph structure and
ask several times of questions:

Which of these graphs has more than 3 nodes?

Which nodes do not connect to each other?

Our oracle can mistakenly connect clusters.

This is a problem if the rate is sufficiently high!

Graphs will never be erroneously disconnected

Nodes will never be erroneously disconnected

Nodes will never be erroneously disconnected.
This is critically important: it guarantees that our
k-mer graph representation yields reliable “no”
answers.
This, in turn, lets us reliably partition graphs into
smaller graphs…
…and we can do so iteratively.

1. Built lightweight probabilistic data
structure/algorithm for k-mer storage.
- Constant memory, constant lookup
- Linear time to create structure
2. Implemented systematic graph traversal of
arbitrarily large graphs (> ~3 billion connected
k-mers, so far)
- Affine memory (with small linear constant)
- Bounded time for exploration; bound traded for
memory
3. Built partitioning system to eliminate small
graphs and extract disconnected graphs.

Pre-filter/partition for somebody else’s
assembler
N.B. This results in identical assembly.

 Python wrapping C++, ~5000 LoC. (Python handles
parallelization; go free, GIL!)
 Partitioning & assembling 2 Gb data set can be done in ~8
gb of RAM in < 1 day
◦ Compare with 40 gb requirement for existing (released) assemblers.
◦ Probably 10-fold speed improvement easily (KISS; no premature opt)
 Can partition, assemble ~50 Gb in < 1 wk in 70 gb of RAM,
single chassis, 8 CPU.
 Not yet clear how well it scales to 200 Gb, but should…
 …all of this is running on Amazon cloud rentals.

 Lightweight probabilistic storage system for
k-mers, ~1 byte / k-mer.
 Large graph traversal (10-20 bn k-mers)
◦ Tabu search
◦ Neighborhood exclusion
 Graph partitioning, trimming, grokking.
◦ Iterative refinement is “perfect”
◦ Failure rate ~ memory usage, with good failover (
connectivity increases).

 More general assembly graph analysis
 Breaking graphs in good places
 Clustering of large protein similarity graphs/matrices
Caveats:
 Preferential attachment with false positives?
First publication --
 Bloom counting hash (see kmer-filtering blog post)

 We were lucky & could turn our graph traversal
problem into a set membership query.
 Tabu search / neighborhood exclusion for
exhaustive graph traversal isn’t novel, but might
be useful. Requires systematic tagging.
 But… random and probabilistic approaches (skip
lists, Bloom filters, etc.) can be surprisingly
useful.
◦ One sided errors are awesome for Big Data.
http://en.wikipedia.org/wiki/Category:
Probabilistic_data_structures

GED lab / k-mer gang
Adina Howe (w/Tiedje)
Arend Hintze, postdoc
Jason Pell, grad
Rosangela Canino-Koning,
grad
Qingpeng Zhang, grad
Collaborators (MSU)
Weiming Li
Charles Ofria
Jim Tiedje
(w/Janet Jansson, Rachel
Mackelprang (JGI))
Funding
USDA NIFA, NSF, DOE,
Michigan State U.

 ABySS assembler – multi-node assembly in RAM
On-disk assembly:
 SOAP assembler (BGI) – not open source
 Cortex assembler (EBI) – unpub/not released
 Contrail assembler (Michael Schatz) – unpub/not
released
It’s hard for me to tell how these last three compare ;)
BUT our current approach is orthogonal and can be
used in conjunction (as a pre-filter) with these
assemblers.

Probabilistic breakdown of assembly graphs

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (20)

Similar to Probabilistic breakdown of assembly graphs

Similar to Probabilistic breakdown of assembly graphs (20)

More from c.titus.brown

More from c.titus.brown (20)

Recently uploaded

Recently uploaded (20)

Probabilistic breakdown of assembly graphs

Editor's Notes