Your SlideShare is downloading. ×
0
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
PyCon 2011 talk - ngram assembly with Bloom filters
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

PyCon 2011 talk - ngram assembly with Bloom filters

5,189

Published on

Published in: Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,189
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
48
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Funding: MSU startup, USDA NIFA, DOE, BEACON, Amazon.
  • Transcript

    • 1.
    • 2. Handling ridiculous amounts of data with probabilistic data structures<br />C. Titus Brown<br />Michigan State University<br />Computer Science / Microbiology<br />
    • 3. Resources<br />http://www.slideshare.net/c.titus.brown/<br />Webinar: http://oreillynet.com/pub/e/1784<br />Source: github.com/ctb/<br />N-grams (this talk): khmer-ngram<br />DNA (the real cheese): khmer<br />khmer is implemented in C++ with a Python wrapper, which has been awesome for scripting, testing, and general development. (But man, does C++ suck…)<br />
    • 4. Lincoln Stein<br />Sequencing capacity is outscaling Moore’s Law.<br />
    • 5. Hat tip to Narayan Desai / ANL<br />We don’t have enough resources or people to analyze data.<br />
    • 6. Data generation vs data analysis<br />It now costs about $10,000 to generate a 200 GB sequencing data set (DNA) in about a week. <br />(Think: resequencing human; sequencing expressed genes; sequencing metagenomes, etc.)<br /> …x1000 sequencers<br />Many useful analyses do not scale linearly in RAM or CPU with the amount of data.<br />
    • 7. The challenge?<br />Massive (and increasing) data generation capacity, operating at a boutique level, with algorithms that are wholly incapable of scaling to the data volume.<br />Note: cloud computing isn’t a solution to a sustained scaling problem!! <br />(See: Moore’s Law slide)<br />
    • 8. Life’s too short to tackle the easy problems – come to academia!<br />Easy stuff like Google Search<br />Awesomeness<br />
    • 9. A brief intro to shotgun assembly<br />It was the best of times, it was the wor<br />, it was the worst of times, it was the <br />isdom, it was the age of foolishness<br />mes, it was the age of wisdom, it was th<br />It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness<br />…but for 2 bn+ fragments.<br />Not subdivisible; not easy to distribute; memory intensive.<br />
    • 10. Define a hash function (word => num)<br />def hash(word):<br /> assert len(word) <= MAX_K<br /> value = 0<br /> for n, ch in enumerate(word):<br /> value += ord(ch) * 128**n<br /> return value<br />
    • 11. class BloomFilter(object):<br /> def __init__(self, tablesizes, k=DEFAULT_K):<br />self.tables = [ (size, [0] * size) <br /> for size in tablesizes ]<br />self.k = k<br /> def add(self, word): # insert; ignore collisions<br />val = hash(word)<br /> for size, ht in self.tables:<br />ht[val % size] = 1<br /> def __contains__(self, word):<br />val = hash(word)<br /> return all( ht[val % size] <br /> for (size, ht) in self.tables )<br />
    • 12. class BloomFilter(object):<br /> def __init__(self, tablesizes, k=DEFAULT_K):<br />self.tables = [ (size, [0] * size) <br /> for size in tablesizes ]<br />self.k = k<br /> def add(self, word): # insert; ignore collisions<br />val = hash(word)<br /> for size, ht in self.tables:<br />ht[val % size] = 1<br /> def __contains__(self, word):<br />val = hash(word)<br /> return all( ht[val % size] <br /> for (size, ht) in self.tables )<br />
    • 13. class BloomFilter(object):<br /> def __init__(self, tablesizes, k=DEFAULT_K):<br />self.tables = [ (size, [0] * size) <br /> for size in tablesizes ]<br />self.k = k<br /> def add(self, word): # insert; ignore collisions<br />val = hash(word)<br /> for size, ht in self.tables:<br />ht[val % size] = 1<br /> def __contains__(self, word):<br />val = hash(word)<br /> return all( ht[val % size] <br /> for (size, ht) in self.tables )<br />
    • 14. Storing words in a Bloom filter<br />>>> x = BloomFilter([1001, 1003, 1005])<br />>>> 'oogaboog' in x<br />False<br />>>> x.add('oogaboog')<br />>>> 'oogaboog' in x<br />True<br />>>> x = BloomFilter([2]) <br />>>> x.add('a')<br />>>> 'a' in x # no false negatives<br />True<br />>>> 'b' in x<br />False<br />>>> 'c' in x # …but false positives<br />True<br />
    • 15. Storing words in a Bloom filter<br />>>> x = BloomFilter([1001, 1003, 1005])<br />>>> 'oogaboog' in x<br />False<br />>>> x.add('oogaboog')<br />>>> 'oogaboog' in x<br />True<br />>>> x = BloomFilter([2]) # …false positives<br />>>> x.add('a')<br />>>> 'a' in x<br />True<br />>>> 'b' in x<br />False<br />>>> 'c' in x<br />True<br />
    • 16. Storing text in a Bloom filter<br />class BloomFilter(object):<br /> …<br /> def insert_text(self, text):<br /> for i in range(len(text)-self.k+1):<br />self.add(text[i:i+self.k])<br />
    • 17. def next_words(bf, word): # try all 1-ch extensions<br /> prefix = word[1:]<br /> for ch in bf.allchars:<br /> word = prefix + ch<br /> if word in bf:<br /> yield ch<br /># descend into all successive 1-ch extensions<br />def retrieve_all_sentences(bf, start):<br /> word = start[-bf.k:]<br />n = -1<br /> for n, ch in enumerate(next_words(bf, word)):<br />ss = retrieve_all_sentences(bf,start + ch)<br /> for sentence in ss:<br /> yield sentence<br /> if n < 0:<br /> yield start<br />
    • 18. def next_words(bf, word): # try all 1-ch extensions<br /> prefix = word[1:]<br /> for ch in bf.allchars:<br /> word = prefix + ch<br /> if word in bf:<br /> yield ch<br /># descend into all successive 1-ch extensions<br />def retrieve_all_sentences(bf, start):<br /> word = start[-bf.k:]<br />n = -1<br /> for n, ch in enumerate(next_words(bf, word)):<br />ss = retrieve_all_sentences(bf,start + ch)<br /> for sentence in ss:<br /> yield sentence<br /> if n < 0:<br /> yield start<br />
    • 19. Storing and retrieving text<br />>>> x = BloomFilter([1001, 1003, 1005, 1007])<br />>>> x.insert_text('foo bar bazbif zap!')<br />>>> x.insert_text('the quick brown fox jumped over the lazy dog')<br />>>> print retrieve_first_sentence(x, 'foo bar ')<br />foo bar bazbif zap!<br />>>> print retrieve_first_sentence(x, 'the quic')<br />the quick brown fox jumped over the lazy dog<br />
    • 20. Sequence assembly<br />>>> x = BloomFilter([1001, 1003, 1005, 1007])<br />>>> x.insert_text('the quick brown fox jumped ')<br />>>> x.insert_text('jumped over the lazy dog')<br />>>> retrieve_first_sentence(x, 'the quic')<br />the quick brown fox jumpedover the lazy dog<br />(This is known as the de Bruin graph approach to assembly; c.f. Velvet, ABySS, SOAPdenovo)<br />
    • 21. Repetitive strings are the devil<br />>>> x = BloomFilter([1001, 1003, 1005, 1007])<br />>>> x.insert_text('nanana, batman!')<br />>>> x.insert_text('my chemical romance: nanana')<br />>>> retrieve_first_sentence(x, "my chemical")<br />'my chemical romance: nanana, batman!'<br />
    • 22. Note, it’s a probabilistic data structure<br />Retrieval errors:<br />>>> x = BloomFilter([1001, 1003]) # small Bloom filter…<br />>>> x.insert_text('the quick brown fox jumped over the lazy dog’)<br />>>> retrieve_first_sentence(x, 'the quic'),<br />('the quick brY',)<br />
    • 23. Assembling DNA sequence<br />Can’t directly assemble with Bloom filter approach (false connections, and also lacking many convenient graph properties)<br />But we can use the data structure to grok graph properties and eliminate/break up data:<br />Eliminate small graphs (no false negatives!)<br />Disconnected partitions (parts -> map reduce)<br />Local graph complexity reduction & error/artifact trimming<br />…and then feed into other programs.<br />This is a data reducing prefilter<br />
    • 24. Right, but does it work??<br />Can assemble ~200 GB of metagenome DNA on a single 4xlarge EC2 node (68 GB of RAM) in 1 week ($500).<br />…compare with not at allon a 512 GB RAM machine.<br />Error/repeat trimming on a tricky worm genome: reduction from<br />170 GB resident / 60 hrs<br />54 GB resident / 13 hrs <br />
    • 25. How good is this graph representation?<br />V. low false positive rates at ~2 bytes/k-mer;<br />Nearly exact human genome graph in ~5 GB.<br />Estimate we eventually need to store/traverse 50 billion k-mers (soil metagenome)<br />Good failure mode: it’s all connected, Jim! (No loss of connections => good prefilter)<br />Did I mention it’s constant memory? And independent of word size?<br />…only works for de Bruijn graphs <br />
    • 26. Thoughts for the future<br />Unless your algorithm scales sub-linearly as you distribute it across multiple nodes (hah!), oryour problem size has an upper bound, cloud computing isn’t a long-term solution in bioinformatics<br />Synopsis data structures & algorithms (which incl. probabilistic data structures) are a neat approach to parsing problem structure.<br />Scalable in-memory local graph exploration enables many other tricks, including near-optimal multinode graph distribution.<br />
    • 27. Groxel view of knot-like region / ArendHintze<br />
    • 28. Acknowledgements:<br />The k-mer gang:<br />Adina Howe<br />Jason Pell<br />RosangelaCanino-Koning<br />Qingpeng Zhang<br />ArendHintze<br />Collaborators:<br />Jim Tiedje (Il padrino)<br />Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI)<br />Charles Ofria (MSU)<br />Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.<br />

    ×