Slides1

296 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
296
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Slides1

  1. 1. Bloom Filter in Bioinformatics Aleksey Kladov aleksey.kladov@gmail.com February 15, 2013
  2. 2. Set data structure Implementations • Hash table • Search tree • Bloom filter Why Bloom is awesome? • Asymptotically not faster then a hash table • No delete operation • It simply doesn’t work! 1 / 15
  3. 3. Meet the Bloom! • Actually, it’s all from Wikipedia :) • Burton Howard Bloom, 1970 • m-bit vector B • k hash functions: element → [0, m) 2 / 15
  4. 4. Add • Want to add element e? • Compute h1(e), h2(e), . . . , hk(e) • Compute H(e) m-bit vetcor with k ones at h1(e), h2(e), . . . , hk(e) • Let B = B ∨ H(e) 3 / 15
  5. 5. Query • Want to check if element e was inserted in B? • Check if B ∧ H(e) == H(e) • If yes then e may be in the set • If no then e is definitely not in the set • False positives =( 4 / 15
  6. 6. So, why Bloom? From Wikipedia A Bloom filter with 1% error and an optimal value of k, requires only about 9.6 bits per element — regardless of the size of the elements. 5 / 15
  7. 7. k, m, n and other letters • k – number of hash functions • m – length of the vector • p – false positive rate • n – number of elements in the set 6 / 15
  8. 8. $Math$ Probability that ith bit is zero after one insertion (1 − 1 m )k After n insertions (1 − 1 m )kn Probability that ith bit is one after n insertions 1 − (1 − 1 m )kn Probability of positive answer (1 − (1 − 1 m )kn )k Recall that (1 − 1 m ) m ≈ e−1 , plug it into the last formula and we get false positive rate estimate ≈ (1 − e −kn m )k 7 / 15
  9. 9. The optimal k Can we choose k to minimize false positive rate? min k (1 − e −kn m )k Optimal k m n ln 2 False positive rate 2− m n ln 2 Number of bits per item for a given p − ln p (ln 2)2 8 / 15
  10. 10. Excellent! 9 / 15
  11. 11. Counting k-mers • Naive: hash table requires a lot of memory =( • ≈ half of the k-mers are singleton! • Increase in coverege leads to more singletons (linear dependency) • Most applications need solid k-mers 10 / 15
  12. 12. Bloom filter! Paper ”Efficient counting of k-mers in DNA sequences using a bloom filter” by P´all Melsted and Jonathan K Pritchard, 2011 • Create a bloom filter and a hash table • If k-mer is not in filter – insert into filter • If k-mer is in filter – insert into table with count 2 or update count. Resulting counts are higher then true counts at most by one. To find precise counts make a second pass. 2x memory savings, assuming if half of the k-mers are unique 11 / 15
  13. 13. Picture! Figure 1 : Memory usage for BFCounter, Jellyfish and Google Dense at different coverage levels(Chromosome 21 data) 12 / 15
  14. 14. Speeding it up I Paper ”Exact Pattern Matching with Feed-Forward Bloom Filters” by Iulian Moraru and David G. Andersen, 2011 Problem Given the huge text and lots of patterns of the same length, find all occurences of patterns in the text. Solution • Create P and T Bloom filters • Insert all patterns in P • Search substring of text in filter • For each substring of text that is in P, remember it’s positon and add it to the T • Filter out all patterns that are not in T • Check all positions against reduced set of patterns 13 / 15
  15. 15. Speeding it up II Poor cache behavior Split filter into small cache part and large memory part Lot’s of hash functions • Fast rolling hash functions • Hash functions don’t need to be independent: linear combinations are OK 14 / 15
  16. 16. This is the end end {document} 15 / 15

×