Upcoming SlideShare
×

Slides1

296 views

Published on

Published in: Technology, Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
296
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
8
0
Likes
0
Embeds 0
No embeds

No notes for slide

Slides1

1. 1. Bloom Filter in Bioinformatics Aleksey Kladov aleksey.kladov@gmail.com February 15, 2013
2. 2. Set data structure Implementations • Hash table • Search tree • Bloom ﬁlter Why Bloom is awesome? • Asymptotically not faster then a hash table • No delete operation • It simply doesn’t work! 1 / 15
3. 3. Meet the Bloom! • Actually, it’s all from Wikipedia :) • Burton Howard Bloom, 1970 • m-bit vector B • k hash functions: element → [0, m) 2 / 15
4. 4. Add • Want to add element e? • Compute h1(e), h2(e), . . . , hk(e) • Compute H(e) m-bit vetcor with k ones at h1(e), h2(e), . . . , hk(e) • Let B = B ∨ H(e) 3 / 15
5. 5. Query • Want to check if element e was inserted in B? • Check if B ∧ H(e) == H(e) • If yes then e may be in the set • If no then e is deﬁnitely not in the set • False positives =( 4 / 15
6. 6. So, why Bloom? From Wikipedia A Bloom ﬁlter with 1% error and an optimal value of k, requires only about 9.6 bits per element — regardless of the size of the elements. 5 / 15
7. 7. k, m, n and other letters • k – number of hash functions • m – length of the vector • p – false positive rate • n – number of elements in the set 6 / 15
8. 8. \$Math\$ Probability that ith bit is zero after one insertion (1 − 1 m )k After n insertions (1 − 1 m )kn Probability that ith bit is one after n insertions 1 − (1 − 1 m )kn Probability of positive answer (1 − (1 − 1 m )kn )k Recall that (1 − 1 m ) m ≈ e−1 , plug it into the last formula and we get false positive rate estimate ≈ (1 − e −kn m )k 7 / 15
9. 9. The optimal k Can we choose k to minimize false positive rate? min k (1 − e −kn m )k Optimal k m n ln 2 False positive rate 2− m n ln 2 Number of bits per item for a given p − ln p (ln 2)2 8 / 15
10. 10. Excellent! 9 / 15
11. 11. Counting k-mers • Naive: hash table requires a lot of memory =( • ≈ half of the k-mers are singleton! • Increase in coverege leads to more singletons (linear dependency) • Most applications need solid k-mers 10 / 15
12. 12. Bloom ﬁlter! Paper ”Eﬃcient counting of k-mers in DNA sequences using a bloom ﬁlter” by P´all Melsted and Jonathan K Pritchard, 2011 • Create a bloom ﬁlter and a hash table • If k-mer is not in ﬁlter – insert into ﬁlter • If k-mer is in ﬁlter – insert into table with count 2 or update count. Resulting counts are higher then true counts at most by one. To ﬁnd precise counts make a second pass. 2x memory savings, assuming if half of the k-mers are unique 11 / 15
13. 13. Picture! Figure 1 : Memory usage for BFCounter, Jellyﬁsh and Google Dense at diﬀerent coverage levels(Chromosome 21 data) 12 / 15
14. 14. Speeding it up I Paper ”Exact Pattern Matching with Feed-Forward Bloom Filters” by Iulian Moraru and David G. Andersen, 2011 Problem Given the huge text and lots of patterns of the same length, ﬁnd all occurences of patterns in the text. Solution • Create P and T Bloom ﬁlters • Insert all patterns in P • Search substring of text in ﬁlter • For each substring of text that is in P, remember it’s positon and add it to the T • Filter out all patterns that are not in T • Check all positions against reduced set of patterns 13 / 15
15. 15. Speeding it up II Poor cache behavior Split ﬁlter into small cache part and large memory part Lot’s of hash functions • Fast rolling hash functions • Hash functions don’t need to be independent: linear combinations are OK 14 / 15
16. 16. This is the end end {document} 15 / 15