Bloom filter

2,108 views

Published on

bloomfilter is a data structure that can support very fast owership query and it has very compacted storage space.

Published in: Technology, Business
  • Be the first to comment

Bloom filter

  1. 1. Bloom Filterxuanzi.wp@taobao.com 2011-11-18 1
  2. 2. Agenda• A Membership Query Problem• What is Bloom Filter• BloomFilter Math Theory• Compression• Application Scenario 2
  3. 3. Membership Query ProblemProblem Description Given an element E, query whether it belongs to an big elements set S. – Fast as soon as possible – Small as soon as possible 3
  4. 4. Membership Query ProblemSome Solutions  hashtable fast but big data structure  bitmap index can be smaller? 4
  5. 5. Membership Query ProblemTradeoff Solutions To obtain speed and size improvements, allow some probability of error. Bloom Filter 5
  6. 6. What is Bloom Filter Support approximate set membership Given a set S = {x ,x ,…,x }, construct data 1 2 n structure to answer queries of the form “Is y in S?” Data structure should be: –Fast (Faster than searching through S). –Small (Smaller than explicit representation). To obtain speed and size improvements, allow some probability of error. –False positives: y ∉ S but we report y ∈ S –False negatives: y ∈ S but we report y ∉ S 6
  7. 7. What is Bloom Filter Start with an m bit array, filled with 0s.B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 To check if y is in S, check B at Hi(y). All k values must be 1.B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 Possible to have a false positive; all k values are 1, but y is not in S.B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0n items m = cn bits k hash functions 7 7
  8. 8. What is Bloom FilterFalse Positive 0 0 1 hash1 0 A 1 hash2 0 0 B 0 hash3 0 1 0 8
  9. 9. Bloom Filter Math Theory Pr(specific bit of filter is 0) is p ≡ (1 − 1 / m) kn ≈ e − kn / m ≡ p If ρ is fraction of 0 bits in the filter then falsepositive probability is (1 − ρ ) k ≈ (1 − p ) k ≈ (1 − p ) k = (1 − e − k / c ) k Approximations valid as ρ is concentratedaround E[ρ]. –Martingale argument suffices. Find optimal at k = (ln 2)m/n by calculus. –So optimal fpp is about (0.6185)m/nn items m = cn bits k hash functions 9
  10. 10. Bloom Filter Math Theory 0.1 0.09 0.08False positive rate 0.07 m/n = 8 0.06 Opt k = 8 ln 2 = 5.45... 0.05 0.04 0.03 0.02 0.01 0 0 1 2 3 4 5 6 7 8 9 10 Hash functionsn items m = cn bits k hash functions 10
  11. 11. Bloom Filter CompressionUse BF on Network Transmission  BF as a message, should be small enough to transmitted over the network  Compressing bit vector is easy Arithmetic coding gets close to entropy.  Can Bloom filters be compressed? 11
  12. 12. Bloom Filter Compression• Optimize to minimize false positive. p = Pr[cell is empty] = (1 − 1 / m) kn ≈ e − kn / m k − kn / m k f = Pr[false pos] = (1 − p ) ≈ (1 − e ) k = (m ln 2) / n• At k = m (ln 2) /n, p = 1/2.• Bloom filter looks like a random string. – Can’t compress it. – H(p) = -plog2p – (1-p)log2(1-p) 12
  13. 13. Bloom Filter Compression With more decompressed size (storage), we can achive compression.• Assumption: optimal compressor, z = mH(p). – H(p) is entropy function; optimally get H(p) compressed bits per original table bit. – Arithmetic coding close to optimal.• Optimization: Given z bits for compressed filter and n elements, choose table size m and number of hash functions k to minimize /f. ; f ≈ (1 − e − kn / m ) k ; z ≈ mH ( p ) p≈e − kn m 13
  14. 14. Bloom Filter Compression 0.1 0.09 0.08 Original z/n = 8False positive rate 0.07 Compressed 0.06 0.05 0.04 0.03 0.02 0.01 0 0 1 2 3 4 5 6 7 8 9 10 Hash functions 14 14
  15. 15. Bloom Filter CompressionConclusion• At k = m (ln 2) /n, false positives are maximized with a compressed Bloom filter. – Best case without compression is worst case with compression; compression always helps. – Side benefit: Use fewer hash functions with compression; possible speedup. 15 15
  16. 16. Application Scenario Speed up answers in a key-value like syetem filter(memory storage(memory) ) key1 no key2 disk access yes success key3 disk access yes fail 16
  17. 17. Application Scenario Web Cache cache1 cache2 …… cache3 Web Server 17
  18. 18. Q&AQ&A 18

×