Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

2,108 views

Published on

No Downloads

Total views

2,108

On SlideShare

0

From Embeds

0

Number of Embeds

4

Shares

0

Downloads

79

Comments

0

Likes

3

No embeds

No notes for slide

- 1. Bloom Filterxuanzi.wp@taobao.com 2011-11-18 1
- 2. Agenda• A Membership Query Problem• What is Bloom Filter• BloomFilter Math Theory• Compression• Application Scenario 2
- 3. Membership Query ProblemProblem Description Given an element E, query whether it belongs to an big elements set S. – Fast as soon as possible – Small as soon as possible 3
- 4. Membership Query ProblemSome Solutions hashtable fast but big data structure bitmap index can be smaller? 4
- 5. Membership Query ProblemTradeoff Solutions To obtain speed and size improvements, allow some probability of error. Bloom Filter 5
- 6. What is Bloom Filter Support approximate set membership Given a set S = {x ,x ,…,x }, construct data 1 2 n structure to answer queries of the form “Is y in S?” Data structure should be: –Fast (Faster than searching through S). –Small (Smaller than explicit representation). To obtain speed and size improvements, allow some probability of error. –False positives: y ∉ S but we report y ∈ S –False negatives: y ∈ S but we report y ∉ S 6
- 7. What is Bloom Filter Start with an m bit array, filled with 0s.B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 To check if y is in S, check B at Hi(y). All k values must be 1.B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 Possible to have a false positive; all k values are 1, but y is not in S.B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0n items m = cn bits k hash functions 7 7
- 8. What is Bloom FilterFalse Positive 0 0 1 hash1 0 A 1 hash2 0 0 B 0 hash3 0 1 0 8
- 9. Bloom Filter Math Theory Pr(specific bit of filter is 0) is p ≡ (1 − 1 / m) kn ≈ e − kn / m ≡ p If ρ is fraction of 0 bits in the filter then falsepositive probability is (1 − ρ ) k ≈ (1 − p ) k ≈ (1 − p ) k = (1 − e − k / c ) k Approximations valid as ρ is concentratedaround E[ρ]. –Martingale argument suffices. Find optimal at k = (ln 2)m/n by calculus. –So optimal fpp is about (0.6185)m/nn items m = cn bits k hash functions 9
- 10. Bloom Filter Math Theory 0.1 0.09 0.08False positive rate 0.07 m/n = 8 0.06 Opt k = 8 ln 2 = 5.45... 0.05 0.04 0.03 0.02 0.01 0 0 1 2 3 4 5 6 7 8 9 10 Hash functionsn items m = cn bits k hash functions 10
- 11. Bloom Filter CompressionUse BF on Network Transmission BF as a message, should be small enough to transmitted over the network Compressing bit vector is easy Arithmetic coding gets close to entropy. Can Bloom filters be compressed? 11
- 12. Bloom Filter Compression• Optimize to minimize false positive. p = Pr[cell is empty] = (1 − 1 / m) kn ≈ e − kn / m k − kn / m k f = Pr[false pos] = (1 − p ) ≈ (1 − e ) k = (m ln 2) / n• At k = m (ln 2) /n, p = 1/2.• Bloom filter looks like a random string. – Can’t compress it. – H(p) = -plog2p – (1-p)log2(1-p) 12
- 13. Bloom Filter Compression With more decompressed size (storage), we can achive compression.• Assumption: optimal compressor, z = mH(p). – H(p) is entropy function; optimally get H(p) compressed bits per original table bit. – Arithmetic coding close to optimal.• Optimization: Given z bits for compressed filter and n elements, choose table size m and number of hash functions k to minimize /f. ; f ≈ (1 − e − kn / m ) k ; z ≈ mH ( p ) p≈e − kn m 13
- 14. Bloom Filter Compression 0.1 0.09 0.08 Original z/n = 8False positive rate 0.07 Compressed 0.06 0.05 0.04 0.03 0.02 0.01 0 0 1 2 3 4 5 6 7 8 9 10 Hash functions 14 14
- 15. Bloom Filter CompressionConclusion• At k = m (ln 2) /n, false positives are maximized with a compressed Bloom filter. – Best case without compression is worst case with compression; compression always helps. – Side benefit: Use fewer hash functions with compression; possible speedup. 15 15
- 16. Application Scenario Speed up answers in a key-value like syetem filter(memory storage(memory) ) key1 no key2 disk access yes success key3 disk access yes fail 16
- 17. Application Scenario Web Cache cache1 cache2 …… cache3 Web Server 17
- 18. Q&AQ&A 18

No public clipboards found for this slide

Be the first to comment