Tutorial 9 (bloom filters)

Bloom Filters
Kira Radinsky
Slides based on material from:
Michael Mitzenmacher and Hanoch Levy

Motivation - Cache
• Lookup questions:
Does item “x” exist in a set?
• Data set may be very big or expensive to
access. Filter lookup questions with negative
results before accessing data.
• Allow false positive errors, as they only cost us an
extra data access.
• Don’t allow false negative errors, because they
result in wrong answers.

Application of Bloom Filters:
Distributed Web Caches
Web Cache 1 Web Cache 2 Web Cache 3
Web Cache 6Web Cache 5Web Cache 4
• Send Bloom filters of URLs.
• False positives do not hurt much.
– Get errors from cache changes anyway

Web Caching
• Summary Cache: [Fan, Cao, Almeida, & Broder]
If local caches know each other’s content...
…try local cache before going out to Web
• Sending/updating lists of URLs too expensive.
• Solution: use Bloom filters.
• False positives
– Local requests go unfulfilled.
– Small cost, big potential gain

The Problem Solved by BF:
Approximate Set Membership
• Lookup Problem: Given a set S = {x1,x2,…,xn}, construct
data structure to answer queries of the form
“Is y in S?”
• Data structure should be:
– Fast (Faster than searching through S).
– Small (Smaller than explicit representation).
• To obtain speed and size improvements, allow some
probability of error.
– False positives: y  S but we report y  S
– False negatives: y  S but we report y  S

Bloom Filters
Start with an m bit array, filled with 0s.
Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
To check if y is in S, check B at Hi(y). All k values must be 1.
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
Possible to have a false positive; all k values are 1, but y is not in S.

Bloom Filter
01000 10100 00010
x
h1(x) h2(x) hk(x)
V0 Vm-1
h3(x)

Advantages
• No Overflow
• Union and intersection of Bloom filters
– A simple bitwise OR and AND operations
• Applications:
– Google BigTable
– The Squid Web Proxy Cache uses Bloom filters for
cache digests.

Bloom Errors
01000 10100 00010
h1(x) h2(x) hk(x)
V0 Vm-1
h3(x)
a b c d
x didn’t appear, yet its bits are already set

Example
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 1 2 3 4 5 6 7 8 9 10
Hash functions
Falsepositiverate
m/n = 8
Opt k = 8 ln 2 = 5.45...

Tradeoffs
• Three parameters.
– Size m/n : bits per item.
• |U| = n: Number of elements to encode.
• hi: U[1..m] : Maintain a Bit Vector V of size m
– Time k : number of hash functions.
• Use k hash functions (h1..hk)
– Error f : false positive probability.

Bloom Filter Tradeoffs
• Three factors: m,k and n.
• Normally, n and m are given, and we select k.
• Small k
– Less computations.
– Actual number of bits accessed (nk) is smaller, so the chance of a “step
over” is smaller too.
– However, less bits need to be stepped over to generate an error.
• For big k, the exact opposite holds.
• Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits
flipped in the array) is exactly 0.5

Alternative Approach for
Bloom Filters: Perfect Hashing Approach
Element 1 Element 2 Element 3 Element 4 Element 5
Fingerprint(4) Fingerprint(5) Fingerprint(2) Fingerprint(1) Fingerprint(3)

Perfect Hashing Approach
• Folklore Bloom filter construction.
– Recall: Given a set S = {x1,x2,x3,…xn} on a universe U, we want
to answer membership queries.
– Method: Find an n-cell perfect hash function for S.
• Maps set of n elements to n cells in a 1-1 manner.
– Then keep bit fingerprint of item in each cell.
Lookups have false positive < e.
– Advantage: each bit/item reduces false positives by a factor
of 1/2, vs ln 2 for a standard Bloom filter.
• Negatives:
– Perfect hash functions non-trivial to find.
– Cannot handle on-line insertions.
 )/1(log2 e

Bloom Filters and Deletions
• Cache contents change
– Items both inserted and deleted.
• Insertions are easy – add bits to BF
• Can Bloom filters handle deletions?
– Use Counting Bloom Filters to track
insertions/deletions at hosts;
– Send Bloom filters.

Handling Deletions
• Bloom filters can handle insertions, but not
deletions.
• If deleting xi means resetting 1s to 0s, then
deleting xi will “delete” xj.
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
xi xj

Counting Bloom Filters
Start with an m bit array, filled with 0s.
Hash each item xj in S k times. If Hi(xj) = a, add 1 to B[a].
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B
0 3 0 0 1 0 2 0 0 3 2 1 0 2 1 0B
To delete xj decrement the corresponding counters.
0 2 0 0 0 0 2 0 0 3 2 1 0 1 1 0B
Can obtain a corresponding Bloom filter by reducing to 0/1.
0 1 0 0 0 0 1 0 0 1 1 1 0 1 1 0B

Counting Bloom Filters: Overflow
• Must choose counters large enough to avoid
overflow.
• Poisson approximation suggests 4 bits/counter.
– Average load using k = (ln 2)m/n counters is ln 2.
– Probability a counter has load at least 16:
• Failsafes possible.
17E78.6!16/)2(ln 162ln
 
e

Variations and Extensions
• Distance-Sensitive Bloom Filters
• Bloomier Filter

Extension: Distance-Sensitive Bloom Filters
• Instead of answering questions of the form
we would like to answer questions of the form
• That is, is the query close to some element of the set, under
some metric and some notion of close.
• Applications:
– DNA matching
– Virus/worm matching
– Databases
• Some initial results [KirschMitzenmacher]. Hard.
.SyIs 
.SxyIs 

Extension: Bloomier Filter
• Bloom filters handle set membership.
• Counters to handle multi-set/count tracking.
• Bloomier filter [Chazelle, Kilian, Rubinfeld, Tal]:
– Extend to handle approximate functions.
– Each element of set has associated function value.
– Non-set elements should return null.
– Want to always return correct function value for set
elements.
– A false positive returns a function value for a non-null
element.

Tutorial 9 (bloom filters)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Tutorial 9 (bloom filters)

Similar to Tutorial 9 (bloom filters) (20)

More from Kira

More from Kira (9)

Recently uploaded

Recently uploaded (20)

Tutorial 9 (bloom filters)