SlideShare a Scribd company logo
1 of 108
Download to read offline
Data Structures and Algorithms – COMS21103
Bloom Filters
Benjamin SachBenjamin Sach
(based on slides by Ashley Montanaro)
Introduction
Our motivation comes from applications where
the size of the universe U is much much larger than n
INSERT(k) - inserts the key k from U into S
MEMBER(k) - output ‘yes’ if k ∈ S
In this lecture we are interested in space efficient data structures for storing a set S
which support only two, basic operations:
and ‘no’ otherwise
U is the universe, containing
all possible keys
Let n be an upper bound on the
number of keys that will ever be in S
Introduction
Our motivation comes from applications where
the size of the universe U is much much larger than n
INSERT(k) - inserts the key k from U into S
MEMBER(k) - output ‘yes’ if k ∈ S
In this lecture we are interested in space efficient data structures for storing a set S
which support only two, basic operations:
and ‘no’ otherwise
U is the universe, containing
all possible keys
Let n be an upper bound on the
number of keys that will ever be in S
UU
Introduction
Our motivation comes from applications where
the size of the universe U is much much larger than n
INSERT(k) - inserts the key k from U into S
MEMBER(k) - output ‘yes’ if k ∈ S
In this lecture we are interested in space efficient data structures for storing a set S
which support only two, basic operations:
and ‘no’ otherwise
U is the universe, containing
all possible keys
Let n be an upper bound on the
number of keys that will ever be in S
UU
a key in S
Introduction
Our motivation comes from applications where
the size of the universe U is much much larger than n
INSERT(k) - inserts the key k from U into S
MEMBER(k) - output ‘yes’ if k ∈ S
In this lecture we are interested in space efficient data structures for storing a set S
which support only two, basic operations:
and ‘no’ otherwise
U is the universe, containing
all possible keys
Let n be an upper bound on the
number of keys that will ever be in S
Important: You cannot ask “which keys are in S?”, only “is this key in S?”
UU
a key in S
Example and Motivation
Imagine you are attempting to build a blacklist of unsafe URLs
that users should not visit
The universe contains all possible URLs
Whenever a new unsafe URL is discovered it is inserted into the data structure
Whenever we want to visit a URL we check the data structure.
Example and Motivation
Imagine you are attempting to build a blacklist of unsafe URLs
that users should not visit
The universe contains all possible URLs
Whenever a new unsafe URL is discovered it is inserted into the data structure
Whenever we want to visit a URL we check the data structure.
INSERT(www.AwfulVirus.com)
Example and Motivation
Imagine you are attempting to build a blacklist of unsafe URLs
that users should not visit
The universe contains all possible URLs
Whenever a new unsafe URL is discovered it is inserted into the data structure
Whenever we want to visit a URL we check the data structure.
INSERT(www.AwfulVirus.com)
INSERT(www.VirusStore.com)
Example and Motivation
Imagine you are attempting to build a blacklist of unsafe URLs
that users should not visit
The universe contains all possible URLs
Whenever a new unsafe URL is discovered it is inserted into the data structure
Whenever we want to visit a URL we check the data structure.
INSERT(www.AwfulVirus.com)
INSERT(www.VirusStore.com)
Disclaimer: I take no responsability for the contents of these websites
Example and Motivation
Imagine you are attempting to build a blacklist of unsafe URLs
that users should not visit
The universe contains all possible URLs
Whenever a new unsafe URL is discovered it is inserted into the data structure
Whenever we want to visit a URL we check the data structure.
INSERT(www.AwfulVirus.com)
INSERT(www.VirusStore.com)
Disclaimer: I take no responsability for the contents of these websites
MEMBER(www.BBC.co.uk) - returns ‘no’
Example and Motivation
Imagine you are attempting to build a blacklist of unsafe URLs
that users should not visit
The universe contains all possible URLs
Whenever a new unsafe URL is discovered it is inserted into the data structure
Whenever we want to visit a URL we check the data structure.
INSERT(www.AwfulVirus.com)
INSERT(www.VirusStore.com)
Disclaimer: I take no responsability for the contents of these websites
MEMBER(www.BBC.co.uk) - returns ‘no’
MEMBER(www.VirusStore.com) - returns ‘yes’
Example and Motivation
Imagine you are attempting to build a blacklist of unsafe URLs
that users should not visit
The universe contains all possible URLs
Whenever a new unsafe URL is discovered it is inserted into the data structure
Whenever we want to visit a URL we check the data structure.
INSERT(www.AwfulVirus.com)
INSERT(www.VirusStore.com)
MEMBER(www.BBC.co.uk) - returns ‘no’
MEMBER(www.VirusStore.com) - returns ‘yes’
Example and Motivation
Imagine you are attempting to build a blacklist of unsafe URLs
that users should not visit
The universe contains all possible URLs
Whenever a new unsafe URL is discovered it is inserted into the data structure
Whenever we want to visit a URL we check the data structure.
INSERT(www.AwfulVirus.com)
INSERT(www.VirusStore.com)
MEMBER(www.BBC.co.uk) - returns ‘no’
MEMBER(www.VirusStore.com) - returns ‘yes’
INSERT(www.CleanUpPC.com)
Example and Motivation
Imagine you are attempting to build a blacklist of unsafe URLs
that users should not visit
The universe contains all possible URLs
Whenever a new unsafe URL is discovered it is inserted into the data structure
Whenever we want to visit a URL we check the data structure.
INSERT(www.AwfulVirus.com)
INSERT(www.VirusStore.com)
MEMBER(www.BBC.co.uk) - returns ‘no’
MEMBER(www.VirusStore.com) - returns ‘yes’
INSERT(www.CleanUpPC.com)
MEMBER(www.BBC.co.uk) - returns ‘yes’
Example and Motivation
Imagine you are attempting to build a blacklist of unsafe URLs
that users should not visit
The universe contains all possible URLs
Whenever a new unsafe URL is discovered it is inserted into the data structure
Whenever we want to visit a URL we check the data structure.
INSERT(www.AwfulVirus.com)
INSERT(www.VirusStore.com)
MEMBER(www.BBC.co.uk) - returns ‘no’
MEMBER(www.VirusStore.com) - returns ‘yes’
INSERT(www.CleanUpPC.com)
MEMBER(www.BBC.co.uk) - returns ‘yes’
?!
Example and Motivation
Imagine you are attempting to build a blacklist of unsafe URLs
that users should not visit
The universe contains all possible URLs
Whenever a new unsafe URL is discovered it is inserted into the data structure
Whenever we want to visit a URL we check the data structure.
INSERT(www.AwfulVirus.com)
INSERT(www.VirusStore.com)
MEMBER(www.BBC.co.uk) - returns ‘no’
MEMBER(www.VirusStore.com) - returns ‘yes’
INSERT(www.CleanUpPC.com)
MEMBER(www.BBC.co.uk) - returns ‘yes’
?!
a Bloom filter is a randomised data structure - sometimes it gets the answer wrong
Bloom filters
A Bloom filter is a randomised data structure for storing a set S
which supports two operations
Bloom filters
A Bloom filter is a randomised data structure for storing a set S
which supports two operations
The INSERT(k) operation inserts the key k from U into S
Bloom filters
A Bloom filter is a randomised data structure for storing a set S
which supports two operations
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
Bloom filters
In a bloom filter, the MEMBER(k) operation
A Bloom filter is a randomised data structure for storing a set S
which supports two operations
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
Bloom filters
In a bloom filter, the MEMBER(k) operation
A Bloom filter is a randomised data structure for storing a set S
which supports two operations
always returns ‘yes’ if k ∈ S
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
Bloom filters
In a bloom filter, the MEMBER(k) operation
A Bloom filter is a randomised data structure for storing a set S
which supports two operations
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance (say 1%) that it will still say ‘yes’
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
Bloom filters
In a bloom filter, the MEMBER(k) operation
A Bloom filter is a randomised data structure for storing a set S
which supports two operations
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance (say 1%) that it will still say ‘yes’
Why use a Bloom filter then?
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
Bloom filters
In a bloom filter, the MEMBER(k) operation
A Bloom filter is a randomised data structure for storing a set S
Both operations run in O(1) time and the space used is very very good
which supports two operations
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance (say 1%) that it will still say ‘yes’
Why use a Bloom filter then?
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
Bloom filters
In a bloom filter, the MEMBER(k) operation
A Bloom filter is a randomised data structure for storing a set S
Both operations run in O(1) time and the space used is very very good
which supports two operations
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance (say 1%) that it will still say ‘yes’
Why use a Bloom filter then?
It will use O(n) bits of space to store up to n keys
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
Bloom filters
In a bloom filter, the MEMBER(k) operation
A Bloom filter is a randomised data structure for storing a set S
Both operations run in O(1) time and the space used is very very good
which supports two operations
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance (say 1%) that it will still say ‘yes’
Why use a Bloom filter then?
It will use O(n) bits of space to store up to n keys
- the exact number of bits will depend on the failure probability
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
Bloom filters
In a bloom filter, the MEMBER(k) operation
A Bloom filter is a randomised data structure for storing a set S
Both operations run in O(1) time and the space used is very very good
which supports two operations
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance (say 1%) that it will still say ‘yes’
Why use a Bloom filter then?
It will use O(n) bits of space to store up to n keys
- the exact number of bits will depend on the failure probability
we’ll come back to this at the end
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
Approach 1: build an array
Before discussing Bloom filters, lets consider a naive approach using an array. . .
For simplicity, let us think of the universe U as containing numbers 1, 2, 3 . . . |U|.
Approach 1: build an array
Before discussing Bloom filters, lets consider a naive approach using an array. . .
For simplicity, let us think of the universe U as containing numbers 1, 2, 3 . . . |U|.
We could maintain a bit string B
Approach 1: build an array
Before discussing Bloom filters, lets consider a naive approach using an array. . .
For simplicity, let us think of the universe U as containing numbers 1, 2, 3 . . . |U|.
We could maintain a bit string B
Example:
0 1 1 10 0 0 0 0 0
1 2 3 4 5 6 7 8 9 10
B
|U|
Approach 1: build an array
Before discussing Bloom filters, lets consider a naive approach using an array. . .
For simplicity, let us think of the universe U as containing numbers 1, 2, 3 . . . |U|.
We could maintain a bit string B
Example:
0 1 1 10 0 0 0 0 0
1 2 3 4 5 6 7 8 9 10
B
where B[k] = 1 if k ∈ S and B[k] = 0 otherwise
|U|
Approach 1: build an array
Before discussing Bloom filters, lets consider a naive approach using an array. . .
For simplicity, let us think of the universe U as containing numbers 1, 2, 3 . . . |U|.
We could maintain a bit string B
Example:
here |U| = 10 and S contains 3,6 and 8
0 1 1 10 0 0 0 0 0
1 2 3 4 5 6 7 8 9 10
B
where B[k] = 1 if k ∈ S and B[k] = 0 otherwise
|U|
Approach 1: build an array
Before discussing Bloom filters, lets consider a naive approach using an array. . .
For simplicity, let us think of the universe U as containing numbers 1, 2, 3 . . . |U|.
We could maintain a bit string B
Example:
here |U| = 10 and S contains 3,6 and 8
While the operations take O(1) time, this array is |U| bits long!
0 1 1 10 0 0 0 0 0
1 2 3 4 5 6 7 8 9 10
B
where B[k] = 1 if k ∈ S and B[k] = 0 otherwise
|U|
Approach 1: build an array
Before discussing Bloom filters, lets consider a naive approach using an array. . .
For simplicity, let us think of the universe U as containing numbers 1, 2, 3 . . . |U|.
We could maintain a bit string B
Example:
here |U| = 10 and S contains 3,6 and 8
While the operations take O(1) time, this array is |U| bits long!
It certainly isn’t suitable for the application we have seen
0 1 1 10 0 0 0 0 0
1 2 3 4 5 6 7 8 9 10
B
where B[k] = 1 if k ∈ S and B[k] = 0 otherwise
|U|
Approach 2: build a hash table
We could solve the problem by hashing. . .
We now maintain a much shorter bit string B of some length m < |U|
(to be determined later)
Example:
0 00
1 2 3
B
Approach 2: build a hash table
We could solve the problem by hashing. . .
We now maintain a much shorter bit string B of some length m < |U|
(to be determined later)
Assume we have access to a hash function h which maps each key k ∈ U
to an integer h(k) between 1 and m
Example:
0 00
1 2 3
B
Approach 2: build a hash table
We could solve the problem by hashing. . .
We now maintain a much shorter bit string B of some length m < |U|
(to be determined later)
Assume we have access to a hash function h which maps each key k ∈ U
to an integer h(k) between 1 and m
Example:
Imagine that m = 3 and
0 00 h(www.AwfulVirus.com) = 2
h(www.VirusStore.com) = 3
h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3
1 2 3
B
Approach 2: build a hash table
We could solve the problem by hashing. . .
We now maintain a much shorter bit string B of some length m < |U|
(to be determined later)
Assume we have access to a hash function h which maps each key k ∈ U
to an integer h(k) between 1 and m
Example:
Imagine that m = 3 and
0 00
INSERT(k) sets B[h(k)] = 1
h(www.AwfulVirus.com) = 2
h(www.VirusStore.com) = 3
h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3
1 2 3
B
Approach 2: build a hash table
We could solve the problem by hashing. . .
We now maintain a much shorter bit string B of some length m < |U|
(to be determined later)
Assume we have access to a hash function h which maps each key k ∈ U
to an integer h(k) between 1 and m
Example:
Imagine that m = 3 and
0 00
INSERT(k) sets B[h(k)] = 1 MEMBER(k) returns ‘yes’ if B[h(k)] = 1
and ‘no’ if B[h(k)] = 0
h(www.AwfulVirus.com) = 2
h(www.VirusStore.com) = 3
h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3
1 2 3
B
Approach 2: build a hash table
We could solve the problem by hashing. . .
We now maintain a much shorter bit string B of some length m < |U|
(to be determined later)
Assume we have access to a hash function h which maps each key k ∈ U
to an integer h(k) between 1 and m
Example:
Imagine that m = 3 and
0 00
INSERT(k) sets B[h(k)] = 1 MEMBER(k) returns ‘yes’ if B[h(k)] = 1
and ‘no’ if B[h(k)] = 0
h(www.AwfulVirus.com) = 2
INSERT(www.AwfulVirus.com) h(www.VirusStore.com) = 3
h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3
1 2 3
B
Approach 2: build a hash table
We could solve the problem by hashing. . .
We now maintain a much shorter bit string B of some length m < |U|
(to be determined later)
Assume we have access to a hash function h which maps each key k ∈ U
to an integer h(k) between 1 and m
Example:
Imagine that m = 3 and
0 0
INSERT(k) sets B[h(k)] = 1 MEMBER(k) returns ‘yes’ if B[h(k)] = 1
and ‘no’ if B[h(k)] = 0
h(www.AwfulVirus.com) = 2
INSERT(www.AwfulVirus.com) h(www.VirusStore.com) = 3
h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3
1
1 2 3
B
Approach 2: build a hash table
We could solve the problem by hashing. . .
We now maintain a much shorter bit string B of some length m < |U|
(to be determined later)
Assume we have access to a hash function h which maps each key k ∈ U
to an integer h(k) between 1 and m
Example:
Imagine that m = 3 and
0 0
INSERT(k) sets B[h(k)] = 1 MEMBER(k) returns ‘yes’ if B[h(k)] = 1
and ‘no’ if B[h(k)] = 0
h(www.AwfulVirus.com) = 2
INSERT(www.AwfulVirus.com) h(www.VirusStore.com) = 3
h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3
1
INSERT(www.VirusStore.com)
1 2 3
B
Approach 2: build a hash table
We could solve the problem by hashing. . .
We now maintain a much shorter bit string B of some length m < |U|
(to be determined later)
Assume we have access to a hash function h which maps each key k ∈ U
to an integer h(k) between 1 and m
Example:
Imagine that m = 3 and
0
INSERT(k) sets B[h(k)] = 1 MEMBER(k) returns ‘yes’ if B[h(k)] = 1
and ‘no’ if B[h(k)] = 0
h(www.AwfulVirus.com) = 2
INSERT(www.AwfulVirus.com) h(www.VirusStore.com) = 3
h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3
1
INSERT(www.VirusStore.com)
1
1 2 3
B
Approach 2: build a hash table
We could solve the problem by hashing. . .
We now maintain a much shorter bit string B of some length m < |U|
(to be determined later)
Assume we have access to a hash function h which maps each key k ∈ U
to an integer h(k) between 1 and m
Example:
Imagine that m = 3 and
0
INSERT(k) sets B[h(k)] = 1 MEMBER(k) returns ‘yes’ if B[h(k)] = 1
and ‘no’ if B[h(k)] = 0
h(www.AwfulVirus.com) = 2
INSERT(www.AwfulVirus.com)
MEMBER(www.BBC.co.uk) - returns ‘yes’
h(www.VirusStore.com) = 3
h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3
1
INSERT(www.VirusStore.com)
1
1 2 3
B
Approach 2: build a hash table
We could solve the problem by hashing. . .
We now maintain a much shorter bit string B of some length m < |U|
(to be determined later)
Assume we have access to a hash function h which maps each key k ∈ U
to an integer h(k) between 1 and m
Example:
Imagine that m = 3 and
0
INSERT(k) sets B[h(k)] = 1 MEMBER(k) returns ‘yes’ if B[h(k)] = 1
and ‘no’ if B[h(k)] = 0
h(www.AwfulVirus.com) = 2
INSERT(www.AwfulVirus.com)
MEMBER(www.BBC.co.uk) - returns ‘yes’
h(www.VirusStore.com) = 3
h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3
1
INSERT(www.VirusStore.com)
1
This is called a collision
1 2 3
B
Approach 2: build a hash table
The problem with hashing is that if m < |U| then
there will be some keys that hash to the same positions
(these are called collisions)
Approach 2: build a hash table
The problem with hashing is that if m < |U| then
there will be some keys that hash to the same positions
(these are called collisions)
If we call MEMBER(k) for some key k not in S
but there is a key k ∈ S with h(k) = h(k )
we will incorrectly output ‘yes’
Approach 2: build a hash table
The problem with hashing is that if m < |U| then
there will be some keys that hash to the same positions
(these are called collisions)
If we call MEMBER(k) for some key k not in S
but there is a key k ∈ S with h(k) = h(k )
we will incorrectly output ‘yes’
To make sure that the probability of an error is low for every operation sequence,
we pick the hash function h at random
Approach 2: build a hash table
The problem with hashing is that if m < |U| then
there will be some keys that hash to the same positions
(these are called collisions)
If we call MEMBER(k) for some key k not in S
but there is a key k ∈ S with h(k) = h(k )
we will incorrectly output ‘yes’
To make sure that the probability of an error is low for every operation sequence,
we pick the hash function h at random
Important: h is chosen before any operations happen and never changes
Approach 2: build a hash table
The problem with hashing is that if m < |U| then
there will be some keys that hash to the same positions
(these are called collisions)
If we call MEMBER(k) for some key k not in S
but there is a key k ∈ S with h(k) = h(k )
we will incorrectly output ‘yes’
To make sure that the probability of an error is low for every operation sequence,
we pick the hash function h at random
For every key k ∈ U, the value of h(k) is chosen independently and uniformly at random:
that is, the probability that h(k) = j is 1
m for all j between 1 and m
(each position is equally likely)
Important: h is chosen before any operations happen and never changes
What is the probability of an error?
Assume we have already INSERTED n keys into the structure
Further, we have just called
MEMBER(k) for some key k not in S
(which will check whether B[h(k)] = 1)
What is the probability of an error?
Assume we have already INSERTED n keys into the structure
Further, we have just called
MEMBER(k) for some key k not in S
We want to know the probability that the answer returned is ‘yes’ (which would be bad)
(which will check whether B[h(k)] = 1)
What is the probability of an error?
Assume we have already INSERTED n keys into the structure
Further, we have just called
MEMBER(k) for some key k not in S
We want to know the probability that the answer returned is ‘yes’ (which would be bad)
The bit-string B contains at most n 1’s among the m positions
(which will check whether B[h(k)] = 1)
What is the probability of an error?
Assume we have already INSERTED n keys into the structure
Further, we have just called
MEMBER(k) for some key k not in S
We want to know the probability that the answer returned is ‘yes’ (which would be bad)
The bit-string B contains at most n 1’s among the m positions
B 1 1 1 1 1 111
m
(which will check whether B[h(k)] = 1)
What is the probability of an error?
Assume we have already INSERTED n keys into the structure
Further, we have just called
MEMBER(k) for some key k not in S
We want to know the probability that the answer returned is ‘yes’ (which would be bad)
The bit-string B contains at most n 1’s among the m positions
B 1 1 1 1 1 111
m
By definition, h(k) is equally likely to be any position between 1 and m
(which will check whether B[h(k)] = 1)
What is the probability of an error?
Assume we have already INSERTED n keys into the structure
Further, we have just called
MEMBER(k) for some key k not in S
We want to know the probability that the answer returned is ‘yes’ (which would be bad)
The bit-string B contains at most n 1’s among the m positions
h(k)
B 1 1 1 1 1 111
m
By definition, h(k) is equally likely to be any position between 1 and m
(which will check whether B[h(k)] = 1)
What is the probability of an error?
Assume we have already INSERTED n keys into the structure
Further, we have just called
MEMBER(k) for some key k not in S
We want to know the probability that the answer returned is ‘yes’ (which would be bad)
The bit-string B contains at most n 1’s among the m positions
h(k)
B 1 1 1 1 1 111
m
By definition, h(k) is equally likely to be any position between 1 and m
(which will check whether B[h(k)] = 1)
What is the probability of an error?
Assume we have already INSERTED n keys into the structure
Further, we have just called
MEMBER(k) for some key k not in S
We want to know the probability that the answer returned is ‘yes’ (which would be bad)
The bit-string B contains at most n 1’s among the m positions
h(k)
B 1 1 1 1 1 111
m
By definition, h(k) is equally likely to be any position between 1 and m
Therefore the probability that B[h(k)] = 1 is at most n
m
(which will check whether B[h(k)] = 1)
What is the probability of an error?
Assume we have already INSERTED n keys into the structure
Further, we have just called
MEMBER(k) for some key k not in S
We want to know the probability that the answer returned is ‘yes’ (which would be bad)
The bit-string B contains at most n 1’s among the m positions
h(k)
B 1 1 1 1 1 111
m
By definition, h(k) is equally likely to be any position between 1 and m
Therefore the probability that B[h(k)] = 1 is at most n
m
(which will check whether B[h(k)] = 1)
If we choose m = 100n then we get a failure probability of at most 1%
Approach 2: build a hash table
We have developed a randomised data structure for storing a set S
which supports two operations
Approach 2: build a hash table
We have developed a randomised data structure for storing a set S
which supports two operations
The INSERT(k) operation inserts the key k from U into S
Approach 2: build a hash table
We have developed a randomised data structure for storing a set S
which supports two operations
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
Approach 2: build a hash table
Like in a bloom filter, the MEMBER(k) operation
We have developed a randomised data structure for storing a set S
which supports two operations
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
Approach 2: build a hash table
Like in a bloom filter, the MEMBER(k) operation
We have developed a randomised data structure for storing a set S
which supports two operations
always returns ‘yes’ if k ∈ S
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
Approach 2: build a hash table
Like in a bloom filter, the MEMBER(k) operation
We have developed a randomised data structure for storing a set S
which supports two operations
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance (in fact 1%) that it will still say ‘yes’
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
Approach 2: build a hash table
Like in a bloom filter, the MEMBER(k) operation
We have developed a randomised data structure for storing a set S
Both operations run in O(1) time and the space used is 100n bits
which supports two operations
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance (in fact 1%) that it will still say ‘yes’
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
Approach 2: build a hash table
Like in a bloom filter, the MEMBER(k) operation
We have developed a randomised data structure for storing a set S
Both operations run in O(1) time and the space used is 100n bits
which supports two operations
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance (in fact 1%) that it will still say ‘yes’
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
when storing up to n keys
Approach 2: build a hash table
Like in a bloom filter, the MEMBER(k) operation
We have developed a randomised data structure for storing a set S
Both operations run in O(1) time and the space used is 100n bits
which supports two operations
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance (in fact 1%) that it will still say ‘yes’
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
neither the space nor the failure probability depend on |U|
when storing up to n keys
Approach 2: build a hash table
Like in a bloom filter, the MEMBER(k) operation
We have developed a randomised data structure for storing a set S
Both operations run in O(1) time and the space used is 100n bits
which supports two operations
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance (in fact 1%) that it will still say ‘yes’
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
neither the space nor the failure probability depend on |U|
when storing up to n keys
if we wanted a better probability, we could use more space
Approach 2: build a hash table
Like in a bloom filter, the MEMBER(k) operation
We have developed a randomised data structure for storing a set S
Both operations run in O(1) time and the space used is 100n bits
which supports two operations
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance (in fact 1%) that it will still say ‘yes’
Why use a Bloom filter then?
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
neither the space nor the failure probability depend on |U|
when storing up to n keys
if we wanted a better probability, we could use more space
Approach 2: build a hash table
Like in a bloom filter, the MEMBER(k) operation
We have developed a randomised data structure for storing a set S
Both operations run in O(1) time and the space used is 100n bits
which supports two operations
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance (in fact 1%) that it will still say ‘yes’
Why use a Bloom filter then?
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
neither the space nor the failure probability depend on |U|
when storing up to n keys
if we wanted a better probability, we could use more space
we will get much better space usage for the same probability
Approach 3: build a bloom filter
We still maintain a bit string B of some length m < |U|
Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later)
Each hash function hi maps a key k, to an integer hi(k) between 1 and m
Approach 3: build a bloom filter
We still maintain a bit string B of some length m < |U|
Example:
Imagine that m = 4, r = 2 and
0 00 h1(AwVi.com) = 2
h1(ViSt.com) = 3
h1(BBC.com) = 2
Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later)
Each hash function hi maps a key k, to an integer hi(k) between 1 and m
0 h2(AwVi.com) = 1
h2(ViSt.com) = 2
h2(BBC.com) = 4
1 2 3 4
Approach 3: build a bloom filter
We still maintain a bit string B of some length m < |U|
Example:
Imagine that m = 4, r = 2 and
0 00
INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if
for all i, B[hi(k)] = 1
h1(AwVi.com) = 2
h1(ViSt.com) = 3
h1(BBC.com) = 2
Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later)
for all i between 1 and r
Each hash function hi maps a key k, to an integer hi(k) between 1 and m
0 h2(AwVi.com) = 1
h2(ViSt.com) = 2
h2(BBC.com) = 4
1 2 3 4
Approach 3: build a bloom filter
We still maintain a bit string B of some length m < |U|
Example:
Imagine that m = 4, r = 2 and
0 00
INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if
for all i, B[hi(k)] = 1
h1(AwVi.com) = 2
INSERT(AwVi.com)
h1(ViSt.com) = 3
h1(BBC.com) = 2
Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later)
for all i between 1 and r
Each hash function hi maps a key k, to an integer hi(k) between 1 and m
0 h2(AwVi.com) = 1
h2(ViSt.com) = 2
h2(BBC.com) = 4
1 2 3 4
Approach 3: build a bloom filter
We still maintain a bit string B of some length m < |U|
Example:
Imagine that m = 4, r = 2 and
0
INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if
for all i, B[hi(k)] = 1
h1(AwVi.com) = 2
INSERT(AwVi.com)
h1(ViSt.com) = 3
h1(BBC.com) = 2
1
Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later)
for all i between 1 and r
Each hash function hi maps a key k, to an integer hi(k) between 1 and m
0 h2(AwVi.com) = 1
h2(ViSt.com) = 2
h2(BBC.com) = 4
1
1 2 3 4
Approach 3: build a bloom filter
We still maintain a bit string B of some length m < |U|
Example:
Imagine that m = 4, r = 2 and
0
INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if
for all i, B[hi(k)] = 1
h1(AwVi.com) = 2
INSERT(AwVi.com)
h1(ViSt.com) = 3
h1(BBC.com) = 2
1
INSERT(ViSt.com)
Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later)
for all i between 1 and r
Each hash function hi maps a key k, to an integer hi(k) between 1 and m
0 h2(AwVi.com) = 1
h2(ViSt.com) = 2
h2(BBC.com) = 4
1
1 2 3 4
Approach 3: build a bloom filter
We still maintain a bit string B of some length m < |U|
Example:
Imagine that m = 4, r = 2 and
INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if
for all i, B[hi(k)] = 1
h1(AwVi.com) = 2
INSERT(AwVi.com)
h1(ViSt.com) = 3
h1(BBC.com) = 2
1
INSERT(ViSt.com)
1
Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later)
for all i between 1 and r
Each hash function hi maps a key k, to an integer hi(k) between 1 and m
0 h2(AwVi.com) = 1
h2(ViSt.com) = 2
h2(BBC.com) = 4
1
1 2 3 4
Approach 3: build a bloom filter
We still maintain a bit string B of some length m < |U|
Example:
Imagine that m = 4, r = 2 and
INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if
for all i, B[hi(k)] = 1
h1(AwVi.com) = 2
INSERT(AwVi.com)
MEMBER(BBC.com) - returns ‘no’
h1(ViSt.com) = 3
h1(BBC.com) = 2
1
INSERT(ViSt.com)
1
Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later)
for all i between 1 and r
Each hash function hi maps a key k, to an integer hi(k) between 1 and m
0 h2(AwVi.com) = 1
h2(ViSt.com) = 2
h2(BBC.com) = 4
1
1 2 3 4
Approach 3: build a bloom filter
We still maintain a bit string B of some length m < |U|
Example:
Imagine that m = 4, r = 2 and
INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if
for all i, B[hi(k)] = 1
h1(AwVi.com) = 2
INSERT(AwVi.com)
MEMBER(BBC.com) - returns ‘no’
h1(ViSt.com) = 3
h1(BBC.com) = 2
1
INSERT(ViSt.com)
1
Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later)
for all i between 1 and r
Each hash function hi maps a key k, to an integer hi(k) between 1 and m
0 h2(AwVi.com) = 1
h2(ViSt.com) = 2
h2(BBC.com) = 4
Much better!
1
1 2 3 4
Approach 3: build a bloom filter
We still maintain a bit string B of some length m < |U|
Example:
Imagine that m = 4, r = 2 and
INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if
for all i, B[hi(k)] = 1
h1(AwVi.com) = 2
INSERT(AwVi.com)
MEMBER(BBC.com) - returns ‘no’
h1(ViSt.com) = 3
h1(BBC.com) = 2
1
INSERT(ViSt.com)
1
Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later)
for all i between 1 and r
Each hash function hi maps a key k, to an integer hi(k) between 1 and m
0 h2(AwVi.com) = 1
h2(ViSt.com) = 2
h2(BBC.com) = 4
Much better!
1
1 2 3 4
(not convinced?)
Approach 3: build a bloom filter
We still maintain a bit string B of some length m < |U|
INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if
for all i, B[hi(k)] = 1
Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later)
for all i between 1 and r
Each hash function hi maps a key k, to an integer hi(k) between 1 and m
For every key k ∈ U,
that is, the probability that hi(k) = j is 1
m for all j between 1 and m
(each position is equally likely)
the value of each hi(k) is chosen independently and uniformly at random:
Approach 3: build a bloom filter
We still maintain a bit string B of some length m < |U|
INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if
for all i, B[hi(k)] = 1
Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later)
for all i between 1 and r
Each hash function hi maps a key k, to an integer hi(k) between 1 and m
For every key k ∈ U,
that is, the probability that hi(k) = j is 1
m for all j between 1 and m
(each position is equally likely)
but what is the probability of a wrong answer?
the value of each hi(k) is chosen independently and uniformly at random:
What is the probability of an error?
Assume we have already INSERTED n keys into the bloom filter
Further, we have just called MEMBER(k) for some key k not in S
this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r
What is the probability of an error?
Assume we have already INSERTED n keys into the bloom filter
Further, we have just called MEMBER(k) for some key k not in S
this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r
This is the same as checking whether r randomly chosen bits of B all equal 1
What is the probability of an error?
Assume we have already INSERTED n keys into the bloom filter
Further, we have just called MEMBER(k) for some key k not in S
this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r
We will now show that there is only a small probability of this happening
This is the same as checking whether r randomly chosen bits of B all equal 1
What is the probability of an error?
Assume we have already INSERTED n keys into the bloom filter
Further, we have just called MEMBER(k) for some key k not in S
this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r
We will now show that there is only a small probability of this happening
As there are at most n keys in the filter,
at most nr bits of B are set to 1
This is the same as checking whether r randomly chosen bits of B all equal 1
What is the probability of an error?
Assume we have already INSERTED n keys into the bloom filter
Further, we have just called MEMBER(k) for some key k not in S
this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r
We will now show that there is only a small probability of this happening
As there are at most n keys in the filter,
at most nr bits of B are set to 1
(each INSERT sets at most r bits to 1)
This is the same as checking whether r randomly chosen bits of B all equal 1
What is the probability of an error?
Assume we have already INSERTED n keys into the bloom filter
Further, we have just called MEMBER(k) for some key k not in S
this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r
We will now show that there is only a small probability of this happening
As there are at most n keys in the filter,
at most nr bits of B are set to 1
(each INSERT sets at most r bits to 1)
This is the same as checking whether r randomly chosen bits of B all equal 1
B 1 1 1 1 1 111
m
What is the probability of an error?
Assume we have already INSERTED n keys into the bloom filter
Further, we have just called MEMBER(k) for some key k not in S
this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r
We will now show that there is only a small probability of this happening
As there are at most n keys in the filter,
at most nr bits of B are set to 1
(each INSERT sets at most r bits to 1)
So the fraction of bits set to 1 is at most
nr
m
This is the same as checking whether r randomly chosen bits of B all equal 1
B 1 1 1 1 1 111
m
What is the probability of an error?
Assume we have already INSERTED n keys into the bloom filter
Further, we have just called MEMBER(k) for some key k not in S
this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r
We will now show that there is only a small probability of this happening
As there are at most n keys in the filter,
at most nr bits of B are set to 1
(each INSERT sets at most r bits to 1)
So the fraction of bits set to 1 is at most
nr
m
so the probability that a randomly chosen bit is 1 is at most
nr
m
This is the same as checking whether r randomly chosen bits of B all equal 1
B 1 1 1 1 1 111
m
What is the probability of an error?
Assume we have already INSERTED n keys into the bloom filter
Further, we have just called MEMBER(k) for some key k not in S
this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r
We will now show that there is only a small probability of this happening
As there are at most n keys in the filter,
at most nr bits of B are set to 1
(each INSERT sets at most r bits to 1)
So the fraction of bits set to 1 is at most
nr
m
so the probability that a randomly chosen bit is 1 is at most
nr
m
This is the same as checking whether r randomly chosen bits of B all equal 1
B 1 1 1 1 1 111
m
What is the probability of an error?
Assume we have already INSERTED n keys into the bloom filter
Further, we have just called MEMBER(k) for some key k not in S
this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r
We will now show that there is only a small probability of this happening
As there are at most n keys in the filter,
at most nr bits of B are set to 1
(each INSERT sets at most r bits to 1)
So the fraction of bits set to 1 is at most
nr
m
so the probability that a randomly chosen bit is 1 is at most
nr
m
so the probability that r randomly chosen bits all equal 1 is at most nr
m
r
This is the same as checking whether r randomly chosen bits of B all equal 1
B 1 1 1 1 1 111
m
What is the probability of an error?
Assume we have already INSERTED n keys into the bloom filter
Further, we have just called MEMBER(k) for some key k not in S
this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r
We will now show that there is only a small probability of this happening
As there are at most n keys in the filter,
at most nr bits of B are set to 1
(each INSERT sets at most r bits to 1)
So the fraction of bits set to 1 is at most
nr
m
so the probability that a randomly chosen bit is 1 is at most
nr
m
so the probability that r randomly chosen bits all equal 1 is at most nr
m
r
This is the same as checking whether r randomly chosen bits of B all equal 1
B 1 1 1 1 1 111
m
(do this independently r times)
What is the probability of a collision?
We now choose r to minimise this probability. . .
What is the probability of a collision?
We now choose r to minimise this probability. . .
By differentiating, we can find that nr
m
r
letting r = m/(ne) where e = 2.7813 . . .
is minimised by
What is the probability of a collision?
We now choose r to minimise this probability. . .
By differentiating, we can find that nr
m
r
letting r = m/(ne) where e = 2.7813 . . .
If we plug this in we get that,
1
e
m
ne
≈ (0.69)
m
nthe probability of failure, is at most
is minimised by
What is the probability of a collision?
We now choose r to minimise this probability. . .
By differentiating, we can find that nr
m
r
letting r = m/(ne) where e = 2.7813 . . .
If we plug this in we get that,
1
e
m
ne
≈ (0.69)
m
nthe probability of failure, is at most
In particular to achieve a 1% failure probability,
we can set m ≈ 12.52n bits
is minimised by
What is the probability of a collision?
We now choose r to minimise this probability. . .
By differentiating, we can find that nr
m
r
letting r = m/(ne) where e = 2.7813 . . .
If we plug this in we get that,
1
e
m
ne
≈ (0.69)
m
nthe probability of failure, is at most
In particular to achieve a 1% failure probability,
we can set m ≈ 12.52n bits
is minimised by
neither the space nor the failure probability depend on |U|
What is the probability of a collision?
We now choose r to minimise this probability. . .
By differentiating, we can find that nr
m
r
letting r = m/(ne) where e = 2.7813 . . .
If we plug this in we get that,
1
e
m
ne
≈ (0.69)
m
nthe probability of failure, is at most
In particular to achieve a 1% failure probability,
we can set m ≈ 12.52n bits
is minimised by
neither the space nor the failure probability depend on |U|
if we wanted a better probability, we could use more space
What is the probability of a collision?
We now choose r to minimise this probability. . .
By differentiating, we can find that nr
m
r
letting r = m/(ne) where e = 2.7813 . . .
If we plug this in we get that,
1
e
m
ne
≈ (0.69)
m
nthe probability of failure, is at most
In particular to achieve a 1% failure probability,
we can set m ≈ 12.52n bits
This is much better than the 100n bits we needed with a single hash function
to achieve the same probability
is minimised by
neither the space nor the failure probability depend on |U|
if we wanted a better probability, we could use more space
Bloom filter summary
In a bloom filter, the MEMBER(k) operation
A Bloom filter is a randomised data structure for storing a set S
which supports two operations, each in O(1) time
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance, , that it will still say ‘yes’
when storing up to n keys
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
We have seen that if = 0.01 (1%) the the space used is m ≈ 12.52n bits
By impoving the analysis, one can show that only ≈ 1.44 log2(1/ ) bits are needed
(≈ 9.57n bits when = 0.01)
Practical hash functions
We made the unrealistic assumption that each hash function hi maps a key k to a
uniformly random integer between 1 and m.
Practical hash functions
We made the unrealistic assumption that each hash function hi maps a key k to a
uniformly random integer between 1 and m.
In practice, we pick each hash function hi randomly from a fixed set of hash functions.
Practical hash functions
1. Pick a prime number p > |U|.
2. Pick random integers a ∈ {1, . . . , p − 1}, b ∈ {0, . . . , p − 1}.
3. Let hi be defined by hi(k) = 1 + ((ak + b) mod p) mod m.
We made the unrealistic assumption that each hash function hi maps a key k to a
uniformly random integer between 1 and m.
One way of doing this for integer keys is the following: (see CLRS 11.3.3)
In practice, we pick each hash function hi randomly from a fixed set of hash functions.
For each i:
Practical hash functions
1. Pick a prime number p > |U|.
2. Pick random integers a ∈ {1, . . . , p − 1}, b ∈ {0, . . . , p − 1}.
3. Let hi be defined by hi(k) = 1 + ((ak + b) mod p) mod m.
We made the unrealistic assumption that each hash function hi maps a key k to a
uniformly random integer between 1 and m.
One way of doing this for integer keys is the following: (see CLRS 11.3.3)
Some number theory can be used to prove that this set of hash functions is “pseudorandom” in some
sense; however, technically they are not “random enough” for our analysis above to go through.
In practice, we pick each hash function hi randomly from a fixed set of hash functions.
For each i:
Practical hash functions
1. Pick a prime number p > |U|.
2. Pick random integers a ∈ {1, . . . , p − 1}, b ∈ {0, . . . , p − 1}.
3. Let hi be defined by hi(k) = 1 + ((ak + b) mod p) mod m.
We made the unrealistic assumption that each hash function hi maps a key k to a
uniformly random integer between 1 and m.
One way of doing this for integer keys is the following: (see CLRS 11.3.3)
Some number theory can be used to prove that this set of hash functions is “pseudorandom” in some
sense; however, technically they are not “random enough” for our analysis above to go through.
Nevertheless, in practice hash functions like this are very effective.
In practice, we pick each hash function hi randomly from a fixed set of hash functions.
For each i:
Bloom filter summary
In a bloom filter, the MEMBER(k) operation
A Bloom filter is a randomised data structure for storing a set S
which supports two operations, each in O(1) time
always returns ‘yes’ if k ∈ S
however, if k is not in S
there is a small chance, , that it will still say ‘yes’
when storing up to n keys
The INSERT(k) operation inserts the key k from U into S
(it never does this incorrectly)
We have seen that if = 0.01 (1%) the the space used is m ≈ 12.52n bits
By impoving the analysis, one can show that only ≈ 1.44 log2(1/ ) bits are needed
(≈ 9.57n bits when = 0.01)

More Related Content

Similar to Bloom Filters

SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLab
SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLabSF Big Analytics: Introduction to Succinct by UC Berkeley AmpLab
SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLabChester Chen
 
Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLSat...
Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLSat...Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLSat...
Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLSat...Cathrine Wilhelmsen
 
computer notes - Data Structures - 5
computer notes - Data Structures - 5computer notes - Data Structures - 5
computer notes - Data Structures - 5ecomputernotes
 
Computer notes - Josephus Problem
Computer notes - Josephus ProblemComputer notes - Josephus Problem
Computer notes - Josephus Problemecomputernotes
 
Splunk Enterprise for Information Security Hands-On Breakout Session
Splunk Enterprise for Information Security Hands-On Breakout SessionSplunk Enterprise for Information Security Hands-On Breakout Session
Splunk Enterprise for Information Security Hands-On Breakout SessionSplunk
 
The Art of Identifying Vulnerabilities - CascadiaFest 2015
The Art of Identifying Vulnerabilities  - CascadiaFest 2015The Art of Identifying Vulnerabilities  - CascadiaFest 2015
The Art of Identifying Vulnerabilities - CascadiaFest 2015Adam Baldwin
 
Data Strucure with STL
Data Strucure with STL Data Strucure with STL
Data Strucure with STL Mohamed Essam
 
Functional Pe(a)rls - the Purely Functional Datastructures edition
Functional Pe(a)rls - the Purely Functional Datastructures editionFunctional Pe(a)rls - the Purely Functional Datastructures edition
Functional Pe(a)rls - the Purely Functional Datastructures editionosfameron
 

Similar to Bloom Filters (8)

SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLab
SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLabSF Big Analytics: Introduction to Succinct by UC Berkeley AmpLab
SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLab
 
Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLSat...
Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLSat...Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLSat...
Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLSat...
 
computer notes - Data Structures - 5
computer notes - Data Structures - 5computer notes - Data Structures - 5
computer notes - Data Structures - 5
 
Computer notes - Josephus Problem
Computer notes - Josephus ProblemComputer notes - Josephus Problem
Computer notes - Josephus Problem
 
Splunk Enterprise for Information Security Hands-On Breakout Session
Splunk Enterprise for Information Security Hands-On Breakout SessionSplunk Enterprise for Information Security Hands-On Breakout Session
Splunk Enterprise for Information Security Hands-On Breakout Session
 
The Art of Identifying Vulnerabilities - CascadiaFest 2015
The Art of Identifying Vulnerabilities  - CascadiaFest 2015The Art of Identifying Vulnerabilities  - CascadiaFest 2015
The Art of Identifying Vulnerabilities - CascadiaFest 2015
 
Data Strucure with STL
Data Strucure with STL Data Strucure with STL
Data Strucure with STL
 
Functional Pe(a)rls - the Purely Functional Datastructures edition
Functional Pe(a)rls - the Purely Functional Datastructures editionFunctional Pe(a)rls - the Purely Functional Datastructures edition
Functional Pe(a)rls - the Purely Functional Datastructures edition
 

More from Benjamin Sach

Approximation Algorithms Part Four: APTAS
Approximation Algorithms Part Four: APTASApproximation Algorithms Part Four: APTAS
Approximation Algorithms Part Four: APTASBenjamin Sach
 
Approximation Algorithms Part Three: (F)PTAS
Approximation Algorithms Part Three: (F)PTASApproximation Algorithms Part Three: (F)PTAS
Approximation Algorithms Part Three: (F)PTASBenjamin Sach
 
Approximation Algorithms Part Two: More Constant factor approximations
Approximation Algorithms Part Two: More Constant factor approximationsApproximation Algorithms Part Two: More Constant factor approximations
Approximation Algorithms Part Two: More Constant factor approximationsBenjamin Sach
 
Approximation Algorithms Part One: Constant factor approximations
Approximation Algorithms Part One: Constant factor approximationsApproximation Algorithms Part One: Constant factor approximations
Approximation Algorithms Part One: Constant factor approximationsBenjamin Sach
 
Orthogonal Range Searching
Orthogonal Range SearchingOrthogonal Range Searching
Orthogonal Range SearchingBenjamin Sach
 
Pattern Matching Part Two: k-mismatches
Pattern Matching Part Two: k-mismatchesPattern Matching Part Two: k-mismatches
Pattern Matching Part Two: k-mismatchesBenjamin Sach
 
Pattern Matching Part Three: Hamming Distance
Pattern Matching Part Three: Hamming DistancePattern Matching Part Three: Hamming Distance
Pattern Matching Part Three: Hamming DistanceBenjamin Sach
 
Lowest Common Ancestor
Lowest Common AncestorLowest Common Ancestor
Lowest Common AncestorBenjamin Sach
 
Range Minimum Queries
Range Minimum QueriesRange Minimum Queries
Range Minimum QueriesBenjamin Sach
 
Pattern Matching Part Two: Suffix Arrays
Pattern Matching Part Two: Suffix ArraysPattern Matching Part Two: Suffix Arrays
Pattern Matching Part Two: Suffix ArraysBenjamin Sach
 
Pattern Matching Part One: Suffix Trees
Pattern Matching Part One: Suffix TreesPattern Matching Part One: Suffix Trees
Pattern Matching Part One: Suffix TreesBenjamin Sach
 
Hashing Part Two: Cuckoo Hashing
Hashing Part Two: Cuckoo HashingHashing Part Two: Cuckoo Hashing
Hashing Part Two: Cuckoo HashingBenjamin Sach
 
Hashing Part Two: Static Perfect Hashing
Hashing Part Two: Static Perfect HashingHashing Part Two: Static Perfect Hashing
Hashing Part Two: Static Perfect HashingBenjamin Sach
 
Minimum Spanning Trees (via Disjoint Sets)
Minimum Spanning Trees (via Disjoint Sets)Minimum Spanning Trees (via Disjoint Sets)
Minimum Spanning Trees (via Disjoint Sets)Benjamin Sach
 
Shortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
Shortest Paths Part 1: Priority Queues and Dijkstra's AlgorithmShortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
Shortest Paths Part 1: Priority Queues and Dijkstra's AlgorithmBenjamin Sach
 
Depth First Search and Breadth First Search
Depth First Search and Breadth First SearchDepth First Search and Breadth First Search
Depth First Search and Breadth First SearchBenjamin Sach
 

More from Benjamin Sach (20)

Approximation Algorithms Part Four: APTAS
Approximation Algorithms Part Four: APTASApproximation Algorithms Part Four: APTAS
Approximation Algorithms Part Four: APTAS
 
Approximation Algorithms Part Three: (F)PTAS
Approximation Algorithms Part Three: (F)PTASApproximation Algorithms Part Three: (F)PTAS
Approximation Algorithms Part Three: (F)PTAS
 
Approximation Algorithms Part Two: More Constant factor approximations
Approximation Algorithms Part Two: More Constant factor approximationsApproximation Algorithms Part Two: More Constant factor approximations
Approximation Algorithms Part Two: More Constant factor approximations
 
Approximation Algorithms Part One: Constant factor approximations
Approximation Algorithms Part One: Constant factor approximationsApproximation Algorithms Part One: Constant factor approximations
Approximation Algorithms Part One: Constant factor approximations
 
van Emde Boas trees
van Emde Boas treesvan Emde Boas trees
van Emde Boas trees
 
Orthogonal Range Searching
Orthogonal Range SearchingOrthogonal Range Searching
Orthogonal Range Searching
 
Pattern Matching Part Two: k-mismatches
Pattern Matching Part Two: k-mismatchesPattern Matching Part Two: k-mismatches
Pattern Matching Part Two: k-mismatches
 
Pattern Matching Part Three: Hamming Distance
Pattern Matching Part Three: Hamming DistancePattern Matching Part Three: Hamming Distance
Pattern Matching Part Three: Hamming Distance
 
Lowest Common Ancestor
Lowest Common AncestorLowest Common Ancestor
Lowest Common Ancestor
 
Range Minimum Queries
Range Minimum QueriesRange Minimum Queries
Range Minimum Queries
 
Pattern Matching Part Two: Suffix Arrays
Pattern Matching Part Two: Suffix ArraysPattern Matching Part Two: Suffix Arrays
Pattern Matching Part Two: Suffix Arrays
 
Pattern Matching Part One: Suffix Trees
Pattern Matching Part One: Suffix TreesPattern Matching Part One: Suffix Trees
Pattern Matching Part One: Suffix Trees
 
Hashing Part Two: Cuckoo Hashing
Hashing Part Two: Cuckoo HashingHashing Part Two: Cuckoo Hashing
Hashing Part Two: Cuckoo Hashing
 
Hashing Part Two: Static Perfect Hashing
Hashing Part Two: Static Perfect HashingHashing Part Two: Static Perfect Hashing
Hashing Part Two: Static Perfect Hashing
 
Hashing Part One
Hashing Part OneHashing Part One
Hashing Part One
 
Probability Recap
Probability RecapProbability Recap
Probability Recap
 
Dynamic Programming
Dynamic ProgrammingDynamic Programming
Dynamic Programming
 
Minimum Spanning Trees (via Disjoint Sets)
Minimum Spanning Trees (via Disjoint Sets)Minimum Spanning Trees (via Disjoint Sets)
Minimum Spanning Trees (via Disjoint Sets)
 
Shortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
Shortest Paths Part 1: Priority Queues and Dijkstra's AlgorithmShortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
Shortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
 
Depth First Search and Breadth First Search
Depth First Search and Breadth First SearchDepth First Search and Breadth First Search
Depth First Search and Breadth First Search
 

Recently uploaded

4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptxmary850239
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptxmary850239
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6Vanessa Camilleri
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxMichelleTuguinay1
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
CHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxCHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxAneriPatwari
 

Recently uploaded (20)

4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
CHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxCHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptx
 

Bloom Filters

  • 1. Data Structures and Algorithms – COMS21103 Bloom Filters Benjamin SachBenjamin Sach (based on slides by Ashley Montanaro)
  • 2. Introduction Our motivation comes from applications where the size of the universe U is much much larger than n INSERT(k) - inserts the key k from U into S MEMBER(k) - output ‘yes’ if k ∈ S In this lecture we are interested in space efficient data structures for storing a set S which support only two, basic operations: and ‘no’ otherwise U is the universe, containing all possible keys Let n be an upper bound on the number of keys that will ever be in S
  • 3. Introduction Our motivation comes from applications where the size of the universe U is much much larger than n INSERT(k) - inserts the key k from U into S MEMBER(k) - output ‘yes’ if k ∈ S In this lecture we are interested in space efficient data structures for storing a set S which support only two, basic operations: and ‘no’ otherwise U is the universe, containing all possible keys Let n be an upper bound on the number of keys that will ever be in S UU
  • 4. Introduction Our motivation comes from applications where the size of the universe U is much much larger than n INSERT(k) - inserts the key k from U into S MEMBER(k) - output ‘yes’ if k ∈ S In this lecture we are interested in space efficient data structures for storing a set S which support only two, basic operations: and ‘no’ otherwise U is the universe, containing all possible keys Let n be an upper bound on the number of keys that will ever be in S UU a key in S
  • 5. Introduction Our motivation comes from applications where the size of the universe U is much much larger than n INSERT(k) - inserts the key k from U into S MEMBER(k) - output ‘yes’ if k ∈ S In this lecture we are interested in space efficient data structures for storing a set S which support only two, basic operations: and ‘no’ otherwise U is the universe, containing all possible keys Let n be an upper bound on the number of keys that will ever be in S Important: You cannot ask “which keys are in S?”, only “is this key in S?” UU a key in S
  • 6. Example and Motivation Imagine you are attempting to build a blacklist of unsafe URLs that users should not visit The universe contains all possible URLs Whenever a new unsafe URL is discovered it is inserted into the data structure Whenever we want to visit a URL we check the data structure.
  • 7. Example and Motivation Imagine you are attempting to build a blacklist of unsafe URLs that users should not visit The universe contains all possible URLs Whenever a new unsafe URL is discovered it is inserted into the data structure Whenever we want to visit a URL we check the data structure. INSERT(www.AwfulVirus.com)
  • 8. Example and Motivation Imagine you are attempting to build a blacklist of unsafe URLs that users should not visit The universe contains all possible URLs Whenever a new unsafe URL is discovered it is inserted into the data structure Whenever we want to visit a URL we check the data structure. INSERT(www.AwfulVirus.com) INSERT(www.VirusStore.com)
  • 9. Example and Motivation Imagine you are attempting to build a blacklist of unsafe URLs that users should not visit The universe contains all possible URLs Whenever a new unsafe URL is discovered it is inserted into the data structure Whenever we want to visit a URL we check the data structure. INSERT(www.AwfulVirus.com) INSERT(www.VirusStore.com) Disclaimer: I take no responsability for the contents of these websites
  • 10. Example and Motivation Imagine you are attempting to build a blacklist of unsafe URLs that users should not visit The universe contains all possible URLs Whenever a new unsafe URL is discovered it is inserted into the data structure Whenever we want to visit a URL we check the data structure. INSERT(www.AwfulVirus.com) INSERT(www.VirusStore.com) Disclaimer: I take no responsability for the contents of these websites MEMBER(www.BBC.co.uk) - returns ‘no’
  • 11. Example and Motivation Imagine you are attempting to build a blacklist of unsafe URLs that users should not visit The universe contains all possible URLs Whenever a new unsafe URL is discovered it is inserted into the data structure Whenever we want to visit a URL we check the data structure. INSERT(www.AwfulVirus.com) INSERT(www.VirusStore.com) Disclaimer: I take no responsability for the contents of these websites MEMBER(www.BBC.co.uk) - returns ‘no’ MEMBER(www.VirusStore.com) - returns ‘yes’
  • 12. Example and Motivation Imagine you are attempting to build a blacklist of unsafe URLs that users should not visit The universe contains all possible URLs Whenever a new unsafe URL is discovered it is inserted into the data structure Whenever we want to visit a URL we check the data structure. INSERT(www.AwfulVirus.com) INSERT(www.VirusStore.com) MEMBER(www.BBC.co.uk) - returns ‘no’ MEMBER(www.VirusStore.com) - returns ‘yes’
  • 13. Example and Motivation Imagine you are attempting to build a blacklist of unsafe URLs that users should not visit The universe contains all possible URLs Whenever a new unsafe URL is discovered it is inserted into the data structure Whenever we want to visit a URL we check the data structure. INSERT(www.AwfulVirus.com) INSERT(www.VirusStore.com) MEMBER(www.BBC.co.uk) - returns ‘no’ MEMBER(www.VirusStore.com) - returns ‘yes’ INSERT(www.CleanUpPC.com)
  • 14. Example and Motivation Imagine you are attempting to build a blacklist of unsafe URLs that users should not visit The universe contains all possible URLs Whenever a new unsafe URL is discovered it is inserted into the data structure Whenever we want to visit a URL we check the data structure. INSERT(www.AwfulVirus.com) INSERT(www.VirusStore.com) MEMBER(www.BBC.co.uk) - returns ‘no’ MEMBER(www.VirusStore.com) - returns ‘yes’ INSERT(www.CleanUpPC.com) MEMBER(www.BBC.co.uk) - returns ‘yes’
  • 15. Example and Motivation Imagine you are attempting to build a blacklist of unsafe URLs that users should not visit The universe contains all possible URLs Whenever a new unsafe URL is discovered it is inserted into the data structure Whenever we want to visit a URL we check the data structure. INSERT(www.AwfulVirus.com) INSERT(www.VirusStore.com) MEMBER(www.BBC.co.uk) - returns ‘no’ MEMBER(www.VirusStore.com) - returns ‘yes’ INSERT(www.CleanUpPC.com) MEMBER(www.BBC.co.uk) - returns ‘yes’ ?!
  • 16. Example and Motivation Imagine you are attempting to build a blacklist of unsafe URLs that users should not visit The universe contains all possible URLs Whenever a new unsafe URL is discovered it is inserted into the data structure Whenever we want to visit a URL we check the data structure. INSERT(www.AwfulVirus.com) INSERT(www.VirusStore.com) MEMBER(www.BBC.co.uk) - returns ‘no’ MEMBER(www.VirusStore.com) - returns ‘yes’ INSERT(www.CleanUpPC.com) MEMBER(www.BBC.co.uk) - returns ‘yes’ ?! a Bloom filter is a randomised data structure - sometimes it gets the answer wrong
  • 17. Bloom filters A Bloom filter is a randomised data structure for storing a set S which supports two operations
  • 18. Bloom filters A Bloom filter is a randomised data structure for storing a set S which supports two operations The INSERT(k) operation inserts the key k from U into S
  • 19. Bloom filters A Bloom filter is a randomised data structure for storing a set S which supports two operations The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly)
  • 20. Bloom filters In a bloom filter, the MEMBER(k) operation A Bloom filter is a randomised data structure for storing a set S which supports two operations The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly)
  • 21. Bloom filters In a bloom filter, the MEMBER(k) operation A Bloom filter is a randomised data structure for storing a set S which supports two operations always returns ‘yes’ if k ∈ S The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly)
  • 22. Bloom filters In a bloom filter, the MEMBER(k) operation A Bloom filter is a randomised data structure for storing a set S which supports two operations always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (say 1%) that it will still say ‘yes’ The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly)
  • 23. Bloom filters In a bloom filter, the MEMBER(k) operation A Bloom filter is a randomised data structure for storing a set S which supports two operations always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (say 1%) that it will still say ‘yes’ Why use a Bloom filter then? The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly)
  • 24. Bloom filters In a bloom filter, the MEMBER(k) operation A Bloom filter is a randomised data structure for storing a set S Both operations run in O(1) time and the space used is very very good which supports two operations always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (say 1%) that it will still say ‘yes’ Why use a Bloom filter then? The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly)
  • 25. Bloom filters In a bloom filter, the MEMBER(k) operation A Bloom filter is a randomised data structure for storing a set S Both operations run in O(1) time and the space used is very very good which supports two operations always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (say 1%) that it will still say ‘yes’ Why use a Bloom filter then? It will use O(n) bits of space to store up to n keys The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly)
  • 26. Bloom filters In a bloom filter, the MEMBER(k) operation A Bloom filter is a randomised data structure for storing a set S Both operations run in O(1) time and the space used is very very good which supports two operations always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (say 1%) that it will still say ‘yes’ Why use a Bloom filter then? It will use O(n) bits of space to store up to n keys - the exact number of bits will depend on the failure probability The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly)
  • 27. Bloom filters In a bloom filter, the MEMBER(k) operation A Bloom filter is a randomised data structure for storing a set S Both operations run in O(1) time and the space used is very very good which supports two operations always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (say 1%) that it will still say ‘yes’ Why use a Bloom filter then? It will use O(n) bits of space to store up to n keys - the exact number of bits will depend on the failure probability we’ll come back to this at the end The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly)
  • 28. Approach 1: build an array Before discussing Bloom filters, lets consider a naive approach using an array. . . For simplicity, let us think of the universe U as containing numbers 1, 2, 3 . . . |U|.
  • 29. Approach 1: build an array Before discussing Bloom filters, lets consider a naive approach using an array. . . For simplicity, let us think of the universe U as containing numbers 1, 2, 3 . . . |U|. We could maintain a bit string B
  • 30. Approach 1: build an array Before discussing Bloom filters, lets consider a naive approach using an array. . . For simplicity, let us think of the universe U as containing numbers 1, 2, 3 . . . |U|. We could maintain a bit string B Example: 0 1 1 10 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 B |U|
  • 31. Approach 1: build an array Before discussing Bloom filters, lets consider a naive approach using an array. . . For simplicity, let us think of the universe U as containing numbers 1, 2, 3 . . . |U|. We could maintain a bit string B Example: 0 1 1 10 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 B where B[k] = 1 if k ∈ S and B[k] = 0 otherwise |U|
  • 32. Approach 1: build an array Before discussing Bloom filters, lets consider a naive approach using an array. . . For simplicity, let us think of the universe U as containing numbers 1, 2, 3 . . . |U|. We could maintain a bit string B Example: here |U| = 10 and S contains 3,6 and 8 0 1 1 10 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 B where B[k] = 1 if k ∈ S and B[k] = 0 otherwise |U|
  • 33. Approach 1: build an array Before discussing Bloom filters, lets consider a naive approach using an array. . . For simplicity, let us think of the universe U as containing numbers 1, 2, 3 . . . |U|. We could maintain a bit string B Example: here |U| = 10 and S contains 3,6 and 8 While the operations take O(1) time, this array is |U| bits long! 0 1 1 10 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 B where B[k] = 1 if k ∈ S and B[k] = 0 otherwise |U|
  • 34. Approach 1: build an array Before discussing Bloom filters, lets consider a naive approach using an array. . . For simplicity, let us think of the universe U as containing numbers 1, 2, 3 . . . |U|. We could maintain a bit string B Example: here |U| = 10 and S contains 3,6 and 8 While the operations take O(1) time, this array is |U| bits long! It certainly isn’t suitable for the application we have seen 0 1 1 10 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 B where B[k] = 1 if k ∈ S and B[k] = 0 otherwise |U|
  • 35. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < |U| (to be determined later) Example: 0 00 1 2 3 B
  • 36. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < |U| (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h(k) between 1 and m Example: 0 00 1 2 3 B
  • 37. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < |U| (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h(k) between 1 and m Example: Imagine that m = 3 and 0 00 h(www.AwfulVirus.com) = 2 h(www.VirusStore.com) = 3 h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3 1 2 3 B
  • 38. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < |U| (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h(k) between 1 and m Example: Imagine that m = 3 and 0 00 INSERT(k) sets B[h(k)] = 1 h(www.AwfulVirus.com) = 2 h(www.VirusStore.com) = 3 h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3 1 2 3 B
  • 39. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < |U| (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h(k) between 1 and m Example: Imagine that m = 3 and 0 00 INSERT(k) sets B[h(k)] = 1 MEMBER(k) returns ‘yes’ if B[h(k)] = 1 and ‘no’ if B[h(k)] = 0 h(www.AwfulVirus.com) = 2 h(www.VirusStore.com) = 3 h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3 1 2 3 B
  • 40. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < |U| (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h(k) between 1 and m Example: Imagine that m = 3 and 0 00 INSERT(k) sets B[h(k)] = 1 MEMBER(k) returns ‘yes’ if B[h(k)] = 1 and ‘no’ if B[h(k)] = 0 h(www.AwfulVirus.com) = 2 INSERT(www.AwfulVirus.com) h(www.VirusStore.com) = 3 h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3 1 2 3 B
  • 41. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < |U| (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h(k) between 1 and m Example: Imagine that m = 3 and 0 0 INSERT(k) sets B[h(k)] = 1 MEMBER(k) returns ‘yes’ if B[h(k)] = 1 and ‘no’ if B[h(k)] = 0 h(www.AwfulVirus.com) = 2 INSERT(www.AwfulVirus.com) h(www.VirusStore.com) = 3 h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3 1 1 2 3 B
  • 42. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < |U| (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h(k) between 1 and m Example: Imagine that m = 3 and 0 0 INSERT(k) sets B[h(k)] = 1 MEMBER(k) returns ‘yes’ if B[h(k)] = 1 and ‘no’ if B[h(k)] = 0 h(www.AwfulVirus.com) = 2 INSERT(www.AwfulVirus.com) h(www.VirusStore.com) = 3 h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3 1 INSERT(www.VirusStore.com) 1 2 3 B
  • 43. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < |U| (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h(k) between 1 and m Example: Imagine that m = 3 and 0 INSERT(k) sets B[h(k)] = 1 MEMBER(k) returns ‘yes’ if B[h(k)] = 1 and ‘no’ if B[h(k)] = 0 h(www.AwfulVirus.com) = 2 INSERT(www.AwfulVirus.com) h(www.VirusStore.com) = 3 h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3 1 INSERT(www.VirusStore.com) 1 1 2 3 B
  • 44. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < |U| (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h(k) between 1 and m Example: Imagine that m = 3 and 0 INSERT(k) sets B[h(k)] = 1 MEMBER(k) returns ‘yes’ if B[h(k)] = 1 and ‘no’ if B[h(k)] = 0 h(www.AwfulVirus.com) = 2 INSERT(www.AwfulVirus.com) MEMBER(www.BBC.co.uk) - returns ‘yes’ h(www.VirusStore.com) = 3 h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3 1 INSERT(www.VirusStore.com) 1 1 2 3 B
  • 45. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < |U| (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h(k) between 1 and m Example: Imagine that m = 3 and 0 INSERT(k) sets B[h(k)] = 1 MEMBER(k) returns ‘yes’ if B[h(k)] = 1 and ‘no’ if B[h(k)] = 0 h(www.AwfulVirus.com) = 2 INSERT(www.AwfulVirus.com) MEMBER(www.BBC.co.uk) - returns ‘yes’ h(www.VirusStore.com) = 3 h(www.BBC.co.uk) = 3h(www.BBC.co.uk) = 3 1 INSERT(www.VirusStore.com) 1 This is called a collision 1 2 3 B
  • 46. Approach 2: build a hash table The problem with hashing is that if m < |U| then there will be some keys that hash to the same positions (these are called collisions)
  • 47. Approach 2: build a hash table The problem with hashing is that if m < |U| then there will be some keys that hash to the same positions (these are called collisions) If we call MEMBER(k) for some key k not in S but there is a key k ∈ S with h(k) = h(k ) we will incorrectly output ‘yes’
  • 48. Approach 2: build a hash table The problem with hashing is that if m < |U| then there will be some keys that hash to the same positions (these are called collisions) If we call MEMBER(k) for some key k not in S but there is a key k ∈ S with h(k) = h(k ) we will incorrectly output ‘yes’ To make sure that the probability of an error is low for every operation sequence, we pick the hash function h at random
  • 49. Approach 2: build a hash table The problem with hashing is that if m < |U| then there will be some keys that hash to the same positions (these are called collisions) If we call MEMBER(k) for some key k not in S but there is a key k ∈ S with h(k) = h(k ) we will incorrectly output ‘yes’ To make sure that the probability of an error is low for every operation sequence, we pick the hash function h at random Important: h is chosen before any operations happen and never changes
  • 50. Approach 2: build a hash table The problem with hashing is that if m < |U| then there will be some keys that hash to the same positions (these are called collisions) If we call MEMBER(k) for some key k not in S but there is a key k ∈ S with h(k) = h(k ) we will incorrectly output ‘yes’ To make sure that the probability of an error is low for every operation sequence, we pick the hash function h at random For every key k ∈ U, the value of h(k) is chosen independently and uniformly at random: that is, the probability that h(k) = j is 1 m for all j between 1 and m (each position is equally likely) Important: h is chosen before any operations happen and never changes
  • 51. What is the probability of an error? Assume we have already INSERTED n keys into the structure Further, we have just called MEMBER(k) for some key k not in S (which will check whether B[h(k)] = 1)
  • 52. What is the probability of an error? Assume we have already INSERTED n keys into the structure Further, we have just called MEMBER(k) for some key k not in S We want to know the probability that the answer returned is ‘yes’ (which would be bad) (which will check whether B[h(k)] = 1)
  • 53. What is the probability of an error? Assume we have already INSERTED n keys into the structure Further, we have just called MEMBER(k) for some key k not in S We want to know the probability that the answer returned is ‘yes’ (which would be bad) The bit-string B contains at most n 1’s among the m positions (which will check whether B[h(k)] = 1)
  • 54. What is the probability of an error? Assume we have already INSERTED n keys into the structure Further, we have just called MEMBER(k) for some key k not in S We want to know the probability that the answer returned is ‘yes’ (which would be bad) The bit-string B contains at most n 1’s among the m positions B 1 1 1 1 1 111 m (which will check whether B[h(k)] = 1)
  • 55. What is the probability of an error? Assume we have already INSERTED n keys into the structure Further, we have just called MEMBER(k) for some key k not in S We want to know the probability that the answer returned is ‘yes’ (which would be bad) The bit-string B contains at most n 1’s among the m positions B 1 1 1 1 1 111 m By definition, h(k) is equally likely to be any position between 1 and m (which will check whether B[h(k)] = 1)
  • 56. What is the probability of an error? Assume we have already INSERTED n keys into the structure Further, we have just called MEMBER(k) for some key k not in S We want to know the probability that the answer returned is ‘yes’ (which would be bad) The bit-string B contains at most n 1’s among the m positions h(k) B 1 1 1 1 1 111 m By definition, h(k) is equally likely to be any position between 1 and m (which will check whether B[h(k)] = 1)
  • 57. What is the probability of an error? Assume we have already INSERTED n keys into the structure Further, we have just called MEMBER(k) for some key k not in S We want to know the probability that the answer returned is ‘yes’ (which would be bad) The bit-string B contains at most n 1’s among the m positions h(k) B 1 1 1 1 1 111 m By definition, h(k) is equally likely to be any position between 1 and m (which will check whether B[h(k)] = 1)
  • 58. What is the probability of an error? Assume we have already INSERTED n keys into the structure Further, we have just called MEMBER(k) for some key k not in S We want to know the probability that the answer returned is ‘yes’ (which would be bad) The bit-string B contains at most n 1’s among the m positions h(k) B 1 1 1 1 1 111 m By definition, h(k) is equally likely to be any position between 1 and m Therefore the probability that B[h(k)] = 1 is at most n m (which will check whether B[h(k)] = 1)
  • 59. What is the probability of an error? Assume we have already INSERTED n keys into the structure Further, we have just called MEMBER(k) for some key k not in S We want to know the probability that the answer returned is ‘yes’ (which would be bad) The bit-string B contains at most n 1’s among the m positions h(k) B 1 1 1 1 1 111 m By definition, h(k) is equally likely to be any position between 1 and m Therefore the probability that B[h(k)] = 1 is at most n m (which will check whether B[h(k)] = 1) If we choose m = 100n then we get a failure probability of at most 1%
  • 60. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations
  • 61. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations The INSERT(k) operation inserts the key k from U into S
  • 62. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly)
  • 63. Approach 2: build a hash table Like in a bloom filter, the MEMBER(k) operation We have developed a randomised data structure for storing a set S which supports two operations The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly)
  • 64. Approach 2: build a hash table Like in a bloom filter, the MEMBER(k) operation We have developed a randomised data structure for storing a set S which supports two operations always returns ‘yes’ if k ∈ S The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly)
  • 65. Approach 2: build a hash table Like in a bloom filter, the MEMBER(k) operation We have developed a randomised data structure for storing a set S which supports two operations always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (in fact 1%) that it will still say ‘yes’ The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly)
  • 66. Approach 2: build a hash table Like in a bloom filter, the MEMBER(k) operation We have developed a randomised data structure for storing a set S Both operations run in O(1) time and the space used is 100n bits which supports two operations always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (in fact 1%) that it will still say ‘yes’ The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly)
  • 67. Approach 2: build a hash table Like in a bloom filter, the MEMBER(k) operation We have developed a randomised data structure for storing a set S Both operations run in O(1) time and the space used is 100n bits which supports two operations always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (in fact 1%) that it will still say ‘yes’ The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly) when storing up to n keys
  • 68. Approach 2: build a hash table Like in a bloom filter, the MEMBER(k) operation We have developed a randomised data structure for storing a set S Both operations run in O(1) time and the space used is 100n bits which supports two operations always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (in fact 1%) that it will still say ‘yes’ The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly) neither the space nor the failure probability depend on |U| when storing up to n keys
  • 69. Approach 2: build a hash table Like in a bloom filter, the MEMBER(k) operation We have developed a randomised data structure for storing a set S Both operations run in O(1) time and the space used is 100n bits which supports two operations always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (in fact 1%) that it will still say ‘yes’ The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly) neither the space nor the failure probability depend on |U| when storing up to n keys if we wanted a better probability, we could use more space
  • 70. Approach 2: build a hash table Like in a bloom filter, the MEMBER(k) operation We have developed a randomised data structure for storing a set S Both operations run in O(1) time and the space used is 100n bits which supports two operations always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (in fact 1%) that it will still say ‘yes’ Why use a Bloom filter then? The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly) neither the space nor the failure probability depend on |U| when storing up to n keys if we wanted a better probability, we could use more space
  • 71. Approach 2: build a hash table Like in a bloom filter, the MEMBER(k) operation We have developed a randomised data structure for storing a set S Both operations run in O(1) time and the space used is 100n bits which supports two operations always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (in fact 1%) that it will still say ‘yes’ Why use a Bloom filter then? The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly) neither the space nor the failure probability depend on |U| when storing up to n keys if we wanted a better probability, we could use more space we will get much better space usage for the same probability
  • 72. Approach 3: build a bloom filter We still maintain a bit string B of some length m < |U| Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later) Each hash function hi maps a key k, to an integer hi(k) between 1 and m
  • 73. Approach 3: build a bloom filter We still maintain a bit string B of some length m < |U| Example: Imagine that m = 4, r = 2 and 0 00 h1(AwVi.com) = 2 h1(ViSt.com) = 3 h1(BBC.com) = 2 Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later) Each hash function hi maps a key k, to an integer hi(k) between 1 and m 0 h2(AwVi.com) = 1 h2(ViSt.com) = 2 h2(BBC.com) = 4 1 2 3 4
  • 74. Approach 3: build a bloom filter We still maintain a bit string B of some length m < |U| Example: Imagine that m = 4, r = 2 and 0 00 INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if for all i, B[hi(k)] = 1 h1(AwVi.com) = 2 h1(ViSt.com) = 3 h1(BBC.com) = 2 Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later) for all i between 1 and r Each hash function hi maps a key k, to an integer hi(k) between 1 and m 0 h2(AwVi.com) = 1 h2(ViSt.com) = 2 h2(BBC.com) = 4 1 2 3 4
  • 75. Approach 3: build a bloom filter We still maintain a bit string B of some length m < |U| Example: Imagine that m = 4, r = 2 and 0 00 INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if for all i, B[hi(k)] = 1 h1(AwVi.com) = 2 INSERT(AwVi.com) h1(ViSt.com) = 3 h1(BBC.com) = 2 Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later) for all i between 1 and r Each hash function hi maps a key k, to an integer hi(k) between 1 and m 0 h2(AwVi.com) = 1 h2(ViSt.com) = 2 h2(BBC.com) = 4 1 2 3 4
  • 76. Approach 3: build a bloom filter We still maintain a bit string B of some length m < |U| Example: Imagine that m = 4, r = 2 and 0 INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if for all i, B[hi(k)] = 1 h1(AwVi.com) = 2 INSERT(AwVi.com) h1(ViSt.com) = 3 h1(BBC.com) = 2 1 Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later) for all i between 1 and r Each hash function hi maps a key k, to an integer hi(k) between 1 and m 0 h2(AwVi.com) = 1 h2(ViSt.com) = 2 h2(BBC.com) = 4 1 1 2 3 4
  • 77. Approach 3: build a bloom filter We still maintain a bit string B of some length m < |U| Example: Imagine that m = 4, r = 2 and 0 INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if for all i, B[hi(k)] = 1 h1(AwVi.com) = 2 INSERT(AwVi.com) h1(ViSt.com) = 3 h1(BBC.com) = 2 1 INSERT(ViSt.com) Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later) for all i between 1 and r Each hash function hi maps a key k, to an integer hi(k) between 1 and m 0 h2(AwVi.com) = 1 h2(ViSt.com) = 2 h2(BBC.com) = 4 1 1 2 3 4
  • 78. Approach 3: build a bloom filter We still maintain a bit string B of some length m < |U| Example: Imagine that m = 4, r = 2 and INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if for all i, B[hi(k)] = 1 h1(AwVi.com) = 2 INSERT(AwVi.com) h1(ViSt.com) = 3 h1(BBC.com) = 2 1 INSERT(ViSt.com) 1 Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later) for all i between 1 and r Each hash function hi maps a key k, to an integer hi(k) between 1 and m 0 h2(AwVi.com) = 1 h2(ViSt.com) = 2 h2(BBC.com) = 4 1 1 2 3 4
  • 79. Approach 3: build a bloom filter We still maintain a bit string B of some length m < |U| Example: Imagine that m = 4, r = 2 and INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if for all i, B[hi(k)] = 1 h1(AwVi.com) = 2 INSERT(AwVi.com) MEMBER(BBC.com) - returns ‘no’ h1(ViSt.com) = 3 h1(BBC.com) = 2 1 INSERT(ViSt.com) 1 Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later) for all i between 1 and r Each hash function hi maps a key k, to an integer hi(k) between 1 and m 0 h2(AwVi.com) = 1 h2(ViSt.com) = 2 h2(BBC.com) = 4 1 1 2 3 4
  • 80. Approach 3: build a bloom filter We still maintain a bit string B of some length m < |U| Example: Imagine that m = 4, r = 2 and INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if for all i, B[hi(k)] = 1 h1(AwVi.com) = 2 INSERT(AwVi.com) MEMBER(BBC.com) - returns ‘no’ h1(ViSt.com) = 3 h1(BBC.com) = 2 1 INSERT(ViSt.com) 1 Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later) for all i between 1 and r Each hash function hi maps a key k, to an integer hi(k) between 1 and m 0 h2(AwVi.com) = 1 h2(ViSt.com) = 2 h2(BBC.com) = 4 Much better! 1 1 2 3 4
  • 81. Approach 3: build a bloom filter We still maintain a bit string B of some length m < |U| Example: Imagine that m = 4, r = 2 and INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if for all i, B[hi(k)] = 1 h1(AwVi.com) = 2 INSERT(AwVi.com) MEMBER(BBC.com) - returns ‘no’ h1(ViSt.com) = 3 h1(BBC.com) = 2 1 INSERT(ViSt.com) 1 Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later) for all i between 1 and r Each hash function hi maps a key k, to an integer hi(k) between 1 and m 0 h2(AwVi.com) = 1 h2(ViSt.com) = 2 h2(BBC.com) = 4 Much better! 1 1 2 3 4 (not convinced?)
  • 82. Approach 3: build a bloom filter We still maintain a bit string B of some length m < |U| INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if for all i, B[hi(k)] = 1 Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later) for all i between 1 and r Each hash function hi maps a key k, to an integer hi(k) between 1 and m For every key k ∈ U, that is, the probability that hi(k) = j is 1 m for all j between 1 and m (each position is equally likely) the value of each hi(k) is chosen independently and uniformly at random:
  • 83. Approach 3: build a bloom filter We still maintain a bit string B of some length m < |U| INSERT(k) sets B[hi(k)] = 1 MEMBER(k) returns ‘yes’ if and only if for all i, B[hi(k)] = 1 Now we have r hash functions: h1, h2, . . . , hrh1, h2, . . . , hr (we will choose r and m later) for all i between 1 and r Each hash function hi maps a key k, to an integer hi(k) between 1 and m For every key k ∈ U, that is, the probability that hi(k) = j is 1 m for all j between 1 and m (each position is equally likely) but what is the probability of a wrong answer? the value of each hi(k) is chosen independently and uniformly at random:
  • 84. What is the probability of an error? Assume we have already INSERTED n keys into the bloom filter Further, we have just called MEMBER(k) for some key k not in S this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r
  • 85. What is the probability of an error? Assume we have already INSERTED n keys into the bloom filter Further, we have just called MEMBER(k) for some key k not in S this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r This is the same as checking whether r randomly chosen bits of B all equal 1
  • 86. What is the probability of an error? Assume we have already INSERTED n keys into the bloom filter Further, we have just called MEMBER(k) for some key k not in S this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r We will now show that there is only a small probability of this happening This is the same as checking whether r randomly chosen bits of B all equal 1
  • 87. What is the probability of an error? Assume we have already INSERTED n keys into the bloom filter Further, we have just called MEMBER(k) for some key k not in S this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 This is the same as checking whether r randomly chosen bits of B all equal 1
  • 88. What is the probability of an error? Assume we have already INSERTED n keys into the bloom filter Further, we have just called MEMBER(k) for some key k not in S this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 (each INSERT sets at most r bits to 1) This is the same as checking whether r randomly chosen bits of B all equal 1
  • 89. What is the probability of an error? Assume we have already INSERTED n keys into the bloom filter Further, we have just called MEMBER(k) for some key k not in S this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 (each INSERT sets at most r bits to 1) This is the same as checking whether r randomly chosen bits of B all equal 1 B 1 1 1 1 1 111 m
  • 90. What is the probability of an error? Assume we have already INSERTED n keys into the bloom filter Further, we have just called MEMBER(k) for some key k not in S this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 (each INSERT sets at most r bits to 1) So the fraction of bits set to 1 is at most nr m This is the same as checking whether r randomly chosen bits of B all equal 1 B 1 1 1 1 1 111 m
  • 91. What is the probability of an error? Assume we have already INSERTED n keys into the bloom filter Further, we have just called MEMBER(k) for some key k not in S this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 (each INSERT sets at most r bits to 1) So the fraction of bits set to 1 is at most nr m so the probability that a randomly chosen bit is 1 is at most nr m This is the same as checking whether r randomly chosen bits of B all equal 1 B 1 1 1 1 1 111 m
  • 92. What is the probability of an error? Assume we have already INSERTED n keys into the bloom filter Further, we have just called MEMBER(k) for some key k not in S this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 (each INSERT sets at most r bits to 1) So the fraction of bits set to 1 is at most nr m so the probability that a randomly chosen bit is 1 is at most nr m This is the same as checking whether r randomly chosen bits of B all equal 1 B 1 1 1 1 1 111 m
  • 93. What is the probability of an error? Assume we have already INSERTED n keys into the bloom filter Further, we have just called MEMBER(k) for some key k not in S this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 (each INSERT sets at most r bits to 1) So the fraction of bits set to 1 is at most nr m so the probability that a randomly chosen bit is 1 is at most nr m so the probability that r randomly chosen bits all equal 1 is at most nr m r This is the same as checking whether r randomly chosen bits of B all equal 1 B 1 1 1 1 1 111 m
  • 94. What is the probability of an error? Assume we have already INSERTED n keys into the bloom filter Further, we have just called MEMBER(k) for some key k not in S this will check whether B[hi(k)] = 1 for all j = 1, 2, . . . r We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 (each INSERT sets at most r bits to 1) So the fraction of bits set to 1 is at most nr m so the probability that a randomly chosen bit is 1 is at most nr m so the probability that r randomly chosen bits all equal 1 is at most nr m r This is the same as checking whether r randomly chosen bits of B all equal 1 B 1 1 1 1 1 111 m (do this independently r times)
  • 95. What is the probability of a collision? We now choose r to minimise this probability. . .
  • 96. What is the probability of a collision? We now choose r to minimise this probability. . . By differentiating, we can find that nr m r letting r = m/(ne) where e = 2.7813 . . . is minimised by
  • 97. What is the probability of a collision? We now choose r to minimise this probability. . . By differentiating, we can find that nr m r letting r = m/(ne) where e = 2.7813 . . . If we plug this in we get that, 1 e m ne ≈ (0.69) m nthe probability of failure, is at most is minimised by
  • 98. What is the probability of a collision? We now choose r to minimise this probability. . . By differentiating, we can find that nr m r letting r = m/(ne) where e = 2.7813 . . . If we plug this in we get that, 1 e m ne ≈ (0.69) m nthe probability of failure, is at most In particular to achieve a 1% failure probability, we can set m ≈ 12.52n bits is minimised by
  • 99. What is the probability of a collision? We now choose r to minimise this probability. . . By differentiating, we can find that nr m r letting r = m/(ne) where e = 2.7813 . . . If we plug this in we get that, 1 e m ne ≈ (0.69) m nthe probability of failure, is at most In particular to achieve a 1% failure probability, we can set m ≈ 12.52n bits is minimised by neither the space nor the failure probability depend on |U|
  • 100. What is the probability of a collision? We now choose r to minimise this probability. . . By differentiating, we can find that nr m r letting r = m/(ne) where e = 2.7813 . . . If we plug this in we get that, 1 e m ne ≈ (0.69) m nthe probability of failure, is at most In particular to achieve a 1% failure probability, we can set m ≈ 12.52n bits is minimised by neither the space nor the failure probability depend on |U| if we wanted a better probability, we could use more space
  • 101. What is the probability of a collision? We now choose r to minimise this probability. . . By differentiating, we can find that nr m r letting r = m/(ne) where e = 2.7813 . . . If we plug this in we get that, 1 e m ne ≈ (0.69) m nthe probability of failure, is at most In particular to achieve a 1% failure probability, we can set m ≈ 12.52n bits This is much better than the 100n bits we needed with a single hash function to achieve the same probability is minimised by neither the space nor the failure probability depend on |U| if we wanted a better probability, we could use more space
  • 102. Bloom filter summary In a bloom filter, the MEMBER(k) operation A Bloom filter is a randomised data structure for storing a set S which supports two operations, each in O(1) time always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance, , that it will still say ‘yes’ when storing up to n keys The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly) We have seen that if = 0.01 (1%) the the space used is m ≈ 12.52n bits By impoving the analysis, one can show that only ≈ 1.44 log2(1/ ) bits are needed (≈ 9.57n bits when = 0.01)
  • 103. Practical hash functions We made the unrealistic assumption that each hash function hi maps a key k to a uniformly random integer between 1 and m.
  • 104. Practical hash functions We made the unrealistic assumption that each hash function hi maps a key k to a uniformly random integer between 1 and m. In practice, we pick each hash function hi randomly from a fixed set of hash functions.
  • 105. Practical hash functions 1. Pick a prime number p > |U|. 2. Pick random integers a ∈ {1, . . . , p − 1}, b ∈ {0, . . . , p − 1}. 3. Let hi be defined by hi(k) = 1 + ((ak + b) mod p) mod m. We made the unrealistic assumption that each hash function hi maps a key k to a uniformly random integer between 1 and m. One way of doing this for integer keys is the following: (see CLRS 11.3.3) In practice, we pick each hash function hi randomly from a fixed set of hash functions. For each i:
  • 106. Practical hash functions 1. Pick a prime number p > |U|. 2. Pick random integers a ∈ {1, . . . , p − 1}, b ∈ {0, . . . , p − 1}. 3. Let hi be defined by hi(k) = 1 + ((ak + b) mod p) mod m. We made the unrealistic assumption that each hash function hi maps a key k to a uniformly random integer between 1 and m. One way of doing this for integer keys is the following: (see CLRS 11.3.3) Some number theory can be used to prove that this set of hash functions is “pseudorandom” in some sense; however, technically they are not “random enough” for our analysis above to go through. In practice, we pick each hash function hi randomly from a fixed set of hash functions. For each i:
  • 107. Practical hash functions 1. Pick a prime number p > |U|. 2. Pick random integers a ∈ {1, . . . , p − 1}, b ∈ {0, . . . , p − 1}. 3. Let hi be defined by hi(k) = 1 + ((ak + b) mod p) mod m. We made the unrealistic assumption that each hash function hi maps a key k to a uniformly random integer between 1 and m. One way of doing this for integer keys is the following: (see CLRS 11.3.3) Some number theory can be used to prove that this set of hash functions is “pseudorandom” in some sense; however, technically they are not “random enough” for our analysis above to go through. Nevertheless, in practice hash functions like this are very effective. In practice, we pick each hash function hi randomly from a fixed set of hash functions. For each i:
  • 108. Bloom filter summary In a bloom filter, the MEMBER(k) operation A Bloom filter is a randomised data structure for storing a set S which supports two operations, each in O(1) time always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance, , that it will still say ‘yes’ when storing up to n keys The INSERT(k) operation inserts the key k from U into S (it never does this incorrectly) We have seen that if = 0.01 (1%) the the space used is m ≈ 12.52n bits By impoving the analysis, one can show that only ≈ 1.44 log2(1/ ) bits are needed (≈ 9.57n bits when = 0.01)