PROBABILISTIC
DATA STRUCTURES
Thinh Dang-An
Definitions
• Data structure
• It is a ‘structure’ that holds ‘data’, allowing you to extract
information
• Probabilistic
• Query may return a wrong answer
• The answer is ‘good enough’
• Uses a fraction of the resources i.e. memory or cpu cycles
Four types:
• Membership
• Bloom Filter
• Cuckoo Filters
• Cardinality
• Linear Counting
• LogLog, SuperLogLog,
HyperLogLog, HyperLogLog++
• Frequency
• Count-Min Sketch
• Majority Algorithm
• Misra-Gries Algorithm
• Similarity
• Locality-Sensitive Hashing (LSH)
• MinHash
• SimHash
Bloom Filter
Membership
Properties
• It tells us that the element either definitely is not in
the set or may be in the set.
• Bloom filters are called filters because they are often
used as a cheap first pass to filter out segments of a
dataset that do not match a query.
How does it work
• Bloom filter is a bit array of m bits, all set to 0 at the beginning
• To insert element into the filter - calculate values of all k hash functions for the
element and set bit with the corresponding indices
• To test if element is in the filter - calculate all k hash functions for the element
and check bits in all corresponding indices:
• if all bits are set, then answer is “maybe”
• if at least 1 bit isn’t set, then answer is “definitely not”
• Time needed to insert or test elements is a fixed constant O(k), independent
from the number of items already in the filter
Application
• Google BigTable, Apache HBase and Apache Cassandra use Bloom filters to
reduce the disk lookups for non-existent rows or columns
• Medium uses Bloom filters to avoid recommending articles a user has previously
read
• Google Chrome web browser used to use a Bloom filter to identify malicious URLs
(moved to PrefixSet, Issue 71832)
• The Squid Web Proxy Cache uses Bloom filters for cache digests
Cuckoo Filters
Membership
Properties
• Practically better than bloom filter
• Supports adding and removing items dynamically
• Provide higher lookup performance
• Cuckoo hashing – resolves collisions by rehashing to a new
place
How does it work
• Parameters of the Filter:
• 1. Two Hash Functions: h1 and h2
• 2. An array B with n buckets. The i-th bucket will be called B[i]
• Input: L, a list of elements to be inserted into the cuckoo filter.
How does it work
While L is not empty:
Let x be the first item in the list L. Remove x from the list.
If B[h1(x)] is empty:
place x in B[h1(x)]
Else, If B[h2(x) is empty]:
place x in B[h2(x)]
Else:
Let y be the element in B[h2(x)].
Prepend y to L
place x in B[h2(x)]
What if cuckoo filter use more than two
hash functions?
• Nothing happen and this isn't necessary, Because :
• If you use too many hash function, that will take time to
implement and don't bring any benefit.
• You need more space to store when many insert data focus on
one bucket by add more element per bucket.
COMPARISON WITH BLOOM FILTER
• Space Efficiency
• Number of Memory Accesses
• Value Association
• Maximum Capacity
Count Min Sketch
Frequency
Properties
• Only over-estimate, not under-estimate.
• Time needed to add element or return its frequency is a fixed
constant O(k), assuming that every hash function can be
evaluated in a constant time.
How does it work
• Use multiple arrays with different hash functions to compute
the index.
• When queried, return the minimum of the numbers the arrays.
→ Count-Min Sketch
• AT&T has used Count-Min Sketch in network switches to perform analyses on
network traffic using limited memory
• At Google, a precursor of the count-min sketch (called the “count sketch”) has
been implemented on top of their MapReduce parallel processing infrastructure
• Implemented as a part of Algebird library from Twitter
HyperLogLog
Cardinality
Properties
• HyperLogLog is described by 2 parameters:
• p – number of bits that determine a bucket to use
averaging (m = 2^p is the number of buckets/substreams)
• h - hash function, that produces uniform hash values
• The HyperLogLog algorithm is able to estimate cardinalities of
> 10^9 with a typical error rate of 2%
• Observe the maximum number of leading zeros that for all
hash values.
How does it work
• Stochastic averaging is used to reduce the large variability:
• The input stream of data elements S is divided into m substreams S(i) using the first p
bits of the hash values (m = 2^p)
• In each substream, the rank (after the initial p bits that are used for substreaming) is
measured independently.
• These numbers are kept in an array of registers M, where M[i] stores the maximum rank
it seen for the substream with index i.
• The cardinality formula is calculated computes to approximate the cardinality of a
multiset.
Example
Example
Application
• PFCount in Redis
• Counting unique visitors to a website,...
MinHash
Similarity
Properties
• Compute a “signature” for each set, so that similar documents have similar
signatures (and dissimilar docs are unlikely to have similar signatures)
• Trade-off: length of signature vs. accuracy
How does it work
For each row r = 0, 1, …, N-1 of the characteristic matrix:
1. Compute h1(r), h2(r), …, hn(r)
2. For each column c:
1. If column c has 0 in row r, do nothing
2. Otherwise, for each i = 1,2, …, n set SIG(i, c) to be min(hi
(r), SIG(i, c))
With:
r: row
c: column
i: index of hash
Note: in practice we need to only iterate through the non-zero
elements.
Problem with MinHash
• Assume that we construct a 1,000 byte minhash signature
for each document.
• Million documents can now fit into 1 gigabyte of RAM.
But how much does it cost to find the nearest neighbor
of a document? -
• Brute force: 1/2 N(N-1) comparisons.
• Need a way to reduce the number of comparisons
Locality sensitive Hashing
Similarity
Properties
• Idea:
• From minHash, divide the signature matrix rows into b bands
of r rows hash the columns in each band with a basic hash
function each band divided to buckets [i.e a hashtable for
each band]
• If sets S and T have same values in a band, they will be
hashed into the same bucket in that band.
• For nearest-neighbor, the candidates are the items in the
same bucket as query item, in each band
Application
• Finding duplicate pages on the web
• Retrieving images
• Retrieving music
References
1. Series probabilistic data structure - Andrii Gakhov
2. Cuckoo Filter: Practically Better Than Bloom - Bin Fan, David G. Andersen, Michael
Kaminsky† , Michael D. Mitzenmacher
3. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality
Estimation Algorithm - Stefan Heule , Marc Nunkesser ,Alexander Hall
4. MinHash & LSH slide
Thank you for watching

Probabilistic data structure

  • 1.
  • 2.
    Definitions • Data structure •It is a ‘structure’ that holds ‘data’, allowing you to extract information • Probabilistic • Query may return a wrong answer • The answer is ‘good enough’ • Uses a fraction of the resources i.e. memory or cpu cycles
  • 3.
    Four types: • Membership •Bloom Filter • Cuckoo Filters • Cardinality • Linear Counting • LogLog, SuperLogLog, HyperLogLog, HyperLogLog++ • Frequency • Count-Min Sketch • Majority Algorithm • Misra-Gries Algorithm • Similarity • Locality-Sensitive Hashing (LSH) • MinHash • SimHash
  • 4.
  • 5.
    Properties • It tellsus that the element either definitely is not in the set or may be in the set. • Bloom filters are called filters because they are often used as a cheap first pass to filter out segments of a dataset that do not match a query.
  • 6.
    How does itwork • Bloom filter is a bit array of m bits, all set to 0 at the beginning • To insert element into the filter - calculate values of all k hash functions for the element and set bit with the corresponding indices • To test if element is in the filter - calculate all k hash functions for the element and check bits in all corresponding indices: • if all bits are set, then answer is “maybe” • if at least 1 bit isn’t set, then answer is “definitely not” • Time needed to insert or test elements is a fixed constant O(k), independent from the number of items already in the filter
  • 7.
    Application • Google BigTable,Apache HBase and Apache Cassandra use Bloom filters to reduce the disk lookups for non-existent rows or columns • Medium uses Bloom filters to avoid recommending articles a user has previously read • Google Chrome web browser used to use a Bloom filter to identify malicious URLs (moved to PrefixSet, Issue 71832) • The Squid Web Proxy Cache uses Bloom filters for cache digests
  • 8.
  • 9.
    Properties • Practically betterthan bloom filter • Supports adding and removing items dynamically • Provide higher lookup performance • Cuckoo hashing – resolves collisions by rehashing to a new place
  • 10.
    How does itwork • Parameters of the Filter: • 1. Two Hash Functions: h1 and h2 • 2. An array B with n buckets. The i-th bucket will be called B[i] • Input: L, a list of elements to be inserted into the cuckoo filter.
  • 11.
    How does itwork While L is not empty: Let x be the first item in the list L. Remove x from the list. If B[h1(x)] is empty: place x in B[h1(x)] Else, If B[h2(x) is empty]: place x in B[h2(x)] Else: Let y be the element in B[h2(x)]. Prepend y to L place x in B[h2(x)]
  • 12.
    What if cuckoofilter use more than two hash functions? • Nothing happen and this isn't necessary, Because : • If you use too many hash function, that will take time to implement and don't bring any benefit. • You need more space to store when many insert data focus on one bucket by add more element per bucket.
  • 13.
    COMPARISON WITH BLOOMFILTER • Space Efficiency • Number of Memory Accesses • Value Association • Maximum Capacity
  • 14.
  • 15.
    Properties • Only over-estimate,not under-estimate. • Time needed to add element or return its frequency is a fixed constant O(k), assuming that every hash function can be evaluated in a constant time.
  • 16.
    How does itwork • Use multiple arrays with different hash functions to compute the index. • When queried, return the minimum of the numbers the arrays. → Count-Min Sketch
  • 17.
    • AT&T hasused Count-Min Sketch in network switches to perform analyses on network traffic using limited memory • At Google, a precursor of the count-min sketch (called the “count sketch”) has been implemented on top of their MapReduce parallel processing infrastructure • Implemented as a part of Algebird library from Twitter
  • 18.
  • 19.
    Properties • HyperLogLog isdescribed by 2 parameters: • p – number of bits that determine a bucket to use averaging (m = 2^p is the number of buckets/substreams) • h - hash function, that produces uniform hash values • The HyperLogLog algorithm is able to estimate cardinalities of > 10^9 with a typical error rate of 2% • Observe the maximum number of leading zeros that for all hash values.
  • 20.
    How does itwork • Stochastic averaging is used to reduce the large variability: • The input stream of data elements S is divided into m substreams S(i) using the first p bits of the hash values (m = 2^p) • In each substream, the rank (after the initial p bits that are used for substreaming) is measured independently. • These numbers are kept in an array of registers M, where M[i] stores the maximum rank it seen for the substream with index i. • The cardinality formula is calculated computes to approximate the cardinality of a multiset.
  • 21.
  • 22.
  • 23.
    Application • PFCount inRedis • Counting unique visitors to a website,...
  • 24.
  • 25.
    Properties • Compute a“signature” for each set, so that similar documents have similar signatures (and dissimilar docs are unlikely to have similar signatures) • Trade-off: length of signature vs. accuracy
  • 26.
    How does itwork For each row r = 0, 1, …, N-1 of the characteristic matrix: 1. Compute h1(r), h2(r), …, hn(r) 2. For each column c: 1. If column c has 0 in row r, do nothing 2. Otherwise, for each i = 1,2, …, n set SIG(i, c) to be min(hi (r), SIG(i, c)) With: r: row c: column i: index of hash Note: in practice we need to only iterate through the non-zero elements.
  • 27.
    Problem with MinHash •Assume that we construct a 1,000 byte minhash signature for each document. • Million documents can now fit into 1 gigabyte of RAM. But how much does it cost to find the nearest neighbor of a document? - • Brute force: 1/2 N(N-1) comparisons. • Need a way to reduce the number of comparisons
  • 28.
  • 29.
    Properties • Idea: • FromminHash, divide the signature matrix rows into b bands of r rows hash the columns in each band with a basic hash function each band divided to buckets [i.e a hashtable for each band] • If sets S and T have same values in a band, they will be hashed into the same bucket in that band. • For nearest-neighbor, the candidates are the items in the same bucket as query item, in each band
  • 30.
    Application • Finding duplicatepages on the web • Retrieving images • Retrieving music
  • 31.
    References 1. Series probabilisticdata structure - Andrii Gakhov 2. Cuckoo Filter: Practically Better Than Bloom - Bin Fan, David G. Andersen, Michael Kaminsky† , Michael D. Mitzenmacher 3. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm - Stefan Heule , Marc Nunkesser ,Alexander Hall 4. MinHash & LSH slide Thank you for watching

Editor's Notes

  • #4 membership To determine membership of the element in a large set of elements  frequency To estimate number of times an element occurs in a set  Cardinality  To determine the number of distinct elements Similarity To find clusters of similar documents from the document set • To find duplicates of the document in the document set