Probabilistic data structure

PROBABILISTIC
DATA STRUCTURES
Thinh Dang-An

Definitions
• Data structure
• It is a ‘structure’ that holds ‘data’, allowing you to extract
information
• Probabilistic
• Query may return a wrong answer
• The answer is ‘good enough’
• Uses a fraction of the resources i.e. memory or cpu cycles

Four types:
• Membership
• Bloom Filter
• Cuckoo Filters
• Cardinality
• Linear Counting
• LogLog, SuperLogLog,
HyperLogLog, HyperLogLog++
• Frequency
• Count-Min Sketch
• Majority Algorithm
• Misra-Gries Algorithm
• Similarity
• Locality-Sensitive Hashing (LSH)
• MinHash
• SimHash

Properties
• It tells us that the element either definitely is not in
the set or may be in the set.
• Bloom filters are called filters because they are often
used as a cheap first pass to filter out segments of a
dataset that do not match a query.

How does it work
• Bloom filter is a bit array of m bits, all set to 0 at the beginning
• To insert element into the filter - calculate values of all k hash functions for the
element and set bit with the corresponding indices
• To test if element is in the filter - calculate all k hash functions for the element
and check bits in all corresponding indices:
• if all bits are set, then answer is “maybe”
• if at least 1 bit isn’t set, then answer is “definitely not”
• Time needed to insert or test elements is a fixed constant O(k), independent
from the number of items already in the filter

Application
• Google BigTable, Apache HBase and Apache Cassandra use Bloom filters to
reduce the disk lookups for non-existent rows or columns
• Medium uses Bloom filters to avoid recommending articles a user has previously
read
• Google Chrome web browser used to use a Bloom filter to identify malicious URLs
(moved to PrefixSet, Issue 71832)
• The Squid Web Proxy Cache uses Bloom filters for cache digests

Properties
• Practically better than bloom filter
• Supports adding and removing items dynamically
• Provide higher lookup performance
• Cuckoo hashing – resolves collisions by rehashing to a new
place

How does it work
• Parameters of the Filter:
• 1. Two Hash Functions: h1 and h2
• 2. An array B with n buckets. The i-th bucket will be called B[i]
• Input: L, a list of elements to be inserted into the cuckoo filter.

How does it work
While L is not empty:
Let x be the first item in the list L. Remove x from the list.
If B[h1(x)] is empty:
place x in B[h1(x)]
Else, If B[h2(x) is empty]:
place x in B[h2(x)]
Else:
Let y be the element in B[h2(x)].
Prepend y to L
place x in B[h2(x)]

What if cuckoo filter use more than two
hash functions?
• Nothing happen and this isn't necessary, Because :
• If you use too many hash function, that will take time to
implement and don't bring any benefit.
• You need more space to store when many insert data focus on
one bucket by add more element per bucket.

COMPARISON WITH BLOOM FILTER
• Space Efficiency
• Number of Memory Accesses
• Value Association
• Maximum Capacity

Properties
• Only over-estimate, not under-estimate.
• Time needed to add element or return its frequency is a ﬁxed
constant O(k), assuming that every hash function can be
evaluated in a constant time.

How does it work
• Use multiple arrays with different hash functions to compute
the index.
• When queried, return the minimum of the numbers the arrays.
→ Count-Min Sketch

• AT&T has used Count-Min Sketch in network switches to perform analyses on
network trafﬁc using limited memory
• At Google, a precursor of the count-min sketch (called the “count sketch”) has
been implemented on top of their MapReduce parallel processing infrastructure
• Implemented as a part of Algebird library from Twitter

Properties
• HyperLogLog is described by 2 parameters:
• p – number of bits that determine a bucket to use
averaging (m = 2^p is the number of buckets/substreams)
• h - hash function, that produces uniform hash values
• The HyperLogLog algorithm is able to estimate cardinalities of
> 10^9 with a typical error rate of 2%
• Observe the maximum number of leading zeros that for all
hash values.

How does it work
• Stochastic averaging is used to reduce the large variability:
• The input stream of data elements S is divided into m substreams S(i) using the ﬁrst p
bits of the hash values (m = 2^p)
• In each substream, the rank (after the initial p bits that are used for substreaming) is
measured independently.
• These numbers are kept in an array of registers M, where M[i] stores the maximum rank
it seen for the substream with index i.
• The cardinality formula is calculated computes to approximate the cardinality of a
multiset.

Application
• PFCount in Redis
• Counting unique visitors to a website,...

Properties
• Compute a “signature” for each set, so that similar documents have similar
signatures (and dissimilar docs are unlikely to have similar signatures)
• Trade-off: length of signature vs. accuracy

How does it work
For each row r = 0, 1, …, N-1 of the characteristic matrix:
1. Compute h1(r), h2(r), …, hn(r)
2. For each column c:
1. If column c has 0 in row r, do nothing
2. Otherwise, for each i = 1,2, …, n set SIG(i, c) to be min(hi
(r), SIG(i, c))
With:
r: row
c: column
i: index of hash
Note: in practice we need to only iterate through the non-zero
elements.

Problem with MinHash
• Assume that we construct a 1,000 byte minhash signature
for each document.
• Million documents can now fit into 1 gigabyte of RAM.
But how much does it cost to find the nearest neighbor
of a document? -
• Brute force: 1/2 N(N-1) comparisons.
• Need a way to reduce the number of comparisons

Locality sensitive Hashing
Similarity

Properties
• Idea:
• From minHash, divide the signature matrix rows into b bands
of r rows hash the columns in each band with a basic hash
function each band divided to buckets [i.e a hashtable for
each band]
• If sets S and T have same values in a band, they will be
hashed into the same bucket in that band.
• For nearest-neighbor, the candidates are the items in the
same bucket as query item, in each band

Application
• Finding duplicate pages on the web
• Retrieving images
• Retrieving music

References
1. Series probabilistic data structure - Andrii Gakhov
2. Cuckoo Filter: Practically Better Than Bloom - Bin Fan, David G. Andersen, Michael
Kaminsky† , Michael D. Mitzenmacher
3. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality
Estimation Algorithm - Stefan Heule , Marc Nunkesser ,Alexander Hall
4. MinHash & LSH slide
Thank you for watching

Probabilistic data structure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Probabilistic data structure

Similar to Probabilistic data structure (20)

Recently uploaded

Recently uploaded (20)

Probabilistic data structure

Editor's Notes