Probabilistic data structures

Probabilistic
Data Structures
By,
Shrinivas Vasala

Outline
• Basic Idea
• Hyperloglog
• Bloom filters
• Count-min sketches
• References & further reading
2

Basic Idea
• Problem: you have a lot of data to count,
track or otherwise analyze
• Often an approximate answer is sufficient,
especially if you can place bounds on how
wrong the approximation is likely to be
• IO is the most expensive
3

Hyperloglog
• Originally described by Flajolet and colleagues in 2007
• Can estimate cardinalities well beyond 109 with a relative accuracy
(standard error) of 2% while only using 1.5kb of memory
• Hashing turns anything into uniform distribution
• If the maximum number of leading zeros observed is ’n’, an estimate
for the number of distinct elements in the set is 2n
• Tuning Precision : set is split into multiple subsets
• Harmonic mean + low/high sampling adjustments  Results
• Increasing the number of bits of your hash increases the highest
possible number you can accurately approximate
• Commonly used hashing algorithm: Murmur
• Implementation in Redis
4

Bloom filters
• Conceived by Burton Howard Bloom in 1970
• Used to test whether an element is a member of a set
• Query returns either "possibly in set" or "definitely not in set“
• An empty Bloom filter is a bit array of ‘m’ bits, all set to 0
• ‘k’ hash functions map the set element to one of the m array
positions with a uniform random distribution
• The bits at all these positions are set to 1
• To query, hash with same hash functions and check if all positions
are set, if not its surely not in the set else maybe present in set
• For an optimal value of k with 1% error each element requires only
about 9.6 bits — regardless of the size of the elements
5
1. Choose a ballpark value for no. of elements in the set ‘n’
2. Choose a value for m
3. Calculate the optimal value of k [ k = (m/n) ln 2 ]
4. Calculate the error rate ‘p’ [ p = (1-e-kn/m)k ]
If p is unacceptable, return to step 2 and change m;
otherwise we're done.

Count-min sketches
• Invented by Graham Cormode and S. Muthu Muthukrishnan in 2003
• How many of each item there is in an collection
• Sketch is a compact summary of a large amount of data, is a 2D array
of w columns and d rows
• Each box is a counter
• Each row is indexed by a
corresponding hash function
• Estimated frequency for
‘Something’ is min(a,b,c,d)
• ‘w’ limits the magnitude of the error [ error <= 2 * n/w ]
• ‘d’ controls the probability that the estimation is greater than the error
[probability limit exceeded = 1 – (1/2) ** d]
• Works best on Skewed data

References & further reading
Other Topics
• MinHash is a technique for quickly estimating how similar two sets are
• Quotient filters are AMQs (approximate membership query) and, provide many of the same
benefits as Bloom filters
• Skip list is a data structure that allows fast search within an ordered sequence of elements
Hyperloglog
1. http://druid.io/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html
2. http://www.slideshare.net/c.titus.brown/2013-py-con-awesome-big-data-algorithms
3. http://en.wikipedia.org/wiki/HyperLogLog
Bloom filters
1. http://billmill.org/bloomfilter-tutorial/
2. http://en.wikipedia.org/wiki/Bloom_filter
Count-min sketches
1. http://www.slideshare.net/StampedeCon/a-survey-of-probabilistic-data-structures-stampedecon-
2012
2. http://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
3. https://sites.google.com/site/countminsketch/home
7

Probabilistic data structures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Probabilistic data structures

Similar to Probabilistic data structures (20)

Recently uploaded

Recently uploaded (20)

Probabilistic data structures