Probabilistic
Data Structures
By,
Shrinivas Vasala
Outline
• Basic Idea
• Hyperloglog
• Bloom filters
• Count-min sketches
• References & further reading
2
Basic Idea
• Problem: you have a lot of data to count,
track or otherwise analyze
• Often an approximate answer is sufficient,
especially if you can place bounds on how
wrong the approximation is likely to be
• IO is the most expensive
3
Hyperloglog
• Originally described by Flajolet and colleagues in 2007
• Can estimate cardinalities well beyond 109 with a relative accuracy
(standard error) of 2% while only using 1.5kb of memory
• Hashing turns anything into uniform distribution
• If the maximum number of leading zeros observed is ’n’, an estimate
for the number of distinct elements in the set is 2n
• Tuning Precision : set is split into multiple subsets
• Harmonic mean + low/high sampling adjustments  Results
• Increasing the number of bits of your hash increases the highest
possible number you can accurately approximate
• Commonly used hashing algorithm: Murmur
• Implementation in Redis
4
Bloom filters
• Conceived by Burton Howard Bloom in 1970
• Used to test whether an element is a member of a set
• Query returns either "possibly in set" or "definitely not in set“
• An empty Bloom filter is a bit array of ‘m’ bits, all set to 0
• ‘k’ hash functions map the set element to one of the m array
positions with a uniform random distribution
• The bits at all these positions are set to 1
• To query, hash with same hash functions and check if all positions
are set, if not its surely not in the set else maybe present in set
• For an optimal value of k with 1% error each element requires only
about 9.6 bits — regardless of the size of the elements
5
1. Choose a ballpark value for no. of elements in the set ‘n’
2. Choose a value for m
3. Calculate the optimal value of k [ k = (m/n) ln 2 ]
4. Calculate the error rate ‘p’ [ p = (1-e-kn/m)k ]
If p is unacceptable, return to step 2 and change m;
otherwise we're done.
Count-min sketches
• Invented by Graham Cormode and S. Muthu Muthukrishnan in 2003
• How many of each item there is in an collection
• Sketch is a compact summary of a large amount of data, is a 2D array
of w columns and d rows
• Each box is a counter
• Each row is indexed by a
corresponding hash function
• Estimated frequency for
‘Something’ is min(a,b,c,d)
• ‘w’ limits the magnitude of the error [ error <= 2 * n/w ]
• ‘d’ controls the probability that the estimation is greater than the error
[probability limit exceeded = 1 – (1/2) ** d]
• Works best on Skewed data
References & further reading
Other Topics
• MinHash is a technique for quickly estimating how similar two sets are
• Quotient filters are AMQs (approximate membership query) and, provide many of the same
benefits as Bloom filters
• Skip list is a data structure that allows fast search within an ordered sequence of elements
Hyperloglog
1. http://druid.io/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html
2. http://www.slideshare.net/c.titus.brown/2013-py-con-awesome-big-data-algorithms
3. http://en.wikipedia.org/wiki/HyperLogLog
Bloom filters
1. http://billmill.org/bloomfilter-tutorial/
2. http://en.wikipedia.org/wiki/Bloom_filter
Count-min sketches
1. http://www.slideshare.net/StampedeCon/a-survey-of-probabilistic-data-structures-stampedecon-
2012
2. http://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
3. https://sites.google.com/site/countminsketch/home
7

Probabilistic data structures

  • 1.
  • 2.
    Outline • Basic Idea •Hyperloglog • Bloom filters • Count-min sketches • References & further reading 2
  • 3.
    Basic Idea • Problem:you have a lot of data to count, track or otherwise analyze • Often an approximate answer is sufficient, especially if you can place bounds on how wrong the approximation is likely to be • IO is the most expensive 3
  • 4.
    Hyperloglog • Originally describedby Flajolet and colleagues in 2007 • Can estimate cardinalities well beyond 109 with a relative accuracy (standard error) of 2% while only using 1.5kb of memory • Hashing turns anything into uniform distribution • If the maximum number of leading zeros observed is ’n’, an estimate for the number of distinct elements in the set is 2n • Tuning Precision : set is split into multiple subsets • Harmonic mean + low/high sampling adjustments  Results • Increasing the number of bits of your hash increases the highest possible number you can accurately approximate • Commonly used hashing algorithm: Murmur • Implementation in Redis 4
  • 5.
    Bloom filters • Conceivedby Burton Howard Bloom in 1970 • Used to test whether an element is a member of a set • Query returns either "possibly in set" or "definitely not in set“ • An empty Bloom filter is a bit array of ‘m’ bits, all set to 0 • ‘k’ hash functions map the set element to one of the m array positions with a uniform random distribution • The bits at all these positions are set to 1 • To query, hash with same hash functions and check if all positions are set, if not its surely not in the set else maybe present in set • For an optimal value of k with 1% error each element requires only about 9.6 bits — regardless of the size of the elements 5 1. Choose a ballpark value for no. of elements in the set ‘n’ 2. Choose a value for m 3. Calculate the optimal value of k [ k = (m/n) ln 2 ] 4. Calculate the error rate ‘p’ [ p = (1-e-kn/m)k ] If p is unacceptable, return to step 2 and change m; otherwise we're done.
  • 6.
    Count-min sketches • Inventedby Graham Cormode and S. Muthu Muthukrishnan in 2003 • How many of each item there is in an collection • Sketch is a compact summary of a large amount of data, is a 2D array of w columns and d rows • Each box is a counter • Each row is indexed by a corresponding hash function • Estimated frequency for ‘Something’ is min(a,b,c,d) • ‘w’ limits the magnitude of the error [ error <= 2 * n/w ] • ‘d’ controls the probability that the estimation is greater than the error [probability limit exceeded = 1 – (1/2) ** d] • Works best on Skewed data
  • 7.
    References & furtherreading Other Topics • MinHash is a technique for quickly estimating how similar two sets are • Quotient filters are AMQs (approximate membership query) and, provide many of the same benefits as Bloom filters • Skip list is a data structure that allows fast search within an ordered sequence of elements Hyperloglog 1. http://druid.io/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html 2. http://www.slideshare.net/c.titus.brown/2013-py-con-awesome-big-data-algorithms 3. http://en.wikipedia.org/wiki/HyperLogLog Bloom filters 1. http://billmill.org/bloomfilter-tutorial/ 2. http://en.wikipedia.org/wiki/Bloom_filter Count-min sketches 1. http://www.slideshare.net/StampedeCon/a-survey-of-probabilistic-data-structures-stampedecon- 2012 2. http://en.wikipedia.org/wiki/Count%E2%80%93min_sketch 3. https://sites.google.com/site/countminsketch/home 7