A Survey of Probabilistic Data Structures - StampedeCon 2012

PROBABILISTIC DATA
STRUCTURES
Jim Duey
Lonocloud.com
@jimduey
http://clojure.net

Wednesday, August 1, 2012

WHAT IS A DATA STRUCTURE?

It is a ‘structure’ that holds ‘data’, allowing you to extract
information.

Data gets added to the structure.

Queries of various sorts are used to extract information.


INSPIRATION

Ilya Katsov

https://highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-data-mining/


WORD OF CAUTION

Many probabilistic data structures use hashing

Java’s hashCode is not safe across multiple processes

“Java's hashCode is not safe for distributed systems”

http://martin.kleppmann.com/2012/06/18/java-hashcode-
unsafe-for-distributed-systems.html


PROBABILISTIC

Query may return a wrong answer

The answer is ‘good enough’

Uses a fraction of the resources i.e. memory or cpu cycles


HOW MANY ITEMS?

If you have a large collection of ‘things’ ...

And there are some duplicates ...

And you want to know how many unique things there are.


LINEAR COUNTING
class LinearCounter {
BitSet mask = new BitSet(m); // m is a design parameter

void add(value) {
// get an index for value between 0 .. m
int position = value.hashCode() % m;

mask.set(position);
}


LINEAR COUNTING
1
add() 0
0
Thing 1
0
add() 1
Thing 2 0
0
Thing 3 0
add() 0
0
Thing 4
1
add() 0


LINEAR COUNTING
...
}
Question: How big is m ?


LINEAR COUNTING
Load Factor
n Number of unique items expected

m Size of bit mask

If the load factor is < 1; few collisions, number of

bits set is the cardinality.


LINEAR COUNTING
Load Factor

m Size of bit mask

If the load factor is very high 100; all bits set, no

information about cardinality.


LINEAR COUNTING
Load Factor

m Size of bit mask

If the load factor is higher than 1, but not too high;

many collisions, but some relationship might exist
between number of bits set and cardinality.


LINEAR COUNTING

Finding the number of members of the collection

n = - m * ln ((m - w) / m)
m is the size of the bit map
w is the number of 1 s in the bitmap (cardinality)


LINEAR COUNTING
...
}
Question: How big is m ?
m > max(5, 1 / (std-err * n / m) ** 2) * (e ** (n / m) - n / m -1)

On the order of 1M unique values, m = 154 Kbit, n/m = 6.5
On the order of 10M unique values, m = 1.1 Mbit, n/m = 9
for a standard error of 0.01


LINEAR COUNTING

“Linear-Time Probabilistic Counting Algorithm for
Database Applications”

Use table to find bit map size.

Checkout Ilya’s blog post for some nice graphs.


LINEAR COUNTING
1
0
0
Thing 1
0
1
Thing 2 0
0
Thing 3 0
0
0
Thing 4
1
0


1
0
1
Thing 1
0
1
Thing 2 1
0
Thing 3 0
1
0
Thing 4
1
1


BLOOM FILTER

If you have a large collection of ‘things’ ...

And you want to know if some thing is in the collection.


BLOOM FILTER
1
0
1
Thing 1
0
1
Thing 2 1
0
Thing 3 0
1
0
Thing 4
1
1


BLOOM FILTER
1
0
1
0
1
1
Other thing
0
0
1
0
1
1


BLOOM FILTER
1
0
1
0
1
1
Missing thing
0
0
1
0
1
1


BLOOM FILTER
How big to make ‘m’ and ‘k’?

‘m’ is the number of bits in the filter

‘k’ is the number of separate hash functions

m = - (n * ln p) / (ln 2) ** 2

n is the the number of distinct items to be stored
p is the probability of a false positive
m = - (1M * ln .01) / (ln 2) ** 2 = 9.6 Mbits = 1.2 MB


BLOOM FILTER
How big to make ‘m’ and ‘k’?

‘m’ is the number of bits in the filter

‘k’ is the number of separate hash functions

m = - (n * ln p) / (ln 2) ^ 2
k = m / n * ln 2

k = 9.6M / 1M * 0.69 = 6.64 = 7 hash functions


BLOOM FILTER

You can’t query a Bloom filter for cardinality

You can’t remove an item once it’s been added

Many variants of the Bloom filter, some that address these
issues


HASH FUNCTIONS
How to find many hash functions?

“Out of one, many”

Make the size of your bit mask a power of 2

By masking off bit fields, you can get multiple hash values
from a single hash function.

a 16 bit hash will cover a 65Kbit index

512 bit hash will give 32 16-bit hashes


COUNT-MIN SKETCH
When you want to know how many of each item there is in a
collection.


COUNT-MIN SKETCH
w
+1

+1
Thing 1
d
+1

+1

Each box is a counter.
Each row is indexed by a corresponding hash function.


COUNT-MIN SKETCH
w
a

b
Some thing
d
c

d

Estimated frequency for ‘Some thing’ is min(a, b, c, d).


COUNT-MIN SKETCH
How big to make ‘w’ and ‘d’?

‘w’ is the number of counters per hash function

limits the magnitude of the error

‘d’ is the number of separate hash functions

controls the probability that the estimation is greater than
the error


COUNT-MIN SKETCH

error-limit <= 2 * n / w
probability limit exceeded = 1 - (1 / 2) ** d

n = total number of items counted
w = number of counters per hash function
d = number of separate hash functions

Works best on skewed data.


RESOURCES
https://highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-data-mining/

http://blog.aggregateknowledge.com/

http://lkozma.net/blog/sketching-data-structures/

https://sites.google.com/site/countminsketch/home

“PyCon 2011: Handling ridiculous amounts of data with
probabilistic data structures”


A Survey of Probabilistic Data Structures - StampedeCon 2012

Recommended

Recommended

More Related Content

More from StampedeCon

More from StampedeCon (20)

Recently uploaded

Recently uploaded (20)

A Survey of Probabilistic Data Structures - StampedeCon 2012