At StampedeCon 2012 in St. Louis, Jim Duey of Lonocloud presents: Big data requires big resources which cost big money. But if you only need answers that are good enough, rather than precisely right, probabilistic data structures can be a way to get those answers with a fraction of the resources and cost. In this talk I’ll survey some different data structures, give some theory behind them and point out some use cases.
Exploring the Future Potential of AI-Enabled Smartphone Processors
A Survey of Probabilistic Data Structures - StampedeCon 2012
1. PROBABILISTIC DATA
STRUCTURES
Jim Duey
Lonocloud.com
@jimduey
http://clojure.net
Wednesday, August 1, 2012
2. WHAT IS A DATA STRUCTURE?
It is a ‘structure’ that holds ‘data’, allowing you to extract
information.
Data gets added to the structure.
Queries of various sorts are used to extract information.
Wednesday, August 1, 2012
3. INSPIRATION
Ilya Katsov
https://highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-data-mining/
Wednesday, August 1, 2012
4. WORD OF CAUTION
Many probabilistic data structures use hashing
Java’s hashCode is not safe across multiple processes
“Java's hashCode is not safe for distributed systems”
http://martin.kleppmann.com/2012/06/18/java-hashcode-
unsafe-for-distributed-systems.html
Wednesday, August 1, 2012
5. PROBABILISTIC
Query may return a wrong answer
The answer is ‘good enough’
Uses a fraction of the resources i.e. memory or cpu cycles
Wednesday, August 1, 2012
6. HOW MANY ITEMS?
If you have a large collection of ‘things’ ...
And there are some duplicates ...
And you want to know how many unique things there are.
Wednesday, August 1, 2012
7. LINEAR COUNTING
class LinearCounter {
BitSet mask = new BitSet(m); // m is a design parameter
void add(value) {
// get an index for value between 0 .. m
int position = value.hashCode() % m;
mask.set(position);
}
Wednesday, August 1, 2012
9. LINEAR COUNTING
class LinearCounter {
BitSet mask = new BitSet(m); // m is a design parameter
...
}
Question: How big is m ?
Wednesday, August 1, 2012
10. LINEAR COUNTING
Load Factor
n Number of unique items expected
m Size of bit mask
If the load factor is < 1; few collisions, number of
bits set is the cardinality.
Wednesday, August 1, 2012
11. LINEAR COUNTING
Load Factor
n Number of unique items expected
m Size of bit mask
If the load factor is very high 100; all bits set, no
information about cardinality.
Wednesday, August 1, 2012
12. LINEAR COUNTING
Load Factor
n Number of unique items expected
m Size of bit mask
If the load factor is higher than 1, but not too high;
many collisions, but some relationship might exist
between number of bits set and cardinality.
Wednesday, August 1, 2012
13. LINEAR COUNTING
Finding the number of members of the collection
n = - m * ln ((m - w) / m)
m is the size of the bit map
w is the number of 1 s in the bitmap (cardinality)
Wednesday, August 1, 2012
14. LINEAR COUNTING
class LinearCounter {
BitSet mask = new BitSet(m); // m is a design parameter
...
}
Question: How big is m ?
m > max(5, 1 / (std-err * n / m) ** 2) * (e ** (n / m) - n / m -1)
On the order of 1M unique values, m = 154 Kbit, n/m = 6.5
On the order of 10M unique values, m = 1.1 Mbit, n/m = 9
for a standard error of 0.01
Wednesday, August 1, 2012
15. LINEAR COUNTING
“Linear-Time Probabilistic Counting Algorithm for
Database Applications”
Use table to find bit map size.
Checkout Ilya’s blog post for some nice graphs.
Wednesday, August 1, 2012
22. BLOOM FILTER
How big to make ‘m’ and ‘k’?
‘m’ is the number of bits in the filter
‘k’ is the number of separate hash functions
m = - (n * ln p) / (ln 2) ** 2
n is the the number of distinct items to be stored
p is the probability of a false positive
m = - (1M * ln .01) / (ln 2) ** 2 = 9.6 Mbits = 1.2 MB
Wednesday, August 1, 2012
23. BLOOM FILTER
How big to make ‘m’ and ‘k’?
‘m’ is the number of bits in the filter
‘k’ is the number of separate hash functions
m = - (n * ln p) / (ln 2) ^ 2
k = m / n * ln 2
k = 9.6M / 1M * 0.69 = 6.64 = 7 hash functions
Wednesday, August 1, 2012
24. BLOOM FILTER
You can’t query a Bloom filter for cardinality
You can’t remove an item once it’s been added
Many variants of the Bloom filter, some that address these
issues
Wednesday, August 1, 2012
25. HASH FUNCTIONS
How to find many hash functions?
“Out of one, many”
Make the size of your bit mask a power of 2
By masking off bit fields, you can get multiple hash values
from a single hash function.
a 16 bit hash will cover a 65Kbit index
512 bit hash will give 32 16-bit hashes
Wednesday, August 1, 2012
26. COUNT-MIN SKETCH
When you want to know how many of each item there is in a
collection.
Wednesday, August 1, 2012
27. COUNT-MIN SKETCH
w
+1
+1
Thing 1
d
+1
+1
Each box is a counter.
Each row is indexed by a corresponding hash function.
Wednesday, August 1, 2012
28. COUNT-MIN SKETCH
w
a
b
Some thing
d
c
d
Estimated frequency for ‘Some thing’ is min(a, b, c, d).
Wednesday, August 1, 2012
29. COUNT-MIN SKETCH
How big to make ‘w’ and ‘d’?
‘w’ is the number of counters per hash function
limits the magnitude of the error
‘d’ is the number of separate hash functions
controls the probability that the estimation is greater than
the error
Wednesday, August 1, 2012
30. COUNT-MIN SKETCH
error-limit <= 2 * n / w
probability limit exceeded = 1 - (1 / 2) ** d
n = total number of items counted
w = number of counters per hash function
d = number of separate hash functions
Works best on skewed data.
Wednesday, August 1, 2012
31. RESOURCES
https://highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-data-mining/
http://blog.aggregateknowledge.com/
http://lkozma.net/blog/sketching-data-structures/
https://sites.google.com/site/countminsketch/home
“PyCon 2011: Handling ridiculous amounts of data with
probabilistic data structures”
Wednesday, August 1, 2012