Your SlideShare is downloading.
×

- 1. PROBABILISTIC DATA STRUCTURES Jim Duey Lonocloud.com @jimduey http://clojure.net Wednesday, August 1, 2012
- 2. WHAT IS A DATA STRUCTURE? It is a ‘structure’ that holds ‘data’, allowing you to extract information. Data gets added to the structure. Queries of various sorts are used to extract information. Wednesday, August 1, 2012
- 3. INSPIRATION Ilya Katsov https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/ Wednesday, August 1, 2012
- 4. WORD OF CAUTION Many probabilistic data structures use hashing Java’s hashCode is not safe across multiple processes “Java's hashCode is not safe for distributed systems” http://martin.kleppmann.com/2012/06/18/java-hashcode- unsafe-for-distributed-systems.html Wednesday, August 1, 2012
- 5. PROBABILISTIC Query may return a wrong answer The answer is ‘good enough’ Uses a fraction of the resources i.e. memory or cpu cycles Wednesday, August 1, 2012
- 6. HOW MANY ITEMS? If you have a large collection of ‘things’ ... And there are some duplicates ... And you want to know how many unique things there are. Wednesday, August 1, 2012
- 7. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter void add(value) { // get an index for value between 0 .. m int position = value.hashCode() % m; mask.set(position); } Wednesday, August 1, 2012
- 8. LINEAR COUNTING 1 add() 0 0 Thing 1 0 add() 1 Thing 2 0 0 Thing 3 0 add() 0 0 Thing 4 1 add() 0 Wednesday, August 1, 2012
- 9. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter ... } Question: How big is m ? Wednesday, August 1, 2012
- 10. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is < 1; few collisions, number of bits set is the cardinality. Wednesday, August 1, 2012
- 11. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is very high 100; all bits set, no information about cardinality. Wednesday, August 1, 2012
- 12. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is higher than 1, but not too high; many collisions, but some relationship might exist between number of bits set and cardinality. Wednesday, August 1, 2012
- 13. LINEAR COUNTING Finding the number of members of the collection n = - m * ln ((m - w) / m) m is the size of the bit map w is the number of 1 s in the bitmap (cardinality) Wednesday, August 1, 2012
- 14. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter ... } Question: How big is m ? m > max(5, 1 / (std-err * n / m) ** 2) * (e ** (n / m) - n / m -1) On the order of 1M unique values, m = 154 Kbit, n/m = 6.5 On the order of 10M unique values, m = 1.1 Mbit, n/m = 9 for a standard error of 0.01 Wednesday, August 1, 2012
- 15. LINEAR COUNTING “Linear-Time Probabilistic Counting Algorithm for Database Applications” Use table to find bit map size. Checkout Ilya’s blog post for some nice graphs. Wednesday, August 1, 2012
- 16. LINEAR COUNTING 1 0 0 Thing 1 0 1 Thing 2 0 0 Thing 3 0 0 0 Thing 4 1 0 Wednesday, August 1, 2012
- 17. 1 0 1 Thing 1 0 1 Thing 2 1 0 Thing 3 0 1 0 Thing 4 1 1 Wednesday, August 1, 2012
- 18. BLOOM FILTER If you have a large collection of ‘things’ ... And you want to know if some thing is in the collection. Wednesday, August 1, 2012
- 19. BLOOM FILTER 1 0 1 Thing 1 0 1 Thing 2 1 0 Thing 3 0 1 0 Thing 4 1 1 Wednesday, August 1, 2012
- 20. BLOOM FILTER 1 0 1 0 1 1 Other thing 0 0 1 0 1 1 Wednesday, August 1, 2012
- 21. BLOOM FILTER 1 0 1 0 1 1 Missing thing 0 0 1 0 1 1 Wednesday, August 1, 2012
- 22. BLOOM FILTER How big to make ‘m’ and ‘k’? ‘m’ is the number of bits in the filter ‘k’ is the number of separate hash functions m = - (n * ln p) / (ln 2) ** 2 n is the the number of distinct items to be stored p is the probability of a false positive m = - (1M * ln .01) / (ln 2) ** 2 = 9.6 Mbits = 1.2 MB Wednesday, August 1, 2012
- 23. BLOOM FILTER How big to make ‘m’ and ‘k’? ‘m’ is the number of bits in the filter ‘k’ is the number of separate hash functions m = - (n * ln p) / (ln 2) ^ 2 k = m / n * ln 2 k = 9.6M / 1M * 0.69 = 6.64 = 7 hash functions Wednesday, August 1, 2012
- 24. BLOOM FILTER You can’t query a Bloom filter for cardinality You can’t remove an item once it’s been added Many variants of the Bloom filter, some that address these issues Wednesday, August 1, 2012
- 25. HASH FUNCTIONS How to find many hash functions? “Out of one, many” Make the size of your bit mask a power of 2 By masking off bit fields, you can get multiple hash values from a single hash function. a 16 bit hash will cover a 65Kbit index 512 bit hash will give 32 16-bit hashes Wednesday, August 1, 2012
- 26. COUNT-MIN SKETCH When you want to know how many of each item there is in a collection. Wednesday, August 1, 2012
- 27. COUNT-MIN SKETCH w +1 +1 Thing 1 d +1 +1 Each box is a counter. Each row is indexed by a corresponding hash function. Wednesday, August 1, 2012
- 28. COUNT-MIN SKETCH w a b Some thing d c d Estimated frequency for ‘Some thing’ is min(a, b, c, d). Wednesday, August 1, 2012
- 29. COUNT-MIN SKETCH How big to make ‘w’ and ‘d’? ‘w’ is the number of counters per hash function limits the magnitude of the error ‘d’ is the number of separate hash functions controls the probability that the estimation is greater than the error Wednesday, August 1, 2012
- 30. COUNT-MIN SKETCH error-limit <= 2 * n / w probability limit exceeded = 1 - (1 / 2) ** d n = total number of items counted w = number of counters per hash function d = number of separate hash functions Works best on skewed data. Wednesday, August 1, 2012
- 31. RESOURCES https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/ http://blog.aggregateknowledge.com/ http://lkozma.net/blog/sketching-data-structures/ https://sites.google.com/site/countminsketch/home “PyCon 2011: Handling ridiculous amounts of data with probabilistic data structures” Wednesday, August 1, 2012