A Survey of Probabilistic Data Structures - StampedeCon 2012

3,255 views

Published on

At StampedeCon 2012 in St. Louis, Jim Duey of Lonocloud presents: Big data requires big resources which cost big money. But if you only need answers that are good enough, rather than precisely right, probabilistic data structures can be a way to get those answers with a fraction of the resources and cost. In this talk I’ll survey some different data structures, give some theory behind them and point out some use cases.

Published in: Technology, Sports
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,255
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide

A Survey of Probabilistic Data Structures - StampedeCon 2012

  1. 1. PROBABILISTIC DATA STRUCTURES Jim Duey Lonocloud.com @jimduey http://clojure.netWednesday, August 1, 2012
  2. 2. WHAT IS A DATA STRUCTURE? It is a ‘structure’ that holds ‘data’, allowing you to extract information. Data gets added to the structure. Queries of various sorts are used to extract information.Wednesday, August 1, 2012
  3. 3. INSPIRATION Ilya Katsov https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/Wednesday, August 1, 2012
  4. 4. WORD OF CAUTION Many probabilistic data structures use hashing Java’s hashCode is not safe across multiple processes “Javas hashCode is not safe for distributed systems” http://martin.kleppmann.com/2012/06/18/java-hashcode- unsafe-for-distributed-systems.htmlWednesday, August 1, 2012
  5. 5. PROBABILISTIC Query may return a wrong answer The answer is ‘good enough’ Uses a fraction of the resources i.e. memory or cpu cyclesWednesday, August 1, 2012
  6. 6. HOW MANY ITEMS? If you have a large collection of ‘things’ ... And there are some duplicates ... And you want to know how many unique things there are.Wednesday, August 1, 2012
  7. 7. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter void add(value) { // get an index for value between 0 .. m int position = value.hashCode() % m; mask.set(position); }Wednesday, August 1, 2012
  8. 8. LINEAR COUNTING 1 add() 0 0 Thing 1 0 add() 1 Thing 2 0 0 Thing 3 0 add() 0 0 Thing 4 1 add() 0Wednesday, August 1, 2012
  9. 9. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter ... } Question: How big is m ?Wednesday, August 1, 2012
  10. 10. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is < 1; few collisions, number of bits set is the cardinality.Wednesday, August 1, 2012
  11. 11. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is very high 100; all bits set, no information about cardinality.Wednesday, August 1, 2012
  12. 12. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is higher than 1, but not too high; many collisions, but some relationship might exist between number of bits set and cardinality.Wednesday, August 1, 2012
  13. 13. LINEAR COUNTING Finding the number of members of the collection n = - m * ln ((m - w) / m) m is the size of the bit map w is the number of 1 s in the bitmap (cardinality)Wednesday, August 1, 2012
  14. 14. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter ... } Question: How big is m ? m > max(5, 1 / (std-err * n / m) ** 2) * (e ** (n / m) - n / m -1) On the order of 1M unique values, m = 154 Kbit, n/m = 6.5 On the order of 10M unique values, m = 1.1 Mbit, n/m = 9 for a standard error of 0.01Wednesday, August 1, 2012
  15. 15. LINEAR COUNTING “Linear-Time Probabilistic Counting Algorithm for Database Applications” Use table to find bit map size. Checkout Ilya’s blog post for some nice graphs.Wednesday, August 1, 2012
  16. 16. LINEAR COUNTING 1 0 0 Thing 1 0 1 Thing 2 0 0 Thing 3 0 0 0 Thing 4 1 0Wednesday, August 1, 2012
  17. 17. 1 0 1 Thing 1 0 1 Thing 2 1 0 Thing 3 0 1 0 Thing 4 1 1Wednesday, August 1, 2012
  18. 18. BLOOM FILTER If you have a large collection of ‘things’ ... And you want to know if some thing is in the collection.Wednesday, August 1, 2012
  19. 19. BLOOM FILTER 1 0 1 Thing 1 0 1 Thing 2 1 0 Thing 3 0 1 0 Thing 4 1 1Wednesday, August 1, 2012
  20. 20. BLOOM FILTER 1 0 1 0 1 1 Other thing 0 0 1 0 1 1Wednesday, August 1, 2012
  21. 21. BLOOM FILTER 1 0 1 0 1 1 Missing thing 0 0 1 0 1 1Wednesday, August 1, 2012
  22. 22. BLOOM FILTER How big to make ‘m’ and ‘k’? ‘m’ is the number of bits in the filter ‘k’ is the number of separate hash functions m = - (n * ln p) / (ln 2) ** 2 n is the the number of distinct items to be stored p is the probability of a false positive m = - (1M * ln .01) / (ln 2) ** 2 = 9.6 Mbits = 1.2 MBWednesday, August 1, 2012
  23. 23. BLOOM FILTER How big to make ‘m’ and ‘k’? ‘m’ is the number of bits in the filter ‘k’ is the number of separate hash functions m = - (n * ln p) / (ln 2) ^ 2 k = m / n * ln 2 k = 9.6M / 1M * 0.69 = 6.64 = 7 hash functionsWednesday, August 1, 2012
  24. 24. BLOOM FILTER You can’t query a Bloom filter for cardinality You can’t remove an item once it’s been added Many variants of the Bloom filter, some that address these issuesWednesday, August 1, 2012
  25. 25. HASH FUNCTIONS How to find many hash functions? “Out of one, many” Make the size of your bit mask a power of 2 By masking off bit fields, you can get multiple hash values from a single hash function. a 16 bit hash will cover a 65Kbit index 512 bit hash will give 32 16-bit hashesWednesday, August 1, 2012
  26. 26. COUNT-MIN SKETCH When you want to know how many of each item there is in a collection.Wednesday, August 1, 2012
  27. 27. COUNT-MIN SKETCH w +1 +1 Thing 1 d +1 +1 Each box is a counter. Each row is indexed by a corresponding hash function.Wednesday, August 1, 2012
  28. 28. COUNT-MIN SKETCH w a b Some thing d c d Estimated frequency for ‘Some thing’ is min(a, b, c, d).Wednesday, August 1, 2012
  29. 29. COUNT-MIN SKETCH How big to make ‘w’ and ‘d’? ‘w’ is the number of counters per hash function limits the magnitude of the error ‘d’ is the number of separate hash functions controls the probability that the estimation is greater than the errorWednesday, August 1, 2012
  30. 30. COUNT-MIN SKETCH error-limit <= 2 * n / w probability limit exceeded = 1 - (1 / 2) ** d n = total number of items counted w = number of counters per hash function d = number of separate hash functions Works best on skewed data.Wednesday, August 1, 2012
  31. 31. RESOURCES https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/ http://blog.aggregateknowledge.com/ http://lkozma.net/blog/sketching-data-structures/ https://sites.google.com/site/countminsketch/home “PyCon 2011: Handling ridiculous amounts of data with probabilistic data structures”Wednesday, August 1, 2012

×