Your SlideShare is downloading.
×

×

Saving this for later?
Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.

Text the download link to your phone

Standard text messaging rates apply

Like this presentation? Why not share!

- Modern Algorithms and Data Structur... by Lorenzo Alberton 27708 views
- Probabilistic Data Structures and A... by PyData 1724 views
- Advanced Data Structures for inform... by nar_zack1 694 views
- MinHashing: Fast Similarity Search by Jano Suchal 2979 views
- Probabilistic Data Structures and A... by Oleksandr Pryymak 991 views
- Craig Kerstiens - Scalable Uniques ... by PostgresOpen 1287 views

2,074

Published on

At StampedeCon 2012 in St. Louis, Jim Duey of Lonocloud presents: Big data requires big resources which cost big money. But if you only need answers that are good enough, rather than precisely right, …

At StampedeCon 2012 in St. Louis, Jim Duey of Lonocloud presents: Big data requires big resources which cost big money. But if you only need answers that are good enough, rather than precisely right, probabilistic data structures can be a way to get those answers with a fraction of the resources and cost. In this talk I’ll survey some different data structures, give some theory behind them and point out some use cases.

No Downloads

Total Views

2,074

On Slideshare

0

From Embeds

0

Number of Embeds

0

Shares

0

Downloads

0

Comments

0

Likes

10

No embeds

No notes for slide

- 1. PROBABILISTIC DATA STRUCTURES Jim Duey Lonocloud.com @jimduey http://clojure.netWednesday, August 1, 2012
- 2. WHAT IS A DATA STRUCTURE? It is a ‘structure’ that holds ‘data’, allowing you to extract information. Data gets added to the structure. Queries of various sorts are used to extract information.Wednesday, August 1, 2012
- 3. INSPIRATION Ilya Katsov https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/Wednesday, August 1, 2012
- 4. WORD OF CAUTION Many probabilistic data structures use hashing Java’s hashCode is not safe across multiple processes “Javas hashCode is not safe for distributed systems” http://martin.kleppmann.com/2012/06/18/java-hashcode- unsafe-for-distributed-systems.htmlWednesday, August 1, 2012
- 5. PROBABILISTIC Query may return a wrong answer The answer is ‘good enough’ Uses a fraction of the resources i.e. memory or cpu cyclesWednesday, August 1, 2012
- 6. HOW MANY ITEMS? If you have a large collection of ‘things’ ... And there are some duplicates ... And you want to know how many unique things there are.Wednesday, August 1, 2012
- 7. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter void add(value) { // get an index for value between 0 .. m int position = value.hashCode() % m; mask.set(position); }Wednesday, August 1, 2012
- 8. LINEAR COUNTING 1 add() 0 0 Thing 1 0 add() 1 Thing 2 0 0 Thing 3 0 add() 0 0 Thing 4 1 add() 0Wednesday, August 1, 2012
- 9. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter ... } Question: How big is m ?Wednesday, August 1, 2012
- 10. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is < 1; few collisions, number of bits set is the cardinality.Wednesday, August 1, 2012
- 11. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is very high 100; all bits set, no information about cardinality.Wednesday, August 1, 2012
- 12. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is higher than 1, but not too high; many collisions, but some relationship might exist between number of bits set and cardinality.Wednesday, August 1, 2012
- 13. LINEAR COUNTING Finding the number of members of the collection n = - m * ln ((m - w) / m) m is the size of the bit map w is the number of 1 s in the bitmap (cardinality)Wednesday, August 1, 2012
- 14. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter ... } Question: How big is m ? m > max(5, 1 / (std-err * n / m) ** 2) * (e ** (n / m) - n / m -1) On the order of 1M unique values, m = 154 Kbit, n/m = 6.5 On the order of 10M unique values, m = 1.1 Mbit, n/m = 9 for a standard error of 0.01Wednesday, August 1, 2012
- 15. LINEAR COUNTING “Linear-Time Probabilistic Counting Algorithm for Database Applications” Use table to find bit map size. Checkout Ilya’s blog post for some nice graphs.Wednesday, August 1, 2012
- 16. LINEAR COUNTING 1 0 0 Thing 1 0 1 Thing 2 0 0 Thing 3 0 0 0 Thing 4 1 0Wednesday, August 1, 2012
- 17. 1 0 1 Thing 1 0 1 Thing 2 1 0 Thing 3 0 1 0 Thing 4 1 1Wednesday, August 1, 2012
- 18. BLOOM FILTER If you have a large collection of ‘things’ ... And you want to know if some thing is in the collection.Wednesday, August 1, 2012
- 19. BLOOM FILTER 1 0 1 Thing 1 0 1 Thing 2 1 0 Thing 3 0 1 0 Thing 4 1 1Wednesday, August 1, 2012
- 20. BLOOM FILTER 1 0 1 0 1 1 Other thing 0 0 1 0 1 1Wednesday, August 1, 2012
- 21. BLOOM FILTER 1 0 1 0 1 1 Missing thing 0 0 1 0 1 1Wednesday, August 1, 2012
- 22. BLOOM FILTER How big to make ‘m’ and ‘k’? ‘m’ is the number of bits in the filter ‘k’ is the number of separate hash functions m = - (n * ln p) / (ln 2) ** 2 n is the the number of distinct items to be stored p is the probability of a false positive m = - (1M * ln .01) / (ln 2) ** 2 = 9.6 Mbits = 1.2 MBWednesday, August 1, 2012
- 23. BLOOM FILTER How big to make ‘m’ and ‘k’? ‘m’ is the number of bits in the filter ‘k’ is the number of separate hash functions m = - (n * ln p) / (ln 2) ^ 2 k = m / n * ln 2 k = 9.6M / 1M * 0.69 = 6.64 = 7 hash functionsWednesday, August 1, 2012
- 24. BLOOM FILTER You can’t query a Bloom filter for cardinality You can’t remove an item once it’s been added Many variants of the Bloom filter, some that address these issuesWednesday, August 1, 2012
- 25. HASH FUNCTIONS How to find many hash functions? “Out of one, many” Make the size of your bit mask a power of 2 By masking off bit fields, you can get multiple hash values from a single hash function. a 16 bit hash will cover a 65Kbit index 512 bit hash will give 32 16-bit hashesWednesday, August 1, 2012
- 26. COUNT-MIN SKETCH When you want to know how many of each item there is in a collection.Wednesday, August 1, 2012
- 27. COUNT-MIN SKETCH w +1 +1 Thing 1 d +1 +1 Each box is a counter. Each row is indexed by a corresponding hash function.Wednesday, August 1, 2012
- 28. COUNT-MIN SKETCH w a b Some thing d c d Estimated frequency for ‘Some thing’ is min(a, b, c, d).Wednesday, August 1, 2012
- 29. COUNT-MIN SKETCH How big to make ‘w’ and ‘d’? ‘w’ is the number of counters per hash function limits the magnitude of the error ‘d’ is the number of separate hash functions controls the probability that the estimation is greater than the errorWednesday, August 1, 2012
- 30. COUNT-MIN SKETCH error-limit <= 2 * n / w probability limit exceeded = 1 - (1 / 2) ** d n = total number of items counted w = number of counters per hash function d = number of separate hash functions Works best on skewed data.Wednesday, August 1, 2012
- 31. RESOURCES https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/ http://blog.aggregateknowledge.com/ http://lkozma.net/blog/sketching-data-structures/ https://sites.google.com/site/countminsketch/home “PyCon 2011: Handling ridiculous amounts of data with probabilistic data structures”Wednesday, August 1, 2012

Be the first to comment