• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
A Survey of Probabilistic Data Structures - StampedeCon 2012
 

A Survey of Probabilistic Data Structures - StampedeCon 2012

on

  • 2,004 views

At StampedeCon 2012 in St. Louis, Jim Duey of Lonocloud presents: Big data requires big resources which cost big money. But if you only need answers that are good enough, rather than precisely right, ...

At StampedeCon 2012 in St. Louis, Jim Duey of Lonocloud presents: Big data requires big resources which cost big money. But if you only need answers that are good enough, rather than precisely right, probabilistic data structures can be a way to get those answers with a fraction of the resources and cost. In this talk I’ll survey some different data structures, give some theory behind them and point out some use cases.

Statistics

Views

Total Views
2,004
Views on SlideShare
1,994
Embed Views
10

Actions

Likes
6
Downloads
0
Comments
0

2 Embeds 10

http://eventifier.co 6
http://StampedeCon.com 4

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

A Survey of Probabilistic Data Structures - StampedeCon 2012 A Survey of Probabilistic Data Structures - StampedeCon 2012 Presentation Transcript

  • PROBABILISTIC DATA STRUCTURES Jim Duey Lonocloud.com @jimduey http://clojure.netWednesday, August 1, 2012
  • WHAT IS A DATA STRUCTURE? It is a ‘structure’ that holds ‘data’, allowing you to extract information. Data gets added to the structure. Queries of various sorts are used to extract information.Wednesday, August 1, 2012
  • INSPIRATION Ilya Katsov https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/Wednesday, August 1, 2012
  • WORD OF CAUTION Many probabilistic data structures use hashing Java’s hashCode is not safe across multiple processes “Javas hashCode is not safe for distributed systems” http://martin.kleppmann.com/2012/06/18/java-hashcode- unsafe-for-distributed-systems.htmlWednesday, August 1, 2012
  • PROBABILISTIC Query may return a wrong answer The answer is ‘good enough’ Uses a fraction of the resources i.e. memory or cpu cyclesWednesday, August 1, 2012
  • HOW MANY ITEMS? If you have a large collection of ‘things’ ... And there are some duplicates ... And you want to know how many unique things there are.Wednesday, August 1, 2012
  • LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter void add(value) { // get an index for value between 0 .. m int position = value.hashCode() % m; mask.set(position); }Wednesday, August 1, 2012
  • LINEAR COUNTING 1 add() 0 0 Thing 1 0 add() 1 Thing 2 0 0 Thing 3 0 add() 0 0 Thing 4 1 add() 0Wednesday, August 1, 2012
  • LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter ... } Question: How big is m ?Wednesday, August 1, 2012
  • LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is < 1; few collisions, number of bits set is the cardinality.Wednesday, August 1, 2012
  • LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is very high 100; all bits set, no information about cardinality.Wednesday, August 1, 2012
  • LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is higher than 1, but not too high; many collisions, but some relationship might exist between number of bits set and cardinality.Wednesday, August 1, 2012
  • LINEAR COUNTING Finding the number of members of the collection n = - m * ln ((m - w) / m) m is the size of the bit map w is the number of 1 s in the bitmap (cardinality)Wednesday, August 1, 2012
  • LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter ... } Question: How big is m ? m > max(5, 1 / (std-err * n / m) ** 2) * (e ** (n / m) - n / m -1) On the order of 1M unique values, m = 154 Kbit, n/m = 6.5 On the order of 10M unique values, m = 1.1 Mbit, n/m = 9 for a standard error of 0.01Wednesday, August 1, 2012
  • LINEAR COUNTING “Linear-Time Probabilistic Counting Algorithm for Database Applications” Use table to find bit map size. Checkout Ilya’s blog post for some nice graphs.Wednesday, August 1, 2012
  • LINEAR COUNTING 1 0 0 Thing 1 0 1 Thing 2 0 0 Thing 3 0 0 0 Thing 4 1 0Wednesday, August 1, 2012
  • 1 0 1 Thing 1 0 1 Thing 2 1 0 Thing 3 0 1 0 Thing 4 1 1Wednesday, August 1, 2012
  • BLOOM FILTER If you have a large collection of ‘things’ ... And you want to know if some thing is in the collection.Wednesday, August 1, 2012
  • BLOOM FILTER 1 0 1 Thing 1 0 1 Thing 2 1 0 Thing 3 0 1 0 Thing 4 1 1Wednesday, August 1, 2012
  • BLOOM FILTER 1 0 1 0 1 1 Other thing 0 0 1 0 1 1Wednesday, August 1, 2012
  • BLOOM FILTER 1 0 1 0 1 1 Missing thing 0 0 1 0 1 1Wednesday, August 1, 2012
  • BLOOM FILTER How big to make ‘m’ and ‘k’? ‘m’ is the number of bits in the filter ‘k’ is the number of separate hash functions m = - (n * ln p) / (ln 2) ** 2 n is the the number of distinct items to be stored p is the probability of a false positive m = - (1M * ln .01) / (ln 2) ** 2 = 9.6 Mbits = 1.2 MBWednesday, August 1, 2012
  • BLOOM FILTER How big to make ‘m’ and ‘k’? ‘m’ is the number of bits in the filter ‘k’ is the number of separate hash functions m = - (n * ln p) / (ln 2) ^ 2 k = m / n * ln 2 k = 9.6M / 1M * 0.69 = 6.64 = 7 hash functionsWednesday, August 1, 2012
  • BLOOM FILTER You can’t query a Bloom filter for cardinality You can’t remove an item once it’s been added Many variants of the Bloom filter, some that address these issuesWednesday, August 1, 2012
  • HASH FUNCTIONS How to find many hash functions? “Out of one, many” Make the size of your bit mask a power of 2 By masking off bit fields, you can get multiple hash values from a single hash function. a 16 bit hash will cover a 65Kbit index 512 bit hash will give 32 16-bit hashesWednesday, August 1, 2012
  • COUNT-MIN SKETCH When you want to know how many of each item there is in a collection.Wednesday, August 1, 2012
  • COUNT-MIN SKETCH w +1 +1 Thing 1 d +1 +1 Each box is a counter. Each row is indexed by a corresponding hash function.Wednesday, August 1, 2012
  • COUNT-MIN SKETCH w a b Some thing d c d Estimated frequency for ‘Some thing’ is min(a, b, c, d).Wednesday, August 1, 2012
  • COUNT-MIN SKETCH How big to make ‘w’ and ‘d’? ‘w’ is the number of counters per hash function limits the magnitude of the error ‘d’ is the number of separate hash functions controls the probability that the estimation is greater than the errorWednesday, August 1, 2012
  • COUNT-MIN SKETCH error-limit <= 2 * n / w probability limit exceeded = 1 - (1 / 2) ** d n = total number of items counted w = number of counters per hash function d = number of separate hash functions Works best on skewed data.Wednesday, August 1, 2012
  • RESOURCES https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/ http://blog.aggregateknowledge.com/ http://lkozma.net/blog/sketching-data-structures/ https://sites.google.com/site/countminsketch/home “PyCon 2011: Handling ridiculous amounts of data with probabilistic data structures”Wednesday, August 1, 2012