Probabilistic Data Structures
and Approximate Solutions
by Oleksandr Pryymak
PyData London 2014IPython notebook with code ...
Probabilistic||Approximate: Why?
Often:
● an approximate answer is sufficient
● need to trade accuracy for scalability or ...
Code: Approximation
import random
x = [random.randint(0,80000) for _ in xrange(10000)]
y = [i>>8 for i in x] # trim 8 bits...
Code: Sampling Data
Interview question:
Get K samples from an infinite stream
Probabilistic Data Structures
Generally they are:
● Use less space than a full dataset
● Require higher CPU load
● Stream-...
Hash functions
One-way function:
arbitrary length of the key ->
to a fixed length of the message
message = hash(key)
Howev...
Code: Hashing
Hash collisions and performance
● Cryptographic hashes not ideal for our use (like bcrypt)
● Need a fast algorithm with th...
Hash randomness visualised hashmap
Great
murmur2
on a sequence of numbers
Not so great
DJB2
on a sequence of numbers
Comparison: Locality Sensitive Hashing (LSH)
Comparison: Locality Sensitive Hashing (LSH)
Image hashes
Kernelized locality-sensitive hashing for scalable image search
...
Membership test: Bloom filter
Bloom filter is probabilistic but only yields false positives.
Hash each item k times indice...
Use Bloom filter to serve requests
Code: bloom filter
Use Bloom filter to store graphs
Graphs only gain nodes because of Bloom
filter false positives.
Pell et al., PNAS 2012
Counting Distinct Elements
In: infinite stream of data
Question: how many distinct elements are there?
is similar to:
In: ...
Coin flips: intuition
● Long runs of HEADs in random series are rare.
● The longer you look, the more likely you see a lon...
Code: Cardinality estimation
Cardinality estimation
Basic algorithm:
● n=0
● For each input item:
○ Hash item into bit string
○ Count trailing zeroes i...
Cardinality estimation: HyperLogLog
Demo by: http://www.
aggregateknowledge.
com/science/blog/hll.html
Billions of distinc...
Code: HyperLogLog
Count-min sketch
count(value) = min{w1
[h1
(value)], ... wd
[hd
(value)]}
Frequency histogram
estimation with chance
of ov...
Code: Frequent Itemsets
Machine Learning: Feature hashing
High-dimensional
machine learning without
feature dictionary
by Andrew Clegg “Approximat...
Locality-sensitive hashing
To approximate nearest
neighbours
by Andrew Clegg “Approximate methods for
scalable data mining”
Probabilistic Databases
● PrDB (University of Maryland)
● Orion (Purdue University)
● MayBMS (Cornell University)
● BlinkD...
BlinkDB: queries
Queries with Bounded Errors
and Bounded Response Times
on Very Large Data
BlinkDB: architecture
References
Mining of Massive Datasets
by Jure Leskovec, Anand Rajaraman, and Jeff Ullman
http://infolab.stanford.edu/~ullm...
Summary
● know the data structures
● know what you sacrifice
● control errors
http://nbviewer.ipython.org/gist/235/d3ee622...
Upcoming SlideShare
Loading in …5
×

Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

2,550 views

Published on

Probabilistic Data Structures and Approximate Solutions by Oleksandr Pryymak. http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df

Published in: Technology

Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

  1. 1. Probabilistic Data Structures and Approximate Solutions by Oleksandr Pryymak PyData London 2014IPython notebook with code >>
  2. 2. Probabilistic||Approximate: Why? Often: ● an approximate answer is sufficient ● need to trade accuracy for scalability or speed ● need to analyse stream of data Catch: ● despite typically achieving good result, exists a chance of the bad worst case behaviour. ● use on large datasets (law of large numbers)
  3. 3. Code: Approximation import random x = [random.randint(0,80000) for _ in xrange(10000)] y = [i>>8 for i in x] # trim 8 bits off of integers z = x[:500] # 5% sample (x is uniform) avx = average(x) avy = average(y) * 2**8 # add 8 bits avz = average(z) print avx print avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx)) print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx)) 39547.8816 39420.7744 error 0.321401% 39591.424 error 0.110100%
  4. 4. Code: Sampling Data Interview question: Get K samples from an infinite stream
  5. 5. Probabilistic Data Structures Generally they are: ● Use less space than a full dataset ● Require higher CPU load ● Stream-friendly ● Can be parallelized ● Have controlled error rate
  6. 6. Hash functions One-way function: arbitrary length of the key -> to a fixed length of the message message = hash(key) However, collisions are possible: hash(key1) = hash(key2)
  7. 7. Code: Hashing
  8. 8. Hash collisions and performance ● Cryptographic hashes not ideal for our use (like bcrypt) ● Need a fast algorithm with the lowest number of collisions: Hash Lowercase Random UUID Numbers ============= ============= =========== ============== Murmur 145 ns 259 ns 92 ns 6 collis 5 collis 0 collis FNV-1 184 ns 730 ns 92 ns 1 collis 5 collis 0 collis DJB2 156 ns 437 ns 93 ns 7 collis 6 collis 0 collis SDBM 148 ns 484 ns 90 ns 4 collis 6 collis 0 collis SuperFastHash 164 ns 344 ns 118 ns 85 collis 4 collis 18742 collis CRC32 250 ns 946 ns 130 ns 2 collis 0 collis 0 collis LoseLose 338 ns - - 215178 collis by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed Murmur2 collisions ● cataract collides with periti ● roquette collides with skivie ● shawl collides with stormbound ● dowlases collides with tramontane ● cricketingscollides with twanger ● longans collides with whigs
  9. 9. Hash randomness visualised hashmap Great murmur2 on a sequence of numbers Not so great DJB2 on a sequence of numbers
  10. 10. Comparison: Locality Sensitive Hashing (LSH)
  11. 11. Comparison: Locality Sensitive Hashing (LSH) Image hashes Kernelized locality-sensitive hashing for scalable image search B Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed high- dimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22
  12. 12. Membership test: Bloom filter Bloom filter is probabilistic but only yields false positives. Hash each item k times indices into bit field. ` 1..m At least one 0 means w definitely isn’t in set. All 1s would mean w probably is in set.
  13. 13. Use Bloom filter to serve requests
  14. 14. Code: bloom filter
  15. 15. Use Bloom filter to store graphs Graphs only gain nodes because of Bloom filter false positives. Pell et al., PNAS 2012
  16. 16. Counting Distinct Elements In: infinite stream of data Question: how many distinct elements are there? is similar to: In: coin flips Question: how many times it has been flipped?
  17. 17. Coin flips: intuition ● Long runs of HEADs in random series are rare. ● The longer you look, the more likely you see a long one. ● Long runs are very rare and are correlated with how many coins you’ve flipped.
  18. 18. Code: Cardinality estimation
  19. 19. Cardinality estimation Basic algorithm: ● n=0 ● For each input item: ○ Hash item into bit string ○ Count trailing zeroes in bit string ○ If this count > n: ■ Let n = count ● Estimated cardinality (“count distinct”) = 2^n
  20. 20. Cardinality estimation: HyperLogLog Demo by: http://www. aggregateknowledge. com/science/blog/hll.html Billions of distinct values in 1.5KB of RAM with 2% relative error HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm P.Flajolet, É.Fusy, O.Gandouet, F.Meunier; 2007
  21. 21. Code: HyperLogLog
  22. 22. Count-min sketch count(value) = min{w1 [h1 (value)], ... wd [hd (value)]} Frequency histogram estimation with chance of over-counting
  23. 23. Code: Frequent Itemsets
  24. 24. Machine Learning: Feature hashing High-dimensional machine learning without feature dictionary by Andrew Clegg “Approximate methods for scalable data mining”
  25. 25. Locality-sensitive hashing To approximate nearest neighbours by Andrew Clegg “Approximate methods for scalable data mining”
  26. 26. Probabilistic Databases ● PrDB (University of Maryland) ● Orion (Purdue University) ● MayBMS (Cornell University) ● BlinkDB v0.1alpha (UC Berkeley and MIT)
  27. 27. BlinkDB: queries Queries with Bounded Errors and Bounded Response Times on Very Large Data
  28. 28. BlinkDB: architecture
  29. 29. References Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman http://infolab.stanford.edu/~ullman/mmds.html
  30. 30. Summary ● know the data structures ● know what you sacrifice ● control errors http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df http://highlyscalable.wordpress.com/2012/05/01/probabilistic- structures-web-analytics-data-mining/ by Ilya Katsov

×