Upcoming SlideShare
×

# Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

2,550 views

Published on

Probabilistic Data Structures and Approximate Solutions by Oleksandr Pryymak. http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df

Published in: Technology
4 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
2,550
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
41
0
Likes
4
Embeds 0
No embeds

No notes for slide

### Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

1. 1. Probabilistic Data Structures and Approximate Solutions by Oleksandr Pryymak PyData London 2014IPython notebook with code >>
2. 2. Probabilistic||Approximate: Why? Often: ● an approximate answer is sufficient ● need to trade accuracy for scalability or speed ● need to analyse stream of data Catch: ● despite typically achieving good result, exists a chance of the bad worst case behaviour. ● use on large datasets (law of large numbers)
3. 3. Code: Approximation import random x = [random.randint(0,80000) for _ in xrange(10000)] y = [i>>8 for i in x] # trim 8 bits off of integers z = x[:500] # 5% sample (x is uniform) avx = average(x) avy = average(y) * 2**8 # add 8 bits avz = average(z) print avx print avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx)) print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx)) 39547.8816 39420.7744 error 0.321401% 39591.424 error 0.110100%
4. 4. Code: Sampling Data Interview question: Get K samples from an infinite stream
5. 5. Probabilistic Data Structures Generally they are: ● Use less space than a full dataset ● Require higher CPU load ● Stream-friendly ● Can be parallelized ● Have controlled error rate
6. 6. Hash functions One-way function: arbitrary length of the key -> to a fixed length of the message message = hash(key) However, collisions are possible: hash(key1) = hash(key2)
7. 7. Code: Hashing
8. 8. Hash collisions and performance ● Cryptographic hashes not ideal for our use (like bcrypt) ● Need a fast algorithm with the lowest number of collisions: Hash Lowercase Random UUID Numbers ============= ============= =========== ============== Murmur 145 ns 259 ns 92 ns 6 collis 5 collis 0 collis FNV-1 184 ns 730 ns 92 ns 1 collis 5 collis 0 collis DJB2 156 ns 437 ns 93 ns 7 collis 6 collis 0 collis SDBM 148 ns 484 ns 90 ns 4 collis 6 collis 0 collis SuperFastHash 164 ns 344 ns 118 ns 85 collis 4 collis 18742 collis CRC32 250 ns 946 ns 130 ns 2 collis 0 collis 0 collis LoseLose 338 ns - - 215178 collis by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed Murmur2 collisions ● cataract collides with periti ● roquette collides with skivie ● shawl collides with stormbound ● dowlases collides with tramontane ● cricketingscollides with twanger ● longans collides with whigs
9. 9. Hash randomness visualised hashmap Great murmur2 on a sequence of numbers Not so great DJB2 on a sequence of numbers
10. 10. Comparison: Locality Sensitive Hashing (LSH)
11. 11. Comparison: Locality Sensitive Hashing (LSH) Image hashes Kernelized locality-sensitive hashing for scalable image search B Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed high- dimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22
12. 12. Membership test: Bloom filter Bloom filter is probabilistic but only yields false positives. Hash each item k times indices into bit field. ` 1..m At least one 0 means w definitely isn’t in set. All 1s would mean w probably is in set.
13. 13. Use Bloom filter to serve requests
14. 14. Code: bloom filter
15. 15. Use Bloom filter to store graphs Graphs only gain nodes because of Bloom filter false positives. Pell et al., PNAS 2012
16. 16. Counting Distinct Elements In: infinite stream of data Question: how many distinct elements are there? is similar to: In: coin flips Question: how many times it has been flipped?
17. 17. Coin flips: intuition ● Long runs of HEADs in random series are rare. ● The longer you look, the more likely you see a long one. ● Long runs are very rare and are correlated with how many coins you’ve flipped.
18. 18. Code: Cardinality estimation
19. 19. Cardinality estimation Basic algorithm: ● n=0 ● For each input item: ○ Hash item into bit string ○ Count trailing zeroes in bit string ○ If this count > n: ■ Let n = count ● Estimated cardinality (“count distinct”) = 2^n
20. 20. Cardinality estimation: HyperLogLog Demo by: http://www. aggregateknowledge. com/science/blog/hll.html Billions of distinct values in 1.5KB of RAM with 2% relative error HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm P.Flajolet, É.Fusy, O.Gandouet, F.Meunier; 2007
21. 21. Code: HyperLogLog
22. 22. Count-min sketch count(value) = min{w1 [h1 (value)], ... wd [hd (value)]} Frequency histogram estimation with chance of over-counting
23. 23. Code: Frequent Itemsets
24. 24. Machine Learning: Feature hashing High-dimensional machine learning without feature dictionary by Andrew Clegg “Approximate methods for scalable data mining”
25. 25. Locality-sensitive hashing To approximate nearest neighbours by Andrew Clegg “Approximate methods for scalable data mining”
26. 26. Probabilistic Databases ● PrDB (University of Maryland) ● Orion (Purdue University) ● MayBMS (Cornell University) ● BlinkDB v0.1alpha (UC Berkeley and MIT)
27. 27. BlinkDB: queries Queries with Bounded Errors and Bounded Response Times on Very Large Data