Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

2,550 views

Published on

Published in:
Technology

No Downloads

Total views

2,550

On SlideShare

0

From Embeds

0

Number of Embeds

26

Shares

0

Downloads

41

Comments

0

Likes

4

No embeds

No notes for slide

- 1. Probabilistic Data Structures and Approximate Solutions by Oleksandr Pryymak PyData London 2014IPython notebook with code >>
- 2. Probabilistic||Approximate: Why? Often: ● an approximate answer is sufficient ● need to trade accuracy for scalability or speed ● need to analyse stream of data Catch: ● despite typically achieving good result, exists a chance of the bad worst case behaviour. ● use on large datasets (law of large numbers)
- 3. Code: Approximation import random x = [random.randint(0,80000) for _ in xrange(10000)] y = [i>>8 for i in x] # trim 8 bits off of integers z = x[:500] # 5% sample (x is uniform) avx = average(x) avy = average(y) * 2**8 # add 8 bits avz = average(z) print avx print avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx)) print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx)) 39547.8816 39420.7744 error 0.321401% 39591.424 error 0.110100%
- 4. Code: Sampling Data Interview question: Get K samples from an infinite stream
- 5. Probabilistic Data Structures Generally they are: ● Use less space than a full dataset ● Require higher CPU load ● Stream-friendly ● Can be parallelized ● Have controlled error rate
- 6. Hash functions One-way function: arbitrary length of the key -> to a fixed length of the message message = hash(key) However, collisions are possible: hash(key1) = hash(key2)
- 7. Code: Hashing
- 8. Hash collisions and performance ● Cryptographic hashes not ideal for our use (like bcrypt) ● Need a fast algorithm with the lowest number of collisions: Hash Lowercase Random UUID Numbers ============= ============= =========== ============== Murmur 145 ns 259 ns 92 ns 6 collis 5 collis 0 collis FNV-1 184 ns 730 ns 92 ns 1 collis 5 collis 0 collis DJB2 156 ns 437 ns 93 ns 7 collis 6 collis 0 collis SDBM 148 ns 484 ns 90 ns 4 collis 6 collis 0 collis SuperFastHash 164 ns 344 ns 118 ns 85 collis 4 collis 18742 collis CRC32 250 ns 946 ns 130 ns 2 collis 0 collis 0 collis LoseLose 338 ns - - 215178 collis by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed Murmur2 collisions ● cataract collides with periti ● roquette collides with skivie ● shawl collides with stormbound ● dowlases collides with tramontane ● cricketingscollides with twanger ● longans collides with whigs
- 9. Hash randomness visualised hashmap Great murmur2 on a sequence of numbers Not so great DJB2 on a sequence of numbers
- 10. Comparison: Locality Sensitive Hashing (LSH)
- 11. Comparison: Locality Sensitive Hashing (LSH) Image hashes Kernelized locality-sensitive hashing for scalable image search B Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed high- dimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22
- 12. Membership test: Bloom filter Bloom filter is probabilistic but only yields false positives. Hash each item k times indices into bit field. ` 1..m At least one 0 means w definitely isn’t in set. All 1s would mean w probably is in set.
- 13. Use Bloom filter to serve requests
- 14. Code: bloom filter
- 15. Use Bloom filter to store graphs Graphs only gain nodes because of Bloom filter false positives. Pell et al., PNAS 2012
- 16. Counting Distinct Elements In: infinite stream of data Question: how many distinct elements are there? is similar to: In: coin flips Question: how many times it has been flipped?
- 17. Coin flips: intuition ● Long runs of HEADs in random series are rare. ● The longer you look, the more likely you see a long one. ● Long runs are very rare and are correlated with how many coins you’ve flipped.
- 18. Code: Cardinality estimation
- 19. Cardinality estimation Basic algorithm: ● n=0 ● For each input item: ○ Hash item into bit string ○ Count trailing zeroes in bit string ○ If this count > n: ■ Let n = count ● Estimated cardinality (“count distinct”) = 2^n
- 20. Cardinality estimation: HyperLogLog Demo by: http://www. aggregateknowledge. com/science/blog/hll.html Billions of distinct values in 1.5KB of RAM with 2% relative error HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm P.Flajolet, É.Fusy, O.Gandouet, F.Meunier; 2007
- 21. Code: HyperLogLog
- 22. Count-min sketch count(value) = min{w1 [h1 (value)], ... wd [hd (value)]} Frequency histogram estimation with chance of over-counting
- 23. Code: Frequent Itemsets
- 24. Machine Learning: Feature hashing High-dimensional machine learning without feature dictionary by Andrew Clegg “Approximate methods for scalable data mining”
- 25. Locality-sensitive hashing To approximate nearest neighbours by Andrew Clegg “Approximate methods for scalable data mining”
- 26. Probabilistic Databases ● PrDB (University of Maryland) ● Orion (Purdue University) ● MayBMS (Cornell University) ● BlinkDB v0.1alpha (UC Berkeley and MIT)
- 27. BlinkDB: queries Queries with Bounded Errors and Bounded Response Times on Very Large Data
- 28. BlinkDB: architecture
- 29. References Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman http://infolab.stanford.edu/~ullman/mmds.html
- 30. Summary ● know the data structures ● know what you sacrifice ● control errors http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df http://highlyscalable.wordpress.com/2012/05/01/probabilistic- structures-web-analytics-data-mining/ by Ilya Katsov

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment