Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- A Survey of Probabilistic Data Stru... by StampedeCon 3070 views
- 2013 py con awesome big data algori... by c.titus.brown 3615 views
- Realtime Data Analysis Patterns by Mikio L. Braun 5637 views
- Design Patterns For Real Time Strea... by Hadoop Summit 7244 views
- Probabilistic algorithms for fun an... by Tyler Treat 34814 views
- Introduction to ipython notebook by Go Asgard 573 views

2,110 views

1,753 views

1,753 views

Published on

Published in:
Technology

No Downloads

Total views

2,110

On SlideShare

0

From Embeds

0

Number of Embeds

5

Shares

0

Downloads

32

Comments

0

Likes

9

No embeds

No notes for slide

- 1. Probabilistic Data Structures and Approximate Solutions IPython notebook with code >> by Oleksandr Pryymak PyData London 2014
- 2. Probabilistic||Approximate: Why? Often: ● an approximate answer is sufficient ● need to trade accuracy for scalability or speed ● need to analyse stream of data Catch: ● despite typically achieving good result, exists a chance of the bad worst case behaviour. ● use on large datasets (law of large numbers)
- 3. Code: Approximation import random x = [random.randint(0,80000) for _ in xrange(10000)] y = [i>>8 for i in x] # trim 8 bits off of integers z = x[:500] # 5% sample (x is uniform) avx = average(x) avy = average(y) * 2**8 # add 8 bits avz = average(z) print avx print avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx)) print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx)) 39547.8816 39420.7744 error 0.321401% 39591.424 error 0.110100%
- 4. Code: Sampling Data Interview question: Get K samples from an infinite stream
- 5. Probabilistic Data Structures Generally they are: ● Use less space than a full dataset ● Require higher CPU load ● Stream-friendly ● Can be parallelized ● Have controlled error rate
- 6. Hash functions One-way function: arbitrary length of the key -> to a fixed length of the message message = hash(key) However, collisions are possible: hash(key1) = hash(key2)
- 7. Code: Hashing
- 8. Hash collisions and performance ● ● Cryptographic hashes not ideal for our use (like bcrypt) Need a fast algorithm with the lowest number of collisions: Hash ============= Murmur FNV-1 DJB2 SDBM SuperFastHash CRC32 LoseLose Lowercase ============= 145 ns 6 collis 184 ns 1 collis 156 ns 7 collis 148 ns 4 collis 164 ns 85 collis 250 ns 2 collis 338 ns 215178 collis Random UUID =========== 259 ns 5 collis 730 ns 5 collis 437 ns 6 collis 484 ns 6 collis 344 ns 4 collis 946 ns 0 collis - Numbers ============== 92 ns 0 collis 92 ns 0 collis 93 ns 0 collis 90 ns 0 collis 118 ns 18742 collis 130 ns 0 collis - Murmur2 collisions ● cataract collides with periti ● roquette collides with skivie ● shawl collides with stormbound ● dowlases collides with tramontane ● cricketings collides with twanger ● longans collides with whigs by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
- 9. Hash randomness visualised hashmap Great murmur2 Not so great on a sequence of numbers DJB2 on a sequence of numbers
- 10. Comparison: Locality Sensitive Hashing (LSH)
- 11. Comparison: Locality Sensitive Hashing (LSH) Image hashes Kernelized locality-sensitive hashing for scalable image search B Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed highdimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22
- 12. Membership test: Bloom filter Bloom filter is probabilistic but only yields false positives. Hash each item k times indices into bit field. ` At least one 0 means w definitely isn’t in set. All 1s would mean w probably is in set. 1..m
- 13. Use Bloom filter to serve requests
- 14. Code: bloom filter
- 15. Use Bloom filter to store graphs Graphs only gain nodes because of Bloom filter false positives. Pell et al., PNAS 2012
- 16. Counting Distinct Elements In: infinite stream of data Question: how many distinct elements are there? is similar to: In: coin flips Question: how many times it has been flipped?
- 17. Coin flips: intuition ● Long runs of HEADs in random series are rare. ● The longer you look, the more likely you see a long one. ● Long runs are very rare and are correlated with how many coins you’ve flipped.
- 18. Code: Cardinality estimation
- 19. Cardinality estimation Basic algorithm: ● ● n=0 For each input item: ○ Hash item into bit string ○ Count trailing zeroes in bit string ○ If this count > n: ■ Let n = count ● Estimated cardinality (“count distinct”) = 2^n
- 20. Cardinality estimation: HyperLogLog Demo by: http://www. aggregateknowledge. com/science/blog/hll.html Billions of distinct values in 1.5KB of RAM with 2% relative error HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm P.Flajolet, É.Fusy, O.Gandouet, F.Meunier; 2007
- 21. Code: HyperLogLog
- 22. Count-min sketch Frequency histogram estimation with chance of over-counting count(value) = min{w1[h1(value)], ... wd[hd(value)]}
- 23. Code: Frequent Itemsets
- 24. Machine Learning: Feature hashing High-dimensional machine learning without feature dictionary by Andrew Clegg “Approximate methods for scalable data mining”
- 25. Locality-sensitive hashing To approximate nearest neighbours by Andrew Clegg “Approximate methods for scalable data mining”
- 26. Probabilistic Databases ● PrDB (University of Maryland) ● Orion (Purdue University) ● MayBMS (Cornell University) ● BlinkDB v0.1alpha (UC Berkeley and MIT)
- 27. BlinkDB: queries Queries with Bounded Errors and Bounded Response Times on Very Large Data
- 28. BlinkDB: architecture
- 29. References Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman http://infolab.stanford.edu/~ullman/mmds.html
- 30. Summary ● know the data structures ● know what you sacrifice ● control errors http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df http://highlyscalable.wordpress.com/2012/05/01/probabilisticstructures-web-analytics-data-mining/ by Ilya Katsov

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment