•

15 likes•3,254 views

Many statistics are impossible to compute precisely on streaming data. There are some very clever algorithms, however, which allow us to compute very good approximations of these values efficiently in terms of CPU and memory.

- 1. © 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
- 2. © 2014 MapR Technologies 2 • "Decoder ring" • "the next thing I want to do is this" • Flajolet
- 3. © 2014 MapR Technologies 3 • What's the problem? – speed – feasibility – communication – incremental computation – tree-based pre-computation • What do we need? – on-line version – associative version
- 4. © 2014 MapR Technologies 4 • Why is that hard (impossible)? – pathological inputs – median ... any element of the first half of the data could be the median – k-th most common ... any element could occur enough in the second half to be biggest – unique elements ... hashing loses information, any compact representation must have false positives or negatives.
- 5. © 2014 MapR Technologies 5 • What can we do? – give up ... a slow, but exact answer may not be sooo bad – give up ... a fast, but inexact answer may not be sooo bad • The good news: – approximate can be very, very close to exact
- 6. © 2014 MapR Technologies 6 The Classic Problems • Most common (top-40) • Count distinct • Quantiles, with focus on extremes
- 7. © 2014 MapR Technologies 7 Classic Solutions • Leaky counters – Forget values, remember uncertainties • Count min sketch – Many small hash tables • Count distinct with HyperLogLog – Many hashes again • New Solution - Quantiles by t-digest – A new low in clustering
- 8. © 2014 MapR Technologies 8 Classic Solutions - Leaky counters • Intuition: – Common elements are rarely rare, rare elements are always rare • Leaky counter: – new element inserted with count=1, error = ceiling((N-1)/w) – every w samples {dropAll( if f+error < ceiling(N/w) )} • Adaptation to heavy hitters is trivial
- 9. © 2014 MapR Technologies 9 Classic Solutions - Count min sketch • Intuition: – A gazillion hashed counters can't all be wrong • Big array of counters, each row has different hash function • Increment counter in each row determined by hashing • Probe by finding minimum hashed counter for probe key • Oops... finding heavy hitters is tricky ... requires keeping log n sketches
- 10. © 2014 MapR Technologies 10 Increment Hashed Locations to Insert a h i (a)
- 11. © 2014 MapR Technologies 11 Probe Using min of Counts mini"k[h i (a)]
- 12. Classic Solutions - Count distinct with HyperLogLog © 2014 MapR Technologies 12 • Intuition: – The smallest of n uniform samples is expected to be 1/n – Hashing turns anything into uniform distribution – Hashing again turns anything into a new uniform distribution • Best done with pictures
- 13. What does hashing look like? © 2014 MapR Technologies 13
- 14. © 2014 MapR Technologies 14 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ix
- 15. © 2014 MapR Technologies 15 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 hash(ix)
- 16. Hashing fixes all ills © 2014 MapR Technologies 16
- 17. 0 5 10 15 20 25 30 © 2014 MapR Technologies 17 0.0 1.0 2.0 Original distribution x ~ G(0.2, 0.2) Mean = 1, median = 0.1, 5%−ile = 10-6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 After hashing
- 18. Now the trick … what is the min? © 2014 MapR Technologies 18
- 19. © 2014 MapR Technologies 19 Repeated Minimum 10 samples Min is ~ 0.1
- 20. © 2014 MapR Technologies 20 Min(x) PDF 0.00 0.02 0.04 0.06 0.08 0.10 0 20 40 60 80 Observed minimum value (100 samples x 10,000 replications)
- 21. © 2014 MapR Technologies 21 Min(x) PDF 0.00 0.02 0.04 0.06 0.08 0.10 0 20 40 60 80 Theoretical distribution Observed minimum value (100 samples x 10,000 replications)
- 22. © 2014 MapR Technologies 22 Min(x) PDF Mean = 0.0099 0.00 0.02 0.04 0.06 0.08 0.10 0 20 40 60 80 Theoretical distribution Observed minimum value (100 samples x 10,000 replications)
- 23. Counting leading zeros is taking the log (almost) © 2014 MapR Technologies 23
- 24. © 2014 MapR Technologies 24 Mean = −2.3 10−2.3 = 0.0056 Observed minimum log10(value) Min(x) PDF 0.0 0.2 0.4 0.6 0.8 1.0 Error 1e−05 1e−04 0.001 0.01 0.1
- 25. © 2014 MapR Technologies 25 T-digest for Quantiles • Intuition: – 1-d k-means with size cap – Make size cap depend on distance to nearest end • Experimental verification – Distribution in cluster very uniform – Accuracy far better than alternatives, especially at extremes