Your SlideShare is downloading.
×

- 1. Large-scale real-time analytics for everyone: fast, cheap and 98% correct
- 2. Pavel Kalaidin @facultyofwonder
- 3. we have a lot of data memory is limited one pass would be great constant update time
- 4. max, min, mean is trivial
- 5. median, anyone?
- 6. Sampling?
- 7. Probabilistic algorithms
- 8. Estimate is OK but nice to know how error is distributed
- 9. def frugal(stream): m = 0 for val in stream: if val > m: m += 1 elif val < m: m -= 1 return m
- 10. Memory used - 1 int! def frugal(stream): m = 0 for val in stream: if val > m: m += 1 elif val < m: m -= 1 return m It really works
- 11. Percentiles?
- 12. Demo: bit.ly/frugalsketch def frugal_1u(stream, m=0, q=0.5): for val in stream: r = np.random.random() if val > m and r > 1 - q: m += 1 elif val < m and r > q: m -= 1 return m
- 13. Streaming + probabilistic = sketch
- 14. What do we want? Get the number of unique users aka cardinality number
- 15. What do we want? Get the number of unique users grouped by host, date, segment
- 16. When do we want? Well, right now
- 17. Data: 1010 elements, 109 unique int32 40Gb
- 18. Straight-forward approach: hash-table
- 19. Hash-table: 4Gb
- 20. HyperLogLog: 1.5Kb, 2% error
- 21. It all starts with an algorithm called LogLog
- 22. Imagine I tell you I spent this morning flipping a coin
- 23. and now tell you what was the longest non-interrupting run of heads
- 24. 2 times or 100 times
- 25. When I flipped a coin for longer time?
- 26. We are interested in patterns in hashes (namely the longest runs of leading zeros = heads)
- 27. Hash, don’t sample!* * need a good hash function
- 28. Expecting: 0xxxxxx hashes - ~50% 1xxxxxx hashes - ~50% 00xxxxx hashes - ~25%
- 29. estimate - 2R , where R - is a longest run of leading zeros in hashes
- 30. I can perform several flipping experiments
- 31. and average the number of zeros
- 32. This is called stochastic averaging
- 33. So far the estimate is 2R , where R is a is a longest run of leading zeros in hashes
- 34. We will be using M buckets
- 35. where ɑ is a normalization constant
- 36. LogLog SuperLogLog
- 37. LogLog SuperLogLog HyperLogLog arithmetic mean -> harmonic mean plus a couple of tweaks
- 38. Standard error is 1.04/sqrt (M), where M is the number of buckets
- 39. LogLog SuperLogLog HyperLogLog HyperLogLog++ Google, 2013 32 bit -> 64 bit + fixes for low cardinality bit.ly/HLLGoogle
- 40. LogLog SuperLogLog HyperLogLog HyperLogLog++ Discrete Max-Count Facebook, 2014 bit.ly/DiscreteMaxCount
- 41. Large scale?
- 42. Suppose we have two HLL- sketches, let’s take a maximum value from corresponding buckets
- 43. Resulting sketch has no loss in accuracy!
- 44. What do we want? how many unique users belong to two segments?
- 45. HLL intersection
- 46. Inclusion-exclusion principle
- 47. credits: http://research.neustar. biz/2012/12/17/hll-intersections-2/
- 48. Python code: bit.ly/hloglog
- 49. What do we want? Get the churn rate
- 50. Straight forward: feed new data to a new sketch
- 51. Sliding-window HyperLogLog
- 52. We maintain a list of tuples (timestamp, R), where R is a possible maximum over future time
- 53. Values that are no longer make sense are automatically discarded from the list
- 54. One list per bucket
- 55. Take a maximum R over the given timeframe from the past, then estimate as we do in a regular HLL
- 56. Extra memory is required
- 57. All the details: bit.ly/SlidingHLL
- 58. hash, don’t sample estimate, not precise save memory streaming this slide is the sketch of the talk
- 59. Lots of sketches for various purposes: percentiles, heavy hitters, similarity, other stream statistics
- 60. Have we seen this user before?
- 61. Bloom filter
- 62. i h 1 h 2 h k 1 1 10 0 0 0 0 0 0 0 0 0 0 0 0
- 63. How many time did we see a user?
- 64. Count-Min sketch is the answer: bit.ly/CountMinSketch
- 65. w i +1 +1 +1 h1 h4 hd d Estimate - take minimum from d values
- 66. Percentiles
- 67. Frugal sketching is not that precise enough
- 68. Sorting is pain
- 69. Distribute incoming values to buckets?
- 70. Some sort of clustering, maybe
- 71. T-Digest
- 72. Size is log(n), error is relative to q(1-q)
- 73. Code: bit.ly/T-Digest-Java bit.ly/T-Digest-Python
- 74. This is a growing field of computer science: stay tuned!
- 75. Thanks and happy sketching!
- 76. Reading list: Neustar Research blog: bit.ly/NRsketches Sketches overview: bit.ly/SketchesOverview Lecture notes on streaming algorithms: bit.ly/streaming-lectures
- 77. Bonus: HyperLogLog in SQL: bit.ly/HLLinSQL