Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Global wan prez-ru.mini by TWD Industries AG 652 views
- TrustLeap GWAN - The multicore Futu... by TWD Industries AG 6538 views
- Trustleap driverless cars by TWD Industries AG 301 views
- The Allen AI Science Challenge by Pavel Kalaidin 672 views
- Global-WAN - The Swiss Neutral Data... by TWD Industries AG 1336 views
- Data Mining in RTB by Pavel Kalaidin 3002 views

876 views

Published on

My slides from Highload Strategy conference in Vilnius.

Published in:
Data & Analytics

No Downloads

Total views

876

On SlideShare

0

From Embeds

0

Number of Embeds

61

Shares

0

Downloads

8

Comments

1

Likes

2

No notes for slide

- 1. Large-scale real-time analytics for everyone: fast, cheap and 98% correct
- 2. Pavel Kalaidin @facultyofwonder
- 3. we have a lot of data memory is limited one pass would be great constant update time
- 4. max, min, mean is trivial
- 5. median, anyone?
- 6. Sampling?
- 7. Probabilistic algorithms
- 8. Estimate is OK but nice to know how error is distributed
- 9. def frugal(stream): m = 0 for val in stream: if val > m: m += 1 elif val < m: m -= 1 return m
- 10. Memory used - 1 int! def frugal(stream): m = 0 for val in stream: if val > m: m += 1 elif val < m: m -= 1 return m It really works
- 11. Percentiles?
- 12. Demo: bit.ly/frugalsketch def frugal_1u(stream, m=0, q=0.5): for val in stream: r = np.random.random() if val > m and r > 1 - q: m += 1 elif val < m and r > q: m -= 1 return m
- 13. Streaming + probabilistic = sketch
- 14. What do we want? Get the number of unique users aka cardinality number
- 15. What do we want? Get the number of unique users grouped by host, date, segment
- 16. When do we want? Well, right now
- 17. Data: 1010 elements, 109 unique int32 40Gb
- 18. Straight-forward approach: hash-table
- 19. Hash-table: 4Gb
- 20. HyperLogLog: 1.5Kb, 2% error
- 21. It all starts with an algorithm called LogLog
- 22. Imagine I tell you I spent this morning flipping a coin
- 23. and now tell you what was the longest non-interrupting run of heads
- 24. 2 times or 100 times
- 25. When I flipped a coin for longer time?
- 26. We are interested in patterns in hashes (namely the longest runs of leading zeros = heads)
- 27. Hash, don’t sample!* * need a good hash function
- 28. Expecting: 0xxxxxx hashes - ~50% 1xxxxxx hashes - ~50% 00xxxxx hashes - ~25%
- 29. estimate - 2R , where R - is a longest run of leading zeros in hashes
- 30. I can perform several flipping experiments
- 31. and average the number of zeros
- 32. This is called stochastic averaging
- 33. So far the estimate is 2R , where R is a is a longest run of leading zeros in hashes
- 34. We will be using M buckets
- 35. where ɑ is a normalization constant
- 36. LogLog SuperLogLog
- 37. LogLog SuperLogLog HyperLogLog arithmetic mean -> harmonic mean plus a couple of tweaks
- 38. Standard error is 1.04/sqrt (M), where M is the number of buckets
- 39. LogLog SuperLogLog HyperLogLog HyperLogLog++ Google, 2013 32 bit -> 64 bit + fixes for low cardinality bit.ly/HLLGoogle
- 40. LogLog SuperLogLog HyperLogLog HyperLogLog++ Discrete Max-Count Facebook, 2014 bit.ly/DiscreteMaxCount
- 41. Large scale?
- 42. Suppose we have two HLL- sketches, let’s take a maximum value from corresponding buckets
- 43. Resulting sketch has no loss in accuracy!
- 44. What do we want? how many unique users belong to two segments?
- 45. HLL intersection
- 46. Inclusion-exclusion principle
- 47. credits: http://research.neustar. biz/2012/12/17/hll-intersections-2/
- 48. Python code: bit.ly/hloglog
- 49. What do we want? Get the churn rate
- 50. Straight forward: feed new data to a new sketch
- 51. Sliding-window HyperLogLog
- 52. We maintain a list of tuples (timestamp, R), where R is a possible maximum over future time
- 53. Values that are no longer make sense are automatically discarded from the list
- 54. One list per bucket
- 55. Take a maximum R over the given timeframe from the past, then estimate as we do in a regular HLL
- 56. Extra memory is required
- 57. All the details: bit.ly/SlidingHLL
- 58. hash, don’t sample estimate, not precise save memory streaming this slide is the sketch of the talk
- 59. Lots of sketches for various purposes: percentiles, heavy hitters, similarity, other stream statistics
- 60. Have we seen this user before?
- 61. Bloom filter
- 62. i h 1 h 2 h k 1 1 10 0 0 0 0 0 0 0 0 0 0 0 0
- 63. How many time did we see a user?
- 64. Count-Min sketch is the answer: bit.ly/CountMinSketch
- 65. w i +1 +1 +1 h1 h4 hd d Estimate - take minimum from d values
- 66. Percentiles
- 67. Frugal sketching is not that precise enough
- 68. Sorting is pain
- 69. Distribute incoming values to buckets?
- 70. Some sort of clustering, maybe
- 71. T-Digest
- 72. Size is log(n), error is relative to q(1-q)
- 73. Code: bit.ly/T-Digest-Java bit.ly/T-Digest-Python
- 74. This is a growing field of computer science: stay tuned!
- 75. Thanks and happy sketching!
- 76. Reading list: Neustar Research blog: bit.ly/NRsketches Sketches overview: bit.ly/SketchesOverview Lecture notes on streaming algorithms: bit.ly/streaming-lectures
- 77. Bonus: HyperLogLog in SQL: bit.ly/HLLinSQL

No public clipboards found for this slide

Login to see the comments