Report

Pavel KalaidinFollow

Jan. 30, 2015•0 likes•1,045 views

Jan. 30, 2015•0 likes•1,045 views

Download to read offline

Report

Data & Analytics

My slides from Highload Strategy conference in Vilnius.

Pavel KalaidinFollow

A calculus of mobile Real-Time processesPolytechnique Montréal

Real-Time Big Data Stream AnalyticsAlbert Bifet

PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYAMaulik Borsaniya

Lecture 12Tanveer Malik

Global wan prez-ru.miniTWD Industries AG

TrustLeap GWAN - The multicore Future requires Parallelism Programming toolsTWD Industries AG

- 1. Large-scale real-time analytics for everyone: fast, cheap and 98% correct
- 3. we have a lot of data memory is limited one pass would be great constant update time
- 4. max, min, mean is trivial
- 6. Sampling?
- 8. Estimate is OK but nice to know how error is distributed
- 9. def frugal(stream): m = 0 for val in stream: if val > m: m += 1 elif val < m: m -= 1 return m
- 10. Memory used - 1 int! def frugal(stream): m = 0 for val in stream: if val > m: m += 1 elif val < m: m -= 1 return m It really works
- 12. Percentiles?
- 13. Demo: bit.ly/frugalsketch def frugal_1u(stream, m=0, q=0.5): for val in stream: r = np.random.random() if val > m and r > 1 - q: m += 1 elif val < m and r > q: m -= 1 return m
- 14. Streaming + probabilistic = sketch
- 15. What do we want? Get the number of unique users aka cardinality number
- 16. What do we want? Get the number of unique users grouped by host, date, segment
- 17. When do we want? Well, right now
- 20. Hash-table: 4Gb
- 22. It all starts with an algorithm called LogLog
- 23. Imagine I tell you I spent this morning flipping a coin
- 24. and now tell you what was the longest non-interrupting run of heads
- 26. When I flipped a coin for longer time?
- 27. We are interested in patterns in hashes (namely the longest runs of leading zeros = heads)
- 28. Hash, don’t sample!* * need a good hash function
- 29. Expecting: 0xxxxxx hashes - ~50% 1xxxxxx hashes - ~50% 00xxxxx hashes - ~25%
- 30. estimate - 2R , where R - is a longest run of leading zeros in hashes
- 31. I can perform several flipping experiments
- 32. and average the number of zeros
- 33. This is called stochastic averaging
- 34. So far the estimate is 2R , where R is a is a longest run of leading zeros in hashes
- 35. We will be using M buckets
- 36. where ɑ is a normalization constant
- 38. LogLog SuperLogLog HyperLogLog arithmetic mean -> harmonic mean plus a couple of tweaks
- 39. Standard error is 1.04/sqrt (M), where M is the number of buckets
- 40. LogLog SuperLogLog HyperLogLog HyperLogLog++ Google, 2013 32 bit -> 64 bit + fixes for low cardinality bit.ly/HLLGoogle
- 42. Large scale?
- 43. Suppose we have two HLL- sketches, let’s take a maximum value from corresponding buckets
- 44. Resulting sketch has no loss in accuracy!
- 45. What do we want? how many unique users belong to two segments?
- 46. HLL intersection
- 51. What do we want? Get the churn rate
- 52. Straight forward: feed new data to a new sketch
- 54. We maintain a list of tuples (timestamp, R), where R is a possible maximum over future time
- 55. Values that are no longer make sense are automatically discarded from the list
- 57. One list per bucket
- 58. Take a maximum R over the given timeframe from the past, then estimate as we do in a regular HLL
- 59. Extra memory is required
- 61. hash, don’t sample estimate, not precise save memory streaming this slide is the sketch of the talk
- 63. Lots of sketches for various purposes: percentiles, heavy hitters, similarity, other stream statistics
- 64. Have we seen this user before?
- 65. Bloom filter
- 66. i h 1 h 2 h k 1 1 10 0 0 0 0 0 0 0 0 0 0 0 0
- 67. How many time did we see a user?
- 68. Count-Min sketch is the answer: bit.ly/CountMinSketch
- 69. w i +1 +1 +1 h1 h4 hd d Estimate - take minimum from d values
- 70. Percentiles
- 71. Frugal sketching is not that precise enough
- 72. Sorting is pain
- 73. Distribute incoming values to buckets?
- 74. Some sort of clustering, maybe
- 75. T-Digest
- 77. Size is log(n), error is relative to q(1-q)
- 79. This is a growing field of computer science: stay tuned!
- 82. Reading list: Neustar Research blog: bit.ly/NRsketches Sketches overview: bit.ly/SketchesOverview Lecture notes on streaming algorithms: bit.ly/streaming-lectures