Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Realtime analytics

203 views

Published on

probabilistic data structures

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Realtime analytics

  1. 1. Multidimensional probabilistic real-time analytics at Scale VALENTIN BAZAREVSKY
  2. 2. Questions to audience  HLL  MinHash  Uniform distribution  Inclusion-Exclusion principle  Bitmap
  3. 3. Web Analytics Questions  How big your audience?  From where?  How active?  Gender?  What browsers / devices?  How similar audiences are?  Who is the most similar to your audience?  What dynamics?
  4. 4. Advanced Web Analytics Questions  What characteristics my audience will have if I build it by particular rule?  If KPI could be described by given rule, give me audience which fits them better than others
  5. 5. Numbers  2B cookie profiles  50k segments  35B cookie-segment pairs  150M transaction predicate sets  15 TB of transactional data  50k requests per second Segment size Segments > 1k 6k > 10k 6k > 100k 6k > 1M 6k > 10M 6k > 100M 2k > 1B 25
  6. 6. Estimation PIPELINE HyperLogLogs MinHashes 1% Bitmaps 1%, 0.01% samples as sets
  7. 7. Probabilistic data structures landscape  HLL zipped 2% error – 400b  MinHash – 32 kb  1% bitmap – 2-5 mb  1% sets – depending on size (in our case up to 150Mb – rare case)
  8. 8. Hyperloglog intuition  Allows to estimate number of unique users in set  Probability it will have 0 in first position – 50%  Two zeros sequentially 25%  Three - 12.5%  Etc.  What can you say about the set if you know that maximal sequence of zeros was 10?
  9. 9. HLL intuition pt. 2  0011001010100  1010010010100  1101101010100  1100111010100  0111000010100  0101001010100  0001000000100
  10. 10. Set operations on HLL  Union  Intersection  Subtraction  Inclusion exclusion principle  Accuracy degradation  Binomial coefficients
  11. 11. calculation tree transformation  HLL can union only with another HLL  If you need to intersect HLL with another HLL, you need to use inclusion exclusion principle:  |A and B| = |A| + |B| - |A or B| - this results number, not HLL  So how to estimate expressions like:  (A and B) or C => (A or C) and (B or C)  Needed recursive tree transformation, which will result only one final intersection and subtraction
  12. 12. MinHash vs K Min Values  Jaccard index:  Sampling ratio normalization  Cardinality estimation via KMinValues  Accuracy degradation when estimation result much smaller then bigger set
  13. 13. Bitmaps  Each bit corresponds to particular set item  Good estimation accuracy and performance  Not efficient from memory requirements if underlying set is small  Mapping from element id to sequence number in bitmap required (sync challenge for distributed application)  Improvement: Compressed bitmaps  Still big overhead, as we need to store all the items
  14. 14. Sampled audience as Sets  Huge memory consumption for big audiences  Set operations performance depend on smaller set  So operations with two big sets are slow  Resample big sets to 0.01% and use this only for case if all sets in equation big  No need to store id-sequence number mapping  Efficient for small audiences
  15. 15. To sum up (2b audience) HLL MinHash (8k) Bitmaps 1% Sets (1% + 0.01%) Size 2kb (400b packed) 32 kb 5 Mb 0 – 200 Mb Accuracy 2% in average for cardinality. 2% if sets cardinality less than 100 2% if sets size > 10k 2% if sets size > 10k Restrictions Significant degradation if set sizes differ more than 10 times Set sizes difference > 1000 times Lots of extra data for big sets if there is no need to intersect with small Lots of extra data for big sets if there is no need to intersect with small Supported operations Union natively, Intersect and subtract via inclusion exclusion principle. Not every calculation tree can be estimated. Union, Intersect, Subtract Recursive disjoint and intersection leads to accuracy degradation. Requires tree transformation Union, Intersect, Subtract Union, Intersect, subtract
  16. 16. Combination of different approaches  HLL + MH  Use MH for intersection and subtraction  Bitmaps + Sets  I.e. sparse and dense representation of set  Store items as sets and then convert them to bitmaps after certain threshold
  17. 17. What we store  Segment data (near realtime)  Segment stats per each day (HLL + MinHash) 14 Gb, 1Gb per day  Affinities report (daily recount + deltas near realtime)  1% sample bitmap (no compression in Redis, 190 Gb)  1% + 0.01% sample sets (40 Gb)  Transaction Predicate Sets (Daily)  HLL (compressed. 150M HLLs in 40 Gb)
  18. 18. Questions?

×