Realtime analytics

Multidimensional
probabilistic real-time
analytics at Scale
VALENTIN BAZAREVSKY

Questions to audience
 HLL
 MinHash
 Uniform distribution
 Inclusion-Exclusion principle
 Bitmap

Web Analytics Questions
 How big your audience?
 From where?
 How active?
 Gender?
 What browsers / devices?
 How similar audiences are?
 Who is the most similar to your audience?
 What dynamics?

Advanced Web Analytics Questions
 What characteristics my audience will have if I build it by particular rule?
 If KPI could be described by given rule, give me audience which fits them better than others

Numbers
 2B cookie profiles
 50k segments
 35B cookie-segment pairs
 150M transaction predicate sets
 15 TB of transactional data
 50k requests per second
Segment size Segments
> 1k 6k
> 10k 6k
> 100k 6k
> 1M 6k
> 10M 6k
> 100M 2k
> 1B 25

Estimation PIPELINE
HyperLogLogs
MinHashes
1% Bitmaps
1%, 0.01% samples as sets

Probabilistic data structures landscape
 HLL zipped 2% error – 400b
 MinHash – 32 kb
 1% bitmap – 2-5 mb
 1% sets – depending on size
(in our case up to 150Mb – rare case)

Hyperloglog intuition
 Allows to estimate number of unique users in set
 Probability it will have 0 in first position – 50%
 Two zeros sequentially 25%
 Three - 12.5%
 Etc.
 What can you say about the set if you know that maximal sequence of zeros was 10?

HLL intuition pt. 2
 0011001010100
 1010010010100
 1101101010100
 1100111010100
 0111000010100
 0101001010100
 0001000000100

Set operations on HLL
 Union
 Intersection
 Subtraction
 Inclusion exclusion principle
 Accuracy degradation
 Binomial coefficients

calculation tree transformation
 HLL can union only with another HLL
 If you need to intersect HLL with another HLL, you need to use inclusion
exclusion principle:
 |A and B| = |A| + |B| - |A or B| - this results number, not HLL
 So how to estimate expressions like:
 (A and B) or C => (A or C) and (B or C)
 Needed recursive tree transformation, which will result only one final
intersection and subtraction

MinHash vs K Min Values
 Jaccard index:
 Sampling ratio normalization
 Cardinality estimation via KMinValues
 Accuracy degradation when estimation result much smaller then bigger set

Bitmaps
 Each bit corresponds to particular set item
 Good estimation accuracy and performance
 Not efficient from memory requirements if underlying set is small
 Mapping from element id to sequence number in bitmap required (sync
challenge for distributed application)
 Improvement: Compressed bitmaps
 Still big overhead, as we need to store all the items

Sampled audience as Sets
 Huge memory consumption for big audiences
 Set operations performance depend on smaller set
 So operations with two big sets are slow
 Resample big sets to 0.01% and use this only for case if all sets in equation big
 No need to store id-sequence number mapping
 Efficient for small audiences

To sum up (2b audience)
HLL MinHash (8k) Bitmaps 1% Sets (1% + 0.01%)
Size 2kb (400b packed) 32 kb 5 Mb 0 – 200 Mb
Accuracy 2% in average for
cardinality.
2% if sets cardinality
less than 100
2% if sets size > 10k 2% if sets size > 10k
Restrictions Significant
degradation if set
sizes differ more than
10 times
Set sizes difference >
1000 times
Lots of extra data
for big sets if there
is no need to
intersect with small
Lots of extra data
for big sets if there
is no need to
intersect with small
Supported
operations
Union natively,
Intersect and subtract
via inclusion exclusion
principle.
Not every calculation
tree can be
estimated.
Union, Intersect,
Subtract
Recursive disjoint and
intersection leads to
accuracy degradation.
Requires tree
transformation
Union, Intersect,
Subtract
Union, Intersect,
subtract

Combination of different approaches
 HLL + MH
 Use MH for intersection and subtraction
 Bitmaps + Sets
 I.e. sparse and dense representation of set
 Store items as sets and then convert them to bitmaps after certain threshold

What we store
 Segment data (near realtime)
 Segment stats per each day (HLL + MinHash) 14 Gb, 1Gb per day
 Affinities report (daily recount + deltas near realtime)
 1% sample bitmap (no compression in Redis, 190 Gb)
 1% + 0.01% sample sets (40 Gb)
 Transaction Predicate Sets (Daily)
 HLL (compressed. 150M HLLs in 40 Gb)

Realtime analytics

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Realtime analytics

Similar to Realtime analytics (20)

Recently uploaded

Recently uploaded (20)

Realtime analytics