2. Questions to audience
HLL
MinHash
Uniform distribution
Inclusion-Exclusion principle
Bitmap
3. Web Analytics Questions
How big your audience?
From where?
How active?
Gender?
What browsers / devices?
How similar audiences are?
Who is the most similar to your audience?
What dynamics?
4. Advanced Web Analytics Questions
What characteristics my audience will have if I build it by particular rule?
If KPI could be described by given rule, give me audience which fits them better than others
7. Probabilistic data structures landscape
HLL zipped 2% error – 400b
MinHash – 32 kb
1% bitmap – 2-5 mb
1% sets – depending on size
(in our case up to 150Mb – rare case)
8. Hyperloglog intuition
Allows to estimate number of unique users in set
Probability it will have 0 in first position – 50%
Two zeros sequentially 25%
Three - 12.5%
Etc.
What can you say about the set if you know that maximal sequence of zeros was 10?
10. Set operations on HLL
Union
Intersection
Subtraction
Inclusion exclusion principle
Accuracy degradation
Binomial coefficients
11. calculation tree transformation
HLL can union only with another HLL
If you need to intersect HLL with another HLL, you need to use inclusion
exclusion principle:
|A and B| = |A| + |B| - |A or B| - this results number, not HLL
So how to estimate expressions like:
(A and B) or C => (A or C) and (B or C)
Needed recursive tree transformation, which will result only one final
intersection and subtraction
12. MinHash vs K Min Values
Jaccard index:
Sampling ratio normalization
Cardinality estimation via KMinValues
Accuracy degradation when estimation result much smaller then bigger set
13. Bitmaps
Each bit corresponds to particular set item
Good estimation accuracy and performance
Not efficient from memory requirements if underlying set is small
Mapping from element id to sequence number in bitmap required (sync
challenge for distributed application)
Improvement: Compressed bitmaps
Still big overhead, as we need to store all the items
14. Sampled audience as Sets
Huge memory consumption for big audiences
Set operations performance depend on smaller set
So operations with two big sets are slow
Resample big sets to 0.01% and use this only for case if all sets in equation big
No need to store id-sequence number mapping
Efficient for small audiences
15. To sum up (2b audience)
HLL MinHash (8k) Bitmaps 1% Sets (1% + 0.01%)
Size 2kb (400b packed) 32 kb 5 Mb 0 – 200 Mb
Accuracy 2% in average for
cardinality.
2% if sets cardinality
less than 100
2% if sets size > 10k 2% if sets size > 10k
Restrictions Significant
degradation if set
sizes differ more than
10 times
Set sizes difference >
1000 times
Lots of extra data
for big sets if there
is no need to
intersect with small
Lots of extra data
for big sets if there
is no need to
intersect with small
Supported
operations
Union natively,
Intersect and subtract
via inclusion exclusion
principle.
Not every calculation
tree can be
estimated.
Union, Intersect,
Subtract
Recursive disjoint and
intersection leads to
accuracy degradation.
Requires tree
transformation
Union, Intersect,
Subtract
Union, Intersect,
subtract
16. Combination of different approaches
HLL + MH
Use MH for intersection and subtraction
Bitmaps + Sets
I.e. sparse and dense representation of set
Store items as sets and then convert them to bitmaps after certain threshold
17. What we store
Segment data (near realtime)
Segment stats per each day (HLL + MinHash) 14 Gb, 1Gb per day
Affinities report (daily recount + deltas near realtime)
1% sample bitmap (no compression in Redis, 190 Gb)
1% + 0.01% sample sets (40 Gb)
Transaction Predicate Sets (Daily)
HLL (compressed. 150M HLLs in 40 Gb)