HyperLogLog is dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, "short bytes''), HyperLogLog performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about 1.04/sqrt(m). This improves on the best previously known cardinality estimator, LogLog, whose accuracy can be matched by consuming only 64% of the original memory.
5. Definitions and facts
- harmonic mean:
- if each of a collection of m independent random variables has standard
deviation σ, then their arithmetic mean has standard deviation σ/√m
- the 68–95–99.7 rule
7. Problem statement and naive solution
- given a multiset M, find number of distinct elements
- hash table on M?
- sort(M) + scroll?
8. Issues
- big cardinality of data set, no space to store
- data set stored in distributed environment
9. Examples
- Google search, distinct number of search queries
- traffic monitoring (dos attacks)
- correlation of genomes in human DNA, distinct subwords of fixed size k
10. Constraints
- crucial factor is then to relax the constraint of computing the value of the
cardinality exactly
- allows to apply whole range of probabilistic algorithms
- in 99% practical applications, a tolerance of a few percents on the result
is acceptable
11. Idea of probabilistic counting
- imagine I flip a coin many times and count the number of consecutive
heads before the first tail
- repeat it several times
- Sequence 1: HHHT
- Sequence 2: HT
- Sequence 3: HHT
12. What if?
- what if I say you that I get 1000 sequences and got 2 as maximum index
- what if I say you that I get 10 sequences and got 100 as maximum index
- X ≈ 2k
, X - number of sequences, k - maximum index
14. m different hash functions, drawbacks
- complexity = O(Nm)
- it would necessitate a large set (e.g.: 104
to decrease error by 102
) of
independent hashing functions, for which no construction is known
15. Split one problem into m sub-problems
- split M into m buckets
- estimate cardinality of each bucket (X/m)
- compute mean of all estimations
- multiply result by m (get estimation with accuracy σ/√m)
25. Data structure
- estimate the cardinality of union of multiple sets. It is natural to combine
multiple HLL’s; simply take the largest count of consecutive leading 0’s
from all the HLL’s
- estimate the overlap of two sets. Since |A ∩ B| = |A| + |B| – |A ∪ B|, the
overlap of two sets can be calculated from the cardinality of each set and
the cardinality of their union