Count-Distinct Problem

Count-Distinct
Problem
Yunhe Feng, Kai Zhang
Apr.5 2016

Questions
1. What is the cardinality of this data stream:
{1, 2, 4, 6, 8, 9, 2, 3, 11, 3, 1, 4}
2. Remember we use “bit pattern observables” to estimate
cardinality, describe the basic idea behind it.
3. How is the “buckets” useful in LOGLOG COUNTING algorithm?
n

Outline
1. Overview
2. History
3.Algorithms
4. Implementation & Results
5. Open Issues
6. References

Deﬁnition:
Instance: A stream of elements with repetitions, and
an integer . Let be the number of distinct elements, namely
, and let these elements be .
Overview
x1, x2, ..., xs
m n
n =| {x1, x2, ..., xs} || {e1, e2, ..., en} |
ˆn n
m ⌧ n
a, b, a, c, d, b, d
n =| {a, b, c, d} |= 4
mObjective: Find an estimate of using only storage units, where
.
e.g. Count the cardinality of the stream: . For this
instance, .

Example:
Keep track of the number of
Unique Visitors (UV) for a particular
product on Amazon in one day.
• 1MB for each tree, 1 million items:100GB memory! (200 million on Amazon)
• what if we want to know the number of UVs of 2 items together?
Drawbacks:
Operation: Searching, Insertion

Other Applications
Application:
Networking / Traﬃc monitoring
• Detection of worm propagation
• Network attacks
• Link-based spam
Data mining of massive data set
• Natural language texts
• Biological data
• Large structured databases
Google: Sawzall, Dremel and PowerDrill

1980: Optimization of classical algorithms operations on data bases:
union, intersection, sorting, …
Data set size >> RAM capacities.
• in one pass;
• using small auxiliary memory
1983: Probabilistic Counting by Flajolet and Martin
2003: LogLog Counting algorithm
2007: HyperLogLog Counting algorithm
History

1. LINEAR COUNTING
0 0 0 0 0 0 … 0 0 0 0 0
1, 2, … … m
LINEAR COUNTING
0 1 0 0 1 1 … 0 0 1 0 1
1, 2, … … m
Step 2: Hash the value to a bitmap address and set the address bit to “1”;
m Vn
ˆn = mlnVn
Step 3: Count the empty bit map entries and divide it by the bit map size
(fraction is ), then the cardinality estimation is:
mStep 1: Allocate a bit map (hash table) of size , all entries are initialized to “0”;

n = 11
• cardinality:
• estimated cardinality:
ˆn = mlnVn
= 8ln
1
4
˙=11.09
LINEAR COUNTING

Let stands for the event that
box is empty:
Let denote the number of
empty boxes:
P(Aj) =
✓
1
1
m
◆n
Aj
P(Aj Ak) =
✓
1
2
m
◆n
, j 6= k
Un
E(Un) =
mX
j=1
P(Aj) = m
✓
1
1
m
◆n
⇠= me n/m
ˆn = mln
E(Un)
m
balls
boxes
n
m
j
LINEAR COUNTING

Algorithm Basic Linear Counting:
let = the key for the th tuple in the relation.
initialize the bit map to “0”s.
for =1 to do
hash_value = hash( )
bit map(hash_value)=“1”
end for
= number of “0”s in the bit map
= /m
keyi i
i q
keyi
Un
Vn Un
ˆn = mlnVn
LINEAR COUNTING

How to choose size ? The mean number of empty boxes must be
a standard deviations from zero:
Lemma: The limiting distribution of , the number of empty
boxes, is Poisson with the expected value of
as
Thus,
The ﬁll-up probability is then obtained as
If , that is , ,
m
E(Un) a ⇥ StdDev(Un) > 0
Un
me n/m
! n, m ! 1
lim
n,m!1
Pr(Un = k) = ( k
/k!)e
Pr(Un = 0) = e
a > 5 E(Un) >
p
5 · StdDev(Un) >
p
5
Constraint 1:
Pr(Un = 0) < e 5
˙=0.007(0.7%)

Suppose the user what to limit the standard error to , we have
or equivalently as
✏
((et
t 1)/m)1/2
t
< ✏
m >
et
t 1
(✏t)2
Constraint 2:

Map size m epsilon Map size m epsilon
n 0.01 0.1 n 0.01 0.1
100 5034 80 20000 10506 3105
200 5067 106 30000 12839 4417
300 5100 129 40000 15036 5680
400 5133 151 50000 17134 6909
500 5166 172 60000 19156 8112
600 5199 192 70000 21117 9294
700 5231 212 80000 23029 10458
800 5264 231 90000 24897 11608
900 5296 249 100000 26729 12744
1000 5329 268 200000 43710 23633
2000 5647 441 300000 59264 33992
3000 5957 618 400000 73999 44032
4000 6260 786 500000 88175 53848
5000 6556 948 600000 101932 63492
6000 6847 1106 700000 115359 72997
7000 7132 1261 800000 128514 82387
8000 7412 1412 900000 141441 91677
9000 7688 1562 1000000 154171 100880
10000 7960 1709
LINEAR COUNTING

“01001101001…”
for each string , let denote the position of its ﬁrst 1-
bit:
and denote the data set after hashing. Clearly, we can expect about
amongst the distinct elements of to have a -value equal to
, so
is a rough indication on the value of .
x 2 {0, 1}1
⇢(x)
⇢(1...) = 1, p(001) = 3, etc
n/2k
M
M ⇢
k
R(M) := max
1jn
⇢(x)
log2n
LOGLOG COUNTING
Basic Idea: (Bit pattern observables)
Hash the each data to binary strings like

LOGLOG COUNTING
0 0 1 0 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1
the hash function hash each value to a binary string, suppose, “90001” to:
the first 1 bit of this {0,1}-string is 3, .⇢(001011...) = 3
Suppose here comes a data stream: {234, 39102, 3, 4556, 90011, 87, …},
It has high variability: one experiment cannot suffice to obtain accurate
predictions.
Stochastic Averaging: emulating the effect of experiments.m
hash value

LOGLOG COUNTING
0 0 1 0 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1
m
hash value
bucket index
Stochastic Averaging: emulating the eﬀect of experiments.
Use the last 8 digits to represent bucket number:
8 bits can represent buckets (experiments).m = 28
= 256
http://content.research.neustar.biz/blog/hll.html

2. LOGLOG COUNTING algorithm
LOGLOG COUNTING ( : multiset of hashed values; ):
initialize to “0”;
let be the rank of the ﬁrst 1-bit from the left in :
for do
set (value of ﬁrst k bits in base 2)
set
return as cardinality estimate.
M m ⌘ 2k
M(1)
, M(2)
, ..., M(m)
⇢(y) y
x = b1b2... 2 M
j := hb1, ..., bki
M(j)
:= max(M(j)
, ⇢(bk+1bk+2...))
E := ↵mm2
1
m
P
j M(j)
LOGLOG COUNTING

Theorem: Let be a function that tends to inﬁnity arbitrarily slowly and
consider the function
Then, the -restricted algorithm and the LOGLOG algorithm provide the same
output with probability tending to 1 as tends to inﬁnity.
e.g. Count cardinality till (a hundred million), adopt buckets;
each bucket is visited (roughly): times;
we have , adopt , each bucket: 5 bits;
Totally 1024*5/8=640 bytes! (with a standard error of 4%)
!(n)
l(n) = log2log2(
n
m
) + !(n)
l(n)
n
227
m = 1024 = 210
n/m = 217
log2log2217
˙=4.09 ! = 0.91
LOGLOG COUNTING

HYPERLOGLOG COUNTING
LOGLOG COUNTING algorithm with Harmonic Mean
E := ↵mm2
1
m
P
j M(j)
1
m
(M(1)
+ M(2)
+ · · · + M(m)
)
Arithmetic mean
m
1
2M(1) + 1
2M(2) + · · · + 1
2M(m)
E := ↵mm2
0
@
mX
j=1
2 M[j]
1
A
1
Harmonic Mean

3. HYPERLOGLOG COUNTING algorithm
HYPERLOGLOG COUNTING( input : multiset of items):
assume with
initialize a collection of integers, to ;
for do
set (value of ﬁrst k bits in base 2)
set (the binary address determined
by the ﬁrst bits of )
set set
compute
return
m = 2b b 2 Z>0
m M[1], ..., M[m] 1
v 2 M
x := h(v)
j = 1 + hx1x2...xbi2
b x
w := xb+1xb+2...; M[j] := max(M[j], ⇢(!))
Z :=
0
@
mX
j=1
2 M[j]
1
A
1
E := ↵mm2
Z
M

Implementation
Programming Language: Python 2.7
Hash Function: MurmurHash 3_64
Multiset: integers 1, 2, 3, …

MurmurHash3_32
Number of Elements in Each Bucket Follows a Uniform Distribution

MurmurHash3_32
Distributions of Position of 1st 1-bit of Hashed Binary Strings

Linear Counting
Performances of LC for Different Map Sizes Load Factor VS Standard Errors

LogLog Counting
Performances of LLC for Different Numbers of Buckets

A Large Error for Small Cardinalities
LogLog Counting

HyperLogLog Counting
Performances of LLC for Different Numbers of Buckets

Comparison of HLLC and LLC
Comparison of HLLC and LLC when Number of Buckets is Small

Comparison of HLLC and LLC when Number of Buckets is Large
Comparison of HLLC and LLC

if then
Let V be the number of registers equal to 0.
V ~=0 then set E := LinearCounting(m, V )
else
do nothing
end
if then
E := E
if
end
return E
Large Cardinalities:
A hash function of L bits can at most
distinguish 2L different values, and as the
cardinality n approaches 2L, hash
collisions become more and more likely
and accurate estimation gets impossible.
Small Cardinalities:
When cardinality is small, the
proportion of un-hit bucket is large,
which leads to inaccurate estimation.
E := ↵mm2
0
@
mX
j=1
2 M[j]
1
A
2
E <=
5
2
m
E 
1
30
232
E = 232
log(1 E/232
)
E
1
30
232
The “raw” estimate:

Correction for HyperLogLog Counting
Bad Performances for Small Cardinalities Corrections for Small Cardinalities

Correction for HyperLogLog Counting
Performance Comparison between HLLC_raw and HLLC for Small Cardinalities

Bad Performances for Large Cardinalities Corrections for Large Cardinalities

Performance Comparison between HLLC_raw and HLLC for Large Cardinalities

Open Issues
If there’s other smart ideas to use.

Reference
Whang, Kyu-Young, Brad T. Vander-Zanden, and Howard M. Taylor. "A linear-time
probabilistic counting algorithm for database applications." ACM Transactions on
Database Systems (TODS) 15.2 (1990): 208-229.
Durand, Marianne, and Philippe Flajolet. "Loglog counting of large cardinalities."
Algorithms-ESA 2003. Springer Berlin Heidelberg, 2003. 605-617.
Flajolet, Philippe, et al. "Hyperloglog: the analysis of a near-optimal cardinality estimation
algorithm." DMTCS Proceedings 1 (2008).
Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic
engineering of a state of the art cardinality estimation algorithm." Proceedings of the 16th
International Conference on Extending Database Technology. ACM, 2013.
Metwally, Ahmed, Divyakant Agrawal, and Amr El Abbadi. "Why go logarithmic if we can go
linear?: Towards effective distinct counting of search traffic." Proceedings of the 11th
international conference on Extending database technology: Advances in database
technology. ACM, 2008.

History
Sketch-Based Algorithm
Distinct Counting Algorithm
Sampling Algorithms Sketch-Based Algorithm
Logarithmic Hashing Algorithms Uniform Hashing Algorithms
Interval-Based Algorithms Backer-Based Algorithms
Pure-Bucket-Based Algorithms Hybrid-Bucket-Based Algorithms
Hybrid Bucket-Based-Logarithmic Algorithms Hybrid Bucket-Based-Sampling Algorithms

Count-Distinct Problem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Count-Distinct Problem

Similar to Count-Distinct Problem (20)

Recently uploaded

Recently uploaded (20)

Count-Distinct Problem