Counting (Using Computer)

Problems
• Count all elements
• count number of HTTP requests
• Count unique elements
• detect network attacks
• query optimisation in databases
2

Counting All Elements
• Simple solution – use a counter!
• Accurate answer
• Linear in time – O(n)
• Logarithmic in space - O(logn)
• log2n bits to store n
3

Logarithmic Space
• To count till n you need at least log2n bits
• 10 bits can count upto around one thousand
• 20 bits can count upto around one million
• 30 bits can count upto around one billion
• 64 bits can probably count everything
4

Problem earlier?
• Memory wasn’t always cheap
• Robert Morris (1932–2011)
• Bell Labs, 1977
• “a programming situation
that required using a large
number of counters to keep
track of the number of
occurrences of many
different events.”
• 8-bits!
5

Better than 2n
• 8-bit counters can count up to 256
• Can we do better? No, 256 is information-
theoretic limit
• Can we count more with 8 bits?  
Say, up to 2 x 256 (512)? Any hack?
6

Ideas
• Can we reuse the counter? Loop through it
twice?
• Flag that tells us if we’re using or reusing the counter?
• Count every other event, so effectively we’re
counting double?
• Flag that keeps track of whether we should count the next
event?
7

No More Bits
• A ﬂag is another bit, we don’t have a 9th bit
• Can’t count >256 accurately with only 8bits.
• Can we count inaccurately?
• vs
8

Tossing Coins
• Toss a fair coin – ½ chance of heads; 50%
• If heads, we increment the counter
• If tails, we don’t
• Not deterministic. Probabilistic!
• Trading accuracy to break info-theory bound
9

Coin-Tossing “Counter”
10
1 2 3 4
1
Heads
Tails
2
Heads
Tails
3
Heads
Tails
4
Heads
Tails

Coin Tossing
• Toss the fair coin twice,
expect one head
• Toss the fair coin 10
times, expect ﬁve heads
• So our “counter” is
expected to increment
once for every two
tosses.
11
Probability
0
0.075
0.15
0.225
0.3
Number of Heads
0 1 2 3 4 5 6 7 8 9 10

Estimator
• The “counter” isn’t a counter, it’s an estimator
• If count is n, estimator should be n/2
• If estimator is k, estimate of the count is 2k,
• 8-bit estimator goes to 256 can estimate 512!
12

Implementation
• Use a random number generator: 
def increment(estimator): 
if random.random() < 0.5: 
return estimator + 1 
return estimator
13

Counting Further
• To count to higher values, say 768 (3 x 256), or
256,000 (1000 x 256)? Use a biased coin
• Bias the coin isn’t fair, not 50-50
• Coin with ⅓ chance of heads  
“counter” increments by 1 every three tosses
• Coin with 1/1000 chance of heads  
“counter” increments by 1 every 1000 tosses
14

Implementation
• Choose bias of coin: 
def increment(estimator, bias): 
if random.random() < bias: 
return estimator
• Call with bias=1/3 or 1/1000
15

Fair Coin Error
• Suppose the actual count is 1
• With fair coin, the estimator is: 
0 (50% chance), or 1 (50% chance)
• Estimate is: 0 (50% chance), or 2 (50% chance)
• Error is 1, always
16

Biased Coin Error
• If coin has 1/1000 chance of heads, the estimator
is: 
0 (999/1000 chance), or 1 (1/1000 chance)
• Estimate is:  
0 (999/1000 chance), or 1000 (1/1000 chance)
• Error is 1 or 999
• For small counts, the error can be huge
17

Another Estimator
• Estimator stores the value of log2(n); for
estimator k, the estimate will be 2k
• If the estimator is k after n increments, k ≅log2n
• k should become k + 1, after n increments
• We’re only storing an integer, k
18

Increment?
• Given only the value k, how do we know when to
increment – we use our coin-intuition
• With probability 2-k, we increment k
• With probability 1 - 2-k, we don’t
19

log-Estimator
20
1
Heads
Tails
2
Heads
Tails
3
Heads
Tails
4
Heads
Tails
1
1
0
2
1/2
1/2
3
1/4
3/4
4
1/8
7/8

Implementation
• This is simple too, 
def increment(estimator): 
if random.random() < 2 ** -estimator: 
return estimator
21

Error
• “one binary order of magnitude” 100%
• If the estimator is 10, the estimate is 1024, and
the actual value might be between 512 and 1024
• This seems worse, but it is regular, and can be
ﬁxed.
22

Reducing Error
• We can improve the algorithm by changing the
base of the estimator
• Instead of storing log2n, we store logan
• For base a (with a < 2):
• With probability a-k, we increment k
• With probability 1 - a-k, we don’t
23

Example
• With a = 22-δ
, we can count up to n using storage
of log2log2n + δ bits (take this formula as fact)
• For example, with δ = 4, a will be 22-4
= 1.044, we
can count up to 65,536 using 8 bits 
log265536 = 16, log216 = 4 4 + 4 = 8
• Binary counter: 8 bits 256, 16 bits 65,536!
• Relative errors are typically <15%
24

Approximate Counting
• Simple to implement on computer
• Requires ≅log2log2n bits of storage.
• Small relative error
• Estimation, probabilistic techniques
25

Counting Unique Elements
• Keep track of what you’ve seen, increase
counter if new item
• Use a hash-map or a set
• linear in time
• space proportional to number of uniques
26

Characteristics
• Accurate answer (good to have)
• Linear with time – good!
• Linear with space – not good!
• “Trade accuracy for space!”
27

Estimation
• Approximate Counting – allow for some
error, and use probabilistic techniques
• What can we do? What estimator works here?
28

Sampling
• Try to estimate the cardinality of the complete
set, by calculating the cardinality of a sample
• Error rates are high, depends on the replication
of items in data
• Example: a million integers between 1 and 10,
draw sample of 1000. Cardinality is still 10. 
Scaling up by 1,000, our estimate is 10,000!
29

Puzzle
• Choose 9 evenly spaced numbers between 0
and 100? With repetitions?
• 10, 20, 30, 40, 50, 60, 70, 80, 90
• Why not 11, 21, 31, 41, …, 91?
• What is the minimum value chosen?
30

Simpler Problem
• Can we do the reverse?
• If our data consisted of evenly distributed
numbers, repeated arbitrarily, and randomly
shufﬂed
• How can we estimate the number of uniques
values?
31

Min-Estimator
• With evenly-spaced data between 0 and 100, if
minimum is 10 the data had 9 values
• The number of uniques is (b - a)/(x - a) - 1, where x
is the minimum seen in data ranging from a to b
• Good solution! What’s the problem?
32

Reality
• Data are not going to be numbers
• Even if they were, it wouldn’t be spread evenly in
some range
33

Hashing
• Hash functions solve that problem
• Map arbitrary data from any domain to 32-bit
integers, that are uniformly distributed over the
232 range
• All you need is a hash function for your data.
34

Counting Unique Elements
• Philippe Flajolet (1948–2011)
• 1983, Approximate
Counting
• 1985, Probabilistic
Counting
• 2003, LogLog
• 2007, HyperLogLog
35

Back to the Problem
• min() is just one estimator that can be used
• Another is the count the maximum number of 0s
in the beginning of the hashed values.
• 0001 0111 0011 1101 3
• What sort of estimator is this? logarithmic!
36

Probability
• For uniformly distributed 32-bit numbers
• Approximately half the data should start with 1
• Half should start with 0
• Of this, half should start with 01
• The rest should start with 00
37

Probability of Pattern
Pattern 1… 01… 001… 0001…
Zeroes 0 1 2 3
Probability 1/2 1/4 1/8 1/16
38

Estimator
• ρ(x) = number of leading zeros in hash
• ρ(1) = 0, ρ(01) = 1, ρ(001) = 2, …
• S = maximum of all ρ(x)
• E(S) ≅ log2n
• Estimate = 2S unique values
39

Error
• Approx. 1 binary order of magnitude
• In Approximate Counting, changed base of log
• Alternative, take m hashing functions, and derive
m values of S<1>, S<2>, …, S<m> calculate
average, A, of values
• This reduces the error by a factor of 1/√m
40

m Hashes?
• Finding m good hash functions isn’t easy
• There is a CPU cost to calculating these hashes
for every value.
41

Stochastic Averaging
• Finding m good hash functions isn’t easy
• Use few bits from hash to distribute values in m
bins
• Use remaining bits to estimate n/m
• Average these values and scale by m
42

Example
• If hash is 1010 1010 1010
• we can use ﬁrst three bits, to put this value into
one of 8 (23) bins
• the remaining bits, ---0 1010 1010, can be used
as before, but to estimate n/8
43

Techniques
44
• Estimator – “bit-pattern observable”
• Stochastic averaging

Algorithms
45
Space Error
Probabilistic Counting,
1985
m 32-bit 0.78/√m
LogLog, 2003 m 5-bit 1.3/√m
HyperLogLog, 2007 m 5-bit 1.04/√m

Advantages
• Simple, very easy to implement (not obvious,
hard to analyse)
• log2log2n space complexity – 2
28
≅1077
• Linear time complexity – good
• Easily distributable – just exchange estimators
47

Ideas
• Estimation with probabilistic approaches – trade
accuracy for space – averaging values reduces
errors
• Hashing – to map convert any data into a
uniformly distributed set of numbers
• Don’t implement –  
ﬁnd in PostgreSQL, Cassandra, Redis, etc.
48

Slides, References:
Roshan.Mathews@ca.com
49

Counting (Using Computer)

Recommended

Recommended

More Related Content

Similar to Counting (Using Computer)

Similar to Counting (Using Computer) (20)

Recently uploaded

Recently uploaded (20)

Counting (Using Computer)