This talk is going to cover some techniques of counting using a computer. Counting problems come up very often in areas of databases, networking, and elsewhere. Counting is so simple that by itself it isn’t even worth talking about, but there are some techniques that have truly impressive gains.
Accompanying notes at: http://www.slideshare.net/roshmat/counting-notes
2. Problems
• Count all elements
• count number of HTTP requests
• Count unique elements
• detect network attacks
• query optimisation in databases
2
3. Counting All Elements
• Simple solution – use a counter!
• Accurate answer
• Linear in time – O(n)
• Logarithmic in space - O(logn)
• log2n bits to store n
3
4. Logarithmic Space
• To count till n you need at least log2n bits
• 10 bits can count upto around one thousand
• 20 bits can count upto around one million
• 30 bits can count upto around one billion
• 64 bits can probably count everything
4
5. Problem earlier?
• Memory wasn’t always cheap
• Robert Morris (1932–2011)
• Bell Labs, 1977
• “a programming situation
that required using a large
number of counters to keep
track of the number of
occurrences of many
different events.”
• 8-bits!
5
6. Better than 2n
• 8-bit counters can count up to 256
• Can we do better? No, 256 is information-
theoretic limit
• Can we count more with 8 bits?
Say, up to 2 x 256 (512)? Any hack?
6
7. Ideas
• Can we reuse the counter? Loop through it
twice?
• Flag that tells us if we’re using or reusing the counter?
• Count every other event, so effectively we’re
counting double?
• Flag that keeps track of whether we should count the next
event?
7
8. No More Bits
• A flag is another bit, we don’t have a 9th bit
• Can’t count >256 accurately with only 8bits.
• Can we count inaccurately?
• vs
8
9. Tossing Coins
• Toss a fair coin – ½ chance of heads; 50%
• If heads, we increment the counter
• If tails, we don’t
• Not deterministic. Probabilistic!
• Trading accuracy to break info-theory bound
9
11. Coin Tossing
• Toss the fair coin twice,
expect one head
• Toss the fair coin 10
times, expect five heads
• So our “counter” is
expected to increment
once for every two
tosses.
11
Probability
0
0.075
0.15
0.225
0.3
Number of Heads
0 1 2 3 4 5 6 7 8 9 10
12. Estimator
• The “counter” isn’t a counter, it’s an estimator
• If count is n, estimator should be n/2
• If estimator is k, estimate of the count is 2k,
• 8-bit estimator goes to 256 can estimate 512!
12
13. Implementation
• Use a random number generator:
def increment(estimator):
if random.random() < 0.5:
return estimator + 1
return estimator
13
14. Counting Further
• To count to higher values, say 768 (3 x 256), or
256,000 (1000 x 256)? Use a biased coin
• Bias the coin isn’t fair, not 50-50
• Coin with ⅓ chance of heads
“counter” increments by 1 every three tosses
• Coin with 1/1000 chance of heads
“counter” increments by 1 every 1000 tosses
14
15. Implementation
• Choose bias of coin:
def increment(estimator, bias):
if random.random() < bias:
return estimator + 1
return estimator
• Call with bias=1/3 or 1/1000
15
16. Fair Coin Error
• Suppose the actual count is 1
• With fair coin, the estimator is:
0 (50% chance), or 1 (50% chance)
• Estimate is: 0 (50% chance), or 2 (50% chance)
• Error is 1, always
16
17. Biased Coin Error
• If coin has 1/1000 chance of heads, the estimator
is:
0 (999/1000 chance), or 1 (1/1000 chance)
• Estimate is:
0 (999/1000 chance), or 1000 (1/1000 chance)
• Error is 1 or 999
• For small counts, the error can be huge
17
18. Another Estimator
• Estimator stores the value of log2(n); for
estimator k, the estimate will be 2k
• If the estimator is k after n increments, k ≅log2n
• k should become k + 1, after n increments
• We’re only storing an integer, k
18
19. Increment?
• Given only the value k, how do we know when to
increment – we use our coin-intuition
• With probability 2-k, we increment k
• With probability 1 - 2-k, we don’t
19
21. Implementation
• This is simple too,
def increment(estimator):
if random.random() < 2 ** -estimator:
return estimator + 1
return estimator
21
22. Error
• “one binary order of magnitude” 100%
• If the estimator is 10, the estimate is 1024, and
the actual value might be between 512 and 1024
• This seems worse, but it is regular, and can be
fixed.
22
23. Reducing Error
• We can improve the algorithm by changing the
base of the estimator
• Instead of storing log2n, we store logan
• For base a (with a < 2):
• With probability a-k, we increment k
• With probability 1 - a-k, we don’t
23
24. Example
• With a = 22-δ
, we can count up to n using storage
of log2log2n + δ bits (take this formula as fact)
• For example, with δ = 4, a will be 22-4
= 1.044, we
can count up to 65,536 using 8 bits
log265536 = 16, log216 = 4 4 + 4 = 8
• Binary counter: 8 bits 256, 16 bits 65,536!
• Relative errors are typically <15%
24
25. Approximate Counting
• Simple to implement on computer
• Requires ≅log2log2n bits of storage.
• Small relative error
• Estimation, probabilistic techniques
25
26. Counting Unique Elements
• Keep track of what you’ve seen, increase
counter if new item
• Use a hash-map or a set
• linear in time
• space proportional to number of uniques
26
27. Characteristics
• Accurate answer (good to have)
• Linear with time – good!
• Linear with space – not good!
• “Trade accuracy for space!”
27
28. Estimation
• Approximate Counting – allow for some
error, and use probabilistic techniques
• What can we do? What estimator works here?
28
29. Sampling
• Try to estimate the cardinality of the complete
set, by calculating the cardinality of a sample
• Error rates are high, depends on the replication
of items in data
• Example: a million integers between 1 and 10,
draw sample of 1000. Cardinality is still 10.
Scaling up by 1,000, our estimate is 10,000!
29
30. Puzzle
• Choose 9 evenly spaced numbers between 0
and 100? With repetitions?
• 10, 20, 30, 40, 50, 60, 70, 80, 90
• Why not 11, 21, 31, 41, …, 91?
• What is the minimum value chosen?
30
31. Simpler Problem
• Can we do the reverse?
• If our data consisted of evenly distributed
numbers, repeated arbitrarily, and randomly
shuffled
• How can we estimate the number of uniques
values?
31
32. Min-Estimator
• With evenly-spaced data between 0 and 100, if
minimum is 10 the data had 9 values
• The number of uniques is (b - a)/(x - a) - 1, where x
is the minimum seen in data ranging from a to b
• Good solution! What’s the problem?
32
33. Reality
• Data are not going to be numbers
• Even if they were, it wouldn’t be spread evenly in
some range
33
34. Hashing
• Hash functions solve that problem
• Map arbitrary data from any domain to 32-bit
integers, that are uniformly distributed over the
232 range
• All you need is a hash function for your data.
34
36. Back to the Problem
• min() is just one estimator that can be used
• Another is the count the maximum number of 0s
in the beginning of the hashed values.
• 0001 0111 0011 1101 3
• What sort of estimator is this? logarithmic!
36
37. Probability
• For uniformly distributed 32-bit numbers
• Approximately half the data should start with 1
• Half should start with 0
• Of this, half should start with 01
• The rest should start with 00
37
39. Estimator
• ρ(x) = number of leading zeros in hash
• ρ(1) = 0, ρ(01) = 1, ρ(001) = 2, …
• S = maximum of all ρ(x)
• E(S) ≅ log2n
• Estimate = 2S unique values
39
40. Error
• Approx. 1 binary order of magnitude
• In Approximate Counting, changed base of log
• Alternative, take m hashing functions, and derive
m values of S<1>, S<2>, …, S<m> calculate
average, A, of values
• This reduces the error by a factor of 1/√m
40
41. m Hashes?
• Finding m good hash functions isn’t easy
• There is a CPU cost to calculating these hashes
for every value.
41
42. Stochastic Averaging
• Finding m good hash functions isn’t easy
• Use few bits from hash to distribute values in m
bins
• Use remaining bits to estimate n/m
• Average these values and scale by m
42
43. Example
• If hash is 1010 1010 1010
• we can use first three bits, to put this value into
one of 8 (23) bins
• the remaining bits, ---0 1010 1010, can be used
as before, but to estimate n/8
43
47. Advantages
• Simple, very easy to implement (not obvious,
hard to analyse)
• log2log2n space complexity – 2
28
≅1077
• Linear time complexity – good
• Easily distributable – just exchange estimators
47
48. Ideas
• Estimation with probabilistic approaches – trade
accuracy for space – averaging values reduces
errors
• Hashing – to map convert any data into a
uniformly distributed set of numbers
• Don’t implement –
find in PostgreSQL, Cassandra, Redis, etc.
48