SlideShare a Scribd company logo
1 of 50
Download to read offline
Counting
(Using Computer)
1
Problems
• Count all elements
• count number of HTTP requests
• Count unique elements
• detect network attacks
• query optimisation in databases
2
Counting All Elements
• Simple solution – use a counter!
• Accurate answer
• Linear in time – O(n)
• Logarithmic in space - O(logn)
• log2n bits to store n
3
Logarithmic Space
• To count till n you need at least log2n bits
• 10 bits can count upto around one thousand
• 20 bits can count upto around one million
• 30 bits can count upto around one billion
• 64 bits can probably count everything
4
Problem earlier?
• Memory wasn’t always cheap
• Robert Morris (1932–2011)
• Bell Labs, 1977
• “a programming situation
that required using a large
number of counters to keep
track of the number of
occurrences of many
different events.”
• 8-bits!
5
Better than 2n
• 8-bit counters can count up to 256
• Can we do better? No, 256 is information-
theoretic limit
• Can we count more with 8 bits? 

Say, up to 2 x 256 (512)? Any hack?
6
Ideas
• Can we reuse the counter? Loop through it
twice?
• Flag that tells us if we’re using or reusing the counter?
• Count every other event, so effectively we’re
counting double?
• Flag that keeps track of whether we should count the next
event?
7
No More Bits
• A flag is another bit, we don’t have a 9th bit
• Can’t count >256 accurately with only 8bits.
• Can we count inaccurately?
• vs
8
Tossing Coins
• Toss a fair coin – ½ chance of heads; 50%
• If heads, we increment the counter
• If tails, we don’t
• Not deterministic. Probabilistic!
• Trading accuracy to break info-theory bound
9
Coin-Tossing “Counter”
10
1 2 3 4
1
Heads
Tails
2
Heads
Tails
3
Heads
Tails
4
Heads
Tails
Coin Tossing
• Toss the fair coin twice,
expect one head
• Toss the fair coin 10
times, expect five heads
• So our “counter” is
expected to increment
once for every two
tosses.
11
Probability
0
0.075
0.15
0.225
0.3
Number of Heads
0 1 2 3 4 5 6 7 8 9 10
Estimator
• The “counter” isn’t a counter, it’s an estimator
• If count is n, estimator should be n/2
• If estimator is k, estimate of the count is 2k,
• 8-bit estimator goes to 256 can estimate 512!
12
Implementation
• Use a random number generator:

def increment(estimator):

if random.random() < 0.5:

return estimator + 1

return estimator
13
Counting Further
• To count to higher values, say 768 (3 x 256), or
256,000 (1000 x 256)? Use a biased coin
• Bias the coin isn’t fair, not 50-50
• Coin with ⅓ chance of heads 

“counter” increments by 1 every three tosses
• Coin with 1/1000 chance of heads 

“counter” increments by 1 every 1000 tosses
14
Implementation
• Choose bias of coin:

def increment(estimator, bias):

if random.random() < bias:

return estimator + 1

return estimator
• Call with bias=1/3 or 1/1000
15
Fair Coin Error
• Suppose the actual count is 1
• With fair coin, the estimator is:

0 (50% chance), or 1 (50% chance)
• Estimate is: 0 (50% chance), or 2 (50% chance)
• Error is 1, always
16
Biased Coin Error
• If coin has 1/1000 chance of heads, the estimator
is:

0 (999/1000 chance), or 1 (1/1000 chance)
• Estimate is: 

0 (999/1000 chance), or 1000 (1/1000 chance)
• Error is 1 or 999
• For small counts, the error can be huge
17
Another Estimator
• Estimator stores the value of log2(n); for
estimator k, the estimate will be 2k
• If the estimator is k after n increments, k ≅log2n
• k should become k + 1, after n increments
• We’re only storing an integer, k
18
Increment?
• Given only the value k, how do we know when to
increment – we use our coin-intuition
• With probability 2-k, we increment k
• With probability 1 - 2-k, we don’t
19
log-Estimator
20
1
Heads
Tails
2
Heads
Tails
3
Heads
Tails
4
Heads
Tails
1
1
0
2
1/2
1/2
3
1/4
3/4
4
1/8
7/8
Implementation
• This is simple too,

def increment(estimator):

if random.random() < 2 ** -estimator:

return estimator + 1

return estimator
21
Error
• “one binary order of magnitude” 100%
• If the estimator is 10, the estimate is 1024, and
the actual value might be between 512 and 1024
• This seems worse, but it is regular, and can be
fixed.
22
Reducing Error
• We can improve the algorithm by changing the
base of the estimator
• Instead of storing log2n, we store logan
• For base a (with a < 2):
• With probability a-k, we increment k
• With probability 1 - a-k, we don’t
23
Example
• With a = 22-δ
, we can count up to n using storage
of log2log2n + δ bits (take this formula as fact)
• For example, with δ = 4, a will be 22-4
= 1.044, we
can count up to 65,536 using 8 bits

log265536 = 16, log216 = 4 4 + 4 = 8
• Binary counter: 8 bits 256, 16 bits 65,536!
• Relative errors are typically <15%
24
Approximate Counting
• Simple to implement on computer
• Requires ≅log2log2n bits of storage.
• Small relative error
• Estimation, probabilistic techniques
25
Counting Unique Elements
• Keep track of what you’ve seen, increase
counter if new item
• Use a hash-map or a set
• linear in time
• space proportional to number of uniques
26
Characteristics
• Accurate answer (good to have)
• Linear with time – good!
• Linear with space – not good!
• “Trade accuracy for space!”
27
Estimation
• Approximate Counting – allow for some
error, and use probabilistic techniques
• What can we do? What estimator works here?
28
Sampling
• Try to estimate the cardinality of the complete
set, by calculating the cardinality of a sample
• Error rates are high, depends on the replication
of items in data
• Example: a million integers between 1 and 10,
draw sample of 1000. Cardinality is still 10.

Scaling up by 1,000, our estimate is 10,000!
29
Puzzle
• Choose 9 evenly spaced numbers between 0
and 100? With repetitions?
• 10, 20, 30, 40, 50, 60, 70, 80, 90
• Why not 11, 21, 31, 41, …, 91?
• What is the minimum value chosen?
30
Simpler Problem
• Can we do the reverse?
• If our data consisted of evenly distributed
numbers, repeated arbitrarily, and randomly
shuffled
• How can we estimate the number of uniques
values?
31
Min-Estimator
• With evenly-spaced data between 0 and 100, if
minimum is 10 the data had 9 values
• The number of uniques is (b - a)/(x - a) - 1, where x
is the minimum seen in data ranging from a to b
• Good solution! What’s the problem?
32
Reality
• Data are not going to be numbers
• Even if they were, it wouldn’t be spread evenly in
some range
33
Hashing
• Hash functions solve that problem
• Map arbitrary data from any domain to 32-bit
integers, that are uniformly distributed over the
232 range
• All you need is a hash function for your data.
34
Counting Unique Elements
• Philippe Flajolet (1948–2011)
• 1983, Approximate
Counting
• 1985, Probabilistic
Counting
• 2003, LogLog
• 2007, HyperLogLog
35
Back to the Problem
• min() is just one estimator that can be used
• Another is the count the maximum number of 0s
in the beginning of the hashed values.
• 0001 0111 0011 1101 3
• What sort of estimator is this? logarithmic!
36
Probability
• For uniformly distributed 32-bit numbers
• Approximately half the data should start with 1
• Half should start with 0
• Of this, half should start with 01
• The rest should start with 00
37
Probability of Pattern
Pattern 1… 01… 001… 0001…
Zeroes 0 1 2 3
Probability 1/2 1/4 1/8 1/16
38
Estimator
• ρ(x) = number of leading zeros in hash
• ρ(1) = 0, ρ(01) = 1, ρ(001) = 2, …
• S = maximum of all ρ(x)
• E(S) ≅ log2n
• Estimate = 2S unique values
39
Error
• Approx. 1 binary order of magnitude
• In Approximate Counting, changed base of log
• Alternative, take m hashing functions, and derive
m values of S<1>, S<2>, …, S<m> calculate
average, A, of values
• This reduces the error by a factor of 1/√m
40
m Hashes?
• Finding m good hash functions isn’t easy
• There is a CPU cost to calculating these hashes
for every value.
41
Stochastic Averaging
• Finding m good hash functions isn’t easy
• Use few bits from hash to distribute values in m
bins
• Use remaining bits to estimate n/m
• Average these values and scale by m
42
Example
• If hash is 1010 1010 1010
• we can use first three bits, to put this value into
one of 8 (23) bins
• the remaining bits, ---0 1010 1010, can be used
as before, but to estimate n/8
43
Techniques
44
• Estimator – “bit-pattern observable”
• Stochastic averaging
Algorithms
45
Space Error
Probabilistic Counting,
1985
m 32-bit 0.78/√m
LogLog, 2003 m 5-bit 1.3/√m
HyperLogLog, 2007 m 5-bit 1.04/√m
LogLog
46
Advantages
• Simple, very easy to implement (not obvious,
hard to analyse)
• log2log2n space complexity – 2
28
≅1077
• Linear time complexity – good
• Easily distributable – just exchange estimators
47
Ideas
• Estimation with probabilistic approaches – trade
accuracy for space – averaging values reduces
errors
• Hashing – to map convert any data into a
uniformly distributed set of numbers
• Don’t implement – 

find in PostgreSQL, Cassandra, Redis, etc.
48
Slides, References:
Roshan.Mathews@ca.com
49
Counting
(Using Computer)
50

More Related Content

Similar to Counting (Using Computer)

streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
Week 4-Number Systems.pptx
Week 4-Number Systems.pptxWeek 4-Number Systems.pptx
Week 4-Number Systems.pptx
HamnaKhalid25
 

Similar to Counting (Using Computer) (20)

Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Beating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit ArithmeticBeating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit Arithmetic
 
computer architecture organization in Ece
computer architecture organization in Ececomputer architecture organization in Ece
computer architecture organization in Ece
 
data representation
 data representation data representation
data representation
 
Beyond Floating Point – Next Generation Computer Arithmetic
Beyond Floating Point – Next Generation Computer ArithmeticBeyond Floating Point – Next Generation Computer Arithmetic
Beyond Floating Point – Next Generation Computer Arithmetic
 
Hash tables
Hash tablesHash tables
Hash tables
 
Matt Purkeypile's Doctoral Dissertation Defense Slides
Matt Purkeypile's Doctoral Dissertation Defense SlidesMatt Purkeypile's Doctoral Dissertation Defense Slides
Matt Purkeypile's Doctoral Dissertation Defense Slides
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
1.Digital Electronics overview & Number Systems.pptx
1.Digital Electronics overview & Number Systems.pptx1.Digital Electronics overview & Number Systems.pptx
1.Digital Electronics overview & Number Systems.pptx
 
lec08-numpy.pptx
lec08-numpy.pptxlec08-numpy.pptx
lec08-numpy.pptx
 
Week 4-Number Systems.pptx
Week 4-Number Systems.pptxWeek 4-Number Systems.pptx
Week 4-Number Systems.pptx
 
UNIT 4 -Data Representation.pptxfghfghhggh
UNIT 4 -Data Representation.pptxfghfghhgghUNIT 4 -Data Representation.pptxfghfghhggh
UNIT 4 -Data Representation.pptxfghfghhggh
 
Number system
Number systemNumber system
Number system
 
Modern block cipher
Modern block cipherModern block cipher
Modern block cipher
 
chapter1.pdf ......................................
chapter1.pdf ......................................chapter1.pdf ......................................
chapter1.pdf ......................................
 
Only floating point lecture 7 (1)
Only floating point lecture 7 (1)Only floating point lecture 7 (1)
Only floating point lecture 7 (1)
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 
QC-UNIT 2.ppt
QC-UNIT 2.pptQC-UNIT 2.ppt
QC-UNIT 2.ppt
 
8. Hash table
8. Hash table8. Hash table
8. Hash table
 

Recently uploaded

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 

Counting (Using Computer)

  • 2. Problems • Count all elements • count number of HTTP requests • Count unique elements • detect network attacks • query optimisation in databases 2
  • 3. Counting All Elements • Simple solution – use a counter! • Accurate answer • Linear in time – O(n) • Logarithmic in space - O(logn) • log2n bits to store n 3
  • 4. Logarithmic Space • To count till n you need at least log2n bits • 10 bits can count upto around one thousand • 20 bits can count upto around one million • 30 bits can count upto around one billion • 64 bits can probably count everything 4
  • 5. Problem earlier? • Memory wasn’t always cheap • Robert Morris (1932–2011) • Bell Labs, 1977 • “a programming situation that required using a large number of counters to keep track of the number of occurrences of many different events.” • 8-bits! 5
  • 6. Better than 2n • 8-bit counters can count up to 256 • Can we do better? No, 256 is information- theoretic limit • Can we count more with 8 bits? 
 Say, up to 2 x 256 (512)? Any hack? 6
  • 7. Ideas • Can we reuse the counter? Loop through it twice? • Flag that tells us if we’re using or reusing the counter? • Count every other event, so effectively we’re counting double? • Flag that keeps track of whether we should count the next event? 7
  • 8. No More Bits • A flag is another bit, we don’t have a 9th bit • Can’t count >256 accurately with only 8bits. • Can we count inaccurately? • vs 8
  • 9. Tossing Coins • Toss a fair coin – ½ chance of heads; 50% • If heads, we increment the counter • If tails, we don’t • Not deterministic. Probabilistic! • Trading accuracy to break info-theory bound 9
  • 10. Coin-Tossing “Counter” 10 1 2 3 4 1 Heads Tails 2 Heads Tails 3 Heads Tails 4 Heads Tails
  • 11. Coin Tossing • Toss the fair coin twice, expect one head • Toss the fair coin 10 times, expect five heads • So our “counter” is expected to increment once for every two tosses. 11 Probability 0 0.075 0.15 0.225 0.3 Number of Heads 0 1 2 3 4 5 6 7 8 9 10
  • 12. Estimator • The “counter” isn’t a counter, it’s an estimator • If count is n, estimator should be n/2 • If estimator is k, estimate of the count is 2k, • 8-bit estimator goes to 256 can estimate 512! 12
  • 13. Implementation • Use a random number generator:
 def increment(estimator):
 if random.random() < 0.5:
 return estimator + 1
 return estimator 13
  • 14. Counting Further • To count to higher values, say 768 (3 x 256), or 256,000 (1000 x 256)? Use a biased coin • Bias the coin isn’t fair, not 50-50 • Coin with ⅓ chance of heads 
 “counter” increments by 1 every three tosses • Coin with 1/1000 chance of heads 
 “counter” increments by 1 every 1000 tosses 14
  • 15. Implementation • Choose bias of coin:
 def increment(estimator, bias):
 if random.random() < bias:
 return estimator + 1
 return estimator • Call with bias=1/3 or 1/1000 15
  • 16. Fair Coin Error • Suppose the actual count is 1 • With fair coin, the estimator is:
 0 (50% chance), or 1 (50% chance) • Estimate is: 0 (50% chance), or 2 (50% chance) • Error is 1, always 16
  • 17. Biased Coin Error • If coin has 1/1000 chance of heads, the estimator is:
 0 (999/1000 chance), or 1 (1/1000 chance) • Estimate is: 
 0 (999/1000 chance), or 1000 (1/1000 chance) • Error is 1 or 999 • For small counts, the error can be huge 17
  • 18. Another Estimator • Estimator stores the value of log2(n); for estimator k, the estimate will be 2k • If the estimator is k after n increments, k ≅log2n • k should become k + 1, after n increments • We’re only storing an integer, k 18
  • 19. Increment? • Given only the value k, how do we know when to increment – we use our coin-intuition • With probability 2-k, we increment k • With probability 1 - 2-k, we don’t 19
  • 21. Implementation • This is simple too,
 def increment(estimator):
 if random.random() < 2 ** -estimator:
 return estimator + 1
 return estimator 21
  • 22. Error • “one binary order of magnitude” 100% • If the estimator is 10, the estimate is 1024, and the actual value might be between 512 and 1024 • This seems worse, but it is regular, and can be fixed. 22
  • 23. Reducing Error • We can improve the algorithm by changing the base of the estimator • Instead of storing log2n, we store logan • For base a (with a < 2): • With probability a-k, we increment k • With probability 1 - a-k, we don’t 23
  • 24. Example • With a = 22-δ , we can count up to n using storage of log2log2n + δ bits (take this formula as fact) • For example, with δ = 4, a will be 22-4 = 1.044, we can count up to 65,536 using 8 bits
 log265536 = 16, log216 = 4 4 + 4 = 8 • Binary counter: 8 bits 256, 16 bits 65,536! • Relative errors are typically <15% 24
  • 25. Approximate Counting • Simple to implement on computer • Requires ≅log2log2n bits of storage. • Small relative error • Estimation, probabilistic techniques 25
  • 26. Counting Unique Elements • Keep track of what you’ve seen, increase counter if new item • Use a hash-map or a set • linear in time • space proportional to number of uniques 26
  • 27. Characteristics • Accurate answer (good to have) • Linear with time – good! • Linear with space – not good! • “Trade accuracy for space!” 27
  • 28. Estimation • Approximate Counting – allow for some error, and use probabilistic techniques • What can we do? What estimator works here? 28
  • 29. Sampling • Try to estimate the cardinality of the complete set, by calculating the cardinality of a sample • Error rates are high, depends on the replication of items in data • Example: a million integers between 1 and 10, draw sample of 1000. Cardinality is still 10.
 Scaling up by 1,000, our estimate is 10,000! 29
  • 30. Puzzle • Choose 9 evenly spaced numbers between 0 and 100? With repetitions? • 10, 20, 30, 40, 50, 60, 70, 80, 90 • Why not 11, 21, 31, 41, …, 91? • What is the minimum value chosen? 30
  • 31. Simpler Problem • Can we do the reverse? • If our data consisted of evenly distributed numbers, repeated arbitrarily, and randomly shuffled • How can we estimate the number of uniques values? 31
  • 32. Min-Estimator • With evenly-spaced data between 0 and 100, if minimum is 10 the data had 9 values • The number of uniques is (b - a)/(x - a) - 1, where x is the minimum seen in data ranging from a to b • Good solution! What’s the problem? 32
  • 33. Reality • Data are not going to be numbers • Even if they were, it wouldn’t be spread evenly in some range 33
  • 34. Hashing • Hash functions solve that problem • Map arbitrary data from any domain to 32-bit integers, that are uniformly distributed over the 232 range • All you need is a hash function for your data. 34
  • 35. Counting Unique Elements • Philippe Flajolet (1948–2011) • 1983, Approximate Counting • 1985, Probabilistic Counting • 2003, LogLog • 2007, HyperLogLog 35
  • 36. Back to the Problem • min() is just one estimator that can be used • Another is the count the maximum number of 0s in the beginning of the hashed values. • 0001 0111 0011 1101 3 • What sort of estimator is this? logarithmic! 36
  • 37. Probability • For uniformly distributed 32-bit numbers • Approximately half the data should start with 1 • Half should start with 0 • Of this, half should start with 01 • The rest should start with 00 37
  • 38. Probability of Pattern Pattern 1… 01… 001… 0001… Zeroes 0 1 2 3 Probability 1/2 1/4 1/8 1/16 38
  • 39. Estimator • ρ(x) = number of leading zeros in hash • ρ(1) = 0, ρ(01) = 1, ρ(001) = 2, … • S = maximum of all ρ(x) • E(S) ≅ log2n • Estimate = 2S unique values 39
  • 40. Error • Approx. 1 binary order of magnitude • In Approximate Counting, changed base of log • Alternative, take m hashing functions, and derive m values of S<1>, S<2>, …, S<m> calculate average, A, of values • This reduces the error by a factor of 1/√m 40
  • 41. m Hashes? • Finding m good hash functions isn’t easy • There is a CPU cost to calculating these hashes for every value. 41
  • 42. Stochastic Averaging • Finding m good hash functions isn’t easy • Use few bits from hash to distribute values in m bins • Use remaining bits to estimate n/m • Average these values and scale by m 42
  • 43. Example • If hash is 1010 1010 1010 • we can use first three bits, to put this value into one of 8 (23) bins • the remaining bits, ---0 1010 1010, can be used as before, but to estimate n/8 43
  • 44. Techniques 44 • Estimator – “bit-pattern observable” • Stochastic averaging
  • 45. Algorithms 45 Space Error Probabilistic Counting, 1985 m 32-bit 0.78/√m LogLog, 2003 m 5-bit 1.3/√m HyperLogLog, 2007 m 5-bit 1.04/√m
  • 47. Advantages • Simple, very easy to implement (not obvious, hard to analyse) • log2log2n space complexity – 2 28 ≅1077 • Linear time complexity – good • Easily distributable – just exchange estimators 47
  • 48. Ideas • Estimation with probabilistic approaches – trade accuracy for space – averaging values reduces errors • Hashing – to map convert any data into a uniformly distributed set of numbers • Don’t implement – 
 find in PostgreSQL, Cassandra, Redis, etc. 48