SlideShare a Scribd company logo
Count-Distinct
Problem
Yunhe Feng, Kai Zhang
Apr.5 2016
Questions
1. What is the cardinality of this data stream:
{1, 2, 4, 6, 8, 9, 2, 3, 11, 3, 1, 4}
2. Remember we use “bit pattern observables” to estimate
cardinality, describe the basic idea behind it.
3. How is the “buckets” useful in LOGLOG COUNTING algorithm?
n
Outline
1. Overview
2. History
3.Algorithms
4. Implementation & Results
5. Open Issues
6. References
Definition:
Instance: A stream of elements with repetitions, and
an integer . Let be the number of distinct elements, namely
, and let these elements be .
Overview
x1, x2, ..., xs
m n
n =| {x1, x2, ..., xs} || {e1, e2, ..., en} |
ˆn n
m ⌧ n
a, b, a, c, d, b, d
n =| {a, b, c, d} |= 4
mObjective: Find an estimate of using only storage units, where
.
e.g. Count the cardinality of the stream: . For this
instance, .
Example:
Keep track of the number of
Unique Visitors (UV) for a particular
product on Amazon in one day.
• 1MB for each tree, 1 million items:100GB memory! (200 million on Amazon)
• what if we want to know the number of UVs of 2 items together?
Drawbacks:
Operation: Searching, Insertion
Other Applications
Application:
Networking / Traffic monitoring
• Detection of worm propagation
• Network attacks
• Link-based spam
Data mining of massive data set
• Natural language texts
• Biological data
• Large structured databases
Google: Sawzall, Dremel and PowerDrill
1980: Optimization of classical algorithms operations on data bases:
union, intersection, sorting, …
Data set size >> RAM capacities.
• in one pass;
• using small auxiliary memory
1983: Probabilistic Counting by Flajolet and Martin
2003: LogLog Counting algorithm
2007: HyperLogLog Counting algorithm
History
1. LINEAR COUNTING
0 0 0 0 0 0 … 0 0 0 0 0
1, 2, … … m
LINEAR COUNTING
0 1 0 0 1 1 … 0 0 1 0 1
1, 2, … … m
Step 2: Hash the value to a bitmap address and set the address bit to “1”;
m Vn
ˆn = mlnVn
Step 3: Count the empty bit map entries and divide it by the bit map size
(fraction is ), then the cardinality estimation is:
mStep 1: Allocate a bit map (hash table) of size , all entries are initialized to “0”;
n = 11
• cardinality:
• estimated cardinality:
ˆn = mlnVn
= 8ln
1
4
˙=11.09
LINEAR COUNTING
Let stands for the event that
box is empty:
Let denote the number of
empty boxes:
P(Aj) =
✓
1
1
m
◆n
Aj
P(Aj  Ak) =
✓
1
2
m
◆n
, j 6= k
Un
E(Un) =
mX
j=1
P(Aj) = m
✓
1
1
m
◆n
⇠= me n/m
ˆn = mln
E(Un)
m
balls
boxes
n
m
j
LINEAR COUNTING
Algorithm Basic Linear Counting:
let = the key for the th tuple in the relation.
initialize the bit map to “0”s.
for =1 to do
hash_value = hash( )
bit map(hash_value)=“1”
end for
= number of “0”s in the bit map
= /m
keyi i
i q
keyi
Un
Vn Un
ˆn = mlnVn
LINEAR COUNTING
How to choose size ? The mean number of empty boxes must be
a standard deviations from zero:
Lemma: The limiting distribution of , the number of empty
boxes, is Poisson with the expected value of
as
Thus,
The fill-up probability is then obtained as
If , that is , ,
m
E(Un) a ⇥ StdDev(Un) > 0
Un
me n/m
! n, m ! 1
lim
n,m!1
Pr(Un = k) = ( k
/k!)e
Pr(Un = 0) = e
a > 5 E(Un) >
p
5 · StdDev(Un) >
p
5
Constraint 1:
Pr(Un = 0) < e 5
˙=0.007(0.7%)
Suppose the user what to limit the standard error to , we have
or equivalently as
✏
((et
t 1)/m)1/2
t
< ✏
m >
et
t 1
(✏t)2
Constraint 2:
Map size m epsilon Map size m epsilon
n 0.01 0.1 n 0.01 0.1
100 5034 80 20000 10506 3105
200 5067 106 30000 12839 4417
300 5100 129 40000 15036 5680
400 5133 151 50000 17134 6909
500 5166 172 60000 19156 8112
600 5199 192 70000 21117 9294
700 5231 212 80000 23029 10458
800 5264 231 90000 24897 11608
900 5296 249 100000 26729 12744
1000 5329 268 200000 43710 23633
2000 5647 441 300000 59264 33992
3000 5957 618 400000 73999 44032
4000 6260 786 500000 88175 53848
5000 6556 948 600000 101932 63492
6000 6847 1106 700000 115359 72997
7000 7132 1261 800000 128514 82387
8000 7412 1412 900000 141441 91677
9000 7688 1562 1000000 154171 100880
10000 7960 1709
LINEAR COUNTING
“01001101001…”
for each string , let denote the position of its first 1-
bit:
and denote the data set after hashing. Clearly, we can expect about
amongst the distinct elements of to have a -value equal to
, so
is a rough indication on the value of .
x 2 {0, 1}1
⇢(x)
⇢(1...) = 1, p(001) = 3, etc
n/2k
M
M ⇢
k
R(M) := max
1jn
⇢(x)
log2n
LOGLOG COUNTING
Basic Idea: (Bit pattern observables)
Hash the each data to binary strings like
LOGLOG COUNTING
0 0 1 0 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1
the hash function hash each value to a binary string, suppose, “90001” to:
the first 1 bit of this {0,1}-string is 3, .⇢(001011...) = 3
Suppose here comes a data stream: {234, 39102, 3, 4556, 90011, 87, …},
It has high variability: one experiment cannot suffice to obtain accurate
predictions.
Stochastic Averaging: emulating the effect of experiments.m
hash value
LOGLOG COUNTING
0 0 1 0 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1
m
hash value
bucket index
Stochastic Averaging: emulating the effect of experiments.
Use the last 8 digits to represent bucket number:
8 bits can represent buckets (experiments).m = 28
= 256
http://content.research.neustar.biz/blog/hll.html
2. LOGLOG COUNTING algorithm
LOGLOG COUNTING ( : multiset of hashed values; ):
initialize to “0”;
let be the rank of the first 1-bit from the left in :
for do
set (value of first k bits in base 2)
set
return as cardinality estimate.
M m ⌘ 2k
M(1)
, M(2)
, ..., M(m)
⇢(y) y
x = b1b2... 2 M
j := hb1, ..., bki
M(j)
:= max(M(j)
, ⇢(bk+1bk+2...))
E := ↵mm2
1
m
P
j M(j)
LOGLOG COUNTING
Theorem: Let be a function that tends to infinity arbitrarily slowly and
consider the function
Then, the -restricted algorithm and the LOGLOG algorithm provide the same
output with probability tending to 1 as tends to infinity.
e.g. Count cardinality till (a hundred million), adopt buckets;
each bucket is visited (roughly): times;
we have , adopt , each bucket: 5 bits;
Totally 1024*5/8=640 bytes! (with a standard error of 4%)
!(n)
l(n) = log2log2(
n
m
) + !(n)
l(n)
n
227
m = 1024 = 210
n/m = 217
log2log2217
˙=4.09 ! = 0.91
LOGLOG COUNTING
HYPERLOGLOG COUNTING
HYPERLOGLOG COUNTING
LOGLOG COUNTING algorithm with Harmonic Mean
E := ↵mm2
1
m
P
j M(j)
1
m
(M(1)
+ M(2)
+ · · · + M(m)
)
Arithmetic mean
m
1
2M(1) + 1
2M(2) + · · · + 1
2M(m)
E := ↵mm2
0
@
mX
j=1
2 M[j]
1
A
1
Harmonic Mean
3. HYPERLOGLOG COUNTING algorithm
HYPERLOGLOG COUNTING( input : multiset of items):
assume with
initialize a collection of integers, to ;
for do
set (value of first k bits in base 2)
set (the binary address determined
by the first bits of )
set set
compute
return
m = 2b b 2 Z>0
m M[1], ..., M[m] 1
v 2 M
x := h(v)
j = 1 + hx1x2...xbi2
b x
w := xb+1xb+2...; M[j] := max(M[j], ⇢(!))
Z :=
0
@
mX
j=1
2 M[j]
1
A
1
E := ↵mm2
Z
HYPERLOGLOG COUNTING
M
Implementation
Programming Language: Python 2.7
Hash Function: MurmurHash 3_64
Multiset: integers 1, 2, 3, …
MurmurHash3_32
Number of Elements in Each Bucket Follows a Uniform Distribution
MurmurHash3_32
Distributions of Position of 1st 1-bit of Hashed Binary Strings
Linear Counting
Performances of LC for Different Map Sizes Load Factor VS Standard Errors
LogLog Counting
Performances of LLC for Different Numbers of Buckets
A Large Error for Small Cardinalities
LogLog Counting
HyperLogLog Counting
Performances of LLC for Different Numbers of Buckets
Comparison of HLLC and LLC
Comparison of HLLC and LLC when Number of Buckets is Small
Comparison of HLLC and LLC when Number of Buckets is Large
Comparison of HLLC and LLC
Comparison of HLLC and LLC
if then
Let V be the number of registers equal to 0.
V ~=0 then set E := LinearCounting(m, V )
else
do nothing
end
if then
E := E
if
end
return E
Large Cardinalities:
A hash function of L bits can at most
distinguish 2L different values, and as the
cardinality n approaches 2L, hash
collisions become more and more likely
and accurate estimation gets impossible.
Small Cardinalities:
When cardinality is small, the
proportion of un-hit bucket is large,
which leads to inaccurate estimation.
E := ↵mm2
0
@
mX
j=1
2 M[j]
1
A
2
E <=
5
2
m
E 
1
30
232
E = 232
log(1 E/232
)
E
1
30
232
The “raw” estimate:
Correction for HyperLogLog Counting
Bad Performances for Small Cardinalities Corrections for Small Cardinalities
Correction for HyperLogLog Counting
Performance Comparison between HLLC_raw and HLLC for Small Cardinalities
HyperLogLog Counting
Bad Performances for Large Cardinalities Corrections for Large Cardinalities
HyperLogLog Counting
Performance Comparison between HLLC_raw and HLLC for Large Cardinalities
HyperLogLog Counting
Performance Comparison between HLLC_raw and HLLC for Large Cardinalities
Open Issues
If there’s other smart ideas to use.
Reference
Whang, Kyu-Young, Brad T. Vander-Zanden, and Howard M. Taylor. "A linear-time
probabilistic counting algorithm for database applications." ACM Transactions on
Database Systems (TODS) 15.2 (1990): 208-229.
Durand, Marianne, and Philippe Flajolet. "Loglog counting of large cardinalities."
Algorithms-ESA 2003. Springer Berlin Heidelberg, 2003. 605-617.
Flajolet, Philippe, et al. "Hyperloglog: the analysis of a near-optimal cardinality estimation
algorithm." DMTCS Proceedings 1 (2008).
Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic
engineering of a state of the art cardinality estimation algorithm." Proceedings of the 16th
International Conference on Extending Database Technology. ACM, 2013.
Metwally, Ahmed, Divyakant Agrawal, and Amr El Abbadi. "Why go logarithmic if we can go
linear?: Towards effective distinct counting of search traffic." Proceedings of the 11th
international conference on Extending database technology: Advances in database
technology. ACM, 2008.
History
Sketch-Based Algorithm
Distinct Counting Algorithm
Sampling Algorithms Sketch-Based Algorithm
Logarithmic Hashing Algorithms Uniform Hashing Algorithms
Interval-Based Algorithms Backer-Based Algorithms
Pure-Bucket-Based Algorithms Hybrid-Bucket-Based Algorithms
Hybrid Bucket-Based-Logarithmic Algorithms Hybrid Bucket-Based-Sampling Algorithms

More Related Content

What's hot

Linear models and multiclass classification
Linear models and multiclass classificationLinear models and multiclass classification
Linear models and multiclass classification
NdSv94
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
Reza Ramezani
 

What's hot (20)

NUMPY
NUMPY NUMPY
NUMPY
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classification
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
Database connectivity in python
Database connectivity in pythonDatabase connectivity in python
Database connectivity in python
 
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer SheetPerspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 
I. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMI. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHM
 
Group and Community Detection in Social Networks
Group and Community Detection in Social NetworksGroup and Community Detection in Social Networks
Group and Community Detection in Social Networks
 
Linear models and multiclass classification
Linear models and multiclass classificationLinear models and multiclass classification
Linear models and multiclass classification
 
Greedy Algorithm - Knapsack Problem
Greedy Algorithm - Knapsack ProblemGreedy Algorithm - Knapsack Problem
Greedy Algorithm - Knapsack Problem
 
Knapsack Problem
Knapsack ProblemKnapsack Problem
Knapsack Problem
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
Density based methods
Density based methodsDensity based methods
Density based methods
 
Module 4 part_1
Module 4 part_1Module 4 part_1
Module 4 part_1
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 

Similar to Count-Distinct Problem

2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
abramsm
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 

Similar to Count-Distinct Problem (20)

Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and Visualization
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
Section6 stochastic
Section6 stochasticSection6 stochastic
Section6 stochastic
 
Learn Matlab
Learn MatlabLearn Matlab
Learn Matlab
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
 
Project2
Project2Project2
Project2
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
Scientific Computing II Numerical Tools & Algorithms - CEI40 - AGA
Scientific Computing II Numerical Tools & Algorithms - CEI40 - AGAScientific Computing II Numerical Tools & Algorithms - CEI40 - AGA
Scientific Computing II Numerical Tools & Algorithms - CEI40 - AGA
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Divide and Conquer
Divide and ConquerDivide and Conquer
Divide and Conquer
 
Es272 ch2
Es272 ch2Es272 ch2
Es272 ch2
 
Regression
RegressionRegression
Regression
 
Matlab1
Matlab1Matlab1
Matlab1
 
Matrix Multiplication(An example of concurrent programming)
Matrix Multiplication(An example of concurrent programming)Matrix Multiplication(An example of concurrent programming)
Matrix Multiplication(An example of concurrent programming)
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...
 
Mm chap08 -_lossy_compression_algorithms
Mm chap08 -_lossy_compression_algorithmsMm chap08 -_lossy_compression_algorithms
Mm chap08 -_lossy_compression_algorithms
 
02 Notes Divide and Conquer
02 Notes Divide and Conquer02 Notes Divide and Conquer
02 Notes Divide and Conquer
 
3 analysis.gtm
3 analysis.gtm3 analysis.gtm
3 analysis.gtm
 
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
 
Regression
RegressionRegression
Regression
 

Recently uploaded

Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Hall booking system project report .pdf
Hall booking system project report  .pdfHall booking system project report  .pdf
Hall booking system project report .pdf
Kamal Acharya
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
Kamal Acharya
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 

Recently uploaded (20)

Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdf
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docxThe Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
 
Hall booking system project report .pdf
Hall booking system project report  .pdfHall booking system project report  .pdf
Hall booking system project report .pdf
 
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES  INTRODUCTION UNIT-IENERGY STORAGE DEVICES  INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdf
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltage
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientist
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 

Count-Distinct Problem

  • 2. Questions 1. What is the cardinality of this data stream: {1, 2, 4, 6, 8, 9, 2, 3, 11, 3, 1, 4} 2. Remember we use “bit pattern observables” to estimate cardinality, describe the basic idea behind it. 3. How is the “buckets” useful in LOGLOG COUNTING algorithm? n
  • 3. Outline 1. Overview 2. History 3.Algorithms 4. Implementation & Results 5. Open Issues 6. References
  • 4. Definition: Instance: A stream of elements with repetitions, and an integer . Let be the number of distinct elements, namely , and let these elements be . Overview x1, x2, ..., xs m n n =| {x1, x2, ..., xs} || {e1, e2, ..., en} | ˆn n m ⌧ n a, b, a, c, d, b, d n =| {a, b, c, d} |= 4 mObjective: Find an estimate of using only storage units, where . e.g. Count the cardinality of the stream: . For this instance, .
  • 5. Example: Keep track of the number of Unique Visitors (UV) for a particular product on Amazon in one day. • 1MB for each tree, 1 million items:100GB memory! (200 million on Amazon) • what if we want to know the number of UVs of 2 items together? Drawbacks: Operation: Searching, Insertion
  • 6. Other Applications Application: Networking / Traffic monitoring • Detection of worm propagation • Network attacks • Link-based spam Data mining of massive data set • Natural language texts • Biological data • Large structured databases Google: Sawzall, Dremel and PowerDrill
  • 7. 1980: Optimization of classical algorithms operations on data bases: union, intersection, sorting, … Data set size >> RAM capacities. • in one pass; • using small auxiliary memory 1983: Probabilistic Counting by Flajolet and Martin 2003: LogLog Counting algorithm 2007: HyperLogLog Counting algorithm History
  • 8. 1. LINEAR COUNTING 0 0 0 0 0 0 … 0 0 0 0 0 1, 2, … … m LINEAR COUNTING 0 1 0 0 1 1 … 0 0 1 0 1 1, 2, … … m Step 2: Hash the value to a bitmap address and set the address bit to “1”; m Vn ˆn = mlnVn Step 3: Count the empty bit map entries and divide it by the bit map size (fraction is ), then the cardinality estimation is: mStep 1: Allocate a bit map (hash table) of size , all entries are initialized to “0”;
  • 9. n = 11 • cardinality: • estimated cardinality: ˆn = mlnVn = 8ln 1 4 ˙=11.09 LINEAR COUNTING
  • 10. Let stands for the event that box is empty: Let denote the number of empty boxes: P(Aj) = ✓ 1 1 m ◆n Aj P(Aj Ak) = ✓ 1 2 m ◆n , j 6= k Un E(Un) = mX j=1 P(Aj) = m ✓ 1 1 m ◆n ⇠= me n/m ˆn = mln E(Un) m balls boxes n m j LINEAR COUNTING
  • 11. Algorithm Basic Linear Counting: let = the key for the th tuple in the relation. initialize the bit map to “0”s. for =1 to do hash_value = hash( ) bit map(hash_value)=“1” end for = number of “0”s in the bit map = /m keyi i i q keyi Un Vn Un ˆn = mlnVn LINEAR COUNTING
  • 12. How to choose size ? The mean number of empty boxes must be a standard deviations from zero: Lemma: The limiting distribution of , the number of empty boxes, is Poisson with the expected value of as Thus, The fill-up probability is then obtained as If , that is , , m E(Un) a ⇥ StdDev(Un) > 0 Un me n/m ! n, m ! 1 lim n,m!1 Pr(Un = k) = ( k /k!)e Pr(Un = 0) = e a > 5 E(Un) > p 5 · StdDev(Un) > p 5 Constraint 1: Pr(Un = 0) < e 5 ˙=0.007(0.7%)
  • 13. Suppose the user what to limit the standard error to , we have or equivalently as ✏ ((et t 1)/m)1/2 t < ✏ m > et t 1 (✏t)2 Constraint 2:
  • 14. Map size m epsilon Map size m epsilon n 0.01 0.1 n 0.01 0.1 100 5034 80 20000 10506 3105 200 5067 106 30000 12839 4417 300 5100 129 40000 15036 5680 400 5133 151 50000 17134 6909 500 5166 172 60000 19156 8112 600 5199 192 70000 21117 9294 700 5231 212 80000 23029 10458 800 5264 231 90000 24897 11608 900 5296 249 100000 26729 12744 1000 5329 268 200000 43710 23633 2000 5647 441 300000 59264 33992 3000 5957 618 400000 73999 44032 4000 6260 786 500000 88175 53848 5000 6556 948 600000 101932 63492 6000 6847 1106 700000 115359 72997 7000 7132 1261 800000 128514 82387 8000 7412 1412 900000 141441 91677 9000 7688 1562 1000000 154171 100880 10000 7960 1709 LINEAR COUNTING
  • 15. “01001101001…” for each string , let denote the position of its first 1- bit: and denote the data set after hashing. Clearly, we can expect about amongst the distinct elements of to have a -value equal to , so is a rough indication on the value of . x 2 {0, 1}1 ⇢(x) ⇢(1...) = 1, p(001) = 3, etc n/2k M M ⇢ k R(M) := max 1jn ⇢(x) log2n LOGLOG COUNTING Basic Idea: (Bit pattern observables) Hash the each data to binary strings like
  • 16. LOGLOG COUNTING 0 0 1 0 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 the hash function hash each value to a binary string, suppose, “90001” to: the first 1 bit of this {0,1}-string is 3, .⇢(001011...) = 3 Suppose here comes a data stream: {234, 39102, 3, 4556, 90011, 87, …}, It has high variability: one experiment cannot suffice to obtain accurate predictions. Stochastic Averaging: emulating the effect of experiments.m hash value
  • 17. LOGLOG COUNTING 0 0 1 0 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 m hash value bucket index Stochastic Averaging: emulating the effect of experiments. Use the last 8 digits to represent bucket number: 8 bits can represent buckets (experiments).m = 28 = 256 http://content.research.neustar.biz/blog/hll.html
  • 18. 2. LOGLOG COUNTING algorithm LOGLOG COUNTING ( : multiset of hashed values; ): initialize to “0”; let be the rank of the first 1-bit from the left in : for do set (value of first k bits in base 2) set return as cardinality estimate. M m ⌘ 2k M(1) , M(2) , ..., M(m) ⇢(y) y x = b1b2... 2 M j := hb1, ..., bki M(j) := max(M(j) , ⇢(bk+1bk+2...)) E := ↵mm2 1 m P j M(j) LOGLOG COUNTING
  • 19. Theorem: Let be a function that tends to infinity arbitrarily slowly and consider the function Then, the -restricted algorithm and the LOGLOG algorithm provide the same output with probability tending to 1 as tends to infinity. e.g. Count cardinality till (a hundred million), adopt buckets; each bucket is visited (roughly): times; we have , adopt , each bucket: 5 bits; Totally 1024*5/8=640 bytes! (with a standard error of 4%) !(n) l(n) = log2log2( n m ) + !(n) l(n) n 227 m = 1024 = 210 n/m = 217 log2log2217 ˙=4.09 ! = 0.91 LOGLOG COUNTING
  • 20. HYPERLOGLOG COUNTING HYPERLOGLOG COUNTING LOGLOG COUNTING algorithm with Harmonic Mean E := ↵mm2 1 m P j M(j) 1 m (M(1) + M(2) + · · · + M(m) ) Arithmetic mean m 1 2M(1) + 1 2M(2) + · · · + 1 2M(m) E := ↵mm2 0 @ mX j=1 2 M[j] 1 A 1 Harmonic Mean
  • 21. 3. HYPERLOGLOG COUNTING algorithm HYPERLOGLOG COUNTING( input : multiset of items): assume with initialize a collection of integers, to ; for do set (value of first k bits in base 2) set (the binary address determined by the first bits of ) set set compute return m = 2b b 2 Z>0 m M[1], ..., M[m] 1 v 2 M x := h(v) j = 1 + hx1x2...xbi2 b x w := xb+1xb+2...; M[j] := max(M[j], ⇢(!)) Z := 0 @ mX j=1 2 M[j] 1 A 1 E := ↵mm2 Z HYPERLOGLOG COUNTING M
  • 22. Implementation Programming Language: Python 2.7 Hash Function: MurmurHash 3_64 Multiset: integers 1, 2, 3, …
  • 23. MurmurHash3_32 Number of Elements in Each Bucket Follows a Uniform Distribution
  • 24. MurmurHash3_32 Distributions of Position of 1st 1-bit of Hashed Binary Strings
  • 25. Linear Counting Performances of LC for Different Map Sizes Load Factor VS Standard Errors
  • 26. LogLog Counting Performances of LLC for Different Numbers of Buckets
  • 27. A Large Error for Small Cardinalities LogLog Counting
  • 28. HyperLogLog Counting Performances of LLC for Different Numbers of Buckets
  • 29. Comparison of HLLC and LLC Comparison of HLLC and LLC when Number of Buckets is Small
  • 30. Comparison of HLLC and LLC when Number of Buckets is Large Comparison of HLLC and LLC
  • 32. if then Let V be the number of registers equal to 0. V ~=0 then set E := LinearCounting(m, V ) else do nothing end if then E := E if end return E Large Cardinalities: A hash function of L bits can at most distinguish 2L different values, and as the cardinality n approaches 2L, hash collisions become more and more likely and accurate estimation gets impossible. Small Cardinalities: When cardinality is small, the proportion of un-hit bucket is large, which leads to inaccurate estimation. E := ↵mm2 0 @ mX j=1 2 M[j] 1 A 2 E <= 5 2 m E  1 30 232 E = 232 log(1 E/232 ) E 1 30 232 The “raw” estimate:
  • 33. Correction for HyperLogLog Counting Bad Performances for Small Cardinalities Corrections for Small Cardinalities
  • 34. Correction for HyperLogLog Counting Performance Comparison between HLLC_raw and HLLC for Small Cardinalities
  • 35. HyperLogLog Counting Bad Performances for Large Cardinalities Corrections for Large Cardinalities
  • 36. HyperLogLog Counting Performance Comparison between HLLC_raw and HLLC for Large Cardinalities
  • 37. HyperLogLog Counting Performance Comparison between HLLC_raw and HLLC for Large Cardinalities
  • 38. Open Issues If there’s other smart ideas to use.
  • 39. Reference Whang, Kyu-Young, Brad T. Vander-Zanden, and Howard M. Taylor. "A linear-time probabilistic counting algorithm for database applications." ACM Transactions on Database Systems (TODS) 15.2 (1990): 208-229. Durand, Marianne, and Philippe Flajolet. "Loglog counting of large cardinalities." Algorithms-ESA 2003. Springer Berlin Heidelberg, 2003. 605-617. Flajolet, Philippe, et al. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS Proceedings 1 (2008). Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm." Proceedings of the 16th International Conference on Extending Database Technology. ACM, 2013. Metwally, Ahmed, Divyakant Agrawal, and Amr El Abbadi. "Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic." Proceedings of the 11th international conference on Extending database technology: Advances in database technology. ACM, 2008.
  • 40. History Sketch-Based Algorithm Distinct Counting Algorithm Sampling Algorithms Sketch-Based Algorithm Logarithmic Hashing Algorithms Uniform Hashing Algorithms Interval-Based Algorithms Backer-Based Algorithms Pure-Bucket-Based Algorithms Hybrid-Bucket-Based Algorithms Hybrid Bucket-Based-Logarithmic Algorithms Hybrid Bucket-Based-Sampling Algorithms