SlideShare a Scribd company logo
Count-Distinct
Problem
Yunhe Feng, Kai Zhang
Apr.5 2016
Questions
1. What is the cardinality of this data stream:
{1, 2, 4, 6, 8, 9, 2, 3, 11, 3, 1, 4}
2. Remember we use “bit pattern observables” to estimate
cardinality, describe the basic idea behind it.
3. How is the “buckets” useful in LOGLOG COUNTING algorithm?
n
Outline
1. Overview
2. History
3.Algorithms
4. Implementation & Results
5. Open Issues
6. References
Definition:
Instance: A stream of elements with repetitions, and
an integer . Let be the number of distinct elements, namely
, and let these elements be .
Overview
x1, x2, ..., xs
m n
n =| {x1, x2, ..., xs} || {e1, e2, ..., en} |
ˆn n
m ⌧ n
a, b, a, c, d, b, d
n =| {a, b, c, d} |= 4
mObjective: Find an estimate of using only storage units, where
.
e.g. Count the cardinality of the stream: . For this
instance, .
Example:
Keep track of the number of
Unique Visitors (UV) for a particular
product on Amazon in one day.
• 1MB for each tree, 1 million items:100GB memory! (200 million on Amazon)
• what if we want to know the number of UVs of 2 items together?
Drawbacks:
Operation: Searching, Insertion
Other Applications
Application:
Networking / Traffic monitoring
• Detection of worm propagation
• Network attacks
• Link-based spam
Data mining of massive data set
• Natural language texts
• Biological data
• Large structured databases
Google: Sawzall, Dremel and PowerDrill
1980: Optimization of classical algorithms operations on data bases:
union, intersection, sorting, …
Data set size >> RAM capacities.
• in one pass;
• using small auxiliary memory
1983: Probabilistic Counting by Flajolet and Martin
2003: LogLog Counting algorithm
2007: HyperLogLog Counting algorithm
History
1. LINEAR COUNTING
0 0 0 0 0 0 … 0 0 0 0 0
1, 2, … … m
LINEAR COUNTING
0 1 0 0 1 1 … 0 0 1 0 1
1, 2, … … m
Step 2: Hash the value to a bitmap address and set the address bit to “1”;
m Vn
ˆn = mlnVn
Step 3: Count the empty bit map entries and divide it by the bit map size
(fraction is ), then the cardinality estimation is:
mStep 1: Allocate a bit map (hash table) of size , all entries are initialized to “0”;
n = 11
• cardinality:
• estimated cardinality:
ˆn = mlnVn
= 8ln
1
4
˙=11.09
LINEAR COUNTING
Let stands for the event that
box is empty:
Let denote the number of
empty boxes:
P(Aj) =
✓
1
1
m
◆n
Aj
P(Aj  Ak) =
✓
1
2
m
◆n
, j 6= k
Un
E(Un) =
mX
j=1
P(Aj) = m
✓
1
1
m
◆n
⇠= me n/m
ˆn = mln
E(Un)
m
balls
boxes
n
m
j
LINEAR COUNTING
Algorithm Basic Linear Counting:
let = the key for the th tuple in the relation.
initialize the bit map to “0”s.
for =1 to do
hash_value = hash( )
bit map(hash_value)=“1”
end for
= number of “0”s in the bit map
= /m
keyi i
i q
keyi
Un
Vn Un
ˆn = mlnVn
LINEAR COUNTING
How to choose size ? The mean number of empty boxes must be
a standard deviations from zero:
Lemma: The limiting distribution of , the number of empty
boxes, is Poisson with the expected value of
as
Thus,
The fill-up probability is then obtained as
If , that is , ,
m
E(Un) a ⇥ StdDev(Un) > 0
Un
me n/m
! n, m ! 1
lim
n,m!1
Pr(Un = k) = ( k
/k!)e
Pr(Un = 0) = e
a > 5 E(Un) >
p
5 · StdDev(Un) >
p
5
Constraint 1:
Pr(Un = 0) < e 5
˙=0.007(0.7%)
Suppose the user what to limit the standard error to , we have
or equivalently as
✏
((et
t 1)/m)1/2
t
< ✏
m >
et
t 1
(✏t)2
Constraint 2:
Map size m epsilon Map size m epsilon
n 0.01 0.1 n 0.01 0.1
100 5034 80 20000 10506 3105
200 5067 106 30000 12839 4417
300 5100 129 40000 15036 5680
400 5133 151 50000 17134 6909
500 5166 172 60000 19156 8112
600 5199 192 70000 21117 9294
700 5231 212 80000 23029 10458
800 5264 231 90000 24897 11608
900 5296 249 100000 26729 12744
1000 5329 268 200000 43710 23633
2000 5647 441 300000 59264 33992
3000 5957 618 400000 73999 44032
4000 6260 786 500000 88175 53848
5000 6556 948 600000 101932 63492
6000 6847 1106 700000 115359 72997
7000 7132 1261 800000 128514 82387
8000 7412 1412 900000 141441 91677
9000 7688 1562 1000000 154171 100880
10000 7960 1709
LINEAR COUNTING
“01001101001…”
for each string , let denote the position of its first 1-
bit:
and denote the data set after hashing. Clearly, we can expect about
amongst the distinct elements of to have a -value equal to
, so
is a rough indication on the value of .
x 2 {0, 1}1
⇢(x)
⇢(1...) = 1, p(001) = 3, etc
n/2k
M
M ⇢
k
R(M) := max
1jn
⇢(x)
log2n
LOGLOG COUNTING
Basic Idea: (Bit pattern observables)
Hash the each data to binary strings like
LOGLOG COUNTING
0 0 1 0 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1
the hash function hash each value to a binary string, suppose, “90001” to:
the first 1 bit of this {0,1}-string is 3, .⇢(001011...) = 3
Suppose here comes a data stream: {234, 39102, 3, 4556, 90011, 87, …},
It has high variability: one experiment cannot suffice to obtain accurate
predictions.
Stochastic Averaging: emulating the effect of experiments.m
hash value
LOGLOG COUNTING
0 0 1 0 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1
m
hash value
bucket index
Stochastic Averaging: emulating the effect of experiments.
Use the last 8 digits to represent bucket number:
8 bits can represent buckets (experiments).m = 28
= 256
http://content.research.neustar.biz/blog/hll.html
2. LOGLOG COUNTING algorithm
LOGLOG COUNTING ( : multiset of hashed values; ):
initialize to “0”;
let be the rank of the first 1-bit from the left in :
for do
set (value of first k bits in base 2)
set
return as cardinality estimate.
M m ⌘ 2k
M(1)
, M(2)
, ..., M(m)
⇢(y) y
x = b1b2... 2 M
j := hb1, ..., bki
M(j)
:= max(M(j)
, ⇢(bk+1bk+2...))
E := ↵mm2
1
m
P
j M(j)
LOGLOG COUNTING
Theorem: Let be a function that tends to infinity arbitrarily slowly and
consider the function
Then, the -restricted algorithm and the LOGLOG algorithm provide the same
output with probability tending to 1 as tends to infinity.
e.g. Count cardinality till (a hundred million), adopt buckets;
each bucket is visited (roughly): times;
we have , adopt , each bucket: 5 bits;
Totally 1024*5/8=640 bytes! (with a standard error of 4%)
!(n)
l(n) = log2log2(
n
m
) + !(n)
l(n)
n
227
m = 1024 = 210
n/m = 217
log2log2217
˙=4.09 ! = 0.91
LOGLOG COUNTING
HYPERLOGLOG COUNTING
HYPERLOGLOG COUNTING
LOGLOG COUNTING algorithm with Harmonic Mean
E := ↵mm2
1
m
P
j M(j)
1
m
(M(1)
+ M(2)
+ · · · + M(m)
)
Arithmetic mean
m
1
2M(1) + 1
2M(2) + · · · + 1
2M(m)
E := ↵mm2
0
@
mX
j=1
2 M[j]
1
A
1
Harmonic Mean
3. HYPERLOGLOG COUNTING algorithm
HYPERLOGLOG COUNTING( input : multiset of items):
assume with
initialize a collection of integers, to ;
for do
set (value of first k bits in base 2)
set (the binary address determined
by the first bits of )
set set
compute
return
m = 2b b 2 Z>0
m M[1], ..., M[m] 1
v 2 M
x := h(v)
j = 1 + hx1x2...xbi2
b x
w := xb+1xb+2...; M[j] := max(M[j], ⇢(!))
Z :=
0
@
mX
j=1
2 M[j]
1
A
1
E := ↵mm2
Z
HYPERLOGLOG COUNTING
M
Implementation
Programming Language: Python 2.7
Hash Function: MurmurHash 3_64
Multiset: integers 1, 2, 3, …
MurmurHash3_32
Number of Elements in Each Bucket Follows a Uniform Distribution
MurmurHash3_32
Distributions of Position of 1st 1-bit of Hashed Binary Strings
Linear Counting
Performances of LC for Different Map Sizes Load Factor VS Standard Errors
LogLog Counting
Performances of LLC for Different Numbers of Buckets
A Large Error for Small Cardinalities
LogLog Counting
HyperLogLog Counting
Performances of LLC for Different Numbers of Buckets
Comparison of HLLC and LLC
Comparison of HLLC and LLC when Number of Buckets is Small
Comparison of HLLC and LLC when Number of Buckets is Large
Comparison of HLLC and LLC
Comparison of HLLC and LLC
if then
Let V be the number of registers equal to 0.
V ~=0 then set E := LinearCounting(m, V )
else
do nothing
end
if then
E := E
if
end
return E
Large Cardinalities:
A hash function of L bits can at most
distinguish 2L different values, and as the
cardinality n approaches 2L, hash
collisions become more and more likely
and accurate estimation gets impossible.
Small Cardinalities:
When cardinality is small, the
proportion of un-hit bucket is large,
which leads to inaccurate estimation.
E := ↵mm2
0
@
mX
j=1
2 M[j]
1
A
2
E <=
5
2
m
E 
1
30
232
E = 232
log(1 E/232
)
E
1
30
232
The “raw” estimate:
Correction for HyperLogLog Counting
Bad Performances for Small Cardinalities Corrections for Small Cardinalities
Correction for HyperLogLog Counting
Performance Comparison between HLLC_raw and HLLC for Small Cardinalities
HyperLogLog Counting
Bad Performances for Large Cardinalities Corrections for Large Cardinalities
HyperLogLog Counting
Performance Comparison between HLLC_raw and HLLC for Large Cardinalities
HyperLogLog Counting
Performance Comparison between HLLC_raw and HLLC for Large Cardinalities
Open Issues
If there’s other smart ideas to use.
Reference
Whang, Kyu-Young, Brad T. Vander-Zanden, and Howard M. Taylor. "A linear-time
probabilistic counting algorithm for database applications." ACM Transactions on
Database Systems (TODS) 15.2 (1990): 208-229.
Durand, Marianne, and Philippe Flajolet. "Loglog counting of large cardinalities."
Algorithms-ESA 2003. Springer Berlin Heidelberg, 2003. 605-617.
Flajolet, Philippe, et al. "Hyperloglog: the analysis of a near-optimal cardinality estimation
algorithm." DMTCS Proceedings 1 (2008).
Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic
engineering of a state of the art cardinality estimation algorithm." Proceedings of the 16th
International Conference on Extending Database Technology. ACM, 2013.
Metwally, Ahmed, Divyakant Agrawal, and Amr El Abbadi. "Why go logarithmic if we can go
linear?: Towards effective distinct counting of search traffic." Proceedings of the 11th
international conference on Extending database technology: Advances in database
technology. ACM, 2008.
History
Sketch-Based Algorithm
Distinct Counting Algorithm
Sampling Algorithms Sketch-Based Algorithm
Logarithmic Hashing Algorithms Uniform Hashing Algorithms
Interval-Based Algorithms Backer-Based Algorithms
Pure-Bucket-Based Algorithms Hybrid-Bucket-Based Algorithms
Hybrid Bucket-Based-Logarithmic Algorithms Hybrid Bucket-Based-Sampling Algorithms

More Related Content

What's hot

Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
Ramakant Soni
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
nehabsairam
 
Mandatory access control for information security
Mandatory access control for information securityMandatory access control for information security
Mandatory access control for information security
Ajit Dadresa
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
JYoTHiSH o.s
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Distributed shred memory architecture
Distributed shred memory architectureDistributed shred memory architecture
Distributed shred memory architecture
Maulik Togadiya
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
Pooyan Mehrparvar
 
Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systemssumitjain2013
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
Word2Vec
Word2VecWord2Vec
Word2Vec
hyunyoung Lee
 
Hypervisor
HypervisorHypervisor
Hypervisor
kalpita surve
 
Collaborating Using Cloud Services
Collaborating Using Cloud ServicesCollaborating Using Cloud Services
Collaborating Using Cloud Services
Dr. Sunil Kr. Pandey
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
YounesCharfaoui
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
Kavya Barnadhya Hazarika
 
VTU Open Elective 6th Sem CSE - Module 2 - Cloud Computing
VTU Open Elective 6th Sem CSE - Module 2 - Cloud ComputingVTU Open Elective 6th Sem CSE - Module 2 - Cloud Computing
VTU Open Elective 6th Sem CSE - Module 2 - Cloud Computing
Sachin Gowda
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Viswanath J
 

What's hot (20)

Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Mandatory access control for information security
Mandatory access control for information securityMandatory access control for information security
Mandatory access control for information security
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Distributed shred memory architecture
Distributed shred memory architectureDistributed shred memory architecture
Distributed shred memory architecture
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systems
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Hypervisor
HypervisorHypervisor
Hypervisor
 
Collaborating Using Cloud Services
Collaborating Using Cloud ServicesCollaborating Using Cloud Services
Collaborating Using Cloud Services
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
 
VTU Open Elective 6th Sem CSE - Module 2 - Cloud Computing
VTU Open Elective 6th Sem CSE - Module 2 - Cloud ComputingVTU Open Elective 6th Sem CSE - Module 2 - Cloud Computing
VTU Open Elective 6th Sem CSE - Module 2 - Cloud Computing
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 

Similar to Count-Distinct Problem

Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and Visualization
Marjan Sterjev
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
Stratio
 
Section6 stochastic
Section6 stochasticSection6 stochastic
Section6 stochastic
cairo university
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3abramsm
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
Scientific Computing II Numerical Tools & Algorithms - CEI40 - AGA
Scientific Computing II Numerical Tools & Algorithms - CEI40 - AGAScientific Computing II Numerical Tools & Algorithms - CEI40 - AGA
Scientific Computing II Numerical Tools & Algorithms - CEI40 - AGA
Ahmed Gamal Abdel Gawad
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
arogozhnikov
 
Divide and Conquer
Divide and ConquerDivide and Conquer
Divide and Conquer
Dr Shashikant Athawale
 
Regression
RegressionRegression
Matrix Multiplication(An example of concurrent programming)
Matrix Multiplication(An example of concurrent programming)Matrix Multiplication(An example of concurrent programming)
Matrix Multiplication(An example of concurrent programming)
Pramit Kumar
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...
AmirParnianifard1
 
Mm chap08 -_lossy_compression_algorithms
Mm chap08 -_lossy_compression_algorithmsMm chap08 -_lossy_compression_algorithms
Mm chap08 -_lossy_compression_algorithms
Eellekwameowusu
 
02 Notes Divide and Conquer
02 Notes Divide and Conquer02 Notes Divide and Conquer
02 Notes Divide and Conquer
Andres Mendez-Vazquez
 
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer SheetPerspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Hoang Nguyen Phong
 
3 analysis.gtm
3 analysis.gtm3 analysis.gtm
3 analysis.gtm
Natarajan Angappan
 
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Alexander Litvinenko
 

Similar to Count-Distinct Problem (20)

Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and Visualization
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
Section6 stochastic
Section6 stochasticSection6 stochastic
Section6 stochastic
 
Learn Matlab
Learn MatlabLearn Matlab
Learn Matlab
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
 
Project2
Project2Project2
Project2
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
Scientific Computing II Numerical Tools & Algorithms - CEI40 - AGA
Scientific Computing II Numerical Tools & Algorithms - CEI40 - AGAScientific Computing II Numerical Tools & Algorithms - CEI40 - AGA
Scientific Computing II Numerical Tools & Algorithms - CEI40 - AGA
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Divide and Conquer
Divide and ConquerDivide and Conquer
Divide and Conquer
 
Es272 ch2
Es272 ch2Es272 ch2
Es272 ch2
 
Regression
RegressionRegression
Regression
 
Matlab1
Matlab1Matlab1
Matlab1
 
Matrix Multiplication(An example of concurrent programming)
Matrix Multiplication(An example of concurrent programming)Matrix Multiplication(An example of concurrent programming)
Matrix Multiplication(An example of concurrent programming)
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...
 
Mm chap08 -_lossy_compression_algorithms
Mm chap08 -_lossy_compression_algorithmsMm chap08 -_lossy_compression_algorithms
Mm chap08 -_lossy_compression_algorithms
 
02 Notes Divide and Conquer
02 Notes Divide and Conquer02 Notes Divide and Conquer
02 Notes Divide and Conquer
 
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer SheetPerspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
 
3 analysis.gtm
3 analysis.gtm3 analysis.gtm
3 analysis.gtm
 
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
Computation of Electromagnetic Fields Scattered from Dielectric Objects of Un...
 

Recently uploaded

Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
itech2017
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
Nettur Technical Training Foundation
 
Steel & Timber Design according to British Standard
Steel & Timber Design according to British StandardSteel & Timber Design according to British Standard
Steel & Timber Design according to British Standard
AkolbilaEmmanuel1
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 

Recently uploaded (20)

Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
 
Steel & Timber Design according to British Standard
Steel & Timber Design according to British StandardSteel & Timber Design according to British Standard
Steel & Timber Design according to British Standard
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 

Count-Distinct Problem

  • 2. Questions 1. What is the cardinality of this data stream: {1, 2, 4, 6, 8, 9, 2, 3, 11, 3, 1, 4} 2. Remember we use “bit pattern observables” to estimate cardinality, describe the basic idea behind it. 3. How is the “buckets” useful in LOGLOG COUNTING algorithm? n
  • 3. Outline 1. Overview 2. History 3.Algorithms 4. Implementation & Results 5. Open Issues 6. References
  • 4. Definition: Instance: A stream of elements with repetitions, and an integer . Let be the number of distinct elements, namely , and let these elements be . Overview x1, x2, ..., xs m n n =| {x1, x2, ..., xs} || {e1, e2, ..., en} | ˆn n m ⌧ n a, b, a, c, d, b, d n =| {a, b, c, d} |= 4 mObjective: Find an estimate of using only storage units, where . e.g. Count the cardinality of the stream: . For this instance, .
  • 5. Example: Keep track of the number of Unique Visitors (UV) for a particular product on Amazon in one day. • 1MB for each tree, 1 million items:100GB memory! (200 million on Amazon) • what if we want to know the number of UVs of 2 items together? Drawbacks: Operation: Searching, Insertion
  • 6. Other Applications Application: Networking / Traffic monitoring • Detection of worm propagation • Network attacks • Link-based spam Data mining of massive data set • Natural language texts • Biological data • Large structured databases Google: Sawzall, Dremel and PowerDrill
  • 7. 1980: Optimization of classical algorithms operations on data bases: union, intersection, sorting, … Data set size >> RAM capacities. • in one pass; • using small auxiliary memory 1983: Probabilistic Counting by Flajolet and Martin 2003: LogLog Counting algorithm 2007: HyperLogLog Counting algorithm History
  • 8. 1. LINEAR COUNTING 0 0 0 0 0 0 … 0 0 0 0 0 1, 2, … … m LINEAR COUNTING 0 1 0 0 1 1 … 0 0 1 0 1 1, 2, … … m Step 2: Hash the value to a bitmap address and set the address bit to “1”; m Vn ˆn = mlnVn Step 3: Count the empty bit map entries and divide it by the bit map size (fraction is ), then the cardinality estimation is: mStep 1: Allocate a bit map (hash table) of size , all entries are initialized to “0”;
  • 9. n = 11 • cardinality: • estimated cardinality: ˆn = mlnVn = 8ln 1 4 ˙=11.09 LINEAR COUNTING
  • 10. Let stands for the event that box is empty: Let denote the number of empty boxes: P(Aj) = ✓ 1 1 m ◆n Aj P(Aj Ak) = ✓ 1 2 m ◆n , j 6= k Un E(Un) = mX j=1 P(Aj) = m ✓ 1 1 m ◆n ⇠= me n/m ˆn = mln E(Un) m balls boxes n m j LINEAR COUNTING
  • 11. Algorithm Basic Linear Counting: let = the key for the th tuple in the relation. initialize the bit map to “0”s. for =1 to do hash_value = hash( ) bit map(hash_value)=“1” end for = number of “0”s in the bit map = /m keyi i i q keyi Un Vn Un ˆn = mlnVn LINEAR COUNTING
  • 12. How to choose size ? The mean number of empty boxes must be a standard deviations from zero: Lemma: The limiting distribution of , the number of empty boxes, is Poisson with the expected value of as Thus, The fill-up probability is then obtained as If , that is , , m E(Un) a ⇥ StdDev(Un) > 0 Un me n/m ! n, m ! 1 lim n,m!1 Pr(Un = k) = ( k /k!)e Pr(Un = 0) = e a > 5 E(Un) > p 5 · StdDev(Un) > p 5 Constraint 1: Pr(Un = 0) < e 5 ˙=0.007(0.7%)
  • 13. Suppose the user what to limit the standard error to , we have or equivalently as ✏ ((et t 1)/m)1/2 t < ✏ m > et t 1 (✏t)2 Constraint 2:
  • 14. Map size m epsilon Map size m epsilon n 0.01 0.1 n 0.01 0.1 100 5034 80 20000 10506 3105 200 5067 106 30000 12839 4417 300 5100 129 40000 15036 5680 400 5133 151 50000 17134 6909 500 5166 172 60000 19156 8112 600 5199 192 70000 21117 9294 700 5231 212 80000 23029 10458 800 5264 231 90000 24897 11608 900 5296 249 100000 26729 12744 1000 5329 268 200000 43710 23633 2000 5647 441 300000 59264 33992 3000 5957 618 400000 73999 44032 4000 6260 786 500000 88175 53848 5000 6556 948 600000 101932 63492 6000 6847 1106 700000 115359 72997 7000 7132 1261 800000 128514 82387 8000 7412 1412 900000 141441 91677 9000 7688 1562 1000000 154171 100880 10000 7960 1709 LINEAR COUNTING
  • 15. “01001101001…” for each string , let denote the position of its first 1- bit: and denote the data set after hashing. Clearly, we can expect about amongst the distinct elements of to have a -value equal to , so is a rough indication on the value of . x 2 {0, 1}1 ⇢(x) ⇢(1...) = 1, p(001) = 3, etc n/2k M M ⇢ k R(M) := max 1jn ⇢(x) log2n LOGLOG COUNTING Basic Idea: (Bit pattern observables) Hash the each data to binary strings like
  • 16. LOGLOG COUNTING 0 0 1 0 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 the hash function hash each value to a binary string, suppose, “90001” to: the first 1 bit of this {0,1}-string is 3, .⇢(001011...) = 3 Suppose here comes a data stream: {234, 39102, 3, 4556, 90011, 87, …}, It has high variability: one experiment cannot suffice to obtain accurate predictions. Stochastic Averaging: emulating the effect of experiments.m hash value
  • 17. LOGLOG COUNTING 0 0 1 0 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 m hash value bucket index Stochastic Averaging: emulating the effect of experiments. Use the last 8 digits to represent bucket number: 8 bits can represent buckets (experiments).m = 28 = 256 http://content.research.neustar.biz/blog/hll.html
  • 18. 2. LOGLOG COUNTING algorithm LOGLOG COUNTING ( : multiset of hashed values; ): initialize to “0”; let be the rank of the first 1-bit from the left in : for do set (value of first k bits in base 2) set return as cardinality estimate. M m ⌘ 2k M(1) , M(2) , ..., M(m) ⇢(y) y x = b1b2... 2 M j := hb1, ..., bki M(j) := max(M(j) , ⇢(bk+1bk+2...)) E := ↵mm2 1 m P j M(j) LOGLOG COUNTING
  • 19. Theorem: Let be a function that tends to infinity arbitrarily slowly and consider the function Then, the -restricted algorithm and the LOGLOG algorithm provide the same output with probability tending to 1 as tends to infinity. e.g. Count cardinality till (a hundred million), adopt buckets; each bucket is visited (roughly): times; we have , adopt , each bucket: 5 bits; Totally 1024*5/8=640 bytes! (with a standard error of 4%) !(n) l(n) = log2log2( n m ) + !(n) l(n) n 227 m = 1024 = 210 n/m = 217 log2log2217 ˙=4.09 ! = 0.91 LOGLOG COUNTING
  • 20. HYPERLOGLOG COUNTING HYPERLOGLOG COUNTING LOGLOG COUNTING algorithm with Harmonic Mean E := ↵mm2 1 m P j M(j) 1 m (M(1) + M(2) + · · · + M(m) ) Arithmetic mean m 1 2M(1) + 1 2M(2) + · · · + 1 2M(m) E := ↵mm2 0 @ mX j=1 2 M[j] 1 A 1 Harmonic Mean
  • 21. 3. HYPERLOGLOG COUNTING algorithm HYPERLOGLOG COUNTING( input : multiset of items): assume with initialize a collection of integers, to ; for do set (value of first k bits in base 2) set (the binary address determined by the first bits of ) set set compute return m = 2b b 2 Z>0 m M[1], ..., M[m] 1 v 2 M x := h(v) j = 1 + hx1x2...xbi2 b x w := xb+1xb+2...; M[j] := max(M[j], ⇢(!)) Z := 0 @ mX j=1 2 M[j] 1 A 1 E := ↵mm2 Z HYPERLOGLOG COUNTING M
  • 22. Implementation Programming Language: Python 2.7 Hash Function: MurmurHash 3_64 Multiset: integers 1, 2, 3, …
  • 23. MurmurHash3_32 Number of Elements in Each Bucket Follows a Uniform Distribution
  • 24. MurmurHash3_32 Distributions of Position of 1st 1-bit of Hashed Binary Strings
  • 25. Linear Counting Performances of LC for Different Map Sizes Load Factor VS Standard Errors
  • 26. LogLog Counting Performances of LLC for Different Numbers of Buckets
  • 27. A Large Error for Small Cardinalities LogLog Counting
  • 28. HyperLogLog Counting Performances of LLC for Different Numbers of Buckets
  • 29. Comparison of HLLC and LLC Comparison of HLLC and LLC when Number of Buckets is Small
  • 30. Comparison of HLLC and LLC when Number of Buckets is Large Comparison of HLLC and LLC
  • 32. if then Let V be the number of registers equal to 0. V ~=0 then set E := LinearCounting(m, V ) else do nothing end if then E := E if end return E Large Cardinalities: A hash function of L bits can at most distinguish 2L different values, and as the cardinality n approaches 2L, hash collisions become more and more likely and accurate estimation gets impossible. Small Cardinalities: When cardinality is small, the proportion of un-hit bucket is large, which leads to inaccurate estimation. E := ↵mm2 0 @ mX j=1 2 M[j] 1 A 2 E <= 5 2 m E  1 30 232 E = 232 log(1 E/232 ) E 1 30 232 The “raw” estimate:
  • 33. Correction for HyperLogLog Counting Bad Performances for Small Cardinalities Corrections for Small Cardinalities
  • 34. Correction for HyperLogLog Counting Performance Comparison between HLLC_raw and HLLC for Small Cardinalities
  • 35. HyperLogLog Counting Bad Performances for Large Cardinalities Corrections for Large Cardinalities
  • 36. HyperLogLog Counting Performance Comparison between HLLC_raw and HLLC for Large Cardinalities
  • 37. HyperLogLog Counting Performance Comparison between HLLC_raw and HLLC for Large Cardinalities
  • 38. Open Issues If there’s other smart ideas to use.
  • 39. Reference Whang, Kyu-Young, Brad T. Vander-Zanden, and Howard M. Taylor. "A linear-time probabilistic counting algorithm for database applications." ACM Transactions on Database Systems (TODS) 15.2 (1990): 208-229. Durand, Marianne, and Philippe Flajolet. "Loglog counting of large cardinalities." Algorithms-ESA 2003. Springer Berlin Heidelberg, 2003. 605-617. Flajolet, Philippe, et al. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS Proceedings 1 (2008). Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm." Proceedings of the 16th International Conference on Extending Database Technology. ACM, 2013. Metwally, Ahmed, Divyakant Agrawal, and Amr El Abbadi. "Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic." Proceedings of the 11th international conference on Extending database technology: Advances in database technology. ACM, 2008.
  • 40. History Sketch-Based Algorithm Distinct Counting Algorithm Sampling Algorithms Sketch-Based Algorithm Logarithmic Hashing Algorithms Uniform Hashing Algorithms Interval-Based Algorithms Backer-Based Algorithms Pure-Bucket-Based Algorithms Hybrid-Bucket-Based Algorithms Hybrid Bucket-Based-Logarithmic Algorithms Hybrid Bucket-Based-Sampling Algorithms