SlideShare a Scribd company logo
tech talk @ ferret
Andrii Gakhov
PROBABILISTIC DATA STRUCTURES
ALL YOU WANTED TO KNOW BUT WERE AFRAID TO ASK
PART 2: CARDINALITY
CARDINALITY
Agenda:
▸ Linear Counting
▸ LogLog, SuperLogLog, HyperLogLog, HyperLogLog++
• To determine the number of distinct elements, also
called the cardinality, of a large set of elements
where duplicates are present
Calculating the exact cardinality of a multiset requires an amount of memory
proportional to the cardinality, which is impractical for very large data sets.
THE PROBLEM
LINEAR COUNTING
LINEAR COUNTING: ALGORITHM
• Linear counter is a bit map (hash table) of size m (all
elements set to 0 at the beginning).
• Algorithm consists of a few steps:
• for every element calculate hash function and set the
appropriate bit to 1
• calculate the fraction V of empty bits in the structure 

(divide the number of empty bits by the bit map size m )
• estimate cardinality as n ≈ -m ln V
LINEAR COUNTING: EXAMPLE
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
• Consider linear counter with 16 bits (m=16)
• Consider MurmurHash3 as the hash function h

(to calculate the appropriate index, we divide result by mod 16)
• Set of 10 elements: “bernau”, “bernau”, “bernau”,
“berlin”, “kiev”, “kiev”, “new york”, “germany”, “ukraine”,
“europe” (NOTE: the real cardinality n = 7)
h(“bernau”) = 4, h(“berlin”) = 4, h(“kiev”) = 6, h(“new york”) = 6,
h(“germany”) = 14, h(“ukraine”) = 7, h(“europe”) = 9
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0
LINEAR COUNTING: EXAMPLE
number of empty bits: 11
m = 16
V = 11 / 16 = 0.6875
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0
• Cardinality estimation is
n ≈ - 16 * ln (0.6875) = 5.995
LINEAR COUNTING: READ MORE
• http://dblab.kaist.ac.kr/Prof/pdf/Whang1990(linear).pdf
• http://www.codeproject.com/Articles/569718/
CardinalityplusEstimationplusinplusLinearplusTimep
HYPERLOGLOG
HYPERLOGLOG: INTUITION
• The cardinality of a multiset of uniformly distributed numbers can be estimated
by the maximum number of leading zeros in the binary representation of each
number. If such value is k, then the number of distinct elements in the set is 2k
P(rank=1) = 1/2 - probability to find a binary representation, that starts with 1
P(rank = 2) = 1/2
2
- probability to find a binary representation, that start with 01
…
P(rank=k) = 1/2
k
rank = number of leading zeros + 1, e.g. rank(f) = 3
0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0
0 1 2
leading zeros
3
f =
• Therefore, for 2k
binary representations we shell find at least one
representation with rank = k
• If we remember the maximal rank we’ve seen and it’s equal to k, then we can
use 2k
as the approximation of the number of elements
HYPERLOGLOG
• proposed by Flajolet et. al., 2007
• an extension of the Flajolet–Martin algorithm (1985)
• HyperLogLog is described by 2 parameters:
• p – number of bits that determine a bucket to use averaging

(m = 2p
is the number of buckets/substreams)
• h - hash function, that produces uniform hash values
• The HyperLogLog algorithm is able to estimate cardinalities of
> 109
with a typical error rate of 2%, using 1.5kB of memory
(Flajolet, P. et al., 2007).
HYPERLOGLOG: ALGORITHM
• HyperLogLog uses randomization to approximate the
cardinality of a multiset.This randomization is achieved by
using hash function h
• Observe the maximum number of leading zeros that for all
hash values:
• If the bit pattern 0L−1 1 is observed at the beginning of a
hash value (so, rank = L), then a good estimation of the size
of the multiset is 2L.
HYPERLOGLOG: ALGORITHM
• Stochastic averaging is used to reduce the large variability:
• The input stream of data elements S is divided into m substreams Si using
the first p bits of the hash values (m = 2p)
.
• In each substream, the rank (after the initial p bits that are used for
substreaming) is measured independently.
• These numbers are kept in an array of registers M, where M[i] stores the
maximum rank it seen for the substream with index i.
• The cardinality estimation is calculated computes as the normalized bias
corrected harmonic mean of the estimations on the substreams
DVHLL = const(m)⋅m2
⋅ 2
−M j
j=1
m
∑
⎛
⎝⎜
⎞
⎠⎟
−1
HYPERLOGLOG: EXAMPLE
• Consider L=8 bits hash function h
• Index elements “berlin” and “ferret”:
h(“berlin”) = 0110111 h(“ferret”) = 1100011
• Define buckets and calculate values to store:

(use first p =3 bits for buckets and least L - p = 5 bits for ranks)
• bucket(“berlin”) = 011 = 3 value(“berlin”) = rank(0111) = 2
• bucket(“ferret”) = 110 = 6 value(“ferret”) = rank(0011) = 3
• Let’s use p=3 bits to define a bucket (then m=23
=8 buckets).
1 2 3 4 5 6 7
0 0 0 0 0 0 0 0
0
M
1 2 3 4 5 6 7
0 0 0 2 0 0 3 0
0
M
HYPERLOGLOG: EXAMPLE
• Estimate the cardinality be the HLL formula (C ≈ 0.66):
DVHLL ≈ 0.66 * 82
/ (2-2
+ 2-4
) = 0.66 * 204.8 ≈ 135≠3
• Index element “kharkov”:
• h(“kharkov”) = 1100001
• bucket(“kharkov”) = 110 = 6 value(“kharkov”) = rank(0001) = 4
• M[6] = max(M[6], 4) = max(3, 4) = 4
1 2 3 4 5 6 7
0 0 0 2 0 0 4 0
0
M
NOTE: For small cardinalities HLL has a strong bias!!!
HYPERLOGLOG: PROPERTIES
• Memory requirement doesn't grow linearly with L (unlike MinCount or
Linear Counting) - for hash function of L bits and precision p, required
memory:
• original HyperLogLog uses 32 bit hash codes, which requires 5 · 2
p
bits
• It’s not necessary to calculate the full hash code for the element
• first p bits and number of leading zeros of the remaining bits are
enough
• There are no evidence that some of popular hash functions (MD5, Sha1,
Sha256, Murmur3) performs significantly better than others.
log2 L +1− p( )⎡⎢ ⎤⎥⋅2p
bits
HYPERLOGLOG: PROPERTIES
• The standard error can be estimated as:
σ =
1.04
2p
so, if we use 16 bits (p=16) for bucket indices, we receive the
standard error in 0.40625%
• Algorithm has large error for small cardinalities.
• For instance, for n = 0 the algorithm always returns roughly 0.7m
• To achieve better estimates for small cardinalities, use
LinearCounting below a threshold of 5m/2
HYPERLOGLOG: APPLICATIONS
• PFCOUNT in Redis returns the approximated cardinality
computed by the HyperLogLog data structure 

(http://antirez.com/news/75)
• Redis implementation uses 12Kb per key to count with a standard
error of 0.81%, and there is no limit to the number of items you can
count, unless you approach 264 items
HYPERLOGLOG: READ MORE
• http://algo.inria.fr/flajolet/Publications/DuFl03-LNCS.pdf
• http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
• https://stefanheule.com/papers/edbt13-hyperloglog.pdf
• https://highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-data-mining/
• https://hal.archives-ouvertes.fr/file/index/docid/465313/
filename/sliding_HyperLogLog.pdf
• http://stackoverflow.com/questions/12327004/how-does-
the-hyperloglog-algorithm-work
HYPERLOGLOG++
HYPERLOGLOG++
• proposed by Stefan Heule et. al., 2013 for Google PowerDrill
• an improved version of HyperLogLog (Flajolet et. al., 2007)
• HyperLogLog++ is described by 2 parameters:
• p – number of bits that determine a bucket to use averaging

(m = 2p
is the number of buckets/substreams)
• h - hash function, that produces uniform hash values
• The HyperLogLog++ algorithm is able to estimate cardinalities
of ~ 7.9 · 10
9
with a typical error rate of 1.625%, using 2.56KB of
memory (Micha Gorelick and Ian Ozsvald, High Performance
Python, 2014).
HYPERLOGLOG++: IMPROVEMENTS
• use 64-bit hash function
• algorithm that only uses the hash code of the input values is limited by the
number of bits of the hash codes when it comes to accurately estimating
large cardinalities
• In particular, a hash function of L bits can at most distinguish 2L
different
values, and as the cardinality n approaches 2L
hash collisions become
more and more likely and accurate estimation gets impossible
• if the cardinality approaches 264 ≈ 1.8 · 1019, hash collisions become a problem
• bias correction
• original algorithm overestimates the real cardinality for small sets, but
most of the error is due to bias.
• storage efficiency
• uses different encoding strategies for hash values, variable length
encoding for integers, difference encoding
HYPERLOGLOG++ VS HYPERLOGLOG
• accuracy is significantly better for large range of cardinalities
and equally good on the rest
• sparse representation allows for a more adaptive use of memory
• if the cardinality n is much smaller then m, then HyperLogLog++
requires significantly less memory
• For cardinalities between 12000 and 61000, the bias correction
allows for a lower error and avoids a spike in the error when
switching between sub-algorithms.
• 64 bit hash codes allow the algorithm to estimate cardinalities well
beyond 1 billion
HYPERLOGLOG++: APPLICATIONS
• cardinality metric in Elasticsearch is based on the
HyperLogLog++ algorithm for big cardinalities (adaptive
counting)
• Apache DataFu, collection of libraries for working with
large-scale data in Hadoop, has an implementation of
HyperLogLog++ algorithm
HYPERLOGLOG++: READ MORE
• http://static.googleusercontent.com/media/
research.google.com/en//pubs/archive/40671.pdf
• https://research.neustar.biz/2013/01/24/hyperloglog-
googles-take-on-engineering-hll/
▸ @gakhov
▸ linkedin.com/in/gakhov
▸ www.datacrucis.com
THANK YOU

More Related Content

What's hot

Lecture notes data structures tree
Lecture notes data structures   treeLecture notes data structures   tree
Lecture notes data structures tree
maamir farooq
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
hyunyoung Lee
 
Garbage Collection
Garbage CollectionGarbage Collection
Garbage Collection
Eelco Visser
 
Sqoop
SqoopSqoop
Concept of hashing
Concept of hashingConcept of hashing
Concept of hashing
Rafi Dar
 
Python Flow Control
Python Flow ControlPython Flow Control
Python Flow Control
Mohammed Sikander
 
Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
Po-Hsiang (Barnett) Chiu
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
izahn
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
Bloom filters
Bloom filtersBloom filters
Bloom filters
Devesh Maru
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
Nowell Strite
 
Decorators in Python
Decorators in PythonDecorators in Python
Decorators in Python
Ben James
 
6-Python-Recursion PPT.pptx
6-Python-Recursion PPT.pptx6-Python-Recursion PPT.pptx
6-Python-Recursion PPT.pptx
Venkateswara Babu Ravipati
 
First Order Logic resolution
First Order Logic resolutionFirst Order Logic resolution
First Order Logic resolution
Amar Jukuntla
 
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VECUnit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
sundarKanagaraj1
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
Hortonworks
 
DATA STRUCTURE AND ALGORITHMS
DATA STRUCTURE AND ALGORITHMS DATA STRUCTURE AND ALGORITHMS
DATA STRUCTURE AND ALGORITHMS
Adams Sidibe
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Hashing In Data Structure
Hashing In Data Structure Hashing In Data Structure
Hashing In Data Structure
Meghaj Mallick
 

What's hot (20)

Lecture notes data structures tree
Lecture notes data structures   treeLecture notes data structures   tree
Lecture notes data structures tree
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
 
Garbage Collection
Garbage CollectionGarbage Collection
Garbage Collection
 
Sqoop
SqoopSqoop
Sqoop
 
Concept of hashing
Concept of hashingConcept of hashing
Concept of hashing
 
Python Flow Control
Python Flow ControlPython Flow Control
Python Flow Control
 
Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
Bloom filters
Bloom filtersBloom filters
Bloom filters
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Decorators in Python
Decorators in PythonDecorators in Python
Decorators in Python
 
6-Python-Recursion PPT.pptx
6-Python-Recursion PPT.pptx6-Python-Recursion PPT.pptx
6-Python-Recursion PPT.pptx
 
First Order Logic resolution
First Order Logic resolutionFirst Order Logic resolution
First Order Logic resolution
 
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VECUnit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
DATA STRUCTURE AND ALGORITHMS
DATA STRUCTURE AND ALGORITHMS DATA STRUCTURE AND ALGORITHMS
DATA STRUCTURE AND ALGORITHMS
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Hashing In Data Structure
Hashing In Data Structure Hashing In Data Structure
Hashing In Data Structure
 

Viewers also liked

Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
shrinivasvasala
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
nathanmarz
 
Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]
Qrator Labs
 
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?
bzamecnik
 
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Data Con LA
 
ReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений СафроненкоReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений Сафроненко
PechaKucha Ukraine
 
Big Data aggregation techniques
Big Data aggregation techniquesBig Data aggregation techniques
Big Data aggregation techniques
Valentin Logvinskiy
 
Hyper loglog
Hyper loglogHyper loglog
Hyper loglog
nybon
 
Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017
Roman Elizarov
 

Viewers also liked (9)

Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]
 
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?
 
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
 
ReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений СафроненкоReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений Сафроненко
 
Big Data aggregation techniques
Big Data aggregation techniquesBig Data aggregation techniques
Big Data aggregation techniques
 
Hyper loglog
Hyper loglogHyper loglog
Hyper loglog
 
Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017
 

Similar to Probabilistic data structures. Part 2. Cardinality

2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
abramsm
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
Open Analytics
 
Counting (Using Computer)
Counting (Using Computer)Counting (Using Computer)
Counting (Using Computer)
roshmat
 
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
Thomas Mueller
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Count-Distinct Problem
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct Problem
Kai Zhang
 
358 33 powerpoint-slides_15-hashing-collision_chapter-15
358 33 powerpoint-slides_15-hashing-collision_chapter-15358 33 powerpoint-slides_15-hashing-collision_chapter-15
358 33 powerpoint-slides_15-hashing-collision_chapter-15
sumitbardhan
 
Chapter1
Chapter1Chapter1
Chapter1
Kiran kumar
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1
Kshitij Singh
 
Analysis.ppt
Analysis.pptAnalysis.ppt
Analysis.ppt
ShivaniSharma335055
 
Data Structures - Lecture 1 [introduction]
Data Structures - Lecture 1 [introduction]Data Structures - Lecture 1 [introduction]
Data Structures - Lecture 1 [introduction]
Muhammad Hammad Waseem
 
Unit ii algorithm
Unit   ii algorithmUnit   ii algorithm
Unit ii algorithm
Tribhuvan University
 
Максим Харченко. Erlang lincx
Максим Харченко. Erlang lincxМаксим Харченко. Erlang lincx
Максим Харченко. Erlang lincx
Alina Dolgikh
 
0.5mln packets per second with Erlang
0.5mln packets per second with Erlang0.5mln packets per second with Erlang
0.5mln packets per second with Erlang
Maxim Kharchenko
 
Design and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesDesign and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture Notes
Sreedhar Chowdam
 
A Lossless FBAR Compressor
A Lossless FBAR CompressorA Lossless FBAR Compressor
A Lossless FBAR Compressor
Philip Alipour
 
Large-scale real-time analytics for everyone
Large-scale real-time analytics for everyoneLarge-scale real-time analytics for everyone
Large-scale real-time analytics for everyone
Pavel Kalaidin
 
L3-.pptx
L3-.pptxL3-.pptx
L3-.pptx
asdq4
 

Similar to Probabilistic data structures. Part 2. Cardinality (20)

2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
 
Counting (Using Computer)
Counting (Using Computer)Counting (Using Computer)
Counting (Using Computer)
 
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Count-Distinct Problem
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct Problem
 
358 33 powerpoint-slides_15-hashing-collision_chapter-15
358 33 powerpoint-slides_15-hashing-collision_chapter-15358 33 powerpoint-slides_15-hashing-collision_chapter-15
358 33 powerpoint-slides_15-hashing-collision_chapter-15
 
Chapter1
Chapter1Chapter1
Chapter1
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1
 
Analysis.ppt
Analysis.pptAnalysis.ppt
Analysis.ppt
 
Data Structures - Lecture 1 [introduction]
Data Structures - Lecture 1 [introduction]Data Structures - Lecture 1 [introduction]
Data Structures - Lecture 1 [introduction]
 
Unit ii algorithm
Unit   ii algorithmUnit   ii algorithm
Unit ii algorithm
 
Максим Харченко. Erlang lincx
Максим Харченко. Erlang lincxМаксим Харченко. Erlang lincx
Максим Харченко. Erlang lincx
 
0.5mln packets per second with Erlang
0.5mln packets per second with Erlang0.5mln packets per second with Erlang
0.5mln packets per second with Erlang
 
Design and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesDesign and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture Notes
 
A Lossless FBAR Compressor
A Lossless FBAR CompressorA Lossless FBAR Compressor
A Lossless FBAR Compressor
 
Large-scale real-time analytics for everyone
Large-scale real-time analytics for everyoneLarge-scale real-time analytics for everyone
Large-scale real-time analytics for everyone
 
L3-.pptx
L3-.pptxL3-.pptx
L3-.pptx
 

More from Andrii Gakhov

Let's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architecture
Andrii Gakhov
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
Andrii Gakhov
 
DNS Delegation
DNS DelegationDNS Delegation
DNS Delegation
Andrii Gakhov
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
Andrii Gakhov
 
Pecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food Traditions
Andrii Gakhov
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
Andrii Gakhov
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данных
Andrii Gakhov
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
Andrii Gakhov
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
Andrii Gakhov
 
Swagger / Quick Start Guide
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start Guide
Andrii Gakhov
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlights
Andrii Gakhov
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
Andrii Gakhov
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014
Andrii Gakhov
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
Andrii Gakhov
 
Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014
Andrii Gakhov
 
Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014
Andrii Gakhov
 
Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014
Andrii Gakhov
 
Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014
Andrii Gakhov
 

More from Andrii Gakhov (20)

Let's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architecture
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
 
DNS Delegation
DNS DelegationDNS Delegation
DNS Delegation
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
 
Pecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food Traditions
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данных
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
 
Swagger / Quick Start Guide
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start Guide
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlights
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
 
Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014
 
Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014
 
Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014
 
Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014
 

Recently uploaded

Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
maazsz111
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 

Recently uploaded (20)

Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 

Probabilistic data structures. Part 2. Cardinality

  • 1. tech talk @ ferret Andrii Gakhov PROBABILISTIC DATA STRUCTURES ALL YOU WANTED TO KNOW BUT WERE AFRAID TO ASK PART 2: CARDINALITY
  • 2. CARDINALITY Agenda: ▸ Linear Counting ▸ LogLog, SuperLogLog, HyperLogLog, HyperLogLog++
  • 3. • To determine the number of distinct elements, also called the cardinality, of a large set of elements where duplicates are present Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. THE PROBLEM
  • 5. LINEAR COUNTING: ALGORITHM • Linear counter is a bit map (hash table) of size m (all elements set to 0 at the beginning). • Algorithm consists of a few steps: • for every element calculate hash function and set the appropriate bit to 1 • calculate the fraction V of empty bits in the structure 
 (divide the number of empty bits by the bit map size m ) • estimate cardinality as n ≈ -m ln V
  • 6. LINEAR COUNTING: EXAMPLE 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • Consider linear counter with 16 bits (m=16) • Consider MurmurHash3 as the hash function h
 (to calculate the appropriate index, we divide result by mod 16) • Set of 10 elements: “bernau”, “bernau”, “bernau”, “berlin”, “kiev”, “kiev”, “new york”, “germany”, “ukraine”, “europe” (NOTE: the real cardinality n = 7) h(“bernau”) = 4, h(“berlin”) = 4, h(“kiev”) = 6, h(“new york”) = 6, h(“germany”) = 14, h(“ukraine”) = 7, h(“europe”) = 9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0
  • 7. LINEAR COUNTING: EXAMPLE number of empty bits: 11 m = 16 V = 11 / 16 = 0.6875 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 • Cardinality estimation is n ≈ - 16 * ln (0.6875) = 5.995
  • 8. LINEAR COUNTING: READ MORE • http://dblab.kaist.ac.kr/Prof/pdf/Whang1990(linear).pdf • http://www.codeproject.com/Articles/569718/ CardinalityplusEstimationplusinplusLinearplusTimep
  • 10. HYPERLOGLOG: INTUITION • The cardinality of a multiset of uniformly distributed numbers can be estimated by the maximum number of leading zeros in the binary representation of each number. If such value is k, then the number of distinct elements in the set is 2k P(rank=1) = 1/2 - probability to find a binary representation, that starts with 1 P(rank = 2) = 1/2 2 - probability to find a binary representation, that start with 01 … P(rank=k) = 1/2 k rank = number of leading zeros + 1, e.g. rank(f) = 3 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 2 leading zeros 3 f = • Therefore, for 2k binary representations we shell find at least one representation with rank = k • If we remember the maximal rank we’ve seen and it’s equal to k, then we can use 2k as the approximation of the number of elements
  • 11. HYPERLOGLOG • proposed by Flajolet et. al., 2007 • an extension of the Flajolet–Martin algorithm (1985) • HyperLogLog is described by 2 parameters: • p – number of bits that determine a bucket to use averaging
 (m = 2p is the number of buckets/substreams) • h - hash function, that produces uniform hash values • The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical error rate of 2%, using 1.5kB of memory (Flajolet, P. et al., 2007).
  • 12. HYPERLOGLOG: ALGORITHM • HyperLogLog uses randomization to approximate the cardinality of a multiset.This randomization is achieved by using hash function h • Observe the maximum number of leading zeros that for all hash values: • If the bit pattern 0L−1 1 is observed at the beginning of a hash value (so, rank = L), then a good estimation of the size of the multiset is 2L.
  • 13. HYPERLOGLOG: ALGORITHM • Stochastic averaging is used to reduce the large variability: • The input stream of data elements S is divided into m substreams Si using the first p bits of the hash values (m = 2p) . • In each substream, the rank (after the initial p bits that are used for substreaming) is measured independently. • These numbers are kept in an array of registers M, where M[i] stores the maximum rank it seen for the substream with index i. • The cardinality estimation is calculated computes as the normalized bias corrected harmonic mean of the estimations on the substreams DVHLL = const(m)⋅m2 ⋅ 2 −M j j=1 m ∑ ⎛ ⎝⎜ ⎞ ⎠⎟ −1
  • 14. HYPERLOGLOG: EXAMPLE • Consider L=8 bits hash function h • Index elements “berlin” and “ferret”: h(“berlin”) = 0110111 h(“ferret”) = 1100011 • Define buckets and calculate values to store:
 (use first p =3 bits for buckets and least L - p = 5 bits for ranks) • bucket(“berlin”) = 011 = 3 value(“berlin”) = rank(0111) = 2 • bucket(“ferret”) = 110 = 6 value(“ferret”) = rank(0011) = 3 • Let’s use p=3 bits to define a bucket (then m=23 =8 buckets). 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 0 M 1 2 3 4 5 6 7 0 0 0 2 0 0 3 0 0 M
  • 15. HYPERLOGLOG: EXAMPLE • Estimate the cardinality be the HLL formula (C ≈ 0.66): DVHLL ≈ 0.66 * 82 / (2-2 + 2-4 ) = 0.66 * 204.8 ≈ 135≠3 • Index element “kharkov”: • h(“kharkov”) = 1100001 • bucket(“kharkov”) = 110 = 6 value(“kharkov”) = rank(0001) = 4 • M[6] = max(M[6], 4) = max(3, 4) = 4 1 2 3 4 5 6 7 0 0 0 2 0 0 4 0 0 M NOTE: For small cardinalities HLL has a strong bias!!!
  • 16. HYPERLOGLOG: PROPERTIES • Memory requirement doesn't grow linearly with L (unlike MinCount or Linear Counting) - for hash function of L bits and precision p, required memory: • original HyperLogLog uses 32 bit hash codes, which requires 5 · 2 p bits • It’s not necessary to calculate the full hash code for the element • first p bits and number of leading zeros of the remaining bits are enough • There are no evidence that some of popular hash functions (MD5, Sha1, Sha256, Murmur3) performs significantly better than others. log2 L +1− p( )⎡⎢ ⎤⎥⋅2p bits
  • 17. HYPERLOGLOG: PROPERTIES • The standard error can be estimated as: σ = 1.04 2p so, if we use 16 bits (p=16) for bucket indices, we receive the standard error in 0.40625% • Algorithm has large error for small cardinalities. • For instance, for n = 0 the algorithm always returns roughly 0.7m • To achieve better estimates for small cardinalities, use LinearCounting below a threshold of 5m/2
  • 18. HYPERLOGLOG: APPLICATIONS • PFCOUNT in Redis returns the approximated cardinality computed by the HyperLogLog data structure 
 (http://antirez.com/news/75) • Redis implementation uses 12Kb per key to count with a standard error of 0.81%, and there is no limit to the number of items you can count, unless you approach 264 items
  • 19. HYPERLOGLOG: READ MORE • http://algo.inria.fr/flajolet/Publications/DuFl03-LNCS.pdf • http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf • https://stefanheule.com/papers/edbt13-hyperloglog.pdf • https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/ • https://hal.archives-ouvertes.fr/file/index/docid/465313/ filename/sliding_HyperLogLog.pdf • http://stackoverflow.com/questions/12327004/how-does- the-hyperloglog-algorithm-work
  • 21. HYPERLOGLOG++ • proposed by Stefan Heule et. al., 2013 for Google PowerDrill • an improved version of HyperLogLog (Flajolet et. al., 2007) • HyperLogLog++ is described by 2 parameters: • p – number of bits that determine a bucket to use averaging
 (m = 2p is the number of buckets/substreams) • h - hash function, that produces uniform hash values • The HyperLogLog++ algorithm is able to estimate cardinalities of ~ 7.9 · 10 9 with a typical error rate of 1.625%, using 2.56KB of memory (Micha Gorelick and Ian Ozsvald, High Performance Python, 2014).
  • 22. HYPERLOGLOG++: IMPROVEMENTS • use 64-bit hash function • algorithm that only uses the hash code of the input values is limited by the number of bits of the hash codes when it comes to accurately estimating large cardinalities • In particular, a hash function of L bits can at most distinguish 2L different values, and as the cardinality n approaches 2L hash collisions become more and more likely and accurate estimation gets impossible • if the cardinality approaches 264 ≈ 1.8 · 1019, hash collisions become a problem • bias correction • original algorithm overestimates the real cardinality for small sets, but most of the error is due to bias. • storage efficiency • uses different encoding strategies for hash values, variable length encoding for integers, difference encoding
  • 23. HYPERLOGLOG++ VS HYPERLOGLOG • accuracy is significantly better for large range of cardinalities and equally good on the rest • sparse representation allows for a more adaptive use of memory • if the cardinality n is much smaller then m, then HyperLogLog++ requires significantly less memory • For cardinalities between 12000 and 61000, the bias correction allows for a lower error and avoids a spike in the error when switching between sub-algorithms. • 64 bit hash codes allow the algorithm to estimate cardinalities well beyond 1 billion
  • 24. HYPERLOGLOG++: APPLICATIONS • cardinality metric in Elasticsearch is based on the HyperLogLog++ algorithm for big cardinalities (adaptive counting) • Apache DataFu, collection of libraries for working with large-scale data in Hadoop, has an implementation of HyperLogLog++ algorithm
  • 25. HYPERLOGLOG++: READ MORE • http://static.googleusercontent.com/media/ research.google.com/en//pubs/archive/40671.pdf • https://research.neustar.biz/2013/01/24/hyperloglog- googles-take-on-engineering-hll/
  • 26. ▸ @gakhov ▸ linkedin.com/in/gakhov ▸ www.datacrucis.com THANK YOU