Probabilistic data structures. Part 2. Cardinality

Andrii Gakhov
Andrii GakhovPh.D., Senior Software Engineer at Ferret-Go
tech talk @ ferret
Andrii Gakhov
PROBABILISTIC DATA STRUCTURES
ALL YOU WANTED TO KNOW BUT WERE AFRAID TO ASK
PART 2: CARDINALITY
CARDINALITY
Agenda:
▸ Linear Counting
▸ LogLog, SuperLogLog, HyperLogLog, HyperLogLog++
• To determine the number of distinct elements, also
called the cardinality, of a large set of elements
where duplicates are present
Calculating the exact cardinality of a multiset requires an amount of memory
proportional to the cardinality, which is impractical for very large data sets.
THE PROBLEM
LINEAR COUNTING
LINEAR COUNTING: ALGORITHM
• Linear counter is a bit map (hash table) of size m (all
elements set to 0 at the beginning).
• Algorithm consists of a few steps:
• for every element calculate hash function and set the
appropriate bit to 1
• calculate the fraction V of empty bits in the structure 

(divide the number of empty bits by the bit map size m )
• estimate cardinality as n ≈ -m ln V
LINEAR COUNTING: EXAMPLE
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
• Consider linear counter with 16 bits (m=16)
• Consider MurmurHash3 as the hash function h

(to calculate the appropriate index, we divide result by mod 16)
• Set of 10 elements: “bernau”, “bernau”, “bernau”,
“berlin”, “kiev”, “kiev”, “new york”, “germany”, “ukraine”,
“europe” (NOTE: the real cardinality n = 7)
h(“bernau”) = 4, h(“berlin”) = 4, h(“kiev”) = 6, h(“new york”) = 6,
h(“germany”) = 14, h(“ukraine”) = 7, h(“europe”) = 9
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0
LINEAR COUNTING: EXAMPLE
number of empty bits: 11
m = 16
V = 11 / 16 = 0.6875
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0
• Cardinality estimation is
n ≈ - 16 * ln (0.6875) = 5.995
LINEAR COUNTING: READ MORE
• http://dblab.kaist.ac.kr/Prof/pdf/Whang1990(linear).pdf
• http://www.codeproject.com/Articles/569718/
CardinalityplusEstimationplusinplusLinearplusTimep
HYPERLOGLOG
HYPERLOGLOG: INTUITION
• The cardinality of a multiset of uniformly distributed numbers can be estimated
by the maximum number of leading zeros in the binary representation of each
number. If such value is k, then the number of distinct elements in the set is 2k
P(rank=1) = 1/2 - probability to find a binary representation, that starts with 1
P(rank = 2) = 1/2
2
- probability to find a binary representation, that start with 01
…
P(rank=k) = 1/2
k
rank = number of leading zeros + 1, e.g. rank(f) = 3
0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0
0 1 2
leading zeros
3
f =
• Therefore, for 2k
binary representations we shell find at least one
representation with rank = k
• If we remember the maximal rank we’ve seen and it’s equal to k, then we can
use 2k
as the approximation of the number of elements
HYPERLOGLOG
• proposed by Flajolet et. al., 2007
• an extension of the Flajolet–Martin algorithm (1985)
• HyperLogLog is described by 2 parameters:
• p – number of bits that determine a bucket to use averaging

(m = 2p
is the number of buckets/substreams)
• h - hash function, that produces uniform hash values
• The HyperLogLog algorithm is able to estimate cardinalities of
> 109
with a typical error rate of 2%, using 1.5kB of memory
(Flajolet, P. et al., 2007).
HYPERLOGLOG: ALGORITHM
• HyperLogLog uses randomization to approximate the
cardinality of a multiset.This randomization is achieved by
using hash function h
• Observe the maximum number of leading zeros that for all
hash values:
• If the bit pattern 0L−1 1 is observed at the beginning of a
hash value (so, rank = L), then a good estimation of the size
of the multiset is 2L.
HYPERLOGLOG: ALGORITHM
• Stochastic averaging is used to reduce the large variability:
• The input stream of data elements S is divided into m substreams Si using
the first p bits of the hash values (m = 2p)
.
• In each substream, the rank (after the initial p bits that are used for
substreaming) is measured independently.
• These numbers are kept in an array of registers M, where M[i] stores the
maximum rank it seen for the substream with index i.
• The cardinality estimation is calculated computes as the normalized bias
corrected harmonic mean of the estimations on the substreams
DVHLL = const(m)⋅m2
⋅ 2
−M j
j=1
m
∑
⎛
⎝⎜
⎞
⎠⎟
−1
HYPERLOGLOG: EXAMPLE
• Consider L=8 bits hash function h
• Index elements “berlin” and “ferret”:
h(“berlin”) = 0110111 h(“ferret”) = 1100011
• Define buckets and calculate values to store:

(use first p =3 bits for buckets and least L - p = 5 bits for ranks)
• bucket(“berlin”) = 011 = 3 value(“berlin”) = rank(0111) = 2
• bucket(“ferret”) = 110 = 6 value(“ferret”) = rank(0011) = 3
• Let’s use p=3 bits to define a bucket (then m=23
=8 buckets).
1 2 3 4 5 6 7
0 0 0 0 0 0 0 0
0
M
1 2 3 4 5 6 7
0 0 0 2 0 0 3 0
0
M
HYPERLOGLOG: EXAMPLE
• Estimate the cardinality be the HLL formula (C ≈ 0.66):
DVHLL ≈ 0.66 * 82
/ (2-2
+ 2-4
) = 0.66 * 204.8 ≈ 135≠3
• Index element “kharkov”:
• h(“kharkov”) = 1100001
• bucket(“kharkov”) = 110 = 6 value(“kharkov”) = rank(0001) = 4
• M[6] = max(M[6], 4) = max(3, 4) = 4
1 2 3 4 5 6 7
0 0 0 2 0 0 4 0
0
M
NOTE: For small cardinalities HLL has a strong bias!!!
HYPERLOGLOG: PROPERTIES
• Memory requirement doesn't grow linearly with L (unlike MinCount or
Linear Counting) - for hash function of L bits and precision p, required
memory:
• original HyperLogLog uses 32 bit hash codes, which requires 5 · 2
p
bits
• It’s not necessary to calculate the full hash code for the element
• first p bits and number of leading zeros of the remaining bits are
enough
• There are no evidence that some of popular hash functions (MD5, Sha1,
Sha256, Murmur3) performs significantly better than others.
log2 L +1− p( )⎡⎢ ⎤⎥⋅2p
bits
HYPERLOGLOG: PROPERTIES
• The standard error can be estimated as:
σ =
1.04
2p
so, if we use 16 bits (p=16) for bucket indices, we receive the
standard error in 0.40625%
• Algorithm has large error for small cardinalities.
• For instance, for n = 0 the algorithm always returns roughly 0.7m
• To achieve better estimates for small cardinalities, use
LinearCounting below a threshold of 5m/2
HYPERLOGLOG: APPLICATIONS
• PFCOUNT in Redis returns the approximated cardinality
computed by the HyperLogLog data structure 

(http://antirez.com/news/75)
• Redis implementation uses 12Kb per key to count with a standard
error of 0.81%, and there is no limit to the number of items you can
count, unless you approach 264 items
HYPERLOGLOG: READ MORE
• http://algo.inria.fr/flajolet/Publications/DuFl03-LNCS.pdf
• http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
• https://stefanheule.com/papers/edbt13-hyperloglog.pdf
• https://highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-data-mining/
• https://hal.archives-ouvertes.fr/file/index/docid/465313/
filename/sliding_HyperLogLog.pdf
• http://stackoverflow.com/questions/12327004/how-does-
the-hyperloglog-algorithm-work
HYPERLOGLOG++
HYPERLOGLOG++
• proposed by Stefan Heule et. al., 2013 for Google PowerDrill
• an improved version of HyperLogLog (Flajolet et. al., 2007)
• HyperLogLog++ is described by 2 parameters:
• p – number of bits that determine a bucket to use averaging

(m = 2p
is the number of buckets/substreams)
• h - hash function, that produces uniform hash values
• The HyperLogLog++ algorithm is able to estimate cardinalities
of ~ 7.9 · 10
9
with a typical error rate of 1.625%, using 2.56KB of
memory (Micha Gorelick and Ian Ozsvald, High Performance
Python, 2014).
HYPERLOGLOG++: IMPROVEMENTS
• use 64-bit hash function
• algorithm that only uses the hash code of the input values is limited by the
number of bits of the hash codes when it comes to accurately estimating
large cardinalities
• In particular, a hash function of L bits can at most distinguish 2L
different
values, and as the cardinality n approaches 2L
hash collisions become
more and more likely and accurate estimation gets impossible
• if the cardinality approaches 264 ≈ 1.8 · 1019, hash collisions become a problem
• bias correction
• original algorithm overestimates the real cardinality for small sets, but
most of the error is due to bias.
• storage efficiency
• uses different encoding strategies for hash values, variable length
encoding for integers, difference encoding
HYPERLOGLOG++ VS HYPERLOGLOG
• accuracy is significantly better for large range of cardinalities
and equally good on the rest
• sparse representation allows for a more adaptive use of memory
• if the cardinality n is much smaller then m, then HyperLogLog++
requires significantly less memory
• For cardinalities between 12000 and 61000, the bias correction
allows for a lower error and avoids a spike in the error when
switching between sub-algorithms.
• 64 bit hash codes allow the algorithm to estimate cardinalities well
beyond 1 billion
HYPERLOGLOG++: APPLICATIONS
• cardinality metric in Elasticsearch is based on the
HyperLogLog++ algorithm for big cardinalities (adaptive
counting)
• Apache DataFu, collection of libraries for working with
large-scale data in Hadoop, has an implementation of
HyperLogLog++ algorithm
HYPERLOGLOG++: READ MORE
• http://static.googleusercontent.com/media/
research.google.com/en//pubs/archive/40671.pdf
• https://research.neustar.biz/2013/01/24/hyperloglog-
googles-take-on-engineering-hll/
▸ @gakhov
▸ linkedin.com/in/gakhov
▸ www.datacrucis.com
THANK YOU
1 of 26

Recommended

HyperLogLog in Hive - How to count sheep efficiently? by
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?bzamecnik
4.7K views33 slides
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith... by
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA
1.3K views36 slides
Probabilistic data structures. Part 3. Frequency by
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyAndrii Gakhov
1.7K views31 slides
Probabilistic data structures by
Probabilistic data structuresProbabilistic data structures
Probabilistic data structuresshrinivasvasala
595 views7 slides
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat... by
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Andrii Gakhov
735 views39 slides
Vasia Kalavri – Training: Gelly School by
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Flink Forward
7.1K views20 slides

More Related Content

What's hot

Garbage Collection by
Garbage CollectionGarbage Collection
Garbage CollectionEelco Visser
5.7K views49 slides
Michael Häusler – Everyday flink by
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flinkFlink Forward
8.5K views31 slides
R and cpp by
R and cppR and cpp
R and cppRomain Francois
1.6K views44 slides
Flux and InfluxDB 2.0 by Paul Dix by
Flux and InfluxDB 2.0 by Paul DixFlux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul DixInfluxData
1.3K views85 slides
Time Series Processing with Solr and Spark by
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and SparkJosef Adersberger
527 views64 slides
Apache Flink Training: DataSet API Basics by
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsFlink Forward
12.3K views37 slides

What's hot(20)

Garbage Collection by Eelco Visser
Garbage CollectionGarbage Collection
Garbage Collection
Eelco Visser5.7K views
Michael Häusler – Everyday flink by Flink Forward
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
Flink Forward8.5K views
Flux and InfluxDB 2.0 by Paul Dix by InfluxData
Flux and InfluxDB 2.0 by Paul DixFlux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul Dix
InfluxData1.3K views
Time Series Processing with Solr and Spark by Josef Adersberger
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
Josef Adersberger527 views
Apache Flink Training: DataSet API Basics by Flink Forward
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
Flink Forward12.3K views
Faster persistent data structures through hashing by Johan Tibell
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
Johan Tibell4K views
Go and Uber’s time series database m3 by Rob Skillington
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
Rob Skillington6.5K views
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi by InfluxData
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
InfluxData228 views
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec... by Jen Aman
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman2.7K views
Map reduce: beyond word count by Jeff Patti
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
Jeff Patti11.3K views
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa... by InfluxData
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
InfluxData136 views
A Fast and Efficient Time Series Storage Based on Apache Solr by QAware GmbH
A Fast and Efficient Time Series Storage Based on Apache SolrA Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache Solr
QAware GmbH3.5K views
Apache Flink Training: DataStream API Part 2 Advanced by Flink Forward
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward9.6K views
How to Build a Telegraf Plugin by Noah Crowley by InfluxData
How to Build a Telegraf Plugin by Noah CrowleyHow to Build a Telegraf Plugin by Noah Crowley
How to Build a Telegraf Plugin by Noah Crowley
InfluxData3.4K views

Viewers also liked

Using Simplicity to Make Hard Big Data Problems Easy by
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easynathanmarz
6.3K views102 slides
Анализ количества посетителей на сайте [Считаем уникальные элементы] by
Анализ количества посетителей на сайте [Считаем уникальные элементы]Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]Qrator Labs
464 views87 slides
ReqLabs PechaKucha Евгений Сафроненко by
ReqLabs PechaKucha Евгений СафроненкоReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений СафроненкоPechaKucha Ukraine
624 views20 slides
Big Data aggregation techniques by
Big Data aggregation techniquesBig Data aggregation techniques
Big Data aggregation techniquesValentin Logvinskiy
1.3K views31 slides
Hyper loglog by
Hyper loglogHyper loglog
Hyper loglognybon
329 views20 slides
Deep dive into Coroutines on JVM @ KotlinConf 2017 by
Deep dive into Coroutines on JVM @ KotlinConf 2017Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017Roman Elizarov
5.8K views107 slides

Viewers also liked(6)

Using Simplicity to Make Hard Big Data Problems Easy by nathanmarz
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
nathanmarz6.3K views
Анализ количества посетителей на сайте [Считаем уникальные элементы] by Qrator Labs
Анализ количества посетителей на сайте [Считаем уникальные элементы]Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]
Qrator Labs464 views
ReqLabs PechaKucha Евгений Сафроненко by PechaKucha Ukraine
ReqLabs PechaKucha Евгений СафроненкоReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений Сафроненко
PechaKucha Ukraine624 views
Hyper loglog by nybon
Hyper loglogHyper loglog
Hyper loglog
nybon329 views
Deep dive into Coroutines on JVM @ KotlinConf 2017 by Roman Elizarov
Deep dive into Coroutines on JVM @ KotlinConf 2017Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017
Roman Elizarov5.8K views

Similar to Probabilistic data structures. Part 2. Cardinality

2013 open analytics_countingv3 by
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3abramsm
351 views40 slides
2013 open analytics_countingv3 by
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3Open Analytics
691 views39 slides
Counting (Using Computer) by
Counting (Using Computer)Counting (Using Computer)
Counting (Using Computer)roshmat
311 views50 slides
RecSplit Minimal Perfect Hashing by
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingThomas Mueller
1.5K views36 slides
Data streaming algorithms by
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
1.6K views33 slides
Probabilistic data structure by
Probabilistic data structureProbabilistic data structure
Probabilistic data structureThinh Dang
272 views31 slides

Similar to Probabilistic data structures. Part 2. Cardinality(20)

2013 open analytics_countingv3 by abramsm
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
abramsm351 views
2013 open analytics_countingv3 by Open Analytics
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
Open Analytics691 views
Counting (Using Computer) by roshmat
Counting (Using Computer)Counting (Using Computer)
Counting (Using Computer)
roshmat311 views
RecSplit Minimal Perfect Hashing by Thomas Mueller
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
Thomas Mueller1.5K views
Data streaming algorithms by Sandeep Joshi
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi1.6K views
Probabilistic data structure by Thinh Dang
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
Thinh Dang272 views
Count-Distinct Problem by Kai Zhang
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct Problem
Kai Zhang3.4K views
358 33 powerpoint-slides_15-hashing-collision_chapter-15 by sumitbardhan
358 33 powerpoint-slides_15-hashing-collision_chapter-15358 33 powerpoint-slides_15-hashing-collision_chapter-15
358 33 powerpoint-slides_15-hashing-collision_chapter-15
sumitbardhan10.3K views
Building graphs to discover information by David Martínez at Big Data Spain 2015 by Big Data Spain
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain729 views
Digital logic-formula-notes-final-1 by Kshitij Singh
Digital logic-formula-notes-final-1Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1
Kshitij Singh931 views
Максим Харченко. Erlang lincx by Alina Dolgikh
Максим Харченко. Erlang lincxМаксим Харченко. Erlang lincx
Максим Харченко. Erlang lincx
Alina Dolgikh1.3K views
0.5mln packets per second with Erlang by Maxim Kharchenko
0.5mln packets per second with Erlang0.5mln packets per second with Erlang
0.5mln packets per second with Erlang
Maxim Kharchenko1.5K views
Large-scale real-time analytics for everyone by Pavel Kalaidin
Large-scale real-time analytics for everyoneLarge-scale real-time analytics for everyone
Large-scale real-time analytics for everyone
Pavel Kalaidin1K views
L3-.pptx by asdq4
L3-.pptxL3-.pptx
L3-.pptx
asdq41 view
Logic Circuits Design - "Chapter 1: Digital Systems and Information" by Ra'Fat Al-Msie'deen
Logic Circuits Design - "Chapter 1: Digital Systems and Information"Logic Circuits Design - "Chapter 1: Digital Systems and Information"
Logic Circuits Design - "Chapter 1: Digital Systems and Information"
Ra'Fat Al-Msie'deen5.8K views
Scaling up genomic analysis with ADAM by fnothaft
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
fnothaft1.5K views

More from Andrii Gakhov

Let's start GraphQL: structure, behavior, and architecture by
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureAndrii Gakhov
423 views90 slides
Too Much Data? - Just Sample, Just Hash, ... by
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Andrii Gakhov
386 views23 slides
DNS Delegation by
DNS DelegationDNS Delegation
DNS DelegationAndrii Gakhov
902 views15 slides
Implementing a Fileserver with Nginx and Lua by
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaAndrii Gakhov
2.1K views16 slides
Pecha Kucha: Ukrainian Food Traditions by
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsAndrii Gakhov
838 views20 slides
Probabilistic data structures. Part 4. Similarity by
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityAndrii Gakhov
2.4K views46 slides

More from Andrii Gakhov(20)

Let's start GraphQL: structure, behavior, and architecture by Andrii Gakhov
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architecture
Andrii Gakhov423 views
Too Much Data? - Just Sample, Just Hash, ... by Andrii Gakhov
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
Andrii Gakhov386 views
Implementing a Fileserver with Nginx and Lua by Andrii Gakhov
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
Andrii Gakhov2.1K views
Pecha Kucha: Ukrainian Food Traditions by Andrii Gakhov
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food Traditions
Andrii Gakhov838 views
Probabilistic data structures. Part 4. Similarity by Andrii Gakhov
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
Andrii Gakhov2.4K views
Вероятностные структуры данных by Andrii Gakhov
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данных
Andrii Gakhov1.3K views
Recurrent Neural Networks. Part 1: Theory by Andrii Gakhov
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
Andrii Gakhov14.2K views
Apache Big Data Europe 2015: Selected Talks by Andrii Gakhov
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
Andrii Gakhov716 views
Swagger / Quick Start Guide by Andrii Gakhov
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start Guide
Andrii Gakhov7.6K views
API Days Berlin highlights by Andrii Gakhov
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlights
Andrii Gakhov787 views
ELK - What's new and showcases by Andrii Gakhov
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
Andrii Gakhov938 views
Apache Spark Overview @ ferret by Andrii Gakhov
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov1.2K views
Data Mining - lecture 8 - 2014 by Andrii Gakhov
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014
Andrii Gakhov1.3K views
Data Mining - lecture 7 - 2014 by Andrii Gakhov
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
Andrii Gakhov2.5K views
Data Mining - lecture 6 - 2014 by Andrii Gakhov
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014
Andrii Gakhov1.1K views
Data Mining - lecture 5 - 2014 by Andrii Gakhov
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014
Andrii Gakhov816 views
Data Mining - lecture 4 - 2014 by Andrii Gakhov
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014
Andrii Gakhov1K views
Data Mining - lecture 3 - 2014 by Andrii Gakhov
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014
Andrii Gakhov1K views
Decision Theory - lecture 1 (introduction) by Andrii Gakhov
Decision Theory - lecture 1 (introduction)Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)
Andrii Gakhov1.4K views

Recently uploaded

Microchip: CXL Use Cases and Enabling Ecosystem by
Microchip: CXL Use Cases and Enabling EcosystemMicrochip: CXL Use Cases and Enabling Ecosystem
Microchip: CXL Use Cases and Enabling EcosystemCXL Forum
129 views12 slides
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor... by
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...Vadym Kazulkin
70 views64 slides
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum... by
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...NUS-ISS
28 views35 slides
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure by
Astera Labs:  Intelligent Connectivity for Cloud and AI InfrastructureAstera Labs:  Intelligent Connectivity for Cloud and AI Infrastructure
Astera Labs: Intelligent Connectivity for Cloud and AI InfrastructureCXL Forum
125 views16 slides
CXL at OCP by
CXL at OCPCXL at OCP
CXL at OCPCXL Forum
208 views66 slides

Recently uploaded(20)

Microchip: CXL Use Cases and Enabling Ecosystem by CXL Forum
Microchip: CXL Use Cases and Enabling EcosystemMicrochip: CXL Use Cases and Enabling Ecosystem
Microchip: CXL Use Cases and Enabling Ecosystem
CXL Forum129 views
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor... by Vadym Kazulkin
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
Vadym Kazulkin70 views
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum... by NUS-ISS
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
NUS-ISS28 views
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure by CXL Forum
Astera Labs:  Intelligent Connectivity for Cloud and AI InfrastructureAstera Labs:  Intelligent Connectivity for Cloud and AI Infrastructure
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure
CXL Forum125 views
CXL at OCP by CXL Forum
CXL at OCPCXL at OCP
CXL at OCP
CXL Forum208 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10165 views
Future of Learning - Yap Aye Wee.pdf by NUS-ISS
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdf
NUS-ISS38 views
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst449 views
AI: mind, matter, meaning, metaphors, being, becoming, life values by Twain Liu 刘秋艳
AI: mind, matter, meaning, metaphors, being, becoming, life valuesAI: mind, matter, meaning, metaphors, being, becoming, life values
AI: mind, matter, meaning, metaphors, being, becoming, life values
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur by Fwdays
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
Fwdays40 views
Samsung: CMM-H Tiered Memory Solution with Built-in DRAM by CXL Forum
Samsung: CMM-H Tiered Memory Solution with Built-in DRAMSamsung: CMM-H Tiered Memory Solution with Built-in DRAM
Samsung: CMM-H Tiered Memory Solution with Built-in DRAM
CXL Forum105 views
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen... by NUS-ISS
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
NUS-ISS23 views
Future of Learning - Khoong Chan Meng by NUS-ISS
Future of Learning - Khoong Chan MengFuture of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan Meng
NUS-ISS31 views
Micron CXL product and architecture update by CXL Forum
Micron CXL product and architecture updateMicron CXL product and architecture update
Micron CXL product and architecture update
CXL Forum27 views
The details of description: Techniques, tips, and tangents on alternative tex... by BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada110 views
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad... by Fwdays
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad..."Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
Fwdays40 views
Photowave Presentation Slides - 11.8.23.pptx by CXL Forum
Photowave Presentation Slides - 11.8.23.pptxPhotowave Presentation Slides - 11.8.23.pptx
Photowave Presentation Slides - 11.8.23.pptx
CXL Forum126 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman25 views

Probabilistic data structures. Part 2. Cardinality

  • 1. tech talk @ ferret Andrii Gakhov PROBABILISTIC DATA STRUCTURES ALL YOU WANTED TO KNOW BUT WERE AFRAID TO ASK PART 2: CARDINALITY
  • 2. CARDINALITY Agenda: ▸ Linear Counting ▸ LogLog, SuperLogLog, HyperLogLog, HyperLogLog++
  • 3. • To determine the number of distinct elements, also called the cardinality, of a large set of elements where duplicates are present Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. THE PROBLEM
  • 5. LINEAR COUNTING: ALGORITHM • Linear counter is a bit map (hash table) of size m (all elements set to 0 at the beginning). • Algorithm consists of a few steps: • for every element calculate hash function and set the appropriate bit to 1 • calculate the fraction V of empty bits in the structure 
 (divide the number of empty bits by the bit map size m ) • estimate cardinality as n ≈ -m ln V
  • 6. LINEAR COUNTING: EXAMPLE 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • Consider linear counter with 16 bits (m=16) • Consider MurmurHash3 as the hash function h
 (to calculate the appropriate index, we divide result by mod 16) • Set of 10 elements: “bernau”, “bernau”, “bernau”, “berlin”, “kiev”, “kiev”, “new york”, “germany”, “ukraine”, “europe” (NOTE: the real cardinality n = 7) h(“bernau”) = 4, h(“berlin”) = 4, h(“kiev”) = 6, h(“new york”) = 6, h(“germany”) = 14, h(“ukraine”) = 7, h(“europe”) = 9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0
  • 7. LINEAR COUNTING: EXAMPLE number of empty bits: 11 m = 16 V = 11 / 16 = 0.6875 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 • Cardinality estimation is n ≈ - 16 * ln (0.6875) = 5.995
  • 8. LINEAR COUNTING: READ MORE • http://dblab.kaist.ac.kr/Prof/pdf/Whang1990(linear).pdf • http://www.codeproject.com/Articles/569718/ CardinalityplusEstimationplusinplusLinearplusTimep
  • 10. HYPERLOGLOG: INTUITION • The cardinality of a multiset of uniformly distributed numbers can be estimated by the maximum number of leading zeros in the binary representation of each number. If such value is k, then the number of distinct elements in the set is 2k P(rank=1) = 1/2 - probability to find a binary representation, that starts with 1 P(rank = 2) = 1/2 2 - probability to find a binary representation, that start with 01 … P(rank=k) = 1/2 k rank = number of leading zeros + 1, e.g. rank(f) = 3 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 2 leading zeros 3 f = • Therefore, for 2k binary representations we shell find at least one representation with rank = k • If we remember the maximal rank we’ve seen and it’s equal to k, then we can use 2k as the approximation of the number of elements
  • 11. HYPERLOGLOG • proposed by Flajolet et. al., 2007 • an extension of the Flajolet–Martin algorithm (1985) • HyperLogLog is described by 2 parameters: • p – number of bits that determine a bucket to use averaging
 (m = 2p is the number of buckets/substreams) • h - hash function, that produces uniform hash values • The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical error rate of 2%, using 1.5kB of memory (Flajolet, P. et al., 2007).
  • 12. HYPERLOGLOG: ALGORITHM • HyperLogLog uses randomization to approximate the cardinality of a multiset.This randomization is achieved by using hash function h • Observe the maximum number of leading zeros that for all hash values: • If the bit pattern 0L−1 1 is observed at the beginning of a hash value (so, rank = L), then a good estimation of the size of the multiset is 2L.
  • 13. HYPERLOGLOG: ALGORITHM • Stochastic averaging is used to reduce the large variability: • The input stream of data elements S is divided into m substreams Si using the first p bits of the hash values (m = 2p) . • In each substream, the rank (after the initial p bits that are used for substreaming) is measured independently. • These numbers are kept in an array of registers M, where M[i] stores the maximum rank it seen for the substream with index i. • The cardinality estimation is calculated computes as the normalized bias corrected harmonic mean of the estimations on the substreams DVHLL = const(m)⋅m2 ⋅ 2 −M j j=1 m ∑ ⎛ ⎝⎜ ⎞ ⎠⎟ −1
  • 14. HYPERLOGLOG: EXAMPLE • Consider L=8 bits hash function h • Index elements “berlin” and “ferret”: h(“berlin”) = 0110111 h(“ferret”) = 1100011 • Define buckets and calculate values to store:
 (use first p =3 bits for buckets and least L - p = 5 bits for ranks) • bucket(“berlin”) = 011 = 3 value(“berlin”) = rank(0111) = 2 • bucket(“ferret”) = 110 = 6 value(“ferret”) = rank(0011) = 3 • Let’s use p=3 bits to define a bucket (then m=23 =8 buckets). 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 0 M 1 2 3 4 5 6 7 0 0 0 2 0 0 3 0 0 M
  • 15. HYPERLOGLOG: EXAMPLE • Estimate the cardinality be the HLL formula (C ≈ 0.66): DVHLL ≈ 0.66 * 82 / (2-2 + 2-4 ) = 0.66 * 204.8 ≈ 135≠3 • Index element “kharkov”: • h(“kharkov”) = 1100001 • bucket(“kharkov”) = 110 = 6 value(“kharkov”) = rank(0001) = 4 • M[6] = max(M[6], 4) = max(3, 4) = 4 1 2 3 4 5 6 7 0 0 0 2 0 0 4 0 0 M NOTE: For small cardinalities HLL has a strong bias!!!
  • 16. HYPERLOGLOG: PROPERTIES • Memory requirement doesn't grow linearly with L (unlike MinCount or Linear Counting) - for hash function of L bits and precision p, required memory: • original HyperLogLog uses 32 bit hash codes, which requires 5 · 2 p bits • It’s not necessary to calculate the full hash code for the element • first p bits and number of leading zeros of the remaining bits are enough • There are no evidence that some of popular hash functions (MD5, Sha1, Sha256, Murmur3) performs significantly better than others. log2 L +1− p( )⎡⎢ ⎤⎥⋅2p bits
  • 17. HYPERLOGLOG: PROPERTIES • The standard error can be estimated as: σ = 1.04 2p so, if we use 16 bits (p=16) for bucket indices, we receive the standard error in 0.40625% • Algorithm has large error for small cardinalities. • For instance, for n = 0 the algorithm always returns roughly 0.7m • To achieve better estimates for small cardinalities, use LinearCounting below a threshold of 5m/2
  • 18. HYPERLOGLOG: APPLICATIONS • PFCOUNT in Redis returns the approximated cardinality computed by the HyperLogLog data structure 
 (http://antirez.com/news/75) • Redis implementation uses 12Kb per key to count with a standard error of 0.81%, and there is no limit to the number of items you can count, unless you approach 264 items
  • 19. HYPERLOGLOG: READ MORE • http://algo.inria.fr/flajolet/Publications/DuFl03-LNCS.pdf • http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf • https://stefanheule.com/papers/edbt13-hyperloglog.pdf • https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/ • https://hal.archives-ouvertes.fr/file/index/docid/465313/ filename/sliding_HyperLogLog.pdf • http://stackoverflow.com/questions/12327004/how-does- the-hyperloglog-algorithm-work
  • 21. HYPERLOGLOG++ • proposed by Stefan Heule et. al., 2013 for Google PowerDrill • an improved version of HyperLogLog (Flajolet et. al., 2007) • HyperLogLog++ is described by 2 parameters: • p – number of bits that determine a bucket to use averaging
 (m = 2p is the number of buckets/substreams) • h - hash function, that produces uniform hash values • The HyperLogLog++ algorithm is able to estimate cardinalities of ~ 7.9 · 10 9 with a typical error rate of 1.625%, using 2.56KB of memory (Micha Gorelick and Ian Ozsvald, High Performance Python, 2014).
  • 22. HYPERLOGLOG++: IMPROVEMENTS • use 64-bit hash function • algorithm that only uses the hash code of the input values is limited by the number of bits of the hash codes when it comes to accurately estimating large cardinalities • In particular, a hash function of L bits can at most distinguish 2L different values, and as the cardinality n approaches 2L hash collisions become more and more likely and accurate estimation gets impossible • if the cardinality approaches 264 ≈ 1.8 · 1019, hash collisions become a problem • bias correction • original algorithm overestimates the real cardinality for small sets, but most of the error is due to bias. • storage efficiency • uses different encoding strategies for hash values, variable length encoding for integers, difference encoding
  • 23. HYPERLOGLOG++ VS HYPERLOGLOG • accuracy is significantly better for large range of cardinalities and equally good on the rest • sparse representation allows for a more adaptive use of memory • if the cardinality n is much smaller then m, then HyperLogLog++ requires significantly less memory • For cardinalities between 12000 and 61000, the bias correction allows for a lower error and avoids a spike in the error when switching between sub-algorithms. • 64 bit hash codes allow the algorithm to estimate cardinalities well beyond 1 billion
  • 24. HYPERLOGLOG++: APPLICATIONS • cardinality metric in Elasticsearch is based on the HyperLogLog++ algorithm for big cardinalities (adaptive counting) • Apache DataFu, collection of libraries for working with large-scale data in Hadoop, has an implementation of HyperLogLog++ algorithm
  • 25. HYPERLOGLOG++: READ MORE • http://static.googleusercontent.com/media/ research.google.com/en//pubs/archive/40671.pdf • https://research.neustar.biz/2013/01/24/hyperloglog- googles-take-on-engineering-hll/
  • 26. ▸ @gakhov ▸ linkedin.com/in/gakhov ▸ www.datacrucis.com THANK YOU