Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications

Andrii Gakhov
Andrii GakhovPh.D., Senior Software Engineer at Ferret-Go
Andrii Gakhov, PhD
Exceeding Classical:
Probabilistic Data Structures
in Data-Intensive Applications
EuroSciPy 2019

Bilbao, Spain
Andrii Gakhov
Senior Software Engineer

at Ferret Go GmbH, Germany
Ph.D. in Mathematical Modelling, 

M.Sc. in Applied Mathematics
Twitter: @gakhov | Website: gakhov.com
Probabilistic Data Structures
and Algorithms

for Big Data Applications
ISBN: 9783748190486

https://pdsa.gakhov.com
0. Motivation
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
Bioinformatics: Counting k-mers in DNA
Counting substrings of length k in DNA sequence data (k-mers) is
essential in bioinformatics, for instance, for metagenomic sequencing.
A large fraction of the storage is spent on storing k-mers with sequencing
errors and which are observed only a single time in the data*.
Can we efficiently avoid to persist such invalid substrings?
Can we efficiently count valid substrings?
* Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12(1), 333, 2011
For example, the team that sequenced the giant panda genome needed to
count 8.62 billion 27-mers, where 68% were low-coverage k-mers.
1. Data-Intensive Applications 

in Big Data epoch
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
What is Big Data?
Doug Laney in 2001 described Big Data datasets as such that
contain greater variety arriving in increasing volumes and with ever-
higher velocity. Today this is known as the famous 3V’s of Big Data.
Big
Data
Velocity Variety
Volume expresses the amount of data
describes the speed at which data is arriving refers to the number of types of data
What is Big Data?
Big Data is more than simply a
matter of size.
Big Data does not refer to data, it
refers to technology.
The datasets of Big Data are larger, more complex, and
generated more rapidly than our current resources can handle.
Image: https://www.freepngimg.com/electronics/technology
2. Probabilistic Data Structures

and Algorithms
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
Probabilistic Data Structures and Algorithms (PDSA)
A family of advanced approaches that are optimized to
use sublinear memory and constant execution time.
Cannot provide the exact answers and have some probability of error.
error
resources
The tradeoff between the error and the resources is another feature
that distinguish the algorithms and data structures of this family.
PDSA in Big Data Ecosystem
Count-Min Sketch
Count Sketch
Bloom Filter
Quotient Filter
Cuckoo Filter
Linear Counting
FM Sketch
LogLog
HyperLogLog
Random Sampling t-digestq-digestGreenwald-Khanna
MinHash
SimHash
LSH
Counting
find the number of unique elements
Membership
keep track of indexed elements
Rank approximate percentiles and quantiles
Frequency
estimate frequencies of elements
Similarity
find similar documents
Big
Data
Velocity Variety
Volume
PDSA in Apache Spark SQL (PySpark interface)
q-quantile estimation (Greenwald-Khanna)
# pyspark.sql.DataFrameStatFunctions(df).approxQuantile

df.approxQuantile("language", [0.5], 0.25)
Approximate number of distinct elements (HyperLogLog++)
#pyspark.sql.functions.approx_count_distinct
df.agg(approx_count_distinct(df.language).alias('lang')).collect()
Spark SQL is Apache Spark's module for working with structured data.
PDSA in Production
3. Frequency
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
Frequency: Challenge
A hashtag is used to index a topic on Twitter and allows people to easily follow
items they are interested in. Hashtags are usually written with a # symbol in front.
Find the most trending hashtags on Twitter
every second about 6000 tweets are created on Twitter,
that is roughly 500 million items daily
most of tweets are linked with one or more hashtags
https://www.internetlivestats.com/twitter-statistics/
Frequency:Traditional Approach
Build a table that lists of all seen thus far
elements with corresponding counters
Increment counters when a new element
comes or add that element into the table and
initialize its counter
Return the value of the counter that
corresponds to the element as frequency
requires linear memory
requires O(n) time lookup (worst case)
huge overhead for heavy hitters search
1 1 1
1 1 2
Frequency: Challenges for Big Data data streams
Continuous data streams
potentially unbounded number of unique elements

➡ sublinear (polylogarithmic at most) space

not feasible to re-process data streams

➡ one-pass algorithms preferred

high frequency throughput

➡ fast updates
Image: https://www.pngfind.com
Count-Min Sketch
a simple space-efficient probabilistic data structure that is used to estimate
frequencies of elements in data streams and can address the Heavy hitters problem
presented by Graham Cormode and Shan Muthukrishnan in 2003
Frequency: Estimation with a single counter
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
counter 0 1 2 3 4 5 …. m-1 m+1
h( )
0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
counter 0 1 2 3 4 5 …. m-1 m
h( ) +1
0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0
counter 0 1 2 3 4 5 …. m-1 m+1
h( )
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
counter 1 0 1 2 3 4 5 …. m-1 m
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
counter 2 0 1 2 3 4 5 …. m-1 m
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
counter k 0 1 2 3 4 5 …. m-1 m
…
CMSketch
Frequency: Estimation with Count-Min Sketch
+1 +1 +1
h1( ) h2( ) hk( )…,
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
counter 1 0 1 2 3 4 5 …. m-1 m
0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0
counter 2 0 1 2 3 4 5 …. m-1 m
0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0
counter k 0 1 2 3 4 5 …. m-1 m
…
CMSketch
Frequency: Estimation with Count-Min Sketch
h1( ) h2( ) hk( )…,
f( ) = min (1, 3, ..., 5) = 1
Counting: Invoking Count-Min Sketch from Python


import json
from pdsa.frequency.count_min_sketch import CountMinSketch
cms = CountMinSketch(5, 2000)
with open('tweets.txt') as f:
for line in f:
ip = json.loads(line)['hashtag']
cms.add(ip)
print('Frequency of #Python', cms.frequency("Python"))
size_in_bytes = cms.sizeof()
print('Size in bytes', size_in_bytes) # ~40Kb / 32-bit counters 

4. Counting
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
Counting: Challenge
Count the number of unique visitors
Amazon and eBay had about 3.375 billion* visitors in June 2019
Assume 337 million of unique IP addresses (128 bit per IPv6 record)
5.4 GB of memory just to store them all
*SimilarWeb.Com Data for June, 2019
What if we can count them with 12 KB only?
Image: https://www.cleanpng.com
Counting:Traditional Approach
Build list of all unique elements
Sort / search 

to avoid listing elements twice
Count elements in the list
requires linear memory
requires O(n·log n) time
Counting:Approximate Counting
@katyperry has
107,287,629 followers
Would you really care 

if she has 107.2, 108.0, or 106.7 million followers?
HyperLogLog
a hash-based probabilistic algorithm for counting the number of distinct
values in the presence of duplicates
proposed by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier in 2007
Counting: Estimation with a single counter (Flajolet, Martin)
h( )
h( )
0 0 0 0 1 1 … 0 0 1 0 1 0 0 0 0
binary (LSB-0)
rank ( ) = 4
h( )
1 1 0 0 0 1 … 0 0 1 0 1 0 0 0 0
binary (LSB-0)
rank ( ) = 0
h( )
1
0
0
0
1
…
0
0
0
1
2
3
4
…
m-1
m
R = 1
n ≈
2R
0.77351
FM Sketch
Counting: Estimation with HyperLogLog
1 1 0 0 0 1 … 0 0 1 0 1 0 0 1 0
binary (LSB-0)
rank1 ( ) = 0
h1( )
0
5
…
2
1
2
…
k
HLL Sketch
h1( ) h2( ) hk( )…,
0 0 0 0 0 1 … 0 0 1 1 1 0 0 0 0
binary (LSB-0)
rank2 ( ) = 5
h2( )
0 0 1 1 0 1 … 0 0 1 0 1 1 0 0 1
binary (LSB-0)
rankk ( ) = 0
hk( )
…
iff bigger than
existing value
iff bigger than
existing value
iff bigger then
existing value
n ≈ α ⋅ k ⋅ 2AVG(HLLi)
Counting: HyperLogLog Algorithm
Based on a single 32-bit hash function
Simulates k hash functions using stochastic averaging
approach
p bits (32 - p) bits
addressing bits rank computation bits
hash(x) =
32-bit hash value
Stores only k = 2p
counters (registers), about 4 bytes each
The memory always fixed, regardless the number of unique elements
More counters provide less error (memory/accuracy trade-off)
Counting: Invoking HyperLogLog from Python


import json
from pdsa.cardinality.hyperloglog import HyperLogLog
hll = HyperLogLog(precision=10) # 2^{10} = 1024 counters
with open('visitors.txt') as f:
for line in f:
ip = json.loads(line)['ip']
hll.add(ip)
num_of_unique_visitors = hll.count()
print('Unique visitors', num_of_unique_visitors)
size_in_bytes = hll.sizeof()
print('Size in bytes', size_in_bytes) # ~ 4Kb
Counting: Distinct Count in Redis
Redis uses the HyperLogLog data structure to count unique elements in a set
requires a small constant amount of memory of 12KB for every data
structure
approximates the exact cardinality with a standard error of 0.81%.
redis> PFADD hll python java ruby
(integer) 1
redis> PFADD hll python python python
(integer) 0
redis> PFADD hll java ruby
(integer) 0
redis> PFCOUNT hll
(integer) 3
http://antirez.com/news/75
5. Final Notes
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
Final Notes
Think about Big Data as a technology
challenge
Instead of buying new servers, learn new
algorithms
Believe in hashing! Sample vs Hashing.
Probabilistic Data Structures and Algorithms
become useful when your problem fits
Image: https://longfordpc.com/
Read More
[book] Probabilistic Data Structures and Algorithms for Big Data Applications 

https://pdsa.gakhov.com
[repo] Probabilistic Data Structures and Algorithms in Python

https://github.com/gakhov/pdsa
Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure 

https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
Redis new data structure: the HyperLogLog

http://antirez.com/news/75
Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles 

https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html
Big Data with Sketchy Structures 

https://towardsdatascience.com/b73fb3a33e2a
Count-Min Sketch 

http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov
Website: www.gakhov.com
Twitter: @gakhov
Probabilistic Data Structures and
Algorithms for Big Data Applications
pdsa.gakhov.com
Eskerrik asko!
6.Additional Slides
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov, @gakhov
(for that person who wants more)
Counting: Interactive Presentation of HyperLogLog
Counting:Accuracy vs MemoryTradeoff in HyperLogLog
!38
More counters require more memory (4 bytes per counter)
More counters need more bits for addressing them (m = 2p
)
Counting: HyperLogLog++Algorithm
HyperLogLog++
64-bit hash function, so allows to count more values
better bias correction using pre-trained data
proposed a sparse representation of the counters
(registers) to reduce memory requirements
HyperLogLog++ is an improved version of HyperLogLog 

developed in Google and proposed in 2013
1 of 39

Recommended

Too Much Data? - Just Sample, Just Hash, ... by
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Andrii Gakhov
386 views23 slides
Probabilistic data structures. Part 3. Frequency by
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyAndrii Gakhov
1.7K views31 slides
Probabilistic data structures by
Probabilistic data structuresProbabilistic data structures
Probabilistic data structuresshrinivasvasala
595 views7 slides
Probabilistic data structures. Part 2. Cardinality by
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
1.7K views26 slides
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith... by
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA
1.3K views36 slides
Data streaming algorithms by
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
1.6K views33 slides

More Related Content

What's hot

Streaming Algorithms by
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
3.3K views22 slides
Tutorial 9 (bloom filters) by
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Kira
1.5K views21 slides
STRIP: stream learning of influence probabilities. by
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.Albert Bifet
2.5K views22 slides
New zealand bloom filter by
New zealand bloom filterNew zealand bloom filter
New zealand bloom filterxlight
1.5K views45 slides
Internet of Things Data Science by
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data ScienceAlbert Bifet
834 views105 slides
Bloom filter by
Bloom filterBloom filter
Bloom filterfeng lee
3.1K views20 slides

What's hot(20)

Streaming Algorithms by Joe Kelley
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
Joe Kelley3.3K views
Tutorial 9 (bloom filters) by Kira
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
Kira1.5K views
STRIP: stream learning of influence probabilities. by Albert Bifet
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
Albert Bifet2.5K views
New zealand bloom filter by xlight
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
xlight1.5K views
Internet of Things Data Science by Albert Bifet
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
Albert Bifet834 views
Bloom filter by feng lee
Bloom filterBloom filter
Bloom filter
feng lee3.1K views
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon by Christopher Conlan
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Christopher Conlan124 views
Approximate methods for scalable data mining (long version) by Andrew Clegg
Approximate methods for scalable data mining (long version)Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)
Andrew Clegg1.1K views
Introduction to Big Data Science by Albert Bifet
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
Albert Bifet861 views
Bloom filter by wang ping
Bloom filterBloom filter
Bloom filter
wang ping2.9K views
Efficient Data Stream Classification via Probabilistic Adaptive Windows by Albert Bifet
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Albert Bifet984 views
Vasia Kalavri – Training: Gelly School by Flink Forward
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School
Flink Forward7.1K views
Faster persistent data structures through hashing by Johan Tibell
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
Johan Tibell4K views
Analysis Of Algorithms - Hashing by Sam Light
Analysis Of Algorithms - HashingAnalysis Of Algorithms - Hashing
Analysis Of Algorithms - Hashing
Sam Light543 views
A Short Course in Data Stream Mining by Albert Bifet
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
Albert Bifet8.8K views
Concept of hashing by Rafi Dar
Concept of hashingConcept of hashing
Concept of hashing
Rafi Dar9.7K views
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec... by Jen Aman
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman2.7K views
Tech talk Probabilistic Data Structure by Rishabh Dugar
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
Rishabh Dugar193 views

Similar to Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications

Big data serving: Processing and inference at scale in real time by
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
567 views59 slides
Machine Learning on Code - SF meetup by
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetupsource{d}
535 views64 slides
Approximation Data Structures for Streaming Applications by
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
1.3K views67 slides
Presentation Brucon - Anubisnetworks and PTCoresec by
Presentation Brucon - Anubisnetworks and PTCoresecPresentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecTiago Henriques
3.9K views71 slides
Performance Analysis of Hashing Mathods on the Employment of App by
Performance Analysis of Hashing Mathods on the Employment of App Performance Analysis of Hashing Mathods on the Employment of App
Performance Analysis of Hashing Mathods on the Employment of App IJECEIAES
9 views11 slides
Data-Centric Parallel Programming by
Data-Centric Parallel ProgrammingData-Centric Parallel Programming
Data-Centric Parallel Programminginside-BigData.com
926 views35 slides

Similar to Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications(20)

Big data serving: Processing and inference at scale in real time by Itai Yaffe
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
Itai Yaffe567 views
Machine Learning on Code - SF meetup by source{d}
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
source{d}535 views
Approximation Data Structures for Streaming Applications by Debasish Ghosh
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
Debasish Ghosh1.3K views
Presentation Brucon - Anubisnetworks and PTCoresec by Tiago Henriques
Presentation Brucon - Anubisnetworks and PTCoresecPresentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresec
Tiago Henriques3.9K views
Performance Analysis of Hashing Mathods on the Employment of App by IJECEIAES
Performance Analysis of Hashing Mathods on the Employment of App Performance Analysis of Hashing Mathods on the Employment of App
Performance Analysis of Hashing Mathods on the Employment of App
IJECEIAES9 views
Intelligent Monitoring by Intelie
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
Intelie2.4K views
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics by Toyotaro Suzumura
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
Toyotaro Suzumura310 views
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018 by Codemotion
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Codemotion171 views
Building graphs to discover information by David Martínez at Big Data Spain 2015 by Big Data Spain
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain729 views
5 parallel implementation 06299286 by Ninad Samel
5 parallel implementation 062992865 parallel implementation 06299286
5 parallel implementation 06299286
Ninad Samel366 views
Mining Data Streams by SujaAldrin
Mining Data StreamsMining Data Streams
Mining Data Streams
SujaAldrin931 views
Towards an Incremental Schema-level Index for Distributed Linked Open Data G... by Till Blume
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Till Blume185 views
Scalable frequent itemset mining using heterogeneous computing par apriori a... by ijdpsjournal
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...
ijdpsjournal561 views
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge by DataWorks Summit
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
DataWorks Summit3.1K views
Cassandra at Finn.io — May 30th 2013 by DataStax Academy
Cassandra at Finn.io — May 30th 2013Cassandra at Finn.io — May 30th 2013
Cassandra at Finn.io — May 30th 2013
DataStax Academy780 views
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011) by Portland R User Group
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)
"R, HTTP, and APIs, with a preview of TopicWatchr" (15 November 2011)

More from Andrii Gakhov

Let's start GraphQL: structure, behavior, and architecture by
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureAndrii Gakhov
423 views90 slides
DNS Delegation by
DNS DelegationDNS Delegation
DNS DelegationAndrii Gakhov
902 views15 slides
Implementing a Fileserver with Nginx and Lua by
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaAndrii Gakhov
2.1K views16 slides
Pecha Kucha: Ukrainian Food Traditions by
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsAndrii Gakhov
838 views20 slides
Probabilistic data structures. Part 4. Similarity by
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityAndrii Gakhov
2.4K views46 slides
Вероятностные структуры данных by
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данныхAndrii Gakhov
1.3K views48 slides

More from Andrii Gakhov(20)

Let's start GraphQL: structure, behavior, and architecture by Andrii Gakhov
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architecture
Andrii Gakhov423 views
Implementing a Fileserver with Nginx and Lua by Andrii Gakhov
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
Andrii Gakhov2.1K views
Pecha Kucha: Ukrainian Food Traditions by Andrii Gakhov
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food Traditions
Andrii Gakhov838 views
Probabilistic data structures. Part 4. Similarity by Andrii Gakhov
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
Andrii Gakhov2.4K views
Вероятностные структуры данных by Andrii Gakhov
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данных
Andrii Gakhov1.3K views
Recurrent Neural Networks. Part 1: Theory by Andrii Gakhov
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
Andrii Gakhov14.2K views
Apache Big Data Europe 2015: Selected Talks by Andrii Gakhov
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
Andrii Gakhov716 views
Swagger / Quick Start Guide by Andrii Gakhov
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start Guide
Andrii Gakhov7.6K views
API Days Berlin highlights by Andrii Gakhov
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlights
Andrii Gakhov787 views
ELK - What's new and showcases by Andrii Gakhov
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
Andrii Gakhov938 views
Apache Spark Overview @ ferret by Andrii Gakhov
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov1.2K views
Data Mining - lecture 8 - 2014 by Andrii Gakhov
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014
Andrii Gakhov1.3K views
Data Mining - lecture 7 - 2014 by Andrii Gakhov
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
Andrii Gakhov2.5K views
Data Mining - lecture 6 - 2014 by Andrii Gakhov
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014
Andrii Gakhov1.1K views
Data Mining - lecture 5 - 2014 by Andrii Gakhov
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014
Andrii Gakhov816 views
Data Mining - lecture 4 - 2014 by Andrii Gakhov
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014
Andrii Gakhov1K views
Data Mining - lecture 3 - 2014 by Andrii Gakhov
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014
Andrii Gakhov1K views
Decision Theory - lecture 1 (introduction) by Andrii Gakhov
Decision Theory - lecture 1 (introduction)Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)
Andrii Gakhov1.4K views
Data Mining - lecture 2 - 2014 by Andrii Gakhov
Data Mining - lecture 2 - 2014Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014
Andrii Gakhov824 views

Recently uploaded

Elevate your SAP landscape's efficiency and performance with HCL Workload Aut... by
Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...
Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...HCLSoftware
6 views2 slides
Software evolution understanding: Automatic extraction of software identifier... by
Software evolution understanding: Automatic extraction of software identifier...Software evolution understanding: Automatic extraction of software identifier...
Software evolution understanding: Automatic extraction of software identifier...Ra'Fat Al-Msie'deen
7 views33 slides
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...Deltares
6 views15 slides
Cycleops - Automate deployments on top of bare metal.pptx by
Cycleops - Automate deployments on top of bare metal.pptxCycleops - Automate deployments on top of bare metal.pptx
Cycleops - Automate deployments on top of bare metal.pptxThanassis Parathyras
30 views12 slides
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...Deltares
7 views24 slides
DSD-INT 2023 Modelling litter in the Yarra and Maribyrnong Rivers (Australia)... by
DSD-INT 2023 Modelling litter in the Yarra and Maribyrnong Rivers (Australia)...DSD-INT 2023 Modelling litter in the Yarra and Maribyrnong Rivers (Australia)...
DSD-INT 2023 Modelling litter in the Yarra and Maribyrnong Rivers (Australia)...Deltares
9 views34 slides

Recently uploaded(20)

Elevate your SAP landscape's efficiency and performance with HCL Workload Aut... by HCLSoftware
Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...
Elevate your SAP landscape's efficiency and performance with HCL Workload Aut...
HCLSoftware6 views
Software evolution understanding: Automatic extraction of software identifier... by Ra'Fat Al-Msie'deen
Software evolution understanding: Automatic extraction of software identifier...Software evolution understanding: Automatic extraction of software identifier...
Software evolution understanding: Automatic extraction of software identifier...
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by Deltares
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
Deltares6 views
Cycleops - Automate deployments on top of bare metal.pptx by Thanassis Parathyras
Cycleops - Automate deployments on top of bare metal.pptxCycleops - Automate deployments on top of bare metal.pptx
Cycleops - Automate deployments on top of bare metal.pptx
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by Deltares
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
Deltares7 views
DSD-INT 2023 Modelling litter in the Yarra and Maribyrnong Rivers (Australia)... by Deltares
DSD-INT 2023 Modelling litter in the Yarra and Maribyrnong Rivers (Australia)...DSD-INT 2023 Modelling litter in the Yarra and Maribyrnong Rivers (Australia)...
DSD-INT 2023 Modelling litter in the Yarra and Maribyrnong Rivers (Australia)...
Deltares9 views
El Arte de lo Possible by Neo4j
El Arte de lo PossibleEl Arte de lo Possible
El Arte de lo Possible
Neo4j34 views
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t... by Deltares
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...
Deltares9 views
Software testing company in India.pptx by SakshiPatel82
Software testing company in India.pptxSoftware testing company in India.pptx
Software testing company in India.pptx
SakshiPatel827 views
How to Install and Activate Email-Researcher by eGrabber
How to Install and Activate Email-ResearcherHow to Install and Activate Email-Researcher
How to Install and Activate Email-Researcher
eGrabber19 views
Geospatial Synergy: Amplifying Efficiency with FME & Esri ft. Peak Guest Spea... by Safe Software
Geospatial Synergy: Amplifying Efficiency with FME & Esri ft. Peak Guest Spea...Geospatial Synergy: Amplifying Efficiency with FME & Esri ft. Peak Guest Spea...
Geospatial Synergy: Amplifying Efficiency with FME & Esri ft. Peak Guest Spea...
Safe Software391 views
DSD-INT 2023 HydroMT model building and river-coast coupling in Python - Bove... by Deltares
DSD-INT 2023 HydroMT model building and river-coast coupling in Python - Bove...DSD-INT 2023 HydroMT model building and river-coast coupling in Python - Bove...
DSD-INT 2023 HydroMT model building and river-coast coupling in Python - Bove...
Deltares15 views
MariaDB stored procedures and why they should be improved by Federico Razzoli
MariaDB stored procedures and why they should be improvedMariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improved
DSD-INT 2023 SFINCS Modelling in the U.S. Pacific Northwest - Parker by Deltares
DSD-INT 2023 SFINCS Modelling in the U.S. Pacific Northwest - ParkerDSD-INT 2023 SFINCS Modelling in the U.S. Pacific Northwest - Parker
DSD-INT 2023 SFINCS Modelling in the U.S. Pacific Northwest - Parker
Deltares8 views
What Can Employee Monitoring Software Do?​ by wAnywhere
What Can Employee Monitoring Software Do?​What Can Employee Monitoring Software Do?​
What Can Employee Monitoring Software Do?​
wAnywhere18 views
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko... by Deltares
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
Deltares10 views
Citi TechTalk Session 2: Kafka Deep Dive by confluent
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
confluent17 views

Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications

  • 1. Andrii Gakhov, PhD Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019
 Bilbao, Spain
  • 2. Andrii Gakhov Senior Software Engineer
 at Ferret Go GmbH, Germany Ph.D. in Mathematical Modelling, 
 M.Sc. in Applied Mathematics Twitter: @gakhov | Website: gakhov.com Probabilistic Data Structures and Algorithms
 for Big Data Applications ISBN: 9783748190486
 https://pdsa.gakhov.com
  • 3. 0. Motivation Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  • 4. Bioinformatics: Counting k-mers in DNA Counting substrings of length k in DNA sequence data (k-mers) is essential in bioinformatics, for instance, for metagenomic sequencing. A large fraction of the storage is spent on storing k-mers with sequencing errors and which are observed only a single time in the data*. Can we efficiently avoid to persist such invalid substrings? Can we efficiently count valid substrings? * Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12(1), 333, 2011 For example, the team that sequenced the giant panda genome needed to count 8.62 billion 27-mers, where 68% were low-coverage k-mers.
  • 5. 1. Data-Intensive Applications 
 in Big Data epoch Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  • 6. What is Big Data? Doug Laney in 2001 described Big Data datasets as such that contain greater variety arriving in increasing volumes and with ever- higher velocity. Today this is known as the famous 3V’s of Big Data. Big Data Velocity Variety Volume expresses the amount of data describes the speed at which data is arriving refers to the number of types of data
  • 7. What is Big Data? Big Data is more than simply a matter of size. Big Data does not refer to data, it refers to technology. The datasets of Big Data are larger, more complex, and generated more rapidly than our current resources can handle. Image: https://www.freepngimg.com/electronics/technology
  • 8. 2. Probabilistic Data Structures
 and Algorithms Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  • 9. Probabilistic Data Structures and Algorithms (PDSA) A family of advanced approaches that are optimized to use sublinear memory and constant execution time. Cannot provide the exact answers and have some probability of error. error resources The tradeoff between the error and the resources is another feature that distinguish the algorithms and data structures of this family.
  • 10. PDSA in Big Data Ecosystem Count-Min Sketch Count Sketch Bloom Filter Quotient Filter Cuckoo Filter Linear Counting FM Sketch LogLog HyperLogLog Random Sampling t-digestq-digestGreenwald-Khanna MinHash SimHash LSH Counting find the number of unique elements Membership keep track of indexed elements Rank approximate percentiles and quantiles Frequency estimate frequencies of elements Similarity find similar documents Big Data Velocity Variety Volume
  • 11. PDSA in Apache Spark SQL (PySpark interface) q-quantile estimation (Greenwald-Khanna) # pyspark.sql.DataFrameStatFunctions(df).approxQuantile
 df.approxQuantile("language", [0.5], 0.25) Approximate number of distinct elements (HyperLogLog++) #pyspark.sql.functions.approx_count_distinct df.agg(approx_count_distinct(df.language).alias('lang')).collect() Spark SQL is Apache Spark's module for working with structured data.
  • 13. 3. Frequency Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  • 14. Frequency: Challenge A hashtag is used to index a topic on Twitter and allows people to easily follow items they are interested in. Hashtags are usually written with a # symbol in front. Find the most trending hashtags on Twitter every second about 6000 tweets are created on Twitter, that is roughly 500 million items daily most of tweets are linked with one or more hashtags https://www.internetlivestats.com/twitter-statistics/
  • 15. Frequency:Traditional Approach Build a table that lists of all seen thus far elements with corresponding counters Increment counters when a new element comes or add that element into the table and initialize its counter Return the value of the counter that corresponds to the element as frequency requires linear memory requires O(n) time lookup (worst case) huge overhead for heavy hitters search 1 1 1 1 1 2
  • 16. Frequency: Challenges for Big Data data streams Continuous data streams potentially unbounded number of unique elements
 ➡ sublinear (polylogarithmic at most) space
 not feasible to re-process data streams
 ➡ one-pass algorithms preferred
 high frequency throughput
 ➡ fast updates Image: https://www.pngfind.com
  • 17. Count-Min Sketch a simple space-efficient probabilistic data structure that is used to estimate frequencies of elements in data streams and can address the Heavy hitters problem presented by Graham Cormode and Shan Muthukrishnan in 2003
  • 18. Frequency: Estimation with a single counter 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 counter 0 1 2 3 4 5 …. m-1 m+1 h( ) 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 counter 0 1 2 3 4 5 …. m-1 m h( ) +1 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 counter 0 1 2 3 4 5 …. m-1 m+1 h( )
  • 19. 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 counter 1 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 counter 2 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 counter k 0 1 2 3 4 5 …. m-1 m … CMSketch Frequency: Estimation with Count-Min Sketch +1 +1 +1 h1( ) h2( ) hk( )…,
  • 20. 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 counter 1 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 counter 2 0 1 2 3 4 5 …. m-1 m 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 counter k 0 1 2 3 4 5 …. m-1 m … CMSketch Frequency: Estimation with Count-Min Sketch h1( ) h2( ) hk( )…, f( ) = min (1, 3, ..., 5) = 1
  • 21. Counting: Invoking Count-Min Sketch from Python 
 import json from pdsa.frequency.count_min_sketch import CountMinSketch cms = CountMinSketch(5, 2000) with open('tweets.txt') as f: for line in f: ip = json.loads(line)['hashtag'] cms.add(ip) print('Frequency of #Python', cms.frequency("Python")) size_in_bytes = cms.sizeof() print('Size in bytes', size_in_bytes) # ~40Kb / 32-bit counters 

  • 22. 4. Counting Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  • 23. Counting: Challenge Count the number of unique visitors Amazon and eBay had about 3.375 billion* visitors in June 2019 Assume 337 million of unique IP addresses (128 bit per IPv6 record) 5.4 GB of memory just to store them all *SimilarWeb.Com Data for June, 2019 What if we can count them with 12 KB only? Image: https://www.cleanpng.com
  • 24. Counting:Traditional Approach Build list of all unique elements Sort / search 
 to avoid listing elements twice Count elements in the list requires linear memory requires O(n·log n) time
  • 25. Counting:Approximate Counting @katyperry has 107,287,629 followers Would you really care 
 if she has 107.2, 108.0, or 106.7 million followers?
  • 26. HyperLogLog a hash-based probabilistic algorithm for counting the number of distinct values in the presence of duplicates proposed by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier in 2007
  • 27. Counting: Estimation with a single counter (Flajolet, Martin) h( ) h( ) 0 0 0 0 1 1 … 0 0 1 0 1 0 0 0 0 binary (LSB-0) rank ( ) = 4 h( ) 1 1 0 0 0 1 … 0 0 1 0 1 0 0 0 0 binary (LSB-0) rank ( ) = 0 h( ) 1 0 0 0 1 … 0 0 0 1 2 3 4 … m-1 m R = 1 n ≈ 2R 0.77351 FM Sketch
  • 28. Counting: Estimation with HyperLogLog 1 1 0 0 0 1 … 0 0 1 0 1 0 0 1 0 binary (LSB-0) rank1 ( ) = 0 h1( ) 0 5 … 2 1 2 … k HLL Sketch h1( ) h2( ) hk( )…, 0 0 0 0 0 1 … 0 0 1 1 1 0 0 0 0 binary (LSB-0) rank2 ( ) = 5 h2( ) 0 0 1 1 0 1 … 0 0 1 0 1 1 0 0 1 binary (LSB-0) rankk ( ) = 0 hk( ) … iff bigger than existing value iff bigger than existing value iff bigger then existing value n ≈ α ⋅ k ⋅ 2AVG(HLLi)
  • 29. Counting: HyperLogLog Algorithm Based on a single 32-bit hash function Simulates k hash functions using stochastic averaging approach p bits (32 - p) bits addressing bits rank computation bits hash(x) = 32-bit hash value Stores only k = 2p counters (registers), about 4 bytes each The memory always fixed, regardless the number of unique elements More counters provide less error (memory/accuracy trade-off)
  • 30. Counting: Invoking HyperLogLog from Python 
 import json from pdsa.cardinality.hyperloglog import HyperLogLog hll = HyperLogLog(precision=10) # 2^{10} = 1024 counters with open('visitors.txt') as f: for line in f: ip = json.loads(line)['ip'] hll.add(ip) num_of_unique_visitors = hll.count() print('Unique visitors', num_of_unique_visitors) size_in_bytes = hll.sizeof() print('Size in bytes', size_in_bytes) # ~ 4Kb
  • 31. Counting: Distinct Count in Redis Redis uses the HyperLogLog data structure to count unique elements in a set requires a small constant amount of memory of 12KB for every data structure approximates the exact cardinality with a standard error of 0.81%. redis> PFADD hll python java ruby (integer) 1 redis> PFADD hll python python python (integer) 0 redis> PFADD hll java ruby (integer) 0 redis> PFCOUNT hll (integer) 3 http://antirez.com/news/75
  • 32. 5. Final Notes Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov
  • 33. Final Notes Think about Big Data as a technology challenge Instead of buying new servers, learn new algorithms Believe in hashing! Sample vs Hashing. Probabilistic Data Structures and Algorithms become useful when your problem fits Image: https://longfordpc.com/
  • 34. Read More [book] Probabilistic Data Structures and Algorithms for Big Data Applications 
 https://pdsa.gakhov.com [repo] Probabilistic Data Structures and Algorithms in Python
 https://github.com/gakhov/pdsa Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure 
 https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ Redis new data structure: the HyperLogLog
 http://antirez.com/news/75 Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles 
 https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html Big Data with Sketchy Structures 
 https://towardsdatascience.com/b73fb3a33e2a Count-Min Sketch 
 http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf
  • 35. Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov @gakhov Website: www.gakhov.com Twitter: @gakhov Probabilistic Data Structures and Algorithms for Big Data Applications pdsa.gakhov.com Eskerrik asko!
  • 36. 6.Additional Slides Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications EuroSciPy 2019Andrii Gakhov, @gakhov (for that person who wants more)
  • 38. Counting:Accuracy vs MemoryTradeoff in HyperLogLog !38 More counters require more memory (4 bytes per counter) More counters need more bits for addressing them (m = 2p )
  • 39. Counting: HyperLogLog++Algorithm HyperLogLog++ 64-bit hash function, so allows to count more values better bias correction using pre-trained data proposed a sparse representation of the counters (registers) to reduce memory requirements HyperLogLog++ is an improved version of HyperLogLog 
 developed in Google and proposed in 2013