Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications

Andrii Gakhov, PhD
Exceeding Classical:
Probabilistic Data Structures
in Data-Intensive Applications
EuroSciPy 2019 
Bilbao, Spain

Andrii Gakhov
Senior Software Engineer 
at Ferret Go GmbH, Germany
Ph.D. in Mathematical Modelling,  
M.Sc. in Applied Mathematics
Twitter: @gakhov | Website: gakhov.com
Probabilistic Data Structures
and Algorithms 
for Big Data Applications
ISBN: 9783748190486 
https://pdsa.gakhov.com

0. Motivation
Exceeding Classical: Probabilistic Data Structures in Data-Intensive Applications
EuroSciPy 2019Andrii Gakhov @gakhov

Bioinformatics: Counting k-mers in DNA
Counting substrings of length k in DNA sequence data (k-mers) is
essential in bioinformatics, for instance, for metagenomic sequencing.
A large fraction of the storage is spent on storing k-mers with sequencing
errors and which are observed only a single time in the data*.
Can we efficiently avoid to persist such invalid substrings?
Can we efficiently count valid substrings?
* Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12(1), 333, 2011
For example, the team that sequenced the giant panda genome needed to
count 8.62 billion 27-mers, where 68% were low-coverage k-mers.

1. Data-Intensive Applications  
in Big Data epoch

What is Big Data?
Doug Laney in 2001 described Big Data datasets as such that
contain greater variety arriving in increasing volumes and with ever-
higher velocity. Today this is known as the famous 3V’s of Big Data.
Big
Data
Velocity Variety
Volume expresses the amount of data
describes the speed at which data is arriving refers to the number of types of data

What is Big Data?
Big Data is more than simply a
matter of size.
Big Data does not refer to data, it
refers to technology.
The datasets of Big Data are larger, more complex, and
generated more rapidly than our current resources can handle.
Image: https://www.freepngimg.com/electronics/technology

2. Probabilistic Data Structures 
and Algorithms

Probabilistic Data Structures and Algorithms (PDSA)
A family of advanced approaches that are optimized to
use sublinear memory and constant execution time.
Cannot provide the exact answers and have some probability of error.
error
resources
The tradeoﬀ between the error and the resources is another feature
that distinguish the algorithms and data structures of this family.

PDSA in Big Data Ecosystem
Count-Min Sketch
Count Sketch
Bloom Filter
Quotient Filter
Cuckoo Filter
Linear Counting
FM Sketch
LogLog
HyperLogLog
Random Sampling t-digestq-digestGreenwald-Khanna
MinHash
SimHash
LSH
Counting
ﬁnd the number of unique elements
Membership
keep track of indexed elements
Rank approximate percentiles and quantiles
Frequency
estimate frequencies of elements
Similarity
ﬁnd similar documents
Big
Data
Velocity Variety
Volume

PDSA in Apache Spark SQL (PySpark interface)
q-quantile estimation (Greenwald-Khanna)
# pyspark.sql.DataFrameStatFunctions(df).approxQuantile 
df.approxQuantile("language", [0.5], 0.25)
Approximate number of distinct elements (HyperLogLog++)
#pyspark.sql.functions.approx_count_distinct
df.agg(approx_count_distinct(df.language).alias('lang')).collect()
Spark SQL is Apache Spark's module for working with structured data.

3. Frequency

Frequency: Challenge
A hashtag is used to index a topic on Twitter and allows people to easily follow
items they are interested in. Hashtags are usually written with a # symbol in front.
Find the most trending hashtags on Twitter
every second about 6000 tweets are created on Twitter,
that is roughly 500 million items daily
most of tweets are linked with one or more hashtags
https://www.internetlivestats.com/twitter-statistics/

Frequency:Traditional Approach
Build a table that lists of all seen thus far
elements with corresponding counters
Increment counters when a new element
comes or add that element into the table and
initialize its counter
Return the value of the counter that
corresponds to the element as frequency
requires linear memory
requires O(n) time lookup (worst case)
huge overhead for heavy hitters search
1 1 1
1 1 2

Frequency: Challenges for Big Data data streams
Continuous data streams
potentially unbounded number of unique elements 
➡ sublinear (polylogarithmic at most) space 
not feasible to re-process data streams 
➡ one-pass algorithms preferred 
high frequency throughput 
➡ fast updates
Image: https://www.pngﬁnd.com

Count-Min Sketch
a simple space-eﬃcient probabilistic data structure that is used to estimate
frequencies of elements in data streams and can address the Heavy hitters problem
presented by Graham Cormode and Shan Muthukrishnan in 2003

Frequency: Estimation with a single counter
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
counter 0 1 2 3 4 5 …. m-1 m+1
h( )
0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
counter 0 1 2 3 4 5 …. m-1 m
h( ) +1
0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0
counter 0 1 2 3 4 5 …. m-1 m+1
h( )

0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
counter 1 0 1 2 3 4 5 …. m-1 m
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
counter 2 0 1 2 3 4 5 …. m-1 m
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
counter k 0 1 2 3 4 5 …. m-1 m
…
CMSketch
Frequency: Estimation with Count-Min Sketch
+1 +1 +1
h1( ) h2( ) hk( )…,

0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
counter 1 0 1 2 3 4 5 …. m-1 m
0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0
counter 2 0 1 2 3 4 5 …. m-1 m
0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0
counter k 0 1 2 3 4 5 …. m-1 m
…
CMSketch
Frequency: Estimation with Count-Min Sketch
h1( ) h2( ) hk( )…,
f( ) = min (1, 3, ..., 5) = 1

Counting: Invoking Count-Min Sketch from Python
 
import json
from pdsa.frequency.count_min_sketch import CountMinSketch
cms = CountMinSketch(5, 2000)
with open('tweets.txt') as f:
for line in f:
ip = json.loads(line)['hashtag']
cms.add(ip)
print('Frequency of #Python', cms.frequency("Python"))
size_in_bytes = cms.sizeof()
print('Size in bytes', size_in_bytes) # ~40Kb / 32-bit counters

4. Counting

Counting: Challenge
Count the number of unique visitors
Amazon and eBay had about 3.375 billion* visitors in June 2019
Assume 337 million of unique IP addresses (128 bit per IPv6 record)
5.4 GB of memory just to store them all
*SimilarWeb.Com Data for June, 2019
What if we can count them with 12 KB only?
Image: https://www.cleanpng.com

Counting:Traditional Approach
Build list of all unique elements
Sort / search  
to avoid listing elements twice
Count elements in the list
requires linear memory
requires O(n·log n) time

Counting:Approximate Counting
@katyperry has
107,287,629 followers
Would you really care  
if she has 107.2, 108.0, or 106.7 million followers?

HyperLogLog
a hash-based probabilistic algorithm for counting the number of distinct
values in the presence of duplicates
proposed by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier in 2007

Counting: Estimation with a single counter (Flajolet, Martin)
h( )
h( )
0 0 0 0 1 1 … 0 0 1 0 1 0 0 0 0
binary (LSB-0)
rank ( ) = 4
h( )
1 1 0 0 0 1 … 0 0 1 0 1 0 0 0 0
binary (LSB-0)
rank ( ) = 0
h( )
1
0
0
0
1
…
0
0
0
1
2
3
4
…
m-1
m
R = 1
n ≈
2R
0.77351
FM Sketch

Counting: Estimation with HyperLogLog
1 1 0 0 0 1 … 0 0 1 0 1 0 0 1 0
binary (LSB-0)
rank1 ( ) = 0
h1( )
0
5
…
2
1
2
…
k
HLL Sketch
h1( ) h2( ) hk( )…,
0 0 0 0 0 1 … 0 0 1 1 1 0 0 0 0
binary (LSB-0)
rank2 ( ) = 5
h2( )
0 0 1 1 0 1 … 0 0 1 0 1 1 0 0 1
binary (LSB-0)
rankk ( ) = 0
hk( )
…
iff bigger than
existing value
iff bigger than
existing value
iff bigger then
existing value
n ≈ α ⋅ k ⋅ 2AVG(HLLi)

Counting: HyperLogLog Algorithm
Based on a single 32-bit hash function
Simulates k hash functions using stochastic averaging
approach
p bits (32 - p) bits
addressing bits rank computation bits
hash(x) =
32-bit hash value
Stores only k = 2p
counters (registers), about 4 bytes each
The memory always ﬁxed, regardless the number of unique elements
More counters provide less error (memory/accuracy trade-oﬀ)

Counting: Invoking HyperLogLog from Python
 
import json
from pdsa.cardinality.hyperloglog import HyperLogLog
hll = HyperLogLog(precision=10) # 2^{10} = 1024 counters
with open('visitors.txt') as f:
for line in f:
ip = json.loads(line)['ip']
hll.add(ip)
num_of_unique_visitors = hll.count()
print('Unique visitors', num_of_unique_visitors)
size_in_bytes = hll.sizeof()
print('Size in bytes', size_in_bytes) # ~ 4Kb

Counting: Distinct Count in Redis
Redis uses the HyperLogLog data structure to count unique elements in a set
requires a small constant amount of memory of 12KB for every data
structure
approximates the exact cardinality with a standard error of 0.81%.
redis> PFADD hll python java ruby
(integer) 1
redis> PFADD hll python python python
(integer) 0
redis> PFADD hll java ruby
(integer) 0
redis> PFCOUNT hll
(integer) 3
http://antirez.com/news/75

5. Final Notes

Final Notes
Think about Big Data as a technology
challenge
Instead of buying new servers, learn new
algorithms
Believe in hashing! Sample vs Hashing.
Probabilistic Data Structures and Algorithms
become useful when your problem ﬁts
Image: https://longfordpc.com/

Read More
[book] Probabilistic Data Structures and Algorithms for Big Data Applications  
https://pdsa.gakhov.com
[repo] Probabilistic Data Structures and Algorithms in Python 
https://github.com/gakhov/pdsa
Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure  
https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
Redis new data structure: the HyperLogLog 
http://antirez.com/news/75
Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles  
https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html
Big Data with Sketchy Structures  
https://towardsdatascience.com/b73fb3a33e2a
Count-Min Sketch  
http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf

Website: www.gakhov.com
Twitter: @gakhov
Probabilistic Data Structures and
Algorithms for Big Data Applications
pdsa.gakhov.com
Eskerrik asko!

6.Additional Slides
EuroSciPy 2019Andrii Gakhov, @gakhov
(for that person who wants more)

Counting: Interactive Presentation of HyperLogLog

Counting:Accuracy vs MemoryTradeoff in HyperLogLog
!38
More counters require more memory (4 bytes per counter)
More counters need more bits for addressing them (m = 2p
)

Counting: HyperLogLog++Algorithm
HyperLogLog++
64-bit hash function, so allows to count more values
better bias correction using pre-trained data
proposed a sparse representation of the counters
(registers) to reduce memory requirements
HyperLogLog++ is an improved version of HyperLogLog  
developed in Google and proposed in 2013

Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications

Similar to Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications (20)

More from Andrii Gakhov

More from Andrii Gakhov (20)

Recently uploaded

Recently uploaded (20)

Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications