Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures
and Approximate Solutions
by Oleksandr Pryymak
PyData London 2014IPython notebook with code >>
● an approximate answer is sufficient
● need to trade accuracy for scalability or speed
● need to analyse stream of data
● despite typically achieving good result, exists a
chance of the bad worst case behaviour.
● use on large datasets (law of large numbers)
x = [random.randint(0,80000) for _ in xrange(10000)]
y = [i>>8 for i in x] # trim 8 bits off of integers
z = x[:500] # 5% sample (x is uniform)
avx = average(x)
avy = average(y) * 2**8 # add 8 bits
avz = average(z)
print avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx))
print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx))
39420.7744 error 0.321401%
39591.424 error 0.110100%
Code: Sampling Data
Get K samples from an infinite stream
Probabilistic Data Structures
Generally they are:
● Use less space than a full dataset
● Require higher CPU load
● Can be parallelized
● Have controlled error rate
arbitrary length of the key ->
to a fixed length of the message
message = hash(key)
However, collisions are possible:
hash(key1) = hash(key2)
Comparison: Locality Sensitive Hashing (LSH)
Kernelized locality-sensitive hashing for scalable image search
B Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org
Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed high-
dimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22
Membership test: Bloom filter
Bloom filter is probabilistic but only yields false positives.
Hash each item k times indices into bit field.
At least one 0 means
w definitely isn’t in set.
All 1s would mean w
probably is in set.
Use Bloom filter to store graphs
Graphs only gain nodes because of Bloom
filter false positives.
Pell et al., PNAS 2012
Counting Distinct Elements
In: infinite stream of data
Question: how many distinct elements are there?
is similar to:
In: coin flips
Question: how many times it has been flipped?
Coin flips: intuition
● Long runs of HEADs in random series are rare.
● The longer you look, the more likely you see a long one.
● Long runs are very rare and are correlated with how
many coins you’ve flipped.
● For each input item:
○ Hash item into bit string
○ Count trailing zeroes in bit string
○ If this count > n:
■ Let n = count
● Estimated cardinality (“count distinct”) = 2^n
Cardinality estimation: HyperLogLog
Demo by: http://www.
Billions of distinct values in 1.5KB of
RAM with 2% relative error
HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm
P.Flajolet, É.Fusy, O.Gandouet, F.Meunier;
Mining of Massive Datasets
by Jure Leskovec, Anand Rajaraman, and Jeff Ullman
● know the data structures
● know what you sacrifice
● control errors
structures-web-analytics-data-mining/ by Ilya Katsov