SlideShare a Scribd company logo
1 of 30
Download to read offline
tech talk @ ferret
Andrii Gakhov
PROBABILISTIC DATA STRUCTURES
ALL YOU WANTED TO KNOW BUT WERE AFRAID TO ASK
PART 1: MEMBERSHIP
MEMBERSHIP
Agenda:
▸ Bloom Filter
▸ Quotient filter
• To determine membership of the element in a
large set of elements
THE PROBLEM
BLOOM FILTER
BLOOM FILTER
• proposed by Burton Howard Bloom, 1970
• Bloom filter is a realisation of probabilistic set with 3 operations:
• add element into the set
• test whether an element is a member of the set
• test whether an element is not a member of the set
• Bloom filter is described by 2 parameters:
• m - length of the filter 

(proportional to expected number of elements n)
• k - number of different hash functions

(usually, k is much smaller than m)
• It doesn’t store elements and require about 1 byte per stored data
BLOOM FILTER: ALGORITHM
• Bloom filter is a bit array of m bits, all set to 0 at the beginning
• To insert element into the filter - calculate values of all k hash
functions for the element and set bit with the corresponding
indices
• To test if element is in the filter - calculate all k hash functions
for the element and check bits in all corresponding indices:
• if all bits are set, then answer is “maybe”
• if at least 1 bit isn’t set, then answer is “definitely not”
• Time needed to insert or test elements is a fixed constant O(k),
independent from the number of items already in the filter
BLOOM FILTER: EXAMPLE
• Consider Bloom filter of 16 bits (m=16)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
• Consider 2 hash functions (k=2): 

MurmurHash3 and Fowler-Noll-Vo

(to calculate the appropriate index, we divide result by mod 16)
• Add element to the filter: “ferret”:

MurmurHash3(“ferret”) = 1, FNV(“ferret”) = 11
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
BLOOM FILTER: EXAMPLE
• Add element to the filter: “bernau”:

MurmurHash3(“bernau”) = 4, FNV(“bernau”) = 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0
• Test element: “berlin”:

MurmurHash3(“berlin”) = 4, FNV(“berlin”) = 12

Bit 12 is not set, so “berlin” definitely not in the set
• Test element: “paris”:

MurmurHash3(“paris”) = 11, FNV(“paris”) = 4

Bits 4 and 11 are set, so the element maybe in the set (false positive)
BLOOM FILTER: PROPERTIES
• False positives are possible.

(element is not a member, but filter returns like it is a member)
• False negatives are not possible. 

(filter returns that elements isn’t a member only if it’s not a member)
• Hash functions should be independent and uniformly
distributed.They should also be as fast as possible.

(don’t use cryptographic, like sha1)
• By choice of k and m it is possible to decrease false
positives probability.
P e ∈ℑ| e ∉ℑ( ) ≈ 1− e
kn
m
⎛
⎝⎜
⎞
⎠⎟
k
P e ∉ℑ| e ∈ℑ( )= 0
k∗
=
m
n
ln2
BLOOM FILTER: APPLICATIONS
• Google BigTable, Apache HBase and Apache Cassandra use
Bloom filters to reduce the disk lookups for non-existent rows
or columns
• Medium uses Bloom filters to avoid recommending articles a
user has previously read
• Google Chrome web browser used to use a Bloom filter to
identify malicious URLs (moved to PrefixSet, Issue 71832)
• The Squid Web Proxy Cache uses Bloom filters for cache
digests
BLOOM FILTER: PROBLEMS
• The basic Bloom filter doesn’t support deletion.
• Bloom filters work well when they fit in main memory
• ~1 byte per element (3x-4x times more for Counting Bloom filters, that support deletion)
• What goes wrong when Bloom filters grow too big to fit in main
memory?
• On disks with rotating platters and moving heads, Bloom filters choke.A rotational disk
performs only 100–200 (random) I/Os per second, and each Bloom filter operation
requires multiple I/Os.
• On flash-based solid-state drives, Bloom filters achieve only hundreds of operations per
second in contrast to the order of a million per second in main memory.
• Buffering can help. However, buffering scales poorly as the Bloom-filter
size increases compared to the in-memory buffer size, resulting in only a
few buffered updates per flash page on average.
BLOOM FILTER: VARIANTS
• Attenuated Bloom filters use arrays of Bloom filters to store shortest path
distance information
• Spectral Bloom filters extend the data structure to support estimates of
frequencies.
• Counting Bloom Filters each entry in the filter instead of a single bit is rather a
small counter. Insertions and deletions to the filter increment or decrement the
counters respectively.
• Compressed Bloom filters can be easily adjusted to the desired tradeoff
between size and false-positive rate
• Bloom Filter Cascade implements filtering by cascading pipeline of Bloom
filters
• Scalable Bloom Filters can adapt dynamically to the number of elements
stored, while assuring a maximum false positive probability
BLOOM FILTER: PYTHON
• https://github.com/jaybaird/python-bloomfilter

pybloom is a module that includes a Bloom Filter data
structure along with an implementation of Scalable
Bloom Filters
• https://github.com/seomoz/pyreBloom

pyreBloom provides Redis backed Bloom Filter using
GETBIT and SETBIT
BLOOM FILTER: READ MORE
• http://dmod.eu/deca/ft_gateway.cfm.pdf
• https://en.wikipedia.org/wiki/Bloom_filter
• https://www.cs.uchicago.edu/~matei/PAPERS/bf.doc
• http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf
• https://www.eecs.harvard.edu/~michaelm/postscripts/
im2005b.pdf
QUOTIENT FILTER
QUOTIENT FILTER
• introduced by Michael Bender et al., 2011
• Quotient filter is a realisation of probabilistic set with 4 operations:
• add element into the set
• delete element from the set
• test whether an element is a member of a set
• test whether an element is not a member of a set
• Quotient filter is described by:
• p - size (in bits) for fingerprints
• single hash function that generates fingerprints
• Quotient filter stores p-bit fingerprints of elements
QUOTIENT FILTER: ALGORITHM
• Quotient filter is a compact open hash table with m = 2
q
buckets. The
hash table employs quotienting, a technique suggested by D. Knuth:
• the fingerprint f is partitioned into:
• the r least significant bits (fr = f mod 2r
, the remainder)
• the q = p - r most significant bits (fq = ⌊f / 2r
⌋, the quotient)
• The remainder is stored in the bucket indexed by the quotient
• Each bucket contains 3 bits, all 0 at the beginning: is_occupied,
is_continuation, is_shifted
0 1 2 m…
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
is_occupied
is_continuation
is_shifted
QUOTIENT FILTER: ALGORITHM
• If two fingerprints f and f′ have the same quotient (fq=f′q) - it is a soft
collision.All remainders of fingerprints with the same quotient are
stored contiguously in a run.
• If necessary, a remainder is shifted forward from its original location and
stored in a subsequent bucket, wrapping around at the end of the array.
• is_occupied is set when a bucket j is the canonical bucket (fq=j) for some fingerprint f ,
stored (somewhere) in the filter
• is_continuation is set when a bucket is occupied but not by the first remainder in a run
• is_shifted is set when the remainder in a bucket is not in its canonical bucket
0 1 2 m…
0 0 0 1 0 0 1 0 0 1 1 1 0 1 1
is_occupied
is_continuation
is_shifted
1 0 1 0 0 0
3 4
run
0 1 1
5
ba c d e
QUOTIENT FILTER: ALGORITHM
• To test a fingerprint f:
• if bucket fq is not occupied, then the element with fingerprint f definitely not in the filter
• if bucket fq is occupied:
• starting with bucket fq, scan left to locate bucket without set is_shifted bit
• scan right with running count (is_occupied: +1, is_continuation: -1) until the
running count reaches 0 - when it’s the quotient's run.
• compare the remainder in each bucket in the quotient's run with fr
• if found, than element is (probably) in the filter, else - it is definitely not in the filter.
• To add a fingerprint f:
• follow a path similar to test until certain that the fingerprint is definitely not in the filter
• choose bucket in the current run by keeping the sorted order and insert reminder fr (set
is_occupied bit)
• shift forward all reminders at or after the chosen bucket and update the buckets’ bits.
QUOTIENT FILTER: EXAMPLE
• Consider Quotient filter with size p=8 and 32-bit signed
MurmurHash3 as h
0 1 2 76
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 4
0 0 0
5
is_occupied
is_continuation
is_shifted
• Add elements: “amsterdam”, “berlin”, “london”

fq(“amsterdam”) = 1, fq(“berlin”) = 4, fq(“london”) = 7

fr(“amsterdam”) =164894540, fr(“berlin”) = -89622902, fr(“london”) = 232552816
0 1 2 76
0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
3 4
0 0 0
5
is_occupied
is_continuation
is_shifted
-89622902164894540 232552816
Insertion at this stage is easy since all canonical slots are not occupied
run/cluster run/cluster run/cluster
QUOTIENT FILTER: EXAMPLE
• Add element: “madrid”

fq(“madrid”) = 1, fr(“madrid”) = 249059682
The canonical slot 1 is already occupied.The shifted and continuation bits are not set, so
we are at the beginning of the cluster which is also the run’s start.
The reminder fr(“madrid”) is strongly bigger than the existing reminder, so it should be
shifted right into the next available slot 2 and shifted/continuation bits should be set
(but not the occupied bit, because it pertain to the slot, not the contained reminder).
0 1 2 76
0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
3 4
0 1 1
5
is_occupied
is_continuation
is_shifted
-89622902164894540 232552816249059682
run/cluster run/clusterrun/cluster
QUOTIENT FILTER: EXAMPLE
• Add element: “ankara”

fq(“ankara”) = 2, fr(“ankara”) = 62147742
The canonical slot 2 is not occupied, but already in use. So, the fr(“ankara”) should be
shifted right into the neared available slot 3 and its shifted bit should be set. In
addition, we need to flag the canonical slot 2 as occupied by setting the occupied bit.
0 1 2 76
0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0
3 4
1 1 1
5
is_occupied
is_continuation
is_shifted
-89622902164894540 232552816249059682 62147742
run/cluster run/clusterrun run
cluster
QUOTIENT FILTER: EXAMPLE
• Add element: “abu dhabi”

fq(“abu dhabi”) = 1, fr(“abu dhabi”) = -265307463
The canonical slot 1 is already occupied.The shifted and continuation bits are not set, so
we are at the beginning of the cluster which is also the run’s start.
The reminder fr(“abu dhabi”) is strongly smaller than the existing reminder, so all
reminders in slot 1 should be shifted right and flagged as continuation and shifted.
If shifting affects reminders from other runs/clusters, we also shift them right and set
shifted bits (and mirror the continuation bits if they are set there).
0 1 2 76
0 0 0 1 0 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 0
3 4
1 1 1
5
is_occupied
is_continuation
is_shifted
-89622902164894540 232552816249059682 62147742-265307463
run run/clusterrun run
cluster
QUOTIENT FILTER: EXAMPLE
• Test element: “ferret”

fq(“ferret”) = 1, fr(“ferret”) = 122150710
The canonical slot 1 is already occupied and it’s the start of the run. Iterate through the
run and compare fr(“ferret”) with existing reminders until we found the match, found
reminder that is strongly bigger, or hit the run’s end.
We start from slot 1 which reminder is smaller, so we continue to slot 2. Reminder in
the slot 2 is already bigger than fr(“ferret”), so we conclude that “ferret” definitely not
in the filter.
0 1 2 76
0 0 0 1 0 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 0
3 4
1 1 1
5
is_occupied
is_continuation
is_shifted
-89622902164894540 232552816249059682 62147742-265307463
run run/clusterrun run
cluster
QUOTIENT FILTER: EXAMPLE
• Test element: “berlin”

fq(“berlin”) = 4, fr(“berlin”) = -89622902
The canonical slot 4 is already occupied, but shifted bit is set, so the run for which it is
canonical slot exists, but is shifted right.
First, we need to find a run corresponding to the canonical slot 4 in the current cluster.,
so we scan left and count occupied slots.There are 2 occupied slots found (indices 1 and
2), therefore our run is the 3rd in the cluster and we can scan right until we found it
(count slots with not set continuation bit).
Our run starts in the slot 5 and we start comparing fr(“berlin”) with existing values and
found match, so we can conclude that “berlin” probably in the filter.
0 1 2 76
0 0 0 1 0 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 0
3 4
1 1 1
5
is_occupied
is_continuation
is_shifted
-89622902164894540 232552816249059682 62147742-265307463
run run/clusterrun run
cluster
QUOTIENT FILTER: PROPERTIES
• False positives are possible.

(element is not a member, but filter returns like it is a member)
• False negatives are not possible. 

(filter returns that elements isn’t a member only if it’s not a member)
• Hash function should generate uniformly distributed
fingerprints.
• The length of most runs is O(1) and it is highly likely that all runs
have length O(log m)
• Quotient filter efficient for large number of elements (~1B for
64-bit hash function)
P e ∈ℑ| e ∉ℑ( )≤
1
2r
P e ∉ℑ| e ∈ℑ( )= 0
QUOTIENT FILTER VS BLOOM FILTER
• Quotient filters are about 20% bigger than Bloom filters, but faster
because each access requires evaluating only a single hash function
• Results of comparison of in-RAM performance (M. Bender et al.):
• inserts: BF: 690 000 inserts per second, QF: 2 400 000 insert per second
• lookups: BF: 1 900 000 lookups per second, QF: 2 000 000 lookups per second
• Lookups in Quotient filters incur a single cache miss, as opposed to at
least two in expectation for a Bloom filter.
• Two Quotient Filters can be efficiently merged without affecting
their false positive rates.This is not possible with Bloom filters.
• Quotient filters support deletion
QUOTIENT FILTER: IMPLEMENTATIONS
• https://github.com/vedantk/quotient-filter

Quotient filter in-memory implementation written in C
• https://github.com/dsx724/php-quotient-filter

Quotient filter implementation in pure PHP
• https://github.com/bucaojit/QuotientFilter 

Quotient filter implementation in Java
QUOTIENT FILTER: READ MORE
• http://static.usenix.org/events/hotstorage11/tech/
final_files/Bender.pdf
• http://arxiv.org/pdf/1208.0290.pdf
• https://en.wikipedia.org/wiki/Quotient_filter
• http://www.vldb.org/pvldb/vol6/p589-dutta.pdf
▸ @gakhov
▸ linkedin.com/in/gakhov
▸ www.datacrucis.com
THANK YOU

More Related Content

Viewers also liked

Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsOleksandr Pryymak
 
Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013StampedeCon
 
Big Data with Semantics - StampedeCon 2012
Big Data with Semantics - StampedeCon 2012Big Data with Semantics - StampedeCon 2012
Big Data with Semantics - StampedeCon 2012StampedeCon
 
Skiena algorithm 2007 lecture12 topological sort connectivity
Skiena algorithm 2007 lecture12 topological sort connectivitySkiena algorithm 2007 lecture12 topological sort connectivity
Skiena algorithm 2007 lecture12 topological sort connectivityzukun
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaAndrii Gakhov
 
Bloom filter
Bloom filterBloom filter
Bloom filterwang ping
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Kira
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksAndrii Gakhov
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkSpark Summit
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesLorenzo Alberton
 
Cuckoo Optimization ppt
Cuckoo Optimization pptCuckoo Optimization ppt
Cuckoo Optimization pptAnuja Joshi
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Andrii Gakhov
 

Viewers also liked (16)

LSH
LSHLSH
LSH
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
 
Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013
 
Big Data with Semantics - StampedeCon 2012
Big Data with Semantics - StampedeCon 2012Big Data with Semantics - StampedeCon 2012
Big Data with Semantics - StampedeCon 2012
 
Skiena algorithm 2007 lecture12 topological sort connectivity
Skiena algorithm 2007 lecture12 topological sort connectivitySkiena algorithm 2007 lecture12 topological sort connectivity
Skiena algorithm 2007 lecture12 topological sort connectivity
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
skip list
skip listskip list
skip list
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
 
Cuckoo Optimization ppt
Cuckoo Optimization pptCuckoo Optimization ppt
Cuckoo Optimization ppt
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
 
Cuckoo search algorithm
Cuckoo search algorithmCuckoo search algorithm
Cuckoo search algorithm
 

More from Andrii Gakhov

Let's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureAndrii Gakhov
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Andrii Gakhov
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Andrii Gakhov
 
Pecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsAndrii Gakhov
 
Swagger / Quick Start Guide
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start GuideAndrii Gakhov
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlightsAndrii Gakhov
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcasesAndrii Gakhov
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Andrii Gakhov
 
Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Andrii Gakhov
 
Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Andrii Gakhov
 
Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Andrii Gakhov
 
Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Andrii Gakhov
 
Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)Andrii Gakhov
 
Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014Andrii Gakhov
 
Data Mining - lecture 1 - 2014
Data Mining - lecture 1 - 2014Data Mining - lecture 1 - 2014
Data Mining - lecture 1 - 2014Andrii Gakhov
 
Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2Andrii Gakhov
 
Buzzwords 2014 / Overview / part1
Buzzwords 2014 / Overview / part1Buzzwords 2014 / Overview / part1
Buzzwords 2014 / Overview / part1Andrii Gakhov
 

More from Andrii Gakhov (20)

Let's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architecture
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
 
DNS Delegation
DNS DelegationDNS Delegation
DNS Delegation
 
Pecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food Traditions
 
Swagger / Quick Start Guide
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start Guide
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlights
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014
 
Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014
 
Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014
 
Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014
 
Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014
 
Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)
 
Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014
 
Data Mining - lecture 1 - 2014
Data Mining - lecture 1 - 2014Data Mining - lecture 1 - 2014
Data Mining - lecture 1 - 2014
 
Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2
 
Buzzwords 2014 / Overview / part1
Buzzwords 2014 / Overview / part1Buzzwords 2014 / Overview / part1
Buzzwords 2014 / Overview / part1
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 

Recently uploaded

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 

Recently uploaded (20)

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 

Probabilistic data structures. Part 1. Membership

  • 1. tech talk @ ferret Andrii Gakhov PROBABILISTIC DATA STRUCTURES ALL YOU WANTED TO KNOW BUT WERE AFRAID TO ASK PART 1: MEMBERSHIP
  • 3. • To determine membership of the element in a large set of elements THE PROBLEM
  • 5. BLOOM FILTER • proposed by Burton Howard Bloom, 1970 • Bloom filter is a realisation of probabilistic set with 3 operations: • add element into the set • test whether an element is a member of the set • test whether an element is not a member of the set • Bloom filter is described by 2 parameters: • m - length of the filter 
 (proportional to expected number of elements n) • k - number of different hash functions
 (usually, k is much smaller than m) • It doesn’t store elements and require about 1 byte per stored data
  • 6. BLOOM FILTER: ALGORITHM • Bloom filter is a bit array of m bits, all set to 0 at the beginning • To insert element into the filter - calculate values of all k hash functions for the element and set bit with the corresponding indices • To test if element is in the filter - calculate all k hash functions for the element and check bits in all corresponding indices: • if all bits are set, then answer is “maybe” • if at least 1 bit isn’t set, then answer is “definitely not” • Time needed to insert or test elements is a fixed constant O(k), independent from the number of items already in the filter
  • 7. BLOOM FILTER: EXAMPLE • Consider Bloom filter of 16 bits (m=16) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • Consider 2 hash functions (k=2): 
 MurmurHash3 and Fowler-Noll-Vo
 (to calculate the appropriate index, we divide result by mod 16) • Add element to the filter: “ferret”:
 MurmurHash3(“ferret”) = 1, FNV(“ferret”) = 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
  • 8. BLOOM FILTER: EXAMPLE • Add element to the filter: “bernau”:
 MurmurHash3(“bernau”) = 4, FNV(“bernau”) = 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 • Test element: “berlin”:
 MurmurHash3(“berlin”) = 4, FNV(“berlin”) = 12
 Bit 12 is not set, so “berlin” definitely not in the set • Test element: “paris”:
 MurmurHash3(“paris”) = 11, FNV(“paris”) = 4
 Bits 4 and 11 are set, so the element maybe in the set (false positive)
  • 9. BLOOM FILTER: PROPERTIES • False positives are possible.
 (element is not a member, but filter returns like it is a member) • False negatives are not possible. 
 (filter returns that elements isn’t a member only if it’s not a member) • Hash functions should be independent and uniformly distributed.They should also be as fast as possible.
 (don’t use cryptographic, like sha1) • By choice of k and m it is possible to decrease false positives probability. P e ∈ℑ| e ∉ℑ( ) ≈ 1− e kn m ⎛ ⎝⎜ ⎞ ⎠⎟ k P e ∉ℑ| e ∈ℑ( )= 0 k∗ = m n ln2
  • 10. BLOOM FILTER: APPLICATIONS • Google BigTable, Apache HBase and Apache Cassandra use Bloom filters to reduce the disk lookups for non-existent rows or columns • Medium uses Bloom filters to avoid recommending articles a user has previously read • Google Chrome web browser used to use a Bloom filter to identify malicious URLs (moved to PrefixSet, Issue 71832) • The Squid Web Proxy Cache uses Bloom filters for cache digests
  • 11. BLOOM FILTER: PROBLEMS • The basic Bloom filter doesn’t support deletion. • Bloom filters work well when they fit in main memory • ~1 byte per element (3x-4x times more for Counting Bloom filters, that support deletion) • What goes wrong when Bloom filters grow too big to fit in main memory? • On disks with rotating platters and moving heads, Bloom filters choke.A rotational disk performs only 100–200 (random) I/Os per second, and each Bloom filter operation requires multiple I/Os. • On flash-based solid-state drives, Bloom filters achieve only hundreds of operations per second in contrast to the order of a million per second in main memory. • Buffering can help. However, buffering scales poorly as the Bloom-filter size increases compared to the in-memory buffer size, resulting in only a few buffered updates per flash page on average.
  • 12. BLOOM FILTER: VARIANTS • Attenuated Bloom filters use arrays of Bloom filters to store shortest path distance information • Spectral Bloom filters extend the data structure to support estimates of frequencies. • Counting Bloom Filters each entry in the filter instead of a single bit is rather a small counter. Insertions and deletions to the filter increment or decrement the counters respectively. • Compressed Bloom filters can be easily adjusted to the desired tradeoff between size and false-positive rate • Bloom Filter Cascade implements filtering by cascading pipeline of Bloom filters • Scalable Bloom Filters can adapt dynamically to the number of elements stored, while assuring a maximum false positive probability
  • 13. BLOOM FILTER: PYTHON • https://github.com/jaybaird/python-bloomfilter
 pybloom is a module that includes a Bloom Filter data structure along with an implementation of Scalable Bloom Filters • https://github.com/seomoz/pyreBloom
 pyreBloom provides Redis backed Bloom Filter using GETBIT and SETBIT
  • 14. BLOOM FILTER: READ MORE • http://dmod.eu/deca/ft_gateway.cfm.pdf • https://en.wikipedia.org/wiki/Bloom_filter • https://www.cs.uchicago.edu/~matei/PAPERS/bf.doc • http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf • https://www.eecs.harvard.edu/~michaelm/postscripts/ im2005b.pdf
  • 16. QUOTIENT FILTER • introduced by Michael Bender et al., 2011 • Quotient filter is a realisation of probabilistic set with 4 operations: • add element into the set • delete element from the set • test whether an element is a member of a set • test whether an element is not a member of a set • Quotient filter is described by: • p - size (in bits) for fingerprints • single hash function that generates fingerprints • Quotient filter stores p-bit fingerprints of elements
  • 17. QUOTIENT FILTER: ALGORITHM • Quotient filter is a compact open hash table with m = 2 q buckets. The hash table employs quotienting, a technique suggested by D. Knuth: • the fingerprint f is partitioned into: • the r least significant bits (fr = f mod 2r , the remainder) • the q = p - r most significant bits (fq = ⌊f / 2r ⌋, the quotient) • The remainder is stored in the bucket indexed by the quotient • Each bucket contains 3 bits, all 0 at the beginning: is_occupied, is_continuation, is_shifted 0 1 2 m… 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 is_occupied is_continuation is_shifted
  • 18. QUOTIENT FILTER: ALGORITHM • If two fingerprints f and f′ have the same quotient (fq=f′q) - it is a soft collision.All remainders of fingerprints with the same quotient are stored contiguously in a run. • If necessary, a remainder is shifted forward from its original location and stored in a subsequent bucket, wrapping around at the end of the array. • is_occupied is set when a bucket j is the canonical bucket (fq=j) for some fingerprint f , stored (somewhere) in the filter • is_continuation is set when a bucket is occupied but not by the first remainder in a run • is_shifted is set when the remainder in a bucket is not in its canonical bucket 0 1 2 m… 0 0 0 1 0 0 1 0 0 1 1 1 0 1 1 is_occupied is_continuation is_shifted 1 0 1 0 0 0 3 4 run 0 1 1 5 ba c d e
  • 19. QUOTIENT FILTER: ALGORITHM • To test a fingerprint f: • if bucket fq is not occupied, then the element with fingerprint f definitely not in the filter • if bucket fq is occupied: • starting with bucket fq, scan left to locate bucket without set is_shifted bit • scan right with running count (is_occupied: +1, is_continuation: -1) until the running count reaches 0 - when it’s the quotient's run. • compare the remainder in each bucket in the quotient's run with fr • if found, than element is (probably) in the filter, else - it is definitely not in the filter. • To add a fingerprint f: • follow a path similar to test until certain that the fingerprint is definitely not in the filter • choose bucket in the current run by keeping the sorted order and insert reminder fr (set is_occupied bit) • shift forward all reminders at or after the chosen bucket and update the buckets’ bits.
  • 20. QUOTIENT FILTER: EXAMPLE • Consider Quotient filter with size p=8 and 32-bit signed MurmurHash3 as h 0 1 2 76 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 4 0 0 0 5 is_occupied is_continuation is_shifted • Add elements: “amsterdam”, “berlin”, “london”
 fq(“amsterdam”) = 1, fq(“berlin”) = 4, fq(“london”) = 7
 fr(“amsterdam”) =164894540, fr(“berlin”) = -89622902, fr(“london”) = 232552816 0 1 2 76 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 3 4 0 0 0 5 is_occupied is_continuation is_shifted -89622902164894540 232552816 Insertion at this stage is easy since all canonical slots are not occupied run/cluster run/cluster run/cluster
  • 21. QUOTIENT FILTER: EXAMPLE • Add element: “madrid”
 fq(“madrid”) = 1, fr(“madrid”) = 249059682 The canonical slot 1 is already occupied.The shifted and continuation bits are not set, so we are at the beginning of the cluster which is also the run’s start. The reminder fr(“madrid”) is strongly bigger than the existing reminder, so it should be shifted right into the next available slot 2 and shifted/continuation bits should be set (but not the occupied bit, because it pertain to the slot, not the contained reminder). 0 1 2 76 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 3 4 0 1 1 5 is_occupied is_continuation is_shifted -89622902164894540 232552816249059682 run/cluster run/clusterrun/cluster
  • 22. QUOTIENT FILTER: EXAMPLE • Add element: “ankara”
 fq(“ankara”) = 2, fr(“ankara”) = 62147742 The canonical slot 2 is not occupied, but already in use. So, the fr(“ankara”) should be shifted right into the neared available slot 3 and its shifted bit should be set. In addition, we need to flag the canonical slot 2 as occupied by setting the occupied bit. 0 1 2 76 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 3 4 1 1 1 5 is_occupied is_continuation is_shifted -89622902164894540 232552816249059682 62147742 run/cluster run/clusterrun run cluster
  • 23. QUOTIENT FILTER: EXAMPLE • Add element: “abu dhabi”
 fq(“abu dhabi”) = 1, fr(“abu dhabi”) = -265307463 The canonical slot 1 is already occupied.The shifted and continuation bits are not set, so we are at the beginning of the cluster which is also the run’s start. The reminder fr(“abu dhabi”) is strongly smaller than the existing reminder, so all reminders in slot 1 should be shifted right and flagged as continuation and shifted. If shifting affects reminders from other runs/clusters, we also shift them right and set shifted bits (and mirror the continuation bits if they are set there). 0 1 2 76 0 0 0 1 0 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 0 3 4 1 1 1 5 is_occupied is_continuation is_shifted -89622902164894540 232552816249059682 62147742-265307463 run run/clusterrun run cluster
  • 24. QUOTIENT FILTER: EXAMPLE • Test element: “ferret”
 fq(“ferret”) = 1, fr(“ferret”) = 122150710 The canonical slot 1 is already occupied and it’s the start of the run. Iterate through the run and compare fr(“ferret”) with existing reminders until we found the match, found reminder that is strongly bigger, or hit the run’s end. We start from slot 1 which reminder is smaller, so we continue to slot 2. Reminder in the slot 2 is already bigger than fr(“ferret”), so we conclude that “ferret” definitely not in the filter. 0 1 2 76 0 0 0 1 0 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 0 3 4 1 1 1 5 is_occupied is_continuation is_shifted -89622902164894540 232552816249059682 62147742-265307463 run run/clusterrun run cluster
  • 25. QUOTIENT FILTER: EXAMPLE • Test element: “berlin”
 fq(“berlin”) = 4, fr(“berlin”) = -89622902 The canonical slot 4 is already occupied, but shifted bit is set, so the run for which it is canonical slot exists, but is shifted right. First, we need to find a run corresponding to the canonical slot 4 in the current cluster., so we scan left and count occupied slots.There are 2 occupied slots found (indices 1 and 2), therefore our run is the 3rd in the cluster and we can scan right until we found it (count slots with not set continuation bit). Our run starts in the slot 5 and we start comparing fr(“berlin”) with existing values and found match, so we can conclude that “berlin” probably in the filter. 0 1 2 76 0 0 0 1 0 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 0 3 4 1 1 1 5 is_occupied is_continuation is_shifted -89622902164894540 232552816249059682 62147742-265307463 run run/clusterrun run cluster
  • 26. QUOTIENT FILTER: PROPERTIES • False positives are possible.
 (element is not a member, but filter returns like it is a member) • False negatives are not possible. 
 (filter returns that elements isn’t a member only if it’s not a member) • Hash function should generate uniformly distributed fingerprints. • The length of most runs is O(1) and it is highly likely that all runs have length O(log m) • Quotient filter efficient for large number of elements (~1B for 64-bit hash function) P e ∈ℑ| e ∉ℑ( )≤ 1 2r P e ∉ℑ| e ∈ℑ( )= 0
  • 27. QUOTIENT FILTER VS BLOOM FILTER • Quotient filters are about 20% bigger than Bloom filters, but faster because each access requires evaluating only a single hash function • Results of comparison of in-RAM performance (M. Bender et al.): • inserts: BF: 690 000 inserts per second, QF: 2 400 000 insert per second • lookups: BF: 1 900 000 lookups per second, QF: 2 000 000 lookups per second • Lookups in Quotient filters incur a single cache miss, as opposed to at least two in expectation for a Bloom filter. • Two Quotient Filters can be efficiently merged without affecting their false positive rates.This is not possible with Bloom filters. • Quotient filters support deletion
  • 28. QUOTIENT FILTER: IMPLEMENTATIONS • https://github.com/vedantk/quotient-filter
 Quotient filter in-memory implementation written in C • https://github.com/dsx724/php-quotient-filter
 Quotient filter implementation in pure PHP • https://github.com/bucaojit/QuotientFilter 
 Quotient filter implementation in Java
  • 29. QUOTIENT FILTER: READ MORE • http://static.usenix.org/events/hotstorage11/tech/ final_files/Bender.pdf • http://arxiv.org/pdf/1208.0290.pdf • https://en.wikipedia.org/wiki/Quotient_filter • http://www.vldb.org/pvldb/vol6/p589-dutta.pdf
  • 30. ▸ @gakhov ▸ linkedin.com/in/gakhov ▸ www.datacrucis.com THANK YOU