Uploaded byLorenzo Alberton

30,160 views

Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees

The document discusses 'modern' algorithms and data structures, focusing on Bloom filters and Merkle trees. Bloom filters are space-efficient probabilistic data structures used to test set membership with the possibility of false positives but no false negatives, and they are implemented in systems like Cassandra to optimize I/O operations. Merkle trees, on the other hand, are hash trees that ensure the integrity of data blocks exchanged in peer-to-peer networks, assisting in minimizing data transfer during consistency checks.

Technology◦Education◦

Related topics:

Data Structures•

Lorenzo Alberton
@lorenzoalberton

“Modern” Algorithms
and Data Structures
Part 1
Bloom Filters, Merkle Trees

Cassandra-London, Monday 18th April 2011
1

Bloom Filters
Burton Howard Bloom, 1970

http://portal.acm.org/citation.cfm?doid=362686.362692 2

Bloom Filter

Space-efﬁcient
probabilistic
data structure
used to test
set membership
http://en.wikipedia.org/wiki/Bloom_ﬁlter 3

Bloom Filter
Space-efﬁcient probabilistic data structure that is used to test
whether an element is a member of a set

4

Bloom Filter
Space-efﬁcient probabilistic data structure that is used to test
whether an element is a member of a set

Hash Table ⇒ chance of collision

hash(x) hash(y)

4

Bloom Filter
Space-efﬁcient probabilistic data structure that is used to test
whether an element is a member of a set

Hash Table ⇒ chance of collision

hash(x) hash(y)

False positives are possible, false negatives are not.
It might be beneﬁcial to build an exception list of known false positives.
4

Bloom Filter
Space-efﬁcient probabilistic data structure that is used to test
whether an element is a member of a set

5

Bloom Filter
Space-efﬁcient probabilistic data structure that is used to test
whether an element is a member of a set

Not a Key-Value store

5

Bloom Filter
Space-efﬁcient probabilistic data structure that is used to test
whether an element is a member of a set

Not a Key-Value store

Array of bits indicating the
presence of a key in the ﬁlter

5

Bloom Filter
Space-efficient probabilistic data structure that is used to test
whether an element is a member of a set

Not a Key-Value store

Array of bits indicating the
presence of a key in the filter

(*)
Removing an element from the filter is not possible

5

Bloom Filter: Add & Query
m bits (initially set to 0)
k hash functions

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 2 m-1 m

6

Bloom Filter: Add & Query
m bits (initially set to 0)
k hash functions

Add

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 2 m-1 m

6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x
Add

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 2 m-1 m

6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x
Add
f(x)

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1
0 1 2 m-1 m

6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x
Add
g(x) f(x)

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1
0 1 2 m-1 m

6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x
Add
g(x) f(x) h(x)

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1
0 1 2 m-1 m

6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x y
g(y)
Add f(y)
g(x) f(x) h(x)
h(y)
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
0 1 2 m-1 m

6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x y
g(y)
Add f(y)
g(x) f(x) h(x)
h(y)
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
0 1 2 m-1 m

Query

6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x y
g(y)
Add f(y)
g(x) f(x) h(x)
h(y)
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
0 1 2 m-1 m

f(z) h(z) g(z)
Query
z
6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x y
g(y)
Add f(y)
g(x) f(x) h(x)
h(y)
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
0 1 2 m-1 m

f(z) h(z) g(z)
Query
one bit set to 0
z ⇒z∉S
6

Bloom Filter: Hash Functions
k Hash functions: uniform random distribution in [1...m)

k different hash functions

The same hash functions with different salts

Double or triple hashing : g (x) = h (x) + ih (x) mod m
[1]
i 1 2

2 hash functions can mimic k hashing functions

Dillinger, Peter C.; Manolios, Panagiotis (2004b), "Bloom Filters in Probabilistic Verification",
[1]
http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf

http://www.strchr.com/hash_functions 7

Bloom Filter: Hash Functions
k Hash functions: uniform random distribution in [1...m)

k different hash functions

‣ Cryptographic Hash different salts
The same hash functions withFunctions
(MD5, SHA-1, SHA-256, Tiger, Whirlpool ...)
Double or triple hashing : g (x) = h (x) + ih (x) mod m
[1]
i 1 2

2 hash functions can mimic k hashing functions
‣ Murmur Hashes
http://code.google.com/p/smhasher/
Dillinger, Peter C.; Manolios, Panagiotis (2004b), "Bloom Filters in Probabilistic Verification",
[1]
http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf

http://www.strchr.com/hash_functions 7

Bloom Filter: Usage

Guard against First line of defence
Peer to Peer Routing -
expensive operations in high performance
communication Resource Location
(like disk access) (distributed) caches

...
Squid Google Various Google Cisco
Cassandra HBase
Proxy Cache BigTable RDBMS’ Chrome Routers

8

Bloom Filter: Usage in Cassandra

Used to save I/O during key look-ups
(check for non-existent keys)

One bloom ﬁlter per SSTable.

9

Bloom Filter: Usage in Cassandra

Used to save I/O during key look-ups
(check for non-existent keys)

One bloom ﬁlter per SSTable.

org.apache.cassandra.utils.BloomFilter

9

Bloom Filter: False Positive Rate

m = number of bits in the ﬁlter
n = number of elements
k = number of hashing functions

http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html 10

Bloom Filter: False Positive Rate

m = number of bits in the ﬁlter
n = number of elements
k = number of hashing functions

http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html 10

Bloom Filter: False Positive Rate

A bloom ﬁlter with an optimal value for k
and 1% error rate only needs 9.6 bits per key.
Add 4.8 bits/key and the error rate decreases by 10 times.

10.000 words, 1% error rate 10.000 words, 0.1% error rate
7 hash functions 11 hash functions

~12 KB of memory ~18 KB of memory
http://www.igvita.com/2008/12/27/scalable-datasets-bloom-ﬁlters-in-ruby/ 11

Bloom Filter: False Positive Rate
false positive probability

bloom ﬁlter size (n)
http://en.wikipedia.org/wiki/Bloom_ﬁlter 12

Counting Bloom Filter
Can handle deletions
Use counters instead of 0/1s
When adding an element, increment the counters
When deleting an element, decrement the counters
Counters must be large enough to avoid overﬂow (4 bits)
x y
g(y)
f(y)
g(x) f(x) h(x)
h(y)
S 1 0 0 0 1 0 0 0 2 0 0 0 1 0 1
13

Stable (Time-Based) Bloom Filter
Input
Stream

Duplicate 1 0 0 0 1 0 0 0 1 0
Filter

Output
Stream
14

Stable (Time-Based) Bloom Filter
Input Before each insertion, P random
Stream cells are decremented by one.
The k cells for the new value xi
are set to Max (usually < 7)
http://webdocs.cs.ualberta.ca/~draﬁei/papers/DupDet06Sigmod.pdf

Duplicate 1 0 0 0 1 0 0 0 1 0
Filter

Output
Stream
14

Stable (Time-Based) Bloom Filter
Input Before each insertion, P random
Stream cells are decremented by one.
The k cells for the new value xi
are set to Max (usually < 7)
http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf

Duplicate 1 0 0 0 1 0 0 0 1 0
Filter

Alternatively, set an expiry time
Output for each cell, with a TTL
dependent on the volume of data
Stream
http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/

14

Bloom Filters: Further reading
Compressed Bloom Filters
Improve performance when the Bloom ﬁlter is passed as a message,
and its transmission size is a limiting factor.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.3346

Retouched Bloom Filters
Allow networked applications to trade off selected false positives
against false negatives
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.172.8453

Bloomier Filters
Extended to handle approximate functions (each element of the set
has an associated function value)
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.4154 http://arxiv.org/abs/0807.0928

Attenuated B.F., Spectral B.F., Distance-Sensitive B.F. ...
15

Merkle Trees
Ralph C. Merkle, 1979

http://www.springerlink.com/content/q865hwxq73ex1am9/ 16

Merkle Trees (Hash Trees)

Data Structure containing a
tree of summary information
about a larger piece of data
to verify its contents

http://en.wikipedia.org/wiki/Hash_Tree 17

Merkle Trees (Hash Trees)
Leaves: hashes of
ROOT
hash(A, B) data blocks.
Nodes: hashes of
their children.
A B
hash(C, D) hash(E, F)
Used to detect
inconsistencies
C D E F between replicas
hash(001) hash(002) hash(003) hash(004)
(anti-entropy) and
to minimise the
Data Data Data Data
Block Block Block Block amount of
001 002 003 004 transferred data
18

Merkle Trees
Node A Node B
gossip
exchange

19

Merkle Trees
Node A Node B
gossip
exchange

Minimal data transfer
Differences are easy to locate

19

Merkle Trees
Node A Node B
gossip
exchange

Minimal data transfer
Differences are easy to locate

SHA-1, Whirlpool or Tiger (TTH) hash functions
19

Merkle Trees: Usage

Peer to Peer
communication

20

Merkle Trees: Usage
DC++

Peer to Peer
communication

20

Merkle Trees: Usage
DC++

Peer to Peer
communication

...
Amazon Google Google
Cassandra HBase ZFS
Dynamo BigTable Wave

20

Merkle Trees: Usage in Cassandra

Ensure the P2P network of nodes receives
data blocks unaltered and unharmed.
Anti-entropy during major compactions
(via Scuttlebutt reconciliation).

http://wiki.apache.org/cassandra/ArchitectureAntiEntropy 21

Merkle Trees: Usage in Cassandra

Ensure the P2P network of nodes receives
data blocks unaltered and unharmed.
Anti-entropy during major compactions
(via Scuttlebutt reconciliation).

One Merkle Tree per Column Family
(in Dynamo, one per node / key range)

http://wiki.apache.org/cassandra/ArchitectureAntiEntropy 21

Merkle Trees: Usage in Cassandra

Ensure the P2P network of nodes receives
data blocks unaltered and unharmed.
Anti-entropy during major compactions
(via Scuttlebutt reconciliation).

One Merkle Tree per Column Family
(in Dynamo, one per node / key range)

org.apache.cassandra.utils.MerkleTree

http://wiki.apache.org/cassandra/ArchitectureAntiEntropy 21

References

Bloom Filters
http://bit.ly/bundles/quipo/1

Merkle Trees
http://bit.ly/bundles/quipo/2

22

We’re Hiring!

http://mediasift.com/careers
23

Lorenzo Alberton
@lorenzoalberton

Thank you!

lorenzo@alberton.info

http://www.alberton.info/talks
24

Recommended

PPTX

Bloom filters

PPTX

Elastic stack Presentation

byAmr Alaa Yassen

PPTX

Python/Flask Presentation

byParag Mujumdar

PPTX

Big Data Analytics with Hadoop

byPhilippe Julio

PPTX

An Intro to Elasticsearch and Kibana

PDF

Understanding Query Plans and Spark UIs

PDF

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...

PPTX

Introduction to ML with Apache Spark MLlib

byTaras Matyashovsky

PDF

redis 소개자료 - 네오클로바

PPTX

Elasticsearch

byJean-Philippe Chateau

ZIP

NoSQL databases

byHarri Kauhanen

PDF

Slides [DAA] Unit 2 Ch 2.pdf

byVijayraj799513

PDF

Deep Dive into GPU Support in Apache Spark 3.x

PPTX

Discovering the 2 in Alfresco Search Services 2.0

byAngel Borroy López

PDF

CDC Stream Processing with Apache Flink

PDF

MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing

PDF

Load Data Fast!

byKarwin Software Solutions LLC

PDF

PGConf.ASIA 2017 Logical Replication Internals (English)

byNoriyoshi Shinoda

PDF

Big Data Analytics with Spark

byMohammed Guller

PDF

Etsy Activity Feeds Architecture

PDF

Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...

byHostedbyConfluent

PDF

HDFS on Kubernetes—Lessons Learned with Kimoon Kim

PPT

Hive(ppt)

byAbhinav Tyagi

PDF

Redis cluster

PPTX

Kafka 101

byClement Demonchy

PDF

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...

PDF

Kafka Streams Rebalances and Assignments: The Whole Story with Alieh Saeedi &...

byHostedbyConfluent

PPTX

Introduction to Redis

KEY

Scalable Architectures - Taming the Twitter Firehose

byLorenzo Alberton

KEY

Scaling Teams, Processes and Architectures

byLorenzo Alberton

More Related Content

PPTX

Bloom filters

PPTX

Elastic stack Presentation

byAmr Alaa Yassen

PPTX

Python/Flask Presentation

byParag Mujumdar

PPTX

Big Data Analytics with Hadoop

byPhilippe Julio

PPTX

An Intro to Elasticsearch and Kibana

PDF

Understanding Query Plans and Spark UIs

PDF

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...

PPTX

Introduction to ML with Apache Spark MLlib

byTaras Matyashovsky

Bloom filters

Elastic stack Presentation

byAmr Alaa Yassen

Python/Flask Presentation

byParag Mujumdar

Big Data Analytics with Hadoop

byPhilippe Julio

An Intro to Elasticsearch and Kibana

Understanding Query Plans and Spark UIs

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...

Introduction to ML with Apache Spark MLlib

byTaras Matyashovsky

What's hot

PDF

redis 소개자료 - 네오클로바

PPTX

Elasticsearch

byJean-Philippe Chateau

ZIP

NoSQL databases

byHarri Kauhanen

PDF

Slides [DAA] Unit 2 Ch 2.pdf

byVijayraj799513

PDF

Deep Dive into GPU Support in Apache Spark 3.x

PPTX

Discovering the 2 in Alfresco Search Services 2.0

byAngel Borroy López

PDF

CDC Stream Processing with Apache Flink

PDF

MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing

PDF

Load Data Fast!

byKarwin Software Solutions LLC

PDF

PGConf.ASIA 2017 Logical Replication Internals (English)

byNoriyoshi Shinoda

PDF

Big Data Analytics with Spark

byMohammed Guller

PDF

Etsy Activity Feeds Architecture

PDF

Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...

byHostedbyConfluent

PDF

HDFS on Kubernetes—Lessons Learned with Kimoon Kim

PPT

Hive(ppt)

byAbhinav Tyagi

PDF

Redis cluster

PPTX

Kafka 101

byClement Demonchy

PDF

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...

PDF

Kafka Streams Rebalances and Assignments: The Whole Story with Alieh Saeedi &...

byHostedbyConfluent

PPTX

Introduction to Redis

redis 소개자료 - 네오클로바

Elasticsearch

byJean-Philippe Chateau

NoSQL databases

byHarri Kauhanen

Slides [DAA] Unit 2 Ch 2.pdf

byVijayraj799513

Deep Dive into GPU Support in Apache Spark 3.x

Discovering the 2 in Alfresco Search Services 2.0

byAngel Borroy López

CDC Stream Processing with Apache Flink

MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing

Load Data Fast!

byKarwin Software Solutions LLC

PGConf.ASIA 2017 Logical Replication Internals (English)

byNoriyoshi Shinoda

Big Data Analytics with Spark

byMohammed Guller

Etsy Activity Feeds Architecture

Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...

byHostedbyConfluent

HDFS on Kubernetes—Lessons Learned with Kimoon Kim

Hive(ppt)

byAbhinav Tyagi

Redis cluster

Kafka 101

byClement Demonchy

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...

Kafka Streams Rebalances and Assignments: The Whole Story with Alieh Saeedi &...

byHostedbyConfluent

Introduction to Redis

Viewers also liked

KEY

Scalable Architectures - Taming the Twitter Firehose

byLorenzo Alberton

KEY

Scaling Teams, Processes and Architectures

byLorenzo Alberton

KEY

The Art of Scalability - Managing growth

byLorenzo Alberton

PDF

Monitoring at scale - Intuitive dashboard design

byLorenzo Alberton

KEY

Graphs in the Database: Rdbms In The Social Networks Age

byLorenzo Alberton

KEY

NoSQL Databases: Why, what and when

byLorenzo Alberton

KEY

Trees In The Database - Advanced data structures

byLorenzo Alberton

Scalable Architectures - Taming the Twitter Firehose

byLorenzo Alberton

Scaling Teams, Processes and Architectures

byLorenzo Alberton

The Art of Scalability - Managing growth

byLorenzo Alberton

Monitoring at scale - Intuitive dashboard design

byLorenzo Alberton

Graphs in the Database: Rdbms In The Social Networks Age

byLorenzo Alberton

NoSQL Databases: Why, what and when

byLorenzo Alberton

Trees In The Database - Advanced data structures

byLorenzo Alberton

Similar to Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees

PPTX

Unit 5 Streams2.pptx

bySonaliAjankar

PPT

Bloom filter

PPT

New zealand bloom filter

PDF

Tutorial 9 (bloom filters)

PDF

Bloom filter

PPTX

Bloom Filters: a fascinating approach to determining set membership

PDF

Bloom Filters: An Introduction

byIRJET Journal

PPTX

streamingalgo88585858585858585pppppp.pptx

byGopiNathVelivela

PPTX

Probabilistic data structures

byshrinivasvasala

PPT

Footalks#1 Bloom Filters

byJesly Varghese

PDF

Hash Functions FTW

PDF

Slides1

byBioinformaticsInstitute

PPTX

Tech talk Probabilistic Data Structure

byRishabh Dugar

PPT

bloomfilter.ppt

bykashvirelhan1

PPT

bloomfilter.ppt

bykashvirelhan1

PPT

bloomfilter.ppt

bykashvirelhan1

PDF

Hash - A probabilistic approach for big data

byLuca Mastrostefano

PDF

Algorithm chapter 7

PDF

Bloom filter

byHamid Feizabadi

PDF

Randamization.pdf

byPrashanth460337

Unit 5 Streams2.pptx

bySonaliAjankar

Bloom filter

New zealand bloom filter

Tutorial 9 (bloom filters)

Bloom filter

Bloom Filters: a fascinating approach to determining set membership

Bloom Filters: An Introduction

byIRJET Journal

streamingalgo88585858585858585pppppp.pptx

byGopiNathVelivela

Probabilistic data structures

byshrinivasvasala

Footalks#1 Bloom Filters

byJesly Varghese

Hash Functions FTW

Slides1

byBioinformaticsInstitute

Tech talk Probabilistic Data Structure

byRishabh Dugar

bloomfilter.ppt

bykashvirelhan1

bloomfilter.ppt

bykashvirelhan1

bloomfilter.ppt

bykashvirelhan1

Hash - A probabilistic approach for big data

byLuca Mastrostefano

Algorithm chapter 7

Bloom filter

byHamid Feizabadi

Randamization.pdf

byPrashanth460337

Recently uploaded

PDF

Transcript: What Thema can do: Leveraging metadata to support the discoverabi...

byBookNet Canada

PDF

Chapter 4 Network Security in computer security

byGetnet Tigabie Askale -(GM)

PDF

Title Installing Windows, Linux, MacOS.pdf

byGeraldAbadajos

PDF

Real-Time AI Masterclass: Vector Search at Scale

PDF

Website Redesign Vs Website Refresh, What’s Right for You in 2026?

byAnuj Kumar Singh

PDF

Bettersize | BeSEC Series Product Brochure

byBettersize Instruments

PDF

EU regulations for the North American book supply chain - Tech Forum 2026

byBookNet Canada

PDF

7 Essential Types of Penetration Testing Services Every Business Should Under...

bypandeydevika621

PDF

Real-Time AI at Scale Masterclass: Feature Store at Scale

PDF

When Drones Decide for Themselves_ Inside the Rise of Machine-Led Flight.pdf

byLyra Anderson

PPTX

Jisc Equipment Data Service Update Workshop 3 December 2025

PPTX

TechSprint (SJBIT) 2025-26 Hackathon Winners & Awards Ceremony

PDF

What Thema can do: Leveraging metadata to support the discoverability of Firs...

byBookNet Canada

PPTX

NTG - Data Center Management System Software

byMustafa Kuğu

PDF

2026_01_28 - OpenMetadata Community Meeting.pdf

PDF

January 2026 OpenMetadata Community Spotlight - OpenMetadata @ Wix.pdf

PDF

OpenCharacter AI Reviews: Features, Plans, and Best Alternative

byOpenCharacter AI

PDF

"Painless Major Upgrade: A Strategy for Updating Large Codebases", Andrii Yat...

PPTX

Software Engineering in the Age of AI Agents

byHusseinMalikMammadli

PDF

Adapt.com Seed Fundraising Deck | The AI Computer for Business

Transcript: What Thema can do: Leveraging metadata to support the discoverabi...

byBookNet Canada

Chapter 4 Network Security in computer security

byGetnet Tigabie Askale -(GM)

Title Installing Windows, Linux, MacOS.pdf

byGeraldAbadajos

Real-Time AI Masterclass: Vector Search at Scale

Website Redesign Vs Website Refresh, What’s Right for You in 2026?

byAnuj Kumar Singh

Bettersize | BeSEC Series Product Brochure

byBettersize Instruments

EU regulations for the North American book supply chain - Tech Forum 2026

byBookNet Canada

7 Essential Types of Penetration Testing Services Every Business Should Under...

bypandeydevika621

Real-Time AI at Scale Masterclass: Feature Store at Scale

When Drones Decide for Themselves_ Inside the Rise of Machine-Led Flight.pdf

byLyra Anderson

Jisc Equipment Data Service Update Workshop 3 December 2025

TechSprint (SJBIT) 2025-26 Hackathon Winners & Awards Ceremony

What Thema can do: Leveraging metadata to support the discoverability of Firs...

byBookNet Canada

NTG - Data Center Management System Software

byMustafa Kuğu

2026_01_28 - OpenMetadata Community Meeting.pdf

January 2026 OpenMetadata Community Spotlight - OpenMetadata @ Wix.pdf

OpenCharacter AI Reviews: Features, Plans, and Best Alternative

byOpenCharacter AI

"Painless Major Upgrade: A Strategy for Updating Large Codebases", Andrii Yat...

Software Engineering in the Age of AI Agents

byHusseinMalikMammadli

Adapt.com Seed Fundraising Deck | The AI Computer for Business

Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees

1.
Lorenzo Alberton @lorenzoalberton “Modern” Algorithms and Data Structures Part 1 Bloom Filters, Merkle Trees Cassandra-London, Monday 18th April 2011 1
2.
Bloom Filters Burton Howard Bloom, 1970 http://portal.acm.org/citation.cfm?doid=362686.362692 2
3.
Bloom Filter Space-efﬁcient probabilistic data structure used to test set membership http://en.wikipedia.org/wiki/Bloom_ﬁlter 3
4.
Bloom Filter Space-efﬁcient probabilisticdata structure that is used to test whether an element is a member of a set 4
5.
Bloom Filter Space-efﬁcient probabilisticdata structure that is used to test whether an element is a member of a set Hash Table ⇒ chance of collision hash(x) hash(y) 4
6.
Bloom Filter Space-efﬁcient probabilisticdata structure that is used to test whether an element is a member of a set Hash Table ⇒ chance of collision hash(x) hash(y) False positives are possible, false negatives are not. It might be beneﬁcial to build an exception list of known false positives. 4
7.
Bloom Filter Space-efﬁcient probabilisticdata structure that is used to test whether an element is a member of a set 5
8.
Bloom Filter Space-efﬁcient probabilisticdata structure that is used to test whether an element is a member of a set Not a Key-Value store 5
9.
Bloom Filter Space-efﬁcient probabilisticdata structure that is used to test whether an element is a member of a set Not a Key-Value store Array of bits indicating the presence of a key in the ﬁlter 5
10.
Bloom Filter Space-efficient probabilisticdata structure that is used to test whether an element is a member of a set Not a Key-Value store Array of bits indicating the presence of a key in the filter (*) Removing an element from the filter is not possible 5
11.
Bloom Filter: Add& Query m bits (initially set to 0) k hash functions S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 m-1 m 6
12.
Bloom Filter: Add& Query m bits (initially set to 0) k hash functions Add S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 m-1 m 6
13.
Bloom Filter: Add& Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x Add S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 m-1 m 6
14.
Bloom Filter: Add& Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x Add f(x) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 2 m-1 m 6
15.
Bloom Filter: Add& Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x Add g(x) f(x) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 2 m-1 m 6
16.
Bloom Filter: Add& Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x Add g(x) f(x) h(x) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 2 m-1 m 6
17.
Bloom Filter: Add& Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x y g(y) Add f(y) g(x) f(x) h(x) h(y) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 2 m-1 m 6
18.
Bloom Filter: Add& Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x y g(y) Add f(y) g(x) f(x) h(x) h(y) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 2 m-1 m Query 6
19.
Bloom Filter: Add& Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x y g(y) Add f(y) g(x) f(x) h(x) h(y) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 2 m-1 m f(z) h(z) g(z) Query z 6
20.
Bloom Filter: Add& Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x y g(y) Add f(y) g(x) f(x) h(x) h(y) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 2 m-1 m f(z) h(z) g(z) Query one bit set to 0 z ⇒z∉S 6
21.
Bloom Filter: HashFunctions k Hash functions: uniform random distribution in [1...m) k different hash functions The same hash functions with different salts Double or triple hashing : g (x) = h (x) + ih (x) mod m [1] i 1 2 2 hash functions can mimic k hashing functions Dillinger, Peter C.; Manolios, Panagiotis (2004b), "Bloom Filters in Probabilistic Verification", [1] http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf http://www.strchr.com/hash_functions 7
22.
Bloom Filter: HashFunctions k Hash functions: uniform random distribution in [1...m) k different hash functions ‣ Cryptographic Hash different salts The same hash functions withFunctions (MD5, SHA-1, SHA-256, Tiger, Whirlpool ...) Double or triple hashing : g (x) = h (x) + ih (x) mod m [1] i 1 2 2 hash functions can mimic k hashing functions ‣ Murmur Hashes http://code.google.com/p/smhasher/ Dillinger, Peter C.; Manolios, Panagiotis (2004b), "Bloom Filters in Probabilistic Verification", [1] http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf http://www.strchr.com/hash_functions 7
23.
Bloom Filter: Usage Guard against First line of defence Peer to Peer Routing - expensive operations in high performance communication Resource Location (like disk access) (distributed) caches ... Squid Google Various Google Cisco Cassandra HBase Proxy Cache BigTable RDBMS’ Chrome Routers 8
24.
Bloom Filter: Usagein Cassandra Used to save I/O during key look-ups (check for non-existent keys) One bloom ﬁlter per SSTable. 9
25.
Bloom Filter: Usagein Cassandra Used to save I/O during key look-ups (check for non-existent keys) One bloom ﬁlter per SSTable. org.apache.cassandra.utils.BloomFilter 9
26.
Bloom Filter: FalsePositive Rate m = number of bits in the ﬁlter n = number of elements k = number of hashing functions http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html 10
27.
Bloom Filter: FalsePositive Rate m = number of bits in the ﬁlter n = number of elements k = number of hashing functions http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html 10
28.
Bloom Filter: FalsePositive Rate A bloom ﬁlter with an optimal value for k and 1% error rate only needs 9.6 bits per key. Add 4.8 bits/key and the error rate decreases by 10 times. 10.000 words, 1% error rate 10.000 words, 0.1% error rate 7 hash functions 11 hash functions ~12 KB of memory ~18 KB of memory http://www.igvita.com/2008/12/27/scalable-datasets-bloom-ﬁlters-in-ruby/ 11
29.
Bloom Filter: FalsePositive Rate false positive probability bloom ﬁlter size (n) http://en.wikipedia.org/wiki/Bloom_ﬁlter 12
30.
Counting Bloom Filter Can handle deletions Use counters instead of 0/1s When adding an element, increment the counters When deleting an element, decrement the counters Counters must be large enough to avoid overﬂow (4 bits) x y g(y) f(y) g(x) f(x) h(x) h(y) S 1 0 0 0 1 0 0 0 2 0 0 0 1 0 1 13
31.
Stable (Time-Based) BloomFilter Input Stream Duplicate 1 0 0 0 1 0 0 0 1 0 Filter Output Stream 14
32.
Stable (Time-Based) BloomFilter Input Before each insertion, P random Stream cells are decremented by one. The k cells for the new value xi are set to Max (usually < 7) http://webdocs.cs.ualberta.ca/~draﬁei/papers/DupDet06Sigmod.pdf Duplicate 1 0 0 0 1 0 0 0 1 0 Filter Output Stream 14
33.
Stable (Time-Based) BloomFilter Input Before each insertion, P random Stream cells are decremented by one. The k cells for the new value xi are set to Max (usually < 7) http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf Duplicate 1 0 0 0 1 0 0 0 1 0 Filter Alternatively, set an expiry time Output for each cell, with a TTL dependent on the volume of data Stream http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/ 14
34.
Bloom Filters: Furtherreading Compressed Bloom Filters Improve performance when the Bloom ﬁlter is passed as a message, and its transmission size is a limiting factor. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.3346 Retouched Bloom Filters Allow networked applications to trade off selected false positives against false negatives http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.172.8453 Bloomier Filters Extended to handle approximate functions (each element of the set has an associated function value) http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.4154 http://arxiv.org/abs/0807.0928 Attenuated B.F., Spectral B.F., Distance-Sensitive B.F. ... 15
35.
Merkle Trees Ralph C. Merkle, 1979 http://www.springerlink.com/content/q865hwxq73ex1am9/ 16
36.
Merkle Trees (HashTrees) Data Structure containing a tree of summary information about a larger piece of data to verify its contents http://en.wikipedia.org/wiki/Hash_Tree 17
37.
Merkle Trees (HashTrees) Leaves: hashes of ROOT hash(A, B) data blocks. Nodes: hashes of their children. A B hash(C, D) hash(E, F) Used to detect inconsistencies C D E F between replicas hash(001) hash(002) hash(003) hash(004) (anti-entropy) and to minimise the Data Data Data Data Block Block Block Block amount of 001 002 003 004 transferred data 18
38.
Merkle Trees Node A Node B gossip exchange 19
39.
Merkle Trees Node A Node B gossip exchange Minimal data transfer Differences are easy to locate 19
40.
Merkle Trees Node A Node B gossip exchange Minimal data transfer Differences are easy to locate SHA-1, Whirlpool or Tiger (TTH) hash functions 19
41.
Merkle Trees: Usage Peer to Peer communication 20
42.
Merkle Trees: Usage DC++ Peer to Peer communication 20
43.
Merkle Trees: Usage DC++ Peer to Peer communication ... Amazon Google Google Cassandra HBase ZFS Dynamo BigTable Wave 20
44.
Merkle Trees: Usagein Cassandra Ensure the P2P network of nodes receives data blocks unaltered and unharmed. Anti-entropy during major compactions (via Scuttlebutt reconciliation). http://wiki.apache.org/cassandra/ArchitectureAntiEntropy 21
45.
Merkle Trees: Usagein Cassandra Ensure the P2P network of nodes receives data blocks unaltered and unharmed. Anti-entropy during major compactions (via Scuttlebutt reconciliation). One Merkle Tree per Column Family (in Dynamo, one per node / key range) http://wiki.apache.org/cassandra/ArchitectureAntiEntropy 21
46.
Merkle Trees: Usagein Cassandra Ensure the P2P network of nodes receives data blocks unaltered and unharmed. Anti-entropy during major compactions (via Scuttlebutt reconciliation). One Merkle Tree per Column Family (in Dynamo, one per node / key range) org.apache.cassandra.utils.MerkleTree http://wiki.apache.org/cassandra/ArchitectureAntiEntropy 21
47.
References Bloom Filters http://bit.ly/bundles/quipo/1 Merkle Trees http://bit.ly/bundles/quipo/2 22
48.
We’re Hiring! http://mediasift.com/careers 23
49.
Lorenzo Alberton @lorenzoalberton Thank you! lorenzo@alberton.info http://www.alberton.info/talks 24

Editor's Notes

#2 \n
#3 \n
#4 \n
#5 Two keys might map into the same bucket\n
#6 Two keys might map into the same bucket\n
#7 Two keys might map into the same bucket\n
#8 Two keys might map into the same bucket\n
#9 Two keys might map into the same bucket\n
#10 Two keys might map into the same bucket\n
#11 Two keys might map into the same bucket\n
#12 Two keys might map into the same bucket\n
#13 \n
#14 \n
#15 \n
#16 \n
#17 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#18 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#19 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#20 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#21 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#22 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#23 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#24 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#25 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#26 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#27 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#28 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#29 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#30 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#31 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#32 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#33 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#34 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#35 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#36 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#37 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#38 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#39 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#40 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#41 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#42 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#43 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#44 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#45 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#46 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#47 An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
#48 Tiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits).\nMurmur hash is very very fast and low collision rate (2008).\nAnother good non-cryptographic hash function is the Jenkins Hash Function (Bob Jenkins, 1997)\nHashing with checksum functions is possible, and may produce a sufficiently uniform distribution of hash values, as long as the hash range size n is small compared to the range of the checksum or fingerprint function. The CRC32 checksum provides only 16 bits (the higher half of the result) that are usable for hashing.\n\n\n
#49 Popular in distributed web caches (small cost, big potential gain).\nThe Google Chrome web browser uses Bloom filters to speed up its Safe Browsing service.[6]\nIn Relational Databases, Bloom Filters are often used for JOINs\n
#50 \n
#51 All the bits for an element not yet inserted might already be set.\nThere is a clear tradeoff between m and the probability of a false positive.\nThe value of k that minimizes the probability of false positives is 0.7m/n\n
#52 \n
#53 An optimal number of hash functions k has been assumed\n
#54 Standard bloom filters can&#x2019;t handle deletions: if deleting x means resetting 1s to 0s, then deleting an entry might delete several others.\n\n
#55 2006. Precisely eliminating duplicates in an unbounded data stream (i.e. when you don&#x2019;t kow the size of the data set up front) is not feasible in many streaming scenarios. A common characteristic of these algorithms is the underlying assumption that the whole data set is stored and can be accessed if needed.\nUse cases: URL crawlers, Network monitoring (number of accesses by IP in the past hour), trending topics.\nIn many data stream applications, the allocated space is rather small compared to the size of the stream. When more and more elements arrive, the fraction of zeros\nin the Bloom Filter will decrease continuously, and the false positive rate will increase accordingly, finally reaching the limit, 1, where every distinct element will be reported as a\nduplicate, indicating that the Bloom Filter is useless.\nFor the regular Bloom Filter, there is no way to distinguish the recent elements from the past ones\n\ngithub?\n
#56 2006. Precisely eliminating duplicates in an unbounded data stream (i.e. when you don&#x2019;t kow the size of the data set up front) is not feasible in many streaming scenarios. A common characteristic of these algorithms is the underlying assumption that the whole data set is stored and can be accessed if needed.\nUse cases: URL crawlers, Network monitoring (number of accesses by IP in the past hour), trending topics.\nIn many data stream applications, the allocated space is rather small compared to the size of the stream. When more and more elements arrive, the fraction of zeros\nin the Bloom Filter will decrease continuously, and the false positive rate will increase accordingly, finally reaching the limit, 1, where every distinct element will be reported as a\nduplicate, indicating that the Bloom Filter is useless.\nFor the regular Bloom Filter, there is no way to distinguish the recent elements from the past ones\n\ngithub?\n
#57 2006. Precisely eliminating duplicates in an unbounded data stream (i.e. when you don&#x2019;t kow the size of the data set up front) is not feasible in many streaming scenarios. A common characteristic of these algorithms is the underlying assumption that the whole data set is stored and can be accessed if needed.\nUse cases: URL crawlers, Network monitoring (number of accesses by IP in the past hour), trending topics.\nIn many data stream applications, the allocated space is rather small compared to the size of the stream. When more and more elements arrive, the fraction of zeros\nin the Bloom Filter will decrease continuously, and the false positive rate will increase accordingly, finally reaching the limit, 1, where every distinct element will be reported as a\nduplicate, indicating that the Bloom Filter is useless.\nFor the regular Bloom Filter, there is no way to distinguish the recent elements from the past ones\n\ngithub?\n
#58 2006. Precisely eliminating duplicates in an unbounded data stream (i.e. when you don&#x2019;t kow the size of the data set up front) is not feasible in many streaming scenarios. A common characteristic of these algorithms is the underlying assumption that the whole data set is stored and can be accessed if needed.\nUse cases: URL crawlers, Network monitoring (number of accesses by IP in the past hour), trending topics.\nIn many data stream applications, the allocated space is rather small compared to the size of the stream. When more and more elements arrive, the fraction of zeros\nin the Bloom Filter will decrease continuously, and the false positive rate will increase accordingly, finally reaching the limit, 1, where every distinct element will be reported as a\nduplicate, indicating that the Bloom Filter is useless.\nFor the regular Bloom Filter, there is no way to distinguish the recent elements from the past ones\n\ngithub?\n
#59 RBF: permit the removal of selected false positives at the expense of generating random false negatives.\n
#60 \n
#61 They are used to protect any kind of data stored, handled and transferred in and between computers\n
#62 Each inner node is the hash value of the concatenation of its two children.\nThe principal advantage of Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the entire data set.\n\n\n
#63 For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nTiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits)\n
#64 For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nTiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits)\n
#65 For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nTiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits)\n
#66 For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nTiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits)\n
#67 Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
#68 Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
#69 Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
#70 Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
#71 Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
#72 Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
#73 Merkle trees are exchanged, if they disagree, Cassandra does a range-repair via compaction (using the Scuttlebutt reconciliation)\nTo ensure the data is still in sync even there is no READ and WRITE occurs to the data, replica nodes periodically gossip with each other to figure out if anyone out of sync. For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nAnti-entropy is the "catch-all" way to guarantee eventual consistency, but is also pretty expensive and therefore is not done frequently. By combining the data sync with read repair and hinted handoff, we can keep the replicas pretty up-to-date.\n\nThe key difference in Cassandra's implementation of anti-entropy is that the Merkle trees are built per column family, and they are not maintained for longer than it takes to send them to neighboring nodes. Instead, the trees are generated as snapshots of the dataset during major compactions: this means that excess data might be sent across the network, but it saves local disk IO, and is preferable for very large datasets.\n
#74 Merkle trees are exchanged, if they disagree, Cassandra does a range-repair via compaction (using the Scuttlebutt reconciliation)\nTo ensure the data is still in sync even there is no READ and WRITE occurs to the data, replica nodes periodically gossip with each other to figure out if anyone out of sync. For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nAnti-entropy is the "catch-all" way to guarantee eventual consistency, but is also pretty expensive and therefore is not done frequently. By combining the data sync with read repair and hinted handoff, we can keep the replicas pretty up-to-date.\n\nThe key difference in Cassandra's implementation of anti-entropy is that the Merkle trees are built per column family, and they are not maintained for longer than it takes to send them to neighboring nodes. Instead, the trees are generated as snapshots of the dataset during major compactions: this means that excess data might be sent across the network, but it saves local disk IO, and is preferable for very large datasets.\n
#75 \n
#76 \n
#77 \n