SlideShare a Scribd company logo
1 of 19
Bloom Filters
A Simple Problem
• Design Chrome Malicious URL Detector
• Lets say you work for Google, in the Chrome team, and you want to add a
feature to the browser which notifies the user if the url he has entered is a
malicious URL.
• Given a database of about known 1 million malicious URLs, the size of any
dictionary for matching will be around 50MB (size of 1 million urls with 50
average string length). 50MB is too heavy for a browser so this cannot be
locally done!!
• Anything locally should be something < 2MB memory. (25x far)
Wait you cannot even compress the strings (Huffman Coding) to that size.
Universal hashing review
• Goals
• Design ℎ: 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 → [0 − 𝑅] such that
• If 𝑂1 ≠ 𝑂2 𝑡ℎ𝑒𝑛 ℎ 𝑂1 ≠ ℎ 𝑂2 with high probability.
• h is cheap to compute and requires almost no memory.
• Classical Example:
• Randomly generate a and b (seed), choose a prime P >> R.
• ℎ 𝑥 = 𝑎𝑥 + 𝑏 𝑚𝑜𝑑 𝑃 𝑚𝑜𝑑 𝑅 (faster if 𝑅 = 232 why?)
• For strings or other objects, use bitarrays, randomly shuffle bits based on a
some seed then take mod.
• example: See Murmurhash.
• Memory: A seed per hash function!
More concrete problem
• Given a universal hashing
• ℎ: 𝑠𝑡𝑟𝑖𝑛𝑔𝑠 → 0 − 999
• Given 100 queries (strings) (average length 50 characters). Space is 100x50x8
= 40,000 bits.
• Same Task: Given a new query, answer whether we have seen it?
• Given a new string, answer no if it is not seen and if it is seen than
answer no with small probability
• What can we do in 1000 bits?
Bit-Maps and Universal Hashing
0 0 1 0 1 0 1 0 0 1 0 1 1 1 0
h(s)
• Given a query q, if h(q) = 1 return seen, else not seen.
• What is chance for false positive? Given there are N Strings.
• < 1 − 1 −
1
R
𝑁
(𝑅 𝑖𝑠 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑏𝑖𝑡𝑚𝑎𝑝)
Can we do better? Bloom Filters
• Use K-independent hash functions instead of 1.
• Just select seeds independently.
• K = 3 illustration
• Given a query q, if all ℎ1 𝑞 , ℎ2 𝑞 , ℎ3 𝑞 is set, return seen else no.
• False Positive Rate?
0 0 1 0 1 0 1 0 0 1 0 1 1 1 0
𝑆1
𝑆2
Illustration of Bloom Filters (K = 2)
• ℎ1 𝑥 = 𝑥 𝑚𝑜𝑑 5, ℎ2 𝑥 = (2𝑥 + 3) 𝑚𝑜𝑑 5
• Initialize Bloom Filters
• Insert 9 (4 and 1)
• Insert 11 (1 and 0)
0 0 0 0 0
0 1 0 0 1
1 1 0 0 1
Illustration of Bloom Filters
• ℎ1 𝑥 = 𝑥 𝑚𝑜𝑑 5, ℎ2 𝑥 = 2𝑥 + 3 𝑚𝑜𝑑 5
• Query 15 : 0 and 3  No
• Query 16: 1 and 0  Yes (False Positive)
1 1 0 0 1
Properties
• If the query was inserted before, bloom filters always return true.
• No false negatives.
• However, it can return true for an element which was not inserted
• Chances of false positives.
• If false positives are rare then caches still work.
• Why and How?
A bit of Analysis
• Pr a bit is not set given N strings are inserted?
• 1 −
1
𝑅
𝐾𝑁
• Pr that a given bit is set 1 - 1 −
1
𝑅
𝐾𝑁
• Pr that all the k-bits ℎ1 𝑞 , ℎ2 𝑞 … ℎ𝐾 𝑞 is set without seeing q
• 1 − 1 −
1
𝑅
𝐾𝑁 𝐾
≈ 1 − 𝑒
𝐾𝑁
𝑅
𝐾
• Minimized at 𝐾 = ln(2) ×
𝑅
𝑁
. If R is say 10N, then K = 6.9 or 7.
• Optimum false positive is approx. 0.618
𝑅
𝑁 which is < 0.008
• Compare that with 0.1 with k =1.
• So with N strings we need 10N bits space. Compare with hash table of 𝑵 × 𝟖 × 𝟓𝟎 (assuming 50
characters mean) ⇒ a reduction of 40x in memory.
Generic set compression
• Given a set 𝑆 of 𝑛 objects with each object being heavy such as
strings, etc.
• Bloom filter can compress 𝑆 to less memory around 10 bit per object
and still answer membership queries efficiently with rare chances of
false positives.
• It can up updated dynamically on the fly.
• How about deletions?
Use of Bloom Filters from Wikipedia
• The servers of Akamai Technologies, a content delivery provider, use Bloom filters to prevent "one-hit-wonders" from being stored
in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly
three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that
object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and
increasing disk cache hit rates.
• Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters to reduce the disk lookups for non-
existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
• The Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local
Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if
that too returned a positive result).
• The Squid Web Proxy Cache uses Bloom filters for cache digests.
• Bitcoin uses Bloom filters to speed up wallet synchronization.
• The Venti archival storage system uses Bloom filters to detect previously stored data.
• The SPIN model checker uses Bloom filters to track the reachable state space for large verification problems.
• The Cascading analytics framework uses Bloom filters to speed up asymmetric joins, where one of the joined data sets is
significantly larger than the other (often called Bloom join in the database literature).
• The Exim mail transfer agent (MTA) uses Bloom filters in its rate-limit feature.
• Medium uses Bloom filters to avoid recommending articles a user has previously read.
• Ethereum uses Bloom filters for quickly finding logs on the Ethereum blockchain.
Popular Use: One-hit-wonders
• Content delivery networks deploy web caches around the world to cache
and serve web content to users with greater performance and reliability.
• A key application of Bloom filters is their use in efficiently determining
which web objects to store in these web caches.
• Nearly three-quarters of the URLs accessed from a typical web cache are
"one-hit-wonders" that are accessed by users only once and never again.
• To prevent caching one-hit-wonders, a Bloom filter is used to keep track of
all URLs that are accessed by users.
• A web object is cached only when it has been accessed at least once
before, i.e., the object is cached on its second request.
• The use of a Bloom filter in this fashion significantly reduces the disk write
workload, since one-hit-wonders are never written to the disk cache.
Deletion: Option 1
• Use two bloom filters
• One to keep track of added elements
• One to keep track of deleted elements
• What are the chances of false negatives?
• What are the chances of false positives?
• Decreased!
A Popular Alternative
• Counting filters
• Fan, Li; Cao, Pei; Almeida, Jussara; Broder, Andrei (2000), "Summary Cache: A
Scalable Wide-Area Web Cache Sharing Protocol“
• In a counting filter the array positions (buckets) are extended from
being a single bit to being an n-bit counter.
• Addition: Increment
• Deletion: Decrement
• Chances of both false positive and false negatives.
Union of two bloom filters?
• Given bloom filter 𝐵1 for set 𝑆1 and another bloom filter 𝐵2 of set 𝑆2
with same hash functions.
• What is the bloom filter of 𝑆1 ∪ 𝑆2 ?
• Just OR the two bloom filters.
• Bloom filters can be organized in distributed data structures to
perform fully decentralized computations of aggregate functions.
Decentralized aggregation makes them ideal for several application by
avoiding costly communication.
Shrink Size of Bloom Filters?
• Can we shrink the bloom filter to half of its size?
• How about doubling its size?
Weakness of Bloom Filters
• Needs full independent hash functions. (Hard)
• The space usage is 1.44x more than the information theoretically best
possible.
• Dynamically growing bloom filters is hard. Best size depends on false
positive rate and number of insertions.
A problem to ponder on
• You want to know if two people share friends on facebook.
• For privacy and memory reasons, Facebook only gave you a
compressed bloom filter of their graph.
• What should you ask Facebook to compress.
• How to do it and how good will it be?

More Related Content

Similar to Lecture_3.pptx

RedisConf18 - ReBloom - Leveraging boom filters on Redis
RedisConf18 - ReBloom - Leveraging boom filters on RedisRedisConf18 - ReBloom - Leveraging boom filters on Redis
RedisConf18 - ReBloom - Leveraging boom filters on Redis
Redis Labs
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
xlight
 
London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 PresentationOperation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Blue Raster
 

Similar to Lecture_3.pptx (20)

2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
cache memory
 cache memory cache memory
cache memory
 
RedisConf18 - ReBloom - Leveraging boom filters on Redis
RedisConf18 - ReBloom - Leveraging boom filters on RedisRedisConf18 - ReBloom - Leveraging boom filters on Redis
RedisConf18 - ReBloom - Leveraging boom filters on Redis
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
 
Heap Memory Management.pptx
Heap Memory Management.pptxHeap Memory Management.pptx
Heap Memory Management.pptx
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
Inerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine LearningInerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine Learning
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time Metrics
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
 
Scaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWSScaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWS
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Overview of ChIA-PET tools
Overview of ChIA-PET toolsOverview of ChIA-PET tools
Overview of ChIA-PET tools
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Search at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterSearch at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, Twitter
 
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 PresentationOperation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
 

More from GayathriSanthosh11 (11)

privacy preserving forenciscs of encyrpted data.pptx
privacy preserving forenciscs of encyrpted data.pptxprivacy preserving forenciscs of encyrpted data.pptx
privacy preserving forenciscs of encyrpted data.pptx
 
Public key infrastrucure and its uses.pptx
Public key infrastrucure and its uses.pptxPublic key infrastrucure and its uses.pptx
Public key infrastrucure and its uses.pptx
 
rough intro about xai and the uses of xai in decision making
rough intro about xai and the uses of xai in decision makingrough intro about xai and the uses of xai in decision making
rough intro about xai and the uses of xai in decision making
 
2019-08-08-OriginStamp-Blockchain-Technology-Presentation.pptx
2019-08-08-OriginStamp-Blockchain-Technology-Presentation.pptx2019-08-08-OriginStamp-Blockchain-Technology-Presentation.pptx
2019-08-08-OriginStamp-Blockchain-Technology-Presentation.pptx
 
273CC03851E778670A (1).ppt
273CC03851E778670A (1).ppt273CC03851E778670A (1).ppt
273CC03851E778670A (1).ppt
 
10.1098-rsos.190023Figure1900232.pptx
10.1098-rsos.190023Figure1900232.pptx10.1098-rsos.190023Figure1900232.pptx
10.1098-rsos.190023Figure1900232.pptx
 
Deep-Learning-2017-Lecture7GAN.ppt
Deep-Learning-2017-Lecture7GAN.pptDeep-Learning-2017-Lecture7GAN.ppt
Deep-Learning-2017-Lecture7GAN.ppt
 
Untitled 6.pptx
Untitled 6.pptxUntitled 6.pptx
Untitled 6.pptx
 
rest motion.pptx
rest motion.pptxrest motion.pptx
rest motion.pptx
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
09092019-Block.one-Presentation.pptx
09092019-Block.one-Presentation.pptx09092019-Block.one-Presentation.pptx
09092019-Block.one-Presentation.pptx
 

Recently uploaded

Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 

Recently uploaded (20)

Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 

Lecture_3.pptx

  • 2. A Simple Problem • Design Chrome Malicious URL Detector • Lets say you work for Google, in the Chrome team, and you want to add a feature to the browser which notifies the user if the url he has entered is a malicious URL. • Given a database of about known 1 million malicious URLs, the size of any dictionary for matching will be around 50MB (size of 1 million urls with 50 average string length). 50MB is too heavy for a browser so this cannot be locally done!! • Anything locally should be something < 2MB memory. (25x far) Wait you cannot even compress the strings (Huffman Coding) to that size.
  • 3. Universal hashing review • Goals • Design ℎ: 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 → [0 − 𝑅] such that • If 𝑂1 ≠ 𝑂2 𝑡ℎ𝑒𝑛 ℎ 𝑂1 ≠ ℎ 𝑂2 with high probability. • h is cheap to compute and requires almost no memory. • Classical Example: • Randomly generate a and b (seed), choose a prime P >> R. • ℎ 𝑥 = 𝑎𝑥 + 𝑏 𝑚𝑜𝑑 𝑃 𝑚𝑜𝑑 𝑅 (faster if 𝑅 = 232 why?) • For strings or other objects, use bitarrays, randomly shuffle bits based on a some seed then take mod. • example: See Murmurhash. • Memory: A seed per hash function!
  • 4. More concrete problem • Given a universal hashing • ℎ: 𝑠𝑡𝑟𝑖𝑛𝑔𝑠 → 0 − 999 • Given 100 queries (strings) (average length 50 characters). Space is 100x50x8 = 40,000 bits. • Same Task: Given a new query, answer whether we have seen it? • Given a new string, answer no if it is not seen and if it is seen than answer no with small probability • What can we do in 1000 bits?
  • 5. Bit-Maps and Universal Hashing 0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 h(s) • Given a query q, if h(q) = 1 return seen, else not seen. • What is chance for false positive? Given there are N Strings. • < 1 − 1 − 1 R 𝑁 (𝑅 𝑖𝑠 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑏𝑖𝑡𝑚𝑎𝑝)
  • 6. Can we do better? Bloom Filters • Use K-independent hash functions instead of 1. • Just select seeds independently. • K = 3 illustration • Given a query q, if all ℎ1 𝑞 , ℎ2 𝑞 , ℎ3 𝑞 is set, return seen else no. • False Positive Rate? 0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 𝑆1 𝑆2
  • 7. Illustration of Bloom Filters (K = 2) • ℎ1 𝑥 = 𝑥 𝑚𝑜𝑑 5, ℎ2 𝑥 = (2𝑥 + 3) 𝑚𝑜𝑑 5 • Initialize Bloom Filters • Insert 9 (4 and 1) • Insert 11 (1 and 0) 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1
  • 8. Illustration of Bloom Filters • ℎ1 𝑥 = 𝑥 𝑚𝑜𝑑 5, ℎ2 𝑥 = 2𝑥 + 3 𝑚𝑜𝑑 5 • Query 15 : 0 and 3  No • Query 16: 1 and 0  Yes (False Positive) 1 1 0 0 1
  • 9. Properties • If the query was inserted before, bloom filters always return true. • No false negatives. • However, it can return true for an element which was not inserted • Chances of false positives. • If false positives are rare then caches still work. • Why and How?
  • 10. A bit of Analysis • Pr a bit is not set given N strings are inserted? • 1 − 1 𝑅 𝐾𝑁 • Pr that a given bit is set 1 - 1 − 1 𝑅 𝐾𝑁 • Pr that all the k-bits ℎ1 𝑞 , ℎ2 𝑞 … ℎ𝐾 𝑞 is set without seeing q • 1 − 1 − 1 𝑅 𝐾𝑁 𝐾 ≈ 1 − 𝑒 𝐾𝑁 𝑅 𝐾 • Minimized at 𝐾 = ln(2) × 𝑅 𝑁 . If R is say 10N, then K = 6.9 or 7. • Optimum false positive is approx. 0.618 𝑅 𝑁 which is < 0.008 • Compare that with 0.1 with k =1. • So with N strings we need 10N bits space. Compare with hash table of 𝑵 × 𝟖 × 𝟓𝟎 (assuming 50 characters mean) ⇒ a reduction of 40x in memory.
  • 11. Generic set compression • Given a set 𝑆 of 𝑛 objects with each object being heavy such as strings, etc. • Bloom filter can compress 𝑆 to less memory around 10 bit per object and still answer membership queries efficiently with rare chances of false positives. • It can up updated dynamically on the fly. • How about deletions?
  • 12. Use of Bloom Filters from Wikipedia • The servers of Akamai Technologies, a content delivery provider, use Bloom filters to prevent "one-hit-wonders" from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates. • Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters to reduce the disk lookups for non- existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation. • The Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result). • The Squid Web Proxy Cache uses Bloom filters for cache digests. • Bitcoin uses Bloom filters to speed up wallet synchronization. • The Venti archival storage system uses Bloom filters to detect previously stored data. • The SPIN model checker uses Bloom filters to track the reachable state space for large verification problems. • The Cascading analytics framework uses Bloom filters to speed up asymmetric joins, where one of the joined data sets is significantly larger than the other (often called Bloom join in the database literature). • The Exim mail transfer agent (MTA) uses Bloom filters in its rate-limit feature. • Medium uses Bloom filters to avoid recommending articles a user has previously read. • Ethereum uses Bloom filters for quickly finding logs on the Ethereum blockchain.
  • 13. Popular Use: One-hit-wonders • Content delivery networks deploy web caches around the world to cache and serve web content to users with greater performance and reliability. • A key application of Bloom filters is their use in efficiently determining which web objects to store in these web caches. • Nearly three-quarters of the URLs accessed from a typical web cache are "one-hit-wonders" that are accessed by users only once and never again. • To prevent caching one-hit-wonders, a Bloom filter is used to keep track of all URLs that are accessed by users. • A web object is cached only when it has been accessed at least once before, i.e., the object is cached on its second request. • The use of a Bloom filter in this fashion significantly reduces the disk write workload, since one-hit-wonders are never written to the disk cache.
  • 14. Deletion: Option 1 • Use two bloom filters • One to keep track of added elements • One to keep track of deleted elements • What are the chances of false negatives? • What are the chances of false positives? • Decreased!
  • 15. A Popular Alternative • Counting filters • Fan, Li; Cao, Pei; Almeida, Jussara; Broder, Andrei (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol“ • In a counting filter the array positions (buckets) are extended from being a single bit to being an n-bit counter. • Addition: Increment • Deletion: Decrement • Chances of both false positive and false negatives.
  • 16. Union of two bloom filters? • Given bloom filter 𝐵1 for set 𝑆1 and another bloom filter 𝐵2 of set 𝑆2 with same hash functions. • What is the bloom filter of 𝑆1 ∪ 𝑆2 ? • Just OR the two bloom filters. • Bloom filters can be organized in distributed data structures to perform fully decentralized computations of aggregate functions. Decentralized aggregation makes them ideal for several application by avoiding costly communication.
  • 17. Shrink Size of Bloom Filters? • Can we shrink the bloom filter to half of its size? • How about doubling its size?
  • 18. Weakness of Bloom Filters • Needs full independent hash functions. (Hard) • The space usage is 1.44x more than the information theoretically best possible. • Dynamically growing bloom filters is hard. Best size depends on false positive rate and number of insertions.
  • 19. A problem to ponder on • You want to know if two people share friends on facebook. • For privacy and memory reasons, Facebook only gave you a compressed bloom filter of their graph. • What should you ask Facebook to compress. • How to do it and how good will it be?