SlideShare a Scribd company logo
Bloom Filters
Kira Radinsky
Slides based on material from:
Michael Mitzenmacher and Hanoch Levy
Motivation - Cache
• Lookup questions:
Does item “x” exist in a set?
• Data set may be very big or expensive to
access. Filter lookup questions with negative
results before accessing data.
• Allow false positive errors, as they only cost us an
extra data access.
• Don’t allow false negative errors, because they
result in wrong answers.
Application of Bloom Filters:
Distributed Web Caches
Web Cache 1 Web Cache 2 Web Cache 3
Web Cache 6Web Cache 5Web Cache 4
• Send Bloom filters of URLs.
• False positives do not hurt much.
– Get errors from cache changes anyway
Web Caching
• Summary Cache: [Fan, Cao, Almeida, & Broder]
If local caches know each other’s content...
…try local cache before going out to Web
• Sending/updating lists of URLs too expensive.
• Solution: use Bloom filters.
• False positives
– Local requests go unfulfilled.
– Small cost, big potential gain
The Problem Solved by BF:
Approximate Set Membership
• Lookup Problem: Given a set S = {x1,x2,…,xn}, construct
data structure to answer queries of the form
“Is y in S?”
• Data structure should be:
– Fast (Faster than searching through S).
– Small (Smaller than explicit representation).
• To obtain speed and size improvements, allow some
probability of error.
– False positives: y  S but we report y  S
– False negatives: y  S but we report y  S
Bloom Filters
Start with an m bit array, filled with 0s.
Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
To check if y is in S, check B at Hi(y). All k values must be 1.
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
Possible to have a false positive; all k values are 1, but y is not in S.
Bloom Filter
01000 10100 00010
x
h1(x) h2(x) hk(x)
V0 Vm-1
h3(x)
Advantages
• No Overflow
• Union and intersection of Bloom filters
– A simple bitwise OR and AND operations
• Applications:
– Google BigTable
– The Squid Web Proxy Cache uses Bloom filters for
cache digests.
Bloom Errors
01000 10100 00010
h1(x) h2(x) hk(x)
V0 Vm-1
h3(x)
a b c d
x didn’t appear, yet its bits are already set
Example
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 1 2 3 4 5 6 7 8 9 10
Hash functions
Falsepositiverate
m/n = 8
Opt k = 8 ln 2 = 5.45...
Tradeoffs
• Three parameters.
– Size m/n : bits per item.
• |U| = n: Number of elements to encode.
• hi: U[1..m] : Maintain a Bit Vector V of size m
– Time k : number of hash functions.
• Use k hash functions (h1..hk)
– Error f : false positive probability.
Bloom Filter Tradeoffs
• Three factors: m,k and n.
• Normally, n and m are given, and we select k.
• Small k
– Less computations.
– Actual number of bits accessed (nk) is smaller, so the chance of a “step
over” is smaller too.
– However, less bits need to be stepped over to generate an error.
• For big k, the exact opposite holds.
• Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits
flipped in the array) is exactly 0.5
Alternative Approach for
Bloom Filters: Perfect Hashing Approach
Element 1 Element 2 Element 3 Element 4 Element 5
Fingerprint(4) Fingerprint(5) Fingerprint(2) Fingerprint(1) Fingerprint(3)
Perfect Hashing Approach
• Folklore Bloom filter construction.
– Recall: Given a set S = {x1,x2,x3,…xn} on a universe U, we want
to answer membership queries.
– Method: Find an n-cell perfect hash function for S.
• Maps set of n elements to n cells in a 1-1 manner.
– Then keep bit fingerprint of item in each cell.
Lookups have false positive < e.
– Advantage: each bit/item reduces false positives by a factor
of 1/2, vs ln 2 for a standard Bloom filter.
• Negatives:
– Perfect hash functions non-trivial to find.
– Cannot handle on-line insertions.
 )/1(log2 e
Bloom Filters and Deletions
• Cache contents change
– Items both inserted and deleted.
• Insertions are easy – add bits to BF
• Can Bloom filters handle deletions?
– Use Counting Bloom Filters to track
insertions/deletions at hosts;
– Send Bloom filters.
Handling Deletions
• Bloom filters can handle insertions, but not
deletions.
• If deleting xi means resetting 1s to 0s, then
deleting xi will “delete” xj.
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
xi xj
Counting Bloom Filters
Start with an m bit array, filled with 0s.
Hash each item xj in S k times. If Hi(xj) = a, add 1 to B[a].
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B
0 3 0 0 1 0 2 0 0 3 2 1 0 2 1 0B
To delete xj decrement the corresponding counters.
0 2 0 0 0 0 2 0 0 3 2 1 0 1 1 0B
Can obtain a corresponding Bloom filter by reducing to 0/1.
0 1 0 0 0 0 1 0 0 1 1 1 0 1 1 0B
Counting Bloom Filters: Overflow
• Must choose counters large enough to avoid
overflow.
• Poisson approximation suggests 4 bits/counter.
– Average load using k = (ln 2)m/n counters is ln 2.
– Probability a counter has load at least 16:
• Failsafes possible.
17E78.6!16/)2(ln 162ln
 
e
Variations and Extensions
• Distance-Sensitive Bloom Filters
• Bloomier Filter
Extension: Distance-Sensitive Bloom Filters
• Instead of answering questions of the form
we would like to answer questions of the form
• That is, is the query close to some element of the set, under
some metric and some notion of close.
• Applications:
– DNA matching
– Virus/worm matching
– Databases
• Some initial results [KirschMitzenmacher]. Hard.
.SyIs 
.SxyIs 
Extension: Bloomier Filter
• Bloom filters handle set membership.
• Counters to handle multi-set/count tracking.
• Bloomier filter [Chazelle, Kilian, Rubinfeld, Tal]:
– Extend to handle approximate functions.
– Each element of set has associated function value.
– Non-set elements should return null.
– Want to always return correct function value for set
elements.
– A false positive returns a function value for a non-null
element.

More Related Content

What's hot

Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
Albert Bifet
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
Hashing gt1
Hashing gt1Hashing gt1
Hashing gt1
Gopi Saiteja
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
Albert Bifet
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsAlbert Bifet
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
Albert Bifet
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
Albert Bifet
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
Andrii Gakhov
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
Richard Herrell
 
Heaps
HeapsHeaps
Statistics - ArgMax Equation
Statistics - ArgMax EquationStatistics - ArgMax Equation
Statistics - ArgMax Equation
Andrew Ferlitsch
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
Fabian Pedregosa
 
行列演算とPythonの言語デザイン
行列演算とPythonの言語デザイン行列演算とPythonの言語デザイン
行列演算とPythonの言語デザイン
Atsuo Ishimoto
 
Consistent hashing
Consistent hashingConsistent hashing
Consistent hashingJooho Lee
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
Johan Tibell
 
Scientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuanScientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuan
Wei-Yuan Chang
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
Rishabh Dugar
 

What's hot (20)

Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
Hashing gt1
Hashing gt1Hashing gt1
Hashing gt1
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive Windows
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Hashing
HashingHashing
Hashing
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Heaps
HeapsHeaps
Heaps
 
Statistics - ArgMax Equation
Statistics - ArgMax EquationStatistics - ArgMax Equation
Statistics - ArgMax Equation
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 
行列演算とPythonの言語デザイン
行列演算とPythonの言語デザイン行列演算とPythonの言語デザイン
行列演算とPythonの言語デザイン
 
Consistent hashing
Consistent hashingConsistent hashing
Consistent hashing
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
 
Scientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuanScientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuan
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
 

Viewers also liked

Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
Kira
 
Circuit Theory Audio Filter
Circuit Theory Audio FilterCircuit Theory Audio Filter
Circuit Theory Audio FilterMark Falcone
 
Lic
LicLic
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
Kira
 
High pass filter
High pass filterHigh pass filter
High pass filter
Anirban Bhowal
 
Low pass filter
Low pass filterLow pass filter
Low pass filter
Mohamad Firdaus Daud
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
Kira
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
High pass filter with analog electronic
High pass filter with analog electronicHigh pass filter with analog electronic
High pass filter with analog electronic
Dilouar Hossain
 
Filters
FiltersFilters
Ayesha low pass filter
Ayesha low pass filterAyesha low pass filter
Ayesha low pass filterAyesha Saeed
 

Viewers also liked (12)

Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
 
Circuit Theory Audio Filter
Circuit Theory Audio FilterCircuit Theory Audio Filter
Circuit Theory Audio Filter
 
Lic
LicLic
Lic
 
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
 
High pass filter
High pass filterHigh pass filter
High pass filter
 
Low pass filter
Low pass filterLow pass filter
Low pass filter
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
High pass filter with analog electronic
High pass filter with analog electronicHigh pass filter with analog electronic
High pass filter with analog electronic
 
Filters
FiltersFilters
Filters
 
Ayesha low pass filter
Ayesha low pass filterAyesha low pass filter
Ayesha low pass filter
 

Similar to Tutorial 9 (bloom filters)

Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
huguk
 
ilp-nlp-slides.pdf
ilp-nlp-slides.pdfilp-nlp-slides.pdf
ilp-nlp-slides.pdf
FlorentBersani
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
SonaliAjankar
 
Introduction to Bloom Filters
Introduction to Bloom Filters Introduction to Bloom Filters
Introduction to Bloom Filters
Hayden Marchant
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
MapR Technologies
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
MapR Technologies
 
Beating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit ArithmeticBeating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit Arithmetic
inside-BigData.com
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
Jason Riedy
 
02-gates-w.pptx
02-gates-w.pptx02-gates-w.pptx
02-gates-w.pptx
039JagadeeswaranS
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the the
sanjaibalajeessn
 
Practical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep KathPractical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep Kath
Sandeep Kath
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
Thinh Dang
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
Ted Dunning
 
Error detection and correction
Error detection and correctionError detection and correction
Error detection and correction
alamelumani
 
Scalable Recommendation Algorithms with LSH
Scalable Recommendation Algorithms with LSHScalable Recommendation Algorithms with LSH
Scalable Recommendation Algorithms with LSH
Maruf Aytekin
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
Ted Dunning
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Sergey Karayev
 

Similar to Tutorial 9 (bloom filters) (20)

Lecture_3.pptx
Lecture_3.pptxLecture_3.pptx
Lecture_3.pptx
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
ilp-nlp-slides.pdf
ilp-nlp-slides.pdfilp-nlp-slides.pdf
ilp-nlp-slides.pdf
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
 
Introduction to Bloom Filters
Introduction to Bloom Filters Introduction to Bloom Filters
Introduction to Bloom Filters
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
Beating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit ArithmeticBeating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit Arithmetic
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
02-gates-w.pptx
02-gates-w.pptx02-gates-w.pptx
02-gates-w.pptx
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the the
 
Practical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep KathPractical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep Kath
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Error detection and correction
Error detection and correctionError detection and correction
Error detection and correction
 
Scalable Recommendation Algorithms with LSH
Scalable Recommendation Algorithms with LSHScalable Recommendation Algorithms with LSH
Scalable Recommendation Algorithms with LSH
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
 

More from Kira

Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
Kira
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
Kira
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
Kira
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
Kira
 

More from Kira (9)

Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
 

Recently uploaded

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 

Recently uploaded (20)

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 

Tutorial 9 (bloom filters)

  • 1. Bloom Filters Kira Radinsky Slides based on material from: Michael Mitzenmacher and Hanoch Levy
  • 2. Motivation - Cache • Lookup questions: Does item “x” exist in a set? • Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data. • Allow false positive errors, as they only cost us an extra data access. • Don’t allow false negative errors, because they result in wrong answers.
  • 3. Application of Bloom Filters: Distributed Web Caches Web Cache 1 Web Cache 2 Web Cache 3 Web Cache 6Web Cache 5Web Cache 4 • Send Bloom filters of URLs. • False positives do not hurt much. – Get errors from cache changes anyway
  • 4. Web Caching • Summary Cache: [Fan, Cao, Almeida, & Broder] If local caches know each other’s content... …try local cache before going out to Web • Sending/updating lists of URLs too expensive. • Solution: use Bloom filters. • False positives – Local requests go unfulfilled. – Small cost, big potential gain
  • 5. The Problem Solved by BF: Approximate Set Membership • Lookup Problem: Given a set S = {x1,x2,…,xn}, construct data structure to answer queries of the form “Is y in S?” • Data structure should be: – Fast (Faster than searching through S). – Small (Smaller than explicit representation). • To obtain speed and size improvements, allow some probability of error. – False positives: y  S but we report y  S – False negatives: y  S but we report y  S
  • 6. Bloom Filters Start with an m bit array, filled with 0s. Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B To check if y is in S, check B at Hi(y). All k values must be 1. 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B Possible to have a false positive; all k values are 1, but y is not in S.
  • 7. Bloom Filter 01000 10100 00010 x h1(x) h2(x) hk(x) V0 Vm-1 h3(x)
  • 8. Advantages • No Overflow • Union and intersection of Bloom filters – A simple bitwise OR and AND operations • Applications: – Google BigTable – The Squid Web Proxy Cache uses Bloom filters for cache digests.
  • 9. Bloom Errors 01000 10100 00010 h1(x) h2(x) hk(x) V0 Vm-1 h3(x) a b c d x didn’t appear, yet its bits are already set
  • 10. Example 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 1 2 3 4 5 6 7 8 9 10 Hash functions Falsepositiverate m/n = 8 Opt k = 8 ln 2 = 5.45...
  • 11. Tradeoffs • Three parameters. – Size m/n : bits per item. • |U| = n: Number of elements to encode. • hi: U[1..m] : Maintain a Bit Vector V of size m – Time k : number of hash functions. • Use k hash functions (h1..hk) – Error f : false positive probability.
  • 12. Bloom Filter Tradeoffs • Three factors: m,k and n. • Normally, n and m are given, and we select k. • Small k – Less computations. – Actual number of bits accessed (nk) is smaller, so the chance of a “step over” is smaller too. – However, less bits need to be stepped over to generate an error. • For big k, the exact opposite holds. • Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits flipped in the array) is exactly 0.5
  • 13. Alternative Approach for Bloom Filters: Perfect Hashing Approach Element 1 Element 2 Element 3 Element 4 Element 5 Fingerprint(4) Fingerprint(5) Fingerprint(2) Fingerprint(1) Fingerprint(3)
  • 14. Perfect Hashing Approach • Folklore Bloom filter construction. – Recall: Given a set S = {x1,x2,x3,…xn} on a universe U, we want to answer membership queries. – Method: Find an n-cell perfect hash function for S. • Maps set of n elements to n cells in a 1-1 manner. – Then keep bit fingerprint of item in each cell. Lookups have false positive < e. – Advantage: each bit/item reduces false positives by a factor of 1/2, vs ln 2 for a standard Bloom filter. • Negatives: – Perfect hash functions non-trivial to find. – Cannot handle on-line insertions.  )/1(log2 e
  • 15. Bloom Filters and Deletions • Cache contents change – Items both inserted and deleted. • Insertions are easy – add bits to BF • Can Bloom filters handle deletions? – Use Counting Bloom Filters to track insertions/deletions at hosts; – Send Bloom filters.
  • 16. Handling Deletions • Bloom filters can handle insertions, but not deletions. • If deleting xi means resetting 1s to 0s, then deleting xi will “delete” xj. 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B xi xj
  • 17. Counting Bloom Filters Start with an m bit array, filled with 0s. Hash each item xj in S k times. If Hi(xj) = a, add 1 to B[a]. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B 0 3 0 0 1 0 2 0 0 3 2 1 0 2 1 0B To delete xj decrement the corresponding counters. 0 2 0 0 0 0 2 0 0 3 2 1 0 1 1 0B Can obtain a corresponding Bloom filter by reducing to 0/1. 0 1 0 0 0 0 1 0 0 1 1 1 0 1 1 0B
  • 18. Counting Bloom Filters: Overflow • Must choose counters large enough to avoid overflow. • Poisson approximation suggests 4 bits/counter. – Average load using k = (ln 2)m/n counters is ln 2. – Probability a counter has load at least 16: • Failsafes possible. 17E78.6!16/)2(ln 162ln   e
  • 19. Variations and Extensions • Distance-Sensitive Bloom Filters • Bloomier Filter
  • 20. Extension: Distance-Sensitive Bloom Filters • Instead of answering questions of the form we would like to answer questions of the form • That is, is the query close to some element of the set, under some metric and some notion of close. • Applications: – DNA matching – Virus/worm matching – Databases • Some initial results [KirschMitzenmacher]. Hard. .SyIs  .SxyIs 
  • 21. Extension: Bloomier Filter • Bloom filters handle set membership. • Counters to handle multi-set/count tracking. • Bloomier filter [Chazelle, Kilian, Rubinfeld, Tal]: – Extend to handle approximate functions. – Each element of set has associated function value. – Non-set elements should return null. – Want to always return correct function value for set elements. – A false positive returns a function value for a non-null element.