SlideShare a Scribd company logo
Bloom Filters
A Simple Problem
• Design Chrome Malicious URL Detector
• Lets say you work for Google, in the Chrome team, and you want to add a
feature to the browser which notifies the user if the url he has entered is a
malicious URL.
• Given a database of about known 1 million malicious URLs, the size of any
dictionary for matching will be around 50MB (size of 1 million urls with 50
average string length). 50MB is too heavy for a browser so this cannot be
locally done!!
• Anything locally should be something < 2MB memory. (25x far)
Wait you cannot even compress the strings (Huffman Coding) to that size.
Universal hashing review
• Goals
• Design ℎ: 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 → [0 − 𝑅] such that
• If 𝑂1 ≠ 𝑂2 𝑡ℎ𝑒𝑛 ℎ 𝑂1 ≠ ℎ 𝑂2 with high probability.
• h is cheap to compute and requires almost no memory.
• Classical Example:
• Randomly generate a and b (seed), choose a prime P >> R.
• ℎ 𝑥 = 𝑎𝑥 + 𝑏 𝑚𝑜𝑑 𝑃 𝑚𝑜𝑑 𝑅 (faster if 𝑅 = 232 why?)
• For strings or other objects, use bitarrays, randomly shuffle bits based on a
some seed then take mod.
• example: See Murmurhash.
• Memory: A seed per hash function!
More concrete problem
• Given a universal hashing
• ℎ: 𝑠𝑡𝑟𝑖𝑛𝑔𝑠 → 0 − 999
• Given 100 queries (strings) (average length 50 characters). Space is 100x50x8
= 40,000 bits.
• Same Task: Given a new query, answer whether we have seen it?
• Given a new string, answer no if it is not seen and if it is seen than
answer no with small probability
• What can we do in 1000 bits?
Bit-Maps and Universal Hashing
0 0 1 0 1 0 1 0 0 1 0 1 1 1 0
h(s)
• Given a query q, if h(q) = 1 return seen, else not seen.
• What is chance for false positive? Given there are N Strings.
• < 1 − 1 −
1
R
𝑁
(𝑅 𝑖𝑠 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑏𝑖𝑡𝑚𝑎𝑝)
Can we do better? Bloom Filters
• Use K-independent hash functions instead of 1.
• Just select seeds independently.
• K = 3 illustration
• Given a query q, if all ℎ1 𝑞 , ℎ2 𝑞 , ℎ3 𝑞 is set, return seen else no.
• False Positive Rate?
0 0 1 0 1 0 1 0 0 1 0 1 1 1 0
𝑆1
𝑆2
Illustration of Bloom Filters (K = 2)
• ℎ1 𝑥 = 𝑥 𝑚𝑜𝑑 5, ℎ2 𝑥 = (2𝑥 + 3) 𝑚𝑜𝑑 5
• Initialize Bloom Filters
• Insert 9 (4 and 1)
• Insert 11 (1 and 0)
0 0 0 0 0
0 1 0 0 1
1 1 0 0 1
Illustration of Bloom Filters
• ℎ1 𝑥 = 𝑥 𝑚𝑜𝑑 5, ℎ2 𝑥 = 2𝑥 + 3 𝑚𝑜𝑑 5
• Query 15 : 0 and 3  No
• Query 16: 1 and 0  Yes (False Positive)
1 1 0 0 1
Properties
• If the query was inserted before, bloom filters always return true.
• No false negatives.
• However, it can return true for an element which was not inserted
• Chances of false positives.
• If false positives are rare then caches still work.
• Why and How?
A bit of Analysis
• Pr a bit is not set given N strings are inserted?
• 1 −
1
𝑅
𝐾𝑁
• Pr that a given bit is set 1 - 1 −
1
𝑅
𝐾𝑁
• Pr that all the k-bits ℎ1 𝑞 , ℎ2 𝑞 … ℎ𝐾 𝑞 is set without seeing q
• 1 − 1 −
1
𝑅
𝐾𝑁 𝐾
≈ 1 − 𝑒
𝐾𝑁
𝑅
𝐾
• Minimized at 𝐾 = ln(2) ×
𝑅
𝑁
. If R is say 10N, then K = 6.9 or 7.
• Optimum false positive is approx. 0.618
𝑅
𝑁 which is < 0.008
• Compare that with 0.1 with k =1.
• So with N strings we need 10N bits space. Compare with hash table of 𝑵 × 𝟖 × 𝟓𝟎 (assuming 50
characters mean) ⇒ a reduction of 40x in memory.
Generic set compression
• Given a set 𝑆 of 𝑛 objects with each object being heavy such as
strings, etc.
• Bloom filter can compress 𝑆 to less memory around 10 bit per object
and still answer membership queries efficiently with rare chances of
false positives.
• It can up updated dynamically on the fly.
• How about deletions?
Use of Bloom Filters from Wikipedia
• The servers of Akamai Technologies, a content delivery provider, use Bloom filters to prevent "one-hit-wonders" from being stored
in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly
three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that
object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and
increasing disk cache hit rates.
• Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters to reduce the disk lookups for non-
existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
• The Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local
Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if
that too returned a positive result).
• The Squid Web Proxy Cache uses Bloom filters for cache digests.
• Bitcoin uses Bloom filters to speed up wallet synchronization.
• The Venti archival storage system uses Bloom filters to detect previously stored data.
• The SPIN model checker uses Bloom filters to track the reachable state space for large verification problems.
• The Cascading analytics framework uses Bloom filters to speed up asymmetric joins, where one of the joined data sets is
significantly larger than the other (often called Bloom join in the database literature).
• The Exim mail transfer agent (MTA) uses Bloom filters in its rate-limit feature.
• Medium uses Bloom filters to avoid recommending articles a user has previously read.
• Ethereum uses Bloom filters for quickly finding logs on the Ethereum blockchain.
Popular Use: One-hit-wonders
• Content delivery networks deploy web caches around the world to cache
and serve web content to users with greater performance and reliability.
• A key application of Bloom filters is their use in efficiently determining
which web objects to store in these web caches.
• Nearly three-quarters of the URLs accessed from a typical web cache are
"one-hit-wonders" that are accessed by users only once and never again.
• To prevent caching one-hit-wonders, a Bloom filter is used to keep track of
all URLs that are accessed by users.
• A web object is cached only when it has been accessed at least once
before, i.e., the object is cached on its second request.
• The use of a Bloom filter in this fashion significantly reduces the disk write
workload, since one-hit-wonders are never written to the disk cache.
Deletion: Option 1
• Use two bloom filters
• One to keep track of added elements
• One to keep track of deleted elements
• What are the chances of false negatives?
• What are the chances of false positives?
• Decreased!
A Popular Alternative
• Counting filters
• Fan, Li; Cao, Pei; Almeida, Jussara; Broder, Andrei (2000), "Summary Cache: A
Scalable Wide-Area Web Cache Sharing Protocol“
• In a counting filter the array positions (buckets) are extended from
being a single bit to being an n-bit counter.
• Addition: Increment
• Deletion: Decrement
• Chances of both false positive and false negatives.
Union of two bloom filters?
• Given bloom filter 𝐵1 for set 𝑆1 and another bloom filter 𝐵2 of set 𝑆2
with same hash functions.
• What is the bloom filter of 𝑆1 ∪ 𝑆2 ?
• Just OR the two bloom filters.
• Bloom filters can be organized in distributed data structures to
perform fully decentralized computations of aggregate functions.
Decentralized aggregation makes them ideal for several application by
avoiding costly communication.
Shrink Size of Bloom Filters?
• Can we shrink the bloom filter to half of its size?
• How about doubling its size?
Weakness of Bloom Filters
• Needs full independent hash functions. (Hard)
• The space usage is 1.44x more than the information theoretically best
possible.
• Dynamically growing bloom filters is hard. Best size depends on false
positive rate and number of insertions.
A problem to ponder on
• You want to know if two people share friends on facebook.
• For privacy and memory reasons, Facebook only gave you a
compressed bloom filter of their graph.
• What should you ask Facebook to compress.
• How to do it and how good will it be?

More Related Content

Similar to Lecture_3.pptx

2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
c.titus.brown
 
cache memory
 cache memory cache memory
RedisConf18 - ReBloom - Leveraging boom filters on Redis
RedisConf18 - ReBloom - Leveraging boom filters on RedisRedisConf18 - ReBloom - Leveraging boom filters on Redis
RedisConf18 - ReBloom - Leveraging boom filters on Redis
Redis Labs
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Lucidworks
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
mustafa sarac
 
Heap Memory Management.pptx
Heap Memory Management.pptxHeap Memory Management.pptx
Heap Memory Management.pptx
Viji B
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
Abhishek Andhavarapu
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
xlight
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
Harini Sirisena
 
Inerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine LearningInerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine Learning
Yash Diwakar
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time Metrics
Samantha Quiñones
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Speedment, Inc.
 
Scaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWSScaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWS
Brett McLain
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
Scrapinghub
 
Overview of ChIA-PET tools
Overview of ChIA-PET toolsOverview of ChIA-PET tools
Overview of ChIA-PET tools
Mohamed Nadhir Djekidel
 
London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
Fabio Fumarola
 
Search at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterSearch at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, Twitter
Lucidworks
 
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 PresentationOperation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Blue Raster
 

Similar to Lecture_3.pptx (20)

2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
cache memory
 cache memory cache memory
cache memory
 
RedisConf18 - ReBloom - Leveraging boom filters on Redis
RedisConf18 - ReBloom - Leveraging boom filters on RedisRedisConf18 - ReBloom - Leveraging boom filters on Redis
RedisConf18 - ReBloom - Leveraging boom filters on Redis
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
 
Heap Memory Management.pptx
Heap Memory Management.pptxHeap Memory Management.pptx
Heap Memory Management.pptx
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
Inerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine LearningInerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine Learning
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time Metrics
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
 
Scaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWSScaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWS
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Overview of ChIA-PET tools
Overview of ChIA-PET toolsOverview of ChIA-PET tools
Overview of ChIA-PET tools
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Search at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterSearch at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, Twitter
 
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 PresentationOperation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
Operation Point Cluster - Blue Raster Esri Developer Summit 2013 Presentation
 

More from GayathriSanthosh11

privacy preserving forenciscs of encyrpted data.pptx
privacy preserving forenciscs of encyrpted data.pptxprivacy preserving forenciscs of encyrpted data.pptx
privacy preserving forenciscs of encyrpted data.pptx
GayathriSanthosh11
 
Public key infrastrucure and its uses.pptx
Public key infrastrucure and its uses.pptxPublic key infrastrucure and its uses.pptx
Public key infrastrucure and its uses.pptx
GayathriSanthosh11
 
rough intro about xai and the uses of xai in decision making
rough intro about xai and the uses of xai in decision makingrough intro about xai and the uses of xai in decision making
rough intro about xai and the uses of xai in decision making
GayathriSanthosh11
 
2019-08-08-OriginStamp-Blockchain-Technology-Presentation.pptx
2019-08-08-OriginStamp-Blockchain-Technology-Presentation.pptx2019-08-08-OriginStamp-Blockchain-Technology-Presentation.pptx
2019-08-08-OriginStamp-Blockchain-Technology-Presentation.pptx
GayathriSanthosh11
 
273CC03851E778670A (1).ppt
273CC03851E778670A (1).ppt273CC03851E778670A (1).ppt
273CC03851E778670A (1).ppt
GayathriSanthosh11
 
10.1098-rsos.190023Figure1900232.pptx
10.1098-rsos.190023Figure1900232.pptx10.1098-rsos.190023Figure1900232.pptx
10.1098-rsos.190023Figure1900232.pptx
GayathriSanthosh11
 
Deep-Learning-2017-Lecture7GAN.ppt
Deep-Learning-2017-Lecture7GAN.pptDeep-Learning-2017-Lecture7GAN.ppt
Deep-Learning-2017-Lecture7GAN.ppt
GayathriSanthosh11
 
Untitled 6.pptx
Untitled 6.pptxUntitled 6.pptx
Untitled 6.pptx
GayathriSanthosh11
 
rest motion.pptx
rest motion.pptxrest motion.pptx
rest motion.pptx
GayathriSanthosh11
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
GayathriSanthosh11
 
09092019-Block.one-Presentation.pptx
09092019-Block.one-Presentation.pptx09092019-Block.one-Presentation.pptx
09092019-Block.one-Presentation.pptx
GayathriSanthosh11
 

More from GayathriSanthosh11 (11)

privacy preserving forenciscs of encyrpted data.pptx
privacy preserving forenciscs of encyrpted data.pptxprivacy preserving forenciscs of encyrpted data.pptx
privacy preserving forenciscs of encyrpted data.pptx
 
Public key infrastrucure and its uses.pptx
Public key infrastrucure and its uses.pptxPublic key infrastrucure and its uses.pptx
Public key infrastrucure and its uses.pptx
 
rough intro about xai and the uses of xai in decision making
rough intro about xai and the uses of xai in decision makingrough intro about xai and the uses of xai in decision making
rough intro about xai and the uses of xai in decision making
 
2019-08-08-OriginStamp-Blockchain-Technology-Presentation.pptx
2019-08-08-OriginStamp-Blockchain-Technology-Presentation.pptx2019-08-08-OriginStamp-Blockchain-Technology-Presentation.pptx
2019-08-08-OriginStamp-Blockchain-Technology-Presentation.pptx
 
273CC03851E778670A (1).ppt
273CC03851E778670A (1).ppt273CC03851E778670A (1).ppt
273CC03851E778670A (1).ppt
 
10.1098-rsos.190023Figure1900232.pptx
10.1098-rsos.190023Figure1900232.pptx10.1098-rsos.190023Figure1900232.pptx
10.1098-rsos.190023Figure1900232.pptx
 
Deep-Learning-2017-Lecture7GAN.ppt
Deep-Learning-2017-Lecture7GAN.pptDeep-Learning-2017-Lecture7GAN.ppt
Deep-Learning-2017-Lecture7GAN.ppt
 
Untitled 6.pptx
Untitled 6.pptxUntitled 6.pptx
Untitled 6.pptx
 
rest motion.pptx
rest motion.pptxrest motion.pptx
rest motion.pptx
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
09092019-Block.one-Presentation.pptx
09092019-Block.one-Presentation.pptx09092019-Block.one-Presentation.pptx
09092019-Block.one-Presentation.pptx
 

Recently uploaded

ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
ssuser36d3051
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
nooriasukmaningtyas
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
Divyam548318
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
rpskprasana
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
Mukeshwaran Balu
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt
PuktoonEngr
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Series of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.pptSeries of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.ppt
PauloRodrigues104553
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Wearable antenna for antenna applications
Wearable antenna for antenna applicationsWearable antenna for antenna applications
Wearable antenna for antenna applications
Madhumitha Jayaram
 

Recently uploaded (20)

ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Series of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.pptSeries of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.ppt
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Wearable antenna for antenna applications
Wearable antenna for antenna applicationsWearable antenna for antenna applications
Wearable antenna for antenna applications
 

Lecture_3.pptx

  • 2. A Simple Problem • Design Chrome Malicious URL Detector • Lets say you work for Google, in the Chrome team, and you want to add a feature to the browser which notifies the user if the url he has entered is a malicious URL. • Given a database of about known 1 million malicious URLs, the size of any dictionary for matching will be around 50MB (size of 1 million urls with 50 average string length). 50MB is too heavy for a browser so this cannot be locally done!! • Anything locally should be something < 2MB memory. (25x far) Wait you cannot even compress the strings (Huffman Coding) to that size.
  • 3. Universal hashing review • Goals • Design ℎ: 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 → [0 − 𝑅] such that • If 𝑂1 ≠ 𝑂2 𝑡ℎ𝑒𝑛 ℎ 𝑂1 ≠ ℎ 𝑂2 with high probability. • h is cheap to compute and requires almost no memory. • Classical Example: • Randomly generate a and b (seed), choose a prime P >> R. • ℎ 𝑥 = 𝑎𝑥 + 𝑏 𝑚𝑜𝑑 𝑃 𝑚𝑜𝑑 𝑅 (faster if 𝑅 = 232 why?) • For strings or other objects, use bitarrays, randomly shuffle bits based on a some seed then take mod. • example: See Murmurhash. • Memory: A seed per hash function!
  • 4. More concrete problem • Given a universal hashing • ℎ: 𝑠𝑡𝑟𝑖𝑛𝑔𝑠 → 0 − 999 • Given 100 queries (strings) (average length 50 characters). Space is 100x50x8 = 40,000 bits. • Same Task: Given a new query, answer whether we have seen it? • Given a new string, answer no if it is not seen and if it is seen than answer no with small probability • What can we do in 1000 bits?
  • 5. Bit-Maps and Universal Hashing 0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 h(s) • Given a query q, if h(q) = 1 return seen, else not seen. • What is chance for false positive? Given there are N Strings. • < 1 − 1 − 1 R 𝑁 (𝑅 𝑖𝑠 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑏𝑖𝑡𝑚𝑎𝑝)
  • 6. Can we do better? Bloom Filters • Use K-independent hash functions instead of 1. • Just select seeds independently. • K = 3 illustration • Given a query q, if all ℎ1 𝑞 , ℎ2 𝑞 , ℎ3 𝑞 is set, return seen else no. • False Positive Rate? 0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 𝑆1 𝑆2
  • 7. Illustration of Bloom Filters (K = 2) • ℎ1 𝑥 = 𝑥 𝑚𝑜𝑑 5, ℎ2 𝑥 = (2𝑥 + 3) 𝑚𝑜𝑑 5 • Initialize Bloom Filters • Insert 9 (4 and 1) • Insert 11 (1 and 0) 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1
  • 8. Illustration of Bloom Filters • ℎ1 𝑥 = 𝑥 𝑚𝑜𝑑 5, ℎ2 𝑥 = 2𝑥 + 3 𝑚𝑜𝑑 5 • Query 15 : 0 and 3  No • Query 16: 1 and 0  Yes (False Positive) 1 1 0 0 1
  • 9. Properties • If the query was inserted before, bloom filters always return true. • No false negatives. • However, it can return true for an element which was not inserted • Chances of false positives. • If false positives are rare then caches still work. • Why and How?
  • 10. A bit of Analysis • Pr a bit is not set given N strings are inserted? • 1 − 1 𝑅 𝐾𝑁 • Pr that a given bit is set 1 - 1 − 1 𝑅 𝐾𝑁 • Pr that all the k-bits ℎ1 𝑞 , ℎ2 𝑞 … ℎ𝐾 𝑞 is set without seeing q • 1 − 1 − 1 𝑅 𝐾𝑁 𝐾 ≈ 1 − 𝑒 𝐾𝑁 𝑅 𝐾 • Minimized at 𝐾 = ln(2) × 𝑅 𝑁 . If R is say 10N, then K = 6.9 or 7. • Optimum false positive is approx. 0.618 𝑅 𝑁 which is < 0.008 • Compare that with 0.1 with k =1. • So with N strings we need 10N bits space. Compare with hash table of 𝑵 × 𝟖 × 𝟓𝟎 (assuming 50 characters mean) ⇒ a reduction of 40x in memory.
  • 11. Generic set compression • Given a set 𝑆 of 𝑛 objects with each object being heavy such as strings, etc. • Bloom filter can compress 𝑆 to less memory around 10 bit per object and still answer membership queries efficiently with rare chances of false positives. • It can up updated dynamically on the fly. • How about deletions?
  • 12. Use of Bloom Filters from Wikipedia • The servers of Akamai Technologies, a content delivery provider, use Bloom filters to prevent "one-hit-wonders" from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates. • Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters to reduce the disk lookups for non- existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation. • The Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result). • The Squid Web Proxy Cache uses Bloom filters for cache digests. • Bitcoin uses Bloom filters to speed up wallet synchronization. • The Venti archival storage system uses Bloom filters to detect previously stored data. • The SPIN model checker uses Bloom filters to track the reachable state space for large verification problems. • The Cascading analytics framework uses Bloom filters to speed up asymmetric joins, where one of the joined data sets is significantly larger than the other (often called Bloom join in the database literature). • The Exim mail transfer agent (MTA) uses Bloom filters in its rate-limit feature. • Medium uses Bloom filters to avoid recommending articles a user has previously read. • Ethereum uses Bloom filters for quickly finding logs on the Ethereum blockchain.
  • 13. Popular Use: One-hit-wonders • Content delivery networks deploy web caches around the world to cache and serve web content to users with greater performance and reliability. • A key application of Bloom filters is their use in efficiently determining which web objects to store in these web caches. • Nearly three-quarters of the URLs accessed from a typical web cache are "one-hit-wonders" that are accessed by users only once and never again. • To prevent caching one-hit-wonders, a Bloom filter is used to keep track of all URLs that are accessed by users. • A web object is cached only when it has been accessed at least once before, i.e., the object is cached on its second request. • The use of a Bloom filter in this fashion significantly reduces the disk write workload, since one-hit-wonders are never written to the disk cache.
  • 14. Deletion: Option 1 • Use two bloom filters • One to keep track of added elements • One to keep track of deleted elements • What are the chances of false negatives? • What are the chances of false positives? • Decreased!
  • 15. A Popular Alternative • Counting filters • Fan, Li; Cao, Pei; Almeida, Jussara; Broder, Andrei (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol“ • In a counting filter the array positions (buckets) are extended from being a single bit to being an n-bit counter. • Addition: Increment • Deletion: Decrement • Chances of both false positive and false negatives.
  • 16. Union of two bloom filters? • Given bloom filter 𝐵1 for set 𝑆1 and another bloom filter 𝐵2 of set 𝑆2 with same hash functions. • What is the bloom filter of 𝑆1 ∪ 𝑆2 ? • Just OR the two bloom filters. • Bloom filters can be organized in distributed data structures to perform fully decentralized computations of aggregate functions. Decentralized aggregation makes them ideal for several application by avoiding costly communication.
  • 17. Shrink Size of Bloom Filters? • Can we shrink the bloom filter to half of its size? • How about doubling its size?
  • 18. Weakness of Bloom Filters • Needs full independent hash functions. (Hard) • The space usage is 1.44x more than the information theoretically best possible. • Dynamically growing bloom filters is hard. Best size depends on false positive rate and number of insertions.
  • 19. A problem to ponder on • You want to know if two people share friends on facebook. • For privacy and memory reasons, Facebook only gave you a compressed bloom filter of their graph. • What should you ask Facebook to compress. • How to do it and how good will it be?