SlideShare a Scribd company logo
1 of 31
PROBABILISTIC
DATA STRUCTURES
Thinh Dang-An
Definitions
• Data structure
• It is a ‘structure’ that holds ‘data’, allowing you to extract
information
• Probabilistic
• Query may return a wrong answer
• The answer is ‘good enough’
• Uses a fraction of the resources i.e. memory or cpu cycles
Four types:
• Membership
• Bloom Filter
• Cuckoo Filters
• Cardinality
• Linear Counting
• LogLog, SuperLogLog,
HyperLogLog, HyperLogLog++
• Frequency
• Count-Min Sketch
• Majority Algorithm
• Misra-Gries Algorithm
• Similarity
• Locality-Sensitive Hashing (LSH)
• MinHash
• SimHash
Bloom Filter
Membership
Properties
• It tells us that the element either definitely is not in
the set or may be in the set.
• Bloom filters are called filters because they are often
used as a cheap first pass to filter out segments of a
dataset that do not match a query.
How does it work
• Bloom filter is a bit array of m bits, all set to 0 at the beginning
• To insert element into the filter - calculate values of all k hash functions for the
element and set bit with the corresponding indices
• To test if element is in the filter - calculate all k hash functions for the element
and check bits in all corresponding indices:
• if all bits are set, then answer is “maybe”
• if at least 1 bit isn’t set, then answer is “definitely not”
• Time needed to insert or test elements is a fixed constant O(k), independent
from the number of items already in the filter
Application
• Google BigTable, Apache HBase and Apache Cassandra use Bloom filters to
reduce the disk lookups for non-existent rows or columns
• Medium uses Bloom filters to avoid recommending articles a user has previously
read
• Google Chrome web browser used to use a Bloom filter to identify malicious URLs
(moved to PrefixSet, Issue 71832)
• The Squid Web Proxy Cache uses Bloom filters for cache digests
Cuckoo Filters
Membership
Properties
• Practically better than bloom filter
• Supports adding and removing items dynamically
• Provide higher lookup performance
• Cuckoo hashing – resolves collisions by rehashing to a new
place
How does it work
• Parameters of the Filter:
• 1. Two Hash Functions: h1 and h2
• 2. An array B with n buckets. The i-th bucket will be called B[i]
• Input: L, a list of elements to be inserted into the cuckoo filter.
How does it work
While L is not empty:
Let x be the first item in the list L. Remove x from the list.
If B[h1(x)] is empty:
place x in B[h1(x)]
Else, If B[h2(x) is empty]:
place x in B[h2(x)]
Else:
Let y be the element in B[h2(x)].
Prepend y to L
place x in B[h2(x)]
What if cuckoo filter use more than two
hash functions?
• Nothing happen and this isn't necessary, Because :
• If you use too many hash function, that will take time to
implement and don't bring any benefit.
• You need more space to store when many insert data focus on
one bucket by add more element per bucket.
COMPARISON WITH BLOOM FILTER
• Space Efficiency
• Number of Memory Accesses
• Value Association
• Maximum Capacity
Count Min Sketch
Frequency
Properties
• Only over-estimate, not under-estimate.
• Time needed to add element or return its frequency is a fixed
constant O(k), assuming that every hash function can be
evaluated in a constant time.
How does it work
• Use multiple arrays with different hash functions to compute
the index.
• When queried, return the minimum of the numbers the arrays.
→ Count-Min Sketch
• AT&T has used Count-Min Sketch in network switches to perform analyses on
network traffic using limited memory
• At Google, a precursor of the count-min sketch (called the “count sketch”) has
been implemented on top of their MapReduce parallel processing infrastructure
• Implemented as a part of Algebird library from Twitter
HyperLogLog
Cardinality
Properties
• HyperLogLog is described by 2 parameters:
• p – number of bits that determine a bucket to use
averaging (m = 2^p is the number of buckets/substreams)
• h - hash function, that produces uniform hash values
• The HyperLogLog algorithm is able to estimate cardinalities of
> 10^9 with a typical error rate of 2%
• Observe the maximum number of leading zeros that for all
hash values.
How does it work
• Stochastic averaging is used to reduce the large variability:
• The input stream of data elements S is divided into m substreams S(i) using the first p
bits of the hash values (m = 2^p)
• In each substream, the rank (after the initial p bits that are used for substreaming) is
measured independently.
• These numbers are kept in an array of registers M, where M[i] stores the maximum rank
it seen for the substream with index i.
• The cardinality formula is calculated computes to approximate the cardinality of a
multiset.
Example
Example
Application
• PFCount in Redis
• Counting unique visitors to a website,...
MinHash
Similarity
Properties
• Compute a “signature” for each set, so that similar documents have similar
signatures (and dissimilar docs are unlikely to have similar signatures)
• Trade-off: length of signature vs. accuracy
How does it work
For each row r = 0, 1, …, N-1 of the characteristic matrix:
1. Compute h1(r), h2(r), …, hn(r)
2. For each column c:
1. If column c has 0 in row r, do nothing
2. Otherwise, for each i = 1,2, …, n set SIG(i, c) to be min(hi
(r), SIG(i, c))
With:
r: row
c: column
i: index of hash
Note: in practice we need to only iterate through the non-zero
elements.
Problem with MinHash
• Assume that we construct a 1,000 byte minhash signature
for each document.
• Million documents can now fit into 1 gigabyte of RAM.
But how much does it cost to find the nearest neighbor
of a document? -
• Brute force: 1/2 N(N-1) comparisons.
• Need a way to reduce the number of comparisons
Locality sensitive Hashing
Similarity
Properties
• Idea:
• From minHash, divide the signature matrix rows into b bands
of r rows hash the columns in each band with a basic hash
function each band divided to buckets [i.e a hashtable for
each band]
• If sets S and T have same values in a band, they will be
hashed into the same bucket in that band.
• For nearest-neighbor, the candidates are the items in the
same bucket as query item, in each band
Application
• Finding duplicate pages on the web
• Retrieving images
• Retrieving music
References
1. Series probabilistic data structure - Andrii Gakhov
2. Cuckoo Filter: Practically Better Than Bloom - Bin Fan, David G. Andersen, Michael
Kaminsky† , Michael D. Mitzenmacher
3. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality
Estimation Algorithm - Stefan Heule , Marc Nunkesser ,Alexander Hall
4. MinHash & LSH slide
Thank you for watching

More Related Content

What's hot

Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
Sri Ambati
 
Lecture 9 - DSA - Python Data Structures
Lecture 9 - DSA - Python Data StructuresLecture 9 - DSA - Python Data Structures
Lecture 9 - DSA - Python Data Structures
Haitham El-Ghareeb
 

What's hot (20)

Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
 
Lecture 9 - DSA - Python Data Structures
Lecture 9 - DSA - Python Data StructuresLecture 9 - DSA - Python Data Structures
Lecture 9 - DSA - Python Data Structures
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learning
 
Hashing PPT
Hashing PPTHashing PPT
Hashing PPT
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
Presentation on Sentiment Analysis
Presentation on Sentiment AnalysisPresentation on Sentiment Analysis
Presentation on Sentiment Analysis
 
Issues in Decision Tree by Ravindra Singh Kushwaha B.Tech(IT) 2017-21 Chaudha...
Issues in Decision Tree by Ravindra Singh Kushwaha B.Tech(IT) 2017-21 Chaudha...Issues in Decision Tree by Ravindra Singh Kushwaha B.Tech(IT) 2017-21 Chaudha...
Issues in Decision Tree by Ravindra Singh Kushwaha B.Tech(IT) 2017-21 Chaudha...
 
Hashing algorithms and its uses
Hashing algorithms and its usesHashing algorithms and its uses
Hashing algorithms and its uses
 
Searching algorithms
Searching algorithmsSearching algorithms
Searching algorithms
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentation
 
Linked List
Linked ListLinked List
Linked List
 
Extensible hashing
Extensible hashingExtensible hashing
Extensible hashing
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
KERAS Python Tutorial
KERAS Python TutorialKERAS Python Tutorial
KERAS Python Tutorial
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
 
Unit 1 introduction to data structure
Unit 1   introduction to data structureUnit 1   introduction to data structure
Unit 1 introduction to data structure
 

Similar to Probabilistic data structure

Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
xlight
 

Similar to Probabilistic data structure (20)

hash
 hash hash
hash
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Data Analytics using R.pptx
Data Analytics using R.pptxData Analytics using R.pptx
Data Analytics using R.pptx
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Online statistical analysis using transducers and sketch algorithms
Online statistical analysis using transducers and sketch algorithmsOnline statistical analysis using transducers and sketch algorithms
Online statistical analysis using transducers and sketch algorithms
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
 
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
 
Lecture_3.pptx
Lecture_3.pptxLecture_3.pptx
Lecture_3.pptx
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
Sketch algorithms
Sketch algorithmsSketch algorithms
Sketch algorithms
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
Hash tables
Hash tablesHash tables
Hash tables
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)
 

Recently uploaded

Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 

Recently uploaded (20)

Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 

Probabilistic data structure

  • 2. Definitions • Data structure • It is a ‘structure’ that holds ‘data’, allowing you to extract information • Probabilistic • Query may return a wrong answer • The answer is ‘good enough’ • Uses a fraction of the resources i.e. memory or cpu cycles
  • 3. Four types: • Membership • Bloom Filter • Cuckoo Filters • Cardinality • Linear Counting • LogLog, SuperLogLog, HyperLogLog, HyperLogLog++ • Frequency • Count-Min Sketch • Majority Algorithm • Misra-Gries Algorithm • Similarity • Locality-Sensitive Hashing (LSH) • MinHash • SimHash
  • 5. Properties • It tells us that the element either definitely is not in the set or may be in the set. • Bloom filters are called filters because they are often used as a cheap first pass to filter out segments of a dataset that do not match a query.
  • 6. How does it work • Bloom filter is a bit array of m bits, all set to 0 at the beginning • To insert element into the filter - calculate values of all k hash functions for the element and set bit with the corresponding indices • To test if element is in the filter - calculate all k hash functions for the element and check bits in all corresponding indices: • if all bits are set, then answer is “maybe” • if at least 1 bit isn’t set, then answer is “definitely not” • Time needed to insert or test elements is a fixed constant O(k), independent from the number of items already in the filter
  • 7. Application • Google BigTable, Apache HBase and Apache Cassandra use Bloom filters to reduce the disk lookups for non-existent rows or columns • Medium uses Bloom filters to avoid recommending articles a user has previously read • Google Chrome web browser used to use a Bloom filter to identify malicious URLs (moved to PrefixSet, Issue 71832) • The Squid Web Proxy Cache uses Bloom filters for cache digests
  • 9. Properties • Practically better than bloom filter • Supports adding and removing items dynamically • Provide higher lookup performance • Cuckoo hashing – resolves collisions by rehashing to a new place
  • 10. How does it work • Parameters of the Filter: • 1. Two Hash Functions: h1 and h2 • 2. An array B with n buckets. The i-th bucket will be called B[i] • Input: L, a list of elements to be inserted into the cuckoo filter.
  • 11. How does it work While L is not empty: Let x be the first item in the list L. Remove x from the list. If B[h1(x)] is empty: place x in B[h1(x)] Else, If B[h2(x) is empty]: place x in B[h2(x)] Else: Let y be the element in B[h2(x)]. Prepend y to L place x in B[h2(x)]
  • 12. What if cuckoo filter use more than two hash functions? • Nothing happen and this isn't necessary, Because : • If you use too many hash function, that will take time to implement and don't bring any benefit. • You need more space to store when many insert data focus on one bucket by add more element per bucket.
  • 13. COMPARISON WITH BLOOM FILTER • Space Efficiency • Number of Memory Accesses • Value Association • Maximum Capacity
  • 15. Properties • Only over-estimate, not under-estimate. • Time needed to add element or return its frequency is a fixed constant O(k), assuming that every hash function can be evaluated in a constant time.
  • 16. How does it work • Use multiple arrays with different hash functions to compute the index. • When queried, return the minimum of the numbers the arrays. → Count-Min Sketch
  • 17. • AT&T has used Count-Min Sketch in network switches to perform analyses on network traffic using limited memory • At Google, a precursor of the count-min sketch (called the “count sketch”) has been implemented on top of their MapReduce parallel processing infrastructure • Implemented as a part of Algebird library from Twitter
  • 19. Properties • HyperLogLog is described by 2 parameters: • p – number of bits that determine a bucket to use averaging (m = 2^p is the number of buckets/substreams) • h - hash function, that produces uniform hash values • The HyperLogLog algorithm is able to estimate cardinalities of > 10^9 with a typical error rate of 2% • Observe the maximum number of leading zeros that for all hash values.
  • 20. How does it work • Stochastic averaging is used to reduce the large variability: • The input stream of data elements S is divided into m substreams S(i) using the first p bits of the hash values (m = 2^p) • In each substream, the rank (after the initial p bits that are used for substreaming) is measured independently. • These numbers are kept in an array of registers M, where M[i] stores the maximum rank it seen for the substream with index i. • The cardinality formula is calculated computes to approximate the cardinality of a multiset.
  • 23. Application • PFCount in Redis • Counting unique visitors to a website,...
  • 25. Properties • Compute a “signature” for each set, so that similar documents have similar signatures (and dissimilar docs are unlikely to have similar signatures) • Trade-off: length of signature vs. accuracy
  • 26. How does it work For each row r = 0, 1, …, N-1 of the characteristic matrix: 1. Compute h1(r), h2(r), …, hn(r) 2. For each column c: 1. If column c has 0 in row r, do nothing 2. Otherwise, for each i = 1,2, …, n set SIG(i, c) to be min(hi (r), SIG(i, c)) With: r: row c: column i: index of hash Note: in practice we need to only iterate through the non-zero elements.
  • 27. Problem with MinHash • Assume that we construct a 1,000 byte minhash signature for each document. • Million documents can now fit into 1 gigabyte of RAM. But how much does it cost to find the nearest neighbor of a document? - • Brute force: 1/2 N(N-1) comparisons. • Need a way to reduce the number of comparisons
  • 29. Properties • Idea: • From minHash, divide the signature matrix rows into b bands of r rows hash the columns in each band with a basic hash function each band divided to buckets [i.e a hashtable for each band] • If sets S and T have same values in a band, they will be hashed into the same bucket in that band. • For nearest-neighbor, the candidates are the items in the same bucket as query item, in each band
  • 30. Application • Finding duplicate pages on the web • Retrieving images • Retrieving music
  • 31. References 1. Series probabilistic data structure - Andrii Gakhov 2. Cuckoo Filter: Practically Better Than Bloom - Bin Fan, David G. Andersen, Michael Kaminsky† , Michael D. Mitzenmacher 3. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm - Stefan Heule , Marc Nunkesser ,Alexander Hall 4. MinHash & LSH slide Thank you for watching

Editor's Notes

  1. membership To determine membership of the element in a large set of elements  frequency To estimate number of times an element occurs in a set  Cardinality  To determine the number of distinct elements Similarity To find clusters of similar documents from the document set • To find duplicates of the document in the document set