SlideShare a Scribd company logo
PROBABILISTIC
DATA STRUCTURES
Thinh Dang-An
Definitions
• Data structure
• It is a ‘structure’ that holds ‘data’, allowing you to extract
information
• Probabilistic
• Query may return a wrong answer
• The answer is ‘good enough’
• Uses a fraction of the resources i.e. memory or cpu cycles
Four types:
• Membership
• Bloom Filter
• Cuckoo Filters
• Cardinality
• Linear Counting
• LogLog, SuperLogLog,
HyperLogLog, HyperLogLog++
• Frequency
• Count-Min Sketch
• Majority Algorithm
• Misra-Gries Algorithm
• Similarity
• Locality-Sensitive Hashing (LSH)
• MinHash
• SimHash
Bloom Filter
Membership
Properties
• It tells us that the element either definitely is not in
the set or may be in the set.
• Bloom filters are called filters because they are often
used as a cheap first pass to filter out segments of a
dataset that do not match a query.
How does it work
• Bloom filter is a bit array of m bits, all set to 0 at the beginning
• To insert element into the filter - calculate values of all k hash functions for the
element and set bit with the corresponding indices
• To test if element is in the filter - calculate all k hash functions for the element
and check bits in all corresponding indices:
• if all bits are set, then answer is “maybe”
• if at least 1 bit isn’t set, then answer is “definitely not”
• Time needed to insert or test elements is a fixed constant O(k), independent
from the number of items already in the filter
Application
• Google BigTable, Apache HBase and Apache Cassandra use Bloom filters to
reduce the disk lookups for non-existent rows or columns
• Medium uses Bloom filters to avoid recommending articles a user has previously
read
• Google Chrome web browser used to use a Bloom filter to identify malicious URLs
(moved to PrefixSet, Issue 71832)
• The Squid Web Proxy Cache uses Bloom filters for cache digests
Cuckoo Filters
Membership
Properties
• Practically better than bloom filter
• Supports adding and removing items dynamically
• Provide higher lookup performance
• Cuckoo hashing – resolves collisions by rehashing to a new
place
How does it work
• Parameters of the Filter:
• 1. Two Hash Functions: h1 and h2
• 2. An array B with n buckets. The i-th bucket will be called B[i]
• Input: L, a list of elements to be inserted into the cuckoo filter.
How does it work
While L is not empty:
Let x be the first item in the list L. Remove x from the list.
If B[h1(x)] is empty:
place x in B[h1(x)]
Else, If B[h2(x) is empty]:
place x in B[h2(x)]
Else:
Let y be the element in B[h2(x)].
Prepend y to L
place x in B[h2(x)]
What if cuckoo filter use more than two
hash functions?
• Nothing happen and this isn't necessary, Because :
• If you use too many hash function, that will take time to
implement and don't bring any benefit.
• You need more space to store when many insert data focus on
one bucket by add more element per bucket.
COMPARISON WITH BLOOM FILTER
• Space Efficiency
• Number of Memory Accesses
• Value Association
• Maximum Capacity
Count Min Sketch
Frequency
Properties
• Only over-estimate, not under-estimate.
• Time needed to add element or return its frequency is a fixed
constant O(k), assuming that every hash function can be
evaluated in a constant time.
How does it work
• Use multiple arrays with different hash functions to compute
the index.
• When queried, return the minimum of the numbers the arrays.
→ Count-Min Sketch
• AT&T has used Count-Min Sketch in network switches to perform analyses on
network traffic using limited memory
• At Google, a precursor of the count-min sketch (called the “count sketch”) has
been implemented on top of their MapReduce parallel processing infrastructure
• Implemented as a part of Algebird library from Twitter
HyperLogLog
Cardinality
Properties
• HyperLogLog is described by 2 parameters:
• p – number of bits that determine a bucket to use
averaging (m = 2^p is the number of buckets/substreams)
• h - hash function, that produces uniform hash values
• The HyperLogLog algorithm is able to estimate cardinalities of
> 10^9 with a typical error rate of 2%
• Observe the maximum number of leading zeros that for all
hash values.
How does it work
• Stochastic averaging is used to reduce the large variability:
• The input stream of data elements S is divided into m substreams S(i) using the first p
bits of the hash values (m = 2^p)
• In each substream, the rank (after the initial p bits that are used for substreaming) is
measured independently.
• These numbers are kept in an array of registers M, where M[i] stores the maximum rank
it seen for the substream with index i.
• The cardinality formula is calculated computes to approximate the cardinality of a
multiset.
Example
Example
Application
• PFCount in Redis
• Counting unique visitors to a website,...
MinHash
Similarity
Properties
• Compute a “signature” for each set, so that similar documents have similar
signatures (and dissimilar docs are unlikely to have similar signatures)
• Trade-off: length of signature vs. accuracy
How does it work
For each row r = 0, 1, …, N-1 of the characteristic matrix:
1. Compute h1(r), h2(r), …, hn(r)
2. For each column c:
1. If column c has 0 in row r, do nothing
2. Otherwise, for each i = 1,2, …, n set SIG(i, c) to be min(hi
(r), SIG(i, c))
With:
r: row
c: column
i: index of hash
Note: in practice we need to only iterate through the non-zero
elements.
Problem with MinHash
• Assume that we construct a 1,000 byte minhash signature
for each document.
• Million documents can now fit into 1 gigabyte of RAM.
But how much does it cost to find the nearest neighbor
of a document? -
• Brute force: 1/2 N(N-1) comparisons.
• Need a way to reduce the number of comparisons
Locality sensitive Hashing
Similarity
Properties
• Idea:
• From minHash, divide the signature matrix rows into b bands
of r rows hash the columns in each band with a basic hash
function each band divided to buckets [i.e a hashtable for
each band]
• If sets S and T have same values in a band, they will be
hashed into the same bucket in that band.
• For nearest-neighbor, the candidates are the items in the
same bucket as query item, in each band
Application
• Finding duplicate pages on the web
• Retrieving images
• Retrieving music
References
1. Series probabilistic data structure - Andrii Gakhov
2. Cuckoo Filter: Practically Better Than Bloom - Bin Fan, David G. Andersen, Michael
Kaminsky† , Michael D. Mitzenmacher
3. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality
Estimation Algorithm - Stefan Heule , Marc Nunkesser ,Alexander Hall
4. MinHash & LSH slide
Thank you for watching

More Related Content

What's hot

Convex Hull Algorithms
Convex Hull AlgorithmsConvex Hull Algorithms
Convex Hull Algorithms
Kasun Ranga Wijeweera
 
Informed search algorithms.pptx
Informed search algorithms.pptxInformed search algorithms.pptx
Informed search algorithms.pptx
Dr.Shweta
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Parsing in Compiler Design
Parsing in Compiler DesignParsing in Compiler Design
Parsing in Compiler Design
Akhil Kaushik
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
Lukas Tencer
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Vigen Sahakyan
 
Operator Precedence Grammar
Operator Precedence GrammarOperator Precedence Grammar
Operator Precedence Grammar
HarisonFekadu
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
GauravBiswas9
 
Hashing and Hashtable, application of hashing, advantages of hashing, disadva...
Hashing and Hashtable, application of hashing, advantages of hashing, disadva...Hashing and Hashtable, application of hashing, advantages of hashing, disadva...
Hashing and Hashtable, application of hashing, advantages of hashing, disadva...
NaveenPeter8
 
Minmax Algorithm In Artificial Intelligence slides
Minmax Algorithm In Artificial Intelligence slidesMinmax Algorithm In Artificial Intelligence slides
Minmax Algorithm In Artificial Intelligence slides
SamiaAziz4
 
Linear regression in machine learning
Linear regression in machine learningLinear regression in machine learning
Linear regression in machine learning
Shajun Nisha
 
Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}
FellowBuddy.com
 
Turing machine
Turing machineTuring machine
Turing machine
MuhammadSamranTanvee
 
Heuristic search
Heuristic searchHeuristic search
Heuristic search
Soheil Khodayari
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks I
Wanjin Yu
 
Heuristic search
Heuristic searchHeuristic search
Heuristic search
NivethaS35
 
Bootstrapping in Compiler
Bootstrapping in CompilerBootstrapping in Compiler
Bootstrapping in Compiler
Akhil Kaushik
 
Shift reduce parser
Shift reduce parserShift reduce parser
Shift reduce parser
TEJVEER SINGH
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
SujaAldrin
 
Register allocation and assignment
Register allocation and assignmentRegister allocation and assignment
Register allocation and assignment
Karthi Keyan
 

What's hot (20)

Convex Hull Algorithms
Convex Hull AlgorithmsConvex Hull Algorithms
Convex Hull Algorithms
 
Informed search algorithms.pptx
Informed search algorithms.pptxInformed search algorithms.pptx
Informed search algorithms.pptx
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Parsing in Compiler Design
Parsing in Compiler DesignParsing in Compiler Design
Parsing in Compiler Design
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Operator Precedence Grammar
Operator Precedence GrammarOperator Precedence Grammar
Operator Precedence Grammar
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Hashing and Hashtable, application of hashing, advantages of hashing, disadva...
Hashing and Hashtable, application of hashing, advantages of hashing, disadva...Hashing and Hashtable, application of hashing, advantages of hashing, disadva...
Hashing and Hashtable, application of hashing, advantages of hashing, disadva...
 
Minmax Algorithm In Artificial Intelligence slides
Minmax Algorithm In Artificial Intelligence slidesMinmax Algorithm In Artificial Intelligence slides
Minmax Algorithm In Artificial Intelligence slides
 
Linear regression in machine learning
Linear regression in machine learningLinear regression in machine learning
Linear regression in machine learning
 
Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}
 
Turing machine
Turing machineTuring machine
Turing machine
 
Heuristic search
Heuristic searchHeuristic search
Heuristic search
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks I
 
Heuristic search
Heuristic searchHeuristic search
Heuristic search
 
Bootstrapping in Compiler
Bootstrapping in CompilerBootstrapping in Compiler
Bootstrapping in Compiler
 
Shift reduce parser
Shift reduce parserShift reduce parser
Shift reduce parser
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 
Register allocation and assignment
Register allocation and assignmentRegister allocation and assignment
Register allocation and assignment
 

Similar to Probabilistic data structure

hash
 hash hash
hash
tim4911
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Data Analytics using R.pptx
Data Analytics using R.pptxData Analytics using R.pptx
Data Analytics using R.pptx
CheatMe
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
Andrii Gakhov
 
Online statistical analysis using transducers and sketch algorithms
Online statistical analysis using transducers and sketch algorithmsOnline statistical analysis using transducers and sketch algorithms
Online statistical analysis using transducers and sketch algorithms
Simon Belak
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
Ortus Solutions, Corp
 
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
Thomas Mueller
 
Lecture_3.pptx
Lecture_3.pptxLecture_3.pptx
Lecture_3.pptx
GayathriSanthosh11
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
Yoav chernobroda
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
Qiangning Hong
 
Sketch algorithms
Sketch algorithmsSketch algorithms
Sketch algorithms
Simon Belak
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
Amazon Web Services
 
Hash tables
Hash tablesHash tables
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
Amazon Web Services
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
xlight
 
Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)
Andrew Clegg
 

Similar to Probabilistic data structure (20)

hash
 hash hash
hash
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Data Analytics using R.pptx
Data Analytics using R.pptxData Analytics using R.pptx
Data Analytics using R.pptx
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Online statistical analysis using transducers and sketch algorithms
Online statistical analysis using transducers and sketch algorithmsOnline statistical analysis using transducers and sketch algorithms
Online statistical analysis using transducers and sketch algorithms
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
 
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
 
Lecture_3.pptx
Lecture_3.pptxLecture_3.pptx
Lecture_3.pptx
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
Sketch algorithms
Sketch algorithmsSketch algorithms
Sketch algorithms
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
Hash tables
Hash tablesHash tables
Hash tables
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)
 

Recently uploaded

一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 

Recently uploaded (20)

一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 

Probabilistic data structure

  • 2. Definitions • Data structure • It is a ‘structure’ that holds ‘data’, allowing you to extract information • Probabilistic • Query may return a wrong answer • The answer is ‘good enough’ • Uses a fraction of the resources i.e. memory or cpu cycles
  • 3. Four types: • Membership • Bloom Filter • Cuckoo Filters • Cardinality • Linear Counting • LogLog, SuperLogLog, HyperLogLog, HyperLogLog++ • Frequency • Count-Min Sketch • Majority Algorithm • Misra-Gries Algorithm • Similarity • Locality-Sensitive Hashing (LSH) • MinHash • SimHash
  • 5. Properties • It tells us that the element either definitely is not in the set or may be in the set. • Bloom filters are called filters because they are often used as a cheap first pass to filter out segments of a dataset that do not match a query.
  • 6. How does it work • Bloom filter is a bit array of m bits, all set to 0 at the beginning • To insert element into the filter - calculate values of all k hash functions for the element and set bit with the corresponding indices • To test if element is in the filter - calculate all k hash functions for the element and check bits in all corresponding indices: • if all bits are set, then answer is “maybe” • if at least 1 bit isn’t set, then answer is “definitely not” • Time needed to insert or test elements is a fixed constant O(k), independent from the number of items already in the filter
  • 7. Application • Google BigTable, Apache HBase and Apache Cassandra use Bloom filters to reduce the disk lookups for non-existent rows or columns • Medium uses Bloom filters to avoid recommending articles a user has previously read • Google Chrome web browser used to use a Bloom filter to identify malicious URLs (moved to PrefixSet, Issue 71832) • The Squid Web Proxy Cache uses Bloom filters for cache digests
  • 9. Properties • Practically better than bloom filter • Supports adding and removing items dynamically • Provide higher lookup performance • Cuckoo hashing – resolves collisions by rehashing to a new place
  • 10. How does it work • Parameters of the Filter: • 1. Two Hash Functions: h1 and h2 • 2. An array B with n buckets. The i-th bucket will be called B[i] • Input: L, a list of elements to be inserted into the cuckoo filter.
  • 11. How does it work While L is not empty: Let x be the first item in the list L. Remove x from the list. If B[h1(x)] is empty: place x in B[h1(x)] Else, If B[h2(x) is empty]: place x in B[h2(x)] Else: Let y be the element in B[h2(x)]. Prepend y to L place x in B[h2(x)]
  • 12. What if cuckoo filter use more than two hash functions? • Nothing happen and this isn't necessary, Because : • If you use too many hash function, that will take time to implement and don't bring any benefit. • You need more space to store when many insert data focus on one bucket by add more element per bucket.
  • 13. COMPARISON WITH BLOOM FILTER • Space Efficiency • Number of Memory Accesses • Value Association • Maximum Capacity
  • 15. Properties • Only over-estimate, not under-estimate. • Time needed to add element or return its frequency is a fixed constant O(k), assuming that every hash function can be evaluated in a constant time.
  • 16. How does it work • Use multiple arrays with different hash functions to compute the index. • When queried, return the minimum of the numbers the arrays. → Count-Min Sketch
  • 17. • AT&T has used Count-Min Sketch in network switches to perform analyses on network traffic using limited memory • At Google, a precursor of the count-min sketch (called the “count sketch”) has been implemented on top of their MapReduce parallel processing infrastructure • Implemented as a part of Algebird library from Twitter
  • 19. Properties • HyperLogLog is described by 2 parameters: • p – number of bits that determine a bucket to use averaging (m = 2^p is the number of buckets/substreams) • h - hash function, that produces uniform hash values • The HyperLogLog algorithm is able to estimate cardinalities of > 10^9 with a typical error rate of 2% • Observe the maximum number of leading zeros that for all hash values.
  • 20. How does it work • Stochastic averaging is used to reduce the large variability: • The input stream of data elements S is divided into m substreams S(i) using the first p bits of the hash values (m = 2^p) • In each substream, the rank (after the initial p bits that are used for substreaming) is measured independently. • These numbers are kept in an array of registers M, where M[i] stores the maximum rank it seen for the substream with index i. • The cardinality formula is calculated computes to approximate the cardinality of a multiset.
  • 23. Application • PFCount in Redis • Counting unique visitors to a website,...
  • 25. Properties • Compute a “signature” for each set, so that similar documents have similar signatures (and dissimilar docs are unlikely to have similar signatures) • Trade-off: length of signature vs. accuracy
  • 26. How does it work For each row r = 0, 1, …, N-1 of the characteristic matrix: 1. Compute h1(r), h2(r), …, hn(r) 2. For each column c: 1. If column c has 0 in row r, do nothing 2. Otherwise, for each i = 1,2, …, n set SIG(i, c) to be min(hi (r), SIG(i, c)) With: r: row c: column i: index of hash Note: in practice we need to only iterate through the non-zero elements.
  • 27. Problem with MinHash • Assume that we construct a 1,000 byte minhash signature for each document. • Million documents can now fit into 1 gigabyte of RAM. But how much does it cost to find the nearest neighbor of a document? - • Brute force: 1/2 N(N-1) comparisons. • Need a way to reduce the number of comparisons
  • 29. Properties • Idea: • From minHash, divide the signature matrix rows into b bands of r rows hash the columns in each band with a basic hash function each band divided to buckets [i.e a hashtable for each band] • If sets S and T have same values in a band, they will be hashed into the same bucket in that band. • For nearest-neighbor, the candidates are the items in the same bucket as query item, in each band
  • 30. Application • Finding duplicate pages on the web • Retrieving images • Retrieving music
  • 31. References 1. Series probabilistic data structure - Andrii Gakhov 2. Cuckoo Filter: Practically Better Than Bloom - Bin Fan, David G. Andersen, Michael Kaminsky† , Michael D. Mitzenmacher 3. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm - Stefan Heule , Marc Nunkesser ,Alexander Hall 4. MinHash & LSH slide Thank you for watching

Editor's Notes

  1. membership To determine membership of the element in a large set of elements  frequency To estimate number of times an element occurs in a set  Cardinality  To determine the number of distinct elements Similarity To find clusters of similar documents from the document set • To find duplicates of the document in the document set