SlideShare a Scribd company logo
Count-min sketch to Infinity:
Using Probabilistic Data Structures to Solve Presence, Counting, and Distinct
Count Problems in .NET
presented by: Steve Lorello - Developer Advocate @Redis
Agenda
● What are Probabilistic Data Structures?
● Set Membership problems
● Bloom Filters
● Counting problems
● Count-Min-Sketch
● Distinct Count problems
● HyperLogLog
● Using Probabilistic Data Structures with Redis
What are Probabilistic Data Structures?
● Class of specialized data structures
● Tackle specific problems
● Use probability approximate
Probabilistic Data Structures Examples
Name Problem Solved Optimization
Bloom Filter Presence Space, Insertion Time, Lookup Time
Quotient Filter Presence Space, Insertion Time, Lookup Time
Skip List Ordering and Searching Insertion Time, Search time
HyperLogLog Set Cardinality Space, Insertion Time, Lookup Time
Count-min-sketch Counting occurrences on large sets Space, Insertion Time, Lookup Time
Cuckoo Filter Presence Space, Insertion Time, Lookup Time
Top-K Keep track of top records Space, Insertion Time, Lookup Time
Set Membership
Set Membership Problems
● Has a given element been inserted?
● e.g. Unique username for registration
Presence Problem Naive Approach 1
● Store User Info in table ‘users’ and Query
Check username Query username
Presence Problem Naive Approach 1
SELECT 1
FROM users
WHERE username = ‘selected_username’
Check username Query username
Summary
Access Type Disk
Lookup Time O(n)
Extra Space
(beyond storing
user info)
O(1)
Presence Problem Naive Approach 2
● Store User Info in table ‘users’
● Index username
Check username Query username
Presence Problem Naive Approach 2
SELECT 1
FROM users
WHERE username = ‘selected_username’
Check username Query username
Summary
Access Type Disk
Lookup Time O(log(n))
Extra Space
(beyond storing
user info)
O(n)
Presence Problem Naive Approach 3
● Store usernames in Redis cache
Check username SISMEMBER
Presence Problem Naive Approach 3
● Store usernames in Redis cache
SADD usernames selected_username
SISMEMBER usernames selected_username
Check username SISMEMBER
Summary
Access Type Memory
Lookup Time O(1)
Extra Space
(beyond storing
user info)
O(n)
Bloom Filters
Bloom Filter
● Specialized ‘Probabilistic’ Data Structure for presence checks
● Can say if element has definitely not been added
● Can say if element has probably been added
● Uses constant K-hashes scheme
● Represented as a 1D array of bits
● All operations O(1) complexity
● Space complexity O(n) - bits
Insert:
For i = 0->K:
FILTER[H[ i ](key)] = 1
Query:
For i = 0 -> K:
If FILTER[H[ i ](key)] == 0:
Return False
Return true
Complexities
Type Worst Case
Space O(n) - BITS
Insert O(1)
Lookup O(1)
Delete Not Available
Example Initial State
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 0 0 0 0 0 0 0 0
Example Insert username ‘razzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 0 0 0 0 0 0 0 0
Example Insert username ‘razzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 0 0 0 0 0
● H1(razzle) = 2
Example Insert username ‘razzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 0 0
● H1(razzle) = 2
● H2(razzle) = 5
Example Insert username ‘razzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
● H1(razzle) = 2
● H2(razzle) = 5
● H3(razzle) = 8
Example Query username ‘fizzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
H1(fizzle) = 8 - bit 8 is set—maybe?
Example Query username ‘fizzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
H1(fizzle) = 8 - bit 8 is set—maybe?
H3(fizzle) = 2 - bit 2 is set—maybe?
Example Query username ‘fizzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
H1(fizzle) = 8 - bit 8 is set—maybe?
H3(fizzle) = 2 - bit 2 is set—maybe?
H2(fizzle) = 4 - bit 4 is not set—definitely not.
False Positives and Optimal K
● This algorithm will never give you false negatives, but it is
possible to report false positives
● You can optimize false positives by optimizing K
● Let c = hash-table-size/num-records
● Optimal K = c * ln(2)
● This will result in a false positivity rate of .6185^c, this will be
quite small
Counting Problems
What’s a Counting Problem?
● How many times does an individual occur in a stream
● Easy to do on small-mid size streams of data
● e.g. Counting Views on YouTube
● Nearly impossible to scale to enormous data sets
Naive Approach: Hash Table
● Hash Table of Counters
● Lookup name in Hash table, instead of storing record,
store an integer
● On insert, increment the integer
● On query, check the integer
Pros
● Straight Forward
● Guaranteed accuracy (if
storing whole object)
● O(n) Space Complexity in the
best case
● O(n) worst case time
complexity
● Scales poorly (think billions
of unique records)
● If relying on only a single
hash - very vulnerable to
collisions and overcounts
Cons
Naive Approach Relational DB
● Issue a Query to a traditional Relational Database searching for a count of
record where some condition occurs
SELECT COUNT( * ) FROM views
WHERE name=”Gangnam Style”
Linear Time Complexity O(n)
Linear Space Complexity O(n)
What’s the problem with a Billion Unique Records?
● Each unique record needs its own space in a Hash Table or row in a RDBMS
(perhaps several rows across multiple tables)
● Taxing on memory for Hash Table
○ 8 bit integer? 1GB
○ 16 bit? 2GB
○ 32 bit? 4GB
○ 64 bit? 8GB
● Maintaining such large data structures in a typical program’s memory isn’t
feasible
● In a relational database, it’s stored on disk
Count-min Sketch
Count-Min Sketch
● Specialized data structure for keeping count on very large streams of data
● Similar to Bloom filter in Concept - multi-hashed record
● 2D array of counters
● Sublinear Space Complexity
● Constant Time complexity
● Never undercounts, sometimes over counts
Increment:
For i = 0 -> k:
Table[ H(i) ][ i ] += 1
Query:
minimum = infinity
For i = 0 -> k:
minimum = min(minimum,Table[H(i)][i])
return minimum
Video Views Sketch 10 x 3
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 0 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 0 0 0 0 0 0
H3 0 0 0 0 0 0 0 0 0 0
Increment Gangnam Style
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 0 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 0 0 0 0 0 0
H3 0 0 0 0 0 0 0 0 0 0
Increment Gangnam Style
● H1(Gangnam Style) = 0
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 1 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 0 0 0 0 0 0
H3 0 0 0 0 0 0 0 0 0 0
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 1 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 0 0 0 0 0
H3 0 0 0 0 0 0 0 0 0 0
Increment Gangnam Style
● H1(Gangnam Style) = 0
● H2(Gangnam Style) = 4
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 1 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 0 0 0 0 0
H3 0 0 0 0 0 0 1 0 0 0
Increment Gangnam Style
● H1(Gangnam Style) = 0
● H2(Gangnam Style) = 4
● H3(Gangnam Style) = 6
Increment Baby Shark
● H1(Baby Shark) = 0
● H2(Baby Shark) = 5
● H3(Baby Shark) = 6
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 1 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 0 0 0 0 0
H3 0 0 0 0 0 0 1 0 0 0
Increment Baby Shark
● H1(Baby Shark) = 0
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 0 0 0 0 0
H3 0 0 0 0 0 0 1 0 0 0
Increment Baby Shark
● H1(Baby Shark) = 0
● H2(Baby Shark) = 5
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 1 0 0 0
Increment Baby Shark
● H1(Baby Shark) = 0
● H2(Baby Shark) = 5
● H3(Baby Shark) = 6
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 2 0 0 0
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 2 0 0 0
Query Gangnam Style
● H1(Gangnam Style) = 0
● MIN (2)
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 2 0 0 0
Query Gangnam Style
● H1(Gangnam Style) = 0
● H2(Gangnam Style) = 4
● MIN (2, 1)
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 2 0 0 0
Query Gangnam Style
● H1(Gangnam Style) = 0
● H2(Gangnam Style) = 4
● H3(Gangnam Style) = 6
● MIN (2, 1, 2) = 1
Type Worst Case
Space Sublinear
Increment O(1)
Query O(1)
Delete Not Available
Complexities
CMS Pros
● Extremely Fast - O(1)
● Super compact - sublinear
● Impossible to undercount
● incidence of overcounting - all
results are approximations
CMS Cons
When to Use a Count Min Sketch?
● Counting many unique instances
● When Approximation is fine
● When counts are likely to be skewed (think YouTube video views)
Set Cardinality
Set Cardinality
● Counting distinct elements inserted into set
● Easier on smaller data sets
● For exact counts - must preserve all unique elements
● Scales very poorly
HyperLogLog
HyperLogLog
● Probabilistic Data Structure to Count Distinct Elements
● Space Complexity is about O(1)
● Time Complexity O(1)
● Can handle billions of elements with less than 2 kB of memory
● Can scale up as needed - HLL is effectively constant even
though you may want to increase its size for enormous
cardinalities.
HyperLogLog Walkthrough
● Initialize an array of registers of size 2^P (where P is some constant, usually
around 16-18)
● When an Item is inserted
○ Hash the Item
○ Determine the register to update: i - from the left P bits of the item’s hash
○ Set registers[i] to the index of the rightmost 1 in the binary representation of the hash
● When Querying
○ Compute harmonic mean of the registers that have been set
○ Multiply by a constant determined by size of P
Example: Insert Username ‘bar’ P = 16
H(bar) = 3103595182
● 1011 1000 1111 1101 0001 1010 1010 1110
● Take first 16 bits -> 1011 1000 1111 1101 -> 47357 = register index
● Index of rightmost 1 = 1
● registers[47357] = 1
Get Cardinality
● Calculate harmonic mean of only set registers.
Only 1 set register: 47357 -> 1
ceiling(.673 * 1 * 1/(2^1)) = 1
Cardinality = 1
Probabilistic Data
Structures with
Redis-Stack
What Is Redis Stack?
● Grouping of modules for Redis, including Redis Bloom
● Adds additional functionality, e.g. Probabilistic Data structures to Redis
Demo
Steve Lorello
Developer Advocate
@Redis
@slorello
github.com/slorello89
slorello.com
Resources
Redis
https://redis.io
RedisBloom
https://redisbloom.io
Source Code For Demo:
https://github.com/slorello89/probablistic-data-structures-blazor
C# Implementation Bloom Filter, HyperLogLog, and Count-Min Sketch:
https://github.com/TheAlgorithms/C-Sharp/tree/master/DataStructures/Probabilistic
Come Check Us Out!
Redis University:
https://university.redis.com
Discord:
https://discord.com/invite/redis
Baby Name Freq hash table
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
H(Liam) = 4
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
H(Liam) = 4
0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 0 0 0 0 0
H(Sophia) = 8
0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 0 0 0 0 0
H(Sophia) = 8
0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 0 0 0 1 0
H(Liam) = 4
0 1 2 3 4 5 6 7 8 9
0 0 0 0 2 0 0 0 1 0
Baby Name existence table k = 3
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 0 1 0 0 0
H1(Susan)=0 H2(Susan) = 5 H3(Susan) = 6
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 0 1 0 0 0
H1(Susan)=0 H2(Susan) = 5 H3(Susan) = 6
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 1 1 0 0 0
Does Tom Exist? H1(Tom)=1 H2(Tom)=4 H3(Tom) = 5
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 1 1 0 0 0
Does Tom Exist? H1(Tom)=1 H2(Tom)=4 H3(Tom) = 5
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 1 1 0 0 0
A hash of Tom = 0, so no Tom does not exist!
Does Liam Exist? H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 1 1 0 0 0
All Hashes of Liam = 1, so we repot YES
2-Choice Hashing
● Use two Hash Functions instead of one
● Store @ index with Lowest Load (smallest linked list)
● Time Complexity goes from log(n) in traditional chain hash table -> log(log(n))
with high probability, so nearly constant
● Benefit stops at 2 hashes, additional hashes don’t help
● Still O(n) space complexity
Hash Tables
Hash Table
● Ubiquitous data structure for
storing associated data. E.g.
Map, Dictionary, Dict
● Set of Keys associated with
array of values
● Run hash function on key to find
position in array to store value
Source: wikipedia
Hash Collisions
● Hash Functions can
produce the same
output for different
keys - creates
collision
● Collision Resolution
either sequentially of
with linked-list
Hash Table Complexity - with chain hashing
Type Amortized Worst Case
Space O(n) O(n)
Insert O(1) O(n)
Lookup O(1) O(n)
Delete O(1) O(n)

More Related Content

What's hot

Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
CloudxLab
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Databricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
NGINX Installation and Tuning
NGINX Installation and TuningNGINX Installation and Tuning
NGINX Installation and Tuning
NGINX, Inc.
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with Openstack
Arun prasath
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
Amazon Web Services
 
Storing 16 Bytes at Scale
Storing 16 Bytes at ScaleStoring 16 Bytes at Scale
Storing 16 Bytes at Scale
Fabian Reinartz
 
Deep dive into Kubernetes Networking
Deep dive into Kubernetes NetworkingDeep dive into Kubernetes Networking
Deep dive into Kubernetes Networking
Sreenivas Makam
 
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Henning Jacobs
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리
confluent
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Databricks
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
DataWorks Summit
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
DataWorks Summit
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
OpenStack Korea Community
 
SecurityFest-22-Fitzl-beyond.pdf
SecurityFest-22-Fitzl-beyond.pdfSecurityFest-22-Fitzl-beyond.pdf
SecurityFest-22-Fitzl-beyond.pdf
Csaba Fitzl
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 

What's hot (20)

Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
NGINX Installation and Tuning
NGINX Installation and TuningNGINX Installation and Tuning
NGINX Installation and Tuning
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with Openstack
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
 
Storing 16 Bytes at Scale
Storing 16 Bytes at ScaleStoring 16 Bytes at Scale
Storing 16 Bytes at Scale
 
Deep dive into Kubernetes Networking
Deep dive into Kubernetes NetworkingDeep dive into Kubernetes Networking
Deep dive into Kubernetes Networking
 
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
SecurityFest-22-Fitzl-beyond.pdf
SecurityFest-22-Fitzl-beyond.pdfSecurityFest-22-Fitzl-beyond.pdf
SecurityFest-22-Fitzl-beyond.pdf
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 

Similar to Count-min sketch to Infinity.pdf

Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
Interview questions slide deck
Interview questions slide deckInterview questions slide deck
Interview questions slide deck
MikeBegley
 
lecture8_Cuong.ppt
lecture8_Cuong.pptlecture8_Cuong.ppt
lecture8_Cuong.ppt
HongV34104
 
Python Memory Management 101(Europython)
Python Memory Management 101(Europython)Python Memory Management 101(Europython)
Python Memory Management 101(Europython)
Jose Manuel Ortega Candel
 
Python memory managment. Deeping in Garbage collector
Python memory managment. Deeping in Garbage collectorPython memory managment. Deeping in Garbage collector
Python memory managment. Deeping in Garbage collector
Jose Manuel Ortega Candel
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
Eman magdy
 
Python for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo CruzPython for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo Cruz
rpmcruz
 
Basic arithmetic, instruction execution and program
Basic arithmetic, instruction execution and programBasic arithmetic, instruction execution and program
Basic arithmetic, instruction execution and program
JyotiprakashMishra18
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
Andrii Gakhov
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
Andrii Gakhov
 
Beyond PFCount: Shrif Nada
Beyond PFCount: Shrif NadaBeyond PFCount: Shrif Nada
Beyond PFCount: Shrif Nada
Redis Labs
 
Melbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event Slides
Melbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event SlidesMelbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event Slides
Melbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event Slides
Victoria Schiffer
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
Python Quick Start
Python Quick StartPython Quick Start
Python Quick Start
Abbas Ali
 
Python in 90 Minutes
Python in 90 MinutesPython in 90 Minutes
Python in 90 Minutes
Nachu Muthu
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
Mohamed Essam
 
GAN in_kakao
GAN in_kakaoGAN in_kakao
GAN in_kakao
Junho Kim
 
Data Mining Lecture_4.pptx
Data Mining Lecture_4.pptxData Mining Lecture_4.pptx
Data Mining Lecture_4.pptx
Subrata Kumer Paul
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
 
LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)
Wang Hsiangkai
 

Similar to Count-min sketch to Infinity.pdf (20)

Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
 
Interview questions slide deck
Interview questions slide deckInterview questions slide deck
Interview questions slide deck
 
lecture8_Cuong.ppt
lecture8_Cuong.pptlecture8_Cuong.ppt
lecture8_Cuong.ppt
 
Python Memory Management 101(Europython)
Python Memory Management 101(Europython)Python Memory Management 101(Europython)
Python Memory Management 101(Europython)
 
Python memory managment. Deeping in Garbage collector
Python memory managment. Deeping in Garbage collectorPython memory managment. Deeping in Garbage collector
Python memory managment. Deeping in Garbage collector
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
 
Python for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo CruzPython for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo Cruz
 
Basic arithmetic, instruction execution and program
Basic arithmetic, instruction execution and programBasic arithmetic, instruction execution and program
Basic arithmetic, instruction execution and program
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
 
Beyond PFCount: Shrif Nada
Beyond PFCount: Shrif NadaBeyond PFCount: Shrif Nada
Beyond PFCount: Shrif Nada
 
Melbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event Slides
Melbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event SlidesMelbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event Slides
Melbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event Slides
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
Python Quick Start
Python Quick StartPython Quick Start
Python Quick Start
 
Python in 90 Minutes
Python in 90 MinutesPython in 90 Minutes
Python in 90 Minutes
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
 
GAN in_kakao
GAN in_kakaoGAN in_kakao
GAN in_kakao
 
Data Mining Lecture_4.pptx
Data Mining Lecture_4.pptxData Mining Lecture_4.pptx
Data Mining Lecture_4.pptx
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
 
LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)
 

More from Stephen Lorello

Florida Man Uses Cache as Database.pdf
Florida Man Uses Cache as Database.pdfFlorida Man Uses Cache as Database.pdf
Florida Man Uses Cache as Database.pdf
Stephen Lorello
 
An Introduction to Redis for .NET Developers.pdf
An Introduction to Redis for .NET Developers.pdfAn Introduction to Redis for .NET Developers.pdf
An Introduction to Redis for .NET Developers.pdf
Stephen Lorello
 
An Introduction to Redis for Developers.pdf
An Introduction to Redis for Developers.pdfAn Introduction to Redis for Developers.pdf
An Introduction to Redis for Developers.pdf
Stephen Lorello
 
Indexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netIndexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .net
Stephen Lorello
 
Frontends w ithout javascript
Frontends w ithout javascriptFrontends w ithout javascript
Frontends w ithout javascript
Stephen Lorello
 
Intro to computer vision in .net update
Intro to computer vision in .net   updateIntro to computer vision in .net   update
Intro to computer vision in .net update
Stephen Lorello
 
Intro to computer vision in .net
Intro to computer vision in .netIntro to computer vision in .net
Intro to computer vision in .net
Stephen Lorello
 

More from Stephen Lorello (7)

Florida Man Uses Cache as Database.pdf
Florida Man Uses Cache as Database.pdfFlorida Man Uses Cache as Database.pdf
Florida Man Uses Cache as Database.pdf
 
An Introduction to Redis for .NET Developers.pdf
An Introduction to Redis for .NET Developers.pdfAn Introduction to Redis for .NET Developers.pdf
An Introduction to Redis for .NET Developers.pdf
 
An Introduction to Redis for Developers.pdf
An Introduction to Redis for Developers.pdfAn Introduction to Redis for Developers.pdf
An Introduction to Redis for Developers.pdf
 
Indexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netIndexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .net
 
Frontends w ithout javascript
Frontends w ithout javascriptFrontends w ithout javascript
Frontends w ithout javascript
 
Intro to computer vision in .net update
Intro to computer vision in .net   updateIntro to computer vision in .net   update
Intro to computer vision in .net update
 
Intro to computer vision in .net
Intro to computer vision in .netIntro to computer vision in .net
Intro to computer vision in .net
 

Recently uploaded

What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
aymanquadri279
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Envertis Software Solutions
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
Yara Milbes
 

Recently uploaded (20)

What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
 

Count-min sketch to Infinity.pdf

  • 1. Count-min sketch to Infinity: Using Probabilistic Data Structures to Solve Presence, Counting, and Distinct Count Problems in .NET presented by: Steve Lorello - Developer Advocate @Redis
  • 2. Agenda ● What are Probabilistic Data Structures? ● Set Membership problems ● Bloom Filters ● Counting problems ● Count-Min-Sketch ● Distinct Count problems ● HyperLogLog ● Using Probabilistic Data Structures with Redis
  • 3. What are Probabilistic Data Structures? ● Class of specialized data structures ● Tackle specific problems ● Use probability approximate
  • 4. Probabilistic Data Structures Examples Name Problem Solved Optimization Bloom Filter Presence Space, Insertion Time, Lookup Time Quotient Filter Presence Space, Insertion Time, Lookup Time Skip List Ordering and Searching Insertion Time, Search time HyperLogLog Set Cardinality Space, Insertion Time, Lookup Time Count-min-sketch Counting occurrences on large sets Space, Insertion Time, Lookup Time Cuckoo Filter Presence Space, Insertion Time, Lookup Time Top-K Keep track of top records Space, Insertion Time, Lookup Time
  • 6. Set Membership Problems ● Has a given element been inserted? ● e.g. Unique username for registration
  • 7. Presence Problem Naive Approach 1 ● Store User Info in table ‘users’ and Query Check username Query username
  • 8. Presence Problem Naive Approach 1 SELECT 1 FROM users WHERE username = ‘selected_username’ Check username Query username
  • 9. Summary Access Type Disk Lookup Time O(n) Extra Space (beyond storing user info) O(1)
  • 10. Presence Problem Naive Approach 2 ● Store User Info in table ‘users’ ● Index username Check username Query username
  • 11. Presence Problem Naive Approach 2 SELECT 1 FROM users WHERE username = ‘selected_username’ Check username Query username
  • 12. Summary Access Type Disk Lookup Time O(log(n)) Extra Space (beyond storing user info) O(n)
  • 13. Presence Problem Naive Approach 3 ● Store usernames in Redis cache Check username SISMEMBER
  • 14. Presence Problem Naive Approach 3 ● Store usernames in Redis cache SADD usernames selected_username SISMEMBER usernames selected_username Check username SISMEMBER
  • 15. Summary Access Type Memory Lookup Time O(1) Extra Space (beyond storing user info) O(n)
  • 17. Bloom Filter ● Specialized ‘Probabilistic’ Data Structure for presence checks ● Can say if element has definitely not been added ● Can say if element has probably been added ● Uses constant K-hashes scheme ● Represented as a 1D array of bits ● All operations O(1) complexity ● Space complexity O(n) - bits
  • 18. Insert: For i = 0->K: FILTER[H[ i ](key)] = 1
  • 19. Query: For i = 0 -> K: If FILTER[H[ i ](key)] == 0: Return False Return true
  • 20. Complexities Type Worst Case Space O(n) - BITS Insert O(1) Lookup O(1) Delete Not Available
  • 21. Example Initial State Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 0 0 0 0 0 0 0 0
  • 22. Example Insert username ‘razzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 0 0 0 0 0 0 0 0
  • 23. Example Insert username ‘razzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 0 0 0 0 0 ● H1(razzle) = 2
  • 24. Example Insert username ‘razzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 0 0 ● H1(razzle) = 2 ● H2(razzle) = 5
  • 25. Example Insert username ‘razzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 1 0 ● H1(razzle) = 2 ● H2(razzle) = 5 ● H3(razzle) = 8
  • 26. Example Query username ‘fizzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 1 0 H1(fizzle) = 8 - bit 8 is set—maybe?
  • 27. Example Query username ‘fizzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 1 0 H1(fizzle) = 8 - bit 8 is set—maybe? H3(fizzle) = 2 - bit 2 is set—maybe?
  • 28. Example Query username ‘fizzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 1 0 H1(fizzle) = 8 - bit 8 is set—maybe? H3(fizzle) = 2 - bit 2 is set—maybe? H2(fizzle) = 4 - bit 4 is not set—definitely not.
  • 29. False Positives and Optimal K ● This algorithm will never give you false negatives, but it is possible to report false positives ● You can optimize false positives by optimizing K ● Let c = hash-table-size/num-records ● Optimal K = c * ln(2) ● This will result in a false positivity rate of .6185^c, this will be quite small
  • 31. What’s a Counting Problem? ● How many times does an individual occur in a stream ● Easy to do on small-mid size streams of data ● e.g. Counting Views on YouTube ● Nearly impossible to scale to enormous data sets
  • 32. Naive Approach: Hash Table ● Hash Table of Counters ● Lookup name in Hash table, instead of storing record, store an integer ● On insert, increment the integer ● On query, check the integer
  • 33. Pros ● Straight Forward ● Guaranteed accuracy (if storing whole object) ● O(n) Space Complexity in the best case ● O(n) worst case time complexity ● Scales poorly (think billions of unique records) ● If relying on only a single hash - very vulnerable to collisions and overcounts Cons
  • 34. Naive Approach Relational DB ● Issue a Query to a traditional Relational Database searching for a count of record where some condition occurs SELECT COUNT( * ) FROM views WHERE name=”Gangnam Style” Linear Time Complexity O(n) Linear Space Complexity O(n)
  • 35. What’s the problem with a Billion Unique Records? ● Each unique record needs its own space in a Hash Table or row in a RDBMS (perhaps several rows across multiple tables) ● Taxing on memory for Hash Table ○ 8 bit integer? 1GB ○ 16 bit? 2GB ○ 32 bit? 4GB ○ 64 bit? 8GB ● Maintaining such large data structures in a typical program’s memory isn’t feasible ● In a relational database, it’s stored on disk
  • 37. Count-Min Sketch ● Specialized data structure for keeping count on very large streams of data ● Similar to Bloom filter in Concept - multi-hashed record ● 2D array of counters ● Sublinear Space Complexity ● Constant Time complexity ● Never undercounts, sometimes over counts
  • 38. Increment: For i = 0 -> k: Table[ H(i) ][ i ] += 1
  • 39. Query: minimum = infinity For i = 0 -> k: minimum = min(minimum,Table[H(i)][i]) return minimum
  • 40. Video Views Sketch 10 x 3 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 0 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 0 0 0 0 0 0 H3 0 0 0 0 0 0 0 0 0 0
  • 41. Increment Gangnam Style Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 0 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 0 0 0 0 0 0 H3 0 0 0 0 0 0 0 0 0 0
  • 42. Increment Gangnam Style ● H1(Gangnam Style) = 0 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 1 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 0 0 0 0 0 0 H3 0 0 0 0 0 0 0 0 0 0
  • 43. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 1 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 0 0 0 0 0 H3 0 0 0 0 0 0 0 0 0 0 Increment Gangnam Style ● H1(Gangnam Style) = 0 ● H2(Gangnam Style) = 4
  • 44. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 1 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 0 0 0 0 0 H3 0 0 0 0 0 0 1 0 0 0 Increment Gangnam Style ● H1(Gangnam Style) = 0 ● H2(Gangnam Style) = 4 ● H3(Gangnam Style) = 6
  • 45. Increment Baby Shark ● H1(Baby Shark) = 0 ● H2(Baby Shark) = 5 ● H3(Baby Shark) = 6 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 1 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 0 0 0 0 0 H3 0 0 0 0 0 0 1 0 0 0
  • 46. Increment Baby Shark ● H1(Baby Shark) = 0 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 0 0 0 0 0 H3 0 0 0 0 0 0 1 0 0 0
  • 47. Increment Baby Shark ● H1(Baby Shark) = 0 ● H2(Baby Shark) = 5 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 1 0 0 0
  • 48. Increment Baby Shark ● H1(Baby Shark) = 0 ● H2(Baby Shark) = 5 ● H3(Baby Shark) = 6 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 2 0 0 0
  • 49. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 2 0 0 0 Query Gangnam Style ● H1(Gangnam Style) = 0 ● MIN (2)
  • 50. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 2 0 0 0 Query Gangnam Style ● H1(Gangnam Style) = 0 ● H2(Gangnam Style) = 4 ● MIN (2, 1)
  • 51. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 2 0 0 0 Query Gangnam Style ● H1(Gangnam Style) = 0 ● H2(Gangnam Style) = 4 ● H3(Gangnam Style) = 6 ● MIN (2, 1, 2) = 1
  • 52. Type Worst Case Space Sublinear Increment O(1) Query O(1) Delete Not Available Complexities
  • 53. CMS Pros ● Extremely Fast - O(1) ● Super compact - sublinear ● Impossible to undercount ● incidence of overcounting - all results are approximations CMS Cons
  • 54. When to Use a Count Min Sketch? ● Counting many unique instances ● When Approximation is fine ● When counts are likely to be skewed (think YouTube video views)
  • 56. Set Cardinality ● Counting distinct elements inserted into set ● Easier on smaller data sets ● For exact counts - must preserve all unique elements ● Scales very poorly
  • 58. HyperLogLog ● Probabilistic Data Structure to Count Distinct Elements ● Space Complexity is about O(1) ● Time Complexity O(1) ● Can handle billions of elements with less than 2 kB of memory ● Can scale up as needed - HLL is effectively constant even though you may want to increase its size for enormous cardinalities.
  • 59. HyperLogLog Walkthrough ● Initialize an array of registers of size 2^P (where P is some constant, usually around 16-18) ● When an Item is inserted ○ Hash the Item ○ Determine the register to update: i - from the left P bits of the item’s hash ○ Set registers[i] to the index of the rightmost 1 in the binary representation of the hash ● When Querying ○ Compute harmonic mean of the registers that have been set ○ Multiply by a constant determined by size of P
  • 60. Example: Insert Username ‘bar’ P = 16 H(bar) = 3103595182 ● 1011 1000 1111 1101 0001 1010 1010 1110 ● Take first 16 bits -> 1011 1000 1111 1101 -> 47357 = register index ● Index of rightmost 1 = 1 ● registers[47357] = 1
  • 61. Get Cardinality ● Calculate harmonic mean of only set registers. Only 1 set register: 47357 -> 1 ceiling(.673 * 1 * 1/(2^1)) = 1 Cardinality = 1
  • 63. What Is Redis Stack? ● Grouping of modules for Redis, including Redis Bloom ● Adds additional functionality, e.g. Probabilistic Data structures to Redis
  • 64. Demo
  • 66. Resources Redis https://redis.io RedisBloom https://redisbloom.io Source Code For Demo: https://github.com/slorello89/probablistic-data-structures-blazor C# Implementation Bloom Filter, HyperLogLog, and Count-Min Sketch: https://github.com/TheAlgorithms/C-Sharp/tree/master/DataStructures/Probabilistic
  • 67. Come Check Us Out! Redis University: https://university.redis.com Discord: https://discord.com/invite/redis
  • 68.
  • 69. Baby Name Freq hash table 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0
  • 70. H(Liam) = 4 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0
  • 71. H(Liam) = 4 0 1 2 3 4 5 6 7 8 9 0 0 0 0 1 0 0 0 0 0
  • 72. H(Sophia) = 8 0 1 2 3 4 5 6 7 8 9 0 0 0 0 1 0 0 0 0 0
  • 73. H(Sophia) = 8 0 1 2 3 4 5 6 7 8 9 0 0 0 0 1 0 0 0 1 0
  • 74. H(Liam) = 4 0 1 2 3 4 5 6 7 8 9 0 0 0 0 2 0 0 0 1 0
  • 75. Baby Name existence table k = 3 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0
  • 76. H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0
  • 77. H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 0 1 0 0 0
  • 78. H1(Susan)=0 H2(Susan) = 5 H3(Susan) = 6 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 0 1 0 0 0
  • 79. H1(Susan)=0 H2(Susan) = 5 H3(Susan) = 6 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 1 1 0 0 0
  • 80. Does Tom Exist? H1(Tom)=1 H2(Tom)=4 H3(Tom) = 5 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 1 1 0 0 0
  • 81. Does Tom Exist? H1(Tom)=1 H2(Tom)=4 H3(Tom) = 5 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 1 1 0 0 0 A hash of Tom = 0, so no Tom does not exist!
  • 82. Does Liam Exist? H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 1 1 0 0 0 All Hashes of Liam = 1, so we repot YES
  • 83. 2-Choice Hashing ● Use two Hash Functions instead of one ● Store @ index with Lowest Load (smallest linked list) ● Time Complexity goes from log(n) in traditional chain hash table -> log(log(n)) with high probability, so nearly constant ● Benefit stops at 2 hashes, additional hashes don’t help ● Still O(n) space complexity
  • 85. Hash Table ● Ubiquitous data structure for storing associated data. E.g. Map, Dictionary, Dict ● Set of Keys associated with array of values ● Run hash function on key to find position in array to store value Source: wikipedia
  • 86. Hash Collisions ● Hash Functions can produce the same output for different keys - creates collision ● Collision Resolution either sequentially of with linked-list
  • 87. Hash Table Complexity - with chain hashing Type Amortized Worst Case Space O(n) O(n) Insert O(1) O(n) Lookup O(1) O(n) Delete O(1) O(n)