SlideShare a Scribd company logo
1 of 87
Download to read offline
Count-min sketch to Infinity:
Using Probabilistic Data Structures to Solve Presence, Counting, and Distinct
Count Problems in .NET
presented by: Steve Lorello - Developer Advocate @Redis
Agenda
● What are Probabilistic Data Structures?
● Set Membership problems
● Bloom Filters
● Counting problems
● Count-Min-Sketch
● Distinct Count problems
● HyperLogLog
● Using Probabilistic Data Structures with Redis
What are Probabilistic Data Structures?
● Class of specialized data structures
● Tackle specific problems
● Use probability approximate
Probabilistic Data Structures Examples
Name Problem Solved Optimization
Bloom Filter Presence Space, Insertion Time, Lookup Time
Quotient Filter Presence Space, Insertion Time, Lookup Time
Skip List Ordering and Searching Insertion Time, Search time
HyperLogLog Set Cardinality Space, Insertion Time, Lookup Time
Count-min-sketch Counting occurrences on large sets Space, Insertion Time, Lookup Time
Cuckoo Filter Presence Space, Insertion Time, Lookup Time
Top-K Keep track of top records Space, Insertion Time, Lookup Time
Set Membership
Set Membership Problems
● Has a given element been inserted?
● e.g. Unique username for registration
Presence Problem Naive Approach 1
● Store User Info in table ‘users’ and Query
Check username Query username
Presence Problem Naive Approach 1
SELECT 1
FROM users
WHERE username = ‘selected_username’
Check username Query username
Summary
Access Type Disk
Lookup Time O(n)
Extra Space
(beyond storing
user info)
O(1)
Presence Problem Naive Approach 2
● Store User Info in table ‘users’
● Index username
Check username Query username
Presence Problem Naive Approach 2
SELECT 1
FROM users
WHERE username = ‘selected_username’
Check username Query username
Summary
Access Type Disk
Lookup Time O(log(n))
Extra Space
(beyond storing
user info)
O(n)
Presence Problem Naive Approach 3
● Store usernames in Redis cache
Check username SISMEMBER
Presence Problem Naive Approach 3
● Store usernames in Redis cache
SADD usernames selected_username
SISMEMBER usernames selected_username
Check username SISMEMBER
Summary
Access Type Memory
Lookup Time O(1)
Extra Space
(beyond storing
user info)
O(n)
Bloom Filters
Bloom Filter
● Specialized ‘Probabilistic’ Data Structure for presence checks
● Can say if element has definitely not been added
● Can say if element has probably been added
● Uses constant K-hashes scheme
● Represented as a 1D array of bits
● All operations O(1) complexity
● Space complexity O(n) - bits
Insert:
For i = 0->K:
FILTER[H[ i ](key)] = 1
Query:
For i = 0 -> K:
If FILTER[H[ i ](key)] == 0:
Return False
Return true
Complexities
Type Worst Case
Space O(n) - BITS
Insert O(1)
Lookup O(1)
Delete Not Available
Example Initial State
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 0 0 0 0 0 0 0 0
Example Insert username ‘razzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 0 0 0 0 0 0 0 0
Example Insert username ‘razzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 0 0 0 0 0
● H1(razzle) = 2
Example Insert username ‘razzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 0 0
● H1(razzle) = 2
● H2(razzle) = 5
Example Insert username ‘razzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
● H1(razzle) = 2
● H2(razzle) = 5
● H3(razzle) = 8
Example Query username ‘fizzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
H1(fizzle) = 8 - bit 8 is set—maybe?
Example Query username ‘fizzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
H1(fizzle) = 8 - bit 8 is set—maybe?
H3(fizzle) = 2 - bit 2 is set—maybe?
Example Query username ‘fizzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
H1(fizzle) = 8 - bit 8 is set—maybe?
H3(fizzle) = 2 - bit 2 is set—maybe?
H2(fizzle) = 4 - bit 4 is not set—definitely not.
False Positives and Optimal K
● This algorithm will never give you false negatives, but it is
possible to report false positives
● You can optimize false positives by optimizing K
● Let c = hash-table-size/num-records
● Optimal K = c * ln(2)
● This will result in a false positivity rate of .6185^c, this will be
quite small
Counting Problems
What’s a Counting Problem?
● How many times does an individual occur in a stream
● Easy to do on small-mid size streams of data
● e.g. Counting Views on YouTube
● Nearly impossible to scale to enormous data sets
Naive Approach: Hash Table
● Hash Table of Counters
● Lookup name in Hash table, instead of storing record,
store an integer
● On insert, increment the integer
● On query, check the integer
Pros
● Straight Forward
● Guaranteed accuracy (if
storing whole object)
● O(n) Space Complexity in the
best case
● O(n) worst case time
complexity
● Scales poorly (think billions
of unique records)
● If relying on only a single
hash - very vulnerable to
collisions and overcounts
Cons
Naive Approach Relational DB
● Issue a Query to a traditional Relational Database searching for a count of
record where some condition occurs
SELECT COUNT( * ) FROM views
WHERE name=”Gangnam Style”
Linear Time Complexity O(n)
Linear Space Complexity O(n)
What’s the problem with a Billion Unique Records?
● Each unique record needs its own space in a Hash Table or row in a RDBMS
(perhaps several rows across multiple tables)
● Taxing on memory for Hash Table
○ 8 bit integer? 1GB
○ 16 bit? 2GB
○ 32 bit? 4GB
○ 64 bit? 8GB
● Maintaining such large data structures in a typical program’s memory isn’t
feasible
● In a relational database, it’s stored on disk
Count-min Sketch
Count-Min Sketch
● Specialized data structure for keeping count on very large streams of data
● Similar to Bloom filter in Concept - multi-hashed record
● 2D array of counters
● Sublinear Space Complexity
● Constant Time complexity
● Never undercounts, sometimes over counts
Increment:
For i = 0 -> k:
Table[ H(i) ][ i ] += 1
Query:
minimum = infinity
For i = 0 -> k:
minimum = min(minimum,Table[H(i)][i])
return minimum
Video Views Sketch 10 x 3
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 0 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 0 0 0 0 0 0
H3 0 0 0 0 0 0 0 0 0 0
Increment Gangnam Style
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 0 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 0 0 0 0 0 0
H3 0 0 0 0 0 0 0 0 0 0
Increment Gangnam Style
● H1(Gangnam Style) = 0
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 1 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 0 0 0 0 0 0
H3 0 0 0 0 0 0 0 0 0 0
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 1 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 0 0 0 0 0
H3 0 0 0 0 0 0 0 0 0 0
Increment Gangnam Style
● H1(Gangnam Style) = 0
● H2(Gangnam Style) = 4
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 1 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 0 0 0 0 0
H3 0 0 0 0 0 0 1 0 0 0
Increment Gangnam Style
● H1(Gangnam Style) = 0
● H2(Gangnam Style) = 4
● H3(Gangnam Style) = 6
Increment Baby Shark
● H1(Baby Shark) = 0
● H2(Baby Shark) = 5
● H3(Baby Shark) = 6
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 1 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 0 0 0 0 0
H3 0 0 0 0 0 0 1 0 0 0
Increment Baby Shark
● H1(Baby Shark) = 0
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 0 0 0 0 0
H3 0 0 0 0 0 0 1 0 0 0
Increment Baby Shark
● H1(Baby Shark) = 0
● H2(Baby Shark) = 5
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 1 0 0 0
Increment Baby Shark
● H1(Baby Shark) = 0
● H2(Baby Shark) = 5
● H3(Baby Shark) = 6
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 2 0 0 0
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 2 0 0 0
Query Gangnam Style
● H1(Gangnam Style) = 0
● MIN (2)
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 2 0 0 0
Query Gangnam Style
● H1(Gangnam Style) = 0
● H2(Gangnam Style) = 4
● MIN (2, 1)
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 2 0 0 0
Query Gangnam Style
● H1(Gangnam Style) = 0
● H2(Gangnam Style) = 4
● H3(Gangnam Style) = 6
● MIN (2, 1, 2) = 1
Type Worst Case
Space Sublinear
Increment O(1)
Query O(1)
Delete Not Available
Complexities
CMS Pros
● Extremely Fast - O(1)
● Super compact - sublinear
● Impossible to undercount
● incidence of overcounting - all
results are approximations
CMS Cons
When to Use a Count Min Sketch?
● Counting many unique instances
● When Approximation is fine
● When counts are likely to be skewed (think YouTube video views)
Set Cardinality
Set Cardinality
● Counting distinct elements inserted into set
● Easier on smaller data sets
● For exact counts - must preserve all unique elements
● Scales very poorly
HyperLogLog
HyperLogLog
● Probabilistic Data Structure to Count Distinct Elements
● Space Complexity is about O(1)
● Time Complexity O(1)
● Can handle billions of elements with less than 2 kB of memory
● Can scale up as needed - HLL is effectively constant even
though you may want to increase its size for enormous
cardinalities.
HyperLogLog Walkthrough
● Initialize an array of registers of size 2^P (where P is some constant, usually
around 16-18)
● When an Item is inserted
○ Hash the Item
○ Determine the register to update: i - from the left P bits of the item’s hash
○ Set registers[i] to the index of the rightmost 1 in the binary representation of the hash
● When Querying
○ Compute harmonic mean of the registers that have been set
○ Multiply by a constant determined by size of P
Example: Insert Username ‘bar’ P = 16
H(bar) = 3103595182
● 1011 1000 1111 1101 0001 1010 1010 1110
● Take first 16 bits -> 1011 1000 1111 1101 -> 47357 = register index
● Index of rightmost 1 = 1
● registers[47357] = 1
Get Cardinality
● Calculate harmonic mean of only set registers.
Only 1 set register: 47357 -> 1
ceiling(.673 * 1 * 1/(2^1)) = 1
Cardinality = 1
Probabilistic Data
Structures with
Redis-Stack
What Is Redis Stack?
● Grouping of modules for Redis, including Redis Bloom
● Adds additional functionality, e.g. Probabilistic Data structures to Redis
Demo
Steve Lorello
Developer Advocate
@Redis
@slorello
github.com/slorello89
slorello.com
Resources
Redis
https://redis.io
RedisBloom
https://redisbloom.io
Source Code For Demo:
https://github.com/slorello89/probablistic-data-structures-blazor
C# Implementation Bloom Filter, HyperLogLog, and Count-Min Sketch:
https://github.com/TheAlgorithms/C-Sharp/tree/master/DataStructures/Probabilistic
Come Check Us Out!
Redis University:
https://university.redis.com
Discord:
https://discord.com/invite/redis
Baby Name Freq hash table
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
H(Liam) = 4
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
H(Liam) = 4
0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 0 0 0 0 0
H(Sophia) = 8
0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 0 0 0 0 0
H(Sophia) = 8
0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 0 0 0 1 0
H(Liam) = 4
0 1 2 3 4 5 6 7 8 9
0 0 0 0 2 0 0 0 1 0
Baby Name existence table k = 3
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 0 1 0 0 0
H1(Susan)=0 H2(Susan) = 5 H3(Susan) = 6
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 0 1 0 0 0
H1(Susan)=0 H2(Susan) = 5 H3(Susan) = 6
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 1 1 0 0 0
Does Tom Exist? H1(Tom)=1 H2(Tom)=4 H3(Tom) = 5
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 1 1 0 0 0
Does Tom Exist? H1(Tom)=1 H2(Tom)=4 H3(Tom) = 5
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 1 1 0 0 0
A hash of Tom = 0, so no Tom does not exist!
Does Liam Exist? H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 1 1 0 0 0
All Hashes of Liam = 1, so we repot YES
2-Choice Hashing
● Use two Hash Functions instead of one
● Store @ index with Lowest Load (smallest linked list)
● Time Complexity goes from log(n) in traditional chain hash table -> log(log(n))
with high probability, so nearly constant
● Benefit stops at 2 hashes, additional hashes don’t help
● Still O(n) space complexity
Hash Tables
Hash Table
● Ubiquitous data structure for
storing associated data. E.g.
Map, Dictionary, Dict
● Set of Keys associated with
array of values
● Run hash function on key to find
position in array to store value
Source: wikipedia
Hash Collisions
● Hash Functions can
produce the same
output for different
keys - creates
collision
● Collision Resolution
either sequentially of
with linked-list
Hash Table Complexity - with chain hashing
Type Amortized Worst Case
Space O(n) O(n)
Insert O(1) O(n)
Lookup O(1) O(n)
Delete O(1) O(n)

More Related Content

Similar to Count-min sketch to Infinity.pdf

Python memory managment. Deeping in Garbage collector
Python memory managment. Deeping in Garbage collectorPython memory managment. Deeping in Garbage collector
Python memory managment. Deeping in Garbage collectorJose Manuel Ortega Candel
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure Eman magdy
 
Python for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo CruzPython for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo Cruzrpmcruz
 
Basic arithmetic, instruction execution and program
Basic arithmetic, instruction execution and programBasic arithmetic, instruction execution and program
Basic arithmetic, instruction execution and programJyotiprakashMishra18
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Andrii Gakhov
 
Beyond PFCount: Shrif Nada
Beyond PFCount: Shrif NadaBeyond PFCount: Shrif Nada
Beyond PFCount: Shrif NadaRedis Labs
 
Melbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event Slides
Melbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event SlidesMelbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event Slides
Melbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event SlidesVictoria Schiffer
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxGopiNathVelivela
 
Python in 90 Minutes
Python in 90 MinutesPython in 90 Minutes
Python in 90 MinutesNachu Muthu
 
Python Quick Start
Python Quick StartPython Quick Start
Python Quick StartAbbas Ali
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to AlgorithmsMohamed Essam
 
GAN in_kakao
GAN in_kakaoGAN in_kakao
GAN in_kakaoJunho Kim
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsOleksandr Pryymak
 
LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)Wang Hsiangkai
 
Using R in remote computer clusters
Using R in remote computer clustersUsing R in remote computer clusters
Using R in remote computer clustersBurak Himmetoglu
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptxSonaliAjankar
 
Paper Reading: Pessimistic Cardinality Estimation
Paper Reading: Pessimistic Cardinality EstimationPaper Reading: Pessimistic Cardinality Estimation
Paper Reading: Pessimistic Cardinality EstimationPingCAP
 

Similar to Count-min sketch to Infinity.pdf (20)

Python Memory Management 101(Europython)
Python Memory Management 101(Europython)Python Memory Management 101(Europython)
Python Memory Management 101(Europython)
 
Python memory managment. Deeping in Garbage collector
Python memory managment. Deeping in Garbage collectorPython memory managment. Deeping in Garbage collector
Python memory managment. Deeping in Garbage collector
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
 
Python for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo CruzPython for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo Cruz
 
Basic arithmetic, instruction execution and program
Basic arithmetic, instruction execution and programBasic arithmetic, instruction execution and program
Basic arithmetic, instruction execution and program
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
 
Beyond PFCount: Shrif Nada
Beyond PFCount: Shrif NadaBeyond PFCount: Shrif Nada
Beyond PFCount: Shrif Nada
 
Melbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event Slides
Melbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event SlidesMelbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event Slides
Melbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event Slides
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
Python in 90 Minutes
Python in 90 MinutesPython in 90 Minutes
Python in 90 Minutes
 
Python Quick Start
Python Quick StartPython Quick Start
Python Quick Start
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
 
GAN in_kakao
GAN in_kakaoGAN in_kakao
GAN in_kakao
 
Data Mining Lecture_4.pptx
Data Mining Lecture_4.pptxData Mining Lecture_4.pptx
Data Mining Lecture_4.pptx
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
 
LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)
 
Using R in remote computer clusters
Using R in remote computer clustersUsing R in remote computer clusters
Using R in remote computer clusters
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
 
Paper Reading: Pessimistic Cardinality Estimation
Paper Reading: Pessimistic Cardinality EstimationPaper Reading: Pessimistic Cardinality Estimation
Paper Reading: Pessimistic Cardinality Estimation
 

More from Stephen Lorello

Florida Man Uses Cache as Database.pdf
Florida Man Uses Cache as Database.pdfFlorida Man Uses Cache as Database.pdf
Florida Man Uses Cache as Database.pdfStephen Lorello
 
An Introduction to Redis for .NET Developers.pdf
An Introduction to Redis for .NET Developers.pdfAn Introduction to Redis for .NET Developers.pdf
An Introduction to Redis for .NET Developers.pdfStephen Lorello
 
An Introduction to Redis for Developers.pdf
An Introduction to Redis for Developers.pdfAn Introduction to Redis for Developers.pdf
An Introduction to Redis for Developers.pdfStephen Lorello
 
Indexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netIndexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netStephen Lorello
 
Frontends w ithout javascript
Frontends w ithout javascriptFrontends w ithout javascript
Frontends w ithout javascriptStephen Lorello
 
Intro to computer vision in .net update
Intro to computer vision in .net   updateIntro to computer vision in .net   update
Intro to computer vision in .net updateStephen Lorello
 
Intro to computer vision in .net
Intro to computer vision in .netIntro to computer vision in .net
Intro to computer vision in .netStephen Lorello
 

More from Stephen Lorello (7)

Florida Man Uses Cache as Database.pdf
Florida Man Uses Cache as Database.pdfFlorida Man Uses Cache as Database.pdf
Florida Man Uses Cache as Database.pdf
 
An Introduction to Redis for .NET Developers.pdf
An Introduction to Redis for .NET Developers.pdfAn Introduction to Redis for .NET Developers.pdf
An Introduction to Redis for .NET Developers.pdf
 
An Introduction to Redis for Developers.pdf
An Introduction to Redis for Developers.pdfAn Introduction to Redis for Developers.pdf
An Introduction to Redis for Developers.pdf
 
Indexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netIndexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .net
 
Frontends w ithout javascript
Frontends w ithout javascriptFrontends w ithout javascript
Frontends w ithout javascript
 
Intro to computer vision in .net update
Intro to computer vision in .net   updateIntro to computer vision in .net   update
Intro to computer vision in .net update
 
Intro to computer vision in .net
Intro to computer vision in .netIntro to computer vision in .net
Intro to computer vision in .net
 

Recently uploaded

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 

Recently uploaded (20)

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Count-min sketch to Infinity.pdf

  • 1. Count-min sketch to Infinity: Using Probabilistic Data Structures to Solve Presence, Counting, and Distinct Count Problems in .NET presented by: Steve Lorello - Developer Advocate @Redis
  • 2. Agenda ● What are Probabilistic Data Structures? ● Set Membership problems ● Bloom Filters ● Counting problems ● Count-Min-Sketch ● Distinct Count problems ● HyperLogLog ● Using Probabilistic Data Structures with Redis
  • 3. What are Probabilistic Data Structures? ● Class of specialized data structures ● Tackle specific problems ● Use probability approximate
  • 4. Probabilistic Data Structures Examples Name Problem Solved Optimization Bloom Filter Presence Space, Insertion Time, Lookup Time Quotient Filter Presence Space, Insertion Time, Lookup Time Skip List Ordering and Searching Insertion Time, Search time HyperLogLog Set Cardinality Space, Insertion Time, Lookup Time Count-min-sketch Counting occurrences on large sets Space, Insertion Time, Lookup Time Cuckoo Filter Presence Space, Insertion Time, Lookup Time Top-K Keep track of top records Space, Insertion Time, Lookup Time
  • 6. Set Membership Problems ● Has a given element been inserted? ● e.g. Unique username for registration
  • 7. Presence Problem Naive Approach 1 ● Store User Info in table ‘users’ and Query Check username Query username
  • 8. Presence Problem Naive Approach 1 SELECT 1 FROM users WHERE username = ‘selected_username’ Check username Query username
  • 9. Summary Access Type Disk Lookup Time O(n) Extra Space (beyond storing user info) O(1)
  • 10. Presence Problem Naive Approach 2 ● Store User Info in table ‘users’ ● Index username Check username Query username
  • 11. Presence Problem Naive Approach 2 SELECT 1 FROM users WHERE username = ‘selected_username’ Check username Query username
  • 12. Summary Access Type Disk Lookup Time O(log(n)) Extra Space (beyond storing user info) O(n)
  • 13. Presence Problem Naive Approach 3 ● Store usernames in Redis cache Check username SISMEMBER
  • 14. Presence Problem Naive Approach 3 ● Store usernames in Redis cache SADD usernames selected_username SISMEMBER usernames selected_username Check username SISMEMBER
  • 15. Summary Access Type Memory Lookup Time O(1) Extra Space (beyond storing user info) O(n)
  • 17. Bloom Filter ● Specialized ‘Probabilistic’ Data Structure for presence checks ● Can say if element has definitely not been added ● Can say if element has probably been added ● Uses constant K-hashes scheme ● Represented as a 1D array of bits ● All operations O(1) complexity ● Space complexity O(n) - bits
  • 18. Insert: For i = 0->K: FILTER[H[ i ](key)] = 1
  • 19. Query: For i = 0 -> K: If FILTER[H[ i ](key)] == 0: Return False Return true
  • 20. Complexities Type Worst Case Space O(n) - BITS Insert O(1) Lookup O(1) Delete Not Available
  • 21. Example Initial State Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 0 0 0 0 0 0 0 0
  • 22. Example Insert username ‘razzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 0 0 0 0 0 0 0 0
  • 23. Example Insert username ‘razzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 0 0 0 0 0 ● H1(razzle) = 2
  • 24. Example Insert username ‘razzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 0 0 ● H1(razzle) = 2 ● H2(razzle) = 5
  • 25. Example Insert username ‘razzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 1 0 ● H1(razzle) = 2 ● H2(razzle) = 5 ● H3(razzle) = 8
  • 26. Example Query username ‘fizzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 1 0 H1(fizzle) = 8 - bit 8 is set—maybe?
  • 27. Example Query username ‘fizzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 1 0 H1(fizzle) = 8 - bit 8 is set—maybe? H3(fizzle) = 2 - bit 2 is set—maybe?
  • 28. Example Query username ‘fizzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 1 0 H1(fizzle) = 8 - bit 8 is set—maybe? H3(fizzle) = 2 - bit 2 is set—maybe? H2(fizzle) = 4 - bit 4 is not set—definitely not.
  • 29. False Positives and Optimal K ● This algorithm will never give you false negatives, but it is possible to report false positives ● You can optimize false positives by optimizing K ● Let c = hash-table-size/num-records ● Optimal K = c * ln(2) ● This will result in a false positivity rate of .6185^c, this will be quite small
  • 31. What’s a Counting Problem? ● How many times does an individual occur in a stream ● Easy to do on small-mid size streams of data ● e.g. Counting Views on YouTube ● Nearly impossible to scale to enormous data sets
  • 32. Naive Approach: Hash Table ● Hash Table of Counters ● Lookup name in Hash table, instead of storing record, store an integer ● On insert, increment the integer ● On query, check the integer
  • 33. Pros ● Straight Forward ● Guaranteed accuracy (if storing whole object) ● O(n) Space Complexity in the best case ● O(n) worst case time complexity ● Scales poorly (think billions of unique records) ● If relying on only a single hash - very vulnerable to collisions and overcounts Cons
  • 34. Naive Approach Relational DB ● Issue a Query to a traditional Relational Database searching for a count of record where some condition occurs SELECT COUNT( * ) FROM views WHERE name=”Gangnam Style” Linear Time Complexity O(n) Linear Space Complexity O(n)
  • 35. What’s the problem with a Billion Unique Records? ● Each unique record needs its own space in a Hash Table or row in a RDBMS (perhaps several rows across multiple tables) ● Taxing on memory for Hash Table ○ 8 bit integer? 1GB ○ 16 bit? 2GB ○ 32 bit? 4GB ○ 64 bit? 8GB ● Maintaining such large data structures in a typical program’s memory isn’t feasible ● In a relational database, it’s stored on disk
  • 37. Count-Min Sketch ● Specialized data structure for keeping count on very large streams of data ● Similar to Bloom filter in Concept - multi-hashed record ● 2D array of counters ● Sublinear Space Complexity ● Constant Time complexity ● Never undercounts, sometimes over counts
  • 38. Increment: For i = 0 -> k: Table[ H(i) ][ i ] += 1
  • 39. Query: minimum = infinity For i = 0 -> k: minimum = min(minimum,Table[H(i)][i]) return minimum
  • 40. Video Views Sketch 10 x 3 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 0 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 0 0 0 0 0 0 H3 0 0 0 0 0 0 0 0 0 0
  • 41. Increment Gangnam Style Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 0 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 0 0 0 0 0 0 H3 0 0 0 0 0 0 0 0 0 0
  • 42. Increment Gangnam Style ● H1(Gangnam Style) = 0 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 1 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 0 0 0 0 0 0 H3 0 0 0 0 0 0 0 0 0 0
  • 43. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 1 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 0 0 0 0 0 H3 0 0 0 0 0 0 0 0 0 0 Increment Gangnam Style ● H1(Gangnam Style) = 0 ● H2(Gangnam Style) = 4
  • 44. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 1 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 0 0 0 0 0 H3 0 0 0 0 0 0 1 0 0 0 Increment Gangnam Style ● H1(Gangnam Style) = 0 ● H2(Gangnam Style) = 4 ● H3(Gangnam Style) = 6
  • 45. Increment Baby Shark ● H1(Baby Shark) = 0 ● H2(Baby Shark) = 5 ● H3(Baby Shark) = 6 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 1 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 0 0 0 0 0 H3 0 0 0 0 0 0 1 0 0 0
  • 46. Increment Baby Shark ● H1(Baby Shark) = 0 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 0 0 0 0 0 H3 0 0 0 0 0 0 1 0 0 0
  • 47. Increment Baby Shark ● H1(Baby Shark) = 0 ● H2(Baby Shark) = 5 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 1 0 0 0
  • 48. Increment Baby Shark ● H1(Baby Shark) = 0 ● H2(Baby Shark) = 5 ● H3(Baby Shark) = 6 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 2 0 0 0
  • 49. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 2 0 0 0 Query Gangnam Style ● H1(Gangnam Style) = 0 ● MIN (2)
  • 50. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 2 0 0 0 Query Gangnam Style ● H1(Gangnam Style) = 0 ● H2(Gangnam Style) = 4 ● MIN (2, 1)
  • 51. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 2 0 0 0 Query Gangnam Style ● H1(Gangnam Style) = 0 ● H2(Gangnam Style) = 4 ● H3(Gangnam Style) = 6 ● MIN (2, 1, 2) = 1
  • 52. Type Worst Case Space Sublinear Increment O(1) Query O(1) Delete Not Available Complexities
  • 53. CMS Pros ● Extremely Fast - O(1) ● Super compact - sublinear ● Impossible to undercount ● incidence of overcounting - all results are approximations CMS Cons
  • 54. When to Use a Count Min Sketch? ● Counting many unique instances ● When Approximation is fine ● When counts are likely to be skewed (think YouTube video views)
  • 56. Set Cardinality ● Counting distinct elements inserted into set ● Easier on smaller data sets ● For exact counts - must preserve all unique elements ● Scales very poorly
  • 58. HyperLogLog ● Probabilistic Data Structure to Count Distinct Elements ● Space Complexity is about O(1) ● Time Complexity O(1) ● Can handle billions of elements with less than 2 kB of memory ● Can scale up as needed - HLL is effectively constant even though you may want to increase its size for enormous cardinalities.
  • 59. HyperLogLog Walkthrough ● Initialize an array of registers of size 2^P (where P is some constant, usually around 16-18) ● When an Item is inserted ○ Hash the Item ○ Determine the register to update: i - from the left P bits of the item’s hash ○ Set registers[i] to the index of the rightmost 1 in the binary representation of the hash ● When Querying ○ Compute harmonic mean of the registers that have been set ○ Multiply by a constant determined by size of P
  • 60. Example: Insert Username ‘bar’ P = 16 H(bar) = 3103595182 ● 1011 1000 1111 1101 0001 1010 1010 1110 ● Take first 16 bits -> 1011 1000 1111 1101 -> 47357 = register index ● Index of rightmost 1 = 1 ● registers[47357] = 1
  • 61. Get Cardinality ● Calculate harmonic mean of only set registers. Only 1 set register: 47357 -> 1 ceiling(.673 * 1 * 1/(2^1)) = 1 Cardinality = 1
  • 63. What Is Redis Stack? ● Grouping of modules for Redis, including Redis Bloom ● Adds additional functionality, e.g. Probabilistic Data structures to Redis
  • 64. Demo
  • 66. Resources Redis https://redis.io RedisBloom https://redisbloom.io Source Code For Demo: https://github.com/slorello89/probablistic-data-structures-blazor C# Implementation Bloom Filter, HyperLogLog, and Count-Min Sketch: https://github.com/TheAlgorithms/C-Sharp/tree/master/DataStructures/Probabilistic
  • 67. Come Check Us Out! Redis University: https://university.redis.com Discord: https://discord.com/invite/redis
  • 68.
  • 69. Baby Name Freq hash table 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0
  • 70. H(Liam) = 4 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0
  • 71. H(Liam) = 4 0 1 2 3 4 5 6 7 8 9 0 0 0 0 1 0 0 0 0 0
  • 72. H(Sophia) = 8 0 1 2 3 4 5 6 7 8 9 0 0 0 0 1 0 0 0 0 0
  • 73. H(Sophia) = 8 0 1 2 3 4 5 6 7 8 9 0 0 0 0 1 0 0 0 1 0
  • 74. H(Liam) = 4 0 1 2 3 4 5 6 7 8 9 0 0 0 0 2 0 0 0 1 0
  • 75. Baby Name existence table k = 3 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0
  • 76. H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0
  • 77. H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 0 1 0 0 0
  • 78. H1(Susan)=0 H2(Susan) = 5 H3(Susan) = 6 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 0 1 0 0 0
  • 79. H1(Susan)=0 H2(Susan) = 5 H3(Susan) = 6 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 1 1 0 0 0
  • 80. Does Tom Exist? H1(Tom)=1 H2(Tom)=4 H3(Tom) = 5 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 1 1 0 0 0
  • 81. Does Tom Exist? H1(Tom)=1 H2(Tom)=4 H3(Tom) = 5 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 1 1 0 0 0 A hash of Tom = 0, so no Tom does not exist!
  • 82. Does Liam Exist? H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 1 1 0 0 0 All Hashes of Liam = 1, so we repot YES
  • 83. 2-Choice Hashing ● Use two Hash Functions instead of one ● Store @ index with Lowest Load (smallest linked list) ● Time Complexity goes from log(n) in traditional chain hash table -> log(log(n)) with high probability, so nearly constant ● Benefit stops at 2 hashes, additional hashes don’t help ● Still O(n) space complexity
  • 85. Hash Table ● Ubiquitous data structure for storing associated data. E.g. Map, Dictionary, Dict ● Set of Keys associated with array of values ● Run hash function on key to find position in array to store value Source: wikipedia
  • 86. Hash Collisions ● Hash Functions can produce the same output for different keys - creates collision ● Collision Resolution either sequentially of with linked-list
  • 87. Hash Table Complexity - with chain hashing Type Amortized Worst Case Space O(n) O(n) Insert O(1) O(n) Lookup O(1) O(n) Delete O(1) O(n)