SlideShare a Scribd company logo
Probabilistic data structures
https://www.linkedin.com/in/taras-yaroshchuk-551383105/
Taras Yaroshchuk
Senior Data Engineer at Sigma Software
- 4 years in Data Engineering
- AdTech, IoT, FinTech
- Scala/Java/Python
- Trying to contribute to big data community
Skype/Telegram/FB/everywhere: taras.yaroshchuk
Use cases
● Membership (Bloom filter, Quotient filter, Cuckoo filter)
● Frequency (Frequent algorithm, Count-Min Sketch)
● Cardinality (Linear Counting, LogLog, HyperLogLog)
● Rank (Random sampling, q-digest, t-digest)
● Similarity (Locality-Sensitive Hashing, MinHash, SimHash)
Motivation
->
Use cases
● Membership (Bloom filter, Quotient filter, Cuckoo filter)
● Frequency (Frequent algorithm, Count-Min Sketch)
● Cardinality (Linear Counting, LogLog, HyperLogLog)
● Rank (Random sampling, q-digest, t-digest)
● Similarity (Locality-Sensitive Hashing, MinHash, SimHash)
Hashing
Cryptographic hash functions
● Message-Digest Algorithm (MD5)
● Secure Hash Algorithms (SHA-256, SHA-512, etc)
● RadioGetun
Non-Cryptographics hash functions
● FNV1
● CityHash, FarmHash
● MurmurHash3
42
Bloom Filter (Membership)
- Google Bigtable, HBase, Cassandra and
PostgreSQL use Bloom filters to reduce the disk
lookups for non-existent rows or columns.
- Medium uses bloom filter to avoid showing
duplicate recommendations
- Bad URLs for Google Chrome
- Compromised passwords
Bloom Filter (Membership)
- It is like Set(), but doesn’t store elements itself
- Supports 2 operations: add element,
check if element exists
HashSet
Bloom Filter (Membership)
0 1 2 3 4
1 0 1 1 0
- It is like Set(), but doesn’t store elements itself
- Supports 2 operations: add element,
check if element exists
- Bit array
- Use multiple hash functions
h1(x) = MurmurHash3(x) % 10
h2(x) = FNV1(x) % 10
HashSet
Bloom Filter (Membership)
Example:
- camera on highway
- bad internet connection
- police in 400m
Bloom Filter (Membership)
0 1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 1 0 0
Euroblyaha detection
QWERTY777, NET1234, ASDF999
1. Add QWERTY777
h1 = MurmurHash3(QWERTY777) % 10 = 7
h2 = FNV1(QWERTY777) % 10 = 3
Bloom Filter (Membership)
0 1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 1 0 0
Euroblyaha detection
QWERTY777, NET1234, ASDF999
1. Add QWERTY777
h1 = MurmurHash3(QWERTY777) % 10 = 7
h2 = FNV1(QWERTY777) % 10 = 3
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
2. Add NET1234
h1 = MurmurHash3(NET1234) % 10 = 1
h2 = FNV1(NET1234) % 10 = 3
Bloom Filter (Membership)
0 1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 1 0 0
Euroblyaha detection
QWERTY777, NET1234, ASDF999
1. Add QWERTY777
h1 = MurmurHash3(QWERTY777) % 10 = 7
h2 = FNV1(QWERTY777) % 10 = 3
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
2. Add NET1234
h1 = MurmurHash3(NET1234) % 10 = 1
h2 = FNV1(NET1234) % 10 = 3
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
3. Contains ASDF999? (false)
h1 = MurmurHash3(ASDF999) % 10 = 5
h2 = FNV1(ASDF999) % 10 = 6
Bloom Filter (Membership)
0 1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 1 0 0
Euroblyaha detection
QWERTY777, NET1234, ASDF999
1. Add QWERTY777
h1 = MurmurHash3(QWERTY777) % 10 = 7
h2 = FNV1(QWERTY777) % 10 = 3
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
2. Add NET1234
h1 = MurmurHash3(NET1234) % 10 = 1
h2 = FNV1(NET1234) % 10 = 3
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
3. Contains ASDF999? (false)
h1 = MurmurHash3(ASDF999) % 10 = 5
h2 = FNV1(ASDF999) % 10 = 6
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 0 1 0 0
4. Contains NET1234? (true)
h1 = MurmurHash3(NET1234) % 10 = 1
h2 = FNV1(NET1234) % 10 = 3
Bloom Filter (Membership)
- Element definitely doesn’t exist in the set
- Element may exist in the set. Lets say, 98%
Bloom Filter (Membership)
p - positive error rate
m - based on the size of the filter
k - the number of hash functions,
n - number of elements inserted
- Element definitely doesn’t exist in the set
- Element may exist in the set. Lets say, 98%
k m/n p, %
4 6 5.62
6 8 2.15
8 12 0.314
11 16 0.04581 billion elements, p=2% ~ 1 Gb
Cassandra
bloom filter
How it looks like?
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>22.0</version>
</dependency>
How many times element occurred?
Show top X elements
For streaming application that deals with huge amounts of data
● DNS DDoS
● Intent Surge
● twitter trending hashtags
Count-Min Sketch (Frequency)
- Use multiple hash functions
- Matrix of counters (not bits)
- Top frequent elements
- Shows upper bound estimation (less than)
Count-Min Sketch (Frequency)
h1(x) = MurmurHash3(x) % 10
h2(x) = FNV1(x) % 10
0 1 2 3 4 5 6 7 8 9
h1 0 0 0 0 0 0 0 0 0 0
h2 0 0 0 0 0 0 0 0 0 0
{ ->#quarantine, #quarantine, #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine,
#brexit, #oscar, #quarantine }
Count-Min Sketch (Frequency)
h1(x) = MurmurHash3(x) % 10
h2(x) = FNV1(x) % 10
0 1 2 3 4 5 6 7 8 9
h1 0 0 0 0 0 0 0 0 0 0
h2 0 0 0 0 0 0 0 0 0 0
{ #quarantine, #quarantine, -> #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine,
#brexit, #oscar, #quarantine }
Count-Min Sketch (Frequency)
0 1 2 3 4 5 6 7 8 9
h1 0 0 0 0 2 0 0 0 0 0
h2 0 0 0 0 0 0 0 2 0 0
h1(x) = MurmurHash3(quarantine) % 10 = 4
h2(x) = FNV1(quarantine) % 10 = 7
1. #quarantine
2. #quarantine
{ #quarantine, #quarantine, #brexit, #brexit, -> #alyonalyona, #quarantine, #tesla, #tesla, #quarantine,
#brexit, #oscar, #quarantine }
Count-Min Sketch (Frequency)
0 1 2 3 4 5 6 7 8 9
h1 0 0 0 0 2 0 0 0 0 0
h2 0 0 0 0 0 0 0 2 0 0
h1(x) = MurmurHash3(quarantine) % 10 =
4
h2(x) = FNV1(quarantine) % 10 = 7
1. #quarantine
2. #quarantine
0 1 2 3 4 5 6 7 8 9
h1 0 0 0 0 2 0 0 0 2 0
h2 0 0 2 0 0 0 0 2 0 0
3. #brexit
4. #brexit
h1(x) = MurmurHash3(brexit) % 10 = 8
h2(x) = FNV1(brexit) % 10 = 2
{ #quarantine, #quarantine, #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine,
#brexit, #oscar, #quarantine -> }
Count-Min Sketch (Frequency)
0 1 2 3 4 5 6 7 8 9
h1 0 0 0 0 6 0 1 0 5 0
h2 0 0 3 0 0 1 0 6 0 2
h1(x) = MurmurHash3(x) % 10
h2(x) = FNV1(x) % 10 h1(x) = MurmurHash3(brexit) % 10 = 8
h2(x) = FNV1(brexit) % 10 = 2
h1(x) = MurmurHash3(tesla) % 10 = 8
h2(x) = FNV1(tesla) % 10 = 9
{ #quarantine, #quarantine, #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine,
#brexit, #oscar, #quarantine -> }
Count-Min Sketch (Frequency)
0 1 2 3 4 5 6 7 8 9
h1 0 0 0 0 6 0 1 0 5 0
h2 0 0 3 0 0 1 0 6 0 2
How many times #tesla?
h1(x) = MurmurHash3(tesla) % 10 = 8
h2(x) = FNV1(tesla) % 10 = 9
Final answer = min(h1[8], h2[9]) = min(5, 2) = 2
Count-Min Sketch (Frequency)
p = |ln(1/σ)|
m = 2.71828/ɛ
p - number hash functions
σ - standard error
m - number of bits
ɛ - overestimation factor
Example:
We expect to store 10 million of elements
σ should be ~1%, accepted overestimation is
10.
p = |ln(1/0.01)| = 5
ɛ = 10/107=10-6
m = 2.71828/10-6 = 2718280
Conclusions
- Probabilistic data structures are not general purpose
- They should be used as optimization
- They can save you memory and time
- Sound complex, but not so scary in practice
- Learn them and impress your interviewer
https://www.amazon.com/Probabilistic-Data-Structures-Algorithms-Applications/dp/3748190484
Thanks!

More Related Content

Similar to Data monsters probablistic data structures

Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
Yanchang Zhao
 
dplyr and torrents from cpasbien
dplyr and torrents from cpasbiendplyr and torrents from cpasbien
dplyr and torrents from cpasbien
Romain Francois
 
Exact Real Arithmetic for Tcl
Exact Real Arithmetic for TclExact Real Arithmetic for Tcl
Exact Real Arithmetic for Tcl
ke9tv
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
A/B Testing for Game Design
A/B Testing for Game DesignA/B Testing for Game Design
A/B Testing for Game Design
Trieu Nguyen
 
An introduction to probabilistic data structures
An introduction to probabilistic data structuresAn introduction to probabilistic data structures
An introduction to probabilistic data structures
Miguel Ping
 
OTTER 2017-12-03
OTTER 2017-12-03OTTER 2017-12-03
OTTER 2017-12-03
Ruo Ando
 
Helvetia
HelvetiaHelvetia
Helvetia
ESUG
 
Perl 6 in Context
Perl 6 in ContextPerl 6 in Context
Perl 6 in Context
lichtkind
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
Plotly
 
Smalltalk
SmalltalkSmalltalk
Smalltalk
Damien Cassou
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
Avjinder (Avi) Kaler
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
Avjinder (Avi) Kaler
 
NTU ML TENSORFLOW
NTU ML TENSORFLOWNTU ML TENSORFLOW
NTU ML TENSORFLOW
Mark Chang
 
Oceans 2019 tutorial-geophysical-nav_7-updated
Oceans 2019 tutorial-geophysical-nav_7-updatedOceans 2019 tutorial-geophysical-nav_7-updated
Oceans 2019 tutorial-geophysical-nav_7-updated
Francisco Curado-Teixeira
 
The Ring programming language version 1.8 book - Part 66 of 202
The Ring programming language version 1.8 book - Part 66 of 202The Ring programming language version 1.8 book - Part 66 of 202
The Ring programming language version 1.8 book - Part 66 of 202
Mahmoud Samir Fayed
 
05 2 관계논리비트연산
05 2 관계논리비트연산05 2 관계논리비트연산
05 2 관계논리비트연산
Changwon National University
 
R programming language
R programming languageR programming language
R programming language
Alberto Minetti
 
Pycon 2011 talk (may not be final, note)
Pycon 2011 talk (may not be final, note)Pycon 2011 talk (may not be final, note)
Pycon 2011 talk (may not be final, note)
c.titus.brown
 
Panel101R princeton.pdf
Panel101R princeton.pdfPanel101R princeton.pdf
Panel101R princeton.pdf
JeanTaipeChvez
 

Similar to Data monsters probablistic data structures (20)

Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
dplyr and torrents from cpasbien
dplyr and torrents from cpasbiendplyr and torrents from cpasbien
dplyr and torrents from cpasbien
 
Exact Real Arithmetic for Tcl
Exact Real Arithmetic for TclExact Real Arithmetic for Tcl
Exact Real Arithmetic for Tcl
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
 
A/B Testing for Game Design
A/B Testing for Game DesignA/B Testing for Game Design
A/B Testing for Game Design
 
An introduction to probabilistic data structures
An introduction to probabilistic data structuresAn introduction to probabilistic data structures
An introduction to probabilistic data structures
 
OTTER 2017-12-03
OTTER 2017-12-03OTTER 2017-12-03
OTTER 2017-12-03
 
Helvetia
HelvetiaHelvetia
Helvetia
 
Perl 6 in Context
Perl 6 in ContextPerl 6 in Context
Perl 6 in Context
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
Smalltalk
SmalltalkSmalltalk
Smalltalk
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
NTU ML TENSORFLOW
NTU ML TENSORFLOWNTU ML TENSORFLOW
NTU ML TENSORFLOW
 
Oceans 2019 tutorial-geophysical-nav_7-updated
Oceans 2019 tutorial-geophysical-nav_7-updatedOceans 2019 tutorial-geophysical-nav_7-updated
Oceans 2019 tutorial-geophysical-nav_7-updated
 
The Ring programming language version 1.8 book - Part 66 of 202
The Ring programming language version 1.8 book - Part 66 of 202The Ring programming language version 1.8 book - Part 66 of 202
The Ring programming language version 1.8 book - Part 66 of 202
 
05 2 관계논리비트연산
05 2 관계논리비트연산05 2 관계논리비트연산
05 2 관계논리비트연산
 
R programming language
R programming languageR programming language
R programming language
 
Pycon 2011 talk (may not be final, note)
Pycon 2011 talk (may not be final, note)Pycon 2011 talk (may not be final, note)
Pycon 2011 talk (may not be final, note)
 
Panel101R princeton.pdf
Panel101R princeton.pdfPanel101R princeton.pdf
Panel101R princeton.pdf
 

More from GreenM

User Case of Migration from MicroStrategy to Power BI
 User Case of Migration from MicroStrategy to Power BI User Case of Migration from MicroStrategy to Power BI
User Case of Migration from MicroStrategy to Power BI
GreenM
 
Tableau vs Microstrategy
Tableau vs MicrostrategyTableau vs Microstrategy
Tableau vs Microstrategy
GreenM
 
Data streamsnorkelingdatamonsters
Data streamsnorkelingdatamonstersData streamsnorkelingdatamonsters
Data streamsnorkelingdatamonsters
GreenM
 
Data monstersrealtimeetl new
Data monstersrealtimeetl newData monstersrealtimeetl new
Data monstersrealtimeetl new
GreenM
 
DAX as Power BI Visualization Weapon
DAX as Power BI Visualization WeaponDAX as Power BI Visualization Weapon
DAX as Power BI Visualization Weapon
GreenM
 
How To Make Your Dashboard Smaller
How To Make Your Dashboard SmallerHow To Make Your Dashboard Smaller
How To Make Your Dashboard Smaller
GreenM
 
Data Pipeline Installation Quality
Data Pipeline Installation QualityData Pipeline Installation Quality
Data Pipeline Installation Quality
GreenM
 
Scalable data pipeline
Scalable data pipelineScalable data pipeline
Scalable data pipeline
GreenM
 

More from GreenM (8)

User Case of Migration from MicroStrategy to Power BI
 User Case of Migration from MicroStrategy to Power BI User Case of Migration from MicroStrategy to Power BI
User Case of Migration from MicroStrategy to Power BI
 
Tableau vs Microstrategy
Tableau vs MicrostrategyTableau vs Microstrategy
Tableau vs Microstrategy
 
Data streamsnorkelingdatamonsters
Data streamsnorkelingdatamonstersData streamsnorkelingdatamonsters
Data streamsnorkelingdatamonsters
 
Data monstersrealtimeetl new
Data monstersrealtimeetl newData monstersrealtimeetl new
Data monstersrealtimeetl new
 
DAX as Power BI Visualization Weapon
DAX as Power BI Visualization WeaponDAX as Power BI Visualization Weapon
DAX as Power BI Visualization Weapon
 
How To Make Your Dashboard Smaller
How To Make Your Dashboard SmallerHow To Make Your Dashboard Smaller
How To Make Your Dashboard Smaller
 
Data Pipeline Installation Quality
Data Pipeline Installation QualityData Pipeline Installation Quality
Data Pipeline Installation Quality
 
Scalable data pipeline
Scalable data pipelineScalable data pipeline
Scalable data pipeline
 

Recently uploaded

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 

Recently uploaded (20)

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 

Data monsters probablistic data structures

  • 2. https://www.linkedin.com/in/taras-yaroshchuk-551383105/ Taras Yaroshchuk Senior Data Engineer at Sigma Software - 4 years in Data Engineering - AdTech, IoT, FinTech - Scala/Java/Python - Trying to contribute to big data community Skype/Telegram/FB/everywhere: taras.yaroshchuk
  • 3. Use cases ● Membership (Bloom filter, Quotient filter, Cuckoo filter) ● Frequency (Frequent algorithm, Count-Min Sketch) ● Cardinality (Linear Counting, LogLog, HyperLogLog) ● Rank (Random sampling, q-digest, t-digest) ● Similarity (Locality-Sensitive Hashing, MinHash, SimHash)
  • 5.
  • 6.
  • 7. Use cases ● Membership (Bloom filter, Quotient filter, Cuckoo filter) ● Frequency (Frequent algorithm, Count-Min Sketch) ● Cardinality (Linear Counting, LogLog, HyperLogLog) ● Rank (Random sampling, q-digest, t-digest) ● Similarity (Locality-Sensitive Hashing, MinHash, SimHash)
  • 8. Hashing Cryptographic hash functions ● Message-Digest Algorithm (MD5) ● Secure Hash Algorithms (SHA-256, SHA-512, etc) ● RadioGetun Non-Cryptographics hash functions ● FNV1 ● CityHash, FarmHash ● MurmurHash3 42
  • 9. Bloom Filter (Membership) - Google Bigtable, HBase, Cassandra and PostgreSQL use Bloom filters to reduce the disk lookups for non-existent rows or columns. - Medium uses bloom filter to avoid showing duplicate recommendations - Bad URLs for Google Chrome - Compromised passwords
  • 10. Bloom Filter (Membership) - It is like Set(), but doesn’t store elements itself - Supports 2 operations: add element, check if element exists HashSet
  • 11. Bloom Filter (Membership) 0 1 2 3 4 1 0 1 1 0 - It is like Set(), but doesn’t store elements itself - Supports 2 operations: add element, check if element exists - Bit array - Use multiple hash functions h1(x) = MurmurHash3(x) % 10 h2(x) = FNV1(x) % 10 HashSet
  • 12. Bloom Filter (Membership) Example: - camera on highway - bad internet connection - police in 400m
  • 13. Bloom Filter (Membership) 0 1 2 3 4 5 6 7 8 9 0 0 0 1 0 0 0 1 0 0 Euroblyaha detection QWERTY777, NET1234, ASDF999 1. Add QWERTY777 h1 = MurmurHash3(QWERTY777) % 10 = 7 h2 = FNV1(QWERTY777) % 10 = 3
  • 14. Bloom Filter (Membership) 0 1 2 3 4 5 6 7 8 9 0 0 0 1 0 0 0 1 0 0 Euroblyaha detection QWERTY777, NET1234, ASDF999 1. Add QWERTY777 h1 = MurmurHash3(QWERTY777) % 10 = 7 h2 = FNV1(QWERTY777) % 10 = 3 0 1 2 3 4 5 6 7 8 9 0 1 0 1 0 0 0 1 0 0 2. Add NET1234 h1 = MurmurHash3(NET1234) % 10 = 1 h2 = FNV1(NET1234) % 10 = 3
  • 15. Bloom Filter (Membership) 0 1 2 3 4 5 6 7 8 9 0 0 0 1 0 0 0 1 0 0 Euroblyaha detection QWERTY777, NET1234, ASDF999 1. Add QWERTY777 h1 = MurmurHash3(QWERTY777) % 10 = 7 h2 = FNV1(QWERTY777) % 10 = 3 0 1 2 3 4 5 6 7 8 9 0 1 0 1 0 0 0 1 0 0 2. Add NET1234 h1 = MurmurHash3(NET1234) % 10 = 1 h2 = FNV1(NET1234) % 10 = 3 0 1 2 3 4 5 6 7 8 9 0 1 0 1 0 0 0 1 0 0 3. Contains ASDF999? (false) h1 = MurmurHash3(ASDF999) % 10 = 5 h2 = FNV1(ASDF999) % 10 = 6
  • 16. Bloom Filter (Membership) 0 1 2 3 4 5 6 7 8 9 0 0 0 1 0 0 0 1 0 0 Euroblyaha detection QWERTY777, NET1234, ASDF999 1. Add QWERTY777 h1 = MurmurHash3(QWERTY777) % 10 = 7 h2 = FNV1(QWERTY777) % 10 = 3 0 1 2 3 4 5 6 7 8 9 0 1 0 1 0 0 0 1 0 0 2. Add NET1234 h1 = MurmurHash3(NET1234) % 10 = 1 h2 = FNV1(NET1234) % 10 = 3 0 1 2 3 4 5 6 7 8 9 0 1 0 1 0 0 0 1 0 0 3. Contains ASDF999? (false) h1 = MurmurHash3(ASDF999) % 10 = 5 h2 = FNV1(ASDF999) % 10 = 6 0 1 2 3 4 5 6 7 8 9 0 1 0 1 0 0 0 1 0 0 4. Contains NET1234? (true) h1 = MurmurHash3(NET1234) % 10 = 1 h2 = FNV1(NET1234) % 10 = 3
  • 17. Bloom Filter (Membership) - Element definitely doesn’t exist in the set - Element may exist in the set. Lets say, 98%
  • 18. Bloom Filter (Membership) p - positive error rate m - based on the size of the filter k - the number of hash functions, n - number of elements inserted - Element definitely doesn’t exist in the set - Element may exist in the set. Lets say, 98% k m/n p, % 4 6 5.62 6 8 2.15 8 12 0.314 11 16 0.04581 billion elements, p=2% ~ 1 Gb
  • 20. How it looks like? <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>22.0</version> </dependency>
  • 21. How many times element occurred? Show top X elements For streaming application that deals with huge amounts of data ● DNS DDoS ● Intent Surge ● twitter trending hashtags Count-Min Sketch (Frequency)
  • 22. - Use multiple hash functions - Matrix of counters (not bits) - Top frequent elements - Shows upper bound estimation (less than) Count-Min Sketch (Frequency) h1(x) = MurmurHash3(x) % 10 h2(x) = FNV1(x) % 10 0 1 2 3 4 5 6 7 8 9 h1 0 0 0 0 0 0 0 0 0 0 h2 0 0 0 0 0 0 0 0 0 0
  • 23. { ->#quarantine, #quarantine, #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine, #brexit, #oscar, #quarantine } Count-Min Sketch (Frequency) h1(x) = MurmurHash3(x) % 10 h2(x) = FNV1(x) % 10 0 1 2 3 4 5 6 7 8 9 h1 0 0 0 0 0 0 0 0 0 0 h2 0 0 0 0 0 0 0 0 0 0
  • 24. { #quarantine, #quarantine, -> #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine, #brexit, #oscar, #quarantine } Count-Min Sketch (Frequency) 0 1 2 3 4 5 6 7 8 9 h1 0 0 0 0 2 0 0 0 0 0 h2 0 0 0 0 0 0 0 2 0 0 h1(x) = MurmurHash3(quarantine) % 10 = 4 h2(x) = FNV1(quarantine) % 10 = 7 1. #quarantine 2. #quarantine
  • 25. { #quarantine, #quarantine, #brexit, #brexit, -> #alyonalyona, #quarantine, #tesla, #tesla, #quarantine, #brexit, #oscar, #quarantine } Count-Min Sketch (Frequency) 0 1 2 3 4 5 6 7 8 9 h1 0 0 0 0 2 0 0 0 0 0 h2 0 0 0 0 0 0 0 2 0 0 h1(x) = MurmurHash3(quarantine) % 10 = 4 h2(x) = FNV1(quarantine) % 10 = 7 1. #quarantine 2. #quarantine 0 1 2 3 4 5 6 7 8 9 h1 0 0 0 0 2 0 0 0 2 0 h2 0 0 2 0 0 0 0 2 0 0 3. #brexit 4. #brexit h1(x) = MurmurHash3(brexit) % 10 = 8 h2(x) = FNV1(brexit) % 10 = 2
  • 26. { #quarantine, #quarantine, #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine, #brexit, #oscar, #quarantine -> } Count-Min Sketch (Frequency) 0 1 2 3 4 5 6 7 8 9 h1 0 0 0 0 6 0 1 0 5 0 h2 0 0 3 0 0 1 0 6 0 2 h1(x) = MurmurHash3(x) % 10 h2(x) = FNV1(x) % 10 h1(x) = MurmurHash3(brexit) % 10 = 8 h2(x) = FNV1(brexit) % 10 = 2 h1(x) = MurmurHash3(tesla) % 10 = 8 h2(x) = FNV1(tesla) % 10 = 9
  • 27. { #quarantine, #quarantine, #brexit, #brexit, #alyonalyona, #quarantine, #tesla, #tesla, #quarantine, #brexit, #oscar, #quarantine -> } Count-Min Sketch (Frequency) 0 1 2 3 4 5 6 7 8 9 h1 0 0 0 0 6 0 1 0 5 0 h2 0 0 3 0 0 1 0 6 0 2 How many times #tesla? h1(x) = MurmurHash3(tesla) % 10 = 8 h2(x) = FNV1(tesla) % 10 = 9 Final answer = min(h1[8], h2[9]) = min(5, 2) = 2
  • 28. Count-Min Sketch (Frequency) p = |ln(1/σ)| m = 2.71828/ɛ p - number hash functions σ - standard error m - number of bits ɛ - overestimation factor Example: We expect to store 10 million of elements σ should be ~1%, accepted overestimation is 10. p = |ln(1/0.01)| = 5 ɛ = 10/107=10-6 m = 2.71828/10-6 = 2718280
  • 29. Conclusions - Probabilistic data structures are not general purpose - They should be used as optimization - They can save you memory and time - Sound complex, but not so scary in practice - Learn them and impress your interviewer https://www.amazon.com/Probabilistic-Data-Structures-Algorithms-Applications/dp/3748190484