SlideShare a Scribd company logo
1 of 15
1
Intro to Sketch Algorithms
19/10/2021
2
- Did this IP visit me before?
- How many unique IPs have we seen this
month?
- How many times did I see this IP?
- What is the median transaction value?
top 1% value?
- What are the most common collection of
fonts available?
Large Stream of Events
3
Can’t store all unique values in memory
Fixed memory
4
If we are willing to accept an arbitrary low chance
of false positives we can solve this problem with
Bloom Filters.
Did I see this value before?
5
Hash each value and turn on a bit for that hash
bucket.
Repeat with multiple k different hash function, and
ask if all bits for all hash functions are set
Some false positives, no false negatives.
Bloom Filter
6
If we hash all values, and calculate the minimum of
all hashes, what is the expected minimum value?
Cardinality estimation
7
let hash(x) : X => [0,1] uniformly pseudo random
E[min(hash(x))] = 1/(k+1) when k is number of
distinct elements.
This is an unbiased estimator
If we repeat with several different hash functions,
we can average the estimations.
Cardinality estimation
8
Counting bloom filters.
Hash value and increment a counter at the hashed
index.
Use multiple hash functions each with separate
table(column) return min of all estimates.
Produces biased estimate, estimate >= actual
How many times did we see this value?
count–min sketch
9
Naive - Sample and calculate on sample
Remedian - Calculate median of medians (of
medians…)
Median estimation
10
Naive - sample and calculate quantile on sample
Sample and keep to K
Manku - maintain eps approximate counts and
quantiles. keep counts of values in intervals. and
keep them balanced.
Biased quantile estimators
11
Proveably requires at least O(N) space
Even top 1 most common does.
Relax to K-heavy-hitters problem. Find all values with
frequency at least 1/K ?
Approximate K heavy hitters: Return all values with frequency
more than 1/K and return no value with frequency below 1/k -
epsilon
What are the top K most frequent
values?
12
Initialize an empty Map m from elements to counters
def add(a)
if m.contains(a) m(a) += 1
else if m.size < k m(a) = 1
else
decrease all counters in m by 1
remove any elements with count=0
Frequent algorithm
13
THANK YOU
14
Sampling K elements from a stream of N
Algorithm Extra memory Accurate results Materialized result
Shuffle and take N elements Yes Yes
Reservoir K elements Yes Yes
Indices reservoir K indices Yes No
Independent sample O(1) Length not guaranteed No
Accurate independent O(1) Slight correlation
between elements
No
15
variance = E[(x - E[x])^2] =
E[x^2 -2xE[x] +E[x]^2] = E[x^2] -2E[x]E[x]+E[x]^2 =
E[x^2] - E[x]^2
stdev = sqrt(variance)
STDEV streaming - accurate algorithm

More Related Content

Similar to Sketch algoritms

Exploring Algorithms
Exploring AlgorithmsExploring Algorithms
Exploring AlgorithmsSri Prasanna
 
PRML Chapter 1
PRML Chapter 1PRML Chapter 1
PRML Chapter 1Sunwoo Kim
 
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...Florent Renucci
 
Skiena algorithm 2007 lecture16 introduction to dynamic programming
Skiena algorithm 2007 lecture16 introduction to dynamic programmingSkiena algorithm 2007 lecture16 introduction to dynamic programming
Skiena algorithm 2007 lecture16 introduction to dynamic programmingzukun
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big datajins0618
 
Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your researchDorothy Bishop
 
Nelder Mead Search Algorithm
Nelder Mead Search AlgorithmNelder Mead Search Algorithm
Nelder Mead Search AlgorithmAshish Khetan
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsJason Tsai
 
ICML 2016: The Information Sieve
ICML 2016: The Information SieveICML 2016: The Information Sieve
ICML 2016: The Information Sievegregv123
 
lecture 10
lecture 10lecture 10
lecture 10sajinsc
 

Similar to Sketch algoritms (20)

Class9_PCA_final.ppt
Class9_PCA_final.pptClass9_PCA_final.ppt
Class9_PCA_final.ppt
 
Exploring Algorithms
Exploring AlgorithmsExploring Algorithms
Exploring Algorithms
 
PRML Chapter 1
PRML Chapter 1PRML Chapter 1
PRML Chapter 1
 
Unit 2 in daa
Unit 2 in daaUnit 2 in daa
Unit 2 in daa
 
algorithm Unit 2
algorithm Unit 2 algorithm Unit 2
algorithm Unit 2
 
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
 
Skiena algorithm 2007 lecture16 introduction to dynamic programming
Skiena algorithm 2007 lecture16 introduction to dynamic programmingSkiena algorithm 2007 lecture16 introduction to dynamic programming
Skiena algorithm 2007 lecture16 introduction to dynamic programming
 
L&NDeltaTalk
L&NDeltaTalkL&NDeltaTalk
L&NDeltaTalk
 
Exponential functions
Exponential functionsExponential functions
Exponential functions
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big data
 
Advance algebra
Advance algebraAdvance algebra
Advance algebra
 
Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your research
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
 
35 algorithm-types
35 algorithm-types35 algorithm-types
35 algorithm-types
 
Nelder Mead Search Algorithm
Nelder Mead Search AlgorithmNelder Mead Search Algorithm
Nelder Mead Search Algorithm
 
Data Analysis Homework Help
Data Analysis Homework HelpData Analysis Homework Help
Data Analysis Homework Help
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
 
Solution 3.
Solution 3.Solution 3.
Solution 3.
 
ICML 2016: The Information Sieve
ICML 2016: The Information SieveICML 2016: The Information Sieve
ICML 2016: The Information Sieve
 
lecture 10
lecture 10lecture 10
lecture 10
 

More from Meir Maor

Actionable Machine Learning
Actionable Machine LearningActionable Machine Learning
Actionable Machine LearningMeir Maor
 
Limits of Machine Learning
Limits of Machine LearningLimits of Machine Learning
Limits of Machine LearningMeir Maor
 
Prior On Model Space
Prior On Model SpacePrior On Model Space
Prior On Model SpaceMeir Maor
 
Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Meir Maor
 
Scala Reflection & Runtime MetaProgramming
Scala Reflection & Runtime MetaProgrammingScala Reflection & Runtime MetaProgramming
Scala Reflection & Runtime MetaProgrammingMeir Maor
 
10 Things I Hate About Scala
10 Things I Hate About Scala10 Things I Hate About Scala
10 Things I Hate About ScalaMeir Maor
 

More from Meir Maor (6)

Actionable Machine Learning
Actionable Machine LearningActionable Machine Learning
Actionable Machine Learning
 
Limits of Machine Learning
Limits of Machine LearningLimits of Machine Learning
Limits of Machine Learning
 
Prior On Model Space
Prior On Model SpacePrior On Model Space
Prior On Model Space
 
Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks
 
Scala Reflection & Runtime MetaProgramming
Scala Reflection & Runtime MetaProgrammingScala Reflection & Runtime MetaProgramming
Scala Reflection & Runtime MetaProgramming
 
10 Things I Hate About Scala
10 Things I Hate About Scala10 Things I Hate About Scala
10 Things I Hate About Scala
 

Recently uploaded

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 

Recently uploaded (20)

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 

Sketch algoritms

  • 1. 1 Intro to Sketch Algorithms 19/10/2021
  • 2. 2 - Did this IP visit me before? - How many unique IPs have we seen this month? - How many times did I see this IP? - What is the median transaction value? top 1% value? - What are the most common collection of fonts available? Large Stream of Events
  • 3. 3 Can’t store all unique values in memory Fixed memory
  • 4. 4 If we are willing to accept an arbitrary low chance of false positives we can solve this problem with Bloom Filters. Did I see this value before?
  • 5. 5 Hash each value and turn on a bit for that hash bucket. Repeat with multiple k different hash function, and ask if all bits for all hash functions are set Some false positives, no false negatives. Bloom Filter
  • 6. 6 If we hash all values, and calculate the minimum of all hashes, what is the expected minimum value? Cardinality estimation
  • 7. 7 let hash(x) : X => [0,1] uniformly pseudo random E[min(hash(x))] = 1/(k+1) when k is number of distinct elements. This is an unbiased estimator If we repeat with several different hash functions, we can average the estimations. Cardinality estimation
  • 8. 8 Counting bloom filters. Hash value and increment a counter at the hashed index. Use multiple hash functions each with separate table(column) return min of all estimates. Produces biased estimate, estimate >= actual How many times did we see this value? count–min sketch
  • 9. 9 Naive - Sample and calculate on sample Remedian - Calculate median of medians (of medians…) Median estimation
  • 10. 10 Naive - sample and calculate quantile on sample Sample and keep to K Manku - maintain eps approximate counts and quantiles. keep counts of values in intervals. and keep them balanced. Biased quantile estimators
  • 11. 11 Proveably requires at least O(N) space Even top 1 most common does. Relax to K-heavy-hitters problem. Find all values with frequency at least 1/K ? Approximate K heavy hitters: Return all values with frequency more than 1/K and return no value with frequency below 1/k - epsilon What are the top K most frequent values?
  • 12. 12 Initialize an empty Map m from elements to counters def add(a) if m.contains(a) m(a) += 1 else if m.size < k m(a) = 1 else decrease all counters in m by 1 remove any elements with count=0 Frequent algorithm
  • 14. 14 Sampling K elements from a stream of N Algorithm Extra memory Accurate results Materialized result Shuffle and take N elements Yes Yes Reservoir K elements Yes Yes Indices reservoir K indices Yes No Independent sample O(1) Length not guaranteed No Accurate independent O(1) Slight correlation between elements No
  • 15. 15 variance = E[(x - E[x])^2] = E[x^2 -2xE[x] +E[x]^2] = E[x^2] -2E[x]E[x]+E[x]^2 = E[x^2] - E[x]^2 stdev = sqrt(variance) STDEV streaming - accurate algorithm