Sketch algoritms

•Download as PPTX, PDF•

0 likes•58 views

Meir Maor

Introduction to sketch algorithms. Aproximate solutions with sub linear memory.

Software

2
- Did this IP visit me before?
- How many unique IPs have we seen this
month?
- How many times did I see this IP?
- What is the median transaction value?
top 1% value?
- What are the most common collection of
fonts available?
Large Stream of Events

3
Can’t store all unique values in memory
Fixed memory

4
If we are willing to accept an arbitrary low chance
of false positives we can solve this problem with
Bloom Filters.
Did I see this value before?

5
Hash each value and turn on a bit for that hash
bucket.
Repeat with multiple k different hash function, and
ask if all bits for all hash functions are set
Some false positives, no false negatives.
Bloom Filter

6
If we hash all values, and calculate the minimum of
all hashes, what is the expected minimum value?
Cardinality estimation

7
let hash(x) : X => [0,1] uniformly pseudo random
E[min(hash(x))] = 1/(k+1) when k is number of
distinct elements.
This is an unbiased estimator
If we repeat with several different hash functions,
we can average the estimations.
Cardinality estimation

8
Counting bloom filters.
Hash value and increment a counter at the hashed
index.
Use multiple hash functions each with separate
table(column) return min of all estimates.
Produces biased estimate, estimate >= actual
How many times did we see this value?
count–min sketch

9
Naive - Sample and calculate on sample
Remedian - Calculate median of medians (of
medians…)
Median estimation

10
Naive - sample and calculate quantile on sample
Sample and keep to K
Manku - maintain eps approximate counts and
quantiles. keep counts of values in intervals. and
keep them balanced.
Biased quantile estimators

11
Proveably requires at least O(N) space
Even top 1 most common does.
Relax to K-heavy-hitters problem. Find all values with
frequency at least 1/K ?
Approximate K heavy hitters: Return all values with frequency
more than 1/K and return no value with frequency below 1/k -
epsilon
What are the top K most frequent
values?

12
Initialize an empty Map m from elements to counters
def add(a)
if m.contains(a) m(a) += 1
else if m.size < k m(a) = 1
else
decrease all counters in m by 1
remove any elements with count=0
Frequent algorithm

14
Sampling K elements from a stream of N
Algorithm Extra memory Accurate results Materialized result
Shuffle and take N elements Yes Yes
Reservoir K elements Yes Yes
Indices reservoir K indices Yes No
Independent sample O(1) Length not guaranteed No
Accurate independent O(1) Slight correlation
between elements
No

15
variance = E[(x - E[x])^2] =
E[x^2 -2xE[x] +E[x]^2] = E[x^2] -2E[x]E[x]+E[x]^2 =
E[x^2] - E[x]^2
stdev = sqrt(variance)
STDEV streaming - accurate algorithm

Similar to Sketch algoritms

Class9_PCA_final.pptMaTruongThanh002937

Exploring AlgorithmsSri Prasanna

PRML Chapter 1Sunwoo Kim

Unit 2 in daaNv Thejaswini

algorithm Unit 2 Monika Choudhery

Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...Florent Renucci

Skiena algorithm 2007 lecture16 introduction to dynamic programmingzukun

L&NDeltaTalkSteve Sugden

Exponential functionsJessica Garcia

Ke yi small summaries for big datajins0618

Advance algebralyra matalubos

Introduction to simulating data to improve your researchDorothy Bishop

Machine learning mathematicals.pdfKing Khalid University

35 algorithm-typesKislay Bhardwaj L|PT,ECSA,C|EH

Nelder Mead Search AlgorithmAshish Khetan

Data Analysis Homework HelpMatlab Assignment Experts

Deep Learning: Introduction & Chapter 5 Machine Learning BasicsJason Tsai

Solution 3.sansaristic

ICML 2016: The Information Sievegregv123

lecture 10sajinsc

Similar to Sketch algoritms (20)

Class9_PCA_final.ppt

Exploring Algorithms

PRML Chapter 1

Unit 2 in daa

algorithm Unit 2

Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...

Skiena algorithm 2007 lecture16 introduction to dynamic programming

L&NDeltaTalk

Exponential functions

Ke yi small summaries for big data

Advance algebra

Introduction to simulating data to improve your research

Machine learning mathematicals.pdf

35 algorithm-types

Nelder Mead Search Algorithm

Data Analysis Homework Help

Deep Learning: Introduction & Chapter 5 Machine Learning Basics

Solution 3.

ICML 2016: The Information Sieve

lecture 10

Recently uploaded

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase

MYjobs Presentation Django-based projectAnoyGreter

The Evolution of Karaoke From Analog to App.pdfPower Karaoke

Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel

What are the key points to focus on before starting to learn ETL Development....kzayra69

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

chapter--4-software-project-planning.pptkotipi9215

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

Professional Resume Template for Software DevelopersVinodh Ram

Recently uploaded (20)

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service

Software Project Health Check: Best Practices and Techniques for Your Product...

Advancing Engineering with AI through the Next Generation of Strategic Projec...

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

英国UN学位证,北安普顿大学毕业证书1:1制作

Salesforce Certified Field Service Consultant

Intelligent Home Wi-Fi Solutions | ThinkPalm

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024

MYjobs Presentation Django-based project

The Evolution of Karaoke From Analog to App.pdf

Unveiling the Future: Sylius 2.0 New Features

What are the key points to focus on before starting to learn ETL Development....

Implementing Zero Trust strategy with Azure

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

chapter--4-software-project-planning.ppt

Cloud Management Software Platforms: OpenStack

Professional Resume Template for Software Developers

Sketch algoritms

1. 1 Intro to Sketch Algorithms 19/10/2021

2. 2 - Did this IP visit me before? - How many unique IPs have we seen this month? - How many times did I see this IP? - What is the median transaction value? top 1% value? - What are the most common collection of fonts available? Large Stream of Events

3. 3 Can’t store all unique values in memory Fixed memory

4. 4 If we are willing to accept an arbitrary low chance of false positives we can solve this problem with Bloom Filters. Did I see this value before?

5. 5 Hash each value and turn on a bit for that hash bucket. Repeat with multiple k different hash function, and ask if all bits for all hash functions are set Some false positives, no false negatives. Bloom Filter

6. 6 If we hash all values, and calculate the minimum of all hashes, what is the expected minimum value? Cardinality estimation

7. 7 let hash(x) : X => [0,1] uniformly pseudo random E[min(hash(x))] = 1/(k+1) when k is number of distinct elements. This is an unbiased estimator If we repeat with several different hash functions, we can average the estimations. Cardinality estimation

8. 8 Counting bloom filters. Hash value and increment a counter at the hashed index. Use multiple hash functions each with separate table(column) return min of all estimates. Produces biased estimate, estimate >= actual How many times did we see this value? count–min sketch

9. 9 Naive - Sample and calculate on sample Remedian - Calculate median of medians (of medians…) Median estimation

10. 10 Naive - sample and calculate quantile on sample Sample and keep to K Manku - maintain eps approximate counts and quantiles. keep counts of values in intervals. and keep them balanced. Biased quantile estimators

11. 11 Proveably requires at least O(N) space Even top 1 most common does. Relax to K-heavy-hitters problem. Find all values with frequency at least 1/K ? Approximate K heavy hitters: Return all values with frequency more than 1/K and return no value with frequency below 1/k - epsilon What are the top K most frequent values?

12. 12 Initialize an empty Map m from elements to counters def add(a) if m.contains(a) m(a) += 1 else if m.size < k m(a) = 1 else decrease all counters in m by 1 remove any elements with count=0 Frequent algorithm

13. 13 THANK YOU

14. 14 Sampling K elements from a stream of N Algorithm Extra memory Accurate results Materialized result Shuffle and take N elements Yes Yes Reservoir K elements Yes Yes Indices reservoir K indices Yes No Independent sample O(1) Length not guaranteed No Accurate independent O(1) Slight correlation between elements No

15. 15 variance = E[(x - E[x])^2] = E[x^2 -2xE[x] +E[x]^2] = E[x^2] -2E[x]E[x]+E[x]^2 = E[x^2] - E[x]^2 stdev = sqrt(variance) STDEV streaming - accurate algorithm

Sketch algoritms

Recommended

Recommended

More Related Content

Similar to Sketch algoritms

Similar to Sketch algoritms (20)

More from Meir Maor

More from Meir Maor (6)

Recently uploaded

Recently uploaded (20)

Sketch algoritms