SlideShare a Scribd company logo
1 of 31
Download to read offline
tech talk @ ferret
Andrii Gakhov
PROBABILISTIC DATA STRUCTURES
ALL YOU WANTED TO KNOW BUT WERE AFRAID TO ASK
PART 3: FREQUENCY
FREQUENCY
Agenda:
▸ Count-Min Sketch
▸ Majority Algorithm
▸ Misra-Gries Algorithm
• To estimate number of times an element
occurs in a set
THE PROBLEM
COUNT-MIN SKETCH
COUNT-MIN SKETCH
• proposed by G. Cormode and S. Muthukrishnan in 2003
• CM Sketch is a sublinear space data structure that supports:
• add element to the structure
• count number of times the element has been added
(frequency)
• Count-Min Sketch is described by 2 parameters:
• m - number of buckets 

(independent from n, but much smaller)
• k - number of different hash functions, that map to 1…m

(usually, k is much smaller than m)
• required fixed space: m*k counters and k hash functions
COUNT-MIN SKETCH: ALGORITHM
• Count-Min Sketch is simply an matrix of counters (initially all 0), where each row
corresponds to a hash function hi, i=1…k
• To add element into the sketch - calculate all k hash functions increment
counters in positions [i, hi(element)], i=1…k
+1
+1
+1
+1
+1
h1
h2
hk
1 2 3 m
x
h1(x)
h2(x)
hk(x)
… …
…
COUNT-MIN SKETCH: ALGORITHM
• Because of soft collisions, we have k estimations of the true
frequency of the element, but because we never decrement
counter it can only overestimate, not underestimate.
• To get frequency of the element we calculate all k hash
functions and return the minimal value of the counters in
positions [i, hi(element)], i=1…k.
• Time needed to add element or return its frequency is a fixed
constant O(k), assuming that every hash function can be
evaluated in a constant time.
COUNT-MIN SKETCH: EXAMPLE
• Consider Count-Min Sketch with 16 columns (m=16)
• Consider 2 hash functions (k=2):

h1 is MurmurHash3 and h2 is Fowler-Noll-Vo

(to calculate the appropriate index, we divide result by mod 16)
• Add element to the structure: “berlin”

h1(“berlin”) = 4, h2(“berlin”) = 12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 01
2
0
COUNT-MIN SKETCH: EXAMPLE
• Add element “berlin” 5 more times (so, 6 in total):
• Add element “bernau”: h1(“bernau”) = 4, h2(“bernau”) = 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0
0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 01
2
0
• Add element “paris”: h1(“paris”) = 11, h2(“paris”) = 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 1 0 0 0 0 0 0 0 6 0 0 0
0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 01
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 2 0 0 0 0 0 0 0 6 0 0 0
0 0 0 0 7 0 0 0 0 0 0 1 0 0 0 01
2
0
COUNT-MIN SKETCH: EXAMPLE
• Get frequency for element: “london”:

h1(“london”) = 7, h2(“london”) = 4

freq(“london”) = min (0, 2) = 0
• Get frequency for element: “berlin”:

h1(“berlin”) = 4, h2(“berlin”) = 12

freq(“berlin”) = min (7, 6) = 6
• Get frequency for element: “warsaw”:

h1(“warsaw”) = 4, h2(“warsaw”) = 12

freq(“warsaw”) = min (7, 6) = 6 !!! due to hash collision
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 2 0 0 0 0 0 0 0 6 0 0 0
0 0 0 0 7 0 0 0 0 0 0 1 0 0 0 01
2
0
COUNT-MIN SKETCH: PROPERTIES
• Count-Min Sketch only returns overestimates of true frequency
counts, never underestimates
• To achieve a target error probability of δ, we need k ≥ ln 1/δ.
• for δ around 1%, k = 5 is good enough
• Count-Min Sketch is essentially the same data structure as the
Counting Bloom filters. 

The difference is followed from the usage:
• Count-Min Sketch has a sublinear number of cells, related to the desired
approximation quality of the sketch
• Counting Bloom filter is sized to match the number of elements in the set
COUNT-MIN SKETCH: APPLICATIONS
• AT&T has used Count-Min Sketch in network switches to
perform analyses on network traffic using limited memory
• At Google, a precursor of the count-min sketch (called the
“count sketch”) has been implemented on top of their
MapReduce parallel processing infrastructure
• Implemented as a part of Algebird library from Twitter
COUNT-MIN SKETCH: PYTHON
• https://github.com/rafacarrascosa/countminsketch

CountMinSketch is a minimalistic Count-min Sketch in
pure Python
• https://github.com/farsightsec/fsisketch

FSI Sketch a disk-backed implementation of the Count-
Min Sketch algorithm
COUNT-MIN SKETCH: READ MORE
• http://dimacs.rutgers.edu/~graham/pubs/papers/cm-
latin.pdf
• http://dimacs.rutgers.edu/~graham/pubs/papers/
cmencyc.pdf
• http://dimacs.rutgers.edu/~graham/pubs/papers/
cmsoft.pdf
• http://theory.stanford.edu/~tim/s15/l/l2.pdf
• http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11/
Notes/lecnotes.pdf
• To find a single element that occurs strictly
more than n / 2 times
n - number of elements in a set
MAJORITY PROBLEM
MAJORITY PROBLEM
• Mostly a toy problem, but it let us better understand the
frequency problems for data streams
• Majority problem has been formulated as a research problem
by J Strother Moore in the Journal of Algorithms in 1981.
• It’s possible that such element does not exists
• It could be at most one majority element in the set
• If such element exists, it will be equal to the median. So, the
Naïve linear-time solution - compute the median of the set
• Problem: it requires multiple pass through the stream
MAJORITY PROBLEM
• The Boyer-Moore Majority Vote Algorithm has been invented
by Bob Boyer and J Strother Moore in 1980 to solve the Majority
Problem in a single pass through the data stream.
• The similar solution was independently proposed by Michael J. Fischer and
Steven L. Salzberg in 1982.This is the most popular algorithm for
undergraduate classes due to its simplicity.
• An important pre-requirement is that the majority element
actually exists, without it the output of the algorithm will be an
arbitrary element of the data stream.
• Algorithm requires only 1 left-to-right pass!
• The Data structure for the Majority Algorithm just a pair of an
integer counter and monitored element current
MAJORITY ALGORITHM: ALGORITHM
• Initialise counter with 1 and current with the first element from
the left
• Going from left to right:
• If counter equals to 0, then take the current element as the
current and set counter to 1
• if counter isn’t 0, then increase counter by 1 if the element
equals to current or decrease counter by 1 otherwise
• The last current is the majority element (if counter bigger that 0)
Intuition: each element that contains a non-majority-value can only “cancel out” one copy
of the majority value. Since more than n/2 of the elements contain the majority value, there
is guaranteed to be a copy of it left standing at the end of the algorithm.
MAJORITY ALGORITHM: EXAMPLE
• Consider the following set of elements: {3,2,3,2,2,3,3,3}
• Iterate from left to right and update counter:
3 2 3 2 2 3 3 3
counter: 1 current: 3
3 2 3 2 2 3 3 3
counter: 1 current: 3
3 2 3 2 2 3 3 3
counter: 0 current: 3
3 2 3 2 2 3 3 3
counter: 0 current: 3
3 2 3 2 2 3 3 3
counter: 1 current: 2
3 2 3 2 2 3 3 3
counter: 0 current: 2
3 2 3 2 2 3 3 3
counter: 1 current: 3
3 2 3 2 2 3 3 3
counter: 2 current: 3
MAJORITY: READ MORE
• https://courses.cs.washington.edu/courses/cse522/14sp/
lectures/lect04.pdf
• http://theory.stanford.edu/~tim/s16/l/l2.pdf
• To find elements that occurs more than n / k
times (n >> k)
* also known as k-frequency-estimation
HEAVY HITTERS PROBLEM
• There are not more than k such values
• The majority problem is a particular case of the heavy hitters
problem
• k ≈ 2 − δ for a small value δ > 0
• with the additional promise that a majority element exists
• Theorem: There is no algorithm that solves the heavy hitters problems in one pass
while using a sublinear amount of auxiliary space
• ε-approximate heavy hitters (ε-HH) problem:
• every value that occurs at least n/k times is in the list
• every value in the list occurs at least n/k − εn times in the set
• ε-HH can be solved with Count-Min Sketch as well, but we
consider Misra-Gries / Frequent algorithm here …
HEAVY HITTERS PROBLEM
MISRA-GRIES SUMMARY
MISRA-GRIES / FREQUENT ALGORITHM
• A generalisation of the Majority Algorithm to track multiple
frequent items, known as Frequent Algorithm, was proposed
by Erik D. Demaine, et al. in 2002.
• After some time it was discovered that the Frequent algorithm actually the
same as the algorithm that was published by Jayadev Misra and David Gries
in 1982, known now as Misra-Gries Algorithm
• The trick is to run the Majority algorithm, but with many
counters instead of 1
• The time cost of the algorithm is dominated by the O(1)
dictionary operations per update, and the cost of decrementing
counts
MISRA-GRIES: ALGORITHM
The Data structure for Misra-Gries algorithm consists of 2 arrays:
• an array of k-1 frequency counters C (all 0) and k-1 locations X* (empty set)
For every element xi in the set:
• If xi is already in X* at some index j:
• increase its corresponding frequency counter by 1: C[j] = C[j] + 1
• If xi not in X*:
• If size of X* less than k (so, there is free space in the top):
• append xi to X* at index j
• set the corresponding frequency counter C[j] =1
• else If X* is already full:
• decrement all frequency counters by 1: C[j] = C[j] - 1, j=1..k-1
• remove such elements from X* whose counters are 0.
Top k-1 frequent elements : X*[j] with frequency estimations C[j], j=1..k-1
MISRA-GRIES:EXAMPLE
0 0C:
X*:
• Step 1: {3,2,1,2,2,3,1,3}

Element {3} isn’t in X* already and X* isn’t full, so we add it at position 0 and set C[0] = 1
• Step 2: {3,2,1,2,2,3,1,3}

Element {2} isn’t in X* already and X* isn’t full, so we add it at position 1 and set C[1] = 1
1 0C:
X*: 3
• Consider the following set of elements: {3,2,1,2,2,3,1,3}, n= 8
• We need to find top 2 elements in the set that appears at least n/3 times.
1 1
2
C:
X*: 3
MISRA-GRIES:EXAMPLE
• Step 3: {3,2,1,2,2,3,1,3}

Element {1} isn’t in X* already, but X* is full, so decrement all counters and remove
elements that have counters less then 1:
• Step 4: {3,2,1,2,2,3,1,3}

Element {2} isn’t in X* already and X* isn’t full, so we add it at position 0 and set C[0] = 1
0 0C:
X*:
1 0C:
X*: 2
• Step 5: {3,2,1,2,2,3,1,3}

Element {2} is in X* at position 0, so we increase its counter C[0] = C[0] + 1 = 2
2 0C:
X*: 2
MISRA-GRIES:EXAMPLE
• Step 6: {3,2,1,2,2,3,1,3}

Element {3} isn’t in X* already and X* isn’t full, so we append {3} at position 1 and set
its counter to 1: C[1] = 1
• Step 7: {3,2,1,2,2,3,1,3}

Element {1} isn’t in X* already, but X* is full, so decrement all counters and remove
elements that have 0 as the counters values:
2 1
3
C:
X*: 2
1 0C:
X*: 2
• Step 8: {3,2,1,2,2,3,1,3}

Element {3} isn’t in X* already and X* isn’t full, so we append {3} at position 1 and set
its counter to 1: C[1] = 1
2 1
3
C:
X*: 2
Top 2 elements
MISRA-GRIES: PROPERTIES
• The algorithm identifies at most k-1 candidates without
any probabilistic approach.
• It's still open question how to process such updates
quickly, or particularly, how to decrement and release
several counters simultaneously.
• For instance, it was proposed to use doubly linked list of counters and
store only the differences between same-value counter’s groups and use.
With such data structure each counter no longer needs to store a value,
but rather its group and its monitored element.
MISRA-GRIES: READ MORE
• https://courses.cs.washington.edu/courses/cse522/14sp/
lectures/lect04.pdf
• http://dimacs.rutgers.edu/~graham/pubs/papers/
encalgs-mg.pdf
• http://drops.dagstuhl.de/opus/volltexte/2016/5773/pdf/
3.pdf
• http://theory.stanford.edu/~tim/s16/l/l2.pdf
▸ @gakhov
▸ linkedin.com/in/gakhov
▸ www.datacrucis.com
THANK YOU

More Related Content

What's hot

Count-Distinct Problem
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct ProblemKai Zhang
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE MethodBrendan Gregg
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningAbhishek Vijayvargia
 
PCA (Principal Component Analysis)
PCA (Principal Component Analysis)PCA (Principal Component Analysis)
PCA (Principal Component Analysis)Luis Serrano
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
[234]멀티테넌트 하둡 클러스터 운영 경험기
[234]멀티테넌트 하둡 클러스터 운영 경험기[234]멀티테넌트 하둡 클러스터 운영 경험기
[234]멀티테넌트 하둡 클러스터 운영 경험기NAVER D2
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesLorenzo Alberton
 
Open addressiing &rehashing,extendiblevhashing
Open addressiing &rehashing,extendiblevhashingOpen addressiing &rehashing,extendiblevhashing
Open addressiing &rehashing,extendiblevhashingSangeethaSasi1
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and BoostingMohit Rajput
 
Master the RETE algorithm
Master the RETE algorithmMaster the RETE algorithm
Master the RETE algorithmMasahiko Umeno
 
Hadoop implementation for algorithms apriori, pcy, son
Hadoop implementation for algorithms apriori, pcy, sonHadoop implementation for algorithms apriori, pcy, son
Hadoop implementation for algorithms apriori, pcy, sonChengeng Ma
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것NAVER Engineering
 
Perspective in Informatics 3 - Assignment 1 - Answer Sheet
Perspective in Informatics 3 - Assignment 1 - Answer SheetPerspective in Informatics 3 - Assignment 1 - Answer Sheet
Perspective in Informatics 3 - Assignment 1 - Answer SheetHoang Nguyen Phong
 

What's hot (20)

Count-Distinct Problem
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct Problem
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE Method
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learning
 
Bayes Belief Networks
Bayes Belief NetworksBayes Belief Networks
Bayes Belief Networks
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Extensible hashing
Extensible hashingExtensible hashing
Extensible hashing
 
PCA (Principal Component Analysis)
PCA (Principal Component Analysis)PCA (Principal Component Analysis)
PCA (Principal Component Analysis)
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
[234]멀티테넌트 하둡 클러스터 운영 경험기
[234]멀티테넌트 하둡 클러스터 운영 경험기[234]멀티테넌트 하둡 클러스터 운영 경험기
[234]멀티테넌트 하둡 클러스터 운영 경험기
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
 
Open addressiing &rehashing,extendiblevhashing
Open addressiing &rehashing,extendiblevhashingOpen addressiing &rehashing,extendiblevhashing
Open addressiing &rehashing,extendiblevhashing
 
Expression trees
Expression treesExpression trees
Expression trees
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Master the RETE algorithm
Master the RETE algorithmMaster the RETE algorithm
Master the RETE algorithm
 
Hadoop implementation for algorithms apriori, pcy, son
Hadoop implementation for algorithms apriori, pcy, sonHadoop implementation for algorithms apriori, pcy, son
Hadoop implementation for algorithms apriori, pcy, son
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
Perspective in Informatics 3 - Assignment 1 - Answer Sheet
Perspective in Informatics 3 - Assignment 1 - Answer SheetPerspective in Informatics 3 - Assignment 1 - Answer Sheet
Perspective in Informatics 3 - Assignment 1 - Answer Sheet
 

Viewers also liked

Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityAndrii Gakhov
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
 
Bloom filter
Bloom filterBloom filter
Bloom filterfeng lee
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данныхAndrii Gakhov
 
Big Data with Semantics - StampedeCon 2012
Big Data with Semantics - StampedeCon 2012Big Data with Semantics - StampedeCon 2012
Big Data with Semantics - StampedeCon 2012StampedeCon
 
Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013StampedeCon
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaAndrii Gakhov
 
Bloom filter
Bloom filterBloom filter
Bloom filterwang ping
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksAndrii Gakhov
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 
Cuckoo Optimization ppt
Cuckoo Optimization pptCuckoo Optimization ppt
Cuckoo Optimization pptAnuja Joshi
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Andrii Gakhov
 

Viewers also liked (17)

Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Bloom filters
Bloom filtersBloom filters
Bloom filters
 
14 Skip Lists
14 Skip Lists14 Skip Lists
14 Skip Lists
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данных
 
Big Data with Semantics - StampedeCon 2012
Big Data with Semantics - StampedeCon 2012Big Data with Semantics - StampedeCon 2012
Big Data with Semantics - StampedeCon 2012
 
Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
skip list
skip listskip list
skip list
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Cuckoo Optimization ppt
Cuckoo Optimization pptCuckoo Optimization ppt
Cuckoo Optimization ppt
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
 
Cuckoo search algorithm
Cuckoo search algorithmCuckoo search algorithm
Cuckoo search algorithm
 

Similar to Probabilistic data structures. Part 3. Frequency

streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxGopiNathVelivela
 
Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Satalia
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structureThinh Dang
 
Simple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorialSimple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorialJin-Hwa Kim
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...GeeksLab Odessa
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidStreamNative
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptxSonaliAjankar
 
lecture 23
lecture 23lecture 23
lecture 23sajinsc
 
computer logic and digital design chapter 1
computer logic and digital design chapter 1computer logic and digital design chapter 1
computer logic and digital design chapter 1tendaisigauke3
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithmK Hari Shankar
 
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingThomas Mueller
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Andrii Gakhov
 

Similar to Probabilistic data structures. Part 3. Frequency (20)

streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
 
hash
 hash hash
hash
 
Simple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorialSimple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorial
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
Data Structures- Hashing
Data Structures- Hashing Data Structures- Hashing
Data Structures- Hashing
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Chapter1
Chapter1Chapter1
Chapter1
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
 
lecture 23
lecture 23lecture 23
lecture 23
 
Chapter1.ppt
Chapter1.pptChapter1.ppt
Chapter1.ppt
 
computer logic and digital design chapter 1
computer logic and digital design chapter 1computer logic and digital design chapter 1
computer logic and digital design chapter 1
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
 
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
 
Practical data analysis with wine
Practical data analysis with winePractical data analysis with wine
Practical data analysis with wine
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 

More from Andrii Gakhov

Let's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureAndrii Gakhov
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Andrii Gakhov
 
Pecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsAndrii Gakhov
 
Swagger / Quick Start Guide
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start GuideAndrii Gakhov
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlightsAndrii Gakhov
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcasesAndrii Gakhov
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Andrii Gakhov
 
Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Andrii Gakhov
 
Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Andrii Gakhov
 
Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Andrii Gakhov
 
Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Andrii Gakhov
 
Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)Andrii Gakhov
 
Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014Andrii Gakhov
 
Data Mining - lecture 1 - 2014
Data Mining - lecture 1 - 2014Data Mining - lecture 1 - 2014
Data Mining - lecture 1 - 2014Andrii Gakhov
 
Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2Andrii Gakhov
 
Buzzwords 2014 / Overview / part1
Buzzwords 2014 / Overview / part1Buzzwords 2014 / Overview / part1
Buzzwords 2014 / Overview / part1Andrii Gakhov
 

More from Andrii Gakhov (20)

Let's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architecture
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
 
DNS Delegation
DNS DelegationDNS Delegation
DNS Delegation
 
Pecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food Traditions
 
Swagger / Quick Start Guide
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start Guide
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlights
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014
 
Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014
 
Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014
 
Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014
 
Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014
 
Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)
 
Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014
 
Data Mining - lecture 1 - 2014
Data Mining - lecture 1 - 2014Data Mining - lecture 1 - 2014
Data Mining - lecture 1 - 2014
 
Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2
 
Buzzwords 2014 / Overview / part1
Buzzwords 2014 / Overview / part1Buzzwords 2014 / Overview / part1
Buzzwords 2014 / Overview / part1
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Lean analytics
Lean analyticsLean analytics
Lean analytics
 

Recently uploaded

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Recently uploaded (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Probabilistic data structures. Part 3. Frequency

  • 1. tech talk @ ferret Andrii Gakhov PROBABILISTIC DATA STRUCTURES ALL YOU WANTED TO KNOW BUT WERE AFRAID TO ASK PART 3: FREQUENCY
  • 2. FREQUENCY Agenda: ▸ Count-Min Sketch ▸ Majority Algorithm ▸ Misra-Gries Algorithm
  • 3. • To estimate number of times an element occurs in a set THE PROBLEM
  • 5. COUNT-MIN SKETCH • proposed by G. Cormode and S. Muthukrishnan in 2003 • CM Sketch is a sublinear space data structure that supports: • add element to the structure • count number of times the element has been added (frequency) • Count-Min Sketch is described by 2 parameters: • m - number of buckets 
 (independent from n, but much smaller) • k - number of different hash functions, that map to 1…m
 (usually, k is much smaller than m) • required fixed space: m*k counters and k hash functions
  • 6. COUNT-MIN SKETCH: ALGORITHM • Count-Min Sketch is simply an matrix of counters (initially all 0), where each row corresponds to a hash function hi, i=1…k • To add element into the sketch - calculate all k hash functions increment counters in positions [i, hi(element)], i=1…k +1 +1 +1 +1 +1 h1 h2 hk 1 2 3 m x h1(x) h2(x) hk(x) … … …
  • 7. COUNT-MIN SKETCH: ALGORITHM • Because of soft collisions, we have k estimations of the true frequency of the element, but because we never decrement counter it can only overestimate, not underestimate. • To get frequency of the element we calculate all k hash functions and return the minimal value of the counters in positions [i, hi(element)], i=1…k. • Time needed to add element or return its frequency is a fixed constant O(k), assuming that every hash function can be evaluated in a constant time.
  • 8. COUNT-MIN SKETCH: EXAMPLE • Consider Count-Min Sketch with 16 columns (m=16) • Consider 2 hash functions (k=2):
 h1 is MurmurHash3 and h2 is Fowler-Noll-Vo
 (to calculate the appropriate index, we divide result by mod 16) • Add element to the structure: “berlin”
 h1(“berlin”) = 4, h2(“berlin”) = 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 01 2 0
  • 9. COUNT-MIN SKETCH: EXAMPLE • Add element “berlin” 5 more times (so, 6 in total): • Add element “bernau”: h1(“bernau”) = 4, h2(“bernau”) = 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 01 2 0 • Add element “paris”: h1(“paris”) = 11, h2(“paris”) = 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 1 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 01 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 2 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 7 0 0 0 0 0 0 1 0 0 0 01 2 0
  • 10. COUNT-MIN SKETCH: EXAMPLE • Get frequency for element: “london”:
 h1(“london”) = 7, h2(“london”) = 4
 freq(“london”) = min (0, 2) = 0 • Get frequency for element: “berlin”:
 h1(“berlin”) = 4, h2(“berlin”) = 12
 freq(“berlin”) = min (7, 6) = 6 • Get frequency for element: “warsaw”:
 h1(“warsaw”) = 4, h2(“warsaw”) = 12
 freq(“warsaw”) = min (7, 6) = 6 !!! due to hash collision 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 2 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 7 0 0 0 0 0 0 1 0 0 0 01 2 0
  • 11. COUNT-MIN SKETCH: PROPERTIES • Count-Min Sketch only returns overestimates of true frequency counts, never underestimates • To achieve a target error probability of δ, we need k ≥ ln 1/δ. • for δ around 1%, k = 5 is good enough • Count-Min Sketch is essentially the same data structure as the Counting Bloom filters. 
 The difference is followed from the usage: • Count-Min Sketch has a sublinear number of cells, related to the desired approximation quality of the sketch • Counting Bloom filter is sized to match the number of elements in the set
  • 12. COUNT-MIN SKETCH: APPLICATIONS • AT&T has used Count-Min Sketch in network switches to perform analyses on network traffic using limited memory • At Google, a precursor of the count-min sketch (called the “count sketch”) has been implemented on top of their MapReduce parallel processing infrastructure • Implemented as a part of Algebird library from Twitter
  • 13. COUNT-MIN SKETCH: PYTHON • https://github.com/rafacarrascosa/countminsketch
 CountMinSketch is a minimalistic Count-min Sketch in pure Python • https://github.com/farsightsec/fsisketch
 FSI Sketch a disk-backed implementation of the Count- Min Sketch algorithm
  • 14. COUNT-MIN SKETCH: READ MORE • http://dimacs.rutgers.edu/~graham/pubs/papers/cm- latin.pdf • http://dimacs.rutgers.edu/~graham/pubs/papers/ cmencyc.pdf • http://dimacs.rutgers.edu/~graham/pubs/papers/ cmsoft.pdf • http://theory.stanford.edu/~tim/s15/l/l2.pdf • http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11/ Notes/lecnotes.pdf
  • 15. • To find a single element that occurs strictly more than n / 2 times n - number of elements in a set MAJORITY PROBLEM
  • 16. MAJORITY PROBLEM • Mostly a toy problem, but it let us better understand the frequency problems for data streams • Majority problem has been formulated as a research problem by J Strother Moore in the Journal of Algorithms in 1981. • It’s possible that such element does not exists • It could be at most one majority element in the set • If such element exists, it will be equal to the median. So, the Naïve linear-time solution - compute the median of the set • Problem: it requires multiple pass through the stream
  • 17. MAJORITY PROBLEM • The Boyer-Moore Majority Vote Algorithm has been invented by Bob Boyer and J Strother Moore in 1980 to solve the Majority Problem in a single pass through the data stream. • The similar solution was independently proposed by Michael J. Fischer and Steven L. Salzberg in 1982.This is the most popular algorithm for undergraduate classes due to its simplicity. • An important pre-requirement is that the majority element actually exists, without it the output of the algorithm will be an arbitrary element of the data stream. • Algorithm requires only 1 left-to-right pass! • The Data structure for the Majority Algorithm just a pair of an integer counter and monitored element current
  • 18. MAJORITY ALGORITHM: ALGORITHM • Initialise counter with 1 and current with the first element from the left • Going from left to right: • If counter equals to 0, then take the current element as the current and set counter to 1 • if counter isn’t 0, then increase counter by 1 if the element equals to current or decrease counter by 1 otherwise • The last current is the majority element (if counter bigger that 0) Intuition: each element that contains a non-majority-value can only “cancel out” one copy of the majority value. Since more than n/2 of the elements contain the majority value, there is guaranteed to be a copy of it left standing at the end of the algorithm.
  • 19. MAJORITY ALGORITHM: EXAMPLE • Consider the following set of elements: {3,2,3,2,2,3,3,3} • Iterate from left to right and update counter: 3 2 3 2 2 3 3 3 counter: 1 current: 3 3 2 3 2 2 3 3 3 counter: 1 current: 3 3 2 3 2 2 3 3 3 counter: 0 current: 3 3 2 3 2 2 3 3 3 counter: 0 current: 3 3 2 3 2 2 3 3 3 counter: 1 current: 2 3 2 3 2 2 3 3 3 counter: 0 current: 2 3 2 3 2 2 3 3 3 counter: 1 current: 3 3 2 3 2 2 3 3 3 counter: 2 current: 3
  • 20. MAJORITY: READ MORE • https://courses.cs.washington.edu/courses/cse522/14sp/ lectures/lect04.pdf • http://theory.stanford.edu/~tim/s16/l/l2.pdf
  • 21. • To find elements that occurs more than n / k times (n >> k) * also known as k-frequency-estimation HEAVY HITTERS PROBLEM
  • 22. • There are not more than k such values • The majority problem is a particular case of the heavy hitters problem • k ≈ 2 − δ for a small value δ > 0 • with the additional promise that a majority element exists • Theorem: There is no algorithm that solves the heavy hitters problems in one pass while using a sublinear amount of auxiliary space • ε-approximate heavy hitters (ε-HH) problem: • every value that occurs at least n/k times is in the list • every value in the list occurs at least n/k − εn times in the set • ε-HH can be solved with Count-Min Sketch as well, but we consider Misra-Gries / Frequent algorithm here … HEAVY HITTERS PROBLEM
  • 24. MISRA-GRIES / FREQUENT ALGORITHM • A generalisation of the Majority Algorithm to track multiple frequent items, known as Frequent Algorithm, was proposed by Erik D. Demaine, et al. in 2002. • After some time it was discovered that the Frequent algorithm actually the same as the algorithm that was published by Jayadev Misra and David Gries in 1982, known now as Misra-Gries Algorithm • The trick is to run the Majority algorithm, but with many counters instead of 1 • The time cost of the algorithm is dominated by the O(1) dictionary operations per update, and the cost of decrementing counts
  • 25. MISRA-GRIES: ALGORITHM The Data structure for Misra-Gries algorithm consists of 2 arrays: • an array of k-1 frequency counters C (all 0) and k-1 locations X* (empty set) For every element xi in the set: • If xi is already in X* at some index j: • increase its corresponding frequency counter by 1: C[j] = C[j] + 1 • If xi not in X*: • If size of X* less than k (so, there is free space in the top): • append xi to X* at index j • set the corresponding frequency counter C[j] =1 • else If X* is already full: • decrement all frequency counters by 1: C[j] = C[j] - 1, j=1..k-1 • remove such elements from X* whose counters are 0. Top k-1 frequent elements : X*[j] with frequency estimations C[j], j=1..k-1
  • 26. MISRA-GRIES:EXAMPLE 0 0C: X*: • Step 1: {3,2,1,2,2,3,1,3}
 Element {3} isn’t in X* already and X* isn’t full, so we add it at position 0 and set C[0] = 1 • Step 2: {3,2,1,2,2,3,1,3}
 Element {2} isn’t in X* already and X* isn’t full, so we add it at position 1 and set C[1] = 1 1 0C: X*: 3 • Consider the following set of elements: {3,2,1,2,2,3,1,3}, n= 8 • We need to find top 2 elements in the set that appears at least n/3 times. 1 1 2 C: X*: 3
  • 27. MISRA-GRIES:EXAMPLE • Step 3: {3,2,1,2,2,3,1,3}
 Element {1} isn’t in X* already, but X* is full, so decrement all counters and remove elements that have counters less then 1: • Step 4: {3,2,1,2,2,3,1,3}
 Element {2} isn’t in X* already and X* isn’t full, so we add it at position 0 and set C[0] = 1 0 0C: X*: 1 0C: X*: 2 • Step 5: {3,2,1,2,2,3,1,3}
 Element {2} is in X* at position 0, so we increase its counter C[0] = C[0] + 1 = 2 2 0C: X*: 2
  • 28. MISRA-GRIES:EXAMPLE • Step 6: {3,2,1,2,2,3,1,3}
 Element {3} isn’t in X* already and X* isn’t full, so we append {3} at position 1 and set its counter to 1: C[1] = 1 • Step 7: {3,2,1,2,2,3,1,3}
 Element {1} isn’t in X* already, but X* is full, so decrement all counters and remove elements that have 0 as the counters values: 2 1 3 C: X*: 2 1 0C: X*: 2 • Step 8: {3,2,1,2,2,3,1,3}
 Element {3} isn’t in X* already and X* isn’t full, so we append {3} at position 1 and set its counter to 1: C[1] = 1 2 1 3 C: X*: 2 Top 2 elements
  • 29. MISRA-GRIES: PROPERTIES • The algorithm identifies at most k-1 candidates without any probabilistic approach. • It's still open question how to process such updates quickly, or particularly, how to decrement and release several counters simultaneously. • For instance, it was proposed to use doubly linked list of counters and store only the differences between same-value counter’s groups and use. With such data structure each counter no longer needs to store a value, but rather its group and its monitored element.
  • 30. MISRA-GRIES: READ MORE • https://courses.cs.washington.edu/courses/cse522/14sp/ lectures/lect04.pdf • http://dimacs.rutgers.edu/~graham/pubs/papers/ encalgs-mg.pdf • http://drops.dagstuhl.de/opus/volltexte/2016/5773/pdf/ 3.pdf • http://theory.stanford.edu/~tim/s16/l/l2.pdf
  • 31. ▸ @gakhov ▸ linkedin.com/in/gakhov ▸ www.datacrucis.com THANK YOU