SlideShare a Scribd company logo
tech talk @ ferret
Andrii Gakhov
PROBABILISTIC DATA STRUCTURES
ALL YOU WANTED TO KNOW BUT WERE AFRAID TO ASK
PART 3: FREQUENCY
FREQUENCY
Agenda:
▸ Count-Min Sketch
▸ Majority Algorithm
▸ Misra-Gries Algorithm
• To estimate number of times an element
occurs in a set
THE PROBLEM
COUNT-MIN SKETCH
COUNT-MIN SKETCH
• proposed by G. Cormode and S. Muthukrishnan in 2003
• CM Sketch is a sublinear space data structure that supports:
• add element to the structure
• count number of times the element has been added
(frequency)
• Count-Min Sketch is described by 2 parameters:
• m - number of buckets 

(independent from n, but much smaller)
• k - number of different hash functions, that map to 1…m

(usually, k is much smaller than m)
• required fixed space: m*k counters and k hash functions
COUNT-MIN SKETCH: ALGORITHM
• Count-Min Sketch is simply an matrix of counters (initially all 0), where each row
corresponds to a hash function hi, i=1…k
• To add element into the sketch - calculate all k hash functions increment
counters in positions [i, hi(element)], i=1…k
+1
+1
+1
+1
+1
h1
h2
hk
1 2 3 m
x
h1(x)
h2(x)
hk(x)
… …
…
COUNT-MIN SKETCH: ALGORITHM
• Because of soft collisions, we have k estimations of the true
frequency of the element, but because we never decrement
counter it can only overestimate, not underestimate.
• To get frequency of the element we calculate all k hash
functions and return the minimal value of the counters in
positions [i, hi(element)], i=1…k.
• Time needed to add element or return its frequency is a fixed
constant O(k), assuming that every hash function can be
evaluated in a constant time.
COUNT-MIN SKETCH: EXAMPLE
• Consider Count-Min Sketch with 16 columns (m=16)
• Consider 2 hash functions (k=2):

h1 is MurmurHash3 and h2 is Fowler-Noll-Vo

(to calculate the appropriate index, we divide result by mod 16)
• Add element to the structure: “berlin”

h1(“berlin”) = 4, h2(“berlin”) = 12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 01
2
0
COUNT-MIN SKETCH: EXAMPLE
• Add element “berlin” 5 more times (so, 6 in total):
• Add element “bernau”: h1(“bernau”) = 4, h2(“bernau”) = 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0
0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 01
2
0
• Add element “paris”: h1(“paris”) = 11, h2(“paris”) = 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 1 0 0 0 0 0 0 0 6 0 0 0
0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 01
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 2 0 0 0 0 0 0 0 6 0 0 0
0 0 0 0 7 0 0 0 0 0 0 1 0 0 0 01
2
0
COUNT-MIN SKETCH: EXAMPLE
• Get frequency for element: “london”:

h1(“london”) = 7, h2(“london”) = 4

freq(“london”) = min (0, 2) = 0
• Get frequency for element: “berlin”:

h1(“berlin”) = 4, h2(“berlin”) = 12

freq(“berlin”) = min (7, 6) = 6
• Get frequency for element: “warsaw”:

h1(“warsaw”) = 4, h2(“warsaw”) = 12

freq(“warsaw”) = min (7, 6) = 6 !!! due to hash collision
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 2 0 0 0 0 0 0 0 6 0 0 0
0 0 0 0 7 0 0 0 0 0 0 1 0 0 0 01
2
0
COUNT-MIN SKETCH: PROPERTIES
• Count-Min Sketch only returns overestimates of true frequency
counts, never underestimates
• To achieve a target error probability of δ, we need k ≥ ln 1/δ.
• for δ around 1%, k = 5 is good enough
• Count-Min Sketch is essentially the same data structure as the
Counting Bloom filters. 

The difference is followed from the usage:
• Count-Min Sketch has a sublinear number of cells, related to the desired
approximation quality of the sketch
• Counting Bloom filter is sized to match the number of elements in the set
COUNT-MIN SKETCH: APPLICATIONS
• AT&T has used Count-Min Sketch in network switches to
perform analyses on network traffic using limited memory
• At Google, a precursor of the count-min sketch (called the
“count sketch”) has been implemented on top of their
MapReduce parallel processing infrastructure
• Implemented as a part of Algebird library from Twitter
COUNT-MIN SKETCH: PYTHON
• https://github.com/rafacarrascosa/countminsketch

CountMinSketch is a minimalistic Count-min Sketch in
pure Python
• https://github.com/farsightsec/fsisketch

FSI Sketch a disk-backed implementation of the Count-
Min Sketch algorithm
COUNT-MIN SKETCH: READ MORE
• http://dimacs.rutgers.edu/~graham/pubs/papers/cm-
latin.pdf
• http://dimacs.rutgers.edu/~graham/pubs/papers/
cmencyc.pdf
• http://dimacs.rutgers.edu/~graham/pubs/papers/
cmsoft.pdf
• http://theory.stanford.edu/~tim/s15/l/l2.pdf
• http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11/
Notes/lecnotes.pdf
• To find a single element that occurs strictly
more than n / 2 times
n - number of elements in a set
MAJORITY PROBLEM
MAJORITY PROBLEM
• Mostly a toy problem, but it let us better understand the
frequency problems for data streams
• Majority problem has been formulated as a research problem
by J Strother Moore in the Journal of Algorithms in 1981.
• It’s possible that such element does not exists
• It could be at most one majority element in the set
• If such element exists, it will be equal to the median. So, the
Naïve linear-time solution - compute the median of the set
• Problem: it requires multiple pass through the stream
MAJORITY PROBLEM
• The Boyer-Moore Majority Vote Algorithm has been invented
by Bob Boyer and J Strother Moore in 1980 to solve the Majority
Problem in a single pass through the data stream.
• The similar solution was independently proposed by Michael J. Fischer and
Steven L. Salzberg in 1982.This is the most popular algorithm for
undergraduate classes due to its simplicity.
• An important pre-requirement is that the majority element
actually exists, without it the output of the algorithm will be an
arbitrary element of the data stream.
• Algorithm requires only 1 left-to-right pass!
• The Data structure for the Majority Algorithm just a pair of an
integer counter and monitored element current
MAJORITY ALGORITHM: ALGORITHM
• Initialise counter with 1 and current with the first element from
the left
• Going from left to right:
• If counter equals to 0, then take the current element as the
current and set counter to 1
• if counter isn’t 0, then increase counter by 1 if the element
equals to current or decrease counter by 1 otherwise
• The last current is the majority element (if counter bigger that 0)
Intuition: each element that contains a non-majority-value can only “cancel out” one copy
of the majority value. Since more than n/2 of the elements contain the majority value, there
is guaranteed to be a copy of it left standing at the end of the algorithm.
MAJORITY ALGORITHM: EXAMPLE
• Consider the following set of elements: {3,2,3,2,2,3,3,3}
• Iterate from left to right and update counter:
3 2 3 2 2 3 3 3
counter: 1 current: 3
3 2 3 2 2 3 3 3
counter: 1 current: 3
3 2 3 2 2 3 3 3
counter: 0 current: 3
3 2 3 2 2 3 3 3
counter: 0 current: 3
3 2 3 2 2 3 3 3
counter: 1 current: 2
3 2 3 2 2 3 3 3
counter: 0 current: 2
3 2 3 2 2 3 3 3
counter: 1 current: 3
3 2 3 2 2 3 3 3
counter: 2 current: 3
MAJORITY: READ MORE
• https://courses.cs.washington.edu/courses/cse522/14sp/
lectures/lect04.pdf
• http://theory.stanford.edu/~tim/s16/l/l2.pdf
• To find elements that occurs more than n / k
times (n >> k)
* also known as k-frequency-estimation
HEAVY HITTERS PROBLEM
• There are not more than k such values
• The majority problem is a particular case of the heavy hitters
problem
• k ≈ 2 − δ for a small value δ > 0
• with the additional promise that a majority element exists
• Theorem: There is no algorithm that solves the heavy hitters problems in one pass
while using a sublinear amount of auxiliary space
• ε-approximate heavy hitters (ε-HH) problem:
• every value that occurs at least n/k times is in the list
• every value in the list occurs at least n/k − εn times in the set
• ε-HH can be solved with Count-Min Sketch as well, but we
consider Misra-Gries / Frequent algorithm here …
HEAVY HITTERS PROBLEM
MISRA-GRIES SUMMARY
MISRA-GRIES / FREQUENT ALGORITHM
• A generalisation of the Majority Algorithm to track multiple
frequent items, known as Frequent Algorithm, was proposed
by Erik D. Demaine, et al. in 2002.
• After some time it was discovered that the Frequent algorithm actually the
same as the algorithm that was published by Jayadev Misra and David Gries
in 1982, known now as Misra-Gries Algorithm
• The trick is to run the Majority algorithm, but with many
counters instead of 1
• The time cost of the algorithm is dominated by the O(1)
dictionary operations per update, and the cost of decrementing
counts
MISRA-GRIES: ALGORITHM
The Data structure for Misra-Gries algorithm consists of 2 arrays:
• an array of k-1 frequency counters C (all 0) and k-1 locations X* (empty set)
For every element xi in the set:
• If xi is already in X* at some index j:
• increase its corresponding frequency counter by 1: C[j] = C[j] + 1
• If xi not in X*:
• If size of X* less than k (so, there is free space in the top):
• append xi to X* at index j
• set the corresponding frequency counter C[j] =1
• else If X* is already full:
• decrement all frequency counters by 1: C[j] = C[j] - 1, j=1..k-1
• remove such elements from X* whose counters are 0.
Top k-1 frequent elements : X*[j] with frequency estimations C[j], j=1..k-1
MISRA-GRIES:EXAMPLE
0 0C:
X*:
• Step 1: {3,2,1,2,2,3,1,3}

Element {3} isn’t in X* already and X* isn’t full, so we add it at position 0 and set C[0] = 1
• Step 2: {3,2,1,2,2,3,1,3}

Element {2} isn’t in X* already and X* isn’t full, so we add it at position 1 and set C[1] = 1
1 0C:
X*: 3
• Consider the following set of elements: {3,2,1,2,2,3,1,3}, n= 8
• We need to find top 2 elements in the set that appears at least n/3 times.
1 1
2
C:
X*: 3
MISRA-GRIES:EXAMPLE
• Step 3: {3,2,1,2,2,3,1,3}

Element {1} isn’t in X* already, but X* is full, so decrement all counters and remove
elements that have counters less then 1:
• Step 4: {3,2,1,2,2,3,1,3}

Element {2} isn’t in X* already and X* isn’t full, so we add it at position 0 and set C[0] = 1
0 0C:
X*:
1 0C:
X*: 2
• Step 5: {3,2,1,2,2,3,1,3}

Element {2} is in X* at position 0, so we increase its counter C[0] = C[0] + 1 = 2
2 0C:
X*: 2
MISRA-GRIES:EXAMPLE
• Step 6: {3,2,1,2,2,3,1,3}

Element {3} isn’t in X* already and X* isn’t full, so we append {3} at position 1 and set
its counter to 1: C[1] = 1
• Step 7: {3,2,1,2,2,3,1,3}

Element {1} isn’t in X* already, but X* is full, so decrement all counters and remove
elements that have 0 as the counters values:
2 1
3
C:
X*: 2
1 0C:
X*: 2
• Step 8: {3,2,1,2,2,3,1,3}

Element {3} isn’t in X* already and X* isn’t full, so we append {3} at position 1 and set
its counter to 1: C[1] = 1
2 1
3
C:
X*: 2
Top 2 elements
MISRA-GRIES: PROPERTIES
• The algorithm identifies at most k-1 candidates without
any probabilistic approach.
• It's still open question how to process such updates
quickly, or particularly, how to decrement and release
several counters simultaneously.
• For instance, it was proposed to use doubly linked list of counters and
store only the differences between same-value counter’s groups and use.
With such data structure each counter no longer needs to store a value,
but rather its group and its monitored element.
MISRA-GRIES: READ MORE
• https://courses.cs.washington.edu/courses/cse522/14sp/
lectures/lect04.pdf
• http://dimacs.rutgers.edu/~graham/pubs/papers/
encalgs-mg.pdf
• http://drops.dagstuhl.de/opus/volltexte/2016/5773/pdf/
3.pdf
• http://theory.stanford.edu/~tim/s16/l/l2.pdf
▸ @gakhov
▸ linkedin.com/in/gakhov
▸ www.datacrucis.com
THANK YOU

More Related Content

What's hot

Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep Learning
Yan Xu
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
ESCOM
 
Lisp
LispLisp
INTRODUCTION TO LISP
INTRODUCTION TO LISPINTRODUCTION TO LISP
INTRODUCTION TO LISP
Nilt1234
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
Sri Ambati
 
5 csp
5 csp5 csp
5 csp
Mhd Sb
 
Circuit complexity
Circuit complexityCircuit complexity
Circuit complexity
Ami Prakash
 
Bloom filters
Bloom filtersBloom filters
Bloom filters
Devesh Maru
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Edureka!
 
Geospatial Options in Apache Spark
Geospatial Options in Apache SparkGeospatial Options in Apache Spark
Geospatial Options in Apache Spark
Databricks
 
P, NP, NP-Complete, and NP-Hard
P, NP, NP-Complete, and NP-HardP, NP, NP-Complete, and NP-Hard
P, NP, NP-Complete, and NP-Hard
Animesh Chaturvedi
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
Joonyoung Yi
 
Regular expressions and languages pdf
Regular expressions and languages pdfRegular expressions and languages pdf
Regular expressions and languages pdf
Dilouar Hossain
 
Two Pseudo-random Number Generators, an Overview
Two Pseudo-random Number Generators, an Overview Two Pseudo-random Number Generators, an Overview
Two Pseudo-random Number Generators, an Overview
Kato Mivule
 
Transformer xl
Transformer xlTransformer xl
Transformer xl
San Kim
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)
SungminYou
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
KU Leuven
 
5.2 primitive recursive functions
5.2 primitive recursive functions5.2 primitive recursive functions
5.2 primitive recursive functions
Sampath Kumar S
 
NP completeness
NP completenessNP completeness
NP completeness
Amrinder Arora
 
Computational Complexity: Introduction-Turing Machines-Undecidability
Computational Complexity: Introduction-Turing Machines-UndecidabilityComputational Complexity: Introduction-Turing Machines-Undecidability
Computational Complexity: Introduction-Turing Machines-Undecidability
Antonis Antonopoulos
 

What's hot (20)

Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep Learning
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
Lisp
LispLisp
Lisp
 
INTRODUCTION TO LISP
INTRODUCTION TO LISPINTRODUCTION TO LISP
INTRODUCTION TO LISP
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
5 csp
5 csp5 csp
5 csp
 
Circuit complexity
Circuit complexityCircuit complexity
Circuit complexity
 
Bloom filters
Bloom filtersBloom filters
Bloom filters
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
 
Geospatial Options in Apache Spark
Geospatial Options in Apache SparkGeospatial Options in Apache Spark
Geospatial Options in Apache Spark
 
P, NP, NP-Complete, and NP-Hard
P, NP, NP-Complete, and NP-HardP, NP, NP-Complete, and NP-Hard
P, NP, NP-Complete, and NP-Hard
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Regular expressions and languages pdf
Regular expressions and languages pdfRegular expressions and languages pdf
Regular expressions and languages pdf
 
Two Pseudo-random Number Generators, an Overview
Two Pseudo-random Number Generators, an Overview Two Pseudo-random Number Generators, an Overview
Two Pseudo-random Number Generators, an Overview
 
Transformer xl
Transformer xlTransformer xl
Transformer xl
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
5.2 primitive recursive functions
5.2 primitive recursive functions5.2 primitive recursive functions
5.2 primitive recursive functions
 
NP completeness
NP completenessNP completeness
NP completeness
 
Computational Complexity: Introduction-Turing Machines-Undecidability
Computational Complexity: Introduction-Turing Machines-UndecidabilityComputational Complexity: Introduction-Turing Machines-Undecidability
Computational Complexity: Introduction-Turing Machines-Undecidability
 

Viewers also liked

Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
Andrii Gakhov
 
Bloom filter
Bloom filterBloom filter
Bloom filter
feng lee
 
14 Skip Lists
14 Skip Lists14 Skip Lists
14 Skip Lists
Andres Mendez-Vazquez
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данных
Andrii Gakhov
 
Big Data with Semantics - StampedeCon 2012
Big Data with Semantics - StampedeCon 2012Big Data with Semantics - StampedeCon 2012
Big Data with Semantics - StampedeCon 2012
StampedeCon
 
Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013
StampedeCon
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
Andrii Gakhov
 
Bloom filter
Bloom filterBloom filter
Bloom filter
wang ping
 
Bloom filter
Bloom filterBloom filter
Bloom filter
Hamid Feizabadi
 
skip list
skip listskip list
skip list
iammutex
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
Andrii Gakhov
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
Andrii Gakhov
 
Cuckoo Optimization ppt
Cuckoo Optimization pptCuckoo Optimization ppt
Cuckoo Optimization ppt
Anuja Joshi
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
Andrii Gakhov
 
Cuckoo search algorithm
Cuckoo search algorithmCuckoo search algorithm
Cuckoo search algorithm
Ahmed Fouad Ali
 

Viewers also liked (15)

Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
14 Skip Lists
14 Skip Lists14 Skip Lists
14 Skip Lists
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данных
 
Big Data with Semantics - StampedeCon 2012
Big Data with Semantics - StampedeCon 2012Big Data with Semantics - StampedeCon 2012
Big Data with Semantics - StampedeCon 2012
 
Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
skip list
skip listskip list
skip list
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Cuckoo Optimization ppt
Cuckoo Optimization pptCuckoo Optimization ppt
Cuckoo Optimization ppt
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
 
Cuckoo search algorithm
Cuckoo search algorithmCuckoo search algorithm
Cuckoo search algorithm
 

Similar to Probabilistic data structures. Part 3. Frequency

streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++
Satalia
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
Thinh Dang
 
hash
 hash hash
hash
tim4911
 
Simple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorialSimple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorial
Jin-Hwa Kim
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
eXascale Infolab
 
Data Structures- Hashing
Data Structures- Hashing Data Structures- Hashing
Data Structures- Hashing
hemalatha athinarayanan
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Chapter1
Chapter1Chapter1
Chapter1
Kiran kumar
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
GeeksLab Odessa
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
StreamNative
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
SonaliAjankar
 
lecture 23
lecture 23lecture 23
lecture 23
sajinsc
 
Chapter1.ppt
Chapter1.pptChapter1.ppt
Chapter1.ppt
PravinPrajapati53
 
computer logic and digital design chapter 1
computer logic and digital design chapter 1computer logic and digital design chapter 1
computer logic and digital design chapter 1
tendaisigauke3
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Databricks
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
K Hari Shankar
 
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
Thomas Mueller
 
Practical data analysis with wine
Practical data analysis with winePractical data analysis with wine
Practical data analysis with wine
TOSHI STATS Co.,Ltd.
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 

Similar to Probabilistic data structures. Part 3. Frequency (20)

streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
 
hash
 hash hash
hash
 
Simple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorialSimple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorial
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
Data Structures- Hashing
Data Structures- Hashing Data Structures- Hashing
Data Structures- Hashing
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Chapter1
Chapter1Chapter1
Chapter1
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
 
lecture 23
lecture 23lecture 23
lecture 23
 
Chapter1.ppt
Chapter1.pptChapter1.ppt
Chapter1.ppt
 
computer logic and digital design chapter 1
computer logic and digital design chapter 1computer logic and digital design chapter 1
computer logic and digital design chapter 1
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
 
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
 
Practical data analysis with wine
Practical data analysis with winePractical data analysis with wine
Practical data analysis with wine
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 

More from Andrii Gakhov

Let's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architecture
Andrii Gakhov
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
Andrii Gakhov
 
DNS Delegation
DNS DelegationDNS Delegation
DNS Delegation
Andrii Gakhov
 
Pecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food Traditions
Andrii Gakhov
 
Swagger / Quick Start Guide
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start Guide
Andrii Gakhov
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlights
Andrii Gakhov
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
Andrii Gakhov
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014
Andrii Gakhov
 
Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014
Andrii Gakhov
 
Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014
Andrii Gakhov
 
Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014
Andrii Gakhov
 
Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014
Andrii Gakhov
 
Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)
Andrii Gakhov
 
Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014
Andrii Gakhov
 
Data Mining - lecture 1 - 2014
Data Mining - lecture 1 - 2014Data Mining - lecture 1 - 2014
Data Mining - lecture 1 - 2014
Andrii Gakhov
 
Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2
Andrii Gakhov
 
Buzzwords 2014 / Overview / part1
Buzzwords 2014 / Overview / part1Buzzwords 2014 / Overview / part1
Buzzwords 2014 / Overview / part1
Andrii Gakhov
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Andrii Gakhov
 
Lean analytics
Lean analyticsLean analytics
Lean analytics
Andrii Gakhov
 

More from Andrii Gakhov (20)

Let's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architecture
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
 
DNS Delegation
DNS DelegationDNS Delegation
DNS Delegation
 
Pecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food Traditions
 
Swagger / Quick Start Guide
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start Guide
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlights
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014
 
Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014
 
Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014
 
Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014
 
Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014Data Mining - lecture 3 - 2014
Data Mining - lecture 3 - 2014
 
Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)Decision Theory - lecture 1 (introduction)
Decision Theory - lecture 1 (introduction)
 
Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014Data Mining - lecture 2 - 2014
Data Mining - lecture 2 - 2014
 
Data Mining - lecture 1 - 2014
Data Mining - lecture 1 - 2014Data Mining - lecture 1 - 2014
Data Mining - lecture 1 - 2014
 
Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2
 
Buzzwords 2014 / Overview / part1
Buzzwords 2014 / Overview / part1Buzzwords 2014 / Overview / part1
Buzzwords 2014 / Overview / part1
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Lean analytics
Lean analyticsLean analytics
Lean analytics
 

Recently uploaded

20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 

Recently uploaded (20)

20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 

Probabilistic data structures. Part 3. Frequency

  • 1. tech talk @ ferret Andrii Gakhov PROBABILISTIC DATA STRUCTURES ALL YOU WANTED TO KNOW BUT WERE AFRAID TO ASK PART 3: FREQUENCY
  • 2. FREQUENCY Agenda: ▸ Count-Min Sketch ▸ Majority Algorithm ▸ Misra-Gries Algorithm
  • 3. • To estimate number of times an element occurs in a set THE PROBLEM
  • 5. COUNT-MIN SKETCH • proposed by G. Cormode and S. Muthukrishnan in 2003 • CM Sketch is a sublinear space data structure that supports: • add element to the structure • count number of times the element has been added (frequency) • Count-Min Sketch is described by 2 parameters: • m - number of buckets 
 (independent from n, but much smaller) • k - number of different hash functions, that map to 1…m
 (usually, k is much smaller than m) • required fixed space: m*k counters and k hash functions
  • 6. COUNT-MIN SKETCH: ALGORITHM • Count-Min Sketch is simply an matrix of counters (initially all 0), where each row corresponds to a hash function hi, i=1…k • To add element into the sketch - calculate all k hash functions increment counters in positions [i, hi(element)], i=1…k +1 +1 +1 +1 +1 h1 h2 hk 1 2 3 m x h1(x) h2(x) hk(x) … … …
  • 7. COUNT-MIN SKETCH: ALGORITHM • Because of soft collisions, we have k estimations of the true frequency of the element, but because we never decrement counter it can only overestimate, not underestimate. • To get frequency of the element we calculate all k hash functions and return the minimal value of the counters in positions [i, hi(element)], i=1…k. • Time needed to add element or return its frequency is a fixed constant O(k), assuming that every hash function can be evaluated in a constant time.
  • 8. COUNT-MIN SKETCH: EXAMPLE • Consider Count-Min Sketch with 16 columns (m=16) • Consider 2 hash functions (k=2):
 h1 is MurmurHash3 and h2 is Fowler-Noll-Vo
 (to calculate the appropriate index, we divide result by mod 16) • Add element to the structure: “berlin”
 h1(“berlin”) = 4, h2(“berlin”) = 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 01 2 0
  • 9. COUNT-MIN SKETCH: EXAMPLE • Add element “berlin” 5 more times (so, 6 in total): • Add element “bernau”: h1(“bernau”) = 4, h2(“bernau”) = 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 01 2 0 • Add element “paris”: h1(“paris”) = 11, h2(“paris”) = 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 1 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 01 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 2 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 7 0 0 0 0 0 0 1 0 0 0 01 2 0
  • 10. COUNT-MIN SKETCH: EXAMPLE • Get frequency for element: “london”:
 h1(“london”) = 7, h2(“london”) = 4
 freq(“london”) = min (0, 2) = 0 • Get frequency for element: “berlin”:
 h1(“berlin”) = 4, h2(“berlin”) = 12
 freq(“berlin”) = min (7, 6) = 6 • Get frequency for element: “warsaw”:
 h1(“warsaw”) = 4, h2(“warsaw”) = 12
 freq(“warsaw”) = min (7, 6) = 6 !!! due to hash collision 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 2 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 7 0 0 0 0 0 0 1 0 0 0 01 2 0
  • 11. COUNT-MIN SKETCH: PROPERTIES • Count-Min Sketch only returns overestimates of true frequency counts, never underestimates • To achieve a target error probability of δ, we need k ≥ ln 1/δ. • for δ around 1%, k = 5 is good enough • Count-Min Sketch is essentially the same data structure as the Counting Bloom filters. 
 The difference is followed from the usage: • Count-Min Sketch has a sublinear number of cells, related to the desired approximation quality of the sketch • Counting Bloom filter is sized to match the number of elements in the set
  • 12. COUNT-MIN SKETCH: APPLICATIONS • AT&T has used Count-Min Sketch in network switches to perform analyses on network traffic using limited memory • At Google, a precursor of the count-min sketch (called the “count sketch”) has been implemented on top of their MapReduce parallel processing infrastructure • Implemented as a part of Algebird library from Twitter
  • 13. COUNT-MIN SKETCH: PYTHON • https://github.com/rafacarrascosa/countminsketch
 CountMinSketch is a minimalistic Count-min Sketch in pure Python • https://github.com/farsightsec/fsisketch
 FSI Sketch a disk-backed implementation of the Count- Min Sketch algorithm
  • 14. COUNT-MIN SKETCH: READ MORE • http://dimacs.rutgers.edu/~graham/pubs/papers/cm- latin.pdf • http://dimacs.rutgers.edu/~graham/pubs/papers/ cmencyc.pdf • http://dimacs.rutgers.edu/~graham/pubs/papers/ cmsoft.pdf • http://theory.stanford.edu/~tim/s15/l/l2.pdf • http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11/ Notes/lecnotes.pdf
  • 15. • To find a single element that occurs strictly more than n / 2 times n - number of elements in a set MAJORITY PROBLEM
  • 16. MAJORITY PROBLEM • Mostly a toy problem, but it let us better understand the frequency problems for data streams • Majority problem has been formulated as a research problem by J Strother Moore in the Journal of Algorithms in 1981. • It’s possible that such element does not exists • It could be at most one majority element in the set • If such element exists, it will be equal to the median. So, the Naïve linear-time solution - compute the median of the set • Problem: it requires multiple pass through the stream
  • 17. MAJORITY PROBLEM • The Boyer-Moore Majority Vote Algorithm has been invented by Bob Boyer and J Strother Moore in 1980 to solve the Majority Problem in a single pass through the data stream. • The similar solution was independently proposed by Michael J. Fischer and Steven L. Salzberg in 1982.This is the most popular algorithm for undergraduate classes due to its simplicity. • An important pre-requirement is that the majority element actually exists, without it the output of the algorithm will be an arbitrary element of the data stream. • Algorithm requires only 1 left-to-right pass! • The Data structure for the Majority Algorithm just a pair of an integer counter and monitored element current
  • 18. MAJORITY ALGORITHM: ALGORITHM • Initialise counter with 1 and current with the first element from the left • Going from left to right: • If counter equals to 0, then take the current element as the current and set counter to 1 • if counter isn’t 0, then increase counter by 1 if the element equals to current or decrease counter by 1 otherwise • The last current is the majority element (if counter bigger that 0) Intuition: each element that contains a non-majority-value can only “cancel out” one copy of the majority value. Since more than n/2 of the elements contain the majority value, there is guaranteed to be a copy of it left standing at the end of the algorithm.
  • 19. MAJORITY ALGORITHM: EXAMPLE • Consider the following set of elements: {3,2,3,2,2,3,3,3} • Iterate from left to right and update counter: 3 2 3 2 2 3 3 3 counter: 1 current: 3 3 2 3 2 2 3 3 3 counter: 1 current: 3 3 2 3 2 2 3 3 3 counter: 0 current: 3 3 2 3 2 2 3 3 3 counter: 0 current: 3 3 2 3 2 2 3 3 3 counter: 1 current: 2 3 2 3 2 2 3 3 3 counter: 0 current: 2 3 2 3 2 2 3 3 3 counter: 1 current: 3 3 2 3 2 2 3 3 3 counter: 2 current: 3
  • 20. MAJORITY: READ MORE • https://courses.cs.washington.edu/courses/cse522/14sp/ lectures/lect04.pdf • http://theory.stanford.edu/~tim/s16/l/l2.pdf
  • 21. • To find elements that occurs more than n / k times (n >> k) * also known as k-frequency-estimation HEAVY HITTERS PROBLEM
  • 22. • There are not more than k such values • The majority problem is a particular case of the heavy hitters problem • k ≈ 2 − δ for a small value δ > 0 • with the additional promise that a majority element exists • Theorem: There is no algorithm that solves the heavy hitters problems in one pass while using a sublinear amount of auxiliary space • ε-approximate heavy hitters (ε-HH) problem: • every value that occurs at least n/k times is in the list • every value in the list occurs at least n/k − εn times in the set • ε-HH can be solved with Count-Min Sketch as well, but we consider Misra-Gries / Frequent algorithm here … HEAVY HITTERS PROBLEM
  • 24. MISRA-GRIES / FREQUENT ALGORITHM • A generalisation of the Majority Algorithm to track multiple frequent items, known as Frequent Algorithm, was proposed by Erik D. Demaine, et al. in 2002. • After some time it was discovered that the Frequent algorithm actually the same as the algorithm that was published by Jayadev Misra and David Gries in 1982, known now as Misra-Gries Algorithm • The trick is to run the Majority algorithm, but with many counters instead of 1 • The time cost of the algorithm is dominated by the O(1) dictionary operations per update, and the cost of decrementing counts
  • 25. MISRA-GRIES: ALGORITHM The Data structure for Misra-Gries algorithm consists of 2 arrays: • an array of k-1 frequency counters C (all 0) and k-1 locations X* (empty set) For every element xi in the set: • If xi is already in X* at some index j: • increase its corresponding frequency counter by 1: C[j] = C[j] + 1 • If xi not in X*: • If size of X* less than k (so, there is free space in the top): • append xi to X* at index j • set the corresponding frequency counter C[j] =1 • else If X* is already full: • decrement all frequency counters by 1: C[j] = C[j] - 1, j=1..k-1 • remove such elements from X* whose counters are 0. Top k-1 frequent elements : X*[j] with frequency estimations C[j], j=1..k-1
  • 26. MISRA-GRIES:EXAMPLE 0 0C: X*: • Step 1: {3,2,1,2,2,3,1,3}
 Element {3} isn’t in X* already and X* isn’t full, so we add it at position 0 and set C[0] = 1 • Step 2: {3,2,1,2,2,3,1,3}
 Element {2} isn’t in X* already and X* isn’t full, so we add it at position 1 and set C[1] = 1 1 0C: X*: 3 • Consider the following set of elements: {3,2,1,2,2,3,1,3}, n= 8 • We need to find top 2 elements in the set that appears at least n/3 times. 1 1 2 C: X*: 3
  • 27. MISRA-GRIES:EXAMPLE • Step 3: {3,2,1,2,2,3,1,3}
 Element {1} isn’t in X* already, but X* is full, so decrement all counters and remove elements that have counters less then 1: • Step 4: {3,2,1,2,2,3,1,3}
 Element {2} isn’t in X* already and X* isn’t full, so we add it at position 0 and set C[0] = 1 0 0C: X*: 1 0C: X*: 2 • Step 5: {3,2,1,2,2,3,1,3}
 Element {2} is in X* at position 0, so we increase its counter C[0] = C[0] + 1 = 2 2 0C: X*: 2
  • 28. MISRA-GRIES:EXAMPLE • Step 6: {3,2,1,2,2,3,1,3}
 Element {3} isn’t in X* already and X* isn’t full, so we append {3} at position 1 and set its counter to 1: C[1] = 1 • Step 7: {3,2,1,2,2,3,1,3}
 Element {1} isn’t in X* already, but X* is full, so decrement all counters and remove elements that have 0 as the counters values: 2 1 3 C: X*: 2 1 0C: X*: 2 • Step 8: {3,2,1,2,2,3,1,3}
 Element {3} isn’t in X* already and X* isn’t full, so we append {3} at position 1 and set its counter to 1: C[1] = 1 2 1 3 C: X*: 2 Top 2 elements
  • 29. MISRA-GRIES: PROPERTIES • The algorithm identifies at most k-1 candidates without any probabilistic approach. • It's still open question how to process such updates quickly, or particularly, how to decrement and release several counters simultaneously. • For instance, it was proposed to use doubly linked list of counters and store only the differences between same-value counter’s groups and use. With such data structure each counter no longer needs to store a value, but rather its group and its monitored element.
  • 30. MISRA-GRIES: READ MORE • https://courses.cs.washington.edu/courses/cse522/14sp/ lectures/lect04.pdf • http://dimacs.rutgers.edu/~graham/pubs/papers/ encalgs-mg.pdf • http://drops.dagstuhl.de/opus/volltexte/2016/5773/pdf/ 3.pdf • http://theory.stanford.edu/~tim/s16/l/l2.pdf
  • 31. ▸ @gakhov ▸ linkedin.com/in/gakhov ▸ www.datacrucis.com THANK YOU