SlideShare a Scribd company logo
1 of 56
Download to read offline
An introduction toAn introduction to
probabilities data-probabilities data-
structuresstructures (and algorithms)(and algorithms)
== Grokking Engineering, April 2016 ==
Võ Việt Hùng
vvhung@gmail.com
22
Who am I?Who am I?
● A technical guy, has been working in IT for 15+ years
✔ In many roles: developer, sys-admin, dba, big data
analyst
✔ Large systems: billions of requests per month
● Current: one of the biggest adnetworks in Vietnam
● Past
✔ VNG: Zing Ads, Zing Me, ...
✔ Vietnamworks/Navigos Group...
33
Agenda (1)Agenda (1)
● A real-world problem
● Probabilities data-structures (PDS), what?
● PDS, why?
● Some characteristics of PDS
● Some common PDS
● Membership Query – BloomFilter
● Cardinality Estimation – HyperLogLog
● Frequency Estimation – Count-Min Sketch
● Percentile and Quantile Estimation – t-digest
44
Agenda (2)Agenda (2)
● Some case studies
● Whats else in the jungle?
● References
● Q&A
55
A real-world problem (1)A real-world problem (1)
When processing data sets, we often want to do some simple checks
(queries) like:
● Does the data set contain a particular element (membership query)?
● How many distinct elements are in the data set (i.e. what is the
cardinality of the data set)?
● What are the most frequent elements (i.e. top-k elements)?
● What are the frequencies of the most frequent elements?
● What are the mean/median value of some quantity of the data set?
66
A real-world problem (2)A real-world problem (2)
The common approach is to use some kind of deterministic
data structure like HashSet or Hashtable for such purposes.
Another approach is using database, then performs SQL
queries.
But along with data grows, with demand for fast response,
come to problems with memory, CPU limitation, slow queries.
77
Probabilities data-structures (PDS)Probabilities data-structures (PDS)
● PDS are a group of data structures that are extremely
useful for big data and streaming/realtime applications.
● These data structures use hash functions to randomize
and compactly represent a set of items.
● Collisions are ignored but errors can be well-controlled
under certain threshold.
88
Why?Why?
to deal with
● fast response
● (very) large data and could not fit in memory
● data (could be) processed in one pass
● incremental updates (results)
● no need of 100% correct, just approximation but
controllable
99
PDS characteristicsPDS characteristics
(as comparing with error-free approaches)
● trade space and performance for accuracy
● use less memory
● have constant (and short) query time
● (usually) support union and intersection operations
● can be merged => map-reduced friendly
● Parallelized and distributed
1010
Some common PDSSome common PDS
● Membership Query
✔ Bloom Filter (BF)
✔ Bloom Filter extensions: counting-BF, scalable-BF,
stable-BF, layered-BF, inverse-BF
✔ Cuckoo hashing
● Cardinality Estimation – HyperLogLog (HLL), KMV, LC
● Frequency Estimation – Count-Min Sketch (CMS)
● Percentile and Quantile Estimation – t-digest
● Skip-list
● ….
1111
Membership Query – Bloom FilterMembership Query – Bloom Filter
● conceived by Burton Howard Bloom in 1970
● is used to test whether an element is a member of a set
● False-positive matches are possible, but false-negatives
are not. In other words, a query returns either "possibly in
set" or "definitely not in set"
● Elements can be added to the set, but not removed
(though this can be addressed with a "counting" filter).
● The more elements that are added to the set, the larger
the probability of false positives.
1212
BloomFilter – algorithm behindBloomFilter – algorithm behind
● effectively a hash table where collisions are ignored and
each element added to the table is hashed by some
number k hash functions.
● There is one major difference: a bloom filter does NOT
store the hashed keys.
● Instead, it has a bit array as its underlying data structure;
each key is remembered by flipping on all of the bits the k
hash functions map it to.
1313
BloomFilter – Simple implementationBloomFilter – Simple implementation
1414
BloomFilter – PropertiesBloomFilter – Properties
● Unlike a standard hash table, a BF of a fixed size can represent a set
with an arbitrarily large number of elements
● adding an element never fails due to the data structure "filling up"
● Union and intersection of BFs with the same size and set of hash
functions can be implemented with bitwise OR (union) and AND
(intersection)
● The union operation on BFs is lossless in the sense that the resulting
BF is the same as the BF created from scratch using the union of the
two sets.
● The intersect operation satisfies a weaker property: the false-
positive probability in the resulting BF is at most the false-positive
probability in one of the constituent BFs, but may be larger than the
false-positive probability in the BF created from scratch using the
intersection of the two sets.
1515
BloomFilter – simple usageBloomFilter – simple usage
1616
BloomFilter – Math behindBloomFilter – Math behind
1717
BloomFilter – rules-of-thumbBloomFilter – rules-of-thumb
Fomulas, rule-of-thumbs
(http://corte.si/posts/code/bloom-filter-rules-of-thumb/)
● fp rate bits
50% 1.44
10% 4.79
2% 8.14
1% 9.58
0.1% 14.38
0.01% 19.17
1818
BloomFilter – size over probabilityBloomFilter – size over probability
1919
BloomFilter extension – CountingBloomFilter extension – Counting
● Counting BFs provide a way to implement a delete operation
on a BF without recreating the filter afresh.
● In a counting filter the array positions (buckets) are extended
from being a single bit to being an n-bit counter.
● When an item is added, the corresponding counters are
incremented, and when it’s removed, the counters are
decremented.
● Counting BF takes n-times more space than a regular BF,
but it also has a scalability limit. Because the counting BF
table cannot be expanded, the maximal number of keys to be
stored simultaneously in the filter must be known in advance.
Once the designed capacity of the table is exceeded, the false
positive rate will grow rapidly as more keys are inserted.
2020
BloomFilter extension – ScalableBloomFilter extension – Scalable
● Standard BFs require knowing the size of the data set
ahead of time in order to keep probability controlable
● Scalable BFs are useful for cases where the size of the
data set isn’t known a priori and memory constraints
aren’t of particular concern.
● Scalable BF is essentially an array of BFs. New elements
are added to the last filter. When this filter becomes “full” –
when it reaches a target fill ratio – a new filter is added
with a tightened error probability.
2121
BloomFilter extension – ScalableBloomFilter extension – Scalable
2222
BloomFilter extension – StableBloomFilter extension – Stable
● Stable BF is a variant of BFs for detecting duplicates in
unbounded data streams with limited space (memory).
In particular, if the stream is not uniformly distributed,
meaning duplicates are likely to be grouped closer
together, the rate of false positives becomes immaterial.
● Since there is no way to store the entire history of a stream
(which can be infinite), Stable BFs continuously evict stale
information to make room for more recent elements.
● Since stale information is evicted, the Stable BF introduces
false negatives, which do not appear in traditional Bloom
filters. But a tight upper bound of false positive rates is
guaranteed.
2323
BloomFilter extension – LayeredBloomFilter extension – Layered
● A layered BF consists of multiple BF layers.
● Layered BFs allow keeping track of how many times an
item was added to the BF by checking how many layers
contain the item.
● With a layered BF a check operation will normally return
the deepest layer number the item was found in.
2424
BloomFilter extension – InverseBloomFilter extension – Inverse
● Inverse BF is an “opposite” of BF. It may report a false
negative but can never report a false positive. That is, it
may indicate that an item has not been seen when it
actually has, but it will never report an item as seen which
it hasn’t come across.
● Inverse BF behaves in a similar manner to a fixed-size
hash map of m buckets which doesn’t handle conflicts, but
it provides lock-free concurrency using an underlying CAS.
● Inverse BF is a nice option for dealing with unbounded
streams or large data sets due to its limited memory usage.
If duplicates are close together, the rate of false negatives
becomes vanishingly small with an adequately sized filter.
2525
BloomFilter – ApplicationsBloomFilter – Applications
● Akamai's web servers use Bloom filters to prevent "one-hit-
wonders" from being stored in its disk caches
● Google BigTable, Apache HBase and Apache Cassandra use
Bloom filters to reduce the disk lookups for non-existent rows or
columns
● Google Chrome web browser used to use a Bloom filter to identify
malicious URLs. Any URL was first checked against a local Bloom
filter, and only if the Bloom filter returned a positive result was a full
check of the URL performed
● The Squid Web Proxy Cache uses Bloom filters for cache digests
● Bitcoin uses Bloom filters to speed up wallet synchronization
● The Exim mail transfer agent (MTA) uses Bloom filters in its rate-
limit feature
2626
BloomFilter – AlternativesBloomFilter – Alternatives
● Cuckoo hashing
https://en.wikipedia.org/wiki/Cuckoo_hashing
● Roaringbitmaps
http://roaringbitmap.org/
2727
Cardinality Estimation – HyperLogLogCardinality Estimation – HyperLogLog
● a streaming algorithm used for estimating the number of
distinct elements (cardinality) of very large data sets.
● HyperLogLog counter can count one billion distinct items
with an accuracy of 2% using only 1.5 KB of memory.
● It is based on the bit pattern observation that for a stream
of randomly distributed numbers, if there is a number x
with the maximum of leading 0 bits k, the cardinality of
the stream is very likely equal to 2^k.
2828
HyperLogLog – simple explanationHyperLogLog – simple explanation
● For example, given four bits there exist only 16 possible
representations. If in our stream the highest number of
consecutive zeroes were three (000), the probability of
seeing that pattern is 2 in 16 (or 1 in 8) to conclude that
the cardinality of our streaming set is 8.
2929
HyperLogLog – more detailsHyperLogLog – more details
● In the HLL algorithm, a hash function is applied to each
element in the original multiset (a set which allows multiple
occurrences of its elements), to obtain a multiset of uniformly
distributed random numbers with the same cardinality as the
original multiset. The cardinality of this randomly distributed
set can then be estimated using the algorithm above.
● The simple estimate of cardinality obtained using the algorithm
above has the disadvantage of a large variance. In the
HyperLogLog algorithm, the variance is minimised by splitting
the multiset into numerous subsets, calculating the maximum
number of leading zeros in the numbers in each of these
subsets, and using a harmonic mean to combine these
estimates for each subset into an estimate of the cardinality of
the whole set.
3030
HyperLogLog – an implementationHyperLogLog – an implementation
3131
Frequency Est. – Count-Min SketchFrequency Est. – Count-Min Sketch
● Count-Min Sketches is a family of memory efficient data
structures that allow one to estimate frequency-related
properties of the data set, e.g. estimate frequencies of
particular elements, find top-K frequent elements,
perform range queries (where the goal is to find the sum of
frequencies of elements within a range), estimate
percentiles
● It is somewhat similar to bloom filter. The main difference
is that bloom filter represents a set as a bitmap, while
Count-Min sketch represents a multi-set which keeps a
frequency distribution summary.
3232
Frequency Est. – Count-Min SketchFrequency Est. – Count-Min Sketch
● Count-Min sketch is a two-dimensional array (dxw) of
integer counters. When a value arrives, it is mapped to
one position at each of d rows using d different and
preferably independent hash functions. Counters on each
position are incremented.
3333
Frequency Est. – Count-Min SketchFrequency Est. – Count-Min Sketch
● The estimate of the counts for an item is the minimum
value of the counts at the array positions determined by
the d hash functions.
● The space used by Count-Min sketch is the array of w*d
counters. By choosing appropriate values for d and w, very
small error and high probability can be achieved.
3434
Count-Min Sketch – implementationCount-Min Sketch – implementation
3535
Count-Min Sketch – PropertiesCount-Min Sketch – Properties
● Union can be performed by cell-wise ADD operation
● O(k) query time
● Better accuracy for higher frequency items (heavy-hitters)
● Can only cause over-counting but not under-counting
3636
Count-Min Sketch – NotesCount-Min Sketch – Notes
● Accuracy of the Count-Min sketch depends on the ratio
between the sketch size and the total number of registered
events. This means that Count-Min technique provides
significant memory gains only for skewed data, i.e. data
where items have very different probabilities.
● Applicability of Count-Min sketches is not a straightforward
question and the best thing that can be recommended is
experimental evaluation of each particular case.
● Count-Min sketch performs well on highly skewed data, but
on low or moderately skewed data it is not so efficient
because of poor protection from the high number of hash
collisions – Count-Min sketch simply selects minimal (less
distorted) estimator => Count-Mean-Min sketch
3737
Count-Mean-Min Sketch – implementationCount-Mean-Min Sketch – implementation
● CMM estimates noise for each hash function as the
average value of all counters in the row that correspond to
this function (except counter that corresponds to the query
itself), deduces it from the estimation for this hash function,
and, finally, computes the median of the estimations for all
hash functions.
3838
Count-Min Sketch – Top-k problemCount-Min Sketch – Top-k problem
Find all elements in the data set with the frequencies greater than k
percent of the total number of elements in the data set.
● Maintain a standard Count-Min sketch during the scan of the data set
and put all elements into it.
● Maintain a heap of top elements, initially empty, and a counter N of the
total number of already process elements.
● For each element in the data set:
✔ Put the element to the sketch
✔ Estimate the frequency of the element using the sketch. If frequency
is greater than a threshold (k*N), then put the element to the heap.
Heap should be periodically or continuously cleaned up to remove
elements that do not meet the threshold anymore.
● In general, the top-k problem makes sense only for skewed data, so
usage of Count-Min sketches is reasonable in this context.
3939
Percentile & Quantile Est. – t-digestPercentile & Quantile Est. – t-digest
● The problem of calculating median of a dataset in
distributed environment. ('cause the median of median is
not equal to the median) => what's needed is an algorithm
that can approximate the median, while still being space
efficient.
● the t-Digest is a probabilistic data structure for estimating
the median (and more generally any percentile) from
either distributed data or streaming data.
● Internally, the data structure is a sparse representation of
the cumulative distribution function. After ingesting data,
the data structure has learned the "interesting" points of
the CDF, called centroids.
4040
Percentile & Quantile Est. – t-digestPercentile & Quantile Est. – t-digest
● A new data structure for accurate on-line accumulation of rank-based
statistics such as quantiles and trimmed means. The t-digest
algorithm is also very parallel friendly making it useful in map-reduce
and parallel streaming applications.
● The t-digest construction algorithm uses a variant of 1-dimensional k-
means clustering to product a data structure that is related to the Q-
digest. This t-digest data structure can be used to estimate quantiles
or compute other rank statistics.
● The advantage of the t-digest over the Q-digest is that the t-digest can
handle floating point values while the Q-digest is limited to integers.
With small changes, the t-digest can handle any values from any
ordered set that has something akin to a mean.
● The accuracy of quantile estimates produced by t-digests can be orders
of magnitude more accurate than those produced by Q-digests in spite
of the fact that t-digests are more compact when stored on disk.
4141
t-digest – characteristicst-digest – characteristics
● has smaller summaries than Q-digest
● works on doubles as well as integers.
● provides part per million accuracy for extreme quantiles
and typically <1000 ppm accuracy for middle quantiles
● is fast
● is very simple
● can be used with map-reduce very easily because digests
can be merged
4242
Some remarksSome remarks
● For some structures like HyperLogLog or Bloom filter,
there're simple and practical formulas to determine
parameters of the structure on the basis of expected data
volume and required error probability.
● Other structures like Count-(Mean-)Min Sketch have
complex dependency on statistical properties of data and
experiments are the only reasonable way to understand their
applicability to real use cases.
● Data-structures populated by different data sets can often be
combined to process complex queries.
● Some types of queries can be supported by using
customized versions of the described data-structures/
algorithms.
4343
Case Study 1Case Study 1
● There is a system that tracks a huge number of web
events and each event is marked by a number of tags
including a user ID this event corresponds to.
It is required to report a number of unique users that
meet the specified combination of tags (like users from
the city C that visited site A or site B)
4444
Case Study 1: solutionCase Study 1: solution
● Solution 1:
✔ maintain a BF that tracks user IDs for each tag value
and a BF that contains user IDs that correspond to the
final result.
✔ A user ID from each incoming event is tested against
the per-tag filters – does it satisfy the required
combination of tags or not.
✔ If the user ID passes this test, it is additionally tested
against the additional BF that corresponds to the report
itself and, if passed, the final report counter is
increased.
● Solution 2: using HLL for each tag value
4545
Case Study 2Case Study 2
● There is a system that receives events on user visits from
different internet sites.
● This system enables analysis to query a number of
unique visitors for the specified date range and site.
4646
Case Study 2: solutionCase Study 2: solution
● HLL can be used to aggregate information about visitor
IDs for each day and site, masks for each day are saved,
and a query can be processed using bitwise OR-ing of the
daily masks.
4747
Case Study 3Case Study 3
● There is a system that tracks traffic by IP address and it is
required to detect most traffic-intensive addresses.
4848
Case Study 3: solutionCase Study 3: solution
● CMS?!!
● the problem is not trivial because we need to track the total
traffic for each address, not a frequency of items.
● counters in the CMS implementation can be incremented
not by 1, but by absolute amount of traffic for each
observation (i.e, size of IP packet if sketch is updated for
each packet)
● In this case, sketch will track amounts of traffic for each
address and a heap with the most traffic-intensive
addresses can be maintained (top-k or heavy-hitter).
4949
Case Study 4Case Study 4
● There is a system that monitors traffic and counts
unique visitors for different criteria (visited site,
geography, etc.).
● It is required to compute 100 most popular sites using a
number of unique visitors as a metric of popularity.
● Popularity should be computed every day on the basis of
data for last 30-day, i.e. every day one-day partition added,
another one is removed from the scope.
5050
Case Study 4: solutionCase Study 4: solution
● create a fresh set of per-site HLL counters every day and
maintain this set during 30 days, i.e. 30 sets of counters
are active at any moment of time.
5151
Case Study 5Case Study 5
● Number of users doing-action (view, click...) on site objects
(banner, button, …) 1-times, 2-times, …., 10+-times
● Report looks like below
Filter: Object=X
 1-times: 98765
 2-times: 76543
 3-times: 54321
 …
 9-times: 1234
 10+-times: 343
5252
Case Study 5: solutionCase Study 5: solution
● Should we use CMS???
● … and why/why NOT???
5353
Case Study 5: solutionCase Study 5: solution
● Use scalable layered-BF to track k-times user actions on
objects
● Use HLL to count users on each k-times action
5454
What else?What else?
● Libs
✔ Redis: HLL already, BF in next 3.2
✔ https://github.com/twitter/algebird
✔ https://github.com/addthis/stream-lib
✔ https://github.com/tylertreat/BoomFilters
✔ https://github.com/tdunning/t-digest
● More
✔ Linear Counting
✔ MinHash
✔ Top-K
5555
ReferencesReferences
● https://highlyscalable.wordpress.com/2012/05/01/probabili
stic-structures-web-analytics-data-mining/
● https://dzone.com/articles/introduction-probabilistic-0
● http://bravenewgeek.com/stream-processing-and-
probabilistic-methods/
● https://www.somethingsimilar.com/2012/05/21/the-
opposite-of-a-bloom-filter/
● https://dataorigami.net/blogs/napkin-folding/19055451-
percentile-and-quantile-estimation-of-big-data-the-t-digest
Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

More Related Content

More from Grokking VN

Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking VN
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking VN
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...Grokking VN
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compilerGrokking VN
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problemGrokking VN
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoringGrokking VN
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...Grokking VN
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking VN
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking VN
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design PatternsGrokking VN
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking VN
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking VN
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking VN
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking VN
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking VN
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking VN
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking VN
 
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking VN
 

More from Grokking VN (20)

Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellchecking
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKI
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design Patterns
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
 
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer Vision
 

Recently uploaded

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoordharasingh5698
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 

Recently uploaded (20)

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 

Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

  • 1. An introduction toAn introduction to probabilities data-probabilities data- structuresstructures (and algorithms)(and algorithms) == Grokking Engineering, April 2016 == Võ Việt Hùng vvhung@gmail.com
  • 2. 22 Who am I?Who am I? ● A technical guy, has been working in IT for 15+ years ✔ In many roles: developer, sys-admin, dba, big data analyst ✔ Large systems: billions of requests per month ● Current: one of the biggest adnetworks in Vietnam ● Past ✔ VNG: Zing Ads, Zing Me, ... ✔ Vietnamworks/Navigos Group...
  • 3. 33 Agenda (1)Agenda (1) ● A real-world problem ● Probabilities data-structures (PDS), what? ● PDS, why? ● Some characteristics of PDS ● Some common PDS ● Membership Query – BloomFilter ● Cardinality Estimation – HyperLogLog ● Frequency Estimation – Count-Min Sketch ● Percentile and Quantile Estimation – t-digest
  • 4. 44 Agenda (2)Agenda (2) ● Some case studies ● Whats else in the jungle? ● References ● Q&A
  • 5. 55 A real-world problem (1)A real-world problem (1) When processing data sets, we often want to do some simple checks (queries) like: ● Does the data set contain a particular element (membership query)? ● How many distinct elements are in the data set (i.e. what is the cardinality of the data set)? ● What are the most frequent elements (i.e. top-k elements)? ● What are the frequencies of the most frequent elements? ● What are the mean/median value of some quantity of the data set?
  • 6. 66 A real-world problem (2)A real-world problem (2) The common approach is to use some kind of deterministic data structure like HashSet or Hashtable for such purposes. Another approach is using database, then performs SQL queries. But along with data grows, with demand for fast response, come to problems with memory, CPU limitation, slow queries.
  • 7. 77 Probabilities data-structures (PDS)Probabilities data-structures (PDS) ● PDS are a group of data structures that are extremely useful for big data and streaming/realtime applications. ● These data structures use hash functions to randomize and compactly represent a set of items. ● Collisions are ignored but errors can be well-controlled under certain threshold.
  • 8. 88 Why?Why? to deal with ● fast response ● (very) large data and could not fit in memory ● data (could be) processed in one pass ● incremental updates (results) ● no need of 100% correct, just approximation but controllable
  • 9. 99 PDS characteristicsPDS characteristics (as comparing with error-free approaches) ● trade space and performance for accuracy ● use less memory ● have constant (and short) query time ● (usually) support union and intersection operations ● can be merged => map-reduced friendly ● Parallelized and distributed
  • 10. 1010 Some common PDSSome common PDS ● Membership Query ✔ Bloom Filter (BF) ✔ Bloom Filter extensions: counting-BF, scalable-BF, stable-BF, layered-BF, inverse-BF ✔ Cuckoo hashing ● Cardinality Estimation – HyperLogLog (HLL), KMV, LC ● Frequency Estimation – Count-Min Sketch (CMS) ● Percentile and Quantile Estimation – t-digest ● Skip-list ● ….
  • 11. 1111 Membership Query – Bloom FilterMembership Query – Bloom Filter ● conceived by Burton Howard Bloom in 1970 ● is used to test whether an element is a member of a set ● False-positive matches are possible, but false-negatives are not. In other words, a query returns either "possibly in set" or "definitely not in set" ● Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter). ● The more elements that are added to the set, the larger the probability of false positives.
  • 12. 1212 BloomFilter – algorithm behindBloomFilter – algorithm behind ● effectively a hash table where collisions are ignored and each element added to the table is hashed by some number k hash functions. ● There is one major difference: a bloom filter does NOT store the hashed keys. ● Instead, it has a bit array as its underlying data structure; each key is remembered by flipping on all of the bits the k hash functions map it to.
  • 13. 1313 BloomFilter – Simple implementationBloomFilter – Simple implementation
  • 14. 1414 BloomFilter – PropertiesBloomFilter – Properties ● Unlike a standard hash table, a BF of a fixed size can represent a set with an arbitrarily large number of elements ● adding an element never fails due to the data structure "filling up" ● Union and intersection of BFs with the same size and set of hash functions can be implemented with bitwise OR (union) and AND (intersection) ● The union operation on BFs is lossless in the sense that the resulting BF is the same as the BF created from scratch using the union of the two sets. ● The intersect operation satisfies a weaker property: the false- positive probability in the resulting BF is at most the false-positive probability in one of the constituent BFs, but may be larger than the false-positive probability in the BF created from scratch using the intersection of the two sets.
  • 15. 1515 BloomFilter – simple usageBloomFilter – simple usage
  • 16. 1616 BloomFilter – Math behindBloomFilter – Math behind
  • 17. 1717 BloomFilter – rules-of-thumbBloomFilter – rules-of-thumb Fomulas, rule-of-thumbs (http://corte.si/posts/code/bloom-filter-rules-of-thumb/) ● fp rate bits 50% 1.44 10% 4.79 2% 8.14 1% 9.58 0.1% 14.38 0.01% 19.17
  • 18. 1818 BloomFilter – size over probabilityBloomFilter – size over probability
  • 19. 1919 BloomFilter extension – CountingBloomFilter extension – Counting ● Counting BFs provide a way to implement a delete operation on a BF without recreating the filter afresh. ● In a counting filter the array positions (buckets) are extended from being a single bit to being an n-bit counter. ● When an item is added, the corresponding counters are incremented, and when it’s removed, the counters are decremented. ● Counting BF takes n-times more space than a regular BF, but it also has a scalability limit. Because the counting BF table cannot be expanded, the maximal number of keys to be stored simultaneously in the filter must be known in advance. Once the designed capacity of the table is exceeded, the false positive rate will grow rapidly as more keys are inserted.
  • 20. 2020 BloomFilter extension – ScalableBloomFilter extension – Scalable ● Standard BFs require knowing the size of the data set ahead of time in order to keep probability controlable ● Scalable BFs are useful for cases where the size of the data set isn’t known a priori and memory constraints aren’t of particular concern. ● Scalable BF is essentially an array of BFs. New elements are added to the last filter. When this filter becomes “full” – when it reaches a target fill ratio – a new filter is added with a tightened error probability.
  • 21. 2121 BloomFilter extension – ScalableBloomFilter extension – Scalable
  • 22. 2222 BloomFilter extension – StableBloomFilter extension – Stable ● Stable BF is a variant of BFs for detecting duplicates in unbounded data streams with limited space (memory). In particular, if the stream is not uniformly distributed, meaning duplicates are likely to be grouped closer together, the rate of false positives becomes immaterial. ● Since there is no way to store the entire history of a stream (which can be infinite), Stable BFs continuously evict stale information to make room for more recent elements. ● Since stale information is evicted, the Stable BF introduces false negatives, which do not appear in traditional Bloom filters. But a tight upper bound of false positive rates is guaranteed.
  • 23. 2323 BloomFilter extension – LayeredBloomFilter extension – Layered ● A layered BF consists of multiple BF layers. ● Layered BFs allow keeping track of how many times an item was added to the BF by checking how many layers contain the item. ● With a layered BF a check operation will normally return the deepest layer number the item was found in.
  • 24. 2424 BloomFilter extension – InverseBloomFilter extension – Inverse ● Inverse BF is an “opposite” of BF. It may report a false negative but can never report a false positive. That is, it may indicate that an item has not been seen when it actually has, but it will never report an item as seen which it hasn’t come across. ● Inverse BF behaves in a similar manner to a fixed-size hash map of m buckets which doesn’t handle conflicts, but it provides lock-free concurrency using an underlying CAS. ● Inverse BF is a nice option for dealing with unbounded streams or large data sets due to its limited memory usage. If duplicates are close together, the rate of false negatives becomes vanishingly small with an adequately sized filter.
  • 25. 2525 BloomFilter – ApplicationsBloomFilter – Applications ● Akamai's web servers use Bloom filters to prevent "one-hit- wonders" from being stored in its disk caches ● Google BigTable, Apache HBase and Apache Cassandra use Bloom filters to reduce the disk lookups for non-existent rows or columns ● Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed ● The Squid Web Proxy Cache uses Bloom filters for cache digests ● Bitcoin uses Bloom filters to speed up wallet synchronization ● The Exim mail transfer agent (MTA) uses Bloom filters in its rate- limit feature
  • 26. 2626 BloomFilter – AlternativesBloomFilter – Alternatives ● Cuckoo hashing https://en.wikipedia.org/wiki/Cuckoo_hashing ● Roaringbitmaps http://roaringbitmap.org/
  • 27. 2727 Cardinality Estimation – HyperLogLogCardinality Estimation – HyperLogLog ● a streaming algorithm used for estimating the number of distinct elements (cardinality) of very large data sets. ● HyperLogLog counter can count one billion distinct items with an accuracy of 2% using only 1.5 KB of memory. ● It is based on the bit pattern observation that for a stream of randomly distributed numbers, if there is a number x with the maximum of leading 0 bits k, the cardinality of the stream is very likely equal to 2^k.
  • 28. 2828 HyperLogLog – simple explanationHyperLogLog – simple explanation ● For example, given four bits there exist only 16 possible representations. If in our stream the highest number of consecutive zeroes were three (000), the probability of seeing that pattern is 2 in 16 (or 1 in 8) to conclude that the cardinality of our streaming set is 8.
  • 29. 2929 HyperLogLog – more detailsHyperLogLog – more details ● In the HLL algorithm, a hash function is applied to each element in the original multiset (a set which allows multiple occurrences of its elements), to obtain a multiset of uniformly distributed random numbers with the same cardinality as the original multiset. The cardinality of this randomly distributed set can then be estimated using the algorithm above. ● The simple estimate of cardinality obtained using the algorithm above has the disadvantage of a large variance. In the HyperLogLog algorithm, the variance is minimised by splitting the multiset into numerous subsets, calculating the maximum number of leading zeros in the numbers in each of these subsets, and using a harmonic mean to combine these estimates for each subset into an estimate of the cardinality of the whole set.
  • 30. 3030 HyperLogLog – an implementationHyperLogLog – an implementation
  • 31. 3131 Frequency Est. – Count-Min SketchFrequency Est. – Count-Min Sketch ● Count-Min Sketches is a family of memory efficient data structures that allow one to estimate frequency-related properties of the data set, e.g. estimate frequencies of particular elements, find top-K frequent elements, perform range queries (where the goal is to find the sum of frequencies of elements within a range), estimate percentiles ● It is somewhat similar to bloom filter. The main difference is that bloom filter represents a set as a bitmap, while Count-Min sketch represents a multi-set which keeps a frequency distribution summary.
  • 32. 3232 Frequency Est. – Count-Min SketchFrequency Est. – Count-Min Sketch ● Count-Min sketch is a two-dimensional array (dxw) of integer counters. When a value arrives, it is mapped to one position at each of d rows using d different and preferably independent hash functions. Counters on each position are incremented.
  • 33. 3333 Frequency Est. – Count-Min SketchFrequency Est. – Count-Min Sketch ● The estimate of the counts for an item is the minimum value of the counts at the array positions determined by the d hash functions. ● The space used by Count-Min sketch is the array of w*d counters. By choosing appropriate values for d and w, very small error and high probability can be achieved.
  • 34. 3434 Count-Min Sketch – implementationCount-Min Sketch – implementation
  • 35. 3535 Count-Min Sketch – PropertiesCount-Min Sketch – Properties ● Union can be performed by cell-wise ADD operation ● O(k) query time ● Better accuracy for higher frequency items (heavy-hitters) ● Can only cause over-counting but not under-counting
  • 36. 3636 Count-Min Sketch – NotesCount-Min Sketch – Notes ● Accuracy of the Count-Min sketch depends on the ratio between the sketch size and the total number of registered events. This means that Count-Min technique provides significant memory gains only for skewed data, i.e. data where items have very different probabilities. ● Applicability of Count-Min sketches is not a straightforward question and the best thing that can be recommended is experimental evaluation of each particular case. ● Count-Min sketch performs well on highly skewed data, but on low or moderately skewed data it is not so efficient because of poor protection from the high number of hash collisions – Count-Min sketch simply selects minimal (less distorted) estimator => Count-Mean-Min sketch
  • 37. 3737 Count-Mean-Min Sketch – implementationCount-Mean-Min Sketch – implementation ● CMM estimates noise for each hash function as the average value of all counters in the row that correspond to this function (except counter that corresponds to the query itself), deduces it from the estimation for this hash function, and, finally, computes the median of the estimations for all hash functions.
  • 38. 3838 Count-Min Sketch – Top-k problemCount-Min Sketch – Top-k problem Find all elements in the data set with the frequencies greater than k percent of the total number of elements in the data set. ● Maintain a standard Count-Min sketch during the scan of the data set and put all elements into it. ● Maintain a heap of top elements, initially empty, and a counter N of the total number of already process elements. ● For each element in the data set: ✔ Put the element to the sketch ✔ Estimate the frequency of the element using the sketch. If frequency is greater than a threshold (k*N), then put the element to the heap. Heap should be periodically or continuously cleaned up to remove elements that do not meet the threshold anymore. ● In general, the top-k problem makes sense only for skewed data, so usage of Count-Min sketches is reasonable in this context.
  • 39. 3939 Percentile & Quantile Est. – t-digestPercentile & Quantile Est. – t-digest ● The problem of calculating median of a dataset in distributed environment. ('cause the median of median is not equal to the median) => what's needed is an algorithm that can approximate the median, while still being space efficient. ● the t-Digest is a probabilistic data structure for estimating the median (and more generally any percentile) from either distributed data or streaming data. ● Internally, the data structure is a sparse representation of the cumulative distribution function. After ingesting data, the data structure has learned the "interesting" points of the CDF, called centroids.
  • 40. 4040 Percentile & Quantile Est. – t-digestPercentile & Quantile Est. – t-digest ● A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications. ● The t-digest construction algorithm uses a variant of 1-dimensional k- means clustering to product a data structure that is related to the Q- digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. ● The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. ● The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.
  • 41. 4141 t-digest – characteristicst-digest – characteristics ● has smaller summaries than Q-digest ● works on doubles as well as integers. ● provides part per million accuracy for extreme quantiles and typically <1000 ppm accuracy for middle quantiles ● is fast ● is very simple ● can be used with map-reduce very easily because digests can be merged
  • 42. 4242 Some remarksSome remarks ● For some structures like HyperLogLog or Bloom filter, there're simple and practical formulas to determine parameters of the structure on the basis of expected data volume and required error probability. ● Other structures like Count-(Mean-)Min Sketch have complex dependency on statistical properties of data and experiments are the only reasonable way to understand their applicability to real use cases. ● Data-structures populated by different data sets can often be combined to process complex queries. ● Some types of queries can be supported by using customized versions of the described data-structures/ algorithms.
  • 43. 4343 Case Study 1Case Study 1 ● There is a system that tracks a huge number of web events and each event is marked by a number of tags including a user ID this event corresponds to. It is required to report a number of unique users that meet the specified combination of tags (like users from the city C that visited site A or site B)
  • 44. 4444 Case Study 1: solutionCase Study 1: solution ● Solution 1: ✔ maintain a BF that tracks user IDs for each tag value and a BF that contains user IDs that correspond to the final result. ✔ A user ID from each incoming event is tested against the per-tag filters – does it satisfy the required combination of tags or not. ✔ If the user ID passes this test, it is additionally tested against the additional BF that corresponds to the report itself and, if passed, the final report counter is increased. ● Solution 2: using HLL for each tag value
  • 45. 4545 Case Study 2Case Study 2 ● There is a system that receives events on user visits from different internet sites. ● This system enables analysis to query a number of unique visitors for the specified date range and site.
  • 46. 4646 Case Study 2: solutionCase Study 2: solution ● HLL can be used to aggregate information about visitor IDs for each day and site, masks for each day are saved, and a query can be processed using bitwise OR-ing of the daily masks.
  • 47. 4747 Case Study 3Case Study 3 ● There is a system that tracks traffic by IP address and it is required to detect most traffic-intensive addresses.
  • 48. 4848 Case Study 3: solutionCase Study 3: solution ● CMS?!! ● the problem is not trivial because we need to track the total traffic for each address, not a frequency of items. ● counters in the CMS implementation can be incremented not by 1, but by absolute amount of traffic for each observation (i.e, size of IP packet if sketch is updated for each packet) ● In this case, sketch will track amounts of traffic for each address and a heap with the most traffic-intensive addresses can be maintained (top-k or heavy-hitter).
  • 49. 4949 Case Study 4Case Study 4 ● There is a system that monitors traffic and counts unique visitors for different criteria (visited site, geography, etc.). ● It is required to compute 100 most popular sites using a number of unique visitors as a metric of popularity. ● Popularity should be computed every day on the basis of data for last 30-day, i.e. every day one-day partition added, another one is removed from the scope.
  • 50. 5050 Case Study 4: solutionCase Study 4: solution ● create a fresh set of per-site HLL counters every day and maintain this set during 30 days, i.e. 30 sets of counters are active at any moment of time.
  • 51. 5151 Case Study 5Case Study 5 ● Number of users doing-action (view, click...) on site objects (banner, button, …) 1-times, 2-times, …., 10+-times ● Report looks like below Filter: Object=X  1-times: 98765  2-times: 76543  3-times: 54321  …  9-times: 1234  10+-times: 343
  • 52. 5252 Case Study 5: solutionCase Study 5: solution ● Should we use CMS??? ● … and why/why NOT???
  • 53. 5353 Case Study 5: solutionCase Study 5: solution ● Use scalable layered-BF to track k-times user actions on objects ● Use HLL to count users on each k-times action
  • 54. 5454 What else?What else? ● Libs ✔ Redis: HLL already, BF in next 3.2 ✔ https://github.com/twitter/algebird ✔ https://github.com/addthis/stream-lib ✔ https://github.com/tylertreat/BoomFilters ✔ https://github.com/tdunning/t-digest ● More ✔ Linear Counting ✔ MinHash ✔ Top-K
  • 55. 5555 ReferencesReferences ● https://highlyscalable.wordpress.com/2012/05/01/probabili stic-structures-web-analytics-data-mining/ ● https://dzone.com/articles/introduction-probabilistic-0 ● http://bravenewgeek.com/stream-processing-and- probabilistic-methods/ ● https://www.somethingsimilar.com/2012/05/21/the- opposite-of-a-bloom-filter/ ● https://dataorigami.net/blogs/napkin-folding/19055451- percentile-and-quantile-estimation-of-big-data-the-t-digest