Talk 2: Introduction to some probabilistics data-structures for real-time processing - Vo Viet Hung (Vietnamese, slides in English)
Tổng quan về các CTDL xác suất (CTDLXS), định nghĩa, tính chất, mục đích sử dụng.
Giới thiệu về một số CTDLXS như BloomFilter, HyperLogLog, Count-Min Sketch.
Sơ lược về tư tưởng thiết kế và giải thuật xây dựng, các công thức kiểm soát độ chính xác của các CTDLXS này.
Một số case-study ứng dụng cụ thể.
Bio: Anh Vo Viet Hung có hơn 15 năm kinh nghiệm làm việc trong ngành IT, qua nhiều vị trí khác nhau từ developer, sys-admin đến dba, phân tích dữ liệu lớn.
Từng làm việc với nhiều hệ thống high performance, phục vụ hàng tỷ request / tháng.
https://www.facebook.com/events/1253414338006273/?
Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)
1. An introduction toAn introduction to
probabilities data-probabilities data-
structuresstructures (and algorithms)(and algorithms)
== Grokking Engineering, April 2016 ==
Võ Việt Hùng
vvhung@gmail.com
2. 22
Who am I?Who am I?
● A technical guy, has been working in IT for 15+ years
✔ In many roles: developer, sys-admin, dba, big data
analyst
✔ Large systems: billions of requests per month
● Current: one of the biggest adnetworks in Vietnam
● Past
✔ VNG: Zing Ads, Zing Me, ...
✔ Vietnamworks/Navigos Group...
3. 33
Agenda (1)Agenda (1)
● A real-world problem
● Probabilities data-structures (PDS), what?
● PDS, why?
● Some characteristics of PDS
● Some common PDS
● Membership Query – BloomFilter
● Cardinality Estimation – HyperLogLog
● Frequency Estimation – Count-Min Sketch
● Percentile and Quantile Estimation – t-digest
5. 55
A real-world problem (1)A real-world problem (1)
When processing data sets, we often want to do some simple checks
(queries) like:
● Does the data set contain a particular element (membership query)?
● How many distinct elements are in the data set (i.e. what is the
cardinality of the data set)?
● What are the most frequent elements (i.e. top-k elements)?
● What are the frequencies of the most frequent elements?
● What are the mean/median value of some quantity of the data set?
6. 66
A real-world problem (2)A real-world problem (2)
The common approach is to use some kind of deterministic
data structure like HashSet or Hashtable for such purposes.
Another approach is using database, then performs SQL
queries.
But along with data grows, with demand for fast response,
come to problems with memory, CPU limitation, slow queries.
7. 77
Probabilities data-structures (PDS)Probabilities data-structures (PDS)
● PDS are a group of data structures that are extremely
useful for big data and streaming/realtime applications.
● These data structures use hash functions to randomize
and compactly represent a set of items.
● Collisions are ignored but errors can be well-controlled
under certain threshold.
8. 88
Why?Why?
to deal with
● fast response
● (very) large data and could not fit in memory
● data (could be) processed in one pass
● incremental updates (results)
● no need of 100% correct, just approximation but
controllable
9. 99
PDS characteristicsPDS characteristics
(as comparing with error-free approaches)
● trade space and performance for accuracy
● use less memory
● have constant (and short) query time
● (usually) support union and intersection operations
● can be merged => map-reduced friendly
● Parallelized and distributed
11. 1111
Membership Query – Bloom FilterMembership Query – Bloom Filter
● conceived by Burton Howard Bloom in 1970
● is used to test whether an element is a member of a set
● False-positive matches are possible, but false-negatives
are not. In other words, a query returns either "possibly in
set" or "definitely not in set"
● Elements can be added to the set, but not removed
(though this can be addressed with a "counting" filter).
● The more elements that are added to the set, the larger
the probability of false positives.
12. 1212
BloomFilter – algorithm behindBloomFilter – algorithm behind
● effectively a hash table where collisions are ignored and
each element added to the table is hashed by some
number k hash functions.
● There is one major difference: a bloom filter does NOT
store the hashed keys.
● Instead, it has a bit array as its underlying data structure;
each key is remembered by flipping on all of the bits the k
hash functions map it to.
14. 1414
BloomFilter – PropertiesBloomFilter – Properties
● Unlike a standard hash table, a BF of a fixed size can represent a set
with an arbitrarily large number of elements
● adding an element never fails due to the data structure "filling up"
● Union and intersection of BFs with the same size and set of hash
functions can be implemented with bitwise OR (union) and AND
(intersection)
● The union operation on BFs is lossless in the sense that the resulting
BF is the same as the BF created from scratch using the union of the
two sets.
● The intersect operation satisfies a weaker property: the false-
positive probability in the resulting BF is at most the false-positive
probability in one of the constituent BFs, but may be larger than the
false-positive probability in the BF created from scratch using the
intersection of the two sets.
19. 1919
BloomFilter extension – CountingBloomFilter extension – Counting
● Counting BFs provide a way to implement a delete operation
on a BF without recreating the filter afresh.
● In a counting filter the array positions (buckets) are extended
from being a single bit to being an n-bit counter.
● When an item is added, the corresponding counters are
incremented, and when it’s removed, the counters are
decremented.
● Counting BF takes n-times more space than a regular BF,
but it also has a scalability limit. Because the counting BF
table cannot be expanded, the maximal number of keys to be
stored simultaneously in the filter must be known in advance.
Once the designed capacity of the table is exceeded, the false
positive rate will grow rapidly as more keys are inserted.
20. 2020
BloomFilter extension – ScalableBloomFilter extension – Scalable
● Standard BFs require knowing the size of the data set
ahead of time in order to keep probability controlable
● Scalable BFs are useful for cases where the size of the
data set isn’t known a priori and memory constraints
aren’t of particular concern.
● Scalable BF is essentially an array of BFs. New elements
are added to the last filter. When this filter becomes “full” –
when it reaches a target fill ratio – a new filter is added
with a tightened error probability.
22. 2222
BloomFilter extension – StableBloomFilter extension – Stable
● Stable BF is a variant of BFs for detecting duplicates in
unbounded data streams with limited space (memory).
In particular, if the stream is not uniformly distributed,
meaning duplicates are likely to be grouped closer
together, the rate of false positives becomes immaterial.
● Since there is no way to store the entire history of a stream
(which can be infinite), Stable BFs continuously evict stale
information to make room for more recent elements.
● Since stale information is evicted, the Stable BF introduces
false negatives, which do not appear in traditional Bloom
filters. But a tight upper bound of false positive rates is
guaranteed.
23. 2323
BloomFilter extension – LayeredBloomFilter extension – Layered
● A layered BF consists of multiple BF layers.
● Layered BFs allow keeping track of how many times an
item was added to the BF by checking how many layers
contain the item.
● With a layered BF a check operation will normally return
the deepest layer number the item was found in.
24. 2424
BloomFilter extension – InverseBloomFilter extension – Inverse
● Inverse BF is an “opposite” of BF. It may report a false
negative but can never report a false positive. That is, it
may indicate that an item has not been seen when it
actually has, but it will never report an item as seen which
it hasn’t come across.
● Inverse BF behaves in a similar manner to a fixed-size
hash map of m buckets which doesn’t handle conflicts, but
it provides lock-free concurrency using an underlying CAS.
● Inverse BF is a nice option for dealing with unbounded
streams or large data sets due to its limited memory usage.
If duplicates are close together, the rate of false negatives
becomes vanishingly small with an adequately sized filter.
25. 2525
BloomFilter – ApplicationsBloomFilter – Applications
● Akamai's web servers use Bloom filters to prevent "one-hit-
wonders" from being stored in its disk caches
● Google BigTable, Apache HBase and Apache Cassandra use
Bloom filters to reduce the disk lookups for non-existent rows or
columns
● Google Chrome web browser used to use a Bloom filter to identify
malicious URLs. Any URL was first checked against a local Bloom
filter, and only if the Bloom filter returned a positive result was a full
check of the URL performed
● The Squid Web Proxy Cache uses Bloom filters for cache digests
● Bitcoin uses Bloom filters to speed up wallet synchronization
● The Exim mail transfer agent (MTA) uses Bloom filters in its rate-
limit feature
27. 2727
Cardinality Estimation – HyperLogLogCardinality Estimation – HyperLogLog
● a streaming algorithm used for estimating the number of
distinct elements (cardinality) of very large data sets.
● HyperLogLog counter can count one billion distinct items
with an accuracy of 2% using only 1.5 KB of memory.
● It is based on the bit pattern observation that for a stream
of randomly distributed numbers, if there is a number x
with the maximum of leading 0 bits k, the cardinality of
the stream is very likely equal to 2^k.
28. 2828
HyperLogLog – simple explanationHyperLogLog – simple explanation
● For example, given four bits there exist only 16 possible
representations. If in our stream the highest number of
consecutive zeroes were three (000), the probability of
seeing that pattern is 2 in 16 (or 1 in 8) to conclude that
the cardinality of our streaming set is 8.
29. 2929
HyperLogLog – more detailsHyperLogLog – more details
● In the HLL algorithm, a hash function is applied to each
element in the original multiset (a set which allows multiple
occurrences of its elements), to obtain a multiset of uniformly
distributed random numbers with the same cardinality as the
original multiset. The cardinality of this randomly distributed
set can then be estimated using the algorithm above.
● The simple estimate of cardinality obtained using the algorithm
above has the disadvantage of a large variance. In the
HyperLogLog algorithm, the variance is minimised by splitting
the multiset into numerous subsets, calculating the maximum
number of leading zeros in the numbers in each of these
subsets, and using a harmonic mean to combine these
estimates for each subset into an estimate of the cardinality of
the whole set.
31. 3131
Frequency Est. – Count-Min SketchFrequency Est. – Count-Min Sketch
● Count-Min Sketches is a family of memory efficient data
structures that allow one to estimate frequency-related
properties of the data set, e.g. estimate frequencies of
particular elements, find top-K frequent elements,
perform range queries (where the goal is to find the sum of
frequencies of elements within a range), estimate
percentiles
● It is somewhat similar to bloom filter. The main difference
is that bloom filter represents a set as a bitmap, while
Count-Min sketch represents a multi-set which keeps a
frequency distribution summary.
32. 3232
Frequency Est. – Count-Min SketchFrequency Est. – Count-Min Sketch
● Count-Min sketch is a two-dimensional array (dxw) of
integer counters. When a value arrives, it is mapped to
one position at each of d rows using d different and
preferably independent hash functions. Counters on each
position are incremented.
33. 3333
Frequency Est. – Count-Min SketchFrequency Est. – Count-Min Sketch
● The estimate of the counts for an item is the minimum
value of the counts at the array positions determined by
the d hash functions.
● The space used by Count-Min sketch is the array of w*d
counters. By choosing appropriate values for d and w, very
small error and high probability can be achieved.
35. 3535
Count-Min Sketch – PropertiesCount-Min Sketch – Properties
● Union can be performed by cell-wise ADD operation
● O(k) query time
● Better accuracy for higher frequency items (heavy-hitters)
● Can only cause over-counting but not under-counting
36. 3636
Count-Min Sketch – NotesCount-Min Sketch – Notes
● Accuracy of the Count-Min sketch depends on the ratio
between the sketch size and the total number of registered
events. This means that Count-Min technique provides
significant memory gains only for skewed data, i.e. data
where items have very different probabilities.
● Applicability of Count-Min sketches is not a straightforward
question and the best thing that can be recommended is
experimental evaluation of each particular case.
● Count-Min sketch performs well on highly skewed data, but
on low or moderately skewed data it is not so efficient
because of poor protection from the high number of hash
collisions – Count-Min sketch simply selects minimal (less
distorted) estimator => Count-Mean-Min sketch
37. 3737
Count-Mean-Min Sketch – implementationCount-Mean-Min Sketch – implementation
● CMM estimates noise for each hash function as the
average value of all counters in the row that correspond to
this function (except counter that corresponds to the query
itself), deduces it from the estimation for this hash function,
and, finally, computes the median of the estimations for all
hash functions.
38. 3838
Count-Min Sketch – Top-k problemCount-Min Sketch – Top-k problem
Find all elements in the data set with the frequencies greater than k
percent of the total number of elements in the data set.
● Maintain a standard Count-Min sketch during the scan of the data set
and put all elements into it.
● Maintain a heap of top elements, initially empty, and a counter N of the
total number of already process elements.
● For each element in the data set:
✔ Put the element to the sketch
✔ Estimate the frequency of the element using the sketch. If frequency
is greater than a threshold (k*N), then put the element to the heap.
Heap should be periodically or continuously cleaned up to remove
elements that do not meet the threshold anymore.
● In general, the top-k problem makes sense only for skewed data, so
usage of Count-Min sketches is reasonable in this context.
39. 3939
Percentile & Quantile Est. – t-digestPercentile & Quantile Est. – t-digest
● The problem of calculating median of a dataset in
distributed environment. ('cause the median of median is
not equal to the median) => what's needed is an algorithm
that can approximate the median, while still being space
efficient.
● the t-Digest is a probabilistic data structure for estimating
the median (and more generally any percentile) from
either distributed data or streaming data.
● Internally, the data structure is a sparse representation of
the cumulative distribution function. After ingesting data,
the data structure has learned the "interesting" points of
the CDF, called centroids.
40. 4040
Percentile & Quantile Est. – t-digestPercentile & Quantile Est. – t-digest
● A new data structure for accurate on-line accumulation of rank-based
statistics such as quantiles and trimmed means. The t-digest
algorithm is also very parallel friendly making it useful in map-reduce
and parallel streaming applications.
● The t-digest construction algorithm uses a variant of 1-dimensional k-
means clustering to product a data structure that is related to the Q-
digest. This t-digest data structure can be used to estimate quantiles
or compute other rank statistics.
● The advantage of the t-digest over the Q-digest is that the t-digest can
handle floating point values while the Q-digest is limited to integers.
With small changes, the t-digest can handle any values from any
ordered set that has something akin to a mean.
● The accuracy of quantile estimates produced by t-digests can be orders
of magnitude more accurate than those produced by Q-digests in spite
of the fact that t-digests are more compact when stored on disk.
41. 4141
t-digest – characteristicst-digest – characteristics
● has smaller summaries than Q-digest
● works on doubles as well as integers.
● provides part per million accuracy for extreme quantiles
and typically <1000 ppm accuracy for middle quantiles
● is fast
● is very simple
● can be used with map-reduce very easily because digests
can be merged
42. 4242
Some remarksSome remarks
● For some structures like HyperLogLog or Bloom filter,
there're simple and practical formulas to determine
parameters of the structure on the basis of expected data
volume and required error probability.
● Other structures like Count-(Mean-)Min Sketch have
complex dependency on statistical properties of data and
experiments are the only reasonable way to understand their
applicability to real use cases.
● Data-structures populated by different data sets can often be
combined to process complex queries.
● Some types of queries can be supported by using
customized versions of the described data-structures/
algorithms.
43. 4343
Case Study 1Case Study 1
● There is a system that tracks a huge number of web
events and each event is marked by a number of tags
including a user ID this event corresponds to.
It is required to report a number of unique users that
meet the specified combination of tags (like users from
the city C that visited site A or site B)
44. 4444
Case Study 1: solutionCase Study 1: solution
● Solution 1:
✔ maintain a BF that tracks user IDs for each tag value
and a BF that contains user IDs that correspond to the
final result.
✔ A user ID from each incoming event is tested against
the per-tag filters – does it satisfy the required
combination of tags or not.
✔ If the user ID passes this test, it is additionally tested
against the additional BF that corresponds to the report
itself and, if passed, the final report counter is
increased.
● Solution 2: using HLL for each tag value
45. 4545
Case Study 2Case Study 2
● There is a system that receives events on user visits from
different internet sites.
● This system enables analysis to query a number of
unique visitors for the specified date range and site.
46. 4646
Case Study 2: solutionCase Study 2: solution
● HLL can be used to aggregate information about visitor
IDs for each day and site, masks for each day are saved,
and a query can be processed using bitwise OR-ing of the
daily masks.
47. 4747
Case Study 3Case Study 3
● There is a system that tracks traffic by IP address and it is
required to detect most traffic-intensive addresses.
48. 4848
Case Study 3: solutionCase Study 3: solution
● CMS?!!
● the problem is not trivial because we need to track the total
traffic for each address, not a frequency of items.
● counters in the CMS implementation can be incremented
not by 1, but by absolute amount of traffic for each
observation (i.e, size of IP packet if sketch is updated for
each packet)
● In this case, sketch will track amounts of traffic for each
address and a heap with the most traffic-intensive
addresses can be maintained (top-k or heavy-hitter).
49. 4949
Case Study 4Case Study 4
● There is a system that monitors traffic and counts
unique visitors for different criteria (visited site,
geography, etc.).
● It is required to compute 100 most popular sites using a
number of unique visitors as a metric of popularity.
● Popularity should be computed every day on the basis of
data for last 30-day, i.e. every day one-day partition added,
another one is removed from the scope.
50. 5050
Case Study 4: solutionCase Study 4: solution
● create a fresh set of per-site HLL counters every day and
maintain this set during 30 days, i.e. 30 sets of counters
are active at any moment of time.
51. 5151
Case Study 5Case Study 5
● Number of users doing-action (view, click...) on site objects
(banner, button, …) 1-times, 2-times, …., 10+-times
● Report looks like below
Filter: Object=X
1-times: 98765
2-times: 76543
3-times: 54321
…
9-times: 1234
10+-times: 343
52. 5252
Case Study 5: solutionCase Study 5: solution
● Should we use CMS???
● … and why/why NOT???
53. 5353
Case Study 5: solutionCase Study 5: solution
● Use scalable layered-BF to track k-times user actions on
objects
● Use HLL to count users on each k-times action
54. 5454
What else?What else?
● Libs
✔ Redis: HLL already, BF in next 3.2
✔ https://github.com/twitter/algebird
✔ https://github.com/addthis/stream-lib
✔ https://github.com/tylertreat/BoomFilters
✔ https://github.com/tdunning/t-digest
● More
✔ Linear Counting
✔ MinHash
✔ Top-K