2. 2
- Did this IP visit me before?
- How many unique IPs have we seen this
month?
- How many times did I see this IP?
- What is the median transaction value?
top 1% value?
- What are the most common collection of
fonts available?
Large Stream of Events
4. 4
If we are willing to accept an arbitrary low chance
of false positives we can solve this problem with
Bloom Filters.
Did I see this value before?
5. 5
Hash each value and turn on a bit for that hash
bucket.
Repeat with multiple k different hash function, and
ask if all bits for all hash functions are set
Some false positives, no false negatives.
Bloom Filter
6. 6
If we hash all values, and calculate the minimum of
all hashes, what is the expected minimum value?
Cardinality estimation
7. 7
let hash(x) : X => [0,1] uniformly pseudo random
E[min(hash(x))] = 1/(k+1) when k is number of
distinct elements.
This is an unbiased estimator
If we repeat with several different hash functions,
we can average the estimations.
Cardinality estimation
8. 8
Counting bloom filters.
Hash value and increment a counter at the hashed
index.
Use multiple hash functions each with separate
table(column) return min of all estimates.
Produces biased estimate, estimate >= actual
How many times did we see this value?
count–min sketch
9. 9
Naive - Sample and calculate on sample
Remedian - Calculate median of medians (of
medians…)
Median estimation
10. 10
Naive - sample and calculate quantile on sample
Sample and keep to K
Manku - maintain eps approximate counts and
quantiles. keep counts of values in intervals. and
keep them balanced.
Biased quantile estimators
11. 11
Proveably requires at least O(N) space
Even top 1 most common does.
Relax to K-heavy-hitters problem. Find all values with
frequency at least 1/K ?
Approximate K heavy hitters: Return all values with frequency
more than 1/K and return no value with frequency below 1/k -
epsilon
What are the top K most frequent
values?
12. 12
Initialize an empty Map m from elements to counters
def add(a)
if m.contains(a) m(a) += 1
else if m.size < k m(a) = 1
else
decrease all counters in m by 1
remove any elements with count=0
Frequent algorithm
14. 14
Sampling K elements from a stream of N
Algorithm Extra memory Accurate results Materialized result
Shuffle and take N elements Yes Yes
Reservoir K elements Yes Yes
Indices reservoir K indices Yes No
Independent sample O(1) Length not guaranteed No
Accurate independent O(1) Slight correlation
between elements
No