Slides from my talk at UAI 2012 conference.
We describe 北斎 Hokusai, a real time system which is able to capture frequency information for streams of arbitrary sequences of symbols. The algorithm uses the CountMin sketch as its basis and exploits the fact that sketching is linear. It provides real time statistics of arbitrary events, e.g. streams of queries as a function of time. We use a factorizing approximation to provide point estimates at arbitrary (time, item) combinations. Queries can be answered in constant time.
1. Hokusai
Sketching streams in real time
Sergiy Matusevych1
Alexander J. Smola2
Amr Ahmed2
1Yahoo! Research, Santa Clara, CA
2Google, Mountain View, CA
UAI 2012
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1
2. Thanks
Alex Smola
Google and CMU
Amr Ahmed
Google
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 2
3. Motivation
Compute frequencies of elements in the data stream
Item frequencies change over time.
Number of items unkonwn and variable.
Example - logging query frequency over time.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 3
4. Motivation
Compute frequencies of elements in the data stream
Item frequencies change over time.
Number of items unkonwn and variable.
Example - logging query frequency over time.
Applications
Flow counting for IP traffic (who sent what, when and how much)
Spam detection and filtering (detect bursts immediately)
Website analytics (feedback to editors, trend detection)
State of the art
CountMin sketch is instantaneous but does not log time.
Naive snapshotting costs linear memory.
MapReduce batch job provides exact counts but long delays.
Resource constraints
Fixed memory footprint for entire sketch regardless of duration
High query throughput
Real time aggregation and response
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 4
5. Strategy
1. Use CountMin sketch to store snapshots of data.
(this solves the real time logging problem)
2. Compress snapshots linearly as they age
We care most about recent events
Logarithmic storage since
T
t=1
t−1
= O(log T)
3. Exploit CountMin data structure for efficient compression
Variant 1: reduce storage per snapshot
Variant 2: increase timespan per snapshot
4. Interpolate between both variants for improved accuracy
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 5
6. CountMin Sketch (Cormode & Muthukrishnan)
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
In-memory data structure for instantaneous retrieval
Aggregate statistic of observation interval (instantanous retrieval)
Intuition — Bloom filter with integers
Algorithm
insert(x):
for i = 1 to d do
M[i, hi (x)] ← M[i, hi (x)] + 1
end for
query(x):
ˆnx ← min
i∈{1,...d}
M[i, hi (x)]
return ˆnx
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 6
7. Guarantees
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
Approximation guarantee
For sketch with d = log 1
δ and n = e
we have with probability
1 − δ that the estimate ˆnx deviates from the count nx via
nx ≤ ˆnx ≤ nx +
x
nx for all x.
Linear statistic of the data
Power law distributions with exponent z only use O(N −1/z) space.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 7
8. Step 1: Combining time intervals
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
MT and MT sketches at time intervals T and T with T ∩ T = ∅.
Combine sketches by adding them up
+
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 8
9. Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
4
2
1
1 1
1 2
1 1
1 1
1 1
1 1
42
4
2
1
2
1
1 1
1 1 2 4
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 9
10. Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
2
2
8
1
1 1
1 1
1 1
1 421
8
8
8
4
4
4
2
4
2
1 1
1 1
1 1
42
4
2
1
1 1
1 1 2 4
8
8
8
8
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 10
11. Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
2
2
8
1
1 1
1 1
1 1
1 421
8
8
8
4
4
4
2
4
2
1 1
1 1
1 1
42
4
2
1
1 1
1 1 2 4
8
8
8
8
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 11
12. Step 2: Folding over
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
Mb is sketch with n = 2b bins.
Mb−1 can obtained as
Mb−1[i, j] = Mb[i, j] + Mb[i, j + 2b−1
]
by “folding over” the sketch
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 12
13. Step 2: Efficient computation
Halve the size of the sketch every 2t intervals.
Computation costs O(1) time and O(log t) space.
. . .
1 x 16 bins
2 x 8 bins
4 x 4 bins
interval 1
interval 2 3
4 5 6 7
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 13
14. Step 3: Resolution Interpolation
Time aggregation reports good estimate over long time interval.
Item aggregation reports poor estimate over short time interval.
Marginals of joint distribution — assume independence & interpolate
n(t)
n(x)n
Torso and Tail
Item aggregated estimate nx
Time aggregated estimate nt
Count interpolation
ˆnxt =
nx · nt
n
where n =
t
nt =
x
nx
Head
Sketch accuracy decreases with e · t
Use regular CountMin sketch whenever
˜n(x, t) > e · t · 2−b
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 14
15. Setup and Throughput
Web query data, 5 days sample
Term frequency
Numberofuniqueterms
100
102
104
106
97.9M unique terms,
378.1M total
100
101
102
103
104
105
106
Wikipedia data
Term frequency
Numberofuniqueterms
100
101
102
103
104
105
106
4.5M unique terms,
1291.5M total
100
102
104
106
Configuration
Platform
64-bit Linux
4-core 2GHz x86
16GB RAM
Gigabit network
Sketch setup
4 hash functions
223
bins
211
aggregation
intervals (7 days in
5 minute intervals)
3-gram interpolation
12GB sketch with
3 hash functions
230
bins
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 15
16. Setup and Throughput
Web query data, 5 days sample
Term frequency
Numberofuniqueterms
100
102
104
106
97.9M unique terms,
378.1M total
100
101
102
103
104
105
106
Wikipedia data
Term frequency
Numberofuniqueterms
100
101
102
103
104
105
106
4.5M unique terms,
1291.5M total
100
102
104
106
Speed
Software
Client-server system
ICE middleware
1 server, 10 clients
Throughput/s
50k inserts
22k requests
(time aggregation)
8.5k requests
(resolution interp.)
Limiting Factors
TCP/IP Overhead
Package query
Memory latency
Random access
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 16
17. Accuracy (aggregate absolute error ˆn − n)
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 17
18. Accuracy (stratified absolute error ˆn − n)
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 18
19. Sketching for Graphical Models
Goal
Observe stream of observations
Estimate joint probability in O(1) time
CountMin is good for head but interpolation better for torso and tail
General Strategy
Markov network with junction tree: cliques C and separator sets S.
Estimate counts for xC and xS with C ∈ C and S ∈ S to generate
ˆp(x) = n|S|−|C|
C∈C
nxC
S∈S
n−1
xS
.
Estimates are fast — only lookup in CountMin sketch. No need to
solve convex program for graphical model inference.
Markov Chain
p(abc) ≈ n−3
· ˆna · ˆnb · ˆnc Unigrams
p(abc) ≈ n−2
·
ˆnab · ˆnbc
ˆnb
Bigrams
Backoff smoothing (e.g. Kneser-Ney) in practice.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 19
20. n-gram Interpolation
Trigram approximation
Wikipedia dataset (1291.5M terms, 405M unique trigrams)
Absolute error Relative error
Unigram approximation 2.50 · 107 0.266
Bigram approximation 1.22 · 106 0.013
Trigram sketching (CountMin) 8.35 · 106 0.089
Sketching trigrams is not accurate enough on the tail.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 20
21. Summary
Fast and simple algorithm to aggregate statistics of data streams.
Effective compressed representation of the temporal data.
Works well for graphical models.
High-performance scalable implementation with O(1) time access.
Can be distributed over many servers.
Hokusai Katsushika
Great Wave off Kanagawa
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 21