SlideShare a Scribd company logo
Hokusai
Sketching streams in real time
Sergiy Matusevych1
Alexander J. Smola2
Amr Ahmed2
1Yahoo! Research, Santa Clara, CA
2Google, Mountain View, CA
UAI 2012
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1
Thanks
Alex Smola
Google and CMU
Amr Ahmed
Google
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 2
Motivation
Compute frequencies of elements in the data stream
Item frequencies change over time.
Number of items unkonwn and variable.
Example - logging query frequency over time.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 3
Motivation
Compute frequencies of elements in the data stream
Item frequencies change over time.
Number of items unkonwn and variable.
Example - logging query frequency over time.
Applications
Flow counting for IP traffic (who sent what, when and how much)
Spam detection and filtering (detect bursts immediately)
Website analytics (feedback to editors, trend detection)
State of the art
CountMin sketch is instantaneous but does not log time.
Naive snapshotting costs linear memory.
MapReduce batch job provides exact counts but long delays.
Resource constraints
Fixed memory footprint for entire sketch regardless of duration
High query throughput
Real time aggregation and response
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 4
Strategy
1. Use CountMin sketch to store snapshots of data.
(this solves the real time logging problem)
2. Compress snapshots linearly as they age
We care most about recent events
Logarithmic storage since
T
t=1
t−1
= O(log T)
3. Exploit CountMin data structure for efficient compression
Variant 1: reduce storage per snapshot
Variant 2: increase timespan per snapshot
4. Interpolate between both variants for improved accuracy
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 5
CountMin Sketch (Cormode & Muthukrishnan)
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
In-memory data structure for instantaneous retrieval
Aggregate statistic of observation interval (instantanous retrieval)
Intuition — Bloom filter with integers
Algorithm
insert(x):
for i = 1 to d do
M[i, hi (x)] ← M[i, hi (x)] + 1
end for
query(x):
ˆnx ← min
i∈{1,...d}
M[i, hi (x)]
return ˆnx
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 6
Guarantees
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
Approximation guarantee
For sketch with d = log 1
δ and n = e
we have with probability
1 − δ that the estimate ˆnx deviates from the count nx via
nx ≤ ˆnx ≤ nx +
x
nx for all x.
Linear statistic of the data
Power law distributions with exponent z only use O(N −1/z) space.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 7
Step 1: Combining time intervals
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
MT and MT sketches at time intervals T and T with T ∩ T = ∅.
Combine sketches by adding them up
+
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 8
Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
4
2
1
1 1
1 2
1 1
1 1
1 1
1 1
42
4
2
1
2
1
1 1
1 1 2 4
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 9
Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
2
2
8
1
1 1
1 1
1 1
1 421
8
8
8
4
4
4
2
4
2
1 1
1 1
1 1
42
4
2
1
1 1
1 1 2 4
8
8
8
8
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 10
Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
2
2
8
1
1 1
1 1
1 1
1 421
8
8
8
4
4
4
2
4
2
1 1
1 1
1 1
42
4
2
1
1 1
1 1 2 4
8
8
8
8
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 11
Step 2: Folding over
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
Mb is sketch with n = 2b bins.
Mb−1 can obtained as
Mb−1[i, j] = Mb[i, j] + Mb[i, j + 2b−1
]
by “folding over” the sketch
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 12
Step 2: Efficient computation
Halve the size of the sketch every 2t intervals.
Computation costs O(1) time and O(log t) space.
. . .
1 x 16 bins
2 x 8 bins
4 x 4 bins
interval 1
interval 2 3
4 5 6 7
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 13
Step 3: Resolution Interpolation
Time aggregation reports good estimate over long time interval.
Item aggregation reports poor estimate over short time interval.
Marginals of joint distribution — assume independence & interpolate
n(t)
n(x)n
Torso and Tail
Item aggregated estimate nx
Time aggregated estimate nt
Count interpolation
ˆnxt =
nx · nt
n
where n =
t
nt =
x
nx
Head
Sketch accuracy decreases with e · t
Use regular CountMin sketch whenever
˜n(x, t) > e · t · 2−b
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 14
Setup and Throughput
Web query data, 5 days sample
Term frequency
Numberofuniqueterms
100
102
104
106
97.9M unique terms,
378.1M total
100
101
102
103
104
105
106
Wikipedia data
Term frequency
Numberofuniqueterms
100
101
102
103
104
105
106
4.5M unique terms,
1291.5M total
100
102
104
106
Configuration
Platform
64-bit Linux
4-core 2GHz x86
16GB RAM
Gigabit network
Sketch setup
4 hash functions
223
bins
211
aggregation
intervals (7 days in
5 minute intervals)
3-gram interpolation
12GB sketch with
3 hash functions
230
bins
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 15
Setup and Throughput
Web query data, 5 days sample
Term frequency
Numberofuniqueterms
100
102
104
106
97.9M unique terms,
378.1M total
100
101
102
103
104
105
106
Wikipedia data
Term frequency
Numberofuniqueterms
100
101
102
103
104
105
106
4.5M unique terms,
1291.5M total
100
102
104
106
Speed
Software
Client-server system
ICE middleware
1 server, 10 clients
Throughput/s
50k inserts
22k requests
(time aggregation)
8.5k requests
(resolution interp.)
Limiting Factors
TCP/IP Overhead
Package query
Memory latency
Random access
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 16
Accuracy (aggregate absolute error ˆn − n)
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 17
Accuracy (stratified absolute error ˆn − n)
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 18
Sketching for Graphical Models
Goal
Observe stream of observations
Estimate joint probability in O(1) time
CountMin is good for head but interpolation better for torso and tail
General Strategy
Markov network with junction tree: cliques C and separator sets S.
Estimate counts for xC and xS with C ∈ C and S ∈ S to generate
ˆp(x) = n|S|−|C|
C∈C
nxC
S∈S
n−1
xS
.
Estimates are fast — only lookup in CountMin sketch. No need to
solve convex program for graphical model inference.
Markov Chain
p(abc) ≈ n−3
· ˆna · ˆnb · ˆnc Unigrams
p(abc) ≈ n−2
·
ˆnab · ˆnbc
ˆnb
Bigrams
Backoff smoothing (e.g. Kneser-Ney) in practice.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 19
n-gram Interpolation
Trigram approximation
Wikipedia dataset (1291.5M terms, 405M unique trigrams)
Absolute error Relative error
Unigram approximation 2.50 · 107 0.266
Bigram approximation 1.22 · 106 0.013
Trigram sketching (CountMin) 8.35 · 106 0.089
Sketching trigrams is not accurate enough on the tail.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 20
Summary
Fast and simple algorithm to aggregate statistics of data streams.
Effective compressed representation of the temporal data.
Works well for graphical models.
High-performance scalable implementation with O(1) time access.
Can be distributed over many servers.
Hokusai Katsushika
Great Wave off Kanagawa
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 21

More Related Content

What's hot

MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
Elvis DOHMATOB
 
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
The Statistical and Applied Mathematical Sciences Institute
 
A common fixed point theorem for two random operators using random mann itera...
A common fixed point theorem for two random operators using random mann itera...A common fixed point theorem for two random operators using random mann itera...
A common fixed point theorem for two random operators using random mann itera...
Alexander Decker
 
Problem Understanding through Landscape Theory
Problem Understanding through Landscape TheoryProblem Understanding through Landscape Theory
Problem Understanding through Landscape Theory
jfrchicanog
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
The Statistical and Applied Mathematical Sciences Institute
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
The Statistical and Applied Mathematical Sciences Institute
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
The Statistical and Applied Mathematical Sciences Institute
 
Maneuvering target track prediction model
Maneuvering target track prediction modelManeuvering target track prediction model
Maneuvering target track prediction model
IJCI JOURNAL
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithms
Jagadeeswaran Rathinavel
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
Frank Nielsen
 
Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...
rahulmonikasharma
 
LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15
Karen Pao
 
Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems
Scandala Tamang
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
Fabian Pedregosa
 
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
Kostas Hatalis, PhD
 
Distributed Support Vector Machines
Distributed Support Vector MachinesDistributed Support Vector Machines
Distributed Support Vector Machines
Harsha Vardhan Tetali
 
Lec17 sparse signal processing & applications
Lec17 sparse signal processing & applicationsLec17 sparse signal processing & applications
Lec17 sparse signal processing & applications
United States Air Force Academy
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
Arthur Mensch
 
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
The Statistical and Applied Mathematical Sciences Institute
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPT
andrewmart11
 

What's hot (20)

MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
 
A common fixed point theorem for two random operators using random mann itera...
A common fixed point theorem for two random operators using random mann itera...A common fixed point theorem for two random operators using random mann itera...
A common fixed point theorem for two random operators using random mann itera...
 
Problem Understanding through Landscape Theory
Problem Understanding through Landscape TheoryProblem Understanding through Landscape Theory
Problem Understanding through Landscape Theory
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Maneuvering target track prediction model
Maneuvering target track prediction modelManeuvering target track prediction model
Maneuvering target track prediction model
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithms
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...
 
LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15
 
Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
 
Distributed Support Vector Machines
Distributed Support Vector MachinesDistributed Support Vector Machines
Distributed Support Vector Machines
 
Lec17 sparse signal processing & applications
Lec17 sparse signal processing & applicationsLec17 sparse signal processing & applications
Lec17 sparse signal processing & applications
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPT
 

Viewers also liked

Jones_Talei_EDE202_Assessment2ppt
Jones_Talei_EDE202_Assessment2pptJones_Talei_EDE202_Assessment2ppt
Jones_Talei_EDE202_Assessment2ppt
Talei85
 
Hokusai
HokusaiHokusai
Hokusai
HokusaiHokusai
George seurat
George seuratGeorge seurat
George seurat
mrsalig
 
Post impressionism Art Period Study Guide
Post impressionism Art Period Study GuidePost impressionism Art Period Study Guide
Post impressionism Art Period Study Guide
Mike Hightower
 
Seurat powerpoint
Seurat powerpointSeurat powerpoint
Seurat powerpoint
teachingpalette
 
Jackson Pollock
Jackson PollockJackson Pollock
Jackson Pollock
Eric
 
Japanese printmaking elementary lesson ppt
Japanese printmaking elementary lesson pptJapanese printmaking elementary lesson ppt
Japanese printmaking elementary lesson ppt
dandeliondandelion23
 
Henri Matisse
Henri MatisseHenri Matisse
Henri Matisse
mrfortiz
 
Paul klee.ppt
Paul klee.pptPaul klee.ppt
Paul klee.ppt
nurlorenzo
 
Paul Klee
Paul KleePaul Klee
Paul Klee
Alice Fernándz
 
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets (nx power lite)
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets  (nx power lite)Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets  (nx power lite)
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets (nx power lite)
Trinity Blu*** Don't Thank Me for Viewing Shows....but Rather, Pay It Forward :))***
 
PaulKlee
PaulKleePaulKlee
PaulKlee
barmonson
 
Leonardo da Vinci
Leonardo da VinciLeonardo da Vinci
Leonardo da Vinci
anabel sánchez
 
Hokusai Nº2
Hokusai Nº2Hokusai Nº2
Hokusai Nº2
Nicolás Svistoonoff
 
Henri Matisse
Henri MatisseHenri Matisse
Henri Matisse
Eduardo Iberico
 
Vincent van gogh
Vincent van goghVincent van gogh
Vincent van gogh
mkredford
 
Bauhaus
BauhausBauhaus
Bauhaus
roger Pitiot
 
Bauhaus
BauhausBauhaus
Periods of Art
Periods of ArtPeriods of Art
Periods of Art
coleseth88
 

Viewers also liked (20)

Jones_Talei_EDE202_Assessment2ppt
Jones_Talei_EDE202_Assessment2pptJones_Talei_EDE202_Assessment2ppt
Jones_Talei_EDE202_Assessment2ppt
 
Hokusai
HokusaiHokusai
Hokusai
 
Hokusai
HokusaiHokusai
Hokusai
 
George seurat
George seuratGeorge seurat
George seurat
 
Post impressionism Art Period Study Guide
Post impressionism Art Period Study GuidePost impressionism Art Period Study Guide
Post impressionism Art Period Study Guide
 
Seurat powerpoint
Seurat powerpointSeurat powerpoint
Seurat powerpoint
 
Jackson Pollock
Jackson PollockJackson Pollock
Jackson Pollock
 
Japanese printmaking elementary lesson ppt
Japanese printmaking elementary lesson pptJapanese printmaking elementary lesson ppt
Japanese printmaking elementary lesson ppt
 
Henri Matisse
Henri MatisseHenri Matisse
Henri Matisse
 
Paul klee.ppt
Paul klee.pptPaul klee.ppt
Paul klee.ppt
 
Paul Klee
Paul KleePaul Klee
Paul Klee
 
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets (nx power lite)
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets  (nx power lite)Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets  (nx power lite)
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets (nx power lite)
 
PaulKlee
PaulKleePaulKlee
PaulKlee
 
Leonardo da Vinci
Leonardo da VinciLeonardo da Vinci
Leonardo da Vinci
 
Hokusai Nº2
Hokusai Nº2Hokusai Nº2
Hokusai Nº2
 
Henri Matisse
Henri MatisseHenri Matisse
Henri Matisse
 
Vincent van gogh
Vincent van goghVincent van gogh
Vincent van gogh
 
Bauhaus
BauhausBauhaus
Bauhaus
 
Bauhaus
BauhausBauhaus
Bauhaus
 
Periods of Art
Periods of ArtPeriods of Art
Periods of Art
 

Similar to Hokusai - Sketching streams in real time

D143136
D143136D143136
D143136
IJRES Journal
 
!Business statistics tekst
!Business statistics tekst!Business statistics tekst
!Business statistics tekst
King Nisar
 
Cg
CgCg
Introduction to Machine Vision
Introduction to Machine VisionIntroduction to Machine Vision
Introduction to Machine Vision
Nasir Jumani
 
Analysis of Algorithum
Analysis of AlgorithumAnalysis of Algorithum
Analysis of Algorithum
Ain-ul-Moiz Khawaja
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf
EmanAsem4
 
Optimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound imageOptimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound image
Alexander Decker
 
11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image
Alexander Decker
 
Model-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical ConstraintsModel-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical Constraints
Quoc-Sang Phan
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamics
University of Glasgow
 
Viii sem
Viii semViii sem
Viii sem
Lavesh Kaushik
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
The Statistical and Applied Mathematical Sciences Institute
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
Frank Nielsen
 
Teknik Simulasi
Teknik SimulasiTeknik Simulasi
Teknik Simulasi
Rezzy Caraka
 
Ch-2 final exam documet compler design elements
Ch-2 final exam documet compler design elementsCh-2 final exam documet compler design elements
Ch-2 final exam documet compler design elements
MAHERMOHAMED27
 
Meshing for computer graphics
Meshing for computer graphicsMeshing for computer graphics
Meshing for computer graphics
Bruno Levy
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
Lidia Pivovarova
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation Algorithms
Ajay Bidyarthy
 
Lecture _Line Scan Conversion.ppt
Lecture _Line Scan Conversion.pptLecture _Line Scan Conversion.ppt
Lecture _Line Scan Conversion.ppt
GaganvirKaur
 
Atomic algorithm and the servers' s use to find the Hamiltonian cycles
Atomic algorithm and the servers' s use to find the Hamiltonian cyclesAtomic algorithm and the servers' s use to find the Hamiltonian cycles
Atomic algorithm and the servers' s use to find the Hamiltonian cycles
IJERA Editor
 

Similar to Hokusai - Sketching streams in real time (20)

D143136
D143136D143136
D143136
 
!Business statistics tekst
!Business statistics tekst!Business statistics tekst
!Business statistics tekst
 
Cg
CgCg
Cg
 
Introduction to Machine Vision
Introduction to Machine VisionIntroduction to Machine Vision
Introduction to Machine Vision
 
Analysis of Algorithum
Analysis of AlgorithumAnalysis of Algorithum
Analysis of Algorithum
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf
 
Optimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound imageOptimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound image
 
11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image
 
Model-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical ConstraintsModel-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical Constraints
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamics
 
Viii sem
Viii semViii sem
Viii sem
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
Teknik Simulasi
Teknik SimulasiTeknik Simulasi
Teknik Simulasi
 
Ch-2 final exam documet compler design elements
Ch-2 final exam documet compler design elementsCh-2 final exam documet compler design elements
Ch-2 final exam documet compler design elements
 
Meshing for computer graphics
Meshing for computer graphicsMeshing for computer graphics
Meshing for computer graphics
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation Algorithms
 
Lecture _Line Scan Conversion.ppt
Lecture _Line Scan Conversion.pptLecture _Line Scan Conversion.ppt
Lecture _Line Scan Conversion.ppt
 
Atomic algorithm and the servers' s use to find the Hamiltonian cycles
Atomic algorithm and the servers' s use to find the Hamiltonian cyclesAtomic algorithm and the servers' s use to find the Hamiltonian cycles
Atomic algorithm and the servers' s use to find the Hamiltonian cycles
 

Recently uploaded

Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
PirithiRaju
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
Frédéric Baudron
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
Shashank Shekhar Pandey
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
Scintica Instrumentation
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
frank0071
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills MN
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
RDhivya6
 
Modelo de slide quimica para powerpoint
Modelo  de slide quimica para powerpointModelo  de slide quimica para powerpoint
Modelo de slide quimica para powerpoint
Karen593256
 

Recently uploaded (20)

Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
 
Modelo de slide quimica para powerpoint
Modelo  de slide quimica para powerpointModelo  de slide quimica para powerpoint
Modelo de slide quimica para powerpoint
 

Hokusai - Sketching streams in real time

  • 1. Hokusai Sketching streams in real time Sergiy Matusevych1 Alexander J. Smola2 Amr Ahmed2 1Yahoo! Research, Santa Clara, CA 2Google, Mountain View, CA UAI 2012 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1
  • 2. Thanks Alex Smola Google and CMU Amr Ahmed Google Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 2
  • 3. Motivation Compute frequencies of elements in the data stream Item frequencies change over time. Number of items unkonwn and variable. Example - logging query frequency over time. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 3
  • 4. Motivation Compute frequencies of elements in the data stream Item frequencies change over time. Number of items unkonwn and variable. Example - logging query frequency over time. Applications Flow counting for IP traffic (who sent what, when and how much) Spam detection and filtering (detect bursts immediately) Website analytics (feedback to editors, trend detection) State of the art CountMin sketch is instantaneous but does not log time. Naive snapshotting costs linear memory. MapReduce batch job provides exact counts but long delays. Resource constraints Fixed memory footprint for entire sketch regardless of duration High query throughput Real time aggregation and response Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 4
  • 5. Strategy 1. Use CountMin sketch to store snapshots of data. (this solves the real time logging problem) 2. Compress snapshots linearly as they age We care most about recent events Logarithmic storage since T t=1 t−1 = O(log T) 3. Exploit CountMin data structure for efficient compression Variant 1: reduce storage per snapshot Variant 2: increase timespan per snapshot 4. Interpolate between both variants for improved accuracy Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 5
  • 6. CountMin Sketch (Cormode & Muthukrishnan) M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x In-memory data structure for instantaneous retrieval Aggregate statistic of observation interval (instantanous retrieval) Intuition — Bloom filter with integers Algorithm insert(x): for i = 1 to d do M[i, hi (x)] ← M[i, hi (x)] + 1 end for query(x): ˆnx ← min i∈{1,...d} M[i, hi (x)] return ˆnx Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 6
  • 7. Guarantees M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x Approximation guarantee For sketch with d = log 1 δ and n = e we have with probability 1 − δ that the estimate ˆnx deviates from the count nx via nx ≤ ˆnx ≤ nx + x nx for all x. Linear statistic of the data Power law distributions with exponent z only use O(N −1/z) space. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 7
  • 8. Step 1: Combining time intervals M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x MT and MT sketches at time intervals T and T with T ∩ T = ∅. Combine sketches by adding them up + Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 8
  • 9. Step 1: Efficient computation Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}. Insert into the leftmost aggregation interval. Aggregate as cumulative sum from the left using 1 + n i=0 2i = 2n+1 Computation is ∞ n=1 n · 2−n = O(1) amortized time, O(log t) space. 4 2 1 1 1 1 2 1 1 1 1 1 1 1 1 42 4 2 1 2 1 1 1 1 1 2 4 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 9
  • 10. Step 1: Efficient computation Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}. Insert into the leftmost aggregation interval. Aggregate as cumulative sum from the left using 1 + n i=0 2i = 2n+1 Computation is ∞ n=1 n · 2−n = O(1) amortized time, O(log t) space. 2 2 8 1 1 1 1 1 1 1 1 421 8 8 8 4 4 4 2 4 2 1 1 1 1 1 1 42 4 2 1 1 1 1 1 2 4 8 8 8 8 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 10
  • 11. Step 1: Efficient computation Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}. Insert into the leftmost aggregation interval. Aggregate as cumulative sum from the left using 1 + n i=0 2i = 2n+1 Computation is ∞ n=1 n · 2−n = O(1) amortized time, O(log t) space. 2 2 8 1 1 1 1 1 1 1 1 421 8 8 8 4 4 4 2 4 2 1 1 1 1 1 1 42 4 2 1 1 1 1 1 2 4 8 8 8 8 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 11
  • 12. Step 2: Folding over M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x Mb is sketch with n = 2b bins. Mb−1 can obtained as Mb−1[i, j] = Mb[i, j] + Mb[i, j + 2b−1 ] by “folding over” the sketch Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 12
  • 13. Step 2: Efficient computation Halve the size of the sketch every 2t intervals. Computation costs O(1) time and O(log t) space. . . . 1 x 16 bins 2 x 8 bins 4 x 4 bins interval 1 interval 2 3 4 5 6 7 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 13
  • 14. Step 3: Resolution Interpolation Time aggregation reports good estimate over long time interval. Item aggregation reports poor estimate over short time interval. Marginals of joint distribution — assume independence & interpolate n(t) n(x)n Torso and Tail Item aggregated estimate nx Time aggregated estimate nt Count interpolation ˆnxt = nx · nt n where n = t nt = x nx Head Sketch accuracy decreases with e · t Use regular CountMin sketch whenever ˜n(x, t) > e · t · 2−b Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 14
  • 15. Setup and Throughput Web query data, 5 days sample Term frequency Numberofuniqueterms 100 102 104 106 97.9M unique terms, 378.1M total 100 101 102 103 104 105 106 Wikipedia data Term frequency Numberofuniqueterms 100 101 102 103 104 105 106 4.5M unique terms, 1291.5M total 100 102 104 106 Configuration Platform 64-bit Linux 4-core 2GHz x86 16GB RAM Gigabit network Sketch setup 4 hash functions 223 bins 211 aggregation intervals (7 days in 5 minute intervals) 3-gram interpolation 12GB sketch with 3 hash functions 230 bins Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 15
  • 16. Setup and Throughput Web query data, 5 days sample Term frequency Numberofuniqueterms 100 102 104 106 97.9M unique terms, 378.1M total 100 101 102 103 104 105 106 Wikipedia data Term frequency Numberofuniqueterms 100 101 102 103 104 105 106 4.5M unique terms, 1291.5M total 100 102 104 106 Speed Software Client-server system ICE middleware 1 server, 10 clients Throughput/s 50k inserts 22k requests (time aggregation) 8.5k requests (resolution interp.) Limiting Factors TCP/IP Overhead Package query Memory latency Random access Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 16
  • 17. Accuracy (aggregate absolute error ˆn − n) Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 17
  • 18. Accuracy (stratified absolute error ˆn − n) Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 18
  • 19. Sketching for Graphical Models Goal Observe stream of observations Estimate joint probability in O(1) time CountMin is good for head but interpolation better for torso and tail General Strategy Markov network with junction tree: cliques C and separator sets S. Estimate counts for xC and xS with C ∈ C and S ∈ S to generate ˆp(x) = n|S|−|C| C∈C nxC S∈S n−1 xS . Estimates are fast — only lookup in CountMin sketch. No need to solve convex program for graphical model inference. Markov Chain p(abc) ≈ n−3 · ˆna · ˆnb · ˆnc Unigrams p(abc) ≈ n−2 · ˆnab · ˆnbc ˆnb Bigrams Backoff smoothing (e.g. Kneser-Ney) in practice. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 19
  • 20. n-gram Interpolation Trigram approximation Wikipedia dataset (1291.5M terms, 405M unique trigrams) Absolute error Relative error Unigram approximation 2.50 · 107 0.266 Bigram approximation 1.22 · 106 0.013 Trigram sketching (CountMin) 8.35 · 106 0.089 Sketching trigrams is not accurate enough on the tail. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 20
  • 21. Summary Fast and simple algorithm to aggregate statistics of data streams. Effective compressed representation of the temporal data. Works well for graphical models. High-performance scalable implementation with O(1) time access. Can be distributed over many servers. Hokusai Katsushika Great Wave off Kanagawa Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 21