SlideShare a Scribd company logo
1 of 21
Download to read offline
Hokusai
Sketching streams in real time
Sergiy Matusevych1
Alexander J. Smola2
Amr Ahmed2
1Yahoo! Research, Santa Clara, CA
2Google, Mountain View, CA
UAI 2012
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1
Thanks
Alex Smola
Google and CMU
Amr Ahmed
Google
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 2
Motivation
Compute frequencies of elements in the data stream
Item frequencies change over time.
Number of items unkonwn and variable.
Example - logging query frequency over time.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 3
Motivation
Compute frequencies of elements in the data stream
Item frequencies change over time.
Number of items unkonwn and variable.
Example - logging query frequency over time.
Applications
Flow counting for IP traffic (who sent what, when and how much)
Spam detection and filtering (detect bursts immediately)
Website analytics (feedback to editors, trend detection)
State of the art
CountMin sketch is instantaneous but does not log time.
Naive snapshotting costs linear memory.
MapReduce batch job provides exact counts but long delays.
Resource constraints
Fixed memory footprint for entire sketch regardless of duration
High query throughput
Real time aggregation and response
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 4
Strategy
1. Use CountMin sketch to store snapshots of data.
(this solves the real time logging problem)
2. Compress snapshots linearly as they age
We care most about recent events
Logarithmic storage since
T
t=1
t−1
= O(log T)
3. Exploit CountMin data structure for efficient compression
Variant 1: reduce storage per snapshot
Variant 2: increase timespan per snapshot
4. Interpolate between both variants for improved accuracy
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 5
CountMin Sketch (Cormode & Muthukrishnan)
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
In-memory data structure for instantaneous retrieval
Aggregate statistic of observation interval (instantanous retrieval)
Intuition — Bloom filter with integers
Algorithm
insert(x):
for i = 1 to d do
M[i, hi (x)] ← M[i, hi (x)] + 1
end for
query(x):
ˆnx ← min
i∈{1,...d}
M[i, hi (x)]
return ˆnx
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 6
Guarantees
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
Approximation guarantee
For sketch with d = log 1
δ and n = e
we have with probability
1 − δ that the estimate ˆnx deviates from the count nx via
nx ≤ ˆnx ≤ nx +
x
nx for all x.
Linear statistic of the data
Power law distributions with exponent z only use O(N −1/z) space.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 7
Step 1: Combining time intervals
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
MT and MT sketches at time intervals T and T with T ∩ T = ∅.
Combine sketches by adding them up
+
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 8
Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
4
2
1
1 1
1 2
1 1
1 1
1 1
1 1
42
4
2
1
2
1
1 1
1 1 2 4
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 9
Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
2
2
8
1
1 1
1 1
1 1
1 421
8
8
8
4
4
4
2
4
2
1 1
1 1
1 1
42
4
2
1
1 1
1 1 2 4
8
8
8
8
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 10
Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
2
2
8
1
1 1
1 1
1 1
1 421
8
8
8
4
4
4
2
4
2
1 1
1 1
1 1
42
4
2
1
1 1
1 1 2 4
8
8
8
8
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 11
Step 2: Folding over
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
Mb is sketch with n = 2b bins.
Mb−1 can obtained as
Mb−1[i, j] = Mb[i, j] + Mb[i, j + 2b−1
]
by “folding over” the sketch
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 12
Step 2: Efficient computation
Halve the size of the sketch every 2t intervals.
Computation costs O(1) time and O(log t) space.
. . .
1 x 16 bins
2 x 8 bins
4 x 4 bins
interval 1
interval 2 3
4 5 6 7
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 13
Step 3: Resolution Interpolation
Time aggregation reports good estimate over long time interval.
Item aggregation reports poor estimate over short time interval.
Marginals of joint distribution — assume independence & interpolate
n(t)
n(x)n
Torso and Tail
Item aggregated estimate nx
Time aggregated estimate nt
Count interpolation
ˆnxt =
nx · nt
n
where n =
t
nt =
x
nx
Head
Sketch accuracy decreases with e · t
Use regular CountMin sketch whenever
˜n(x, t) > e · t · 2−b
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 14
Setup and Throughput
Web query data, 5 days sample
Term frequency
Numberofuniqueterms
100
102
104
106
97.9M unique terms,
378.1M total
100
101
102
103
104
105
106
Wikipedia data
Term frequency
Numberofuniqueterms
100
101
102
103
104
105
106
4.5M unique terms,
1291.5M total
100
102
104
106
Configuration
Platform
64-bit Linux
4-core 2GHz x86
16GB RAM
Gigabit network
Sketch setup
4 hash functions
223
bins
211
aggregation
intervals (7 days in
5 minute intervals)
3-gram interpolation
12GB sketch with
3 hash functions
230
bins
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 15
Setup and Throughput
Web query data, 5 days sample
Term frequency
Numberofuniqueterms
100
102
104
106
97.9M unique terms,
378.1M total
100
101
102
103
104
105
106
Wikipedia data
Term frequency
Numberofuniqueterms
100
101
102
103
104
105
106
4.5M unique terms,
1291.5M total
100
102
104
106
Speed
Software
Client-server system
ICE middleware
1 server, 10 clients
Throughput/s
50k inserts
22k requests
(time aggregation)
8.5k requests
(resolution interp.)
Limiting Factors
TCP/IP Overhead
Package query
Memory latency
Random access
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 16
Accuracy (aggregate absolute error ˆn − n)
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 17
Accuracy (stratified absolute error ˆn − n)
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 18
Sketching for Graphical Models
Goal
Observe stream of observations
Estimate joint probability in O(1) time
CountMin is good for head but interpolation better for torso and tail
General Strategy
Markov network with junction tree: cliques C and separator sets S.
Estimate counts for xC and xS with C ∈ C and S ∈ S to generate
ˆp(x) = n|S|−|C|
C∈C
nxC
S∈S
n−1
xS
.
Estimates are fast — only lookup in CountMin sketch. No need to
solve convex program for graphical model inference.
Markov Chain
p(abc) ≈ n−3
· ˆna · ˆnb · ˆnc Unigrams
p(abc) ≈ n−2
·
ˆnab · ˆnbc
ˆnb
Bigrams
Backoff smoothing (e.g. Kneser-Ney) in practice.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 19
n-gram Interpolation
Trigram approximation
Wikipedia dataset (1291.5M terms, 405M unique trigrams)
Absolute error Relative error
Unigram approximation 2.50 · 107 0.266
Bigram approximation 1.22 · 106 0.013
Trigram sketching (CountMin) 8.35 · 106 0.089
Sketching trigrams is not accurate enough on the tail.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 20
Summary
Fast and simple algorithm to aggregate statistics of data streams.
Effective compressed representation of the temporal data.
Works well for graphical models.
High-performance scalable implementation with O(1) time access.
Can be distributed over many servers.
Hokusai Katsushika
Great Wave off Kanagawa
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 21

More Related Content

What's hot

MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
 
A common fixed point theorem for two random operators using random mann itera...
A common fixed point theorem for two random operators using random mann itera...A common fixed point theorem for two random operators using random mann itera...
A common fixed point theorem for two random operators using random mann itera...Alexander Decker
 
Problem Understanding through Landscape Theory
Problem Understanding through Landscape TheoryProblem Understanding through Landscape Theory
Problem Understanding through Landscape Theoryjfrchicanog
 
Maneuvering target track prediction model
Maneuvering target track prediction modelManeuvering target track prediction model
Maneuvering target track prediction modelIJCI JOURNAL
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsJagadeeswaran Rathinavel
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodFrank Nielsen
 
Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...rahulmonikasharma
 
LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15Karen Pao
 
Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems Scandala Tamang
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientFabian Pedregosa
 
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...Kostas Hatalis, PhD
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationArthur Mensch
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTandrewmart11
 

What's hot (20)

MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
 
A common fixed point theorem for two random operators using random mann itera...
A common fixed point theorem for two random operators using random mann itera...A common fixed point theorem for two random operators using random mann itera...
A common fixed point theorem for two random operators using random mann itera...
 
Problem Understanding through Landscape Theory
Problem Understanding through Landscape TheoryProblem Understanding through Landscape Theory
Problem Understanding through Landscape Theory
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Maneuvering target track prediction model
Maneuvering target track prediction modelManeuvering target track prediction model
Maneuvering target track prediction model
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithms
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...
 
LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15
 
Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
 
Distributed Support Vector Machines
Distributed Support Vector MachinesDistributed Support Vector Machines
Distributed Support Vector Machines
 
Lec17 sparse signal processing & applications
Lec17 sparse signal processing & applicationsLec17 sparse signal processing & applications
Lec17 sparse signal processing & applications
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPT
 

Viewers also liked (20)

Jones_Talei_EDE202_Assessment2ppt
Jones_Talei_EDE202_Assessment2pptJones_Talei_EDE202_Assessment2ppt
Jones_Talei_EDE202_Assessment2ppt
 
Hokusai
HokusaiHokusai
Hokusai
 
Hokusai
HokusaiHokusai
Hokusai
 
George seurat
George seuratGeorge seurat
George seurat
 
Post impressionism Art Period Study Guide
Post impressionism Art Period Study GuidePost impressionism Art Period Study Guide
Post impressionism Art Period Study Guide
 
Seurat powerpoint
Seurat powerpointSeurat powerpoint
Seurat powerpoint
 
Jackson Pollock
Jackson PollockJackson Pollock
Jackson Pollock
 
Japanese printmaking elementary lesson ppt
Japanese printmaking elementary lesson pptJapanese printmaking elementary lesson ppt
Japanese printmaking elementary lesson ppt
 
Henri Matisse
Henri MatisseHenri Matisse
Henri Matisse
 
Paul klee.ppt
Paul klee.pptPaul klee.ppt
Paul klee.ppt
 
Paul Klee
Paul KleePaul Klee
Paul Klee
 
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets (nx power lite)
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets  (nx power lite)Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets  (nx power lite)
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets (nx power lite)
 
PaulKlee
PaulKleePaulKlee
PaulKlee
 
Leonardo da Vinci
Leonardo da VinciLeonardo da Vinci
Leonardo da Vinci
 
Hokusai Nº2
Hokusai Nº2Hokusai Nº2
Hokusai Nº2
 
Henri Matisse
Henri MatisseHenri Matisse
Henri Matisse
 
Vincent van gogh
Vincent van goghVincent van gogh
Vincent van gogh
 
Bauhaus
BauhausBauhaus
Bauhaus
 
Bauhaus
BauhausBauhaus
Bauhaus
 
Periods of Art
Periods of ArtPeriods of Art
Periods of Art
 

Similar to Hokusai - Sketching streams in real time

!Business statistics tekst
!Business statistics tekst!Business statistics tekst
!Business statistics tekstKing Nisar
 
Introduction to Machine Vision
Introduction to Machine VisionIntroduction to Machine Vision
Introduction to Machine VisionNasir Jumani
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdfEmanAsem4
 
Optimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound imageOptimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound imageAlexander Decker
 
11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound imageAlexander Decker
 
Model-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical ConstraintsModel-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical ConstraintsQuoc-Sang Phan
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsUniversity of Glasgow
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applicationsFrank Nielsen
 
Ch-2 final exam documet compler design elements
Ch-2 final exam documet compler design elementsCh-2 final exam documet compler design elements
Ch-2 final exam documet compler design elementsMAHERMOHAMED27
 
handout17.pdfStat 102B -Computation and Optimization in St.docx
handout17.pdfStat 102B -Computation and Optimization in St.docxhandout17.pdfStat 102B -Computation and Optimization in St.docx
handout17.pdfStat 102B -Computation and Optimization in St.docxbenjaminjames21681
 
Meshing for computer graphics
Meshing for computer graphicsMeshing for computer graphics
Meshing for computer graphicsBruno Levy
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsAjay Bidyarthy
 
Lecture _Line Scan Conversion.ppt
Lecture _Line Scan Conversion.pptLecture _Line Scan Conversion.ppt
Lecture _Line Scan Conversion.pptGaganvirKaur
 

Similar to Hokusai - Sketching streams in real time (20)

D143136
D143136D143136
D143136
 
!Business statistics tekst
!Business statistics tekst!Business statistics tekst
!Business statistics tekst
 
Cg
CgCg
Cg
 
Introduction to Machine Vision
Introduction to Machine VisionIntroduction to Machine Vision
Introduction to Machine Vision
 
Analysis of Algorithum
Analysis of AlgorithumAnalysis of Algorithum
Analysis of Algorithum
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf
 
Optimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound imageOptimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound image
 
11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image
 
Model-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical ConstraintsModel-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical Constraints
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamics
 
Viii sem
Viii semViii sem
Viii sem
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
Teknik Simulasi
Teknik SimulasiTeknik Simulasi
Teknik Simulasi
 
Ch-2 final exam documet compler design elements
Ch-2 final exam documet compler design elementsCh-2 final exam documet compler design elements
Ch-2 final exam documet compler design elements
 
handout17.pdfStat 102B -Computation and Optimization in St.docx
handout17.pdfStat 102B -Computation and Optimization in St.docxhandout17.pdfStat 102B -Computation and Optimization in St.docx
handout17.pdfStat 102B -Computation and Optimization in St.docx
 
Meshing for computer graphics
Meshing for computer graphicsMeshing for computer graphics
Meshing for computer graphics
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation Algorithms
 
Lecture _Line Scan Conversion.ppt
Lecture _Line Scan Conversion.pptLecture _Line Scan Conversion.ppt
Lecture _Line Scan Conversion.ppt
 

Recently uploaded

THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 

Recently uploaded (20)

THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 

Hokusai - Sketching streams in real time

  • 1. Hokusai Sketching streams in real time Sergiy Matusevych1 Alexander J. Smola2 Amr Ahmed2 1Yahoo! Research, Santa Clara, CA 2Google, Mountain View, CA UAI 2012 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1
  • 2. Thanks Alex Smola Google and CMU Amr Ahmed Google Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 2
  • 3. Motivation Compute frequencies of elements in the data stream Item frequencies change over time. Number of items unkonwn and variable. Example - logging query frequency over time. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 3
  • 4. Motivation Compute frequencies of elements in the data stream Item frequencies change over time. Number of items unkonwn and variable. Example - logging query frequency over time. Applications Flow counting for IP traffic (who sent what, when and how much) Spam detection and filtering (detect bursts immediately) Website analytics (feedback to editors, trend detection) State of the art CountMin sketch is instantaneous but does not log time. Naive snapshotting costs linear memory. MapReduce batch job provides exact counts but long delays. Resource constraints Fixed memory footprint for entire sketch regardless of duration High query throughput Real time aggregation and response Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 4
  • 5. Strategy 1. Use CountMin sketch to store snapshots of data. (this solves the real time logging problem) 2. Compress snapshots linearly as they age We care most about recent events Logarithmic storage since T t=1 t−1 = O(log T) 3. Exploit CountMin data structure for efficient compression Variant 1: reduce storage per snapshot Variant 2: increase timespan per snapshot 4. Interpolate between both variants for improved accuracy Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 5
  • 6. CountMin Sketch (Cormode & Muthukrishnan) M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x In-memory data structure for instantaneous retrieval Aggregate statistic of observation interval (instantanous retrieval) Intuition — Bloom filter with integers Algorithm insert(x): for i = 1 to d do M[i, hi (x)] ← M[i, hi (x)] + 1 end for query(x): ˆnx ← min i∈{1,...d} M[i, hi (x)] return ˆnx Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 6
  • 7. Guarantees M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x Approximation guarantee For sketch with d = log 1 δ and n = e we have with probability 1 − δ that the estimate ˆnx deviates from the count nx via nx ≤ ˆnx ≤ nx + x nx for all x. Linear statistic of the data Power law distributions with exponent z only use O(N −1/z) space. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 7
  • 8. Step 1: Combining time intervals M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x MT and MT sketches at time intervals T and T with T ∩ T = ∅. Combine sketches by adding them up + Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 8
  • 9. Step 1: Efficient computation Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}. Insert into the leftmost aggregation interval. Aggregate as cumulative sum from the left using 1 + n i=0 2i = 2n+1 Computation is ∞ n=1 n · 2−n = O(1) amortized time, O(log t) space. 4 2 1 1 1 1 2 1 1 1 1 1 1 1 1 42 4 2 1 2 1 1 1 1 1 2 4 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 9
  • 10. Step 1: Efficient computation Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}. Insert into the leftmost aggregation interval. Aggregate as cumulative sum from the left using 1 + n i=0 2i = 2n+1 Computation is ∞ n=1 n · 2−n = O(1) amortized time, O(log t) space. 2 2 8 1 1 1 1 1 1 1 1 421 8 8 8 4 4 4 2 4 2 1 1 1 1 1 1 42 4 2 1 1 1 1 1 2 4 8 8 8 8 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 10
  • 11. Step 1: Efficient computation Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}. Insert into the leftmost aggregation interval. Aggregate as cumulative sum from the left using 1 + n i=0 2i = 2n+1 Computation is ∞ n=1 n · 2−n = O(1) amortized time, O(log t) space. 2 2 8 1 1 1 1 1 1 1 1 421 8 8 8 4 4 4 2 4 2 1 1 1 1 1 1 42 4 2 1 1 1 1 1 2 4 8 8 8 8 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 11
  • 12. Step 2: Folding over M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x Mb is sketch with n = 2b bins. Mb−1 can obtained as Mb−1[i, j] = Mb[i, j] + Mb[i, j + 2b−1 ] by “folding over” the sketch Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 12
  • 13. Step 2: Efficient computation Halve the size of the sketch every 2t intervals. Computation costs O(1) time and O(log t) space. . . . 1 x 16 bins 2 x 8 bins 4 x 4 bins interval 1 interval 2 3 4 5 6 7 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 13
  • 14. Step 3: Resolution Interpolation Time aggregation reports good estimate over long time interval. Item aggregation reports poor estimate over short time interval. Marginals of joint distribution — assume independence & interpolate n(t) n(x)n Torso and Tail Item aggregated estimate nx Time aggregated estimate nt Count interpolation ˆnxt = nx · nt n where n = t nt = x nx Head Sketch accuracy decreases with e · t Use regular CountMin sketch whenever ˜n(x, t) > e · t · 2−b Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 14
  • 15. Setup and Throughput Web query data, 5 days sample Term frequency Numberofuniqueterms 100 102 104 106 97.9M unique terms, 378.1M total 100 101 102 103 104 105 106 Wikipedia data Term frequency Numberofuniqueterms 100 101 102 103 104 105 106 4.5M unique terms, 1291.5M total 100 102 104 106 Configuration Platform 64-bit Linux 4-core 2GHz x86 16GB RAM Gigabit network Sketch setup 4 hash functions 223 bins 211 aggregation intervals (7 days in 5 minute intervals) 3-gram interpolation 12GB sketch with 3 hash functions 230 bins Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 15
  • 16. Setup and Throughput Web query data, 5 days sample Term frequency Numberofuniqueterms 100 102 104 106 97.9M unique terms, 378.1M total 100 101 102 103 104 105 106 Wikipedia data Term frequency Numberofuniqueterms 100 101 102 103 104 105 106 4.5M unique terms, 1291.5M total 100 102 104 106 Speed Software Client-server system ICE middleware 1 server, 10 clients Throughput/s 50k inserts 22k requests (time aggregation) 8.5k requests (resolution interp.) Limiting Factors TCP/IP Overhead Package query Memory latency Random access Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 16
  • 17. Accuracy (aggregate absolute error ˆn − n) Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 17
  • 18. Accuracy (stratified absolute error ˆn − n) Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 18
  • 19. Sketching for Graphical Models Goal Observe stream of observations Estimate joint probability in O(1) time CountMin is good for head but interpolation better for torso and tail General Strategy Markov network with junction tree: cliques C and separator sets S. Estimate counts for xC and xS with C ∈ C and S ∈ S to generate ˆp(x) = n|S|−|C| C∈C nxC S∈S n−1 xS . Estimates are fast — only lookup in CountMin sketch. No need to solve convex program for graphical model inference. Markov Chain p(abc) ≈ n−3 · ˆna · ˆnb · ˆnc Unigrams p(abc) ≈ n−2 · ˆnab · ˆnbc ˆnb Bigrams Backoff smoothing (e.g. Kneser-Ney) in practice. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 19
  • 20. n-gram Interpolation Trigram approximation Wikipedia dataset (1291.5M terms, 405M unique trigrams) Absolute error Relative error Unigram approximation 2.50 · 107 0.266 Bigram approximation 1.22 · 106 0.013 Trigram sketching (CountMin) 8.35 · 106 0.089 Sketching trigrams is not accurate enough on the tail. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 20
  • 21. Summary Fast and simple algorithm to aggregate statistics of data streams. Effective compressed representation of the temporal data. Works well for graphical models. High-performance scalable implementation with O(1) time access. Can be distributed over many servers. Hokusai Katsushika Great Wave off Kanagawa Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 21