SlideShare a Scribd company logo
1 of 21
Download to read offline
Hokusai
Sketching streams in real time
Sergiy Matusevych1
Alexander J. Smola2
Amr Ahmed2
1Yahoo! Research, Santa Clara, CA
2Google, Mountain View, CA
UAI 2012
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1
Thanks
Alex Smola
Google and CMU
Amr Ahmed
Google
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 2
Motivation
Compute frequencies of elements in the data stream
Item frequencies change over time.
Number of items unkonwn and variable.
Example - logging query frequency over time.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 3
Motivation
Compute frequencies of elements in the data stream
Item frequencies change over time.
Number of items unkonwn and variable.
Example - logging query frequency over time.
Applications
Flow counting for IP traffic (who sent what, when and how much)
Spam detection and filtering (detect bursts immediately)
Website analytics (feedback to editors, trend detection)
State of the art
CountMin sketch is instantaneous but does not log time.
Naive snapshotting costs linear memory.
MapReduce batch job provides exact counts but long delays.
Resource constraints
Fixed memory footprint for entire sketch regardless of duration
High query throughput
Real time aggregation and response
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 4
Strategy
1. Use CountMin sketch to store snapshots of data.
(this solves the real time logging problem)
2. Compress snapshots linearly as they age
We care most about recent events
Logarithmic storage since
T
t=1
t−1
= O(log T)
3. Exploit CountMin data structure for efficient compression
Variant 1: reduce storage per snapshot
Variant 2: increase timespan per snapshot
4. Interpolate between both variants for improved accuracy
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 5
CountMin Sketch (Cormode & Muthukrishnan)
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
In-memory data structure for instantaneous retrieval
Aggregate statistic of observation interval (instantanous retrieval)
Intuition — Bloom filter with integers
Algorithm
insert(x):
for i = 1 to d do
M[i, hi (x)] ← M[i, hi (x)] + 1
end for
query(x):
ˆnx ← min
i∈{1,...d}
M[i, hi (x)]
return ˆnx
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 6
Guarantees
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
Approximation guarantee
For sketch with d = log 1
δ and n = e
we have with probability
1 − δ that the estimate ˆnx deviates from the count nx via
nx ≤ ˆnx ≤ nx +
x
nx for all x.
Linear statistic of the data
Power law distributions with exponent z only use O(N −1/z) space.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 7
Step 1: Combining time intervals
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
MT and MT sketches at time intervals T and T with T ∩ T = ∅.
Combine sketches by adding them up
+
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 8
Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
4
2
1
1 1
1 2
1 1
1 1
1 1
1 1
42
4
2
1
2
1
1 1
1 1 2 4
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 9
Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
2
2
8
1
1 1
1 1
1 1
1 421
8
8
8
4
4
4
2
4
2
1 1
1 1
1 1
42
4
2
1
1 1
1 1 2 4
8
8
8
8
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 10
Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
2
2
8
1
1 1
1 1
1 1
1 421
8
8
8
4
4
4
2
4
2
1 1
1 1
1 1
42
4
2
1
1 1
1 1 2 4
8
8
8
8
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 11
Step 2: Folding over
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
Mb is sketch with n = 2b bins.
Mb−1 can obtained as
Mb−1[i, j] = Mb[i, j] + Mb[i, j + 2b−1
]
by “folding over” the sketch
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 12
Step 2: Efficient computation
Halve the size of the sketch every 2t intervals.
Computation costs O(1) time and O(log t) space.
. . .
1 x 16 bins
2 x 8 bins
4 x 4 bins
interval 1
interval 2 3
4 5 6 7
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 13
Step 3: Resolution Interpolation
Time aggregation reports good estimate over long time interval.
Item aggregation reports poor estimate over short time interval.
Marginals of joint distribution — assume independence & interpolate
n(t)
n(x)n
Torso and Tail
Item aggregated estimate nx
Time aggregated estimate nt
Count interpolation
ˆnxt =
nx · nt
n
where n =
t
nt =
x
nx
Head
Sketch accuracy decreases with e · t
Use regular CountMin sketch whenever
˜n(x, t) > e · t · 2−b
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 14
Setup and Throughput
Web query data, 5 days sample
Term frequency
Numberofuniqueterms
100
102
104
106
97.9M unique terms,
378.1M total
100
101
102
103
104
105
106
Wikipedia data
Term frequency
Numberofuniqueterms
100
101
102
103
104
105
106
4.5M unique terms,
1291.5M total
100
102
104
106
Configuration
Platform
64-bit Linux
4-core 2GHz x86
16GB RAM
Gigabit network
Sketch setup
4 hash functions
223
bins
211
aggregation
intervals (7 days in
5 minute intervals)
3-gram interpolation
12GB sketch with
3 hash functions
230
bins
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 15
Setup and Throughput
Web query data, 5 days sample
Term frequency
Numberofuniqueterms
100
102
104
106
97.9M unique terms,
378.1M total
100
101
102
103
104
105
106
Wikipedia data
Term frequency
Numberofuniqueterms
100
101
102
103
104
105
106
4.5M unique terms,
1291.5M total
100
102
104
106
Speed
Software
Client-server system
ICE middleware
1 server, 10 clients
Throughput/s
50k inserts
22k requests
(time aggregation)
8.5k requests
(resolution interp.)
Limiting Factors
TCP/IP Overhead
Package query
Memory latency
Random access
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 16
Accuracy (aggregate absolute error ˆn − n)
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 17
Accuracy (stratified absolute error ˆn − n)
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 18
Sketching for Graphical Models
Goal
Observe stream of observations
Estimate joint probability in O(1) time
CountMin is good for head but interpolation better for torso and tail
General Strategy
Markov network with junction tree: cliques C and separator sets S.
Estimate counts for xC and xS with C ∈ C and S ∈ S to generate
ˆp(x) = n|S|−|C|
C∈C
nxC
S∈S
n−1
xS
.
Estimates are fast — only lookup in CountMin sketch. No need to
solve convex program for graphical model inference.
Markov Chain
p(abc) ≈ n−3
· ˆna · ˆnb · ˆnc Unigrams
p(abc) ≈ n−2
·
ˆnab · ˆnbc
ˆnb
Bigrams
Backoff smoothing (e.g. Kneser-Ney) in practice.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 19
n-gram Interpolation
Trigram approximation
Wikipedia dataset (1291.5M terms, 405M unique trigrams)
Absolute error Relative error
Unigram approximation 2.50 · 107 0.266
Bigram approximation 1.22 · 106 0.013
Trigram sketching (CountMin) 8.35 · 106 0.089
Sketching trigrams is not accurate enough on the tail.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 20
Summary
Fast and simple algorithm to aggregate statistics of data streams.
Effective compressed representation of the temporal data.
Works well for graphical models.
High-performance scalable implementation with O(1) time access.
Can be distributed over many servers.
Hokusai Katsushika
Great Wave off Kanagawa
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 21

More Related Content

What's hot

Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...
rahulmonikasharma
 

What's hot (20)

MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
 
A common fixed point theorem for two random operators using random mann itera...
A common fixed point theorem for two random operators using random mann itera...A common fixed point theorem for two random operators using random mann itera...
A common fixed point theorem for two random operators using random mann itera...
 
Problem Understanding through Landscape Theory
Problem Understanding through Landscape TheoryProblem Understanding through Landscape Theory
Problem Understanding through Landscape Theory
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Maneuvering target track prediction model
Maneuvering target track prediction modelManeuvering target track prediction model
Maneuvering target track prediction model
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithms
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...
 
LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15
 
Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
 
Distributed Support Vector Machines
Distributed Support Vector MachinesDistributed Support Vector Machines
Distributed Support Vector Machines
 
Lec17 sparse signal processing & applications
Lec17 sparse signal processing & applicationsLec17 sparse signal processing & applications
Lec17 sparse signal processing & applications
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPT
 

Viewers also liked

Jones_Talei_EDE202_Assessment2ppt
Jones_Talei_EDE202_Assessment2pptJones_Talei_EDE202_Assessment2ppt
Jones_Talei_EDE202_Assessment2ppt
Talei85
 
Jackson Pollock
Jackson PollockJackson Pollock
Jackson Pollock
Eric
 
Japanese printmaking elementary lesson ppt
Japanese printmaking elementary lesson pptJapanese printmaking elementary lesson ppt
Japanese printmaking elementary lesson ppt
dandeliondandelion23
 
Vincent van gogh
Vincent van goghVincent van gogh
Vincent van gogh
mkredford
 

Viewers also liked (20)

Jones_Talei_EDE202_Assessment2ppt
Jones_Talei_EDE202_Assessment2pptJones_Talei_EDE202_Assessment2ppt
Jones_Talei_EDE202_Assessment2ppt
 
Hokusai
HokusaiHokusai
Hokusai
 
Hokusai
HokusaiHokusai
Hokusai
 
George seurat
George seuratGeorge seurat
George seurat
 
Post impressionism Art Period Study Guide
Post impressionism Art Period Study GuidePost impressionism Art Period Study Guide
Post impressionism Art Period Study Guide
 
Seurat powerpoint
Seurat powerpointSeurat powerpoint
Seurat powerpoint
 
Jackson Pollock
Jackson PollockJackson Pollock
Jackson Pollock
 
Japanese printmaking elementary lesson ppt
Japanese printmaking elementary lesson pptJapanese printmaking elementary lesson ppt
Japanese printmaking elementary lesson ppt
 
Henri Matisse
Henri MatisseHenri Matisse
Henri Matisse
 
Paul klee.ppt
Paul klee.pptPaul klee.ppt
Paul klee.ppt
 
Paul Klee
Paul KleePaul Klee
Paul Klee
 
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets (nx power lite)
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets  (nx power lite)Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets  (nx power lite)
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets (nx power lite)
 
PaulKlee
PaulKleePaulKlee
PaulKlee
 
Leonardo da Vinci
Leonardo da VinciLeonardo da Vinci
Leonardo da Vinci
 
Hokusai Nº2
Hokusai Nº2Hokusai Nº2
Hokusai Nº2
 
Henri Matisse
Henri MatisseHenri Matisse
Henri Matisse
 
Vincent van gogh
Vincent van goghVincent van gogh
Vincent van gogh
 
Bauhaus
BauhausBauhaus
Bauhaus
 
Bauhaus
BauhausBauhaus
Bauhaus
 
Periods of Art
Periods of ArtPeriods of Art
Periods of Art
 

Similar to Hokusai - Sketching streams in real time

!Business statistics tekst
!Business statistics tekst!Business statistics tekst
!Business statistics tekst
King Nisar
 
Introduction to Machine Vision
Introduction to Machine VisionIntroduction to Machine Vision
Introduction to Machine Vision
Nasir Jumani
 
11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image
Alexander Decker
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamics
University of Glasgow
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
Frank Nielsen
 
Ch-2 final exam documet compler design elements
Ch-2 final exam documet compler design elementsCh-2 final exam documet compler design elements
Ch-2 final exam documet compler design elements
MAHERMOHAMED27
 

Similar to Hokusai - Sketching streams in real time (20)

D143136
D143136D143136
D143136
 
!Business statistics tekst
!Business statistics tekst!Business statistics tekst
!Business statistics tekst
 
Cg
CgCg
Cg
 
Introduction to Machine Vision
Introduction to Machine VisionIntroduction to Machine Vision
Introduction to Machine Vision
 
Analysis of Algorithum
Analysis of AlgorithumAnalysis of Algorithum
Analysis of Algorithum
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf
 
Optimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound imageOptimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound image
 
11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image
 
Model-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical ConstraintsModel-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical Constraints
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamics
 
Viii sem
Viii semViii sem
Viii sem
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
Teknik Simulasi
Teknik SimulasiTeknik Simulasi
Teknik Simulasi
 
Ch-2 final exam documet compler design elements
Ch-2 final exam documet compler design elementsCh-2 final exam documet compler design elements
Ch-2 final exam documet compler design elements
 
Meshing for computer graphics
Meshing for computer graphicsMeshing for computer graphics
Meshing for computer graphics
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation Algorithms
 
Lecture _Line Scan Conversion.ppt
Lecture _Line Scan Conversion.pptLecture _Line Scan Conversion.ppt
Lecture _Line Scan Conversion.ppt
 
Atomic algorithm and the servers' s use to find the Hamiltonian cycles
Atomic algorithm and the servers' s use to find the Hamiltonian cyclesAtomic algorithm and the servers' s use to find the Hamiltonian cycles
Atomic algorithm and the servers' s use to find the Hamiltonian cycles
 

Recently uploaded

Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mapping
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 

Hokusai - Sketching streams in real time

  • 1. Hokusai Sketching streams in real time Sergiy Matusevych1 Alexander J. Smola2 Amr Ahmed2 1Yahoo! Research, Santa Clara, CA 2Google, Mountain View, CA UAI 2012 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1
  • 2. Thanks Alex Smola Google and CMU Amr Ahmed Google Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 2
  • 3. Motivation Compute frequencies of elements in the data stream Item frequencies change over time. Number of items unkonwn and variable. Example - logging query frequency over time. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 3
  • 4. Motivation Compute frequencies of elements in the data stream Item frequencies change over time. Number of items unkonwn and variable. Example - logging query frequency over time. Applications Flow counting for IP traffic (who sent what, when and how much) Spam detection and filtering (detect bursts immediately) Website analytics (feedback to editors, trend detection) State of the art CountMin sketch is instantaneous but does not log time. Naive snapshotting costs linear memory. MapReduce batch job provides exact counts but long delays. Resource constraints Fixed memory footprint for entire sketch regardless of duration High query throughput Real time aggregation and response Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 4
  • 5. Strategy 1. Use CountMin sketch to store snapshots of data. (this solves the real time logging problem) 2. Compress snapshots linearly as they age We care most about recent events Logarithmic storage since T t=1 t−1 = O(log T) 3. Exploit CountMin data structure for efficient compression Variant 1: reduce storage per snapshot Variant 2: increase timespan per snapshot 4. Interpolate between both variants for improved accuracy Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 5
  • 6. CountMin Sketch (Cormode & Muthukrishnan) M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x In-memory data structure for instantaneous retrieval Aggregate statistic of observation interval (instantanous retrieval) Intuition — Bloom filter with integers Algorithm insert(x): for i = 1 to d do M[i, hi (x)] ← M[i, hi (x)] + 1 end for query(x): ˆnx ← min i∈{1,...d} M[i, hi (x)] return ˆnx Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 6
  • 7. Guarantees M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x Approximation guarantee For sketch with d = log 1 δ and n = e we have with probability 1 − δ that the estimate ˆnx deviates from the count nx via nx ≤ ˆnx ≤ nx + x nx for all x. Linear statistic of the data Power law distributions with exponent z only use O(N −1/z) space. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 7
  • 8. Step 1: Combining time intervals M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x MT and MT sketches at time intervals T and T with T ∩ T = ∅. Combine sketches by adding them up + Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 8
  • 9. Step 1: Efficient computation Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}. Insert into the leftmost aggregation interval. Aggregate as cumulative sum from the left using 1 + n i=0 2i = 2n+1 Computation is ∞ n=1 n · 2−n = O(1) amortized time, O(log t) space. 4 2 1 1 1 1 2 1 1 1 1 1 1 1 1 42 4 2 1 2 1 1 1 1 1 2 4 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 9
  • 10. Step 1: Efficient computation Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}. Insert into the leftmost aggregation interval. Aggregate as cumulative sum from the left using 1 + n i=0 2i = 2n+1 Computation is ∞ n=1 n · 2−n = O(1) amortized time, O(log t) space. 2 2 8 1 1 1 1 1 1 1 1 421 8 8 8 4 4 4 2 4 2 1 1 1 1 1 1 42 4 2 1 1 1 1 1 2 4 8 8 8 8 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 10
  • 11. Step 1: Efficient computation Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}. Insert into the leftmost aggregation interval. Aggregate as cumulative sum from the left using 1 + n i=0 2i = 2n+1 Computation is ∞ n=1 n · 2−n = O(1) amortized time, O(log t) space. 2 2 8 1 1 1 1 1 1 1 1 421 8 8 8 4 4 4 2 4 2 1 1 1 1 1 1 42 4 2 1 1 1 1 1 2 4 8 8 8 8 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 11
  • 12. Step 2: Folding over M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x Mb is sketch with n = 2b bins. Mb−1 can obtained as Mb−1[i, j] = Mb[i, j] + Mb[i, j + 2b−1 ] by “folding over” the sketch Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 12
  • 13. Step 2: Efficient computation Halve the size of the sketch every 2t intervals. Computation costs O(1) time and O(log t) space. . . . 1 x 16 bins 2 x 8 bins 4 x 4 bins interval 1 interval 2 3 4 5 6 7 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 13
  • 14. Step 3: Resolution Interpolation Time aggregation reports good estimate over long time interval. Item aggregation reports poor estimate over short time interval. Marginals of joint distribution — assume independence & interpolate n(t) n(x)n Torso and Tail Item aggregated estimate nx Time aggregated estimate nt Count interpolation ˆnxt = nx · nt n where n = t nt = x nx Head Sketch accuracy decreases with e · t Use regular CountMin sketch whenever ˜n(x, t) > e · t · 2−b Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 14
  • 15. Setup and Throughput Web query data, 5 days sample Term frequency Numberofuniqueterms 100 102 104 106 97.9M unique terms, 378.1M total 100 101 102 103 104 105 106 Wikipedia data Term frequency Numberofuniqueterms 100 101 102 103 104 105 106 4.5M unique terms, 1291.5M total 100 102 104 106 Configuration Platform 64-bit Linux 4-core 2GHz x86 16GB RAM Gigabit network Sketch setup 4 hash functions 223 bins 211 aggregation intervals (7 days in 5 minute intervals) 3-gram interpolation 12GB sketch with 3 hash functions 230 bins Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 15
  • 16. Setup and Throughput Web query data, 5 days sample Term frequency Numberofuniqueterms 100 102 104 106 97.9M unique terms, 378.1M total 100 101 102 103 104 105 106 Wikipedia data Term frequency Numberofuniqueterms 100 101 102 103 104 105 106 4.5M unique terms, 1291.5M total 100 102 104 106 Speed Software Client-server system ICE middleware 1 server, 10 clients Throughput/s 50k inserts 22k requests (time aggregation) 8.5k requests (resolution interp.) Limiting Factors TCP/IP Overhead Package query Memory latency Random access Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 16
  • 17. Accuracy (aggregate absolute error ˆn − n) Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 17
  • 18. Accuracy (stratified absolute error ˆn − n) Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 18
  • 19. Sketching for Graphical Models Goal Observe stream of observations Estimate joint probability in O(1) time CountMin is good for head but interpolation better for torso and tail General Strategy Markov network with junction tree: cliques C and separator sets S. Estimate counts for xC and xS with C ∈ C and S ∈ S to generate ˆp(x) = n|S|−|C| C∈C nxC S∈S n−1 xS . Estimates are fast — only lookup in CountMin sketch. No need to solve convex program for graphical model inference. Markov Chain p(abc) ≈ n−3 · ˆna · ˆnb · ˆnc Unigrams p(abc) ≈ n−2 · ˆnab · ˆnbc ˆnb Bigrams Backoff smoothing (e.g. Kneser-Ney) in practice. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 19
  • 20. n-gram Interpolation Trigram approximation Wikipedia dataset (1291.5M terms, 405M unique trigrams) Absolute error Relative error Unigram approximation 2.50 · 107 0.266 Bigram approximation 1.22 · 106 0.013 Trigram sketching (CountMin) 8.35 · 106 0.089 Sketching trigrams is not accurate enough on the tail. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 20
  • 21. Summary Fast and simple algorithm to aggregate statistics of data streams. Effective compressed representation of the temporal data. Works well for graphical models. High-performance scalable implementation with O(1) time access. Can be distributed over many servers. Hokusai Katsushika Great Wave off Kanagawa Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 21