SlideShare a Scribd company logo
1 of 67
Download to read offline
Approximation Data Structures for
Streaming Data Applications
Debasish Ghosh
(@debasishg)
Big Data => Fast Data
•Volume

•Variety

•Velocity
https://whatsthebigdata.com/2016/04/22/what-happens-on-the-internet-in-60-seconds/
Credit: http://www.doc.govt.nz/nature/habitats/freshwater/
A fundamental change in the shape of data that we need to process
Data Stream Model
Data Stream Model
• So big that it doesn’t fit in a single computer
(unbounded)
Data Stream Model
• So big that it doesn’t fit in a single computer
(unbounded)

• So big that a polynomial running time isn’t good
enough
Data Stream Model
• So big that it doesn’t fit in a single computer
(unbounded)

• So big that a polynomial running time isn’t good
enough

• An algorithm processing such data can only access
data in a single pass
Data Stream Model
• So big that it doesn’t fit in a single computer
(unbounded)

• So big that a polynomial running time isn’t good
enough

• An algorithm processing such data can only access
data in a single pass

• And yet data needs to be processed with a low
latency feedback loop with the consumers
Motivating Use Cases
• Monitor events when a user visits a web site. Event streams
drive analytics and generate various metrics on user
behaviors

• Traffic monitoring in network routers based on IP addresses -
explore heavy hitters (top traffic intensive IP addresses)

• Processing financial data streams (stock quotes & orders) to
facilitate real time decision making

• Online clustering algorithms - similarity detection in real time

• Real time anomaly detection on data streams
Algorithm Ideas
• Continuous processing of unbounded streams of data

• Single pass over the data

• Memory and time bounded - sublinear space

• Queries may not have to be served with hard accuracy -
some affordance of errors allowed
Can we have a deterministic and/or
exact algorithm that meets all of
these requirements ?
Distinct Elements Problem
• Input: Stream of integers

• Where: [n] denotes the Set { 1, 2, .. , n }

• Output: The number of distinct elements seen in the
stream 

• Goal: Minimize space consumption
i1, . . . , im ∈ [n]
Distinct Elements Problem
• Solution 1: Keep a bit array of length n, initialized to all
zeroes. When you see i in the stream, set the ith bit to 1.

• Space required: n bits of memory
Distinct Elements Problem
• Solution 1: Keep a bit array of length n, initialized to all
zeroes. When you see i in the stream, set the ith bit to 1.

• Space required: n bits of memory

• Solution 2: Store the whole stream in memory explicitly

• Space required: bits of memory⌈mlog2n⌉
Can we have a deterministic and/or
exact algorithm that beats this space
bound of ?min{n, ⌈mlog2n⌉}
Sublinear with Deterministic
& Exact - Possible ?
• Each element of the stream can be represented by n bits. The
entire stream can then be mapped to {0, 1}n

• Suppose a deterministic & exact algorithm exists that uses s bits of
space where s < n

• Then there must exist some mapping from n-bit strings to s-bit
strings i.e. {0,1}n to {0,1}s

• And this mapping has to be injective (no 2 elements of the domain
can map to the same element in co-domain)

• It can be proved that such a mapping does not exist (there cannot
be an injective mapping from a larger set to a smaller set)
There exists NO deterministic and/or
exact algorithm that implements
Distinct Elements problem in
sublinear space
Randomized & Approximate
Randomized & Approximate
• Estimators - the algorithm returns an estimator in
response to a query
Randomized & Approximate
• Estimators - the algorithm returns an estimator in
response to a query
Unbiased ?
Variance ?
Randomized & Approximate
• Estimators - the algorithm returns an estimator in
response to a query

• Error bound - f(x) is accurate up to a certain bound
( bound )ϵ
Randomized & Approximate
• Estimators - the algorithm returns an estimator in
response to a query

• Error bound - f(x) is accurate up to a certain bound
( bound )

• Confidence of accuracy - probability that the estimator
will be within the above bound ( )
ϵ
1 − δ
ϵ − δ Approximation
ϵ − δ Approximation
Accuracy within bounds with a failure
probability of
±ϵ
δ
ϵ − δ Approximation
Accuracy within bounds with a failure
probability of
±ϵ
δ
ℙ( ∣ ˜n − n ∣ > ϵn) < δ
(Data)
(Summary)
f(X)
(Data)
(Summary)
f(X)
X
C(X)
Sketch
• A Sketch C(X) of some data set X with respect to some
function f is a compression of X that allows us to
compute, or approximately compute f(X), given access
only to C(X)
Alice Bob
Data set X, which is
a list of Integers
Data set Y, which is
a list of Integers
f(X, Y) =
∑
z∈X∪Y
z
Alice Bob
Data set X, which is
a list of Integers
Data set Y, which is
a list of Integers
f(X, Y) =
∑
z∈X∪Y
z
Maintain Sketch of X
as the running sum of
the integers
Maintain Sketch of Y
as the running sum of
the integers
Source: https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
Show me some data!
Membership Query
with 4% error - Bloom Filter
Exact Membership Query,
Cardinality Estimation - Sorted IDs or Hash Table
Frequencies of top-100 most frequent
elements with 4% error - Count Min Sketch
Top-100 most frequent
elements with 4% error - Stream-Summary
Cardinality Estimation
with 4% error - Loglog Counter
Cardinality Estimation
with 4% error - Linear Counter
Exact Frequency
Estimation, Range Query - Sorted Table or Hash Map
Raw Data
A Simple Counter
• Use Case - Monitor a stream of events

• At any point in time output (an estimate of) the number of
events seen so far.You may have to report from multiple
counters aggregated by event types

• Idea is to beat O(log2n) space. Any trivial algorithm can
implement this using log2n bits
• Using a suitable sketch, there exists an algorithm that
returns an estimator of the counter within a bound of 

• and a small probability of failure
k(1 ± ϵ)
δ
ϵ − δ Approximation
Approximate Counting
(Morris ’78)
Counting Large Number of Events in Small Registers - Robert Morris, CACM, Volume 21,
Issue 10, Oct 1978: https://dl.acm.org/citation.cfm?id=359627
ℙ( ∣ ˜n − n ∣ > ϵn) < δ
1. Initialize X ⟵ 0.
2. For each u pdate, increment X with probability 1/2X
.
3. For a qu ery, ou tpu t ˜n= 2X
−1.
The steps to analyze this algorithm
generalize beautifully to all
approximation data structures used
to handle streaming data
Generalization steps ..
• Compute the expected value of the estimator. In [Morris
’78] we have 

• Compute the variance of the estimator. In [Morris ’78] we
have

• Using median trick, establish
𝔼[2X
− 1] = n
var[2X
− 1] = O(n2
)
ϵ − δ Approximation
Algorithm
Data Stream
Data Sketch
f(x)
Response
Sketch based Query Model
Use Case
• Continuous stream of IP
addresses hitting a router

• Updates of the form (i, ),
which means the count of IP
address i has to increase by
by 

• Want an estimate of how
many times IP address i has
hit the router at any point in
time (Frequency Estimation)
Δ
Δ
Credit: http://voipstuff.net.au/routers/
Count Min Sketch
width w
d hash
functions
An Improved Data Stream Summary: The Count-Min Sketch and its Applications
- Graham Cormode and S. Muthukrishnan (http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf)
Count Min Sketch
width w
d hash
functions
(i, Δ)
update comes
Count Min Sketch
width w
d hash
functions
i
(i, Δ)
update comes
+Δ
+Δ
+Δ
+Δ
h1(i)
h2(i)
h3(i)
hd(i)
hash using pairwise
independent hash functions
Count Min Sketch
width w
d hash
functions
+Δ
h2
w5
Sum of frequencies of all items i
that hash to w5 using hash function h2
query(i)
width w
d hash
functions
i
+Δ
h1(i)
h2(i)
h3(i)
hd(i)
+Δ
+Δ
+Δ
• Hash i using all d hash functions 

• The results point to d cells in the table, each containing some frequency value

• Return the minimum of the d values as an estimate of query(i)
Count Min Sketch
Claim
1. Fo r ϵ − po in t qu ery with failu re pro bability δ .
2. qu ery(i) = xi ± ϵ ∥ x ∥1 with pro b ≥ 1 − δ .
3. Set w = ⌈2/ϵ⌉ an d d = ⌈lo g 2(1/δ)⌉ .
4. Space requ ired is O(ϵ−1
lo g 2(1/δ) .
Count Min Sketch in Spark
https://twitter.github.io/algebird/
Algebra of a Monoid
Set A
ϕ : A × A → A
given
a binary operation
(a ϕ b) ϕ c = a ϕ (b ϕ c)
associative
fo r (a, b, c) ∈ A
a ϕ I = I ϕ a = a
fo r (a, I ) ∈ A
identity
time 1 time 2 time 3 time 4 time 5
window
at time 1
window
at time 3
window
at time 5
window-based
operation
original
DStream
windowed
DStream
Stream of host IPs
hitting the router
CMS in the wild
time 1 time 2 time 3 time 4 time 5
window
at time 1
window
at time 3
window
at time 5
window-based
operation
original
DStream
windowed
DStream
Stream of host IPs
hitting the router
Frequency Sketch /
Heavy Hitter Sketch
for this batch
Frequency Sketch /
Heavy Hitter Sketch
for this window
Frequency Sketch /
Heavy Hitter Sketch
global
CMS in the wild
time 1 time 2 time 3 time 4 time 5
window
at time 1
window
at time 3
window
at time 5
window-based
operation
original
DStream
windowed
DStream
Stream of host IPs
hitting the router
Frequency Sketch /
Heavy Hitter Sketch
for this batch
Frequency Sketch /
Heavy Hitter Sketch
for this window
Frequency Sketch /
Heavy Hitter Sketch
global
Kafka
HDFS
Dashboard
CMS in the wild
Streaming CMS
// CMS parameters
val DELTA = 1E-3
val EPS = 0.01
val SEED = 1
// create CMS
val cmsMonoid = CMS.monoid[String](DELTA, EPS, SEED)
var globalCMS = cmsMonoid.zero
// Generate data stream
val hosts: DStream[String] = lines.flatMap(r =>
LogParseUtil.parseHost(r.value).toOption)
// load data into CMS
val approxHosts: DStream[CMS[String]] = hosts.mapPartitions(ids => {
val cms = CMS.monoid[String](DELTA, EPS, SEED)
ids.map(cms.create)
}).reduce(_ ++ _)
Streaming CMS
approxHosts.foreachRDD(rdd => {
if (rdd.count() != 0) {
val cmsThisBatch: CMS[String] = rdd.first
globalCMS ++= cmsThisBatch
val f1ThisBatch = cmsThisBatch.f1
val freqThisBatch = cmsThisBatch.frequency("world.std.com")
val f1Overall = globalCMS.f1
val freqOverall = globalCMS.frequency("world.std.com")
// ..
}
})
Motivation of Streaming
CMS
• Prepare the sketch online on streaming data

• Store it offline for future analytics

• It’s a small structure - hence ideal for serialization &
storage

• It’s a commutative monoid and hence you can distribute
many of them across multiple machines, do parallel
computations and again aggregate the results
Count Min Sketch -
Applications
• AT&T has used it in network switches to perform network analyses on streaming
network traffic with limited memory [1].

• Streaming log analysis

• Join size estimation for database query planners

• Heavy hitters - 

• Top-k active users on Twitter 

• Popular products - most viewed products page

• Compute frequent search queries

• Identify heavy TCP flow

• Identify volatile stocks
[1] G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava. Holistic UDAFs at streaming speeds. In Proceedings of the 2004 ACM SIGMOD International
Conference on Management of Data, pages 35–46, 2004.
Heavy Hitters Problem
• Using a single pass over a data stream, find all elements
with frequencies greater than k percent of the total number of
elements seen so far.

• unbounded data stream

• will have to use sublinear space

• Fact: There is no deterministic algorithm that solves the
Heavy Hitters problems in 1 pass while using sublinear space

• Hence ϵ − approximate Heavy Hitters Problem
Approximate Heavy Hitters
using Count Min Sketch
Datastreamofelements
Count Min Sketch Heap
N
Count seen so far
(1) Element Xi comes
(2) Add Xi to CMS
(3) Check freq of Xi
> Threshold ?
Yes
(4)AddtoHeap
No
Streaming Approximate
Heavy Hitters
// create heavy hitter CMS
val approxHH: DStream[TopCMS[String]] = hosts.mapPartitions(ids => {
val cms = TopPctCMS.monoid[String](DELTA, EPS, SEED, 0.15)
ids.map(cms.create(_))
}).reduce(_ ++ _)
// analyze in microbatch
approxHH.foreachRDD(rdd => {
if (rdd.count() != 0) {
val hhThisBatch: TopCMS[String] = rdd.first
hhThisBatch.heavyHitters.foreach(println)
}
})
Bloom Filter
• Another sketching data structure (based on hashing)

• Solves the same problem as Hash Map but with much
less space

• Great tool to have if you want approximate membership
query with sublinear storage

• Can give false positives
Bloom Filter - Under the
Hood
• Ingredients
• Array A of n bits. If we store a dataset S, then number of bits used per object = n/|S| 

• k hash functions (h1,h2, ..,hk) (usually k is small)

• Insert(x)
• For i=1,2, ..,k set A[hi(x)]=1 irrespective of what the previous values of those
bits were

• Query(x)
• if for every i=1,2, ..,k A[hi(x)]=1 return true

• No false negatives

• Can have false positives
Space/time trade-offs in hash coding with allowable errors - B. H. Bloom.
Communications of the ACM 13(7): 422-426. 1970.
ByDavidEppstein-self-made,originallyforatalkatWADS2007,PublicDomain,https://commons.wikimedia.org/w/index.p
Bloom Filter as Application
State
Kafka Streams*
Application
Kafka Streams*
Application
Local State Local State
Rebalancing
Partition #1
Partition #2
Partition #3
Data Stream Kafka Topic
* 2 instances of the same application
Bloom Filter State Store
// Bloom Filter as a StateStore. The only query it supports is membership.
class BFStore[T: Hash128](
override val name: String,
val loggingEnabled: Boolean = true,
val numHashes: Int = 6,
val width: Int = 32,
val seed: Int = 1) extends WriteableBFStore[T] with StateStore {
// monoid!
private val bfMonoid =
new BloomFilterMonoid[T](numHashes, width)
// initialize
private[processor] var bf: BF[T] = bfMonoid.zero
// ..
}
Bloom Filter State Store
// Bloom Filter as a StateStore. The only query it supports is membership.
class BFStore[T: Hash128](
override val name: String,
val loggingEnabled: Boolean = true,
val numHashes: Int = 6,
val width: Int = 32,
val seed: Int = 1) extends WriteableBFStore[T] with StateStore {
// ..
def +(item: T): Unit = bf = bf + item
def contains(item: T): Boolean = {
val v = bf.contains(item)
v.isTrue && v.withProb > ACCEPTABLE_PROBABILITY
}
def maybeContains(item: T): Boolean = bf.maybeContains(item)
def size: Approximate[Long] = bf.size
}
BF Store with Kafka
Streams Processor
// the Kafka Streams processor that will be part of the topology
class WeblogProcessor extends AbstractProcessor[String, String]
// the store instance
private var bfStore: BFStore[String] = _
override def init(context: ProcessorContext): Unit = {
super.init(context)
// ..
bfStore = this.context.getStateStore(
WeblogDriver.LOG_COUNT_STATE_STORE).asInstanceOf[BFStore[String]]
}
override def process(dummy: String, record: String): Unit =
LogParseUtil.parseLine(record) match {
case Success(r) => {
bfStore + r.host
bfStore.changeLogger.logChange(bfStore.changelogKey, bfStore.bf)
}
case Failure(ex) => // ..
}
// ..
}
https://www.lightbend.com/products/fast-data-platform
Questions?

More Related Content

What's hot

Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsDB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsJohn Beresniewicz
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFShapeBlue
 
Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...Michelle Holley
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performanceoysteing
 
MySQL Performance Schema in Action: the Complete Tutorial
MySQL Performance Schema in Action: the Complete TutorialMySQL Performance Schema in Action: the Complete Tutorial
MySQL Performance Schema in Action: the Complete TutorialSveta Smirnova
 
[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅NAVER D2
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL AdministrationEDB
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxData
 
Oracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention TroubleshootingOracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention TroubleshootingTanel Poder
 
The MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer TraceThe MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer Traceoysteing
 
Cloud Native Java GraalVM 이상과 현실
Cloud Native Java GraalVM 이상과 현실Cloud Native Java GraalVM 이상과 현실
Cloud Native Java GraalVM 이상과 현실Taewan Kim
 
Windows IOCP vs Linux EPOLL Performance Comparison
Windows IOCP vs Linux EPOLL Performance ComparisonWindows IOCP vs Linux EPOLL Performance Comparison
Windows IOCP vs Linux EPOLL Performance ComparisonSeungmo Koo
 

What's hot (20)

Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Redo log
Redo logRedo log
Redo log
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsDB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
PostgreSQL and RAM usage
PostgreSQL and RAM usagePostgreSQL and RAM usage
PostgreSQL and RAM usage
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performance
 
MySQL Performance Schema in Action: the Complete Tutorial
MySQL Performance Schema in Action: the Complete TutorialMySQL Performance Schema in Action: the Complete Tutorial
MySQL Performance Schema in Action: the Complete Tutorial
 
[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
 
Oracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention TroubleshootingOracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention Troubleshooting
 
The MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer TraceThe MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer Trace
 
Cloud Native Java GraalVM 이상과 현실
Cloud Native Java GraalVM 이상과 현실Cloud Native Java GraalVM 이상과 현실
Cloud Native Java GraalVM 이상과 현실
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Windows IOCP vs Linux EPOLL Performance Comparison
Windows IOCP vs Linux EPOLL Performance ComparisonWindows IOCP vs Linux EPOLL Performance Comparison
Windows IOCP vs Linux EPOLL Performance Comparison
 

Similar to Approximation Data Structures for Streaming Applications

ODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identificationODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identificationKuldeep Jiwani
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
Polymer Brush Data Processor
Polymer Brush Data ProcessorPolymer Brush Data Processor
Polymer Brush Data ProcessorCory Bethrant
 
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...NoSQLmatters
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataWeCloudData
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsAlbert Bifet
 
The Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemThe Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemReza Rahimi
 
anti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHanti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHLeo Chu
 
Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr - Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr - PyData
 
Lecture 15 run timeenvironment_2
Lecture 15 run timeenvironment_2Lecture 15 run timeenvironment_2
Lecture 15 run timeenvironment_2Iffat Anjum
 
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...AboutYouGmbH
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 

Similar to Approximation Data Structures for Streaming Applications (20)

ODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identificationODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identification
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Polymer Brush Data Processor
Polymer Brush Data ProcessorPolymer Brush Data Processor
Polymer Brush Data Processor
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Self healing data
Self healing dataSelf healing data
Self healing data
 
The Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemThe Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management System
 
anti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHanti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIH
 
No stress with state
No stress with stateNo stress with state
No stress with state
 
Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr - Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr -
 
Lecture 15 run timeenvironment_2
Lecture 15 run timeenvironment_2Lecture 15 run timeenvironment_2
Lecture 15 run timeenvironment_2
 
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
 
Introduction to C ++.pptx
Introduction to C ++.pptxIntroduction to C ++.pptx
Introduction to C ++.pptx
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 

More from Debasish Ghosh

Functional Domain Modeling - The ZIO 2 Way
Functional Domain Modeling - The ZIO 2 WayFunctional Domain Modeling - The ZIO 2 Way
Functional Domain Modeling - The ZIO 2 WayDebasish Ghosh
 
Algebraic Thinking for Evolution of Pure Functional Domain Models
Algebraic Thinking for Evolution of Pure Functional Domain ModelsAlgebraic Thinking for Evolution of Pure Functional Domain Models
Algebraic Thinking for Evolution of Pure Functional Domain ModelsDebasish Ghosh
 
Power of functions in a typed world
Power of functions in a typed worldPower of functions in a typed world
Power of functions in a typed worldDebasish Ghosh
 
Functional and Algebraic Domain Modeling
Functional and Algebraic Domain ModelingFunctional and Algebraic Domain Modeling
Functional and Algebraic Domain ModelingDebasish Ghosh
 
Architectural Patterns in Building Modular Domain Models
Architectural Patterns in Building Modular Domain ModelsArchitectural Patterns in Building Modular Domain Models
Architectural Patterns in Building Modular Domain ModelsDebasish Ghosh
 
Mining Functional Patterns
Mining Functional PatternsMining Functional Patterns
Mining Functional PatternsDebasish Ghosh
 
An Algebraic Approach to Functional Domain Modeling
An Algebraic Approach to Functional Domain ModelingAn Algebraic Approach to Functional Domain Modeling
An Algebraic Approach to Functional Domain ModelingDebasish Ghosh
 
Functional and Algebraic Domain Modeling
Functional and Algebraic Domain ModelingFunctional and Algebraic Domain Modeling
Functional and Algebraic Domain ModelingDebasish Ghosh
 
From functional to Reactive - patterns in domain modeling
From functional to Reactive - patterns in domain modelingFrom functional to Reactive - patterns in domain modeling
From functional to Reactive - patterns in domain modelingDebasish Ghosh
 
Domain Modeling with Functions - an algebraic approach
Domain Modeling with Functions - an algebraic approachDomain Modeling with Functions - an algebraic approach
Domain Modeling with Functions - an algebraic approachDebasish Ghosh
 
Functional Patterns in Domain Modeling
Functional Patterns in Domain ModelingFunctional Patterns in Domain Modeling
Functional Patterns in Domain ModelingDebasish Ghosh
 
Property based Testing - generative data & executable domain rules
Property based Testing - generative data & executable domain rulesProperty based Testing - generative data & executable domain rules
Property based Testing - generative data & executable domain rulesDebasish Ghosh
 
Big Data - architectural concerns for the new age
Big Data - architectural concerns for the new ageBig Data - architectural concerns for the new age
Big Data - architectural concerns for the new ageDebasish Ghosh
 
Domain Modeling in a Functional World
Domain Modeling in a Functional WorldDomain Modeling in a Functional World
Domain Modeling in a Functional WorldDebasish Ghosh
 
Functional and Event Driven - another approach to domain modeling
Functional and Event Driven - another approach to domain modelingFunctional and Event Driven - another approach to domain modeling
Functional and Event Driven - another approach to domain modelingDebasish Ghosh
 
DSL - expressive syntax on top of a clean semantic model
DSL - expressive syntax on top of a clean semantic modelDSL - expressive syntax on top of a clean semantic model
DSL - expressive syntax on top of a clean semantic modelDebasish Ghosh
 
Dependency Injection in Scala - Beyond the Cake Pattern
Dependency Injection in Scala - Beyond the Cake PatternDependency Injection in Scala - Beyond the Cake Pattern
Dependency Injection in Scala - Beyond the Cake PatternDebasish Ghosh
 

More from Debasish Ghosh (17)

Functional Domain Modeling - The ZIO 2 Way
Functional Domain Modeling - The ZIO 2 WayFunctional Domain Modeling - The ZIO 2 Way
Functional Domain Modeling - The ZIO 2 Way
 
Algebraic Thinking for Evolution of Pure Functional Domain Models
Algebraic Thinking for Evolution of Pure Functional Domain ModelsAlgebraic Thinking for Evolution of Pure Functional Domain Models
Algebraic Thinking for Evolution of Pure Functional Domain Models
 
Power of functions in a typed world
Power of functions in a typed worldPower of functions in a typed world
Power of functions in a typed world
 
Functional and Algebraic Domain Modeling
Functional and Algebraic Domain ModelingFunctional and Algebraic Domain Modeling
Functional and Algebraic Domain Modeling
 
Architectural Patterns in Building Modular Domain Models
Architectural Patterns in Building Modular Domain ModelsArchitectural Patterns in Building Modular Domain Models
Architectural Patterns in Building Modular Domain Models
 
Mining Functional Patterns
Mining Functional PatternsMining Functional Patterns
Mining Functional Patterns
 
An Algebraic Approach to Functional Domain Modeling
An Algebraic Approach to Functional Domain ModelingAn Algebraic Approach to Functional Domain Modeling
An Algebraic Approach to Functional Domain Modeling
 
Functional and Algebraic Domain Modeling
Functional and Algebraic Domain ModelingFunctional and Algebraic Domain Modeling
Functional and Algebraic Domain Modeling
 
From functional to Reactive - patterns in domain modeling
From functional to Reactive - patterns in domain modelingFrom functional to Reactive - patterns in domain modeling
From functional to Reactive - patterns in domain modeling
 
Domain Modeling with Functions - an algebraic approach
Domain Modeling with Functions - an algebraic approachDomain Modeling with Functions - an algebraic approach
Domain Modeling with Functions - an algebraic approach
 
Functional Patterns in Domain Modeling
Functional Patterns in Domain ModelingFunctional Patterns in Domain Modeling
Functional Patterns in Domain Modeling
 
Property based Testing - generative data & executable domain rules
Property based Testing - generative data & executable domain rulesProperty based Testing - generative data & executable domain rules
Property based Testing - generative data & executable domain rules
 
Big Data - architectural concerns for the new age
Big Data - architectural concerns for the new ageBig Data - architectural concerns for the new age
Big Data - architectural concerns for the new age
 
Domain Modeling in a Functional World
Domain Modeling in a Functional WorldDomain Modeling in a Functional World
Domain Modeling in a Functional World
 
Functional and Event Driven - another approach to domain modeling
Functional and Event Driven - another approach to domain modelingFunctional and Event Driven - another approach to domain modeling
Functional and Event Driven - another approach to domain modeling
 
DSL - expressive syntax on top of a clean semantic model
DSL - expressive syntax on top of a clean semantic modelDSL - expressive syntax on top of a clean semantic model
DSL - expressive syntax on top of a clean semantic model
 
Dependency Injection in Scala - Beyond the Cake Pattern
Dependency Injection in Scala - Beyond the Cake PatternDependency Injection in Scala - Beyond the Cake Pattern
Dependency Injection in Scala - Beyond the Cake Pattern
 

Recently uploaded

SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
1C_PNS.pdf Philippines National standard
1C_PNS.pdf Philippines National standard1C_PNS.pdf Philippines National standard
1C_PNS.pdf Philippines National standardraffietividad53
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Lecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptLecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptesrabilgic2
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 

Recently uploaded (20)

SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
1C_PNS.pdf Philippines National standard
1C_PNS.pdf Philippines National standard1C_PNS.pdf Philippines National standard
1C_PNS.pdf Philippines National standard
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Lecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptLecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).ppt
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 

Approximation Data Structures for Streaming Applications

  • 1. Approximation Data Structures for Streaming Data Applications Debasish Ghosh (@debasishg)
  • 2.
  • 3. Big Data => Fast Data •Volume •Variety •Velocity
  • 5. Credit: http://www.doc.govt.nz/nature/habitats/freshwater/ A fundamental change in the shape of data that we need to process
  • 7. Data Stream Model • So big that it doesn’t fit in a single computer (unbounded)
  • 8. Data Stream Model • So big that it doesn’t fit in a single computer (unbounded) • So big that a polynomial running time isn’t good enough
  • 9. Data Stream Model • So big that it doesn’t fit in a single computer (unbounded) • So big that a polynomial running time isn’t good enough • An algorithm processing such data can only access data in a single pass
  • 10. Data Stream Model • So big that it doesn’t fit in a single computer (unbounded) • So big that a polynomial running time isn’t good enough • An algorithm processing such data can only access data in a single pass • And yet data needs to be processed with a low latency feedback loop with the consumers
  • 11. Motivating Use Cases • Monitor events when a user visits a web site. Event streams drive analytics and generate various metrics on user behaviors • Traffic monitoring in network routers based on IP addresses - explore heavy hitters (top traffic intensive IP addresses) • Processing financial data streams (stock quotes & orders) to facilitate real time decision making • Online clustering algorithms - similarity detection in real time • Real time anomaly detection on data streams
  • 12. Algorithm Ideas • Continuous processing of unbounded streams of data • Single pass over the data • Memory and time bounded - sublinear space • Queries may not have to be served with hard accuracy - some affordance of errors allowed
  • 13. Can we have a deterministic and/or exact algorithm that meets all of these requirements ?
  • 14. Distinct Elements Problem • Input: Stream of integers • Where: [n] denotes the Set { 1, 2, .. , n } • Output: The number of distinct elements seen in the stream • Goal: Minimize space consumption i1, . . . , im ∈ [n]
  • 15. Distinct Elements Problem • Solution 1: Keep a bit array of length n, initialized to all zeroes. When you see i in the stream, set the ith bit to 1. • Space required: n bits of memory
  • 16. Distinct Elements Problem • Solution 1: Keep a bit array of length n, initialized to all zeroes. When you see i in the stream, set the ith bit to 1. • Space required: n bits of memory • Solution 2: Store the whole stream in memory explicitly • Space required: bits of memory⌈mlog2n⌉
  • 17. Can we have a deterministic and/or exact algorithm that beats this space bound of ?min{n, ⌈mlog2n⌉}
  • 18. Sublinear with Deterministic & Exact - Possible ? • Each element of the stream can be represented by n bits. The entire stream can then be mapped to {0, 1}n • Suppose a deterministic & exact algorithm exists that uses s bits of space where s < n • Then there must exist some mapping from n-bit strings to s-bit strings i.e. {0,1}n to {0,1}s • And this mapping has to be injective (no 2 elements of the domain can map to the same element in co-domain) • It can be proved that such a mapping does not exist (there cannot be an injective mapping from a larger set to a smaller set)
  • 19. There exists NO deterministic and/or exact algorithm that implements Distinct Elements problem in sublinear space
  • 21. Randomized & Approximate • Estimators - the algorithm returns an estimator in response to a query
  • 22. Randomized & Approximate • Estimators - the algorithm returns an estimator in response to a query Unbiased ? Variance ?
  • 23. Randomized & Approximate • Estimators - the algorithm returns an estimator in response to a query • Error bound - f(x) is accurate up to a certain bound ( bound )ϵ
  • 24. Randomized & Approximate • Estimators - the algorithm returns an estimator in response to a query • Error bound - f(x) is accurate up to a certain bound ( bound ) • Confidence of accuracy - probability that the estimator will be within the above bound ( ) ϵ 1 − δ
  • 25. ϵ − δ Approximation
  • 26. ϵ − δ Approximation Accuracy within bounds with a failure probability of ±ϵ δ
  • 27. ϵ − δ Approximation Accuracy within bounds with a failure probability of ±ϵ δ ℙ( ∣ ˜n − n ∣ > ϵn) < δ
  • 30. • A Sketch C(X) of some data set X with respect to some function f is a compression of X that allows us to compute, or approximately compute f(X), given access only to C(X)
  • 31. Alice Bob Data set X, which is a list of Integers Data set Y, which is a list of Integers f(X, Y) = ∑ z∈X∪Y z
  • 32. Alice Bob Data set X, which is a list of Integers Data set Y, which is a list of Integers f(X, Y) = ∑ z∈X∪Y z Maintain Sketch of X as the running sum of the integers Maintain Sketch of Y as the running sum of the integers
  • 33. Source: https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/ Show me some data! Membership Query with 4% error - Bloom Filter Exact Membership Query, Cardinality Estimation - Sorted IDs or Hash Table Frequencies of top-100 most frequent elements with 4% error - Count Min Sketch Top-100 most frequent elements with 4% error - Stream-Summary Cardinality Estimation with 4% error - Loglog Counter Cardinality Estimation with 4% error - Linear Counter Exact Frequency Estimation, Range Query - Sorted Table or Hash Map Raw Data
  • 34. A Simple Counter • Use Case - Monitor a stream of events • At any point in time output (an estimate of) the number of events seen so far.You may have to report from multiple counters aggregated by event types • Idea is to beat O(log2n) space. Any trivial algorithm can implement this using log2n bits
  • 35. • Using a suitable sketch, there exists an algorithm that returns an estimator of the counter within a bound of • and a small probability of failure k(1 ± ϵ) δ ϵ − δ Approximation
  • 36. Approximate Counting (Morris ’78) Counting Large Number of Events in Small Registers - Robert Morris, CACM, Volume 21, Issue 10, Oct 1978: https://dl.acm.org/citation.cfm?id=359627 ℙ( ∣ ˜n − n ∣ > ϵn) < δ 1. Initialize X ⟵ 0. 2. For each u pdate, increment X with probability 1/2X . 3. For a qu ery, ou tpu t ˜n= 2X −1.
  • 37. The steps to analyze this algorithm generalize beautifully to all approximation data structures used to handle streaming data
  • 38. Generalization steps .. • Compute the expected value of the estimator. In [Morris ’78] we have • Compute the variance of the estimator. In [Morris ’78] we have • Using median trick, establish 𝔼[2X − 1] = n var[2X − 1] = O(n2 ) ϵ − δ Approximation
  • 40. Use Case • Continuous stream of IP addresses hitting a router • Updates of the form (i, ), which means the count of IP address i has to increase by by • Want an estimate of how many times IP address i has hit the router at any point in time (Frequency Estimation) Δ Δ Credit: http://voipstuff.net.au/routers/
  • 41. Count Min Sketch width w d hash functions An Improved Data Stream Summary: The Count-Min Sketch and its Applications - Graham Cormode and S. Muthukrishnan (http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf)
  • 42. Count Min Sketch width w d hash functions (i, Δ) update comes
  • 43. Count Min Sketch width w d hash functions i (i, Δ) update comes +Δ +Δ +Δ +Δ h1(i) h2(i) h3(i) hd(i) hash using pairwise independent hash functions
  • 44. Count Min Sketch width w d hash functions +Δ h2 w5 Sum of frequencies of all items i that hash to w5 using hash function h2
  • 45. query(i) width w d hash functions i +Δ h1(i) h2(i) h3(i) hd(i) +Δ +Δ +Δ • Hash i using all d hash functions • The results point to d cells in the table, each containing some frequency value • Return the minimum of the d values as an estimate of query(i)
  • 46. Count Min Sketch Claim 1. Fo r ϵ − po in t qu ery with failu re pro bability δ . 2. qu ery(i) = xi ± ϵ ∥ x ∥1 with pro b ≥ 1 − δ . 3. Set w = ⌈2/ϵ⌉ an d d = ⌈lo g 2(1/δ)⌉ . 4. Space requ ired is O(ϵ−1 lo g 2(1/δ) .
  • 47. Count Min Sketch in Spark
  • 49. Algebra of a Monoid Set A ϕ : A × A → A given a binary operation (a ϕ b) ϕ c = a ϕ (b ϕ c) associative fo r (a, b, c) ∈ A a ϕ I = I ϕ a = a fo r (a, I ) ∈ A identity
  • 50. time 1 time 2 time 3 time 4 time 5 window at time 1 window at time 3 window at time 5 window-based operation original DStream windowed DStream Stream of host IPs hitting the router CMS in the wild
  • 51. time 1 time 2 time 3 time 4 time 5 window at time 1 window at time 3 window at time 5 window-based operation original DStream windowed DStream Stream of host IPs hitting the router Frequency Sketch / Heavy Hitter Sketch for this batch Frequency Sketch / Heavy Hitter Sketch for this window Frequency Sketch / Heavy Hitter Sketch global CMS in the wild
  • 52. time 1 time 2 time 3 time 4 time 5 window at time 1 window at time 3 window at time 5 window-based operation original DStream windowed DStream Stream of host IPs hitting the router Frequency Sketch / Heavy Hitter Sketch for this batch Frequency Sketch / Heavy Hitter Sketch for this window Frequency Sketch / Heavy Hitter Sketch global Kafka HDFS Dashboard CMS in the wild
  • 53. Streaming CMS // CMS parameters val DELTA = 1E-3 val EPS = 0.01 val SEED = 1 // create CMS val cmsMonoid = CMS.monoid[String](DELTA, EPS, SEED) var globalCMS = cmsMonoid.zero // Generate data stream val hosts: DStream[String] = lines.flatMap(r => LogParseUtil.parseHost(r.value).toOption) // load data into CMS val approxHosts: DStream[CMS[String]] = hosts.mapPartitions(ids => { val cms = CMS.monoid[String](DELTA, EPS, SEED) ids.map(cms.create) }).reduce(_ ++ _)
  • 54. Streaming CMS approxHosts.foreachRDD(rdd => { if (rdd.count() != 0) { val cmsThisBatch: CMS[String] = rdd.first globalCMS ++= cmsThisBatch val f1ThisBatch = cmsThisBatch.f1 val freqThisBatch = cmsThisBatch.frequency("world.std.com") val f1Overall = globalCMS.f1 val freqOverall = globalCMS.frequency("world.std.com") // .. } })
  • 55. Motivation of Streaming CMS • Prepare the sketch online on streaming data • Store it offline for future analytics • It’s a small structure - hence ideal for serialization & storage • It’s a commutative monoid and hence you can distribute many of them across multiple machines, do parallel computations and again aggregate the results
  • 56. Count Min Sketch - Applications • AT&T has used it in network switches to perform network analyses on streaming network traffic with limited memory [1]. • Streaming log analysis • Join size estimation for database query planners • Heavy hitters - • Top-k active users on Twitter • Popular products - most viewed products page • Compute frequent search queries • Identify heavy TCP flow • Identify volatile stocks [1] G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava. Holistic UDAFs at streaming speeds. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pages 35–46, 2004.
  • 57. Heavy Hitters Problem • Using a single pass over a data stream, find all elements with frequencies greater than k percent of the total number of elements seen so far. • unbounded data stream • will have to use sublinear space • Fact: There is no deterministic algorithm that solves the Heavy Hitters problems in 1 pass while using sublinear space • Hence ϵ − approximate Heavy Hitters Problem
  • 58. Approximate Heavy Hitters using Count Min Sketch Datastreamofelements Count Min Sketch Heap N Count seen so far (1) Element Xi comes (2) Add Xi to CMS (3) Check freq of Xi > Threshold ? Yes (4)AddtoHeap No
  • 59. Streaming Approximate Heavy Hitters // create heavy hitter CMS val approxHH: DStream[TopCMS[String]] = hosts.mapPartitions(ids => { val cms = TopPctCMS.monoid[String](DELTA, EPS, SEED, 0.15) ids.map(cms.create(_)) }).reduce(_ ++ _) // analyze in microbatch approxHH.foreachRDD(rdd => { if (rdd.count() != 0) { val hhThisBatch: TopCMS[String] = rdd.first hhThisBatch.heavyHitters.foreach(println) } })
  • 60. Bloom Filter • Another sketching data structure (based on hashing) • Solves the same problem as Hash Map but with much less space • Great tool to have if you want approximate membership query with sublinear storage • Can give false positives
  • 61. Bloom Filter - Under the Hood • Ingredients • Array A of n bits. If we store a dataset S, then number of bits used per object = n/|S| • k hash functions (h1,h2, ..,hk) (usually k is small) • Insert(x) • For i=1,2, ..,k set A[hi(x)]=1 irrespective of what the previous values of those bits were • Query(x) • if for every i=1,2, ..,k A[hi(x)]=1 return true • No false negatives • Can have false positives Space/time trade-offs in hash coding with allowable errors - B. H. Bloom. Communications of the ACM 13(7): 422-426. 1970. ByDavidEppstein-self-made,originallyforatalkatWADS2007,PublicDomain,https://commons.wikimedia.org/w/index.p
  • 62. Bloom Filter as Application State Kafka Streams* Application Kafka Streams* Application Local State Local State Rebalancing Partition #1 Partition #2 Partition #3 Data Stream Kafka Topic * 2 instances of the same application
  • 63. Bloom Filter State Store // Bloom Filter as a StateStore. The only query it supports is membership. class BFStore[T: Hash128]( override val name: String, val loggingEnabled: Boolean = true, val numHashes: Int = 6, val width: Int = 32, val seed: Int = 1) extends WriteableBFStore[T] with StateStore { // monoid! private val bfMonoid = new BloomFilterMonoid[T](numHashes, width) // initialize private[processor] var bf: BF[T] = bfMonoid.zero // .. }
  • 64. Bloom Filter State Store // Bloom Filter as a StateStore. The only query it supports is membership. class BFStore[T: Hash128]( override val name: String, val loggingEnabled: Boolean = true, val numHashes: Int = 6, val width: Int = 32, val seed: Int = 1) extends WriteableBFStore[T] with StateStore { // .. def +(item: T): Unit = bf = bf + item def contains(item: T): Boolean = { val v = bf.contains(item) v.isTrue && v.withProb > ACCEPTABLE_PROBABILITY } def maybeContains(item: T): Boolean = bf.maybeContains(item) def size: Approximate[Long] = bf.size }
  • 65. BF Store with Kafka Streams Processor // the Kafka Streams processor that will be part of the topology class WeblogProcessor extends AbstractProcessor[String, String] // the store instance private var bfStore: BFStore[String] = _ override def init(context: ProcessorContext): Unit = { super.init(context) // .. bfStore = this.context.getStateStore( WeblogDriver.LOG_COUNT_STATE_STORE).asInstanceOf[BFStore[String]] } override def process(dummy: String, record: String): Unit = LogParseUtil.parseLine(record) match { case Success(r) => { bfStore + r.host bfStore.changeLogger.logChange(bfStore.changelogKey, bfStore.bf) } case Failure(ex) => // .. } // .. }