SlideShare a Scribd company logo
1 of 60
Download to read offline
Streaming Algorithms:
Data without a disk
H. Andrew Schwartz
CSE545
Spring 2023
Big Data Analytics, The Class
Goal: Generalizations
A model or summarization of the data.
Hadoop File System
MapReduce
Spark
Deep Learning Frameworks
Similarity Search
Recommendation Systems
Link Analysis
Transformers/Self-Supervision
Streaming
Hypothesis Testing
Data Workflow Frameworks Analytics and Algorithms
Broadly:
Process
RECORD GONE
RECORD IN
What is Streaming?
(1) Direct: Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading is too long)
● … are rapidly arriving (need rapidly updated "results")
Why Streaming?
(1) Direct: Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading is too long)
● … are rapidly arriving (need rapidly updated "results")
Examples: Google search queries
Satellite imagery data
Text Messages, Status updates
Click Streams
Why Streaming?
(1) Direct: Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading is too long)
● … are rapidly arriving (need rapidly updated "results")
(2) Indirect: The constraints for streaming data force one to
solutions that are often efficient even when storing data.
Streaming Approx Random Sample
Distributed IO (MapReduce, Spark)
Why Streaming?
(1) Direct: Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading is too long)
● … are rapidly arriving (need rapidly updated "results")
(2) Indirect: The constraints for streaming data force one to
solutions that are often efficient even when storing data.
Streaming Approx Random Sample
Distributed IO (MapReduce, Spark)
Often translates into O(N) or strictly N algorithms.
Process
RECORD GONE
RECORD IN
Why Streaming?
● General Stream Processing Model
● Sampling
● Counting Distinct Elements
● Filtering data according to a criteria
Streaming Topics
Standing Queries:
Stored and permanently executing.
Ad-Hoc:
One-time questions
-- must store expected parts /
summaries of streams
Process
for
stream queries
RECORD GONE
RECORD IN
Standing Queries:
Stored and permanently executing.
Ad-Hoc:
One-time questions
-- must store expected parts /
summaries of streams
Process
for
stream queries
RECORD GONE
RECORD IN
E.g. How would you handle:
What is the mean of values seen so far?
Standing Queries:
Stored and permanently executing.
Ad-Hoc:
One-time questions
-- must store expected parts /
summaries of streams
Process
for
stream queries
RECORD GONE
RECORD IN
E.g. How would you handle:
What is the mean of values seen so far?
Important difference from typical database management:
● Input is not controlled by system staff.
● Input timing/rate is often unknown, controlled by users.
Standing Queries:
Stored and permanently executing.
Ad-Hoc:
One-time questions
-- must store expected parts /
summaries of streams
Process
for
stream queries
RECORD GONE
RECORD IN
Important difference from typical database management:
● Input is not controlled by system staff.
● Input timing/rate is often unknown, controlled by users.
Might hold a sliding window of
records instead of single record.
.. , i, h, g, f, e, d, c, b, a
E.g. How would you handle:
What is the mean of values seen so far?
Input stream
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
Processor
Output
(Generalization,
Summarization)
A stream of records
(also often referred to as “elements” , “tuples”, "lines", or "rows")
Theoretically, could be anything! search queries, numbers, bits, image files, ...
(Leskovec et al., 2014)
General Stream Processing Model
ad-hoc queries
Input stream
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
Processor
Output
(Generalization,
Summarization)
-- one-time questions
General Stream Processing Model
ad-hoc queries
Input stream
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
Processor
Output
(Generalization,
Summarization)
standing
queries
-- asked at all times.
General Stream Processing Model
ad-hoc queries
Input stream
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
Processor
Output
(Generalization,
Summarization)
standing
queries
General Stream Processing Model
limited
memory
ad-hoc queries
Input stream
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
Processor
Output
(Generalization,
Summarization)
limited
memory archival storage
standing
queries
-- not suitable for
fast queries.
General Stream Processing Model
Create a random sample for statistical analysis.
Process RECORD GONE
RECORD IN
Sampling
Create a random sample for statistical analysis.
Process RECORD GONE
RECORD IN
Keep?
limited
memory
yes
Sampling
Create a random sample for statistical analysis.
Process RECORD GONE
RECORD IN
Keep?
limited
memory
yes
run statistical
analysis
sometime in
future
Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
Sampling: 2 Versions
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
Sampling: 2 Versions
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
Sampling: 2 Versions
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if ?: #keep: e.g., true 5% of the time
memory.write(record)
yes
limited
memory
Sampling
RECORD GONE
RECORD IN
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)
random() < .05?
yes
limited
memory
Sampling
RECORD GONE
RECORD IN
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)
Problem: records/rows often are not units-of-analysis for statistical analyses
E.g. user_ids for searches, tweets; location_ids for satellite images
limited
memory
run statistical
analysis
sometime in
future
Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)
Solution: ?
Sampling
Sampling
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if ??: #keep
memory.write(record)
Solution: ?
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if ??: #keep:
memory.write(record)
Solution: instead of checking random digit; hash the attribute being sampled.
– streaming: only need to store hash functions; may be part of standing query
Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if hash(record[‘user_id’]) == 1: #keep
memory.write(record)
Solution: instead of checking random digit; hash the attribute being sampled.
– streaming: only need to store hash functions; may be part of standing query
How many buckets to hash into?
Sampling
● General Stream Processing Model
● Sampling
● Counting Distinct Elements
● Filtering data according to a criteria
Streaming Topics
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
Counting Moments
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
Counting Moments
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
Trivial: just increment
a counter
Counting Moments
Counting Moments
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
0th moment
Applications
Counting…
distinct words in large document.
distinct websites (URLs).
users that visit a site without storing.
unique queries to Alexa.
Counting Moments
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
0th moment
One Solution: Just keep a set (hashmap, dictionary, heap)
Problem: Can’t maintain that many in memory; disk storage is too slow
Applications
Counting…
distinct words in large document.
distinct websites (URLs).
users that visit a site without storing.
unique queries to Alexa.
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
0th moment
Streaming Solution: Flajolet-Martin Algorithm
General idea:
n -- suspected total number of elements observed
pick a hash, h, to map each element to log2
n bits (buckets)
Counting Moments
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
0th moment
Streaming Solution: Flajolet-Martin Algorithm
General idea:
n -- suspected overestimate of total number of elements observed
pick a hash, h, to map each element to log2
n bits (buckets)
--------------------------------
R = 0 #current max number of zeros at tail
for each stream element, e:
r(e) = trailZeros(h(e)) #num of trailing 0s from h(e)
R = r(e) if r[e] > R
estimated_distinct_elements = 2R
Counting Moments
Counting Moments
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
0th moment
Streaming Solution: Flajolet-Martin Algorithm
General idea:
n -- suspected total number of elements observed
pick a hash, h, to map each element to log2
n bits (buckets)
--------------------------------
R = 0 #current max number of zeros at tail
for each stream element, e:
r(e) = trailZeros(h(e)) #num of trailing 0s from h(e)
R = r(e) if r[e] > R
estimated_distinct_elements = 2R
# m
Mathematical Intuition
P( trailZeros(h(e)) >= i ) = 2-i
# P(h(e) == __0) = .5; P(h(e) == __00) = .25; …
P( trailZeros(h(e)) < i ) = 1 - 2-i
for m elements: = (1 - 2-i
)m
P( one e has trailZeros > i) = 1 - (1 - 2-i
)m
≈ 1 - e-m2^-i
If 2R
>> m, then 1 - (1 - 2-i
)m
≈ 0
If 2R
<< m, then 1 - (1 - 2-i
)m
≈ 1
Counting Moments
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
0th moment
Streaming Solution: Flajolet-Martin Algorithm
General idea:
n -- suspected total number of elements observed
pick a hash, h, to map each element to log2
n bits (buckets)
--------------------------------
R = 0 #current max number of zeros at tail
for each stream element, e:
r(e) = trailZeros(h(e)) #num of trailing 0s from h(e)
R = r(e) if r[e] > R
estimated_distinct_elements = 2R
# m
Mathematical Intuition
P( trailZeros(h(e)) >= i ) = 2-i
# P(h(e) == __0) = .5; P(h(e) == __00) = .25; …
P( trailZeros(h(e)) < i ) = 1 - 2-i
for m elements: = (1 - 2-i
)m
P( one e has trailZeros > i) = 1 - (1 - 2-i
)m
≈ 1 - e-m2^-i
If 2R
>> m, then 1 - (1 - 2-i
)m
≈ 0
If 2R
<< m, then 1 - (1 - 2-i
)m
≈ 1
Problem:
Unstable in practice.
Solution:
Multiple hash functions
but how to combine?
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
Counting Moments
0th moment
Streaming Solution: Flajolet-Martin Algorithm
General idea:
n -- suspected total number of elements observed
pick a hash, h, to map each element to log2
n bits (buckets)
--------------------------------
Rs = list()
for h in hashes:
R = 0 #potential max number of zeros at tail
for each stream element, e:
r(e) = trailZeros(h(e)) #num of trailing 0s from h(e)
R = r(e) if r[e] > R
Rs.append(2R
)
groupRs = [Rs[i:i+log n] for i in range(0, len(Rs), log n)]
estimated_distinct_elements = median(map(mean, groupRs))
Problem:
Unstable in practice.
Solution: Multiple hash functions
1. Partition into groups of size log n
2. Take mean in groups
3. Take median of group means
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
Counting Moments
0th moment
Streaming Solution: Flajolet-Martin Algorithm
General idea:
n -- suspected total number of elements observed
pick a hash, h, to map each element to log2
n bits (buckets)
--------------------------------
Rs = list()
for h in hashes:
R = 0 #potential max number of zeros at tail
for each stream element, e:
r(e) = trailZeros(h(e) #num of trailing 0s from h(e)
R = r(e) if r[e] > R
Rs.append(2R
)
groupRs = [Rs[i:i+log n] for i in range(0, len(Rs), log n)]
estimated_distinct_elements = median(map(mean, groupRs))
Problem:
Unstable in practice.
Solution: Multiple hash functions
1. Partition into groups of size log n
2. Take mean in groups
3. Take median of group means
A good approach anytime one
has many “low resolution”
estimates of a true value.
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
Examples
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares (measures uneveness related to variance)
2nd moment
Streaming Solution: Alon-Matias-Szegedy Algorithm
(Exercise; Out of Scope; see in MMDS)
Counting Moments
Counting Moments
standard deviation
(square-root of variance for numeric data)
Counting Moments
standard deviation
(square-root of variance for numeric data)
Counting Moments
For streaming, just need to store
(1) number of elements, (2) sum of
elements, and (3) sum of squares.
standard deviation
(square-root of variance for numeric data)
Counting Moments
For streaming, just need to store
(1) number of elements, (2) sum of
elements, and (3) sum of squares.
However, challenge:
Sum of squares can blow up!
standard deviation
(square-root of variance for numeric data)
Filtering: Select elements with property x
Example: 40B safe email addresses for spam detector
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows false positives but not false negatives)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows false positives but not false negatives)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0 #B is a bit vector
for each i in hashes, for each s in S:
set B[hi
(s)] = 1 #all bits resulting from
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam detector
The Bloom Filter (approximates; allows false positives but not false negatives)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0 #B is a bit vector
for each i in hashes, for each s in S:
set B[hi
(s)] = 1 #all bits resulting from
… #usually embedded in other code
while key x arrives next in stream #filter:
if B[hi
(x)] == 1 for all i in hashes:
#do as if x is in S
else: #do as if x not in S
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows false positives but not false negatives)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0 #B is a bit vector
for each i in hashes, for each s in S:
set B[hi
(s)] = 1 #all bits resulting from
… #usually embedded in other code
while key x arrives next in stream #filter:
if B[hi
(x)] == 1 for all i in hashes:
#do as if x is in S
else: #do as if x not in S
Filtering Data
Setup filter
Apply Filter
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows FPs)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0
for each i in hashes, for each s in S:
set B[hi
(s)] = 1
… #usually embedded in other code
while key x arrives next in stream #filter:
if B[hi
(x)] == 1 for all i in hashes:
#do as if x is in S
else: #do as if x not in S
What is the probability of a false
positive (FP)?
Q: What fraction of |B| are 1s?
(Leskovec et al., 2014)
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows FPs)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0
for each i in hashes, for each s in S:
set B[hi
(s)] = 1
… #usually embedded in other code
while key x arrives next in stream #filter:
if B[hi
(x)] == 1 for all i in hashes:
#do as if x is in S
else: #do as if x not in S
What is the probability of a false
positive?
Q: What fraction of |B| are 1s?
A: Analogy:
Throw |S| * k darts at n targets.
1 dart: 1/n
d darts: (1 - 1/n)d
= prob of 0
= e-d/n
are 0s
(Leskovec et al., 2014)
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows FPs)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0
for each i in hashes, for each s in S:
set B[hi
(s)] = 1
… #usually embedded in other code
while key x arrives next in stream #filter:
if B[hi
(x)] == 1 for all i in hashes:
#do as if x is in S
else: #do as if x not in S
What is the probability of a false
positive?
Q: What fraction of |B| are 1s?
A: Analogy:
Throw |S| * k darts at n targets.
1 dart: 1/n
d darts: (1 - 1/n)d
= prob of 0
= e-d/n
are 0s
(Leskovec et al., 2014)
= e-1
for large n
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows FPs)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0
for each i in hashes, for each s in S:
set B[hi
(s)] = 1
… #usually embedded in other code
while key x arrives next in stream #filter:
if B[hi
(x)] == 1 for all i in hashes:
#do as if x is in S
else: #do as if x not in S
What is the probability of a false
positive?
Q: What fraction of |B| are 1s?
A: Analogy:
Throw |S| * k darts at n targets.
1 dart: 1/n
d darts: (1 - 1/n)d
= prob of 0
= e-d/n
are 0s
thus, (1 - e-d/n
) are 1s
probability all k being 1?
(Leskovec et al., 2014)
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows FPs)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0
for each i in hashes, for each s in S:
set B[hi
(s)] = 1
… #usually embedded in other code
while key x arrives next in stream #filter:
if B[hi
(x)] == 1 for all i in hashes:
#do as if x is in S
else: #do as if x not in S
What is the probability of a false
positive?
Q: What fraction of |B| are 1s?
A: Analogy:
Throw |S| * k darts at n targets.
1 dart: 1/n
d darts: (1 - 1/n)d
= prob of 0
= e-d/n
are 0s
thus, (1 - e-d/n
) are 1s
probability all k being 1?
(1 - e-(|S|*k)/n
)k
|S| size of set
k: number of hash functions
n: number of buckets
Note: Can expand S as stream
continues as long as |B| has room
(e.g. adding verified email addresses)
(Leskovec et al., 2014)
Side Note on Generating Hash Functions:
What hash functions to use?
Start with 2 decent hash functions
e.g. ha
(x) = ascii(string) % large_prime_number
hb
(x) = (3*ascii(string) + 16) % large_prime_number
Add together multiplying the second times i:
hi
(x) = ha
(x) + i*hb
(x) % |BUCKETS|
e.g. h5
(x) = ha
(x) + 5*hb
(x) % 100
https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
Popular choices: md5 (fast, predistable); mmh3 (easy to seed; fast)
● General Stream Processing Model
● Sampling
○ approx. random
○ hierarchical approx. random
● Counting Elements
○ distinct elements
○ mean, standard deviation
● Filtering data according to a criteria
○ bloom filter setup + application
○ calculating false positives
Streaming Topics

More Related Content

Similar to CSE545 sp23 (2) Streaming Algorithms 2-4.pdf

Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...HostedbyConfluent
 
Mining data streams
Mining data streamsMining data streams
Mining data streamsAkash Gupta
 
Building a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearchBuilding a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearchMark Greene
 
Monitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the companyMonitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the companyJeff Weinstein
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
 
Database Research Principles Revealed
Database Research Principles RevealedDatabase Research Principles Revealed
Database Research Principles Revealedinfoblog
 
Meetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesMeetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesZenodia Charpy
 
Don’t give up, You can... Cache!
Don’t give up, You can... Cache!Don’t give up, You can... Cache!
Don’t give up, You can... Cache!Stefano Fago
 
Presto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop MeetupPresto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop MeetupJustin Borgman
 
5 Practical Steps to a Successful Deep Learning Research
5 Practical Steps to a Successful  Deep Learning Research5 Practical Steps to a Successful  Deep Learning Research
5 Practical Steps to a Successful Deep Learning ResearchBrodmann17
 
Approximate methods for scalable data mining
Approximate methods for scalable data miningApproximate methods for scalable data mining
Approximate methods for scalable data miningAndrew Clegg
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereSAP Technology
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Omid Vahdaty
 
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Seattle Apache Flink Meetup
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...Bowen Li
 
Python for Data Logistics
Python for Data LogisticsPython for Data Logistics
Python for Data LogisticsKen Farmer
 
Application Logging Good Bad Ugly ... Beautiful?
Application Logging Good Bad Ugly ... Beautiful?Application Logging Good Bad Ugly ... Beautiful?
Application Logging Good Bad Ugly ... Beautiful?Anton Chuvakin
 

Similar to CSE545 sp23 (2) Streaming Algorithms 2-4.pdf (20)

Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
Mining data streams
Mining data streamsMining data streams
Mining data streams
 
Building a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearchBuilding a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearch
 
Monitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the companyMonitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the company
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 
Database Research Principles Revealed
Database Research Principles RevealedDatabase Research Principles Revealed
Database Research Principles Revealed
 
Meetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesMeetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo cases
 
Don’t give up, You can... Cache!
Don’t give up, You can... Cache!Don’t give up, You can... Cache!
Don’t give up, You can... Cache!
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
Presto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop MeetupPresto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop Meetup
 
5 Practical Steps to a Successful Deep Learning Research
5 Practical Steps to a Successful  Deep Learning Research5 Practical Steps to a Successful  Deep Learning Research
5 Practical Steps to a Successful Deep Learning Research
 
Approximate methods for scalable data mining
Approximate methods for scalable data miningApproximate methods for scalable data mining
Approximate methods for scalable data mining
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
 
Python for Data Logistics
Python for Data LogisticsPython for Data Logistics
Python for Data Logistics
 
Application Logging Good Bad Ugly ... Beautiful?
Application Logging Good Bad Ugly ... Beautiful?Application Logging Good Bad Ugly ... Beautiful?
Application Logging Good Bad Ugly ... Beautiful?
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 

Recently uploaded

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Recently uploaded (20)

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

CSE545 sp23 (2) Streaming Algorithms 2-4.pdf

  • 1. Streaming Algorithms: Data without a disk H. Andrew Schwartz CSE545 Spring 2023
  • 2. Big Data Analytics, The Class Goal: Generalizations A model or summarization of the data. Hadoop File System MapReduce Spark Deep Learning Frameworks Similarity Search Recommendation Systems Link Analysis Transformers/Self-Supervision Streaming Hypothesis Testing Data Workflow Frameworks Analytics and Algorithms
  • 4. (1) Direct: Often, data … ● … cannot be stored (too big, privacy concerns) ● … are not practical to access repeatedly (reading is too long) ● … are rapidly arriving (need rapidly updated "results") Why Streaming?
  • 5. (1) Direct: Often, data … ● … cannot be stored (too big, privacy concerns) ● … are not practical to access repeatedly (reading is too long) ● … are rapidly arriving (need rapidly updated "results") Examples: Google search queries Satellite imagery data Text Messages, Status updates Click Streams Why Streaming?
  • 6. (1) Direct: Often, data … ● … cannot be stored (too big, privacy concerns) ● … are not practical to access repeatedly (reading is too long) ● … are rapidly arriving (need rapidly updated "results") (2) Indirect: The constraints for streaming data force one to solutions that are often efficient even when storing data. Streaming Approx Random Sample Distributed IO (MapReduce, Spark) Why Streaming?
  • 7. (1) Direct: Often, data … ● … cannot be stored (too big, privacy concerns) ● … are not practical to access repeatedly (reading is too long) ● … are rapidly arriving (need rapidly updated "results") (2) Indirect: The constraints for streaming data force one to solutions that are often efficient even when storing data. Streaming Approx Random Sample Distributed IO (MapReduce, Spark) Often translates into O(N) or strictly N algorithms. Process RECORD GONE RECORD IN Why Streaming?
  • 8. ● General Stream Processing Model ● Sampling ● Counting Distinct Elements ● Filtering data according to a criteria Streaming Topics
  • 9. Standing Queries: Stored and permanently executing. Ad-Hoc: One-time questions -- must store expected parts / summaries of streams Process for stream queries RECORD GONE RECORD IN
  • 10. Standing Queries: Stored and permanently executing. Ad-Hoc: One-time questions -- must store expected parts / summaries of streams Process for stream queries RECORD GONE RECORD IN E.g. How would you handle: What is the mean of values seen so far?
  • 11. Standing Queries: Stored and permanently executing. Ad-Hoc: One-time questions -- must store expected parts / summaries of streams Process for stream queries RECORD GONE RECORD IN E.g. How would you handle: What is the mean of values seen so far? Important difference from typical database management: ● Input is not controlled by system staff. ● Input timing/rate is often unknown, controlled by users.
  • 12. Standing Queries: Stored and permanently executing. Ad-Hoc: One-time questions -- must store expected parts / summaries of streams Process for stream queries RECORD GONE RECORD IN Important difference from typical database management: ● Input is not controlled by system staff. ● Input timing/rate is often unknown, controlled by users. Might hold a sliding window of records instead of single record. .. , i, h, g, f, e, d, c, b, a E.g. How would you handle: What is the mean of values seen so far?
  • 13. Input stream …, 4, 3, 11, 2, 0, 5, 8, 1, 4 Processor Output (Generalization, Summarization) A stream of records (also often referred to as “elements” , “tuples”, "lines", or "rows") Theoretically, could be anything! search queries, numbers, bits, image files, ... (Leskovec et al., 2014) General Stream Processing Model
  • 14. ad-hoc queries Input stream …, 4, 3, 11, 2, 0, 5, 8, 1, 4 Processor Output (Generalization, Summarization) -- one-time questions General Stream Processing Model
  • 15. ad-hoc queries Input stream …, 4, 3, 11, 2, 0, 5, 8, 1, 4 Processor Output (Generalization, Summarization) standing queries -- asked at all times. General Stream Processing Model
  • 16. ad-hoc queries Input stream …, 4, 3, 11, 2, 0, 5, 8, 1, 4 Processor Output (Generalization, Summarization) standing queries General Stream Processing Model limited memory
  • 17. ad-hoc queries Input stream …, 4, 3, 11, 2, 0, 5, 8, 1, 4 Processor Output (Generalization, Summarization) limited memory archival storage standing queries -- not suitable for fast queries. General Stream Processing Model
  • 18. Create a random sample for statistical analysis. Process RECORD GONE RECORD IN Sampling
  • 19. Create a random sample for statistical analysis. Process RECORD GONE RECORD IN Keep? limited memory yes Sampling
  • 20. Create a random sample for statistical analysis. Process RECORD GONE RECORD IN Keep? limited memory yes run statistical analysis sometime in future Sampling
  • 21. Create a random sample for statistical analysis. 1. Simple Sampling: Individual records are what you wish to sample. Sampling: 2 Versions
  • 22. Create a random sample for statistical analysis. 1. Simple Sampling: Individual records are what you wish to sample. 2. Hierarchical Sampling: Sample an attribute of a record. (e.g. records are tweets, but with to sample users) Sampling: 2 Versions tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet!
  • 23. Sampling: 2 Versions tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! Create a random sample for statistical analysis. 1. Simple Sampling: Individual records are what you wish to sample. 2. Hierarchical Sampling: Sample an attribute of a record. (e.g. records are tweets, but with to sample users) tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet!
  • 24. Create a random sample for statistical analysis. 1. Simple Sampling: Individual records are what you wish to sample. Sampling
  • 25. Create a random sample for statistical analysis. 1. Simple Sampling: Individual records are what you wish to sample. record = stream.next() if ?: #keep: e.g., true 5% of the time memory.write(record) yes limited memory Sampling RECORD GONE RECORD IN
  • 26. Create a random sample for statistical analysis. 1. Simple Sampling: Individual records are what you wish to sample. record = stream.next() if random() <= .05: #keep: true 5% of the time memory.write(record) random() < .05? yes limited memory Sampling RECORD GONE RECORD IN
  • 27. Create a random sample for statistical analysis. 1. Simple Sampling: Individual records are what you wish to sample. record = stream.next() if random() <= .05: #keep: true 5% of the time memory.write(record) Problem: records/rows often are not units-of-analysis for statistical analyses E.g. user_ids for searches, tweets; location_ids for satellite images limited memory run statistical analysis sometime in future Sampling
  • 28. 2. Hierarchical Sampling: Sample an attribute of a record. (e.g. records are tweets, but with to sample users) record = stream.next() if random() <= .05: #keep: true 5% of the time memory.write(record) Solution: ? Sampling
  • 29. Sampling tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! 2. Hierarchical Sampling: Sample an attribute of a record. (e.g. records are tweets, but with to sample users) record = stream.next() if ??: #keep memory.write(record) Solution: ?
  • 30. 2. Hierarchical Sampling: Sample an attribute of a record. (e.g. records are tweets, but with to sample users) record = stream.next() if ??: #keep: memory.write(record) Solution: instead of checking random digit; hash the attribute being sampled. – streaming: only need to store hash functions; may be part of standing query Sampling
  • 31. 2. Hierarchical Sampling: Sample an attribute of a record. (e.g. records are tweets, but with to sample users) record = stream.next() if hash(record[‘user_id’]) == 1: #keep memory.write(record) Solution: instead of checking random digit; hash the attribute being sampled. – streaming: only need to store hash functions; may be part of standing query How many buckets to hash into? Sampling
  • 32. ● General Stream Processing Model ● Sampling ● Counting Distinct Elements ● Filtering data according to a criteria Streaming Topics
  • 33. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is Counting Moments
  • 34. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) Counting Moments
  • 35. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) Trivial: just increment a counter Counting Moments
  • 36. Counting Moments Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) 0th moment Applications Counting… distinct words in large document. distinct websites (URLs). users that visit a site without storing. unique queries to Alexa.
  • 37. Counting Moments Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) 0th moment One Solution: Just keep a set (hashmap, dictionary, heap) Problem: Can’t maintain that many in memory; disk storage is too slow Applications Counting… distinct words in large document. distinct websites (URLs). users that visit a site without storing. unique queries to Alexa.
  • 38. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) 0th moment Streaming Solution: Flajolet-Martin Algorithm General idea: n -- suspected total number of elements observed pick a hash, h, to map each element to log2 n bits (buckets) Counting Moments
  • 39. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) 0th moment Streaming Solution: Flajolet-Martin Algorithm General idea: n -- suspected overestimate of total number of elements observed pick a hash, h, to map each element to log2 n bits (buckets) -------------------------------- R = 0 #current max number of zeros at tail for each stream element, e: r(e) = trailZeros(h(e)) #num of trailing 0s from h(e) R = r(e) if r[e] > R estimated_distinct_elements = 2R Counting Moments
  • 40. Counting Moments Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) 0th moment Streaming Solution: Flajolet-Martin Algorithm General idea: n -- suspected total number of elements observed pick a hash, h, to map each element to log2 n bits (buckets) -------------------------------- R = 0 #current max number of zeros at tail for each stream element, e: r(e) = trailZeros(h(e)) #num of trailing 0s from h(e) R = r(e) if r[e] > R estimated_distinct_elements = 2R # m Mathematical Intuition P( trailZeros(h(e)) >= i ) = 2-i # P(h(e) == __0) = .5; P(h(e) == __00) = .25; … P( trailZeros(h(e)) < i ) = 1 - 2-i for m elements: = (1 - 2-i )m P( one e has trailZeros > i) = 1 - (1 - 2-i )m ≈ 1 - e-m2^-i If 2R >> m, then 1 - (1 - 2-i )m ≈ 0 If 2R << m, then 1 - (1 - 2-i )m ≈ 1
  • 41. Counting Moments Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) 0th moment Streaming Solution: Flajolet-Martin Algorithm General idea: n -- suspected total number of elements observed pick a hash, h, to map each element to log2 n bits (buckets) -------------------------------- R = 0 #current max number of zeros at tail for each stream element, e: r(e) = trailZeros(h(e)) #num of trailing 0s from h(e) R = r(e) if r[e] > R estimated_distinct_elements = 2R # m Mathematical Intuition P( trailZeros(h(e)) >= i ) = 2-i # P(h(e) == __0) = .5; P(h(e) == __00) = .25; … P( trailZeros(h(e)) < i ) = 1 - 2-i for m elements: = (1 - 2-i )m P( one e has trailZeros > i) = 1 - (1 - 2-i )m ≈ 1 - e-m2^-i If 2R >> m, then 1 - (1 - 2-i )m ≈ 0 If 2R << m, then 1 - (1 - 2-i )m ≈ 1 Problem: Unstable in practice. Solution: Multiple hash functions but how to combine?
  • 42. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) Counting Moments 0th moment Streaming Solution: Flajolet-Martin Algorithm General idea: n -- suspected total number of elements observed pick a hash, h, to map each element to log2 n bits (buckets) -------------------------------- Rs = list() for h in hashes: R = 0 #potential max number of zeros at tail for each stream element, e: r(e) = trailZeros(h(e)) #num of trailing 0s from h(e) R = r(e) if r[e] > R Rs.append(2R ) groupRs = [Rs[i:i+log n] for i in range(0, len(Rs), log n)] estimated_distinct_elements = median(map(mean, groupRs)) Problem: Unstable in practice. Solution: Multiple hash functions 1. Partition into groups of size log n 2. Take mean in groups 3. Take median of group means
  • 43. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) Counting Moments 0th moment Streaming Solution: Flajolet-Martin Algorithm General idea: n -- suspected total number of elements observed pick a hash, h, to map each element to log2 n bits (buckets) -------------------------------- Rs = list() for h in hashes: R = 0 #potential max number of zeros at tail for each stream element, e: r(e) = trailZeros(h(e) #num of trailing 0s from h(e) R = r(e) if r[e] > R Rs.append(2R ) groupRs = [Rs[i:i+log n] for i in range(0, len(Rs), log n)] estimated_distinct_elements = median(map(mean, groupRs)) Problem: Unstable in practice. Solution: Multiple hash functions 1. Partition into groups of size log n 2. Take mean in groups 3. Take median of group means A good approach anytime one has many “low resolution” estimates of a true value.
  • 44. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is Examples ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness related to variance) 2nd moment Streaming Solution: Alon-Matias-Szegedy Algorithm (Exercise; Out of Scope; see in MMDS) Counting Moments
  • 45. Counting Moments standard deviation (square-root of variance for numeric data)
  • 46. Counting Moments standard deviation (square-root of variance for numeric data)
  • 47. Counting Moments For streaming, just need to store (1) number of elements, (2) sum of elements, and (3) sum of squares. standard deviation (square-root of variance for numeric data)
  • 48. Counting Moments For streaming, just need to store (1) number of elements, (2) sum of elements, and (3) sum of squares. However, challenge: Sum of squares can blow up! standard deviation (square-root of variance for numeric data)
  • 49. Filtering: Select elements with property x Example: 40B safe email addresses for spam detector Filtering Data
  • 50. Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows false positives but not false negatives) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Filtering Data
  • 51. Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows false positives but not false negatives) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 #B is a bit vector for each i in hashes, for each s in S: set B[hi (s)] = 1 #all bits resulting from Filtering Data
  • 52. Filtering: Select elements with property x Example: 40B safe email addresses for spam detector The Bloom Filter (approximates; allows false positives but not false negatives) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 #B is a bit vector for each i in hashes, for each s in S: set B[hi (s)] = 1 #all bits resulting from … #usually embedded in other code while key x arrives next in stream #filter: if B[hi (x)] == 1 for all i in hashes: #do as if x is in S else: #do as if x not in S Filtering Data
  • 53. Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows false positives but not false negatives) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 #B is a bit vector for each i in hashes, for each s in S: set B[hi (s)] = 1 #all bits resulting from … #usually embedded in other code while key x arrives next in stream #filter: if B[hi (x)] == 1 for all i in hashes: #do as if x is in S else: #do as if x not in S Filtering Data Setup filter Apply Filter
  • 54. Filtering Data Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows FPs) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 for each i in hashes, for each s in S: set B[hi (s)] = 1 … #usually embedded in other code while key x arrives next in stream #filter: if B[hi (x)] == 1 for all i in hashes: #do as if x is in S else: #do as if x not in S What is the probability of a false positive (FP)? Q: What fraction of |B| are 1s? (Leskovec et al., 2014)
  • 55. Filtering Data Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows FPs) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 for each i in hashes, for each s in S: set B[hi (s)] = 1 … #usually embedded in other code while key x arrives next in stream #filter: if B[hi (x)] == 1 for all i in hashes: #do as if x is in S else: #do as if x not in S What is the probability of a false positive? Q: What fraction of |B| are 1s? A: Analogy: Throw |S| * k darts at n targets. 1 dart: 1/n d darts: (1 - 1/n)d = prob of 0 = e-d/n are 0s (Leskovec et al., 2014)
  • 56. Filtering Data Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows FPs) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 for each i in hashes, for each s in S: set B[hi (s)] = 1 … #usually embedded in other code while key x arrives next in stream #filter: if B[hi (x)] == 1 for all i in hashes: #do as if x is in S else: #do as if x not in S What is the probability of a false positive? Q: What fraction of |B| are 1s? A: Analogy: Throw |S| * k darts at n targets. 1 dart: 1/n d darts: (1 - 1/n)d = prob of 0 = e-d/n are 0s (Leskovec et al., 2014) = e-1 for large n
  • 57. Filtering Data Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows FPs) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 for each i in hashes, for each s in S: set B[hi (s)] = 1 … #usually embedded in other code while key x arrives next in stream #filter: if B[hi (x)] == 1 for all i in hashes: #do as if x is in S else: #do as if x not in S What is the probability of a false positive? Q: What fraction of |B| are 1s? A: Analogy: Throw |S| * k darts at n targets. 1 dart: 1/n d darts: (1 - 1/n)d = prob of 0 = e-d/n are 0s thus, (1 - e-d/n ) are 1s probability all k being 1? (Leskovec et al., 2014)
  • 58. Filtering Data Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows FPs) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 for each i in hashes, for each s in S: set B[hi (s)] = 1 … #usually embedded in other code while key x arrives next in stream #filter: if B[hi (x)] == 1 for all i in hashes: #do as if x is in S else: #do as if x not in S What is the probability of a false positive? Q: What fraction of |B| are 1s? A: Analogy: Throw |S| * k darts at n targets. 1 dart: 1/n d darts: (1 - 1/n)d = prob of 0 = e-d/n are 0s thus, (1 - e-d/n ) are 1s probability all k being 1? (1 - e-(|S|*k)/n )k |S| size of set k: number of hash functions n: number of buckets Note: Can expand S as stream continues as long as |B| has room (e.g. adding verified email addresses) (Leskovec et al., 2014)
  • 59. Side Note on Generating Hash Functions: What hash functions to use? Start with 2 decent hash functions e.g. ha (x) = ascii(string) % large_prime_number hb (x) = (3*ascii(string) + 16) % large_prime_number Add together multiplying the second times i: hi (x) = ha (x) + i*hb (x) % |BUCKETS| e.g. h5 (x) = ha (x) + 5*hb (x) % 100 https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf Popular choices: md5 (fast, predistable); mmh3 (easy to seed; fast)
  • 60. ● General Stream Processing Model ● Sampling ○ approx. random ○ hierarchical approx. random ● Counting Elements ○ distinct elements ○ mean, standard deviation ● Filtering data according to a criteria ○ bloom filter setup + application ○ calculating false positives Streaming Topics