SlideShare a Scribd company logo
Streaming Algorithms:
Data without a disk
H. Andrew Schwartz
CSE545
Spring 2023
Big Data Analytics, The Class
Goal: Generalizations
A model or summarization of the data.
Hadoop File System
MapReduce
Spark
Deep Learning Frameworks
Similarity Search
Recommendation Systems
Link Analysis
Transformers/Self-Supervision
Streaming
Hypothesis Testing
Data Workflow Frameworks Analytics and Algorithms
Broadly:
Process
RECORD GONE
RECORD IN
What is Streaming?
(1) Direct: Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading is too long)
● … are rapidly arriving (need rapidly updated "results")
Why Streaming?
(1) Direct: Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading is too long)
● … are rapidly arriving (need rapidly updated "results")
Examples: Google search queries
Satellite imagery data
Text Messages, Status updates
Click Streams
Why Streaming?
(1) Direct: Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading is too long)
● … are rapidly arriving (need rapidly updated "results")
(2) Indirect: The constraints for streaming data force one to
solutions that are often efficient even when storing data.
Streaming Approx Random Sample
Distributed IO (MapReduce, Spark)
Why Streaming?
(1) Direct: Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading is too long)
● … are rapidly arriving (need rapidly updated "results")
(2) Indirect: The constraints for streaming data force one to
solutions that are often efficient even when storing data.
Streaming Approx Random Sample
Distributed IO (MapReduce, Spark)
Often translates into O(N) or strictly N algorithms.
Process
RECORD GONE
RECORD IN
Why Streaming?
● General Stream Processing Model
● Sampling
● Counting Distinct Elements
● Filtering data according to a criteria
Streaming Topics
Standing Queries:
Stored and permanently executing.
Ad-Hoc:
One-time questions
-- must store expected parts /
summaries of streams
Process
for
stream queries
RECORD GONE
RECORD IN
Standing Queries:
Stored and permanently executing.
Ad-Hoc:
One-time questions
-- must store expected parts /
summaries of streams
Process
for
stream queries
RECORD GONE
RECORD IN
E.g. How would you handle:
What is the mean of values seen so far?
Standing Queries:
Stored and permanently executing.
Ad-Hoc:
One-time questions
-- must store expected parts /
summaries of streams
Process
for
stream queries
RECORD GONE
RECORD IN
E.g. How would you handle:
What is the mean of values seen so far?
Important difference from typical database management:
● Input is not controlled by system staff.
● Input timing/rate is often unknown, controlled by users.
Standing Queries:
Stored and permanently executing.
Ad-Hoc:
One-time questions
-- must store expected parts /
summaries of streams
Process
for
stream queries
RECORD GONE
RECORD IN
Important difference from typical database management:
● Input is not controlled by system staff.
● Input timing/rate is often unknown, controlled by users.
Might hold a sliding window of
records instead of single record.
.. , i, h, g, f, e, d, c, b, a
E.g. How would you handle:
What is the mean of values seen so far?
Input stream
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
Processor
Output
(Generalization,
Summarization)
A stream of records
(also often referred to as “elements” , “tuples”, "lines", or "rows")
Theoretically, could be anything! search queries, numbers, bits, image files, ...
(Leskovec et al., 2014)
General Stream Processing Model
ad-hoc queries
Input stream
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
Processor
Output
(Generalization,
Summarization)
-- one-time questions
General Stream Processing Model
ad-hoc queries
Input stream
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
Processor
Output
(Generalization,
Summarization)
standing
queries
-- asked at all times.
General Stream Processing Model
ad-hoc queries
Input stream
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
Processor
Output
(Generalization,
Summarization)
standing
queries
General Stream Processing Model
limited
memory
ad-hoc queries
Input stream
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
Processor
Output
(Generalization,
Summarization)
limited
memory archival storage
standing
queries
-- not suitable for
fast queries.
General Stream Processing Model
Create a random sample for statistical analysis.
Process RECORD GONE
RECORD IN
Sampling
Create a random sample for statistical analysis.
Process RECORD GONE
RECORD IN
Keep?
limited
memory
yes
Sampling
Create a random sample for statistical analysis.
Process RECORD GONE
RECORD IN
Keep?
limited
memory
yes
run statistical
analysis
sometime in
future
Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
Sampling: 2 Versions
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
Sampling: 2 Versions
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
Sampling: 2 Versions
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if ?: #keep: e.g., true 5% of the time
memory.write(record)
yes
limited
memory
Sampling
RECORD GONE
RECORD IN
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)
random() < .05?
yes
limited
memory
Sampling
RECORD GONE
RECORD IN
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)
Problem: records/rows often are not units-of-analysis for statistical analyses
E.g. user_ids for searches, tweets; location_ids for satellite images
limited
memory
run statistical
analysis
sometime in
future
Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)
Solution: ?
Sampling
Sampling
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
tweet!
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if ??: #keep
memory.write(record)
Solution: ?
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if ??: #keep:
memory.write(record)
Solution: instead of checking random digit; hash the attribute being sampled.
– streaming: only need to store hash functions; may be part of standing query
Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if hash(record[‘user_id’]) == 1: #keep
memory.write(record)
Solution: instead of checking random digit; hash the attribute being sampled.
– streaming: only need to store hash functions; may be part of standing query
How many buckets to hash into?
Sampling
● General Stream Processing Model
● Sampling
● Counting Distinct Elements
● Filtering data according to a criteria
Streaming Topics
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
Counting Moments
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
Counting Moments
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
Trivial: just increment
a counter
Counting Moments
Counting Moments
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
0th moment
Applications
Counting…
distinct words in large document.
distinct websites (URLs).
users that visit a site without storing.
unique queries to Alexa.
Counting Moments
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
0th moment
One Solution: Just keep a set (hashmap, dictionary, heap)
Problem: Can’t maintain that many in memory; disk storage is too slow
Applications
Counting…
distinct words in large document.
distinct websites (URLs).
users that visit a site without storing.
unique queries to Alexa.
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
0th moment
Streaming Solution: Flajolet-Martin Algorithm
General idea:
n -- suspected total number of elements observed
pick a hash, h, to map each element to log2
n bits (buckets)
Counting Moments
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
0th moment
Streaming Solution: Flajolet-Martin Algorithm
General idea:
n -- suspected overestimate of total number of elements observed
pick a hash, h, to map each element to log2
n bits (buckets)
--------------------------------
R = 0 #current max number of zeros at tail
for each stream element, e:
r(e) = trailZeros(h(e)) #num of trailing 0s from h(e)
R = r(e) if r[e] > R
estimated_distinct_elements = 2R
Counting Moments
Counting Moments
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
0th moment
Streaming Solution: Flajolet-Martin Algorithm
General idea:
n -- suspected total number of elements observed
pick a hash, h, to map each element to log2
n bits (buckets)
--------------------------------
R = 0 #current max number of zeros at tail
for each stream element, e:
r(e) = trailZeros(h(e)) #num of trailing 0s from h(e)
R = r(e) if r[e] > R
estimated_distinct_elements = 2R
# m
Mathematical Intuition
P( trailZeros(h(e)) >= i ) = 2-i
# P(h(e) == __0) = .5; P(h(e) == __00) = .25; …
P( trailZeros(h(e)) < i ) = 1 - 2-i
for m elements: = (1 - 2-i
)m
P( one e has trailZeros > i) = 1 - (1 - 2-i
)m
≈ 1 - e-m2^-i
If 2R
>> m, then 1 - (1 - 2-i
)m
≈ 0
If 2R
<< m, then 1 - (1 - 2-i
)m
≈ 1
Counting Moments
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
0th moment
Streaming Solution: Flajolet-Martin Algorithm
General idea:
n -- suspected total number of elements observed
pick a hash, h, to map each element to log2
n bits (buckets)
--------------------------------
R = 0 #current max number of zeros at tail
for each stream element, e:
r(e) = trailZeros(h(e)) #num of trailing 0s from h(e)
R = r(e) if r[e] > R
estimated_distinct_elements = 2R
# m
Mathematical Intuition
P( trailZeros(h(e)) >= i ) = 2-i
# P(h(e) == __0) = .5; P(h(e) == __00) = .25; …
P( trailZeros(h(e)) < i ) = 1 - 2-i
for m elements: = (1 - 2-i
)m
P( one e has trailZeros > i) = 1 - (1 - 2-i
)m
≈ 1 - e-m2^-i
If 2R
>> m, then 1 - (1 - 2-i
)m
≈ 0
If 2R
<< m, then 1 - (1 - 2-i
)m
≈ 1
Problem:
Unstable in practice.
Solution:
Multiple hash functions
but how to combine?
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
Counting Moments
0th moment
Streaming Solution: Flajolet-Martin Algorithm
General idea:
n -- suspected total number of elements observed
pick a hash, h, to map each element to log2
n bits (buckets)
--------------------------------
Rs = list()
for h in hashes:
R = 0 #potential max number of zeros at tail
for each stream element, e:
r(e) = trailZeros(h(e)) #num of trailing 0s from h(e)
R = r(e) if r[e] > R
Rs.append(2R
)
groupRs = [Rs[i:i+log n] for i in range(0, len(Rs), log n)]
estimated_distinct_elements = median(map(mean, groupRs))
Problem:
Unstable in practice.
Solution: Multiple hash functions
1. Partition into groups of size log n
2. Take mean in groups
3. Take median of group means
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)
Counting Moments
0th moment
Streaming Solution: Flajolet-Martin Algorithm
General idea:
n -- suspected total number of elements observed
pick a hash, h, to map each element to log2
n bits (buckets)
--------------------------------
Rs = list()
for h in hashes:
R = 0 #potential max number of zeros at tail
for each stream element, e:
r(e) = trailZeros(h(e) #num of trailing 0s from h(e)
R = r(e) if r[e] > R
Rs.append(2R
)
groupRs = [Rs[i:i+log n] for i in range(0, len(Rs), log n)]
estimated_distinct_elements = median(map(mean, groupRs))
Problem:
Unstable in practice.
Solution: Multiple hash functions
1. Partition into groups of size log n
2. Take mean in groups
3. Take median of group means
A good approach anytime one
has many “low resolution”
estimates of a true value.
Moments:
● Suppose mi
is the count of distinct element i in the data
● The kth moment of the stream is
Examples
● 0th moment: count of distinct elements
● 1st moment: length of stream
● 2nd moment: sum of squares (measures uneveness related to variance)
2nd moment
Streaming Solution: Alon-Matias-Szegedy Algorithm
(Exercise; Out of Scope; see in MMDS)
Counting Moments
Counting Moments
standard deviation
(square-root of variance for numeric data)
Counting Moments
standard deviation
(square-root of variance for numeric data)
Counting Moments
For streaming, just need to store
(1) number of elements, (2) sum of
elements, and (3) sum of squares.
standard deviation
(square-root of variance for numeric data)
Counting Moments
For streaming, just need to store
(1) number of elements, (2) sum of
elements, and (3) sum of squares.
However, challenge:
Sum of squares can blow up!
standard deviation
(square-root of variance for numeric data)
Filtering: Select elements with property x
Example: 40B safe email addresses for spam detector
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows false positives but not false negatives)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows false positives but not false negatives)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0 #B is a bit vector
for each i in hashes, for each s in S:
set B[hi
(s)] = 1 #all bits resulting from
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam detector
The Bloom Filter (approximates; allows false positives but not false negatives)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0 #B is a bit vector
for each i in hashes, for each s in S:
set B[hi
(s)] = 1 #all bits resulting from
… #usually embedded in other code
while key x arrives next in stream #filter:
if B[hi
(x)] == 1 for all i in hashes:
#do as if x is in S
else: #do as if x not in S
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows false positives but not false negatives)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0 #B is a bit vector
for each i in hashes, for each s in S:
set B[hi
(s)] = 1 #all bits resulting from
… #usually embedded in other code
while key x arrives next in stream #filter:
if B[hi
(x)] == 1 for all i in hashes:
#do as if x is in S
else: #do as if x not in S
Filtering Data
Setup filter
Apply Filter
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows FPs)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0
for each i in hashes, for each s in S:
set B[hi
(s)] = 1
… #usually embedded in other code
while key x arrives next in stream #filter:
if B[hi
(x)] == 1 for all i in hashes:
#do as if x is in S
else: #do as if x not in S
What is the probability of a false
positive (FP)?
Q: What fraction of |B| are 1s?
(Leskovec et al., 2014)
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows FPs)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0
for each i in hashes, for each s in S:
set B[hi
(s)] = 1
… #usually embedded in other code
while key x arrives next in stream #filter:
if B[hi
(x)] == 1 for all i in hashes:
#do as if x is in S
else: #do as if x not in S
What is the probability of a false
positive?
Q: What fraction of |B| are 1s?
A: Analogy:
Throw |S| * k darts at n targets.
1 dart: 1/n
d darts: (1 - 1/n)d
= prob of 0
= e-d/n
are 0s
(Leskovec et al., 2014)
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows FPs)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0
for each i in hashes, for each s in S:
set B[hi
(s)] = 1
… #usually embedded in other code
while key x arrives next in stream #filter:
if B[hi
(x)] == 1 for all i in hashes:
#do as if x is in S
else: #do as if x not in S
What is the probability of a false
positive?
Q: What fraction of |B| are 1s?
A: Analogy:
Throw |S| * k darts at n targets.
1 dart: 1/n
d darts: (1 - 1/n)d
= prob of 0
= e-d/n
are 0s
(Leskovec et al., 2014)
= e-1
for large n
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows FPs)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0
for each i in hashes, for each s in S:
set B[hi
(s)] = 1
… #usually embedded in other code
while key x arrives next in stream #filter:
if B[hi
(x)] == 1 for all i in hashes:
#do as if x is in S
else: #do as if x not in S
What is the probability of a false
positive?
Q: What fraction of |B| are 1s?
A: Analogy:
Throw |S| * k darts at n targets.
1 dart: 1/n
d darts: (1 - 1/n)d
= prob of 0
= e-d/n
are 0s
thus, (1 - e-d/n
) are 1s
probability all k being 1?
(Leskovec et al., 2014)
Filtering Data
Filtering: Select elements with property x
Example: 40B safe email addresses for spam filter
The Bloom Filter (approximates; allows FPs)
Given:
|S| keys to filter; will be mapped to |B| bits
hashes = h1,
h2
, …, hk
independent hash functions
Algorithm:
set all B to 0
for each i in hashes, for each s in S:
set B[hi
(s)] = 1
… #usually embedded in other code
while key x arrives next in stream #filter:
if B[hi
(x)] == 1 for all i in hashes:
#do as if x is in S
else: #do as if x not in S
What is the probability of a false
positive?
Q: What fraction of |B| are 1s?
A: Analogy:
Throw |S| * k darts at n targets.
1 dart: 1/n
d darts: (1 - 1/n)d
= prob of 0
= e-d/n
are 0s
thus, (1 - e-d/n
) are 1s
probability all k being 1?
(1 - e-(|S|*k)/n
)k
|S| size of set
k: number of hash functions
n: number of buckets
Note: Can expand S as stream
continues as long as |B| has room
(e.g. adding verified email addresses)
(Leskovec et al., 2014)
Side Note on Generating Hash Functions:
What hash functions to use?
Start with 2 decent hash functions
e.g. ha
(x) = ascii(string) % large_prime_number
hb
(x) = (3*ascii(string) + 16) % large_prime_number
Add together multiplying the second times i:
hi
(x) = ha
(x) + i*hb
(x) % |BUCKETS|
e.g. h5
(x) = ha
(x) + 5*hb
(x) % 100
https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
Popular choices: md5 (fast, predistable); mmh3 (easy to seed; fast)
● General Stream Processing Model
● Sampling
○ approx. random
○ hierarchical approx. random
● Counting Elements
○ distinct elements
○ mean, standard deviation
● Filtering data according to a criteria
○ bloom filter setup + application
○ calculating false positives
Streaming Topics

More Related Content

Similar to CSE545 sp23 (2) Streaming Algorithms 2-4.pdf

Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
HostedbyConfluent
 
Mining data streams
Mining data streamsMining data streams
Mining data streams
Akash Gupta
 
Building a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearchBuilding a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearch
Mark Greene
 
Monitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the companyMonitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the companyJeff Weinstein
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
Demi Ben-Ari
 
Database Research Principles Revealed
Database Research Principles RevealedDatabase Research Principles Revealed
Database Research Principles Revealed
infoblog
 
Meetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesMeetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo cases
Zenodia Charpy
 
Don’t give up, You can... Cache!
Don’t give up, You can... Cache!Don’t give up, You can... Cache!
Don’t give up, You can... Cache!
Stefano Fago
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
Maycon Viana Bordin
 
Presto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop MeetupPresto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop Meetup
Justin Borgman
 
5 Practical Steps to a Successful Deep Learning Research
5 Practical Steps to a Successful  Deep Learning Research5 Practical Steps to a Successful  Deep Learning Research
5 Practical Steps to a Successful Deep Learning Research
Brodmann17
 
Approximate methods for scalable data mining
Approximate methods for scalable data miningApproximate methods for scalable data mining
Approximate methods for scalable data mining
Andrew Clegg
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
SAP Technology
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
Andrea Wiggins
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
Omid Vahdaty
 
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Seattle Apache Flink Meetup
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Bowen Li
 
Python for Data Logistics
Python for Data LogisticsPython for Data Logistics
Python for Data Logistics
Ken Farmer
 
Application Logging Good Bad Ugly ... Beautiful?
Application Logging Good Bad Ugly ... Beautiful?Application Logging Good Bad Ugly ... Beautiful?
Application Logging Good Bad Ugly ... Beautiful?
Anton Chuvakin
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
Aleksandr Kuboskin, CFA
 

Similar to CSE545 sp23 (2) Streaming Algorithms 2-4.pdf (20)

Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
Mining data streams
Mining data streamsMining data streams
Mining data streams
 
Building a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearchBuilding a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearch
 
Monitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the companyMonitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the company
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 
Database Research Principles Revealed
Database Research Principles RevealedDatabase Research Principles Revealed
Database Research Principles Revealed
 
Meetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesMeetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo cases
 
Don’t give up, You can... Cache!
Don’t give up, You can... Cache!Don’t give up, You can... Cache!
Don’t give up, You can... Cache!
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
Presto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop MeetupPresto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop Meetup
 
5 Practical Steps to a Successful Deep Learning Research
5 Practical Steps to a Successful  Deep Learning Research5 Practical Steps to a Successful  Deep Learning Research
5 Practical Steps to a Successful Deep Learning Research
 
Approximate methods for scalable data mining
Approximate methods for scalable data miningApproximate methods for scalable data mining
Approximate methods for scalable data mining
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
 
Python for Data Logistics
Python for Data LogisticsPython for Data Logistics
Python for Data Logistics
 
Application Logging Good Bad Ugly ... Beautiful?
Application Logging Good Bad Ugly ... Beautiful?Application Logging Good Bad Ugly ... Beautiful?
Application Logging Good Bad Ugly ... Beautiful?
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 

Recently uploaded

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 

Recently uploaded (20)

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 

CSE545 sp23 (2) Streaming Algorithms 2-4.pdf

  • 1. Streaming Algorithms: Data without a disk H. Andrew Schwartz CSE545 Spring 2023
  • 2. Big Data Analytics, The Class Goal: Generalizations A model or summarization of the data. Hadoop File System MapReduce Spark Deep Learning Frameworks Similarity Search Recommendation Systems Link Analysis Transformers/Self-Supervision Streaming Hypothesis Testing Data Workflow Frameworks Analytics and Algorithms
  • 4. (1) Direct: Often, data … ● … cannot be stored (too big, privacy concerns) ● … are not practical to access repeatedly (reading is too long) ● … are rapidly arriving (need rapidly updated "results") Why Streaming?
  • 5. (1) Direct: Often, data … ● … cannot be stored (too big, privacy concerns) ● … are not practical to access repeatedly (reading is too long) ● … are rapidly arriving (need rapidly updated "results") Examples: Google search queries Satellite imagery data Text Messages, Status updates Click Streams Why Streaming?
  • 6. (1) Direct: Often, data … ● … cannot be stored (too big, privacy concerns) ● … are not practical to access repeatedly (reading is too long) ● … are rapidly arriving (need rapidly updated "results") (2) Indirect: The constraints for streaming data force one to solutions that are often efficient even when storing data. Streaming Approx Random Sample Distributed IO (MapReduce, Spark) Why Streaming?
  • 7. (1) Direct: Often, data … ● … cannot be stored (too big, privacy concerns) ● … are not practical to access repeatedly (reading is too long) ● … are rapidly arriving (need rapidly updated "results") (2) Indirect: The constraints for streaming data force one to solutions that are often efficient even when storing data. Streaming Approx Random Sample Distributed IO (MapReduce, Spark) Often translates into O(N) or strictly N algorithms. Process RECORD GONE RECORD IN Why Streaming?
  • 8. ● General Stream Processing Model ● Sampling ● Counting Distinct Elements ● Filtering data according to a criteria Streaming Topics
  • 9. Standing Queries: Stored and permanently executing. Ad-Hoc: One-time questions -- must store expected parts / summaries of streams Process for stream queries RECORD GONE RECORD IN
  • 10. Standing Queries: Stored and permanently executing. Ad-Hoc: One-time questions -- must store expected parts / summaries of streams Process for stream queries RECORD GONE RECORD IN E.g. How would you handle: What is the mean of values seen so far?
  • 11. Standing Queries: Stored and permanently executing. Ad-Hoc: One-time questions -- must store expected parts / summaries of streams Process for stream queries RECORD GONE RECORD IN E.g. How would you handle: What is the mean of values seen so far? Important difference from typical database management: ● Input is not controlled by system staff. ● Input timing/rate is often unknown, controlled by users.
  • 12. Standing Queries: Stored and permanently executing. Ad-Hoc: One-time questions -- must store expected parts / summaries of streams Process for stream queries RECORD GONE RECORD IN Important difference from typical database management: ● Input is not controlled by system staff. ● Input timing/rate is often unknown, controlled by users. Might hold a sliding window of records instead of single record. .. , i, h, g, f, e, d, c, b, a E.g. How would you handle: What is the mean of values seen so far?
  • 13. Input stream …, 4, 3, 11, 2, 0, 5, 8, 1, 4 Processor Output (Generalization, Summarization) A stream of records (also often referred to as “elements” , “tuples”, "lines", or "rows") Theoretically, could be anything! search queries, numbers, bits, image files, ... (Leskovec et al., 2014) General Stream Processing Model
  • 14. ad-hoc queries Input stream …, 4, 3, 11, 2, 0, 5, 8, 1, 4 Processor Output (Generalization, Summarization) -- one-time questions General Stream Processing Model
  • 15. ad-hoc queries Input stream …, 4, 3, 11, 2, 0, 5, 8, 1, 4 Processor Output (Generalization, Summarization) standing queries -- asked at all times. General Stream Processing Model
  • 16. ad-hoc queries Input stream …, 4, 3, 11, 2, 0, 5, 8, 1, 4 Processor Output (Generalization, Summarization) standing queries General Stream Processing Model limited memory
  • 17. ad-hoc queries Input stream …, 4, 3, 11, 2, 0, 5, 8, 1, 4 Processor Output (Generalization, Summarization) limited memory archival storage standing queries -- not suitable for fast queries. General Stream Processing Model
  • 18. Create a random sample for statistical analysis. Process RECORD GONE RECORD IN Sampling
  • 19. Create a random sample for statistical analysis. Process RECORD GONE RECORD IN Keep? limited memory yes Sampling
  • 20. Create a random sample for statistical analysis. Process RECORD GONE RECORD IN Keep? limited memory yes run statistical analysis sometime in future Sampling
  • 21. Create a random sample for statistical analysis. 1. Simple Sampling: Individual records are what you wish to sample. Sampling: 2 Versions
  • 22. Create a random sample for statistical analysis. 1. Simple Sampling: Individual records are what you wish to sample. 2. Hierarchical Sampling: Sample an attribute of a record. (e.g. records are tweets, but with to sample users) Sampling: 2 Versions tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet!
  • 23. Sampling: 2 Versions tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! Create a random sample for statistical analysis. 1. Simple Sampling: Individual records are what you wish to sample. 2. Hierarchical Sampling: Sample an attribute of a record. (e.g. records are tweets, but with to sample users) tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet!
  • 24. Create a random sample for statistical analysis. 1. Simple Sampling: Individual records are what you wish to sample. Sampling
  • 25. Create a random sample for statistical analysis. 1. Simple Sampling: Individual records are what you wish to sample. record = stream.next() if ?: #keep: e.g., true 5% of the time memory.write(record) yes limited memory Sampling RECORD GONE RECORD IN
  • 26. Create a random sample for statistical analysis. 1. Simple Sampling: Individual records are what you wish to sample. record = stream.next() if random() <= .05: #keep: true 5% of the time memory.write(record) random() < .05? yes limited memory Sampling RECORD GONE RECORD IN
  • 27. Create a random sample for statistical analysis. 1. Simple Sampling: Individual records are what you wish to sample. record = stream.next() if random() <= .05: #keep: true 5% of the time memory.write(record) Problem: records/rows often are not units-of-analysis for statistical analyses E.g. user_ids for searches, tweets; location_ids for satellite images limited memory run statistical analysis sometime in future Sampling
  • 28. 2. Hierarchical Sampling: Sample an attribute of a record. (e.g. records are tweets, but with to sample users) record = stream.next() if random() <= .05: #keep: true 5% of the time memory.write(record) Solution: ? Sampling
  • 29. Sampling tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! tweet! 2. Hierarchical Sampling: Sample an attribute of a record. (e.g. records are tweets, but with to sample users) record = stream.next() if ??: #keep memory.write(record) Solution: ?
  • 30. 2. Hierarchical Sampling: Sample an attribute of a record. (e.g. records are tweets, but with to sample users) record = stream.next() if ??: #keep: memory.write(record) Solution: instead of checking random digit; hash the attribute being sampled. – streaming: only need to store hash functions; may be part of standing query Sampling
  • 31. 2. Hierarchical Sampling: Sample an attribute of a record. (e.g. records are tweets, but with to sample users) record = stream.next() if hash(record[‘user_id’]) == 1: #keep memory.write(record) Solution: instead of checking random digit; hash the attribute being sampled. – streaming: only need to store hash functions; may be part of standing query How many buckets to hash into? Sampling
  • 32. ● General Stream Processing Model ● Sampling ● Counting Distinct Elements ● Filtering data according to a criteria Streaming Topics
  • 33. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is Counting Moments
  • 34. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) Counting Moments
  • 35. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) Trivial: just increment a counter Counting Moments
  • 36. Counting Moments Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) 0th moment Applications Counting… distinct words in large document. distinct websites (URLs). users that visit a site without storing. unique queries to Alexa.
  • 37. Counting Moments Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) 0th moment One Solution: Just keep a set (hashmap, dictionary, heap) Problem: Can’t maintain that many in memory; disk storage is too slow Applications Counting… distinct words in large document. distinct websites (URLs). users that visit a site without storing. unique queries to Alexa.
  • 38. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) 0th moment Streaming Solution: Flajolet-Martin Algorithm General idea: n -- suspected total number of elements observed pick a hash, h, to map each element to log2 n bits (buckets) Counting Moments
  • 39. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) 0th moment Streaming Solution: Flajolet-Martin Algorithm General idea: n -- suspected overestimate of total number of elements observed pick a hash, h, to map each element to log2 n bits (buckets) -------------------------------- R = 0 #current max number of zeros at tail for each stream element, e: r(e) = trailZeros(h(e)) #num of trailing 0s from h(e) R = r(e) if r[e] > R estimated_distinct_elements = 2R Counting Moments
  • 40. Counting Moments Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) 0th moment Streaming Solution: Flajolet-Martin Algorithm General idea: n -- suspected total number of elements observed pick a hash, h, to map each element to log2 n bits (buckets) -------------------------------- R = 0 #current max number of zeros at tail for each stream element, e: r(e) = trailZeros(h(e)) #num of trailing 0s from h(e) R = r(e) if r[e] > R estimated_distinct_elements = 2R # m Mathematical Intuition P( trailZeros(h(e)) >= i ) = 2-i # P(h(e) == __0) = .5; P(h(e) == __00) = .25; … P( trailZeros(h(e)) < i ) = 1 - 2-i for m elements: = (1 - 2-i )m P( one e has trailZeros > i) = 1 - (1 - 2-i )m ≈ 1 - e-m2^-i If 2R >> m, then 1 - (1 - 2-i )m ≈ 0 If 2R << m, then 1 - (1 - 2-i )m ≈ 1
  • 41. Counting Moments Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) 0th moment Streaming Solution: Flajolet-Martin Algorithm General idea: n -- suspected total number of elements observed pick a hash, h, to map each element to log2 n bits (buckets) -------------------------------- R = 0 #current max number of zeros at tail for each stream element, e: r(e) = trailZeros(h(e)) #num of trailing 0s from h(e) R = r(e) if r[e] > R estimated_distinct_elements = 2R # m Mathematical Intuition P( trailZeros(h(e)) >= i ) = 2-i # P(h(e) == __0) = .5; P(h(e) == __00) = .25; … P( trailZeros(h(e)) < i ) = 1 - 2-i for m elements: = (1 - 2-i )m P( one e has trailZeros > i) = 1 - (1 - 2-i )m ≈ 1 - e-m2^-i If 2R >> m, then 1 - (1 - 2-i )m ≈ 0 If 2R << m, then 1 - (1 - 2-i )m ≈ 1 Problem: Unstable in practice. Solution: Multiple hash functions but how to combine?
  • 42. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) Counting Moments 0th moment Streaming Solution: Flajolet-Martin Algorithm General idea: n -- suspected total number of elements observed pick a hash, h, to map each element to log2 n bits (buckets) -------------------------------- Rs = list() for h in hashes: R = 0 #potential max number of zeros at tail for each stream element, e: r(e) = trailZeros(h(e)) #num of trailing 0s from h(e) R = r(e) if r[e] > R Rs.append(2R ) groupRs = [Rs[i:i+log n] for i in range(0, len(Rs), log n)] estimated_distinct_elements = median(map(mean, groupRs)) Problem: Unstable in practice. Solution: Multiple hash functions 1. Partition into groups of size log n 2. Take mean in groups 3. Take median of group means
  • 43. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance) Counting Moments 0th moment Streaming Solution: Flajolet-Martin Algorithm General idea: n -- suspected total number of elements observed pick a hash, h, to map each element to log2 n bits (buckets) -------------------------------- Rs = list() for h in hashes: R = 0 #potential max number of zeros at tail for each stream element, e: r(e) = trailZeros(h(e) #num of trailing 0s from h(e) R = r(e) if r[e] > R Rs.append(2R ) groupRs = [Rs[i:i+log n] for i in range(0, len(Rs), log n)] estimated_distinct_elements = median(map(mean, groupRs)) Problem: Unstable in practice. Solution: Multiple hash functions 1. Partition into groups of size log n 2. Take mean in groups 3. Take median of group means A good approach anytime one has many “low resolution” estimates of a true value.
  • 44. Moments: ● Suppose mi is the count of distinct element i in the data ● The kth moment of the stream is Examples ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness related to variance) 2nd moment Streaming Solution: Alon-Matias-Szegedy Algorithm (Exercise; Out of Scope; see in MMDS) Counting Moments
  • 45. Counting Moments standard deviation (square-root of variance for numeric data)
  • 46. Counting Moments standard deviation (square-root of variance for numeric data)
  • 47. Counting Moments For streaming, just need to store (1) number of elements, (2) sum of elements, and (3) sum of squares. standard deviation (square-root of variance for numeric data)
  • 48. Counting Moments For streaming, just need to store (1) number of elements, (2) sum of elements, and (3) sum of squares. However, challenge: Sum of squares can blow up! standard deviation (square-root of variance for numeric data)
  • 49. Filtering: Select elements with property x Example: 40B safe email addresses for spam detector Filtering Data
  • 50. Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows false positives but not false negatives) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Filtering Data
  • 51. Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows false positives but not false negatives) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 #B is a bit vector for each i in hashes, for each s in S: set B[hi (s)] = 1 #all bits resulting from Filtering Data
  • 52. Filtering: Select elements with property x Example: 40B safe email addresses for spam detector The Bloom Filter (approximates; allows false positives but not false negatives) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 #B is a bit vector for each i in hashes, for each s in S: set B[hi (s)] = 1 #all bits resulting from … #usually embedded in other code while key x arrives next in stream #filter: if B[hi (x)] == 1 for all i in hashes: #do as if x is in S else: #do as if x not in S Filtering Data
  • 53. Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows false positives but not false negatives) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 #B is a bit vector for each i in hashes, for each s in S: set B[hi (s)] = 1 #all bits resulting from … #usually embedded in other code while key x arrives next in stream #filter: if B[hi (x)] == 1 for all i in hashes: #do as if x is in S else: #do as if x not in S Filtering Data Setup filter Apply Filter
  • 54. Filtering Data Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows FPs) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 for each i in hashes, for each s in S: set B[hi (s)] = 1 … #usually embedded in other code while key x arrives next in stream #filter: if B[hi (x)] == 1 for all i in hashes: #do as if x is in S else: #do as if x not in S What is the probability of a false positive (FP)? Q: What fraction of |B| are 1s? (Leskovec et al., 2014)
  • 55. Filtering Data Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows FPs) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 for each i in hashes, for each s in S: set B[hi (s)] = 1 … #usually embedded in other code while key x arrives next in stream #filter: if B[hi (x)] == 1 for all i in hashes: #do as if x is in S else: #do as if x not in S What is the probability of a false positive? Q: What fraction of |B| are 1s? A: Analogy: Throw |S| * k darts at n targets. 1 dart: 1/n d darts: (1 - 1/n)d = prob of 0 = e-d/n are 0s (Leskovec et al., 2014)
  • 56. Filtering Data Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows FPs) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 for each i in hashes, for each s in S: set B[hi (s)] = 1 … #usually embedded in other code while key x arrives next in stream #filter: if B[hi (x)] == 1 for all i in hashes: #do as if x is in S else: #do as if x not in S What is the probability of a false positive? Q: What fraction of |B| are 1s? A: Analogy: Throw |S| * k darts at n targets. 1 dart: 1/n d darts: (1 - 1/n)d = prob of 0 = e-d/n are 0s (Leskovec et al., 2014) = e-1 for large n
  • 57. Filtering Data Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows FPs) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 for each i in hashes, for each s in S: set B[hi (s)] = 1 … #usually embedded in other code while key x arrives next in stream #filter: if B[hi (x)] == 1 for all i in hashes: #do as if x is in S else: #do as if x not in S What is the probability of a false positive? Q: What fraction of |B| are 1s? A: Analogy: Throw |S| * k darts at n targets. 1 dart: 1/n d darts: (1 - 1/n)d = prob of 0 = e-d/n are 0s thus, (1 - e-d/n ) are 1s probability all k being 1? (Leskovec et al., 2014)
  • 58. Filtering Data Filtering: Select elements with property x Example: 40B safe email addresses for spam filter The Bloom Filter (approximates; allows FPs) Given: |S| keys to filter; will be mapped to |B| bits hashes = h1, h2 , …, hk independent hash functions Algorithm: set all B to 0 for each i in hashes, for each s in S: set B[hi (s)] = 1 … #usually embedded in other code while key x arrives next in stream #filter: if B[hi (x)] == 1 for all i in hashes: #do as if x is in S else: #do as if x not in S What is the probability of a false positive? Q: What fraction of |B| are 1s? A: Analogy: Throw |S| * k darts at n targets. 1 dart: 1/n d darts: (1 - 1/n)d = prob of 0 = e-d/n are 0s thus, (1 - e-d/n ) are 1s probability all k being 1? (1 - e-(|S|*k)/n )k |S| size of set k: number of hash functions n: number of buckets Note: Can expand S as stream continues as long as |B| has room (e.g. adding verified email addresses) (Leskovec et al., 2014)
  • 59. Side Note on Generating Hash Functions: What hash functions to use? Start with 2 decent hash functions e.g. ha (x) = ascii(string) % large_prime_number hb (x) = (3*ascii(string) + 16) % large_prime_number Add together multiplying the second times i: hi (x) = ha (x) + i*hb (x) % |BUCKETS| e.g. h5 (x) = ha (x) + 5*hb (x) % 100 https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf Popular choices: md5 (fast, predistable); mmh3 (easy to seed; fast)
  • 60. ● General Stream Processing Model ● Sampling ○ approx. random ○ hierarchical approx. random ● Counting Elements ○ distinct elements ○ mean, standard deviation ● Filtering data according to a criteria ○ bloom filter setup + application ○ calculating false positives Streaming Topics