SlideShare a Scribd company logo
1 of 22
Streaming Algorithms
Joe Kelley
Data Engineer
July 2013
CONFIDENTIAL | 2
Accelerating Your Time to Value
Strategy
and Roadmap
IMAGINE
Training
and Education
ILLUMINATE
Hands-On
Data Science and
Data Engineering
IMPLEMENT
Leading Provider of
Data Science & Engineering for Big
Analytics
CONFIDENTIAL | 3
• Operates on a continuous stream of data
• Unknown or infinite size
• Only one pass; options:
• Store it
• Lose it
• Store an approximation
• Limited processing time per item
•
• Limited total memory
•
What is a Streaming Algorithm?
Algorithm
Standing Query
Ad-hoc Query
Input
Output
Memory
Disk
CONFIDENTIAL | 4
Why use a Streaming Algorithm?
• Compare to typical “Big Data” approach: store
everything, analyze later, scale linearly
• Streaming Pros:
• Lower latency
• Lower storage cost
• Streaming Cons:
• Less flexibility
• Lower precision (sometimes)
• Answer?
• Why not both?
Streaming
Algorithm
Result
Initial Answer
Long-term Storage Batch Algorithm
Result
Authoritative Answer
CONFIDENTIAL | 5
General Techniques
1. Tunable Approximation
2. Sampling
• Sliding window
• Fixed number
• Fixed percentage
3. Hashing: useful randomness
CONFIDENTIAL | 6
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)
• Scenario:
• Not enough space to store everything
• Simple queries  storing 1% is good enough
Device-1
(Device-1, event-1, 10001123)
(Device-1, event-3, 10001126)
(Device-1, event-1, 10001129)
...
Device-2
(Device-2, event-2, 10001124)
(Device-2, ERROR, 10001130)
(Device-2, event-4, 10001132)
...
Device-3
(Device-3, event-3, 10001122)
(Device-3, event-1, 10001127)
(Device-3, ERROR, 10001135)
...
(Device-3, event-3, 10001122)
(Device-1, event-1, 10001123)
(Device-2, event-2, 10001124)
(Device-1, event-3, 10001126)
(Device-3, event-1, 10001127)
(Device-1, event-1, 10001129)
(Device-2, ERROR, 10001130)
(Device-2, event-4, 10001132)
(Device-3, ERROR, 10001135)
...
Input
CONFIDENTIAL | 7
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)
• Scenario:
• Not enough space to store everything
• Simple queries  storing 1% is good enough
Algorithm:
for each element e:
with probability 0.01:
store e
else:
throw out e
Can lead to some insidious statistical “bugs”…
CONFIDENTIAL | 8
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)
• Scenario:
• Not enough space to store everything
• Simple queries  storing 1% is good enough
Query:
How many errors has the average device encountered?
Answer:
SELECT AVG(n) FROM (
SELECT COUNT(*) AS n FROM events
WHERE event = 'ERROR'
GROUP BY device_id
)
Simple… but off by up to 100x. Each device had only 1% of its events
sampled.
Can we just multiply by 100?
CONFIDENTIAL | 9
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)
• Scenario:
• Not enough space to store everything
• Simple queries  storing 1% is good enough
Better Algorithm:
for each element e:
if (hash(e.device_id) mod 100) == 0
store e
else:
throw out e
Choose how to hash carefully... or hash every different way
CONFIDENTIAL | 10
Example 2: Sampling fixed number
Choice of p is crucial:
• p = constant  prefer more recent elements. Higher p = more recent
• p = k/n  sample uniformly from entire stream
Let arr = array of size k
for each element e:
if arr is not yet full:
add e to arr
else:
with probability p:
replace a random element of arr with e
else:
throw out e
Want to sample a fixed count (k), not a fixed percentage.
Algorithm:
CONFIDENTIAL | 11
Example 2: Sampling fixed number
CONFIDENTIAL | 12
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)
• Want to know how many distinct users are seen over
a time period
• Naïve approach:
• Store all user_id’s in a list/tree/hashtable
• Millions of users = lot of memory
• Better approach:
• Store all user_id’s in a database
• Good, but maybe it’s not fast enough…
• What if an approximate count is ok?
CONFIDENTIAL | 13
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)
• Want to know how many distinct users are seen over a time period
• Approximate count is ok
• Flajolet-Martin Idea:
• Hash each user_id into a bit string
• Count the trailing zeros
• Remember maximum number of trailing zeros seen
user_id H(user_id) trailing zeros max(trailing zeros)
john_doe 0111001001 0 0
jane_doe 1011011100 2 2
alan_t 0010111000 3 3
EWDijkstra 1101011110 1 3
jane_doe 1011011100 2 3
CONFIDENTIAL | 14
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)
• Want to know how many distinct users are seen over a time period
• Intuition:
• If we had seen 2 distinct users, we would expect 1
trailing zero
• If we had seen 4, we would expect 2 trailing zeros
• If we had seen , we would expect
• In general, if there has been a maximum of trailing
zeros, is a reasonable estimation of distinct users
• Want more precision? User more independent hash
functions, and combine the results
• Median = only get powers of two
• Mean = subject to skew
• Median of means of groups works well in practice
CONFIDENTIAL | 15
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)
• Want to know how many distinct users are seen over a time period
Flajolet-Martin, all together:
arr = int[k]
for each item e:
for i in 0...k-1:
z = trailing_zeros(hashi(e))
if z > arr[i]:
arr[i] = z
means = group_means(arr)
median = median(means)
return pow(2, median)
CONFIDENTIAL | 16
Example 3: Counting unique users
Flajolet-Martin in practice
• Devil is in the details
• Tunable precision
• more hash functions = more precise
• See the paper for bounds on precision
• Tunable latency
• more hash functions = higher latency
• faster hash functions = lower latency
• faster hash functions = more possibility of
correlation = less precision
Remember: streaming algorithm for quick, imprecise
answer. Back-end batch algorithm for slower, exact
answer
CONFIDENTIAL | 17
Example 4: Counting Individual Item Frequencies
Want to keep track of how many times each item has
appeared in the stream
Many applications:
• How popular is each search term?
• How many times has this hashtag been tweeted?
• Which IP addresses are DDoS’ing me?
Again, two obvious approaches:
• In-memory hashmap of itemcount
• Database
But can we be more clever?
CONFIDENTIAL | 18
Example 4: Counting Individual Item Frequencies
Want to keep track of how many times each item has appeared in the stream
Idea:
• Maintain array of counts
• Hash each item, increment array at that index
To check the count of an item, hash again and check
array at that index
• Over-estimates because of hash “collisions”
CONFIDENTIAL | 19
Example 4: Counting Individual Item Frequencies
Count-Min Sketch algorithm:
• Maintain 2-d array of size w x d
• Choose d different hash functions; each row in array corresponds to one
hash function
• Hash each item with every hash function, increment the appropriate
position in each row
• To query an item, hash it d times again, take the minimum value from all
rows
CONFIDENTIAL | 20
Example 4: Counting Individual Item Frequencies
Want to keep track of how many times each item has appeared in the stream
Count-Min Sketch, all together:
arr = int[d][w]
for each item e:
for i in 0...d-1:
j = hashi(e) mod w
arr[i][j]++
def frequency(q):
min = +infinity
for i in 0...d-1:
j = hashi(e) mod w
if arr[i][j] < min:
min = arr[i][j]
return min
CONFIDENTIAL | 21
Example 4: Counting Individual Item Frequencies
Count-Min Sketch in practice
• Devil is in the details
• Tunable precision
• Bigger array = more precise
• See the paper for bounds on precision
• Tunable latency
• more hash functions = higher latency
• Better at estimating more frequent items
• Can subtract out estimation of collisions
Remember: streaming algorithm for quick, imprecise
answer. Back-end batch algorithm for slower, exact
answer
CONFIDENTIAL | 22
Questions?
• Feel free to reach out
• www.thinkbiganalytics.com
• joe.kelley@thinkbiganalytics.com
• www.slideshare.net/jfkelley1
• References:
• http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
• http://infolab.stanford.edu/~ullman/mmds.html
We’re hiring! Engineers and Data Scientists

More Related Content

What's hot

An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms Hakky St
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretizationKrish_ver2
 
Pagerank Algorithm Explained
Pagerank Algorithm ExplainedPagerank Algorithm Explained
Pagerank Algorithm Explainedjdhaar
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
Hyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningHyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningFrancesco Casalegno
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
data mining
data miningdata mining
data mininguoitc
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel ProgrammingUday Sharma
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
R language tutorial
R language tutorialR language tutorial
R language tutorialDavid Chiu
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReducePietro Michiardi
 
Unit 1 - SNA QUESTION BANK
Unit 1 - SNA QUESTION BANKUnit 1 - SNA QUESTION BANK
Unit 1 - SNA QUESTION BANKUsha Rani M
 

What's hot (20)

An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Machine learning & Time Series Analysis
Machine learning & Time Series AnalysisMachine learning & Time Series Analysis
Machine learning & Time Series Analysis
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
Pagerank Algorithm Explained
Pagerank Algorithm ExplainedPagerank Algorithm Explained
Pagerank Algorithm Explained
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Hyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningHyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine Learning
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
data mining
data miningdata mining
data mining
 
03 data mining : data warehouse
03 data mining : data warehouse03 data mining : data warehouse
03 data mining : data warehouse
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
R language tutorial
R language tutorialR language tutorial
R language tutorial
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
Unit 1 - SNA QUESTION BANK
Unit 1 - SNA QUESTION BANKUnit 1 - SNA QUESTION BANK
Unit 1 - SNA QUESTION BANK
 

Viewers also liked

Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Yueshen Xu
 
Hash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataHash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataLuca Mastrostefano
 
Detecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataDetecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataSubutai Ahmad
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoMarco Brambilla
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming AlgorithmsRakuten Group, Inc.
 
NAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceNAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceFilip Ilievski
 
Copyright And Streaming Media Presentation
Copyright And Streaming Media PresentationCopyright And Streaming Media Presentation
Copyright And Streaming Media PresentationWill Ritter
 
Twarql Architecture - Streaming Annotated Tweets
Twarql Architecture - Streaming Annotated TweetsTwarql Architecture - Streaming Annotated Tweets
Twarql Architecture - Streaming Annotated TweetsPablo Mendes
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Hamza Aslam
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Adrianos Dadis
 
Protocol For Streaming Media
Protocol For Streaming MediaProtocol For Streaming Media
Protocol For Streaming MediaKaniska Mandal
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
Paradigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadas
Paradigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadasParadigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadas
Paradigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadasBig-Data-Summit
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalHortonworks
 
Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Flink Forward
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
 
Data Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantData Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantParis Carbone
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 

Viewers also liked (20)

Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
 
Hash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataHash - A probabilistic approach for big data
Hash - A probabilistic approach for big data
 
Detecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataDetecting Anomalies in Streaming Data
Detecting Anomalies in Streaming Data
 
Chapter 2.1 : Data Stream
Chapter 2.1 : Data StreamChapter 2.1 : Data Stream
Chapter 2.1 : Data Stream
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di Milano
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
 
NAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceNAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event Coreference
 
Copyright And Streaming Media Presentation
Copyright And Streaming Media PresentationCopyright And Streaming Media Presentation
Copyright And Streaming Media Presentation
 
Twarql Architecture - Streaming Annotated Tweets
Twarql Architecture - Streaming Annotated TweetsTwarql Architecture - Streaming Annotated Tweets
Twarql Architecture - Streaming Annotated Tweets
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016
 
Protocol For Streaming Media
Protocol For Streaming MediaProtocol For Streaming Media
Protocol For Streaming Media
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Paradigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadas
Paradigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadasParadigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadas
Paradigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadas
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink-
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 
Data Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantData Stream Analytics - Why they are important
Data Stream Analytics - Why they are important
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 

Similar to Streaming Algorithms

Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidStreamNative
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Digital Document Preservation Simulation - Boston Python User's Group
Digital Document  Preservation Simulation - Boston Python User's GroupDigital Document  Preservation Simulation - Boston Python User's Group
Digital Document Preservation Simulation - Boston Python User's GroupMicah Altman
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Subrata Kumer Paul
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Experfy
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfAlexanderKyalo3
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?Raffael Marty
 
Algorithms Analysis.pdf
Algorithms Analysis.pdfAlgorithms Analysis.pdf
Algorithms Analysis.pdfShaistaRiaz4
 
ADS Introduction
ADS IntroductionADS Introduction
ADS IntroductionNagendraK18
 
Chapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptxChapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptxssuser957b41
 
Lecture7-QuantitativeAnalysis2.pptx
Lecture7-QuantitativeAnalysis2.pptxLecture7-QuantitativeAnalysis2.pptx
Lecture7-QuantitativeAnalysis2.pptxssuser0d0f881
 
Implementing and analyzing online experiments
Implementing and analyzing online experimentsImplementing and analyzing online experiments
Implementing and analyzing online experimentsSean Taylor
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?Raffael Marty
 
Unit 1, ADA.pptx
Unit 1, ADA.pptxUnit 1, ADA.pptx
Unit 1, ADA.pptxjinkhatima
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptxShree Shree
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6Rod Soto
 
anti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHanti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHLeo Chu
 

Similar to Streaming Algorithms (20)

Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Digital Document Preservation Simulation - Boston Python User's Group
Digital Document  Preservation Simulation - Boston Python User's GroupDigital Document  Preservation Simulation - Boston Python User's Group
Digital Document Preservation Simulation - Boston Python User's Group
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?
 
Algorithms Analysis.pdf
Algorithms Analysis.pdfAlgorithms Analysis.pdf
Algorithms Analysis.pdf
 
ADS Introduction
ADS IntroductionADS Introduction
ADS Introduction
 
Chapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptxChapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptx
 
Lecture7-QuantitativeAnalysis2.pptx
Lecture7-QuantitativeAnalysis2.pptxLecture7-QuantitativeAnalysis2.pptx
Lecture7-QuantitativeAnalysis2.pptx
 
Implementing and analyzing online experiments
Implementing and analyzing online experimentsImplementing and analyzing online experiments
Implementing and analyzing online experiments
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?
 
Unit 1, ADA.pptx
Unit 1, ADA.pptxUnit 1, ADA.pptx
Unit 1, ADA.pptx
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
 
anti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHanti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIH
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Streaming Algorithms

  • 2. CONFIDENTIAL | 2 Accelerating Your Time to Value Strategy and Roadmap IMAGINE Training and Education ILLUMINATE Hands-On Data Science and Data Engineering IMPLEMENT Leading Provider of Data Science & Engineering for Big Analytics
  • 3. CONFIDENTIAL | 3 • Operates on a continuous stream of data • Unknown or infinite size • Only one pass; options: • Store it • Lose it • Store an approximation • Limited processing time per item • • Limited total memory • What is a Streaming Algorithm? Algorithm Standing Query Ad-hoc Query Input Output Memory Disk
  • 4. CONFIDENTIAL | 4 Why use a Streaming Algorithm? • Compare to typical “Big Data” approach: store everything, analyze later, scale linearly • Streaming Pros: • Lower latency • Lower storage cost • Streaming Cons: • Less flexibility • Lower precision (sometimes) • Answer? • Why not both? Streaming Algorithm Result Initial Answer Long-term Storage Batch Algorithm Result Authoritative Answer
  • 5. CONFIDENTIAL | 5 General Techniques 1. Tunable Approximation 2. Sampling • Sliding window • Fixed number • Fixed percentage 3. Hashing: useful randomness
  • 6. CONFIDENTIAL | 6 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Device-1 (Device-1, event-1, 10001123) (Device-1, event-3, 10001126) (Device-1, event-1, 10001129) ... Device-2 (Device-2, event-2, 10001124) (Device-2, ERROR, 10001130) (Device-2, event-4, 10001132) ... Device-3 (Device-3, event-3, 10001122) (Device-3, event-1, 10001127) (Device-3, ERROR, 10001135) ... (Device-3, event-3, 10001122) (Device-1, event-1, 10001123) (Device-2, event-2, 10001124) (Device-1, event-3, 10001126) (Device-3, event-1, 10001127) (Device-1, event-1, 10001129) (Device-2, ERROR, 10001130) (Device-2, event-4, 10001132) (Device-3, ERROR, 10001135) ... Input
  • 7. CONFIDENTIAL | 7 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Algorithm: for each element e: with probability 0.01: store e else: throw out e Can lead to some insidious statistical “bugs”…
  • 8. CONFIDENTIAL | 8 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Query: How many errors has the average device encountered? Answer: SELECT AVG(n) FROM ( SELECT COUNT(*) AS n FROM events WHERE event = 'ERROR' GROUP BY device_id ) Simple… but off by up to 100x. Each device had only 1% of its events sampled. Can we just multiply by 100?
  • 9. CONFIDENTIAL | 9 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Better Algorithm: for each element e: if (hash(e.device_id) mod 100) == 0 store e else: throw out e Choose how to hash carefully... or hash every different way
  • 10. CONFIDENTIAL | 10 Example 2: Sampling fixed number Choice of p is crucial: • p = constant  prefer more recent elements. Higher p = more recent • p = k/n  sample uniformly from entire stream Let arr = array of size k for each element e: if arr is not yet full: add e to arr else: with probability p: replace a random element of arr with e else: throw out e Want to sample a fixed count (k), not a fixed percentage. Algorithm:
  • 11. CONFIDENTIAL | 11 Example 2: Sampling fixed number
  • 12. CONFIDENTIAL | 12 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Naïve approach: • Store all user_id’s in a list/tree/hashtable • Millions of users = lot of memory • Better approach: • Store all user_id’s in a database • Good, but maybe it’s not fast enough… • What if an approximate count is ok?
  • 13. CONFIDENTIAL | 13 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Approximate count is ok • Flajolet-Martin Idea: • Hash each user_id into a bit string • Count the trailing zeros • Remember maximum number of trailing zeros seen user_id H(user_id) trailing zeros max(trailing zeros) john_doe 0111001001 0 0 jane_doe 1011011100 2 2 alan_t 0010111000 3 3 EWDijkstra 1101011110 1 3 jane_doe 1011011100 2 3
  • 14. CONFIDENTIAL | 14 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Intuition: • If we had seen 2 distinct users, we would expect 1 trailing zero • If we had seen 4, we would expect 2 trailing zeros • If we had seen , we would expect • In general, if there has been a maximum of trailing zeros, is a reasonable estimation of distinct users • Want more precision? User more independent hash functions, and combine the results • Median = only get powers of two • Mean = subject to skew • Median of means of groups works well in practice
  • 15. CONFIDENTIAL | 15 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period Flajolet-Martin, all together: arr = int[k] for each item e: for i in 0...k-1: z = trailing_zeros(hashi(e)) if z > arr[i]: arr[i] = z means = group_means(arr) median = median(means) return pow(2, median)
  • 16. CONFIDENTIAL | 16 Example 3: Counting unique users Flajolet-Martin in practice • Devil is in the details • Tunable precision • more hash functions = more precise • See the paper for bounds on precision • Tunable latency • more hash functions = higher latency • faster hash functions = lower latency • faster hash functions = more possibility of correlation = less precision Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer
  • 17. CONFIDENTIAL | 17 Example 4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Many applications: • How popular is each search term? • How many times has this hashtag been tweeted? • Which IP addresses are DDoS’ing me? Again, two obvious approaches: • In-memory hashmap of itemcount • Database But can we be more clever?
  • 18. CONFIDENTIAL | 18 Example 4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Idea: • Maintain array of counts • Hash each item, increment array at that index To check the count of an item, hash again and check array at that index • Over-estimates because of hash “collisions”
  • 19. CONFIDENTIAL | 19 Example 4: Counting Individual Item Frequencies Count-Min Sketch algorithm: • Maintain 2-d array of size w x d • Choose d different hash functions; each row in array corresponds to one hash function • Hash each item with every hash function, increment the appropriate position in each row • To query an item, hash it d times again, take the minimum value from all rows
  • 20. CONFIDENTIAL | 20 Example 4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Count-Min Sketch, all together: arr = int[d][w] for each item e: for i in 0...d-1: j = hashi(e) mod w arr[i][j]++ def frequency(q): min = +infinity for i in 0...d-1: j = hashi(e) mod w if arr[i][j] < min: min = arr[i][j] return min
  • 21. CONFIDENTIAL | 21 Example 4: Counting Individual Item Frequencies Count-Min Sketch in practice • Devil is in the details • Tunable precision • Bigger array = more precise • See the paper for bounds on precision • Tunable latency • more hash functions = higher latency • Better at estimating more frequent items • Can subtract out estimation of collisions Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer
  • 22. CONFIDENTIAL | 22 Questions? • Feel free to reach out • www.thinkbiganalytics.com • joe.kelley@thinkbiganalytics.com • www.slideshare.net/jfkelley1 • References: • http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf • http://infolab.stanford.edu/~ullman/mmds.html We’re hiring! Engineers and Data Scientists