SlideShare a Scribd company logo
Efficient Data Stream Classification via
Probabilistic Adaptive Windows
Albert Bifet1, Jesse Read2,
Bernhard Pfahringer3, Geoff Holmes3
1Yahoo! Research Barcelona
2Universidad Carlos III, Madrid, Spain
3University of Waikato, Hamilton, New Zealand
SAC 2013, 19 March 2013
Data Streams
Big Data & Real Time
Data Streams
Data Streams
Sequence is potentially infinite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed
it is discarded or archived
Big Data & Real Time
Data Streams
Approximation algorithms
Small error rate with high probability
An algorithm ( , δ)−approximates F if it outputs ˜F for which
Pr[|˜F − F| > F] < δ.
Big Data & Real Time
Data Stream Sliding Window
Sampling algorithms
Giving equal weight to old and new examples: RESERVOIR
SAMPLING
Giving more weight to recent examples: PROBABILISTIC
APPROXIMATE WINDOW
Big Data & Real Time
8 Bits Counter
1 0 1 0 1 0 1 0
What is the largest number we can
store in 8 bits?
8 Bits Counter
What is the largest number we can
store in 8 bits?
8 Bits Counter
0 20 40 60 80 100
0
20
40
60
80
100
x
f(x) = log(1 + x)/ log(2)
f(0) = 0, f(1) = 1
8 Bits Counter
0 2 4 6 8 10
0
2
4
6
8
10
x
f(x) = log(1 + x)/ log(2)
f(0) = 0, f(1) = 1
8 Bits Counter
0 2 4 6 8 10
0
2
4
6
8
10
x
f(x) = log(1 + x/30)/ log(1 + 1/30)
f(0) = 0, f(1) = 1
8 Bits Counter
0 20 40 60 80 100
0
20
40
60
80
100
x
f(x) = log(1 + x/30)/ log(1 + 1/30)
f(0) = 0, f(1) = 1
8 bits Counter
MORRIS APPROXIMATE COUNTING ALGORITHM
1 Init counter c ← 0
2 for every event in the stream
3 do rand = random number between 0 and 1
4 if rand < p
5 then c ← c + 1
What is the largest number we can
store in 8 bits?
8 bits Counter
MORRIS APPROXIMATE COUNTING ALGORITHM
1 Init counter c ← 0
2 for every event in the stream
3 do rand = random number between 0 and 1
4 if rand < p
5 then c ← c + 1
With p = 1/2 we can store 2 × 256
with standard deviation σ = n/2
8 bits Counter
MORRIS APPROXIMATE COUNTING ALGORITHM
1 Init counter c ← 0
2 for every event in the stream
3 do rand = random number between 0 and 1
4 if rand < p
5 then c ← c + 1
With p = 2−c
then E[2c
] = n + 2 with
variance σ2
= n(n + 1)/2
8 bits Counter
MORRIS APPROXIMATE COUNTING ALGORITHM
1 Init counter c ← 0
2 for every event in the stream
3 do rand = random number between 0 and 1
4 if rand < p
5 then c ← c + 1
If p = b−c
then E[bc
] = n(b − 1) + b,
σ2
= (b − 1)n(n + 1)/2
PROBABILISTIC APPROXIMATE WINDOW
1 Init window w ← ∅
2 for every instance i in the stream
3 do store the new instance i in window w
4 for every instance j in the window
5 do rand = random number between 0 and 1
6 if rand > b−1
7 then remove instance j from window w
PAW maintains a sample of instances
in logarithmic memory, giving greater
weight to newer instances
Experiments: Methods
Abbr. Classifier Parameters
NB Naive Bayes
HT Hoeffding Tree
HTLB Leveraging Bagging with HT n = 10
kNN k Nearest Neighbour w = 1000, k = 10
kNNW kNN with PAW w = 1000, k = 10
kNNWA
kNN with PAW+ADWIN w = 1000, k = 10
kNNLB
W Leveraging Bagging with kNNW n = 10
The methods we consider. Leveraging Bagging
methods use n models. kNNWA
empties its
window (of max w) when drift is detected (using
the ADWIN drift detector).
Experimental Evaluation
Table : The window size for kNN and corresponding performance.
Accuracy
−w 100 −w 500 −w 1000 −w 5000
Real Avg. 77.88 77.78 79.59 78.23
Synth. Avg. 57.99 81.93 84.74 86.03
Overall Avg. 62.53 80.28 82.59 83.11
Results
Experimental Evaluation
Table : The window size for kNN and corresponding performance.
Time (seconds)
−w 100 −w 500 −w 1000 −w 5000
Real Tot. 297 998 1754 7900
Synth. Tot. 371 1297 2313 10671
Overall Tot. 668 2295 4067 18570
Results
Experimental Evaluation
Table : The window size for kNN and corresponding performance.
RAM Hours
−w 100 −w 500 −w 1000 −w 5000
Real Tot. 0.007 0.082 0.269 5.884
Synth. Tot. 0.002 0.026 0.088 1.988
Overall Tot. 0.009 0.108 0.357 7.872
Results
Experimental Evaluation
Table : Summary of Efficiency: Accuracy and RAM-Hours.
NB HT HTLB kNN kNNW kNNWA
kNNLB
W
Accuracy 56.19 73.95 83.75 82.59 82.92 83.19 84.67
RAM-Hrs 0.02 1.57 300.02 0.36 8.08 8.80 250.98
Results
Conclusions
Sampling algorithms for kNN
Giving equal weight to old and new examples: RESERVOIR
SAMPLING
Giving more weight to recent examples: PROBABILISTIC
APPROXIMATE WINDOW
Big Data & Real Time
Thanks!

More Related Content

What's hot

Sentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataSentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming Data
Albert Bifet
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid them
Albert Bifet
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
Albert Bifet
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
Albert Bifet
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
Albert Bifet
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data Streams
Albert Bifet
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
Joe Kelley
 
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsMining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Albert Bifet
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
StampedeCon
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
Davide Gallitelli
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
Fabian Pedregosa
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
xlight
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
Kira
 
Python-List comprehension
Python-List comprehensionPython-List comprehension
Python-List comprehension
Colin Su
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly
 
DeepLearningProjV3
DeepLearningProjV3DeepLearningProjV3
DeepLearningProjV3
Ana Sanchez
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
Raja Chiky
 

What's hot (20)

Sentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataSentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming Data
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid them
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data Streams
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsMining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Python-List comprehension
Python-List comprehensionPython-List comprehension
Python-List comprehension
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
 
DeepLearningProjV3
DeepLearningProjV3DeepLearningProjV3
DeepLearningProjV3
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 

Viewers also liked

Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labels
Albert Bifet
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
Albert Bifet
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Albert Bifet
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
Albert Bifet
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
Albert Bifet
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
Albert Bifet
 

Viewers also liked (7)

Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labels
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache Flink
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 

Similar to Efficient Data Stream Classification via Probabilistic Adaptive Windows

Streaming multiscale anomaly detection
Streaming multiscale anomaly detectionStreaming multiscale anomaly detection
Streaming multiscale anomaly detection
Ravi Kiran B.
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
SujaAldrin
 
Real-Time Data Mining for Event Streams
Real-Time Data Mining for Event StreamsReal-Time Data Mining for Event Streams
Real-Time Data Mining for Event Streams
Sylvain Hallé
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...
Junho Suh
 
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction ChallengeESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
Francisco Zamora-Martinez
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
butest
 
20110620 amst rdam_kpb
20110620 amst rdam_kpb20110620 amst rdam_kpb
20110620 amst rdam_kpb
Konrad Banachewicz
 
Fast detection of transformed data leaks[mithun_p_c]
Fast detection of transformed data leaks[mithun_p_c]Fast detection of transformed data leaks[mithun_p_c]
Fast detection of transformed data leaks[mithun_p_c]
MithunPChandra
 
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKSSLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
IJCI JOURNAL
 
DSD-INT 2018 Algorithmic Differentiation - Markus
DSD-INT 2018 Algorithmic Differentiation - MarkusDSD-INT 2018 Algorithmic Differentiation - Markus
DSD-INT 2018 Algorithmic Differentiation - Markus
Deltares
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutions
Peter Solymos
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
Count-Distinct Problem
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct Problem
Kai Zhang
 
Model-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical ConstraintsModel-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical Constraints
Quoc-Sang Phan
 
Big Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina MorikBig Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina Morik
BigMine
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...
Paolo Missier
 
Secure information aggregation in sensor networks
Secure information aggregation in sensor networksSecure information aggregation in sensor networks
Secure information aggregation in sensor networks
Aleksandr Yampolskiy
 
Data_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.pptData_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.ppt
ISHANAMRITSRIVASTAVA
 

Similar to Efficient Data Stream Classification via Probabilistic Adaptive Windows (20)

Streaming multiscale anomaly detection
Streaming multiscale anomaly detectionStreaming multiscale anomaly detection
Streaming multiscale anomaly detection
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 
Real-Time Data Mining for Event Streams
Real-Time Data Mining for Event StreamsReal-Time Data Mining for Event Streams
Real-Time Data Mining for Event Streams
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...
 
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction ChallengeESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
 
20110620 amst rdam_kpb
20110620 amst rdam_kpb20110620 amst rdam_kpb
20110620 amst rdam_kpb
 
Fast detection of transformed data leaks[mithun_p_c]
Fast detection of transformed data leaks[mithun_p_c]Fast detection of transformed data leaks[mithun_p_c]
Fast detection of transformed data leaks[mithun_p_c]
 
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKSSLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
 
DSD-INT 2018 Algorithmic Differentiation - Markus
DSD-INT 2018 Algorithmic Differentiation - MarkusDSD-INT 2018 Algorithmic Differentiation - Markus
DSD-INT 2018 Algorithmic Differentiation - Markus
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutions
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
Count-Distinct Problem
Count-Distinct ProblemCount-Distinct Problem
Count-Distinct Problem
 
Model-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical ConstraintsModel-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical Constraints
 
Big Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina MorikBig Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina Morik
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...
 
Secure information aggregation in sensor networks
Secure information aggregation in sensor networksSecure information aggregation in sensor networks
Secure information aggregation in sensor networks
 
Data_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.pptData_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.ppt
 

More from Albert Bifet

Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
Albert Bifet
 
MOA : Massive Online Analysis
MOA : Massive Online AnalysisMOA : Massive Online Analysis
MOA : Massive Online Analysis
Albert Bifet
 
New ensemble methods for evolving data streams
New ensemble methods for evolving data streamsNew ensemble methods for evolving data streams
New ensemble methods for evolving data streams
Albert Bifet
 
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Albert Bifet
 
Adaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data StreamsAdaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data Streams
Albert Bifet
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAdaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Albert Bifet
 
Mining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed TreesMining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed Trees
Albert Bifet
 
Kalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data StreamsKalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data Streams
Albert Bifet
 

More from Albert Bifet (8)

Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
MOA : Massive Online Analysis
MOA : Massive Online AnalysisMOA : Massive Online Analysis
MOA : Massive Online Analysis
 
New ensemble methods for evolving data streams
New ensemble methods for evolving data streamsNew ensemble methods for evolving data streams
New ensemble methods for evolving data streams
 
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
 
Adaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data StreamsAdaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data Streams
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAdaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent Patterns
 
Mining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed TreesMining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed Trees
 
Kalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data StreamsKalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data Streams
 

Recently uploaded

Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 

Recently uploaded (20)

Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 

Efficient Data Stream Classification via Probabilistic Adaptive Windows

  • 1. Efficient Data Stream Classification via Probabilistic Adaptive Windows Albert Bifet1, Jesse Read2, Bernhard Pfahringer3, Geoff Holmes3 1Yahoo! Research Barcelona 2Universidad Carlos III, Madrid, Spain 3University of Waikato, Hamilton, New Zealand SAC 2013, 19 March 2013
  • 2. Data Streams Big Data & Real Time
  • 3. Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Big Data & Real Time
  • 4. Data Streams Approximation algorithms Small error rate with high probability An algorithm ( , δ)−approximates F if it outputs ˜F for which Pr[|˜F − F| > F] < δ. Big Data & Real Time
  • 5. Data Stream Sliding Window Sampling algorithms Giving equal weight to old and new examples: RESERVOIR SAMPLING Giving more weight to recent examples: PROBABILISTIC APPROXIMATE WINDOW Big Data & Real Time
  • 6. 8 Bits Counter 1 0 1 0 1 0 1 0 What is the largest number we can store in 8 bits?
  • 7. 8 Bits Counter What is the largest number we can store in 8 bits?
  • 8. 8 Bits Counter 0 20 40 60 80 100 0 20 40 60 80 100 x f(x) = log(1 + x)/ log(2) f(0) = 0, f(1) = 1
  • 9. 8 Bits Counter 0 2 4 6 8 10 0 2 4 6 8 10 x f(x) = log(1 + x)/ log(2) f(0) = 0, f(1) = 1
  • 10. 8 Bits Counter 0 2 4 6 8 10 0 2 4 6 8 10 x f(x) = log(1 + x/30)/ log(1 + 1/30) f(0) = 0, f(1) = 1
  • 11. 8 Bits Counter 0 20 40 60 80 100 0 20 40 60 80 100 x f(x) = log(1 + x/30)/ log(1 + 1/30) f(0) = 0, f(1) = 1
  • 12. 8 bits Counter MORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 What is the largest number we can store in 8 bits?
  • 13. 8 bits Counter MORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 With p = 1/2 we can store 2 × 256 with standard deviation σ = n/2
  • 14. 8 bits Counter MORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 With p = 2−c then E[2c ] = n + 2 with variance σ2 = n(n + 1)/2
  • 15. 8 bits Counter MORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 If p = b−c then E[bc ] = n(b − 1) + b, σ2 = (b − 1)n(n + 1)/2
  • 16. PROBABILISTIC APPROXIMATE WINDOW 1 Init window w ← ∅ 2 for every instance i in the stream 3 do store the new instance i in window w 4 for every instance j in the window 5 do rand = random number between 0 and 1 6 if rand > b−1 7 then remove instance j from window w PAW maintains a sample of instances in logarithmic memory, giving greater weight to newer instances
  • 17. Experiments: Methods Abbr. Classifier Parameters NB Naive Bayes HT Hoeffding Tree HTLB Leveraging Bagging with HT n = 10 kNN k Nearest Neighbour w = 1000, k = 10 kNNW kNN with PAW w = 1000, k = 10 kNNWA kNN with PAW+ADWIN w = 1000, k = 10 kNNLB W Leveraging Bagging with kNNW n = 10 The methods we consider. Leveraging Bagging methods use n models. kNNWA empties its window (of max w) when drift is detected (using the ADWIN drift detector).
  • 18. Experimental Evaluation Table : The window size for kNN and corresponding performance. Accuracy −w 100 −w 500 −w 1000 −w 5000 Real Avg. 77.88 77.78 79.59 78.23 Synth. Avg. 57.99 81.93 84.74 86.03 Overall Avg. 62.53 80.28 82.59 83.11 Results
  • 19. Experimental Evaluation Table : The window size for kNN and corresponding performance. Time (seconds) −w 100 −w 500 −w 1000 −w 5000 Real Tot. 297 998 1754 7900 Synth. Tot. 371 1297 2313 10671 Overall Tot. 668 2295 4067 18570 Results
  • 20. Experimental Evaluation Table : The window size for kNN and corresponding performance. RAM Hours −w 100 −w 500 −w 1000 −w 5000 Real Tot. 0.007 0.082 0.269 5.884 Synth. Tot. 0.002 0.026 0.088 1.988 Overall Tot. 0.009 0.108 0.357 7.872 Results
  • 21. Experimental Evaluation Table : Summary of Efficiency: Accuracy and RAM-Hours. NB HT HTLB kNN kNNW kNNWA kNNLB W Accuracy 56.19 73.95 83.75 82.59 82.92 83.19 84.67 RAM-Hrs 0.02 1.57 300.02 0.36 8.08 8.80 250.98 Results
  • 22. Conclusions Sampling algorithms for kNN Giving equal weight to old and new examples: RESERVOIR SAMPLING Giving more weight to recent examples: PROBABILISTIC APPROXIMATE WINDOW Big Data & Real Time