Efficient Data Stream Classification via Probabilistic Adaptive Windows

•

1 like•984 views

This document discusses efficient data stream classification using probabilistic adaptive windows. It introduces the concept of data streams which have potentially infinite sequences of high-speed data that must be processed in real-time with limited memory. It then describes the probabilistic approximate window (PAW) algorithm, which maintains a sample of data instances in logarithmic memory by giving greater weight to newer instances. The document evaluates several data stream classification methods on real and synthetic data streams and finds that k-nearest neighbors with PAW has higher accuracy and lower memory usage than other methods.

Technology Education

Efﬁcient Data Stream Classiﬁcation via
Probabilistic Adaptive Windows
Albert Bifet1, Jesse Read2,
Bernhard Pfahringer3, Geoff Holmes3
1Yahoo! Research Barcelona
2Universidad Carlos III, Madrid, Spain
3University of Waikato, Hamilton, New Zealand
SAC 2013, 19 March 2013

Data Streams
Data Streams
Sequence is potentially inﬁnite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed
it is discarded or archived
Big Data & Real Time

Data Streams
Approximation algorithms
Small error rate with high probability
An algorithm ( , δ)−approximates F if it outputs ˜F for which
Pr[|˜F − F| > F] < δ.
Big Data & Real Time

Data Stream Sliding Window
Sampling algorithms
Giving equal weight to old and new examples: RESERVOIR
SAMPLING
Giving more weight to recent examples: PROBABILISTIC
APPROXIMATE WINDOW
Big Data & Real Time

8 Bits Counter
1 0 1 0 1 0 1 0
What is the largest number we can
store in 8 bits?

8 Bits Counter
What is the largest number we can
store in 8 bits?

8 Bits Counter
0 20 40 60 80 100
0
20
40
60
80
100
x
f(x) = log(1 + x)/ log(2)
f(0) = 0, f(1) = 1

8 Bits Counter
0 2 4 6 8 10
0
2
4
6
8
10
x
f(x) = log(1 + x)/ log(2)
f(0) = 0, f(1) = 1

8 Bits Counter
0 2 4 6 8 10
0
2
4
6
8
10
x
f(x) = log(1 + x/30)/ log(1 + 1/30)
f(0) = 0, f(1) = 1

8 Bits Counter
0 20 40 60 80 100
0
20
40
60
80
100
x
f(x) = log(1 + x/30)/ log(1 + 1/30)
f(0) = 0, f(1) = 1

8 bits Counter
MORRIS APPROXIMATE COUNTING ALGORITHM
1 Init counter c ← 0
2 for every event in the stream
3 do rand = random number between 0 and 1
4 if rand < p
5 then c ← c + 1
What is the largest number we can
store in 8 bits?

PROBABILISTIC APPROXIMATE WINDOW
1 Init window w ← ∅
2 for every instance i in the stream
3 do store the new instance i in window w
4 for every instance j in the window
5 do rand = random number between 0 and 1
6 if rand > b−1
7 then remove instance j from window w
PAW maintains a sample of instances
in logarithmic memory, giving greater
weight to newer instances

Experiments: Methods
Abbr. Classiﬁer Parameters
NB Naive Bayes
HT Hoeffding Tree
HTLB Leveraging Bagging with HT n = 10
kNN k Nearest Neighbour w = 1000, k = 10
kNNW kNN with PAW w = 1000, k = 10
kNNWA
kNN with PAW+ADWIN w = 1000, k = 10
kNNLB
W Leveraging Bagging with kNNW n = 10
The methods we consider. Leveraging Bagging
methods use n models. kNNWA
empties its
window (of max w) when drift is detected (using
the ADWIN drift detector).

Experimental Evaluation
Table : The window size for kNN and corresponding performance.
Accuracy
−w 100 −w 500 −w 1000 −w 5000
Real Avg. 77.88 77.78 79.59 78.23
Synth. Avg. 57.99 81.93 84.74 86.03
Overall Avg. 62.53 80.28 82.59 83.11
Results

Experimental Evaluation
Table : The window size for kNN and corresponding performance.
Time (seconds)
−w 100 −w 500 −w 1000 −w 5000
Real Tot. 297 998 1754 7900
Synth. Tot. 371 1297 2313 10671
Overall Tot. 668 2295 4067 18570
Results

Experimental Evaluation
Table : The window size for kNN and corresponding performance.
RAM Hours
−w 100 −w 500 −w 1000 −w 5000
Real Tot. 0.007 0.082 0.269 5.884
Synth. Tot. 0.002 0.026 0.088 1.988
Overall Tot. 0.009 0.108 0.357 7.872
Results

Experimental Evaluation
Table : Summary of Efﬁciency: Accuracy and RAM-Hours.
NB HT HTLB kNN kNNW kNNWA
kNNLB
W
Accuracy 56.19 73.95 83.75 82.59 82.92 83.19 84.67
RAM-Hrs 0.02 1.57 300.02 0.36 8.08 8.80 250.98
Results

Conclusions
Sampling algorithms for kNN
Giving equal weight to old and new examples: RESERVOIR
SAMPLING
Giving more weight to recent examples: PROBABILISTIC
APPROXIMATE WINDOW
Big Data & Real Time

The document presents new methods for leveraging bagging for evolving data streams. It discusses using randomization techniques like Poisson distributions for input data and random output codes to increase diversity among classifiers. Experimental results on data streams with concept drift show the proposed methods like Leveraging Bagging and Leveraging Bagging MC improve accuracy over baselines like Hoeffding Trees and Online Bagging, while methods like Leveraging Bagging ME reduce RAM-Hours usage. The paper aims to improve accuracy and resource usage for data stream mining under concept drift.

Efficient Online Evaluation of Big Data Stream Classifiers

Albert Bifet

The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.

Fast Perceptron Decision Tree Learning from Evolving Data Streams

Albert Bifet

The document proposes using perceptron learners at the leaves of Hoeffding decision trees to improve performance on data streams. It introduces a new evaluation metric called RAM-Hours that considers both time and memory usage. The authors empirically evaluate different classifier models, including Hoeffding trees with perceptron and naive Bayes learners at leaves, on several datasets. Results show that hybrid models like Hoeffding naive Bayes perceptron trees often provide the best balance of accuracy, time and memory usage.

STRIP: stream learning of influence probabilities.

Albert Bifet

This document presents a method called STRIP (Streaming Learning of Influence Probabilities) for learning influence probabilities between users in a social network from a streaming log of propagations. It describes three solutions: (1) storing the whole social graph in memory, (2) using min-wise independent hashing to estimate probabilities while using sublinear space, and (3) estimating probabilities only for the most active users to be more space efficient. Experimental results on a Twitter dataset showed these solutions provided good approximations while using reasonable memory and processing time.

Internet of Things Data Science

Albert Bifet

The document discusses data stream classification and algorithms for handling data streams. It begins with an introduction to data stream characteristics and challenges. It then discusses approximation algorithms for data streams, including maintaining statistics over sliding windows. Classification algorithms for data streams discussed include Naive Bayes classifiers, perceptrons, and Hoeffding trees, which are decision trees adapted for data streams using the Hoeffding bound inequality to determine the optimal split attribute.

Introduction to Big Data Science

Albert Bifet

Data science involves extracting insights from large volumes of data. It is an interdisciplinary field that uses techniques from statistics, machine learning, and other domains. The document provides examples of classification algorithms like k-nearest neighbors, naive Bayes, and perceptrons that are commonly used in data science to build models for tasks like spam filtering or sentiment analysis. It also discusses clustering, frequent pattern mining, and other machine learning concepts.

Moa: Real Time Analytics for Data Streams

Albert Bifet

Real-Time Big Data Stream Analytics

Albert Bifet

1. Real-time analytics of social networks can help companies detect new business opportunities by understanding customer needs and reactions in real-time. 2. MOA and SAMOA are frameworks for analyzing massive online and distributed data streams. MOA deals with evolving data streams using online learning algorithms. SAMOA provides a programming model for distributed, real-time machine learning on data streams. 3. Both tools allow companies to gain insights from social network and other real-time data to understand customers and react to opportunities.

This document discusses pitfalls in benchmarking data stream classification and proposes ways to avoid them. It analyzes the electricity market dataset, a popular benchmark, and finds that it exhibits temporal dependence that favors classifiers that simply predict the previous value. It introduces new evaluation metrics like kappa plus that account for temporal dependence by comparing to a "no change" classifier. It also proposes a temporally aware classifier called SWT that incorporates previous labels into its predictions. Experiments on electricity and forest cover datasets show SWT and the new metrics better capture classifier performance on temporally dependent streaming data.

MOA for the IoT at ACML 2016

Albert Bifet

MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.

A Short Course in Data Stream Mining

Albert Bifet

Artificial intelligence and data stream mining

Albert Bifet

Big Data and Artificial Intelligence have the potential to fundamentally shift the way we interact with our surroundings. The challenge of deriving insights from data streams has been recognized as one of the most exciting and key opportunities for both academia and industry. Advanced analysis of big data streams from sensors and devices is bound to become a key area of artificial intelligence research as the number of applications requiring such processing increases. Dealing with the evolution over time of such data streams, i.e., with concepts that drift or change completely, is one of the core issues in stream mining. In this talk, I will present an overview of data stream mining, industrial applications, open source tools, and current challenges of data stream mining.

Mining Frequent Closed Graphs on Evolving Data Streams

Albert Bifet

Streaming Algorithms

Joe Kelley

Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Albert Bifet

This document discusses mining frequent closed unlabeled rooted trees in data streams. It introduces the problem of finding frequent closed trees in a data stream of unlabeled rooted trees. It describes some of the challenges of data streams, including that the sequence is potentially infinite, there is a high amount of data requiring sublinear space, and a high speed of arrival requiring sublinear time per example. The document outlines an approach using ADWIN, an adaptive sliding window algorithm, to detect concept drift and adapt the window size accordingly.

Data streaming algorithms

Sandeep Joshi

1) The document discusses algorithms for computing statistics like minimum, maximum, average over data streams using limited memory in a single pass. It covers algorithms for computing cardinality, heavy hitters, order statistics and histograms. 2) Cardinality can be estimated using the Flajolet-Martin algorithm which tracks the position of the rightmost zero bit in a bitmap. Heavy hitters can be found using the Count-Min sketch. Order statistics like the median can be approximated using the Frugal and T-Digest algorithms. Wavelet-based approaches can be used to compute histograms over data streams. 3) The document provides high-level explanations of these streaming algorithms along with references for further reading, but does not

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017

StampedeCon

This talk will go over how to build an end-to-end data processing system in Python, from data ingest, to data analytics, to machine learning, to user presentation. Developments in old and new tools have made this particularly possible today. The talk in particular will talk about Airflow for process workflows, PySpark for data processing, Python data science libraries for machine learning and advanced analytics, and building agile microservices in Python. System architects, software engineers, data scientists, and business leaders can all benefit from attending the talk. They should learn how to build more agile data processing systems and take away some ideas on how their data systems could be simpler and more powerful.

Mining high speed data streams: Hoeffding and VFDT

Davide Gallitelli

Parallel Optimization in Machine Learning

Fabian Pedregosa

5.1 mining data streams

Krish_ver2

This document discusses techniques for mining data streams. It begins by defining different types of streaming data like time-series data and sequence data. It then discusses the characteristics of data streams like their huge volume, fast changing nature, and requirement for real-time processing. The key challenges in stream query processing are the unbounded memory requirements and need for approximate query answering. The document outlines several synopsis data structures and techniques used for mining data streams, including random sampling, histograms, sketches, and randomized algorithms. It also discusses architectures for stream query processing and classification of dynamic data streams.

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

Jen Aman

Kyle Foreman presented on using Spark for large-scale global health simulations. The talk discussed (1) the motivation for simulations to model disease burden forecasts and alternative scenarios, (2) SimBuilder for constructing modular simulation workflows as directed acyclic graphs, and (3) benchmarks showing Spark backends can efficiently distribute simulations across a cluster. Future work aims to optimize Spark DataFrame joins and take better advantage of Numpy's vectorization for panel data simulations.

New zealand bloom filter

xlight

Bloom filters provide a space-efficient probabilistic data structure for representing a set in order to support membership queries. They allow false positives but no false negatives. The structure uses k hash functions to map elements to bit positions in a bit array. Querying whether an element is in the set checks if the corresponding bit positions are all set to 1. Modern applications include distributed caching, peer-to-peer networks, routing, and measurement infrastructure where Bloom filters trade off exact representation for speed and space efficiency.

Tutorial 9 (bloom filters)

Kira

Bloom filters are a space-efficient probabilistic data structure for representing a set in order to support membership queries. They allow for false positives but not false negatives. The document discusses how bloom filters work using hash functions to set bits in a bit vector, allowing for fast set membership checks. It also covers extensions like counting bloom filters that can support deletions by incrementing and decrementing counters, and variations like distance-sensitive bloom filters and bloomier filters.

Python-List comprehension

Colin Su

This document discusses list comprehensions in Python. It provides examples of using list comprehensions to generate lists based on conditions. It describes generating a list of squares of numbers from 1 to 20 and generating a list of letters from a dictionary whose values are greater than or equal to 3. It then discusses using list comprehensions to solve practice problems involving loading height data from a file, summarizing statistics, looking up heights, finding people above a certain height, and printing a report of heights.

Numerical tour in the Python eco-system: Python, NumPy, scikit-learn

Arnaud Joly

DeepLearningProjV3

Ana Sanchez

This document summarizes a student project using deep learning techniques for feature selection in genome-wide association studies. The student applied patching, k-means clustering, and distance matrix calculations to reduce over 490,000 SNP features for 20 case and control subjects into new feature vectors of sizes 20x1000 and 20x10,000. This significant data reduction saves memory and allows classification algorithms to be applied to the new representations of the genetic data.

Introduction to Data streaming - 05/12/2014

Raja Chiky

Raja Chiky is an associate professor whose research interests include data stream mining, distributed architectures, and recommender systems. The document outlines data streaming concepts including what a data stream is, data stream management systems, and basic approximate algorithms used for processing massive, high-velocity data streams. It also discusses challenges in distributed systems and using semantic technologies for data streaming.

Multi-label Classification with Meta-labels

Albert Bifet

The area of multi-label classification has rapidly developed in recent years. It has become widely known that the baseline binary relevance approach suffers from class imbalance and a restricted hypothesis space that negatively affects its predictive performance, and can easily be outperformed by methods which learn labels together. A number of methods have grown around the label powerset approach, which models label combinations together as class values in a multi-class problem. We describe the label-powerset-based solutions under a general framework of \emph{meta-labels}. We provide theoretical justification for this framework which has been lacking, by viewing meta-labels as a hidden layer in an artificial neural network. We explain how meta-labels essentially allow a random projection into a space where non-linearities can easily be tackled with established linear learning algorithms. The proposed framework enables comparison and combination of related approaches to different multi-label problems. Indeed, we present a novel model in the framework and evaluate it empirically against several high-performing methods, with respect to predictive performance and scalability, on a number of datasets and evaluation metrics. Our deployment of an ensemble of meta-label classifiers obtains competitive accuracy for a fraction of the computation required by the current meta-label methods for multi-label classification.

Apache Samoa: Mining Big Data Streams with Apache Flink

Albert Bifet

1) Apache SAMOA is a platform for mining big data streams in real-time that provides algorithms, libraries, and frameworks. 2) It allows researchers to develop and compare stream mining algorithms and practitioners to easily apply state-of-the-art algorithms to problems like sentiment analysis, spam detection, and recommendations. 3) A key challenge addressed by SAMOA is how to perform distributed stream mining on high-volume, high-velocity data streams at low latency using approaches like Apache Flink that can scale to handle large, fast data.

What's hot

Sentiment Knowledge Discovery in Twitter Streaming Data

Albert Bifet

Pitfalls in benchmarking data stream classification and how to avoid them

Albert Bifet

MOA for the IoT at ACML 2016

Albert Bifet

A Short Course in Data Stream Mining

Albert Bifet

Artificial intelligence and data stream mining

Albert Bifet

Mining Frequent Closed Graphs on Evolving Data Streams

Albert Bifet

Streaming Algorithms

Joe Kelley

Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Albert Bifet

Data streaming algorithms

Sandeep Joshi

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017

StampedeCon

Mining high speed data streams: Hoeffding and VFDT

Davide Gallitelli

Parallel Optimization in Machine Learning

Fabian Pedregosa

5.1 mining data streams

Krish_ver2

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

Jen Aman

New zealand bloom filter

xlight

Tutorial 9 (bloom filters)

Kira

Python-List comprehension

Colin Su

Numerical tour in the Python eco-system: Python, NumPy, scikit-learn

Arnaud Joly

DeepLearningProjV3

Ana Sanchez

Introduction to Data streaming - 05/12/2014

Raja Chiky

What's hot (20)

Sentiment Knowledge Discovery in Twitter Streaming Data

Pitfalls in benchmarking data stream classification and how to avoid them

MOA for the IoT at ACML 2016

A Short Course in Data Stream Mining

Artificial intelligence and data stream mining

Mining Frequent Closed Graphs on Evolving Data Streams

Streaming Algorithms

Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Data streaming algorithms

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017

Mining high speed data streams: Hoeffding and VFDT

Parallel Optimization in Machine Learning

5.1 mining data streams

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

New zealand bloom filter

Tutorial 9 (bloom filters)

Python-List comprehension

Numerical tour in the Python eco-system: Python, NumPy, scikit-learn

DeepLearningProjV3

Introduction to Data streaming - 05/12/2014

Viewers also liked

Multi-label Classification with Meta-labels

Albert Bifet

Apache Samoa: Mining Big Data Streams with Apache Flink

Albert Bifet

Mining Big Data in Real Time

Albert Bifet

Introduction to Big Data

Albert Bifet

This document provides an introduction to big data and MapReduce frameworks. It discusses: - What big data is and examples of large datasets. - An overview of MapReduce, including how it allows programmers to break problems into parallelizable map and reduce tasks. - Details of how MapReduce frameworks like Apache Hadoop work, including distributed processing, fault tolerance, and the roles of mappers, reducers, and other components.

PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions

Albert Bifet

The document outlines a tutorial on handling concept drift in machine learning. It discusses the challenges of concept drift when applying supervised learning algorithms to streaming data where the underlying data distribution changes over time. The tutorial aims to provide an integrated view of adaptive learning methods and how they can handle concept drift. It covers topics such as the problem of concept drift, techniques for handling drift, evaluating adaptive learning approaches, and applications that experience concept drift.

Mining Big Data in Real Time

Albert Bifet

Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis. We discuss the current and future trends of mining evolving data streams, and the challenges that the field will have to overcome during the next years.

Real Time Big Data Management

Albert Bifet

Viewers also liked (7)

Multi-label Classification with Meta-labels

Apache Samoa: Mining Big Data Streams with Apache Flink

Mining Big Data in Real Time

Introduction to Big Data

PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions

Mining Big Data in Real Time

Real Time Big Data Management

Similar to Efficient Data Stream Classification via Probabilistic Adaptive Windows

Streaming multiscale anomaly detection

Ravi Kiran B.

We develop a multi-scale streaming anomaly score that takes into account a family of window sizes, making the algorithm scale invariant across a different types of time series with varying pseudo-periodic structure. We explore different aggregation methods of the multi-scale anomaly score to obtain a final anomaly score. We evaluate the performance on the Yahoo! and Numenta Anomaly Benchmark(NAB) datasets.

Mining Data Streams

SujaAldrin

This document discusses concepts related to data streams and real-time analytics. It begins with introductions to stream data models and sampling techniques. It then covers filtering, counting, and windowing queries on data streams. The document discusses challenges of stream processing like bounded memory and proposes solutions like sampling and sketching. It provides examples of applications in various domains and tools for real-time data streaming and analytics.

Real-Time Data Mining for Event Streams

Sylvain Hallé

Information systems produce different types of event logs; in many situations, it may be desirable to look for trends inside these logs. We show how trends of various kinds can be computed over such logs in real time, using a generic framework called the trend distance workflow. Many common computations on event streams turn out to be special cases of this workflow, depending on how a handful of workflow parameters are defined. This process has been implemented and tested in a real-world event stream processing tool, called BeepBeep.

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen

confluent

Flink and Kafka are popular components to build an open source stream processing infrastructure. We present how Flink integrates with Kafka to provide a platform with a unique feature set that matches the challenging requirements of advanced stream processing applications. In particular, we will dive into the following points: Flink’s support for event-time processing, how it handles out-of-order streams, and how it can perform analytics on historical and real-time streams served from Kafka’s persistent log using the same code. We present Flink’s windowing mechanism that supports time-, count- and session- based windows, and intermixing event and processing time semantics in one program. How Flink’s checkpointing mechanism integrates with Kafka for fault-tolerance, for consistent stateful applications with exactly-once semantics. We will discuss “”Savepoints””, which allows users to save the state of the streaming program at any point in time. Together with a durable event log like Kafka, savepoints allow users to pause/resume streaming programs, go back to prior states, or switch to different versions of the program, while preserving exactly-once semantics. We explain the techniques behind the combination of low-latency and high throughput streaming, and how latency/throughput trade-off can configured. We will give an outlook on current developments for streaming analytics, such as streaming SQL and complex event processing.

Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...

Andrii Gakhov

Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...

Junho Suh

In this paper we propose, implement and evaluate OpenSample: a low-latency, sampling-based network measurement platform targeted at building faster control loops for software-defined networks. OpenSample leverages sFlow packet sampling to provide near–real-time measurements of both network load and individual flows. While OpenSample is useful in any context, it is particularly useful in an SDN environment where a network controller can quickly take action based on the data it provides. Using sampling for network monitoring allows OpenSample to have a 100 millisecond control loop rather than the 1–5 second control loop of prior polling-based approaches. We implement OpenSample in the Floodlight OpenFlow controller and evaluate it both in simulation and on a testbed comprised of commodity switches. When used to inform traffic engineering, OpenSample provides up to a 150% throughput improvement over both static equal-cost multi-path routing and a polling-based solution with a one second control loop.

ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge

Francisco Zamora-Martinez

Presentation given at Cyient Insights (Hyderabad, India). This work presents the solution proposed by Universidad CEU Cardenal Herrera (ESAI-CEU-UCH) at Kaggle American Epilepsy Society Seizure Prediction Challenge. The proposed solution was positioned as 4th at Kaggle competition. Different kind of input features (different preprocessing pipelines) and different statistical models are being proposed. This diversity was motivated to improve model combination result. It is important to note that any of the proposed systems use test set for calibration. The competition allow to do this model calibration using test set, but doing it will reduce the reproducibility of the results in a real world implementation.

"An adaptive modular approach to the mining of sensor network ...

butest

This document summarizes an adaptive modular approach for mining sensor network data using machine learning techniques. It presents a two-layer architecture that uses an online compression algorithm (PCA) in the first layer to reduce data dimensionality and an adaptive lazy learning algorithm (KNN) in the second layer for prediction and regression tasks. Simulation results on a wave propagation dataset show the approach can handle non-stationarities like concept drift, sensor failures and network changes in an efficient and adaptive manner.

20110620 amst rdam_kpb

Konrad Banachewicz

The document discusses computing analytics directly in databases to improve performance over traditional approaches that import all data into memory first. It presents three case studies - linear regression, correlation, and value-at-risk estimation - showing how each can leverage database operations like aggregation, sorting, and querying to perform calculations faster by computing near the data. Graphs of execution times demonstrate that partially computing models within the database rather than solely in memory leads to significant speed improvements, especially as data sizes increase.

Fast detection of transformed data leaks[mithun_p_c]

MithunPChandra

The document proposes a content inspection technique for detecting sensitive data leakage. It involves aligning samples from sensitive data and content to compare similarity. Two algorithms are used - a comparable sampling algorithm and a sampling oblivious alignment algorithm. This alignment method promises high-speed security scanning while achieving high detection specificity and tolerance to pattern variation.

SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS

IJCI JOURNAL

Sliding window sums are widely used for string indexing, hashing and time series analysis. We have developed a family of the generic vectorized sliding sum algorithms that provide speedup of O(P/w) for window size w and number of processors P. For a sum with a commutative operator the speedup is improved to O(P/log(w)). Even more important, our algorithms exhibit efficient memory access patterns. In this paper we study the application of sliding sum algorithms to the training and inference of Deep Neural Networks. We demonstrate how both pooling and convolution primitives could be expressed as sliding sums and evaluated by the compute kernels with a shared structure. We show that the sliding sum convolution kernels are more efficient than the commonly used GEMM kernels on CPUs and could even outperform their GPU counterparts.

DSD-INT 2018 Algorithmic Differentiation - Markus

Deltares

Complex models in ecology: challenges and solutions

Peter Solymos

This document discusses complex models in ecology and solutions for Bayesian analysis of complex hierarchical models. It introduces data cloning as a method that allows using Bayesian Markov chain Monte Carlo tools for frequentist inference on complex models. Data cloning replicates the data to increase the effective sample size, improving mixing and reducing the need for long runs. The document also discusses using high-performance computing to parallelize MCMC for faster inference on complex models through techniques like distributing chains across nodes.

streamingalgo88585858585858585pppppp.pptx

GopiNathVelivela

The document summarizes the Count-Min Sketch streaming algorithm. It uses a two-dimensional array and d independent hash functions to estimate item frequencies in a data stream using sublinear space. It works by incrementing the appropriate counters in each row when an item arrives. The estimated frequency of an item is the minimum value across the rows. Analysis shows that for an array width w proportional to 1/ε, the estimate will be within an additive error of ε times the total frequency with high probability.

Count-Distinct Problem

Kai Zhang

The document discusses count-distinct algorithms for estimating the cardinality of large data streams. It provides an overview of the history of count-distinct algorithms, from early linear counting approaches to modern algorithms like LogLog counting and HyperLogLog counting. The document then describes the basic ideas, algorithms, and implementations of LogLog counting and HyperLogLog counting. It analyzes the performance of these algorithms and discusses open issues like how to handle small and large cardinalities more accurately.

Model-counting Approaches For Nonlinear Numerical Constraints

Quoc-Sang Phan

This document summarizes research on using model counting approaches to analyze nonlinear numerical constraints that arise in applications like probabilistic inference, reliability analysis, and side-channel analysis. It presents two implementations of modular exponentiation with nonlinear constraints and evaluates the performance of various exact and approximate model counting tools on the path conditions extracted from symbolic execution. The results show that for small domains, brute force counting works best, while approximate model counting scales better to larger problems.

Big Data and Small Devices by Katharina Morik

BigMine

How can we learn from the data of small ubiquitous systems? Do we need to send the data to a server or cloud and do all learning there? Or can we learn on some small devices directly? Are smartphones small? Are navigation systems small? How complex is learning allowed to be in times of big data? What about graphical models? Can they be applied on small devices or even learned on restricted processors? Big data are produced by various sources. Most often, they are distributedly stored at computing farms or clouds. Analytics on the Hadoop Distributed File System (HDFS) then follows the MapReduce programming model. According to the Lambda architecture of Nathan Marz and James Warren, this is the batch layer. It is complemented by the speed layer, which aggregates and integrates incoming data streams in real time. When considering big data and small devices, obviously, we imagine the small devices being hosts of the speed layer, only. Analytics on the small devices is restricted by memory and computation resources. The interplay of streaming and batch analytics offers a multitude of configurations. In this talk, we discuss opportunities for using sophisticated models for learning spatio-temporal models. In particular, we investigate graphical models, which generate the probabilities for connected (sensor) nodes. First, we present spatio-temporal random fields that take as input data from small devices, are computed at a server, and send results to -possibly different — small devices. Second, we go even further: the Integer Markov Random Field approximates the likelihood estimates such that it can be computed on small devices. We illustrate our learning models by applications from traffic management.

Selective and incremental re-computation in reaction to changes: an exercise ...

Paolo Missier

Secure information aggregation in sensor networks

Aleksandr Yampolskiy

The document summarizes the paper "Secure Information Aggregation in Sensor Networks" which proposes a framework called aggregate-commit-prove for securely computing aggregation functions like median, min/max, counting distinct elements in sensor networks even if sensors or aggregators are compromised. It describes the sensor network model, attack model, and gives concrete sublinear protocols for computing specific aggregation functions that allow the base station to detect incorrect results with high probability.

Data_Structure_and_Algorithms_Lecture_1.ppt

ISHANAMRITSRIVASTAVA

The document discusses various topics related to algorithms including introduction to algorithms, algorithm design, complexity analysis, asymptotic notations, and data structures. It provides definitions and examples of algorithms, their properties and categories. It also covers algorithm design methods and approaches. Complexity analysis covers time and space complexity. Asymptotic notations like Big-O, Omega, and Theta notations are introduced to analyze algorithms. Examples are provided to find the upper and lower bounds of algorithms.

Similar to Efficient Data Stream Classification via Probabilistic Adaptive Windows (20)

Streaming multiscale anomaly detection

Mining Data Streams

Real-Time Data Mining for Event Streams

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen

Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...

Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...

ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge

"An adaptive modular approach to the mining of sensor network ...

20110620 amst rdam_kpb

Fast detection of transformed data leaks[mithun_p_c]

SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS

DSD-INT 2018 Algorithmic Differentiation - Markus

Complex models in ecology: challenges and solutions

streamingalgo88585858585858585pppppp.pptx

Count-Distinct Problem

Model-counting Approaches For Nonlinear Numerical Constraints

Big Data and Small Devices by Katharina Morik

Selective and incremental re-computation in reaction to changes: an exercise ...

Secure information aggregation in sensor networks

Data_Structure_and_Algorithms_Lecture_1.ppt

More from Albert Bifet

Mining Big Data Streams with APACHE SAMOA

Albert Bifet

In this talk, we present Apache SAMOA, an open-source platform for mining big data streams with Apache Flink, Storm and Samza. Real time analytics is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Apache SAMOA includes algorithms for the most common machine learning tasks such as classification and clustering. It provides a pluggable architecture that allows it to run on Apache Flink, but also with other several distributed stream processing engines such as Storm and Samza.

MOA : Massive Online Analysis

Albert Bifet

New ensemble methods for evolving data streams

Albert Bifet

Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.

Albert Bifet

This document discusses methods for mining data streams, which are potentially infinite sequences of data that change over time. It describes using the ADWIN algorithm, which is an adaptive sliding window technique without parameters, to extract information from data streams using few resources. It also covers mining massive data, where the amount of digital information created now exceeds available storage. Algorithmic efficiency is important for green computing approaches to efficiently using computing resources. The document provides an example of finding a missing number in an increasing sequence and using random sampling to find a number in the upper half of a sorted list using sublinear space and time.

Adaptive XML Tree Mining on Evolving Data Streams

Albert Bifet

Adaptive Learning and Mining for Data Streams and Frequent Patterns

Albert Bifet

This document summarizes Albert Bifet's 2009 PhD dissertation on adaptive learning and mining for data streams and frequent patterns. It introduces the challenges of mining massive, evolving structured data streams in real-time. It describes the ADWIN algorithm for detecting concept drift in data streams and outlines methods for mining evolving tree data streams, including incremental, sliding window, adaptive and logarithmic relaxed support approaches.

Mining Implications from Lattices of Closed Trees

Albert Bifet

Kalman Filters and Adaptive Windows for Learning in Data Streams

Albert Bifet

This document proposes combining the Kalman filter and ADWIN algorithm (Adaptive Windowing algorithm) to create an algorithm called K-ADWIN for learning from data streams. The Kalman filter is used as an estimator to estimate statistics from data streams, while ADWIN acts as a change detector to detect changes in the data distribution over time. This allows K-ADWIN to adaptively update its learning window size based on the detected changes, providing more accurate estimations that account for concept drift in non-stationary data streams. The document outlines the key components of K-ADWIN and provides experimental validation of its effectiveness.

More from Albert Bifet (8)

Mining Big Data Streams with APACHE SAMOA

MOA : Massive Online Analysis

New ensemble methods for evolving data streams

Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.

Adaptive XML Tree Mining on Evolving Data Streams

Adaptive Learning and Mining for Data Streams and Frequent Patterns

Mining Implications from Lattices of Closed Trees

Kalman Filters and Adaptive Windows for Learning in Data Streams

Recently uploaded

Full-RAG: A modern architecture for hyper-personalization

Zilliz

Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Neo4j

How to use Firebase Data Connect For Flutter

Daiki Mogmet Ito

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/ Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit. In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing. van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.

Communications Mining Series - Zero to Hero - Session 1

DianaGray10

This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered: • Communication Mining Overview • Why is it important? • How can it help today’s business and the benefits • Phases in Communication Mining • Demo on Platform overview • Q/A

A tale of scale & speed: How the US Navy is enabling software delivery from l...

sonjaschweigert1

Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved: - Reduction in onboarding time from 5 weeks to 1 day - Improved developer experience and productivity through actionable findings and reduction of false positives - Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO) Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production. We will cover: - How to remove silos in DevSecOps - How to build efficient development pipeline roles and component templates - How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence) - How to streamline operations with automated policy checks on container images

TrustArc Webinar - 2024 Global Privacy Survey

TrustArc

How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024? In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores. See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe. This webinar will review: - The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey - The top challenges for privacy leaders, practitioners, and organizations in 2024 - Key themes to consider in developing and maintaining your privacy program

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Neo4j

Dr. Sean Tan, Head of Data Science, Changi Airport Group Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.

Mind map of terminologies used in context of Generative AI

Kumud Singh

20240607 QFM018 Elixir Reading List May 2024

Matthew Sinclair

Building RAG with self-deployed Milvus vector database and Snowpark Container...

Zilliz

Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI

Vladimir Iglovikov, Ph.D.

Presented by Vladimir Iglovikov: - https://www.linkedin.com/in/iglovikov/ - https://x.com/viglovikov - https://www.instagram.com/ternaus/ This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation. Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners. This case study covers various aspects, including: People: The contributors and community that have supported Albumentations. Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions. Challenges: The hurdles in monetizing open-source projects and measuring user engagement. Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration. Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community. Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations. Mental Health: Maintaining balance and not feeling pressured by user demands. Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth. Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects. Explore more about Albumentations and join the community at: GitHub: https://github.com/albumentations-team/albumentations Website: https://albumentations.ai/ LinkedIn: https://www.linkedin.com/company/100504475 Twitter: https://x.com/albumentations

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Uni Systems S.M.S.A.

20 Comprehensive Checklist of Designing and Developing a Website

Pixlogix Infotech

Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.

Large Language Model (LLM) and it’s Geospatial Applications

Rohit Gautam

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!

SOFTTECHHUB

As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.

Data structures and Algorithms in Python.pdf

TIPNGVN2

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf

Malak Abu Hammad

Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers: * What is Vector Search? * Importance and benefits of vector search * Practical use cases across various industries * Step-by-step implementation guide * Live demos with code snippets * Enhancing LLM capabilities with vector search * Best practices and optimization strategies Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications. #MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology

Recently uploaded (20)

Full-RAG: A modern architecture for hyper-personalization

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

How to use Firebase Data Connect For Flutter

PCI PIN Basics Webinar from the Controlcase Team

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Communications Mining Series - Zero to Hero - Session 1

A tale of scale & speed: How the US Navy is enabling software delivery from l...

TrustArc Webinar - 2024 Global Privacy Survey

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Mind map of terminologies used in context of Generative AI

20240607 QFM018 Elixir Reading List May 2024

Building RAG with self-deployed Milvus vector database and Snowpark Container...

Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI

Uni Systems Copilot event_05062024_C.Vlachos.pdf

20 Comprehensive Checklist of Designing and Developing a Website

Large Language Model (LLM) and it’s Geospatial Applications

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!

Data structures and Algorithms in Python.pdf

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf

Efficient Data Stream Classification via Probabilistic Adaptive Windows

1. Efﬁcient Data Stream Classiﬁcation via Probabilistic Adaptive Windows Albert Bifet1, Jesse Read2, Bernhard Pfahringer3, Geoff Holmes3 1Yahoo! Research Barcelona 2Universidad Carlos III, Madrid, Spain 3University of Waikato, Hamilton, New Zealand SAC 2013, 19 March 2013

2. Data Streams Big Data & Real Time

3. Data Streams Data Streams Sequence is potentially inﬁnite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Big Data & Real Time

4. Data Streams Approximation algorithms Small error rate with high probability An algorithm ( , δ)−approximates F if it outputs ˜F for which Pr[|˜F − F| > F] < δ. Big Data & Real Time

5. Data Stream Sliding Window Sampling algorithms Giving equal weight to old and new examples: RESERVOIR SAMPLING Giving more weight to recent examples: PROBABILISTIC APPROXIMATE WINDOW Big Data & Real Time

6. 8 Bits Counter 1 0 1 0 1 0 1 0 What is the largest number we can store in 8 bits?

7. 8 Bits Counter What is the largest number we can store in 8 bits?

8. 8 Bits Counter 0 20 40 60 80 100 0 20 40 60 80 100 x f(x) = log(1 + x)/ log(2) f(0) = 0, f(1) = 1

9. 8 Bits Counter 0 2 4 6 8 10 0 2 4 6 8 10 x f(x) = log(1 + x)/ log(2) f(0) = 0, f(1) = 1

10. 8 Bits Counter 0 2 4 6 8 10 0 2 4 6 8 10 x f(x) = log(1 + x/30)/ log(1 + 1/30) f(0) = 0, f(1) = 1

11. 8 Bits Counter 0 20 40 60 80 100 0 20 40 60 80 100 x f(x) = log(1 + x/30)/ log(1 + 1/30) f(0) = 0, f(1) = 1

12. 8 bits Counter MORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 What is the largest number we can store in 8 bits?

13. 8 bits Counter MORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 With p = 1/2 we can store 2 × 256 with standard deviation σ = n/2

14. 8 bits Counter MORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 With p = 2−c then E[2c ] = n + 2 with variance σ2 = n(n + 1)/2

15. 8 bits Counter MORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 If p = b−c then E[bc ] = n(b − 1) + b, σ2 = (b − 1)n(n + 1)/2

16. PROBABILISTIC APPROXIMATE WINDOW 1 Init window w ← ∅ 2 for every instance i in the stream 3 do store the new instance i in window w 4 for every instance j in the window 5 do rand = random number between 0 and 1 6 if rand > b−1 7 then remove instance j from window w PAW maintains a sample of instances in logarithmic memory, giving greater weight to newer instances

17. Experiments: Methods Abbr. Classiﬁer Parameters NB Naive Bayes HT Hoeffding Tree HTLB Leveraging Bagging with HT n = 10 kNN k Nearest Neighbour w = 1000, k = 10 kNNW kNN with PAW w = 1000, k = 10 kNNWA kNN with PAW+ADWIN w = 1000, k = 10 kNNLB W Leveraging Bagging with kNNW n = 10 The methods we consider. Leveraging Bagging methods use n models. kNNWA empties its window (of max w) when drift is detected (using the ADWIN drift detector).

18. Experimental Evaluation Table : The window size for kNN and corresponding performance. Accuracy −w 100 −w 500 −w 1000 −w 5000 Real Avg. 77.88 77.78 79.59 78.23 Synth. Avg. 57.99 81.93 84.74 86.03 Overall Avg. 62.53 80.28 82.59 83.11 Results

19. Experimental Evaluation Table : The window size for kNN and corresponding performance. Time (seconds) −w 100 −w 500 −w 1000 −w 5000 Real Tot. 297 998 1754 7900 Synth. Tot. 371 1297 2313 10671 Overall Tot. 668 2295 4067 18570 Results

20. Experimental Evaluation Table : The window size for kNN and corresponding performance. RAM Hours −w 100 −w 500 −w 1000 −w 5000 Real Tot. 0.007 0.082 0.269 5.884 Synth. Tot. 0.002 0.026 0.088 1.988 Overall Tot. 0.009 0.108 0.357 7.872 Results

21. Experimental Evaluation Table : Summary of Efﬁciency: Accuracy and RAM-Hours. NB HT HTLB kNN kNNW kNNWA kNNLB W Accuracy 56.19 73.95 83.75 82.59 82.92 83.19 84.67 RAM-Hrs 0.02 1.57 300.02 0.36 8.08 8.80 250.98 Results

22. Conclusions Sampling algorithms for kNN Giving equal weight to old and new examples: RESERVOIR SAMPLING Giving more weight to recent examples: PROBABILISTIC APPROXIMATE WINDOW Big Data & Real Time

23. Thanks!

Efficient Data Stream Classification via Probabilistic Adaptive Windows

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Efficient Data Stream Classification via Probabilistic Adaptive Windows

Similar to Efficient Data Stream Classification via Probabilistic Adaptive Windows (20)

More from Albert Bifet

More from Albert Bifet (8)

Recently uploaded

Recently uploaded (20)

Efficient Data Stream Classification via Probabilistic Adaptive Windows