The document discusses data stream classification and algorithms for handling data streams. It begins with an introduction to data stream characteristics and challenges. It then discusses approximation algorithms for data streams, including maintaining statistics over sliding windows. Classification algorithms for data streams discussed include Naive Bayes classifiers, perceptrons, and Hoeffding trees, which are decision trees adapted for data streams using the Hoeffding bound inequality to determine the optimal split attribute.
Data science involves extracting insights from large volumes of data. It is an interdisciplinary field that uses techniques from statistics, machine learning, and other domains. The document provides examples of classification algorithms like k-nearest neighbors, naive Bayes, and perceptrons that are commonly used in data science to build models for tasks like spam filtering or sentiment analysis. It also discusses clustering, frequent pattern mining, and other machine learning concepts.
The document discusses real-time big data management and Apache Flink. It provides an overview of Apache Flink, including its architecture, components, and APIs for batch and streaming data processing. It also provides examples of word count programs in Java, Scala, and Java 8 that demonstrate how to write Flink programs for batch and streaming data.
1. Real-time analytics of social networks can help companies detect new business opportunities by understanding customer needs and reactions in real-time.
2. MOA and SAMOA are frameworks for analyzing massive online and distributed data streams. MOA deals with evolving data streams using online learning algorithms. SAMOA provides a programming model for distributed, real-time machine learning on data streams.
3. Both tools allow companies to gain insights from social network and other real-time data to understand customers and react to opportunities.
STRIP: stream learning of influence probabilities.Albert Bifet
This document presents a method called STRIP (Streaming Learning of Influence Probabilities) for learning influence probabilities between users in a social network from a streaming log of propagations. It describes three solutions: (1) storing the whole social graph in memory, (2) using min-wise independent hashing to estimate probabilities while using sublinear space, and (3) estimating probabilities only for the most active users to be more space efficient. Experimental results on a Twitter dataset showed these solutions provided good approximations while using reasonable memory and processing time.
Efficient Data Stream Classification via Probabilistic Adaptive WindowsAlbert Bifet
This document discusses efficient data stream classification using probabilistic adaptive windows. It introduces the concept of data streams which have potentially infinite sequences of high-speed data that must be processed in real-time with limited memory. It then describes the probabilistic approximate window (PAW) algorithm, which maintains a sample of data instances in logarithmic memory by giving greater weight to newer instances. The document evaluates several data stream classification methods on real and synthetic data streams and finds that k-nearest neighbors with PAW has higher accuracy and lower memory usage than other methods.
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this talk we present a framework for studying graph pattern mining on time-varying streams and large datasets.
Data science involves extracting insights from large volumes of data. It is an interdisciplinary field that uses techniques from statistics, machine learning, and other domains. The document provides examples of classification algorithms like k-nearest neighbors, naive Bayes, and perceptrons that are commonly used in data science to build models for tasks like spam filtering or sentiment analysis. It also discusses clustering, frequent pattern mining, and other machine learning concepts.
The document discusses real-time big data management and Apache Flink. It provides an overview of Apache Flink, including its architecture, components, and APIs for batch and streaming data processing. It also provides examples of word count programs in Java, Scala, and Java 8 that demonstrate how to write Flink programs for batch and streaming data.
1. Real-time analytics of social networks can help companies detect new business opportunities by understanding customer needs and reactions in real-time.
2. MOA and SAMOA are frameworks for analyzing massive online and distributed data streams. MOA deals with evolving data streams using online learning algorithms. SAMOA provides a programming model for distributed, real-time machine learning on data streams.
3. Both tools allow companies to gain insights from social network and other real-time data to understand customers and react to opportunities.
STRIP: stream learning of influence probabilities.Albert Bifet
This document presents a method called STRIP (Streaming Learning of Influence Probabilities) for learning influence probabilities between users in a social network from a streaming log of propagations. It describes three solutions: (1) storing the whole social graph in memory, (2) using min-wise independent hashing to estimate probabilities while using sublinear space, and (3) estimating probabilities only for the most active users to be more space efficient. Experimental results on a Twitter dataset showed these solutions provided good approximations while using reasonable memory and processing time.
Efficient Data Stream Classification via Probabilistic Adaptive WindowsAlbert Bifet
This document discusses efficient data stream classification using probabilistic adaptive windows. It introduces the concept of data streams which have potentially infinite sequences of high-speed data that must be processed in real-time with limited memory. It then describes the probabilistic approximate window (PAW) algorithm, which maintains a sample of data instances in logarithmic memory by giving greater weight to newer instances. The document evaluates several data stream classification methods on real and synthetic data streams and finds that k-nearest neighbors with PAW has higher accuracy and lower memory usage than other methods.
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this talk we present a framework for studying graph pattern mining on time-varying streams and large datasets.
Leveraging Bagging for Evolving Data StreamsAlbert Bifet
The document presents new methods for leveraging bagging for evolving data streams. It discusses using randomization techniques like Poisson distributions for input data and random output codes to increase diversity among classifiers. Experimental results on data streams with concept drift show the proposed methods like Leveraging Bagging and Leveraging Bagging MC improve accuracy over baselines like Hoeffding Trees and Online Bagging, while methods like Leveraging Bagging ME reduce RAM-Hours usage. The paper aims to improve accuracy and resource usage for data stream mining under concept drift.
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsAlbert Bifet
This document discusses mining frequent closed unlabeled rooted trees in data streams. It introduces the problem of finding frequent closed trees in a data stream of unlabeled rooted trees. It describes some of the challenges of data streams, including that the sequence is potentially infinite, there is a high amount of data requiring sublinear space, and a high speed of arrival requiring sublinear time per example. The document outlines an approach using ADWIN, an adaptive sliding window algorithm, to detect concept drift and adapt the window size accordingly.
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
The document proposes using perceptron learners at the leaves of Hoeffding decision trees to improve performance on data streams. It introduces a new evaluation metric called RAM-Hours that considers both time and memory usage. The authors empirically evaluate different classifier models, including Hoeffding trees with perceptron and naive Bayes learners at leaves, on several datasets. Results show that hybrid models like Hoeffding naive Bayes perceptron trees often provide the best balance of accuracy, time and memory usage.
MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.
Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet
This document discusses pitfalls in benchmarking data stream classification and proposes ways to avoid them. It analyzes the electricity market dataset, a popular benchmark, and finds that it exhibits temporal dependence that favors classifiers that simply predict the previous value. It introduces new evaluation metrics like kappa plus that account for temporal dependence by comparing to a "no change" classifier. It also proposes a temporally aware classifier called SWT that incorporates previous labels into its predictions. Experiments on electricity and forest cover datasets show SWT and the new metrics better capture classifier performance on temporally dependent streaming data.
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
The document discusses new techniques for improving the k-means clustering algorithm. It begins by describing the standard k-means algorithm and Lloyd's method. It then discusses issues with random initialization for k-means. It proposes using furthest point initialization (k-means++) as an improvement. The document also discusses parallelizing k-means initialization (k-means||) and using nearest neighbor data structures to speed up assigning points to clusters, which allows k-means to scale to many clusters. Experimental results show these techniques provide faster and higher quality clustering compared to standard k-means.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
45 min talk about collecting home network performance measures, analyzing and forecasting time series data, and building anomaly detection system.
In this talk, we will go through the whole process of data mining and knowledge discovery. Firstly we write a script to run speed test periodically and log the metric. Then we parse the log data and convert them into a time series and visualize the data for a certain period.
Next we conduct some data analysis; finding trends, forecasting, and detecting anomalous data. There will be several statistic or deep learning techniques used for the analysis; ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short Term Memory).
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
The document outlines the PAC-Bayesian bound for deep learning. It discusses how the PAC-Bayesian bound provides a generalization guarantee that depends on the KL divergence between the prior and posterior distributions over hypotheses. This allows the bound to account for factors like model complexity and noise in the training data, avoiding some limitations of other generalization bounds. The document also explains how the PAC-Bayesian bound can be applied to stochastic neural networks by placing distributions over the network weights.
Distributed GLM with H2O - Atlanta MeetupSri Ambati
The document outlines a presentation about H2O's distributed generalized linear model (GLM) algorithm. The presentation includes sections about H2O.ai the company, an overview of the H2O software, a 30 minute section explaining H2O's distributed GLM in detail, a 15 minute demo of GLM, and a question and answer period. The document provides background on H2O.ai and H2O, and outlines the topics that will be covered in the distributed GLM section, including the algorithm, input parameters, outputs, runtime costs, and best practices.
1) The document discusses algorithms for computing statistics like minimum, maximum, average over data streams using limited memory in a single pass. It covers algorithms for computing cardinality, heavy hitters, order statistics and histograms.
2) Cardinality can be estimated using the Flajolet-Martin algorithm which tracks the position of the rightmost zero bit in a bitmap. Heavy hitters can be found using the Count-Min sketch. Order statistics like the median can be approximated using the Frugal and T-Digest algorithms. Wavelet-based approaches can be used to compute histograms over data streams.
3) The document provides high-level explanations of these streaming algorithms along with references for further reading, but does not
1) The document outlines PAC-Bayesian bounds, which provide probabilistic guarantees on the generalization error of a learning algorithm.
2) PAC-Bayesian bounds relate the expected generalization error of the output distribution Q to the training error, number of samples, and KL divergence between the prior P and posterior Q distributions over hypotheses.
3) The bounds show that better generalization requires a smaller divergence between P and Q, meaning the training process should not alter the distribution of hypotheses too much. This provides insights into reducing overfitting in deep learning models.
The document outlines the theory of domain adaptation. It discusses how the generalization bound from learning in a single domain does not apply when testing on a different target domain. The key challenges are the distance between the source and target features and the distance between their labeling functions. Domain adaptation aims to reduce these distances and provide a generalization bound by estimating these distances using a hypothesis trained on samples from both domains. An example approach is to find the hypothesis that minimizes the sum of source and target errors.
Scipy 2011 Time Series Analysis in PythonWes McKinney
1) The document discusses statsmodels, a Python library for statistical modeling that implements standard statistical models. It includes tools for linear regression, descriptive statistics, statistical tests, time series analysis, and more.
2) The talk provides an overview of using statsmodels for time series analysis, including descriptive statistics, autoregressive moving average (ARMA) models, vector autoregression (VAR) models, and filtering tools.
3) The discussion highlights the development of statsmodels and the need for integrated statistical data structures and user interfaces to make Python more competitive with R for data analysis and statistics.
The document discusses decision tree learning, which is a machine learning approach for classification that builds classification models in the form of a decision tree. It describes the ID3 algorithm, which is a popular method for generating a decision tree from a set of training data. The ID3 algorithm uses information gain as the splitting criterion to recursively split the training data into purer subsets based on the values of the attributes. It selects the attribute with the highest information gain to make decisions at each node in the tree. Entropy from information theory is used to measure the information gain, with the goal being to build a tree that best classifies the training instances into target classes. An example applying the ID3 algorithm to a tennis playing dataset is provided to illustrate
This document provides an overview of kernel methods for machine learning. It discusses the evolution of learning methods from perceptrons in the 1950s to kernel methods in the 1990s. Kernel methods embed data into a higher-dimensional Hilbert space to allow for linear classification of non-linear relationships. The kernel trick replaces the inner product in this space with a kernel function, avoiding the need to explicitly define the embedding. Common kernel functions include polynomial kernels and Gaussian RBF kernels. The document provides code examples of kernel ridge regression in Python and discusses applications of string kernels and normalization techniques.
Leveraging Bagging for Evolving Data StreamsAlbert Bifet
The document presents new methods for leveraging bagging for evolving data streams. It discusses using randomization techniques like Poisson distributions for input data and random output codes to increase diversity among classifiers. Experimental results on data streams with concept drift show the proposed methods like Leveraging Bagging and Leveraging Bagging MC improve accuracy over baselines like Hoeffding Trees and Online Bagging, while methods like Leveraging Bagging ME reduce RAM-Hours usage. The paper aims to improve accuracy and resource usage for data stream mining under concept drift.
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsAlbert Bifet
This document discusses mining frequent closed unlabeled rooted trees in data streams. It introduces the problem of finding frequent closed trees in a data stream of unlabeled rooted trees. It describes some of the challenges of data streams, including that the sequence is potentially infinite, there is a high amount of data requiring sublinear space, and a high speed of arrival requiring sublinear time per example. The document outlines an approach using ADWIN, an adaptive sliding window algorithm, to detect concept drift and adapt the window size accordingly.
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
The document proposes using perceptron learners at the leaves of Hoeffding decision trees to improve performance on data streams. It introduces a new evaluation metric called RAM-Hours that considers both time and memory usage. The authors empirically evaluate different classifier models, including Hoeffding trees with perceptron and naive Bayes learners at leaves, on several datasets. Results show that hybrid models like Hoeffding naive Bayes perceptron trees often provide the best balance of accuracy, time and memory usage.
MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.
Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet
This document discusses pitfalls in benchmarking data stream classification and proposes ways to avoid them. It analyzes the electricity market dataset, a popular benchmark, and finds that it exhibits temporal dependence that favors classifiers that simply predict the previous value. It introduces new evaluation metrics like kappa plus that account for temporal dependence by comparing to a "no change" classifier. It also proposes a temporally aware classifier called SWT that incorporates previous labels into its predictions. Experiments on electricity and forest cover datasets show SWT and the new metrics better capture classifier performance on temporally dependent streaming data.
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
The document discusses new techniques for improving the k-means clustering algorithm. It begins by describing the standard k-means algorithm and Lloyd's method. It then discusses issues with random initialization for k-means. It proposes using furthest point initialization (k-means++) as an improvement. The document also discusses parallelizing k-means initialization (k-means||) and using nearest neighbor data structures to speed up assigning points to clusters, which allows k-means to scale to many clusters. Experimental results show these techniques provide faster and higher quality clustering compared to standard k-means.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
45 min talk about collecting home network performance measures, analyzing and forecasting time series data, and building anomaly detection system.
In this talk, we will go through the whole process of data mining and knowledge discovery. Firstly we write a script to run speed test periodically and log the metric. Then we parse the log data and convert them into a time series and visualize the data for a certain period.
Next we conduct some data analysis; finding trends, forecasting, and detecting anomalous data. There will be several statistic or deep learning techniques used for the analysis; ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short Term Memory).
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
The document outlines the PAC-Bayesian bound for deep learning. It discusses how the PAC-Bayesian bound provides a generalization guarantee that depends on the KL divergence between the prior and posterior distributions over hypotheses. This allows the bound to account for factors like model complexity and noise in the training data, avoiding some limitations of other generalization bounds. The document also explains how the PAC-Bayesian bound can be applied to stochastic neural networks by placing distributions over the network weights.
Distributed GLM with H2O - Atlanta MeetupSri Ambati
The document outlines a presentation about H2O's distributed generalized linear model (GLM) algorithm. The presentation includes sections about H2O.ai the company, an overview of the H2O software, a 30 minute section explaining H2O's distributed GLM in detail, a 15 minute demo of GLM, and a question and answer period. The document provides background on H2O.ai and H2O, and outlines the topics that will be covered in the distributed GLM section, including the algorithm, input parameters, outputs, runtime costs, and best practices.
1) The document discusses algorithms for computing statistics like minimum, maximum, average over data streams using limited memory in a single pass. It covers algorithms for computing cardinality, heavy hitters, order statistics and histograms.
2) Cardinality can be estimated using the Flajolet-Martin algorithm which tracks the position of the rightmost zero bit in a bitmap. Heavy hitters can be found using the Count-Min sketch. Order statistics like the median can be approximated using the Frugal and T-Digest algorithms. Wavelet-based approaches can be used to compute histograms over data streams.
3) The document provides high-level explanations of these streaming algorithms along with references for further reading, but does not
1) The document outlines PAC-Bayesian bounds, which provide probabilistic guarantees on the generalization error of a learning algorithm.
2) PAC-Bayesian bounds relate the expected generalization error of the output distribution Q to the training error, number of samples, and KL divergence between the prior P and posterior Q distributions over hypotheses.
3) The bounds show that better generalization requires a smaller divergence between P and Q, meaning the training process should not alter the distribution of hypotheses too much. This provides insights into reducing overfitting in deep learning models.
The document outlines the theory of domain adaptation. It discusses how the generalization bound from learning in a single domain does not apply when testing on a different target domain. The key challenges are the distance between the source and target features and the distance between their labeling functions. Domain adaptation aims to reduce these distances and provide a generalization bound by estimating these distances using a hypothesis trained on samples from both domains. An example approach is to find the hypothesis that minimizes the sum of source and target errors.
Scipy 2011 Time Series Analysis in PythonWes McKinney
1) The document discusses statsmodels, a Python library for statistical modeling that implements standard statistical models. It includes tools for linear regression, descriptive statistics, statistical tests, time series analysis, and more.
2) The talk provides an overview of using statsmodels for time series analysis, including descriptive statistics, autoregressive moving average (ARMA) models, vector autoregression (VAR) models, and filtering tools.
3) The discussion highlights the development of statsmodels and the need for integrated statistical data structures and user interfaces to make Python more competitive with R for data analysis and statistics.
The document discusses decision tree learning, which is a machine learning approach for classification that builds classification models in the form of a decision tree. It describes the ID3 algorithm, which is a popular method for generating a decision tree from a set of training data. The ID3 algorithm uses information gain as the splitting criterion to recursively split the training data into purer subsets based on the values of the attributes. It selects the attribute with the highest information gain to make decisions at each node in the tree. Entropy from information theory is used to measure the information gain, with the goal being to build a tree that best classifies the training instances into target classes. An example applying the ID3 algorithm to a tennis playing dataset is provided to illustrate
This document provides an overview of kernel methods for machine learning. It discusses the evolution of learning methods from perceptrons in the 1950s to kernel methods in the 1990s. Kernel methods embed data into a higher-dimensional Hilbert space to allow for linear classification of non-linear relationships. The kernel trick replaces the inner product in this space with a kernel function, avoiding the need to explicitly define the embedding. Common kernel functions include polynomial kernels and Gaussian RBF kernels. The document provides code examples of kernel ridge regression in Python and discusses applications of string kernels and normalization techniques.
This document provides an introduction to probabilistic programming using PyMC3 and Edward. It discusses the differences between frequentist and Bayesian approaches. Bayesian inference accounts for prior beliefs and provides probabilities rather than binary outcomes. Markov chain Monte Carlo and variational inference are introduced as methods for approximating posterior distributions. Examples are given for Bayesian statistical analysis of coin toss data using these probabilistic programming tools.
The document summarizes the Count-Min Sketch streaming algorithm. It uses a two-dimensional array and d independent hash functions to estimate item frequencies in a data stream using sublinear space. It works by incrementing the appropriate counters in each row when an item arrives. The estimated frequency of an item is the minimum value across the rows. Analysis shows that for an array width w proportional to 1/ε, the estimate will be within an additive error of ε times the total frequency with high probability.
Isolation Forest is an anomaly detection algorithm that builds decision trees to isolate anomalies from normal data points. It works by constructing isolation trees on randomly selected sub-samples of the data, and computes an anomaly score based on the path length of each data point in the trees. The algorithm has linear time complexity and low memory requirements, making it scalable to large, high-dimensional datasets. Empirical experiments show Isolation Forest achieves high AUC scores comparable to other algorithms while using less processing time, especially as the number of trees increases. It is also effective at detecting anomalies in the presence of irrelevant attributes.
This document discusses applying deep learning techniques like variational autoencoders to cyber security and anomaly detection in network traffic. It notes that while deep learning has made progress in related areas, modeling categorical network flow data poses unique challenges. It proposes using variational inference with a Gumbel softmax relaxation to train a generative model on network flows in an unsupervised manner. The trained model could then be used for tasks like anomaly detection based on the model's predictions or a sample's reconstruction error.
a paper review. This presentation introduces Abductive Commonsense Reasoning which is the published paper in ICLR 2020. In this paper, the authors use commonsense to generate plausible hypotheses. They generate new data set 'ART' and propose new models for 'aNLI', 'aNLG' using BERT, and GPT.
This document summarizes Robert Fry's presentation on computation and design of autonomous intelligent systems. It outlines a computational theory of intelligence based on defining questions and answers within a system. Key points include:
- Intelligent systems acquire information to make decisions to achieve goals.
- Questions are defined as sets of possible answers. Boolean algebra is used to represent questions and assertions.
- Probability and entropy theories are derived from this logical framework.
- A simple protozoan system is used to illustrate how a system maps information to decisions.
- Neural computation is modeled using this theory, with neurons posing questions and making optimal decisions.
- Hebbian learning allows neural systems to adapt optimally via dual-matching.
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience hirokazutanaka
This document summarizes key concepts from a lecture on neural networks and neuroscience:
- Single-layer neural networks like perceptrons can only learn linearly separable patterns, while multi-layer networks can approximate any function. Backpropagation enables training multi-layer networks.
- Recurrent neural networks incorporate memory through recurrent connections between units. Backpropagation through time extends backpropagation to train recurrent networks.
- The cerebellum functions similarly to a perceptron for motor learning and control. Its feedforward circuitry from mossy fibers to Purkinje cells maps to the layers of a perceptron.
The document summarizes several linear sorting algorithms, including bucket sort, counting sort, general bucket sort, and radix sort. Counting sort runs in O(n+k) time and O(k) space, where k is the range of integer keys, and is stable. Radix sort uses a stable sorting algorithm like counting sort to sort based on each digit of d-digit numbers, resulting in O(d(n+k)) time for sorting n numbers with d digits in the range [1,k].
This document provides an overview of Dirichlet processes and their applications. It begins with background on probability mass functions and density functions. It then discusses the probability simplex and the Dirichlet distribution. The Dirichlet process is defined as a distribution over distributions that allows modeling probability distributions over infinite sample spaces. An example application involves using Dirichlet processes to learn hierarchical morphology paradigms by modeling stems and suffixes as being generated independently from Dirichlet processes. References for further reading are also provided.
This document provides an overview of sorting algorithms including bubble sort, insertion sort, shellsort, and others. It discusses why sorting is important, provides pseudocode for common sorting algorithms, and gives examples of how each algorithm works on sample data. The runtime of sorting algorithms like insertion sort and shellsort are analyzed, with insertion sort having quadratic runtime in the worst case and shellsort having unknown but likely better than quadratic runtime.
The world is ever changing. As a result, many of the systems and phenomena we are interested in evolve over time resulting in time evolving datasets. Timeseries often display any interesting properties and levels of correlation. In this tutorial we will introduce the students to the use of Recurrent Neural Networks and LSTMs to model and forecast different kinds of timeseries.
GitHub: https://github.com/DataForScience/RNN
This document discusses Bayesian neural networks. It begins with an introduction to Bayesian inference and variational inference. It then explains how variational inference can be used to approximate the posterior distribution in a Bayesian neural network. Several numerical methods for obtaining the posterior distribution are covered, including Metropolis-Hastings, Hamiltonian Monte Carlo, and Stochastic Gradient Langevin Dynamics. Finally, it provides an example of classifying MNIST digits with a Bayesian neural network and analyzing model uncertainties.
This document discusses algorithms and their analysis. It begins by defining an algorithm and its key characteristics like being finite, definite, and terminating after a finite number of steps. It then discusses designing algorithms to minimize cost and analyzing algorithms to predict their performance. Various algorithm design techniques are covered like divide and conquer, binary search, and its recursive implementation. Asymptotic notations like Big-O, Omega, and Theta are introduced to analyze time and space complexity. Specific algorithms like merge sort, quicksort, and their recursive implementations are explained in detail.
The document discusses machine learning decision trees and the ID3 algorithm for constructing decision trees from training data. ID3 is a top-down, greedy search algorithm that uses information gain to select the attribute that best splits the training examples at each node, without backtracking. It recursively builds the tree by creating child nodes for each value of the selected attribute, then applies the same process to partition the examples at each child node.
Maximum likelihood estimation (MLE) is a technique for estimating parameters in a probabilistic model based on observed data. MLE finds the parameter values that maximize the likelihood function, or the probability of obtaining the observed data given the parameters. This involves writing the log likelihood function, taking its derivative with respect to the parameters, and solving for the parameter values that set the derivative to zero. MLE was demonstrated for Bernoulli, normal, Poisson, and Markov chain models using both theoretical examples and observed data. In practice, MLE provides a principled approach for learning probability distributions from samples.
Monte Carlo methods use random sampling to solve quantitative problems. They were first used by Stanislaw Ulam and Nicholas Metropolis to solve non-random problems by transforming them into random forms. Monte Carlo simulations play a major role in experimental physics by designing experiments, evaluating potential outputs and risks, and validating results. Random numbers are generated using pseudorandom number generators or by transforming uniform random variables using probability distribution functions. The accuracy of Monte Carlo simulations improves as the number of samples increases, with the standard error declining proportionally with the square root of the number of samples.
This document provides an overview of clustering techniques. It defines clustering as grouping a set of similar objects into classes, with objects within a cluster being similar to each other and dissimilar to objects in other clusters. The document then discusses partitioning, hierarchical, and density-based clustering methods. It also covers mathematical elements of clustering like partitions, distances, and data types. The goal of clustering is to minimize a similarity function to create high similarity within clusters and low similarity between clusters.
This document provides an overview of machine learning concepts including:
1. Machine learning aims to create computer programs that improve with experience by learning from data. It involves tasks like classification, regression, and clustering.
2. Data comes in different types like text, numbers, images and is generated in massive quantities daily from sources like Google, Facebook, and sensors.
3. Machine learning algorithms are either supervised, using labeled training data, or unsupervised, using unlabeled data. Common supervised techniques are decision trees, neural networks, and support vector machines while clustering is a major unsupervised technique.
Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet
1) Apache SAMOA is a platform for mining big data streams in real-time that provides algorithms, libraries, and frameworks.
2) It allows researchers to develop and compare stream mining algorithms and practitioners to easily apply state-of-the-art algorithms to problems like sentiment analysis, spam detection, and recommendations.
3) A key challenge addressed by SAMOA is how to perform distributed stream mining on high-volume, high-velocity data streams at low latency using approaches like Apache Flink that can scale to handle large, fast data.
This document provides an introduction to big data and MapReduce frameworks. It discusses:
- What big data is and examples of large datasets.
- An overview of MapReduce, including how it allows programmers to break problems into parallelizable map and reduce tasks.
- Details of how MapReduce frameworks like Apache Hadoop work, including distributed processing, fault tolerance, and the roles of mappers, reducers, and other components.
Multi-label Classification with Meta-labelsAlbert Bifet
The area of multi-label classification has rapidly developed in recent years. It has become widely known that the baseline binary relevance approach suffers from class imbalance and a restricted hypothesis space that negatively affects its predictive performance, and can easily be outperformed by methods which learn labels together. A number of methods have grown around the label powerset approach, which models label combinations together as class values in a multi-class problem. We describe the label-powerset-based solutions under a general framework of \emph{meta-labels}. We provide theoretical justification for this framework which has been lacking, by viewing meta-labels as a hidden layer in an artificial neural network. We explain how meta-labels essentially allow a random projection into a space where non-linearities can easily be tackled with established linear learning algorithms. The proposed framework enables comparison and combination of related approaches to different multi-label problems. Indeed, we present a novel model in the framework and evaluate it empirically against several high-performing methods, with respect to predictive performance and scalability, on a number of datasets and evaluation metrics. Our deployment of an ensemble of meta-label classifiers obtains competitive accuracy for a fraction of the computation required by the current meta-label methods for multi-label classification.
Big Data is a new term used to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data.
Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis. We discuss the current and future trends of mining evolving data streams, and the challenges that the field will have to overcome during the next years.
The document outlines a tutorial on handling concept drift in machine learning. It discusses the challenges of concept drift when applying supervised learning algorithms to streaming data where the underlying data distribution changes over time. The tutorial aims to provide an integrated view of adaptive learning methods and how they can handle concept drift. It covers topics such as the problem of concept drift, techniques for handling drift, evaluating adaptive learning approaches, and applications that experience concept drift.
MOA is a framework for online learning from data streams. It is closely related to WEKA and includes tools for evaluation such as boosting, bagging, and Hoeffding Trees. MOA deals with evolving data streams and is easy to use and extend.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Albert Bifet
This document discusses methods for mining data streams, which are potentially infinite sequences of data that change over time. It describes using the ADWIN algorithm, which is an adaptive sliding window technique without parameters, to extract information from data streams using few resources. It also covers mining massive data, where the amount of digital information created now exceeds available storage. Algorithmic efficiency is important for green computing approaches to efficiently using computing resources. The document provides an example of finding a missing number in an increasing sequence and using random sampling to find a number in the upper half of a sorted list using sublinear space and time.
Adaptive XML Tree Mining on Evolving Data StreamsAlbert Bifet
This document discusses mining frequent closed trees from XML data streams. It presents three algorithms for mining closed trees incrementally, with sliding windows, and adaptively using ADWIN to monitor change. Experimental results on real datasets show the adaptive approach using ADWIN achieves good accuracy while using limited memory.
Adaptive Learning and Mining for Data Streams and Frequent PatternsAlbert Bifet
This document summarizes Albert Bifet's 2009 PhD dissertation on adaptive learning and mining for data streams and frequent patterns. It introduces the challenges of mining massive, evolving structured data streams in real-time. It describes the ADWIN algorithm for detecting concept drift in data streams and outlines methods for mining evolving tree data streams, including incremental, sliding window, adaptive and logarithmic relaxed support approaches.
Mining Implications from Lattices of Closed TreesAlbert Bifet
The document proposes a method for extracting association rules from datasets of unlabeled trees by representing the trees with closure operators and propositional logic. It defines implicit rules as those that always hold for any dataset. The method was experimentally validated on a dataset, showing it can detect many implicit rules and extract a smaller set of high-confidence rules.
Kalman Filters and Adaptive Windows for Learning in Data StreamsAlbert Bifet
This document proposes combining the Kalman filter and ADWIN algorithm (Adaptive Windowing algorithm) to create an algorithm called K-ADWIN for learning from data streams. The Kalman filter is used as an estimator to estimate statistics from data streams, while ADWIN acts as a change detector to detect changes in the data distribution over time. This allows K-ADWIN to adaptively update its learning window size based on the detected changes, providing more accurate estimations that account for concept drift in non-stationary data streams. The document outlines the key components of K-ADWIN and provides experimental validation of its effectiveness.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
5. introduction: data streams
Data Streams
• Sequence is potentially infinite
• High amount of data: sublinear space
• High speed of arrival: sublinear time per example
• Once an element from a data stream has been processed it
is discarded or archived
Example
Puzzle: Finding Missing Numbers
• Let π be a permutation of {1,...,n}.
• Let π−1 be π with one element missing.
• π−1[i] arrives in increasing order
Task: Determine the missing number
4
6. introduction: data streams
Data Streams
• Sequence is potentially infinite
• High amount of data: sublinear space
• High speed of arrival: sublinear time per example
• Once an element from a data stream has been processed it
is discarded or archived
Example
Puzzle: Finding Missing Numbers
• Let π be a permutation of {1,...,n}.
• Let π−1 be π with one element missing.
• π−1[i] arrives in increasing order
Task: Determine the missing number
Use a n-bit
vector to
memorize all the
numbers (O(n)
space)
4
7. introduction: data streams
Data Streams
• Sequence is potentially infinite
• High amount of data: sublinear space
• High speed of arrival: sublinear time per example
• Once an element from a data stream has been processed it
is discarded or archived
Example
Puzzle: Finding Missing Numbers
• Let π be a permutation of {1,...,n}.
• Let π−1 be π with one element missing.
• π−1[i] arrives in increasing order
Task: Determine the missing number
Data Streams:
O(log(n)) space.
4
8. introduction: data streams
Data Streams
• Sequence is potentially infinite
• High amount of data: sublinear space
• High speed of arrival: sublinear time per example
• Once an element from a data stream has been processed it
is discarded or archived
Example
Puzzle: Finding Missing Numbers
• Let π be a permutation of {1,...,n}.
• Let π−1 be π with one element missing.
• π−1[i] arrives in increasing order
Task: Determine the missing number
Data Streams:
O(log(n)) space.
Store
n(n+1)
2
−∑
j≤i
π−1[j].
4
9. data streams
Data Streams
• Sequence is potentially infinite
• High amount of data: sublinear space
• High speed of arrival: sublinear time per example
• Once an element from a data stream has been processed it
is discarded or archived
Tools:
• approximation
• randomization, sampling
• sketching
5
10. data streams
Data Streams
• Sequence is potentially infinite
• High amount of data: sublinear space
• High speed of arrival: sublinear time per example
• Once an element from a data stream has been processed it
is discarded or archived
Approximation algorithms
• Small error rate with high probability
• An algorithm (ε,δ)−approximates F if it outputs ˜F for which
Pr[|˜F−F| > εF] < δ.
5
11. data streams approximation algorithms
1011000111 1010101
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
• N is the length of the sliding window
• ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
6
12. data streams approximation algorithms
10110001111 0101011
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
• N is the length of the sliding window
• ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
6
13. data streams approximation algorithms
101100011110 1010111
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
• N is the length of the sliding window
• ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
6
14. data streams approximation algorithms
1011000111101 0101110
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
• N is the length of the sliding window
• ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
6
15. data streams approximation algorithms
10110001111010 1011101
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
• N is the length of the sliding window
• ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
6
16. data streams approximation algorithms
101100011110101 0111010
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
• N is the length of the sliding window
• ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
6
18. classification
Definition
Given nC different classes, a classifier algorithm builds a
model that predicts for every unlabelled instance I the class C
to which it belongs with accuracy.
Example
A spam filter
Example
Twitter Sentiment analysis: analyze tweets with positive or
negative feelings
8
19. data stream classification cycle
1 Process an example at a time,
and inspect it only once (at
most)
2 Use a limited amount of memory
3 Work in a limited amount of time
4 Be ready to predict at any point
9
20. classification
Data set that
describes e-mail
features for
deciding if it is
spam.
Example
Contains Domain Has Time
“Money” type attach. received spam
yes com yes night yes
yes edu no night yes
no com yes night yes
no edu no day no
no com no day no
yes cat no day yes
Assume we have to classify the following new instance:
Contains Domain Has Time
“Money” type attach. received spam
yes edu yes day ?
10
21. bayes classifiers
Naïve Bayes
• Based on Bayes Theorem:
P(c|d) =
P(c)P(d|c)
P(d)
posterior =
prior×likelikood
evidence
• Estimates the probability of observing attribute a and the
prior probability P(c)
• Probability of class c given an instance d:
P(c|d) =
P(c)∏a∈d P(a|c)
P(d)
11
22. bayes classifiers
Multinomial Naïve Bayes
• Considers a document as a bag-of-words.
• Estimates the probability of observing word w and the prior
probability P(c)
• Probability of class c given a test document d:
P(c|d) =
P(c)∏w∈d P(w|c)nwd
P(d)
12
26. perceptron
Perceptron Learning(Stream,η)
1 for each class
2 do Perceptron Learning(Stream,class,η)
Perceptron Learning(Stream,class,η)
1 £ Let w0 and ⃗w be randomly initialized
2 for each example (⃗x,y) in Stream
3 do if class = y
4 then δ = (1−h⃗w(⃗x))·h⃗w(⃗x)·(1−h⃗w(⃗x))
5 else δ = (0−h⃗w(⃗x))·h⃗w(⃗x)·(1−h⃗w(⃗x))
6 ⃗w = ⃗w+η ·δ ·⃗x
Perceptron Prediction(⃗x)
1 return argmaxclass h⃗wclass
(⃗x)
14
27. classification
Data set that
describes e-mail
features for
deciding if it is
spam.
Example
Contains Domain Has Time
“Money” type attach. received spam
yes com yes night yes
yes edu no night yes
no com yes night yes
no edu no day no
no com no day no
yes cat no day yes
Assume we have to classify the following new instance:
Contains Domain Has Time
“Money” type attach. received spam
yes edu yes day ?
15
28. classification
• Assume we have to classify the following new instance:
Contains Domain Has Time
“Money” type attach. received spam
yes edu yes day ?
Time
Contains “Money”
YES
Yes
NO
No
Day
YES
Night
15
29. decision trees
Basic induction strategy:
• A ← the “best” decision attribute for next node
• Assign A as decision attribute for node
• For each value of A, create new descendant of node
• Sort training examples to leaf nodes
• If training examples perfectly classified, Then STOP, Else
iterate over new leaf nodes
16
30. hoeffding trees
Hoeffding Tree : VFDT
Pedro Domingos and Geoff Hulten.
Mining high-speed data streams. 2000
• With high probability, constructs an identical model that a
traditional (greedy) method would learn
• With theoretical guarantees on the error rate
Time
Contains “Money”
YES
Yes
NO
No
Day
YES
Night
17
32. hoeffding bound inequality
Let X = ∑i Xi where X1,...,Xn are independent and indentically
distributed in [0,1]. Then
1 Chernoff For each ε < 1
Pr[X > (1+ε)E[X]] ≤ exp
(
−
ε2
3
E[X]
)
2 Hoeffding For each t > 0
Pr[X > E[X]+t] ≤ exp
(
−2t2
/n
)
3 Bernstein Let σ2 = ∑i σ2
i the variance of X. If Xi −E[Xi] ≤ b for
each i ∈ [n] then for each t > 0
Pr[X > E[X]+t] ≤ exp
(
−
t2
2σ2 + 2
3bt
)
19
33. hoeffding tree or vfdt
HT(Stream,δ)
1 £ Let HT be a tree with a single leaf(root)
2 £ Init counts nijk at root
3 for each example (x,y) in Stream
4 do HTGrow((x,y),HT,δ)
HTGrow((x,y),HT,δ)
1 £ Sort (x,y) to leaf l using HT
2 £ Update counts nijk at leaf l
3 if examples seen so far at l are not all of the same class
4 then £ Compute G for each attribute
5 if G(Best Attr.)−G(2nd best) >
√
R2 ln1/δ
2n
6 then £ Split leaf on best attribute
7 for each branch
8 do £ Start new leaf and initiliatize counts
20
34. hoeffding tree or vfdt
HT(Stream,δ)
1 £ Let HT be a tree with a single leaf(root)
2 £ Init counts nijk at root
3 for each example (x,y) in Stream
4 do HTGrow((x,y),HT,δ)
HTGrow((x,y),HT,δ)
1 £ Sort (x,y) to leaf l using HT
2 £ Update counts nijk at leaf l
3 if examples seen so far at l are not all of the same class
4 then £ Compute G for each attribute
5 if G(Best Attr.)−G(2nd best) >
√
R2 ln1/δ
2n
6 then £ Split leaf on best attribute
7 for each branch
8 do £ Start new leaf and initiliatize counts
20
35. hoeffding trees
HT features
• With high probability, constructs an identical model that a
traditional (greedy) method would learn
• Ties: when two attributes have similar G, split if
G(Best Attr.)−G(2nd best) <
√
R2 ln1/δ
2n
< τ
• Compute G every nmin instances
• Memory: deactivate least promising nodes with lower pl ×el
• pl is the probability to reach leaf l
• el is the error in the node
21
36. hoeffding naive bayes tree
Hoeffding Tree
Majority Class learner at leaves
Hoeffding Naive Bayes Tree
G. Holmes, R. Kirkby, and B. Pfahringer.
Stress-testing Hoeffding trees, 2005.
• monitors accuracy of a Majority Class learner
• monitors accuracy of a Naive Bayes learner
• predicts using the most accurate method
22
37. bagging
Example
Dataset of 4 Instances : A, B, C, D
Classifier 1: B, A, C, B
Classifier 2: D, B, A, D
Classifier 3: B, A, C, B
Classifier 4: B, C, B, B
Classifier 5: D, C, A, C
Bagging builds a set of M base models, with a bootstrap
sample created by drawing random samples with
replacement.
23
38. bagging
Example
Dataset of 4 Instances : A, B, C, D
Classifier 1: A, B, B, C
Classifier 2: A, B, D, D
Classifier 3: A, B, B, C
Classifier 4: B, B, B, C
Classifier 5: A, C, C, D
Bagging builds a set of M base models, with a bootstrap
sample created by drawing random samples with
replacement.
23
39. bagging
Example
Dataset of 4 Instances : A, B, C, D
Classifier 1: A, B, B, C: A(1) B(2) C(1) D(0)
Classifier 2: A, B, D, D: A(1) B(1) C(0) D(2)
Classifier 3: A, B, B, C: A(1) B(2) C(1) D(0)
Classifier 4: B, B, B, C: A(0) B(3) C(1) D(0)
Classifier 5: A, C, C, D: A(1) B(0) C(2) D(1)
Each base model’s training set contains each of the original
training example K times where P(K = k) follows a binomial
distribution.
23
40. bagging
Figure 1: Poisson(1) Distribution.
Each base model’s training set contains each of the original
training example K times where P(K = k) follows a binomial
distribution.
23
41. oza and russell’s online bagging for m models
1: Initialize base models hm for all m ∈ {1,2,...,M}
2: for all training examples do
3: for m = 1,2,...,M do
4: Set w = Poisson(1)
5: Update hm with the current example with weight w
6: anytime output:
7: return hypothesis: hfin(x) = argmaxy∈Y ∑T
t=1 I(ht(x) = y)
24
45. optimal change detector and predictor
• High accuracy
• Low false positives and false negatives ratios
• Theoretical guarantees
• Fast detection of change
• Low computational cost: minimum space and time needed
• No parameters needed
27
46. algorithm adaptive sliding window
Example
W= 101010110111111
W0= 1
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t 0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
47. algorithm adaptive sliding window
Example
W= 101010110111111
W0= 1 W1 = 01010110111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t 0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
48. algorithm adaptive sliding window
Example
W= 101010110111111
W0= 10 W1 = 1010110111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t 0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
49. algorithm adaptive sliding window
Example
W= 101010110111111
W0= 101 W1 = 010110111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t 0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
50. algorithm adaptive sliding window
Example
W= 101010110111111
W0= 1010 W1 = 10110111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t 0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
51. algorithm adaptive sliding window
Example
W= 101010110111111
W0= 10101 W1 = 0110111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t 0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
52. algorithm adaptive sliding window
Example
W= 101010110111111
W0= 101010 W1 = 110111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t 0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
53. algorithm adaptive sliding window
Example
W= 101010110111111
W0= 1010101 W1 = 10111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t 0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
54. algorithm adaptive sliding window
Example
W= 101010110111111
W0= 10101011 W1 = 0111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t 0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
55. algorithm adaptive sliding window
Example
W= 101010110111111 |ˆµW0
− ˆµW1
| ≥ εc : CHANGE DET.!
W0= 101010110 W1 = 111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t 0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
56. algorithm adaptive sliding window
Example
W= 101010110111111 Drop elements from the tail of W
W0= 101010110 W1 = 111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t 0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
57. algorithm adaptive sliding window
Example
W= 01010110111111 Drop elements from the tail of W
W0= 101010110 W1 = 111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t 0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
58. algorithm adaptive sliding window
Theorem
At every time step we have:
1 (False positive rate bound). If µt remains constant within W,
the probability that ADWIN shrinks the window at this step is at
most δ.
2 (False negative rate bound). Suppose that for some partition
of W in two parts W0W1 (where W1 contains the most recent
items) we have |µW0
− µW1
| 2εc. Then with probability 1−δ
ADWIN shrinks W to W1, or shorter.
ADWIN tunes itself to the data stream at hand, with no need for
the user to hardwire or precompute parameters.
29
59. algorithm adaptive sliding window
ADWIN using a Data Stream Sliding Window Model,
• can provide the exact counts of 1’s in O(1) time per point.
• tries O(logW) cutpoints
• uses O(1
ε logW) memory words
• the processing time per example is O(logW) (amortized and
worst-case).
Sliding Window Model
1010101 101 11 1 1
Content: 4 2 2 1 1
Capacity: 7 3 2 1 1
30
60. vfdt / cvfdt
Concept-adapting Very Fast Decision Trees: CVFDT
G. Hulten, L. Spencer, and P. Domingos.
Mining time-changing data streams. 2001
• It keeps its model consistent with a sliding window of
examples
• Construct “alternative branches” as preparation for changes
• If the alternative branch becomes more accurate, switch of
tree branches occurs
Time
Contains “Money”
Yes No
Day
YES
Night
31
61. decision trees: cvfdt
Time
Contains “Money”
YES
Yes
NO
No
Day
YES
Night
No theoretical guarantees on the error rate of CVFDT
CVFDT parameters :
1 W: is the example window size.
2 T0: number of examples used to check at each node if the
splitting attribute is still the best.
3 T1: number of examples used to build the alternate tree.
4 T2: number of examples used to test the accuracy of the 32
62. decision trees: hoeffding adaptive tree
Hoeffding Adaptive Tree:
• replace frequency statistics counters by estimators
• don’t need a window to store examples, due to the fact that we
maintain the statistics data needed with estimators
• change the way of checking the substitution of alternate
subtrees, using a change detector with theoretical
guarantees
Advantages over CVFDT:
1 Theoretical guarantees
2 No Parameters
33
63. adwin bagging (kdd’09)
ADWIN
An adaptive sliding window whose size is recomputed online
according to the rate of change observed.
ADWIN has rigorous guarantees (theorems)
• On ratio of false positives and negatives
• On the relation of the size of the current window and change
rates
ADWIN Bagging
When a change is detected, the worst classifier is removed
and a new classifier is added.
34
64. Randomization as a powerful tool to increase accuracy and
diversity
There are three ways of using randomization:
• Manipulating the input data
• Manipulating the classifier algorithms
• Manipulating the output targets
35
65. leveraging bagging for evolving data streams
Leveraging Bagging
• Using Poisson(λ)
Leveraging Bagging MC
• Using Poisson(λ) and Random Output Codes
Fast Leveraging Bagging ME
• if an instance is misclassified: weight = 1
• if not: weight = eT/(1−eT),
36
66. empirical evaluation
Accuracy RAM-Hours
Hoeffding Tree 74.03 0.01
Online Bagging 77.15 2.98
ADWIN Bagging 79.24 1.48
Leveraging Bagging 85.54 20.17
Leveraging Bagging MC 85.37 22.04
Leveraging Bagging ME 80.77 0.87
Leveraging Bagging
• Leveraging Bagging
• Using Poisson(λ)
• Leveraging Bagging MC
• Using Poisson(λ) and Random Output Codes
• Leveraging Bagging ME
• Using weight 1 if misclassified, otherwise eT/(1−eT)
37
68. clustering
Definition
Clustering is the distribution of a set of instances of examples
into non-known groups according to some common relations
or affinities.
Example
Market segmentation of customers
Example
Social network communities
39
69. clustering
Definition
Given
• a set of instances I
• a number of clusters K
• an objective function cost(I)
a clustering algorithm computes an assignment of a cluster
for each instance
f : I → {1,...,K}
that minimizes the objective function cost(I)
40
70. clustering
Definition
Given
• a set of instances I
• a number of clusters K
• an objective function cost(C,I)
a clustering algorithm computes a set C of instances with
|C| = K that minimizes the objective function
cost(C,I) = ∑
x∈I
d2
(x,C)
where
• d(x,c): distance function between x and c
• d2(x,C) = minc∈Cd2(x,c): distance from x to the nearest point 41
71. k-means
• 1. Choose k initial centers C = {c1,...,ck}
• 2. while stopping criterion has not been met
• For i = 1,...,N
• find closest center ck ∈ C to each instance pi
• assign instance pi to cluster Ck
• For k = 1,...,K
• set ck to be the center of mass of all points in Ci
42
72. k-means++
• 1. Choose a initial center c1
• For k = 2,...,K
• select ck = p ∈ I with probability d2(p,C)/cost(C,I)
• 2. while stopping criterion has not been met
• For i = 1,...,N
• find closest center ck ∈ C to each instance pi
• assign instance pi to cluster Ck
• For k = 1,...,K
• set ck to be the center of mass of all points in Ci
43
73. performance measures
Internal Measures
• Sum square distance
• Dunn index D = dmin
dmax
• C-Index C = S−Smin
Smax−Smin
External Measures
• Rand Measure
• F Measure
• Jaccard
• Purity
44
74. birch
Balanced Iterative Reducing and Clustering using
Hierarchies
• Clustering Features CF = (N,LS,SS)
• N: number of data points
• LS: linear sum of the N data points
• SS: square sum of the N data points
• Properties:
• Additivity: CF1 +CF2 = (N1 +N2,LS1 +LS2,SS1 +SS2)
• Easy to compute: average inter-cluster distance
and average intra-cluster distance
• Uses CF tree
• Height-balanced tree with two parameters
• B: branching factor
• T: radius leaf threshold
45
75. birch
Balanced Iterative Reducing and Clustering using
Hierarchies
Phase 1: Scan all data and build an initial in-memory CF
tree
Phase 2: Condense into desirable range by building a
smaller CF tree (optional)
Phase 3: Global clustering
Phase 4: Cluster refining (optional and off line, as requires
more passes)
46
76. clu-stream
Clu-Stream
• Uses micro-clusters to store statistics on-line
• Clustering Features CF = (N,LS,SS,LT,ST)
• N: numer of data points
• LS: linear sum of the N data points
• SS: square sum of the N data points
• LT: linear sum of the time stamps
• ST: square sum of the time stamps
• Uses pyramidal time frame
47
77. clu-stream
On-line Phase
• For each new point that arrives
• the point is absorbed by a micro-cluster
• the point starts a new micro-cluster of its own
• delete oldest micro-cluster
• merge two of the oldest micro-cluster
Off-line Phase
• Apply k-means using microclusters as points
48
78. streamkm++: coresets
Coreset of a set P with respect to some problem
Small subset that approximates the original set P.
• Solving the problem for the coreset provides an approximate
solution for the problem on P.
(k,ε)-coreset
A (k,ε)-coreset S of P is a subset of P that for each C of size k
(1−ε)cost(P,C) ≤ costw(S,C) ≤ (1+ε)cost(P,C)
49
79. streamkm++: coresets
Coreset Tree
• Choose a leaf l node at random
• Choose a new sample point denoted by qt+1 from Pl
according to d2
• Based on ql and qt+1, split Pl into two subclusters and create
two child nodes
StreamKM++
• Maintain L = ⌈log2( n
m )+2⌉ buckets B0,B1,...,BL−1
50
81. frequent patterns
Suppose D is a dataset of patterns, t ∈ D, and min_sup is a
constant.
Definition
Support (t): number of
patterns in D that are
superpatterns of t.
Definition
Pattern t is frequent if
Support (t) ≥ min_sup.
Frequent Subpattern Problem
Given D and min_sup, find all frequent subpatterns of patterns
in D.
52
82. frequent patterns
Suppose D is a dataset of patterns, t ∈ D, and min_sup is a
constant.
Definition
Support (t): number of
patterns in D that are
superpatterns of t.
Definition
Pattern t is frequent if
Support (t) ≥ min_sup.
Frequent Subpattern Problem
Given D and min_sup, find all frequent subpatterns of patterns
in D.
52
83. frequent patterns
Suppose D is a dataset of patterns, t ∈ D, and min_sup is a
constant.
Definition
Support (t): number of
patterns in D that are
superpatterns of t.
Definition
Pattern t is frequent if
Support (t) ≥ min_sup.
Frequent Subpattern Problem
Given D and min_sup, find all frequent subpatterns of patterns
in D.
52
84. frequent patterns
Suppose D is a dataset of patterns, t ∈ D, and min_sup is a
constant.
Definition
Support (t): number of
patterns in D that are
superpatterns of t.
Definition
Pattern t is frequent if
Support (t) ≥ min_sup.
Frequent Subpattern Problem
Given D and min_sup, find all frequent subpatterns of patterns
in D.
52
88. itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce
3 de,cde de cde
55
89. itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
55
90. itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
56
91. itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
e → ce
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
56
92. itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
56
93. itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
57
94. itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
a → ace
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
57
95. itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
58
96. closed patterns
Usually, there are too many frequent patterns. We can
compute a smaller set, while keeping the same information.
Example
A set of 1000 items, has 21000 ≈ 10301 subsets, that is more
than the number of atoms in the universe ≈ 1079
59
97. closed patterns
A priori property
If t′ is a subpattern of t, then Support (t′) ≥ Support (t).
Definition
A frequent pattern t is closed if none of its proper
superpatterns has the same support as it has.
Frequent subpatterns and their supports can be generated
from closed patterns.
59
98. maximal patterns
Definition
A frequent pattern t is maximal if none of its proper
superpatterns is frequent.
Frequent subpatterns can be generated from maximal
patterns, but not with their support.
All maximal patterns are closed, but not all closed patterns are
maximal.
60
99. non streaming frequent itemset miners
Representation:
• Horizontal layout
T1: a, b, c
T2: b, c, e
T3: b, d, e
• Vertical layout
a: 1 0 0
b: 1 1 1
c: 1 1 0
Search:
• Breadth-first (levelwise): Apriori
• Depth-first: Eclat, FP-Growth
61
100. mining patterns over data streams
Requirements: fast, use small amount of memory and adaptive
• Type:
• Exact
• Approximate
• Per batch, per transaction
• Incremental, Sliding Window, Adaptive
• Frequent, Closed, Maximal patterns
62
101. moment
• Computes closed frequents itemsets in a sliding window
• Uses Closed Enumeration Tree
• Uses 4 type of Nodes:
• Closed Nodes
• Intermediate Nodes
• Unpromising Gateway Nodes
• Infrequent Gateway Nodes
• Adding transactions: closed items remains closed
• Removing transactions: infrequent items remains infrequent
63
102. fp-stream
• Mining Frequent Itemsets at Multiple Time Granularities
• Based in FP-Growth
• Maintains
• pattern tree
• tilted-time window
• Allows to answer time-sensitive queries
• Places greater information to recent data
• Drawback: time and memory complexity
64
103. tree and graph mining: dealing with time changes
• Keep a window on recent stream elements
• Actually, just its lattice of closed sets!
• Keep track of number of closed patterns in lattice, N
• Use some change detector on N
• When change is detected:
• Drop stale part of the window
• Update lattice to reflect this deletion, using deletion rule
Alternatively, sliding window of some fixed size
65
105. overview of big data science
Short Course Summary
1 Introduction to Big Data
2 Big Data Science
3 Real Time Big Data Management
4 Internet of Things Data Science
Open Source Software
1 MOA: http://moa.cms.waikato.ac.nz/
2 SAMOA: http://samoa-project.net/
67