Methodologies for Stream Data Processing and Stream Data Systems Random Sampling Sliding Windows Histograms Multi resolution Methods Sketches Synopses
Randomized Algorithms to analyze Data Streams Randomized algorithms, in the form of random sampling and sketching, are often used to deal with massive, high-dimensional data streams.
Data Stream Management Systems and Stream Queries In traditional database systems, data are stored in finite and persistent databases. stream data are infinite and impossible to store fully in a database. Data Stream Management System (DSMS), there may be multiple data streams. Once an element from a data stream has been processed, it is discarded or archived, and it cannot be easily retrieved unless it is explicitly stored in memory
Critical Layers of stream data cube Two critical cuboids (or layers) The first layer, called the minimal interest layer, is the minimally interesting layer that ananalyst would like to study The second layer, called the observation layer, is the layer at which an analyst (or anautomated system) would like to continuously study the data.
Hoeffding Tree Algorithm The Hoeffding tree algorithm is a decision tree learning method for stream data classification. It was initially used to track Web click streams and construct models to predict which Web hosts and Web sites a user is likely to access. It typically runs in sublinear time and produces a nearly identical decision tree to that of traditional batch learners. It uses Hoeffding trees, which exploit the idea that a small sample can often be enough to choose an optimal splitting attribute.
Very Fast Decision Tree (VFDT) The VFDT (Very Fast Decision Tree) algorithm makes several modifications to the Hoeffding tree algorithm. The modifications include breaking near-ties during attribute selection more aggressively, computing the G function after a number of training examples, deactivating the least promising leaves whenever memory is running low, dropping poor splitting attributes, and improving the initialization method. VFDT works well on stream data and also compares extremely well to traditional classifiers in both speed and accuracy To adapt to concept-drifting data streams.
Concept-adapting Very Fast Decision Tree algorithm (CVFDT). CVFDT also uses a sliding window approach; however, it does not construct a new model from scratch each time. Rather, it updates statistics at the nodes by incrementing the counts associated with new examples and decrementing the counts associated with old ones. Therefore, if there is a concept drift, some nodes may no longer pass the Hoeffding bound. When this happens, an alternate subtree will be grown, with the new best splitting attribute at the root.
A Classifier Ensemble Approach to Stream Data Classification The idea is to train an ensemble or group of classifiers (using, say naïve Bayes) from sequential chunks of the data stream. Whenever a new chunk arrives, we build a new classifier from it. The individual classifiers are weighted based on their expected classification accuracy in a time-changing environment. Only the top-k classifiers are kept. The decisions are then based on the weighted votes of the classifiers.
Clustering in evolving data streams Compute and store summaries of past data Apply a divide-and-conquer strategy Incremental clustering of incoming data streams Perform micro clustering as well as macro clustering analysis Explore multiple time granularity for the analysis of cluster evolution Divide stream clustering into on-line and off-line processes
Mining Time-Series Data A time-series database consists of sequences of values or events obtained over repeated measurements of time. Trend Analysis Similarity Search in Time-Series Analysis
Markov Chain for sequence analysis A Markov chain is a model that generates sequences in which the probability of a symbol depends only on the previous symbol.
Tasks using hidden Markov models include: Evaluation: Given a sequence, x, determine the probability, P(x), of obtaining x in the model. Decoding: Given a sequence, determine the most probable path through the model that produced the sequence. Learning: Given a model and a set of training sequences, find the model parameters (i.e., the transition and emission probabilities) that explain the training sequences with relatively high probability.
Different algorithms in series analysis Forward Algorithm Viterbi Algorithm Baum-Welch Algorithm
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net