Data Mining: Mining stream time series and sequence data


Published on

Data Mining: Mining stream time series and sequence data

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining: Mining stream time series and sequence data

  1. 1. Mining Stream, Time Series, and Sequence Data<br />
  2. 2. Methodologies for Stream Data Processing and Stream Data Systems<br />Random Sampling<br />Sliding Windows<br />Histograms<br />Multi resolution Methods<br />Sketches Synopses<br />
  3. 3. Randomized Algorithms to analyze Data Streams<br />Randomized algorithms, in the form of random sampling and sketching, are often used to deal with massive, high-dimensional data streams.<br />
  4. 4. Data Stream Management Systems and Stream Queries<br />In traditional database systems, data are stored in finite and persistent databases.<br />stream data are infinite and impossible to store fully in a database.<br /> Data Stream Management System (DSMS), there may be multiple data streams.<br />Once an element from a data stream has been processed, it is discarded or archived, and it cannot be easily retrieved unless it is explicitly stored in memory<br />
  5. 5. Critical Layers of stream data cube<br /> Two critical cuboids (or layers)<br />The first layer, called the minimal interest layer, is the minimally interesting layer that ananalyst would like to study<br />The second layer, called the observation layer, is the layer at which an analyst (or anautomated system) would like to continuously study the data.<br />
  6. 6. Hoeffding Tree Algorithm<br />The Hoeffding tree algorithm is a decision tree learning method for stream data classification.<br />It was initially used to track Web click streams and construct models to predict which Web hosts and Web sites a user is likely to access. <br />It typically runs in sublinear time and produces a nearly identical decision tree to that of traditional batch learners.<br />It uses Hoeffding trees, which exploit the idea that a small sample can often be enough to choose an optimal splitting attribute. <br />
  7. 7. Very Fast Decision Tree (VFDT) <br />The VFDT (Very Fast Decision Tree) algorithm makes several modifications to the Hoeffding tree algorithm.<br />The modifications include breaking near-ties during attribute selection more aggressively, computing the G function after a number of training examples, deactivating the least promising leaves whenever memory is running low, dropping poor splitting attributes, and improving the initialization method.<br />VFDT works well on stream data and also compares extremely well to traditional classifiers in both speed and accuracy To adapt to concept-drifting data streams.<br />
  8. 8. Concept-adapting Very Fast Decision Tree algorithm (CVFDT).<br />CVFDT also uses a sliding window approach; <br />however, it does not construct a new model from scratch each time. Rather, it updates statistics at the nodes by incrementing the counts associated with new examples and decrementing the counts associated with old ones. <br />Therefore, if there is a concept drift, some nodes may no longer pass the Hoeffding bound. When this happens, an alternate subtree will be grown, with the new best splitting attribute at the root.<br />
  9. 9. A Classifier Ensemble Approach to Stream Data Classification<br />The idea is to train an ensemble or group of classifiers (using, say naïve Bayes) from sequential chunks of the data stream.<br />Whenever a new chunk arrives, we build a new classifier from it. <br />The individual classifiers are weighted based on their expected classification accuracy in a time-changing environment. <br />Only the top-k classifiers are kept. The decisions are then based on the weighted votes of the classifiers.<br />
  10. 10. Clustering in evolving data streams<br />Compute and store summaries of past data<br />Apply a divide-and-conquer strategy<br />Incremental clustering of incoming data streams<br />Perform micro clustering as well as macro clustering analysis<br />Explore multiple time granularity for the analysis of cluster evolution<br />Divide stream clustering into on-line and off-line processes<br />
  11. 11. Mining Time-Series Data<br />A time-series database consists of sequences of values or events obtained over repeated measurements of time.<br />Trend Analysis<br />Similarity Search in Time-Series Analysis<br />
  12. 12. Markov Chain for sequence analysis<br />A Markov chain is a model that generates sequences in which the probability of a symbol depends only on the previous symbol.<br />
  13. 13. Tasks using hidden Markov models include:<br />Evaluation: Given a sequence, x, determine the probability, P(x), of obtaining x in the model.<br />Decoding: Given a sequence, determine the most probable path through the model that produced the sequence.<br />Learning: Given a model and a set of training sequences, find the model parameters (i.e., the transition and emission probabilities) that explain the training sequences with relatively high probability.<br />
  14. 14. Different algorithms in series analysis<br />Forward Algorithm<br />Viterbi Algorithm<br />Baum-Welch Algorithm<br />
  15. 15. Visit more self help tutorials<br />Pick a tutorial of your choice and browse through it at your own pace.<br />The tutorials section is free, self-guiding and will not involve any additional support.<br />Visit us at<br />