Classification and Prediction The data analysis task is classification, where a model or classifier is constructed to predict categorical labels. Data analysis task is an example of numeric prediction, where the model constructed predicts a continuous-valued function, or ordered value, as opposed to a categorical label. This model is a predictor.
Steps and issues in preparing the Data for Classification and Prediction Data cleaning: Relevance analysis Data transformation and reduction Comparing Classification and Prediction Methods Accuracy speed Robustness scalability Interpretability
Classification by Decision Tree Induction Decision tree induction is the learning of decision trees from class-labeled training tuples. A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label.
Tree Pruning When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. Tree pruning methods address this problem of over-fitting the data. Scalability and Decision Tree Inductionproblem: Most often, the training data will not fit in memory! Decision tree construction therefore becomes inefficient due to swapping of the training tuples inand out of main and cache memories., that’s why it is necessary to have scalable decision tree.
Bayesian Classification Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class.
Bayesian belief network A Bayesian network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a Directed Acyclic graph (DAG).
Training Bayesian Belief Networks In the learning or training of a belief network, a number of scenarios are possible. The network topology (or “layout” of nodes and arcs) may be given in advance or inferred from the data. The network variables may be observable or hidden in all or some of the training tuples. The case of hidden data is also referred to as missing values or incomplete data.
Back propagation Back propagation is a neural network learning algorithm. The field of neural networks was originally kindled by psychologists and neurobiologists who sought to develop and test computational analogues of neurons. Back propagation learns by iteratively processing a data set of training tuples, comparing the network’s prediction for each tuple with the actual known target value
Classification by Association Rule Analysis Frequent patterns and their corresponding association or correlation rules characterize interesting relationships between attribute conditions and class labels, and thus have been recently used for effective classification. Association rules show strong associations between attribute-value pairs (or items) that occur frequently in a given data set. Association rules are commonly used to analyze the purchasing patterns of customers in a store.
Training tuples Eager learners: when given a set of training tuples, it will construct a generalization (i.e., classification) model before receiving new (e.g., test) tuples to classify. Lazy approach, in which the learner instead waits until the last minute before doing any model construction in order to classify a given test tuple. That is, when given a training tuple, a lazy learner simply stores it (or does only a little minor processing) and waits until it is given a test tuple.
Other classification methods Genetic Algorithms Genetic algorithms attempt to incorporate ideas of natural evolution. Rough Set Approach Rough set theory can be used for classification to discover structuralrelationships within imprecise or noisy data. Fuzzy Set Approaches Rule-based systems for classification have the disadvantage that they involve sharp cutoffs for continuous attributes.
Prediction in Data mining Linear Regression Straight-line regression analysis involves a response variable, y, and asingle predictor variable, x. It is the simplest form of regression, and models y as a linearfunction of x. Nonlinear RegressionTransformation of a polynomial regression model to a linear regression model, and then predict the values.
Ensemble Methods for Increasing the Accuracy in prediction Bagging and Boosting The bagging algorithm create an ensemble of models (classifiers or predictors) for a learning scheme where each model gives an equally-weighted prediction. In boosting, weights are assigned to each training tuple. A series of k classifiers is iteratively learned. After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1 , to “pay more attention” to the training tuples that were misclassified by Mi .
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net