Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Data Mining Methods and Models By Daniel T. Larose, Ph.D. Chapter Summaries and Keywords Preface. The preface begins by discussing why Data Mining Methods and Models is needed. Because of the powerful data mining software platforms currently available, a strong caveat is given against glib application of data mining methods and techniques. In other words, data mining is easy to do badly. The best way to avoid these costly errors, which stem from a blind black-box approach to data mining, is to instead apply a “white-box” methodology, which emphasizes an understanding of the algorithmic and statistical model structures underlying the software. Data Mining Methods and Models applies this white-box approach by (1) walking the reader through the operations and nuances of the various algorithms, using small sample data sets, so that the reader gets a true appreciation of what is really going on inside the algorithm, (2) providing examples of the application of the various algorithms on actual large data sets, (3) supplying chapter exercises, which allow readers to assess their depth of understanding of the material, as well as have a little fun playing with numbers and data, and (4) providing the reader with hands-on analysis problems, representing an opportunity for the reader to apply his or her newly-acquired data mining expertise to solving real problems using large data sets. Data mining is presented as a well-structured standard process, namely, the Cross-Industry Standard Process for Data Mining (CRISP-DM). A graphical approach to data analysis is emphasized, stressing in particular exploratory data analysis. Data Mining Methods and Models naturally fits the role of textbook for an introductory course in data mining. Instructors may appreciate (1) the presentation of data mining as a process, (2) the “White box” approach, emphasizing an understanding of the underlying algorithmic structures, (3) the graphical approach, emphasizing exploratory data analysis, and (4) the logical presentation, flowing naturally from the CRISP-DM standard process and the set of data mining tasks. Particularly useful for the instructor is the companion website, which provides ancillary materials for teaching a course using Data Mining Methods and Models, including Powerpoint® presentations, answer keys, and sample projects. The book is appropriate for advanced undergraduate or graduate-level courses. No computer programming or database expertise is required. The software used in the book includes Clementine, Minitab, SPSS, and WEKA. Free trial versions of Minitab and SPSS are available for download from their company websites. WEKA is open-source data mining software freely available for download. Keywords: Data Mining Methods and Models Copyright © by Daniel T. Larose, Ph.D. Chapter Summary and Keywords
  2. 2. Algorithm walk-throughs, hands-on analysis problems, chapter exercises, “white-box” approach, data mining as a process, graphical and exploratory approach, companion website, Clementine, Minitab, SPSS, WEKA. Chapter 1: Dimension Reduction Methods Chapter one begins with an assessment of the need for dimension reduction in data mining. Principal components analysis is demonstrated, in the context of a real-world example using the Houses data set. Various criteria are compared for determining how many components should be extracted. Emphasis is given to profiling the principal components for the end-user, along with the importance of validating the principal components using the usual hold-out methods in data mining. Next, factor analysis is introduced and demonstrated using the real-world Adult data set. The need for factor rotation is discussed, which clarifies the definition of the factors. Finally, user-defined composites are briefly discussed, using an example. Key Words: Principal components, factor analysis, commonality, variation, scree plot, eigenvalues, component weights, factor loadings, factor rotation, user-defined composite. Chapter 2: Regression Modeling Chapter Two begins by using an example to introduce simple linear regression and the concept of least squares. The usefulness of the regression is then measured by the coefficient of determination r 2 , and the typical prediction error is estimated using the standard error of the estimate s. The correlation coefficient r is discussed, along with the ANOVA table for succinct display of results. Outliers, high leverage points, and influential observations are discussed in detail. Moving from descriptive methods to inference, the regression model is introduced. The t-Test for the relationship between x and y is shown, along with the confidence interval for the slope of the regression line, the confidence interval for the mean value of y given x, and the prediction interval for a randomly chosen value of y given x. Methods are shown for verifying the assumptions underlying the regression model. Detailed examples are provided using the Baseball and California data sets. Finally, methods of applying transformations to achieve linearity is provided. Key Words: Simple linear regression, least squares, prediction error, outlier, high leverage point, influential observation, confidence interval, prediction interval, transformations. Chapter 3: Multiple Regression and Model Building Multiple regression, where more than one predictor variable is used to estimate a response variable, is introduced by way of an example. To allow for inference, the Data Mining Methods and Models Copyright © by Daniel T. Larose, Ph.D. Chapter Summary and Keywords
  3. 3. multiple regression model is defined, with both model and inferential methods representing extensions of the simple linear regression case. Next, regression with categorical predictors (indicator variables) is explained. The problems of multicollinearity are examined; multicollinearity represents an unstable response surface due to overly correlated predictors. The variance inflation factor is defined, as an aid in identifying multicollinear predictors. Variable selection methods are then provided, including forward selection, backward elimination, stepwise, and best-subsets regression. Mallows’ C p statistic is defined, as an aid in variable selection. Finally, methods for using the principal components as predictors in multiple regression are discussed. Key Words: Categorical predictors, indicator variables, multicollinearity, variance inflation factor, model selection methods, forward selection, backward elimination, stepwise regression, best-subsets. Chapter 4: Logistic Regression Logistic regression is introduced by way of a simple example for predicting the presence of disease based on age. The maximum likelihood estimation methods for logistic regression are outlined. Emphasis is placed on interpreting logistic regression output. Inference within the framework of the logistic regression model is discussed, including determining whether the predictors are significant. Methods for interpreting the logistic regression model are examined, including for dichotomous, polychotomous, and continuous predictors. The assumption of linearity is discussed, as well as methods for tackling the zero-cell problem. We then turn to multiple logistic regression, where more than one predictor is used to classify a response. Methods are discussed for introducing higher order terms to handle non-linearity. As usual, the logistic regression model must be validated. Finally, the application of logistic regression using the freely available software WEKA is demonstrated, using a small example. Key Words: Maximum likelihood estimation, categorical response, classification, the zero-cell problem, multiple logistic regression, WEKA. Chapter 5: Naïve Bayes and Bayesian Networks Chapter Five begins by contrasting the Bayesian approach with the usual (frequentist) approach to probability. The maximum a posteriori (MAP) classification is defined, which is used to select the preferred response classification. Odds ratios are discussed, including the posterior odds ratio. The importance of balancing the data is discussed. Naïve Bayes classification is derived, using a simplifying assumption which greatly reducing the search space. Methods for handling numeric predictors for Naïve Bayes classification are demonstrated. An example of using WEKA for Naïve Bayes is provided. Then, Bayesian Belief Networks (Bayes Nets) are introduced and defined. Data Mining Methods and Models Copyright © by Daniel T. Larose, Ph.D. Chapter Summary and Keywords
  4. 4. Methods for using the Bayesian network to find probabilities are discussed. Finally, an example of using Bayes nets in WEKA is provided. Key Words: Bayesian approach, maximum a posteriori classification, odds ratio, posterior odds ratio, balancing the data, Naïve Bayes classification, Bayesian belief networks, WEKA. Chapter 6: Genetic Algorithms Chapter Six begins by introducing genetic algorithms by way of analogy with the biological processes at work in the evolution of organisms. The basic framework of a genetic algorithm is provided, including the three basic operators: Selection, Crossover, and Mutation. A simple example of a genetic algorithm at work is examined, with each step explained and demonstrated. Next, modifications and enhancements from the literature are discussed, especially for the selection and crossover operators. Genetic algorithms for real-valued variables are discussed. The use of genetic algorithms as optimizers within a neural network is demonstrated, where the genetic algorithm replaces the using backpropagation algorithm. Finally, an example of the use of WEKA for genetic algorithms is provided. Key Words: Selection, crossover, mutation, optimization, global optimum, selection pressure, crowding, fitness, WEKA. Chapter 7: Case Study: Modeling Response to Direct Mail Marketing The case study begins with an overview of the cross-industry standard process for data mining: CRISP-DM. For the business understanding phase, the direct mail marketing response problem is defined, with particular emphasis on the construction of an accurate cost / benefit table, which will be used to assess the usefulness of all later models. In the data understand and data preparation phases, the Clothing Store data set is explored. Transformations to achieve normality or symmetry are applied, as is standardization and the construction of flag variables. Useful new variables are derived. The relationships between the predictors and the response are explored, and the correlation structure among the predictors is investigated. Next comes the modeling phase. Here, two principal components are derived, using principal components analysis. Clustering analysis is performed, using the BIRCH clustering algorithm. Emphasis is laid on the effects of balancing (and over-balancing) the training data set. The baseline model performance is established. Two sets of models are examined, Collection A, which uses the principal components, and Collection B, which does not. The technique of using over-balancing as a surrogate for misclassification costs is applied. The method of combining models via voting is demonstrated, as is the method of combining models using the mean response probabilities. Data Mining Methods and Models Copyright © by Daniel T. Larose, Ph.D. Chapter Summary and Keywords
  5. 5. Key Words: CRISP-DM standard process for data mining, BIRCH clustering algorithm, over-balancing, misclassification costs, cost / benefit analysis, model combination, voting, mean response probabilities. Data Mining Methods and Models Copyright © by Daniel T. Larose, Ph.D. Chapter Summary and Keywords