Data Mining In Market Research


Published on

Published in: Technology, Education
  • Be the first to comment

Data Mining In Market Research

  1. 1. Data Mining in Market Research <ul><li>What is data mining? </li></ul><ul><ul><li>Methods for finding interesting structure in large databases </li></ul></ul><ul><ul><ul><li>E.g. patterns, prediction rules, unusual cases </li></ul></ul></ul><ul><ul><li>Focus on efficient, scalable algorithms </li></ul></ul><ul><ul><ul><li>Contrasts with emphasis on correct inference in statistics </li></ul></ul></ul><ul><ul><li>Related to data warehousing, machine learning </li></ul></ul><ul><li>Why is data mining important? </li></ul><ul><ul><li>Well marketed; now a large industry; pays well </li></ul></ul><ul><ul><li>Handles large databases directly </li></ul></ul><ul><ul><li>Can make data analysis more accessible to end users </li></ul></ul><ul><ul><ul><li>Semi-automation of analysis </li></ul></ul></ul><ul><ul><ul><li>Results can be easier to interpret than e.g. regression models </li></ul></ul></ul><ul><ul><ul><li>Strong focus on decisions and their implementation </li></ul></ul></ul>
  2. 2. CRISP-DM Process Model
  3. 3. Data Mining Software <ul><li>Many providers of data mining software </li></ul><ul><ul><li>SAS Enterprise Miner, SPSS Clementine, Statistica Data Miner, MS SQL Server, Polyanalyst, KnowledgeSTUDIO, … </li></ul></ul><ul><ul><li>See for a list </li></ul></ul><ul><ul><li>Good algorithms important, but also need good facilities for handling data and meta-data </li></ul></ul><ul><li>We’ll use: </li></ul><ul><ul><li>WEKA (Waikato Environment for Knowledge Analysis) </li></ul></ul><ul><ul><ul><li>Free (GPLed) Java package with GUI </li></ul></ul></ul><ul><ul><ul><li>Online at </li></ul></ul></ul><ul><ul><ul><li>Witten and Frank, 2000. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. </li></ul></ul></ul><ul><ul><li>R packages </li></ul></ul><ul><ul><ul><li>E.g. rpart, class, tree, nnet, cclust, deal, GeneSOM, knnTree, mlbench, randomForest, subselect </li></ul></ul></ul>
  4. 4. Data Mining Terms <ul><li>Different names for familiar statistical concepts, from database and AI communities </li></ul><ul><ul><li>Observation = case, record, instance </li></ul></ul><ul><ul><li>Variable = field, attribute </li></ul></ul><ul><ul><li>Analysis of dependence vs interdependence = Supervised vs unsupervised learning </li></ul></ul><ul><ul><li>Relationship = association, concept </li></ul></ul><ul><ul><li>Dependent variable = response, output </li></ul></ul><ul><ul><li>Independent variable = predictor, input </li></ul></ul>
  5. 5. Common Data Mining Techniques <ul><li>Predictive modeling </li></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><ul><li>Derive classification rules </li></ul></ul></ul><ul><ul><ul><li>Decision trees </li></ul></ul></ul><ul><ul><li>Numeric prediction </li></ul></ul><ul><ul><ul><li>Regression trees, model trees </li></ul></ul></ul><ul><li>Association rules </li></ul><ul><li>Meta-learning methods </li></ul><ul><ul><li>Cross-validation, bagging, boosting </li></ul></ul><ul><li>Other data mining methods include: </li></ul><ul><ul><li>artificial neural networks, genetic algorithms, density estimation, clustering, abstraction, discretisation, visualisation, detecting changes in data or models </li></ul></ul>
  6. 6. Classification <ul><li>Methods for predicting a discrete response </li></ul><ul><ul><li>One kind of supervised learning </li></ul></ul><ul><ul><li>Note: in biological and other sciences, classification has long had a different meaning, referring to cluster analysis </li></ul></ul><ul><li>Applications include: </li></ul><ul><ul><li>Identifying good prospects for specific marketing or sales efforts </li></ul></ul><ul><ul><ul><li>Cross-selling, up-selling – when to offer products </li></ul></ul></ul><ul><ul><ul><li>Customers likely to be especially profitable </li></ul></ul></ul><ul><ul><ul><li>Customers likely to defect </li></ul></ul></ul><ul><ul><li>Identifying poor credit risks </li></ul></ul><ul><ul><li>Diagnosing customer problems </li></ul></ul>
  7. 7. Weather/Game-Playing Data <ul><li>Small dataset </li></ul><ul><ul><li>14 instances </li></ul></ul><ul><ul><li>5 attributes </li></ul></ul><ul><ul><ul><li>Outlook - nominal </li></ul></ul></ul><ul><ul><ul><li>Temperature - numeric </li></ul></ul></ul><ul><ul><ul><li>Humidity - numeric </li></ul></ul></ul><ul><ul><ul><li>Wind - nominal </li></ul></ul></ul><ul><ul><ul><li>Play </li></ul></ul></ul><ul><ul><ul><ul><li>Whether or not a certain game would be played </li></ul></ul></ul></ul><ul><ul><ul><ul><li>This is what we want to understand and predict </li></ul></ul></ul></ul>
  8. 8. ARFF file for the weather data.
  9. 9. German Credit Risk Dataset <ul><li>1000 instances (people), 21 attributes </li></ul><ul><ul><li>“ class” attribute describes people as good or bad credit risks </li></ul></ul><ul><ul><li>Other attributes include financial information and demographics </li></ul></ul><ul><ul><ul><li>E.g. checking_status, duration, credit_history, purpose, credit_amount, savings_status, employment, Age, housing, job, num_dependents, own_telephone, foreign_worker </li></ul></ul></ul><ul><li>Want to predict credit risk </li></ul><ul><li>Data available at UCI machine learning data repository </li></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><ul><li>and on 747 web page </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul>
  10. 10. Classification Algorithms <ul><li>Many methods available in WEKA </li></ul><ul><ul><li>0R, 1R, NaiveBayes, DecisionTable, ID3, PRISM, Instance-based learner (IB1, IBk), C4.5 (J48), PART, Support vector machine (SMO) </li></ul></ul><ul><li>Usually train on part of the data, test on the rest </li></ul><ul><li>Simple method – Zero-rule, or 0R </li></ul><ul><ul><li>Predict the most common category </li></ul></ul><ul><ul><ul><li>Class ZeroR in WEKA </li></ul></ul></ul><ul><ul><li>Too simple for practical use, but a useful baseline for evaluating performance of more complex methods </li></ul></ul>
  11. 11. 1-Rule (1R) Algorithm <ul><li>Based on single predictor </li></ul><ul><ul><li>Predict mode within each value of that predictor </li></ul></ul><ul><li>Look at error rate for each predictor on training dataset, and choose best predictor </li></ul><ul><li>Called OneR in WEKA </li></ul><ul><li>Must group numerical predictor values for this method </li></ul><ul><ul><li>Common method is to split at each change in the response </li></ul></ul><ul><ul><li>Collapse buckets until each contains at least 6 instances </li></ul></ul>
  12. 12. 1R Algorithm (continued) <ul><li>Biased towards predictors with more categories </li></ul><ul><ul><li>These can result in over-fitting to the training data </li></ul></ul><ul><li>But found to perform surprisingly well </li></ul><ul><ul><li>Study on 16 widely used datasets </li></ul></ul><ul><ul><ul><li>Holte (1993), Machine Learning 11, 63-91 </li></ul></ul></ul><ul><ul><li>Often error rate only a few percentages points higher than more sophisticated methods (e.g. decision trees) </li></ul></ul><ul><ul><li>Produced rules that were much simpler and more easily understood </li></ul></ul>
  13. 13. Naïve Bayes Method <ul><li>Calculates probabilities of each response value, assuming independence of attribute effects </li></ul><ul><li>Response value with highest probability is predicted </li></ul><ul><li>Numeric attributes are assumed to follow a normal distribution within each response value </li></ul><ul><ul><li>Contribution to probability calculated from normal density function </li></ul></ul><ul><ul><li>Instead can use kernel density estimate, or simply discretise the numerical attributes </li></ul></ul>
  14. 14. Naïve Bayes Calculations <ul><li>Observed counts and probabilities above </li></ul><ul><ul><li>Temperature and humidity have been discretised </li></ul></ul><ul><li>Consider new day </li></ul><ul><ul><li>Outlook=sunny, temperature=cool, humidity=high, windy=true </li></ul></ul><ul><ul><li>Probability(play=yes) α 2/9 x 3/9 x 3/9 x 3/9 x 9/14= 0.0053 </li></ul></ul><ul><ul><li>Probability(play=no) α 3/5 x 1/5 x 4/5 x 3/5 x 5/14= 0.0206 </li></ul></ul><ul><ul><li>Probability(play=no) = 0.0206/(0.0053+0.0206) = 79.5% </li></ul></ul><ul><ul><ul><li>“ no” four times more likely than “yes” </li></ul></ul></ul>
  15. 15. Naïve Bayes Method <ul><li>If any of the component probabilities are zero, the whole probability is zero </li></ul><ul><ul><li>Effectively a veto on that response value </li></ul></ul><ul><ul><li>Add one to each cell’s count to get around this problem </li></ul></ul><ul><ul><ul><li>Corresponds to weak positive prior information </li></ul></ul></ul><ul><li>Naïve Bayes effectively assumes that attributes are equally important </li></ul><ul><ul><li>Several highly correlated attributes could drown out an important variable that would add new information </li></ul></ul><ul><li>However this method often works well in practice </li></ul>
  16. 16. Decision Trees <ul><li>Classification rules can be expressed in a tree structure </li></ul><ul><ul><li>Move from the top of the tree, down through various nodes, to the leaves </li></ul></ul><ul><ul><li>At each node, a decision is made using a simple test based on attribute values </li></ul></ul><ul><ul><li>The leaf you reach holds the appropriate predicted value </li></ul></ul><ul><li>Decision trees are appealing and easily used </li></ul><ul><ul><li>However they can be verbose </li></ul></ul><ul><ul><li>Depending on the tests being used, they may obscure rather than reveal the true pattern </li></ul></ul><ul><li>More info online at http://recursive- / </li></ul>
  17. 17. Decision tree with a replicated subtree If x=1 and y=1 then class = a If z=1 and w=1 then class = a Otherwise class = b
  18. 18. Problems with Univariate Splits
  19. 19. Constructing Decision Trees <ul><li>Develop tree recursively </li></ul><ul><ul><li>Start with all data in one root node </li></ul></ul><ul><ul><li>Need to choose attribute that defines first split </li></ul></ul><ul><ul><ul><li>For now, we assume univariate splits are used </li></ul></ul></ul><ul><ul><li>For accurate predictions, want leaf nodes to be as pure as possible </li></ul></ul><ul><ul><li>Choose the attribute that maximises the average purity of the daughter nodes </li></ul></ul><ul><ul><ul><li>The measure of purity used is the entropy of the node </li></ul></ul></ul><ul><ul><ul><li>This is the amount of information needed to specify the value of an instance in that node, measured in bits </li></ul></ul></ul>
  20. 20. Tree stumps for the weather data (a) (b) (c) (d)
  21. 21. Weather Example <ul><li>First node from outlook split is for “sunny”, with entropy – 2/5 * log 2 (2/5) – 3/5 * log 2 (3/5) = 0.971 </li></ul><ul><li>Average entropy of nodes from outlook split is </li></ul><ul><ul><li>5/14 x 0.971 + 4/14 x 0 + 5/14 x 0.971= 0.693 </li></ul></ul><ul><li>Entropy of root node is 0.940 bits </li></ul><ul><li>Gain of 0.247 bits </li></ul><ul><li>Other splits yield: </li></ul><ul><ul><li>Gain(temperature)=0.029 bits </li></ul></ul><ul><ul><li>Gain(humidity)=0.152 bits </li></ul></ul><ul><ul><li>Gain(windy)=0.048 bits </li></ul></ul><ul><li>So “outlook” is the best attribute to split on </li></ul>
  22. 22. Expanded tree stumps for weather data (a) (b) (c)
  23. 23. Decision tree for the weather data
  24. 24. Decision Tree Algorithms <ul><li>The algorithm described in the preceding slides is known as ID3 </li></ul><ul><ul><li>Due to Quinlan (1986) </li></ul></ul><ul><li>Tends to choose attributes with many values </li></ul><ul><ul><li>Using information gain ratio helps solve this problem </li></ul></ul><ul><li>Several more improvements have been made to handle numeric attributes (via univariate splits), missing values and noisy data (via pruning) </li></ul><ul><ul><li>Resulting algorithm known as C4.5 </li></ul></ul><ul><ul><ul><li>Described by Quinlan (1993) </li></ul></ul></ul><ul><ul><li>Widely used (as is the commercial version C5.0) </li></ul></ul><ul><ul><li>WEKA has a version called J4.8 </li></ul></ul>
  25. 25. Classification Trees <ul><li>Described (along with regression trees) in: </li></ul><ul><ul><li>L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, 1984. Classification and Regression Trees . </li></ul></ul><ul><li>More sophisticated method than ID3 </li></ul><ul><ul><li>However Quinlan’s (1993) C4.5 method caught up with CART in most areas </li></ul></ul><ul><li>CART also incorporates methods for pruning, missing values and numeric attributes </li></ul><ul><ul><li>Multivariate splits are possible, as well as univariate </li></ul></ul><ul><ul><ul><li>Split on linear combination Σ c j x j > d </li></ul></ul></ul><ul><ul><li>CART typically uses Gini measure of node purity to determine best splits </li></ul></ul><ul><ul><ul><li>This is of the form Σ p (1- p ) </li></ul></ul></ul><ul><ul><li>But information/entropy measure also available </li></ul></ul>
  26. 26. Regression Trees <ul><li>Trees can also be used to predict numeric attributes </li></ul><ul><ul><li>Predict using average value of the response in the appropriate node </li></ul></ul><ul><ul><ul><li>Implemented in CART and C4.5 frameworks </li></ul></ul></ul><ul><ul><li>Can use a model at each node instead </li></ul></ul><ul><ul><ul><li>Implemented in Weka’s M5’ algorithm </li></ul></ul></ul><ul><ul><ul><li>Harder to interpret than regression trees </li></ul></ul></ul><ul><li>Classification and regression trees are implemented in R’s rpart package </li></ul><ul><ul><li>See Ch 10 in Venables and Ripley, MASS 3 rd Ed. </li></ul></ul>
  27. 27. Problems with Trees <ul><li>Can be unnecessarily verbose </li></ul><ul><li>Structure often unstable </li></ul><ul><ul><li>“ Greedy” hierarchical algorithm </li></ul></ul><ul><ul><ul><li>Small variations can change chosen splits at high level nodes, which then changes subtree below </li></ul></ul></ul><ul><ul><ul><li>Conclusions about attribute importance can be unreliable </li></ul></ul></ul><ul><li>Direct methods tend to overfit training dataset </li></ul><ul><ul><li>This problem can be reduced by pruning the tree </li></ul></ul><ul><li>Another approach that often works well is to fit the tree, remove all training cases that are not correctly predicted, and refit the tree on the reduced dataset </li></ul><ul><ul><li>Typically gives a smaller tree </li></ul></ul><ul><ul><li>This usually works almost as well on the training data </li></ul></ul><ul><ul><li>But generalises better, e.g. works better on test data </li></ul></ul><ul><li>Bagging the tree algorithm also gives more stable results </li></ul><ul><ul><li>Will discuss bagging later </li></ul></ul>
  28. 28. Classification Tree Example <ul><li>Use Weka’s J4.8 algorithm on German credit data (with default options) </li></ul><ul><ul><li>1000 instances, 21 attributes </li></ul></ul><ul><li>Produces a pruned tree with 140 nodes, 103 leaves </li></ul>
  29. 29. <ul><li>=== Run information === </li></ul><ul><li>Scheme: weka.classifiers.j48.J48 -C 0.25 -M 2 </li></ul><ul><li>Relation: german_credit </li></ul><ul><li>Instances: 1000 </li></ul><ul><li>Attributes: 21 </li></ul><ul><li>Number of Leaves : 103 </li></ul><ul><li>Size of the tree : 140 </li></ul><ul><li>=== Stratified cross-validation === </li></ul><ul><li>=== Summary === </li></ul><ul><li>Correctly Classified Instances 739 73.9 % </li></ul><ul><li>Incorrectly Classified Instances 261 26.1 % </li></ul><ul><li>Kappa statistic 0.3153 </li></ul><ul><li>Mean absolute error 0.3241 </li></ul><ul><li>Root mean squared error 0.4604 </li></ul><ul><li>Relative absolute error 77.134 % </li></ul><ul><li>Root relative squared error 100.4589 % </li></ul><ul><li>Total Number of Instances 1000 </li></ul><ul><li>=== Detailed Accuracy By Class === </li></ul><ul><li>TP Rate FP Rate Precision Recall F-Measure Class </li></ul><ul><li>0.883 0.597 0.775 0.883 0.826 good </li></ul><ul><li>0.403 0.117 0.596 0.403 0.481 bad </li></ul><ul><li>=== Confusion Matrix === </li></ul><ul><li>a b <-- classified as </li></ul><ul><li>618 82 | a = good </li></ul><ul><li>179 121 | b = bad </li></ul>
  30. 30. Cross-Validation <ul><li>Due to over-fitting, cannot estimate prediction error directly on the training dataset </li></ul><ul><li>Cross-validation is a simple and widely used method for estimating prediction error </li></ul><ul><li>Simple approach </li></ul><ul><ul><li>Set aside a test dataset </li></ul></ul><ul><ul><li>Train learner on the remainder (the training dataset) </li></ul></ul><ul><ul><li>Estimate prediction error by using the resulting prediction model on the test dataset </li></ul></ul><ul><li>This is only feasible where there is enough data to set aside a test dataset and still have enough to reliably train the learning algorithm </li></ul>
  31. 31. k -fold Cross-Validation <ul><li>For smaller datasets, use k -fold cross-validation </li></ul><ul><ul><li>Split dataset into k roughly equal parts </li></ul></ul><ul><ul><li>For each part, train on the other k -1 parts and use this part as the test dataset </li></ul></ul><ul><ul><li>Do this for each of the k parts, and average the resulting prediction errors </li></ul></ul><ul><li>This method measures the prediction error when training the learner on a fraction ( k -1)/ k of the data </li></ul><ul><li>If k is small, this will overestimate the prediction error </li></ul><ul><ul><li>k =10 is usually enough </li></ul></ul>Tr Tr Tr Tr Tr Tr Tr Tr Test
  32. 32. Regression Tree Example <ul><li>data(car.test.frame) </li></ul><ul><li> <- rpart(Mileage ~ Weight, car.test.frame) </li></ul><ul><li>post(,FILE=“”) </li></ul><ul><li>summary( </li></ul>
  33. 34. <ul><li>Call: </li></ul><ul><li>rpart(formula = Mileage ~ Weight, data = car.test.frame) </li></ul><ul><li>n= 60 </li></ul><ul><li>CP nsplit rel error xerror xstd </li></ul><ul><li>1 0.59534912 0 1.0000000 1.0322233 0.17981796 </li></ul><ul><li>2 0.13452819 1 0.4046509 0.6081645 0.11371656 </li></ul><ul><li>3 0.01282843 2 0.2701227 0.4557341 0.09178782 </li></ul><ul><li>4 0.01000000 3 0.2572943 0.4659556 0.09134201 </li></ul><ul><li>Node number 1: 60 observations, complexity param=0.5953491 </li></ul><ul><li>mean=24.58333, MSE=22.57639 </li></ul><ul><li>left son=2 (45 obs) right son=3 (15 obs) </li></ul><ul><li>Primary splits: </li></ul><ul><li>Weight < 2567.5 to the right, improve=0.5953491, (0 missing) </li></ul><ul><li>Node number 2: 45 observations, complexity param=0.1345282 </li></ul><ul><li>mean=22.46667, MSE=8.026667 </li></ul><ul><li>left son=4 (22 obs) right son=5 (23 obs) </li></ul><ul><li>Primary splits: </li></ul><ul><li>Weight < 3087.5 to the right, improve=0.5045118, (0 missing) </li></ul><ul><li>… (continued on next page)… </li></ul>
  34. 35. <ul><li>Node number 3: 15 observations </li></ul><ul><li>mean=30.93333, MSE=12.46222 </li></ul><ul><li>Node number 4: 22 observations </li></ul><ul><li>mean=20.40909, MSE=2.78719 </li></ul><ul><li>Node number 5: 23 observations, complexity param=0.01282843 </li></ul><ul><li>mean=24.43478, MSE=5.115312 </li></ul><ul><li>left son=10 (15 obs) right son=11 (8 obs) </li></ul><ul><li>Primary splits: </li></ul><ul><li>Weight < 2747.5 to the right, improve=0.1476996, (0 missing) </li></ul><ul><li>Node number 10: 15 observations </li></ul><ul><li>mean=23.8, MSE=4.026667 </li></ul><ul><li>Node number 11: 8 observations </li></ul><ul><li>mean=25.625, MSE=4.984375 </li></ul>
  35. 36. Regression Tree Example (continued) <ul><li>plotcp( </li></ul><ul><li> <- prune(,cp=0.1) </li></ul><ul><li>post(, file=&quot;&quot;, cex=1) </li></ul>
  36. 37. Complexity Parameter Plot
  37. 39. Pruned Regression Tree
  38. 40. Classification Methods <ul><li>Project the attribute space into decision regions </li></ul><ul><ul><li>Decision trees: piecewise constant approximation </li></ul></ul><ul><ul><li>Logistic regression: linear log-odds approximation </li></ul></ul><ul><ul><li>Discriminant analysis and neural nets: linear & non-linear separators </li></ul></ul><ul><li>Density estimation coupled with a decision rule </li></ul><ul><ul><li>E.g. Naïve Bayes </li></ul></ul><ul><li>Define a metric space and decide based on proximity </li></ul><ul><ul><li>One type of instance-based learning </li></ul></ul><ul><ul><li>K-nearest neighbour methods </li></ul></ul><ul><ul><ul><li>IBk algorithm in Weka </li></ul></ul></ul><ul><ul><li>Would like to drop noisy and unnecessary points </li></ul></ul><ul><ul><ul><li>Simple algorithm based on success rate confidence intervals available in Weka </li></ul></ul></ul><ul><ul><ul><ul><li>Compares naïve prediction with predictions using that instance </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Must choose suitable acceptance and rejection confidence levels </li></ul></ul></ul></ul><ul><li>Many of these approaches can produce probability distributions as well as predictions </li></ul><ul><ul><li>Depending on the application, this information may be useful </li></ul></ul><ul><ul><ul><li>Such as when results reported to expert (e.g. loan officer) as input to their decision </li></ul></ul></ul>
  39. 41. Numeric Prediction Methods <ul><li>Linear regression </li></ul><ul><li>Splines, including smoothing splines and multivariate adaptive regression splines (MARS) </li></ul><ul><li>Generalised additive models (GAM) </li></ul><ul><li>Locally weighted regression (lowess, loess) </li></ul><ul><li>Regression and Model Trees </li></ul><ul><ul><li>CART, C4.5, M5’ </li></ul></ul><ul><li>Artificial neural networks (ANNs) </li></ul>
  40. 42. Artificial Neural Networks (ANNs) <ul><li>An ANN is a network of many simple processors (or units), that are connected by communication channels that carry numeric data </li></ul><ul><li>ANNs are very flexible, encompassing nonlinear regression models, discriminant models, and data reduction models </li></ul><ul><ul><li>They do require some expertise to set up </li></ul></ul><ul><ul><li>An appropriate architecture needs to be selected and tuned for each application </li></ul></ul><ul><li>They can be useful tools for learning from examples to find patterns in data and predict outputs </li></ul><ul><ul><li>However on their own, they tend to overfit the training data </li></ul></ul><ul><ul><li>Meta-learning tools are needed to choose the best fit </li></ul></ul><ul><li>Various network architectures in common use </li></ul><ul><ul><li>Multilayer perceptron (MLR) </li></ul></ul><ul><ul><li>Radial basis functions (RBF) </li></ul></ul><ul><ul><li>Self-organising maps (SOM) </li></ul></ul><ul><li>ANNs have been applied to data editing and imputation, but not widely </li></ul>
  41. 43. Meta-Learning Methods - Bagging <ul><li>General methods for improving the performance of most learning algorithms </li></ul><ul><li>Bootstrap aggregation, bagging for short </li></ul><ul><ul><li>Select B bootstrap samples from the data </li></ul></ul><ul><ul><ul><li>Selected with replacement, same # of instances </li></ul></ul></ul><ul><ul><ul><ul><li>Can use parametric or non-parametric bootstrap </li></ul></ul></ul></ul><ul><ul><li>Fit the model/learner on each bootstrap sample </li></ul></ul><ul><ul><li>The bagged estimate is the average prediction from all these B models </li></ul></ul><ul><li>E.g. for a tree learner, the bagged estimate is the average prediction from the resulting B trees </li></ul><ul><li>Note that this is not a tree </li></ul><ul><ul><li>In general, bagging a model or learner does not produce a model or learner of the same form </li></ul></ul><ul><li>Bagging reduces the variance of unstable procedures like regression trees, and can greatly improve prediction accuracy </li></ul><ul><ul><li>However it does not always work for poor 0-1 predictors </li></ul></ul>
  42. 44. Meta-Learning Methods - Boosting <ul><li>Boosting is a powerful technique for improving accuracy </li></ul><ul><li>The “AdaBoost.M1” method (for classifiers): </li></ul><ul><ul><li>Give each instance an initial weight of 1/ n </li></ul></ul><ul><ul><li>For m =1 to M : </li></ul></ul><ul><ul><ul><li>Fit model using the current weights, & store resulting model m </li></ul></ul></ul><ul><ul><ul><li>If prediction error rate “err” is zero or >= 0.5, terminate loop. </li></ul></ul></ul><ul><ul><ul><li>Otherwise calculate α m =log((1-err)/err) </li></ul></ul></ul><ul><ul><ul><ul><li>This is the log odds of success </li></ul></ul></ul></ul><ul><ul><ul><li>Then adjust weights for incorrectly classified cases by multiplying them by exp( α m ), and repeat </li></ul></ul></ul><ul><ul><li>Predict using a weighted majority vote: Σ α m G m ( x ), where G m ( x ) is the prediction from model m </li></ul></ul>
  43. 45. Meta-Learning Methods - Boosting <ul><li>For example, for the German credit dataset: </li></ul><ul><ul><li>using 100 iterations of AdaBoost.M1 with the DecisionStump algorithm, </li></ul></ul><ul><ul><li>10-fold cross-validation gives an error rate of 24.9% (compared to 26.1% for J4.8) </li></ul></ul>
  44. 46. Association Rules <ul><li>Data on n purchase baskets in form (id, item 1 , item 2 , …, item k ) </li></ul><ul><ul><li>For example, purchases from a supermarket </li></ul></ul><ul><li>Association rules are statements of the form: </li></ul><ul><ul><li>“ When people buy tea, they also often buy coffee.” </li></ul></ul><ul><li>May be useful for product placement decisions or cross-selling recommendations </li></ul><ul><li>We say there is an association rule i 1 ->i 2 if </li></ul><ul><ul><li>i 1 and i 2 occur together in at least s% of the n baskets (the support) </li></ul></ul><ul><ul><li>And at least c% of the baskets containing item i 1 also contain i 2 (the confidence) </li></ul></ul><ul><li>The confidence criterion ensures that “often” is a large enough proportion of the antecedent cases to be interesting </li></ul><ul><li>The support criterion should be large enough that the resulting rules have practical importance </li></ul><ul><ul><li>Also helps to ensure reliability of the conclusions </li></ul></ul>
  45. 47. Association rules <ul><li>The support/confidence approach is widely used </li></ul><ul><ul><li>Efficiently implemented in the Apriori algorithm </li></ul></ul><ul><ul><ul><li>First identify item sets with sufficient support </li></ul></ul></ul><ul><ul><ul><li>Then turn each item set into sets of rules with sufficient confidence </li></ul></ul></ul><ul><li>This method was originally developed in the database community, so there has been a focus on efficient methods for large databases </li></ul><ul><ul><li>“ Large” means up to around 100 million instances, and about ten thousand binary attributes </li></ul></ul><ul><li>However this approach can find a vast number of rules, and it can be difficult to make sense of these </li></ul><ul><li>One useful extension is to i dentify only the rules with high enough lift (or odds ratio) </li></ul>
  46. 48. Classification vs Association Rules <ul><li>Classification rules predict the value of a pre-specified attribute, e.g. </li></ul><ul><ul><ul><li>If outlook=sunny and humidity=high then play =no </li></ul></ul></ul><ul><li>Association rules predict the value of an arbitrary attribute (or combination of attributes) </li></ul><ul><ul><ul><li>E.g. If temperature=cool then humidity=normal </li></ul></ul></ul><ul><ul><ul><li>If humidity=normal and play=no then windy=true </li></ul></ul></ul><ul><ul><ul><li>If temperature=high and humidity=high then play=no </li></ul></ul></ul>
  47. 49. Clustering – EM Algorithm <ul><li>Assume that the data is from a mixture of normal distributions </li></ul><ul><ul><li>I.e. one normal component for each cluster </li></ul></ul><ul><li>For simplicity, consider one attribute x and two components or clusters </li></ul><ul><ul><li>Model has five parameters: ( p , μ 1 , σ 1 , μ 2 , σ 2 ) = θ </li></ul></ul><ul><li>Log-likelihood: </li></ul><ul><li>This is hard to maximise directly </li></ul><ul><ul><li>Use the expectation-maximisation (EM) algorithm instead </li></ul></ul>
  48. 50. Clustering – EM Algorithm <ul><li>Think of data as being augmented by a latent 0/1 variable d i indicating membership of cluster 1 </li></ul><ul><li>If the values of this variable were known, the log-likelihood would be: </li></ul><ul><li>Starting with initial values for the parameters, calculate the expected value of d i </li></ul><ul><li>Then substitute this into the above log-likelihood and maximise to obtain new parameter values </li></ul><ul><ul><li>This will have increased the log-likelihood </li></ul></ul><ul><li>Repeat until the log-likelihood converges </li></ul>
  49. 51. Clustering – EM Algorithm <ul><li>Resulting estimates may only be a local maximum </li></ul><ul><ul><li>Run several times with different starting points to find global maximum (hopefully) </li></ul></ul><ul><li>With parameter estimates, can calculate segment membership probabilities for each case </li></ul>
  50. 52. Clustering – EM Algorithm <ul><li>Extending to more latent classes is easy </li></ul><ul><ul><li>Information criteria such as AIC and BIC are often used to decide how many are appropriate </li></ul></ul><ul><li>Extending to multiple attributes is easy if we assume they are independent, at least conditioning on segment membership </li></ul><ul><ul><li>It is possible to introduce associations, but this can rapidly increase the number of parameters required </li></ul></ul><ul><li>Nominal attributes can be accommodated by allowing different discrete distributions in each latent class, and assuming conditional independence between attributes </li></ul><ul><li>Can extend this approach to a handle joint clustering and prediction models, as mentioned in the MVA lectures </li></ul>
  51. 53. Clustering - Scalability Issues <ul><li>k -means algorithm is also widely used </li></ul><ul><li>However this and the EM-algorithm are slow on large databases </li></ul><ul><li>So is hierarchical clustering - requires O( n 2 ) time </li></ul><ul><li>Iterative clustering methods require full DB scan at each iteration </li></ul><ul><li>Scalable clustering algorithms are an area of active research </li></ul><ul><li>A few recent algorithms: </li></ul><ul><ul><li>Distance-based/ k -Means </li></ul></ul><ul><ul><ul><li>Multi-Resolution kd-Tree for K-Means [PM99] </li></ul></ul></ul><ul><ul><ul><li>CLIQUE [AGGR98] </li></ul></ul></ul><ul><ul><ul><li>Scalable K-Means [BFR98a] </li></ul></ul></ul><ul><ul><ul><li>CLARANS [NH94] </li></ul></ul></ul><ul><ul><li>Probabilistic/EM </li></ul></ul><ul><ul><ul><li>Multi-Resolution kd-Tree for EM [Moore99] </li></ul></ul></ul><ul><ul><ul><li>Scalable EM [BRF98b] </li></ul></ul></ul><ul><ul><ul><li>CF Kernel Density Estimation [ZRL99] </li></ul></ul></ul>
  52. 54. Ethics of Data Mining <ul><li>Data mining and data warehousing raise ethical and legal issues </li></ul><ul><li>Combining information via data warehousing could violate Privacy Act </li></ul><ul><ul><li>Must tell people how their information will be used when the data is obtained </li></ul></ul><ul><li>Data mining raises ethical issues mainly during application of results </li></ul><ul><ul><li>E.g. using ethnicity as a factor in loan approval decisions </li></ul></ul><ul><ul><li>E.g. screening job applications based on age or sex (where not directly relevant) </li></ul></ul><ul><ul><li>E.g. declining insurance coverage based on neighbourhood if this is related to race (“red-lining” is illegal in much of the US) </li></ul></ul><ul><li>Whether something is ethical depends on the application </li></ul><ul><ul><li>E.g. probably ethical to use ethnicity to diagnose and choose treatments for a medical problem, but not to decline medical insurance </li></ul></ul>