An Overview and Example of Data Mining Daniel T. Larose, Ph.D. Professor of Statistics Director,  Data Mining @CCSU Editor...
Overview <ul><li>Part One:  </li></ul><ul><ul><li>A Brief Overview of Data Mining </li></ul></ul><ul><li>Part Two:  </li><...
Master of Science in DM at CCSU   Faculty <ul><li>Dr. Roger Bilisoly (from Ohio State Univ., Statistics) </li></ul><ul><ul...
Master of Science in DM at CCSU  Program (36 credits) <ul><li>Core Courses (27 credits)  All available online. </li></ul><...
Master of Science in DM at CCSU <ul><li>Only MS in DM that is entirely online. </li></ul><ul><li>Some courses available on...
Graduate Certificate in Data Mining <ul><li>18 Credits: </li></ul><ul><li>Required Courses  (12 credits) </li></ul><ul><ul...
Material for Part I Drawn From: Discovering Knowledge in Data:  An Introduction to Data Mining (Wiley, 2005) <ul><li>Chapt...
Material for Part II Drawn From: Data Mining Methods and Models (Wiley, 2006) <ul><li>Chapter 1.  Dimension Reduction Meth...
No Material Drawn From: Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage (Wiley, April 2007) ...
Call for Book Proposals   Wiley Series on  Methods and Applications in Data Mining   <ul><li>Suggested topics: </li></ul><...
What is Data Mining? <ul><li>“ Data mining is the analysis of (often large)  observational  data sets to find  unsuspected...
Why Data Mining? <ul><li>“ We are drowning in information but starved for knowledge.”  </li></ul><ul><ul><ul><ul><li>John ...
Need for Human Direction <ul><li>Automation is no substitute for human supervision and input.  </li></ul><ul><ul><li>Human...
“ Data Mining is Easy to Do Badly” <ul><li>Black box software </li></ul><ul><ul><li>Powerful, “easy-to-use” data mining al...
CRISP-DM: Cross-Industry Standard Process for Data Mining
CRISP: DM as a Process <ul><li>Business / Research Understanding Phase </li></ul><ul><ul><li>Enunciate your objectives  </...
What About Data Dredging? <ul><li>Data Dredging </li></ul><ul><li>“ A sufficiently exhaustive search will certainly throw ...
Guarding Against Data Dredging: Cross-Validation is the Key <ul><li>Partition the data into training set and test set.  </...
Inference and Huge Data Sets <ul><li>Hypothesis testing becomes sensitive at the huge sample sizes prevalent in data minin...
Need for Transparency and Interpretability <ul><li>Data mining models should be  transparent </li></ul><ul><ul><li>Results...
Part Two: Modeling Response to Direct Mail Marketing <ul><li>Business Understanding Phase: </li></ul><ul><ul><li>Clothing ...
Data Understanding:  The Clothing Store dataset <ul><ul><li>List of fields in the dataset (28,7999 customers, 51 fields) <...
Data Preparation and EDA Phase <ul><li>Not covered in this presentation. </li></ul>
Modeling Strategy <ul><li>Apply principal components analysis to address multicollinearity.  </li></ul><ul><li>Apply clust...
Modeling Strategy continued <ul><li>Evaluate each model using test data set. </li></ul><ul><li>Apply misclassification cos...
Principal Components Analysis (PCA) <ul><li>Multicollinearity does not degrade prediction accuracy. </li></ul><ul><ul><li>...
Report Two Model Sets: <ul><li>Model Set A: </li></ul><ul><ul><li>Includes principal components </li></ul></ul><ul><ul><li...
Principal Components Analysis (PCA) <ul><li>Seven correlated variables. </li></ul><ul><ul><li>Two components extracted </l...
Principal Components Analysis (PCA) <ul><li>Principal Component 1 :  </li></ul><ul><ul><li>Purchasing Habits </li></ul></u...
BIRCH Clustering Algorithm <ul><li>Requires only one pass through data set </li></ul><ul><ul><li>Scalable for large data s...
BIRCH Clusters <ul><li>Cluster 3 shows: </li></ul><ul><ul><li>Higher response for flag predictors </li></ul></ul><ul><ul><...
BIRCH Clusters <ul><li>Cluster 3 has highest response rate (red). </li></ul><ul><ul><li>Cluster 1: 7.6% </li></ul></ul><ul...
Balancing the Data <ul><li>For “rare” classes, provides more equitable distribution. </li></ul><ul><li>Drawback: Loss of d...
False Positive vs. False Negative: Which is Worse? <ul><li>For direct mail marketing, a  false negative error  is probably...
Decision Cost / Benefit Analysis Cost of contact $2.00 No Yes False Positive Loss of anticipated revenue $28.40 Yes No Fal...
Establish Baseline Model Performance <ul><li>Benchmarks </li></ul><ul><ul><li>“ Don’t Send a Marketing Promotion to Anyone...
Model Set A  (With 50% Balancing) <ul><li>No model beats benchmark of $2.63 profit per customer </li></ul><ul><li>Misclass...
Model Set A:  Effect of Misclassification Costs <ul><li>For the 447 highlighted records: </li></ul><ul><ul><li>Only 20.8% ...
Model Set A:  PCA Component 1 is Best Predictor <ul><li>First principal component ($F-PCA-1), Purchasing Habits, represent...
Over-Balancing as a Surrogate for Misclassification Costs <ul><li>Software limitation: </li></ul><ul><li>Neural network an...
Over-Balancing as a Surrogate for Misclassification Costs <ul><li>Neural network model results </li></ul><ul><ul><li>Three...
Over-Balancing as a Surrogate for Misclassification Costs <ul><li>Apply 80% - 20% over-balancing to the other models. </li...
Combination Models: Voting <ul><li>Smoothes out strengths and weaknesses of each model </li></ul><ul><ul><li>Each model su...
Combination Models: Voting <ul><li>Mail a Promotion only if: </li></ul><ul><li>All four  models predict response </li></ul...
Combination Models: Voting <ul><li>None beat the logistic regression model: $2.96 profit per customer </li></ul><ul><li>Pe...
Model Collection B: Non-PCA Models <ul><li>Models retain correlated variables  </li></ul><ul><ul><li>Use restricted to pre...
Model Collection B: CART and C5.0 <ul><li>Using misclassification costs, and 50% balancing </li></ul><ul><li>Both models o...
Model Collection B: Over-Balancing <ul><li>Apply over-balancing as a surrogate for misclassification costs for all models ...
Combination Models: Voting <ul><li>Combine the four models via voting and 80%-20% over-balancing </li></ul><ul><li>Synergy...
Combining Models Using  Mean Response Probabilities <ul><li>Combine the confidences that each model reports for its decisi...
Combining Models Using  Mean Response Probabilities <ul><li>Multi-modality due to the discontinuity of the transformation ...
Combining Models Using  Mean Response Probabilities <ul><li>Where shall we define response vs. non-response? </li></ul><ul...
Combining Models Using  Mean Response Probabilities <ul><li>Optimal partition: near 50%. </li></ul><ul><li>Mail a promotio...
Summary <ul><li>For more on this Case Study, see  Data Mining Methods and Models  (Wiley, 2006) </li></ul><ul><li>So, the ...
Upcoming SlideShare
Loading in …5
×

URI Dept of Computer Science and Statistics - An Overview and ...

674 views
623 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
674
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

URI Dept of Computer Science and Statistics - An Overview and ...

  1. 1. An Overview and Example of Data Mining Daniel T. Larose, Ph.D. Professor of Statistics Director, Data Mining @CCSU Editor, Wiley Series on Methods and Applications in Data Mining [email_address] www.math.ccsu.edu/larose University of Rhode Island Department of Computer Science and Statistics March 30, 2007
  2. 2. Overview <ul><li>Part One: </li></ul><ul><ul><li>A Brief Overview of Data Mining </li></ul></ul><ul><li>Part Two: </li></ul><ul><ul><li>An Example of Data Mining: </li></ul></ul><ul><ul><li>Modeling Response to Direct Mail Marketing </li></ul></ul><ul><li>But first, a shameless plug … </li></ul>
  3. 3. Master of Science in DM at CCSU Faculty <ul><li>Dr. Roger Bilisoly (from Ohio State Univ., Statistics) </li></ul><ul><ul><li>Text Mining, Intro to Data Mining </li></ul></ul><ul><li>Dr. Darius Dziuda (from Warsaw Polytechnic Univ, CS) </li></ul><ul><ul><li>Data Mining for Genomics and Proteomics, Biomarker Discovery </li></ul></ul><ul><li>Dr. Zdravko Markov (from Sofia Univ, CS) </li></ul><ul><ul><li>Data Mining (CS perspective), Machine Learning </li></ul></ul><ul><li>Dr. Daniel Miller (from UConn, Statistics) </li></ul><ul><ul><li>Applied Multivariate Analysis, Mathematical Statistics II, Intro to Data Mining </li></ul></ul><ul><li>Dr. Krishna Saha (from Univ of Windsor, Statistics) </li></ul><ul><ul><li>Intro to Data Mining using R </li></ul></ul><ul><li>Dr. Daniel Larose (Program Director) (from UConn, Statistics) </li></ul><ul><ul><li>Intro to Data Mining, Data Mining Methods, Applied Data Mining, Web Mining </li></ul></ul>
  4. 4. Master of Science in DM at CCSU Program (36 credits) <ul><li>Core Courses (27 credits) All available online. </li></ul><ul><ul><li>Stat 521 Introduction to Data Mining (4 cr) </li></ul></ul><ul><ul><li>Stat 522 Data Mining Methods (4 cr) </li></ul></ul><ul><ul><li>Stat 523 Applied Data Mining (4 cr) </li></ul></ul><ul><ul><li>Stat 525 Web Mining </li></ul></ul><ul><ul><li>Stat 526 Data Mining for Genomics and Proteomics </li></ul></ul><ul><ul><li>Stat 527 Text Mining </li></ul></ul><ul><ul><li>Stat 416 Mathematical Statistics II </li></ul></ul><ul><ul><li>Stat 570 Applied Multivariate Analysis </li></ul></ul><ul><li>Electives ( 6 credits. Choose two)  </li></ul><ul><ul><li>CS 570 Topics in Artificial Intelligence: Machine Learning </li></ul></ul><ul><ul><li>CS 580 Topics in Advanced Database: Data Mining </li></ul></ul><ul><ul><li>Stat 455 Experimental Design </li></ul></ul><ul><ul><li>Stat 551 Applied Stochastic Processes </li></ul></ul><ul><ul><li>Stat 567 Linear Models </li></ul></ul><ul><ul><li>Stat 575 Mathematical Statistics III   </li></ul></ul><ul><ul><li>Stat 529 Current Issues in Data Mining                                </li></ul></ul><ul><li>Capstone Requirement: Stat 599 Thesis (3 credits) </li></ul>
  5. 5. Master of Science in DM at CCSU <ul><li>Only MS in DM that is entirely online. </li></ul><ul><li>Some courses available on campus. </li></ul><ul><li>Student must come to CCSU to present Thesis </li></ul><ul><li>We reach students in about 30 US States and a dozen foreign countries </li></ul><ul><li>Half of our students already have master’s degrees </li></ul><ul><li>About 15% already have Ph.D.’s </li></ul><ul><li>Typical student is a mid-career professional </li></ul><ul><li>Backgrounds are diverse: Computer Science, Engineering, Finance, Chemistry, Database Admin, Statistics, etc. </li></ul><ul><li>www.ccsu.edu/datamining </li></ul>
  6. 6. Graduate Certificate in Data Mining <ul><li>18 Credits: </li></ul><ul><li>Required Courses (12 credits) </li></ul><ul><ul><li>Stat 521 Introduction to Data Mining </li></ul></ul><ul><ul><li>Stat 522 Data Mining Methods and Models </li></ul></ul><ul><ul><li>Stat 523 Applied Data Mining </li></ul></ul><ul><li>Elective Courses (6 credits. Choose Two): </li></ul><ul><ul><li>Stat 525 Web Mining </li></ul></ul><ul><ul><li>Stat 526 Data Mining for Genomics and Proteomics </li></ul></ul><ul><ul><li>Stat 527 Text Mining </li></ul></ul><ul><ul><li>Stat 529 Current Issues in Data Mining </li></ul></ul><ul><ul><li>Some other graduate-level data mining or statistics course, with approval of advisor. </li></ul></ul><ul><li>No Mathematical Statistics requirement. </li></ul>
  7. 7. Material for Part I Drawn From: Discovering Knowledge in Data: An Introduction to Data Mining (Wiley, 2005) <ul><li>Chapter 1. An Introduction to Data Mining </li></ul><ul><li>Chapter 2. Data Preprocessing </li></ul><ul><li>Chapter 3. Exploratory Data Analysis </li></ul><ul><li>Chapter 4. Statistical Approaches to </li></ul><ul><li>Estimation and Prediction </li></ul><ul><li>Chapter 5. K-Nearest Neighbor </li></ul><ul><li>Chapter 6. Decision Trees </li></ul><ul><li>Chapter 7. Neural Networks </li></ul><ul><li>Chapter 8. Hierarchical and K-Means </li></ul><ul><li>Clustering </li></ul><ul><li>Chapter 9. Kohonen networks </li></ul><ul><li>Chapter 10. Association Rules </li></ul><ul><li>Chapter 11. Model Evaluation Techniques </li></ul>
  8. 8. Material for Part II Drawn From: Data Mining Methods and Models (Wiley, 2006) <ul><li>Chapter 1. Dimension Reduction Methods </li></ul><ul><li>Chapter 2. Regression Modeling </li></ul><ul><li>Chapter 3. Multiple Regression and Model Building </li></ul><ul><li>Chapter 4. Logistic Regression </li></ul><ul><li>Chapter 5. Naïve Bayes Classification and Bayesian Networks </li></ul><ul><li>Chapter 6. Genetic Algorithms </li></ul><ul><li>Chapter 7. Case Study: Modeling Response to Direct-Mail Marketing </li></ul>
  9. 9. No Material Drawn From: Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage (Wiley, April 2007) <ul><li>Part One: Web Structure Mining </li></ul><ul><ul><li>Information Retrieval and Web Search </li></ul></ul><ul><ul><li>Hyperlink-Based Ranking </li></ul></ul><ul><li>Part Two: Web Content Mining </li></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Evaluating Clustering </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><li>Part Three: Web Usage Mining </li></ul><ul><ul><li>Data Preprocessing, </li></ul></ul><ul><ul><li>Exploratory Data Analysis, </li></ul></ul><ul><ul><li>Association Rules, Clustering, and Classification for Web Usage Mining </li></ul></ul><ul><li>With Dr. Zdravko Markov, Computer Science, CCSU </li></ul>
  10. 10. Call for Book Proposals Wiley Series on Methods and Applications in Data Mining <ul><li>Suggested topics: </li></ul><ul><ul><li>Data Mining in Bioinformatics </li></ul></ul><ul><ul><li>Emerging Techniques in Data Mining (e.g., SVM) </li></ul></ul><ul><ul><li>Data Mining with Evolutionary Algorithms </li></ul></ul><ul><ul><li>Drug Discovery Using Data Mining </li></ul></ul><ul><ul><li>Mining Data Streams </li></ul></ul><ul><ul><li>Visual Analysis in Data Mining </li></ul></ul><ul><li>Books in press: </li></ul><ul><ul><li>Data Mining for Genomics and Proteomics , by Darius Dziuda </li></ul></ul><ul><ul><li>Practical Text Mining Using Perl , by Roger Bilisoly </li></ul></ul><ul><li>Contact Series Editor at larosed@ccsu.edu </li></ul>
  11. 11. What is Data Mining? <ul><li>“ Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.” </li></ul><ul><ul><li>David Hand, Heikki Mannila & Padhraic Smyth, Principles of Data Mining, MIT Press, 2001 </li></ul></ul>
  12. 12. Why Data Mining? <ul><li>“ We are drowning in information but starved for knowledge.” </li></ul><ul><ul><ul><ul><li>John Naisbitt, Megatrends , 1984. </li></ul></ul></ul></ul><ul><li>“ The problem is that there are not enough trained human analysts available who are skilled at translating all of this data into knowledge, and thence up the taxonomy tree into wisdom.” </li></ul><ul><ul><ul><ul><li>Daniel Larose, Discovering Knowledge in Data: An Introduction to Data Mining , Wiley, 2005. </li></ul></ul></ul></ul>
  13. 13. Need for Human Direction <ul><li>Automation is no substitute for human supervision and input. </li></ul><ul><ul><li>Humans need to be actively involved at every phase of data mining process. </li></ul></ul><ul><li>“ Rather than asking where humans fit into data mining, we should instead inquire about how we may design data mining into the very human process of problem solving.” </li></ul><ul><ul><li>- Daniel Larose, Discovering Knowledge in Data: An Introduction to Data Mining , Wiley, 2005. </li></ul></ul>
  14. 14. “ Data Mining is Easy to Do Badly” <ul><li>Black box software </li></ul><ul><ul><li>Powerful, “easy-to-use” data mining algorithms </li></ul></ul><ul><ul><li>Makes their misuse dangerous. </li></ul></ul><ul><ul><li>Too easy to point and click your way to disaster. </li></ul></ul><ul><li>What is needed: </li></ul><ul><ul><li>An understanding of the underlying algorithmic and statistical model structures. </li></ul></ul><ul><ul><li>An understanding of which algorithms are most appropriate in which situations and for which types of data. </li></ul></ul>
  15. 15. CRISP-DM: Cross-Industry Standard Process for Data Mining
  16. 16. CRISP: DM as a Process <ul><li>Business / Research Understanding Phase </li></ul><ul><ul><li>Enunciate your objectives </li></ul></ul><ul><li>Data Understanding Phase: EDA </li></ul><ul><li>Data Preparation Phase: Preprocessing </li></ul><ul><li>Modeling Phase: Fun and interesting! </li></ul><ul><li>Evaluation Phase </li></ul><ul><ul><li>Confluence of results? Objectives Met? </li></ul></ul><ul><li>Deployment Phase: Use results to solve problem. </li></ul><ul><ul><li>If desired: Use lessons learned to reformulate business / research objective. </li></ul></ul>
  17. 17. What About Data Dredging? <ul><li>Data Dredging </li></ul><ul><li>“ A sufficiently exhaustive search will certainly throw up patterns of some kind. Many of these patterns will simply be a product of random fluctuations, and will not represent any underlying structure.” </li></ul><ul><ul><ul><ul><ul><li>David J. Hand, Data Mining: Statistics and More? The American Statistician , May, 1998. </li></ul></ul></ul></ul></ul>
  18. 18. Guarding Against Data Dredging: Cross-Validation is the Key <ul><li>Partition the data into training set and test set. </li></ul><ul><li>If the pattern shows up in both data sets, decreases the probability that it represents noise. </li></ul><ul><li>More generally, may use n -fold cross-validation. </li></ul>
  19. 19. Inference and Huge Data Sets <ul><li>Hypothesis testing becomes sensitive at the huge sample sizes prevalent in data mining applications. </li></ul><ul><ul><li>Even very tiny effects will be found significant. </li></ul></ul><ul><ul><li>So, data mining tends to de-emphasize inference </li></ul></ul>
  20. 20. Need for Transparency and Interpretability <ul><li>Data mining models should be transparent </li></ul><ul><ul><li>Results should be interpretable by humans </li></ul></ul><ul><li>Decision Trees are transparent </li></ul><ul><li>Neural Networks tend to be opaque </li></ul><ul><li>If a customer complains about why he/she was turned down for credit, we should be able to explain why, without saying “Our neural net said so.” </li></ul>
  21. 21. Part Two: Modeling Response to Direct Mail Marketing <ul><li>Business Understanding Phase: </li></ul><ul><ul><li>Clothing Store Purchase Data </li></ul></ul><ul><ul><ul><li>Results of a direct mail marketing campaign </li></ul></ul></ul><ul><li>Task: Construct a classification model </li></ul><ul><ul><li>For classifying customers as either responders or non-responders to the marketing campaign, </li></ul></ul><ul><ul><li>To reduce costs and increase return-on-investment </li></ul></ul>
  22. 22. Data Understanding: The Clothing Store dataset <ul><ul><li>List of fields in the dataset (28,7999 customers, 51 fields) </li></ul></ul>
  23. 23. Data Preparation and EDA Phase <ul><li>Not covered in this presentation. </li></ul>
  24. 24. Modeling Strategy <ul><li>Apply principal components analysis to address multicollinearity. </li></ul><ul><li>Apply cluster analysis. Briefly profile clusters. </li></ul><ul><li>Balance the training data set. </li></ul><ul><li>Establish baseline model performance </li></ul><ul><ul><li>In terms of expected profit per customer contacted. </li></ul></ul><ul><li>Apply classification algorithms to training data set: </li></ul><ul><ul><li>CART </li></ul></ul><ul><ul><li>C5.0 (C4.5) </li></ul></ul><ul><ul><li>Neural networks </li></ul></ul><ul><ul><li>Logistic regression. </li></ul></ul>
  25. 25. Modeling Strategy continued <ul><li>Evaluate each model using test data set. </li></ul><ul><li>Apply misclassification costs in line with cost benefit table. </li></ul><ul><li>Apply overbalancing as a surrogate for misclassification costs. </li></ul><ul><ul><li>Find best overbalancing proportion. </li></ul></ul><ul><li>Combine predictions from four models </li></ul><ul><ul><li>Using model voting. </li></ul></ul><ul><ul><li>Using mean response probabilities. </li></ul></ul>
  26. 26. Principal Components Analysis (PCA) <ul><li>Multicollinearity does not degrade prediction accuracy. </li></ul><ul><ul><li>But muddles individual predictor coefficients. </li></ul></ul><ul><li>Interested in predictor characteristics, customer profiling, etc? </li></ul><ul><ul><li>Then PCA is required. </li></ul></ul><ul><li>But, if interested solely in classification (prediction, estimation), </li></ul><ul><ul><li>PCA not strictly required. </li></ul></ul>
  27. 27. Report Two Model Sets: <ul><li>Model Set A: </li></ul><ul><ul><li>Includes principal components </li></ul></ul><ul><ul><li>All purpose model set </li></ul></ul><ul><li>Model Set B: </li></ul><ul><ul><li>Includes correlated predictors, not principal components </li></ul></ul><ul><ul><li>Use restricted to classification </li></ul></ul>
  28. 28. Principal Components Analysis (PCA) <ul><li>Seven correlated variables. </li></ul><ul><ul><li>Two components extracted </li></ul></ul><ul><ul><li>Account for 87% of variability </li></ul></ul>
  29. 29. Principal Components Analysis (PCA) <ul><li>Principal Component 1 : </li></ul><ul><ul><li>Purchasing Habits </li></ul></ul><ul><ul><li>Customer general purchasing habits </li></ul></ul><ul><ul><li>Expect component to be strongly indicative of response </li></ul></ul><ul><li>Principal Component 2 : </li></ul><ul><ul><li>Promotion Contacts </li></ul></ul><ul><ul><li>Unclear whether component will be associated with response </li></ul></ul><ul><li>Components validated by test data set </li></ul>
  30. 30. BIRCH Clustering Algorithm <ul><li>Requires only one pass through data set </li></ul><ul><ul><li>Scalable for large data sets </li></ul></ul><ul><li>Benefit: Analyst need not pre-specify number of clusters </li></ul><ul><li>Drawback: Sensitive to initial records encountered </li></ul><ul><ul><li>Leads to widely variable cluster solutions </li></ul></ul><ul><li>Requires “outer loop” to find consistent cluster solution </li></ul><ul><li>Zhang, Ramakrishnan and Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery 1, 1997. </li></ul>
  31. 31. BIRCH Clusters <ul><li>Cluster 3 shows: </li></ul><ul><ul><li>Higher response for flag predictors </li></ul></ul><ul><ul><li>Higher averages for numeric predictors </li></ul></ul>
  32. 32. BIRCH Clusters <ul><li>Cluster 3 has highest response rate (red). </li></ul><ul><ul><li>Cluster 1: 7.6% </li></ul></ul><ul><ul><li>Cluster 2: 7.1% </li></ul></ul><ul><ul><li>Cluster 3: 33.0% </li></ul></ul>
  33. 33. Balancing the Data <ul><li>For “rare” classes, provides more equitable distribution. </li></ul><ul><li>Drawback: Loss of data: </li></ul><ul><ul><li>Here, 40% of non-responders randomly omitted </li></ul></ul><ul><ul><li>All responders retained </li></ul></ul><ul><ul><li>Responders increases from 16.58% to 24.76% </li></ul></ul><ul><li>Test data set should never be balanced </li></ul>
  34. 34. False Positive vs. False Negative: Which is Worse? <ul><li>For direct mail marketing, a false negative error is probably worse than a false positive. </li></ul><ul><li>Generate misclassification costs based on the observed data. </li></ul><ul><ul><li>Construct cost-benefit table </li></ul></ul>
  35. 35. Decision Cost / Benefit Analysis Cost of contact $2.00 No Yes False Positive Loss of anticipated revenue $28.40 Yes No False Negative (Anticipated revenue) – (Cost of contact) -$26.40 Yes Yes True Positive No contact made; no revenue lost $0 No No True Negative Rationale Cost Actual Classified Outcome
  36. 36. Establish Baseline Model Performance <ul><li>Benchmarks </li></ul><ul><ul><li>“ Don’t Send a Marketing Promotion to Anyone” Model </li></ul></ul><ul><ul><li>“ Send a Marketing Promotion to Everyone” Model </li></ul></ul><ul><ul><ul><li>Will compare candidate models against this baseline error rate. </li></ul></ul></ul>
  37. 37. Model Set A (With 50% Balancing) <ul><li>No model beats benchmark of $2.63 profit per customer </li></ul><ul><li>Misclassification costs had not been applied </li></ul><ul><li>Now define FN cost = $28.40, FP cost = $2 </li></ul><ul><ul><li>Outperformed baseline “Send to everyone” model </li></ul></ul>
  38. 38. Model Set A: Effect of Misclassification Costs <ul><li>For the 447 highlighted records: </li></ul><ul><ul><li>Only 20.8% responded. </li></ul></ul><ul><ul><li>But model predicts positive response. </li></ul></ul><ul><ul><li>Due to high false negative misclassification cost. </li></ul></ul>
  39. 39. Model Set A: PCA Component 1 is Best Predictor <ul><li>First principal component ($F-PCA-1), Purchasing Habits, represents both the root node split and the secondary split </li></ul><ul><ul><li>Most important factor for predicting response </li></ul></ul>
  40. 40. Over-Balancing as a Surrogate for Misclassification Costs <ul><li>Software limitation: </li></ul><ul><li>Neural network and logistic regression models in Clementine: </li></ul><ul><ul><li>Lack methods for applying misclassification costs </li></ul></ul><ul><li>Over-balancing is an alternate method which can achieve similar results </li></ul><ul><li>Starves the classifier of instances of non-response </li></ul>
  41. 41. Over-Balancing as a Surrogate for Misclassification Costs <ul><li>Neural network model results </li></ul><ul><ul><li>Three over-balanced models outperform baseline </li></ul></ul><ul><li>Properly applied, over-balancing can be used as a surrogate for misclassification costs </li></ul>
  42. 42. Over-Balancing as a Surrogate for Misclassification Costs <ul><li>Apply 80% - 20% over-balancing to the other models. </li></ul>
  43. 43. Combination Models: Voting <ul><li>Smoothes out strengths and weaknesses of each model </li></ul><ul><ul><li>Each model supplies a prediction for each record </li></ul></ul><ul><ul><li>Count the votes for each record </li></ul></ul><ul><li>Disadvantage of combination models: </li></ul><ul><ul><li>Lack of easy interpretability </li></ul></ul><ul><li>Four competing combination models… </li></ul>
  44. 44. Combination Models: Voting <ul><li>Mail a Promotion only if: </li></ul><ul><li>All four models predict response </li></ul><ul><ul><li>Protects against false positive </li></ul></ul><ul><ul><li>All four classification algorithms must agree on a positive prediction </li></ul></ul><ul><li>At least three models predict response </li></ul><ul><li>At least two models predict response </li></ul><ul><li>Any model predicts response </li></ul><ul><ul><li>Protects against false negatives </li></ul></ul>
  45. 45. Combination Models: Voting <ul><li>None beat the logistic regression model: $2.96 profit per customer </li></ul><ul><li>Perhaps combination models will do better with Model Collection B… </li></ul>
  46. 46. Model Collection B: Non-PCA Models <ul><li>Models retain correlated variables </li></ul><ul><ul><li>Use restricted to prediction only </li></ul></ul><ul><li>Since the correlated variables are highly predictive </li></ul><ul><ul><li>Expect Collection B will outperform the PCA models </li></ul></ul>
  47. 47. Model Collection B: CART and C5.0 <ul><li>Using misclassification costs, and 50% balancing </li></ul><ul><li>Both models outperform the best PCA model </li></ul>
  48. 48. Model Collection B: Over-Balancing <ul><li>Apply over-balancing as a surrogate for misclassification costs for all models </li></ul><ul><li>Best performance thus far. </li></ul>
  49. 49. Combination Models: Voting <ul><li>Combine the four models via voting and 80%-20% over-balancing </li></ul><ul><li>Synergy: Combination model outperforms any individual model. </li></ul>
  50. 50. Combining Models Using Mean Response Probabilities <ul><li>Combine the confidences that each model reports for its decisions </li></ul><ul><ul><li>Allows finer tuning of the decision space </li></ul></ul><ul><li>Derive a new variable: </li></ul><ul><ul><li>Mean Response Probability (MRP): </li></ul></ul><ul><ul><ul><li>Average of response confidences of the four models. </li></ul></ul></ul>
  51. 51. Combining Models Using Mean Response Probabilities <ul><li>Multi-modality due to the discontinuity of the transformation used in derivation of MRP </li></ul>
  52. 52. Combining Models Using Mean Response Probabilities <ul><li>Where shall we define response vs. non-response? </li></ul><ul><ul><li>Recall that FN is 14.2 times worse than FP </li></ul></ul><ul><ul><li>Set partitions on the low side => fewer FN decisions are made </li></ul></ul>
  53. 53. Combining Models Using Mean Response Probabilities <ul><li>Optimal partition: near 50%. </li></ul><ul><li>Mail a promotion to a prospective customer only if the mean response probability is at least 50% </li></ul><ul><li>Best model in case study. </li></ul><ul><ul><li>MRP = 0.51 </li></ul></ul><ul><ul><ul><li>$3.1744 profit </li></ul></ul></ul><ul><ul><li>“ send to everyone” </li></ul></ul><ul><ul><ul><li>$2.62 profit </li></ul></ul></ul><ul><ul><li>20.7% profit enhancement (54.44 cents) </li></ul></ul>
  54. 54. Summary <ul><li>For more on this Case Study, see Data Mining Methods and Models (Wiley, 2006) </li></ul><ul><li>So, the best part about all this is: </li></ul><ul><ul><li>Data mining is fun! </li></ul></ul><ul><ul><li>If you love to play with data, and you love to construct and evaluate models, then data mining is for you. </li></ul></ul>

×