DATA MINING PROCESSLecture 21.11.2012Barbro Back
DATA MINING PROCESSES- STANDARDPROCESSES Crisp – DM Cross-Industry Standard Process for Data Mining Semma Is specific to SAS
Cross-Industry Standard Process for Data Mining (CRISP-DM) provides an overview of the life cycle of a data mining project. Six phases: Business understanding Data understanding Data preparation Modeling Evaluation DeploymentPhases of the CRISP-DM Process Model
CRISP- DM1. Business Understanding2. Data Understanding3. Data Preparation4. Modeling5. Evaluation6. Deployment
1 BUSINESS UNDERSTANDING Includes: Determining business objectives A managerial need for new knowledge What types of customers are interested in each of our products? What are typical profiles of our customers and how much value do each of them provide to us Assessing the current situation Establishing data mining goals Developing a project plan including a budget
2 DATA UNDERSTANDING Selectthe data Three important issues Set up a concise and clear description of the problem Identify the relevant data for the problem description (The selected variables should be independent of each other, depends on the method) Types of data Demographic data – income, education, gender etc Socio-graphic data – hobbies, club memberships etc Transactional data –sales records, credit card spending etc. Quantitative data – numerical values Qualitative data – contains nominal and ordinal data
SCALES Nominal – no order between data points - gender Ordinal – order between data points – ranking results Interval – order between data points and equal distances between measurements – no true zero point Ratio – an interval scale with a true zero point – Sales has doubled - sales previous month 1 milj., this month 2 milj. Question: Is the Likert scale an ordinal or interval scale?
3 DATA PREPARATION Cleandata for better quality Convert data to be consistent Treatment of missing values Redundant data Determine the data types: In SPSS Modeler the following data types are used RANGE Numeric values (integer, real) FLAG Binary (yes/no, 0/1) SET Data with distinct multiple values, (string) TYPELESS For other types of data
4 MODELING Data treatment Training set, validation set, test set Data mining techniques Association Classification Clustering-segmentation Predictions Sequential Patterns Similar Time Sequences
5 EVALUATION How to recognize the business value from knowledge discovered. A puzzle to be solved between data analysts, business analysts and decision makers Which visualization tool to use Pie charts, histograms, box plots, scatter plots, self- organizing maps
6 DEPLOYMENT Resultsneed to be reported to project sponsors Monitoring for change is important
SEMMA (BY THE SAS INSTITUTE) Sample Explore Modify Model Assess See http://www.sas.com/offices/europe/uk/technologies/ analytics/datamining/miner/semma.html
AN APPLICATION EXAMPLE (CRISP – DM) Topredict which customers would be insolvent early enough for the firm to take preventive actions Billing period was 2 months Customers used their phone for 4 weeks Received bill about 1 week later Payment was due 30 days after receiving the bill Actions if bill not paid before 14 days after due date. Phone disconnected if bill exceeded a certain amount Hypothesis: Customer’s change their calling behaviour before becoming insolvent
EXAMPLE CONT. Data 100 000 customers 17 month period Discriminant Analysis, decision trees and neural networks were used 2066 cases 46 initial variables Costs were allocated to misclassification errors Final result: 89.8 % correctly classified with test data and a cost function = 360 € compared to 14 580 € in the first run.