CRISP-DM: a methodology for applied computational modelling  (required for CMD cw2) Based on Intro to Data Mining: CRISP-D...
A methodology for projects <ul><li>Cross-Industry Standard Process for Data Mining (CRISP-DM): computational modelling of ...
Why Should There be a Standard Process? <ul><li>Framework for recording experience </li></ul><ul><ul><li>Allows projects t...
Process Standardization <ul><li>CRoss Industry Standard Process for Data Mining </li></ul><ul><li>Initiative launched Sept...
CRISP-DM <ul><li>Non-proprietary </li></ul><ul><li>Application/Industry neutral </li></ul><ul><li>Tool neutral </li></ul><...
CRISP-DM:  Overview
CRISP-DM:  Phases <ul><li>Business Understanding </li></ul><ul><ul><li>Understanding application domain, project objective...
Phases and Tasks Business Understanding Data Understanding Evaluation Data Preparation Modeling Determine  Business Object...
Example applications <ul><li>PAST cw applying CRISP-DM: </li></ul><ul><ul><li>Technologies for Knowledge Discovery (MSc Mo...
Phases in the DM Process (1) <ul><li>Business Understanding: </li></ul><ul><ul><li>Statement of Business Objective </li></...
Phases in TKD’04 cw (1) <ul><li>Business Understanding: </li></ul><ul><ul><li>Explore patient data, what can be predicted?...
Phases in CMD’05 cw (1) <ul><li>Business Understanding: </li></ul><ul><ul><li>Business Objective: school name reflecting g...
Phases in the DM Process (2) <ul><li>Data Understanding </li></ul><ul><ul><li>Explore the data and verify the quality </li...
Phases in TKD’04 cw (2) <ul><li>Data Understanding </li></ul><ul><ul><li>Explore the data and verify the quality: small da...
Phases in CMD’05 cw (2) <ul><li>Data Understanding </li></ul><ul><ul><li>Explore data, decide metric for good/bad “perform...
Phases in the DM Process (3) <ul><li>Data preparation: </li></ul><ul><li>Takes usually over 90% of the time </li></ul><ul>...
Phases in the TKD’04 cw (3) <ul><li>Data preparation: </li></ul><ul><li>Takes usually over 90% of the time </li></ul><ul><...
Phases in CMD’05 cw (3) <ul><li>Data preparation: </li></ul><ul><li>Takes usually over 90% of the time </li></ul><ul><ul><...
Phases in the DM Process(4) <ul><li>Model(l)ing </li></ul><ul><li>Selection of the  modeling techniques is based upon the ...
Phases in the TKD’04 cw (4) <ul><li>Model building </li></ul><ul><ul><ul><li>Try different WEKA models and options </li></...
Phases in CMD’05 cw (4) <ul><li>Model building </li></ul><ul><ul><li>data mining objective: find name attributes which pre...
Types of Models <ul><li>Prediction Models for Predicting and Classifying </li></ul><ul><ul><li>Decision trees, Decision ru...
Phases in the DM Process(5) <ul><li>Model Evaluation </li></ul><ul><ul><li>Evaluation of model:  how well it performed on ...
Phases in TKD’04 cw (5) <ul><li>Model Evaluation </li></ul><ul><ul><li>Evaluation of model:  how well it performed on trai...
Phases in CMD’05 cw (5) <ul><li>Model Evaluation </li></ul><ul><ul><li>Error rates, confusion matrix with names predicting...
Phases in the DM Process (6) <ul><li>Deployment </li></ul><ul><ul><li>Report to client </li></ul></ul><ul><ul><li>Determin...
Phases in TKD’04 cw (6) <ul><li>Deployment </li></ul><ul><ul><li>Report to client (me!) </li></ul></ul><ul><ul><li>Submit ...
Phases in CMD’05 cw (6) <ul><li>Deployment </li></ul><ul><ul><li>Write a Report, section for each CRISP-DM Phase, plus  Ap...
Why CRISP-DM? <ul><li>The data mining process must be reliable and repeatable by people with little data mining skills (e....
Why DM?: Concept Description <ul><li>Descriptive vs. predictive computational modeling </li></ul><ul><ul><li>Descriptive m...
Data Mining vs. Visualization <ul><li>Data Mining:  </li></ul><ul><ul><li>can handle complex data types of the attributes ...
CRISP-DM:  Summary <ul><li>Business Understanding </li></ul><ul><ul><li>Understanding project objectives and requirements ...
Upcoming SlideShare
Loading in …5
×

CRISP-DM methodology for computational modelling projects

1,137 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,137
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
34
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

CRISP-DM methodology for computational modelling projects

  1. 1. CRISP-DM: a methodology for applied computational modelling (required for CMD cw2) Based on Intro to Data Mining: CRISP-DM Prof Chris Clifton, Purdue Univ Thanks to Laura Squier, SPSS for some of the material used
  2. 2. A methodology for projects <ul><li>Cross-Industry Standard Process for Data Mining (CRISP-DM): computational modelling of datasets to derive “knowledge” </li></ul><ul><li>European Community funded effort to develop framework for data mining and computational modeling projects </li></ul><ul><li>Goals: </li></ul><ul><ul><li>Encourage interoperable tools across entire data mining process </li></ul></ul><ul><ul><li>Take the mystery/high-priced expertise out of simple data mining and computational modelling tasks </li></ul></ul>
  3. 3. Why Should There be a Standard Process? <ul><li>Framework for recording experience </li></ul><ul><ul><li>Allows projects to be replicated </li></ul></ul><ul><li>Aid to project planning and management </li></ul><ul><li>“ Comfort factor” for new adopters </li></ul><ul><ul><li>Demonstrates maturity of computational modelling and data mining </li></ul></ul><ul><ul><li>Reduces dependency on “stars” </li></ul></ul>The process must be reliable and repeatable by people with little data mining background.
  4. 4. Process Standardization <ul><li>CRoss Industry Standard Process for Data Mining </li></ul><ul><li>Initiative launched Sept.1996 </li></ul><ul><li>http://www.crisp-dm.org/ </li></ul><ul><li>SPSS/ISL, NCR, Daimler-Benz, OHRA </li></ul><ul><li>Funding from European commission </li></ul><ul><li>Over 200 members of the CRISP-DM SIG worldwide </li></ul><ul><ul><li>DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, Magnify, .. </li></ul></ul><ul><ul><li>System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte & Touche, … </li></ul></ul><ul><ul><li>End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ... </li></ul></ul>
  5. 5. CRISP-DM <ul><li>Non-proprietary </li></ul><ul><li>Application/Industry neutral </li></ul><ul><li>Tool neutral </li></ul><ul><li>Focus on business issues </li></ul><ul><ul><li>modelling is core, but the methodology includes steps before and after involving client business </li></ul></ul><ul><li>Framework for guidance </li></ul><ul><li>Experience base </li></ul><ul><ul><li>Templates for new applications </li></ul></ul>
  6. 6. CRISP-DM: Overview
  7. 7. CRISP-DM: Phases <ul><li>Business Understanding </li></ul><ul><ul><li>Understanding application domain, project objectives and requirements </li></ul></ul><ul><ul><li>Data mining problem definition </li></ul></ul><ul><li>Data Understanding </li></ul><ul><ul><li>Initial data collection and familiarization </li></ul></ul><ul><ul><li>Identify data quality issues </li></ul></ul><ul><ul><li>Initial, obvious results </li></ul></ul><ul><li>Data Preparation </li></ul><ul><ul><li>Record and attribute selection </li></ul></ul><ul><ul><li>Data cleansing </li></ul></ul><ul><li>Modeling </li></ul><ul><ul><li>Run the computational modelling tools, to derive results </li></ul></ul><ul><li>Evaluation </li></ul><ul><ul><li>Determine if results meet business objectives </li></ul></ul><ul><ul><li>Identify business issues that should have been addressed earlier </li></ul></ul><ul><li>Deployment </li></ul><ul><ul><li>Put the resulting models into practice </li></ul></ul><ul><ul><li>Set up for repeated/continuous mining of the data </li></ul></ul>
  8. 8. Phases and Tasks Business Understanding Data Understanding Evaluation Data Preparation Modeling Determine Business Objectives Background Business Objectives Business Success Criteria Situation Assessment Inventory of Resources Requirements, Assumptions, and Constraints Risks and Contingencies Terminology Costs and Benefits Determine Data Mining Goal Data Mining Goals Data Mining Success Criteria Produce Project Plan Project Plan Initial Asessment of Tools and Techniques Collect Initial Data Initial Data Collection Report Describe Data Data Description Report Explore Data Data Exploration Report Verify Data Quality Data Quality Report Data Set Data Set Description Select Data Rationale for Inclusion / Exclusion Clean Data Data Cleaning Report Construct Data Derived Attributes Generated Records Integrate Data Merged Data Format Data Reformatted Data Select Modeling Technique Modeling Technique Modeling Assumptions Generate Test Design Test Design Build Model Parameter Settings Models Model Description Assess Model Model Assessment Revised Parameter Settings Evaluate Results Assessment of Data Mining Results w.r.t. Business Success Criteria Approved Models Review Process Review of Process Determine Next Steps List of Possible Actions Decision Plan Deployment Deployment Plan Plan Monitoring and Maintenance Monitoring and Maintenance Plan Produce Final Report Final Report Final Presentation Review Project Experience Documentation Deployment
  9. 9. Example applications <ul><li>PAST cw applying CRISP-DM: </li></ul><ul><ul><li>Technologies for Knowledge Discovery (MSc Module, discontinued): analysis of patient data from Duke Hospital referrals </li></ul></ul><ul><ul><li>CMD last year: analysis of Schools League Tables to choose a “good” name for a school </li></ul></ul><ul><li>YOU will also need CRISP-DM for YOUR coursework (not the same) </li></ul>
  10. 10. Phases in the DM Process (1) <ul><li>Business Understanding: </li></ul><ul><ul><li>Statement of Business Objective </li></ul></ul><ul><ul><li>Statement of Data Mining objective </li></ul></ul><ul><ul><li>Statement of Success Criteria </li></ul></ul>
  11. 11. Phases in TKD’04 cw (1) <ul><li>Business Understanding: </li></ul><ul><ul><li>Explore patient data, what can be predicted? </li></ul></ul><ul><ul><li>Explore evidence for/against hypotheses </li></ul></ul><ul><ul><li>Statement of Success Criteria: evidence found? </li></ul></ul>
  12. 12. Phases in CMD’05 cw (1) <ul><li>Business Understanding: </li></ul><ul><ul><li>Business Objective: school name reflecting good performance </li></ul></ul><ul><ul><li>Data Mining objective: find name attributes which predict performance </li></ul></ul><ul><ul><li>Success Criteria: evidence for new school name </li></ul></ul>
  13. 13. Phases in the DM Process (2) <ul><li>Data Understanding </li></ul><ul><ul><li>Explore the data and verify the quality </li></ul></ul><ul><ul><li>Find outliers </li></ul></ul>
  14. 14. Phases in TKD’04 cw (2) <ul><li>Data Understanding </li></ul><ul><ul><li>Explore the data and verify the quality: small dataset can be explored with visualization tools </li></ul></ul><ul><ul><li>Find outliers: sparse dataset, many outliers! </li></ul></ul>
  15. 15. Phases in CMD’05 cw (2) <ul><li>Data Understanding </li></ul><ul><ul><li>Explore data, decide metric for good/bad “performance” </li></ul></ul><ul><ul><li>Select and download data for LEAs </li></ul></ul><ul><ul><li>Find outliners, e.g. Special Schools </li></ul></ul>
  16. 16. Phases in the DM Process (3) <ul><li>Data preparation: </li></ul><ul><li>Takes usually over 90% of the time </li></ul><ul><ul><li>Collection </li></ul></ul><ul><ul><li>Assessment </li></ul></ul><ul><ul><li>Consolidation and Cleaning </li></ul></ul><ul><ul><ul><li>table links, aggregation level, missing values, etc </li></ul></ul></ul><ul><ul><li>Data selection </li></ul></ul><ul><ul><ul><li>active role in ignoring non-contributory data? </li></ul></ul></ul><ul><ul><ul><li>outliers? </li></ul></ul></ul><ul><ul><ul><li>Use of samples </li></ul></ul></ul><ul><ul><ul><li>visualization tools </li></ul></ul></ul><ul><ul><li>Transformations - create new variables </li></ul></ul>
  17. 17. Phases in the TKD’04 cw (3) <ul><li>Data preparation: </li></ul><ul><li>Takes usually over 90% of the time </li></ul><ul><ul><li>Not a lot todo: data supplied in ARFF format! </li></ul></ul><ul><ul><li>Transformations - create new variables – but not in this case. </li></ul></ul>
  18. 18. Phases in CMD’05 cw (3) <ul><li>Data preparation: </li></ul><ul><li>Takes usually over 90% of the time </li></ul><ul><ul><li>Download, save as text-file </li></ul></ul><ul><ul><li>Data selection </li></ul></ul><ul><ul><ul><li>ignore non-contributory data e.g. SEN </li></ul></ul></ul><ul><ul><ul><li>Outliers eg Special Schools </li></ul></ul></ul><ul><ul><ul><li>Use of samples? </li></ul></ul></ul><ul><ul><li>Transformations - create new variables: features of name e.g. Grammar, High, town-name, syllables; mapping from existing data to “useful” variables may be non-trivial! </li></ul></ul>
  19. 19. Phases in the DM Process(4) <ul><li>Model(l)ing </li></ul><ul><li>Selection of the modeling techniques is based upon the data mining objective </li></ul><ul><ul><li>Modeling can be an iterative process - different for supervised and unsupervised learning; may model for either description or prediction </li></ul></ul>
  20. 20. Phases in the TKD’04 cw (4) <ul><li>Model building </li></ul><ul><ul><ul><li>Try different WEKA models and options </li></ul></ul></ul><ul><ul><ul><li>Easy to “play and see what happens” </li></ul></ul></ul><ul><ul><ul><li>Keep output or screenshots for models which seem to show evidence </li></ul></ul></ul>
  21. 21. Phases in CMD’05 cw (4) <ul><li>Model building </li></ul><ul><ul><li>data mining objective: find name attributes which predict good performance </li></ul></ul><ul><ul><li>Try various types of model, note any which provide evidence linking an attribute to performance </li></ul></ul>
  22. 22. Types of Models <ul><li>Prediction Models for Predicting and Classifying </li></ul><ul><ul><li>Decision trees, Decision rules, Association rules (symbolic outcome) Regression algorithms (predict numeric outcome), neural networks , rule induction, CART </li></ul></ul><ul><li>Descriptive Models for Grouping and Finding Associations </li></ul><ul><ul><li>Clustering/Grouping algorithms: K-means, Kohonen neural net </li></ul></ul><ul><ul><li>Association algorithms: apriori </li></ul></ul>
  23. 23. Phases in the DM Process(5) <ul><li>Model Evaluation </li></ul><ul><ul><li>Evaluation of model: how well it performed on training/test data </li></ul></ul><ul><ul><li>Methods and criteria depend on model type: </li></ul></ul><ul><ul><ul><li>e.g., confusion matrix, mean error rate </li></ul></ul></ul><ul><ul><li>More importantly: are results “useful” for business objective? </li></ul></ul>
  24. 24. Phases in TKD’04 cw (5) <ul><li>Model Evaluation </li></ul><ul><ul><li>Evaluation of model: how well it performed on training/test data </li></ul></ul><ul><ul><li>Methods and criteria depend on model type: </li></ul></ul><ul><ul><ul><li>e.g., confusion matrix, mean error rate </li></ul></ul></ul><ul><ul><li>More importantly: evidence for/against hypotheses about medical predictions </li></ul></ul>
  25. 25. Phases in CMD’05 cw (5) <ul><li>Model Evaluation </li></ul><ul><ul><li>Error rates, confusion matrix with names predicting “good” schools </li></ul></ul><ul><ul><li>Evaluation wrt business objectives: have you found any indicators of a “good” school name? </li></ul></ul><ul><ul><li>Don’t just present the model, explain how it is evidence </li></ul></ul>
  26. 26. Phases in the DM Process (6) <ul><li>Deployment </li></ul><ul><ul><li>Report to client </li></ul></ul><ul><ul><li>Determine how the results need to be utilized, who needs to use them? </li></ul></ul><ul><ul><li>How often do they need to be used – should the exercise be repeated? </li></ul></ul>
  27. 27. Phases in TKD’04 cw (6) <ul><li>Deployment </li></ul><ul><ul><li>Report to client (me!) </li></ul></ul><ul><ul><li>Submit cw, but no need to think of “future plans” </li></ul></ul>
  28. 28. Phases in CMD’05 cw (6) <ul><li>Deployment </li></ul><ul><ul><li>Write a Report, section for each CRISP-DM Phase, plus Appendices </li></ul></ul><ul><ul><li>Deployment section of report: review of the exercise </li></ul></ul><ul><li>Client should deploy results by: </li></ul><ul><ul><li>Utilizing results as business rules: ?LGS/LGHS merger? </li></ul></ul>
  29. 29. Why CRISP-DM? <ul><li>The data mining process must be reliable and repeatable by people with little data mining skills (e.g. IT consultants, students?...) </li></ul><ul><li>CRISP-DM provides a uniform framework for </li></ul><ul><ul><li>guidelines </li></ul></ul><ul><ul><li>experience documentation </li></ul></ul><ul><li>CRISP-DM is flexible to account for differences </li></ul><ul><ul><li>Different business/agency problems </li></ul></ul><ul><ul><li>Different data types (numeric, database, text, ...) </li></ul></ul>
  30. 30. Why DM?: Concept Description <ul><li>Descriptive vs. predictive computational modeling </li></ul><ul><ul><li>Descriptive modeling : describe concepts or task-relevant data sets in concise, summarative, informative, discriminative forms </li></ul></ul><ul><ul><li>Predictive modeling : Based on data and analysis, construct models for the data, and predict the trend and properties of unknown data </li></ul></ul><ul><li>2 types of Concept description: </li></ul><ul><ul><li>Characterization : provides a concise and succinct summarization of the given collection of data </li></ul></ul><ul><ul><li>Comparison : provides descriptions comparing two or more collections of data </li></ul></ul>
  31. 31. Data Mining vs. Visualization <ul><li>Data Mining: </li></ul><ul><ul><li>can handle complex data types of the attributes and their aggregations </li></ul></ul><ul><ul><li>a more automated process to extract a model </li></ul></ul><ul><li>Visualization: </li></ul><ul><ul><li>restricted to a small number of dimension and measure types </li></ul></ul><ul><ul><li>user-controlled process </li></ul></ul>
  32. 32. CRISP-DM: Summary <ul><li>Business Understanding </li></ul><ul><ul><li>Understanding project objectives and requirements </li></ul></ul><ul><ul><li>Data mining problem definition </li></ul></ul><ul><li>Data Understanding </li></ul><ul><ul><li>Initial data collection and familiarization </li></ul></ul><ul><ul><li>Identify data quality issues </li></ul></ul><ul><ul><li>Initial, obvious results </li></ul></ul><ul><li>Data Preparation </li></ul><ul><ul><li>Record and attribute selection </li></ul></ul><ul><ul><li>Data cleansing </li></ul></ul><ul><li>Modeling </li></ul><ul><ul><li>Run the computational modelling tools </li></ul></ul><ul><li>Evaluation </li></ul><ul><ul><li>Determine if results meet business objectives </li></ul></ul><ul><ul><li>Identify business issues that should have been addressed earlier </li></ul></ul><ul><li>Deployment </li></ul><ul><ul><li>Put the resulting models into practice </li></ul></ul><ul><ul><li>Set up for repeated/continuous mining of the data </li></ul></ul>

×