• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
CRISP-DM methodology for computational modelling projects
 

CRISP-DM methodology for computational modelling projects

on

  • 1,031 views

 

Statistics

Views

Total Views
1,031
Views on SlideShare
1,031
Embed Views
0

Actions

Likes
0
Downloads
23
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    CRISP-DM methodology for computational modelling projects CRISP-DM methodology for computational modelling projects Presentation Transcript

    • CRISP-DM: a methodology for applied computational modelling (required for CMD cw2) Based on Intro to Data Mining: CRISP-DM Prof Chris Clifton, Purdue Univ Thanks to Laura Squier, SPSS for some of the material used
    • A methodology for projects
      • Cross-Industry Standard Process for Data Mining (CRISP-DM): computational modelling of datasets to derive “knowledge”
      • European Community funded effort to develop framework for data mining and computational modeling projects
      • Goals:
        • Encourage interoperable tools across entire data mining process
        • Take the mystery/high-priced expertise out of simple data mining and computational modelling tasks
    • Why Should There be a Standard Process?
      • Framework for recording experience
        • Allows projects to be replicated
      • Aid to project planning and management
      • “ Comfort factor” for new adopters
        • Demonstrates maturity of computational modelling and data mining
        • Reduces dependency on “stars”
      The process must be reliable and repeatable by people with little data mining background.
    • Process Standardization
      • CRoss Industry Standard Process for Data Mining
      • Initiative launched Sept.1996
      • http://www.crisp-dm.org/
      • SPSS/ISL, NCR, Daimler-Benz, OHRA
      • Funding from European commission
      • Over 200 members of the CRISP-DM SIG worldwide
        • DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, Magnify, ..
        • System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte & Touche, …
        • End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ...
    • CRISP-DM
      • Non-proprietary
      • Application/Industry neutral
      • Tool neutral
      • Focus on business issues
        • modelling is core, but the methodology includes steps before and after involving client business
      • Framework for guidance
      • Experience base
        • Templates for new applications
    • CRISP-DM: Overview
    • CRISP-DM: Phases
      • Business Understanding
        • Understanding application domain, project objectives and requirements
        • Data mining problem definition
      • Data Understanding
        • Initial data collection and familiarization
        • Identify data quality issues
        • Initial, obvious results
      • Data Preparation
        • Record and attribute selection
        • Data cleansing
      • Modeling
        • Run the computational modelling tools, to derive results
      • Evaluation
        • Determine if results meet business objectives
        • Identify business issues that should have been addressed earlier
      • Deployment
        • Put the resulting models into practice
        • Set up for repeated/continuous mining of the data
    • Phases and Tasks Business Understanding Data Understanding Evaluation Data Preparation Modeling Determine Business Objectives Background Business Objectives Business Success Criteria Situation Assessment Inventory of Resources Requirements, Assumptions, and Constraints Risks and Contingencies Terminology Costs and Benefits Determine Data Mining Goal Data Mining Goals Data Mining Success Criteria Produce Project Plan Project Plan Initial Asessment of Tools and Techniques Collect Initial Data Initial Data Collection Report Describe Data Data Description Report Explore Data Data Exploration Report Verify Data Quality Data Quality Report Data Set Data Set Description Select Data Rationale for Inclusion / Exclusion Clean Data Data Cleaning Report Construct Data Derived Attributes Generated Records Integrate Data Merged Data Format Data Reformatted Data Select Modeling Technique Modeling Technique Modeling Assumptions Generate Test Design Test Design Build Model Parameter Settings Models Model Description Assess Model Model Assessment Revised Parameter Settings Evaluate Results Assessment of Data Mining Results w.r.t. Business Success Criteria Approved Models Review Process Review of Process Determine Next Steps List of Possible Actions Decision Plan Deployment Deployment Plan Plan Monitoring and Maintenance Monitoring and Maintenance Plan Produce Final Report Final Report Final Presentation Review Project Experience Documentation Deployment
    • Example applications
      • PAST cw applying CRISP-DM:
        • Technologies for Knowledge Discovery (MSc Module, discontinued): analysis of patient data from Duke Hospital referrals
        • CMD last year: analysis of Schools League Tables to choose a “good” name for a school
      • YOU will also need CRISP-DM for YOUR coursework (not the same)
    • Phases in the DM Process (1)
      • Business Understanding:
        • Statement of Business Objective
        • Statement of Data Mining objective
        • Statement of Success Criteria
    • Phases in TKD’04 cw (1)
      • Business Understanding:
        • Explore patient data, what can be predicted?
        • Explore evidence for/against hypotheses
        • Statement of Success Criteria: evidence found?
    • Phases in CMD’05 cw (1)
      • Business Understanding:
        • Business Objective: school name reflecting good performance
        • Data Mining objective: find name attributes which predict performance
        • Success Criteria: evidence for new school name
    • Phases in the DM Process (2)
      • Data Understanding
        • Explore the data and verify the quality
        • Find outliers
    • Phases in TKD’04 cw (2)
      • Data Understanding
        • Explore the data and verify the quality: small dataset can be explored with visualization tools
        • Find outliers: sparse dataset, many outliers!
    • Phases in CMD’05 cw (2)
      • Data Understanding
        • Explore data, decide metric for good/bad “performance”
        • Select and download data for LEAs
        • Find outliners, e.g. Special Schools
    • Phases in the DM Process (3)
      • Data preparation:
      • Takes usually over 90% of the time
        • Collection
        • Assessment
        • Consolidation and Cleaning
          • table links, aggregation level, missing values, etc
        • Data selection
          • active role in ignoring non-contributory data?
          • outliers?
          • Use of samples
          • visualization tools
        • Transformations - create new variables
    • Phases in the TKD’04 cw (3)
      • Data preparation:
      • Takes usually over 90% of the time
        • Not a lot todo: data supplied in ARFF format!
        • Transformations - create new variables – but not in this case.
    • Phases in CMD’05 cw (3)
      • Data preparation:
      • Takes usually over 90% of the time
        • Download, save as text-file
        • Data selection
          • ignore non-contributory data e.g. SEN
          • Outliers eg Special Schools
          • Use of samples?
        • Transformations - create new variables: features of name e.g. Grammar, High, town-name, syllables; mapping from existing data to “useful” variables may be non-trivial!
    • Phases in the DM Process(4)
      • Model(l)ing
      • Selection of the modeling techniques is based upon the data mining objective
        • Modeling can be an iterative process - different for supervised and unsupervised learning; may model for either description or prediction
    • Phases in the TKD’04 cw (4)
      • Model building
          • Try different WEKA models and options
          • Easy to “play and see what happens”
          • Keep output or screenshots for models which seem to show evidence
    • Phases in CMD’05 cw (4)
      • Model building
        • data mining objective: find name attributes which predict good performance
        • Try various types of model, note any which provide evidence linking an attribute to performance
    • Types of Models
      • Prediction Models for Predicting and Classifying
        • Decision trees, Decision rules, Association rules (symbolic outcome) Regression algorithms (predict numeric outcome), neural networks , rule induction, CART
      • Descriptive Models for Grouping and Finding Associations
        • Clustering/Grouping algorithms: K-means, Kohonen neural net
        • Association algorithms: apriori
    • Phases in the DM Process(5)
      • Model Evaluation
        • Evaluation of model: how well it performed on training/test data
        • Methods and criteria depend on model type:
          • e.g., confusion matrix, mean error rate
        • More importantly: are results “useful” for business objective?
    • Phases in TKD’04 cw (5)
      • Model Evaluation
        • Evaluation of model: how well it performed on training/test data
        • Methods and criteria depend on model type:
          • e.g., confusion matrix, mean error rate
        • More importantly: evidence for/against hypotheses about medical predictions
    • Phases in CMD’05 cw (5)
      • Model Evaluation
        • Error rates, confusion matrix with names predicting “good” schools
        • Evaluation wrt business objectives: have you found any indicators of a “good” school name?
        • Don’t just present the model, explain how it is evidence
    • Phases in the DM Process (6)
      • Deployment
        • Report to client
        • Determine how the results need to be utilized, who needs to use them?
        • How often do they need to be used – should the exercise be repeated?
    • Phases in TKD’04 cw (6)
      • Deployment
        • Report to client (me!)
        • Submit cw, but no need to think of “future plans”
    • Phases in CMD’05 cw (6)
      • Deployment
        • Write a Report, section for each CRISP-DM Phase, plus Appendices
        • Deployment section of report: review of the exercise
      • Client should deploy results by:
        • Utilizing results as business rules: ?LGS/LGHS merger?
    • Why CRISP-DM?
      • The data mining process must be reliable and repeatable by people with little data mining skills (e.g. IT consultants, students?...)
      • CRISP-DM provides a uniform framework for
        • guidelines
        • experience documentation
      • CRISP-DM is flexible to account for differences
        • Different business/agency problems
        • Different data types (numeric, database, text, ...)
    • Why DM?: Concept Description
      • Descriptive vs. predictive computational modeling
        • Descriptive modeling : describe concepts or task-relevant data sets in concise, summarative, informative, discriminative forms
        • Predictive modeling : Based on data and analysis, construct models for the data, and predict the trend and properties of unknown data
      • 2 types of Concept description:
        • Characterization : provides a concise and succinct summarization of the given collection of data
        • Comparison : provides descriptions comparing two or more collections of data
    • Data Mining vs. Visualization
      • Data Mining:
        • can handle complex data types of the attributes and their aggregations
        • a more automated process to extract a model
      • Visualization:
        • restricted to a small number of dimension and measure types
        • user-controlled process
    • CRISP-DM: Summary
      • Business Understanding
        • Understanding project objectives and requirements
        • Data mining problem definition
      • Data Understanding
        • Initial data collection and familiarization
        • Identify data quality issues
        • Initial, obvious results
      • Data Preparation
        • Record and attribute selection
        • Data cleansing
      • Modeling
        • Run the computational modelling tools
      • Evaluation
        • Determine if results meet business objectives
        • Identify business issues that should have been addressed earlier
      • Deployment
        • Put the resulting models into practice
        • Set up for repeated/continuous mining of the data