Introduction to Clementine


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to Clementine

  1. 1. Introduction to Clementine Tutors: Cecia Chan & Gabriel Fung Data Mining Tutorial
  2. 2. A Brief Review of Data Mining (I) <ul><li>Data mining is… </li></ul><ul><ul><li>A process of extracting previously unknown , valid and actionable knowledge from large databases </li></ul></ul><ul><li>A rule of thumb: </li></ul><ul><ul><li>If we know clearly the shape and likely content of what we are looking for, we are probably not dealing with data mining </li></ul></ul>
  3. 3. A Brief Review of Data Mining (II) <ul><li>Therefore, data mining is not … </li></ul><ul><ul><li>SQL queries against any number of disparate database or data warehouse </li></ul></ul><ul><ul><li>SQL queries in a parallel or massively parallel environment </li></ul></ul><ul><ul><li>I nformation retrieval, for example, through intelligent agents </li></ul></ul><ul><ul><li>Multidimensional database analysis (MDA) </li></ul></ul><ul><ul><li>OLAP </li></ul></ul><ul><ul><li>Exploratory data analysis (EDA) </li></ul></ul><ul><ul><li>G raphical visualization </li></ul></ul><ul><ul><li>Traditional statistical processing against a data warehouse </li></ul></ul><ul><li>However, they are all related to data mining </li></ul>
  4. 4. Data Mining Process <ul><li>Business objective(s) determination </li></ul><ul><ul><li>What is your goal? </li></ul></ul><ul><li>Data collection </li></ul><ul><ul><li>You can learn nothing without data… </li></ul></ul><ul><li>Data preprocessing (or Data preparation) </li></ul><ul><ul><li>Remove outlier / filter noise / modify fields / etc </li></ul></ul><ul><li>Modeling </li></ul><ul><ul><li>The core part of data mining </li></ul></ul><ul><li>Evaluation </li></ul><ul><ul><li>See what you have learn! </li></ul></ul>
  5. 5. Data Mining Software <ul><li>Existing Data mining software: </li></ul><ul><ul><li>Clementine from SPSS (we have this software) , Enterprise Minter from SAS (we have this software) , Intelligence Miner from IBM (we have this software) , MineSet from Silicon Graphics, K-wiz from Compression Sciences Ltd., DBMiner from DBMiner Tech. Inc., PolyAnalyst from Megaputer Intelligence, StatServer from Mathsoft : : </li></ul></ul>
  6. 6. Problem Statement <ul><li>Situation: </li></ul><ul><ul><li>You are a researcher compiling data for a medical study </li></ul></ul><ul><ul><li>You have collected data about a set of patients, all of whom suffered from the same illness </li></ul></ul><ul><ul><li>Each patient responded to one of five drug treatments </li></ul></ul>
  7. 7. Step 1: Business objective <ul><li>Figure out which drug might be appropriate for a future patient with the same illness </li></ul><ul><li>Here are the data collected: </li></ul><ul><ul><li>Age </li></ul></ul><ul><ul><li>Sex (M or F) </li></ul></ul><ul><ul><li>BP (Blood pressure: High, normal, or low) </li></ul></ul><ul><ul><li>Weight (The weight of the patient) </li></ul></ul><ul><ul><li>Cholesterol (Blood cholesterol: Normal or high) </li></ul></ul><ul><ul><li>Na (Blood sodium concentration) </li></ul></ul><ul><ul><li>K (Blood potassium concentration) </li></ul></ul><ul><ul><li>Drug (Drug to which the patient responded) </li></ul></ul>
  8. 8. Using Clementine (1) <ul><li>Clementine is located in… </li></ul><ul><ul><li>Start  All Programs  Clementine 6.0.2 </li></ul></ul>Models Nodes Work-Space
  9. 9. Using Clementine (2) <ul><li>Nodes in the workspace represent different objects and actions. You connect the nodes to form streams, which, when executed, let you visualize relationships and draw conclusions. </li></ul>
  10. 10. Step 2: Data Collection (1) Double Click Nodes for inputting the collected data
  11. 11. Data Collection (2) Location of your file Use how many columns from the file Is the first row specify the names of the fields or not Other details
  12. 12. Step 3: Data Preparation – Explore the Data (1) <ul><li>Nodes for exploration/visualization: </li></ul><ul><ul><li>Table (in the Output panel) </li></ul></ul><ul><ul><li>Plot (in the Graphs Panel) </li></ul></ul><ul><ul><li>Histogram (in the Graphs Panel) </li></ul></ul><ul><ul><li>Distribution (in the Graphs Panel) </li></ul></ul><ul><ul><li>Web (in the Graphs Panel) </li></ul></ul>
  13. 13. Step 3: Data Preparation – Explore the Data (2) <ul><li>Connect the nodes: </li></ul>Note: Connect the nodes by click-and-drag the middle button of the mouse Double Click
  14. 14. Step 3: Data Preparation – Explore the Data (3) <ul><li>Execution: </li></ul>Note: Right click on the table node to display this menu
  15. 15. Step 3: Data Preparation – Explore the Data (4) <ul><li>Other nodes (Please try the other nodes yourself): </li></ul><ul><ul><li>Histogram: </li></ul></ul>
  16. 16. Step 3: Data Preparation – Modify the Data (1) <ul><li>Replacing values: </li></ul><ul><ul><li>Use Filler node: </li></ul></ul><ul><ul><ul><li>Suppose we want to transform all weights to its log value (Note: we usually only transform variables to log when it is highly skewed): </li></ul></ul></ul>
  17. 17. Step 3: Data Preparation – Modify the Data (2) <ul><li>Derive a new value: </li></ul><ul><ul><li>Use Derive node: </li></ul></ul><ul><ul><ul><li>Suppose we want to combine Na and K: </li></ul></ul></ul>
  18. 18. Step 3: Data Preparation – Modify the Data (3) <ul><li>Remove some fields </li></ul><ul><ul><li>Use Filter node </li></ul></ul><ul><ul><ul><li>Suppose we have derived a new field Na_Over_K, now we need to remove the field Na and K: </li></ul></ul></ul>
  19. 19. Step 4: Modeling – Define fields <ul><li>Define the fields </li></ul><ul><ul><li>Use Type node: </li></ul></ul>
  20. 20. Step 4: Modeling – Build a Model (1) <ul><li>It is the core part of data mining. </li></ul><ul><li>Supervised Learning: </li></ul><ul><ul><li>Train Net (Neural Network) </li></ul></ul><ul><ul><li>C5.0 (C5.0 Decision Tree) </li></ul></ul><ul><ul><li>Linear Reg. (Linear regression) </li></ul></ul><ul><ul><li>C & R Tree (Classification and Regression Tree, CART) </li></ul></ul><ul><li>Unsupervised Learning: </li></ul><ul><ul><li>Train Kohonen (Self-Organized Map, SOM) </li></ul></ul><ul><ul><li>Train KMeans (K-means Clustering) </li></ul></ul><ul><ul><li>TwoStep (A kind of Hierarchical Clustering) </li></ul></ul><ul><li>Others: </li></ul><ul><ul><li>GRI (Association Rule mining) </li></ul></ul><ul><ul><li>Apriori (Association Rule mining) </li></ul></ul><ul><ul><li>Factor / PCA (Factor analysis, attribute selection technique) </li></ul></ul>
  21. 21. Step 4: Modeling – Build a Model (2) <ul><li>Build what model? </li></ul><ul><ul><li>Recall that our objective is to determine which type of drugs is suitable for a specific patient. </li></ul></ul><ul><ul><li>Thus, it is a classification problem (supervised learning) </li></ul></ul><ul><li>In this tutorial, we use: </li></ul><ul><ul><li>C5.0 and C & R Tree </li></ul></ul>
  22. 22. Step 4: Modeling – Build a Model (3) <ul><li>Note: </li></ul><ul><ul><li>There are many complex settings for each model </li></ul></ul><ul><ul><li>In this tutorial, we use default setting </li></ul></ul><ul><ul><li>Fine tuning a model requires solid experiences in data mining </li></ul></ul>
  23. 23. Step 5: Evaluation (1) <ul><li>It means NOTHING even if we have learned SOMETHING, until the knowledge that we have learned are ACTIONABLE and VALID </li></ul><ul><li>Remember: </li></ul><ul><ul><li>The data set of training and testing are ALWAYS different (why?) </li></ul></ul>
  24. 24. Step 5: Evaluation (2) <ul><li>Create the following flow </li></ul>Note: Must have the same flow as the training stage
  25. 25. Step 5: Evaluation (3) <ul><li>Different results: </li></ul><ul><ul><li>Different models can yield a completely different results </li></ul></ul><ul><ul><li>Choosing and tuning a good model is a difficult job </li></ul></ul><ul><ul><li>In this tutorial, we only introduce the process of data mining only </li></ul></ul>
  26. 26. Assignment 1
  27. 27. Assignment 1 – Problem Statement <ul><li>Situation: </li></ul><ul><ul><li>You are a financial analyst of a bank </li></ul></ul><ul><ul><li>You have to predict whether a customer is Good or Bad based on some demographic information </li></ul></ul><ul><li>Data Set: </li></ul><ul><ul><li>A data set about your past customers has been collected </li></ul></ul><ul><ul><li>Each customer is either Good or Bad </li></ul></ul>
  28. 28. Assignment 1 – Field definitions Good or bad credit rating Binary Output GOOD_BAD Foreign worker or Local worker Binary input FOREIGN Job Nature Nominal input JOB Number of existing credits Interval input EXISTCR Type of House Nominal input HOUSING Type of other installment plan Nominal input OTHER Age in years Interval input AGE Type of Property Nominal input PROPERTY Martial status Nominal input MARITAL Type of installment rate Nominal input INSTALLP Employment Type (Gov., private, etc) Nominal input EMPLOYED No. of Savings (bonds, stocks, etc) Nominal input SAVINGS Amount in Bank Interval input AMOUNT Credit history Nominal input HISTORY Checking account status Nominal input CHECKING DESCRIPTION DEFINITION ROLE VARIABLE
  29. 29. Assignment 1 – Data Mining Process <ul><li>Data Collection </li></ul><ul><ul><li>Please download CreditRisk data set from </li></ul></ul><ul><ul><li>Two data sets: </li></ul></ul><ul><ul><li>(i) creditRisk1.csv is for training </li></ul></ul><ul><ul><li>(ii) creditRisk2.csv is for testing </li></ul></ul><ul><li>Data Preprocessing </li></ul><ul><ul><li>Please explore the data and think critically whether any data preprocessing is necessary </li></ul></ul><ul><ul><ul><li>Hints: Two of the interval variables are highly skewed </li></ul></ul></ul>
  30. 30. Assignment 1 – Data Mining Process <ul><li>Modeling </li></ul><ul><ul><li>Please build a prediction models using default settings: </li></ul></ul><ul><ul><ul><li>C5.0 Decision Tree </li></ul></ul></ul><ul><li>Model Assessment </li></ul><ul><ul><li>Please use the testing data set to evaluate the performance of the prediction models </li></ul></ul>
  31. 31. Assignment 1 – Submission <ul><li>Save the stream as “ id.str ” </li></ul><ul><ul><li>E.g, 00123456.str </li></ul></ul><ul><li>Upload your stream to the course account </li></ul><ul><li>Deadline: </li></ul><ul><ul><li>4 April 2004 </li></ul></ul><ul><li>This is an individual assignment </li></ul><ul><li>Note : We strongly encourage you to submit this assignment during the class!!! </li></ul>