DM_Lab1

941 views
853 views

Published on

1 Comment
0 Likes
Statistics
Notes
  • I am the owner of this file. Please delete it asap, or else I will report this to the administrator.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total views
941
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
71
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

DM_Lab1

  1. 1. Data mining exercise with SPSS Clementine Lab 1 Winnie Lam Email: [email_address] Website: http://www.comp.polyu.edu.hk/~cswinnie/ The Hong Kong Polytechnic University Department of Computing Last update:09/03/2006
  2. 2. OVERVIEW <ul><li>Clementine is a data mining tool that combines advanced modeling technology with ease of use, it helps you discover and predict interesting and valuable relationships within your data. </li></ul><ul><li>You can use Clementine for decision-support activities such as: </li></ul><ul><li>Creating customer profiles and determining customer lifetime value. </li></ul><ul><li>Detecting and predicting fraud in your organization. </li></ul><ul><li>Determining and predicting valuable sequences in Web-site data. </li></ul><ul><li>Predicting future trends in sales and growth. </li></ul><ul><li>Profiling for direct mailing response and credit risk. </li></ul><ul><li>Performing churn prediction, classification and segmentation. </li></ul>
  3. 3. <ul><li>KDD Process </li></ul>OVERVIEW Selection Preprocessing Transformation Data Mining Evaluation Preprocessed Data Target Data Transformed Data Patterns Knowledge
  4. 4. Simplified KDD process Data Understanding Data Preparation Modeling (Data Mining) Define target & discover useful data Obtain Clean & Useful data Discover patterns
  5. 5. STREAM CANVAS NODE PALETTES OBJECT MANAGER PROJECT
  6. 6. Learning the Nodes Sources Record Ops Field Ops Graphs Modeling Output
  7. 7. NODES <ul><li>Source nodes </li></ul><ul><li>Database - import data using ODBC </li></ul><ul><li>Variable File - free-field ASCII data </li></ul><ul><li>Fixed File - fixed-field ASCII data </li></ul><ul><li>SPSS File - import SPSS files </li></ul><ul><li>SAS File - import files in SAS format </li></ul><ul><li>User Input - replace existing source nodes </li></ul>Sources Record Ops Field Ops Graphs Modeling Output
  8. 8. NODES <ul><li>Record Operations Nodes </li></ul><ul><li>- make changes to the data set at the record level </li></ul><ul><li>Select </li></ul><ul><li>Sample </li></ul><ul><li>Balance </li></ul><ul><li>Aggregate </li></ul><ul><li>Sort </li></ul><ul><li>Merge </li></ul><ul><li>Append </li></ul><ul><li>Distinct </li></ul>Sources Record Ops Field Ops Graphs Modeling Output
  9. 9. NODES <ul><li>Field Operation Nodes </li></ul><ul><li>- for data transformation and preparation </li></ul><ul><li>Type </li></ul><ul><li>Filter </li></ul><ul><li>Derive </li></ul><ul><li>Filler </li></ul><ul><li>Reclassify </li></ul><ul><li>Binning </li></ul><ul><li>Partition </li></ul><ul><li>Set to Flag </li></ul><ul><li>History </li></ul><ul><li>Field Reorder </li></ul>Sources Record Ops Field Ops Graphs Modeling Output
  10. 10. NODES <ul><li>Graph Nodes </li></ul><ul><li>- explore & check the distribution and relationships </li></ul><ul><li>Plot </li></ul><ul><li>Multiplot </li></ul><ul><li>Distribution </li></ul><ul><li>Histogram </li></ul><ul><li>Collection </li></ul><ul><li>Web </li></ul><ul><li>Evaluation </li></ul>Sources Record Ops Field Ops Graphs Modeling Output
  11. 11. <ul><li>Modeling Nodes </li></ul><ul><li>- Heart of DM process (machine learning) </li></ul><ul><li>Each method has certain strengths and is best suited for particular types of problems. </li></ul><ul><li>Neural Net </li></ul><ul><li>C5.0 </li></ul><ul><li>Classification and Regression (C&R) Trees </li></ul><ul><li>QUEST </li></ul><ul><li>CHAID </li></ul><ul><li>Kohonen </li></ul><ul><li>K-Means </li></ul><ul><li>TwoStep Cluster </li></ul><ul><li>Apriori </li></ul><ul><li>Generalized Rule Induction (GRI) </li></ul><ul><li>CARMA </li></ul><ul><li>Sequence Detection </li></ul><ul><li>PCA/ Factor Analysis </li></ul><ul><li>Linear Regression </li></ul><ul><li>Logistic Regression </li></ul>Sources Record Ops Field Ops Graphs Modeling Output
  12. 12. NODES <ul><li>Output Nodes </li></ul><ul><li>obtain information about your data and models </li></ul><ul><li>exporting data in various formats </li></ul><ul><li>Table </li></ul><ul><li>Matrix </li></ul><ul><li>Analysis </li></ul><ul><li>Data Audit </li></ul><ul><li>Statistics </li></ul><ul><li>Quality </li></ul><ul><li>Report </li></ul><ul><li>Set Globals </li></ul><ul><li>Publisher </li></ul><ul><li>Database Output </li></ul><ul><li>Flat File </li></ul><ul><li>SPSS Export </li></ul><ul><li>SAS Export </li></ul><ul><li>Excel </li></ul><ul><li>SPSS Procedure </li></ul>Sources Record Ops Field Ops Graphs Modeling Output
  13. 13. Association Tools <ul><li>Apriori discovers association rules in the data. </li></ul><ul><li>For large problems, Apriori is generally faster to train than GRI. It has no arbitrary limit on the number of rules that can be retained and can handle rules with up to 32 preconditions. </li></ul><ul><li>GRI , Generalized Rule Induction, extracts a set of rules from the data (similar to Apriori). GRI can handle numeric as well as symbolic input fields. </li></ul><ul><li>CARMA uses an association rules discovery algorithm to discover </li></ul><ul><li>association rules in the data. CARMA node does not require In fields or Out fields. It is equivalent to build an Apriori model with all fields set to Both . </li></ul><ul><li>Sequence discovers patterns in sequential or time-oriented data. A sequence is a list of item sets that tend to occur in a predictable order. </li></ul>
  14. 14. Classification Tools – Decision tree <ul><li>C5.0. This method splits the sample based on the field that provides the maximum information gain at each level to produce either a decision tree or a ruleset . The target field must be categorical. Multiple splits into more than two subgroups are allowed. </li></ul><ul><li>C&RT. The Classification and Regression Trees method is based on minimization of impurity measures. A node is considered “pure” if 100% of cases in the node fall into a specific category of the target field. Target and predictor fields can be range or categorical; all splits are binary (only two subgroups). </li></ul><ul><li>CHAID. Chi-squared Automatic Interaction Detector uses chi-squared statistics to identify optimal splits. Target and predictor fields can be range or categorical; nodes can be split into two or more subgroups at each level. </li></ul><ul><li>QUEST. The Quick, Unbiased, Efficient Statistical Tree method is quick to compute and avoids other methods’ biases in favor of predictors with many categories. Predictor fields can be numeric ranges, but the target field must be categorical. All splits are binary. </li></ul>
  15. 15. Clustering Tools <ul><li>K-means . An approach to clustering that defines k clusters and iteratively assigns records to clusters based on distances from the mean of each cluster until a stable solution is found. </li></ul><ul><li>TwoSteps. A clustering method that involves preclustering the records into a large number of subclusters and then applying a hierarchical clustering technique to those subclusters to define the final clusters. </li></ul><ul><li>Kohonen Networks . A type of neural network used for clustering. Also known as a self organizing map (SOM). </li></ul>
  16. 16. Classification With predefined class! Class 1 Class 2 Class 3 New Sample
  17. 17. Clustering No class is defined previously! CROSS TRIANGLE STAR Class 1 Class 2 Class 3
  18. 18. Practical Session
  19. 19. Data Understanding <ul><li>Data Description: </li></ul><ul><li>Total no. of records : ? (find out by yourself) </li></ul>Data file: http://www.comp.polyu.edu.hk/~cswinnie/lab/2005-6_sem2_lab1/MyData_lab1.csv TID dt Discount Group ref_no prod_cd Attributes Transaction ID Date Discount offered? Y/N Product Group Internal Ref no. Product Code
  20. 20. Data Understanding Step 1: Import Data to Clementine Add Node: Var. File (in Sources Palette) Browse double click
  21. 21. Data Understanding Step 2: Analyze the data Add Nodes: Table (in Output Palette) right click and choose execute
  22. 22. Data Understanding Step 2: Analyze the data Add Nodes: Data Audit (in Output Palette) Execute
  23. 23. Data Understanding Step 2: Analyze the data Add Nodes: Quality (in Output Palette) Execute
  24. 24. Data Preparation
  25. 25. Data Preparation Edit Node: Var. File (in Stream) Goal: Define data type and value 1 2 2 Re-define the Type of Group and ref_no to “Set” Press “Read Values” again double click
  26. 26. Data Preparation Edit Node: Var. File (in Stream) Goal: Define blanks 1 2
  27. 27. Data Preparation Add Node: Filler (in Field Ops Palette) Goal: Replace all blanks with a specified value Result
  28. 28. Data Preparation Add Node: Type (in Field Ops Palette) Goal: Remove records with blanks 1 2 3 4 4 Choose “-1” and delete it 5 Q: How many records are left?
  29. 29. Data Preparation Add Node: Reclassify (in Field Ops Palette) Goal: Replace invalid values 1 2 3 4 5 6 6 Modify to a common set of new value (Y/N)
  30. 30. Data Transformation
  31. 31. Derive New Fields Useful Node: Derive (in Field Ops Palette) Weekday : datetime_weekday(dt) Hour : datetime_hour (dt) Goal: Add new attributes “ Weekday ” and “ Hour ” For weekday, 0 represents Sunday, 1 represents Monday, etc. Q: How many fields in your data?
  32. 32. Discretization Goal: Divide the Hour field into 4 intervals Useful Node: Binning (in Field Ops Palette)
  33. 33. Preprocessed Data
  34. 34. Data Mining
  35. 35. Data Mining Add Nodes: Type (in Field Ops Palette) 1 2 Goal: Update the type and value of data
  36. 36. Association
  37. 37. Association Add Nodes: SetToFlag (in Field Ops Palette) 1 2 3 4 Goal: Convert the transactional format to tabular format Select all values
  38. 38. Association Add Nodes: Apriori (in Modeling Palette) 1 2 3 Goal: Perform association with Apriori
  39. 39. Association Goal: View the mining result Association Rules: For 1st Rule: IF P17 and P39 THEN P27 Right Click and choose Browse
  40. 40. Classification
  41. 41. Classification Choose the Inputs and Target Add Nodes: C5.0 (in Modeling Palette)
  42. 42. Classification Goal: View the mining result Classification Rules: Right Click and choose Browse
  43. 43. Classification Goal: Find out the classification accuracy Drag the classification result to the stream Add Nodes: Classification result (in Model) and Analysis (in Output Palette) Right Click and Execute
  44. 44. Clustering
  45. 45. Clustering Choose the Inputs Add Nodes: K-means (in Modeling Palette) Set k= 3
  46. 46. Clustering Goal: View the mining result Clustering result: Right Click and choose Browse
  47. 47. <ul><li>Q&A </li></ul><ul><li>session </li></ul>
  48. 48. SUMMARY <ul><li>Today, you’ve learnt : </li></ul><ul><li>KDD process </li></ul><ul><li>the differences between nodes </li></ul><ul><li>how to build streams in Clementine </li></ul><ul><li>how to do data preparation with Clementine </li></ul><ul><li>Association modeling </li></ul><ul><li>Classification modeling </li></ul><ul><li>Clustering modeling </li></ul>

×