Data Mining in Excel Using
       XLMiner™

          Nitin R. Patel
   Cytel Software and M.I.T.Sloan


                 ...
Contact Info
• XLMiner is distributed by Resampling
  Stats, Inc.
• www.xlminer.net
• Contact Peter Bruce: pbruce@resample...
What is XLMiner?
• XLMiner is an affordable, easy-to-use tool for
  business analysts, consultants and business
  students...
Available Data Mining Software
• Application-specific: aimed at providing
  solutions to end-users for common tasks
  (e.g...
Kohonen Net
Source: Elder Research




                                                                                   ...
Available Data Mining Software
• Horizontal products: designed for data
  mining analysts: (e.g. SAS Enterprise
  Miner, S...
HORIZONTAL PRODUCTS                                                                                                       ...
Desiderata for Data Mining and
 Modern Data Analysis Software
• Easy-to-use
  – Data import (e.g. cross-platform, various ...
XLMiner is Unique
• Low cost,
• Comprehensive set of data mining models and
  algorithms that includes statistical, machin...
Why Data Mining in Excel?

• Leverage familiarity of MBA students,
  managers and business analysts with
  interface and f...
Advantages
• Low learning hurdle
• Promotes understanding of strengths and
  weaknesses of different data mining technique...
Advantages (cont.)
• Supports communication between data miners and
  end-users
• Supports smooth transition from prototyp...
Size Limitations
• An Excel spreadsheet cannot exceed 64,000 rows.
  If data records are stored as rows in a single
  spre...
Sampling
• Practical Data Mining Methodologies such
  as SEMMA (SAS) and CRISP-DM (SPSS
  and European Industry Standard)
...
XLMiner
• Free 30 day trial version: limit is 200 records per
  partition.
• Education version: limit is 2,000 records per...
Data Mining Procedures in
            XLMiner
• Partitioning data sets (into Training, Validation,
  and Test data sets)
•...
Prediction
• Multiple Linear Regression with subset
  selection, residual analysis, and collinearity
  diagnostics.
• K-Ne...
Classification
• Logistic Regression with subset selection,
  residual analysis, and collinearity diagnostics
• Discrimina...
Data Reduction and Exploration
• Principal Components
• K-Means Clustering
• Hierarchical Clustering




                 ...
Affinity
• Association Rules (Market Basket Analysis)




                                           20
Partitioning

    Aim: To construct training,
validation, and test data sets from
      Boston Housing data

             ...
22
Boston Housing Data
CRIM       ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO     B      LSTAT MEDV
 0.00632    18 2.31    0...
XLMiner : Data Partition Sheet                                                                                        Date...
Prediction
 Multiple Linear Regression using
         subset selection



Aim: To estimate median residential
  property v...
The Regression Model

          Input variables               Coefficient   Std. Error   p-value          SS
          Con...
Subset selection (exhaustive enumeration)

                                           Adjusted R-            Models (Const...
The Regression Model

             Predictor (Indep. Var.)     Coefficient   Std. Error   p-value         SS
             ...
%AvAbsErr=15.6%

         AbsErr Freq
             0      0
             2     61
             4     40
             6    ...
Prediction
       K_Nearest Neighbors




Aim: To estimate median residential
  property value for a census tract

       ...
XLMiner : K-Nearest Neighbors Prediction
Data
Source data w orksheet                      Data_Partition1
Training data us...
Param eters/Options
# Nearest neighbors                     1

Training Data scoring - Summary Report

            Total s...
Validation Data prediction details

          Predicted   Actual          Actual #Nearest
Row Id.                      Res...
Classification
        Classification Tree




Aim: To classify census tracts into
high and low residential property
     ...
Boston Housing Data
CRIM    ZN INDUS CHAS NOX RM AGE DIS       RAD TAX PTRATIO B      LSTAT MEDV HIGHCLASS
 0.00632 18 2.3...
Training Log

               Grow ing the Tree
                #Nodes         Error
                      0        13.82
 ...
XLMiner : Classification Tree - Prune Log

  Back to Navigator


# Decision
                  Error
    Nodes
        15  ...
Classification Tree : Full Tree

Back to Navigator




                                                                   ...
Classification Tree : Best Pruned Tree

Back to Navigator




                          6.55
                           05...
Classification Tree : Minimum Error Tree

Back to Navigator




                                                          ...
Classification
         Neural Network




Aim: To classify census tracts into
high and low residential property
         ...
XLMiner : Neural Network Classification


Epochs Inform ation
Number of Epochs                                            ...
Training Data scoring - Summary Report

           Cut of f Prob.Val. f or Success (Updatable)   0.5


          Clas sifi...
Lift chart (validation dataset)

             35
             30                                                    Cumula...
Data Reduction and Exploration
        Hierarchical Clustering




  Aim: To cluster electric utilities into
            s...
Utilities Data
           seq#        x1          x2           x3         x4          x5          x6           x7         ...
Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Single linkage)

            4



           3.5



            3


      ...
Predicted Clusters                  Back to Navigator


Cluster id.    x1     x2          x3       x4     x5      x6      ...
Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Complete
                                            linkage)

           ...
Predicted Clusters                  Back to Navigator


Cluster id.    x1     x2          x3       x4     x5       x6     ...
Predicted Clusters (sorted)

 Cluster id.        x1     x2          x3       x4       x5       x6     x7          x8
     ...
Affinity
        Association Rules
     (Market Basket Analysis)



Aim: to identify types of books that
are likely to be ...
YouthBks




                                            DoItYBks




                                                    ...
XLMiner : Association Rules


                          Data
                          Input Data       Sheet1!$A$1:$K$200...
Some Utilities
•   Sampling from worksheets and databases
•   Database scoring
•   Graphics
•   Binning




              ...
Simple
Random
Sampling




           56
Stratified
Random
Sampling




             57
Scoring to
databases and
worksheets




                58
Binning
continuous
variables




             59
Missing Data




               60
Graphics: Boston Housing data

                 Box Plot                                                   Histogram

    ...
Box Plot                                     Histogram

           10                          250
           9
          ...
Matrix Plot                                                                 High tax
                                     ...
RM           Box Plot



           10

           8
Y Values




           6

           4

           2

           0
 ...
Future Extensions
•   Cross Validation
•   Bootstrap, Bagging and Boosting
•   Error-based clustering
•   Time Series and ...
In Conclusion
• XLMiner is a modern tool-belt for data mining. It
  is an affordable, easy-to-use tool for consultants,
  ...
Upcoming SlideShare
Loading in...5
×

Data Mining in Excel Using XLMiner

5,515

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
5,515
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
209
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data Mining in Excel Using XLMiner

  1. 1. Data Mining in Excel Using XLMiner™ Nitin R. Patel Cytel Software and M.I.T.Sloan 1
  2. 2. Contact Info • XLMiner is distributed by Resampling Stats, Inc. • www.xlminer.net • Contact Peter Bruce: pbruce@resample.com • 703-522-2713 2
  3. 3. What is XLMiner? • XLMiner is an affordable, easy-to-use tool for business analysts, consultants and business students to: – learn strengths and weaknesses of data mining methods, – prototype large scale data mining applications, – implement medium scale data mining applications. • More generally, XLMiner is a tool for data analysis in Excel that uses classical and modern, computationally-intensive techniques. 3
  4. 4. Available Data Mining Software • Application-specific: aimed at providing solutions to end-users for common tasks (e.g. Unica for Customer Relationship Management, Urban Science for location and distribution) • Technique-specific: focused on a few data mining methods (e.g. CART from Salford Associates, Neural Nets from HNC Software) 4
  5. 5. Kohonen Net Source: Elder Research 5 Association Rules K-Means Sequential. Rules TimeSeries x Logistic Regression Rule Induction x x Naïve Bayes Radial Basis Fns. K-Nearest x Neighbors TECHNIQUE-SPECIFIC PRODUCTS Multilayer Neural Net x Linear Regression Class. & Regr. Trees x x x CART (Salford) Algorithms> NeuroShell WizWhy Cognos See5
  6. 6. Available Data Mining Software • Horizontal products: designed for data mining analysts: (e.g. SAS Enterprise Miner, SPSS Clementine, IBM Intelligent Miner, NCR Teraminer, Splus Insightful Miner, Darwin/Oracle) – Powerful, comprehensive, easy-to-use; but… – Need substantial learning effort – Expensive 6
  7. 7. HORIZONTAL PRODUCTS Source: Elder Research Class. & Regr. Trees Linear Regression Multilayer Neural Net K-Nearest Neighbors Radial Basis Fns. Naïve Bayes Rule Induction Logistic Regression TimeSeries Sequential. Rules K-Means Association Rules Kohonen Net Algorithms> Enterprise Miner (SAS) x x x x x x x x x Clementine (SPSS) x x x x x x x Intelligent Miner (IBM) x x x x x x x x MineSet (SGI) x x x x x Darwin (Oracle) x x x x PRW (Unica) x x x x x x 7
  8. 8. Desiderata for Data Mining and Modern Data Analysis Software • Easy-to-use – Data import (e.g. cross-platform, various data bases) – Data handling (e.g. data partitioning, scoring) – Invoking and experimenting with procedures • Comprehensive Range of Procedures: – Statistics (e.g. Regression, Multivariate procedures) – Machine learning (e.g. Neural Nets, Classification Trees) – Database (e.g. Association Rules) 8
  9. 9. XLMiner is Unique • Low cost, • Comprehensive set of data mining models and algorithms that includes statistical, machine learning and database methods, • Based on prototype used in three years of MBA courses on data mining at Sloan School, M.I.T. • Focus on business applications: Book of lecture notes and cases in preparation (first draft available for examination). 9
  10. 10. Why Data Mining in Excel? • Leverage familiarity of MBA students, managers and business analysts with interface and functionality of Excel to provide them with hands-on experience in data mining. 10
  11. 11. Advantages • Low learning hurdle • Promotes understanding of strengths and weaknesses of different data mining techniques and processes • Enables interactive analysis of data (important in early stages of model building) • Facilitates incorporation of domain knowledge (often key to successful applications) by empowering end-users to participate actively in data mining projects • Enables pre-processing of data and post- processing of results using Excel functions, reporting in Word, presentations in PowerPoint 11
  12. 12. Advantages (cont.) • Supports communication between data miners and end-users • Supports smooth transition from prototyping to custom solution development (VB and VBA) • Emphasizes openness – enables integration with other analytic software for optimization (Solver), simulation (Crystal Ball) , numerical methods; – interface modifications (e.g.custom forms and outputs) – solution specific routines (VBA) • Examples: – Boston Celtics – analysis of player statistics – Clustering for improving forecasts, optimizing price markdowns. 12
  13. 13. Size Limitations • An Excel spreadsheet cannot exceed 64,000 rows. If data records are stored as rows in a single spreadsheet this is the largest data set that can be accommodated. The number of variables cannot exceed 256 (number of columns). • These limits do not apply to deployment of model to score large databases. • If Excel is used as a view-port into a database such as Access, MS SQL Server, Oracle or SAS, these limits do not apply. 13
  14. 14. Sampling • Practical Data Mining Methodologies such as SEMMA (SAS) and CRISP-DM (SPSS and European Industry Standard) recommend working with a sample (typically 10,000 random cases) in the model and algorithm selection phase. This facilitates interactive development of data mining models. 14
  15. 15. XLMiner • Free 30 day trial version: limit is 200 records per partition. • Education version: limit is 2,000 records per partition, so maximum size for a data set is 6,000 records. • Standard version (currently in beta test: will be available by end August): Up to 60,000 records obtained by drawing samples from large data bases in accordance with SAS’s SEMMA (Sample, Explore, Model, Measure, Apply) methodology. Training data restricted to 10,000 records Sampling from and scoring to Access databases (later SQLServer, Oracle, SAS) 15
  16. 16. Data Mining Procedures in XLMiner • Partitioning data sets (into Training, Validation, and Test data sets) • Scoring of training, validation, test and other data • Prediction (of a continuous variable) • Classification • Data reduction and exploration • Affinity • Utilities: Sampling, graphics, missing data, binning, creation of dummy variables 16
  17. 17. Prediction • Multiple Linear Regression with subset selection, residual analysis, and collinearity diagnostics. • K-Nearest Neighbors • Regression Tree • Neural Net 17
  18. 18. Classification • Logistic Regression with subset selection, residual analysis, and collinearity diagnostics • Discriminant Analysis • K-Nearest Neighbors • Classification Tree • Naïve Bayes • Neural Networks 18
  19. 19. Data Reduction and Exploration • Principal Components • K-Means Clustering • Hierarchical Clustering 19
  20. 20. Affinity • Association Rules (Market Basket Analysis) 20
  21. 21. Partitioning Aim: To construct training, validation, and test data sets from Boston Housing data 21
  22. 22. 22
  23. 23. Boston Housing Data CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV 0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 397 4.98 24 0.02731 0 7.07 0 0.469 6.421 78.9 4.97 2 242 17.8 397 9.14 21.6 0.02729 0 7.07 0 0.469 7.185 61.1 4.97 2 242 17.8 393 4.03 34.7 0.03237 0 2.18 0 0.458 6.998 45.8 6.06 3 222 18.7 395 2.94 33.4 0.06905 0 2.18 0 0.458 7.147 54.2 6.06 3 222 18.7 397 5.33 36.2 0.02985 0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7 394 5.21 28.7 0.08829 13 7.87 0 0.524 6.012 66.6 5.56 5 311 15.2 396 12.43 22.9 0.14455 13 7.87 0 0.524 6.172 96.1 5.95 5 311 15.2 397 19.15 27.1 0.21124 13 7.87 0 0.524 5.631 100 6.08 5 311 15.2 387 29.93 16.5 0.17004 13 7.87 0 0.524 6.004 85.9 6.59 5 311 15.2 387 17.1 18.9 0.22489 13 7.87 0 0.524 6.377 94.3 6.35 5 311 15.2 393 20.45 15 0.11747 13 7.87 0 0.524 6.009 82.9 6.23 5 311 15.2 397 13.27 18.9 0.09378 13 7.87 0 0.524 5.889 39 5.45 5 311 15.2 391 15.71 21.7 0.62976 0 8.14 0 0.538 5.949 61.8 4.71 4 307 21 397 8.26 20.4 23
  24. 24. XLMiner : Data Partition Sheet Date: 29-Jul-2003 13:50:09 (Ver: 1.2.0.1) Output Navigator Training Data Validation Data Test Data Data Data source housing!$A$2:$O$507 Selected variables CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B Partitioning Method Randomly chosen Random Seed 81801 # training row s 253 # validation row s 152 # test row s 101 Selected variables CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT Row Id. 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33 6 0.02985 0 2.18 0 0.458 6.43 58.7 6.0622 3 222 18.7 394.12 5.21 7 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.6 12.43 8 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.9 19.15 10 0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.1 12 0.11747 12.5 7.87 0 0.524 6.009 82.9 6.2267 5 311 15.2 396.9 13.27 14 0.62976 0 8.14 0 0.538 5.949 61.8 4.7075 4 307 21 396.9 8.26 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 9 0.21124 12.5 7.87 0 0.524 5.631 100 6.0821 5 311 15.2 386.63 29.93 13 0.09378 12.5 7.87 0 0.524 5.889 39 5.4509 5 311 15.2 390.5 15.71 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 11 0.22489 12.5 7.87 0 0.524 6.377 94.3 6.3467 5 311 15.2 392.52 20.45 17 1.05393 0 8.14 0 0.538 5.935 29.3 4.4986 4 307 21 386.85 6.58 24
  25. 25. Prediction Multiple Linear Regression using subset selection Aim: To estimate median residential property value for a census tract 25
  26. 26. The Regression Model Input variables Coefficient Std. Error p-value SS Constant term 32.677 7.444 0.000 128852 Residual df 239 CRIM -0.094 0.049 0.054 3566 Multiple R-squared 0.738 ZN 0.055 0.020 0.007 2550 Std. Dev. estimate 5.025 INDUS 0.030 0.091 0.742 1529 Residual SS 6036 CHAS 2.836 1.199 0.019 645 NOX -15.889 5.463 0.004 143 RM 3.872 0.597 0.000 4697 AGE 0.007 0.019 0.728 0 DIS -1.405 0.292 0.000 938 RAD 0.358 0.097 0.000 1 TAX -0.013 0.005 0.019 174 PTRATIO -0.934 0.208 0.000 620 B 0.014 0.004 0.000 502 LSTAT -0.582 0.073 0.000 1623 Training Data scoring - Summary Report Total sum of squared RMS Error Average Error # Records training 253 errors 6036 4.884 0.000 # Records validation 152 # Records test 101 Validation Data scoring - Summary Report Total sum of Average squared RMS Error Error errors 2848 4.329 0.066 Test Data scoring - Summary Report Total sum of Average squared RMS Error errors Error 26 2392 4.866 -1.019
  27. 27. Subset selection (exhaustive enumeration) Adjusted R- Models (Constant present in all models) Subset size RSS Cp R-Squared Prob Squared 1 2 3 4 5 6 7 2 19472.3789 362.7529 0.5441 0.5432 0.0000 Constant LSTAT * * * * * 3 15439.3086 185.6474 0.6386 0.6371 0.0000 Constant RM LSTAT * * * * 4 13727.9863 111.6489 0.6786 0.6767 0.0000 Constant RM PTRATIO LSTAT * * * 5 13228.9072 91.4852 0.6903 0.6878 0.0000 Constant RM DIS PTRATIO LSTAT * * 6 12469.3447 59.7537 0.7081 0.7052 0.0000 Constant NOX RM DIS PTRATIO LSTAT * 7 12141.0723 47.1754 0.7158 0.7123 0.0000 Constant CHAS NOX RM DIS PTRATIO LSTAT 27
  28. 28. The Regression Model Predictor (Indep. Var.) Coefficient Std. Error p-value SS Constant 42.8367 7.1766 0.0000 126430.6016 Residual df 247.0000 NOX -21.7852 4.6042 0.0000 3404.4565 Multiple R-squared 0.6601 RM 3.7503 0.6177 0.0000 6583.3579 Std. Dev. Estimate 5.3467 DIS -1.4072 0.2535 0.0000 211.6853 Residual SS 7061.1646 PTRATIO -1.0086 0.1747 0.0000 1453.9551 LSTAT -0.5907 0.0696 0.0000 2060.2676 XLMiner : Multiple Linear Regression - Prediction of Validation Data MaxAbsErr= 20.33 RMSErr= Data range Data_Partition1!$C$273:$P$424 Back to Navigator 4.9355 AvMEDV= 22.9645 %RMSErr= 21.5% AvAbsErr= 3.57 Predicted Actual NOX RM DIS PTRATIO LSTAT AbsErr SqErr Value Value 0.8439637 22.0187 21.1 0.4640 5.8560 4.4290 18.6000 13.0000 0.92 0.2196854 32.8687 32.4 0.4470 6.7580 4.0776 17.6000 3.5300 0.47 0.2137043 25.4623 25 0.4890 6.1820 3.9454 18.6000 9.4700 0.46 6.6637521 31.0814 28.5 0.4110 6.8610 5.1167 19.2000 3.3300 2.58 4.0947798 22.4236 20.4 0.5470 5.8720 2.4775 17.8000 15.3700 2.02 18.224484 24.5690 20.3 0.5440 5.9720 3.1025 18.4000 9.9700 4.27 0.3253246 23.4704 22.9 0.5240 6.0120 5.5605 15.2000 12.4300 0.57 51.86411 14.6983 21.9 0.7180 4.9630 1.7523 20.2000 14.0000 7.20 28
  29. 29. %AvAbsErr=15.6% AbsErr Freq 0 0 2 61 4 40 6 25 8 10 10 9 12 2 14 3 16 0 18 0 20 1 22 1 70 Frequency in Validation Dataset 60 50 40 30 20 10 0 0 2 4 6 8 10 12 14 16 18 20 22 AbsErr 29
  30. 30. Prediction K_Nearest Neighbors Aim: To estimate median residential property value for a census tract 30
  31. 31. XLMiner : K-Nearest Neighbors Prediction Data Source data w orksheet Data_Partition1 Training data used for building the model Data_Partition1!$C$19:$Q$322 Validation data Data_Partition1!$C$323:$Q$524 # cases in the training data set 304 # cases in the validation data set 202 Normalization TRUE # nearest neighbors (k) 1 Variables Input variables NOX RM DIS PTRATIO LSTAT Output variable MEDV 31
  32. 32. Param eters/Options # Nearest neighbors 1 Training Data scoring - Summary Report Total sum of Average squared RMS Error Error errors 0 0 0 Validation Data scoring - Summary Report Total sum of # Records training 253 Average squared RMS Error errors Error # Records validation 152 3314 4.669 0.805 # Records test 101 Test Data scoring - Summary Report Total sum of Average squared RMS Error Error errors 3895 6.210 -0.450 Timings Overall (secs) 3.00 32
  33. 33. Validation Data prediction details Predicted Actual Actual #Nearest Row Id. Residual CRIM ZN INDUS CHAS NOX Value Value Neighbors 3 28.70 34.70 6.00 1 0.02729 0 7.07 0 0.469 9 14.40 16.50 2.10 1 0.21124 12.5 7.87 0 0.524 13 22.90 21.70 -1.20 1 0.09378 12.5 7.87 0 0.524 15 19.60 18.20 -1.40 1 0.63796 0 8.14 0 0.538 16 20.40 19.90 -0.50 1 0.62739 0 8.14 0 0.538 20 20.40 18.20 -2.20 1 0.7258 0 8.14 0 0.538 25 16.60 15.60 -1.00 1 0.75026 0 8.14 0 0.538 29 19.60 18.40 -1.20 1 0.77299 0 8.14 0 0.538 33
  34. 34. Classification Classification Tree Aim: To classify census tracts into high and low residential property value classes 34
  35. 35. Boston Housing Data CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV HIGHCLASS 0.00632 18 2.31 0 0.54 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24 0 0.02731 0 7.07 0 0.47 6.421 78.9 4.97 2 242 17.8 396.9 9.14 21.6 0 0.02729 0 7.07 0 0.47 7.185 61.1 4.97 2 242 17.8 392.83 4.03 34.7 1 0.03237 0 2.18 0 0.46 6.998 45.8 6.06 3 222 18.7 394.63 2.94 33.4 1 0.06905 0 2.18 0 0.46 7.147 54.2 6.06 3 222 18.7 396.9 5.33 36.2 1 0.02985 0 2.18 0 0.46 6.43 58.7 6.06 3 222 18.7 394.12 5.21 28.7 0 0.08829 13 7.87 0 0.52 6.012 66.6 5.56 5 311 15.2 395.6 12.43 22.9 0 0.14455 13 7.87 0 0.52 6.172 96.1 5.95 5 311 15.2 396.9 19.15 27.1 0 0.21124 13 7.87 0 0.52 5.631 100 6.08 5 311 15.2 386.63 29.93 16.5 0 0.17004 13 7.87 0 0.52 6.004 85.9 6.59 5 311 15.2 386.71 17.1 18.9 0 0.22489 13 7.87 0 0.52 6.377 94.3 6.35 5 311 15.2 392.52 20.45 15 0 0.11747 13 7.87 0 0.52 6.009 82.9 6.23 5 311 15.2 396.9 13.27 18.9 0 0.09378 13 7.87 0 0.52 5.889 39 5.45 5 311 15.2 390.5 15.71 21.7 0 35
  36. 36. Training Log Grow ing the Tree #Nodes Error 0 13.82 1 3.45 2 2.97 3 0.67 4 0.65 5 0.56 6 0.2 7 0.14 8 0.06 9 0.05 10 0.05 11 0.04 12 0.02 13 0.01 14 0.01 15 0 Validation Misclassification Summary Classification Confusion Matrix Predicted Class Actual 0 1 Class 0 152 6 1 8 36 Error Report Class # Cases # Errors % Error 0 158 6 3.80 1 44 8 18.18 Overall 202 14 6.93 36
  37. 37. XLMiner : Classification Tree - Prune Log Back to Navigator # Decision Error Nodes 15 0.0792 14 0.0644 13 0.0644 12 0.0644 11 0.0644 10 0.0644 <-- Minimum Error Prune Std. Err. 0.0172708 9 0.0743 8 0.0743 7 0.0743 6 0.0693 5 0.0693 4 0.0693 3 0.0693 <-- Best Prune 2 0.099 1 0.2079 37
  38. 38. Classification Tree : Full Tree Back to Navigator 6.55 05 RM 228 76 1.35 6.79 929 1 DIS RM 6 222 31 45 10.1 73.0 % 7.63 19.4 702 5 5 CRIM 0 LSTAT PTRATIO 2 4 14 17 37 8 0.65 % 1.31 % 3.43 5.59 % 7.06 1.25 515 449 934 1 0 DIS 0 RM DIS 5 9 12 25 1 7 35.0 286. 8.22 % 0.32 % 2.30 % 18.1 000 000 PTRATIO ZN TAX 1 1 0 3 2 7 2 7 5 0.98 % 4.13 2.30 % 4.62 2.30 % 378 499 499 1 LSTAT 0 LSTAT 1 TAX 1 1 1 1 3 2 0.32 % 0.32 % 0.32 % 0.32 % 0.98 % 0.65 % 1 0 0 1 0 1 38
  39. 39. Classification Tree : Best Pruned Tree Back to Navigator 6.55 05 RM 136 66 67.3 % 6.79 1 0 RM 16 50 7.92 % 19.4 5 0 PTRATIO 44 6 21.7 % 2.97 % 1 0 39
  40. 40. Classification Tree : Minimum Error Tree Back to Navigator 6.55 05 RM 136 66 1.35 6.79 929 1 DIS RM 3 133 16 50 10.1 65.8 % 7.63 19.4 702 5 5 CRIM 0 LSTAT PTRATIO 3 0 11 5 44 6 1.48 % 0% 3.43 2.47 % 7.06 2.97 % 515 449 1 0 DIS 0 RM 0 0 11 14 30 0% 5.44 % 286. 14.8 % 000 1 0 TAX 1 7 7 3.46 % 378 1 TAX 6 1 2.97 % 0.49 % 0 1 40
  41. 41. Classification Neural Network Aim: To classify census tracts into high and low residential property value classes 41
  42. 42. XLMiner : Neural Network Classification Epochs Inform ation Number of Epochs 30 Accumulated Trials 9120 Class 0 1 Trials 7860 1260 Architecture Number of hidden layers 1 Hidden Layer 1 # Nodes 25 Step size for gradient descent 0.1000 Weight change momentum 0.6000 Weight decay 0.0000 Cost Function Squared Error Hidden layer sigmoid Standard Output layer sigmoid Standard 42
  43. 43. Training Data scoring - Summary Report Cut of f Prob.Val. f or Success (Updatable) 0.5 Clas sification Confus ion Matrix Pre dicte d Class Actual 1 0 Class 1 40 11 0 4 249 Error Re port Class # Cas es # Errors % Error 1 51 11 21.57 0 253 4 1.58 Overall 304 15 4.93 Validation Data scoring - Summary Report Cut of f Prob.Val. f or Success (Updatable) 0.5 Clas sification Confus ion Matrix Pre dicte d Class Actual 1 0 Class 1 26 7 0 1 168 Error Re port Class # Cas es # Errors % Error 1 33 7 21.21 0 169 1 0.59 43 Overall 202 8 3.96
  44. 44. Lift chart (validation dataset) 35 30 Cumulative Cumulative 25 HIGHV when 20 sorted using 15 predicted values 10 Cumulative 5 HIGHV using 0 average 0 100 200 300 # cases Decile-wise lift chart (validation dataset) 7 Decile mean / Global 6 5 4 mean 3 2 1 0 1 2 3 4 5 6 7 8 9 10 De cile s 44
  45. 45. Data Reduction and Exploration Hierarchical Clustering Aim: To cluster electric utilities into similar groups 45
  46. 46. Utilities Data seq# x1 x2 x3 x4 x5 x6 x7 x8 Arizona 1 1.06 9.2 151 54.4 1.6 9077 0 0.628 Boston 2 0.89 10.3 202 57.9 2.2 5088 25.3 1.555 Central 3 1.43 15.4 113 53 3.4 9212 0 1.058 Common 4 1.02 11.2 168 56 0.3 6423 34.3 0.7 Consolid 5 1.49 8.8 192 51.2 1 3300 15.6 2.044 Florida 6 1.32 13.5 111 60 -2.2 11127 22.5 1.241 Hawaiian 7 1.22 12.2 175 67.6 2.2 7642 0 1.652 Idaho 8 1.1 9.2 245 57 3.3 13082 0 0.309 Kentucky 9 1.34 13 168 60.4 7.2 8406 0 0.862 Madison 10 1.12 12.4 197 53 2.7 6455 39.2 0.623 Nevada 11 0.75 7.5 173 51.5 6.5 17441 0 0.768 NewEngla 12 1.13 10.9 178 62 3.7 6154 0 1.897 Northern 13 1.15 12.7 199 53.7 6.4 7179 50.2 0.527 Oklahoma 14 1.09 12 96 49.8 1.4 9673 0 0.588 Pacific 15 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4 Puget 16 1.16 9.9 252 56 9.2 15991 0 0.62 SanDiego 17 0.76 6.4 136 61.9 9 5714 8.3 1.92 Southern 18 1.05 12.6 150 56.7 2.7 10140 0 1.108 Texas 19 1.16 11.7 104 54 -2.1 13507 0 0.636 Wisconsi 20 1.2 11.8 148 59.9 3.5 7287 41.1 0.702 United 21 1.04 8.6 204 61 3.5 6650 0 2.116 Virginia 22 1.07 9.3 174 54.3 5.9 10093 26.6 1.306 46
  47. 47. Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Single linkage) 4 3.5 3 2.5 Distance 2 1.5 1 0.5 0 1 18 14 19 9 2 4 10 13 20 7 12 21 15 22 6 3 8 16 17 11 5 47
  48. 48. Predicted Clusters Back to Navigator Cluster id. x1 x2 x3 x4 x5 x6 x7 x8 1 1.06 9.2 151 54.4 1.6 9077 0 0.628 1 0.89 10.3 202 57.9 2.2 5088 25.3 1.555 1 1.43 15.4 113 53 3.4 9212 0 1.058 1 1.02 11.2 168 56 0.3 6423 34.3 0.7 2 1.49 8.8 192 51.2 1 3300 15.6 2.044 1 1.32 13.5 111 60 -2.2 11127 22.5 1.241 1 1.22 12.2 175 67.6 2.2 7642 0 1.652 1 1.1 9.2 245 57 3.3 13082 0 0.309 1 1.34 13 168 60.4 7.2 8406 0 0.862 1 1.12 12.4 197 53 2.7 6455 39.2 0.623 3 0.75 7.5 173 51.5 6.5 17441 0 0.768 1 1.13 10.9 178 62 3.7 6154 0 1.897 1 1.15 12.7 199 53.7 6.4 7179 50.2 0.527 1 1.09 12 96 49.8 1.4 9673 0 0.588 1 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4 1 1.16 9.9 252 56 9.2 15991 0 0.62 4 0.76 6.4 136 61.9 9 5714 8.3 1.92 1 1.05 12.6 150 56.7 2.7 10140 0 1.108 1 1.16 11.7 104 54 -2.1 13507 0 0.636 1 1.2 11.8 148 59.9 3.5 7287 41.1 0.702 1 1.04 8.6 204 61 3.5 6650 0 2.116 1 1.07 9.3 174 54.3 5.9 10093 26.6 1.306 48
  49. 49. Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Complete linkage) 7 6 5 4 Distance 3 2 1 0 1 18 14 19 6 3 9 2 22 4 20 10 13 5 7 12 21 15 17 8 16 11 49
  50. 50. Predicted Clusters Back to Navigator Cluster id. x1 x2 x3 x4 x5 x6 x7 x8 1 1.06 9.2 151 54.4 1.6 9077 0 0.628 2 0.89 10.3 202 57.9 2.2 5088 25.3 1.555 1 1.43 15.4 113 53 3.4 9212 0 1.058 2 1.02 11.2 168 56 0.3 6423 34.3 0.7 2 1.49 8.8 192 51.2 1 3300 15.6 2.044 1 1.32 13.5 111 60 -2.2 11127 22.5 1.241 3 1.22 12.2 175 67.6 2.2 7642 0 1.652 4 1.1 9.2 245 57 3.3 13082 0 0.309 1 1.34 13 168 60.4 7.2 8406 0 0.862 2 1.12 12.4 197 53 2.7 6455 39.2 0.623 4 0.75 7.5 173 51.5 6.5 17441 0 0.768 3 1.13 10.9 178 62 3.7 6154 0 1.897 2 1.15 12.7 199 53.7 6.4 7179 50.2 0.527 1 1.09 12 96 49.8 1.4 9673 0 0.588 3 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4 4 1.16 9.9 252 56 9.2 15991 0 0.62 3 0.76 6.4 136 61.9 9 5714 8.3 1.92 1 1.05 12.6 150 56.7 2.7 10140 0 1.108 1 1.16 11.7 104 54 -2.1 13507 0 0.636 2 1.2 11.8 148 59.9 3.5 7287 41.1 0.702 3 1.04 8.6 204 61 3.5 6650 0 2.116 2 1.07 9.3 174 54.3 5.9 10093 26.6 1.306 50
  51. 51. Predicted Clusters (sorted) Cluster id. x1 x2 x3 x4 x5 x6 x7 x8 1 1.06 9.2 151 54.4 1.6 9077 0 0.628 1 1.43 15.4 113 53 3.4 9212 0 1.058 1 1.32 13.5 111 60 -2.2 11127 22.5 1.241 1 1.34 13 168 60.4 7.2 8406 0 0.862 1 1.09 12 96 49.8 1.4 9673 0 0.588 1 1.05 12.6 150 56.7 2.7 10140 0 1.108 1 1.16 11.7 104 54 -2.1 13507 0 0.636 2 0.89 10.3 202 57.9 2.2 5088 25.3 1.555 2 1.02 11.2 168 56 0.3 6423 34.3 0.7 2 1.49 8.8 192 51.2 1 3300 15.6 2.044 2 1.12 12.4 197 53 2.7 6455 39.2 0.623 2 1.15 12.7 199 53.7 6.4 7179 50.2 0.527 2 1.2 11.8 148 59.9 3.5 7287 41.1 0.702 2 1.07 9.3 174 54.3 5.9 10093 26.6 1.306 3 1.22 12.2 175 67.6 2.2 7642 0 1.652 3 1.13 10.9 178 62 3.7 6154 0 1.897 3 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4 3 0.76 6.4 136 61.9 9 5714 8.3 1.92 3 1.04 8.6 204 61 3.5 6650 0 2.116 4 1.1 9.2 245 57 3.3 13082 0 0.309 4 0.75 7.5 173 51.5 6.5 17441 0 0.768 4 1.16 9.9 252 56 9.2 15991 0 0.62 M eans Cluster 1 1.21 12.5 128 55.5 1.7 10163 3.2 0.874 Cluster 2 1.13 10.9 183 55.1 3.1 6546 33.2 1.065 Cluster 3 1.02 9.1 171 62.9 3.7 6526 1.8 1.797 Cluster 4 1.00 8.9 223 54.8 6.3 15505 0.0 0.566 51
  52. 52. Affinity Association Rules (Market Basket Analysis) Aim: to identify types of books that are likely to be bought by customers based on past purchases of books 52
  53. 53. YouthBks DoItYBks GeogBks CookBks 2000 ChildBks Florence ItalCook ItalAtlas RefBks ArtBks ItalArt customers 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 0 0 1 0 0 0 0 1 53
  54. 54. XLMiner : Association Rules Data Input Data Sheet1!$A$1:$K$2001 Data Format Binary Matrix Min. Support 200 Min. Conf. % 70 # Rules 19 Rule # Conf. % Antecedent (a) Consequent (c) Support(a) Support(c) Support(a U c) Lift Ratio 1 100 ItalCook=> CookBks 227 862 227 2.32 2 82.19 DoItYBks, ArtBks=> CookBks 247 862 203 1.91 3 81.89 DoItYBks, GeogBks=> CookBks 265 862 217 1.90 4 80.33 CookBks, RefBks=> ChildBks 305 846 245 1.90 5 80 ArtBks, GeogBks=> ChildBks 255 846 204 1.89 6 81.18 ArtBks, GeogBks=> CookBks 255 862 207 1.88 7 79.63 YouthBks, CookBks=> ChildBks 324 846 258 1.88 8 80.86 ChildBks, RefBks=> CookBks 303 862 245 1.88 9 78.87 DoItYBks, GeogBks=> ChildBks 265 846 209 1.86 10 79.35 ChildBks, DoItYBks=> CookBks 368 862 292 1.84 11 77.87 CookBks, DoItYBks=> ChildBks 375 846 292 1.84 12 77.66 CookBks, GeogBks=> ChildBks 385 846 299 1.84 13 78.18 ChildBks, YouthBks=> CookBks 330 862 258 1.81 14 77.85 ChildBks, ArtBks=> CookBks 325 862 253 1.81 15 75.75 CookBks, ArtBks=> ChildBks 334 846 253 1.79 16 76.67 ChildBks, GeogBks=> CookBks 390 862 299 1.78 17 70.65 GeogBks=> ChildBks 552 846 390 1.67 18 70.63 RefBks=> ChildBks 429 846 303 1.67 19 71.1 RefBks=> CookBks 429 862 305 1.65 54
  55. 55. Some Utilities • Sampling from worksheets and databases • Database scoring • Graphics • Binning 55
  56. 56. Simple Random Sampling 56
  57. 57. Stratified Random Sampling 57
  58. 58. Scoring to databases and worksheets 58
  59. 59. Binning continuous variables 59
  60. 60. Missing Data 60
  61. 61. Graphics: Boston Housing data Box Plot Histogram 120 180 160 100 140 120 Frequency 80 100 Y Values 60 80 60 40 40 20 20 0 0 0 10 20 30 40 50 60 70 80 90 100 AGE AGE 61
  62. 62. Box Plot Histogram 10 250 9 8 200 7 Frequency 150 6 Y Values 5 100 4 3 50 2 1 0 0 3 3.6 4.2 4.8 5.4 6 6.6 7.2 7.8 8.4 9 9.6 RM RM 62
  63. 63. Matrix Plot High tax towns have 0 0.2 0.4 0.6 0.8 1 fewer rooms on average? 9 4.2 5.4 6.6 7.8 RM 0 10 3 1 0.2 0.4 0.6 0.8 AGE 2 10 0 4.2 5.4 6.6 TAX 2 10 3 1.8 3 4.2 5.4 6.6 7.8 9 1.8 3 4.2 5.4 6.6 63
  64. 64. RM Box Plot 10 8 Y Values 6 4 2 0 1 2 3 4 5 Binned_TAX 64
  65. 65. Future Extensions • Cross Validation • Bootstrap, Bagging and Boosting • Error-based clustering • Time Series and Sequences • Support Vector Machines • Collaborative Filtering 65
  66. 66. In Conclusion • XLMiner is a modern tool-belt for data mining. It is an affordable, easy-to-use tool for consultants, MBA’s and business analysts to learn, create and deploy data mining methods, • More generally, XLMiner is a tool for data analysis in Excel that uses classical and modern, computationally intensive techniques. 66
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×