Comparison of Machine Learning Algorithms

801 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
801
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Comparison of Machine Learning Algorithms

  1. 1. 1 Comparison of Machine Learning Algorithms in Market Segmentation Analysis Zhaohua Huang Dec 12, 2005 Abstract This project is aimed to compare the four machine learning methods: bagging, random forests (RFA), artificial neural network (ANN) and support vector machine (SVM) by using the sales data of an orthopedic equipment company. The result shows that the four methods show similarly unsatisfactory prediction performance on this dataset. Though these four methods have their own advantages on predicting some specific categories, ANN is relatively the best based on the misclassification rates.
  2. 2. 2 1. Introduction Bagging, random forests (RFA), artificial neural network (ANN) and support vector machine (SVM) are four useful machine learning methods, which can be use to improve of the classification accuracy. Bagging produces replicated training sets by sampling with replacement from the training set to form the classifiers. Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The artificial neural network used here is a the single-layer network, which consists of only a single layer of output nodes and the inputs are fed directly to the outputs via a series of weights. And the support vector machine for classification creates a hyperplane that separates the data into two classes with the maximum margin. Theoretically, random forests yield better error rates, at least than bagging, and are more robust to noise. In this project, we use a real data set to empirically compare their classification performance. The data set contains the sales of a company’s orthopedic equipments in 4703 hospitals and 13 feature variables that are potentially able to explain the difference of sales among these hospitals. The above four classification algorithms yield the predicted probabilities of sales of four categories: “no sale”, “low”, “high”, and “very high” from low to high. Then these probabilities are input to LDA for classification and the misclassification rates are compared. 2. Analysis and Results 2.1 Overview The procedure of analysis is summarized in the following diagram. Since RFA directly reports the classification result instead of probabilities, LDA is not applied. Data Transformation 2.2 Data Manipulation and PCA The goal and method of data transformation are the same as the previous project and the PCA details are in appendix 0 to 3. There are 5 closely related and highly correlated variables, “knee 95”, “knee 96”, “hip 95”, “hip 96” and “femur 96”. They are the numbers of operations of knee, hip and femur in 1995 and 1996. Simply using them together as Bagging Random ANN SVM predictors will generate unnecessary noises to the prediction. Therefore, we apply Forests principle component analysis to these 5 variables after transforming the data. Since the first component can explain 0.9138151 of the variance and the second one drops to only 0.05, we only use the first principle component variable “V1”, which is the linear combination of the above five highly correlated variables: V1 = -0.456hip95 - LDA 0.445knee95 - 0.458hip96 - 0.445knee96 - 0.432femur96. The other predictors do not show high correlation after transformation.
  3. 3. 3 In addition, variable “rbeds” is transformed into a binary categorical variable, which is the same as the variable “rehab”. So we drop the first. We also try cross validation method to find a good combination of the predictors. The change of result is subtle and in most cases, dropping any one variable will weaken the prediction power. Hence, the final predictors are “V1”, “beds”, “outv”, “adm”, “sir”, “th”, “trauma” and “rehab”. The description of these 8 predictors is in the appendix. To examine the out-of-sample performance of these six methods, we randomly split the whole data set into two subsets: one training set with 4203 observations and one testing set with 500 observations. 2.3 Results of bagging, RFA, ANN and SVM The results of bagging, RFA, ANN and SVM are shown in Appendix 5 to 8, respectively. All these four methods are related to some randomness. Sometimes the result is good and sometimes is bad. Hence we try them for several times and only report their best performances Specifically, we choose the default set for bagging, ntree=1000 for RFA, size=20, maxit=1000 for ANN and kernel = "polynomial", degree = 6, gamma = 0.5, cross = 10 for SVM. We do not have many options for bagging in R. We test different numbers of bags from 5 to 100. The results differ a little bit, but just due to the randomness of the method. Therefore, we only use the default setting. The number of trees in RFA has also been tested from 100 to 1000. Though the increment does not improve the result significantly, we get the best result when we set it to be 1000. ANN is very unstable. Sometimes the iteration number is only 5 and sometimes it will go to 800 and above. Obviously the higher the iteration number, the better the result is. The default setting of maximum iteration number is relatively small and we set it to be 1000, though the iteration never reaches that high. Our best result comes out when iteration goes to 740. There are many options for SVM. We try different kernels: sigmoid, radial or polynomial. For polynomial kernel, we also try different degree. In general, polynomial with high degree dominates sigmoid and radial. It fits the training set pretty well with the misclassification rate as low as 42%. But none of the combination does well in the testing set. Also, the gamma coefficient should be 1 over the number of parameters, which should be 0.125 in our study. However, the testing set performs better if we increase gamma to 0.6. Therefore, we suspect there is an overfitting issue with SVM. In Appendix 9, we compare the misclassification rates to evaluate the prediction accuracy of these 4 classifiers. First, we focus on their performance in training and testing sets. In training set, SVM has the lowest misclassification rate 45.3. However the testing set misclassification rate is 50.8%. Compared with the relatively consistent performance of the other 3 methods, this could be a sign of overfitting. The result of bagging is also not good. The misclassification rate of the training set is 0.485, which is close to ANN’s 0.480. But the 0.498 rate in testing set means that bagging almost makes half incorrect
  4. 4. 4 predictions. RFA performs pretty stably as expected. The misclassification rate for training set is always close to the one for testing set. But the result is still not satisfactory. Surprisingly, ANN dominates all the other three methods. The 48% rate in training set is the second best, counting the possible overfitting SVM and the 47.4% rate is testing set is defiantly the best. The problem with ANN is its randomness and inconsistency. One has to try ANN for many times to get the best result and no one knows whether that result is really the best one can possibly get. As far as the specific category is concerned, their performances are close but a little bit different. In general, all four methods can classify the “n” group out very well. The accuracy rates are all above 80%. But they do extremely badly in “l” group, where the misclassification rate is above 80%. They can not classify the “h” and “v” group. The correct and incorrect predictions are about half and half. The reason for this could be that categorization is not good and the difference between selling 10 and 50 equipments can be very subtle. Hence the classification among low, high and very high sales is very difficult. If the categorization decreases to only 2, these methods could perform very well. The four methods differ in their prediction abilities toward different categories of the response variables. In general, bagging does worse in “n” but pretty well in “v”. RFA has its weakness in “h” and “v”. SVM does very badly in “l” and “v”, but pretty well in “n” and “h”. ANN is ok for “l” and “h”, but pretty good for “n” and “v”. Since the “n” group has most of the observations, the method which performs the best in that group is highly possible to be the best one. In our case, it is ANN. 2.5 Something more We also try the data mining tree method. But the result does not come out on our computer in two hours. Furthermore, the goal of DMT is to find interesting groups, a little bit different from the goal of this project. Therefore, we might try it again after upgrading the computer, but regretfully give it up now. We plan to try different categorization to further test the prediction abilities, since this 4 categorization result is far from satisfactory. Besides, the ANN we use is the single layer neural net provided by R. Maybe a more complicated ANN could perform better. However, due to the time limit, we have to leave all these to our future study. 3. Conclusion The four methods do not yield ideal results. Artificial neural network is ok and random forests show the expected robustness. The prediction abilities for different categories of the response variable differ among them. The category “n” is relatively the easiest to be predicted, while none of the four methods can successfully identify the category “l”. There are several ways to improve the performance, such as lower the dimension of category or adding region or state factor into analysis. In summary, ANN relatively dominates other three methods in the analysis of this orthopedic equipment data.
  5. 5. 5
  6. 6. 6 Reference: [1] Cabrera, J. and McDougall, A. (2001). Statistical Consulting, Springer-Verlag, New York. [2] Agresti, Alan (1996), An Introduction to Categorical Data Analysis, John Wiley & Sons, Canada. [3] Hastie, Tibshirani, and Friedman (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics [4] Venables, W. N. and B. D. Ripley (2002). Modern Applied Statisitcs with S, Springer- Verlag, New York. [5] Breiman, Leo. Random Forest. Machine Learning, 45, 5-32, 2001
  7. 7. 7 Appendix 0 The Notations of Variables Response: SALES12 : SALES OF REHAB. EQUIP. FOR THE LAST 12 MO Features (predictors): BEDS : NUMBER OF HOSPITAL BEDS RBEDS : NUMBER OF REHAB BEDS OUT-V : NUMBER OF OUTPATIENT VISITS ADM : ADMINISTRATIVE COST(In $1000's per year) SIR : REVENUE FROM INPATIENT HIP95 : NUMBER OF HIP OPERATIONS FOR 1995 KNEE95 : NUMBER OF KNEE OPERATIONS FOR 1995 TH : TEACHING HOSPITAL? 0, 1 TRAUMA : DO THEY HAVE A TRAUMA UNIT? 0, 1 REHAB : DO THEY HAVE A REHAB UNIT? 0, 1 HIP96 : NUMBER HIP OPERATIONS FOR 1996 KNEE96 : NUMBER KNEE OPERATIONS FOR 1996 FEMUR96 : NUMBER FEMUR OPERATIONS FOR 1996 Appendix 1 Transformations of Selected Variables beds = log(beds+1) rbeds = 1 if rbeds ≠ 1 outv = 15*log(outv+215) adm = 0.0001*log(adm+425) sir <- log(0.1*sir+42) hip95 <- log(3*hip95+11) knee95 <- sqrt(log(3*knee95+15)) hip96 <- log(25*hip96+150) knee96 <- log(5+10*knee96) femur96 <- log(20*femur96+60)
  8. 8. 8 Appendix 2 Distributions before Transformation
  9. 9. 9 Appendix 3 Distributions after Transformation
  10. 10. 10 Appendix 4 Result of PCA: Call: princomp(x = check) Standard deviations: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 2.1373123 0.4777930 0.3386012 0.2303204 0.1866775 5 variables and 4703 observations. Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 2.1373123 0.47779303 0.33860116 0.23032037 0.18667749 Proportion of Variance 0.9138151 0.04566695 0.02293503 0.01061175 0.00697118 Cumulative Proportion 0.9138151 0.95948204 0.98241707 0.99302882 1.00000000 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 [1,] -0.456 -0.529 0.162 0.697 [2,] -0.445 -0.548 -0.214 -0.612 -0.286 [3,] -0.458 0.126 -0.227 0.587 -0.615 [4,] -0.445 -0.344 0.749 0.261 0.233 [5,] -0.432 0.751 0.248 -0.432 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 SS loadings 1.0 1.0 1.0 1.0 1.0 Proportion Var 0.2 0.2 0.2 0.2 0.2 Cumulative Var 0.2 0.4 0.6 0.8 1.0
  11. 11. 11 Appendix 5 Result of Bagging: Length Class Mode y 4203 -none- numeric X 8 data.frame list Mtrees 10 -none- list OOB 1 -none- logical comb 1 -none- logical call 4 -none- call Bagging regression trees with 20 bootstrap replications Call: bagging.data.frame(formula = yy_ ~ v1 + beds + outv + adm + sir + th + trauma + rehab, data = hos.train, nbagg = 20, coob = T) Training set: pred h l n v h 473 14 324 261 y l 257 91 401 102 n 387 2 1314 87 v 156 1 46 287 Misclassification rate= 0.485 Testing set: lda.predict pred h l n v h 66 3 41 20 y l 30 12 52 20 n 51 2 145 8 v 18 0 4 28 Misclassification rate= 0.498
  12. 12. 12 Appendix 6 Result of Random Forests: Call: randomForest(x = xx.train, y = yf.train, xtest = xx.test, ytest = yf.test, ntree = 1000) Type of random forest: classification Number of trees: 1000 No. of variables tried at each split: 2 OOB estimate of error rate: 49.75% Confusion matrix: Pre 1 2 3 4 class.error 1 1424 38 285 43 0.2044693 y 2 448 107 242 54 0.8742656 3 439 65 421 147 0.6072761 4 88 12 230 160 0.6734694 Test set error rate: 48.6% Confusion matrix: Pre 1 2 3 4 class.error 1 164 2 35 5 0.2038835 y 2 58 14 35 7 0.8771930 3 51 5 61 13 0.5307692 4 7 0 25 18 0.6400000
  13. 13. 13 Appendix 7 Results of ANN a 8-20-4 network with 264 weights options were - # weights: 264 initial value 3855.019060 iter 10 value 2583.006140 iter 20 value 2409.884200 iter 30 value 2357.671846 iter 40 value 2332.398369 ...... iter 740 value 2198.189341 final value 2198.189123 converged LDA result(training set): Pred n l h v n 1440 0 292 58 y l 461 92 219 79 h 423 13 407 229 v 64 1 178 247 Misclassification rate= 0.480 LDA result(testing set): Pred n l h v n 165 0 35 6 y l 59 12 28 15 h 48 3 64 15 v 7 0 21 22 Misclassification rate= 0.474
  14. 14. 14 Appendix 8 Results of SVM Call: svm.default(x = xx.train, y = yf.train, kernel = "polynomial", degree = 6, gamma = 0.5, cross = 10) Parameters: SVM-Type: C-classification SVM-Kernel: polynomial cost: 1 degree: 6 gamma: 0.5 coef.0: 0 Number of Support Vectors: 3251 ( 779 1139 932 401 ) Number of Classes: 4 Levels: 1234 10-fold cross-validation on training data: Total Accuracy: 45.12557 Single Accuracies: 51.35135 51.08108 40.81081 40.70081 38.91892 37.02703 44.74394 42.43243 48.64865 55.52561 LDA result(training set): pred 1 2 3 4 1 1500 9 257 24 y 2 513 62 244 32 3 420 9 579 64 4 68 3 259 160 Misclassification rate= 0.453 LDA result(testing set): pred 1 2 3 4 1 163 4 38 1 y 2 70 5 33 6 3 56 1 65 8 4 6 0 31 13 Misclassification rate= 0.508
  15. 15. 15 Appendix 9 Comparison of 4 Methods: Misclassification Train N L H V Testi N L H V Rates ing ng Set Set Bagging 0.485 0.266 0.893 0.559 0.414 0.498 0.296 0.895 0.492 0.440 Random Forests 0.498 0.204 0.874 0.607 0.673 0.486 0.204 0.877 0.531 0.640 Artificial Neural 0.480 0.196 0.892 0.620 0.496 0.474 0.199 0.895 0.508 0.560 Network Support Vector 0.453 0.162 0.927 0.460 0.673 0.508 0.209 0.956 0.500 0.740 Machine

×