INTERNATIONAL Issue Engineering and Technology © IAEME ISSN 0976 – 6367(Print), ISSN International Journal of Computer 097...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Vol...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Vol...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Vol...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Vol...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Vol...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Vol...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Vol...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Vol...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Vol...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Vol...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Vol...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Vol...
Upcoming SlideShare
Loading in...5
×

Improving the performance of k nearest neighbor algorithm for the classification of diabetes dataset with missing values

888

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
888
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Improving the performance of k nearest neighbor algorithm for the classification of diabetes dataset with missing values

  1. 1. INTERNATIONAL Issue Engineering and Technology © IAEME ISSN 0976 – 6367(Print), ISSN International Journal of Computer 0976 – 6375(Online) Volume 3, (IJCET), JOURNAL OF COMPUTER ENGINEERING & 3, October-December (2012), TECHNOLOGY (IJCET)ISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online)Volume 3, Issue 3, October - December (2012), pp. 155-167 IJCET© IAEME: www.iaeme.com/ijcet.aspJournal Impact Factor (2012): 3.9580 (Calculated by GISI) ©IAEMEwww.jifactor.com IMPROVING THE PERFORMANCE OF K-NEAREST NEIGHBOR ALGORITHM FOR THE CLASSIFICATION OF DIABETES DATASET WITH MISSING VALUES Y. Angeline Christobel1, P. Sivaprakasam2, 1 Research Scholar, Department of Computer Science, Karpagam University, Coimbatore, India 2 Professor, Department of Computer Science, Sri Vasavi College, Erode, India ABSTRACT In today’s world, people get affected by many diseases which cannot be completely cured. Diabetes is one such disease and is now a big growing health problem. It leads to the risk of heart attack, kidney failure and renal disease. The techniques of data mining have been widely applied to extract knowledge from medical databases. In this paper, we evaluated the performance of k- nearest neighbor(kNN) algorithm for classification of Diabetes data. We considered the data imputation, scaling and normalization techniques to improve the accuracy of the classifier while using diabetes data, which may contain lot of missing values. We selected to explore KNN because, it is very simple and faster than most of the complex classification algorithms. To measure the performance, we used Accuracy and Error rate as the metrics. We found that data imputation method will not lead to higher accuracy; instead it will give correct accuracy for missing values and imputation along with a suitable data preprocessing method increases the accuracy. Keywords: Data Mining, Classification, kNN, Data Imputation Methods, Data Normalization and Scaling 1. INTRODUCTION Hospitals and medical institutions are finding difficult to extract useful information for decision support due to the increase of medical data. So the methods for efficient computer based analysis are essential. It has been proven that the benefits of introducing machine learning into medical analysis are to increase diagnostic accuracy, to reduce costs and to reduce human resources. Earlier detection of disease will aid in increased exposure to required patient care and improved cure rates [17]. Diabetes is one of the most deadly diseases observed in many of the nations. According to the World Health Organization (WHO) [18] there are approximately millions of 155
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEMEpeople in this world are suffering from diabetes. Data classification is an important problem inengineering and scientific disciplines such as biology, psychology, medicines, marketing,computer vision, and artificial intelligence. The goal of the data classification is to classifyobjects into a number of categories or classes. A lot of research work has been done on Pima Indian diabetes dataset. In [8],Jayalakshmi and Santhakumaran used the ANN method for diagnosing diabetes, using the PimaIndian diabetes dataset without missing data and obtained 68.56% classification accuracy. Manaswini Pradhan et al.[12] suggested an Artificial Neural Network (ANN) basedclassification model for classifying diabetic patients. Best accuracy being 72% with averageaccuracy of 72.2%. Umar Sathic Ali, Dr.C.Jothi Ventakeswaran[13] proposed an improved evidencetheoretic KNN algorithm which combines Dempster Shafer theory of evidence and K nearestneighbouring rule with distance metric based neighborhood with the average accuracy of 73.57%. Pengyi Yang, LiangXu,Bing B Zhou, Zili Zhang,and Albert Y Zomaya [14] proposed aparticle swarm based hybrid system for remedying the class imbalance problem in medical andbiological data mining. To guide the sampling process, multiple classification algorithms areused. The classification algorithms used in the hybrid system composition includes DecisionTree (J48), k-Nearest Neighbor (kNN), Naive Bayes (NB), Random Forest (RF) and LogisticRegression (LOG). Four medical datasets such as blood, survival, breast cancer and Pima IndianDiabetes from UCI Machine Learning Repository and a genome wide association study (GWAS)dataset from the genotyping of Single Nucleotide Polymorphisms (SNPs) of Age-relatedMacular Degeneration (AMD) are used. The different methods of the study includes PSO(Particle Swam based Hybrid System), RO(Rndom OverSampling), RU (RandomUnderSampling) and Clustering. The metrics used are AU(Area under ROC curve), F-measureand Geometric mean.Particle Swam based Hybrid System (PSO) gives an average accuracy of71.6%, RU 67.3%, RO 65.8% and cluster 64.9% with KNN Classifier for Pima Indian Diabetes.Also with other classifiers and dataset, the study shows that PSO gives better accuracy. In[15], Biswadip Ghosh applies FCP (Fuzzy Composite Programming) to build adiabetes classifier using the PIMA diabetes dataset. He evaluated the performance of the Fuzzyclassifier using Receiver Operating Characteristic (ROC) Curves and compared the performanceof the Fuzzy classifier against a Logistic Regression classifier. The results show that FCPclassifier is better when the calculated AUC values are compared. The classifier based onlogistic regression has an AUC of 64.8%, while the classifier based on FCP has an AUC of72.2%. He proved that the performance of the fuzzy classier was found to be better than alogistic regression classifier. Quinlan applied C4.5 and it was 71.1% accurate (Quinlan 1993). In [3], Wahbas group at the University of Wisconsin applied penalized log likelihoodsmoothing spline analysis of variance (PSA). They eliminated patients with glucose and BMIs ofzero leaving n=752. They used 500 for the training set, and the remaining 252 as the evaluationset which showed an accuracy of 72% for the PSA model and 74% for a GLIM model (Wahba,Gu et al. 1992). Michie et al. used 22 algorithms with 12-fold cross validation reported the followingaccuracy rates on the test set and: Discrim 77.5%, Quaddisc 73.8%, Logdisc 77.7%,SMART 76.8%, ALLOC80 69.9%, k-NN 67.6%, CASTLE 74.2%, CART 74.5%, IndCART72.9%, NewID 71.1%, AC2 72.4%, Baytree 72.9%, NaiveBay 73.8%, CN2 71.1%, C4.5 73%,Itrule 75.5%, Cal5 75%, Kohonen 72.7%, DIPOL92 77.6%, Backprop 75.2%, RBF 75.7%, andLVQ 72.8% (Michie, Spiegelhalter et al. 1994, p. 158) 156
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME J. Oates (1994) used Multi-Stream Dependency Detection (MSDD) algorithm on two-thirds of the dataset for training. Accuracy on the one-third of evaluation was 71.33%. Although the cited articles use different techniques, the average accuracy is 70.64% forPima Indian Diabetes dataset. The representation of lot of missing values by zeros itself forms a characteristic and henceaffects the accuracy of the classifier. The main objective of this paper is to explore the ways toimprove the classification accuracy of a classification algorithm using suitable techniques suchas data imputation, scaling and normalization for the classification of diabetes data which maycontain lot of missing values.2. DATA PREPROCESSINGData in the real world is incomplete, noisy, inconsistent and redundant. The knowledgediscovery during the training phase is difficult if there is much irrelevant and redundantinformation. Data preprocessing is important to produce quality mining results. The main tasksin data preprocessing are data cleaning, data integration, data transformation, data reduction anddata discretization. The final dataset used will be the product of data preprocessing.2.1. Missing Values and Imputation of Missing ValuesMany existing, industrial and research datasets contain missing values. They are introduced dueto reasons, such as manual data entry procedures, equipment errors and incorrect measurements.The most important task in data cleaning is processing missing values. Many methods have beenproposed in the literature to treat missing data [21,22]. Missing data should be carefully handled;otherwise bias might be introduced into the knowledge induced and affect the quality of thesupervised learning process or the performance of classification algorithms [3, 4]. Missingvalues imputation is a challenging issue in machine learning and data mining [2]. Due to thedifficulty with missing values, most of the learning algorithms are not well adapted to someapplication domains (for example Web Applications) because the existed algorithms aredesigned with the assumption that there are no missing values in datasets. This shows thenecessity for dealing with missing values. The satisfactory solution to missing data is gooddatabase design but good analysis can help to reduce the problems [1]. Many techniques can beused to deal missing data. Depending on the problem domain and the goal for the data miningprocess, the right technique should be selected.In literature, the different approaches to handle missing values in dataset are given as:1. Ignore the tuple2. Fill in the missing values manually3. Use a global constant to fill in the missing values4. Use attribute mean to fill in the missing values5.Use the attribute mean for all samples belonging to the same class as the given tuple2.2. Mean substitutionThe popular imputation method to fill in the missing data values is to use a variable’s mean ormedian. This method is suitable only if the data are MCAR(Missing Completely At Random).This method creates a spiked distribution at the mean in frequency distributions. It lowers thecorrelations between the imputed variables and the other variables and underestimates variance. 157
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEMEThe Simple Data Scaling AlgorithmThe following algorithm explains the data scaling method.Let D = { A1, A2, A3, ….. An }Where D is the set of unnormalized data Ai – is the ith attribute column of values of m- is the member of rows (records) n - is the number of attributes.Function Normalize(D)Begin For i=1 to n { Maxi ←max(Ai) Mini←min(Ai) For r =1 to m { Air ← Air-Mini Air ← Air/ Maxi Where Air is the element of Ai at row r } } Finally we will have the scaled data set.EndThe Mean Substitution AlgorithmThe following algorithm explains the very commonly used form of mean substitution method [9]Let D = { A1, A2, A3, ….. An }Where D is the set of data with missing values Ai – is the ith attribute column of values of D with missing values in some or all columns n - is the number of attributes.Function MeanSubstitution(D)Begin For i=1 to n { ai ← Ai ∩ mi where ai is the column of attributes without missing values mi is the set of missing values in Ai (missing values denoted by a symbol) Let µi be the mean of ai Replace all the missing elements of Ai with µi } Finally we will have the imputed data set.End 158
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME2.3. k-Nearest Neighbor(knn) Classification AlgorithmKNN classification classifies instances based on their similarity. It is one of the most popularalgorithms for pattern recognition. It is a type of Lazy learning where the function is onlyapproximated locally and all computation is deferred until classification. An object is classifiedby a majority of its neighbors. K is always a positive integer. The neighbors are selected from aset of objects for which the correct classification is known.The kNN algorithm is as follows: 1. Determine k i.e., the number of nearest neighbors 2. Using the distance measure, calculate the distance between the query instance and all the training samples. 3. The distance of all the training samples are sorted and nearest neighbor based on the k minimum distance is determined. 4. Since the kNN is supervised learning, get all the categories of the training data for the sorted value which fall under k. 5. The prediction value is measured by using the majority of nearest neighbors.The Pseudo Code of kNN AlgorithmFunction kNN(train_patterns , train_targets, test_patterns ) Uc - a set of unique labels of train_targets; N - size of test_patterns for i = 1:N, dist:=EuclideanDistance(train_patterns, test_patterns(i)) idxs := sort(dist) topkClasses := train_targets(idxs(1:Knn)) c := DominatingClass (topkClasses) test_targets(i) := cend2.4. Validating the Performance of the Classification AlgorithmClassifier performance depends on the characteristics of the data to be classified. Performance ofthe selected algorithm is measured for Accuracy and Error rate. Various empirical tests can beperformed to compare the classifier like holdout, random sub-sampling, k-fold cross validationand bootstrap method. In this study, we have selected k-fold cross validation for evaluating theclassifiers. In k-fold cross validation, the initial data are randomly partitioned into k mutuallyexclusive subset or folds d1,d2,…,dk, each approximately equal in size. The training and testingis performed k times. In the first iteration, subsets d2, …, dk collectively serve as the training setin order to obtain a first model, which is tested on d1; the second iteration is trained in subsets d1,d3,…, dk and tested on d2; and so no[19]. The accuracy of the classifier refers to the ability of agiven classifier to correctly predict the class label of new or previously unseen data [19].The Accuracy and Error rate can be defined as follows:Accuracy = (TP+TN) / (TP + FP + TN + FN)Error rate = (FP+FN) / (TP + FP + TN + FN)Where 159
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEMETP is the number of True PositivesTN is the number of True NegativesFP is the number of False PositivesFN is the number of False Negatives3. EXPERIMENTAL RESULTS3.1. Pima Indians Diabetes DatasetThe Pima (or Akimel Oodham) are a group of American Indians living in southern Arizona. Thename, "Akimel Oodham", means "river people". The short name, "Pima" is believed to havecome from the phrase pi añi mac or pi mac, meaning "I dont know," used repeatedly in theirinitial meeting with Europeans [25]. According to World Health Organization (WHO) [18 ], apopulation of women who were at least 21 years old of Pima Indian Heritage was tested fordiabetes. The data were collected by the US National Institute of Diabetes and Digestive andKidney Diseases.Number of Instances: 768Number of Attributes: 8 (Attributes) plus 1 (class label)All the attributes are numeric-valued Sl Attribute Explanation No 1 Pregnant Number of times pregnant Plasma glucose concentration 2 Glucose (glucose tolerance test) Diastolic blood pressure (mm 3 Pressure Hg) 4 Triceps Triceps skin fold thickness (mm) 5 Insulin 2-Hour serum insulin (mu U/ml) Body mass index (weight in 6 Mass kg/(height in m)^2) 7 Pedigree Diabetes pedigree function 8 Age Age (years) 9 Diabetes Class variable (test for diabetes)Class Distribution: Class value 1 is interpreted as "tested positive for diabetes"Class Value : 0 - Number of instances - 500Class Value : 1 - Number of instances – 268While the UCI repository index claims that there are no missing values, closer inspection of thedata shows several physical impossibilities, e.g., blood pressure or body mass index of 0.Theyshould be treated as missing values to achieve better classification accuracy.The following 3D plot clearly shows the complex distribution of the positive and negativerecords in the original Pima Indians Diabetes Database which will complicate the classificationtask. 160
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME Fig 1. The 3D plot of original Pima Indians Diabetes Database3.2. Results of the 10 Fold ValidationThe following table shows the results in terms of accuracy, error and run time in the case of 10fold validation Table 1. 10 Fold Validation Results Performanc After e with Actual After Imputation different Data Imputation and Scaling Metrics Accuracy 71.71 71.32 71.84 Error 28.29 28.68 28.16 Time 0.2 0.28 0.23As shown in the following graphs, the performance was considerably improved while usingimputed and scaled data. The performance after imputation is almost equal or little bit lowerthan using normal data. The reason is, the original data contains lot of missing values and themissing values represented by 0 are treated as a significant feature by the classifier. Even thoughthe performance of original data is higher than that of imputed data, we should not considerbecause it will not be accurate. If we train the system with original data which has lot of missingvalues and use it for classifying real world data which may not contain any missing value, thenthe system will not classify the ideal data correctly. 161
  8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME Performance in terms of Accuracy 71.9 Accuracy(%) 71.8 71.7 71.6 71.5 71.4 71.3 71.2 71.1 71 Actual Data After After Imputation Imputation and Scaling Fig 2. Comparison of Accuracy –10 Fold ValuationThe accuracy of classification has been significantly improved after imputation and scaling. Performance in terms of Error 28.8 28.7 28.6 Error(%) 28.5 28.4 28.3 28.2 28.1 28 27.9 Actual Data After After Imputation Imputation and Scaling Fig 3. Comparison of Error –10 Fold ValuationThe classification error has been significantly reduced after imputation and scaling.3.3. Results of the Average of 2 to 10 Fold ValidationThe following table shows the average of results in terms of accuracy, error and run time in thecase of 2 to 10 fold validation Table 2. Average of 2 to 10 Fold Validation Results Performanc After e with Actual After Imputation different Data Imputation and Scaling Metrics Accuracy 71.94 71.26 73.38 Error 28.06 28.74 26.62 Time 0.19 0.19 0.19 162
  9. 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME Performance in terms of Accuracy 74 73.5 Accuracy(%) 73 72.5 72 71.5 71 70.5 70 Actual Data After After Imputation Imputation and Scaling Fig 4. Comparison of Accuracy Avg. of 2-10 Fold ValuationsThe accuracy of classification has been significantly improved after imputation and scaling. Theaverage of 2 to 10 fold validations clearly shows the difference in accuracy. Performance in terms of Error 29 28.5 28 Error(%) 27.5 27 26.5 26 25.5 Actual Data After After Imputation Imputation and Scaling Fig 5. Comparison of Error Avg. of 2-10 Fold ValuationsThe classification error has been significantly reduced after imputation and scaling. The averageof 2 to 10 fold validations clearly shows the difference in classification Error.The graphs above show that imputation alone does not give higher accuracy but imputation witha data preprocessing technique can improve the accuracy of a dataset which has missing values.We also compared our improved KNN with Weka’s Implementation of KNN. The results showthat improved kNN’s accuracy with imputation and scaling method is 3.2% higher thanthe standard Wekas Implementation of kNN.4. CONCLUSION AND SCOPE FOR FURTHER ENHANCEMENTSThe result shows that the characteristics and distribution of data and the quality of input data hada significant impact on the performance classifier. Performance of the kNN classifier is measuredin terms of Accuracy, Error Rate and run time using 10-fold validation method and average of 2-10 fold validation. The application of imputation method and data normalization had obvious 163
  10. 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEMEimpact on classification performance and significantly improved the performance of kNN. Theperformance after imputation is almost equal or little bit lower than using original data. Thereason is, the original data contains lot of missing values and the missing values represented by 0are treated as a significant feature by the classifier. So the data imputation method not alwayslead to so called “ good results” they only give “correct” results. It means imputation will notalways lead to higher accuracy for imputed missing values. But Imputation along with a suitabledata preprocessing method increases the accuracy. Our improved kNN’s accuracy withimputation and scaling method is 3.2% higher than the standard Wekas Implementation ofkNN. To improve the overall accuracy, we have to improve the performance of the classifier oruse a good feature selection method. Future works may address hybrid classification model usingkNN with other techniques. Even though the achieved accuracy of kNN based method is little bitlower than some of the previous methods, still there are scope for improving the kNN basedmethod. For example, we may use another distance metric function instead of the standardEuclidean distance function in the distance calculation part of kNN for improving accuracy.Even we may use any evolutionary computing technique such as GA or PSO for feature selectionor Feature weight selection for attaining maximum possible accuracy of classification with kNN.So, future work may address the ways to incorporate these ideas along with the standard kNN.ANNEXURE –RESULTS OF K-FOLD VALIDATIONThe following table shows the k-fold validation results where k is changed from 2 to 10. Weused the average of this results as well as the results corresponding to k=10 for preparing graphsin the previous section IV. Table 3. Results With Actual Pima Indian Diabetes Dataset k Accuracy Error Time 2 72.53 27.47 0.13 3 74.35 25.65 0.20 4 71.61 28.39 0.16 5 70.98 29.02 0.17 6 72.40 27.60 0.19 7 70.90 29.10 0.25 8 72.14 27.86 0.20 9 70.85 29.15 0.19 10 71.71 28.29 0.20 avg 71.94 28.06 0.19 164
  11. 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME Table 4. Results With Imputed Data k Accuracy Error Time 2 71.22 28.78 0.16 3 70.44 29.56 0.19 4 70.83 29.17 0.16 5 71.24 28.76 0.14 6 71.88 28.13 0.19 7 71.56 28.44 0.20 8 71.61 28.39 0.17 9 71.24 28.76 0.19 10 71.32 28.68 0.28 Avg 71.26 28.74 0.19 Table 5. Results With Imputed and Scaled/Normalized Data k Accuracy Error Time 2 74.74 25.26 0.22 3 74.09 25.91 0.13 4 71.88 28.13 0.16 5 73.59 26.41 0.17 6 72.92 27.08 0.17 7 74.31 25.69 0.16 8 74.09 25.91 0.22 9 72.94 27.06 0.22 10 71.84 28.16 0.23 Avg 73.38 26.62 0.195. REFERENCES[1] Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.[2] Brian D. Ripley (1996), Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge.[3] Grace Whaba, Chong Gu, Yuedong Wang, and Richard Chappell (1995), Soft Classification a.k.a. Risk Estimation via Penalized Log Likelihood and Smoothing Spline Analysis of Variance, in D. H. Wolpert (1995), The Mathematics of Generalization, 331-359, Addison-Wesley, Reading, MA.[4] S. Sahan, K. Polat, H. Kodaz, and S. Gunes, "The medical applications of attribute weighted artificial immune system (awais): Diagnosis of heart and diabetes diseases," in ICARIS, 2005, p. 456 468.[5] M. Salami, A. Shafie, and A. Aibinu, "Application of modeling tech- niques to diabetes diagnosis," 165
  12. 12. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME in IEEE EMBS Conference on Biomedical Engineering & Sciences, 2010.[6] T. Exarchos, D. Fotiadis, and M. Tsipouras, "Automated creation of transparent fuzzy models based on decision trees application to diabetes diagnosis," in 30th Annual International IEEE EMBS Conference, 2008.[7] M. Ganji and M. Abadeh, "Using fuzzy ant colony optimization fordiagnosis of diabetes disease," IEEE, 2010.[8] T. Jayalakshmi and A. Santhakumaran, "A novel classification methodfor diagnosis of diabetes mellitus using artificial neural networks," in International Conference on Data Storage and Data Engineering, 2010[9] R.S. Somasundaram and R. Nedunchezhian, "Evaluation of Three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values", International Journal of Computer Applications Issn-09758887, 2011[10] Peterson, H. B., Thacker, S. B., Corso, P. S., Marchbanks, P. A. & Koplan, J. P. (2004). Hormone therapy: making decisions in the face of uncertainty. Archives of Internal Medicine, 164(21), 2308-12.[11] Kietikul Jearanaitanakij,”Classifying Continous Data Set by ID3 Algorithm”, Proceedings of fifth International Conference on Information Communication and Signal Processing, 2005.[12] Manaswini Pradhan1 , Dr. Ranjit Kumar Sahu2 , Predict the onset of diabetes disease using Artificial Neural Network (ANN), International Journal of Computer Science & Emerging Technologies (E- ISSN: 2044-6004) 303 Volume 2, Issue 2, April 2011[13] P.Umar Sathic Ali, Dr.C.Jothi Ventakeswaran,” Improved Evidence Theoretic kNN Classifier based on Theory of Evidence”, International Journal of Computer Applications (0975 – 8887) Volume 15– No.5, February 2011[14] Pengyi Yang, LiangXu,Bing B Zhou, Zili Zhang,and Albert Y Zomaya, “A particle swarm based hybrid system for imbalanced medical data sampling”, BMC Genomics. 2009; 10(Suppl 3): S34.[15] BiswadipGhosh, “Using Fuzzy Classification for Chronic disease”, Indian Journal of Economics&Business, March 2012[16] Angeline Christobel, SivaPrakasam,”An Empirical Comparison of Data mining Classification Methods”,International Journal of Computer Information Systems, Vol. 3, No. 2, 2011[17] Roshawnna Scales, Mark Embrechts, “Computational intelligence techniques for medical diagnostics”.[18] World Health Organization. Available: http://www.who.int[19] J. Han and M. Kamber,”Data Mining Concepts and Techniques”, Morgan Kauffman Publishers, USA, 2006.[20] Agrawal, R., Imielinski, T., Swami, A., “Database Mining:A Performance Perspective”, IEEE Transactions on Knowledge and Data Engineering, pp. 914-925, December 1993.[21] E. Acuna and C. Rodriguez, The treatment of Missing Values and its e®ect in the classi¯ er accuracy, In: D. Banks, L. House, F.R. McMorris, P. Arabie, W. Gaul (Eds). Classi¯ cation, Clustering and Data Mining Applications, Springer-Verlag Berlin-Heidelberg, 2004[22] G.E.A.P.A. Batista and M.C. Monard, An analysis of four Missing Data treatment methods for supervised learning, Applied Arti¯ cial Intelli-gence 17 (2003)[23] Chen, M., Han, J., Yu P.S., “Data Mining: An Overview from Database Perspective”, IEEE Transactions on Knowledge and Data Engineering, Vol. 8 No.6, December 1996. 166
  13. 13. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME[24] Angeline Christobel, Usha Rani Sridhar, Kalaichelvi Chandrahasan, “Performance of Inductive Learning Algorithms on Medical Datasets”, International Journal of Advances in Science and Technology, Vol. 3, No.4, 2011[25] Awawtam. "Pima Stories of the Beginning of the World." The Norton Anthology of American Literature. 7th ed. Vol. A. New York: W. W. Norton &, 2007. 22-31[26] Quinlan, J. R. C4.5: “Programs for Machine Learning”. Morgan Kaufmann Publishers, 1993.[27] S.B. Kotsiantis, Supervised Machine Learning: “A Review of Classification Techniques”, Informatica 31(2007) 249-268, 2007[28] J. R. Quinlan. “Improved use of continuous attributes in c4.5”, Journal of Artificial Intelligence Research, 4:77-90, 1996.[29] J.R. Quinlan, “Induction of decision trees,” In Jude W.Shavlik, Thomas G. Dietterich, (Eds.), Readings in Machine Learning. Morgan Kaufmann, 1990. Originally published in Machine Learning, vol. 1, 1986, pp 81–106.[30] Salvatore Ruggieri,”Efficient C4.5 Proceedings of IEEE transactions on knowledge and data Engineering”,Vo1. 14,2,No.2, PP.438-444,20025[31] P. Domingos, M. Pazzani, “On the optimality of the simple Bayesian classifier under Zero-one loss”, Machine learning 29(2-3)(1997) 103-130.11[32] Vapnik, V.N., “The Nature of Statistical Learning Theory”, 1st ed., Springer-Verlag,New York, 1995.[33] Michael J. Sorich, John O. Miners,Ross A. McKinnon,David A. Winkler, Frank R. Burden, and Paul A. Smith, “Comparison of linear and nonlinear classification algorithms for the prediction of drug and chemical metabolism, by human UDP- Glucuronosyltransferase Isoforms”[34] UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29][35] XindongWu · Vipin Kumar · J. Ross Quinlan Joydeep Ghosh · Qiang Yang ·Hiroshi Motoda Geoffrey J. McLachlan · Angus Ng · Bing Liu ·Philip S. Yu ·Zhi-Hua Zhou · Michael Steinbach ·David J. Hand · Dan Steinberg, “Top 10 algorithms in data mining ” Springer 2007[36] Niculescu-Mizil, A., & Caruana, R. (2005). “Predicting good probabilities with supervised learning”, Proc.22nd International Conference on Machine Learning (ICML05).[37] Thair Nu Phyu, “Survey of Classification Techniques in Data Mining MultiConference of Engineers and Computer Scientists”, 2009 Vol I IMECS 2009, Hong Kong[38] Arbach, L.; Reinhardt, J.M.; Bennett, D.L.; Fallouh, G.; Iowa Univ., Iowa City, IA, USA “Mammographic masses classification: comparison between backpropagation neural network (BNN), K nearest neighbors (KNN), and human readers”, 2003 IEEE CCECE[39] D. Xhemali, C.J. Hinde, and R.G. Stone.,”Naïve Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages”, International Journal of Computer Science, 4(1):16-23,2009. 167

×