Clustering Medical Data to Predict the Likelihood of Diseases                        Razan Paul, Abu Sayed Md. Latiful Hoq...
numerical data for each attribute to a series of items.          cardinality of attributes except continuous numericFor ex...
Algorithm: Partition patients to find likelihood of                        1.2.1 If A is continuous attribute  disease bas...
u      v j=1    i=1 f i d j                                                     Microsoft Vista and implementation languag...
82-83% over k-mode. It shows that an average                  28%. What this demonstrates is that neither theaccuracy of 3...
[7] K. Shin and A. Abraham, "Two Phase Semi-                            Computer Science and Network Security, vol. 9, no....
Upcoming SlideShare
Loading in …5

Clustering Medical Data to Predict the Likelihood of Diseases


Published on

Clustering Medical Data to Predict the Likelihood of Diseases

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Clustering Medical Data to Predict the Likelihood of Diseases

  1. 1. Clustering Medical Data to Predict the Likelihood of Diseases Razan Paul, Abu Sayed Md. Latiful Hoque Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh, Abstract numerical attributes. In [3-4], the authors extend k- means algorithm to partition large data sets with Several studies show that background knowledge categorical objects. K-means [2] and K-modes [3-4]of a domain can improve the results of clustering clustering algorithms are recognized techniques toalgorithms. In this paper, we illustrate how to use partition large data sets based on numerical attributesthe background knowledge of medical domain in and categorical attributes respectively. To findclustering process to predict the likelihood of likelihood of disease we need a clustering algorithm,diseases. To find the likelihood of diseases, which can partition objects consisting of bothclustering has to be done based on anticipated numerical and categorical attributes and can setlikelihood attributes with core attributes of disease constraint on presence or absence of items inin data point. To find the likelihood of diseases, we clustering process and on datapoint.have proposed constraint k-Means-Mode clustering A number of work [5-11] has proposed differentalgorithm. Attributes of Medical data are both technique to address a variant of the conventionalcontinuous and categorical. The developed clustering problem. These works include clusteringalgorithm can handle both continuous and discrete in the presence of information about the problemdata and perform clustering based on anticipated domain or some background knowledge. Here ourlikelihood attributes with core attributes of disease proposed algorithm performs clustering in thein data point. We have demonstrated its effectiveness presence of information about the medical domain toby testing it for a real world patient data set. predict the likelihood of diseases. However, the technique to use medical background knowledge in1. Introduction our proposed algorithm is different from the techniques [5-11]. Clustering is an attractive approach for finding For Heart Attack Prediction, in [12-14] authorssimilarities in data and putting similar data into have performed clustering on the preprocessed datagroups. Due to high dimensionality of medical data warehouse using K-means clustering algorithm. The[1], if clustering is done based on all the attributes of data for Heart Attack Prediction are a mixture ofmedical domain, resultant clusters will not be useful continuous and discrete data. However, K-meansbecause they are medically irrelevant, contain cannot cluster categorical attributes. Therefore, theredundant information. Moreover, this property approaches [12-13]will not work to predict Heartmakes likelihood analysis hard and the partitioning Attack. In [14], the author performs clusteringprocess slow. To find the likelihood of a disease aperiodical medical data, which are both continuousclustering has to be done based on anticipated and discrete, using K-means clustering algorithm.likelihood attributes with core attributes of disease indata point. For example, clustering a large number of 2. Mapping complex medical data topatients with selecting age, weight, sex, smoke, mineable itemsHbA1c% as data point and allowing only age,weight, sex, smoke in clustering process, we can find For knowledge discovery, the medical data haveclusters partitioned by age, weight, sex, smoke. This to be transformed into a suitable transaction formatway we get clusters that have similar age, weight, to discover knowledge. We have addressed thesex, smoke value. Then analyzing each cluster based problem of mapping complex medical data to itemson HbA1c% can give likelihood information of using domain dictionary and rule base as shown indiabetes. figure 1. The medical data are types of categorical, Attributes of Medical data are both continuous continuous numerical data, boolean, interval,and categorical. K-means clustering [2] is widely percentage, fraction and ratio. Medical domainused technique to partition large data sets with experts have the knowledge of how to map ranges of978-1-4244-7571-1/10/$26.00 ©2010 IEEE 44
  2. 2. numerical data for each attribute to a series of items. cardinality of attributes except continuous numericFor example, there are certain conventions to data are not high in medical domain, these attributeconsider a person is young, adult, or elder with values are mapped to integer values using medicalrespect to age. A set of rules is created for each domain dictionaries. Therefore, the mapping processcontinuous numerical attribute using the knowledge is divided in two phases. Phase 1: a rule base isof medical domain experts. A rule engine is used to constructed based on the knowledge of medicalmap continuous numerical data to items using these domain experts and dictionaries are constructed fordeveloped rules. attributes where domain expert knowledge is not applicable, Phase 2: attribute values are mapped to We have used domain dictionary approach to integer values using the corresponding rule base andtransform the data, for which medical domain expert the dictionaries.knowledge is not applicable, to numerical form. As Original Mapped Original Mapped Generate dictionary for value value value value each categorical attribute Headache 1 Yes 1 Fever 2 No 2 PatientActual Data Age Smoke Diagnosis Dictionary of Dictionary of ID Diagnosis attribute Smoke attribute 1020D 33 Yes Headache 1021D 63 No Fever Map to integer items using rule base and dictionaries Actual data If age <= 12 then 1 Medical If 13<=age<=60 then 2 domain If 60 <=age then 3 Patient Age Smoke Diagnosis knowledge If smoke = y then 1 ID If smoke = n then 2 1020D 2 1 1 If Sex = M then 1 1021D 3 2 2 If Sex = F then 2 Rule Base Data suitable for Knowledge Discovery Figure 1. Data transformation of medical data 3.1. Updating cluster center We need to update the k clusters centre3. The proposed algorithm dynamically in order to minimize the intra cluster distance of patients. Here k is the number of clusters Figure 2 shows the proposed hybrid-partitioning we would like to make and Pi is the ith patientalgorithm, which can handle both continuous and attribute and Ci is the ith mean-mode value of clusterdiscrete data and perform clustering based on C. As the patient attributes are both continuous andanticipated likelihood attributes with core attributes discrete, each cluster center is an array of bothof disease in data point. In this algorithm, the user average and mode values where average and modewill set which attributes will be used as data point for are computed for continuous and discrete attributesa patient and which attributes will participate in respectively. Mean is computed for each continuousclustering process. The goal of this algorithm is attribute by calculating average of that attributemaking clusters to find likelihood. Healthcare data among the data points in that cluster. Mode isare sparse as doctors perform only few different computed for each discrete attribute by calculatingclinical lab tests for a patient over his lifetime. This maximum frequent value of that attribute among theis natural many patients have not all anticipated data points in that cluster.attributes for likelihood. When a patient does nothave one or more anticipated attributes forlikelihood, keeping this patient in clustering process 3.2. Dissimilarity measurewill make clusters useless to find likelihood.Therefore, we are ignoring that patient in the The object dissimilarity measure is derived fromclustering process. both numeric and categorical attributes. For discrete features, the dissimilarity measure between two data point depends on the number of different values in 45
  3. 3. Algorithm: Partition patients to find likelihood of 1.2.1 If A is continuous attribute disease based on MeanMode value of patients. MeanModec [i] = Find the mean 1. Read the metadata about which attributes will only among the attribute named A values of data points appear in clustering process. in cluster c. 2. Partition patient data into k cluster in random and assign 1.2.2 else If A is category attribute each partition to each cluster. To retrieve paient data use MeanModec [i] = Find the mode the corresponding RetrieveAllPatientsRecord() for each among the attribute named A values of data points data model. in cluster c. 3. Repeat 1.2.3 i++; 3.1 Call UpdateMeanModeofClusters(K, M ) to update Mean-Mode value of k clusters Procedure Distance (P: Patient, C: Cluster, m: Number 3.2 Move patient Pi to the cluster with least of attributes) distance and find the distance between a patient //Here Pi represent the ith attribute value of Patient P and Ci and a cluster using the function Distance (P, C, represents ith MeanMode value of Cluster C m); Until no patient is moved 1. for i = 1 to m where ith attribute value of Patient can appear in clustering Procedure UpdateMeanModeofClusters(K: Number of 1.1 If Pi is continuous clusters, M: Medical attributes) 1.1.1 Then D1 = D1+ (Pi - Ci) 2 1. For each cluster c K 1.2 Else (categorical) 1.2.1 Then D2 = D2 + NumberofOnes (Pi ^ Ci); 1.1 i = 0 1.3 d = SQRT (D1) + D2; 1.2 For each attribute A M where A can appear in 2. return d; clustering Figure 2. Constraint k-Means-Mode clustering algorithmeach categorical feature. For continuous features, the Distance between based on continuousdissimilarity measure between two data pointdepends on Euclidean distance. Here we have used attributes is =1 ( )2 where , and nthe following two functions to measure dissimilarity: is the number of patients. Distance is measured usinghamming distance function for categorical Hamming distance function for categorical attributes.objects and Euclidean distance function for Distance between based on categoricalcontinuous data. To measure distance between two attributes is = ( , ) where , =objects based on several features, for each feature we 0 ==test whether this feature is discrete or continuous. Ifthe feature is continuous, distance is measured using 1Euclidean distance and added it to D1 and if thefeature is discrete, the dissimilarity is measured 3.3. Likelihoodusing hamming distance and added it to D2. Theresultant distance is computed by adding square root Likelihood is the probability of a specifiedof D1 with D2. The computational complexity of the outcome. After clustering using constrained K-algorithm is O ((I+1) k p), where p is the number of Means-Mode algorithm we get a set of clusters,patients, k the number of clusters and I is the number C = {c1 , c2 , c3 , ck }. Each cluster contains a setof iterations. of data points, which consist of anticipated Let the anticipated likelihood attributes be = likelihood attributes and core attributes of disease.{ 1 , 2 , 3 , . . . }. Let the core attributes of disease, Data points for cluster cj is = { 1, 2, 3, }. In the clustering Dj = {dj1 , dj2 , dj3 , . . dju }. There are a set ofprocess, only anticipated likelihood attributes boolean functions on core attributes of disease toparticipate. The anticipated likelihood attributes determine whether a data point has the presence of aconsist of both continuous and categorical attribute. disease or not. Let the set of boolean functions beLet first attributes of are continuous and the F = {f1 , f2 , f3 , fv }. A data point dt has presenceremaining attributes are categorical. Let the of the disease if v fi (dt ) == true for the data i=1anticipated likelihood attributes of two data points point. In a cluster, the number of data points whichare . Dissimilarity between the anticipated has presence of the disease is u v j=1 i=1 fi (dj ). Thelikelihood attributes of two data points is the sum of number of total data points in the cluster is u u. j=1dissimilarity of continuous attribute and dissimilarity So likelihood of a cluster for the disease isof categorical attribute. Distance is measured usingEuclidian distance function for continuous attributes. 46
  4. 4. u v j=1 i=1 f i d j Microsoft Vista and implementation language was u u where fi is the function, which returns j=1 c#. We used 2 datasets to verify our method. Theeither one or zero. first data set of interest is patient dataset collected Here each cluster is represented by the mean and preprocessed from Bangladeshi hospitals, whichmode value of that cluster. Now we will find the has 50273 instances with 514 attributes (includedequation of mean mode value of a cluster c. Mean is 150 discrete and 364 numerical attributes). Thecalculated among the continuous attributes and mode Patient Dataset was clustered in 5 classes (Very Highis calculated among the categorical attributes. Let the Risk, High Risk, Medium Risk, Low Risk, No Risk)mean mode value of a cluster be MM = using proposed algorithm to find likelihood of mm1 , mm2 , mm3 , mmz where z is the number Diabetic. The next data set of interest is the Zoo Dataof attributes in the clustering process. Let first y Set [15] from UCI Machine Learning Repository,attributes of MM are continuous and remaining which has the similar characteristics like medicalz y are categorical. The continuous part of mean data. It contains 101 instances with 7 classesmode value is MMi i=1, y = the mean among ith {mammal, bird, reptile, fish, amphibian, insect, and attribute values of cluster c. The categorical part of invertebrate}, each described by 18 attributes (included 16 discrete and 2 numerical attributes). Wemean mode value is MMj = the mode j=y+1, z have taken an average value from 10 trials for eachamong jth attribute values of cluster c. of the test result. Likelihood is the probability of a specified disease. Here average likelihood is the4. Results and discussion average of all cluster likelihood. Actual likelihood is the actual probability of the disease in the data, The experiments were done using PC with core 2 which has been found using brute force approach.duo processor with a clock rate of 1.8 GHz and 3GB Accuracy is the ratio between average likelihood andof main memory. The operating system was actual likelihood. K-Means K-Mode K-Means with BK K-Mode with BK K-Means-Mode K-Means-Mode with BK 1 Accuracy 0.5 0 64 47 33 Number of boolean functions Figure 3. Accuracy of test result for the patient dataset to find likelihood of diabetic For the Patient Dataset to find likelihood of without background knowledge achieves an averageDiabetic, Figure 3 presents accuracy results for K- accuracy of 17.7%. Both K-mode withoutMeans, K-Mode, K-Means-Mode, K-Means with background knowledge and K-mode withbackground knowledge (BK), K-Mode with BK and background knowledge perform much worse,K-Means-Mode with BK algorithms over the number averaging 12.1% and 30.2 accuracy respectively. Theof boolean functions. The number of boolean proposed method gives better results about 39-40%functions for each presented result is also indicated. over k-means with background knowledge asIt shows that an average accuracy of 95.1% is illustrated in Figures 1 and about 64-65% over k-achieved using the medical background information mode with background knowledge as illustrated inand hybrid clustering algorithm. K-means algorithm Figures 1. The proposed method also gives muchwith background knowledge (BK) achieves an better accuracy when compared to the k-means andaverage accuracy of 56%. K-means algorithm K-Mode with about 77-78% over k-means and about 47
  5. 5. 82-83% over k-mode. It shows that an average 28%. What this demonstrates is that neither theaccuracy of 30.2-56% can be achieved by K-Means medical background information nor hybrid-or K-Mode using the background information alone. clustering algorithm alone performs very well, butK-Means-Mode algorithm without background combining the two effectively produces excellentknowledge achieves an average accuracy of about results. K-Means K-Mode K-Means with BK K-Mode with BK K-Means-Mode K-Means-Mode with BK 1 Accuracy 0.5 0 87 63 12 Number of boolean functions Figure 4. Accuracy of test result for the zoo data set For the Zoo Data Set [15], Figure 4 shows combining the two effectively produces excellentaccuracy results for K-Means, K-Mode K-Means- results.Mode, K-Means with background knowledge (BK),K-Mode with BK and K-Means-Mode with BK 6. Referencesalgorithms over the number of boolean functions.The number of boolean functions for each presented [1] P. B. Torben and J. S. Christian, "Research Issues inresult is also indicated. It also demonstrates that Clinical Data Warehousing," in Proceedings of theneither the medical background information nor 10th International Conference on Scientific andhybrid-clustering algorithm alone performs very Statistical Database Management , Capri, 1998, p.well, but combining the two effectively produces 43 52.excellent results. [2] J. B. Macqueen, "Some methods of classification and analysis of multivariate observations," in Proceedings5. Conclusion of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, Berkelely, CA, 1967, p. 281 297. Clustering medical data is important as the resultsof such analysis can be used for improving patient [3] Z. Huang, "Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,"care and treatment. We have proposed a clustering Data Mining and Knowledge Discovery, vol. 2, no. 2,method for medical data to predict the likelihood of pp. 283 - 304, September 1998.diseases by combining k-means and k-mode [4] O. M. San, V. N. Huynh, and Y. Nakamori, "Analgorithm and incorporating medical background Alternative Extension of the k-Means Algorithm forknowledge. It clusters both numerical and categorical Clustering Categorical Data," JAMCS, vol. 14, no. 2,data efficiently and allows user to specify constraint pp. 241-247, 2004.on what attributes will participate in clustering [5] H. C. Hongch and D. Y. Yeung, "Locally linearprocess and what attributes will be selected as data metric adaptation for semi-supervised clustering," inpoint. The method has also been applied to a real Proceedings of the twenty-first internationalworld medical data set and the Zoo Data Set from conference on Machine learning, Banff, Alberta,UCI Machine Learning Repository. We have shown Canada, 2004, pp. 153--160.significant improvements in accuracy. We have the [6] H. C. Hongch and D. Y. Yeung, "Locally linearfollowing conclusions from the work: neither the metric adaptation with application to semi-supervisedmedical background information nor hybrid- clustering and image retrieval," Pattern Recognition,clustering algorithm alone performs very well, but vol. 39, no. 7, pp. 1253-1264, July 2006. 48
  6. 6. [7] K. Shin and A. Abraham, "Two Phase Semi- Computer Science and Network Security, vol. 9, no. supervised Clustering Using Background 2, pp. 228-235, February 2009. Knowledge," Lecture Notes in Computer Science, [13] S. B. Patil and Y.S. Kumaraswamy, "Intelligent and vol. 4224, pp. 707-712, September 2006. Effective Heart Attack Prediction System Using Data[8] M. S. Baghshaha and S. B. Shourakib, "Kernel-based Mining and Artificial Neural Network," European metric learning for semi-supervised clustering," Journal of Scientific Research, vol. 31, no. 4, pp. 642- Neurocomputing, December 2009. 656, 2009.[9] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl, [14] M. Sacha. (2008) Clustering of an aperiodical medical "Constrained K-means Clustering with Background data. Knowledge," in Proceedings of the Eighteenth International Conference on Machine Learning, 2001, aperiodical-medical-data. pp. 577 - 584. [15] Zoo Data Set. (n.d.). Retrieved 03 01, 2010, from[10] G. Y. Hang, D. Zhang, J. Ren, and C. Hu, "A Machine Learning Repository: Hierarchical Clustering Algorithm Based on K-Means with Constraints," in Fourth International Conference on Innovative Computing, Information and Control, Kaohsiung, Taiwan, 2009, pp. 1479-1482.[11] K. Li, Z. Cao, L. Cao, and R. Zhao, "A novel semi- supervised fuzzy C-means clustering method," in Proceedings of the 21st annual international conference on Chinese control and decision conference, Guilin, China, 2009, pp. 3804-3808.[12] S. B. Patil and Y. S. Kumaraswamy, "Extraction of Significant Patterns from Heart Disease Warehouses for Heart Attack Prediction," International Journal of 49