Published on

This paper presents a hybrid data mining approach based on supervised learning and unsupervised learning to identify the closest data patterns in the data base. This technique enables to achieve the maximum accuracy rate with minimal complexity. The proposed algorithm is compared with traditional clustering and classification algorithm and it is also implemented with multidimensional datasets. The implementation results show better prediction accuracy and reliability.

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. International Journal of Computer Applications Technology and ResearchVolume 2– Issue 3, 214 - 217, 2013www.ijcat.com 214A HYBRID MODEL FOR MINING MULTI DIMENSIONALDATA SETSS. Santhosh KumarPRIST University,Thanjavur,Tamil Nadu, IndiaE.RamarajSchool of ComputingAlagappa University, Karaikudi,Tamil Nadu, IndiaAbstract: This paper presents a hybrid data mining approach based on supervised learning and unsupervised learning to identify theclosest data patterns in the data base. This technique enables to achieve the maximum accuracy rate with minimal complexity. Theproposed algorithm is compared with traditional clustering and classification algorithm and it is also implemented withmultidimensional datasets. The implementation results show better prediction accuracy and reliability.Keywords: Classification, Clustering, C4.5, k-means Algorithm.1. INTRODUCTIONClustering and classification are the two familiar data miningtechniques used for similar and dissimilar grouping of objectsrespectively. Although, due to efficient use of data miningtechniques, clustering and classification techniques are usedas pre-process activities. The clustering technique categoriesthe data and reduces the number of features, removesirrelevant, redundant, or noisy data, and forms the sub groupsof the given data based on its relativity. The classification isused as secondary process which further divides the similar(clustered) groups in to two discrete sub groups based on theattribute value. From our research work, especially for largedata bases, the prior classification is required to minimizemultiple features of data so that it can be mined easily. In thispaper we proposed a combined approach of classification andclustering for gene sub type prediction.2. PRELIMINARIES2.1 C4.5 AlgorithmIt is used to generate a decision tree developed by RossQuinlan and it is an extension of ID3 algorithm. The decisiontrees generated by C4.5 can be used for classification, anC4.5builds decision trees from a set of training data in the sameway as ID3, using the concept of information entropy. Thetraining data is a set of already classified samples. Eachsample consists of a p-dimensional vector, where theyrepresent attributes or features of the sample, as well as theclass in which falls. At each node of the tree, C4.5 chooses theattribute of the data that most effectively splits its set ofsamples into subsets enriched in one class or the other. Thesplitting criterion is the normalized information gain. Theattribute with the highest normalized information gain ischosen to make the decision. The C4.5 algorithm thenrecurses on the smaller sub lists.2.2 K-means AlgorithmThe k-means algorithm was developed by Mac Queen basedon standard algorithm. It is one of the most widely used hardclustering techniques. This is an iterative method where thespecified number of clusters should initialise earlier. Onemust specify the number of clusters beforehand. Thealgorithm can be specified as a given set of observations (x1,x2, …, xn), where each observation is a d-dimensional realvector, k-means clustering aims to partition the n observationsinto k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize thewithin-cluster sum of squaresWhere μi is the mean of points in SiThe algorithm works as follows:o The k (number of clusters) value number of clustersmust be initialisedo Randomly select k cluster centres (centroids) in thedata spaceo Assign data points to clusters based on the shortestEuclidean distance to the cluster centerso Re-compute new cluster centers by averaging theobservations assigned to a clustero Repeat above two steps until convergence criterionis satisfiedThe advantage of this approach is its efficiency to handlelarge data sets and can work with compact clusters. The majorlimitation of this technique is the requirement to specify thenumber of clusters beforehand and its assumption that clustersare spherical.3. RELATED STUDIESIn the year 2009 CSVM [1], Clustering based classificationtechnique is proposed by Juanying Xie, Chunxia Wang, YanZhang, Shuai Jiang for unlabelled data prediction. Theycombined different kinds of k-means algorithm with SVMclassifier to achieve better results. In order to avoid the majordrawback of k-means algorithm; k-value initialisation, theCSVM is proposed. In 2010 [2], Pritha Mahata proposed anew hierarchical clustering technique called ECHC(exploratory consensus of hierarchical clustering’s) which isused to sub group the various types of melanoma cancer. Thiswork reveals that, k-means algorithm gives better results forbiological subtype with proper sub tree. In 2010 [3], TaysirHassan A. Soliman, proposed a clustering and classification asa combined approach to classify the different types of diseasesbased on gene selection method. The results had shownimproved accuracy in data prediction. In 2011[4], ReubenEvans, Bernhard Pfahringer, Geoffrey Holmes, proposed atechnique called statistical based clustering technique forlarge datasets. They used k-means algorithm as initial step forcentroid prediction and classification as secondary step. Inmarch 2013[5] Claudio Gentile, Fabio Vitale, GiovanniZappella, implemented the combined technique (clusteringand classification) in networks using signed graphs
  2. 2. International Journal of Computer Applications Technology and ResearchVolume 2– Issue 3, 214 - 217, 2013www.ijcat.com 215(classification) and correlation based grouping (clustering). Inthis work we proposed that for very larger multi dimensionaldata base needs discrete classification as a primary processand its results classified samples that are grouped with the useof clustering methods. For classification the C4.5 classifier isused and k-means clustering is used as a secondary process.Fig 1: Hybrid model for Clustering and classification4. PROPOSED WORKWith the combination of supervised and unsupervisedlearning approach, we proposed thato A combined approach in needed in order to categorizeand search the relevant data from multidimensionaldatasetso For multi dimensional data sets, the primary partitioningis needed ; so that the categorized search gives betterresults with accuracy and time efficiencyo The classified samples can be sub grouped into smallersets; so that the clustering can be done with global andlocal optimization4.1 Proposed ModelBased on the drawbacks in existing approaches, we proposeda hybrid model for multi dimensional data sets. The majordrawback of large datasets is its dimensionality. It isunavoidable, when needed information contains additional orunrelated data then it leads more difficult to search, analyze,and transfer the data. To overcome the above said problem,the major categorization of data based on requirement isneeded. So that major part of large data can be divided in totwo or more groups. One category contains the required dataand other contains unrelated data respectively. We divided ourwork in to two parts; first approach is to divide the large dataset in to major categories (based on its correlation) whichrequires discrete values to classify the data. For that aclassifier should contain the data with related information inhierarchical form is needed. The C4.5 Decision tree algorithmis recommended to classify the data set into major discretegroups. Our second approach is to find the relativity amongthe categorized data. For that similar and dissimilar distancemeasures between the data items are considered. Based on itsweight of the attributes in the group are organized. The k-means algorithm is used for clustering the similar dataattributes in a categorized group.5. EXPERIMENTAL EVALUATIONWe have used colon cancer data set of colorectal cancer datafrom Biological data analysis web site [6]. The Data setcontains 7,460 genes and 272 samples which contains bothnormal and infected genes. For our approach we have takenentire genes with samples and as stated above as a firstapproach we used C4.5 classifier to categorize data in termsof normal and infected genes based on its intensity range. Thecategorized genes are then clustered with the use of k-meansalgorithm. As a first step we implemented the data set intoC4.5 classifier. The parameters taken for classification arementioned below.Table1. Classifier Parameter InitializationDecision Tree (C 4.5) TopMin size of leaves 10Confidence-level 0.25Error Rate 0.8696The following figure shows the graphical representation ofmajor classification of genes as normal and tumor intensity.Fig.2. Normal Vs Tumor Intensity Classifier RepresentationThe second step is the given dataset is implemented directlyinto k-means algorithm for clustering. The parameters takenfor clustering is tabled below.(X1) Intensity inNormalvs.(X2) Intensity inTumor1,2001,1001,000900800700600500400300200100220210200190180170160150140130120110100908070605040302010
  3. 3. International Journal of Computer Applications Technology and ResearchVolume 2– Issue 3, 214 - 217, 2013www.ijcat.com 216Table.2. Cluster Parameter InitializationClusters (k-Means) 10Max Iteration 10Trials 5Distance Normalization varianceThe figure shows the similar and dissimilar generepresentation using k-means algorithm.Fig.3. Normal Vs Tumor Intensity Cluster RepresentationThe final step of our work is the gene type is implementedusing C4.5 classifier and the output obtained afterclassification is given as input for clustering process. Theresult is compared with individual implementation of previoussteps. In order to evaluate the performance of hybrid method;the same parameters are applied for the proposed technique.The following graph shows the classification result of hybridapproach.Fig.4. Normal Vs Tumor Intensity Hybrid ClassifierRepresentationThe above graph denotes that the given genes are classified into major categories which contain relevant attributes in eachcategory.The following graph gives clear identification ofsimilar and dissimilar representation of categorized groups.The irrelevant attributes are also considered and generated asseparate clusters. Whereas the error rate based on the minimalsize clusters are become comparatively less when it iscompared with overall cluster centers.Fig.5. Normal Vs Tumor Intensity Hybrid (Classifier & Cluster)RepresentationThe experimented results error rate, computation time isanalyzed and compared with the individual methods. Theresult shows that hybrid approach attains better results withless computation time. According to error rate, all the givenattributes are represented as relevant or irrelevant groups sothat the error rate must very less in hybrid approach.Table3. Performance of Various Hybrid approach withDifferent groups methodsMethod Data SetNo. ofClusters/ DTressComputationTimeC 4.5 Colon7 Nodes, 4Leaves31 MsK-Means Colon10 Clusters15 MsHybrid Colon 10 Clusters 16 MsThe table 3 shows that computation time of C4.5 classifier ofusing colon gene dataset is 31 milliseconds. The cluster usingthe same data set results 15 Milliseconds. But the combinationof both C4.5 and k-means results 16 milliseconds. It revealsthat the hybrid approach gives accurate results in lesscomputation time.6. CONCLUSIONFrom the analysis based on experimental results, we canconclude that our hybrid approach for multi dimensional database gives better results in comparison with existingapproaches. The large data sets are categorized using C4.5classifier produces decision tree with relevant and irrelevantattributes in a set. This phenomenon makes possible to groupthe similar attributes in to group called cluster. Based on thehybrid approach one can divide the large database in to majorgroups. The major groups can be further clustered in tosimilar group which enables to achieve high accuracy ratewith less computation time. This hybrid approach is suitablefor large data bases having multi dimensional complexity. Byusing this approach one can retrieve exact information from(X1) Intensity in Tumor vs. (X2) Intensity in Normal220200180160140120100806040201,2001,1001,000900800700600500400300200100(X1) Intensity in Normalvs. (X2) Intensity in Tumor by (Y) pred_SpvInstance_11,2001,1001,000900800700600500400300200100220210200190180170160150140130120110100908070605040302010(X1) Intensity inNormalvs.(X2) Intensity inTumor by (Y) Cluster_KMeans_1c_kmeans_1 c_kmeans_2 c_kmeans_3 c_kmeans_4 c_kmeans_5 c_kmeans_6 c_kmeans_7 c_kmeans_8c_kmeans_9 c_kmeans_101,2001,1001,00090080070060050040030020010022020018016014012010080604020
  4. 4. International Journal of Computer Applications Technology and ResearchVolume 2– Issue 3, 214 - 217, 2013www.ijcat.com 217the data base with in stipulated time by removing orminimizing the additional features of large data.7. REFERENCES[1] Juanying Xie ; Chunxia Wang ; Yan Zhang, 2009,ICTM 2009 conference.[2] Pritha Mahata, January-march 2010. Ieee/acmtransactions on computational biology andbioinformatics, vol. 7, no. 1,[3] Taysir Hassan A. Soliman, Adel A. Sewissy, and HishamAbdel LatifSannella, 2010, A gene selection approachfor classifying diseases based on microarray datasets ,IEEE 12th International Conference on Bioinformaticsand Bioengineering, Cyprus[4] Reuben Evans, Bernhard Pfahringer, Geoffrey Holmes.2011, IEEE, Clustering for classification.[5] Brown, L. D., Hua, H., and Gao, C. 2003. A widgetframework for augmented interaction in SCAPE.[6] http://www.ncbi.nlm.nih.gov/gene[7] Chenn-Jung Huang ,Wei-Chen Liao, “A ComparativeStudy of Feature Selection Methods for ProbabilisticNeural Networks in Cancer Classification”, Proceedingsof the 15th IEEE International Conference on Tools withArtificial Intelligence (ICTAI’03),Vol 3, pp1082-3409,2003.[8] http://sdmc.lit.org.sg/GEDatasets/[9] Debahuti Mishra, Barnali Sahu, 2011,”A signal to noiseclassification model for identification of differentiallyexpressed genes from gene expression data,”3rdInternational conference on electronics computertechnology.[10] L. Parsons, E. Haque, and H. Liu, “Subspace clusteringfor high dimensional data: A review,” SIGKDD Explor,Vol. 6, 2004, pp. 90-105.[11] G. Moise, A. Zimek, P. Kroger, H.P. Kriegel, and J.Sander, “Subspace and projected clustering:experimental evaluation and analysis,” Knowl. Inf. Syst.,Vol. 3, 2009, pp. 299-326.[12] H.P. Kriegel, P. Kroger, and A. Zimek, “Clustering high-dimensional data: A survey on subspace clustering,pattern-based clustering, and correlation clustering,”ACM Trans. Knowl. Discov. Data., Vol 3, 2009, pp. 1-58.[13] ErendiraRendon, Itzel Abundez, Alejandra Arizmendi,Elvia M. Quiroz, “Internal versus External clustervalidation indexes,” International Journal of Computersand Communications, Vol. 5, No. 1, 2011, pp. 27-34.[14] Bolshakova, N., Azuaje, F., “Machaon CVE: clustervalidation for gene expression data,”Bioinformatics, Vol.19, No. 18, 2003, pp. 2494-2495.