Published on

International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. J.Shalini, R.Jayasree, K.Vaishnavi / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.884-888884 | P a g eClustering on High Dimensional data Using Locally LinearEmbedding (LLE) TechniquesFirst Author, J.Shalini, Second Author, R.Jayasree, ThirdAuthor,K.VaishnaviAssistant Professor,Department Of Computer ScienceAbstractClustering is the task of grouping a set ofobjects in such a way that objects in the same group(called cluster) are more similar (in some sense oranother) to each other than to those in other groups(clusters). The dimension can be reduced by usingsome techniques of dimension reduction. Recentlynew non linear methods introduced for reducing thedimensionality of such data called Locally LinearEmbedding (LLE).LLE combined with K-meansclustering in to coherent frame work to adaptivelyselect the most discriminant subspace. K-meansclustering use to generate class labels and use LLEto do subspace selection.Keywords - Clustering, High Dimension Data,Locally Linear Embedding, k-means clustering,Principal Component AnalysisI. INTRODUCTIONHigh dimensional datasets present manymathematical challenges as well as someopportunities to bound to give rise to newtheoretical development [23]. The problem isespecially severe when large databases with manyfeatures are searched for patterns without filtering ofimportant features based on prior knowledge. Thegrowing importance of knowledge discovery anddata mining methods in practical applications hasmade the feature selection/extraction problem.II BACKGROUND STUDYDeveloping effective clustering methodsfor high dimensional datasets is a challengingproblem due to the curse of dimensionality. Thereare many approaches to address the problem ofcurse of dimensionality. The simplest approach ofdimension reduction techniques [6], principalcomponent analysis (PCA) (Duda et al., 2000;Jolliffe, 2002) and random projections (Dasgupta,2000). In PCA method, dimension reduction iscarried out as a preprocessing step and is decoupledfrom the clustering process. Once the subspacedimensions are selected, they stay fixed during theclustering process. This is The main draw back ofPCA [11] [27]. An extension of this approach isLocally Linear Embedding Techniques.III PROPOSED WORK OF LLEThe proposed work is carried out bycollecting a large volume of high dimensionalunsupervised gene datasets, by using dimensionreduction techniques such as, LLE describe locallylinear embedding (LLE) an unsupervised learningalgorithm that computes low dimensional andneighborhood preserving embeddings of highdimensional data. Generating highly nonlinearembeddings—do not involve local minima. Thework is implemented in Mat Lab 7.0. The reductioncan be done with the combination of LLE + k-meansclustering.A. LLELLE is commonly called as Locally LinearEmbedding; it is one of the dimension reductiontechniques [23]. The LLE algorithm of Roweis andSaul (2000) it involves mapping high-dimensionalinputs into a low dimensional ―description‖.B. STEPS OF LLE WORKING(1) Assign neighbors to each data point W Xi (forexample by using the K nearest neighbors).(2) compute the weights Wij that best linearlyreconstruct Xi from its neighbors, solving theconstrained least-squares problem.(3) Compute the low-dimensional embeddingvectors Yi best reconstructed by Wij,minimizing the smallest eigenmodes of thesparse symmetric matrix in Although theweights Wij and vectors Yi are computed bymethods in linear algebra, the constraint thatpoints are onlyFig 1. ClustersReconstruction errors are measured by the costfunction.The weights Wij summarize the contribution of thejth data point to the ith reconstruction. To compute
  2. 2. J.Shalini, R.Jayasree, K.Vaishnavi / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.885 | P a g ethe weights Wij, we minimize the cost space with asmany function subject to two constraints: first, thateach data point Xi is reconstructed only from itsneighbors. The optimal weights Wij subject to theseconstraints are found by solving a least-squaresproblem.This cost function, like the previous one, isbased on locally linear reconstruction errors, buthere we fix the weights Wij while optimizing thecoordinates Yi. The embedding cost in Eq. 2 definesa quadratic form in the vectors W Yi. Subject toconstraints that make the problem well-posed, it canbe minimized by solving a sparse N * N eigenvalueproblem[36]. In this work experiments, data pointswere reconstructed from their K nearest neighbors,as measured by Euclidean distance.C. LLE AlgorithmsIV. LLE+K-MeansRoweis and Saul (2000) have recentlyproposed [36]an unsupervised learning algorithm oflow dimensional manifolds, whose goal is to recoverthe non linear structure of high dimensional data.The idea behind their local linear embeddingalgorithm is that nearby points in the highdimensional space remain nearby and similarly co-located in the low dimensional embedding. Startingfrom this intuition each data point xi (i = 1; n) isapproximated by a weighted linear combination ofits neighbors (from the nature of these local andlinear reconstructions the algorithm derives itsname). In its base formulation, the LLE algorithmfinds the linear coefficients wij by minimizing thereconstructionwhich k ¢ k is the Euclidean norm. For each xi theweights wig are enforced to sum to one and areequal to zero if a point ax does not belong to the setof k neighbors for xi. The same weights thatcharacterize the local geometry in the original dataspace are supposed to reconstruct the data points inthe low dimensional embedding. The n coordinatesare then estimated by minimizing the cost function:for reconstructing it by its k neighbors, as in the firststep of LLE. LLE is an unsupervised, non-iterativemethod, which avoids the local minima problemsplaguing many competing methods (e.g. those basedon the EM algorithm)This formula is applied in this proposal onfive available data sets: namely Iris, Wine, Zoo,Cancer, and Drosophila. Consider a set of input datavectors X =(x1, ··· ,xn) in high dimensional space.For simplicity, the data is centered in thepreprocessing step, so that thestandard K-means clustering is to minimize theclustering objective function [11].Where the matrix is the clusterindicator: if xi belongs to the k-th cluster,. Suppose the data consist of real-valuedvectors , each of dimensionality D, sampledfrom some smooth underlying manifold. Itcharacterizes the local geometry of these patches bylinear coefficients that reconstruct each data pointfrom its neighbors. In the simplest formulation ofLLE, one identifies nearest neighbors per data point,as measured by Euclidean distance.V. EXPERIMENT AND RESULTIn this section it describes about theexperiments evaluate the Performance of LLE andLLE +k -means in Dimension reduction andcompare with LDA+K-means, used widely in genedata sets.VI. DATASET DESCRIPTIONThe proposed work has been implementedand tested on five public available different genedata sets namely Iris, Breast Cancer, Drosophila,Wine, and Zoo.The descriptions of these datasets are asfollows.Five datasets including Iris, Wine, Drosophila,cancer and Zoo Zoo gene describes the animal’s hair,feathers, eggs, milk, airborne, back bone, fins, tailsetc. Iris gene describes the flowers sepal length,sepal width, petal length, petal width it’s amultivariate. The attribute used is real.1. Compute the neighbors of each data point Xi2. Compute the Weights Wij that best reconstructeach data point Xi from its neighborsMinimizing the cost by constraints linear fits.3. Compare the vectors Yi best reconstructed bythe weights Wij Minimizing the quadraticform by its bottom nonzero eigen vectors.
  3. 3. J.Shalini, R.Jayasree, K.Vaishnavi / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.886 | P a g e Drosophila describes concise genomic,proteomic, transcriptomic, genetic and functionalinformation on all known and predicted humangenes. Breast Cancer describes tumour size, Class:no-recurrence-events, recurrence-events, age,menopause, and inv-nodes6. Node-caps, Breast-quad: left-up, left-low, right-up, right-low, central.Wine data describes the attribute about Alcohol,Malic acid, Ash Alkalinity of ash, Magnesium,Total phenols, Flavanoids, Nonflavanoidphenols,Proanthocyanins.Table 1: Data set DescriptionTable 2.Clustering accuracy table on UCI dataset. Theresults are obtained by averaging5 trials ofLLE+ K-meansFig 2: k-means clustering with IrisFig 3: LDA+ k-Means with Iris Data setFig 4: LLE+k-means clusteringFig 5: K-Means clustering with wine Data setWith Iris DatasetFig 6: LDA-k means clusteringFig7: LLE+K means clustering with wine DatasetVII. RESULT ANALYSISDatasets #Samples #Dimensions #ClassIris 150 4 3Wine 178 13 3Zoo 101 18 7BreastCancer 286 9 4Drosophila 163 4 3
  4. 4. J.Shalini, R.Jayasree, K.Vaishnavi / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.887 | P a g eAll the above datasets have labels. View all thelabels of the datasets as the objective knowledge onthe structure of the datasets and use accuracy as theclustering performance measure. Accuracydiscovers the one-to-one relationship betweenclusters and classes and measures the extent towhich each cluster contains data points from thecorresponding class and it has been used asperformance measures for clustering analysis.Accuracy can be describedWhere n is the number of data points, Ck denotesthe k-the cluster, and Lm is the m-theclass. Is the number of data points thatbelong to class m are assigned to cluster k. Accuracyis then computed as the maximum sum offor all pairs of clusters and classes, andthese pairs have no overlaps. On the five datasetsdata repository, it compares the LLE-Km algorithmwith standard K-means algorithm. This work is alsocomparing it with PCA-based clustering algorithm:clustering. This shows that LLE-Km clustering isviable and competitive. The subspace clustering isable to perform the subspace selection and datareduction. Note that LLE-Km is also able todiscover clusters in the low dimensional subspace toovercome the curse of dimensionality.IX. CONCLUSIONThe performance of dimension reductiontechniques with k-means clustering method usingreal different gene data sets was compared. In thisthesis, considered various properties of data sets anddata preprocessing procedures, and have confirmedthe importance of preprocessing data prior toperforming core cluster analysis. All clusteringmethods were affected by data dimension and datacharacteristics, such as the overlapping betweenclusters and the presence of noise. In particular theLLE + k m which are commonly used to reduce thedimension of data, The future extension of this workcan be done via some other preprocessingtechniques.REFERENCES[1] Alone et alum Broad Pattern of geneexpression revealed by clustering analysisof tumor and normal colon cancer tissuesprobed by oligonucleatide arrays. PNAS,96(12): 6745-6750, 1999.[2] Anton Schwaighofer. Matlab interface tosvmlight.In (2004)[3] Banerjee, A., Dhillon, I., Ghosh, J.,Merugu, S., & Modha, D. (2004). Ageneralized maximum entropy approach tobregman co-clustering and matrixapproximation. Proc. ACM Int’l ConfKnowledge Disc. Data Mining (KDD).[4] arbara.D,‖ An Introduction to ClusterAnalysis for Data mining‖,[5] room Head .D.S, Kirby, A new approach todimensionality reduction:Theory andalgorithms, SIAM journal of appliedMathematics 60 (6) (2000)2114-2142.[6] Carreira-Perpinan.M.A, Areview ofdimension reduction techniques. Technicalreport CS-96-09, Department of ComputerScience, University of Sheffield, 1997.[7] Cheng, Y., & Church, G. (2000).Biclustering of expression data. Proc. Int’lSymp. Mol. Bio (ISMB), 93–103.[8] Chrisding et al ―K-means clustering ViaPCA‖ 21stInternational Conference onMachine Learning, Canada-2004.[9] Diamantras. K.I and Kung.S-Y.PrincipalComponent Neural Networks. Theory andApplications. John Wiley & Sons, NewYork, Londin, Sydney, 1996.[10] Ding, C., & He, X. (2004). K-meansclustering and principal componentanalysis. Int’l Conf. Machine Learning(ICML)[11] Ding, C., He, X., Zha, H., & Simon, H.(2002). Adaptive dimension reduction forclustering high dimensional data. Proc.IEEE Int’l Conf. Data Mining[12] De la Torre, F., & Kanade, T. (2006).Discriminative cluster analysis. Proc. Int’lConf. Machine Learning.[13] D. DeMers and G.W. Cottrell. Non-lineardimensionality reduction. In C.L. Giles,S.J. Hanson, and J.D. Cowan, editors,Advances in Neural InformationProcessing Systems 5, pages 580–587.Morgan Kaufmann, San Mateo, CA, 1993.[14] Fukunaga .K, Olsen .D.R, An Algorithmfor Finding intrinsic dimensionality of data,IEEE Transactions on Computers 20 (2)(1976) 165-171.[15] Halkidi.M, Batistakis.Y.―On ClusteringValidation Techniques‖, 2001.[16] Han.J and Kamber.M. ―Data MiningConcepts and Techniques‖, the MorganKaufmann Publisher, August 2000. ISBN1-55860-489-8.[17] Hartigan .J.A, Wang.M.A, ―A K-MeansClustering Algorithm‖, Appl.Stat, Vol.28,1979, pp.100-108.[18] Hartuv.E, Schmitt. A, Lange.J, ―AnAlgorithm for Clustering cDNAS for GeneExpression Analysis‖, Third InternationalConference on Computational MolecularBiology (RECOMB) 1999.[19] Hotelling.H. ―Analysis of a complex ofstatistical variables into principal
  5. 5. J.Shalini, R.Jayasree, K.Vaishnavi / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.888 | P a g ecomponents‖. Journal of EducationalPsychology, 24:417–441, 1933[20] Jackson .J.E. ―A Users Guide to PrincipalComponents‖. New York: John Wiley andSons, 1991.[21] Jain .A.K, Dubes .R.C Algorithm forclustering Data, Prentice Hall, 1988.[22] Jain .A.K, Murty.M.N, Flynn.P.J. ―DataClustering: A Review‖ .ACM ComputerSurveys, Vol.31, No.3, 1999.[23] John stone I.M(2009) ―Non-Lineardimensionality reduction by LLE‖.[24] Jolliffe, I. (2002). ―Principal componentanalysis”. Springer. 2ndEdition.[25] Jolliffe. I.T. ―Principal ComponentAnalysis‖. Springer-Verlag, 1986.[26] Jolliffe I.T. ―Principal ComponentAnalysis” (Springer-Verlag, New York,1989).[27] Kambhatla.N and Leen. T. K. ―Dimensionreduction by local principal componentanalysis‖. Neural Computation 9, 1493–1516 (1997).[28] Kaski. S. ―Dimensionality reduction byrandom mapping: fast similaritycomputation for clustering‖. Proc. IEEEInternational Joint Conference on NeuralNetworks, 1:413-418, 1998.[29] Kaufman.L, Rousseeuw.P.J, ―FindingGroups in Data: An Introduction to ClusterAnalysis‖. John Wiley, New York, 1990.[30] Kirby.M, Geometric Data Analysis: ―AnEmpirical Approach to DimensionalityReduction n and the Study of Patterns‖,John Wiley and Sons, 2001.[31] Kohavi.R and John. G.The wrapperapproach. In H. Liu and H. Motoda,editors, Feature Extraction, Constructionand Selection: ―Data Mining Perspective‖.Springer Verlag, 1998.[32] O. Kouropteva, O. Okun, and M.Pietikäinen. Selection of the optimalparameter value for the locally linearembedding algorithm. 2002. Submitted to1st International Conference on FuzzySystems and Knowledge Discovery.[33] M. Kramer. Nonlinear principal componentanalysis using auto associative neuralnetworks. AIChE Journal 37, 233–243(1991).[34] Lee.Y and Lee.C.K, ―Classification onmultiple cancer types by multicategorysupport vector machines using geneexpression data,‖ Bioinformatics, vol.19,pp.1132-1139, 2003[35] Li. K.-C. High dimensional data analysis viathe SIR/PHD approach., April 2000.Lecture notes in progress.[36] Li, T., Ma, S., & Ogihara, M. (2004).Document clustering via Adaptivesubspace iteration. Proc. conf. Researchand development in IR (SIRGIR) (pp. 218–225).[37] Wolfe P.J and Belabbas .A (2008)‖Hessianeigen maps:Locally Lineartechniques for high dimensional data.[38] Lunan .Y, Li.H, ―Clustering of time-coursegene expression data using a mixed effectsmodel with B-splines, ―BioinformaticsVol.19, 2003, pp.474-482.[39] S. T. Roweis and L. K. Saul. Nonlineardimensionality reduction by locally linearembedding. Science 290, 2323-2326(2000).[40] Zha, H., Ding, C., Gu, M., He, X., &Simon, H. (2002). Spectral relaxation forK-means clustering. Advances in NeuralInformation Processing Systems 14(NIPS’01), 1057– place of birth is in Coimbatoreon 07/10/1984 and my educational qualification isB.Sc [CT] M.Sc.,Mphil computer science completedmy ug in sri Krishna college of engineering andtechnology and my postgraduate in sri Ramakrishnacollege of arts and science for women.Now I amworking in PSGR krishnammal college for womenas Assistant Professor..R.JAYASREE my place of birth is in Coimbatoreand my educational qualification is ,MCA inBharathiar university and pursuing M.Phil in PSGRKrishnammal college and now I am working asAssistant Professor in PSGR krishnammal collegefor women.K.VAISHNAVI my place of birth is in Madurai andmy educational qualification is ,MCA in Bharathiaruniversity and pursuing M.Phil in PSGRKrishnammal college and now I am working asAssistant Professor in PSGR krishnammal collegefor women.