Your SlideShare is downloading. ×
31 34
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

31 34

162
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
162
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 5, July 2012 Improved K-Means clustering technique using distance determination approach Bhanu Sukhija, Sukhvir Singh data owner. OR Data Mining also refers to extracting orAbstract— Powerful systems for collecting data and managing mining from large amount of data. This is also sometimesit in large databases are in place in all large and mid-range referred to as Knowledge Discovery from Data (KDD). Dataorganizations. The value of raw data (collected over a long time) mining tools predict future trends and behaviors, allowingis on the ability to extract high-level information: information businesses to make proactive, knowledge-driven decisions.useful for decision support, for exploration, and for better Data mining tools can answer business questions thatunderstanding of the phenomena generating the data. traditionally were time consuming to resolve.Traditionally this task of extracting information was done withthe help of analysis where one or more analysts with the help ofstatistical techniques provide summaries and generate reports. Clustering, the grouping of the objects of a database intoSuch an approach fails as the volume and dimensionality of the meaningful subclasses such that similarity of objects in thedata increase. Hence tools to aid the automation of analysis same group is maximized and similarity of objects intasks are becoming a necessity. Thus, data mining was evolved different groups is minimized, is called clustering. Thus thewhich is “automatic”: extraction of patterns of information objects are clustered or grouped based on the principle offrom data. Clustering, the grouping of the objects of a database ―maximizing the intra class similarity and minimizing theinto meaningful subclasses such that similarity of objects in the interclass similarity‖. Clustering has its roots in many areas,same group is maximized and similarity of objects in different including data mining, statistics, biology, and machinegroups is minimized, is called clustering But my focus is on learning. Clustering methods can be divided into variouspartitioning methods only , and proposing a new clusteringalgorithm for automatic discovery of data clusters types: Partitioning methods, Hierarchical methods, Density based methods, Grid-based methods, Model based methods, Index Terms — data mining, data sets, clustering techniques Probabilistic techniques, Graph theoretic and Fuzzy methods.k-Mean. But focus of thesis is on partitioning methods only, and proposing a new clustering algorithm for automatic discovery of data clusters. I. INTRODUCTION A. Key Management and LANImportance of collecting the data that reflect business or Many international organizations produce a huge quantity ofscientific activities to achieve competitive advantage is information in a week than many other persons could read inwidely recognized now. Powerful systems for collecting data a lifetime. The situation is even more alarming in worldwideand managing it in large databases are in place in all large and networks like Internet. Everyday hundreds of megabytes ofmid-range organizations. The value of raw data (collected data are distributed around the world, but it is no longerover a long time) is on the ability to extract high-level possible to monitor this increasingly rapid development – theinformation: information useful for decision support, for growth is nearly exponential. So the basic problems with theexploration, and for better understanding of the phenomena management of the data can be summarized below:generating the data. Traditionally this task of extractinginformation was done with the help of analysis where one or  Analysis required unearthing the hiddenmore analysts with the help of statistical techniques provide relationships within the data i.e. for decisionsummaries and generate reports. Such an approach fails as support.the volume and dimensionality of the data increase. Hencetools to aid the automation of analysis tasks are becoming a  DBMS gave access to the data stored but nonecessity. Thus, data mining was evolved which is analysis of data.―automatic‖: extraction of patterns of information from data. Size of databases has increased and needs automated techniques for analysis as they have grown beyond manualData mining is the analysis of large observational data sets to extraction.find unsuspected relationships and to summarize the data innovel ways that are both understandable and useful to theManuscript received july 20, 2012. II. METHODOLOGYBhanu Sukhija, computer science, kurukshetra university y/ N.C College/N.C and S.D group of institutions, (e-mail:bhanu_sukhija57@yahoo.com). As already stated, dissertation is based on methodology ofPanipat, Haryana, India, 07206089757 Iterative Relocating Technique of Partitional Clustering i.e.Sukhvir Singh, computer science department , Kurukshetra university/ N.C k-means and k-medoids. All the algorithms are implementedCollege/ N.C and S.D group of institutions, Panipat,, Haryana, India, , in C++ language and results are studied in same environment.(e-mail: boora_s@yahoomail.com). But, before going through the study of these algorithms, their 31 All Rights Reserved © 2012 IJARCSEE
  • 2. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 5, July 2012implementation view and results there are following terms Let X= {xi | i =1, 2, ……n} be a data set with n objects, k isused in the implementations of these algorithms the number of clusters, mj is the centroid of cluster cj where j=1, 2,…k. Then the algorithm finds the distance between a data object and a centroid by using the following Euclidean A. Distance distance formula [1,9,10]Both of these algorithms are based on Distance among Euclidean distance formula= √| xi – mj |2objects, hence it is necessary to introduce with the distance .which we have taken in this dissertation. III. COMPARISION OF ALGORITHMSLet X= {xi | i =1, 2, ……, n } be a data set with n objects, k is A. Comparison of algorithm’s running timethe number of clusters, mj is the centroid of cluster cj wherej=1, 2,…k. Then the algorithm finds the distance between adata object and a centroid by using the following Euclideandistance formula [1, 9, 10] Name of algorithm ComputationalEuclidean distance formula= √| xi – mj | 2 ComplexityStarting from an initial distribution of cluster centers in dataspace, each object is assigned to the cluster with closest k-means O(nkt)center, after which each center itself is updated as the centerof mass of all objects belonging to that particular cluster. k-medoids O(k(n-k)2) B. Data SetData set is the group of n data points on which the D-M O(ni)partitioning approach is to be applied which is of mdimension. Multidimensional data can include any numberof measurable attributes. Each independent characteristic, or B. Comparison of SSE (sum of square error)of Algorithmmeasurement, is one dimension.The consolidation of large,multidimensional data sets is the main purpose of the field ofcluster analysis.[7] In this dissertation, data set (Dn) is to be Name of Two Fourused of two dimensional and having twenty data points. Each Algorithmdata points is represented by a pixel (x, y) where x isrepresented data point’s first dimension and y is represented k-means 1053.6022 681.4046by data point’s second dimension respectively. Visualizationof data set are also produced in the form of set of twentypixels. k-medoids 1082.002 464.5469 C. K-Means Step D-M 1053.6022 392.6812The basic step of k-means clustering is to give the number ofclusters k and consider first k objects from data set D asclusters & their centroid. Each object in each cluster, the distance from object to itsThen the K-means algorithm will do the three steps below cluster center is squared and the distances are summed. Inuntil convergenceIterate until stable (= no object move group): other words, square error is the sum of distance for all objects in the data set with respect to the centroid or medoid of1. Determine the centroid coordinate2. Determine the distance of each object to the centroid belonging to that object’s cluster. It can also be defined as:3. Group the object based on minimum Distance. E = j=1 k ( i=1  n |pij – m j |2) where E is the sum of square error for all objects in the dataD. K-Medoids or partition around medoid (PAM) step: set; pij is the point in ith object in jth space/cluster representingThe basic step of K-Medoid or PAM clustering is to give the a given object, mj is the mean of jth cluster (both pij and mj arenumber of clusters k and consider first k objects from data set multidimensional). This criterion tries to make the resultingDn as cluster & their medoid. k clusters as compact and as separate as possibleThen the k-medoid or PAM algorithm will do the followingfive steps below until convergence IV. VALIDATIONIterative until stable (=no objects change group) Validation is the measurement of similarity between the1. Design clusters and Compute the Distance over all the actual classes or groups and clusters results of any clusteringclusters technique. It can be measured by way of different measures2. Randomly select non-medoid objects as like Purity, Entropy, F-measures and Coefficient of3. Compute the New Distance w.r.t. new Variation. In which the Coefficient of Variation is the best measurement of Validation 32 All Rights Reserved © 2012 IJARCSEE
  • 3. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 5, July 2012Coefficient of Variation is a measure of the data dispersion.The CV is defined as the ratio of the Standard deviation to themean. CV is a dimensionless number that allows comparisonof the variation of populations that have significantly differentmean values. In general the large the CV value is , the greaterthe variability is in the data. Coefficient of Variation can becalculated as follows:1. Find the mean of all Classes individually in the data set: x = i=1n x i /n where x is the mean of each class having n number of objects. 2. Next calculate the Global Mean of All classes by using their means : xg = j=1N xj /N where xg is the Global Mean of all N number of classes and xj is the mean of that classes. 3. Find the Standard Deviation of all the Classes : s=√ j=1N (xj – xg )2 /N 4. Find the Coefficient of Variation (CV) : Fig. Output of K-means with k=2 B. Implementation view of K-Medoids CV = s/xg As shown in Figure here the value of k=2 shows the Two Where CV is the Coefficient of Variation, s is the Clusters representing by medoids CD:1 and CD:2 Standard Deviation and xg is the Global Mean. respectively which are also the data points of the data set. The Deviation in clusters is shown in rights hand side of the figure i.e. 465.2147 in 1st cluster and 616.7882 in 2nd clusters V. RESULTS totaling 1082.02. There are 11 data points in 1st clusters and 9 data points in 2nd cluster. It is also important to mention The following are the result of the implementation of above here that the total distance i.e. SSE of k-medoids is trivially said algorithms which has been implemented on compiler of more than k-means i.e. 1053.602. C++ with different values of k i.e. the numbers of clusters on same set of two dimensional data set having n=20. As shown in Figure 4.9 here the value of k=3 shows the Three Clusters representing by medoids CD:1, CD:2 and A. Implementation view of K-Means CD:3 respectively which are also the data points of the data set. The Deviation in clusters is shown in rights hand side of As shown in Figure here the value of k=2 shows the Two the figure i.e. 118.6493 in 1st cluster, 522.7472 in 2nd clusters Clusters representing by centroids CD:1 and CD:2 and 145.4263 in 3rd cluster totaling 786.8229. There are 5 respectively. The Deviation in clusters is shown in rights data points in 1st clusters, 10 data points in 2nd cluster and 5 hand side of the figure i.e. 526.8011 in each cluster totaling data points in 3rd clusters. It is again considerable that the 1053.602. The numbers of data points is ten in each cluster. total distance i.e. SSE of k-medoids is trivially more than k-means i.e. 723.1417. 33 All Rights Reserved © 2012 IJARCSEE
  • 4. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 5, July 2012 [5] Hand & Others: Principal of Data Mining [6] Pavel Brakhin: ―Survey of Clustering Data Mining Techniques‖ . [7] Vance Faber, ―Clustering and the Continuous k-Means Algorithm‖. [8] Introduction of Clustering [9]. Shang ,Yi &Li, Guo-Jie(1991) New Crossover Operators In Genetic Algorithms, , P. R. China: National Research Center for Intelligent Computing Systems (NCIC) [10]. Singh ,Vijendra & Choudhary ,Simran (2009) Genetic Algorithm for Traveling Salesman Problem: Using Modified Partially-Mapped CrossoverOperator,sikar,Rajasthan,India: Department of Computer Science & Engineering, Faculty of Engineering & Technology, Mody Institute of Technology & Science, Lakshmangarh [11]. Su, Fanchen et al(2009) New Crossover Operator of Genetic Algorithms for the TSP, P.R. China: Computer School of Wuhan University Wuhan. Ms. Bhanu Sukhija obtained the B.Tech degree in Computer Science and Engineering from Kurukshetra University, Kurukshetra, India in 2009. She is presently working as Lecturer in Department of Computer Science and Engineering at N. C. College of Fig. Output of K-Medoids with k=2 Engineering, Israna, Panipat. She is presently pursuing her M.Tech from the same institute. She has guided several VI. CONCLUSION B.Tech projects. Her research interests include data mining.In partitioning based clustering algorithms, the number offinal clusters (k) needs to be determined beforehand. Also,algorithms have problems like susceptible to local optima, Mr. Sukhvir Singh obtained M.Tech (Integrated) degree insensitive to outliers, memory space, validity and unknown Software Engineering and Systemnumber of iteration steps required to cluster. So an algorithm Analysis in 1996. He is presentlywhich explores number of clusters automatically considering pursuing PhD from M.D University,all these problems is an active research area and most of the Rohtak. He has worked at P.D.Mresearch has focused on effectiveness and scalability of the Polytechnic, Sarai Aurangabad,algorithm. In this paper, a partitioning based algorithm, D-M Bahadurgarh, P.D.M College of(Density Means), clustering has been considered which Engineering, Bahadurgarh.generate clusters automatically. He is presently heading department of computer science at . N.C College of engineering, Israna (Panipat) and also He has received Grant of Rs. 4 Lac from AICTE for the project of MODROBS for Advance Computer Network Lab. He is also ACKNOWLEDGMENT Member of Computer Society of India and Life TimeThe authors are quite thankful to the Computer science Member of ISTE.department at NC College of Engineering for supporting ourwork. REFERENCES[1] J. Han and M. Kamber K. Data Mining: Concepts andTechniques. Morgan Kaufman, 2000.[2] P.S Bradley, Usama M. Fayyad., ―Initial Points forK-Means Clustering.‖, Advances in Knowledge Discoveryand Data Mining. MIT Press .[3] ―Introduction to Data Mining and KnowledgeDiscovery” 3rd Edition by Two Crows Corporation.[4] Teknomo, kardi. ―K-Means clustering tutorial.” 34 All Rights Reserved © 2012 IJARCSEE