K means report

1
Seminar Report on
K-Means Clustering Algorithm
Submitted in partial fulfillment of the Third year Seminar
of
THIRD YEAR OF ENGINEERING
in
COMPUTER ENGINEERING
by
Mr Gaurav Handa
TE CMPN- A
Roll Number:40
Under the Guidance of
Mrs.Veena Kulkarni
Asistanat Professor
Thakur college of Engineering and Technology
ShyamnarayanMarg, Thakur Village, Kandivali(E), Mumbai-101
Year 2013-2014

2
CERTIFICATE
This is to certify that Mr Gaurav Handa is a bonafide student of Thakur College of Engineering
and Technology, Mumbai. He has satisfactorily completed the requirements of the SEMINAR
as prescribed by University of Mumbai while working on seminar topic titled “K-Means
Clustering Algorithm”.
Mrs.Veena
Kulkarni
(Guide)
Dr. Rekha Sharma
(HOD CMPN)
Dr. R. R. Sedamkar
(Dean Academics)
Dr. B. K. Mishra
(Principal)
Internal Examiner External Examiner
(Name and Signature with Date) (Name and Signature with Date)
Thakur College of Engineering and Technology
Kandivali (E), Mumbai-400101.
PLACE: Mumbai
DATE:

3
ACKNOWLEDGEMENT
I would like to put my sincere thanks to Mrs. Veena Kulkarni Mam, for her able guidance and
constant support, through-out process of preparation of seminar.I would also like to thank my
seminar coordinators for arranging the necessary facilities to carry out the seminar work.
A special thank you to the HOD, Dean Academics, Principal, and Management for their support,
through the entire process of preparation.
(Gaurav Handa)

4
ABSTRACT
Business Intelligence is a more advanced form of
 Data Mining
 Transactional Databases
 Performance Management
 Enterprise Reporting
 Dataware House
Business Intelligence enables the business to make intelligent and fact-based decisions.
It is divided into
 Association Analysis
 Classification
 Clustering
 Regression.
Data clustering is a method in which we make cluster of objects which are somewhat similar
in characteristics.
Clustering is further divided into
 Hierarchical
 Partitional
 Density based
K-means algorithm is a part of partitional clustering.

5
C O N T E N T S
Chapter No. Topic Pg.
No.
Chapter 1 Introduction
1.1 Importance of the seminar topic and its background 6
Chapter 2 Literature Review
2.1 Problem Defination 7
2.2 Literature Survey 7
Chapter 3 Analysis and Planning
3.1 Architecture Over view 8
3.2
3.3
3.4
Algorithm
Flowchart
Limitations and Drawbacks
9
10
11
Chapter 4 Designand Implementation
4.1
4.2
4.3
Implementation using graph
Implementation using Java
Applications
12-13
14-16
17-19
Chapter 5 Conclusion and Future work
5.1 Experimental Results 20
Chapter 6 References 21

6
Importance of the seminar topic and its background
Chapter 1: Introduction
Data has an important role in human activities. Data mining is a knowledge discovery
process by analyzing large volumes of data from various perspectives and organizing them into
useful information. Terabytes of data are generated in many organizations in a day. Data mining
is search for valuable information in large volumes of data. Data mining is used to identify
hidden structures in data.Data mining techniques are used to extract hidden predictive
information from large volumes of data. Organizations are now starting to realize the importance
of data mining.
This paper presents the k-means algorithm from data mining. Along with a brief
description of the algorithm we have also provided graphs and arithmetic problems for better
understanding of the algorithm. This paper shows how k-means algorithm isused to implement
data mining efficiently along with the drawbacks of this algorithm.
K-means is an algorithm which is a part of partitional clustering. The k-means algorithm
can group data into k number of categories. The k-means algorithm is a simple iterative method
to partition a given data set into user specified number of clusters k. K-means is a method of
cluster analysis. Its aim is to partition n observations into k clusters and each observation will be
a part of any one cluster with the nearest mean.

7
Chapters 2:Literature Review
2.1 Problem Defination
• The knowledge discovery process by analyzing large volumes of data from various
perspectives and organizing them into useful information.
• The search for valuable information in large volumes of data and to identify hidden
structures in data.
2.2 Papers on K-Means Clustering Algorithm
• “The Uniqueness of a Good Optimum for K-Means’’, Marina Meila, Proceedings of the
23rd International Conference on Machine Learning, 2006-By augmenting k-means with
a simple,randomized seeding technique, they obtained an algorithm that is O(log k)-
competitive with the optimal clustering,that guarantees speed &accuracy.
• “The Effectiveness of Lloyd-Type Methods for the k-Means Problem”, Rafail Ostrovsky,
Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy, SODA, 2007-Polynomial-
time approximation schemes (PTAS’s) has been obtained for the k-means clustering
algo.
• “Improved Smoothed Analysis of the k-Means Method”, Bodo Manthey and Heiko
Roglin, preprint, 2008- The paper tells us one of the distinguished features is its speed in
practice. Its worst-case running-time, however, is exponential, leaving a gap between
practical and theoretical performance. This technical paper aims at closing this gap.

8
Chapter 3: Analysis and Planning
3.1Architecture Overview
K-means algorithm is a Centroid based technique in which each cluster is represented by the
centre of the cluster.
This algorithm aims at minimizing an objective function, specifically a squared error
function.

9
3.2 Algorithm:
Let us give a simple explanation of the k-means algorithm.
Let D be the data set of n objects and let k be the number of clusters. Here we distributes the
objects into k clusters such that objects within a cluster are same and are dissimilar with
the objects in other clusters. First it arbitrarily selects k of the objects each of which represents a
cluster mean or center.For each of the remaining objects, an object is assigned to cluster to which
it is most similar based on the distance between object and the cluster mean. It then computes the
new mean for each cluster and the process is repeated. Thus this is an iterative process which
continues until stability is reached. Consider the K-means algorithm for partitioning
where each cluster’s center is represented by the mean values of the objects in the cluster.
Input: k=the number of clusters
D=data set containing n objects
Method:
 randomly choose k substances from A as the initial cluster center, repeat until no change.
 allocate each substance to the cluster with which the substance is most similar, based on
mean value of the substance in cluster.
 calculate the new mean values for each cluster.

11
3.4 Limitations And Drawbacks:
 The space complexity is O(mn) where m is the number of points and n is the number of
attributes.
 The time complexity is O(I*K*M*N) where I is the number of iterations required for
convergence. I is typically small (5-10).
It can also be easily bounded as most changes occur in the first few iterations.
 Need to specify K, the number of clusters, in advance .
 Unable to handle noisy data and outliers.
 Not suitable for discovering clusters with non-convex shapes.
 Applicable only when mean is defined.

12
Chapter 4: Design and implementation
4.1 Implementation using Graphs.
Lets take an example to show the implementation of K-Means clustering algorithm.
The k-means algorithm requires 3 user specified parameters - the number of clusters k, cluster
initialization and the distance metric. Typically k-means is run individually for different values
of k and the partition that appears to be most meaningful is selected. Different initializations may
lead to different final clustering, because k means only converges to local minima. K-means is
normally used with Euclidean metric for computing the distance between points and cluster
centres. Thus k-means normally forms spherical or ball shaped cluster. We try to choose natural
numbers for the number of clusters, but in general this notion is not well defined. Choosing the
initial centroid is the key step in basic k-means algorithm

13
Considering the example given above we proceed with implementation of K-Means
Clustering Algorithm as follows.
Fig 1 shows that we are given a set of 7 points that mapped graphically differently from
one another depending upon their characteristics. The two black points represents the randomly
choosed initial centroids.
Fig 2 shows that distance is calculated be between the choosen centroids and every other
point using Euclidean distance metric.
Fig 3,Depending upon the results of the distances we assign the points in the cluster with
nearest centroid. Now, as a temporary cluster is made, we need to verify our result.
Fig 4 shows that again the entire procedure is repeated.But the initial centroid being
randomly choosen is now calculated.Mean is taken of all the x-co ordinate to get the x-co
ordinate of the new centroid for that particular cluster.Then, Mean is taken of all the y-co
ordinate to get the y-co ordinate of the new centroid for that particular cluster.
Fig 5 shows that the calculations have ended up with the same results hence we have
verified results showing the successful implementation K-means Clustering Algorithm.
This is what the results we get while working with large no.of data items and implementing K-
Means Clustering Algorithm.

14
4.2 Implementation using Java.
For the implementation of K-Means clustering Algorithm we can also use Java as a tool.
Further is provided a code along with its output showing the successful implantation of K-Means
Clustering Algorithm.
Code:
import java.io.*;
import java.util.*;
class Kmean
{
public static void main(String args[])
{
int i,j=0,n=0,k=2,x=0,l=0;
Scanner sc=new Scanner(System.in);
System.out.println("Enter the no of data:");
n=sc.nextInt();
int array[]=new int[n];
System.out.println("Enter "+n+" data:");
for(i=0;i<n;i++)
array[i]=sc.nextInt();
float m1,m2,m1o=-1,m2o=-1,sum1=0,sum2=0;
m1=array[0];
m2=array[1];
int k1[]=new int[n];
int k2[]=new int[n];
for(i=0;i<n;i++)
{
k1[i]=k2[i]=-1;
}
for(;(m1!=m1o)&&(m2!=m2o);)
{
l++;
m1o=m1;
m2o=m2;
x=j=0;
sum1=sum2=0;
for(i=0;i<n;i++)
{
if(Math.abs(m1-array[i])<=Math.abs(m2-array[i]))
{
k1[x] = array[i];
sum1+=array[i];
x++;
}

15
else
{
k2[j] = array[i];
sum2+=array[i];
j++;
}
}
m1=sum1/x;
m2=sum2/j;
System.out.print("The 1st cluster in pass "+l+":");
for(i=0;i<x;i++)
System.out.print(k1[i]+" ");
System.out.print("ttThe 2nd cluster in pass "+l+" is:");
for(i=0;i<j;i++)
System.out.println();
}
System.out.print("The 1st cluster is:");
for(i=0;i<x;i++)
System.out.print("ttThe 2nd cluster is:");
for(i=0;i<j;i++)
}
}

16
OUTPUT:
C:Program Files (x86)Javajdk1.6.0bin>javac Kmean.java
C:Program Files (x86)Javajdk1.6.0bin>java Kmean
Enter the no of data:
9
Enter 9 data:
2
4
10
12
3
20
30
11
25
The 1st cluster in pass 1:
2 3
The 2nd cluster in pass 1 is:
4 10 12 20 30 11 25
2 4 3
10 12 20 30 11 25
2 4 10 3
12 20 30 11 25
2 4 10 12 3 11
20 30 25
2 4 10 12 3 11
20 30 25
The 1st cluster is:
2 4 10 12 3 11 The 2nd cluster is:
20 30 25

17
4.3 Applications:
1. Archaeology
The objective here is to cluster the locations of archaeological sites and to make
inferences about political history based on the clusters.
With the help of these we can make some speculations and these can be tested by
actual going to the site.
Clustering locations of archaeological sites in Israel

18
2.Computational Biology
Here, carp to different levels of cold and genes were clustered based on their
response in different tissues.
Green colour indicates that the gene is under expressed whereas red colour
indicates that the gene is over expressed.
We can see in the figure that there are some patterns in different tissues.
Thus clustering is a useful tool where we can represent so much information in
one plot.
Identification of common set of cold related genes

19
3.Education
This example is taken from “Teachers as Sources of Middle School Students’
Motivational Identity: Variable Centered and Person Centered Analytic
Approaches” paper.
In this paper survey results of 206 students are clustered.
These clusters are used to identify groups to buttress an analysis of what affects
motivation.
The number of clusters were selected to get some nice hypothesis. This
hypothesis can then be verified.

20
Conclusion:
K-means algorithm is a simple yet popular method for clustering analysis. Its
performance is determined by initialisation and appropriate distance measure.
There are several variants of K-means to overcome its weaknesses :
– K-Medoids: resistance to noise and/or outliers
– K-Modes: extension to categorical data clustering analysis
– CLARA: dealing with large data sets
– Mixture models (EM algorithm): handling uncertainty of clusters

21
References:
Following below are the list of references for the seminar.
List of References
[1] Bowman, M., Debray, S. K., and Peterson, L. L. 1993. Reasoning about naming systems. .
[2] Ding, W. and Marchionini, G. 1997 A Study on Video Browsing Strategies. Technical
Report. University of Maryland at College Park.
[3] Fröhlich, B. and Plate, J. 2000. The cubic mouse: a new device for three-dimensional input.
In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
[4] Tavel, P. 2007 Modeling and Simulation Design. AK Peters Ltd.

K means report

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to K means report

Similar to K means report (20)

More from Gaurav Handa

More from Gaurav Handa (9)

Recently uploaded

Recently uploaded (20)

K means report