SlideShare a Scribd company logo
1 of 21
1
Seminar Report on
K-Means Clustering Algorithm
Submitted in partial fulfillment of the Third year Seminar
of
THIRD YEAR OF ENGINEERING
in
COMPUTER ENGINEERING
by
Mr Gaurav Handa
TE CMPN- A
Roll Number:40
Under the Guidance of
Mrs.Veena Kulkarni
Asistanat Professor
Thakur college of Engineering and Technology
ShyamnarayanMarg, Thakur Village, Kandivali(E), Mumbai-101
Year 2013-2014
2
CERTIFICATE
This is to certify that Mr Gaurav Handa is a bonafide student of Thakur College of Engineering
and Technology, Mumbai. He has satisfactorily completed the requirements of the SEMINAR
as prescribed by University of Mumbai while working on seminar topic titled “K-Means
Clustering Algorithm”.
Mrs.Veena
Kulkarni
(Guide)
Dr. Rekha Sharma
(HOD CMPN)
Dr. R. R. Sedamkar
(Dean Academics)
Dr. B. K. Mishra
(Principal)
Internal Examiner External Examiner
(Name and Signature with Date) (Name and Signature with Date)
Thakur College of Engineering and Technology
Kandivali (E), Mumbai-400101.
PLACE: Mumbai
DATE:
3
ACKNOWLEDGEMENT
I would like to put my sincere thanks to Mrs. Veena Kulkarni Mam, for her able guidance and
constant support, through-out process of preparation of seminar.I would also like to thank my
seminar coordinators for arranging the necessary facilities to carry out the seminar work.
A special thank you to the HOD, Dean Academics, Principal, and Management for their support,
through the entire process of preparation.
(Gaurav Handa)
4
ABSTRACT
Business Intelligence is a more advanced form of
 Data Mining
 Transactional Databases
 Performance Management
 Enterprise Reporting
 Dataware House
Business Intelligence enables the business to make intelligent and fact-based decisions.
It is divided into
 Association Analysis
 Classification
 Clustering
 Regression.
Data clustering is a method in which we make cluster of objects which are somewhat similar
in characteristics.
Clustering is further divided into
 Hierarchical
 Partitional
 Density based
K-means algorithm is a part of partitional clustering.
5
C O N T E N T S
Chapter No. Topic Pg.
No.
Chapter 1 Introduction
1.1 Importance of the seminar topic and its background 6
Chapter 2 Literature Review
2.1 Problem Defination 7
2.2 Literature Survey 7
Chapter 3 Analysis and Planning
3.1 Architecture Over view 8
3.2
3.3
3.4
Algorithm
Flowchart
Limitations and Drawbacks
9
10
11
Chapter 4 Designand Implementation
4.1
4.2
4.3
Implementation using graph
Implementation using Java
Applications
12-13
14-16
17-19
Chapter 5 Conclusion and Future work
5.1 Experimental Results 20
Chapter 6 References 21
6
Importance of the seminar topic and its background
Chapter 1: Introduction
Data has an important role in human activities. Data mining is a knowledge discovery
process by analyzing large volumes of data from various perspectives and organizing them into
useful information. Terabytes of data are generated in many organizations in a day. Data mining
is search for valuable information in large volumes of data. Data mining is used to identify
hidden structures in data.Data mining techniques are used to extract hidden predictive
information from large volumes of data. Organizations are now starting to realize the importance
of data mining.
This paper presents the k-means algorithm from data mining. Along with a brief
description of the algorithm we have also provided graphs and arithmetic problems for better
understanding of the algorithm. This paper shows how k-means algorithm isused to implement
data mining efficiently along with the drawbacks of this algorithm.
K-means is an algorithm which is a part of partitional clustering. The k-means algorithm
can group data into k number of categories. The k-means algorithm is a simple iterative method
to partition a given data set into user specified number of clusters k. K-means is a method of
cluster analysis. Its aim is to partition n observations into k clusters and each observation will be
a part of any one cluster with the nearest mean.
7
Chapters 2:Literature Review
2.1 Problem Defination
• The knowledge discovery process by analyzing large volumes of data from various
perspectives and organizing them into useful information.
• The search for valuable information in large volumes of data and to identify hidden
structures in data.
2.2 Papers on K-Means Clustering Algorithm
• “The Uniqueness of a Good Optimum for K-Means’’, Marina Meila, Proceedings of the
23rd International Conference on Machine Learning, 2006-By augmenting k-means with
a simple,randomized seeding technique, they obtained an algorithm that is O(log k)-
competitive with the optimal clustering,that guarantees speed &accuracy.
• “The Effectiveness of Lloyd-Type Methods for the k-Means Problem”, Rafail Ostrovsky,
Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy, SODA, 2007-Polynomial-
time approximation schemes (PTAS’s) has been obtained for the k-means clustering
algo.
• “Improved Smoothed Analysis of the k-Means Method”, Bodo Manthey and Heiko
Roglin, preprint, 2008- The paper tells us one of the distinguished features is its speed in
practice. Its worst-case running-time, however, is exponential, leaving a gap between
practical and theoretical performance. This technical paper aims at closing this gap.
8
Chapter 3: Analysis and Planning
3.1Architecture Overview
K-means algorithm is a Centroid based technique in which each cluster is represented by the
centre of the cluster.
This algorithm aims at minimizing an objective function, specifically a squared error
function.
9
3.2 Algorithm:
Let us give a simple explanation of the k-means algorithm.
Let D be the data set of n objects and let k be the number of clusters. Here we distributes the
objects into k clusters such that objects within a cluster are same and are dissimilar with
the objects in other clusters. First it arbitrarily selects k of the objects each of which represents a
cluster mean or center.For each of the remaining objects, an object is assigned to cluster to which
it is most similar based on the distance between object and the cluster mean. It then computes the
new mean for each cluster and the process is repeated. Thus this is an iterative process which
continues until stability is reached. Consider the K-means algorithm for partitioning
where each cluster’s center is represented by the mean values of the objects in the cluster.
Input: k=the number of clusters
D=data set containing n objects
Method:
 randomly choose k substances from A as the initial cluster center, repeat until no change.
 allocate each substance to the cluster with which the substance is most similar, based on
mean value of the substance in cluster.
 calculate the new mean values for each cluster.
10
3.3 Flowchart:
11
3.4 Limitations And Drawbacks:
 The space complexity is O(mn) where m is the number of points and n is the number of
attributes.
 The time complexity is O(I*K*M*N) where I is the number of iterations required for
convergence. I is typically small (5-10).
It can also be easily bounded as most changes occur in the first few iterations.
 Need to specify K, the number of clusters, in advance .
 Unable to handle noisy data and outliers.
 Not suitable for discovering clusters with non-convex shapes.
 Applicable only when mean is defined.
12
Chapter 4: Design and implementation
4.1 Implementation using Graphs.
Lets take an example to show the implementation of K-Means clustering algorithm.
The k-means algorithm requires 3 user specified parameters - the number of clusters k, cluster
initialization and the distance metric. Typically k-means is run individually for different values
of k and the partition that appears to be most meaningful is selected. Different initializations may
lead to different final clustering, because k means only converges to local minima. K-means is
normally used with Euclidean metric for computing the distance between points and cluster
centres. Thus k-means normally forms spherical or ball shaped cluster. We try to choose natural
numbers for the number of clusters, but in general this notion is not well defined. Choosing the
initial centroid is the key step in basic k-means algorithm
13
Considering the example given above we proceed with implementation of K-Means
Clustering Algorithm as follows.
Fig 1 shows that we are given a set of 7 points that mapped graphically differently from
one another depending upon their characteristics. The two black points represents the randomly
choosed initial centroids.
Fig 2 shows that distance is calculated be between the choosen centroids and every other
point using Euclidean distance metric.
Fig 3,Depending upon the results of the distances we assign the points in the cluster with
nearest centroid. Now, as a temporary cluster is made, we need to verify our result.
Fig 4 shows that again the entire procedure is repeated.But the initial centroid being
randomly choosen is now calculated.Mean is taken of all the x-co ordinate to get the x-co
ordinate of the new centroid for that particular cluster.Then, Mean is taken of all the y-co
ordinate to get the y-co ordinate of the new centroid for that particular cluster.
Fig 5 shows that the calculations have ended up with the same results hence we have
verified results showing the successful implementation K-means Clustering Algorithm.
This is what the results we get while working with large no.of data items and implementing K-
Means Clustering Algorithm.
14
4.2 Implementation using Java.
For the implementation of K-Means clustering Algorithm we can also use Java as a tool.
Further is provided a code along with its output showing the successful implantation of K-Means
Clustering Algorithm.
Code:
import java.io.*;
import java.util.*;
class Kmean
{
public static void main(String args[])
{
int i,j=0,n=0,k=2,x=0,l=0;
Scanner sc=new Scanner(System.in);
System.out.println("Enter the no of data:");
n=sc.nextInt();
int array[]=new int[n];
System.out.println("Enter "+n+" data:");
for(i=0;i<n;i++)
array[i]=sc.nextInt();
float m1,m2,m1o=-1,m2o=-1,sum1=0,sum2=0;
m1=array[0];
m2=array[1];
int k1[]=new int[n];
int k2[]=new int[n];
for(i=0;i<n;i++)
{
k1[i]=k2[i]=-1;
}
for(;(m1!=m1o)&&(m2!=m2o);)
{
l++;
m1o=m1;
m2o=m2;
x=j=0;
sum1=sum2=0;
for(i=0;i<n;i++)
{
if(Math.abs(m1-array[i])<=Math.abs(m2-array[i]))
{
k1[x] = array[i];
sum1+=array[i];
x++;
}
15
else
{
k2[j] = array[i];
sum2+=array[i];
j++;
}
}
m1=sum1/x;
m2=sum2/j;
System.out.print("The 1st cluster in pass "+l+":");
for(i=0;i<x;i++)
System.out.print(k1[i]+" ");
System.out.print("ttThe 2nd cluster in pass "+l+" is:");
for(i=0;i<j;i++)
System.out.print(k2[i]+" ");
System.out.println();
}
System.out.print("The 1st cluster is:");
for(i=0;i<x;i++)
System.out.print(k1[i]+" ");
System.out.print("ttThe 2nd cluster is:");
for(i=0;i<j;i++)
System.out.print(k2[i]+" ");
}
}
16
OUTPUT:
C:Program Files (x86)Javajdk1.6.0bin>javac Kmean.java
C:Program Files (x86)Javajdk1.6.0bin>java Kmean
Enter the no of data:
9
Enter 9 data:
2
4
10
12
3
20
30
11
25
The 1st cluster in pass 1:
2 3
The 2nd cluster in pass 1 is:
4 10 12 20 30 11 25
The 1st cluster in pass 2:
2 4 3
The 2nd cluster in pass 2 is:
10 12 20 30 11 25
The 1st cluster in pass 3:
2 4 10 3
The 2nd cluster in pass 3 is:
12 20 30 11 25
The 1st cluster in pass 4:
2 4 10 12 3 11
The 2nd cluster in pass 4 is:
20 30 25
The 1st cluster in pass 5:
2 4 10 12 3 11
The 2nd cluster in pass 5 is:
20 30 25
The 1st cluster is:
2 4 10 12 3 11 The 2nd cluster is:
20 30 25
17
4.3 Applications:
1. Archaeology
The objective here is to cluster the locations of archaeological sites and to make
inferences about political history based on the clusters.
With the help of these we can make some speculations and these can be tested by
actual going to the site.
Clustering locations of archaeological sites in Israel
18
2.Computational Biology
Here, carp to different levels of cold and genes were clustered based on their
response in different tissues.
Green colour indicates that the gene is under expressed whereas red colour
indicates that the gene is over expressed.
We can see in the figure that there are some patterns in different tissues.
Thus clustering is a useful tool where we can represent so much information in
one plot.
Identification of common set of cold related genes
19
3.Education
This example is taken from “Teachers as Sources of Middle School Students’
Motivational Identity: Variable Centered and Person Centered Analytic
Approaches” paper.
In this paper survey results of 206 students are clustered.
These clusters are used to identify groups to buttress an analysis of what affects
motivation.
The number of clusters were selected to get some nice hypothesis. This
hypothesis can then be verified.
20
Conclusion:
K-means algorithm is a simple yet popular method for clustering analysis. Its
performance is determined by initialisation and appropriate distance measure.
There are several variants of K-means to overcome its weaknesses :
– K-Medoids: resistance to noise and/or outliers
– K-Modes: extension to categorical data clustering analysis
– CLARA: dealing with large data sets
– Mixture models (EM algorithm): handling uncertainty of clusters
21
References:
Following below are the list of references for the seminar.
List of References
[1] Bowman, M., Debray, S. K., and Peterson, L. L. 1993. Reasoning about naming systems. .
[2] Ding, W. and Marchionini, G. 1997 A Study on Video Browsing Strategies. Technical
Report. University of Maryland at College Park.
[3] Fröhlich, B. and Plate, J. 2000. The cubic mouse: a new device for three-dimensional input.
In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
[4] Tavel, P. 2007 Modeling and Simulation Design. AK Peters Ltd.

More Related Content

What's hot

KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Simplilearn
 

What's hot (20)

K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
KNN
KNNKNN
KNN
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
Decision tree
Decision treeDecision tree
Decision tree
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jax
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis Introduction
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
 
Chapter8
Chapter8Chapter8
Chapter8
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 

Similar to K means report

CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
IAEME Publication
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
Laura Petrosanu
 

Similar to K means report (20)

Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Noura2
Noura2Noura2
Noura2
 
Master's Thesis Presentation
Master's Thesis PresentationMaster's Thesis Presentation
Master's Thesis Presentation
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Data
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
 
Analysis and implementation of modified k medoids
Analysis and implementation of modified k medoidsAnalysis and implementation of modified k medoids
Analysis and implementation of modified k medoids
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
Bb25322324
Bb25322324Bb25322324
Bb25322324
 
Af4201214217
Af4201214217Af4201214217
Af4201214217
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 

More from Gaurav Handa (9)

Gaurav handa - Big Data and Hadoop
Gaurav handa - Big Data and HadoopGaurav handa - Big Data and Hadoop
Gaurav handa - Big Data and Hadoop
 
Gaurav handa - Business Analytics
Gaurav handa - Business AnalyticsGaurav handa - Business Analytics
Gaurav handa - Business Analytics
 
Gaurav handa - Data Visualization
Gaurav handa - Data VisualizationGaurav handa - Data Visualization
Gaurav handa - Data Visualization
 
A comparative study of hawk eye and goal line
A comparative study of hawk eye and goal lineA comparative study of hawk eye and goal line
A comparative study of hawk eye and goal line
 
Ijca paper template
Ijca paper templateIjca paper template
Ijca paper template
 
Kmeans
KmeansKmeans
Kmeans
 
B.E degree
B.E degreeB.E degree
B.E degree
 
Project ISR
Project ISRProject ISR
Project ISR
 
Project WeLike
Project WeLikeProject WeLike
Project WeLike
 

Recently uploaded

原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
saurabvyas476
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
mikehavy0
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
wsppdmt
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 

Recently uploaded (20)

Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .ppt
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
 
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
 

K means report

  • 1. 1 Seminar Report on K-Means Clustering Algorithm Submitted in partial fulfillment of the Third year Seminar of THIRD YEAR OF ENGINEERING in COMPUTER ENGINEERING by Mr Gaurav Handa TE CMPN- A Roll Number:40 Under the Guidance of Mrs.Veena Kulkarni Asistanat Professor Thakur college of Engineering and Technology ShyamnarayanMarg, Thakur Village, Kandivali(E), Mumbai-101 Year 2013-2014
  • 2. 2 CERTIFICATE This is to certify that Mr Gaurav Handa is a bonafide student of Thakur College of Engineering and Technology, Mumbai. He has satisfactorily completed the requirements of the SEMINAR as prescribed by University of Mumbai while working on seminar topic titled “K-Means Clustering Algorithm”. Mrs.Veena Kulkarni (Guide) Dr. Rekha Sharma (HOD CMPN) Dr. R. R. Sedamkar (Dean Academics) Dr. B. K. Mishra (Principal) Internal Examiner External Examiner (Name and Signature with Date) (Name and Signature with Date) Thakur College of Engineering and Technology Kandivali (E), Mumbai-400101. PLACE: Mumbai DATE:
  • 3. 3 ACKNOWLEDGEMENT I would like to put my sincere thanks to Mrs. Veena Kulkarni Mam, for her able guidance and constant support, through-out process of preparation of seminar.I would also like to thank my seminar coordinators for arranging the necessary facilities to carry out the seminar work. A special thank you to the HOD, Dean Academics, Principal, and Management for their support, through the entire process of preparation. (Gaurav Handa)
  • 4. 4 ABSTRACT Business Intelligence is a more advanced form of  Data Mining  Transactional Databases  Performance Management  Enterprise Reporting  Dataware House Business Intelligence enables the business to make intelligent and fact-based decisions. It is divided into  Association Analysis  Classification  Clustering  Regression. Data clustering is a method in which we make cluster of objects which are somewhat similar in characteristics. Clustering is further divided into  Hierarchical  Partitional  Density based K-means algorithm is a part of partitional clustering.
  • 5. 5 C O N T E N T S Chapter No. Topic Pg. No. Chapter 1 Introduction 1.1 Importance of the seminar topic and its background 6 Chapter 2 Literature Review 2.1 Problem Defination 7 2.2 Literature Survey 7 Chapter 3 Analysis and Planning 3.1 Architecture Over view 8 3.2 3.3 3.4 Algorithm Flowchart Limitations and Drawbacks 9 10 11 Chapter 4 Designand Implementation 4.1 4.2 4.3 Implementation using graph Implementation using Java Applications 12-13 14-16 17-19 Chapter 5 Conclusion and Future work 5.1 Experimental Results 20 Chapter 6 References 21
  • 6. 6 Importance of the seminar topic and its background Chapter 1: Introduction Data has an important role in human activities. Data mining is a knowledge discovery process by analyzing large volumes of data from various perspectives and organizing them into useful information. Terabytes of data are generated in many organizations in a day. Data mining is search for valuable information in large volumes of data. Data mining is used to identify hidden structures in data.Data mining techniques are used to extract hidden predictive information from large volumes of data. Organizations are now starting to realize the importance of data mining. This paper presents the k-means algorithm from data mining. Along with a brief description of the algorithm we have also provided graphs and arithmetic problems for better understanding of the algorithm. This paper shows how k-means algorithm isused to implement data mining efficiently along with the drawbacks of this algorithm. K-means is an algorithm which is a part of partitional clustering. The k-means algorithm can group data into k number of categories. The k-means algorithm is a simple iterative method to partition a given data set into user specified number of clusters k. K-means is a method of cluster analysis. Its aim is to partition n observations into k clusters and each observation will be a part of any one cluster with the nearest mean.
  • 7. 7 Chapters 2:Literature Review 2.1 Problem Defination • The knowledge discovery process by analyzing large volumes of data from various perspectives and organizing them into useful information. • The search for valuable information in large volumes of data and to identify hidden structures in data. 2.2 Papers on K-Means Clustering Algorithm • “The Uniqueness of a Good Optimum for K-Means’’, Marina Meila, Proceedings of the 23rd International Conference on Machine Learning, 2006-By augmenting k-means with a simple,randomized seeding technique, they obtained an algorithm that is O(log k)- competitive with the optimal clustering,that guarantees speed &accuracy. • “The Effectiveness of Lloyd-Type Methods for the k-Means Problem”, Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy, SODA, 2007-Polynomial- time approximation schemes (PTAS’s) has been obtained for the k-means clustering algo. • “Improved Smoothed Analysis of the k-Means Method”, Bodo Manthey and Heiko Roglin, preprint, 2008- The paper tells us one of the distinguished features is its speed in practice. Its worst-case running-time, however, is exponential, leaving a gap between practical and theoretical performance. This technical paper aims at closing this gap.
  • 8. 8 Chapter 3: Analysis and Planning 3.1Architecture Overview K-means algorithm is a Centroid based technique in which each cluster is represented by the centre of the cluster. This algorithm aims at minimizing an objective function, specifically a squared error function.
  • 9. 9 3.2 Algorithm: Let us give a simple explanation of the k-means algorithm. Let D be the data set of n objects and let k be the number of clusters. Here we distributes the objects into k clusters such that objects within a cluster are same and are dissimilar with the objects in other clusters. First it arbitrarily selects k of the objects each of which represents a cluster mean or center.For each of the remaining objects, an object is assigned to cluster to which it is most similar based on the distance between object and the cluster mean. It then computes the new mean for each cluster and the process is repeated. Thus this is an iterative process which continues until stability is reached. Consider the K-means algorithm for partitioning where each cluster’s center is represented by the mean values of the objects in the cluster. Input: k=the number of clusters D=data set containing n objects Method:  randomly choose k substances from A as the initial cluster center, repeat until no change.  allocate each substance to the cluster with which the substance is most similar, based on mean value of the substance in cluster.  calculate the new mean values for each cluster.
  • 11. 11 3.4 Limitations And Drawbacks:  The space complexity is O(mn) where m is the number of points and n is the number of attributes.  The time complexity is O(I*K*M*N) where I is the number of iterations required for convergence. I is typically small (5-10). It can also be easily bounded as most changes occur in the first few iterations.  Need to specify K, the number of clusters, in advance .  Unable to handle noisy data and outliers.  Not suitable for discovering clusters with non-convex shapes.  Applicable only when mean is defined.
  • 12. 12 Chapter 4: Design and implementation 4.1 Implementation using Graphs. Lets take an example to show the implementation of K-Means clustering algorithm. The k-means algorithm requires 3 user specified parameters - the number of clusters k, cluster initialization and the distance metric. Typically k-means is run individually for different values of k and the partition that appears to be most meaningful is selected. Different initializations may lead to different final clustering, because k means only converges to local minima. K-means is normally used with Euclidean metric for computing the distance between points and cluster centres. Thus k-means normally forms spherical or ball shaped cluster. We try to choose natural numbers for the number of clusters, but in general this notion is not well defined. Choosing the initial centroid is the key step in basic k-means algorithm
  • 13. 13 Considering the example given above we proceed with implementation of K-Means Clustering Algorithm as follows. Fig 1 shows that we are given a set of 7 points that mapped graphically differently from one another depending upon their characteristics. The two black points represents the randomly choosed initial centroids. Fig 2 shows that distance is calculated be between the choosen centroids and every other point using Euclidean distance metric. Fig 3,Depending upon the results of the distances we assign the points in the cluster with nearest centroid. Now, as a temporary cluster is made, we need to verify our result. Fig 4 shows that again the entire procedure is repeated.But the initial centroid being randomly choosen is now calculated.Mean is taken of all the x-co ordinate to get the x-co ordinate of the new centroid for that particular cluster.Then, Mean is taken of all the y-co ordinate to get the y-co ordinate of the new centroid for that particular cluster. Fig 5 shows that the calculations have ended up with the same results hence we have verified results showing the successful implementation K-means Clustering Algorithm. This is what the results we get while working with large no.of data items and implementing K- Means Clustering Algorithm.
  • 14. 14 4.2 Implementation using Java. For the implementation of K-Means clustering Algorithm we can also use Java as a tool. Further is provided a code along with its output showing the successful implantation of K-Means Clustering Algorithm. Code: import java.io.*; import java.util.*; class Kmean { public static void main(String args[]) { int i,j=0,n=0,k=2,x=0,l=0; Scanner sc=new Scanner(System.in); System.out.println("Enter the no of data:"); n=sc.nextInt(); int array[]=new int[n]; System.out.println("Enter "+n+" data:"); for(i=0;i<n;i++) array[i]=sc.nextInt(); float m1,m2,m1o=-1,m2o=-1,sum1=0,sum2=0; m1=array[0]; m2=array[1]; int k1[]=new int[n]; int k2[]=new int[n]; for(i=0;i<n;i++) { k1[i]=k2[i]=-1; } for(;(m1!=m1o)&&(m2!=m2o);) { l++; m1o=m1; m2o=m2; x=j=0; sum1=sum2=0; for(i=0;i<n;i++) { if(Math.abs(m1-array[i])<=Math.abs(m2-array[i])) { k1[x] = array[i]; sum1+=array[i]; x++; }
  • 15. 15 else { k2[j] = array[i]; sum2+=array[i]; j++; } } m1=sum1/x; m2=sum2/j; System.out.print("The 1st cluster in pass "+l+":"); for(i=0;i<x;i++) System.out.print(k1[i]+" "); System.out.print("ttThe 2nd cluster in pass "+l+" is:"); for(i=0;i<j;i++) System.out.print(k2[i]+" "); System.out.println(); } System.out.print("The 1st cluster is:"); for(i=0;i<x;i++) System.out.print(k1[i]+" "); System.out.print("ttThe 2nd cluster is:"); for(i=0;i<j;i++) System.out.print(k2[i]+" "); } }
  • 16. 16 OUTPUT: C:Program Files (x86)Javajdk1.6.0bin>javac Kmean.java C:Program Files (x86)Javajdk1.6.0bin>java Kmean Enter the no of data: 9 Enter 9 data: 2 4 10 12 3 20 30 11 25 The 1st cluster in pass 1: 2 3 The 2nd cluster in pass 1 is: 4 10 12 20 30 11 25 The 1st cluster in pass 2: 2 4 3 The 2nd cluster in pass 2 is: 10 12 20 30 11 25 The 1st cluster in pass 3: 2 4 10 3 The 2nd cluster in pass 3 is: 12 20 30 11 25 The 1st cluster in pass 4: 2 4 10 12 3 11 The 2nd cluster in pass 4 is: 20 30 25 The 1st cluster in pass 5: 2 4 10 12 3 11 The 2nd cluster in pass 5 is: 20 30 25 The 1st cluster is: 2 4 10 12 3 11 The 2nd cluster is: 20 30 25
  • 17. 17 4.3 Applications: 1. Archaeology The objective here is to cluster the locations of archaeological sites and to make inferences about political history based on the clusters. With the help of these we can make some speculations and these can be tested by actual going to the site. Clustering locations of archaeological sites in Israel
  • 18. 18 2.Computational Biology Here, carp to different levels of cold and genes were clustered based on their response in different tissues. Green colour indicates that the gene is under expressed whereas red colour indicates that the gene is over expressed. We can see in the figure that there are some patterns in different tissues. Thus clustering is a useful tool where we can represent so much information in one plot. Identification of common set of cold related genes
  • 19. 19 3.Education This example is taken from “Teachers as Sources of Middle School Students’ Motivational Identity: Variable Centered and Person Centered Analytic Approaches” paper. In this paper survey results of 206 students are clustered. These clusters are used to identify groups to buttress an analysis of what affects motivation. The number of clusters were selected to get some nice hypothesis. This hypothesis can then be verified.
  • 20. 20 Conclusion: K-means algorithm is a simple yet popular method for clustering analysis. Its performance is determined by initialisation and appropriate distance measure. There are several variants of K-means to overcome its weaknesses : – K-Medoids: resistance to noise and/or outliers – K-Modes: extension to categorical data clustering analysis – CLARA: dealing with large data sets – Mixture models (EM algorithm): handling uncertainty of clusters
  • 21. 21 References: Following below are the list of references for the seminar. List of References [1] Bowman, M., Debray, S. K., and Peterson, L. L. 1993. Reasoning about naming systems. . [2] Ding, W. and Marchionini, G. 1997 A Study on Video Browsing Strategies. Technical Report. University of Maryland at College Park. [3] Fröhlich, B. and Plate, J. 2000. The cubic mouse: a new device for three-dimensional input. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems [4] Tavel, P. 2007 Modeling and Simulation Design. AK Peters Ltd.