Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.
According to Vladimir Estivill-Castro, the notion of a "cluster" cannot be precisely defined, which is one of the reasons why there are so many clustering algorithms.[4] There is a common denominator: a group of data objects. However, different researchers employ different cluster models, and for each of these cluster models again different algorithms can be given. The notion of a cluster, as found by different algorithms, varies significantly in its properties. Understanding these "cluster models" is key to understanding the differences between the various algorithms. Typical cluster models include:
Connectivity models: for example hierarchical clustering builds models based on distance connectivity.
Centroid models: for example the k-means algorithm represents each cluster by a single mean vector.
Distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm.
Density models: for example DBSCAN and OPTICS defines clusters as connected dense regions in the data space.
Subspace models: in Biclustering (also known as Co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes.
Group models: some algorithms do not provide a refined model for their results and just provide the grouping information.
Graph-based models: a clique, i.e., a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques.
A "clustering" is essentially a set of such clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example a hierarchy of clusters embedded in each other. Clusterings can be roughly distinguished as:
hard clustering: each object belongs to a cluster or not
soft clustering (also: fuzzy clustering): each object belongs to each cluster to a certain degree (e.g. a likelihood of belonging to the cluster)
There are also finer distinctions possible, for example:
strict partitioning clustering: here each object belongs to exactly one cluster
strict partitioning clustering with outliers: objects can also belong to no cluster, and are considered outliers.
overlapping clustering (also: alternative clustering, multi-view clustering): while usually a hard cluste
1. MINISTRY OF EDUCATION, YOUTHS, SCIENCE AND SPORTS FRANCE
Ecole Internationale des Sciences du Traitment de
I’Information
Course: DATA MINING
Course Work on CLUSTERS
DEPARTMENT: BUSINESS ANALYTICS (M2)
COURSE TUTOR: PROF. MARIA MALEK
Submitted by: UDEH TOCHUKWU LIVINUS
Cergy Pontoise - 2014
2. 1A. EVALUATING CAR ACCESSIBILITY USING K-MEANS ALGORITHM
S/N
K – MEANS
K = 3
Seed 1 = 10
Seed 2 = 100
Within cluster sum of squared errors: 5709.0
Incorrectly clustered instances : 807.0 46.7014 %
Cluster 0 (unacc)
774 ( 45%)
Cluster 1(acc)
600 ( 35%)
Cluster 2(good)
354 ( 20%)
Within cluster sum of squared errors: 5547.0
Incorrectly clustered instances : 1020.0 59.0278 %
Cluster 0(acc)
813 ( 47%)
Cluster 1(unacc)
555 ( 32%)
Cluster 2 (vgood)
360 ( 21%)
K = 4
Seed = 10
Seed 2 = 100
Within cluster sum of squared errors: 5390.0
Incorrectly clustered instances : 979.0 56.6551 %
Cluster 0 (unacc)
592 ( 34%)
Cluster 1(acc)
557 ( 32%)
Cluster 2(good)
327 ( 19%)
Cluster 3 (vgood)
252 ( 15%)
Within cluster sum of squared errors: 5316.0
Incorrectly clustered instances : 1093.0 63.2523 %
Cluster 0 (acc)
697 ( 40%)
Cluster 1(unacc)
496 ( 29%)
Cluster
2(vgood)
346 ( 20%)
Cluster 3 (good)
189 ( 11%)
K = 5
Within cluster sum of squared errors: 5106.0
Incorrectly clustered instances : 1064.0 61.5741 %
3. Seed 1 = 10
Seed 2 = 100
Cluster 0 (unacc)
543 ( 31%)
Cluster 1(acc)
430 ( 25%)
Cluster
2(good)
302 ( 17%)
Cluster 3
(No class)
227 ( 13%)
Cluster 4
(vgood)
226 ( 13%)
Within cluster sum of squared errors: 4996.0
Incorrectly clustered instances : 1162.0 67.2454 %
Cluster 0 (acc)
586 ( 34%)
Cluster 1(unacc)
424 ( 25%)
Cluster
2(vgood)
310 ( 18%)
Cluster 3
(No class)
174 ( 10%)
Cluster 4
(good)
234 ( 14%)
ANALYSIS OF THE RESULT:
These Algorithm minimizes the total squared distance from instances to their clustered
centers local, not global minimum. So we tend to get different results when we vary the
seeds. From the table above at K = 3, we discover that we had a lesser squared errors
when we assigned the value of the seed to be 100 compared to when the value of the seed
was 100. However, there is an inverse relation with the values of the incorrectly
clustered instances as we can see it from the table. While we tend to minimize the errors
of the squared distances of the instances, the incorrectly clustered instances increases
for each increase in the number of seed. Vice versa. Hence, we try to compare the
similarities of each clustered instances when we iterate and don not take into account
any values or conditions.
FIGURE 1.0
The Figure below describe the cluster and instances as illustrated in the table above.
The Y – axis represents the class value, while the X- axis represents the instance
number. The color represents the cluster, so we can see the cluster when we select each
of the instance from the menu and then compare the similarity before validating our
decisions.
7. Clustered Instances
0 1616 ( 94%)
1 112 ( 6%)
Log likelihood: -7.45474
Class attribute: class
Classes to Clusters:
0 1 <-- assigned to cluster
1140 70 | unacc
352 32 | acc
59 10 | good
65 0 | vgood
Cluster 0 <-- unacc
Cluster 1 <-- acc
Clustered Instances
0 1718 ( 99%)
2 10 ( 1%)
Log likelihood: -7.45474
Class attribute: class
Classes to Clusters:
0 2 <-- assigned to cluster
1200 10 | unacc
384 0 | acc
69 0 | good
65 0 | vgood
Cluster 0 <-- unacc
Cluster 2 <-- No class
ANALYSIS OF THIS ALGORITHM:
These algorithm uses probabilistic approach to classify cluster. It uses expectation
maximization to classify the instances of the cluster. From the table we see that each of
the attribute value, has a probability value assigned to it. We can get the probability
of each attribute by dividing the value of each with the total value of all the
attributes given. With this, we can calculate the probability of each cluster. It also
uses the overall quality measure known as log likelihood. The nominal attributes are the
probability of each value while the numeric attributes are the mean and standard
deviation. It is also an unsupervised learning algorithm. Whilst we compare it with that
of K-Means, we discover that the value of the incorrectly classified instances are lesser
compared to K-means.
8. 2 CLASSIFICATION
A. K-NEAREST NEIGHBOUR (KNN)
S/N
USE TRAINIG SET
CROSS VALIDATION (10 FOLD)
PRECISION
RECALL
ERROR
PRECISION
RECALL
ERROR
Type 1
(TPR)
Type 2
(FPR)
Type
1
(TPR)
Type 2
(FPR)
9. K = 1
Correctly Classified Instances 1728 100%
Incorrectly Classified Instances 0 0%
Correctly Classified Instances
1616 93.5185 %
Incorrectly Classified Instances
112 6.4815 %
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0.973
0.818
1
1
0.94
0.998
0.911
0.188
0.708
0.935
0.998
0.911
0.188
0.708
0.935
0.066
0.058
0
0
0.059
K = 5
Correctly Classified Instances
1664 96.2963 %
Incorrectly Classified Instances
64 3.7037 %
Correctly Classified Instances
1616 93.5185 %
Incorrectly Classified Instances
112 6.4815 %
0.988
0.883
1
1
0.965
1
0.961
0.435
0.846
0.963
1
0.961
0.435
0.846
0.963
0.029
0.036
0
0
0.028
0.973
0.818
1
1
0.94
0.998
0.911
0.188
0.708
0.935
0.998
0.911
0.188
0.708
0.935
0.066
0.058
0
0
0.059
K = 20
Correctly Classified Instances
1337 77.3727 %
Incorrectly Classified Instances
391 22.6273 %
Correctly Classified Instances
1327 76.794 %
Incorrectly Classified Instances
401 23.206 %
0.813
0.531
0
0
0.687
1
0.331
0
0
0.774
1
0.331
0
0
0.774
0.539
0.083
0
0
0.396
0.802
0.528
0
1
0.717
1
0.299
0
0.031
0.768
1
0.299
0
0.031
0.768
0.575
0.077
0
0
0.42
REMARK:
From the above table we had a correctly classified instances with K = 1. So both
precisions and Recall were the same. This means that the algorithm correctly classified
all the misclassified points or instances. We can see it from the Figure below:
10. IBK 1.0
When we choose a different value for K = 5,and k = 20 we had approximately 4% and 23% of
the misclassified instances. However, this explains the concept of a noisy dataset or
instances. So what we did in this regard was we choose 5 neighboring classes and the
majority of the class to classify the unknown points or instances. In the figure below
those misclassified points are represented with rectangles in colors.
11. IBK 1.1
Applying Cross Validation Algorithm process which divides the instances in 10 equal sized
sets whereby 90 instances will be used for training and 10 instances will be used for
testing. At the end it averages the performances of the 10 classifiers produced from the
10 equal sized instances. Similar results was obtained when k =1 and k = 5. We had
approximately 6% of the misclassified instances compared to when we use the whole
training set. The figure below shows the visual diagram of using cross validation
algorithm to evaluate the training sets of data.
12. IBK 1.2
In summary, there is no model in this method. We only test instances to make our
prediction. The More value of k increases the more the percentage of misclassified points
also increase. However, the accuracy of the dataset only improves when we have a more
noisy instances. The base line accuracy can only be achieved by increasing the value of
k. and from the data set giving the base line accuracy is approximately 30%. K method is
a good method although it is very slow method because it has to scan the entire training
instances or data set before it can make prediction.
13. B DECISION TREE ALGORITHM (ID3 or J48)
ID3 (USE TRAINING SET)
ID3(CROSS VALIDATION FOLD 10)
Correctly Classified Instances
1728 100%
Incorrectly Classified Instances
0 0%
Correctly Classified Instances
1544 89.3519%
Incorrectly Classified Instances
61 3.5301%
=== Confusion Matrix ===
a b c d <-- classified as
1210 0 0 0 | a = unacc
0 384 0 0 | b = acc
0 0 69 0 | c = good
0 0 0 65 | d = vgood
=== Confusion Matrix ===
a b c d <-- classified as
1171 28 3 0 | a = unacc
7 292 9 4 | b = acc
0 5 37 5 | c = good
0 0 0 44 | d = vgood
14. J48 (USE TRAINING SET) J48 (CROSS VALIDATION)
Correctly Classified Instances
1664 96.2963 %
Incorrectly Classified Instances
64 3.7037 %
Number of Leaves : 131
Size of the tree : 182
Correctly Classified Instances
1596 92.3611 %
Incorrectly Classified Instances
132 7.6389 %
Number of Leaves : 131
Size of the tree : 182
=== Confusion Matrix ===
a b c d <-- classified as
1182 25 3 0 | a = unacc
10 370 2 2 | b = acc
0 9 57 3 | c = good
0 4 6 55 | d = vgood
=== Confusion Matrix ===
a b c d <-- classified as
1164 43 3 0 | a = unacc
33 333 11 7 | b = acc
0 17 42 10 | c = good
0 3 5 57 | d = vgood
ONE R ALGORITHM
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 1210 70.0231 %
Incorrectly Classified Instances 518 29.9769 %
Kappa statistic 0
Mean absolute error 0.1499
Root mean squared error 0.3871
Relative absolute error 65.4574 %
Root relative squared error 114.5023 %
15. Total Number of Instances 1728
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 1 0.7 1 0.824 0.5 unacc
0 0 0 0 0 0.5 acc
0 0 0 0 0 0.5 good
0 0 0 0 0 0.5 vgood
Weighted Avg. 0.7 0.7 0.49 0.7 0.577 0.5
=== Confusion Matrix ===
a b c d <-- classified as
1210 0 0 0 | a = unacc
384 0 0 0 | b = acc
69 0 0 0 | c = good
65 0 0 0 | d = vgood
PRISM
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 1728 100 %
Incorrectly Classified Instances 0 0 %
16. Kappa statistic 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 1728
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 unacc
1 0 1 1 1 1 acc
1 0 1 1 1 1 good
1 0 1 1 1 1 vgood
Weighted Avg. 1 0 1 1 1 1
=== Confusion Matrix ===
a b c d <-- classified as
1210 0 0 0 | a = unacc
0 384 0 0 | b = acc
0 0 69 0 | c = good
0 0 0 65 | d = vgood
CONCLUSION
The Best algorithm in terms of precision are Prism and ID3. Both algorithm gave exact values of all of the
Instances with no instances misclassified. J48 is also a good algorithm but cannot be used for large
dataset as we see there were about 4% misclassified instances. All of these are examples of supervised
17. learning and can be used in making various decisions. Although the decision trees makes use of Entropy as
the Basis for making decisions.
2.3 ASSOCIATION RULES
A. Apriori
=======
Minimum support: 0.1 (173 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 18
Generated sets of large itemsets:
Size of set of large itemsets L(1): 23
Size of set of large itemsets L(2): 52
Size of set of large itemsets L(3): 11
Best rules found:
1. persons=2 576 ==> class=unacc 576 conf:(1)
2. safety=low 576 ==> class=unacc 576 conf:(1)
3. persons=2 lug_boot=small 192 ==> class=unacc 192 conf:(1)
4. persons=2 lug_boot=med 192 ==> class=unacc 192 conf:(1)
5. persons=2 lug_boot=big 192 ==> class=unacc 192 conf:(1)
6. persons=2 safety=low 192 ==> class=unacc 192 conf:(1)
18. 7. persons=2 safety=med 192 ==> class=unacc 192 conf:(1)
8. persons=2 safety=high 192 ==> class=unacc 192 conf:(1)
9. persons=4 safety=low 192 ==> class=unacc 192 conf:(1)
10. persons=more safety=low 192 ==> class=unacc 192 conf:(1)
B. Apriori
=======
Minimum support: 0.1 (173 instances)
Minimum metric <confidence>: 0.5
Number of cycles performed: 18
Generated sets of large itemsets:
Size of set of large itemsets L(1): 23
Large Itemsets L(1):
buying=vhigh 432
buying=high 432
buying=med 432
buying=low 432
maint=vhigh 432
maint=high 432
maint=med 432
maint=low 432
doors=2 432
doors=3 432
23. 18. buying=high 432 ==> class=unacc 324 conf:(0.75)
19. maint=high 432 ==> class=unacc 314 conf:(0.73)
20. doors=3 432 ==> class=unacc 300 conf:(0.69)
21. safety=high class=unacc 277 ==> persons=2 192 conf:(0.69)
22. lug_boot=med 576 ==> class=unacc 392 conf:(0.68)
23. doors=4 432 ==> class=unacc 292 conf:(0.68)
24. doors=5more 432 ==> class=unacc 292 conf:(0.68)
25. lug_boot=big 576 ==> class=unacc 368 conf:(0.64)
26. buying=med 432 ==> class=unacc 268 conf:(0.62)
27. maint=med 432 ==> class=unacc 268 conf:(0.62)
28. maint=low 432 ==> class=unacc 268 conf:(0.62)
29. safety=med 576 ==> class=unacc 357 conf:(0.62)
30. persons=4 class=unacc 312 ==> safety=low 192 conf:(0.62)
31. buying=low 432 ==> class=unacc 258 conf:(0.6)
32. persons=more class=unacc 322 ==> safety=low 192 conf:(0.6)
33. persons=more 576 ==> class=unacc 322 conf:(0.56)
34. persons=4 576 ==> class=unacc 312 conf:(0.54)
35. safety=med class=unacc 357 ==> persons=2 192 conf:(0.54)
36. class=acc 384 ==> safety=high 204 conf:(0.53)
37. lug_boot=big class=unacc 368 ==> persons=2 192 conf:(0.52)
38. lug_boot=big class=unacc 368 ==> safety=low 192 conf:(0.52)
39. class=acc 384 ==> persons=4 198 conf:(0.52)
From the comparison table, the Best Algorithm depends on the type of problem given to the
researcher. If some datasets were given with a minimum condition or less condition. Then
Association rules can be applied in order to get or generate various possible rules or
outcomes. The outcomes depends on the confidence limit and the number of rules you want to
generate. So it is good when you are making unsupervised learning algorithm.
24. Conversely, Prism Rule is good when you are given some sets of conditions. It is best
because it minimizes errors to its lowest proportion and gives out the best alternatives. It
has an artificial intel. That gives out the best possible alternatives. It is limited when it
comes to complex decisions making processes.
J48
This decision making process uses tree method. It is not the best Ideal method because of the
divide and conquer rule it uses. It can’t be used in complex decision making process. Many
errors will be generated in the classification process. It is also an example of Supervised
Learning or Classification Algorithm.