SlideShare a Scribd company logo
1 of 24
MINISTRY OF EDUCATION, YOUTHS, SCIENCE AND SPORTS FRANCE 
Ecole Internationale des Sciences du Traitment de 
I’Information 
Course: DATA MINING 
Course Work on CLUSTERS 
DEPARTMENT: BUSINESS ANALYTICS (M2) 
COURSE TUTOR: PROF. MARIA MALEK 
Submitted by: UDEH TOCHUKWU LIVINUS 
Cergy Pontoise - 2014
1A. EVALUATING CAR ACCESSIBILITY USING K-MEANS ALGORITHM 
S/N 
K – MEANS 
K = 3 
Seed 1 = 10 
Seed 2 = 100 
Within cluster sum of squared errors: 5709.0 
Incorrectly clustered instances : 807.0 46.7014 % 
Cluster 0 (unacc) 
774 ( 45%) 
Cluster 1(acc) 
600 ( 35%) 
Cluster 2(good) 
354 ( 20%) 
Within cluster sum of squared errors: 5547.0 
Incorrectly clustered instances : 1020.0 59.0278 % 
Cluster 0(acc) 
813 ( 47%) 
Cluster 1(unacc) 
555 ( 32%) 
Cluster 2 (vgood) 
360 ( 21%) 
K = 4 
Seed = 10 
Seed 2 = 100 
Within cluster sum of squared errors: 5390.0 
Incorrectly clustered instances : 979.0 56.6551 % 
Cluster 0 (unacc) 
592 ( 34%) 
Cluster 1(acc) 
557 ( 32%) 
Cluster 2(good) 
327 ( 19%) 
Cluster 3 (vgood) 
252 ( 15%) 
Within cluster sum of squared errors: 5316.0 
Incorrectly clustered instances : 1093.0 63.2523 % 
Cluster 0 (acc) 
697 ( 40%) 
Cluster 1(unacc) 
496 ( 29%) 
Cluster 
2(vgood) 
346 ( 20%) 
Cluster 3 (good) 
189 ( 11%) 
K = 5 
Within cluster sum of squared errors: 5106.0 
Incorrectly clustered instances : 1064.0 61.5741 %
Seed 1 = 10 
Seed 2 = 100 
Cluster 0 (unacc) 
543 ( 31%) 
Cluster 1(acc) 
430 ( 25%) 
Cluster 
2(good) 
302 ( 17%) 
Cluster 3 
(No class) 
227 ( 13%) 
Cluster 4 
(vgood) 
226 ( 13%) 
Within cluster sum of squared errors: 4996.0 
Incorrectly clustered instances : 1162.0 67.2454 % 
Cluster 0 (acc) 
586 ( 34%) 
Cluster 1(unacc) 
424 ( 25%) 
Cluster 
2(vgood) 
310 ( 18%) 
Cluster 3 
(No class) 
174 ( 10%) 
Cluster 4 
(good) 
234 ( 14%) 
ANALYSIS OF THE RESULT: 
These Algorithm minimizes the total squared distance from instances to their clustered 
centers local, not global minimum. So we tend to get different results when we vary the 
seeds. From the table above at K = 3, we discover that we had a lesser squared errors 
when we assigned the value of the seed to be 100 compared to when the value of the seed 
was 100. However, there is an inverse relation with the values of the incorrectly 
clustered instances as we can see it from the table. While we tend to minimize the errors 
of the squared distances of the instances, the incorrectly clustered instances increases 
for each increase in the number of seed. Vice versa. Hence, we try to compare the 
similarities of each clustered instances when we iterate and don not take into account 
any values or conditions. 
FIGURE 1.0 
The Figure below describe the cluster and instances as illustrated in the table above. 
The Y – axis represents the class value, while the X- axis represents the instance 
number. The color represents the cluster, so we can see the cluster when we select each 
of the instance from the menu and then compare the similarity before validating our 
decisions.
1 B. ESTIMATING CLUSTERS USING EXPECTATION MAXIMIZATION (EM CLUSTERING) 
K = 3; Seed = 10 
Incorrectly clustered instances : 519.0 
30.0347 % 
K = 3; Seed = 100 
Incorrectly clustered instances : 545.0 
31.5394 % 
Cluster 
Attribute 0 1 2 
(0.51) (0.29) (0.2) 
======================================= 
buying 
vhigh 211.4062 134.0433 89.5505 
high 221.0605 127.5401 86.3994 
med 230.866 120.6598 83.4743 
low 221.0605 127.5401 86.3994 
[total] 884.3931 509.7833 345.8236 
maint 
Cluster 
Attribute 0 1 2 
(0.48) (0.32) (0.2) 
====================================== 
buying 
vhigh 209.4287 139.8151 85.7562 
high 198.5154 146.6941 89.7905 
med 209.4287 139.8151 85.7562 
low 220.8453 132.7588 81.3959 
[total] 838.2182 559.083 342.6988 
maint
vhigh 220.5171 127.8878 86.595 
Clustered Instances 
0 1725 (100%) 
1 3 ( 0%) 
Log likelihood: -7.45474 
Class attribute: class 
Classes to Clusters: 
0 1 <-- assigned to cluster 
1208 2 | unacc 
383 1 | acc 
69 0 | good 
65 0 | vgood 
Cluster 0 <-- unacc 
Cluster 1 <-- acc 
vhigh 204.4103 138.1582 92.4315 
Clustered Instances 
0 1699 ( 98%) 
1 29 ( 2%) 
Log likelihood: -7.45474 
Class attribute: class 
Classes to Clusters: 
0 1 <-- assigned to cluster 
1182 28 | unacc 
383 1 | acc 
69 0 | good 
65 0 | vgood 
Cluster 0 <-- unacc 
Cluster 1 <-- acc 
K = 4; Seed = 10 
Incorrectly clustered instances : 556.0 
32.1759 % 
K = 4; Seed = 100 
Incorrectly clustered instances : 528.0 
30.5556 % 
Cluster 
Attribute 0 1 2 3 
(0.41) (0.26) (0.18) (0.16) 
============================================= 
buying 
vhigh 171.8456 115.3853 81.9122 66.8569 
high 175.9113 111.256 79.4969 69.3358 
med 182.9562 108.0125 77.5667 67.4646 
low 173.1743 110.3401 79.4592 73.0264 
[total] 703.8874 444.9938 318.435 276.6838 
Cluster 
Attribute 0 1 2 3 
(0.42) (0.18) (0.26) (0.14) 
========================================== 
buying 
vhigh 179.4064 76.6192 118.9071 61.0673 
high 183.026 77.5366 112.1751 63.2622 
med 189.4165 80.2604 106.0872 60.2359 
low 183.1553 75.7134 110.2891 66.8422 
[total] 735.0042 310.1296 447.4586 251.4076
Clustered Instances 
0 1616 ( 94%) 
1 112 ( 6%) 
Log likelihood: -7.45474 
Class attribute: class 
Classes to Clusters: 
0 1 <-- assigned to cluster 
1140 70 | unacc 
352 32 | acc 
59 10 | good 
65 0 | vgood 
Cluster 0 <-- unacc 
Cluster 1 <-- acc 
Clustered Instances 
0 1718 ( 99%) 
2 10 ( 1%) 
Log likelihood: -7.45474 
Class attribute: class 
Classes to Clusters: 
0 2 <-- assigned to cluster 
1200 10 | unacc 
384 0 | acc 
69 0 | good 
65 0 | vgood 
Cluster 0 <-- unacc 
Cluster 2 <-- No class 
ANALYSIS OF THIS ALGORITHM: 
These algorithm uses probabilistic approach to classify cluster. It uses expectation 
maximization to classify the instances of the cluster. From the table we see that each of 
the attribute value, has a probability value assigned to it. We can get the probability 
of each attribute by dividing the value of each with the total value of all the 
attributes given. With this, we can calculate the probability of each cluster. It also 
uses the overall quality measure known as log likelihood. The nominal attributes are the 
probability of each value while the numeric attributes are the mean and standard 
deviation. It is also an unsupervised learning algorithm. Whilst we compare it with that 
of K-Means, we discover that the value of the incorrectly classified instances are lesser 
compared to K-means.
2 CLASSIFICATION 
A. K-NEAREST NEIGHBOUR (KNN) 
S/N 
USE TRAINIG SET 
CROSS VALIDATION (10 FOLD) 
PRECISION 
RECALL 
ERROR 
PRECISION 
RECALL 
ERROR 
Type 1 
(TPR) 
Type 2 
(FPR) 
Type 
1 
(TPR) 
Type 2 
(FPR)
K = 1 
Correctly Classified Instances 1728 100% 
Incorrectly Classified Instances 0 0% 
Correctly Classified Instances 
1616 93.5185 % 
Incorrectly Classified Instances 
112 6.4815 % 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
0 
0 
0 
0 
0 
0.973 
0.818 
1 
1 
0.94 
0.998 
0.911 
0.188 
0.708 
0.935 
0.998 
0.911 
0.188 
0.708 
0.935 
0.066 
0.058 
0 
0 
0.059 
K = 5 
Correctly Classified Instances 
1664 96.2963 % 
Incorrectly Classified Instances 
64 3.7037 % 
Correctly Classified Instances 
1616 93.5185 % 
Incorrectly Classified Instances 
112 6.4815 % 
0.988 
0.883 
1 
1 
0.965 
1 
0.961 
0.435 
0.846 
0.963 
1 
0.961 
0.435 
0.846 
0.963 
0.029 
0.036 
0 
0 
0.028 
0.973 
0.818 
1 
1 
0.94 
0.998 
0.911 
0.188 
0.708 
0.935 
0.998 
0.911 
0.188 
0.708 
0.935 
0.066 
0.058 
0 
0 
0.059 
K = 20 
Correctly Classified Instances 
1337 77.3727 % 
Incorrectly Classified Instances 
391 22.6273 % 
Correctly Classified Instances 
1327 76.794 % 
Incorrectly Classified Instances 
401 23.206 % 
0.813 
0.531 
0 
0 
0.687 
1 
0.331 
0 
0 
0.774 
1 
0.331 
0 
0 
0.774 
0.539 
0.083 
0 
0 
0.396 
0.802 
0.528 
0 
1 
0.717 
1 
0.299 
0 
0.031 
0.768 
1 
0.299 
0 
0.031 
0.768 
0.575 
0.077 
0 
0 
0.42 
REMARK: 
From the above table we had a correctly classified instances with K = 1. So both 
precisions and Recall were the same. This means that the algorithm correctly classified 
all the misclassified points or instances. We can see it from the Figure below:
IBK 1.0 
When we choose a different value for K = 5,and k = 20 we had approximately 4% and 23% of 
the misclassified instances. However, this explains the concept of a noisy dataset or 
instances. So what we did in this regard was we choose 5 neighboring classes and the 
majority of the class to classify the unknown points or instances. In the figure below 
those misclassified points are represented with rectangles in colors.
IBK 1.1 
Applying Cross Validation Algorithm process which divides the instances in 10 equal sized 
sets whereby 90 instances will be used for training and 10 instances will be used for 
testing. At the end it averages the performances of the 10 classifiers produced from the 
10 equal sized instances. Similar results was obtained when k =1 and k = 5. We had 
approximately 6% of the misclassified instances compared to when we use the whole 
training set. The figure below shows the visual diagram of using cross validation 
algorithm to evaluate the training sets of data.
IBK 1.2 
In summary, there is no model in this method. We only test instances to make our 
prediction. The More value of k increases the more the percentage of misclassified points 
also increase. However, the accuracy of the dataset only improves when we have a more 
noisy instances. The base line accuracy can only be achieved by increasing the value of 
k. and from the data set giving the base line accuracy is approximately 30%. K method is 
a good method although it is very slow method because it has to scan the entire training 
instances or data set before it can make prediction.
B DECISION TREE ALGORITHM (ID3 or J48) 
ID3 (USE TRAINING SET) 
ID3(CROSS VALIDATION FOLD 10) 
Correctly Classified Instances 
1728 100% 
Incorrectly Classified Instances 
0 0% 
Correctly Classified Instances 
1544 89.3519% 
Incorrectly Classified Instances 
61 3.5301% 
=== Confusion Matrix === 
a b c d <-- classified as 
1210 0 0 0 | a = unacc 
0 384 0 0 | b = acc 
0 0 69 0 | c = good 
0 0 0 65 | d = vgood 
=== Confusion Matrix === 
a b c d <-- classified as 
1171 28 3 0 | a = unacc 
7 292 9 4 | b = acc 
0 5 37 5 | c = good 
0 0 0 44 | d = vgood
J48 (USE TRAINING SET) J48 (CROSS VALIDATION) 
Correctly Classified Instances 
1664 96.2963 % 
Incorrectly Classified Instances 
64 3.7037 % 
Number of Leaves : 131 
Size of the tree : 182 
Correctly Classified Instances 
1596 92.3611 % 
Incorrectly Classified Instances 
132 7.6389 % 
Number of Leaves : 131 
Size of the tree : 182 
=== Confusion Matrix === 
a b c d <-- classified as 
1182 25 3 0 | a = unacc 
10 370 2 2 | b = acc 
0 9 57 3 | c = good 
0 4 6 55 | d = vgood 
=== Confusion Matrix === 
a b c d <-- classified as 
1164 43 3 0 | a = unacc 
33 333 11 7 | b = acc 
0 17 42 10 | c = good 
0 3 5 57 | d = vgood 
ONE R ALGORITHM 
=== Evaluation on training set === 
=== Summary === 
Correctly Classified Instances 1210 70.0231 % 
Incorrectly Classified Instances 518 29.9769 % 
Kappa statistic 0 
Mean absolute error 0.1499 
Root mean squared error 0.3871 
Relative absolute error 65.4574 % 
Root relative squared error 114.5023 %
Total Number of Instances 1728 
=== Detailed Accuracy By Class === 
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 
1 1 0.7 1 0.824 0.5 unacc 
0 0 0 0 0 0.5 acc 
0 0 0 0 0 0.5 good 
0 0 0 0 0 0.5 vgood 
Weighted Avg. 0.7 0.7 0.49 0.7 0.577 0.5 
=== Confusion Matrix === 
a b c d <-- classified as 
1210 0 0 0 | a = unacc 
384 0 0 0 | b = acc 
69 0 0 0 | c = good 
65 0 0 0 | d = vgood 
PRISM 
=== Evaluation on training set === 
=== Summary === 
Correctly Classified Instances 1728 100 % 
Incorrectly Classified Instances 0 0 %
Kappa statistic 1 
Mean absolute error 0 
Root mean squared error 0 
Relative absolute error 0 % 
Root relative squared error 0 % 
Total Number of Instances 1728 
=== Detailed Accuracy By Class === 
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 
1 0 1 1 1 1 unacc 
1 0 1 1 1 1 acc 
1 0 1 1 1 1 good 
1 0 1 1 1 1 vgood 
Weighted Avg. 1 0 1 1 1 1 
=== Confusion Matrix === 
a b c d <-- classified as 
1210 0 0 0 | a = unacc 
0 384 0 0 | b = acc 
0 0 69 0 | c = good 
0 0 0 65 | d = vgood 
CONCLUSION 
The Best algorithm in terms of precision are Prism and ID3. Both algorithm gave exact values of all of the 
Instances with no instances misclassified. J48 is also a good algorithm but cannot be used for large 
dataset as we see there were about 4% misclassified instances. All of these are examples of supervised
learning and can be used in making various decisions. Although the decision trees makes use of Entropy as 
the Basis for making decisions. 
2.3 ASSOCIATION RULES 
A. Apriori 
======= 
Minimum support: 0.1 (173 instances) 
Minimum metric <confidence>: 0.9 
Number of cycles performed: 18 
Generated sets of large itemsets: 
Size of set of large itemsets L(1): 23 
Size of set of large itemsets L(2): 52 
Size of set of large itemsets L(3): 11 
Best rules found: 
1. persons=2 576 ==> class=unacc 576 conf:(1) 
2. safety=low 576 ==> class=unacc 576 conf:(1) 
3. persons=2 lug_boot=small 192 ==> class=unacc 192 conf:(1) 
4. persons=2 lug_boot=med 192 ==> class=unacc 192 conf:(1) 
5. persons=2 lug_boot=big 192 ==> class=unacc 192 conf:(1) 
6. persons=2 safety=low 192 ==> class=unacc 192 conf:(1)
7. persons=2 safety=med 192 ==> class=unacc 192 conf:(1) 
8. persons=2 safety=high 192 ==> class=unacc 192 conf:(1) 
9. persons=4 safety=low 192 ==> class=unacc 192 conf:(1) 
10. persons=more safety=low 192 ==> class=unacc 192 conf:(1) 
B. Apriori 
======= 
Minimum support: 0.1 (173 instances) 
Minimum metric <confidence>: 0.5 
Number of cycles performed: 18 
Generated sets of large itemsets: 
Size of set of large itemsets L(1): 23 
Large Itemsets L(1): 
buying=vhigh 432 
buying=high 432 
buying=med 432 
buying=low 432 
maint=vhigh 432 
maint=high 432 
maint=med 432 
maint=low 432 
doors=2 432 
doors=3 432
doors=4 432 
doors=5more 432 
persons=2 576 
persons=4 576 
persons=more 576 
lug_boot=small 576 
lug_boot=med 576 
lug_boot=big 576 
safety=low 576 
safety=med 576 
safety=high 576 
class=unacc 1210 
class=acc 384 
Size of set of large itemsets L(2): 52 
Large Itemsets L(2): 
buying=vhigh class=unacc 360 
buying=high class=unacc 324 
buying=med class=unacc 268 
buying=low class=unacc 258 
maint=vhigh class=unacc 360 
maint=high class=unacc 314 
maint=med class=unacc 268 
maint=low class=unacc 268 
doors=2 class=unacc 326
doors=3 class=unacc 300 
doors=4 class=unacc 292 
doors=5more class=unacc 292 
persons=2 lug_boot=small 192 
persons=2 lug_boot=med 192 
persons=2 lug_boot=big 192 
persons=2 safety=low 192 
persons=2 safety=med 192 
persons=2 safety=high 192 
persons=2 class=unacc 576 
persons=4 lug_boot=small 192 
persons=4 lug_boot=med 192 
persons=4 lug_boot=big 192 
persons=4 safety=low 192 
persons=4 safety=med 192 
persons=4 safety=high 192 
persons=4 class=unacc 312 
persons=4 class=acc 198 
persons=more lug_boot=small 192 
persons=more lug_boot=med 192 
persons=more lug_boot=big 192 
persons=more safety=low 192 
persons=more safety=med 192 
persons=more safety=high 192 
persons=more class=unacc 322 
persons=more class=acc 186
lug_boot=small safety=low 192 
lug_boot=small safety=med 192 
lug_boot=small safety=high 192 
lug_boot=small class=unacc 450 
lug_boot=med safety=low 192 
lug_boot=med safety=med 192 
lug_boot=med safety=high 192 
lug_boot=med class=unacc 392 
lug_boot=big safety=low 192 
lug_boot=big safety=med 192 
lug_boot=big safety=high 192 
lug_boot=big class=unacc 368 
safety=low class=unacc 576 
safety=med class=unacc 357 
safety=med class=acc 180 
safety=high class=unacc 277 
safety=high class=acc 204 
Size of set of large itemsets L(3): 11 
Large Itemsets L(3): 
persons=2 lug_boot=small class=unacc 192 
persons=2 lug_boot=med class=unacc 192 
persons=2 lug_boot=big class=unacc 192 
persons=2 safety=low class=unacc 192 
persons=2 safety=med class=unacc 192
persons=2 safety=high class=unacc 192 
persons=4 safety=low class=unacc 192 
persons=more safety=low class=unacc 192 
lug_boot=small safety=low class=unacc 192 
lug_boot=med safety=low class=unacc 192 
lug_boot=big safety=low class=unacc 192 
Best rules found: 
1. persons=2 576 ==> class=unacc 576 conf:(1) 
2. safety=low 576 ==> class=unacc 576 conf:(1) 
3. persons=2 lug_boot=small 192 ==> class=unacc 192 conf:(1) 
4. persons=2 lug_boot=med 192 ==> class=unacc 192 conf:(1) 
5. persons=2 lug_boot=big 192 ==> class=unacc 192 conf:(1) 
6. persons=2 safety=low 192 ==> class=unacc 192 conf:(1) 
7. persons=2 safety=med 192 ==> class=unacc 192 conf:(1) 
8. persons=2 safety=high 192 ==> class=unacc 192 conf:(1) 
9. persons=4 safety=low 192 ==> class=unacc 192 conf:(1) 
10. persons=more safety=low 192 ==> class=unacc 192 conf:(1) 
11. lug_boot=small safety=low 192 ==> class=unacc 192 conf:(1) 
12. lug_boot=med safety=low 192 ==> class=unacc 192 conf:(1) 
13. lug_boot=big safety=low 192 ==> class=unacc 192 conf:(1) 
14. buying=vhigh 432 ==> class=unacc 360 conf:(0.83) 
15. maint=vhigh 432 ==> class=unacc 360 conf:(0.83) 
16. lug_boot=small 576 ==> class=unacc 450 conf:(0.78) 
17. doors=2 432 ==> class=unacc 326 conf:(0.75)
18. buying=high 432 ==> class=unacc 324 conf:(0.75) 
19. maint=high 432 ==> class=unacc 314 conf:(0.73) 
20. doors=3 432 ==> class=unacc 300 conf:(0.69) 
21. safety=high class=unacc 277 ==> persons=2 192 conf:(0.69) 
22. lug_boot=med 576 ==> class=unacc 392 conf:(0.68) 
23. doors=4 432 ==> class=unacc 292 conf:(0.68) 
24. doors=5more 432 ==> class=unacc 292 conf:(0.68) 
25. lug_boot=big 576 ==> class=unacc 368 conf:(0.64) 
26. buying=med 432 ==> class=unacc 268 conf:(0.62) 
27. maint=med 432 ==> class=unacc 268 conf:(0.62) 
28. maint=low 432 ==> class=unacc 268 conf:(0.62) 
29. safety=med 576 ==> class=unacc 357 conf:(0.62) 
30. persons=4 class=unacc 312 ==> safety=low 192 conf:(0.62) 
31. buying=low 432 ==> class=unacc 258 conf:(0.6) 
32. persons=more class=unacc 322 ==> safety=low 192 conf:(0.6) 
33. persons=more 576 ==> class=unacc 322 conf:(0.56) 
34. persons=4 576 ==> class=unacc 312 conf:(0.54) 
35. safety=med class=unacc 357 ==> persons=2 192 conf:(0.54) 
36. class=acc 384 ==> safety=high 204 conf:(0.53) 
37. lug_boot=big class=unacc 368 ==> persons=2 192 conf:(0.52) 
38. lug_boot=big class=unacc 368 ==> safety=low 192 conf:(0.52) 
39. class=acc 384 ==> persons=4 198 conf:(0.52) 
From the comparison table, the Best Algorithm depends on the type of problem given to the 
researcher. If some datasets were given with a minimum condition or less condition. Then 
Association rules can be applied in order to get or generate various possible rules or 
outcomes. The outcomes depends on the confidence limit and the number of rules you want to 
generate. So it is good when you are making unsupervised learning algorithm.
Conversely, Prism Rule is good when you are given some sets of conditions. It is best 
because it minimizes errors to its lowest proportion and gives out the best alternatives. It 
has an artificial intel. That gives out the best possible alternatives. It is limited when it 
comes to complex decisions making processes. 
J48 
This decision making process uses tree method. It is not the best Ideal method because of the 
divide and conquer rule it uses. It can’t be used in complex decision making process. Many 
errors will be generated in the classification process. It is also an example of Supervised 
Learning or Classification Algorithm.

More Related Content

Viewers also liked

Project Weka 54102010358
Project Weka 54102010358Project Weka 54102010358
Project Weka 54102010358MiKe Rungwit
 
Bianca reyes.assignment1
Bianca reyes.assignment1Bianca reyes.assignment1
Bianca reyes.assignment1bkbreyes
 
Chap 11 Dealing with Competition
Chap 11 Dealing with CompetitionChap 11 Dealing with Competition
Chap 11 Dealing with Competitionbkbreyes
 
Открытое внеклассное мероприятие "Золотая Осень"
Открытое внеклассное мероприятие "Золотая Осень"Открытое внеклассное мероприятие "Золотая Осень"
Открытое внеклассное мероприятие "Золотая Осень"mizieva
 
Factors affecting lls
Factors affecting llsFactors affecting lls
Factors affecting llsveen1116
 
Wind and Solar integrated solution and Urban Power Plant
Wind and Solar integrated solution and Urban Power PlantWind and Solar integrated solution and Urban Power Plant
Wind and Solar integrated solution and Urban Power PlantSamyugth Gnxt
 
Psychodynamic perspective of schizophrenia
Psychodynamic perspective of schizophreniaPsychodynamic perspective of schizophrenia
Psychodynamic perspective of schizophreniaBidisha Haque
 
Chap 14 Developing Pricing Strategies & Programs
Chap 14 Developing Pricing Strategies & ProgramsChap 14 Developing Pricing Strategies & Programs
Chap 14 Developing Pricing Strategies & Programsbkbreyes
 
The evolution of human brain
The evolution of human brainThe evolution of human brain
The evolution of human brainBidisha Haque
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligenceSukhdeep Kaur
 
Gesture Recognition Technology
Gesture Recognition TechnologyGesture Recognition Technology
Gesture Recognition TechnologyNikith Kumar Reddy
 

Viewers also liked (20)

Project Weka 54102010358
Project Weka 54102010358Project Weka 54102010358
Project Weka 54102010358
 
Enes Erdem
Enes Erdem Enes Erdem
Enes Erdem
 
Công nghệ không dây
Công nghệ không dâyCông nghệ không dây
Công nghệ không dây
 
Rose
RoseRose
Rose
 
Bianca reyes.assignment1
Bianca reyes.assignment1Bianca reyes.assignment1
Bianca reyes.assignment1
 
Enes Erdem
Enes ErdemEnes Erdem
Enes Erdem
 
Chap 11 Dealing with Competition
Chap 11 Dealing with CompetitionChap 11 Dealing with Competition
Chap 11 Dealing with Competition
 
Открытое внеклассное мероприятие "Золотая Осень"
Открытое внеклассное мероприятие "Золотая Осень"Открытое внеклассное мероприятие "Золотая Осень"
Открытое внеклассное мероприятие "Золотая Осень"
 
Factors affecting lls
Factors affecting llsFactors affecting lls
Factors affecting lls
 
Accident of bidisha
Accident of bidishaAccident of bidisha
Accident of bidisha
 
Wind and Solar integrated solution and Urban Power Plant
Wind and Solar integrated solution and Urban Power PlantWind and Solar integrated solution and Urban Power Plant
Wind and Solar integrated solution and Urban Power Plant
 
Psychodynamic perspective of schizophrenia
Psychodynamic perspective of schizophreniaPsychodynamic perspective of schizophrenia
Psychodynamic perspective of schizophrenia
 
Chap 14 Developing Pricing Strategies & Programs
Chap 14 Developing Pricing Strategies & ProgramsChap 14 Developing Pricing Strategies & Programs
Chap 14 Developing Pricing Strategies & Programs
 
The evolution of human brain
The evolution of human brainThe evolution of human brain
The evolution of human brain
 
Fore brain
Fore brainFore brain
Fore brain
 
Enes Erdem
Enes ErdemEnes Erdem
Enes Erdem
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Scrum Methodology
Scrum MethodologyScrum Methodology
Scrum Methodology
 
Gesture Recognition Technology
Gesture Recognition TechnologyGesture Recognition Technology
Gesture Recognition Technology
 

Similar to DATA MINING - EVALUATING CLUSTERING ALGORITHM

Comparing Machine Learning Algorithms in Text Mining
Comparing Machine Learning Algorithms in Text MiningComparing Machine Learning Algorithms in Text Mining
Comparing Machine Learning Algorithms in Text MiningAndrea Gigli
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methodsjoycemi_la
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methodsjoycemi_la
 
Solution manual-of-probability-statistics-for-engineers-scientists-9th-editio...
Solution manual-of-probability-statistics-for-engineers-scientists-9th-editio...Solution manual-of-probability-statistics-for-engineers-scientists-9th-editio...
Solution manual-of-probability-statistics-for-engineers-scientists-9th-editio...ahmed671895
 
Introduction to Probability and Statistics 13th Edition Mendenhall Solutions ...
Introduction to Probability and Statistics 13th Edition Mendenhall Solutions ...Introduction to Probability and Statistics 13th Edition Mendenhall Solutions ...
Introduction to Probability and Statistics 13th Edition Mendenhall Solutions ...MaxineBoyd
 
An overview of statistics management with excel
An overview of statistics management with excelAn overview of statistics management with excel
An overview of statistics management with excelKRISHANACHOUDHARY1
 
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningBruno Gonçalves
 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmHadi Fadlallah
 
1 FACULTY OF SCIENCE AND ENGINEERING SCHOOL OF COMPUT.docx
1  FACULTY OF SCIENCE AND ENGINEERING SCHOOL OF COMPUT.docx1  FACULTY OF SCIENCE AND ENGINEERING SCHOOL OF COMPUT.docx
1 FACULTY OF SCIENCE AND ENGINEERING SCHOOL OF COMPUT.docxmercysuttle
 
Estimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or VarianceEstimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or VarianceLong Beach City College
 
Estimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or Variance Estimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or Variance Long Beach City College
 
Bigger Data v Better Math
Bigger Data v Better MathBigger Data v Better Math
Bigger Data v Better MathBrent Schneeman
 
k-means Clustering and Custergram with R
k-means Clustering and Custergram with Rk-means Clustering and Custergram with R
k-means Clustering and Custergram with RDr. Volkan OBAN
 
Workshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RWorkshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RShirin Elsinghorst
 
creditriskmanagment_howardhaughton121510
creditriskmanagment_howardhaughton121510creditriskmanagment_howardhaughton121510
creditriskmanagment_howardhaughton121510mrmelchi
 

Similar to DATA MINING - EVALUATING CLUSTERING ALGORITHM (20)

Comparing Machine Learning Algorithms in Text Mining
Comparing Machine Learning Algorithms in Text MiningComparing Machine Learning Algorithms in Text Mining
Comparing Machine Learning Algorithms in Text Mining
 
Lower back pain Regression models
Lower back pain Regression modelsLower back pain Regression models
Lower back pain Regression models
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
 
FINAL (1)
FINAL (1)FINAL (1)
FINAL (1)
 
Solution manual-of-probability-statistics-for-engineers-scientists-9th-editio...
Solution manual-of-probability-statistics-for-engineers-scientists-9th-editio...Solution manual-of-probability-statistics-for-engineers-scientists-9th-editio...
Solution manual-of-probability-statistics-for-engineers-scientists-9th-editio...
 
Chapter12
Chapter12Chapter12
Chapter12
 
MNIST 10-class Classifiers
MNIST 10-class ClassifiersMNIST 10-class Classifiers
MNIST 10-class Classifiers
 
Introduction to Probability and Statistics 13th Edition Mendenhall Solutions ...
Introduction to Probability and Statistics 13th Edition Mendenhall Solutions ...Introduction to Probability and Statistics 13th Edition Mendenhall Solutions ...
Introduction to Probability and Statistics 13th Edition Mendenhall Solutions ...
 
An overview of statistics management with excel
An overview of statistics management with excelAn overview of statistics management with excel
An overview of statistics management with excel
 
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) Learning
 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithm
 
1 FACULTY OF SCIENCE AND ENGINEERING SCHOOL OF COMPUT.docx
1  FACULTY OF SCIENCE AND ENGINEERING SCHOOL OF COMPUT.docx1  FACULTY OF SCIENCE AND ENGINEERING SCHOOL OF COMPUT.docx
1 FACULTY OF SCIENCE AND ENGINEERING SCHOOL OF COMPUT.docx
 
Estimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or VarianceEstimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or Variance
 
Estimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or Variance Estimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or Variance
 
Bigger Data v Better Math
Bigger Data v Better MathBigger Data v Better Math
Bigger Data v Better Math
 
k-means Clustering and Custergram with R
k-means Clustering and Custergram with Rk-means Clustering and Custergram with R
k-means Clustering and Custergram with R
 
Workshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RWorkshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with R
 
Input analysis
Input analysisInput analysis
Input analysis
 
creditriskmanagment_howardhaughton121510
creditriskmanagment_howardhaughton121510creditriskmanagment_howardhaughton121510
creditriskmanagment_howardhaughton121510
 

Recently uploaded

Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 

Recently uploaded (20)

Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 

DATA MINING - EVALUATING CLUSTERING ALGORITHM

  • 1. MINISTRY OF EDUCATION, YOUTHS, SCIENCE AND SPORTS FRANCE Ecole Internationale des Sciences du Traitment de I’Information Course: DATA MINING Course Work on CLUSTERS DEPARTMENT: BUSINESS ANALYTICS (M2) COURSE TUTOR: PROF. MARIA MALEK Submitted by: UDEH TOCHUKWU LIVINUS Cergy Pontoise - 2014
  • 2. 1A. EVALUATING CAR ACCESSIBILITY USING K-MEANS ALGORITHM S/N K – MEANS K = 3 Seed 1 = 10 Seed 2 = 100 Within cluster sum of squared errors: 5709.0 Incorrectly clustered instances : 807.0 46.7014 % Cluster 0 (unacc) 774 ( 45%) Cluster 1(acc) 600 ( 35%) Cluster 2(good) 354 ( 20%) Within cluster sum of squared errors: 5547.0 Incorrectly clustered instances : 1020.0 59.0278 % Cluster 0(acc) 813 ( 47%) Cluster 1(unacc) 555 ( 32%) Cluster 2 (vgood) 360 ( 21%) K = 4 Seed = 10 Seed 2 = 100 Within cluster sum of squared errors: 5390.0 Incorrectly clustered instances : 979.0 56.6551 % Cluster 0 (unacc) 592 ( 34%) Cluster 1(acc) 557 ( 32%) Cluster 2(good) 327 ( 19%) Cluster 3 (vgood) 252 ( 15%) Within cluster sum of squared errors: 5316.0 Incorrectly clustered instances : 1093.0 63.2523 % Cluster 0 (acc) 697 ( 40%) Cluster 1(unacc) 496 ( 29%) Cluster 2(vgood) 346 ( 20%) Cluster 3 (good) 189 ( 11%) K = 5 Within cluster sum of squared errors: 5106.0 Incorrectly clustered instances : 1064.0 61.5741 %
  • 3. Seed 1 = 10 Seed 2 = 100 Cluster 0 (unacc) 543 ( 31%) Cluster 1(acc) 430 ( 25%) Cluster 2(good) 302 ( 17%) Cluster 3 (No class) 227 ( 13%) Cluster 4 (vgood) 226 ( 13%) Within cluster sum of squared errors: 4996.0 Incorrectly clustered instances : 1162.0 67.2454 % Cluster 0 (acc) 586 ( 34%) Cluster 1(unacc) 424 ( 25%) Cluster 2(vgood) 310 ( 18%) Cluster 3 (No class) 174 ( 10%) Cluster 4 (good) 234 ( 14%) ANALYSIS OF THE RESULT: These Algorithm minimizes the total squared distance from instances to their clustered centers local, not global minimum. So we tend to get different results when we vary the seeds. From the table above at K = 3, we discover that we had a lesser squared errors when we assigned the value of the seed to be 100 compared to when the value of the seed was 100. However, there is an inverse relation with the values of the incorrectly clustered instances as we can see it from the table. While we tend to minimize the errors of the squared distances of the instances, the incorrectly clustered instances increases for each increase in the number of seed. Vice versa. Hence, we try to compare the similarities of each clustered instances when we iterate and don not take into account any values or conditions. FIGURE 1.0 The Figure below describe the cluster and instances as illustrated in the table above. The Y – axis represents the class value, while the X- axis represents the instance number. The color represents the cluster, so we can see the cluster when we select each of the instance from the menu and then compare the similarity before validating our decisions.
  • 4.
  • 5. 1 B. ESTIMATING CLUSTERS USING EXPECTATION MAXIMIZATION (EM CLUSTERING) K = 3; Seed = 10 Incorrectly clustered instances : 519.0 30.0347 % K = 3; Seed = 100 Incorrectly clustered instances : 545.0 31.5394 % Cluster Attribute 0 1 2 (0.51) (0.29) (0.2) ======================================= buying vhigh 211.4062 134.0433 89.5505 high 221.0605 127.5401 86.3994 med 230.866 120.6598 83.4743 low 221.0605 127.5401 86.3994 [total] 884.3931 509.7833 345.8236 maint Cluster Attribute 0 1 2 (0.48) (0.32) (0.2) ====================================== buying vhigh 209.4287 139.8151 85.7562 high 198.5154 146.6941 89.7905 med 209.4287 139.8151 85.7562 low 220.8453 132.7588 81.3959 [total] 838.2182 559.083 342.6988 maint
  • 6. vhigh 220.5171 127.8878 86.595 Clustered Instances 0 1725 (100%) 1 3 ( 0%) Log likelihood: -7.45474 Class attribute: class Classes to Clusters: 0 1 <-- assigned to cluster 1208 2 | unacc 383 1 | acc 69 0 | good 65 0 | vgood Cluster 0 <-- unacc Cluster 1 <-- acc vhigh 204.4103 138.1582 92.4315 Clustered Instances 0 1699 ( 98%) 1 29 ( 2%) Log likelihood: -7.45474 Class attribute: class Classes to Clusters: 0 1 <-- assigned to cluster 1182 28 | unacc 383 1 | acc 69 0 | good 65 0 | vgood Cluster 0 <-- unacc Cluster 1 <-- acc K = 4; Seed = 10 Incorrectly clustered instances : 556.0 32.1759 % K = 4; Seed = 100 Incorrectly clustered instances : 528.0 30.5556 % Cluster Attribute 0 1 2 3 (0.41) (0.26) (0.18) (0.16) ============================================= buying vhigh 171.8456 115.3853 81.9122 66.8569 high 175.9113 111.256 79.4969 69.3358 med 182.9562 108.0125 77.5667 67.4646 low 173.1743 110.3401 79.4592 73.0264 [total] 703.8874 444.9938 318.435 276.6838 Cluster Attribute 0 1 2 3 (0.42) (0.18) (0.26) (0.14) ========================================== buying vhigh 179.4064 76.6192 118.9071 61.0673 high 183.026 77.5366 112.1751 63.2622 med 189.4165 80.2604 106.0872 60.2359 low 183.1553 75.7134 110.2891 66.8422 [total] 735.0042 310.1296 447.4586 251.4076
  • 7. Clustered Instances 0 1616 ( 94%) 1 112 ( 6%) Log likelihood: -7.45474 Class attribute: class Classes to Clusters: 0 1 <-- assigned to cluster 1140 70 | unacc 352 32 | acc 59 10 | good 65 0 | vgood Cluster 0 <-- unacc Cluster 1 <-- acc Clustered Instances 0 1718 ( 99%) 2 10 ( 1%) Log likelihood: -7.45474 Class attribute: class Classes to Clusters: 0 2 <-- assigned to cluster 1200 10 | unacc 384 0 | acc 69 0 | good 65 0 | vgood Cluster 0 <-- unacc Cluster 2 <-- No class ANALYSIS OF THIS ALGORITHM: These algorithm uses probabilistic approach to classify cluster. It uses expectation maximization to classify the instances of the cluster. From the table we see that each of the attribute value, has a probability value assigned to it. We can get the probability of each attribute by dividing the value of each with the total value of all the attributes given. With this, we can calculate the probability of each cluster. It also uses the overall quality measure known as log likelihood. The nominal attributes are the probability of each value while the numeric attributes are the mean and standard deviation. It is also an unsupervised learning algorithm. Whilst we compare it with that of K-Means, we discover that the value of the incorrectly classified instances are lesser compared to K-means.
  • 8. 2 CLASSIFICATION A. K-NEAREST NEIGHBOUR (KNN) S/N USE TRAINIG SET CROSS VALIDATION (10 FOLD) PRECISION RECALL ERROR PRECISION RECALL ERROR Type 1 (TPR) Type 2 (FPR) Type 1 (TPR) Type 2 (FPR)
  • 9. K = 1 Correctly Classified Instances 1728 100% Incorrectly Classified Instances 0 0% Correctly Classified Instances 1616 93.5185 % Incorrectly Classified Instances 112 6.4815 % 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0.973 0.818 1 1 0.94 0.998 0.911 0.188 0.708 0.935 0.998 0.911 0.188 0.708 0.935 0.066 0.058 0 0 0.059 K = 5 Correctly Classified Instances 1664 96.2963 % Incorrectly Classified Instances 64 3.7037 % Correctly Classified Instances 1616 93.5185 % Incorrectly Classified Instances 112 6.4815 % 0.988 0.883 1 1 0.965 1 0.961 0.435 0.846 0.963 1 0.961 0.435 0.846 0.963 0.029 0.036 0 0 0.028 0.973 0.818 1 1 0.94 0.998 0.911 0.188 0.708 0.935 0.998 0.911 0.188 0.708 0.935 0.066 0.058 0 0 0.059 K = 20 Correctly Classified Instances 1337 77.3727 % Incorrectly Classified Instances 391 22.6273 % Correctly Classified Instances 1327 76.794 % Incorrectly Classified Instances 401 23.206 % 0.813 0.531 0 0 0.687 1 0.331 0 0 0.774 1 0.331 0 0 0.774 0.539 0.083 0 0 0.396 0.802 0.528 0 1 0.717 1 0.299 0 0.031 0.768 1 0.299 0 0.031 0.768 0.575 0.077 0 0 0.42 REMARK: From the above table we had a correctly classified instances with K = 1. So both precisions and Recall were the same. This means that the algorithm correctly classified all the misclassified points or instances. We can see it from the Figure below:
  • 10. IBK 1.0 When we choose a different value for K = 5,and k = 20 we had approximately 4% and 23% of the misclassified instances. However, this explains the concept of a noisy dataset or instances. So what we did in this regard was we choose 5 neighboring classes and the majority of the class to classify the unknown points or instances. In the figure below those misclassified points are represented with rectangles in colors.
  • 11. IBK 1.1 Applying Cross Validation Algorithm process which divides the instances in 10 equal sized sets whereby 90 instances will be used for training and 10 instances will be used for testing. At the end it averages the performances of the 10 classifiers produced from the 10 equal sized instances. Similar results was obtained when k =1 and k = 5. We had approximately 6% of the misclassified instances compared to when we use the whole training set. The figure below shows the visual diagram of using cross validation algorithm to evaluate the training sets of data.
  • 12. IBK 1.2 In summary, there is no model in this method. We only test instances to make our prediction. The More value of k increases the more the percentage of misclassified points also increase. However, the accuracy of the dataset only improves when we have a more noisy instances. The base line accuracy can only be achieved by increasing the value of k. and from the data set giving the base line accuracy is approximately 30%. K method is a good method although it is very slow method because it has to scan the entire training instances or data set before it can make prediction.
  • 13. B DECISION TREE ALGORITHM (ID3 or J48) ID3 (USE TRAINING SET) ID3(CROSS VALIDATION FOLD 10) Correctly Classified Instances 1728 100% Incorrectly Classified Instances 0 0% Correctly Classified Instances 1544 89.3519% Incorrectly Classified Instances 61 3.5301% === Confusion Matrix === a b c d <-- classified as 1210 0 0 0 | a = unacc 0 384 0 0 | b = acc 0 0 69 0 | c = good 0 0 0 65 | d = vgood === Confusion Matrix === a b c d <-- classified as 1171 28 3 0 | a = unacc 7 292 9 4 | b = acc 0 5 37 5 | c = good 0 0 0 44 | d = vgood
  • 14. J48 (USE TRAINING SET) J48 (CROSS VALIDATION) Correctly Classified Instances 1664 96.2963 % Incorrectly Classified Instances 64 3.7037 % Number of Leaves : 131 Size of the tree : 182 Correctly Classified Instances 1596 92.3611 % Incorrectly Classified Instances 132 7.6389 % Number of Leaves : 131 Size of the tree : 182 === Confusion Matrix === a b c d <-- classified as 1182 25 3 0 | a = unacc 10 370 2 2 | b = acc 0 9 57 3 | c = good 0 4 6 55 | d = vgood === Confusion Matrix === a b c d <-- classified as 1164 43 3 0 | a = unacc 33 333 11 7 | b = acc 0 17 42 10 | c = good 0 3 5 57 | d = vgood ONE R ALGORITHM === Evaluation on training set === === Summary === Correctly Classified Instances 1210 70.0231 % Incorrectly Classified Instances 518 29.9769 % Kappa statistic 0 Mean absolute error 0.1499 Root mean squared error 0.3871 Relative absolute error 65.4574 % Root relative squared error 114.5023 %
  • 15. Total Number of Instances 1728 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 1 0.7 1 0.824 0.5 unacc 0 0 0 0 0 0.5 acc 0 0 0 0 0 0.5 good 0 0 0 0 0 0.5 vgood Weighted Avg. 0.7 0.7 0.49 0.7 0.577 0.5 === Confusion Matrix === a b c d <-- classified as 1210 0 0 0 | a = unacc 384 0 0 0 | b = acc 69 0 0 0 | c = good 65 0 0 0 | d = vgood PRISM === Evaluation on training set === === Summary === Correctly Classified Instances 1728 100 % Incorrectly Classified Instances 0 0 %
  • 16. Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 % Root relative squared error 0 % Total Number of Instances 1728 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 unacc 1 0 1 1 1 1 acc 1 0 1 1 1 1 good 1 0 1 1 1 1 vgood Weighted Avg. 1 0 1 1 1 1 === Confusion Matrix === a b c d <-- classified as 1210 0 0 0 | a = unacc 0 384 0 0 | b = acc 0 0 69 0 | c = good 0 0 0 65 | d = vgood CONCLUSION The Best algorithm in terms of precision are Prism and ID3. Both algorithm gave exact values of all of the Instances with no instances misclassified. J48 is also a good algorithm but cannot be used for large dataset as we see there were about 4% misclassified instances. All of these are examples of supervised
  • 17. learning and can be used in making various decisions. Although the decision trees makes use of Entropy as the Basis for making decisions. 2.3 ASSOCIATION RULES A. Apriori ======= Minimum support: 0.1 (173 instances) Minimum metric <confidence>: 0.9 Number of cycles performed: 18 Generated sets of large itemsets: Size of set of large itemsets L(1): 23 Size of set of large itemsets L(2): 52 Size of set of large itemsets L(3): 11 Best rules found: 1. persons=2 576 ==> class=unacc 576 conf:(1) 2. safety=low 576 ==> class=unacc 576 conf:(1) 3. persons=2 lug_boot=small 192 ==> class=unacc 192 conf:(1) 4. persons=2 lug_boot=med 192 ==> class=unacc 192 conf:(1) 5. persons=2 lug_boot=big 192 ==> class=unacc 192 conf:(1) 6. persons=2 safety=low 192 ==> class=unacc 192 conf:(1)
  • 18. 7. persons=2 safety=med 192 ==> class=unacc 192 conf:(1) 8. persons=2 safety=high 192 ==> class=unacc 192 conf:(1) 9. persons=4 safety=low 192 ==> class=unacc 192 conf:(1) 10. persons=more safety=low 192 ==> class=unacc 192 conf:(1) B. Apriori ======= Minimum support: 0.1 (173 instances) Minimum metric <confidence>: 0.5 Number of cycles performed: 18 Generated sets of large itemsets: Size of set of large itemsets L(1): 23 Large Itemsets L(1): buying=vhigh 432 buying=high 432 buying=med 432 buying=low 432 maint=vhigh 432 maint=high 432 maint=med 432 maint=low 432 doors=2 432 doors=3 432
  • 19. doors=4 432 doors=5more 432 persons=2 576 persons=4 576 persons=more 576 lug_boot=small 576 lug_boot=med 576 lug_boot=big 576 safety=low 576 safety=med 576 safety=high 576 class=unacc 1210 class=acc 384 Size of set of large itemsets L(2): 52 Large Itemsets L(2): buying=vhigh class=unacc 360 buying=high class=unacc 324 buying=med class=unacc 268 buying=low class=unacc 258 maint=vhigh class=unacc 360 maint=high class=unacc 314 maint=med class=unacc 268 maint=low class=unacc 268 doors=2 class=unacc 326
  • 20. doors=3 class=unacc 300 doors=4 class=unacc 292 doors=5more class=unacc 292 persons=2 lug_boot=small 192 persons=2 lug_boot=med 192 persons=2 lug_boot=big 192 persons=2 safety=low 192 persons=2 safety=med 192 persons=2 safety=high 192 persons=2 class=unacc 576 persons=4 lug_boot=small 192 persons=4 lug_boot=med 192 persons=4 lug_boot=big 192 persons=4 safety=low 192 persons=4 safety=med 192 persons=4 safety=high 192 persons=4 class=unacc 312 persons=4 class=acc 198 persons=more lug_boot=small 192 persons=more lug_boot=med 192 persons=more lug_boot=big 192 persons=more safety=low 192 persons=more safety=med 192 persons=more safety=high 192 persons=more class=unacc 322 persons=more class=acc 186
  • 21. lug_boot=small safety=low 192 lug_boot=small safety=med 192 lug_boot=small safety=high 192 lug_boot=small class=unacc 450 lug_boot=med safety=low 192 lug_boot=med safety=med 192 lug_boot=med safety=high 192 lug_boot=med class=unacc 392 lug_boot=big safety=low 192 lug_boot=big safety=med 192 lug_boot=big safety=high 192 lug_boot=big class=unacc 368 safety=low class=unacc 576 safety=med class=unacc 357 safety=med class=acc 180 safety=high class=unacc 277 safety=high class=acc 204 Size of set of large itemsets L(3): 11 Large Itemsets L(3): persons=2 lug_boot=small class=unacc 192 persons=2 lug_boot=med class=unacc 192 persons=2 lug_boot=big class=unacc 192 persons=2 safety=low class=unacc 192 persons=2 safety=med class=unacc 192
  • 22. persons=2 safety=high class=unacc 192 persons=4 safety=low class=unacc 192 persons=more safety=low class=unacc 192 lug_boot=small safety=low class=unacc 192 lug_boot=med safety=low class=unacc 192 lug_boot=big safety=low class=unacc 192 Best rules found: 1. persons=2 576 ==> class=unacc 576 conf:(1) 2. safety=low 576 ==> class=unacc 576 conf:(1) 3. persons=2 lug_boot=small 192 ==> class=unacc 192 conf:(1) 4. persons=2 lug_boot=med 192 ==> class=unacc 192 conf:(1) 5. persons=2 lug_boot=big 192 ==> class=unacc 192 conf:(1) 6. persons=2 safety=low 192 ==> class=unacc 192 conf:(1) 7. persons=2 safety=med 192 ==> class=unacc 192 conf:(1) 8. persons=2 safety=high 192 ==> class=unacc 192 conf:(1) 9. persons=4 safety=low 192 ==> class=unacc 192 conf:(1) 10. persons=more safety=low 192 ==> class=unacc 192 conf:(1) 11. lug_boot=small safety=low 192 ==> class=unacc 192 conf:(1) 12. lug_boot=med safety=low 192 ==> class=unacc 192 conf:(1) 13. lug_boot=big safety=low 192 ==> class=unacc 192 conf:(1) 14. buying=vhigh 432 ==> class=unacc 360 conf:(0.83) 15. maint=vhigh 432 ==> class=unacc 360 conf:(0.83) 16. lug_boot=small 576 ==> class=unacc 450 conf:(0.78) 17. doors=2 432 ==> class=unacc 326 conf:(0.75)
  • 23. 18. buying=high 432 ==> class=unacc 324 conf:(0.75) 19. maint=high 432 ==> class=unacc 314 conf:(0.73) 20. doors=3 432 ==> class=unacc 300 conf:(0.69) 21. safety=high class=unacc 277 ==> persons=2 192 conf:(0.69) 22. lug_boot=med 576 ==> class=unacc 392 conf:(0.68) 23. doors=4 432 ==> class=unacc 292 conf:(0.68) 24. doors=5more 432 ==> class=unacc 292 conf:(0.68) 25. lug_boot=big 576 ==> class=unacc 368 conf:(0.64) 26. buying=med 432 ==> class=unacc 268 conf:(0.62) 27. maint=med 432 ==> class=unacc 268 conf:(0.62) 28. maint=low 432 ==> class=unacc 268 conf:(0.62) 29. safety=med 576 ==> class=unacc 357 conf:(0.62) 30. persons=4 class=unacc 312 ==> safety=low 192 conf:(0.62) 31. buying=low 432 ==> class=unacc 258 conf:(0.6) 32. persons=more class=unacc 322 ==> safety=low 192 conf:(0.6) 33. persons=more 576 ==> class=unacc 322 conf:(0.56) 34. persons=4 576 ==> class=unacc 312 conf:(0.54) 35. safety=med class=unacc 357 ==> persons=2 192 conf:(0.54) 36. class=acc 384 ==> safety=high 204 conf:(0.53) 37. lug_boot=big class=unacc 368 ==> persons=2 192 conf:(0.52) 38. lug_boot=big class=unacc 368 ==> safety=low 192 conf:(0.52) 39. class=acc 384 ==> persons=4 198 conf:(0.52) From the comparison table, the Best Algorithm depends on the type of problem given to the researcher. If some datasets were given with a minimum condition or less condition. Then Association rules can be applied in order to get or generate various possible rules or outcomes. The outcomes depends on the confidence limit and the number of rules you want to generate. So it is good when you are making unsupervised learning algorithm.
  • 24. Conversely, Prism Rule is good when you are given some sets of conditions. It is best because it minimizes errors to its lowest proportion and gives out the best alternatives. It has an artificial intel. That gives out the best possible alternatives. It is limited when it comes to complex decisions making processes. J48 This decision making process uses tree method. It is not the best Ideal method because of the divide and conquer rule it uses. It can’t be used in complex decision making process. Many errors will be generated in the classification process. It is also an example of Supervised Learning or Classification Algorithm.