Clustering Theory

4th International Summer School
Achievements and Applications of Contemporary
Informatics, Mathematics and Physics
National University of Technology of the Ukraine
Kiev, Ukraine, August 5-16, 2009

Clustering Theory

Data Mining for Quality Improvement
with Nonsmooth Optimization
vs. PAM and k-Means

Gerhard-Wilhelm Weber * and Başak Akteke-Öztürk
Gerhard- Akteke-
Institute of Applied Mathematics
Middle East Technical University, Ankara, Turkey
* Faculty of Economics, Management and Law, University of Siegen, Germany
Center for Research on Optimization and Control, University of Aveiro, Portugal

Outline

• Quality Analysis

• Data Mining for Quality Analysis

• Clustering Methods

• Results and Comparison

• Decision Tree Analysis of A Cluster

• Conclusion

Quality Analysis

• Quality is an essential requirement of
– products,
– processes, and
– services.

• This study is a part of a project whose main focus is on quality analysis:
relationship between input and output

• Modern quality analysis takes advantage of using tools of Data Mining.

Data Mining for Quality Analysis

Data mining tools such as
– decision trees (e.g. classification and regression trees (CART)),
– neural networks (NN),
– self-organizing maps (SOM),
– support vector machines (SVM),
are highly prefered for modeling and producing rules for the output.

Applications of such tools are not enough such that the
industry people would prefer and make use of them for
quality analysis needs.

Aim of Our Data Mining Studies

• to identify the data mining approaches that can
effectively improve product and process quality in industrial
organizations:
– classification / prediction,
– clustering and
– association analysis,

• to develop new data mining software and improve the
existing ones for quality analysis.

• Inital study: To identify the most influential variables that
cause defects on the items produced by a casting company
located in Turkey.

Our Data Set

• Our data set: 92 objects (rows),
35 process variables (columns).

• Belongs to a particular product, which has high percentage
of defectives collected during the first five months
production period of 2006.

• Missing values: filled with the averages of the columns

Clustering - 2 Algorithms (Model Free)
choose a randon start partition

compute centroids

create minimal distance partition

end partition

minimal distance procedure

Clustering - 2 Algorithms (Model Free)

choose a randon start partition

test an object in all clusters

update the centroids

end partition

exchange procedure
minimal distance procedure

Our Clustering

• The data set scaled to the interval [0,1] before the clustering analysis:

xi − xmin
xi =
'
.
xmax − xmin

• We used k-means, PAM (Partitioning Around Medoids) and
a modified k-means by Nonsmooth Analysis:

• to understand the data set by examining the groups in the data,
• to find the outliers of the data set,
• our data set was not big.

• These methods use Euclidean metric by default.

About the Methods

• PAM is more robust than k-means
in the presence of noise and outliers.

• PAM minimizes a sum of dissimilarities
instead of a sum of squared Euclidean distances.

• Medoids are less influenced by the presence of noise and outliers.

• A medoid can be defined as that object of a cluster, whose
average distance (dissimilarity) to all the objects in the cluster
is minimal.

Nonsmooth Analysis

• k-means takes as input:
the number of clusters and initial cluster centers.

• This problem can be reduced to nonsmooth optimization problem
--> initial problem for the a modified k-means.

– global optimization techniques,
– nonsmooth optimization algorithms and
– derivative free optimization for the modified k-means algorithm.

• The minimum sum of squares problem -->
nonsmooth and nonconvex optimization problem.

k-Means Results

k=2 cluster_1 (70 Object) – cluster_2 (22 Object) 1.113769

cluster_1 (68 Object) – cluster_2 (22 Object) 1.111567


k-Means Results

• Best result is for k=2.

• The proximities of clusters for k=3 and k=4 are higher.

• But, the results of k=3 and k=4 are artificial,
one of the clusters contain only 2 objects.

• These objects are outliers.

PAM Results

2 clusters cluster_1 (40 Objects) – cluster_2 (52 Objects) 1.2838

cluster_1 (33 Objects) – cluster_2 (34 Objects) 1.2838
3 clusters cluster_1 (33 Objects) – cluster_3 (25 Objects) 1.2729

4 clusters

PAM Results

• The proximities of clusters for k=4 is higher, i.e.,
the clusters are better separated.
• The number of objects in the clusters are 20, 34, 25 and 13.
• This is quite natural grouping of the data.
• Best result is for k=4.
• We can say that clustering conducted by PAM is a
fine tuning of the one done by k-means.

PAM
1.00 2.00 3.00 4.00 Total
k-Means 1.00 20 12 25 13 70
2.00 0 22 0 0 22
Total 20 34 25 13 92

Modified k-Means Results

k=2 k=3 k=4
cluster_1: 45 Objects
cluster_1: 61 Objects cluster_2: 24 Objects
cluster_2: 31 Objects cluster_3: 2 Objects
clluster_4: 21 Objects

For k=4, k-means has 2 clusters of less than 10 objects.
Modified k-means has only 1 cluster of less than 10 objects,
others have all more than 20.
Best result is for k=2.

Modified global k -Means
1.00 2.00 Total
k-Means 1.00 61 9 70
2.00 0 22 22
Total 61 31 92

Modified k-Means Results

• Modified k-means gave more natural results than k-means.

• Found clusters by this modified method are more balanced in
terms of objects numbers.

• As k increases, k-means give artificial results;
however, modified global k-means gives reasonable clusters
except for one cluster.

• This new algorithm can be used when k is not known a priori.

• It is easy to use and the running time of algorithm is
significantly short (seconds in all of our runs).

Studies on Found Clusters

We obtained the rule sets for k-means when k = 2,3 and 4.

These rule sets show us which values of the process variables
together characterize any regarded class of the object.

These results are meaningful for the decision maker
which is in our case the company.

Instead of rule sets it will be meaningful for you to see the
decision tree analysis of the clusters.

We applied CART (classification and regression trees)
of SPSS Clementine® 10.1, on the group we found from
k-means for k=2.

Results

• We chose the big cluster of 70 objects as our dataset for
CART.

• We formed 7 different training sets of 60 objects randomly
and 7 test sets from the remaining 10 objects.

• One output variable (i.e., response variable) which represents
the total defective items.

• We obtained 7 decision tree models from these training and
test sets.

Results

We used two main measure to compare these models:
– Mean error (ME)
– Mean absolute error (MAE)
– Correlation

Average 1.Model 2.Model 3.Model 4.Model 5.Model 6.Model 7.Model
Training ME 0 0,0 0,0 0,0 0,0 0,0 0,0 0,0
Training MAE 2,8 2,6 3,1 3,0 2,5 3,2 2,4 2,8
Training correlation 0,887 0,922 0,840 0,871 0,917 0,874 0,911 0,872
Test ME -0,004 0,008 0,031 0,053 -0,064 0,002 -0,02 -0,04
Test MAE 7,74 5,2 7,7 6,9 9,5 5,5 7,7 11,7
Test correlation 0,040 -0,453 -0,046 0,555 0,146 -0,378 0,535 -0,08

Results

Cluster of 70 Objects Whole data set of 92 objects
Training ME 0 0
Training MAE 2,8 3.23
Training korelasyonu 0,887 0.8098
Test ME -0,004 -0.21
Test MAE 7,74 6.85
Test korelasyonu 0,040 0.0757

Our studies shows that it is better to make clustering
before building models and extracting rulesets.

We obtained 4 most important variables for the response
variables.

2 of these important variables are also the most important
ones for the whole set.

Conclusion

• When the data mining techniques used for classification /
prediction cannot produce accurate results or cannot build
models which are capable of predicting correctly, it is better
to find the homogenous groups in the data set.

• Clustering algorithms produce highly different results,
one should choose the most efficient and natural one.

• Modified k-Means can be preferred instead of k-Means.

References
[1] Akteke-Özturk, B., Weber, G.-W., and Kropat, E., Continuous optimization
approaches for minimum sum of squares, in the ISI Proceedings of 20th
Mini-EURO Conference Continuous Optimization and Knowledge-Based
Technologies (Neringa, Lithuania, May 20-23, 2008) 253-258.

[2] Bagirov, A.M., Rubinov, A.M., Soukhoroukova, N.V., and Yearwood, J.,
Unsupervised and supervised data classification via nonsmooth and global
optimization, TOP 11, 1 (2003), 1-93.
[3] Bakır, B., Batmaz, Đ., Güntürkün, F.A., Đpekçi, Đ.A., Köksal, G., and
Özdemirel, N.E., Defect Cause Modeling with Decision Tree and Regression
Analysis, Proceedings of XVII. International Conference on Computer and
Information Science and Engineering, Cairo, Egypt, December 08-10, 2006,
Volume 17, pp. 266-269, ISBN 975-00803-7-8.
[4] Sugar, C.A. and James, G.M., Finding the number of clusters in a
dataset: an information-theoretic approach, Journal of the American
Statistical Association 98, 463 (2003) 750-763.
[5] Volkovich, Z., Barzily, Z., Weber, G.-W., and Toledano-Kitai, D., Cluster
stabilityestimation based on a minimal spanning trees approach, Proceedings
of the Second Global Conference on Power Control and Optimization, AIP
Conference Proceedings 1159, Bali, Indonesia, 1-3 June 2009, Subseries:
Mathematical and Statistical Physics; ISBN 978-0-7354-0696-4 (August
2009) 299-305; Hakim, A.H., Vasant, P., and Barsoum, N., guest eds..

Clustering Theory

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Clustering Theory

Similar to Clustering Theory (20)

More from SSA KPI

More from SSA KPI (20)

Recently uploaded

Recently uploaded (20)

Clustering Theory