SlideShare a Scribd company logo
1 of 8
Download to read offline
International
OPEN ACCESS Journal
Of Modern Engineering Research (IJMER)
| IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 55 |
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type
Data
N. Aparna1
, M. Kalaiarasu2
1, 2
(Department of Computer Science, Department of Information Technology, Sri Ramakrishna Engineering
College, Coimbatore)
I. Introduction
Clustering is a fundamental technique of unsupervised learning in machine learning and statistics. It is
generally used to find groups of similar items in a set of unlabeled data. The aim of clustering is to divide a set
of data objects into clusters so that those data objects that belongs to the same cluster are more similar to each
other than those in other clusters [1-4]. In real world, datasets usually contain both numeric and categorical
variables [5,6]. However, most existing clustering algorithms assume all variables are either numeric or
categorical , examples of which include the k-means [7], k-modes [8], fuzzy k-modes [9] algorithms. Here, the
data is observed from multiple outlooks and in multiple types of dimensions. For example, in a student data set,
variables can be divided into personal information view showing the information about the studentโ€™s personal
information, the academic view describing the studentโ€™s academic performance and the extra-curricular view
which gives the extra-curricular activities and achievements made by the student.
Traditional methods take multiple views as a set of flat variables and do not take into account the
differences among various views [10], [11], [12]. In the case of multiview clustering, it takes the information
from multiple views and also considers the variations among different views which produces a more precise and
efficient partitioning of data.
In this paper, a new algorithm Multi-viewpoint K-prototypes (MK-Prototypes) for clustering mixed
type data is proposed. It is an enhancement to the usual k-prototypes algorithm. In order to differentiate the
effects of different views and different variables in clustering, the view weights and individual variables are
applied to the distance function. Here while computing the view weights, the complete set of variables are
considered and while calculating the weights of variables in a view, only a part of the data that includes the
variables in the view is considered. Thus, the view weights show the significance of views in the complete data
and the variables weights in a view shows the significance of variables in a view alone.
II. Related Works
Till date, there exist a number of algorithms and methods to directly deal with mixed type data. In [13],
Cen Li and Gautam Biswas proposed an algorithm, Similarity-based agglomerative clustering(SBAC) that
works well for data with mixed attributes. It adopts a similarity measure proposed by Goodall [14] for biological
taxonomy. In this method, while computing the similarity, higher weight is assigned to infrequent attribute value
matches. It does not make any suppositions on the underlying features of the attribute values. An agglomerative
algorithm is used to generate a dendrogram and a simple distinctness heuristic is used to extract a partition of the
data.Hsu and Chen proposed CAVE [15], a clustering algorithm based on the Variance and Entropy for
Abstract: Clustering mixed type data is one of the major research topics in the area of data mining. In
this paper, a new algorithm for clustering mixed type data is proposed where the concept of distribution
centroid is used to represent the prototype of categorical variables in a cluster which is then combined
with the mean to represent the prototype of clusters with mixed type variables. In the method, data is
observed from different views and the variables are grouped into different views. Those instances that
can be viewed differently from different viewpoints can be defined as multiview data. During clustering
process the differences among views are ignored in usual cases. Here, both views and variables weights
are computed simultaneously. The view weight is used to determine the closeness or density of view and
variable weight is used to identify the significance of each variable. With the intention of determining
the cluster of objects both these weights are used in the distance function. In the proposed method,
enhancement to the k-prototypes is done so that it automatically computes both view and variable
weights. The proposed algorithm MK-Prototypes algorithm is compared with two other clustering
algorithms.
Keywords: clustering, mixed data, multiview, variable weighting, view weighting, k-prototypes.
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data
| IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 56 |
clustering mixed data. It builds a distance hierarchy for every categorical attributes which needs domain
expertise.Hsu et al.[16] proposed an extension to the self-organizing map to analyze mixed data where the
distance hierarchy is automatically constructed by using the values of class attributes.
In [17] Chatziz propsed KL-FCM-GM algorithm in which data derived from the clusters are in the
Guassian form and is designed for the Guass-Multinomial distributed data.
Huang presented a k-prototypes algorithm [18] where k-means is integrated with k-modes to partition
mixed data. Bezdek et al. considered the fuzzy nature of the objects in his work the fuzzy k-prototypes[19] and
Zheng et al. proposed [20] an evolutionary type k-prototypes algorithm by introducing an evolutionary
algorithm framework.
III. Proposed System
The motivation for the proposed system is on one hand to provide a better representation for the
categorical variable part in a mixed data since the numerical variables can be well represented using the mean
concept itself. On the other hand it considers the importance of view and variables weights in the process of
clustering. The concept of distribution centroid represents the cluster centroid for the categorical variable part.
Huangโ€™s strategy of evaluation is used for the computation of both view weights and variable weights.
A. The distribution centroid
The idea of distribution centroid for a better representation of categorical variables is stimulated from
fuzzy centroid proposed by Kim et al.[ 21]. It makes use of a fuzzy scenario to represent the cluster centers for
the categorical variable part.
For Dom(Vj)={{๐‘ฃ๐‘–
1
, ๐‘ฃ๐‘–
2
, ๐‘ฃ๐‘–
3
, โ€ฆ ๐‘ฃ๐‘–
๐‘ก
}, the distribution centroid of a cluster o, denoted as ๐ถ๐‘œ
โ€ฒ
, is represented as follows
๐ถ๐‘œ
โ€ฒ
= ๐‘ ๐‘œ1
โ€ฒ
, ๐‘ ๐‘œ2
โ€ฒ
, โ€ฆ , ๐‘ ๐‘œ๐‘—
โ€ฒ
, โ€ฆ ๐‘ ๐‘œ๐‘š
โ€ฒ
(1)
where
๐‘ ๐‘œ๐‘—
โ€ฒ
= ๐‘๐‘—
1
, ๐‘ค ๐‘œ๐‘—
1
, ๐‘๐‘—
2
, ๐‘ค ๐‘œ๐‘—
2
,โ€ฆ ๐‘๐‘—
๐‘˜
, ๐‘ค ๐‘œ๐‘—
๐‘˜
, โ€ฆ ๐‘๐‘—
๐‘ก
, ๐‘ค ๐‘œ๐‘—
๐‘ก
(2)
.
In the above equation
๐‘ค ๐‘œ๐‘—
๐‘˜
= ๐œ‡(
๐‘›
๐‘–=1
๐‘ฅ๐‘–๐‘— ) (3)
where
๐œ‡(๐‘ฅ๐‘–๐‘— )=
๐‘ข ๐‘–๐‘œ
๐‘ข ๐‘–๐‘œ
๐‘›
๐‘–=1
if ๐‘ฅ๐‘–๐‘— = ๐‘๐‘—
๐‘˜
๐‘œ if ๐‘ฅ๐‘–๐‘— โ‰  ๐‘๐‘—
๐‘˜
(4)
Here, ๐‘ข๐‘–๐‘œ is assigned the value 1, if the data object xi belongs to cluster o and as 0, if the data object xi do not
belong to cluster o
From the above mentioned equations it is clear that the computation of distribution centroid considers the
number of times each categorical value repeat in a cluster. Thus to denote the center of a cluster it takes into
account the distribution features of categorical variables
B. Weight calculation using Huangโ€™s approach
Weight of a variable identifies the effect of that variable in clustering process. In 2005, Huang et al.
proposed an approach to calculate the weight of variable [22]. According to their method, the weight is
computed by minimizing the value of objective function.
The standard for assigning weight of variable is to allocate a larger value to a variable that has a
smaller sum of the within cluster distances (WCD), and vice versa. This principle is given by
๐‘ค๐‘— โˆ
1
๐ท๐‘—
(5)
where ๐‘ค๐‘— is the significance of the variable j, โˆ is the mathematical symbol denoting direct proportionality, and
๐ท ๐‘— is the sum of the within cluster distances for this variable.
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data
| IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 57 |
C. Multiview concept
FIGURE 1 : Multiview concept
In 2013, Chen Et Al [23] proposed Tw-K-Means where the concept of multiview data was introduced.
The above figure illustrates the multiview concept. During the process of clustering, the differences among
different views are not considered. In the process of multiview clustering, in addition to variable weights, the
variables are grouped according to their characteristic properties. Each group is termed as a view and a weight is
assigned to each view. The view weight is assigned according to Huangโ€™s approach.
D. The proposed algorithm
The proposed algorithm, MK-prototypes put together the concepts in section 3.1, section 3.2, section
3.3. The figure 2 describes the steps involved in the algorithm:
Steps in the proposed algorithm:
1. Compute the distribution centroid to represent the categorical variable centroid
2. Compute the mean for the numerical variables
3. Integrate the distribution centroid and mean to represent the prototype for the mixed data
4. Compute the view weights and variable weights.
5. Measure the similarity between the data objects and the prototypes
6. Assign the data object to that prototype to which the considered data object is the closest
7. Repeat steps 1-6 until an effective clustering result is obtained.
E. The optimization model
The clustering process to partition the dataset X into k clusters that considers both view weights and
variable weights is represented according to the framework of [23] as a minimization of the following objective
function.
๐‘ƒ ๐‘ˆ, ๐‘, ๐‘…, ๐‘‰ = ๐‘ข๐‘–,๐‘œ ๐‘ฃ๐‘ก ๐‘Ÿ๐‘  ๐‘‘(๐‘ฅ๐‘–,๐‘ 
๐‘ โˆˆ๐บ๐‘ก
๐‘„
๐‘ก=1
,
๐‘›
๐‘–=1
๐‘ง ๐‘œ,๐‘ 
๐‘˜
๐‘œ=1
) (6)
subject to ๐‘ข๐‘–,๐‘œ = 1๐‘˜
๐‘œ=1 , ๐‘ข๐‘–,๐‘™ โˆˆ 0,1 , 1 โ‰ค ๐‘– โ‰ค ๐‘›
๐‘ฃ๐‘ก = 1, 0 โ‰ค ๐‘ฃ๐‘ก
๐‘„
๐‘–=1
โ‰ค 1, 0 โ‰ค ๐‘Ÿ๐‘— โ‰ค 1, 1 โ‰ค ๐‘ก โ‰ค ๐‘„, ๐‘Ÿ๐‘— = 1
๐‘—โˆˆ๐บ๐‘ก
where U is an n x k partition matrix whose elements ๐‘ข๐‘–,๐‘œ are binary where ๐‘ข๐‘–,๐‘œ = 1 indicates that object i is
allocated to cluster o.๐‘ = {๐‘1, ๐‘2, โ€ฆ . ๐‘ ๐‘˜ } is a set of k vectors on behalf of the centers of the k clusters.๐‘‰ =
{๐‘‰1, ๐‘‰2, โ€ฆ . ๐‘‰๐‘„ } are Q weights for Q views. ๐‘… = {๐‘Ÿ1, ๐‘Ÿ2โ€ฆ๐‘Ÿ๐‘ } are s weights for s variables.๐‘‘(๐‘ฅ๐‘–,๐‘ , ๐‘ง ๐‘œ,๐‘ ) is a distance
or dissimilarity measure on the ๐‘  ๐‘กโ„Ž
variable between the ๐‘– ๐‘กโ„Ž
object and the center of the ๐‘œ ๐‘กโ„Ž
cluster.
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data
| IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 58 |
FIGURE 2. Flowchart for the proposed algorithm
In order to minimize the equation, the problem is divided into four sub-problems:
1. Sub-problem 1: Fix Z=Z^,R=R^ and V=V^ and solve the reduced problem P(U,Z^,R^,V^).
2. Sub-problem 2: Fix U=U^, R=R^ and V=V^ and solve the reduced problem P(U^,Z,R^,V^).
3. Sub-problem 3: Fix Z=Z^, U=U^ and V=V^ and solve the reduced problem P(U^,Z^,R,V^).
4. Sub-problem 4: Fix Z=Z^, R=R^ and U=U^ and solve the reduced problem P(U^,Z^,R^,V).
The sub-problem 1 is solved by:
๐‘ข๐‘–,๐‘œ = 1 (7)
if
๐‘ฃ๐‘ก ๐‘Ÿ๐‘ d ๐‘ฅ๐‘–,๐‘œ , ๐‘ง๐‘–,0
๐‘š
๐‘ =1
โ‰ค ๐‘ฃ๐‘ก ๐‘Ÿ๐‘ 
๐‘š
๐‘ =1
d ๐‘ฅ๐‘–,๐‘œ , ๐‘ง๐‘’,0 (8)
where 1โ‰ค ๐‘’ โ‰ค ๐‘˜
๐‘ข๐‘–,๐‘œ = 0 ๐‘คโ„Ž๐‘’๐‘Ÿ๐‘’ ๐‘’ โ‰  ๐‘œ
The sub-problem 2 is solved for the numeric variable by
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data
| IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 59 |
๐‘ง ๐‘œ,๐‘  =
๐‘ข๐‘–,๐‘œ ๐‘ฅ๐‘–,๐‘ 
๐‘›
๐‘–=1
๐‘ข๐‘–,๐‘œ
๐‘›
๐‘–=1
(9)
and for the categorical variables by ๐‘ง ๐‘œ,๐‘  = ๐‘โ€ฒ๐‘–,๐‘  which is already defined.
๐‘‘ ๐‘ฅ๐‘–,๐‘ , ๐‘ง ๐‘œ,๐‘  = ๐‘ฅ๐‘–,๐‘  โˆ’ ๐‘ง ๐‘œ,๐‘  if the sth variable is a numeric variable .
๐‘‘ ๐‘ฅ๐‘–,๐‘ , ๐‘ง ๐‘œ,๐‘  = ๐œ‘ ๐‘ฅ๐‘–,๐‘ , ๐‘ง ๐‘œ,๐‘  if the sth variable is a categorical variable .
where ๐œ‘ ๐‘ฅ๐‘–,๐‘ , ๐‘ง ๐‘œ,๐‘  = ๐›ฟ ๐‘ฅ๐‘–,๐‘ , ๐‘๐‘—
๐‘˜๐‘ก
๐‘˜=1 and ๐›ฟ ๐‘ฅ๐‘–,๐‘ , ๐‘๐‘—
๐‘˜
๐‘–๐‘  0 ๐‘–๐‘“ ๐‘ฅ๐‘–,๐‘  โ‰  ๐‘๐‘—
๐‘˜
and ๐‘ค ๐‘œ,๐‘—
๐‘˜
if ๐‘ฅ๐‘–,๐‘  = ๐‘๐‘—
๐‘˜
.
The solution to the sub-problem 3 is as followed:
Let Z=Z^, U=U^ and V=V^ be fixed . Then the reduced problem P(U^,Z^,R,V^) is minimized if
๐‘Ÿ๐‘  =
1
๐ท๐‘ 
๐ทโ„Ž
1
๐›พ
โ„Ž๐œ–๐บ๐‘ก
(10)
where
๐ท๐‘  = ๐‘ขโ€ฒ
๐‘–,๐‘œ
๐‘›
๐‘–=1
๐‘คโ€ฒ
๐‘ก ๐‘‘ ๐‘ฅ๐‘–,๐‘ , ๐‘งโ€ฒ
๐‘œ,๐‘ 
๐‘˜
๐‘œ=1
(11)
Sub-problem 4 is solved as follows
๐‘ค๐‘ก =
1
๐น๐‘ 
๐น๐‘ก
1
๐œ‡โ„Ž
๐‘ก=1
(12)
where
๐น๐‘  = ๐‘ขโ€ฒ
๐‘–,๐‘œ ๐‘Ÿโ€ฒ
๐‘  ๐‘‘ ๐‘ฅ๐‘–,๐‘ , ๐‘งโ€ฒ
๐‘œ,๐‘ 
๐‘ ๐œ– ๐บ๐‘ก
๐‘›
๐‘–=1
๐‘˜
๐‘œ=1
(13)
Having presented the detailed computations required for calculating the important variables, the proposed
algorithm
MK-Prototypes can be described as given below:
1. Choose the number of iterations, number of clusters k, value of ฮผ and ฮณ, randomly choose k distinct data
objects and convert them into initial prototypes and initialize the view weights and variable weights.
2. Fix Zโ€™, Rโ€™, Vโ€™ as ๐‘ ๐‘ก
, ๐‘… ๐‘ก
, ๐‘‰ ๐‘ก
respectively and minimize the problem P(U, Zโ€™, Rโ€™, Vโ€™) to obtain ๐‘ˆ ๐‘ก+1
.
3. Fix Uโ€™, Rโ€™, Vโ€™ as ๐‘ˆ ๐‘ก
, ๐‘… ๐‘ก
, ๐‘‰ ๐‘ก
respectively and minimize the problem P(Uโ€™, Z, Rโ€™, Vโ€™) to obtain ๐‘ ๐‘ก+1
.
4. Fix Uโ€™, Zโ€™, Vโ€™ as ๐‘ˆ ๐‘ก
, ๐‘ ๐‘ก
, ๐‘‰ ๐‘ก
respectively and minimize the problem P(Uโ€™, Zโ€™, R, Vโ€™) to obtain ๐‘… ๐‘ก+1
.
5. Fix Uโ€™, Zโ€™, Rโ€™, V as ๐‘ˆ ๐‘ก
, ๐‘ ๐‘ก
, ๐‘… ๐‘ก
respectively and minimize the problem P(Uโ€™, Zโ€™, Rโ€™, V) to obtain ๐‘‰ ๐‘ก+1
.
6. If there is no improvement in P or if the maximum iterations is reached, then stop. Else increment t by 1 ,
decrement number of iterations by 1 and go to Step 2.
IV. Experiments on Performance Of Mk-Prototypes Algorithm
In order to measure the performance level of the proposed algorithm, it is used to cluster a real-world dataset
Heart (disease). The dataset is taken from UCI Machine Learning Repository.
The proposed algorithm is compared with k-prototypes and SBAC algorithm. They are well known for
clustering mixed type data. In this paper, the clustering accuracy is measured using one of the most commonly
used criteria. The clustering accuracy r is given by
๐‘Ÿ =
๐‘Ž๐‘–
๐‘˜
๐‘–=1
๐‘›
(14)
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data
| IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 60 |
where ๐‘Ž๐‘–is the number of data objects that occur in both the ith cluster and its corresponding true class and n is
the number of data objects in a data set.
Higher the value of r , the higher the clustering accuracy . A perfect clustering gives a value of r=1.0.
A. Dataset description
The Heart disease data set is a mixed dataset. It contains 303 patient instances. The actual data set
contains 76 variables out of which 14 are considered usually. In the proposed algorithm, in order to define three
views 19 out of 76 variables are considered here. It consists of seven numeric variables and twelve categorical
variables.
These 19 variables can be naturally divided into 3 views.
1. Personal data view: It includes those variables which describes a patientโ€™s personal data.
2. Historical data view: It includes those variables which describes a patientโ€™s historical data like the habits.
3. Test output view: It includes all those variables which describes the results of various tests conducted for
the patient.
Here, ๐บ1, ๐บ2, ๐บ3 represents the three views personal, historical, test output respectively.
B. Results and analysis
Below are the graphical representations of the clustering results. Fig 3 shows the variation in variable
weights for varying ฮผ values and fixed ฮณ values. Fig 4 shows the variation in view weights for varying ฮผ values
and fixed ฮณ values.
From Table 1, it is observed that as ฮผ increased, the variance of V decreased rapidly. This result can be
explained from equation (10) as ฮผ increases, V becomes flatter. The graphical representation of the Table 1 has
been shown below.
Table 1: Variable weights vs ฮณ value For fixed ฮผ value
Table 2 shows that as ฮณ increased, the variance of view weights decreased rapidly. This result can be explained
from equation (11) as ฮณ increases, W becomes flatter. The graphical representation has been shown below.
Fig 3: Variable weights vs ฮณ value for fixed ฮผ value
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 5 10 15 20 25 30 35
Variableweights
ฮณ
ฮผ=1
ฮผ=4
ฮผ=12
ฮผ
ฮณ
1 4 12
10 0.01 0 0
15 0 0.01 0
20 0.7 0 0.05
25 0.03 0.4 0.1
30 0 0.1 0.02
35 0.02 0 0
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data
| IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 61 |
Table 2: View weights vs ฮผ value for fixed ฮณ value
ฮณ
ฮผ
1 4 12
10 0.05 0.075 0.01
20 0.075 0.14 0.015
30 0.095 0.05 0.04
40 0.16 0.06 0.01
50 0.04 0.07 0.005
60 0.05 0.04 0.02
70 0.07 0.035 0.01
From above analysis, it can be summarized that the following method can be used to control two types of weight
distributions in MK-Prototypes algorithm by setting different values of ฮณ and ฮผ.
Figure 4: View weights vs ฮผ value for fixed ฮณ value
The experiments have been conducted for three different values of ฮผ and ฮณ for varying values of ฮณ and ฮผ
respectively.
1. Large ฮผ makes more variables contribute to the clustering while small ฮผ makes only important variables
contribute to the clustering.
2. Large ฮณ makes more views contribute to the clustering while small ฮณ makes only important views
contribute to the clustering.
Table 3: Comparison of accuracy rates of dataset considering all views
From the above table, it is clear that the proposed algorithm has a better clustering accuracy than the
existing k-prototypes and SBAC.
V. Conclusion
Mixed type data are encountered everywhere in the real world. In this paper, a new algorithm,
Multiview point based clustering algorithm for mixed type data has been proposed. When compared with the
existing algorithms the proposed algorithm has many significant contributions. The proposed algorithm
encapsulates the characteristics of clusters with mixed type variables more efficiently since it includes the
distribution information of both numeric and categorical variables.
It also takes into account the importance of various variables and views during the process of clustering
by using Huangโ€™s approach and a new dissimilarity measure.
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0 10 20 30 40 50 60 70 80
Viewweights
ฮผ
ฮณ=1
ฮณ=4
ฮณ=12
Algorithms Clustering accuracy %
k-prototypes
SBAC
MK-Prototypes
0.521
0.747
0.846
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data
| IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 62 |
It can compute weights for views and individual variables simultaneously in the clustering process.
With the two types of weights, dense views and significant variables can be identified and effect of low-quality
views and noise variables can be reduced.
Because of these contributions the proposed algorithm obtains higher clustering accuracy, which has
been validated by experimental results.
REFERENCES
[1] Z.X. Huang, Extensions to the k means algorithm for clustering large datasets with categorical values, Data Min.
Knowl. Discovery2 (3) (1998) 283โ€“304.
[2] A.K.Jain, R.C.Dubes, Algorithms for Clustering Data, Prentice-Hall, New Jersey, 1988.
[3] A.K.Jain, M.N.Murty, P.J.Flynn, Data clustering: a survey, ACM Comput. Surv. 31 (3) (1999) 264โ€“323.
[4] J.W.Han, M.Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann, SanFrancisco,2001.
[5] C.Hsu,Y.P.Huang, Incremental clustering of mixed data based on distance hierarchy, Expert Syst. Appl. 35 (3)
(2008) 1177โ€“1185. [6] C.Hsu, S.Lin, W.Tai, Apply extended self-organizing map to cluster and classify
mixed-type data, Neurocomputing 74 (18) (2011) 3832โ€“3842.
[7] S.Lloyd, Least squares quantization in PCM, IEEE Trans. Inf. Theory 28 (2) (1982) 129โ€“137.
[8] Z.X.Huang, Extensions to the k-meansalgorithm for clustering large datasets with categorical values, Data Min.
Knowl. Discovery 2 (3) (1998) 283โ€“304. [9]Z.X.Huang, M.K.Ng, A fuzzy k-modes algorithm for clustering
categorical data, IEEE Trans. Fuzzy Syst. 7 (4) (1999) 446โ€“452.
[10] J. Mui and K. Fu, โ€œAutomated Classification Of Nucleated Blood Cells Using A Binary Tree Classifier,โ€ IEEE
Trans. Pattern Analysis And Machine Intelligence, Vol. 2, No. 5, Pp. 429-443, May 1980.
[11] J. Wang, H. Zeng, Z. Chen, H. Lu, L. Tao, And W. Ma, โ€œRecom: Reinforcement Clustering Of Multitype
Interrelated Data Objects,โ€Proc. 26th Ann. Intโ€™l ACM SIGIR Conf. Research And Development In Informaion
Retrieval, Pp. 274-281, 2003.
[12] S. Bickel And T. Scheffer, โ€œMulti-View Clustering,โ€ Proc. IEEE Fourth Intโ€™l Conf. Data Mining, Pp. 19-26, 2004.
[13] C.Li, G.Biswas, Unsupervised Learning with Mixed Numeric And Nominal Data, IEEE Trans. Knowl. Data
Eng.14 (4) (2002) 673โ€“690.
[14] D.W.Goodall, A New Similarity Index Based On Probability, Biometrics 22 (4) (1966) 882โ€“907.
[15] C.C.Hsu, Y.C.Chen, Mining Of Mixed Data With Application To Catalog Marketing, Expert Syst. Appl. 32 (1)
(2007) 12โ€“27.
[16] C. Hsu, S.Lin, W.Tai, Apply Extended Self-Organizing Map To Cluster And Classify Mixed-Type Data,
Neurocomputing 74 (18) (2011) 3832โ€“3842.
[17] S.P.Chatzis, A Fuzzy C-Means-Type Algorithm For Clustering Of Data With Mixed Numeric And Categorical
Attributes Employing A Probabilistic Dissimilarity Functional, Expert Syst. Appl. 38 (7) (2011) 8684โ€“8689.
[18] Z.X.Huang, Clustering Large Datasets with Mixed Numeric and Categorical Values, In: Proceedings Of The First
Pacific-Asia Knowledge Discovery And Data Mining Conference, 1997, Pp.21โ€“34.
[19] J.C.Bezdek, J.Keller, R.Krisnapuram, Fuzzy Models And Algorithms For Pattern Recognition And Image
Processing, Kluwer Academy Publishers, Boston, 1999. [25].
[20] Z.Zheng, M.G.Gong, J.J.Ma, L.C.Jiao, Unsupervised Evolutionary Clustering Algorithm For Mixed Type Data, In
: Proceedings Of The IEEE Congresson Evolutionary Computation (CEC), 2010, Pp.1โ€“8.
[21] W.Kim, K.H.Lee, D.Lee, Fuzzy Clustering Of Categorical Data Using Fuzzy Centroid, Pattern Recognition
Lett.25 (11) (2004) 1263โ€“1271.
[22] Z.X.Huang, M.K.Ng, H.Q.Rong, Z.C.Li, Automated Variable Weighting In K-Means Type Clustering, IEEE
Trans. Pattern Anal. Mach. Intell.27 (5) (2005) 657โ€“668.
[23] Xiaojun Chen, Xiaofei Xu, Joshua Zhexue Huang, And Yunming Ye, Tw-K-Means: Automated Two-Level
Variable Weighting Clustering Algorithm For Multiview Data, IEEE Transactions On Knowledge And Data
Engineering, Vol. 25, No. 4, April 2013, pp 932-945

More Related Content

What's hot

EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
IJDKP
ย 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
Editor IJARCET
ย 

What's hot (20)

Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...
ย 
Analysis and implementation of modified k medoids
Analysis and implementation of modified k medoidsAnalysis and implementation of modified k medoids
Analysis and implementation of modified k medoids
ย 
Lx3520322036
Lx3520322036Lx3520322036
Lx3520322036
ย 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
ย 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
ย 
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
ย 
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimization
ย 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clustering
ย 
Cg33504508
Cg33504508Cg33504508
Cg33504508
ย 
K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...
K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...
K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...
ย 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
ย 
Survey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in DataminingSurvey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in Datamining
ย 
A Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsA Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text Documents
ย 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
ย 
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
ย 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
ย 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
ย 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
ย 
Clusters techniques
Clusters techniquesClusters techniques
Clusters techniques
ย 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
ย 

Viewers also liked

Intrusion Detection and Forensics based on decision tree and Association rule...
Intrusion Detection and Forensics based on decision tree and Association rule...Intrusion Detection and Forensics based on decision tree and Association rule...
Intrusion Detection and Forensics based on decision tree and Association rule...
IJMER
ย 
Personalization of the Web Search
Personalization of the Web SearchPersonalization of the Web Search
Personalization of the Web Search
IJMER
ย 
ฮฆฮฅฮ›ฮ›ฮŸ ฮ•ฮกฮ“ฮ‘ฮฃฮ™ฮ‘ฮฃ (ฮ•ฮ™ฮฃฮ‘ฮ“ฮฉฮ“ฮ— ฮžฮ•ฮฮŸฮฆฮฉฮฮคฮ‘ )
ฮฆฮฅฮ›ฮ›ฮŸ ฮ•ฮกฮ“ฮ‘ฮฃฮ™ฮ‘ฮฃ (ฮ•ฮ™ฮฃฮ‘ฮ“ฮฉฮ“ฮ— ฮžฮ•ฮฮŸฮฆฮฉฮฮคฮ‘ )ฮฆฮฅฮ›ฮ›ฮŸ ฮ•ฮกฮ“ฮ‘ฮฃฮ™ฮ‘ฮฃ (ฮ•ฮ™ฮฃฮ‘ฮ“ฮฉฮ“ฮ— ฮžฮ•ฮฮŸฮฆฮฉฮฮคฮ‘ )
ฮฆฮฅฮ›ฮ›ฮŸ ฮ•ฮกฮ“ฮ‘ฮฃฮ™ฮ‘ฮฃ (ฮ•ฮ™ฮฃฮ‘ฮ“ฮฉฮ“ฮ— ฮžฮ•ฮฮŸฮฆฮฉฮฮคฮ‘ )
Popi Kaza
ย 
Bv31291295
Bv31291295Bv31291295
Bv31291295
IJMER
ย 
Bs31267274
Bs31267274Bs31267274
Bs31267274
IJMER
ย 

Viewers also liked (20)

An Empirical Study on Identification of Strokes and their Significance in Scr...
An Empirical Study on Identification of Strokes and their Significance in Scr...An Empirical Study on Identification of Strokes and their Significance in Scr...
An Empirical Study on Identification of Strokes and their Significance in Scr...
ย 
South africa
South africaSouth africa
South africa
ย 
Cost Analysis of Small Scale Solar and Wind Energy Systems
Cost Analysis of Small Scale Solar and Wind Energy SystemsCost Analysis of Small Scale Solar and Wind Energy Systems
Cost Analysis of Small Scale Solar and Wind Energy Systems
ย 
Intrusion Detection and Forensics based on decision tree and Association rule...
Intrusion Detection and Forensics based on decision tree and Association rule...Intrusion Detection and Forensics based on decision tree and Association rule...
Intrusion Detection and Forensics based on decision tree and Association rule...
ย 
The Effect of Design Parameters of an Integrated Linear Electromagnetic Moto...
The Effect of Design Parameters of an Integrated Linear  Electromagnetic Moto...The Effect of Design Parameters of an Integrated Linear  Electromagnetic Moto...
The Effect of Design Parameters of an Integrated Linear Electromagnetic Moto...
ย 
Static Analysis of a Pyro Turbine by using CFD
Static Analysis of a Pyro Turbine by using CFDStatic Analysis of a Pyro Turbine by using CFD
Static Analysis of a Pyro Turbine by using CFD
ย 
Personalization of the Web Search
Personalization of the Web SearchPersonalization of the Web Search
Personalization of the Web Search
ย 
Analysis of Conditions in Boundary Lubrications Using Bearing Materials
Analysis of Conditions in Boundary Lubrications Using Bearing  MaterialsAnalysis of Conditions in Boundary Lubrications Using Bearing  Materials
Analysis of Conditions in Boundary Lubrications Using Bearing Materials
ย 
On Some Partial Orderings for Bimatrices
On Some Partial Orderings for BimatricesOn Some Partial Orderings for Bimatrices
On Some Partial Orderings for Bimatrices
ย 
ฮฆฮฅฮ›ฮ›ฮŸ ฮ•ฮกฮ“ฮ‘ฮฃฮ™ฮ‘ฮฃ (ฮ•ฮ™ฮฃฮ‘ฮ“ฮฉฮ“ฮ— ฮžฮ•ฮฮŸฮฆฮฉฮฮคฮ‘ )
ฮฆฮฅฮ›ฮ›ฮŸ ฮ•ฮกฮ“ฮ‘ฮฃฮ™ฮ‘ฮฃ (ฮ•ฮ™ฮฃฮ‘ฮ“ฮฉฮ“ฮ— ฮžฮ•ฮฮŸฮฆฮฉฮฮคฮ‘ )ฮฆฮฅฮ›ฮ›ฮŸ ฮ•ฮกฮ“ฮ‘ฮฃฮ™ฮ‘ฮฃ (ฮ•ฮ™ฮฃฮ‘ฮ“ฮฉฮ“ฮ— ฮžฮ•ฮฮŸฮฆฮฉฮฮคฮ‘ )
ฮฆฮฅฮ›ฮ›ฮŸ ฮ•ฮกฮ“ฮ‘ฮฃฮ™ฮ‘ฮฃ (ฮ•ฮ™ฮฃฮ‘ฮ“ฮฉฮ“ฮ— ฮžฮ•ฮฮŸฮฆฮฉฮฮคฮ‘ )
ย 
Bv31291295
Bv31291295Bv31291295
Bv31291295
ย 
Bs31267274
Bs31267274Bs31267274
Bs31267274
ย 
Secure and Efficient Hierarchical Data Aggregation in Wireless Sensor Networks
Secure and Efficient Hierarchical Data Aggregation in Wireless Sensor NetworksSecure and Efficient Hierarchical Data Aggregation in Wireless Sensor Networks
Secure and Efficient Hierarchical Data Aggregation in Wireless Sensor Networks
ย 
Influence of chemical admixtures on density and slump loss of concrete
Influence of chemical admixtures on density and slump loss of concreteInfluence of chemical admixtures on density and slump loss of concrete
Influence of chemical admixtures on density and slump loss of concrete
ย 
Radiation and Mass Transfer Effects on MHD Natural Convection Flow over an In...
Radiation and Mass Transfer Effects on MHD Natural Convection Flow over an In...Radiation and Mass Transfer Effects on MHD Natural Convection Flow over an In...
Radiation and Mass Transfer Effects on MHD Natural Convection Flow over an In...
ย 
Develop with love bb10
Develop with love bb10Develop with love bb10
Develop with love bb10
ย 
Analysis of Machining Characteristics of Cryogenically Treated Die Steels Usi...
Analysis of Machining Characteristics of Cryogenically Treated Die Steels Usi...Analysis of Machining Characteristics of Cryogenically Treated Die Steels Usi...
Analysis of Machining Characteristics of Cryogenically Treated Die Steels Usi...
ย 
Query Answering Approach Based on Document Summarization
Query Answering Approach Based on Document SummarizationQuery Answering Approach Based on Document Summarization
Query Answering Approach Based on Document Summarization
ย 
Bill gates
Bill gatesBill gates
Bill gates
ย 
FPGA Based Wireless Jamming Networks
FPGA Based Wireless Jamming NetworksFPGA Based Wireless Jamming Networks
FPGA Based Wireless Jamming Networks
ย 

Similar to MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data

Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
IJERD Editor
ย 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
ย 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
Laura Petrosanu
ย 
Image Segmentation Using Two Weighted Variable Fuzzy K Means
Image Segmentation Using Two Weighted Variable Fuzzy K MeansImage Segmentation Using Two Weighted Variable Fuzzy K Means
Image Segmentation Using Two Weighted Variable Fuzzy K Means
Editor IJCATR
ย 
Bj24390398
Bj24390398Bj24390398
Bj24390398
IJERA Editor
ย 

Similar to MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data (20)

Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
ย 
K044055762
K044055762K044055762
K044055762
ย 
47 292-298
47 292-29847 292-298
47 292-298
ย 
A03202001005
A03202001005A03202001005
A03202001005
ย 
I017235662
I017235662I017235662
I017235662
ย 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
ย 
Max stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problemMax stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problem
ย 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
ย 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
ย 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
ย 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
ย 
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelClustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
ย 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
ย 
One Graduate Paper
One Graduate PaperOne Graduate Paper
One Graduate Paper
ย 
Clustering techniques final
Clustering techniques finalClustering techniques final
Clustering techniques final
ย 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
ย 
Image Segmentation Using Two Weighted Variable Fuzzy K Means
Image Segmentation Using Two Weighted Variable Fuzzy K MeansImage Segmentation Using Two Weighted Variable Fuzzy K Means
Image Segmentation Using Two Weighted Variable Fuzzy K Means
ย 
Bj24390398
Bj24390398Bj24390398
Bj24390398
ย 
A New Framework for Kmeans Algorithm by Combining the Dispersions of Clusters
A New Framework for Kmeans Algorithm by Combining the Dispersions of ClustersA New Framework for Kmeans Algorithm by Combining the Dispersions of Clusters
A New Framework for Kmeans Algorithm by Combining the Dispersions of Clusters
ย 
A0310112
A0310112A0310112
A0310112
ย 

More from IJMER

Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
IJMER
ย 
Material Parameter and Effect of Thermal Load on Functionally Graded Cylinders
Material Parameter and Effect of Thermal Load on Functionally Graded CylindersMaterial Parameter and Effect of Thermal Load on Functionally Graded Cylinders
Material Parameter and Effect of Thermal Load on Functionally Graded Cylinders
IJMER
ย 

More from IJMER (20)

A Study on Translucent Concrete Product and Its Properties by Using Optical F...
A Study on Translucent Concrete Product and Its Properties by Using Optical F...A Study on Translucent Concrete Product and Its Properties by Using Optical F...
A Study on Translucent Concrete Product and Its Properties by Using Optical F...
ย 
Developing Cost Effective Automation for Cotton Seed Delinting
Developing Cost Effective Automation for Cotton Seed DelintingDeveloping Cost Effective Automation for Cotton Seed Delinting
Developing Cost Effective Automation for Cotton Seed Delinting
ย 
Study & Testing Of Bio-Composite Material Based On Munja Fibre
Study & Testing Of Bio-Composite Material Based On Munja FibreStudy & Testing Of Bio-Composite Material Based On Munja Fibre
Study & Testing Of Bio-Composite Material Based On Munja Fibre
ย 
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)
ย 
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
ย 
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...
ย 
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...
ย 
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...
ย 
Static Analysis of Go-Kart Chassis by Analytical and Solid Works Simulation
Static Analysis of Go-Kart Chassis by Analytical and Solid Works SimulationStatic Analysis of Go-Kart Chassis by Analytical and Solid Works Simulation
Static Analysis of Go-Kart Chassis by Analytical and Solid Works Simulation
ย 
High Speed Effortless Bicycle
High Speed Effortless BicycleHigh Speed Effortless Bicycle
High Speed Effortless Bicycle
ย 
Integration of Struts & Spring & Hibernate for Enterprise Applications
Integration of Struts & Spring & Hibernate for Enterprise ApplicationsIntegration of Struts & Spring & Hibernate for Enterprise Applications
Integration of Struts & Spring & Hibernate for Enterprise Applications
ย 
Microcontroller Based Automatic Sprinkler Irrigation System
Microcontroller Based Automatic Sprinkler Irrigation SystemMicrocontroller Based Automatic Sprinkler Irrigation System
Microcontroller Based Automatic Sprinkler Irrigation System
ย 
On some locally closed sets and spaces in Ideal Topological Spaces
On some locally closed sets and spaces in Ideal Topological SpacesOn some locally closed sets and spaces in Ideal Topological Spaces
On some locally closed sets and spaces in Ideal Topological Spaces
ย 
Natural Language Ambiguity and its Effect on Machine Learning
Natural Language Ambiguity and its Effect on Machine LearningNatural Language Ambiguity and its Effect on Machine Learning
Natural Language Ambiguity and its Effect on Machine Learning
ย 
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcess
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcessEvolvea Frameworkfor SelectingPrime Software DevelopmentProcess
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcess
ย 
Material Parameter and Effect of Thermal Load on Functionally Graded Cylinders
Material Parameter and Effect of Thermal Load on Functionally Graded CylindersMaterial Parameter and Effect of Thermal Load on Functionally Graded Cylinders
Material Parameter and Effect of Thermal Load on Functionally Graded Cylinders
ย 
Studies On Energy Conservation And Audit
Studies On Energy Conservation And AuditStudies On Energy Conservation And Audit
Studies On Energy Conservation And Audit
ย 
An Implementation of I2C Slave Interface using Verilog HDL
An Implementation of I2C Slave Interface using Verilog HDLAn Implementation of I2C Slave Interface using Verilog HDL
An Implementation of I2C Slave Interface using Verilog HDL
ย 
Discrete Model of Two Predators competing for One Prey
Discrete Model of Two Predators competing for One PreyDiscrete Model of Two Predators competing for One Prey
Discrete Model of Two Predators competing for One Prey
ย 
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...
ย 

Recently uploaded

Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar โ‰ผ๐Ÿ” Delhi door step de...
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar  โ‰ผ๐Ÿ” Delhi door step de...Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar  โ‰ผ๐Ÿ” Delhi door step de...
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar โ‰ผ๐Ÿ” Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
ย 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
ย 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
ย 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
ย 

Recently uploaded (20)

(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
ย 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
ย 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
ย 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
ย 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
ย 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
ย 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
ย 
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar โ‰ผ๐Ÿ” Delhi door step de...
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar  โ‰ผ๐Ÿ” Delhi door step de...Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar  โ‰ผ๐Ÿ” Delhi door step de...
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar โ‰ผ๐Ÿ” Delhi door step de...
ย 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
ย 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
ย 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
ย 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ย 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
ย 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
ย 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
ย 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
ย 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
ย 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
ย 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
ย 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ย 

MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data

  • 1. International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) | IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 55 | MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data N. Aparna1 , M. Kalaiarasu2 1, 2 (Department of Computer Science, Department of Information Technology, Sri Ramakrishna Engineering College, Coimbatore) I. Introduction Clustering is a fundamental technique of unsupervised learning in machine learning and statistics. It is generally used to find groups of similar items in a set of unlabeled data. The aim of clustering is to divide a set of data objects into clusters so that those data objects that belongs to the same cluster are more similar to each other than those in other clusters [1-4]. In real world, datasets usually contain both numeric and categorical variables [5,6]. However, most existing clustering algorithms assume all variables are either numeric or categorical , examples of which include the k-means [7], k-modes [8], fuzzy k-modes [9] algorithms. Here, the data is observed from multiple outlooks and in multiple types of dimensions. For example, in a student data set, variables can be divided into personal information view showing the information about the studentโ€™s personal information, the academic view describing the studentโ€™s academic performance and the extra-curricular view which gives the extra-curricular activities and achievements made by the student. Traditional methods take multiple views as a set of flat variables and do not take into account the differences among various views [10], [11], [12]. In the case of multiview clustering, it takes the information from multiple views and also considers the variations among different views which produces a more precise and efficient partitioning of data. In this paper, a new algorithm Multi-viewpoint K-prototypes (MK-Prototypes) for clustering mixed type data is proposed. It is an enhancement to the usual k-prototypes algorithm. In order to differentiate the effects of different views and different variables in clustering, the view weights and individual variables are applied to the distance function. Here while computing the view weights, the complete set of variables are considered and while calculating the weights of variables in a view, only a part of the data that includes the variables in the view is considered. Thus, the view weights show the significance of views in the complete data and the variables weights in a view shows the significance of variables in a view alone. II. Related Works Till date, there exist a number of algorithms and methods to directly deal with mixed type data. In [13], Cen Li and Gautam Biswas proposed an algorithm, Similarity-based agglomerative clustering(SBAC) that works well for data with mixed attributes. It adopts a similarity measure proposed by Goodall [14] for biological taxonomy. In this method, while computing the similarity, higher weight is assigned to infrequent attribute value matches. It does not make any suppositions on the underlying features of the attribute values. An agglomerative algorithm is used to generate a dendrogram and a simple distinctness heuristic is used to extract a partition of the data.Hsu and Chen proposed CAVE [15], a clustering algorithm based on the Variance and Entropy for Abstract: Clustering mixed type data is one of the major research topics in the area of data mining. In this paper, a new algorithm for clustering mixed type data is proposed where the concept of distribution centroid is used to represent the prototype of categorical variables in a cluster which is then combined with the mean to represent the prototype of clusters with mixed type variables. In the method, data is observed from different views and the variables are grouped into different views. Those instances that can be viewed differently from different viewpoints can be defined as multiview data. During clustering process the differences among views are ignored in usual cases. Here, both views and variables weights are computed simultaneously. The view weight is used to determine the closeness or density of view and variable weight is used to identify the significance of each variable. With the intention of determining the cluster of objects both these weights are used in the distance function. In the proposed method, enhancement to the k-prototypes is done so that it automatically computes both view and variable weights. The proposed algorithm MK-Prototypes algorithm is compared with two other clustering algorithms. Keywords: clustering, mixed data, multiview, variable weighting, view weighting, k-prototypes.
  • 2. MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data | IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 56 | clustering mixed data. It builds a distance hierarchy for every categorical attributes which needs domain expertise.Hsu et al.[16] proposed an extension to the self-organizing map to analyze mixed data where the distance hierarchy is automatically constructed by using the values of class attributes. In [17] Chatziz propsed KL-FCM-GM algorithm in which data derived from the clusters are in the Guassian form and is designed for the Guass-Multinomial distributed data. Huang presented a k-prototypes algorithm [18] where k-means is integrated with k-modes to partition mixed data. Bezdek et al. considered the fuzzy nature of the objects in his work the fuzzy k-prototypes[19] and Zheng et al. proposed [20] an evolutionary type k-prototypes algorithm by introducing an evolutionary algorithm framework. III. Proposed System The motivation for the proposed system is on one hand to provide a better representation for the categorical variable part in a mixed data since the numerical variables can be well represented using the mean concept itself. On the other hand it considers the importance of view and variables weights in the process of clustering. The concept of distribution centroid represents the cluster centroid for the categorical variable part. Huangโ€™s strategy of evaluation is used for the computation of both view weights and variable weights. A. The distribution centroid The idea of distribution centroid for a better representation of categorical variables is stimulated from fuzzy centroid proposed by Kim et al.[ 21]. It makes use of a fuzzy scenario to represent the cluster centers for the categorical variable part. For Dom(Vj)={{๐‘ฃ๐‘– 1 , ๐‘ฃ๐‘– 2 , ๐‘ฃ๐‘– 3 , โ€ฆ ๐‘ฃ๐‘– ๐‘ก }, the distribution centroid of a cluster o, denoted as ๐ถ๐‘œ โ€ฒ , is represented as follows ๐ถ๐‘œ โ€ฒ = ๐‘ ๐‘œ1 โ€ฒ , ๐‘ ๐‘œ2 โ€ฒ , โ€ฆ , ๐‘ ๐‘œ๐‘— โ€ฒ , โ€ฆ ๐‘ ๐‘œ๐‘š โ€ฒ (1) where ๐‘ ๐‘œ๐‘— โ€ฒ = ๐‘๐‘— 1 , ๐‘ค ๐‘œ๐‘— 1 , ๐‘๐‘— 2 , ๐‘ค ๐‘œ๐‘— 2 ,โ€ฆ ๐‘๐‘— ๐‘˜ , ๐‘ค ๐‘œ๐‘— ๐‘˜ , โ€ฆ ๐‘๐‘— ๐‘ก , ๐‘ค ๐‘œ๐‘— ๐‘ก (2) . In the above equation ๐‘ค ๐‘œ๐‘— ๐‘˜ = ๐œ‡( ๐‘› ๐‘–=1 ๐‘ฅ๐‘–๐‘— ) (3) where ๐œ‡(๐‘ฅ๐‘–๐‘— )= ๐‘ข ๐‘–๐‘œ ๐‘ข ๐‘–๐‘œ ๐‘› ๐‘–=1 if ๐‘ฅ๐‘–๐‘— = ๐‘๐‘— ๐‘˜ ๐‘œ if ๐‘ฅ๐‘–๐‘— โ‰  ๐‘๐‘— ๐‘˜ (4) Here, ๐‘ข๐‘–๐‘œ is assigned the value 1, if the data object xi belongs to cluster o and as 0, if the data object xi do not belong to cluster o From the above mentioned equations it is clear that the computation of distribution centroid considers the number of times each categorical value repeat in a cluster. Thus to denote the center of a cluster it takes into account the distribution features of categorical variables B. Weight calculation using Huangโ€™s approach Weight of a variable identifies the effect of that variable in clustering process. In 2005, Huang et al. proposed an approach to calculate the weight of variable [22]. According to their method, the weight is computed by minimizing the value of objective function. The standard for assigning weight of variable is to allocate a larger value to a variable that has a smaller sum of the within cluster distances (WCD), and vice versa. This principle is given by ๐‘ค๐‘— โˆ 1 ๐ท๐‘— (5) where ๐‘ค๐‘— is the significance of the variable j, โˆ is the mathematical symbol denoting direct proportionality, and ๐ท ๐‘— is the sum of the within cluster distances for this variable.
  • 3. MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data | IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 57 | C. Multiview concept FIGURE 1 : Multiview concept In 2013, Chen Et Al [23] proposed Tw-K-Means where the concept of multiview data was introduced. The above figure illustrates the multiview concept. During the process of clustering, the differences among different views are not considered. In the process of multiview clustering, in addition to variable weights, the variables are grouped according to their characteristic properties. Each group is termed as a view and a weight is assigned to each view. The view weight is assigned according to Huangโ€™s approach. D. The proposed algorithm The proposed algorithm, MK-prototypes put together the concepts in section 3.1, section 3.2, section 3.3. The figure 2 describes the steps involved in the algorithm: Steps in the proposed algorithm: 1. Compute the distribution centroid to represent the categorical variable centroid 2. Compute the mean for the numerical variables 3. Integrate the distribution centroid and mean to represent the prototype for the mixed data 4. Compute the view weights and variable weights. 5. Measure the similarity between the data objects and the prototypes 6. Assign the data object to that prototype to which the considered data object is the closest 7. Repeat steps 1-6 until an effective clustering result is obtained. E. The optimization model The clustering process to partition the dataset X into k clusters that considers both view weights and variable weights is represented according to the framework of [23] as a minimization of the following objective function. ๐‘ƒ ๐‘ˆ, ๐‘, ๐‘…, ๐‘‰ = ๐‘ข๐‘–,๐‘œ ๐‘ฃ๐‘ก ๐‘Ÿ๐‘  ๐‘‘(๐‘ฅ๐‘–,๐‘  ๐‘ โˆˆ๐บ๐‘ก ๐‘„ ๐‘ก=1 , ๐‘› ๐‘–=1 ๐‘ง ๐‘œ,๐‘  ๐‘˜ ๐‘œ=1 ) (6) subject to ๐‘ข๐‘–,๐‘œ = 1๐‘˜ ๐‘œ=1 , ๐‘ข๐‘–,๐‘™ โˆˆ 0,1 , 1 โ‰ค ๐‘– โ‰ค ๐‘› ๐‘ฃ๐‘ก = 1, 0 โ‰ค ๐‘ฃ๐‘ก ๐‘„ ๐‘–=1 โ‰ค 1, 0 โ‰ค ๐‘Ÿ๐‘— โ‰ค 1, 1 โ‰ค ๐‘ก โ‰ค ๐‘„, ๐‘Ÿ๐‘— = 1 ๐‘—โˆˆ๐บ๐‘ก where U is an n x k partition matrix whose elements ๐‘ข๐‘–,๐‘œ are binary where ๐‘ข๐‘–,๐‘œ = 1 indicates that object i is allocated to cluster o.๐‘ = {๐‘1, ๐‘2, โ€ฆ . ๐‘ ๐‘˜ } is a set of k vectors on behalf of the centers of the k clusters.๐‘‰ = {๐‘‰1, ๐‘‰2, โ€ฆ . ๐‘‰๐‘„ } are Q weights for Q views. ๐‘… = {๐‘Ÿ1, ๐‘Ÿ2โ€ฆ๐‘Ÿ๐‘ } are s weights for s variables.๐‘‘(๐‘ฅ๐‘–,๐‘ , ๐‘ง ๐‘œ,๐‘ ) is a distance or dissimilarity measure on the ๐‘  ๐‘กโ„Ž variable between the ๐‘– ๐‘กโ„Ž object and the center of the ๐‘œ ๐‘กโ„Ž cluster.
  • 4. MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data | IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 58 | FIGURE 2. Flowchart for the proposed algorithm In order to minimize the equation, the problem is divided into four sub-problems: 1. Sub-problem 1: Fix Z=Z^,R=R^ and V=V^ and solve the reduced problem P(U,Z^,R^,V^). 2. Sub-problem 2: Fix U=U^, R=R^ and V=V^ and solve the reduced problem P(U^,Z,R^,V^). 3. Sub-problem 3: Fix Z=Z^, U=U^ and V=V^ and solve the reduced problem P(U^,Z^,R,V^). 4. Sub-problem 4: Fix Z=Z^, R=R^ and U=U^ and solve the reduced problem P(U^,Z^,R^,V). The sub-problem 1 is solved by: ๐‘ข๐‘–,๐‘œ = 1 (7) if ๐‘ฃ๐‘ก ๐‘Ÿ๐‘ d ๐‘ฅ๐‘–,๐‘œ , ๐‘ง๐‘–,0 ๐‘š ๐‘ =1 โ‰ค ๐‘ฃ๐‘ก ๐‘Ÿ๐‘  ๐‘š ๐‘ =1 d ๐‘ฅ๐‘–,๐‘œ , ๐‘ง๐‘’,0 (8) where 1โ‰ค ๐‘’ โ‰ค ๐‘˜ ๐‘ข๐‘–,๐‘œ = 0 ๐‘คโ„Ž๐‘’๐‘Ÿ๐‘’ ๐‘’ โ‰  ๐‘œ The sub-problem 2 is solved for the numeric variable by
  • 5. MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data | IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 59 | ๐‘ง ๐‘œ,๐‘  = ๐‘ข๐‘–,๐‘œ ๐‘ฅ๐‘–,๐‘  ๐‘› ๐‘–=1 ๐‘ข๐‘–,๐‘œ ๐‘› ๐‘–=1 (9) and for the categorical variables by ๐‘ง ๐‘œ,๐‘  = ๐‘โ€ฒ๐‘–,๐‘  which is already defined. ๐‘‘ ๐‘ฅ๐‘–,๐‘ , ๐‘ง ๐‘œ,๐‘  = ๐‘ฅ๐‘–,๐‘  โˆ’ ๐‘ง ๐‘œ,๐‘  if the sth variable is a numeric variable . ๐‘‘ ๐‘ฅ๐‘–,๐‘ , ๐‘ง ๐‘œ,๐‘  = ๐œ‘ ๐‘ฅ๐‘–,๐‘ , ๐‘ง ๐‘œ,๐‘  if the sth variable is a categorical variable . where ๐œ‘ ๐‘ฅ๐‘–,๐‘ , ๐‘ง ๐‘œ,๐‘  = ๐›ฟ ๐‘ฅ๐‘–,๐‘ , ๐‘๐‘— ๐‘˜๐‘ก ๐‘˜=1 and ๐›ฟ ๐‘ฅ๐‘–,๐‘ , ๐‘๐‘— ๐‘˜ ๐‘–๐‘  0 ๐‘–๐‘“ ๐‘ฅ๐‘–,๐‘  โ‰  ๐‘๐‘— ๐‘˜ and ๐‘ค ๐‘œ,๐‘— ๐‘˜ if ๐‘ฅ๐‘–,๐‘  = ๐‘๐‘— ๐‘˜ . The solution to the sub-problem 3 is as followed: Let Z=Z^, U=U^ and V=V^ be fixed . Then the reduced problem P(U^,Z^,R,V^) is minimized if ๐‘Ÿ๐‘  = 1 ๐ท๐‘  ๐ทโ„Ž 1 ๐›พ โ„Ž๐œ–๐บ๐‘ก (10) where ๐ท๐‘  = ๐‘ขโ€ฒ ๐‘–,๐‘œ ๐‘› ๐‘–=1 ๐‘คโ€ฒ ๐‘ก ๐‘‘ ๐‘ฅ๐‘–,๐‘ , ๐‘งโ€ฒ ๐‘œ,๐‘  ๐‘˜ ๐‘œ=1 (11) Sub-problem 4 is solved as follows ๐‘ค๐‘ก = 1 ๐น๐‘  ๐น๐‘ก 1 ๐œ‡โ„Ž ๐‘ก=1 (12) where ๐น๐‘  = ๐‘ขโ€ฒ ๐‘–,๐‘œ ๐‘Ÿโ€ฒ ๐‘  ๐‘‘ ๐‘ฅ๐‘–,๐‘ , ๐‘งโ€ฒ ๐‘œ,๐‘  ๐‘ ๐œ– ๐บ๐‘ก ๐‘› ๐‘–=1 ๐‘˜ ๐‘œ=1 (13) Having presented the detailed computations required for calculating the important variables, the proposed algorithm MK-Prototypes can be described as given below: 1. Choose the number of iterations, number of clusters k, value of ฮผ and ฮณ, randomly choose k distinct data objects and convert them into initial prototypes and initialize the view weights and variable weights. 2. Fix Zโ€™, Rโ€™, Vโ€™ as ๐‘ ๐‘ก , ๐‘… ๐‘ก , ๐‘‰ ๐‘ก respectively and minimize the problem P(U, Zโ€™, Rโ€™, Vโ€™) to obtain ๐‘ˆ ๐‘ก+1 . 3. Fix Uโ€™, Rโ€™, Vโ€™ as ๐‘ˆ ๐‘ก , ๐‘… ๐‘ก , ๐‘‰ ๐‘ก respectively and minimize the problem P(Uโ€™, Z, Rโ€™, Vโ€™) to obtain ๐‘ ๐‘ก+1 . 4. Fix Uโ€™, Zโ€™, Vโ€™ as ๐‘ˆ ๐‘ก , ๐‘ ๐‘ก , ๐‘‰ ๐‘ก respectively and minimize the problem P(Uโ€™, Zโ€™, R, Vโ€™) to obtain ๐‘… ๐‘ก+1 . 5. Fix Uโ€™, Zโ€™, Rโ€™, V as ๐‘ˆ ๐‘ก , ๐‘ ๐‘ก , ๐‘… ๐‘ก respectively and minimize the problem P(Uโ€™, Zโ€™, Rโ€™, V) to obtain ๐‘‰ ๐‘ก+1 . 6. If there is no improvement in P or if the maximum iterations is reached, then stop. Else increment t by 1 , decrement number of iterations by 1 and go to Step 2. IV. Experiments on Performance Of Mk-Prototypes Algorithm In order to measure the performance level of the proposed algorithm, it is used to cluster a real-world dataset Heart (disease). The dataset is taken from UCI Machine Learning Repository. The proposed algorithm is compared with k-prototypes and SBAC algorithm. They are well known for clustering mixed type data. In this paper, the clustering accuracy is measured using one of the most commonly used criteria. The clustering accuracy r is given by ๐‘Ÿ = ๐‘Ž๐‘– ๐‘˜ ๐‘–=1 ๐‘› (14)
  • 6. MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data | IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 60 | where ๐‘Ž๐‘–is the number of data objects that occur in both the ith cluster and its corresponding true class and n is the number of data objects in a data set. Higher the value of r , the higher the clustering accuracy . A perfect clustering gives a value of r=1.0. A. Dataset description The Heart disease data set is a mixed dataset. It contains 303 patient instances. The actual data set contains 76 variables out of which 14 are considered usually. In the proposed algorithm, in order to define three views 19 out of 76 variables are considered here. It consists of seven numeric variables and twelve categorical variables. These 19 variables can be naturally divided into 3 views. 1. Personal data view: It includes those variables which describes a patientโ€™s personal data. 2. Historical data view: It includes those variables which describes a patientโ€™s historical data like the habits. 3. Test output view: It includes all those variables which describes the results of various tests conducted for the patient. Here, ๐บ1, ๐บ2, ๐บ3 represents the three views personal, historical, test output respectively. B. Results and analysis Below are the graphical representations of the clustering results. Fig 3 shows the variation in variable weights for varying ฮผ values and fixed ฮณ values. Fig 4 shows the variation in view weights for varying ฮผ values and fixed ฮณ values. From Table 1, it is observed that as ฮผ increased, the variance of V decreased rapidly. This result can be explained from equation (10) as ฮผ increases, V becomes flatter. The graphical representation of the Table 1 has been shown below. Table 1: Variable weights vs ฮณ value For fixed ฮผ value Table 2 shows that as ฮณ increased, the variance of view weights decreased rapidly. This result can be explained from equation (11) as ฮณ increases, W becomes flatter. The graphical representation has been shown below. Fig 3: Variable weights vs ฮณ value for fixed ฮผ value 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 5 10 15 20 25 30 35 Variableweights ฮณ ฮผ=1 ฮผ=4 ฮผ=12 ฮผ ฮณ 1 4 12 10 0.01 0 0 15 0 0.01 0 20 0.7 0 0.05 25 0.03 0.4 0.1 30 0 0.1 0.02 35 0.02 0 0
  • 7. MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data | IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 61 | Table 2: View weights vs ฮผ value for fixed ฮณ value ฮณ ฮผ 1 4 12 10 0.05 0.075 0.01 20 0.075 0.14 0.015 30 0.095 0.05 0.04 40 0.16 0.06 0.01 50 0.04 0.07 0.005 60 0.05 0.04 0.02 70 0.07 0.035 0.01 From above analysis, it can be summarized that the following method can be used to control two types of weight distributions in MK-Prototypes algorithm by setting different values of ฮณ and ฮผ. Figure 4: View weights vs ฮผ value for fixed ฮณ value The experiments have been conducted for three different values of ฮผ and ฮณ for varying values of ฮณ and ฮผ respectively. 1. Large ฮผ makes more variables contribute to the clustering while small ฮผ makes only important variables contribute to the clustering. 2. Large ฮณ makes more views contribute to the clustering while small ฮณ makes only important views contribute to the clustering. Table 3: Comparison of accuracy rates of dataset considering all views From the above table, it is clear that the proposed algorithm has a better clustering accuracy than the existing k-prototypes and SBAC. V. Conclusion Mixed type data are encountered everywhere in the real world. In this paper, a new algorithm, Multiview point based clustering algorithm for mixed type data has been proposed. When compared with the existing algorithms the proposed algorithm has many significant contributions. The proposed algorithm encapsulates the characteristics of clusters with mixed type variables more efficiently since it includes the distribution information of both numeric and categorical variables. It also takes into account the importance of various variables and views during the process of clustering by using Huangโ€™s approach and a new dissimilarity measure. 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0 10 20 30 40 50 60 70 80 Viewweights ฮผ ฮณ=1 ฮณ=4 ฮณ=12 Algorithms Clustering accuracy % k-prototypes SBAC MK-Prototypes 0.521 0.747 0.846
  • 8. MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data | IJMER | ISSN: 2249โ€“6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 62 | It can compute weights for views and individual variables simultaneously in the clustering process. With the two types of weights, dense views and significant variables can be identified and effect of low-quality views and noise variables can be reduced. Because of these contributions the proposed algorithm obtains higher clustering accuracy, which has been validated by experimental results. REFERENCES [1] Z.X. Huang, Extensions to the k means algorithm for clustering large datasets with categorical values, Data Min. Knowl. Discovery2 (3) (1998) 283โ€“304. [2] A.K.Jain, R.C.Dubes, Algorithms for Clustering Data, Prentice-Hall, New Jersey, 1988. [3] A.K.Jain, M.N.Murty, P.J.Flynn, Data clustering: a survey, ACM Comput. Surv. 31 (3) (1999) 264โ€“323. [4] J.W.Han, M.Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann, SanFrancisco,2001. [5] C.Hsu,Y.P.Huang, Incremental clustering of mixed data based on distance hierarchy, Expert Syst. Appl. 35 (3) (2008) 1177โ€“1185. [6] C.Hsu, S.Lin, W.Tai, Apply extended self-organizing map to cluster and classify mixed-type data, Neurocomputing 74 (18) (2011) 3832โ€“3842. [7] S.Lloyd, Least squares quantization in PCM, IEEE Trans. Inf. Theory 28 (2) (1982) 129โ€“137. [8] Z.X.Huang, Extensions to the k-meansalgorithm for clustering large datasets with categorical values, Data Min. Knowl. Discovery 2 (3) (1998) 283โ€“304. [9]Z.X.Huang, M.K.Ng, A fuzzy k-modes algorithm for clustering categorical data, IEEE Trans. Fuzzy Syst. 7 (4) (1999) 446โ€“452. [10] J. Mui and K. Fu, โ€œAutomated Classification Of Nucleated Blood Cells Using A Binary Tree Classifier,โ€ IEEE Trans. Pattern Analysis And Machine Intelligence, Vol. 2, No. 5, Pp. 429-443, May 1980. [11] J. Wang, H. Zeng, Z. Chen, H. Lu, L. Tao, And W. Ma, โ€œRecom: Reinforcement Clustering Of Multitype Interrelated Data Objects,โ€Proc. 26th Ann. Intโ€™l ACM SIGIR Conf. Research And Development In Informaion Retrieval, Pp. 274-281, 2003. [12] S. Bickel And T. Scheffer, โ€œMulti-View Clustering,โ€ Proc. IEEE Fourth Intโ€™l Conf. Data Mining, Pp. 19-26, 2004. [13] C.Li, G.Biswas, Unsupervised Learning with Mixed Numeric And Nominal Data, IEEE Trans. Knowl. Data Eng.14 (4) (2002) 673โ€“690. [14] D.W.Goodall, A New Similarity Index Based On Probability, Biometrics 22 (4) (1966) 882โ€“907. [15] C.C.Hsu, Y.C.Chen, Mining Of Mixed Data With Application To Catalog Marketing, Expert Syst. Appl. 32 (1) (2007) 12โ€“27. [16] C. Hsu, S.Lin, W.Tai, Apply Extended Self-Organizing Map To Cluster And Classify Mixed-Type Data, Neurocomputing 74 (18) (2011) 3832โ€“3842. [17] S.P.Chatzis, A Fuzzy C-Means-Type Algorithm For Clustering Of Data With Mixed Numeric And Categorical Attributes Employing A Probabilistic Dissimilarity Functional, Expert Syst. Appl. 38 (7) (2011) 8684โ€“8689. [18] Z.X.Huang, Clustering Large Datasets with Mixed Numeric and Categorical Values, In: Proceedings Of The First Pacific-Asia Knowledge Discovery And Data Mining Conference, 1997, Pp.21โ€“34. [19] J.C.Bezdek, J.Keller, R.Krisnapuram, Fuzzy Models And Algorithms For Pattern Recognition And Image Processing, Kluwer Academy Publishers, Boston, 1999. [25]. [20] Z.Zheng, M.G.Gong, J.J.Ma, L.C.Jiao, Unsupervised Evolutionary Clustering Algorithm For Mixed Type Data, In : Proceedings Of The IEEE Congresson Evolutionary Computation (CEC), 2010, Pp.1โ€“8. [21] W.Kim, K.H.Lee, D.Lee, Fuzzy Clustering Of Categorical Data Using Fuzzy Centroid, Pattern Recognition Lett.25 (11) (2004) 1263โ€“1271. [22] Z.X.Huang, M.K.Ng, H.Q.Rong, Z.C.Li, Automated Variable Weighting In K-Means Type Clustering, IEEE Trans. Pattern Anal. Mach. Intell.27 (5) (2005) 657โ€“668. [23] Xiaojun Chen, Xiaofei Xu, Joshua Zhexue Huang, And Yunming Ye, Tw-K-Means: Automated Two-Level Variable Weighting Clustering Algorithm For Multiview Data, IEEE Transactions On Knowledge And Data Engineering, Vol. 25, No. 4, April 2013, pp 932-945