Comparison Between Clustering Algorithms for Microarray Data Analysis
Thesis (presentation)
1. N G U Y E N G I A T O A N
N G U Y E N L A M V U T U A N
A D V I S O R : D R . N G U Y E N D I N H T H U A N
Data mining in healthcare
Improved k-means algorithm
1
2. Outline
1. Introduction
2. K-means
3. Improved k-means
3.1. Dealing with mixed categorical and numeric data
3.2. Building initial cluster centers
3.3. Determining appropriate k
3.4. Improved k-means algorithm
3.5. Complexity
4. Cluster analysis tool
5. Analysis and results
6. Conclusion
2
3. 1. Introduction
Data mining in healthcare is always a matter of
concern. Dealing with the diversity of data is the
purpose that makes researchers develop new
algorithms.
In the field of disease prediction, the analyzed data
often contain status, habits of patients with different
data types, and each record has not class label in
advance.
→ Clustering is applied in this area.
3
4. 1. Introduction (cont.)
Why k-means?
One of the most widely used methods to partition a
dataset into groups of patterns.
Easy to understand and easy to set up, allowing
researchers to develop in flexible ways.
K-means method has many weaknesses.
Base on the properties of collected data
4
5. 2. K-means
Algorithm:
1. Input: the number of clusters k, a data set containing n
objects D.
2. Randomly choose k objects from D as the initial cluster
centers;
3. Repeat
4. (Re)assign each object to the cluster to which the object
is the most similar, based on the distance between the
object and the mean value of objects in the cluster;
5. Calculate the new mean value of the objects for each
cluster;
6. Until no change;
5
6. 2. K-means (cont.)
Advantages:
One of the most widely used methods for clustering.
Simple, can be easily modified to deal with different
scenarios.
Compute fast.
6
7. 2. K-means (cont.)
Disadvantages:
1. The traditional k-means is limited to numeric data.
2. Randomly choose initial starting points. A poor
initialization can lead to very poor clusters.
3. Difficult to predict k.
7
8. 3.1. Dealing with mixed types data
A method proposed by Ming-Yi Shih, Jar-Wen Jheng
and Lien-Fu Lai converts items in categorical
attributes into numeric value based on the
relationships among items.
If two items always show up in one object together,
there will be a strong similarity between them.
When a pair of categorical items has a higher
similarity, they shall be assigned closer numeric
values.
8
10. 3.1. Dealing with mixed types data (cont.)
4. Find the numeric attribute that minimizes the
within group variance to base attribute.
5. Quantify every base item by assigning mean of the
mapping value in the selected numeric attribute.
6. Quantify all other categorical items.
10
11. 3.1. Dealing with mixed types data (cont.)
Since all attributes in data set will contain only
numeric value at this moment, the existing distance
based clustering algorithms can by applied without
pain. For numeric data, Euclidean distance is often
used.
11
12. 3.2. Determining initial cluster centers
The two-step method proposed by Ming-Yi Shih, Jar-
Wen Jheng and Lien-Fu Lai specifies that using
agglomerative hierarchical clustering in the first step
to cluster the original dataset into some subsets,
which will be the initial set of clusters in k-means
clustering algorithm.
12
16. 3.3. Choosing appropriate k
D.T.Nguyen and H.Doan’s approach: select k based
on information obtained during the k-means
clustering operation itself.
New metric: two coefficients α and β.
16
19. 3.3. Choosing appropriate k (cont.)
A cluster needs to be splitted:
Two clusters need to be grouped:
Cluster 1 Cluster 2
Center of
cluster 1
ϕmin
dmax
19
20. 3.4. Improved k-means algorithm
Input n objects, and number of clusters k (1 ≤ k ≤ n).
Applying agglomerative hierarchical clustering. Place
each object in its own cluster. The two clusters that have
that closest distance will be merged into a larger cluster.
Continue merge these clusters, until all of the objects are
in k clusters.
From now on, applying K-means algorithm. Compute
mean of the objects in cluster. Then Reassign objects to
clusters.
Repeat the above step until no change.
Calculate αmax and βmax.
→ Base on αmax and βmax, we will know k should be
increased or decreased.
20
22. 4. Cluster analysis tool
We implemented a data mining software written in
C#, that clusters data into groups using the improved
k-means algorithm, and also, the traditional one for
comparison.
This tool can also help to decide the suitable number
of clusters k.
22
24. 5. Analysis and results
Data of the approximately one thousand patient
records from MQIC database that were used to
develop the Health Visualizer.
Every object has 8 attributes: gender, age, diab,
hypertension, stroke, chd, smoking and BMI.
Assume that they have the same weight, and the
distance measure used is Euclidean distance.
24
25. Data information
Name Value
Gender Male; Female
Age Numeric
Diab Binary
Hypertension Binary
Stroke Binary
Chd Binary
Smoking never; former; not current; current; ever
BMI Numeric
25
26. Sample records
ID gender age diab hypertension stroke chd smoking BMI
1 Female 80 0 0 0 1 never 25.19
2 Female 36 0 0 0 0 current 23.45
3 Male 76 0 1 0 1 current 20.14
4 Female 44 1 0 0 0 never 19.31
5 Male 42 0 0 0 0 never 33.64
6 Female 54 0 0 0 0 former 54.7
7 Female 78 0 0 0 0 former 36.05
8 Female 67 0 0 1 0 never 25.69
9 Male 15 0 0 0 0 never 30.36
10 Female 42 0 0 0 0 never 24.48
... ... ... ... ... ... ... ... ...
26
29. Results (cont.)
The graph shows the variation of αmax, βmax and
Davies-Bouldin index
29
30. Results (cont.)
The suitable number of clusters likely locates at the
intersection of locations indicating of the selecting of
number of clusters of red line and blue line.
→ Choose k = 3. The similarity of the data objects in
the each cluster is rather good. Also, the Davies-
Bouldin index is smallest.
30
33. 6. Conclusion
The advantages of improved k-means algorithm:
Can handle mixed categorical and numeric data.
Provide good initial cluster means, and reduce the
number of iterations of k-means, thereby we can
obtain high quality clusters without having to run the
traditional k-means many times.
α and β is the new basis for selecting the number of
clusters k.
33
34. 6. Conclusion (cont.)
Disadvantage:
However, due to the combination of k-means with
agglomerative hierarchical clustering algorithm,
which has a low speed and is only suitable for small
and medium dataset, so running time also becomes
the biggest disadvantage of the new algorithm.
34
35. 6. Conclusion (cont.)
Limits:
The new method is only appropriate for the collected
data in this thesis, not for other kinds of data of
healthcare industry . For large, multidimensional
data, our program may not provide a good result.
Because of limited time as well as difficulties to
update the latest optimization which is approached
for hierarchical clustering, our program have many
cons.
35
36. 6. Conclusion (cont.)
Development orientation:
We propose several new ways to improve the speed
of our program (using SLINK or CLINK), the
flexibility for different kinds of dataset, and the
possibility in handling unusual and missing data.
36
37. 6. Conclusion (cont.)
In data mining, the success of data clustering often
depends on good data, rather than good algorithms. If
the dataset is huge and not clear, your choice of
clustering algorithm might not really matter so much
in terms of performance, so you should choose your
algorithm based on speed or ease of use instead.
37
Iteration:The importance here is the improved kmean: 3 lines here look like a straight like very soon, at the iteration at 3 or 4. K: when we use traditional, we can not guess what the appropriate k is for this dataset because of random initial points, the value of alpha and beta are different after each time running.