SlideShare a Scribd company logo
1 of 38
N G U Y E N G I A T O A N
N G U Y E N L A M V U T U A N
A D V I S O R : D R . N G U Y E N D I N H T H U A N
Data mining in healthcare
Improved k-means algorithm
1
Outline
1. Introduction
2. K-means
3. Improved k-means
3.1. Dealing with mixed categorical and numeric data
3.2. Building initial cluster centers
3.3. Determining appropriate k
3.4. Improved k-means algorithm
3.5. Complexity
4. Cluster analysis tool
5. Analysis and results
6. Conclusion
2
1. Introduction
 Data mining in healthcare is always a matter of
concern. Dealing with the diversity of data is the
purpose that makes researchers develop new
algorithms.
 In the field of disease prediction, the analyzed data
often contain status, habits of patients with different
data types, and each record has not class label in
advance.
→ Clustering is applied in this area.
3
1. Introduction (cont.)
Why k-means?
 One of the most widely used methods to partition a
dataset into groups of patterns.
 Easy to understand and easy to set up, allowing
researchers to develop in flexible ways.
 K-means method has many weaknesses.
 Base on the properties of collected data
4
2. K-means
Algorithm:
1. Input: the number of clusters k, a data set containing n
objects D.
2. Randomly choose k objects from D as the initial cluster
centers;
3. Repeat
4. (Re)assign each object to the cluster to which the object
is the most similar, based on the distance between the
object and the mean value of objects in the cluster;
5. Calculate the new mean value of the objects for each
cluster;
6. Until no change;
5
2. K-means (cont.)
Advantages:
 One of the most widely used methods for clustering.
 Simple, can be easily modified to deal with different
scenarios.
 Compute fast.
6
2. K-means (cont.)
Disadvantages:
1. The traditional k-means is limited to numeric data.
2. Randomly choose initial starting points. A poor
initialization can lead to very poor clusters.
3. Difficult to predict k.
7
3.1. Dealing with mixed types data
 A method proposed by Ming-Yi Shih, Jar-Wen Jheng
and Lien-Fu Lai converts items in categorical
attributes into numeric value based on the
relationships among items.
 If two items always show up in one object together,
there will be a strong similarity between them.
 When a pair of categorical items has a higher
similarity, they shall be assigned closer numeric
values.
8
3.1. Dealing with mixed types data (cont.)

9
3.1. Dealing with mixed types data (cont.)
4. Find the numeric attribute that minimizes the
within group variance to base attribute.
5. Quantify every base item by assigning mean of the
mapping value in the selected numeric attribute.
6. Quantify all other categorical items.
10
3.1. Dealing with mixed types data (cont.)
 Since all attributes in data set will contain only
numeric value at this moment, the existing distance
based clustering algorithms can by applied without
pain. For numeric data, Euclidean distance is often
used.
11
3.2. Determining initial cluster centers
 The two-step method proposed by Ming-Yi Shih, Jar-
Wen Jheng and Lien-Fu Lai specifies that using
agglomerative hierarchical clustering in the first step
to cluster the original dataset into some subsets,
which will be the initial set of clusters in k-means
clustering algorithm.
12
3.2. Determining initial cluster centers (cont.)
Dissimilarit
13
3.2. Determining initial cluster centers (cont.)
14
3.2. Determining initial cluster centers (cont.)
15
3.3. Choosing appropriate k
 D.T.Nguyen and H.Doan’s approach: select k based
on information obtained during the k-means
clustering operation itself.
 New metric: two coefficients α and β.
16
3.3. Choosing appropriate k (cont.)

17
3.3. Choosing appropriate k (cont.)

18
3.3. Choosing appropriate k (cont.)
A cluster needs to be splitted:
Two clusters need to be grouped:
Cluster 1 Cluster 2
Center of
cluster 1
ϕmin
dmax
19
3.4. Improved k-means algorithm
 Input n objects, and number of clusters k (1 ≤ k ≤ n).
 Applying agglomerative hierarchical clustering. Place
each object in its own cluster. The two clusters that have
that closest distance will be merged into a larger cluster.
 Continue merge these clusters, until all of the objects are
in k clusters.
 From now on, applying K-means algorithm. Compute
mean of the objects in cluster. Then Reassign objects to
clusters.
 Repeat the above step until no change.
 Calculate αmax and βmax.
→ Base on αmax and βmax, we will know k should be
increased or decreased.
20
3.5. Complexity

21
4. Cluster analysis tool
 We implemented a data mining software written in
C#, that clusters data into groups using the improved
k-means algorithm, and also, the traditional one for
comparison.
 This tool can also help to decide the suitable number
of clusters k.
22
4. Cluster analysis tool (cont.)
 Demo
23
5. Analysis and results
 Data of the approximately one thousand patient
records from MQIC database that were used to
develop the Health Visualizer.
 Every object has 8 attributes: gender, age, diab,
hypertension, stroke, chd, smoking and BMI.
 Assume that they have the same weight, and the
distance measure used is Euclidean distance.
24
Data information
Name Value
Gender Male; Female
Age Numeric
Diab Binary
Hypertension Binary
Stroke Binary
Chd Binary
Smoking never; former; not current; current; ever
BMI Numeric
25
Sample records
ID gender age diab hypertension stroke chd smoking BMI
1 Female 80 0 0 0 1 never 25.19
2 Female 36 0 0 0 0 current 23.45
3 Male 76 0 1 0 1 current 20.14
4 Female 44 1 0 0 0 never 19.31
5 Male 42 0 0 0 0 never 33.64
6 Female 54 0 0 0 0 former 54.7
7 Female 78 0 0 0 0 former 36.05
8 Female 67 0 0 1 0 never 25.69
9 Male 15 0 0 0 0 never 30.36
10 Female 42 0 0 0 0 never 24.48
... ... ... ... ... ... ... ... ...
26
Sample preprocessed records
ID gender age diab hypertension stroke chd smoking BMI
1 0.580 1 0.54 0.55 0.57 0.06 0 0.24
2 0.580 0.44 0.54 0.55 0.57 0.56 1 0.21
3 0.583 0.94 0.54 0.11 0.57 0.06 1 0.15
4 0.580 0.54 0.13 0.55 0.57 0.56 0 0.14
5 0.583 0.51 0.54 0.55 0.57 0.56 0 0.39
6 0.580 0.66 0.54 0.55 0.57 0.56 0.5 0.74
7 0.580 0.97 0.54 0.55 0.57 0.56 0.5 0.43
8 0.580 0.83 0.54 0.55 0.02 0.56 0 0.25
9 0.583 0.17 0.54 0.55 0.57 0.56 0 0.33
10 0.580 0.51 0.54 0.55 0.57 0.56 0 0.23
... ... ... ... ... ... ... ... ...
27
Results
 Statistic table after running improved k-means with
500 records
Clusters αmax βmax Davies-Bouldin index
2 1.500190508 0.628650915 0.508987479
3 1.500190508 0.628650915 0.508987479
4 1.490102736 0.642023258 0.615070732
5 1.492035333 0.727242247 0.886508179
6 1.495168725 0.842329214 0.888768971
7 1.47234299 0.903728206 0.941857373
8 1.456478952 0.867580149 0.973333409
9 1.483208611 0.91568044 0.913030254
10 1.482619659 0.890154561 1.050418667
28
Results (cont.)
 The graph shows the variation of αmax, βmax and
Davies-Bouldin index
29
Results (cont.)
 The suitable number of clusters likely locates at the
intersection of locations indicating of the selecting of
number of clusters of red line and blue line.
→ Choose k = 3. The similarity of the data objects in
the each cluster is rather good. Also, the Davies-
Bouldin index is smallest.
30
Algorithm evaluation
31
32
6. Conclusion
The advantages of improved k-means algorithm:
 Can handle mixed categorical and numeric data.
 Provide good initial cluster means, and reduce the
number of iterations of k-means, thereby we can
obtain high quality clusters without having to run the
traditional k-means many times.
 α and β is the new basis for selecting the number of
clusters k.
33
6. Conclusion (cont.)
Disadvantage:
 However, due to the combination of k-means with
agglomerative hierarchical clustering algorithm,
which has a low speed and is only suitable for small
and medium dataset, so running time also becomes
the biggest disadvantage of the new algorithm.
34
6. Conclusion (cont.)
Limits:
 The new method is only appropriate for the collected
data in this thesis, not for other kinds of data of
healthcare industry . For large, multidimensional
data, our program may not provide a good result.
 Because of limited time as well as difficulties to
update the latest optimization which is approached
for hierarchical clustering, our program have many
cons.
35
6. Conclusion (cont.)
Development orientation:
 We propose several new ways to improve the speed
of our program (using SLINK or CLINK), the
flexibility for different kinds of dataset, and the
possibility in handling unusual and missing data.
36
6. Conclusion (cont.)
In data mining, the success of data clustering often
depends on good data, rather than good algorithms. If
the dataset is huge and not clear, your choice of
clustering algorithm might not really matter so much
in terms of performance, so you should choose your
algorithm based on speed or ease of use instead.
37
Thanks for listening!
38

More Related Content

What's hot

Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysisAnimesh Kumar
 
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"IJDKP
 
Analysis and implementation of modified k medoids
Analysis and implementation of modified k medoidsAnalysis and implementation of modified k medoids
Analysis and implementation of modified k medoidseSAT Publishing House
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering108kaushik
 
Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ...
 Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ... Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ...
Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ...Gota Morota
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clusteringijcsity
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 

What's hot (20)

Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysis
 
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
 
Analysis and implementation of modified k medoids
Analysis and implementation of modified k medoidsAnalysis and implementation of modified k medoids
Analysis and implementation of modified k medoids
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
 
47 292-298
47 292-29847 292-298
47 292-298
 
Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ...
 Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ... Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ...
Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ...
 
Noura2
Noura2Noura2
Noura2
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
Lect4
Lect4Lect4
Lect4
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
 
Af4201214217
Af4201214217Af4201214217
Af4201214217
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
50120140505013
5012014050501350120140505013
50120140505013
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 

Similar to Thesis (presentation)

MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataKathleneNgo
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniquestalktoharry
 
Survey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in DataminingSurvey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in DataminingIOSR Journals
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
Cluster Analysis : Assignment & Update
Cluster Analysis : Assignment & UpdateCluster Analysis : Assignment & Update
Cluster Analysis : Assignment & UpdateBilly Yang
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
 
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood TestUsing Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood TestStevenQu1
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)Pratik Meshram
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisIOSR Journals
 

Similar to Thesis (presentation) (20)

MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
Survey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in DataminingSurvey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in Datamining
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Cluster Analysis : Assignment & Update
Cluster Analysis : Assignment & UpdateCluster Analysis : Assignment & Update
Cluster Analysis : Assignment & Update
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood TestUsing Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
K means report
K means reportK means report
K means report
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
 

Thesis (presentation)

  • 1. N G U Y E N G I A T O A N N G U Y E N L A M V U T U A N A D V I S O R : D R . N G U Y E N D I N H T H U A N Data mining in healthcare Improved k-means algorithm 1
  • 2. Outline 1. Introduction 2. K-means 3. Improved k-means 3.1. Dealing with mixed categorical and numeric data 3.2. Building initial cluster centers 3.3. Determining appropriate k 3.4. Improved k-means algorithm 3.5. Complexity 4. Cluster analysis tool 5. Analysis and results 6. Conclusion 2
  • 3. 1. Introduction  Data mining in healthcare is always a matter of concern. Dealing with the diversity of data is the purpose that makes researchers develop new algorithms.  In the field of disease prediction, the analyzed data often contain status, habits of patients with different data types, and each record has not class label in advance. → Clustering is applied in this area. 3
  • 4. 1. Introduction (cont.) Why k-means?  One of the most widely used methods to partition a dataset into groups of patterns.  Easy to understand and easy to set up, allowing researchers to develop in flexible ways.  K-means method has many weaknesses.  Base on the properties of collected data 4
  • 5. 2. K-means Algorithm: 1. Input: the number of clusters k, a data set containing n objects D. 2. Randomly choose k objects from D as the initial cluster centers; 3. Repeat 4. (Re)assign each object to the cluster to which the object is the most similar, based on the distance between the object and the mean value of objects in the cluster; 5. Calculate the new mean value of the objects for each cluster; 6. Until no change; 5
  • 6. 2. K-means (cont.) Advantages:  One of the most widely used methods for clustering.  Simple, can be easily modified to deal with different scenarios.  Compute fast. 6
  • 7. 2. K-means (cont.) Disadvantages: 1. The traditional k-means is limited to numeric data. 2. Randomly choose initial starting points. A poor initialization can lead to very poor clusters. 3. Difficult to predict k. 7
  • 8. 3.1. Dealing with mixed types data  A method proposed by Ming-Yi Shih, Jar-Wen Jheng and Lien-Fu Lai converts items in categorical attributes into numeric value based on the relationships among items.  If two items always show up in one object together, there will be a strong similarity between them.  When a pair of categorical items has a higher similarity, they shall be assigned closer numeric values. 8
  • 9. 3.1. Dealing with mixed types data (cont.)  9
  • 10. 3.1. Dealing with mixed types data (cont.) 4. Find the numeric attribute that minimizes the within group variance to base attribute. 5. Quantify every base item by assigning mean of the mapping value in the selected numeric attribute. 6. Quantify all other categorical items. 10
  • 11. 3.1. Dealing with mixed types data (cont.)  Since all attributes in data set will contain only numeric value at this moment, the existing distance based clustering algorithms can by applied without pain. For numeric data, Euclidean distance is often used. 11
  • 12. 3.2. Determining initial cluster centers  The two-step method proposed by Ming-Yi Shih, Jar- Wen Jheng and Lien-Fu Lai specifies that using agglomerative hierarchical clustering in the first step to cluster the original dataset into some subsets, which will be the initial set of clusters in k-means clustering algorithm. 12
  • 13. 3.2. Determining initial cluster centers (cont.) Dissimilarit 13
  • 14. 3.2. Determining initial cluster centers (cont.) 14
  • 15. 3.2. Determining initial cluster centers (cont.) 15
  • 16. 3.3. Choosing appropriate k  D.T.Nguyen and H.Doan’s approach: select k based on information obtained during the k-means clustering operation itself.  New metric: two coefficients α and β. 16
  • 17. 3.3. Choosing appropriate k (cont.)  17
  • 18. 3.3. Choosing appropriate k (cont.)  18
  • 19. 3.3. Choosing appropriate k (cont.) A cluster needs to be splitted: Two clusters need to be grouped: Cluster 1 Cluster 2 Center of cluster 1 ϕmin dmax 19
  • 20. 3.4. Improved k-means algorithm  Input n objects, and number of clusters k (1 ≤ k ≤ n).  Applying agglomerative hierarchical clustering. Place each object in its own cluster. The two clusters that have that closest distance will be merged into a larger cluster.  Continue merge these clusters, until all of the objects are in k clusters.  From now on, applying K-means algorithm. Compute mean of the objects in cluster. Then Reassign objects to clusters.  Repeat the above step until no change.  Calculate αmax and βmax. → Base on αmax and βmax, we will know k should be increased or decreased. 20
  • 22. 4. Cluster analysis tool  We implemented a data mining software written in C#, that clusters data into groups using the improved k-means algorithm, and also, the traditional one for comparison.  This tool can also help to decide the suitable number of clusters k. 22
  • 23. 4. Cluster analysis tool (cont.)  Demo 23
  • 24. 5. Analysis and results  Data of the approximately one thousand patient records from MQIC database that were used to develop the Health Visualizer.  Every object has 8 attributes: gender, age, diab, hypertension, stroke, chd, smoking and BMI.  Assume that they have the same weight, and the distance measure used is Euclidean distance. 24
  • 25. Data information Name Value Gender Male; Female Age Numeric Diab Binary Hypertension Binary Stroke Binary Chd Binary Smoking never; former; not current; current; ever BMI Numeric 25
  • 26. Sample records ID gender age diab hypertension stroke chd smoking BMI 1 Female 80 0 0 0 1 never 25.19 2 Female 36 0 0 0 0 current 23.45 3 Male 76 0 1 0 1 current 20.14 4 Female 44 1 0 0 0 never 19.31 5 Male 42 0 0 0 0 never 33.64 6 Female 54 0 0 0 0 former 54.7 7 Female 78 0 0 0 0 former 36.05 8 Female 67 0 0 1 0 never 25.69 9 Male 15 0 0 0 0 never 30.36 10 Female 42 0 0 0 0 never 24.48 ... ... ... ... ... ... ... ... ... 26
  • 27. Sample preprocessed records ID gender age diab hypertension stroke chd smoking BMI 1 0.580 1 0.54 0.55 0.57 0.06 0 0.24 2 0.580 0.44 0.54 0.55 0.57 0.56 1 0.21 3 0.583 0.94 0.54 0.11 0.57 0.06 1 0.15 4 0.580 0.54 0.13 0.55 0.57 0.56 0 0.14 5 0.583 0.51 0.54 0.55 0.57 0.56 0 0.39 6 0.580 0.66 0.54 0.55 0.57 0.56 0.5 0.74 7 0.580 0.97 0.54 0.55 0.57 0.56 0.5 0.43 8 0.580 0.83 0.54 0.55 0.02 0.56 0 0.25 9 0.583 0.17 0.54 0.55 0.57 0.56 0 0.33 10 0.580 0.51 0.54 0.55 0.57 0.56 0 0.23 ... ... ... ... ... ... ... ... ... 27
  • 28. Results  Statistic table after running improved k-means with 500 records Clusters αmax βmax Davies-Bouldin index 2 1.500190508 0.628650915 0.508987479 3 1.500190508 0.628650915 0.508987479 4 1.490102736 0.642023258 0.615070732 5 1.492035333 0.727242247 0.886508179 6 1.495168725 0.842329214 0.888768971 7 1.47234299 0.903728206 0.941857373 8 1.456478952 0.867580149 0.973333409 9 1.483208611 0.91568044 0.913030254 10 1.482619659 0.890154561 1.050418667 28
  • 29. Results (cont.)  The graph shows the variation of αmax, βmax and Davies-Bouldin index 29
  • 30. Results (cont.)  The suitable number of clusters likely locates at the intersection of locations indicating of the selecting of number of clusters of red line and blue line. → Choose k = 3. The similarity of the data objects in the each cluster is rather good. Also, the Davies- Bouldin index is smallest. 30
  • 32. 32
  • 33. 6. Conclusion The advantages of improved k-means algorithm:  Can handle mixed categorical and numeric data.  Provide good initial cluster means, and reduce the number of iterations of k-means, thereby we can obtain high quality clusters without having to run the traditional k-means many times.  α and β is the new basis for selecting the number of clusters k. 33
  • 34. 6. Conclusion (cont.) Disadvantage:  However, due to the combination of k-means with agglomerative hierarchical clustering algorithm, which has a low speed and is only suitable for small and medium dataset, so running time also becomes the biggest disadvantage of the new algorithm. 34
  • 35. 6. Conclusion (cont.) Limits:  The new method is only appropriate for the collected data in this thesis, not for other kinds of data of healthcare industry . For large, multidimensional data, our program may not provide a good result.  Because of limited time as well as difficulties to update the latest optimization which is approached for hierarchical clustering, our program have many cons. 35
  • 36. 6. Conclusion (cont.) Development orientation:  We propose several new ways to improve the speed of our program (using SLINK or CLINK), the flexibility for different kinds of dataset, and the possibility in handling unusual and missing data. 36
  • 37. 6. Conclusion (cont.) In data mining, the success of data clustering often depends on good data, rather than good algorithms. If the dataset is huge and not clear, your choice of clustering algorithm might not really matter so much in terms of performance, so you should choose your algorithm based on speed or ease of use instead. 37

Editor's Notes

  1. Iteration:The importance here is the improved kmean: 3 lines here look like a straight like very soon, at the iteration at 3 or 4. K: when we use traditional, we can not guess what the appropriate k is for this dataset because of random initial points, the value of alpha and beta are different after each time running.