Statistical Data Analysis on Diabetes 130-US hospitals for years 1999-2008 Data Set

Statistical Data Analysis
on
Diabetes 130-US hospitals
for years 1999-2008 Data Set
Document Version: 1.0
(Date: 12/01/15)
Seval Ünver
unver.seval@metu.edu.tr
Student Number: 1900810 (M.Sc.)
Department of Computer Engineering, Middle East Technical University
Ankara, TURKEY
1 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver

Version History
Version Status* Date Responsible Version Definition
0.1 Send via Email 29/10/14 Seval Unver Projection by PCA (6 hours)
0.2 Uploaded to OdtuClass 05/11/14 Seval Unver Projection by MDS (6 hours)
0.3 Uploaded to OdtuClass 19/11/14 Seval Unver Data clustering by hierarchical and k-means
clustering (6 hours)
0.4 Uploaded to OdtuClass 26/11/14 Seval Unver Cluster Validation (5 hours)
1 Final Report 12/01/15 Seval Unver Spectral Clustering (6 hours)

Table of Content
1. Data Set Description.........................................................................................................................4
2. Data projection by PCA....................................................................................................................8
2.1. Eigenvalues and Eigenvectors..................................................................................................8
2.2. Plot directions of the first and second principle components on the original coordinate
system............................................................................................................................................11
2.3. Transformed data set onto to a new coordinate system by using the first two principle
components....................................................................................................................................12
2.4. Personal Observations and Comments...................................................................................12
2.5. Details of Implementation......................................................................................................13
3. Data projection by MDS.................................................................................................................16
3.1. Classical Metric......................................................................................................................16
3.2. Sammon Mapping and isoMDS..............................................................................................16
3.3. Use the projection of data onto first two principal axes (as a result of PCA) to initialize MDS
(sammon and isoMDS). Plot the final projections.........................................................................19
3.4. Observations and Comments..................................................................................................20
3.5. Self-Reflection about MDS....................................................................................................20
4. Clustering.......................................................................................................................................20
4.1. Hierarchical Clustering...........................................................................................................20
4.2. K-Means Clustering................................................................................................................24
4.2.1. K-Means algorithm for different 5 k values – Plot Error................................................25
4.2.2. K-Means with 5 different initial configurations when k is 100 and when k is 25 – Error
Plot............................................................................................................................................25
4.2.3. Plot the data in 2D...........................................................................................................26
4.3. Self-Reflection About Clustering............................................................................................27
5. Cluster Validation...........................................................................................................................28
5.1. Comparison of Actual Labels and Predicted Labels ..............................................................28
5.2 Dunn Index and Davies-Bouldin.............................................................................................28
5.2.1 Dunn Index Measurements..............................................................................................30
5.2.2 Davies-Bouldin Measurements........................................................................................30
5.3 Self-Reflection About Validation.............................................................................................31
6. Spectral Clustering.........................................................................................................................31
7. References......................................................................................................................................37
8. Appendix.........................................................................................................................................38
8.1. Used Scripts & Programs........................................................................................................38
8.1.1. Scripts of PCA Projection...............................................................................................38
8.1.2. Scripts of MDS Projection..............................................................................................39
8.1.3 Scripts of Clustering.........................................................................................................39
8.1.4 Scripts of Cluster Validation............................................................................................40
8.1.5. Scripts of Spectral Clustering.........................................................................................43

1. Data Set Description
"Diabetes 130-US hospitals for years 1999-2008 Data Set" is selected for this research. This data
has been prepared to analyze factors related to readmission as well as other outcomes pertaining to
patients with diabetes. The dataset represents 10 years (1999-2008) of clinical care at 130 US
hospitals and integrated delivery networks. It includes 50 features representing patient and hospital
outcomes.
The original large database has 74 million unique encounters corresponding to 17 million unique
patients. The database consists of 41 tables in a fact-dimension schema and a total of 117 features.
Almost 70,000 inpatient diabetes encounters were identified with sufficient detail for this analysis.
In an early research, this database is used to show the relationship between the measurement of
HbA1c and early readmission while controlling for covariates such as demographics, severity and
type of the disease, and type of admission. The dataset was created in two steps. First, encounters of
interest were extracted from the database with 55 attributes. Second, preliminary analysis and
preprocessing of the data were performed resulting in retaining only these features (attributes) and
encounters that could be used in further analysis, that is, contain sufficient information [1].
Information was extracted from the database for encounters that satisfied the following criteria:
1. It is an inpatient encounter (a hospital admission).
2. It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the
system as a diagnosis.
3. The length of stay was at least 1 day and at most 14 days.
4. Laboratory tests were performed during the encounter.
5. Medications were administered during the encounter.
Data Set Download Link: http://archive.ics.uci.edu/ml/datasets/Diabetes+130-
US+hospitals+for+years+1999-2008
Date Donated: 05/03/14
Source: The data are submitted on behalf of the Center for Clinical and
Translational Research, Virginia Commonwealth University, a
recipient of NIH CTSA grant UL1 TR00058 and a recipient of the
CERNER data. John Clore (jclore '@' vcu.edu), Krzysztof J. Cios
(kcios '@' vcu.edu), Jon DeShazo (jpdeshazo '@' vcu.edu), and Beata
Strack (strackb '@' vcu.edu). This data is a de-identified abstract of
the Health Facts database (Cerner Corporation, Kansas City, MO).
Table 1. Source of data set
There are 100,000 instances and 50 columns in this data set. Characteristics of the data set is
Multivariate since there are many variables. The main area of this data is life and the data are real.
Because of this reason, there are missing values. We can use Classification and Clustering methods
on this data.
The data contains such attributes as patient number, race, gender, age, admission type, time in
hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result,
diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and
emergency visits in the year before the hospitalization, etc. The whole attribute list can be seen in
the Table 2.

No Feature name Type Description and values % missing
1 Encounter ID Numeric Unique identifier of an encounter 0,00%
2 Patient number Numeric Unique identifier of a patient 0,00%
3 Race Nominal Values: Caucasian, Asian, African American,
Hispanic, and other
2,00%
4 Gender Nominal Values: male, female, and unknown/invalid 0,00%
5 Age Nominal Grouped in 10-year intervals:[0, 10),[10, 20),
…,[90, 100)
0,00%
6 Weight Numeric Weight in pounds. 97,00%
7 Admission type Nominal Integer identifier corresponding to 9 distinct
values, for example, emergency, urgent,
elective, newborn, and not available
0,00%
8 Discharge
disposition
Nominal Integer identifier corresponding to 29 distinct
values, for example, discharged to home,
expired, and not available
0,00%
9 Admission
source
Nominal Integer identifier corresponding to 21 distinct
values, for example, physician referral,
emergency room, and transfer from a hospital
0,00%
10 Time in
hospital
Numeric Integer number of days between admission and
discharge
0,00%
11 Payer code Nominal Integer identifier corresponding to 23 distinct
values, for example, Blue Cross/Blue Shield,
Medicare, and self-pay
52,00%
12 Medical
specialty
Nominal Integer identifier of a specialty of the admitting
physician, corresponding to 84 distinct values,
for example, cardiology, internal medicine,
family/general practice, and surgeon
53,00%
13 Number of lab
procedures
Numeric Number of lab tests performed during the
encounter
0,00%
14 Number of
procedures
Numeric Number of procedures (other than lab tests)
performed during the encounter
0,00%
15 Number of
medications
Numeric Number of distinct generic names administered
during the encounter
0,00%
16 Number of
outpatient visits
Numeric Number of outpatient visits of the patient in the
year preceding the encounter
0,00%
17 Number of
emergency
visits
Numeric Number of emergency visits of the patient in
the year preceding the encounter
0,00%
18 Number of
inpatient visits
Numeric Number of inpatient visits of the patient in the
year preceding the encounter
0,00%

19 Diagnosis 1 Nominal The primary diagnosis (coded as first three
digits of ICD9); 848 distinct values
0,00%
20 Diagnosis 2 Nominal Secondary diagnosis (coded as first three digits
of ICD9); 923 distinct values
0,00%
21 Diagnosis 3 Nominal Additional secondary diagnosis (coded as first
three digits of ICD9); 954 distinct values
1,00%
22 Number of
diagnoses
Numeric Number of diagnoses entered to the system 0,00%
23 Glucose serum
test result
Nominal Indicates the range of the result or if the test
was not taken. Values: “>200,” “>300,”
“normal,” and “none” if not measured
0,00%
24 A1c test result Nominal Indicates the range of the result or if the test
was not taken. Values: “>8” if the result was
greater than 8%, “>7” if the result was greater
than 7% but less than 8%, “normal” if the
result was less than 7%, and “none” if not
measured.
0,00%
25-
47
23 features for
medications
Nominal The feature indicates whether the drug was
prescribed or there was a change in the dosage.
Values: “up” if the dosage was increased
during the encounter, “down” if the dosage was
decreased, “steady” if the dosage did not
change, and “no” if the drug was not
prescribed
0,00%
48 Change of
medications
Nominal Indicates if there was a change in diabetic
medications (either dosage or generic name).
Values: “change” and “no change”
0,00%
49 Diabetes
medications
Nominal Indicates if there was any diabetic medication
prescribed. Values: “yes” and “no”
0,00%
50 Readmitted Nominal Days to inpatient readmission. Values: “<30” if
the patient was readmitted in less than 30 days,
“>30” if the patient was readmitted in more
than 30 days, and “No” for no record of
readmission.
0,00%
Table 2. List of features and their descriptions in the initial dataset
The data are the real-world data, so there is incomplete, redundant, and noisy information. Some
features have high percentage of missing values like weight (97% values missing), payer code
(40%), and medical specialty (47%). Weight can directly relevant to Diabet but it is too sparse in
this database, so it can be removed. Also, payer code can be removed because it is not relevant to
Diabet. Medical specialty attribute can be removed too, because it shows physician's speciality. This
might be important but focus of this research will not be about this issue. Therefore three features
which have high missing values are removed.
To summarize, the dataset consists of hospital admissions of length between 1 and 14 days that did
not result in a patient death or discharge to a hospice. Each encounter corresponds to a unique

patient diagnosed with diabetes, although the primary diagnosis may be different. During each of
the analyzed encounters, lab tests were ordered and medication was administered [1].
Four groups of encounters are considered: (1) no HbA1c test performed, (2) HbA1c performed and
in normal range, (3) HbA1c performed and the result is greater than 8% with no change in diabetic
medications, and (4) HbA1c performed, result is greater than 8%, and diabetic medication was
changed.
Table 3. Distribution of HbA1c in whole data set
1000 instance is separated to analyse as a traning set. In this small data set, the distribution of
HbA1c is changed. It can be seen if Table 3 and Table 4 are compared.
Table 4. Distribution of HbA1c in training data set
The ratio of the population who did not have HbA1c test is 81.60% in whole data set and 81.50% in
training data set. The readmission rate is 9.40% for whole data set, 9.32 for training data set.
The ratio of the population who has higher result than 8% in HbA1c test is 8.90% in whole data set,
11.80% in training data set. Readmission rates are close to each other. In the group who had higher
results and changed the medication, the ratio of population who readmitted the hospital in 30 days is
8.90% for whole data set, 10.00% for training data set.
Encounter Id and Patient Numbers are removed because we are interested in summary of this data.
Diagnosis 1, Diagnosis 2 and Diagnosis 3 is removed from the data, because they are nominal and
they have more than 900 distinct values. Race is removed from the data set because it is not
necessary at this point. 23 medication result is removed because they are nominal and they are not
Variable
Readmitted
% in group
HbA1c
No test was performed 57080 81.60% 5342 9.40%
4071 5.80% 361 8.90%
2196 3.10% 166 7.60%
Normal result of the test 6637 9.50% 590 8.90%
Number of
encounters
% of the
population
Number of
encounters
Result was high and the diabetic medication was
changed
Result was high but the diabetic medication was
not changed
Variable
Readmitted
% in group
HbA1c
No test was performed 815 81.50% 76 9.32%
60 6.00% 6 10.00%
58 5.80% 6 10.34%
Normal result of the test 67 6.70% 6 8.95%
Number of
encounters
% of the
population
Number of
encounters
Result was high and the diabetic medication was
changed
Result was high but the diabetic medication was
not changed

useful for Principle Component Analysis. Our aim is to find a relation between HbA1c test and
Readmission rate. So we still keep that information. Gender is expressed by 1 for woman and 0 for
man. The other nominal values are changed to integer representations.
Changed nominal values:
After removing unnecessary features, there is now 18 features in training data set. Types of discrete
data:
• count data (time_in_hospital, num_lab_procedures, num_procedures, num_medications,
number_outpatient, number_emergency, number_inpatient, number_dianoses)
• nominal data (gender, admission_type_id, discharge_disposition_id, admission_source_id,
diabetesMed, change)
• ordinal data (age, A1Cresult, max_glu_serum, readmitted)
2. Data projection by PCA
PCA(Principle Component Analysis) is done by using GNU R function prcomp. As a method,
eigenvalues of covariance matrix is used.
2.1. Eigenvalues and Eigenvectors
Image: First 18 Eigen Values of whole data
Gender A1Cresult readmitted
Female 1 None 0 >30 2
Male 0 Normal 1 <30 1
>7 2 No 0
Age >8 3
1-10 1
10-20 2 change max_glu_serum
20-30 3 None 0 None 0
30-40 4 change 1 Normal 1
40-50 5 >200 2
50-60 6 diabetesMed >300 3
60-70 7 Yes 1
70-80 8 No 0
80-90 9
90-100 10

First we look at the whole data and plot it. It's much easier to explain PCA for two dimensions and
then generalize from there. So two numeric features are selected: A1Cresult, time_in_hospital.
The A1Cresult feature is categorical, so it is changed with num_lab_procedures.
Look at the colorful plot between num_lab_procedures and num_medications.
If we look at the PCA components, we can see that first component is very high than second
component.

PCA Components of subset of data:

2.2. Plot directions of the first and second principle components on the
original coordinate system
PCA of whole data:
When we use the 2 feature:
Image: 2D Projections of data on Principal

2.3. Transformed data set onto to a new coordinate system by using the
first two principle components
Image: Cumulative Percentages of Eigenvalues
Image: Component 1 vs. Component 2
2.4. Personal Observations and Comments
This data set includes not only numerical values but also nominal values. There are both continuous
and categorical data. However PCA is developed and suitable for continuous (ideally, mutlivariate
normal) data. There is no obvious outliers in the data. Although a PCA applied on binary data would
yield results comparable to those obtained from a Multiple Correspondence Analysis (factor scores
and eigenvalues are linearly related), there are more appropriate techniques to deal with mixed data
types, namely Multiple Factor Analysis for mixed data available in the FactoMineR R package
(AFDM()). If your variables can be considered as structured subsets of descriptive attributes, then

Multiple Factor Analysis (MFA()) is also an option.
The challenge with categorical variables is to find a suitable way to represent distances between
variable categories and individuals in the factorial space. To overcome this problem, you can look
for a non-linear transformation of each variable--whether it be nominal, ordinal, polynomial, or
numerical--with optimal scaling. This is well explained in “Gifi Methods for Optimal Scaling in R:
The Package homals [2]”, and an implementation is available in the corresponding R package
homals.
2.5. Details of Implementation
Data file is ending with ‘.csv’ which is Comma Seperated Value. There are 101768 lines in the file.
Data is imported into R software:
> diabetic_data <- read.table("C:/dataset/diabetic_data.csv", header = TRUE, sep = ",")
To get the summary of data:
> summary(diabetic_data)
The data is opened with Excel and 3 feature columns are deleted which are weight (97% values
missing), payer code (40% values missing), and medical specialty (47% values missing). Then a
subset of this data is selected as a training data which includes 1000 rows. This training data is the
first 1000 rows in original data and it represents the whole data correctly because these 1000 data is
choosen randomly from the developers of this data set.
Training data is imported to R. The summary of training data has not much difference with original
data. For example, approximately half of data is women, the other half of data is men. The mean
values are near each other. Ofcourse there is a difference between original data and training data,
but using a training data makes faster the analysis.

To get the new distribution of HbA1c Result, subset of this data is shown in R. After removing the
unnecessary columns and changing with integer representations, the data set has 18 features now.
So it is imported again.
To run the Principle Component Analysis, we can use princomp function in R. To use Correlation
Matrix, we give it cor=TRUE parameter. It shows Importance of Components:
Image: Summary of new data set (with 18 features)
Image: Plot of Diabetic_Data

It looks like Component 1 is very strong. When it is plotted, it gives the results which can be seen in
below.
Image: Plot of PCA
A scree plot is a graphical display of the variance of each component in the dataset which is used to
determine how many components should be retained in order to explain a high percentage of the
variation in the data. First component and second component are higher so we should keep them.
Now lets see the Biplot. Biplots are a type of exploratory graph used in statistics, a generalization of
the simple two-variable scatterplot. A biplot allows information on both samples and variables of a
data matrix to be displayed graphically. Samples are displayed as points while variables are
displayed either as vectors, linear axes or nonlinear trajectories. In the case of categorical variables,
category level points may be used to represent the levels of a categorical variable. A generalised
biplot displays information on both continuous and categorical variables.
Image: biplot of PCA
It is hard to read the information in this biplot image. However we can see that some of red lines are
going to the same direction. This means that they have association.

3. Data projection by MDS
For visualisation, three methods for MDS (Multidimensional Scaling) are used for visualisation;
classical multidimensional scaling, Sammon mapping and non-metric MDS.
Classical multidimensional scaling is done with cmdscale in 2D and dist functions which uses
euclidian distance between samples. As samples, raw data and first two dimensions of the PCA
result is used.
Sammon mapping is done with sammon function in MASS package. For initial configuration, result
of PCA and several instances uniform distributed random points is used.
Similar to sammon mapping, non-metric MDS is done with isoMDS function in MASS package.
For initial configuration, same values as sammon mapping analysis is used.
3.1. Classical Metric
Image: Classical MDS
3.2. Sammon Mapping and isoMDS
Tried at least 5 different random initial configurations, choosen one that gives minimum error and
plot the final MDS projection onto two dimensions. If result of PCA is used, sammon mapping
converges in just a few iterations and leaves the configuration almost unchanged. Result is very
sensitive to the magic parameter which is used for step size of iterations as indicated by MASS
documentation. Here, magic parameter is chosen as 0.05.

Image: Sammon Mapping with PCA
Image: Magic parameter is 0.05
If the magic parameter is 0.05:
Initial stress : 0.79437
stress after 2 iters: 0.61243

stress after 10 iters: 0.50231, magic = 0.115

3.3. Use the projection of data onto first two principal axes (as a result
of PCA) to initialize MDS (sammon and isoMDS). Plot the final
projections.
Image: isoMDS (Non-metric Mapping with PCA)
initial value 11.809209
final value 11.804621
converged
Image: Non-Metric mapping with random configuration
initial value 48.305275
final value 48.304120
converged

3.4. Observations and Comments
There is so much features in this data set, this means high dimentionality. Although we removed
most of unnecessary features and use a training data which includes 1000 instance, still data is not
easily clusterable in 2D or clusters are not easily visible. Iterative methods used here yields outliers
which are not present in PCA but this behaviour is very dependent on parameters other than
distance data. Classic Torgerson metric MDS is actually done by transforming distances into
similarities and performing PCA (eigen-decomposition or singular-value-decomposition) on those.
So, PCA might be called the algorithm of the simplest MDS.
Thus, MDS and PCA are not at the same level to be in line or opposite to each other. PCA is just a
method while MDS is a class of analysis. As mapping, PCA is a particular case of MDS. On the
other hand, PCA is a particular case of Factor analysis which, being a data reduction, is more than
only a mapping, while MDS is only a mapping.
3.5. Self-Reflection about MDS
As I see in this assignment, MDS gives much more information than PCA. This assignment
provides a general perspective on the measures of similarity and dissimilarity. Both MDS and PCA
use proximity measures such as the correlation coefficient or Euclidean distance to generate a
spatial configuration (map) of points in multidimensional space where distances between points
reflect the similarity among isolates.
4. Clustering
Clustering is a technique for finding similarity groups in data, called clusters. In other words,
clustering is the task of grouping a set of objects in such a way that objects in the same group
(called a cluster) are more similar (in some sense or another) to each other than to those in other
groups (clusters). On this data, two methods of clustering is used for data visualisation.
First one is Hierarchical Clustering with three different methods, implemented in GNU R function
hclust with methods of “average”, ”complete” and ”ward”. Their dendograms are plotted.
Second one is K-means Clustering with different k values (5,10,25,100,200) and several random
runs of the “elbow” value for k which is around 100, consistent with the ground truth.
Euclidian distance of normalized samples is used as the distance between samples for the both
methods.
4.1. Hierarchical Clustering
Dendrograms of hierarchical clustering for three linkages are illustrated under this title. Out of the
three, “ward” method is the easiest to interpret visually, even for high number of clusters chosen.
“Average” and “complete“ methods yield visually similar results but not as easy to interpret. Most
interesting observation is, beyond a certain number of clusters picked, they look similar even
though picks of low numbers of clusters looks different.

Linkage, or the distance from a newly fourmed node to all other nodes, can computed in several
different ways: single, complete, and average. The figure below roughly demonstrates what each
linkage evaluates [3] :
Average linkage clustering uses the average similarity of observations between two groups as the
measure between the two groups. Complete linkage clustering uses the farthest pair of observations
between two groups to determine the similarity of the two groups. Single linkage clustering, on the
other hand, computes the similarity between two groups as the similarity of the closest pair of
observations between the two groups [4].
Ward's linkage is distinct from all the other methods because it uses an analysis of variance
approach to evaluate the distances between clusters. In short, this method attempts to minimize the
Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at each step. In general,
this method is regarded as very efficient, however, it tends to create clusters of small size [4].
Image: Hierarchical Clustering with Ward Method

Image: Dendograms for 5(red),25(green),100(blue) clusters on Hierarchical Clustering with
Ward Method
Image: Hierarchical Clustering with average method

Image: Dendograms for 5,25,100 clusters on Hierarchical Clustering with Average Method
Image: Hierarchical Clustering with complete method

Image: Dendograms for 5,25,100 clusters on Hierarchical Clustering with Complete Method
I recommend 25 number of clusters for this dataset based on the results obtained above. It is
difficult to estimate accurate number of clusters. As you see in Ward method, using 25 clusters
seems best choice.
4.2. K-Means Clustering
K-means clustering aims to partition n observations into k clusters in which each observation
belongs to the cluster with the nearest mean, serving as a prototype of the cluster. For number of
clusters 5, 10, 25, 100 and 200 examined under this title. Total sum of squares within clusters as
given by “tot.withinss” property of kmeans function is used to as the error function to evaluate best
number of clusters.
22 is close to “rule of thumb” value for 1000 samples (k≈√n/2). Ground truth of 100 is close to the
“elbow” value beyond where the error doesn't improve as dramatically. In next steps, five random
runs of k=100 and k=25 and their errors are illustrated.

4.2.1. K-Means algorithm for different 5 k values – Plot Error
Image: “K-Means Clustering With k=5,10,25,100,200” v.s. “Error Plot”
4.2.2. K-Means with 5 different initial configurations when k is 100 and
when k is 25 – Error Plot
2 different k values are tried in this task. Because of the fact that determining number of clusters is
difficult, 25 and 100 tried in 5 different initial configurations. When k is 100, third try is the best
because its error rate is lowest. When k is 25, fifth try is the best because its error rate is lowest also.

Image: “5 random runs with k=100” v.s. “Error Plot”
Image: “5 random runs with k=25” v.s. “Error Plot”
4.2.3. Plot the data in 2D
In previous step, Error plot is showed in a graph. From five initial comfigurations, third
configuration is chosen because its error rate is lowest one when k=100. Here is the 2D plot of third
configuration which has k=100 in K-Means algorithm.
Image: Third initial configuration with k=100 – Clusters in 2D

Here is the 2D plot of third configuration which has k=25 in K-Means algorithm.
Image: Fifth initial configuration with k=25 – Clusters in 2D
4.3. Self-Reflection About Clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the
same group (called a cluster) are more similar (in some sense or another) to each other than to those
in other groups (clusters). This assignment helped me to estimate cluster number of my data. After
this assignment, I choose k=25 as a cluster number. The figures and projections were very
beneficial in this task.
In k-means algorithm, my difficuly was to specify k. In addition to this, algorithm is so sensitive to
outliers. My data has a lot of outliers because it is a real world data. The weekness of k-means is
that this algorithm is only applicable if the mean is defined. For categorical data, k-mode - the
centroid is represented by most frequent values. Therefore, we cannot say that k-means is the best
solution to estimate number of clusters. The other algorithms have their own weeknesses.
Comparing different clustering algorithms is a difficult task. No one knows the correct clusters.
It is very hard, if not impossible, to know what distribution the application data follow. The data
may not fully follow any “ideal” structure or distribution required by the algorithms. One also needs
to decide how to standardize the data, to choose a suitable distance function and to select other
parameter values.
Hierarchical Clustering has O(n^2) complexity. Due the complexity, hard to use for large data sets. I

used a sample which has 1000 instances in this task, so it did not take long time to implement.
My data has a lot of nominal attributes with more than two states or values. The commonly used
distance measure is based on the simple matching method.
To sum up, this assignment is very helpful for students who want to implement clustering on
unlabeled big data.
5. Cluster Validation
Cluster validation is concerned with the quality of clusters generated by an algorithm for data
clustering. Given the partitioning of a data set, it attempts to answer questions such as: How
pronounced is the cluster structure that has been identified? How do clustering solutions from
different algorithms compare? How do clustering solutions for different parameters (e.g. the number
of clusters compare).[6]
5.1. Comparison of Actual Labels and Predicted Labels
"Ground truth" means a set of measurements that is known to be much more accurate than
measurements from the system you are testing. In Diabetes 130-US hospitals for years 1999-2008
Data Set, there was no labels to determine classes. I labeled the four group in a new column with a
Java console program. The column name is label. This column holds numbers which ranges from 1
to 4.
In this dataset, four groups of encounters are considered:
(1) no HbA1c test performed (A1Cresult=0 ),
(2) HbA1c performed and in normal range(A1Cresult=1 or A1Cresult=2),
(3) HbA1c performed and the result is greater than 8% with no change in diabetic medications
(A1Cresult=3 and change=0),
(4) HbA1c performed, result is greater than 8%, and diabetic medication was changed
(A1Cresult=3 and change=1).
Only cluster number of 4 is considered (ground truth).
Method Precision
H. clustering (ward) 0.482
H. clustering (average) 0.993
H. clustering (complete) 0.976
K-means 0.139
5.2 Dunn Index and Davies-Bouldin
The goal of using an index is to determine the optimal clustering parameters. Greater intracluster

distances and lesser intercluster distances are desired. Different distance measures can be used for
the index calculations.
The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio
between the minimal inter-cluster distance to maximal intra-cluster distance. Let S and T be two
nonempty subsets of . [5] Then, the diameter of S is defined as
and set distance be tween S and T is defined as
Here, d(x,y) indicates the distance between points x and y. For any partition, Dunn defined the
following index [11]:
Larger values of VD correspond to good clusters, and the number of clusters that maximizes VD is
taken as the optimal number of clusters.
The Davies–Bouldin index[9] is a metric for evaluating clustering algorithms. This is an internal
evaluation scheme, where the validation of how well the clustering has been done is made using
quantities and features inherent to the dataset. This index is a function of the ratio of the sum of
within-cluster scatter to between-cluster separation [5]. The scatter within the ith cluster, Si, is
computed as
and the distance between cluster Ci and Cj, denoted by dij, is defined as
Here, zi represents the ith cluster center. The Davies-Bouldin (DB) index
is then defined as
where
The objective is to minimize the DB index for achieving proper clustering.
Most practical difference between two indexes are, higher Dunn index is better while lower Davies-
Bouldin is better. Distances discussed here are euclidian distances.

5.2.1 Dunn Index Measurements
Hierarchical Clustering (Ward):
Hierarchical Clustering (Average):
Hierarchical Clustering (Complete):
K-Means:
Hierarchical methods gives better results than K-Means.
5.2.2 Davies-Bouldin Measurements
Hierarchical Clustering (Ward):
Hierarchical Clustering (Average):

Hierarchical Clustering (Complete):
K-Means:
Again, hierarchical methods gives better results than K-Means.
5.3 Self-Reflection About Validation
The data set do not have well-separated clusters, so this task was difficult to implement. The aim of
this assignment was to understand and to encourage the use of cluster-validation techniques in the
analysis of data. In particular, the assignment attempts to familiarize students with some of the
fundamental concepts behind cluster-validation techniques, and to assist them in making more
informed choices of the measures to be used. However, to implement better cluster-validation needs
essential background information. There are a lot of different types of validation techniques. Some
articles in literature proposed effective use of validation techniques. As a conclusion, the validation
should be done after research more and have more background knowledge.
6. Spectral Clustering
Clustering is widely applied in science and engineering, including bioinformatics, image
segmentation and web information retrival etc. The essential task of data clustering is partitioning
data points into disjoint clusters so that objects in the same cluster are similar, objects in different
clusters are dissimilar [7]. Many of the clustering algorithms are shortcoming so Spectral Clustering
is proposed as a promising alternative.
Spectral clustering method is proposed as a new kind of clustering method based on graph theory.
This method uses the top eigenvectors of a matrix derived from the distance between points. Such
algorithms have been successfully used in many applications including computer vision and VLSI
design [8]. Through the spectral analysis on the affinity matrix of data sets, spectral clustering can
get promising clustering results [7]. Because there is no iteration proceeding in the algorithm,
spectral clustering avoids to trapped in the local minimum as K-means. The process of the spectral
clustering can be summarized as follows [7][8] (suppose the data set X = {x1; x2; … ; xn} has k
class):
Spectral Clustering Algorithm

STEP 1. Construct the affinity matrix
If i ̸= j, then
, else wij = 0;
STEP 2. Define the diagonal matrix D, where
Meantime define the Laplacian matrix:
STEP 3. Compute the k eigenvectors corresponding to the k largest eigenvalues of matrix L, and
constitute the matrix:
Then we can get the matrix Y, where
STEP 4. Treat each row of Y as a point in , cluster them into k clusters via K-means. Assign the
original points xi to cluster j iff row i of the matrix Y was assigned to cluster j.
On this dataset, the algorithm given above is used as a spectral clustering. To implement this
algorithm, there is an extensible package which name is kernlab for kernel-based machine learning
methods in R. By using this package, spectral clustering can be done in a few steps easily. So,
“specc” method is used from “kernlab” package. You can find the R scripts in the Appendix
section's 8.1.5. Scripts of Spectral Clustering.
In spectral clustering, the similarity between data points is often defined by Gaussian kernel [7].
The scale hyperparameter σ in the Gaussian kernel will great influence the final clustering results.
So to find best σ hyperparameter, firstly a parameter estimation is done. After that, several runs with
the same parameters are compared.
Parameter Estimation
Kernlab includes an S4 method called specc implementing this algorithm which can be used
through a formula interface or a matrix interface. The S4 object returned by the method extends the
class “vector” and contains the assigned cluster for each point along with information on the centers
size and within-cluster sum of squares for each cluster. In case a Gaussian RBF kernel is being used
a model selection process can be used to determine the optimal value of the σ hyperparameter. For a
good value of σ the values of Y tend to cluster tightly and it turns out that the within cluster sum of
squares is a good indicator for the “quality” of the sigma parameter found. We then iterate through
the sigma values to find an optimal value for σ.
The number of clusters are estimated as 4, 25 and 40. They are tried in specc method. The results

are shown on 2 data subset: {time_in_hospital, num_lab_procedures} and {num_medications,
num_lab_procedures}
Estimated value for 4 clusters σ=4.40010321258815. Random runs are done with this
hyperparameter sigma.
If we estimate 4 clusters:
Image: Plot of data {Time in Hospital, Number of Procedures}
for 4 clusters estimation
Image: Plot of data {Number of Medications, Number of Lab Procedures}
for 4 cluster estimation

Image: Plot of data {Number of Medications, Number of Procedures}

Image: Plot of data {Number of Medications, Number of Procedures}
From the dataset, "num_lab_procedures" and "num_medications" features are choosen to use in 2D
plot. Random runs results are presented in image below. The results are approximately same.

Image: Random Run Results with Centers=4 and σ=4.40010321258815.
Image: Cluster Sizes for {Time in Hospital, Number of Procedures}
Image: Cluster Sizes for {Number of Medications, Number of Procedures}
First Random Run with centers=4 and σ=4.4

Validation
From the random run, first result is chosen for validation. Dunn index results and Davies-Bouldin
index results are given for comparison. To remember, higher Dunn index is better while lower
Davies-Bouldin is better. Both in Dunn Index and Davies-Bouldin Index, Centroid Diameter with
Complete Link gives best result. Since the ground truth has 4 clusters, sizes of the clusters are
conformable the ground truth.
Spectral (Dunn) Complete diameter Average diameter Centroid diameter
Single link 0.00668823 0.04265670 0.06039251
Complete link 0.52512892 3.34920700 4.74174095
Average link 0.15697465 1.00116480 1.41742928
Centroid link 0.09740126 0.62121320 0.87950129
Table: Spectral Clustering Result Validation with Dunn Index
Spectral (DB) Complete diameter Average diameter Centroid diameter
Single link 194.38845600 37.45235040 26.37252550
Complete link 1.83402600 0.43463780 0.30763710
Average link 7.44275800 1.32189570 0.92798800
Centroid link 10.87935700 1.93850020 1.36070810
Table: Spectral Clustering Result Validation with Davies-Bouldin Index
7. References
[1] Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura,
Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission
Rates: Analysis of 70,000 Clinical Database Patient Records”, BioMed Research International, vol.
2014, Article ID 781670, 11 pages, 2014.
[2] Jan de Leeuw, Patrick Mair, “Gifi Methods for Optimal Scaling in R: The Package homals”,
Journal of Statistical Software August 2009, Volume 31, Issue 4.
[3] Laura Mulvey, Julian Gingold, “Microarray Clustering Methods and Gene Ontology”, 2007.
[4] Phil Ender, “Multivariate Analysis: Hierarchical Cluster Analysis”, 1998.
[5] Maulik U, Bandyopadhyay S., “Performance evaluation of some clustering algorithms and
validity indices”, IEEE Transactions on Pattern Analysis Machine Intelligence, 2002, 24(12): 1650-
1654.
[6] Julia Handl, Joshua Knowles, Douglas Kell, “Computational cluster validation in post-genomic
data analysis”, Bioinformatics 21(15):3201-3212, 2005.
[7] Lai Wei, “Path-based Relative Similarity Spectral Clustering”, 2010 Second WRI Global
Congress on Intelligent Systems, 16-17 Dec. 2010.

[8] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss, “On spectral clustering: Analysis and an
algorithm”, Neural Information Processing Symposium 2001.
[9] D.L. Davies and D.W. Bouldin, “A Cluster Separation Measure”, IEEE Trans. Pattern Analysis
and Machine Intelligence, vol. 1, pp. 224-227, 1979.
[10] J.C. Dunn, “A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-
Separated Clusters”, J. Cybernetics, vol. 3, pp. 32-57, 1973.
8. Appendix
8.1. Used Scripts & Programs
R programming language is used (R version 3.1.1). In this part of the document, you can find the R
scripts and commands to implement given tasks. These tasks are projection with PCA, projection
with MDS, clustering, validation and spectral clustering.
8.1.1. Scripts of PCA Projection
R Commands
#load data
diabetic_data <- read.table("C:/dataset/diabetic_data.csv", header = TRUE, sep = ",")
#see summary
summary(diabetic_data)
#if you want to use two dimention, create a new data with random two features
myvars <- c("num_lab_procedures","time_in_hospital")
newdata <- diabetic_data[myvars]
#PCA
my.pca <- princomp(diabetic_data, scores=TRUE, cor=TRUE)
#see the components
plot(my.pca)
biplot(my.pca)
# calculate covariance
my.cov <- cov(diabetic_data)
# calculate eigen values
my.eigen <- eigen(my.cov)
# see the eigen vectors in plot
pc1.slope <- my.eigen$vectors[1,1]/my.eigen$vectors[2,1]
pc2.slope <- my.eigen$vectors[1,2]/my.eigen$vectors[2,2]

abline(0,pc1.slope, col="red")
abline(0,pc2.slope, col="blue")
# cumulative percentages of eigen values
r<-my.pca$rotation
plot(cumsum(my.pca$sdev^2)/sum(my.pca$sdev^2))
# rotated data
biplot(my.pca,choices=c(2,1))
8.1.2. Scripts of MDS Projection
R Commands
library(MASS)
my.dist<-dist(diabetic_data)
randomdata<-cbind(runif(1000,min=-0.5,max=0.5),runif(1000,min=-0.5,max=0.5))
# classical mds
plot(cmdscale(my.dist))
# sammon mapping with PCA
plot(sammon(my.dist,y=my.pca$x[,c(1,2)],magic=0.05)$points)
# sammon mapping with random configuration
plot(sammon(my.dist,y=randomdata,magic=0.05)$points)
# non-metric mapping with PCA
plot(isoMDS(my.dist,y=my.pca$x[,c(1,2)])$points)
# non-metric mapping with random configuration
plot(isoMDS(my.dist,y=randomdata)$points)
8.1.3 Scripts of Clustering
R Commands
library(cluster)
ds<-dist(scale(diabetic_data))
# hierarchical clustering with ward method
hward<-hclust(ds,method="ward")
plot(hward)
# hierarchical clustering with average method
havg<-hclust(ds,method="average")
plot(havg)

# hierarchical clustering with complete method
hcomp<-hclust(ds,method="complete")
plot(hcomp)
#dendograms for 5,25,100 clusters
rect.hclust(hward, k=100, border="blue")
rect.hclust(hward, k=25, border="green")
rect.hclust(hward, k=5, border="red")
#k-means with k=5,10,25,100,200
k1<-kmeans(scale(diabetic_data),5)
#k v.s. error plot
plot(c(length(k1$size),length(k2$size),length(k3$size),length(k4$size),length(k5$size)),c(k1$tot.wi
thinss,k2$tot.withinss,k3$tot.withinss,k4$tot.withinss,k5$tot.withinss),type="l")
# 5 random runs with k=25
kk1<-kmeans(scale(diabetic_data),25)
#run v.s. error plot
plot(1:5,c(kk1$tot.withinss,kk2$tot.withinss,kk3$tot.withinss,kk4$tot.withinss,kk5$tot.withinss),ty
pe="l")
#clusters in 2D
clusplot(scale(diabetic_data),kk5$cluster,lines=0)
8.1.4 Scripts of Cluster Validation
JAVA Code for labeling
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
public class Main {

public static void main(String[] args) {
String satir = "";
try {
File file = new File("final.csv");
if (!file.exists()) {
file.createNewFile();
}
FileWriter fileWriter = new FileWriter(file, false);
BufferedWriter bWriter = new BufferedWriter(fileWriter);
File inputfile = new File("data.csv");
BufferedReader reader = null;
reader = new BufferedReader(new FileReader(inputfile));
// column names
satir = reader.readLine();
bWriter.write(satir + ";label");
bWriter.newLine();
while (satir != null) {
String[] columns = satir.split(";");
if (columns[14].equals("0")) {
bWriter.write(satir + ";1");
} else if (columns[14].equals("1") ||
columns[14].equals("2")) {
} else if (columns[14].equals("3") &&
} else if (columns[14].equals("3") &&
}
bWriter.newLine();
}
bWriter.close();
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
R Commands
labeled_data <- read.table("C:/diabetic_dataset/final.csv", header = TRUE, sep = ",")

library(clv)
#ground truth labels
gt<-c(labeled_data$label)
#pick mode of labels found within a ground truth label
findmapping<-function(cluster,ground)
{sapply( as.numeric( names( table(ground))),function(x)as.numeric(names(sort(table(cluster[ground
==x]),decreasing=TRUE))[1]))}
#for each ground truth label, compare it with the found label
findmatches<-function(cluster,ground){findmapping(cluster,ground)[ground]==cluster}
#precision for h.c. ward
mean(findmatches(cutree(hward,4),gt))
#precision for h.c. average
mean(findmatches(cutree(havg,4),gt))
#precision for h.c. complete
mean(findmatches(cutree(hcomp,4),gt))
#precision for kmeans
mean(findmatches(kk5$cluster,gt))
#Dunn index for h.c. ward
clv.Dunn(cls.scatt.data(scale(diabetic_data),cutree(hward,4)),c("complete","average","centroid"),c("
single","complete","average","centroid"))
#Dunn index for h.c. average
clv.Dunn(cls.scatt.data(scale(diabetic_data),cutree(havg,4)),c("complete","average","centroid"),c("s
ingle","complete","average","centroid"))
#Dunn index for h.c. complete
clv.Dunn(cls.scatt.data(scale(diabetic_data),cutree(hcomp,4)),c("complete","average","centroid"),c(
"single","complete","average","centroid"))
#Dunn index for kmeans
clv.Dunn(cls.scatt.data(scale(diabetic_data),kk5$cluster),c("complete","average","centroid"),c("sing
le","complete","average","centroid"))
#Davies-Bouldin index for h.c. ward
clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),cutree(hward,4)),c("complete","average","cen
troid"),c("single","complete","average","centroid"))
#Davies-Bouldin index for h.c. average
clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),cutree(havg,4)),c("complete","average","centr
oid"),c("single","complete","average","centroid"))

#Davies-Bouldin index for h.c. complete
clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),cutree(hcomp,4)),c("complete","average","ce
ntroid"),c("single","complete","average","centroid"))
#Davies-Bouldin index for kmeans
clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),kk5$cluster),c("complete","average","centroi
d"),c("single","complete","average","centroid"))
8.1.5. Scripts of Spectral Clustering
R Commands
#load data
diabetic_data <- read.table("C:/dataset/diabetic_data.csv", header = TRUE, sep = ",")
#create new datasets with two features
myvars <- c("num_lab_procedures","time_in_hospital")
newdata <- diabetic_data[myvars]
scaled_data<-scale(as.data.frame(newdata))
myvars2 <- c("num_lab_procedures","num_medications")
newdata2 <- diabetic_data[myvars2]
scaled_data2<-scale(as.data.frame(newdata2))
library(ggplot2)
library(kernlab)
#runs with different cluster numbers
sc1<-specc(scaled_data,centers=4)
plot(scaled_data, col = sc1)
sc4<-specc(scaled_data2,centers=4)
plot(scaled_data2, col = sc4)

#find sigma
kernelf(sc4)
#random runs with estimated sigma
sce1<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815))
clv.Dunn(cls.scatt.data(scale(scaled_data2),as.vector(sce1)),c("complete","average","centroid"),c("s
ingle","complete","average","centroid"))
clv.Davies.Bouldin(cls.scatt.data(scale(scaled_data2),as.vector(sce1)),c("complete","average","cent
roid"),c("single","complete","average","centroid"))
#cluster sizes
plot(1:40,sort(size(sc3),decreasing=T))
plot(1:4,sort(size(sce1),decreasing=T))

Statistical Data Analysis on Diabetes 130-US hospitals for years 1999-2008 Data Set

Recommended

Recommended

More Related Content

Similar to Statistical Data Analysis on Diabetes 130-US hospitals for years 1999-2008 Data Set

Similar to Statistical Data Analysis on Diabetes 130-US hospitals for years 1999-2008 Data Set (20)

More from Seval Çapraz

More from Seval Çapraz (20)

Recently uploaded

Recently uploaded (20)

Statistical Data Analysis on Diabetes 130-US hospitals for years 1999-2008 Data Set