Statistical Data Analysis on a Data Set
(Diabetes 130-US hospitals
for years 1999-2008 Data Set)
Seval Ünver
Dept. of Computer Engineering, Middle East Technical University
Ankara, TURKEY
Abstract—Nowadays, data analysis methods have more and
more importance because they have a huge application area. The
main purpose of this paper is to introduce the concepts and
techniques of clustering and multivariate and exploratory data
analysis by using data visualization and projection. The data
analysis techniques are performed on a real data set which
includes diabetes' records of 130 US hospitals for years 1999-
2008. The analysis is started with linear projections and principal
component analysis, then continued with multi-dimensional
scaling. After that, hierarchical clustering and k-means
clustering are applied. Moreover validity of clusters are
discussed. Lastly, as an advanced topic, spectral clustering is
applied since it can deal with arbitrary distribution dataset and
easy to implement. It is an emerging research topic that has
numerous applications, such as data dimension reduction and
image segmentation. For all tasks, a statistical software (R) is
used in order to summarize data numerically and visually, and to
perform data analysis.
Keywords—clustering; spectral clustering; data analysis; k-
means; multidimensional scaling
I. DATASET DESCRIPTION
“Diabetes 130-US hospitals for years 1999-2008 Data
Set”[2] is selected for this research. The dataset represents 10
years (1999-2008) of clinical care at 130 US hospitals and
integrated delivery networks. In this paper, first 1000 instances
are used for data analysis. In this small data set, the distribution
of HbA1c is not changed so much. After cleaning the data set,
there are now 18 features in training data set. Types of discrete
data:
• count data (time_in_hospital, num_lab_procedures,
num_procedures, num_medications,
number_outpatient, number_emergency,
number_inpatient, number_dianoses)
• nominal data (gender, admission_type_id,
discharge_disposition_id, admission_source_id,
diabetesMed, change)
• ordinal data (age, A1Cresult, max_glu_serum,
readmitted)
In this dataset, four groups of encounters are considered:
(1) no HbA1c test performed (A1Cresult=0 ),
(2) HbA1c performed and in normal range(A1Cresult=1
or A1Cresult=2),
(3) HbA1c performed and the result is greater than 8%
with no change in diabetic medications (A1Cresult=3
and change=0),
(4) HbA1c performed, result is greater than 8%, and
diabetic medication was changed (A1Cresult=3 and
change=1).
II. DATA PROJECTION BY PCA
In PCA(Principle Component Analysis), as a method,
eigenvalues of covariance matrix is used. It's much easier to
explain PCA for two dimensions and then generalize from
there. So two numeric features are selected:
num_lab_procedures, time_in_hospital. If we look at the PCA
components, we can see that first component is very high than
second component.
Fig. 1. PCA of Data Set
III. DATA PROJECTION BY MDS
Three methods for MDS (Multidimensional Scaling) are
used for visualisation; classical multidimensional scaling,
Sammon mapping and non-metric MDS. You can compare
classical metric with Sammon Mapping and isoMDS in Graph
2. In Sammon Mapping method, result is very sensitive to the
magic parameter which is used for step size of iterations as
indicated by MASS documentation. There is so much features
in this data set, this means high dimentionality. Although we
removed most of unnecessary features and use a training data
which includes 1000 instance, still data is not easily clusterable
in 2D or clusters are not easily visible. In this data analysis,
MDS gives much more information than PCA.
Fig. 2. Data projection by MDS
IV. CLUSTERING
Clustering is a technique for finding similarity groups in
data, called clusters. On this data set, two methods of clustering
are used for data visualisation. First one is Hierarchical
Clustering implemented with linkages of “average”,
”complete” and ”ward”. Linkage, or the distance from a newly
formed node to all other nodes, can computed in several
different ways: single, complete, and average. Their
dendograms are plotted. Second one is K-means Clustering
with different k values (5,10,25,100,200) and several random
runs of the “elbow” value for k which is around 100, consistent
with the ground truth. Euclidian distance of normalized
samples is used as the distance between samples for the both
methods.
Fig. 3. Figure 3: Hierarchical Clustering
In k-means algorithm, the difficuly is to specify k. In
addition to this, algorithm is so sensitive to outliers. This data
set has a lot of outliers because it is a real world data. The
weekness of k-means is that this algorithm is only applicable if
the mean is defined. For categorical data, k-mode - the centroid
is represented by most frequent values. Therefore, we cannot
say that k-means is the best solution to estimate number of
clusters. On the other hand, Hierarchical Clustering has O(n^2)
complexity. Due the complexity, hard to use for large data sets.
Fig. 4. K-Means Clustering
CLUSTER VALIDATION
Cluster validation is concerned with the quality of clusters
generated by an algorithm for data clustering. Given the
partitioning of a data set, it attempts to answer questions such
as: How pronounced is the cluster structure that has been
identified? How do clustering solutions from different
algorithms compare? How do clustering solutions for different
parameters (e.g. the number of clusters compare).[6]
"Ground truth" means a set of measurements that is known
to be much more accurate than measurements from the system
you are testing. In Diabetes 130-US hospitals for years 1999-
2008 Data Set, there was no labels to determine classes. I
labeled the four group in a new column with a Java console
program. The column name is label. This column holds
numbers which ranges from 1 to 4.
Method Precision
H. clustering (ward) 0.482
H. clustering (average) 0.993
H. clustering (complete) 0.976
K-means 0.139
Fig. 5. Clustering Validation
The goal of using an index is to determine the optimal
clustering parameters. Greater intracluster distances and lesser
intercluster distances are desired. Different distance measures
can be used for the index calculations. In Spectral Clustering,
Dunn index and Davies-Boulding index are used for validation.
Most practical difference between two indexes are, higher
Dunn index is better while lower Davies-Bouldin is better.
Distances discussed here are euclidian distances. Hierarchical
methods gives better results than K-Means.
SPECTRAL CLUSTERING
Spectral clustering method is proposed as a new kind of
clustering method based on graph theory. This method uses the
top eigenvectors of a matrix derived from the distance between
points. Such algorithms have been successfully used in many
applications including computer vision and VLSI design [8].
Through the spectral analysis on the affinity matrix of data
sets, spectral clustering can get promising clustering results [7].
Because there is no iteration proceeding in the algorithm,
spectral clustering avoids to trapped in the local minimum as
K-means. The process of the spectral clustering can be
summarized as follows [7][8] (suppose the data set X = {x1;
x2; … ; xn} has k class):
Spectral Clustering Algorithm:
STEP 1. Construct the affinity matrix
If i  ̸= j, then
, else wij = 0;
STEP 2. Define the diagonal matrix D, where
Meantime define the Laplacian matrix:
STEP 3. Compute the k eigenvectors corresponding to the k
largest eigenvalues of matrix L, and constitute the matrix:
Then we can get the matrix Y, where
STEP 4. Treat each row of Y as a point in R^k, cluster
them into k clusters via K-means. Assign the original points xi
to cluster j iff row i of the matrix Y was assigned to cluster j.
On this dataset, the algorithm given above is used as a
spectral clustering. To implement this algorithm, there is an
extensible package which name is kernlab for kernel-based
machine learning methods in R. By using this package, spectral
clustering can be done in a few steps easily. So, “specc”
method is used from “kernlab” package.
In spectral clustering, the similarity between data points is
often defined by Gaussian kernel [7]. The scale hyperparameter
σ in the Gaussian kernel will great influence the final clustering
results. So to find best σ hyperparameter, firstly a parameter
estimation is done. After that, several runs with the same
parameters are compared.
The number of clusters are estimated as 4, 25 and 35. They
are tried in specc method. The results are shown on a data
subset: {num_medications, num_lab_procedures}
Estimated value for 4 clusters σ=4.40010321258815.
Random runs are done with this hyperparameter sigma.
Fig. 6. Spectral Clustering with different center numbers
Random runs results are presented in an image below. The
results are approximately same.
Fig. 7. Random Run Results with Centers=4 and σ=4.4001.
VALIDATION OF SPECTRAL CLUSTERING
From the random run, first result is chosen for validation.
Dunn index results and Davies-Bouldin index results are given
for comparison. To remember, higher Dunn index is better
while lower Davies-Bouldin is better. Both in Dunn Index and
Davies-Bouldin Index, Centroid Diameter with Complete Link
gives best result. Since the ground truth has 4 clusters, sizes of
the clusters are conformable the ground truth.
Spectral
(Dunn)
Complete
diameter
Average
diameter
Centroid
diameter
Single link 0.00668823 0.04265670 0.06039251
Complete link 0.52512892 3.34920700 4.74174095
Average link 0.15697465 1.00116480 1.41742928
Centroid link 0.09740126 0.62121320 0.87950129
Fig. 8. Spectral Clustering Result Validation with Dunn Index
Spectral (DB) Complete
diameter
Average
diameter
Centroid
diameter
Single link 194.38845600 37.45235040 26.37252550
Complete link 1.83402600 0.43463780 0.30763710
Average link 7.44275800 1.32189570 0.92798800
Centroid link 10.87935700 1.93850020 1.36070810
Fig. 9. Spectral Clustering Result Validation with Davies-Bouldin Index
REFERENCES
[1] [1] Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo,
Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of
HbA1c Measurement on Hospital Readmission Rates: Analysis of
70,000 Clinical Database Patient Records”, BioMed Research
International, vol. 2014, Article ID 781670, 11 pages, 2014.
[2] [2] John Clore, Krzysztof J. Cios, Jon DeShazo, Beata Strack, “Diabetes
130-US hospitals for years 1999-2008 Data Set”, 05.03.2014, Center for
Clinical and Translational Research, Virginia Commonwealth
University, a recipient of NIH CTSA grant UL1 TR00058 and a
recipient of the CERNER data. This data is a de-identified abstract of
the Health Facts database (Cerner Corporation, Kansas City, MO).
[3] [3] Laura Mulvey, Julian Gingold, “Microarray Clustering Methods and
Gene Ontology”, 2007.
[4] [4] Phil Ender, “Multivariate Analysis: Hierarchical Cluster Analysis”,
1998.
[5] [5] Maulik U, Bandyopadhyay S., “Performance evaluation of some
clustering algorithms and validity indices”, IEEE Transactions on
Pattern Analysis Machine Intelligence, 2002, 24(12): 1650-1654.
[6] [6] Julia Handl, Joshua Knowles, Douglas Kell, “Computational cluster
validation in post-genomic data analysis”, Bioinformatics 21(15):3201-
3212, 2005.
[7] [7] Lai Wei, “Path-based Relative Similarity Spectral Clustering”, 2010
Second WRI Global Congress on Intelligent Systems, 16-17 Dec. 2010.
[8] [8] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss, “On spectral
clustering: Analysis and an algorithm”, Neural Information Processing
Symposium 2001.
[9] [9] D.L. Davies and D.W. Bouldin, “A Cluster Separation Measure”,
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, pp. 224-
227, 1979.
[10] [10] J.C. Dunn, “A Fuzzy Relative of the ISODATA Process and Its Use
in Detecting Compact Well-Separated Clusters”, J. Cybernetics, vol. 3,
pp. 32-57, 1973.

Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years 1999-2008 Data Set)

  • 1.
    Statistical Data Analysison a Data Set (Diabetes 130-US hospitals for years 1999-2008 Data Set) Seval Ünver Dept. of Computer Engineering, Middle East Technical University Ankara, TURKEY Abstract—Nowadays, data analysis methods have more and more importance because they have a huge application area. The main purpose of this paper is to introduce the concepts and techniques of clustering and multivariate and exploratory data analysis by using data visualization and projection. The data analysis techniques are performed on a real data set which includes diabetes' records of 130 US hospitals for years 1999- 2008. The analysis is started with linear projections and principal component analysis, then continued with multi-dimensional scaling. After that, hierarchical clustering and k-means clustering are applied. Moreover validity of clusters are discussed. Lastly, as an advanced topic, spectral clustering is applied since it can deal with arbitrary distribution dataset and easy to implement. It is an emerging research topic that has numerous applications, such as data dimension reduction and image segmentation. For all tasks, a statistical software (R) is used in order to summarize data numerically and visually, and to perform data analysis. Keywords—clustering; spectral clustering; data analysis; k- means; multidimensional scaling I. DATASET DESCRIPTION “Diabetes 130-US hospitals for years 1999-2008 Data Set”[2] is selected for this research. The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. In this paper, first 1000 instances are used for data analysis. In this small data set, the distribution of HbA1c is not changed so much. After cleaning the data set, there are now 18 features in training data set. Types of discrete data: • count data (time_in_hospital, num_lab_procedures, num_procedures, num_medications, number_outpatient, number_emergency, number_inpatient, number_dianoses) • nominal data (gender, admission_type_id, discharge_disposition_id, admission_source_id, diabetesMed, change) • ordinal data (age, A1Cresult, max_glu_serum, readmitted) In this dataset, four groups of encounters are considered: (1) no HbA1c test performed (A1Cresult=0 ), (2) HbA1c performed and in normal range(A1Cresult=1 or A1Cresult=2), (3) HbA1c performed and the result is greater than 8% with no change in diabetic medications (A1Cresult=3 and change=0), (4) HbA1c performed, result is greater than 8%, and diabetic medication was changed (A1Cresult=3 and change=1). II. DATA PROJECTION BY PCA In PCA(Principle Component Analysis), as a method, eigenvalues of covariance matrix is used. It's much easier to explain PCA for two dimensions and then generalize from there. So two numeric features are selected: num_lab_procedures, time_in_hospital. If we look at the PCA components, we can see that first component is very high than second component. Fig. 1. PCA of Data Set III. DATA PROJECTION BY MDS Three methods for MDS (Multidimensional Scaling) are used for visualisation; classical multidimensional scaling, Sammon mapping and non-metric MDS. You can compare classical metric with Sammon Mapping and isoMDS in Graph 2. In Sammon Mapping method, result is very sensitive to the magic parameter which is used for step size of iterations as indicated by MASS documentation. There is so much features in this data set, this means high dimentionality. Although we removed most of unnecessary features and use a training data which includes 1000 instance, still data is not easily clusterable in 2D or clusters are not easily visible. In this data analysis, MDS gives much more information than PCA.
  • 2.
    Fig. 2. Dataprojection by MDS IV. CLUSTERING Clustering is a technique for finding similarity groups in data, called clusters. On this data set, two methods of clustering are used for data visualisation. First one is Hierarchical Clustering implemented with linkages of “average”, ”complete” and ”ward”. Linkage, or the distance from a newly formed node to all other nodes, can computed in several different ways: single, complete, and average. Their dendograms are plotted. Second one is K-means Clustering with different k values (5,10,25,100,200) and several random runs of the “elbow” value for k which is around 100, consistent with the ground truth. Euclidian distance of normalized samples is used as the distance between samples for the both methods. Fig. 3. Figure 3: Hierarchical Clustering In k-means algorithm, the difficuly is to specify k. In addition to this, algorithm is so sensitive to outliers. This data set has a lot of outliers because it is a real world data. The weekness of k-means is that this algorithm is only applicable if the mean is defined. For categorical data, k-mode - the centroid is represented by most frequent values. Therefore, we cannot say that k-means is the best solution to estimate number of clusters. On the other hand, Hierarchical Clustering has O(n^2) complexity. Due the complexity, hard to use for large data sets. Fig. 4. K-Means Clustering CLUSTER VALIDATION Cluster validation is concerned with the quality of clusters generated by an algorithm for data clustering. Given the partitioning of a data set, it attempts to answer questions such as: How pronounced is the cluster structure that has been identified? How do clustering solutions from different algorithms compare? How do clustering solutions for different parameters (e.g. the number of clusters compare).[6] "Ground truth" means a set of measurements that is known to be much more accurate than measurements from the system you are testing. In Diabetes 130-US hospitals for years 1999- 2008 Data Set, there was no labels to determine classes. I labeled the four group in a new column with a Java console program. The column name is label. This column holds numbers which ranges from 1 to 4. Method Precision H. clustering (ward) 0.482 H. clustering (average) 0.993 H. clustering (complete) 0.976 K-means 0.139 Fig. 5. Clustering Validation The goal of using an index is to determine the optimal clustering parameters. Greater intracluster distances and lesser intercluster distances are desired. Different distance measures can be used for the index calculations. In Spectral Clustering, Dunn index and Davies-Boulding index are used for validation. Most practical difference between two indexes are, higher Dunn index is better while lower Davies-Bouldin is better. Distances discussed here are euclidian distances. Hierarchical methods gives better results than K-Means.
  • 3.
    SPECTRAL CLUSTERING Spectral clusteringmethod is proposed as a new kind of clustering method based on graph theory. This method uses the top eigenvectors of a matrix derived from the distance between points. Such algorithms have been successfully used in many applications including computer vision and VLSI design [8]. Through the spectral analysis on the affinity matrix of data sets, spectral clustering can get promising clustering results [7]. Because there is no iteration proceeding in the algorithm, spectral clustering avoids to trapped in the local minimum as K-means. The process of the spectral clustering can be summarized as follows [7][8] (suppose the data set X = {x1; x2; … ; xn} has k class): Spectral Clustering Algorithm: STEP 1. Construct the affinity matrix If i ̸= j, then , else wij = 0; STEP 2. Define the diagonal matrix D, where Meantime define the Laplacian matrix: STEP 3. Compute the k eigenvectors corresponding to the k largest eigenvalues of matrix L, and constitute the matrix: Then we can get the matrix Y, where STEP 4. Treat each row of Y as a point in R^k, cluster them into k clusters via K-means. Assign the original points xi to cluster j iff row i of the matrix Y was assigned to cluster j. On this dataset, the algorithm given above is used as a spectral clustering. To implement this algorithm, there is an extensible package which name is kernlab for kernel-based machine learning methods in R. By using this package, spectral clustering can be done in a few steps easily. So, “specc” method is used from “kernlab” package. In spectral clustering, the similarity between data points is often defined by Gaussian kernel [7]. The scale hyperparameter σ in the Gaussian kernel will great influence the final clustering results. So to find best σ hyperparameter, firstly a parameter estimation is done. After that, several runs with the same parameters are compared. The number of clusters are estimated as 4, 25 and 35. They are tried in specc method. The results are shown on a data subset: {num_medications, num_lab_procedures} Estimated value for 4 clusters σ=4.40010321258815. Random runs are done with this hyperparameter sigma. Fig. 6. Spectral Clustering with different center numbers Random runs results are presented in an image below. The results are approximately same. Fig. 7. Random Run Results with Centers=4 and σ=4.4001. VALIDATION OF SPECTRAL CLUSTERING From the random run, first result is chosen for validation. Dunn index results and Davies-Bouldin index results are given for comparison. To remember, higher Dunn index is better while lower Davies-Bouldin is better. Both in Dunn Index and Davies-Bouldin Index, Centroid Diameter with Complete Link gives best result. Since the ground truth has 4 clusters, sizes of the clusters are conformable the ground truth. Spectral (Dunn) Complete diameter Average diameter Centroid diameter Single link 0.00668823 0.04265670 0.06039251 Complete link 0.52512892 3.34920700 4.74174095 Average link 0.15697465 1.00116480 1.41742928 Centroid link 0.09740126 0.62121320 0.87950129 Fig. 8. Spectral Clustering Result Validation with Dunn Index
  • 4.
    Spectral (DB) Complete diameter Average diameter Centroid diameter Singlelink 194.38845600 37.45235040 26.37252550 Complete link 1.83402600 0.43463780 0.30763710 Average link 7.44275800 1.32189570 0.92798800 Centroid link 10.87935700 1.93850020 1.36070810 Fig. 9. Spectral Clustering Result Validation with Davies-Bouldin Index REFERENCES [1] [1] Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records”, BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014. [2] [2] John Clore, Krzysztof J. Cios, Jon DeShazo, Beata Strack, “Diabetes 130-US hospitals for years 1999-2008 Data Set”, 05.03.2014, Center for Clinical and Translational Research, Virginia Commonwealth University, a recipient of NIH CTSA grant UL1 TR00058 and a recipient of the CERNER data. This data is a de-identified abstract of the Health Facts database (Cerner Corporation, Kansas City, MO). [3] [3] Laura Mulvey, Julian Gingold, “Microarray Clustering Methods and Gene Ontology”, 2007. [4] [4] Phil Ender, “Multivariate Analysis: Hierarchical Cluster Analysis”, 1998. [5] [5] Maulik U, Bandyopadhyay S., “Performance evaluation of some clustering algorithms and validity indices”, IEEE Transactions on Pattern Analysis Machine Intelligence, 2002, 24(12): 1650-1654. [6] [6] Julia Handl, Joshua Knowles, Douglas Kell, “Computational cluster validation in post-genomic data analysis”, Bioinformatics 21(15):3201- 3212, 2005. [7] [7] Lai Wei, “Path-based Relative Similarity Spectral Clustering”, 2010 Second WRI Global Congress on Intelligent Systems, 16-17 Dec. 2010. [8] [8] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss, “On spectral clustering: Analysis and an algorithm”, Neural Information Processing Symposium 2001. [9] [9] D.L. Davies and D.W. Bouldin, “A Cluster Separation Measure”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, pp. 224- 227, 1979. [10] [10] J.C. Dunn, “A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters”, J. Cybernetics, vol. 3, pp. 32-57, 1973.