SlideShare a Scribd company logo
1 of 44
Download to read offline
Statistical Data Analysis
on
Diabetes 130-US hospitals
for years 1999-2008 Data Set
Document Version: 1.0
(Date: 12/01/15)
Seval Ünver
unver.seval@metu.edu.tr
Student Number: 1900810 (M.Sc.)
Department of Computer Engineering, Middle East Technical University
Ankara, TURKEY
1 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Version History
Version Status* Date Responsible Version Definition
0.1 Send via Email 29/10/14 Seval Unver Projection by PCA (6 hours)
0.2 Uploaded to OdtuClass 05/11/14 Seval Unver Projection by MDS (6 hours)
0.3 Uploaded to OdtuClass 19/11/14 Seval Unver Data clustering by hierarchical and k-means
clustering (6 hours)
0.4 Uploaded to OdtuClass 26/11/14 Seval Unver Cluster Validation (5 hours)
1 Final Report 12/01/15 Seval Unver Spectral Clustering (6 hours)
2 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Table of Content
1. Data Set Description.........................................................................................................................4
2. Data projection by PCA....................................................................................................................8
2.1. Eigenvalues and Eigenvectors..................................................................................................8
2.2. Plot directions of the first and second principle components on the original coordinate
system............................................................................................................................................11
2.3. Transformed data set onto to a new coordinate system by using the first two principle
components....................................................................................................................................12
2.4. Personal Observations and Comments...................................................................................12
2.5. Details of Implementation......................................................................................................13
3. Data projection by MDS.................................................................................................................16
3.1. Classical Metric......................................................................................................................16
3.2. Sammon Mapping and isoMDS..............................................................................................16
3.3. Use the projection of data onto first two principal axes (as a result of PCA) to initialize MDS
(sammon and isoMDS). Plot the final projections.........................................................................19
3.4. Observations and Comments..................................................................................................20
3.5. Self-Reflection about MDS....................................................................................................20
4. Clustering.......................................................................................................................................20
4.1. Hierarchical Clustering...........................................................................................................20
4.2. K-Means Clustering................................................................................................................24
4.2.1. K-Means algorithm for different 5 k values – Plot Error................................................25
4.2.2. K-Means with 5 different initial configurations when k is 100 and when k is 25 – Error
Plot............................................................................................................................................25
4.2.3. Plot the data in 2D...........................................................................................................26
4.3. Self-Reflection About Clustering............................................................................................27
5. Cluster Validation...........................................................................................................................28
5.1. Comparison of Actual Labels and Predicted Labels ..............................................................28
5.2 Dunn Index and Davies-Bouldin.............................................................................................28
5.2.1 Dunn Index Measurements..............................................................................................30
5.2.2 Davies-Bouldin Measurements........................................................................................30
5.3 Self-Reflection About Validation.............................................................................................31
6. Spectral Clustering.........................................................................................................................31
7. References......................................................................................................................................37
8. Appendix.........................................................................................................................................38
8.1. Used Scripts & Programs........................................................................................................38
8.1.1. Scripts of PCA Projection...............................................................................................38
8.1.2. Scripts of MDS Projection..............................................................................................39
8.1.3 Scripts of Clustering.........................................................................................................39
8.1.4 Scripts of Cluster Validation............................................................................................40
8.1.5. Scripts of Spectral Clustering.........................................................................................43
3 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
1. Data Set Description
"Diabetes 130-US hospitals for years 1999-2008 Data Set" is selected for this research. This data
has been prepared to analyze factors related to readmission as well as other outcomes pertaining to
patients with diabetes. The dataset represents 10 years (1999-2008) of clinical care at 130 US
hospitals and integrated delivery networks. It includes 50 features representing patient and hospital
outcomes.
The original large database has 74 million unique encounters corresponding to 17 million unique
patients. The database consists of 41 tables in a fact-dimension schema and a total of 117 features.
Almost 70,000 inpatient diabetes encounters were identified with sufficient detail for this analysis.
In an early research, this database is used to show the relationship between the measurement of
HbA1c and early readmission while controlling for covariates such as demographics, severity and
type of the disease, and type of admission. The dataset was created in two steps. First, encounters of
interest were extracted from the database with 55 attributes. Second, preliminary analysis and
preprocessing of the data were performed resulting in retaining only these features (attributes) and
encounters that could be used in further analysis, that is, contain sufficient information [1].
Information was extracted from the database for encounters that satisfied the following criteria:
1. It is an inpatient encounter (a hospital admission).
2. It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the
system as a diagnosis.
3. The length of stay was at least 1 day and at most 14 days.
4. Laboratory tests were performed during the encounter.
5. Medications were administered during the encounter.
Data Set Download Link: http://archive.ics.uci.edu/ml/datasets/Diabetes+130-
US+hospitals+for+years+1999-2008
Date Donated: 05/03/14
Source: The data are submitted on behalf of the Center for Clinical and
Translational Research, Virginia Commonwealth University, a
recipient of NIH CTSA grant UL1 TR00058 and a recipient of the
CERNER data. John Clore (jclore '@' vcu.edu), Krzysztof J. Cios
(kcios '@' vcu.edu), Jon DeShazo (jpdeshazo '@' vcu.edu), and Beata
Strack (strackb '@' vcu.edu). This data is a de-identified abstract of
the Health Facts database (Cerner Corporation, Kansas City, MO).
Table 1. Source of data set
There are 100,000 instances and 50 columns in this data set. Characteristics of the data set is
Multivariate since there are many variables. The main area of this data is life and the data are real.
Because of this reason, there are missing values. We can use Classification and Clustering methods
on this data.
The data contains such attributes as patient number, race, gender, age, admission type, time in
hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result,
diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and
emergency visits in the year before the hospitalization, etc. The whole attribute list can be seen in
the Table 2.
4 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
No Feature name Type Description and values % missing
1 Encounter ID Numeric Unique identifier of an encounter 0,00%
2 Patient number Numeric Unique identifier of a patient 0,00%
3 Race Nominal Values: Caucasian, Asian, African American,
Hispanic, and other
2,00%
4 Gender Nominal Values: male, female, and unknown/invalid 0,00%
5 Age Nominal Grouped in 10-year intervals:[0, 10),[10, 20),
…,[90, 100)
0,00%
6 Weight Numeric Weight in pounds. 97,00%
7 Admission type Nominal Integer identifier corresponding to 9 distinct
values, for example, emergency, urgent,
elective, newborn, and not available
0,00%
8 Discharge
disposition
Nominal Integer identifier corresponding to 29 distinct
values, for example, discharged to home,
expired, and not available
0,00%
9 Admission
source
Nominal Integer identifier corresponding to 21 distinct
values, for example, physician referral,
emergency room, and transfer from a hospital
0,00%
10 Time in
hospital
Numeric Integer number of days between admission and
discharge
0,00%
11 Payer code Nominal Integer identifier corresponding to 23 distinct
values, for example, Blue Cross/Blue Shield,
Medicare, and self-pay
52,00%
12 Medical
specialty
Nominal Integer identifier of a specialty of the admitting
physician, corresponding to 84 distinct values,
for example, cardiology, internal medicine,
family/general practice, and surgeon
53,00%
13 Number of lab
procedures
Numeric Number of lab tests performed during the
encounter
0,00%
14 Number of
procedures
Numeric Number of procedures (other than lab tests)
performed during the encounter
0,00%
15 Number of
medications
Numeric Number of distinct generic names administered
during the encounter
0,00%
16 Number of
outpatient visits
Numeric Number of outpatient visits of the patient in the
year preceding the encounter
0,00%
17 Number of
emergency
visits
Numeric Number of emergency visits of the patient in
the year preceding the encounter
0,00%
18 Number of
inpatient visits
Numeric Number of inpatient visits of the patient in the
year preceding the encounter
0,00%
5 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
19 Diagnosis 1 Nominal The primary diagnosis (coded as first three
digits of ICD9); 848 distinct values
0,00%
20 Diagnosis 2 Nominal Secondary diagnosis (coded as first three digits
of ICD9); 923 distinct values
0,00%
21 Diagnosis 3 Nominal Additional secondary diagnosis (coded as first
three digits of ICD9); 954 distinct values
1,00%
22 Number of
diagnoses
Numeric Number of diagnoses entered to the system 0,00%
23 Glucose serum
test result
Nominal Indicates the range of the result or if the test
was not taken. Values: “>200,” “>300,”
“normal,” and “none” if not measured
0,00%
24 A1c test result Nominal Indicates the range of the result or if the test
was not taken. Values: “>8” if the result was
greater than 8%, “>7” if the result was greater
than 7% but less than 8%, “normal” if the
result was less than 7%, and “none” if not
measured.
0,00%
25-
47
23 features for
medications
Nominal The feature indicates whether the drug was
prescribed or there was a change in the dosage.
Values: “up” if the dosage was increased
during the encounter, “down” if the dosage was
decreased, “steady” if the dosage did not
change, and “no” if the drug was not
prescribed
0,00%
48 Change of
medications
Nominal Indicates if there was a change in diabetic
medications (either dosage or generic name).
Values: “change” and “no change”
0,00%
49 Diabetes
medications
Nominal Indicates if there was any diabetic medication
prescribed. Values: “yes” and “no”
0,00%
50 Readmitted Nominal Days to inpatient readmission. Values: “<30” if
the patient was readmitted in less than 30 days,
“>30” if the patient was readmitted in more
than 30 days, and “No” for no record of
readmission.
0,00%
Table 2. List of features and their descriptions in the initial dataset
The data are the real-world data, so there is incomplete, redundant, and noisy information. Some
features have high percentage of missing values like weight (97% values missing), payer code
(40%), and medical specialty (47%). Weight can directly relevant to Diabet but it is too sparse in
this database, so it can be removed. Also, payer code can be removed because it is not relevant to
Diabet. Medical specialty attribute can be removed too, because it shows physician's speciality. This
might be important but focus of this research will not be about this issue. Therefore three features
which have high missing values are removed.
To summarize, the dataset consists of hospital admissions of length between 1 and 14 days that did
not result in a patient death or discharge to a hospice. Each encounter corresponds to a unique
6 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
patient diagnosed with diabetes, although the primary diagnosis may be different. During each of
the analyzed encounters, lab tests were ordered and medication was administered [1].
Four groups of encounters are considered: (1) no HbA1c test performed, (2) HbA1c performed and
in normal range, (3) HbA1c performed and the result is greater than 8% with no change in diabetic
medications, and (4) HbA1c performed, result is greater than 8%, and diabetic medication was
changed.
Table 3. Distribution of HbA1c in whole data set
1000 instance is separated to analyse as a traning set. In this small data set, the distribution of
HbA1c is changed. It can be seen if Table 3 and Table 4 are compared.
Table 4. Distribution of HbA1c in training data set
The ratio of the population who did not have HbA1c test is 81.60% in whole data set and 81.50% in
training data set. The readmission rate is 9.40% for whole data set, 9.32 for training data set.
The ratio of the population who has higher result than 8% in HbA1c test is 8.90% in whole data set,
11.80% in training data set. Readmission rates are close to each other. In the group who had higher
results and changed the medication, the ratio of population who readmitted the hospital in 30 days is
8.90% for whole data set, 10.00% for training data set.
Encounter Id and Patient Numbers are removed because we are interested in summary of this data.
Diagnosis 1, Diagnosis 2 and Diagnosis 3 is removed from the data, because they are nominal and
they have more than 900 distinct values. Race is removed from the data set because it is not
necessary at this point. 23 medication result is removed because they are nominal and they are not
7 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Variable
Readmitted
% in group
HbA1c        
 No test was performed 57080 81.60% 5342 9.40%
4071 5.80% 361 8.90%
2196 3.10% 166 7.60%
 Normal result of the test 6637 9.50% 590 8.90%
Number of
encounters
% of the 
population
Number of
encounters
 Result was high and the diabetic medication was
changed
 Result was high but the diabetic medication was
not changed
Variable
Readmitted
% in group
HbA1c        
 No test was performed 815 81.50% 76 9.32%
60 6.00% 6 10.00%
58 5.80% 6 10.34%
 Normal result of the test 67 6.70% 6 8.95%
Number of
encounters
% of the 
population
Number of
encounters
 Result was high and the diabetic medication was
changed
 Result was high but the diabetic medication was
not changed
useful for Principle Component Analysis. Our aim is to find a relation between HbA1c test and
Readmission rate. So we still keep that information. Gender is expressed by 1 for woman and 0 for
man. The other nominal values are changed to integer representations.
Changed nominal values:
After removing unnecessary features, there is now 18 features in training data set. Types of discrete
data:
• count data (time_in_hospital, num_lab_procedures, num_procedures, num_medications,
number_outpatient, number_emergency, number_inpatient, number_dianoses)
• nominal data (gender, admission_type_id, discharge_disposition_id, admission_source_id,
diabetesMed, change)
• ordinal data (age, A1Cresult, max_glu_serum, readmitted)
2. Data projection by PCA
PCA(Principle Component Analysis) is done by using GNU R function prcomp. As a method,
eigenvalues of covariance matrix is used.
2.1. Eigenvalues and Eigenvectors
Image: First 18 Eigen Values of whole data
8 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Gender A1Cresult readmitted
Female 1 None 0 >30 2
Male 0 Normal 1 <30 1
>7 2 No 0
Age >8 3
1-10 1
10-20 2 change max_glu_serum
20-30 3 None 0 None 0
30-40 4 change 1 Normal 1
40-50 5 >200 2
50-60 6 diabetesMed >300 3
60-70 7 Yes 1
70-80 8 No 0
80-90 9
90-100 10
First we look at the whole data and plot it. It's much easier to explain PCA for two dimensions and
then generalize from there. So two numeric features are selected: A1Cresult, time_in_hospital.
The A1Cresult feature is categorical, so it is changed with num_lab_procedures.
Look at the colorful plot between num_lab_procedures and num_medications.
If we look at the PCA components, we can see that first component is very high than second
component.
9 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
PCA Components of subset of data:
10 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
2.2. Plot directions of the first and second principle components on the
original coordinate system
PCA of whole data:
When we use the 2 feature:
Image: 2D Projections of data on Principal
11 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
2.3. Transformed data set onto to a new coordinate system by using the
first two principle components
Image: Cumulative Percentages of Eigenvalues
Image: Component 1 vs. Component 2
2.4. Personal Observations and Comments
This data set includes not only numerical values but also nominal values. There are both continuous
and categorical data. However PCA is developed and suitable for continuous (ideally, mutlivariate
normal) data. There is no obvious outliers in the data. Although a PCA applied on binary data would
yield results comparable to those obtained from a Multiple Correspondence Analysis (factor scores
and eigenvalues are linearly related), there are more appropriate techniques to deal with mixed data
types, namely Multiple Factor Analysis for mixed data available in the FactoMineR R package
(AFDM()). If your variables can be considered as structured subsets of descriptive attributes, then
12 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Multiple Factor Analysis (MFA()) is also an option.
The challenge with categorical variables is to find a suitable way to represent distances between
variable categories and individuals in the factorial space. To overcome this problem, you can look
for a non-linear transformation of each variable--whether it be nominal, ordinal, polynomial, or
numerical--with optimal scaling. This is well explained in “Gifi Methods for Optimal Scaling in R:
The Package homals [2]”, and an implementation is available in the corresponding R package
homals.
2.5. Details of Implementation
Data file is ending with ‘.csv’ which is Comma Seperated Value. There are 101768 lines in the file.
Data is imported into R software:
> diabetic_data <- read.table("C:/dataset/diabetic_data.csv", header = TRUE, sep = ",")
To get the summary of data:
> summary(diabetic_data)
The data is opened with Excel and 3 feature columns are deleted which are weight (97% values
missing), payer code (40% values missing), and medical specialty (47% values missing). Then a
subset of this data is selected as a training data which includes 1000 rows. This training data is the
first 1000 rows in original data and it represents the whole data correctly because these 1000 data is
choosen randomly from the developers of this data set.
Training data is imported to R. The summary of training data has not much difference with original
data. For example, approximately half of data is women, the other half of data is men. The mean
values are near each other. Ofcourse there is a difference between original data and training data,
but using a training data makes faster the analysis.
13 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
To get the new distribution of HbA1c Result, subset of this data is shown in R. After removing the
unnecessary columns and changing with integer representations, the data set has 18 features now.
So it is imported again.
To run the Principle Component Analysis, we can use princomp function in R. To use Correlation
Matrix, we give it cor=TRUE parameter. It shows Importance of Components:
Image: Summary of new data set (with 18 features)
Image: Plot of Diabetic_Data
14 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
It looks like Component 1 is very strong. When it is plotted, it gives the results which can be seen in
below.
Image: Plot of PCA
A scree plot is a graphical display of the variance of each component in the dataset which is used to
determine how many components should be retained in order to explain a high percentage of the
variation in the data. First component and second component are higher so we should keep them.
Now lets see the Biplot. Biplots are a type of exploratory graph used in statistics, a generalization of
the simple two-variable scatterplot. A biplot allows information on both samples and variables of a
data matrix to be displayed graphically. Samples are displayed as points while variables are
displayed either as vectors, linear axes or nonlinear trajectories. In the case of categorical variables,
category level points may be used to represent the levels of a categorical variable. A generalised
biplot displays information on both continuous and categorical variables.
Image: biplot of PCA
It is hard to read the information in this biplot image. However we can see that some of red lines are
going to the same direction. This means that they have association.
15 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
3. Data projection by MDS
For visualisation, three methods for MDS (Multidimensional Scaling) are used for visualisation;
classical multidimensional scaling, Sammon mapping and non-metric MDS.
Classical multidimensional scaling is done with cmdscale in 2D and dist functions which uses
euclidian distance between samples. As samples, raw data and first two dimensions of the PCA
result is used.
Sammon mapping is done with sammon function in MASS package. For initial configuration, result
of PCA and several instances uniform distributed random points is used.
Similar to sammon mapping, non-metric MDS is done with isoMDS function in MASS package.
For initial configuration, same values as sammon mapping analysis is used.
3.1. Classical Metric
Image: Classical MDS
3.2. Sammon Mapping and isoMDS
Tried at least 5 different random initial configurations, choosen one that gives minimum error and
plot the final MDS projection onto two dimensions. If result of PCA is used, sammon mapping
converges in just a few iterations and leaves the configuration almost unchanged. Result is very
sensitive to the magic parameter which is used for step size of iterations as indicated by MASS
documentation. Here, magic parameter is chosen as 0.05.
16 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Image: Sammon Mapping with PCA
Image: Magic parameter is 0.05
If the magic parameter is 0.05:
Initial stress : 0.79437
stress after 2 iters: 0.61243
17 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Image: Magic parameter is 0.01
If the magic parameter is 0.01:
Initial stress : 0.79437
stress after 10 iters: 0.50231, magic = 0.115
stress after 10 iters: 0.50231
Image: Magic parameter is 0.02
If the magic parameter is 0.02:
Initial stress : 0.79437
stress after 10 iters: 0.41383, magic = 0.231
stress after 20 iters: 0.36066, magic = 0.021
stress after 30 iters: 0.34217, magic = 0.002
stress after 35 iters: 0.33437
18 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
3.3. Use the projection of data onto first two principal axes (as a result
of PCA) to initialize MDS (sammon and isoMDS). Plot the final
projections.
Image: isoMDS (Non-metric Mapping with PCA)
initial value 11.809209
final value 11.804621
converged
Image: Non-Metric mapping with random configuration
initial value 48.305275
final value 48.304120
converged
19 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
3.4. Observations and Comments
There is so much features in this data set, this means high dimentionality. Although we removed
most of unnecessary features and use a training data which includes 1000 instance, still data is not
easily clusterable in 2D or clusters are not easily visible. Iterative methods used here yields outliers
which are not present in PCA but this behaviour is very dependent on parameters other than
distance data. Classic Torgerson metric MDS is actually done by transforming distances into
similarities and performing PCA (eigen-decomposition or singular-value-decomposition) on those.
So, PCA might be called the algorithm of the simplest MDS.
Thus, MDS and PCA are not at the same level to be in line or opposite to each other. PCA is just a
method while MDS is a class of analysis. As mapping, PCA is a particular case of MDS. On the
other hand, PCA is a particular case of Factor analysis which, being a data reduction, is more than
only a mapping, while MDS is only a mapping.
3.5. Self-Reflection about MDS
As I see in this assignment, MDS gives much more information than PCA. This assignment
provides a general perspective on the measures of similarity and dissimilarity. Both MDS and PCA
use proximity measures such as the correlation coefficient or Euclidean distance to generate a
spatial configuration (map) of points in multidimensional space where distances between points
reflect the similarity among isolates.
4. Clustering
Clustering is a technique for finding similarity groups in data, called clusters. In other words,
clustering is the task of grouping a set of objects in such a way that objects in the same group
(called a cluster) are more similar (in some sense or another) to each other than to those in other
groups (clusters). On this data, two methods of clustering is used for data visualisation.
First one is Hierarchical Clustering with three different methods, implemented in GNU R function
hclust with methods of “average”, ”complete” and ”ward”. Their dendograms are plotted.
Second one is K-means Clustering with different k values (5,10,25,100,200) and several random
runs of the “elbow” value for k which is around 100, consistent with the ground truth.
Euclidian distance of normalized samples is used as the distance between samples for the both
methods.
4.1. Hierarchical Clustering
Dendrograms of hierarchical clustering for three linkages are illustrated under this title. Out of the
three, “ward” method is the easiest to interpret visually, even for high number of clusters chosen.
“Average” and “complete“ methods yield visually similar results but not as easy to interpret. Most
interesting observation is, beyond a certain number of clusters picked, they look similar even
though picks of low numbers of clusters looks different.
20 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Linkage, or the distance from a newly fourmed node to all other nodes, can computed in several
different ways: single, complete, and average. The figure below roughly demonstrates what each
linkage evaluates [3] :
Average linkage clustering uses the average similarity of observations between two groups as the
measure between the two groups. Complete linkage clustering uses the farthest pair of observations
between two groups to determine the similarity of the two groups. Single linkage clustering, on the
other hand, computes the similarity between two groups as the similarity of the closest pair of
observations between the two groups [4].
Ward's linkage is distinct from all the other methods because it uses an analysis of variance
approach to evaluate the distances between clusters. In short, this method attempts to minimize the
Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at each step. In general,
this method is regarded as very efficient, however, it tends to create clusters of small size [4].
Image: Hierarchical Clustering with Ward Method
21 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Image: Dendograms for 5(red),25(green),100(blue) clusters on Hierarchical Clustering with
Ward Method
Image: Hierarchical Clustering with average method
22 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Image: Dendograms for 5,25,100 clusters on Hierarchical Clustering with Average Method
Image: Hierarchical Clustering with complete method
23 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Image: Dendograms for 5,25,100 clusters on Hierarchical Clustering with Complete Method
I recommend 25 number of clusters for this dataset based on the results obtained above. It is
difficult to estimate accurate number of clusters. As you see in Ward method, using 25 clusters
seems best choice.
4.2. K-Means Clustering
K-means clustering aims to partition n observations into k clusters in which each observation
belongs to the cluster with the nearest mean, serving as a prototype of the cluster. For number of
clusters 5, 10, 25, 100 and 200 examined under this title. Total sum of squares within clusters as
given by “tot.withinss” property of kmeans function is used to as the error function to evaluate best
number of clusters.
22 is close to “rule of thumb” value for 1000 samples (k≈√n/2). Ground truth of 100 is close to the
“elbow” value beyond where the error doesn't improve as dramatically. In next steps, five random
runs of k=100 and k=25 and their errors are illustrated.
24 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
4.2.1. K-Means algorithm for different 5 k values – Plot Error
Image: “K-Means Clustering With k=5,10,25,100,200” v.s. “Error Plot”
4.2.2. K-Means with 5 different initial configurations when k is 100 and
when k is 25 – Error Plot
2 different k values are tried in this task. Because of the fact that determining number of clusters is
difficult, 25 and 100 tried in 5 different initial configurations. When k is 100, third try is the best
because its error rate is lowest. When k is 25, fifth try is the best because its error rate is lowest also.
25 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Image: “5 random runs with k=100” v.s. “Error Plot”
Image: “5 random runs with k=25” v.s. “Error Plot”
4.2.3. Plot the data in 2D
In previous step, Error plot is showed in a graph. From five initial comfigurations, third
configuration is chosen because its error rate is lowest one when k=100. Here is the 2D plot of third
configuration which has k=100 in K-Means algorithm.
Image: Third initial configuration with k=100 – Clusters in 2D
26 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Here is the 2D plot of third configuration which has k=25 in K-Means algorithm.
Image: Fifth initial configuration with k=25 – Clusters in 2D
4.3. Self-Reflection About Clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the
same group (called a cluster) are more similar (in some sense or another) to each other than to those
in other groups (clusters). This assignment helped me to estimate cluster number of my data. After
this assignment, I choose k=25 as a cluster number. The figures and projections were very
beneficial in this task.
In k-means algorithm, my difficuly was to specify k. In addition to this, algorithm is so sensitive to
outliers. My data has a lot of outliers because it is a real world data. The weekness of k-means is
that this algorithm is only applicable if the mean is defined. For categorical data, k-mode - the
centroid is represented by most frequent values. Therefore, we cannot say that k-means is the best
solution to estimate number of clusters. The other algorithms have their own weeknesses.
Comparing different clustering algorithms is a difficult task. No one knows the correct clusters.
It is very hard, if not impossible, to know what distribution the application data follow. The data
may not fully follow any “ideal” structure or distribution required by the algorithms. One also needs
to decide how to standardize the data, to choose a suitable distance function and to select other
parameter values.
Hierarchical Clustering has O(n^2) complexity. Due the complexity, hard to use for large data sets. I
27 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
used a sample which has 1000 instances in this task, so it did not take long time to implement.
My data has a lot of nominal attributes with more than two states or values. The commonly used
distance measure is based on the simple matching method.
To sum up, this assignment is very helpful for students who want to implement clustering on
unlabeled big data.
5. Cluster Validation
Cluster validation is concerned with the quality of clusters generated by an algorithm for data
clustering. Given the partitioning of a data set, it attempts to answer questions such as: How
pronounced is the cluster structure that has been identified? How do clustering solutions from
different algorithms compare? How do clustering solutions for different parameters (e.g. the number
of clusters compare).[6]
5.1. Comparison of Actual Labels and Predicted Labels
"Ground truth" means a set of measurements that is known to be much more accurate than
measurements from the system you are testing. In Diabetes 130-US hospitals for years 1999-2008
Data Set, there was no labels to determine classes. I labeled the four group in a new column with a
Java console program. The column name is label. This column holds numbers which ranges from 1
to 4.
In this dataset, four groups of encounters are considered:
(1) no HbA1c test performed (A1Cresult=0 ),
(2) HbA1c performed and in normal range(A1Cresult=1 or A1Cresult=2),
(3) HbA1c performed and the result is greater than 8% with no change in diabetic medications
(A1Cresult=3 and change=0),
(4) HbA1c performed, result is greater than 8%, and diabetic medication was changed
(A1Cresult=3 and change=1).
Only cluster number of 4 is considered (ground truth).
Method Precision
H. clustering (ward) 0.482
H. clustering (average) 0.993
H. clustering (complete) 0.976
K-means 0.139
5.2 Dunn Index and Davies-Bouldin
The goal of using an index is to determine the optimal clustering parameters. Greater intracluster
28 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
distances and lesser intercluster distances are desired. Different distance measures can be used for
the index calculations.
The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio
between the minimal inter-cluster distance to maximal intra-cluster distance. Let S and T be two
nonempty subsets of . [5] Then, the diameter of S is defined as
and set distance be tween S and T is defined as
Here, d(x,y) indicates the distance between points x and y. For any partition, Dunn defined the
following index [11]:
Larger values of VD correspond to good clusters, and the number of clusters that maximizes VD is
taken as the optimal number of clusters.
The Davies–Bouldin index[9] is a metric for evaluating clustering algorithms. This is an internal
evaluation scheme, where the validation of how well the clustering has been done is made using
quantities and features inherent to the dataset. This index is a function of the ratio of the sum of
within-cluster scatter to between-cluster separation [5]. The scatter within the ith cluster, Si, is
computed as
and the distance between cluster Ci and Cj, denoted by dij, is defined as
Here, zi represents the ith cluster center. The Davies-Bouldin (DB) index
is then defined as
where
The objective is to minimize the DB index for achieving proper clustering.
Most practical difference between two indexes are, higher Dunn index is better while lower Davies-
Bouldin is better. Distances discussed here are euclidian distances.
29 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
5.2.1 Dunn Index Measurements
Hierarchical Clustering (Ward):
Hierarchical Clustering (Average):
Hierarchical Clustering (Complete):
K-Means:
Hierarchical methods gives better results than K-Means.
5.2.2 Davies-Bouldin Measurements
Hierarchical Clustering (Ward):
Hierarchical Clustering (Average):
30 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Hierarchical Clustering (Complete):
K-Means:
Again, hierarchical methods gives better results than K-Means.
5.3 Self-Reflection About Validation
The data set do not have well-separated clusters, so this task was difficult to implement. The aim of
this assignment was to understand and to encourage the use of cluster-validation techniques in the
analysis of data. In particular, the assignment attempts to familiarize students with some of the
fundamental concepts behind cluster-validation techniques, and to assist them in making more
informed choices of the measures to be used. However, to implement better cluster-validation needs
essential background information. There are a lot of different types of validation techniques. Some
articles in literature proposed effective use of validation techniques. As a conclusion, the validation
should be done after research more and have more background knowledge.
6. Spectral Clustering
Clustering is widely applied in science and engineering, including bioinformatics, image
segmentation and web information retrival etc. The essential task of data clustering is partitioning
data points into disjoint clusters so that objects in the same cluster are similar, objects in different
clusters are dissimilar [7]. Many of the clustering algorithms are shortcoming so Spectral Clustering
is proposed as a promising alternative.
Spectral clustering method is proposed as a new kind of clustering method based on graph theory.
This method uses the top eigenvectors of a matrix derived from the distance between points. Such
algorithms have been successfully used in many applications including computer vision and VLSI
design [8]. Through the spectral analysis on the affinity matrix of data sets, spectral clustering can
get promising clustering results [7]. Because there is no iteration proceeding in the algorithm,
spectral clustering avoids to trapped in the local minimum as K-means. The process of the spectral
clustering can be summarized as follows [7][8] (suppose the data set X = {x1; x2; … ; xn} has k
class):
Spectral Clustering Algorithm
31 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
STEP 1. Construct the affinity matrix
If i  ̸= j, then
, else wij = 0;
STEP 2. Define the diagonal matrix D, where
Meantime define the Laplacian matrix:
STEP 3. Compute the k eigenvectors corresponding to the k largest eigenvalues of matrix L, and
constitute the matrix:
Then we can get the matrix Y, where
STEP 4. Treat each row of Y as a point in , cluster them into k clusters via K-means. Assign the
original points xi to cluster j iff row i of the matrix Y was assigned to cluster j.
On this dataset, the algorithm given above is used as a spectral clustering. To implement this
algorithm, there is an extensible package which name is kernlab for kernel-based machine learning
methods in R. By using this package, spectral clustering can be done in a few steps easily. So,
“specc” method is used from “kernlab” package. You can find the R scripts in the Appendix
section's 8.1.5. Scripts of Spectral Clustering.
In spectral clustering, the similarity between data points is often defined by Gaussian kernel [7].
The scale hyperparameter σ in the Gaussian kernel will great influence the final clustering results.
So to find best σ hyperparameter, firstly a parameter estimation is done. After that, several runs with
the same parameters are compared.
Parameter Estimation
Kernlab includes an S4 method called specc implementing this algorithm which can be used
through a formula interface or a matrix interface. The S4 object returned by the method extends the
class “vector” and contains the assigned cluster for each point along with information on the centers
size and within-cluster sum of squares for each cluster. In case a Gaussian RBF kernel is being used
a model selection process can be used to determine the optimal value of the σ hyperparameter. For a
good value of σ the values of Y tend to cluster tightly and it turns out that the within cluster sum of
squares is a good indicator for the “quality” of the sigma parameter found. We then iterate through
the sigma values to find an optimal value for σ.
The number of clusters are estimated as 4, 25 and 40. They are tried in specc method. The results
32 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
are shown on 2 data subset: {time_in_hospital, num_lab_procedures} and {num_medications,
num_lab_procedures}
Estimated value for 4 clusters σ=4.40010321258815. Random runs are done with this
hyperparameter sigma.
If we estimate 4 clusters:
Image: Plot of data {Time in Hospital, Number of Procedures}
for 4 clusters estimation
Image: Plot of data {Number of Medications, Number of Lab Procedures}
for 4 cluster estimation
33 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
If we estimate 25 clusters:
Image: Plot of data {Time in Hospital, Number of Procedures}
for 25 clusters estimation
Image: Plot of data {Number of Medications, Number of Procedures}
for 25 clusters estimation
34 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
If we estimate 40 clusters:
Image: Plot of data {Time in Hospital, Number of Procedures}
for 40 clusters estimation
If we estimate 35 clusters:
Image: Plot of data {Number of Medications, Number of Procedures}
for 35 clusters estimation
From the dataset, "num_lab_procedures" and "num_medications" features are choosen to use in 2D
plot. Random runs results are presented in image below. The results are approximately same.
35 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Image: Random Run Results with Centers=4 and σ=4.40010321258815.
Image: Cluster Sizes for {Time in Hospital, Number of Procedures}
for 40 clusters estimation
Image: Cluster Sizes for {Number of Medications, Number of Procedures}
First Random Run with centers=4 and σ=4.4
36 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Validation
From the random run, first result is chosen for validation. Dunn index results and Davies-Bouldin
index results are given for comparison. To remember, higher Dunn index is better while lower
Davies-Bouldin is better. Both in Dunn Index and Davies-Bouldin Index, Centroid Diameter with
Complete Link gives best result. Since the ground truth has 4 clusters, sizes of the clusters are
conformable the ground truth.
Spectral (Dunn) Complete diameter Average diameter Centroid diameter
Single link 0.00668823 0.04265670 0.06039251
Complete link 0.52512892 3.34920700 4.74174095
Average link 0.15697465 1.00116480 1.41742928
Centroid link 0.09740126 0.62121320 0.87950129
Table: Spectral Clustering Result Validation with Dunn Index
Spectral (DB) Complete diameter Average diameter Centroid diameter
Single link 194.38845600 37.45235040 26.37252550
Complete link 1.83402600 0.43463780 0.30763710
Average link 7.44275800 1.32189570 0.92798800
Centroid link 10.87935700 1.93850020 1.36070810
Table: Spectral Clustering Result Validation with Davies-Bouldin Index
7. References
[1] Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura,
Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission
Rates: Analysis of 70,000 Clinical Database Patient Records”, BioMed Research International, vol.
2014, Article ID 781670, 11 pages, 2014.
[2] Jan de Leeuw, Patrick Mair, “Gifi Methods for Optimal Scaling in R: The Package homals”,
Journal of Statistical Software August 2009, Volume 31, Issue 4.
[3] Laura Mulvey, Julian Gingold, “Microarray Clustering Methods and Gene Ontology”, 2007.
[4] Phil Ender, “Multivariate Analysis: Hierarchical Cluster Analysis”, 1998.
[5] Maulik U, Bandyopadhyay S., “Performance evaluation of some clustering algorithms and
validity indices”, IEEE Transactions on Pattern Analysis Machine Intelligence, 2002, 24(12): 1650-
1654.
[6] Julia Handl, Joshua Knowles, Douglas Kell, “Computational cluster validation in post-genomic
data analysis”, Bioinformatics 21(15):3201-3212, 2005.
[7] Lai Wei, “Path-based Relative Similarity Spectral Clustering”, 2010 Second WRI Global
Congress on Intelligent Systems, 16-17 Dec. 2010.
37 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
[8] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss, “On spectral clustering: Analysis and an
algorithm”, Neural Information Processing Symposium 2001.
[9] D.L. Davies and D.W. Bouldin, “A Cluster Separation Measure”, IEEE Trans. Pattern Analysis
and Machine Intelligence, vol. 1, pp. 224-227, 1979.
[10] J.C. Dunn, “A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-
Separated Clusters”, J. Cybernetics, vol. 3, pp. 32-57, 1973.
8. Appendix
8.1. Used Scripts & Programs
R programming language is used (R version 3.1.1). In this part of the document, you can find the R
scripts and commands to implement given tasks. These tasks are projection with PCA, projection
with MDS, clustering, validation and spectral clustering.
8.1.1. Scripts of PCA Projection
R Commands
#load data
diabetic_data <- read.table("C:/dataset/diabetic_data.csv", header = TRUE, sep = ",")
#see summary
summary(diabetic_data)
#if you want to use two dimention, create a new data with random two features
myvars <- c("num_lab_procedures","time_in_hospital")
newdata <- diabetic_data[myvars]
#PCA
my.pca <- princomp(diabetic_data, scores=TRUE, cor=TRUE)
#see the components
plot(my.pca)
biplot(my.pca)
# calculate covariance
my.cov <- cov(diabetic_data)
# calculate eigen values
my.eigen <- eigen(my.cov)
# see the eigen vectors in plot
pc1.slope <- my.eigen$vectors[1,1]/my.eigen$vectors[2,1]
pc2.slope <- my.eigen$vectors[1,2]/my.eigen$vectors[2,2]
38 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
abline(0,pc1.slope, col="red")
abline(0,pc2.slope, col="blue")
# cumulative percentages of eigen values
r<-my.pca$rotation
plot(cumsum(my.pca$sdev^2)/sum(my.pca$sdev^2))
# rotated data
biplot(my.pca,choices=c(2,1))
8.1.2. Scripts of MDS Projection
R Commands
library(MASS)
my.dist<-dist(diabetic_data)
randomdata<-cbind(runif(1000,min=-0.5,max=0.5),runif(1000,min=-0.5,max=0.5))
# classical mds
plot(cmdscale(my.dist))
# sammon mapping with PCA
plot(sammon(my.dist,y=my.pca$x[,c(1,2)],magic=0.05)$points)
# sammon mapping with random configuration
plot(sammon(my.dist,y=randomdata,magic=0.05)$points)
# non-metric mapping with PCA
plot(isoMDS(my.dist,y=my.pca$x[,c(1,2)])$points)
# non-metric mapping with random configuration
plot(isoMDS(my.dist,y=randomdata)$points)
8.1.3 Scripts of Clustering
R Commands
library(cluster)
ds<-dist(scale(diabetic_data))
# hierarchical clustering with ward method
hward<-hclust(ds,method="ward")
plot(hward)
# hierarchical clustering with average method
havg<-hclust(ds,method="average")
plot(havg)
39 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
# hierarchical clustering with complete method
hcomp<-hclust(ds,method="complete")
plot(hcomp)
#dendograms for 5,25,100 clusters
rect.hclust(hward, k=100, border="blue")
rect.hclust(hward, k=25, border="green")
rect.hclust(hward, k=5, border="red")
#k-means with k=5,10,25,100,200
k1<-kmeans(scale(diabetic_data),5)
k2<-kmeans(scale(diabetic_data),10)
k3<-kmeans(scale(diabetic_data),25)
k4<-kmeans(scale(diabetic_data),100)
k5<-kmeans(scale(diabetic_data),200)
#k v.s. error plot
plot(c(length(k1$size),length(k2$size),length(k3$size),length(k4$size),length(k5$size)),c(k1$tot.wi
thinss,k2$tot.withinss,k3$tot.withinss,k4$tot.withinss,k5$tot.withinss),type="l")
# 5 random runs with k=25
kk1<-kmeans(scale(diabetic_data),25)
kk2<-kmeans(scale(diabetic_data),25)
kk3<-kmeans(scale(diabetic_data),25)
kk4<-kmeans(scale(diabetic_data),25)
kk5<-kmeans(scale(diabetic_data),25)
#run v.s. error plot
plot(1:5,c(kk1$tot.withinss,kk2$tot.withinss,kk3$tot.withinss,kk4$tot.withinss,kk5$tot.withinss),ty
pe="l")
#clusters in 2D
clusplot(scale(diabetic_data),kk5$cluster,lines=0)
8.1.4 Scripts of Cluster Validation
JAVA Code for labeling
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
public class Main {
40 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
public static void main(String[] args) {
String satir = "";
try {
File file = new File("final.csv");
if (!file.exists()) {
file.createNewFile();
}
FileWriter fileWriter = new FileWriter(file, false);
BufferedWriter bWriter = new BufferedWriter(fileWriter);
File inputfile = new File("data.csv");
BufferedReader reader = null;
reader = new BufferedReader(new FileReader(inputfile));
// column names
satir = reader.readLine();
bWriter.write(satir + ";label");
bWriter.newLine();
satir = reader.readLine();
while (satir != null) {
String[] columns = satir.split(";");
if (columns[14].equals("0")) {
bWriter.write(satir + ";1");
} else if (columns[14].equals("1") ||
columns[14].equals("2")) {
bWriter.write(satir + ";2");
} else if (columns[14].equals("3") &&
columns[15].equals("0")) {
bWriter.write(satir + ";3");
} else if (columns[14].equals("3") &&
columns[15].equals("1")) {
bWriter.write(satir + ";4");
}
bWriter.newLine();
satir = reader.readLine();
}
bWriter.close();
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
R Commands
labeled_data <- read.table("C:/diabetic_dataset/final.csv", header = TRUE, sep = ",")
41 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
library(clv)
#ground truth labels
gt<-c(labeled_data$label)
#pick mode of labels found within a ground truth label
findmapping<-function(cluster,ground)
{sapply( as.numeric( names( table(ground))),function(x)as.numeric(names(sort(table(cluster[ground
==x]),decreasing=TRUE))[1]))}
#for each ground truth label, compare it with the found label
findmatches<-function(cluster,ground){findmapping(cluster,ground)[ground]==cluster}
#precision for h.c. ward
mean(findmatches(cutree(hward,4),gt))
#precision for h.c. average
mean(findmatches(cutree(havg,4),gt))
#precision for h.c. complete
mean(findmatches(cutree(hcomp,4),gt))
#precision for kmeans
mean(findmatches(kk5$cluster,gt))
#Dunn index for h.c. ward
clv.Dunn(cls.scatt.data(scale(diabetic_data),cutree(hward,4)),c("complete","average","centroid"),c("
single","complete","average","centroid"))
#Dunn index for h.c. average
clv.Dunn(cls.scatt.data(scale(diabetic_data),cutree(havg,4)),c("complete","average","centroid"),c("s
ingle","complete","average","centroid"))
#Dunn index for h.c. complete
clv.Dunn(cls.scatt.data(scale(diabetic_data),cutree(hcomp,4)),c("complete","average","centroid"),c(
"single","complete","average","centroid"))
#Dunn index for kmeans
clv.Dunn(cls.scatt.data(scale(diabetic_data),kk5$cluster),c("complete","average","centroid"),c("sing
le","complete","average","centroid"))
#Davies-Bouldin index for h.c. ward
clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),cutree(hward,4)),c("complete","average","cen
troid"),c("single","complete","average","centroid"))
#Davies-Bouldin index for h.c. average
clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),cutree(havg,4)),c("complete","average","centr
oid"),c("single","complete","average","centroid"))
42 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
#Davies-Bouldin index for h.c. complete
clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),cutree(hcomp,4)),c("complete","average","ce
ntroid"),c("single","complete","average","centroid"))
#Davies-Bouldin index for kmeans
clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),kk5$cluster),c("complete","average","centroi
d"),c("single","complete","average","centroid"))
8.1.5. Scripts of Spectral Clustering
R Commands
#load data
diabetic_data <- read.table("C:/dataset/diabetic_data.csv", header = TRUE, sep = ",")
#create new datasets with two features
myvars <- c("num_lab_procedures","time_in_hospital")
newdata <- diabetic_data[myvars]
scaled_data<-scale(as.data.frame(newdata))
myvars2 <- c("num_lab_procedures","num_medications")
newdata2 <- diabetic_data[myvars2]
scaled_data2<-scale(as.data.frame(newdata2))
library(ggplot2)
library(kernlab)
#runs with different cluster numbers
sc1<-specc(scaled_data,centers=4)
plot(scaled_data, col = sc1)
sc2<-specc(scaled_data,centers=25)
plot(scaled_data, col = sc2)
sc3<-specc(scaled_data,centers=40)
plot(scaled_data, col = sc3)
sc4<-specc(scaled_data2,centers=4)
plot(scaled_data2, col = sc4)
sc5<-specc(scaled_data2,centers=25)
plot(scaled_data2, col = sc5)
sc6<-specc(scaled_data2,centers=35)
plot(scaled_data2, col = sc6)
43 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
#find sigma
kernelf(sc4)
#random runs with estimated sigma
sce1<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815))
sce2<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815))
sce3<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815))
sce4<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815))
sce5<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815))
sce6<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815))
clv.Dunn(cls.scatt.data(scale(scaled_data2),as.vector(sce1)),c("complete","average","centroid"),c("s
ingle","complete","average","centroid"))
clv.Davies.Bouldin(cls.scatt.data(scale(scaled_data2),as.vector(sce1)),c("complete","average","cent
roid"),c("single","complete","average","centroid"))
#cluster sizes
plot(1:40,sort(size(sc3),decreasing=T))
plot(1:4,sort(size(sce1),decreasing=T))
44 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver

More Related Content

Similar to Statistical Data Analysis on Diabetes 130-US hospitals for years 1999-2008 Data Set

Clinical_Decision_Support_For_Heart_Disease
Clinical_Decision_Support_For_Heart_DiseaseClinical_Decision_Support_For_Heart_Disease
Clinical_Decision_Support_For_Heart_DiseaseSunil Kakade
 
2016 Enabling Reporting of Patient Safety Events - NIST Workshop
2016 Enabling Reporting of Patient Safety Events - NIST Workshop2016 Enabling Reporting of Patient Safety Events - NIST Workshop
2016 Enabling Reporting of Patient Safety Events - NIST WorkshopMegan Sawchuk
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...IJDKP
 
The Life-Changing Impact of AI in Healthcare
The Life-Changing Impact of AI in HealthcareThe Life-Changing Impact of AI in Healthcare
The Life-Changing Impact of AI in HealthcareKalin Hitrov
 
poster_INFORMS_healthcare_2015 - condensed
poster_INFORMS_healthcare_2015 - condensedposter_INFORMS_healthcare_2015 - condensed
poster_INFORMS_healthcare_2015 - condensedYang Yang
 
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...Data Con LA
 
Predictive Analytics and Machine Learning for Healthcare - Diabetes
Predictive Analytics and Machine Learning for Healthcare - DiabetesPredictive Analytics and Machine Learning for Healthcare - Diabetes
Predictive Analytics and Machine Learning for Healthcare - DiabetesDr Purnendu Sekhar Das
 
Diagnosing Chronic Kidney Disease using Machine Learning
Diagnosing Chronic Kidney Disease using Machine LearningDiagnosing Chronic Kidney Disease using Machine Learning
Diagnosing Chronic Kidney Disease using Machine LearningIRJET Journal
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
KCVI_analytics_NGWeiskopf research week v4
KCVI_analytics_NGWeiskopf research week v4KCVI_analytics_NGWeiskopf research week v4
KCVI_analytics_NGWeiskopf research week v4Deborah Woodcock
 
The OneSource Initiative: An Approach to Structured Sourcing of Key Clinical ...
The OneSource Initiative: An Approach to Structured Sourcing of Key Clinical ...The OneSource Initiative: An Approach to Structured Sourcing of Key Clinical ...
The OneSource Initiative: An Approach to Structured Sourcing of Key Clinical ...Mike Hogarth, MD, FACMI, FACP
 
Asco annual meeting 2012
Asco annual meeting 2012Asco annual meeting 2012
Asco annual meeting 2012David Cocker
 
TQM Charting Practice.pdf
TQM Charting Practice.pdfTQM Charting Practice.pdf
TQM Charting Practice.pdfsdfghj21
 
Maximising the value of routine NHS Data - Innovation Show
Maximising the value of routine NHS Data - Innovation ShowMaximising the value of routine NHS Data - Innovation Show
Maximising the value of routine NHS Data - Innovation ShowInnovation Agency
 
Machine learning and operations research to find diabetics at risk for readmi...
Machine learning and operations research to find diabetics at risk for readmi...Machine learning and operations research to find diabetics at risk for readmi...
Machine learning and operations research to find diabetics at risk for readmi...John Frias Morales, DrBA, MS
 

Similar to Statistical Data Analysis on Diabetes 130-US hospitals for years 1999-2008 Data Set (20)

Clinical_Decision_Support_For_Heart_Disease
Clinical_Decision_Support_For_Heart_DiseaseClinical_Decision_Support_For_Heart_Disease
Clinical_Decision_Support_For_Heart_Disease
 
2016 Enabling Reporting of Patient Safety Events - NIST Workshop
2016 Enabling Reporting of Patient Safety Events - NIST Workshop2016 Enabling Reporting of Patient Safety Events - NIST Workshop
2016 Enabling Reporting of Patient Safety Events - NIST Workshop
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...
 
The Life-Changing Impact of AI in Healthcare
The Life-Changing Impact of AI in HealthcareThe Life-Changing Impact of AI in Healthcare
The Life-Changing Impact of AI in Healthcare
 
ACE @ 25 nomination
ACE @ 25 nominationACE @ 25 nomination
ACE @ 25 nomination
 
poster_INFORMS_healthcare_2015 - condensed
poster_INFORMS_healthcare_2015 - condensedposter_INFORMS_healthcare_2015 - condensed
poster_INFORMS_healthcare_2015 - condensed
 
Factors Predict Perioperative Morbidity pancreatic resection
Factors Predict Perioperative Morbidity pancreatic resectionFactors Predict Perioperative Morbidity pancreatic resection
Factors Predict Perioperative Morbidity pancreatic resection
 
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...
 
Predictive Analytics and Machine Learning for Healthcare - Diabetes
Predictive Analytics and Machine Learning for Healthcare - DiabetesPredictive Analytics and Machine Learning for Healthcare - Diabetes
Predictive Analytics and Machine Learning for Healthcare - Diabetes
 
Diagnosing Chronic Kidney Disease using Machine Learning
Diagnosing Chronic Kidney Disease using Machine LearningDiagnosing Chronic Kidney Disease using Machine Learning
Diagnosing Chronic Kidney Disease using Machine Learning
 
Quality system
Quality systemQuality system
Quality system
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
KCVI_analytics_NGWeiskopf research week v4
KCVI_analytics_NGWeiskopf research week v4KCVI_analytics_NGWeiskopf research week v4
KCVI_analytics_NGWeiskopf research week v4
 
The OneSource Initiative: An Approach to Structured Sourcing of Key Clinical ...
The OneSource Initiative: An Approach to Structured Sourcing of Key Clinical ...The OneSource Initiative: An Approach to Structured Sourcing of Key Clinical ...
The OneSource Initiative: An Approach to Structured Sourcing of Key Clinical ...
 
Asco annual meeting 2012
Asco annual meeting 2012Asco annual meeting 2012
Asco annual meeting 2012
 
Global Burden of Disease - Pakistan Presentation
Global Burden of Disease - Pakistan PresentationGlobal Burden of Disease - Pakistan Presentation
Global Burden of Disease - Pakistan Presentation
 
TQM Charting Practice.pdf
TQM Charting Practice.pdfTQM Charting Practice.pdf
TQM Charting Practice.pdf
 
Maximising the value of routine NHS Data - Innovation Show
Maximising the value of routine NHS Data - Innovation ShowMaximising the value of routine NHS Data - Innovation Show
Maximising the value of routine NHS Data - Innovation Show
 
Machine learning and operations research to find diabetics at risk for readmi...
Machine learning and operations research to find diabetics at risk for readmi...Machine learning and operations research to find diabetics at risk for readmi...
Machine learning and operations research to find diabetics at risk for readmi...
 

More from Seval Çapraz

A Quick Start To Blockchain by Seval Capraz
A Quick Start To Blockchain by Seval CaprazA Quick Start To Blockchain by Seval Capraz
A Quick Start To Blockchain by Seval CaprazSeval Çapraz
 
Yapay Sinir Ağları ile çiftler ticareti finansal tahmin pepsi cocacola örneği
Yapay Sinir Ağları ile çiftler ticareti finansal tahmin pepsi cocacola örneğiYapay Sinir Ağları ile çiftler ticareti finansal tahmin pepsi cocacola örneği
Yapay Sinir Ağları ile çiftler ticareti finansal tahmin pepsi cocacola örneğiSeval Çapraz
 
Assembly Dili İle Binary Search Gerçekleştirimi
Assembly Dili İle Binary Search GerçekleştirimiAssembly Dili İle Binary Search Gerçekleştirimi
Assembly Dili İle Binary Search GerçekleştirimiSeval Çapraz
 
Zimbra zooms ahead with OneView
Zimbra zooms ahead with OneViewZimbra zooms ahead with OneView
Zimbra zooms ahead with OneViewSeval Çapraz
 
Software Project Management Plan
Software Project Management PlanSoftware Project Management Plan
Software Project Management PlanSeval Çapraz
 
Distributed Computing Answers
Distributed Computing AnswersDistributed Computing Answers
Distributed Computing AnswersSeval Çapraz
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
 
VARIABILITY MANAGEMENT IN SOFTWARE PRODUCT LINES
VARIABILITY MANAGEMENT IN SOFTWARE PRODUCT LINESVARIABILITY MANAGEMENT IN SOFTWARE PRODUCT LINES
VARIABILITY MANAGEMENT IN SOFTWARE PRODUCT LINESSeval Çapraz
 
A Content Boosted Hybrid Recommendation System
A Content Boosted Hybrid Recommendation SystemA Content Boosted Hybrid Recommendation System
A Content Boosted Hybrid Recommendation SystemSeval Çapraz
 
Importance of software quality assurance to prevent and reduce software failu...
Importance of software quality assurance to prevent and reduce software failu...Importance of software quality assurance to prevent and reduce software failu...
Importance of software quality assurance to prevent and reduce software failu...Seval Çapraz
 
A Document Management System in Defense Industry Case Study
A Document Management System in Defense Industry Case StudyA Document Management System in Defense Industry Case Study
A Document Management System in Defense Industry Case StudySeval Çapraz
 
Comparison of Parallel Algorithms For An Image Processing Problem on Cuda
Comparison of Parallel Algorithms For An Image Processing Problem on CudaComparison of Parallel Algorithms For An Image Processing Problem on Cuda
Comparison of Parallel Algorithms For An Image Processing Problem on CudaSeval Çapraz
 
GPU-Accelerated Route Planning of Multi-UAV Systems Using Simulated Annealing...
GPU-Accelerated Route Planning of Multi-UAV Systems Using Simulated Annealing...GPU-Accelerated Route Planning of Multi-UAV Systems Using Simulated Annealing...
GPU-Accelerated Route Planning of Multi-UAV Systems Using Simulated Annealing...Seval Çapraz
 
Semantic Filtering (An Image Processing Method)
Semantic Filtering (An Image Processing Method)Semantic Filtering (An Image Processing Method)
Semantic Filtering (An Image Processing Method)Seval Çapraz
 
Optical Flow with Semantic Segmentation and Localized Layers
Optical Flow with Semantic Segmentation and Localized LayersOptical Flow with Semantic Segmentation and Localized Layers
Optical Flow with Semantic Segmentation and Localized LayersSeval Çapraz
 
Spam Tanıma İçin Geliştirilmiş Güncel Yöntemlere Genel Bakış | Seval Çapraz
Spam Tanıma İçin Geliştirilmiş Güncel Yöntemlere Genel Bakış | Seval ÇaprazSpam Tanıma İçin Geliştirilmiş Güncel Yöntemlere Genel Bakış | Seval Çapraz
Spam Tanıma İçin Geliştirilmiş Güncel Yöntemlere Genel Bakış | Seval ÇaprazSeval Çapraz
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big DataSeval Çapraz
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?Seval Çapraz
 
Bir Android Uygulamasında Bulunması Gereken Özellikler | Seval ZX | Android D...
Bir Android Uygulamasında Bulunması Gereken Özellikler | Seval ZX | Android D...Bir Android Uygulamasında Bulunması Gereken Özellikler | Seval ZX | Android D...
Bir Android Uygulamasında Bulunması Gereken Özellikler | Seval ZX | Android D...Seval Çapraz
 

More from Seval Çapraz (20)

A Quick Start To Blockchain by Seval Capraz
A Quick Start To Blockchain by Seval CaprazA Quick Start To Blockchain by Seval Capraz
A Quick Start To Blockchain by Seval Capraz
 
Yapay Sinir Ağları ile çiftler ticareti finansal tahmin pepsi cocacola örneği
Yapay Sinir Ağları ile çiftler ticareti finansal tahmin pepsi cocacola örneğiYapay Sinir Ağları ile çiftler ticareti finansal tahmin pepsi cocacola örneği
Yapay Sinir Ağları ile çiftler ticareti finansal tahmin pepsi cocacola örneği
 
Etu Location
Etu LocationEtu Location
Etu Location
 
Assembly Dili İle Binary Search Gerçekleştirimi
Assembly Dili İle Binary Search GerçekleştirimiAssembly Dili İle Binary Search Gerçekleştirimi
Assembly Dili İle Binary Search Gerçekleştirimi
 
Zimbra zooms ahead with OneView
Zimbra zooms ahead with OneViewZimbra zooms ahead with OneView
Zimbra zooms ahead with OneView
 
Software Project Management Plan
Software Project Management PlanSoftware Project Management Plan
Software Project Management Plan
 
Distributed Computing Answers
Distributed Computing AnswersDistributed Computing Answers
Distributed Computing Answers
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
VARIABILITY MANAGEMENT IN SOFTWARE PRODUCT LINES
VARIABILITY MANAGEMENT IN SOFTWARE PRODUCT LINESVARIABILITY MANAGEMENT IN SOFTWARE PRODUCT LINES
VARIABILITY MANAGEMENT IN SOFTWARE PRODUCT LINES
 
A Content Boosted Hybrid Recommendation System
A Content Boosted Hybrid Recommendation SystemA Content Boosted Hybrid Recommendation System
A Content Boosted Hybrid Recommendation System
 
Importance of software quality assurance to prevent and reduce software failu...
Importance of software quality assurance to prevent and reduce software failu...Importance of software quality assurance to prevent and reduce software failu...
Importance of software quality assurance to prevent and reduce software failu...
 
A Document Management System in Defense Industry Case Study
A Document Management System in Defense Industry Case StudyA Document Management System in Defense Industry Case Study
A Document Management System in Defense Industry Case Study
 
Comparison of Parallel Algorithms For An Image Processing Problem on Cuda
Comparison of Parallel Algorithms For An Image Processing Problem on CudaComparison of Parallel Algorithms For An Image Processing Problem on Cuda
Comparison of Parallel Algorithms For An Image Processing Problem on Cuda
 
GPU-Accelerated Route Planning of Multi-UAV Systems Using Simulated Annealing...
GPU-Accelerated Route Planning of Multi-UAV Systems Using Simulated Annealing...GPU-Accelerated Route Planning of Multi-UAV Systems Using Simulated Annealing...
GPU-Accelerated Route Planning of Multi-UAV Systems Using Simulated Annealing...
 
Semantic Filtering (An Image Processing Method)
Semantic Filtering (An Image Processing Method)Semantic Filtering (An Image Processing Method)
Semantic Filtering (An Image Processing Method)
 
Optical Flow with Semantic Segmentation and Localized Layers
Optical Flow with Semantic Segmentation and Localized LayersOptical Flow with Semantic Segmentation and Localized Layers
Optical Flow with Semantic Segmentation and Localized Layers
 
Spam Tanıma İçin Geliştirilmiş Güncel Yöntemlere Genel Bakış | Seval Çapraz
Spam Tanıma İçin Geliştirilmiş Güncel Yöntemlere Genel Bakış | Seval ÇaprazSpam Tanıma İçin Geliştirilmiş Güncel Yöntemlere Genel Bakış | Seval Çapraz
Spam Tanıma İçin Geliştirilmiş Güncel Yöntemlere Genel Bakış | Seval Çapraz
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?
 
Bir Android Uygulamasında Bulunması Gereken Özellikler | Seval ZX | Android D...
Bir Android Uygulamasında Bulunması Gereken Özellikler | Seval ZX | Android D...Bir Android Uygulamasında Bulunması Gereken Özellikler | Seval ZX | Android D...
Bir Android Uygulamasında Bulunması Gereken Özellikler | Seval ZX | Android D...
 

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 

Statistical Data Analysis on Diabetes 130-US hospitals for years 1999-2008 Data Set

  • 1. Statistical Data Analysis on Diabetes 130-US hospitals for years 1999-2008 Data Set Document Version: 1.0 (Date: 12/01/15) Seval Ünver unver.seval@metu.edu.tr Student Number: 1900810 (M.Sc.) Department of Computer Engineering, Middle East Technical University Ankara, TURKEY 1 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 2. Version History Version Status* Date Responsible Version Definition 0.1 Send via Email 29/10/14 Seval Unver Projection by PCA (6 hours) 0.2 Uploaded to OdtuClass 05/11/14 Seval Unver Projection by MDS (6 hours) 0.3 Uploaded to OdtuClass 19/11/14 Seval Unver Data clustering by hierarchical and k-means clustering (6 hours) 0.4 Uploaded to OdtuClass 26/11/14 Seval Unver Cluster Validation (5 hours) 1 Final Report 12/01/15 Seval Unver Spectral Clustering (6 hours) 2 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 3. Table of Content 1. Data Set Description.........................................................................................................................4 2. Data projection by PCA....................................................................................................................8 2.1. Eigenvalues and Eigenvectors..................................................................................................8 2.2. Plot directions of the first and second principle components on the original coordinate system............................................................................................................................................11 2.3. Transformed data set onto to a new coordinate system by using the first two principle components....................................................................................................................................12 2.4. Personal Observations and Comments...................................................................................12 2.5. Details of Implementation......................................................................................................13 3. Data projection by MDS.................................................................................................................16 3.1. Classical Metric......................................................................................................................16 3.2. Sammon Mapping and isoMDS..............................................................................................16 3.3. Use the projection of data onto first two principal axes (as a result of PCA) to initialize MDS (sammon and isoMDS). Plot the final projections.........................................................................19 3.4. Observations and Comments..................................................................................................20 3.5. Self-Reflection about MDS....................................................................................................20 4. Clustering.......................................................................................................................................20 4.1. Hierarchical Clustering...........................................................................................................20 4.2. K-Means Clustering................................................................................................................24 4.2.1. K-Means algorithm for different 5 k values – Plot Error................................................25 4.2.2. K-Means with 5 different initial configurations when k is 100 and when k is 25 – Error Plot............................................................................................................................................25 4.2.3. Plot the data in 2D...........................................................................................................26 4.3. Self-Reflection About Clustering............................................................................................27 5. Cluster Validation...........................................................................................................................28 5.1. Comparison of Actual Labels and Predicted Labels ..............................................................28 5.2 Dunn Index and Davies-Bouldin.............................................................................................28 5.2.1 Dunn Index Measurements..............................................................................................30 5.2.2 Davies-Bouldin Measurements........................................................................................30 5.3 Self-Reflection About Validation.............................................................................................31 6. Spectral Clustering.........................................................................................................................31 7. References......................................................................................................................................37 8. Appendix.........................................................................................................................................38 8.1. Used Scripts & Programs........................................................................................................38 8.1.1. Scripts of PCA Projection...............................................................................................38 8.1.2. Scripts of MDS Projection..............................................................................................39 8.1.3 Scripts of Clustering.........................................................................................................39 8.1.4 Scripts of Cluster Validation............................................................................................40 8.1.5. Scripts of Spectral Clustering.........................................................................................43 3 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 4. 1. Data Set Description "Diabetes 130-US hospitals for years 1999-2008 Data Set" is selected for this research. This data has been prepared to analyze factors related to readmission as well as other outcomes pertaining to patients with diabetes. The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes 50 features representing patient and hospital outcomes. The original large database has 74 million unique encounters corresponding to 17 million unique patients. The database consists of 41 tables in a fact-dimension schema and a total of 117 features. Almost 70,000 inpatient diabetes encounters were identified with sufficient detail for this analysis. In an early research, this database is used to show the relationship between the measurement of HbA1c and early readmission while controlling for covariates such as demographics, severity and type of the disease, and type of admission. The dataset was created in two steps. First, encounters of interest were extracted from the database with 55 attributes. Second, preliminary analysis and preprocessing of the data were performed resulting in retaining only these features (attributes) and encounters that could be used in further analysis, that is, contain sufficient information [1]. Information was extracted from the database for encounters that satisfied the following criteria: 1. It is an inpatient encounter (a hospital admission). 2. It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis. 3. The length of stay was at least 1 day and at most 14 days. 4. Laboratory tests were performed during the encounter. 5. Medications were administered during the encounter. Data Set Download Link: http://archive.ics.uci.edu/ml/datasets/Diabetes+130- US+hospitals+for+years+1999-2008 Date Donated: 05/03/14 Source: The data are submitted on behalf of the Center for Clinical and Translational Research, Virginia Commonwealth University, a recipient of NIH CTSA grant UL1 TR00058 and a recipient of the CERNER data. John Clore (jclore '@' vcu.edu), Krzysztof J. Cios (kcios '@' vcu.edu), Jon DeShazo (jpdeshazo '@' vcu.edu), and Beata Strack (strackb '@' vcu.edu). This data is a de-identified abstract of the Health Facts database (Cerner Corporation, Kansas City, MO). Table 1. Source of data set There are 100,000 instances and 50 columns in this data set. Characteristics of the data set is Multivariate since there are many variables. The main area of this data is life and the data are real. Because of this reason, there are missing values. We can use Classification and Clustering methods on this data. The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc. The whole attribute list can be seen in the Table 2. 4 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 5. No Feature name Type Description and values % missing 1 Encounter ID Numeric Unique identifier of an encounter 0,00% 2 Patient number Numeric Unique identifier of a patient 0,00% 3 Race Nominal Values: Caucasian, Asian, African American, Hispanic, and other 2,00% 4 Gender Nominal Values: male, female, and unknown/invalid 0,00% 5 Age Nominal Grouped in 10-year intervals:[0, 10),[10, 20), …,[90, 100) 0,00% 6 Weight Numeric Weight in pounds. 97,00% 7 Admission type Nominal Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available 0,00% 8 Discharge disposition Nominal Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available 0,00% 9 Admission source Nominal Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital 0,00% 10 Time in hospital Numeric Integer number of days between admission and discharge 0,00% 11 Payer code Nominal Integer identifier corresponding to 23 distinct values, for example, Blue Cross/Blue Shield, Medicare, and self-pay 52,00% 12 Medical specialty Nominal Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon 53,00% 13 Number of lab procedures Numeric Number of lab tests performed during the encounter 0,00% 14 Number of procedures Numeric Number of procedures (other than lab tests) performed during the encounter 0,00% 15 Number of medications Numeric Number of distinct generic names administered during the encounter 0,00% 16 Number of outpatient visits Numeric Number of outpatient visits of the patient in the year preceding the encounter 0,00% 17 Number of emergency visits Numeric Number of emergency visits of the patient in the year preceding the encounter 0,00% 18 Number of inpatient visits Numeric Number of inpatient visits of the patient in the year preceding the encounter 0,00% 5 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 6. 19 Diagnosis 1 Nominal The primary diagnosis (coded as first three digits of ICD9); 848 distinct values 0,00% 20 Diagnosis 2 Nominal Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values 0,00% 21 Diagnosis 3 Nominal Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values 1,00% 22 Number of diagnoses Numeric Number of diagnoses entered to the system 0,00% 23 Glucose serum test result Nominal Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured 0,00% 24 A1c test result Nominal Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured. 0,00% 25- 47 23 features for medications Nominal The feature indicates whether the drug was prescribed or there was a change in the dosage. Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed 0,00% 48 Change of medications Nominal Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change” 0,00% 49 Diabetes medications Nominal Indicates if there was any diabetic medication prescribed. Values: “yes” and “no” 0,00% 50 Readmitted Nominal Days to inpatient readmission. Values: “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission. 0,00% Table 2. List of features and their descriptions in the initial dataset The data are the real-world data, so there is incomplete, redundant, and noisy information. Some features have high percentage of missing values like weight (97% values missing), payer code (40%), and medical specialty (47%). Weight can directly relevant to Diabet but it is too sparse in this database, so it can be removed. Also, payer code can be removed because it is not relevant to Diabet. Medical specialty attribute can be removed too, because it shows physician's speciality. This might be important but focus of this research will not be about this issue. Therefore three features which have high missing values are removed. To summarize, the dataset consists of hospital admissions of length between 1 and 14 days that did not result in a patient death or discharge to a hospice. Each encounter corresponds to a unique 6 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 7. patient diagnosed with diabetes, although the primary diagnosis may be different. During each of the analyzed encounters, lab tests were ordered and medication was administered [1]. Four groups of encounters are considered: (1) no HbA1c test performed, (2) HbA1c performed and in normal range, (3) HbA1c performed and the result is greater than 8% with no change in diabetic medications, and (4) HbA1c performed, result is greater than 8%, and diabetic medication was changed. Table 3. Distribution of HbA1c in whole data set 1000 instance is separated to analyse as a traning set. In this small data set, the distribution of HbA1c is changed. It can be seen if Table 3 and Table 4 are compared. Table 4. Distribution of HbA1c in training data set The ratio of the population who did not have HbA1c test is 81.60% in whole data set and 81.50% in training data set. The readmission rate is 9.40% for whole data set, 9.32 for training data set. The ratio of the population who has higher result than 8% in HbA1c test is 8.90% in whole data set, 11.80% in training data set. Readmission rates are close to each other. In the group who had higher results and changed the medication, the ratio of population who readmitted the hospital in 30 days is 8.90% for whole data set, 10.00% for training data set. Encounter Id and Patient Numbers are removed because we are interested in summary of this data. Diagnosis 1, Diagnosis 2 and Diagnosis 3 is removed from the data, because they are nominal and they have more than 900 distinct values. Race is removed from the data set because it is not necessary at this point. 23 medication result is removed because they are nominal and they are not 7 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver Variable Readmitted % in group HbA1c          No test was performed 57080 81.60% 5342 9.40% 4071 5.80% 361 8.90% 2196 3.10% 166 7.60%  Normal result of the test 6637 9.50% 590 8.90% Number of encounters % of the  population Number of encounters  Result was high and the diabetic medication was changed  Result was high but the diabetic medication was not changed Variable Readmitted % in group HbA1c          No test was performed 815 81.50% 76 9.32% 60 6.00% 6 10.00% 58 5.80% 6 10.34%  Normal result of the test 67 6.70% 6 8.95% Number of encounters % of the  population Number of encounters  Result was high and the diabetic medication was changed  Result was high but the diabetic medication was not changed
  • 8. useful for Principle Component Analysis. Our aim is to find a relation between HbA1c test and Readmission rate. So we still keep that information. Gender is expressed by 1 for woman and 0 for man. The other nominal values are changed to integer representations. Changed nominal values: After removing unnecessary features, there is now 18 features in training data set. Types of discrete data: • count data (time_in_hospital, num_lab_procedures, num_procedures, num_medications, number_outpatient, number_emergency, number_inpatient, number_dianoses) • nominal data (gender, admission_type_id, discharge_disposition_id, admission_source_id, diabetesMed, change) • ordinal data (age, A1Cresult, max_glu_serum, readmitted) 2. Data projection by PCA PCA(Principle Component Analysis) is done by using GNU R function prcomp. As a method, eigenvalues of covariance matrix is used. 2.1. Eigenvalues and Eigenvectors Image: First 18 Eigen Values of whole data 8 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver Gender A1Cresult readmitted Female 1 None 0 >30 2 Male 0 Normal 1 <30 1 >7 2 No 0 Age >8 3 1-10 1 10-20 2 change max_glu_serum 20-30 3 None 0 None 0 30-40 4 change 1 Normal 1 40-50 5 >200 2 50-60 6 diabetesMed >300 3 60-70 7 Yes 1 70-80 8 No 0 80-90 9 90-100 10
  • 9. First we look at the whole data and plot it. It's much easier to explain PCA for two dimensions and then generalize from there. So two numeric features are selected: A1Cresult, time_in_hospital. The A1Cresult feature is categorical, so it is changed with num_lab_procedures. Look at the colorful plot between num_lab_procedures and num_medications. If we look at the PCA components, we can see that first component is very high than second component. 9 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 10. PCA Components of subset of data: 10 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 11. 2.2. Plot directions of the first and second principle components on the original coordinate system PCA of whole data: When we use the 2 feature: Image: 2D Projections of data on Principal 11 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 12. 2.3. Transformed data set onto to a new coordinate system by using the first two principle components Image: Cumulative Percentages of Eigenvalues Image: Component 1 vs. Component 2 2.4. Personal Observations and Comments This data set includes not only numerical values but also nominal values. There are both continuous and categorical data. However PCA is developed and suitable for continuous (ideally, mutlivariate normal) data. There is no obvious outliers in the data. Although a PCA applied on binary data would yield results comparable to those obtained from a Multiple Correspondence Analysis (factor scores and eigenvalues are linearly related), there are more appropriate techniques to deal with mixed data types, namely Multiple Factor Analysis for mixed data available in the FactoMineR R package (AFDM()). If your variables can be considered as structured subsets of descriptive attributes, then 12 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 13. Multiple Factor Analysis (MFA()) is also an option. The challenge with categorical variables is to find a suitable way to represent distances between variable categories and individuals in the factorial space. To overcome this problem, you can look for a non-linear transformation of each variable--whether it be nominal, ordinal, polynomial, or numerical--with optimal scaling. This is well explained in “Gifi Methods for Optimal Scaling in R: The Package homals [2]”, and an implementation is available in the corresponding R package homals. 2.5. Details of Implementation Data file is ending with ‘.csv’ which is Comma Seperated Value. There are 101768 lines in the file. Data is imported into R software: > diabetic_data <- read.table("C:/dataset/diabetic_data.csv", header = TRUE, sep = ",") To get the summary of data: > summary(diabetic_data) The data is opened with Excel and 3 feature columns are deleted which are weight (97% values missing), payer code (40% values missing), and medical specialty (47% values missing). Then a subset of this data is selected as a training data which includes 1000 rows. This training data is the first 1000 rows in original data and it represents the whole data correctly because these 1000 data is choosen randomly from the developers of this data set. Training data is imported to R. The summary of training data has not much difference with original data. For example, approximately half of data is women, the other half of data is men. The mean values are near each other. Ofcourse there is a difference between original data and training data, but using a training data makes faster the analysis. 13 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 14. To get the new distribution of HbA1c Result, subset of this data is shown in R. After removing the unnecessary columns and changing with integer representations, the data set has 18 features now. So it is imported again. To run the Principle Component Analysis, we can use princomp function in R. To use Correlation Matrix, we give it cor=TRUE parameter. It shows Importance of Components: Image: Summary of new data set (with 18 features) Image: Plot of Diabetic_Data 14 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 15. It looks like Component 1 is very strong. When it is plotted, it gives the results which can be seen in below. Image: Plot of PCA A scree plot is a graphical display of the variance of each component in the dataset which is used to determine how many components should be retained in order to explain a high percentage of the variation in the data. First component and second component are higher so we should keep them. Now lets see the Biplot. Biplots are a type of exploratory graph used in statistics, a generalization of the simple two-variable scatterplot. A biplot allows information on both samples and variables of a data matrix to be displayed graphically. Samples are displayed as points while variables are displayed either as vectors, linear axes or nonlinear trajectories. In the case of categorical variables, category level points may be used to represent the levels of a categorical variable. A generalised biplot displays information on both continuous and categorical variables. Image: biplot of PCA It is hard to read the information in this biplot image. However we can see that some of red lines are going to the same direction. This means that they have association. 15 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 16. 3. Data projection by MDS For visualisation, three methods for MDS (Multidimensional Scaling) are used for visualisation; classical multidimensional scaling, Sammon mapping and non-metric MDS. Classical multidimensional scaling is done with cmdscale in 2D and dist functions which uses euclidian distance between samples. As samples, raw data and first two dimensions of the PCA result is used. Sammon mapping is done with sammon function in MASS package. For initial configuration, result of PCA and several instances uniform distributed random points is used. Similar to sammon mapping, non-metric MDS is done with isoMDS function in MASS package. For initial configuration, same values as sammon mapping analysis is used. 3.1. Classical Metric Image: Classical MDS 3.2. Sammon Mapping and isoMDS Tried at least 5 different random initial configurations, choosen one that gives minimum error and plot the final MDS projection onto two dimensions. If result of PCA is used, sammon mapping converges in just a few iterations and leaves the configuration almost unchanged. Result is very sensitive to the magic parameter which is used for step size of iterations as indicated by MASS documentation. Here, magic parameter is chosen as 0.05. 16 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 17. Image: Sammon Mapping with PCA Image: Magic parameter is 0.05 If the magic parameter is 0.05: Initial stress : 0.79437 stress after 2 iters: 0.61243 17 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 18. Image: Magic parameter is 0.01 If the magic parameter is 0.01: Initial stress : 0.79437 stress after 10 iters: 0.50231, magic = 0.115 stress after 10 iters: 0.50231 Image: Magic parameter is 0.02 If the magic parameter is 0.02: Initial stress : 0.79437 stress after 10 iters: 0.41383, magic = 0.231 stress after 20 iters: 0.36066, magic = 0.021 stress after 30 iters: 0.34217, magic = 0.002 stress after 35 iters: 0.33437 18 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 19. 3.3. Use the projection of data onto first two principal axes (as a result of PCA) to initialize MDS (sammon and isoMDS). Plot the final projections. Image: isoMDS (Non-metric Mapping with PCA) initial value 11.809209 final value 11.804621 converged Image: Non-Metric mapping with random configuration initial value 48.305275 final value 48.304120 converged 19 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 20. 3.4. Observations and Comments There is so much features in this data set, this means high dimentionality. Although we removed most of unnecessary features and use a training data which includes 1000 instance, still data is not easily clusterable in 2D or clusters are not easily visible. Iterative methods used here yields outliers which are not present in PCA but this behaviour is very dependent on parameters other than distance data. Classic Torgerson metric MDS is actually done by transforming distances into similarities and performing PCA (eigen-decomposition or singular-value-decomposition) on those. So, PCA might be called the algorithm of the simplest MDS. Thus, MDS and PCA are not at the same level to be in line or opposite to each other. PCA is just a method while MDS is a class of analysis. As mapping, PCA is a particular case of MDS. On the other hand, PCA is a particular case of Factor analysis which, being a data reduction, is more than only a mapping, while MDS is only a mapping. 3.5. Self-Reflection about MDS As I see in this assignment, MDS gives much more information than PCA. This assignment provides a general perspective on the measures of similarity and dissimilarity. Both MDS and PCA use proximity measures such as the correlation coefficient or Euclidean distance to generate a spatial configuration (map) of points in multidimensional space where distances between points reflect the similarity among isolates. 4. Clustering Clustering is a technique for finding similarity groups in data, called clusters. In other words, clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). On this data, two methods of clustering is used for data visualisation. First one is Hierarchical Clustering with three different methods, implemented in GNU R function hclust with methods of “average”, ”complete” and ”ward”. Their dendograms are plotted. Second one is K-means Clustering with different k values (5,10,25,100,200) and several random runs of the “elbow” value for k which is around 100, consistent with the ground truth. Euclidian distance of normalized samples is used as the distance between samples for the both methods. 4.1. Hierarchical Clustering Dendrograms of hierarchical clustering for three linkages are illustrated under this title. Out of the three, “ward” method is the easiest to interpret visually, even for high number of clusters chosen. “Average” and “complete“ methods yield visually similar results but not as easy to interpret. Most interesting observation is, beyond a certain number of clusters picked, they look similar even though picks of low numbers of clusters looks different. 20 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 21. Linkage, or the distance from a newly fourmed node to all other nodes, can computed in several different ways: single, complete, and average. The figure below roughly demonstrates what each linkage evaluates [3] : Average linkage clustering uses the average similarity of observations between two groups as the measure between the two groups. Complete linkage clustering uses the farthest pair of observations between two groups to determine the similarity of the two groups. Single linkage clustering, on the other hand, computes the similarity between two groups as the similarity of the closest pair of observations between the two groups [4]. Ward's linkage is distinct from all the other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In short, this method attempts to minimize the Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at each step. In general, this method is regarded as very efficient, however, it tends to create clusters of small size [4]. Image: Hierarchical Clustering with Ward Method 21 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 22. Image: Dendograms for 5(red),25(green),100(blue) clusters on Hierarchical Clustering with Ward Method Image: Hierarchical Clustering with average method 22 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 23. Image: Dendograms for 5,25,100 clusters on Hierarchical Clustering with Average Method Image: Hierarchical Clustering with complete method 23 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 24. Image: Dendograms for 5,25,100 clusters on Hierarchical Clustering with Complete Method I recommend 25 number of clusters for this dataset based on the results obtained above. It is difficult to estimate accurate number of clusters. As you see in Ward method, using 25 clusters seems best choice. 4.2. K-Means Clustering K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. For number of clusters 5, 10, 25, 100 and 200 examined under this title. Total sum of squares within clusters as given by “tot.withinss” property of kmeans function is used to as the error function to evaluate best number of clusters. 22 is close to “rule of thumb” value for 1000 samples (k≈√n/2). Ground truth of 100 is close to the “elbow” value beyond where the error doesn't improve as dramatically. In next steps, five random runs of k=100 and k=25 and their errors are illustrated. 24 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 25. 4.2.1. K-Means algorithm for different 5 k values – Plot Error Image: “K-Means Clustering With k=5,10,25,100,200” v.s. “Error Plot” 4.2.2. K-Means with 5 different initial configurations when k is 100 and when k is 25 – Error Plot 2 different k values are tried in this task. Because of the fact that determining number of clusters is difficult, 25 and 100 tried in 5 different initial configurations. When k is 100, third try is the best because its error rate is lowest. When k is 25, fifth try is the best because its error rate is lowest also. 25 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 26. Image: “5 random runs with k=100” v.s. “Error Plot” Image: “5 random runs with k=25” v.s. “Error Plot” 4.2.3. Plot the data in 2D In previous step, Error plot is showed in a graph. From five initial comfigurations, third configuration is chosen because its error rate is lowest one when k=100. Here is the 2D plot of third configuration which has k=100 in K-Means algorithm. Image: Third initial configuration with k=100 – Clusters in 2D 26 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 27. Here is the 2D plot of third configuration which has k=25 in K-Means algorithm. Image: Fifth initial configuration with k=25 – Clusters in 2D 4.3. Self-Reflection About Clustering Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). This assignment helped me to estimate cluster number of my data. After this assignment, I choose k=25 as a cluster number. The figures and projections were very beneficial in this task. In k-means algorithm, my difficuly was to specify k. In addition to this, algorithm is so sensitive to outliers. My data has a lot of outliers because it is a real world data. The weekness of k-means is that this algorithm is only applicable if the mean is defined. For categorical data, k-mode - the centroid is represented by most frequent values. Therefore, we cannot say that k-means is the best solution to estimate number of clusters. The other algorithms have their own weeknesses. Comparing different clustering algorithms is a difficult task. No one knows the correct clusters. It is very hard, if not impossible, to know what distribution the application data follow. The data may not fully follow any “ideal” structure or distribution required by the algorithms. One also needs to decide how to standardize the data, to choose a suitable distance function and to select other parameter values. Hierarchical Clustering has O(n^2) complexity. Due the complexity, hard to use for large data sets. I 27 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 28. used a sample which has 1000 instances in this task, so it did not take long time to implement. My data has a lot of nominal attributes with more than two states or values. The commonly used distance measure is based on the simple matching method. To sum up, this assignment is very helpful for students who want to implement clustering on unlabeled big data. 5. Cluster Validation Cluster validation is concerned with the quality of clusters generated by an algorithm for data clustering. Given the partitioning of a data set, it attempts to answer questions such as: How pronounced is the cluster structure that has been identified? How do clustering solutions from different algorithms compare? How do clustering solutions for different parameters (e.g. the number of clusters compare).[6] 5.1. Comparison of Actual Labels and Predicted Labels "Ground truth" means a set of measurements that is known to be much more accurate than measurements from the system you are testing. In Diabetes 130-US hospitals for years 1999-2008 Data Set, there was no labels to determine classes. I labeled the four group in a new column with a Java console program. The column name is label. This column holds numbers which ranges from 1 to 4. In this dataset, four groups of encounters are considered: (1) no HbA1c test performed (A1Cresult=0 ), (2) HbA1c performed and in normal range(A1Cresult=1 or A1Cresult=2), (3) HbA1c performed and the result is greater than 8% with no change in diabetic medications (A1Cresult=3 and change=0), (4) HbA1c performed, result is greater than 8%, and diabetic medication was changed (A1Cresult=3 and change=1). Only cluster number of 4 is considered (ground truth). Method Precision H. clustering (ward) 0.482 H. clustering (average) 0.993 H. clustering (complete) 0.976 K-means 0.139 5.2 Dunn Index and Davies-Bouldin The goal of using an index is to determine the optimal clustering parameters. Greater intracluster 28 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 29. distances and lesser intercluster distances are desired. Different distance measures can be used for the index calculations. The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal inter-cluster distance to maximal intra-cluster distance. Let S and T be two nonempty subsets of . [5] Then, the diameter of S is defined as and set distance be tween S and T is defined as Here, d(x,y) indicates the distance between points x and y. For any partition, Dunn defined the following index [11]: Larger values of VD correspond to good clusters, and the number of clusters that maximizes VD is taken as the optimal number of clusters. The Davies–Bouldin index[9] is a metric for evaluating clustering algorithms. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset. This index is a function of the ratio of the sum of within-cluster scatter to between-cluster separation [5]. The scatter within the ith cluster, Si, is computed as and the distance between cluster Ci and Cj, denoted by dij, is defined as Here, zi represents the ith cluster center. The Davies-Bouldin (DB) index is then defined as where The objective is to minimize the DB index for achieving proper clustering. Most practical difference between two indexes are, higher Dunn index is better while lower Davies- Bouldin is better. Distances discussed here are euclidian distances. 29 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 30. 5.2.1 Dunn Index Measurements Hierarchical Clustering (Ward): Hierarchical Clustering (Average): Hierarchical Clustering (Complete): K-Means: Hierarchical methods gives better results than K-Means. 5.2.2 Davies-Bouldin Measurements Hierarchical Clustering (Ward): Hierarchical Clustering (Average): 30 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 31. Hierarchical Clustering (Complete): K-Means: Again, hierarchical methods gives better results than K-Means. 5.3 Self-Reflection About Validation The data set do not have well-separated clusters, so this task was difficult to implement. The aim of this assignment was to understand and to encourage the use of cluster-validation techniques in the analysis of data. In particular, the assignment attempts to familiarize students with some of the fundamental concepts behind cluster-validation techniques, and to assist them in making more informed choices of the measures to be used. However, to implement better cluster-validation needs essential background information. There are a lot of different types of validation techniques. Some articles in literature proposed effective use of validation techniques. As a conclusion, the validation should be done after research more and have more background knowledge. 6. Spectral Clustering Clustering is widely applied in science and engineering, including bioinformatics, image segmentation and web information retrival etc. The essential task of data clustering is partitioning data points into disjoint clusters so that objects in the same cluster are similar, objects in different clusters are dissimilar [7]. Many of the clustering algorithms are shortcoming so Spectral Clustering is proposed as a promising alternative. Spectral clustering method is proposed as a new kind of clustering method based on graph theory. This method uses the top eigenvectors of a matrix derived from the distance between points. Such algorithms have been successfully used in many applications including computer vision and VLSI design [8]. Through the spectral analysis on the affinity matrix of data sets, spectral clustering can get promising clustering results [7]. Because there is no iteration proceeding in the algorithm, spectral clustering avoids to trapped in the local minimum as K-means. The process of the spectral clustering can be summarized as follows [7][8] (suppose the data set X = {x1; x2; … ; xn} has k class): Spectral Clustering Algorithm 31 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 32. STEP 1. Construct the affinity matrix If i ̸= j, then , else wij = 0; STEP 2. Define the diagonal matrix D, where Meantime define the Laplacian matrix: STEP 3. Compute the k eigenvectors corresponding to the k largest eigenvalues of matrix L, and constitute the matrix: Then we can get the matrix Y, where STEP 4. Treat each row of Y as a point in , cluster them into k clusters via K-means. Assign the original points xi to cluster j iff row i of the matrix Y was assigned to cluster j. On this dataset, the algorithm given above is used as a spectral clustering. To implement this algorithm, there is an extensible package which name is kernlab for kernel-based machine learning methods in R. By using this package, spectral clustering can be done in a few steps easily. So, “specc” method is used from “kernlab” package. You can find the R scripts in the Appendix section's 8.1.5. Scripts of Spectral Clustering. In spectral clustering, the similarity between data points is often defined by Gaussian kernel [7]. The scale hyperparameter σ in the Gaussian kernel will great influence the final clustering results. So to find best σ hyperparameter, firstly a parameter estimation is done. After that, several runs with the same parameters are compared. Parameter Estimation Kernlab includes an S4 method called specc implementing this algorithm which can be used through a formula interface or a matrix interface. The S4 object returned by the method extends the class “vector” and contains the assigned cluster for each point along with information on the centers size and within-cluster sum of squares for each cluster. In case a Gaussian RBF kernel is being used a model selection process can be used to determine the optimal value of the σ hyperparameter. For a good value of σ the values of Y tend to cluster tightly and it turns out that the within cluster sum of squares is a good indicator for the “quality” of the sigma parameter found. We then iterate through the sigma values to find an optimal value for σ. The number of clusters are estimated as 4, 25 and 40. They are tried in specc method. The results 32 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 33. are shown on 2 data subset: {time_in_hospital, num_lab_procedures} and {num_medications, num_lab_procedures} Estimated value for 4 clusters σ=4.40010321258815. Random runs are done with this hyperparameter sigma. If we estimate 4 clusters: Image: Plot of data {Time in Hospital, Number of Procedures} for 4 clusters estimation Image: Plot of data {Number of Medications, Number of Lab Procedures} for 4 cluster estimation 33 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 34. If we estimate 25 clusters: Image: Plot of data {Time in Hospital, Number of Procedures} for 25 clusters estimation Image: Plot of data {Number of Medications, Number of Procedures} for 25 clusters estimation 34 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 35. If we estimate 40 clusters: Image: Plot of data {Time in Hospital, Number of Procedures} for 40 clusters estimation If we estimate 35 clusters: Image: Plot of data {Number of Medications, Number of Procedures} for 35 clusters estimation From the dataset, "num_lab_procedures" and "num_medications" features are choosen to use in 2D plot. Random runs results are presented in image below. The results are approximately same. 35 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 36. Image: Random Run Results with Centers=4 and σ=4.40010321258815. Image: Cluster Sizes for {Time in Hospital, Number of Procedures} for 40 clusters estimation Image: Cluster Sizes for {Number of Medications, Number of Procedures} First Random Run with centers=4 and σ=4.4 36 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 37. Validation From the random run, first result is chosen for validation. Dunn index results and Davies-Bouldin index results are given for comparison. To remember, higher Dunn index is better while lower Davies-Bouldin is better. Both in Dunn Index and Davies-Bouldin Index, Centroid Diameter with Complete Link gives best result. Since the ground truth has 4 clusters, sizes of the clusters are conformable the ground truth. Spectral (Dunn) Complete diameter Average diameter Centroid diameter Single link 0.00668823 0.04265670 0.06039251 Complete link 0.52512892 3.34920700 4.74174095 Average link 0.15697465 1.00116480 1.41742928 Centroid link 0.09740126 0.62121320 0.87950129 Table: Spectral Clustering Result Validation with Dunn Index Spectral (DB) Complete diameter Average diameter Centroid diameter Single link 194.38845600 37.45235040 26.37252550 Complete link 1.83402600 0.43463780 0.30763710 Average link 7.44275800 1.32189570 0.92798800 Centroid link 10.87935700 1.93850020 1.36070810 Table: Spectral Clustering Result Validation with Davies-Bouldin Index 7. References [1] Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records”, BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014. [2] Jan de Leeuw, Patrick Mair, “Gifi Methods for Optimal Scaling in R: The Package homals”, Journal of Statistical Software August 2009, Volume 31, Issue 4. [3] Laura Mulvey, Julian Gingold, “Microarray Clustering Methods and Gene Ontology”, 2007. [4] Phil Ender, “Multivariate Analysis: Hierarchical Cluster Analysis”, 1998. [5] Maulik U, Bandyopadhyay S., “Performance evaluation of some clustering algorithms and validity indices”, IEEE Transactions on Pattern Analysis Machine Intelligence, 2002, 24(12): 1650- 1654. [6] Julia Handl, Joshua Knowles, Douglas Kell, “Computational cluster validation in post-genomic data analysis”, Bioinformatics 21(15):3201-3212, 2005. [7] Lai Wei, “Path-based Relative Similarity Spectral Clustering”, 2010 Second WRI Global Congress on Intelligent Systems, 16-17 Dec. 2010. 37 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 38. [8] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss, “On spectral clustering: Analysis and an algorithm”, Neural Information Processing Symposium 2001. [9] D.L. Davies and D.W. Bouldin, “A Cluster Separation Measure”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, pp. 224-227, 1979. [10] J.C. Dunn, “A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well- Separated Clusters”, J. Cybernetics, vol. 3, pp. 32-57, 1973. 8. Appendix 8.1. Used Scripts & Programs R programming language is used (R version 3.1.1). In this part of the document, you can find the R scripts and commands to implement given tasks. These tasks are projection with PCA, projection with MDS, clustering, validation and spectral clustering. 8.1.1. Scripts of PCA Projection R Commands #load data diabetic_data <- read.table("C:/dataset/diabetic_data.csv", header = TRUE, sep = ",") #see summary summary(diabetic_data) #if you want to use two dimention, create a new data with random two features myvars <- c("num_lab_procedures","time_in_hospital") newdata <- diabetic_data[myvars] #PCA my.pca <- princomp(diabetic_data, scores=TRUE, cor=TRUE) #see the components plot(my.pca) biplot(my.pca) # calculate covariance my.cov <- cov(diabetic_data) # calculate eigen values my.eigen <- eigen(my.cov) # see the eigen vectors in plot pc1.slope <- my.eigen$vectors[1,1]/my.eigen$vectors[2,1] pc2.slope <- my.eigen$vectors[1,2]/my.eigen$vectors[2,2] 38 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 39. abline(0,pc1.slope, col="red") abline(0,pc2.slope, col="blue") # cumulative percentages of eigen values r<-my.pca$rotation plot(cumsum(my.pca$sdev^2)/sum(my.pca$sdev^2)) # rotated data biplot(my.pca,choices=c(2,1)) 8.1.2. Scripts of MDS Projection R Commands library(MASS) my.dist<-dist(diabetic_data) randomdata<-cbind(runif(1000,min=-0.5,max=0.5),runif(1000,min=-0.5,max=0.5)) # classical mds plot(cmdscale(my.dist)) # sammon mapping with PCA plot(sammon(my.dist,y=my.pca$x[,c(1,2)],magic=0.05)$points) # sammon mapping with random configuration plot(sammon(my.dist,y=randomdata,magic=0.05)$points) # non-metric mapping with PCA plot(isoMDS(my.dist,y=my.pca$x[,c(1,2)])$points) # non-metric mapping with random configuration plot(isoMDS(my.dist,y=randomdata)$points) 8.1.3 Scripts of Clustering R Commands library(cluster) ds<-dist(scale(diabetic_data)) # hierarchical clustering with ward method hward<-hclust(ds,method="ward") plot(hward) # hierarchical clustering with average method havg<-hclust(ds,method="average") plot(havg) 39 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 40. # hierarchical clustering with complete method hcomp<-hclust(ds,method="complete") plot(hcomp) #dendograms for 5,25,100 clusters rect.hclust(hward, k=100, border="blue") rect.hclust(hward, k=25, border="green") rect.hclust(hward, k=5, border="red") #k-means with k=5,10,25,100,200 k1<-kmeans(scale(diabetic_data),5) k2<-kmeans(scale(diabetic_data),10) k3<-kmeans(scale(diabetic_data),25) k4<-kmeans(scale(diabetic_data),100) k5<-kmeans(scale(diabetic_data),200) #k v.s. error plot plot(c(length(k1$size),length(k2$size),length(k3$size),length(k4$size),length(k5$size)),c(k1$tot.wi thinss,k2$tot.withinss,k3$tot.withinss,k4$tot.withinss,k5$tot.withinss),type="l") # 5 random runs with k=25 kk1<-kmeans(scale(diabetic_data),25) kk2<-kmeans(scale(diabetic_data),25) kk3<-kmeans(scale(diabetic_data),25) kk4<-kmeans(scale(diabetic_data),25) kk5<-kmeans(scale(diabetic_data),25) #run v.s. error plot plot(1:5,c(kk1$tot.withinss,kk2$tot.withinss,kk3$tot.withinss,kk4$tot.withinss,kk5$tot.withinss),ty pe="l") #clusters in 2D clusplot(scale(diabetic_data),kk5$cluster,lines=0) 8.1.4 Scripts of Cluster Validation JAVA Code for labeling import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.FileReader; import java.io.FileWriter; import java.io.IOException; public class Main { 40 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 41. public static void main(String[] args) { String satir = ""; try { File file = new File("final.csv"); if (!file.exists()) { file.createNewFile(); } FileWriter fileWriter = new FileWriter(file, false); BufferedWriter bWriter = new BufferedWriter(fileWriter); File inputfile = new File("data.csv"); BufferedReader reader = null; reader = new BufferedReader(new FileReader(inputfile)); // column names satir = reader.readLine(); bWriter.write(satir + ";label"); bWriter.newLine(); satir = reader.readLine(); while (satir != null) { String[] columns = satir.split(";"); if (columns[14].equals("0")) { bWriter.write(satir + ";1"); } else if (columns[14].equals("1") || columns[14].equals("2")) { bWriter.write(satir + ";2"); } else if (columns[14].equals("3") && columns[15].equals("0")) { bWriter.write(satir + ";3"); } else if (columns[14].equals("3") && columns[15].equals("1")) { bWriter.write(satir + ";4"); } bWriter.newLine(); satir = reader.readLine(); } bWriter.close(); reader.close(); } catch (IOException e) { e.printStackTrace(); } } } R Commands labeled_data <- read.table("C:/diabetic_dataset/final.csv", header = TRUE, sep = ",") 41 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 42. library(clv) #ground truth labels gt<-c(labeled_data$label) #pick mode of labels found within a ground truth label findmapping<-function(cluster,ground) {sapply( as.numeric( names( table(ground))),function(x)as.numeric(names(sort(table(cluster[ground ==x]),decreasing=TRUE))[1]))} #for each ground truth label, compare it with the found label findmatches<-function(cluster,ground){findmapping(cluster,ground)[ground]==cluster} #precision for h.c. ward mean(findmatches(cutree(hward,4),gt)) #precision for h.c. average mean(findmatches(cutree(havg,4),gt)) #precision for h.c. complete mean(findmatches(cutree(hcomp,4),gt)) #precision for kmeans mean(findmatches(kk5$cluster,gt)) #Dunn index for h.c. ward clv.Dunn(cls.scatt.data(scale(diabetic_data),cutree(hward,4)),c("complete","average","centroid"),c(" single","complete","average","centroid")) #Dunn index for h.c. average clv.Dunn(cls.scatt.data(scale(diabetic_data),cutree(havg,4)),c("complete","average","centroid"),c("s ingle","complete","average","centroid")) #Dunn index for h.c. complete clv.Dunn(cls.scatt.data(scale(diabetic_data),cutree(hcomp,4)),c("complete","average","centroid"),c( "single","complete","average","centroid")) #Dunn index for kmeans clv.Dunn(cls.scatt.data(scale(diabetic_data),kk5$cluster),c("complete","average","centroid"),c("sing le","complete","average","centroid")) #Davies-Bouldin index for h.c. ward clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),cutree(hward,4)),c("complete","average","cen troid"),c("single","complete","average","centroid")) #Davies-Bouldin index for h.c. average clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),cutree(havg,4)),c("complete","average","centr oid"),c("single","complete","average","centroid")) 42 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 43. #Davies-Bouldin index for h.c. complete clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),cutree(hcomp,4)),c("complete","average","ce ntroid"),c("single","complete","average","centroid")) #Davies-Bouldin index for kmeans clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),kk5$cluster),c("complete","average","centroi d"),c("single","complete","average","centroid")) 8.1.5. Scripts of Spectral Clustering R Commands #load data diabetic_data <- read.table("C:/dataset/diabetic_data.csv", header = TRUE, sep = ",") #create new datasets with two features myvars <- c("num_lab_procedures","time_in_hospital") newdata <- diabetic_data[myvars] scaled_data<-scale(as.data.frame(newdata)) myvars2 <- c("num_lab_procedures","num_medications") newdata2 <- diabetic_data[myvars2] scaled_data2<-scale(as.data.frame(newdata2)) library(ggplot2) library(kernlab) #runs with different cluster numbers sc1<-specc(scaled_data,centers=4) plot(scaled_data, col = sc1) sc2<-specc(scaled_data,centers=25) plot(scaled_data, col = sc2) sc3<-specc(scaled_data,centers=40) plot(scaled_data, col = sc3) sc4<-specc(scaled_data2,centers=4) plot(scaled_data2, col = sc4) sc5<-specc(scaled_data2,centers=25) plot(scaled_data2, col = sc5) sc6<-specc(scaled_data2,centers=35) plot(scaled_data2, col = sc6) 43 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
  • 44. #find sigma kernelf(sc4) #random runs with estimated sigma sce1<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815)) sce2<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815)) sce3<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815)) sce4<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815)) sce5<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815)) sce6<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815)) clv.Dunn(cls.scatt.data(scale(scaled_data2),as.vector(sce1)),c("complete","average","centroid"),c("s ingle","complete","average","centroid")) clv.Davies.Bouldin(cls.scatt.data(scale(scaled_data2),as.vector(sce1)),c("complete","average","cent roid"),c("single","complete","average","centroid")) #cluster sizes plot(1:40,sort(size(sc3),decreasing=T)) plot(1:4,sort(size(sce1),decreasing=T)) 44 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver