Classifying Readmissions of Diabetic Patient EncountersMayur Srinivasan
Readmission rates in hospitals are a key indicator on quality of patient care and a clear indication of total cost or inconvenience related to the treatment. Patients with serious medical
conditions such as diabetes mellitus are key drivers of readmission rates owing to the complexity of their illness. Therefore, being able to predict based on certain features whether or not a patient
will need readmission can help doctors and hospitals provide better care initially and not get financially penalized under Obamacare’s readmission policy
Next generation electronic medical records and search a test implementation i...lucenerevolution
Presented by David Piraino, Chief Imaging Information Officer, Imaging Institute Cleveland Clinic, Cleveland Clinic
& Daniel Palmer, Chief Imaging Information Officer, Imaging Institute Cleveland Clinic, Cleveland Clinic
Most patient specifc medical information is document oriented with varying amounts of associated meta-data. Most of pateint medical information is textual and semi-structured. Electronic Medical Record Systems (EMR) are not optimized to present the textual information to users in the most understandable ways. Present EMRs show information to the user in a reverse time oriented patient specific manner only. This talk discribes the construction and use of Solr search technologies to provide relevant historical information at the point of care while intepreting radiology images.
Radiology reports over a 4 year period were extracted from our Radiology Information System (RIS) and passed through a text processing engine to extract the results, impression, exam description, location, history, and date. Fifteen cases reported during clinical practice were used as test cases to determine if ""similar"" historical cases were found . The results were evaluated by the number of searches that returned any result in less than 3 seconds and the number of cases that illustrated the questioned diagnosis in the top 10 results returned as determined by a bone and joint radiologist. Also methods to better optimize the search results were reviewed.
An average of 7.8 out of the 10 highest rated reports showed a similar case highly related to the present case. The best search showed 10 out of 10 cases that were good examples and the lowest match search showed 2 out of 10 cases that were good examples.The talk will highlight this specific use case and the issues and advances of using Solr search technology in medicine with focus on point of care applications.
Leveraging Text Classification Strategies for Clinical and Public Health Appl...Karin Verspoor
Human-generated text is a critical component of recorded clinical data, yet remains an under-utilised resource in clinical informatics applications due to minimal standards for sharing of unstructured data as well as concerns about patient privacy. Where we can access and analyse clinical text, we find that it provides a hugely valuable resource. In this talk, I will describe two projects where we have used text classification as the basis for addressing a clinical objective: (1) a syndromic surveillance project where the task is the monitoring of health and social media data sources for changes that indicate the onset of disease outbreaks, and (2) the analysis of hospital records to enable retrieval of specific disease cases, for monitoring of the hospital case mix as well as for construction of patient cohorts for clinical research studies. I will end by briefly discussing the huge potential for clinical text analysis to support changing the way modern medicine is practised.
Classifying Readmissions of Diabetic Patient EncountersMayur Srinivasan
Readmission rates in hospitals are a key indicator on quality of patient care and a clear indication of total cost or inconvenience related to the treatment. Patients with serious medical
conditions such as diabetes mellitus are key drivers of readmission rates owing to the complexity of their illness. Therefore, being able to predict based on certain features whether or not a patient
will need readmission can help doctors and hospitals provide better care initially and not get financially penalized under Obamacare’s readmission policy
Next generation electronic medical records and search a test implementation i...lucenerevolution
Presented by David Piraino, Chief Imaging Information Officer, Imaging Institute Cleveland Clinic, Cleveland Clinic
& Daniel Palmer, Chief Imaging Information Officer, Imaging Institute Cleveland Clinic, Cleveland Clinic
Most patient specifc medical information is document oriented with varying amounts of associated meta-data. Most of pateint medical information is textual and semi-structured. Electronic Medical Record Systems (EMR) are not optimized to present the textual information to users in the most understandable ways. Present EMRs show information to the user in a reverse time oriented patient specific manner only. This talk discribes the construction and use of Solr search technologies to provide relevant historical information at the point of care while intepreting radiology images.
Radiology reports over a 4 year period were extracted from our Radiology Information System (RIS) and passed through a text processing engine to extract the results, impression, exam description, location, history, and date. Fifteen cases reported during clinical practice were used as test cases to determine if ""similar"" historical cases were found . The results were evaluated by the number of searches that returned any result in less than 3 seconds and the number of cases that illustrated the questioned diagnosis in the top 10 results returned as determined by a bone and joint radiologist. Also methods to better optimize the search results were reviewed.
An average of 7.8 out of the 10 highest rated reports showed a similar case highly related to the present case. The best search showed 10 out of 10 cases that were good examples and the lowest match search showed 2 out of 10 cases that were good examples.The talk will highlight this specific use case and the issues and advances of using Solr search technology in medicine with focus on point of care applications.
Leveraging Text Classification Strategies for Clinical and Public Health Appl...Karin Verspoor
Human-generated text is a critical component of recorded clinical data, yet remains an under-utilised resource in clinical informatics applications due to minimal standards for sharing of unstructured data as well as concerns about patient privacy. Where we can access and analyse clinical text, we find that it provides a hugely valuable resource. In this talk, I will describe two projects where we have used text classification as the basis for addressing a clinical objective: (1) a syndromic surveillance project where the task is the monitoring of health and social media data sources for changes that indicate the onset of disease outbreaks, and (2) the analysis of hospital records to enable retrieval of specific disease cases, for monitoring of the hospital case mix as well as for construction of patient cohorts for clinical research studies. I will end by briefly discussing the huge potential for clinical text analysis to support changing the way modern medicine is practised.
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...IJDKP
Large amount of heterogeneous medical data is generated every day in various healthcare organizations.
Those data could derive insights for improving monitoring and care delivery in the Intensive Care Unit.
Conversely, these data presents a challenge in reducing this amount of data without information loss.
Dimension reduction is considered the most popular approach for reducing data size and also to reduce
noise and redundancies in data. In this paper, we are investigate the effect of the average laboratory test
value and number of total laboratory in predicting patient deterioration in the Intensive Care Unit, where
we consider laboratory tests as features. Choosing a subset of features would mean choosing the most
important lab tests to perform. Thus, our approach uses state-of-the-art feature selection to identify the
most discriminative attributes, where we would have a better understanding of patient deterioration
problem. If the number of tests can be reduced by identifying the most important tests, then we could also
identify the redundant tests. By omitting the redundant tests, observation time could be reduced and early
treatment could be provided to avoid the risk. Additionally, unnecessary monetary cost would be avoided.
We apply our technique on the publicly available MIMIC-II database and show the effectiveness of the
feature selection. We also provide a detailed analysis of the best features identified by our approach.
The Life-Changing Impact of AI in HealthcareKalin Hitrov
For IT Leaders in the healthcare and pharmaceutical industries looking to understand the impact of AI on their industries and how to overcome the ethical and efficiency challenges that come with its use.
Preoperative Factors Predict Perioperative Morbidity
and Mortality After PancreaticoduodenectomyDavid Yu Greenblatt, MD, MSPH, Kaitlyn J. Kelly, MD, Victoria Rajamanickam, MS, Yin Wan, MS,
Todd Hanson, BS, Robert Rettammel, MA, Emily R. Winslow, MD, Clifford S. Cho, MD, FACS,
and Sharon M. Weber, MD, FACS
Department of Surgery, University of Wisconsin, Madison, WI.
Original article:
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...Data Con LA
Medical institutions, universities and software giants like Google and Microsoft are dedicating increasing resources to machine learning for healthcare. This is a very exciting but relatively young field. However, best practices for methods and reporting of results are not yet fully established. I have 2.5 years of experience as data scientist at a national cancer center working on clinical data, evaluating external vendors and peer reviewing machine learning in healthcare papers. The talk gives an overview of best practices in prototyping machine learning models on data from the patient electronic health record (EHR). The topics addressed are:1. Introduction to the EHR2. Overview of machine learning applications to the EHR3. Cohort definition for survival problems4. Data cleaning5. Performance metricsExcerpts of papers from renowned institutions will be critically reviewed. The material is intended to be useful not only to machine learning for healthcare professionals, but to practitioners dealing with very unbalanced dataset in the temporal domain. For example, customer churn prediction can be modeled as survival problem.
Predictive Analytics and Machine Learning for Healthcare - DiabetesDr Purnendu Sekhar Das
Machine Learning on clinical datasets to predict the risk of chronic disease conditions like Type 2 Diabetes mellitus beforehand; as well as predicting outcomes like hospital readmission using EMR RWE data.
Machine learning and operations research to find diabetics at risk for readmisison.
A team of researchers was able to apply machine learning to reduce readmissions for diabetics, see "Identifying diabetic patients with high risk of readmission" (Bhuvan,Kumar, Zafar, Aand Kishore, 2016).
A Quick Start To Blockchain by Seval CaprazSeval Çapraz
Blockchain is one of the most innovative discoveries of the past century.
The first cryptocurrency, Bitcoin, was proposed in 2008 by Satoshi Nakamoto with a white paper.
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...IJDKP
Large amount of heterogeneous medical data is generated every day in various healthcare organizations.
Those data could derive insights for improving monitoring and care delivery in the Intensive Care Unit.
Conversely, these data presents a challenge in reducing this amount of data without information loss.
Dimension reduction is considered the most popular approach for reducing data size and also to reduce
noise and redundancies in data. In this paper, we are investigate the effect of the average laboratory test
value and number of total laboratory in predicting patient deterioration in the Intensive Care Unit, where
we consider laboratory tests as features. Choosing a subset of features would mean choosing the most
important lab tests to perform. Thus, our approach uses state-of-the-art feature selection to identify the
most discriminative attributes, where we would have a better understanding of patient deterioration
problem. If the number of tests can be reduced by identifying the most important tests, then we could also
identify the redundant tests. By omitting the redundant tests, observation time could be reduced and early
treatment could be provided to avoid the risk. Additionally, unnecessary monetary cost would be avoided.
We apply our technique on the publicly available MIMIC-II database and show the effectiveness of the
feature selection. We also provide a detailed analysis of the best features identified by our approach.
The Life-Changing Impact of AI in HealthcareKalin Hitrov
For IT Leaders in the healthcare and pharmaceutical industries looking to understand the impact of AI on their industries and how to overcome the ethical and efficiency challenges that come with its use.
Preoperative Factors Predict Perioperative Morbidity
and Mortality After PancreaticoduodenectomyDavid Yu Greenblatt, MD, MSPH, Kaitlyn J. Kelly, MD, Victoria Rajamanickam, MS, Yin Wan, MS,
Todd Hanson, BS, Robert Rettammel, MA, Emily R. Winslow, MD, Clifford S. Cho, MD, FACS,
and Sharon M. Weber, MD, FACS
Department of Surgery, University of Wisconsin, Madison, WI.
Original article:
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...Data Con LA
Medical institutions, universities and software giants like Google and Microsoft are dedicating increasing resources to machine learning for healthcare. This is a very exciting but relatively young field. However, best practices for methods and reporting of results are not yet fully established. I have 2.5 years of experience as data scientist at a national cancer center working on clinical data, evaluating external vendors and peer reviewing machine learning in healthcare papers. The talk gives an overview of best practices in prototyping machine learning models on data from the patient electronic health record (EHR). The topics addressed are:1. Introduction to the EHR2. Overview of machine learning applications to the EHR3. Cohort definition for survival problems4. Data cleaning5. Performance metricsExcerpts of papers from renowned institutions will be critically reviewed. The material is intended to be useful not only to machine learning for healthcare professionals, but to practitioners dealing with very unbalanced dataset in the temporal domain. For example, customer churn prediction can be modeled as survival problem.
Predictive Analytics and Machine Learning for Healthcare - DiabetesDr Purnendu Sekhar Das
Machine Learning on clinical datasets to predict the risk of chronic disease conditions like Type 2 Diabetes mellitus beforehand; as well as predicting outcomes like hospital readmission using EMR RWE data.
Machine learning and operations research to find diabetics at risk for readmisison.
A team of researchers was able to apply machine learning to reduce readmissions for diabetics, see "Identifying diabetic patients with high risk of readmission" (Bhuvan,Kumar, Zafar, Aand Kishore, 2016).
Similar to Statistical Data Analysis on Diabetes 130-US hospitals for years 1999-2008 Data Set (20)
A Quick Start To Blockchain by Seval CaprazSeval Çapraz
Blockchain is one of the most innovative discoveries of the past century.
The first cryptocurrency, Bitcoin, was proposed in 2008 by Satoshi Nakamoto with a white paper.
Assembly Dili İle Binary Search GerçekleştirimiSeval Çapraz
Sayıların küçükten büyüğe doğru sıralı bir şekilde verildiği bir dizide istenen sayının var olup olmadığını buluyoruz. Assembly Dili İle Binary Search Gerçekleştirimi.
Importance of software quality assurance to prevent and reduce software failu...Seval Çapraz
Importance of software quality assurance to prevent and reduce software failures:
Document Management System In Defence Industry Case Study by Seval Çapraz
What is Datamining? Which algorithms can be used for Datamining?Seval Çapraz
This presentation includes what is datamining, which technics and algorithms are available in datamining. This presentation helps you to understand the concepts of datamining.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
The affect of service quality and online reviews on customer loyalty in the E...
Statistical Data Analysis on Diabetes 130-US hospitals for years 1999-2008 Data Set
1. Statistical Data Analysis
on
Diabetes 130-US hospitals
for years 1999-2008 Data Set
Document Version: 1.0
(Date: 12/01/15)
Seval Ünver
unver.seval@metu.edu.tr
Student Number: 1900810 (M.Sc.)
Department of Computer Engineering, Middle East Technical University
Ankara, TURKEY
1 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
2. Version History
Version Status* Date Responsible Version Definition
0.1 Send via Email 29/10/14 Seval Unver Projection by PCA (6 hours)
0.2 Uploaded to OdtuClass 05/11/14 Seval Unver Projection by MDS (6 hours)
0.3 Uploaded to OdtuClass 19/11/14 Seval Unver Data clustering by hierarchical and k-means
clustering (6 hours)
0.4 Uploaded to OdtuClass 26/11/14 Seval Unver Cluster Validation (5 hours)
1 Final Report 12/01/15 Seval Unver Spectral Clustering (6 hours)
2 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
3. Table of Content
1. Data Set Description.........................................................................................................................4
2. Data projection by PCA....................................................................................................................8
2.1. Eigenvalues and Eigenvectors..................................................................................................8
2.2. Plot directions of the first and second principle components on the original coordinate
system............................................................................................................................................11
2.3. Transformed data set onto to a new coordinate system by using the first two principle
components....................................................................................................................................12
2.4. Personal Observations and Comments...................................................................................12
2.5. Details of Implementation......................................................................................................13
3. Data projection by MDS.................................................................................................................16
3.1. Classical Metric......................................................................................................................16
3.2. Sammon Mapping and isoMDS..............................................................................................16
3.3. Use the projection of data onto first two principal axes (as a result of PCA) to initialize MDS
(sammon and isoMDS). Plot the final projections.........................................................................19
3.4. Observations and Comments..................................................................................................20
3.5. Self-Reflection about MDS....................................................................................................20
4. Clustering.......................................................................................................................................20
4.1. Hierarchical Clustering...........................................................................................................20
4.2. K-Means Clustering................................................................................................................24
4.2.1. K-Means algorithm for different 5 k values – Plot Error................................................25
4.2.2. K-Means with 5 different initial configurations when k is 100 and when k is 25 – Error
Plot............................................................................................................................................25
4.2.3. Plot the data in 2D...........................................................................................................26
4.3. Self-Reflection About Clustering............................................................................................27
5. Cluster Validation...........................................................................................................................28
5.1. Comparison of Actual Labels and Predicted Labels ..............................................................28
5.2 Dunn Index and Davies-Bouldin.............................................................................................28
5.2.1 Dunn Index Measurements..............................................................................................30
5.2.2 Davies-Bouldin Measurements........................................................................................30
5.3 Self-Reflection About Validation.............................................................................................31
6. Spectral Clustering.........................................................................................................................31
7. References......................................................................................................................................37
8. Appendix.........................................................................................................................................38
8.1. Used Scripts & Programs........................................................................................................38
8.1.1. Scripts of PCA Projection...............................................................................................38
8.1.2. Scripts of MDS Projection..............................................................................................39
8.1.3 Scripts of Clustering.........................................................................................................39
8.1.4 Scripts of Cluster Validation............................................................................................40
8.1.5. Scripts of Spectral Clustering.........................................................................................43
3 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
4. 1. Data Set Description
"Diabetes 130-US hospitals for years 1999-2008 Data Set" is selected for this research. This data
has been prepared to analyze factors related to readmission as well as other outcomes pertaining to
patients with diabetes. The dataset represents 10 years (1999-2008) of clinical care at 130 US
hospitals and integrated delivery networks. It includes 50 features representing patient and hospital
outcomes.
The original large database has 74 million unique encounters corresponding to 17 million unique
patients. The database consists of 41 tables in a fact-dimension schema and a total of 117 features.
Almost 70,000 inpatient diabetes encounters were identified with sufficient detail for this analysis.
In an early research, this database is used to show the relationship between the measurement of
HbA1c and early readmission while controlling for covariates such as demographics, severity and
type of the disease, and type of admission. The dataset was created in two steps. First, encounters of
interest were extracted from the database with 55 attributes. Second, preliminary analysis and
preprocessing of the data were performed resulting in retaining only these features (attributes) and
encounters that could be used in further analysis, that is, contain sufficient information [1].
Information was extracted from the database for encounters that satisfied the following criteria:
1. It is an inpatient encounter (a hospital admission).
2. It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the
system as a diagnosis.
3. The length of stay was at least 1 day and at most 14 days.
4. Laboratory tests were performed during the encounter.
5. Medications were administered during the encounter.
Data Set Download Link: http://archive.ics.uci.edu/ml/datasets/Diabetes+130-
US+hospitals+for+years+1999-2008
Date Donated: 05/03/14
Source: The data are submitted on behalf of the Center for Clinical and
Translational Research, Virginia Commonwealth University, a
recipient of NIH CTSA grant UL1 TR00058 and a recipient of the
CERNER data. John Clore (jclore '@' vcu.edu), Krzysztof J. Cios
(kcios '@' vcu.edu), Jon DeShazo (jpdeshazo '@' vcu.edu), and Beata
Strack (strackb '@' vcu.edu). This data is a de-identified abstract of
the Health Facts database (Cerner Corporation, Kansas City, MO).
Table 1. Source of data set
There are 100,000 instances and 50 columns in this data set. Characteristics of the data set is
Multivariate since there are many variables. The main area of this data is life and the data are real.
Because of this reason, there are missing values. We can use Classification and Clustering methods
on this data.
The data contains such attributes as patient number, race, gender, age, admission type, time in
hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result,
diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and
emergency visits in the year before the hospitalization, etc. The whole attribute list can be seen in
the Table 2.
4 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
5. No Feature name Type Description and values % missing
1 Encounter ID Numeric Unique identifier of an encounter 0,00%
2 Patient number Numeric Unique identifier of a patient 0,00%
3 Race Nominal Values: Caucasian, Asian, African American,
Hispanic, and other
2,00%
4 Gender Nominal Values: male, female, and unknown/invalid 0,00%
5 Age Nominal Grouped in 10-year intervals:[0, 10),[10, 20),
…,[90, 100)
0,00%
6 Weight Numeric Weight in pounds. 97,00%
7 Admission type Nominal Integer identifier corresponding to 9 distinct
values, for example, emergency, urgent,
elective, newborn, and not available
0,00%
8 Discharge
disposition
Nominal Integer identifier corresponding to 29 distinct
values, for example, discharged to home,
expired, and not available
0,00%
9 Admission
source
Nominal Integer identifier corresponding to 21 distinct
values, for example, physician referral,
emergency room, and transfer from a hospital
0,00%
10 Time in
hospital
Numeric Integer number of days between admission and
discharge
0,00%
11 Payer code Nominal Integer identifier corresponding to 23 distinct
values, for example, Blue Cross/Blue Shield,
Medicare, and self-pay
52,00%
12 Medical
specialty
Nominal Integer identifier of a specialty of the admitting
physician, corresponding to 84 distinct values,
for example, cardiology, internal medicine,
family/general practice, and surgeon
53,00%
13 Number of lab
procedures
Numeric Number of lab tests performed during the
encounter
0,00%
14 Number of
procedures
Numeric Number of procedures (other than lab tests)
performed during the encounter
0,00%
15 Number of
medications
Numeric Number of distinct generic names administered
during the encounter
0,00%
16 Number of
outpatient visits
Numeric Number of outpatient visits of the patient in the
year preceding the encounter
0,00%
17 Number of
emergency
visits
Numeric Number of emergency visits of the patient in
the year preceding the encounter
0,00%
18 Number of
inpatient visits
Numeric Number of inpatient visits of the patient in the
year preceding the encounter
0,00%
5 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
6. 19 Diagnosis 1 Nominal The primary diagnosis (coded as first three
digits of ICD9); 848 distinct values
0,00%
20 Diagnosis 2 Nominal Secondary diagnosis (coded as first three digits
of ICD9); 923 distinct values
0,00%
21 Diagnosis 3 Nominal Additional secondary diagnosis (coded as first
three digits of ICD9); 954 distinct values
1,00%
22 Number of
diagnoses
Numeric Number of diagnoses entered to the system 0,00%
23 Glucose serum
test result
Nominal Indicates the range of the result or if the test
was not taken. Values: “>200,” “>300,”
“normal,” and “none” if not measured
0,00%
24 A1c test result Nominal Indicates the range of the result or if the test
was not taken. Values: “>8” if the result was
greater than 8%, “>7” if the result was greater
than 7% but less than 8%, “normal” if the
result was less than 7%, and “none” if not
measured.
0,00%
25-
47
23 features for
medications
Nominal The feature indicates whether the drug was
prescribed or there was a change in the dosage.
Values: “up” if the dosage was increased
during the encounter, “down” if the dosage was
decreased, “steady” if the dosage did not
change, and “no” if the drug was not
prescribed
0,00%
48 Change of
medications
Nominal Indicates if there was a change in diabetic
medications (either dosage or generic name).
Values: “change” and “no change”
0,00%
49 Diabetes
medications
Nominal Indicates if there was any diabetic medication
prescribed. Values: “yes” and “no”
0,00%
50 Readmitted Nominal Days to inpatient readmission. Values: “<30” if
the patient was readmitted in less than 30 days,
“>30” if the patient was readmitted in more
than 30 days, and “No” for no record of
readmission.
0,00%
Table 2. List of features and their descriptions in the initial dataset
The data are the real-world data, so there is incomplete, redundant, and noisy information. Some
features have high percentage of missing values like weight (97% values missing), payer code
(40%), and medical specialty (47%). Weight can directly relevant to Diabet but it is too sparse in
this database, so it can be removed. Also, payer code can be removed because it is not relevant to
Diabet. Medical specialty attribute can be removed too, because it shows physician's speciality. This
might be important but focus of this research will not be about this issue. Therefore three features
which have high missing values are removed.
To summarize, the dataset consists of hospital admissions of length between 1 and 14 days that did
not result in a patient death or discharge to a hospice. Each encounter corresponds to a unique
6 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
7. patient diagnosed with diabetes, although the primary diagnosis may be different. During each of
the analyzed encounters, lab tests were ordered and medication was administered [1].
Four groups of encounters are considered: (1) no HbA1c test performed, (2) HbA1c performed and
in normal range, (3) HbA1c performed and the result is greater than 8% with no change in diabetic
medications, and (4) HbA1c performed, result is greater than 8%, and diabetic medication was
changed.
Table 3. Distribution of HbA1c in whole data set
1000 instance is separated to analyse as a traning set. In this small data set, the distribution of
HbA1c is changed. It can be seen if Table 3 and Table 4 are compared.
Table 4. Distribution of HbA1c in training data set
The ratio of the population who did not have HbA1c test is 81.60% in whole data set and 81.50% in
training data set. The readmission rate is 9.40% for whole data set, 9.32 for training data set.
The ratio of the population who has higher result than 8% in HbA1c test is 8.90% in whole data set,
11.80% in training data set. Readmission rates are close to each other. In the group who had higher
results and changed the medication, the ratio of population who readmitted the hospital in 30 days is
8.90% for whole data set, 10.00% for training data set.
Encounter Id and Patient Numbers are removed because we are interested in summary of this data.
Diagnosis 1, Diagnosis 2 and Diagnosis 3 is removed from the data, because they are nominal and
they have more than 900 distinct values. Race is removed from the data set because it is not
necessary at this point. 23 medication result is removed because they are nominal and they are not
7 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Variable
Readmitted
% in group
HbA1c
No test was performed 57080 81.60% 5342 9.40%
4071 5.80% 361 8.90%
2196 3.10% 166 7.60%
Normal result of the test 6637 9.50% 590 8.90%
Number of
encounters
% of the
population
Number of
encounters
Result was high and the diabetic medication was
changed
Result was high but the diabetic medication was
not changed
Variable
Readmitted
% in group
HbA1c
No test was performed 815 81.50% 76 9.32%
60 6.00% 6 10.00%
58 5.80% 6 10.34%
Normal result of the test 67 6.70% 6 8.95%
Number of
encounters
% of the
population
Number of
encounters
Result was high and the diabetic medication was
changed
Result was high but the diabetic medication was
not changed
8. useful for Principle Component Analysis. Our aim is to find a relation between HbA1c test and
Readmission rate. So we still keep that information. Gender is expressed by 1 for woman and 0 for
man. The other nominal values are changed to integer representations.
Changed nominal values:
After removing unnecessary features, there is now 18 features in training data set. Types of discrete
data:
• count data (time_in_hospital, num_lab_procedures, num_procedures, num_medications,
number_outpatient, number_emergency, number_inpatient, number_dianoses)
• nominal data (gender, admission_type_id, discharge_disposition_id, admission_source_id,
diabetesMed, change)
• ordinal data (age, A1Cresult, max_glu_serum, readmitted)
2. Data projection by PCA
PCA(Principle Component Analysis) is done by using GNU R function prcomp. As a method,
eigenvalues of covariance matrix is used.
2.1. Eigenvalues and Eigenvectors
Image: First 18 Eigen Values of whole data
8 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
Gender A1Cresult readmitted
Female 1 None 0 >30 2
Male 0 Normal 1 <30 1
>7 2 No 0
Age >8 3
1-10 1
10-20 2 change max_glu_serum
20-30 3 None 0 None 0
30-40 4 change 1 Normal 1
40-50 5 >200 2
50-60 6 diabetesMed >300 3
60-70 7 Yes 1
70-80 8 No 0
80-90 9
90-100 10
9. First we look at the whole data and plot it. It's much easier to explain PCA for two dimensions and
then generalize from there. So two numeric features are selected: A1Cresult, time_in_hospital.
The A1Cresult feature is categorical, so it is changed with num_lab_procedures.
Look at the colorful plot between num_lab_procedures and num_medications.
If we look at the PCA components, we can see that first component is very high than second
component.
9 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
10. PCA Components of subset of data:
10 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
11. 2.2. Plot directions of the first and second principle components on the
original coordinate system
PCA of whole data:
When we use the 2 feature:
Image: 2D Projections of data on Principal
11 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
12. 2.3. Transformed data set onto to a new coordinate system by using the
first two principle components
Image: Cumulative Percentages of Eigenvalues
Image: Component 1 vs. Component 2
2.4. Personal Observations and Comments
This data set includes not only numerical values but also nominal values. There are both continuous
and categorical data. However PCA is developed and suitable for continuous (ideally, mutlivariate
normal) data. There is no obvious outliers in the data. Although a PCA applied on binary data would
yield results comparable to those obtained from a Multiple Correspondence Analysis (factor scores
and eigenvalues are linearly related), there are more appropriate techniques to deal with mixed data
types, namely Multiple Factor Analysis for mixed data available in the FactoMineR R package
(AFDM()). If your variables can be considered as structured subsets of descriptive attributes, then
12 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
13. Multiple Factor Analysis (MFA()) is also an option.
The challenge with categorical variables is to find a suitable way to represent distances between
variable categories and individuals in the factorial space. To overcome this problem, you can look
for a non-linear transformation of each variable--whether it be nominal, ordinal, polynomial, or
numerical--with optimal scaling. This is well explained in “Gifi Methods for Optimal Scaling in R:
The Package homals [2]”, and an implementation is available in the corresponding R package
homals.
2.5. Details of Implementation
Data file is ending with ‘.csv’ which is Comma Seperated Value. There are 101768 lines in the file.
Data is imported into R software:
> diabetic_data <- read.table("C:/dataset/diabetic_data.csv", header = TRUE, sep = ",")
To get the summary of data:
> summary(diabetic_data)
The data is opened with Excel and 3 feature columns are deleted which are weight (97% values
missing), payer code (40% values missing), and medical specialty (47% values missing). Then a
subset of this data is selected as a training data which includes 1000 rows. This training data is the
first 1000 rows in original data and it represents the whole data correctly because these 1000 data is
choosen randomly from the developers of this data set.
Training data is imported to R. The summary of training data has not much difference with original
data. For example, approximately half of data is women, the other half of data is men. The mean
values are near each other. Ofcourse there is a difference between original data and training data,
but using a training data makes faster the analysis.
13 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
14. To get the new distribution of HbA1c Result, subset of this data is shown in R. After removing the
unnecessary columns and changing with integer representations, the data set has 18 features now.
So it is imported again.
To run the Principle Component Analysis, we can use princomp function in R. To use Correlation
Matrix, we give it cor=TRUE parameter. It shows Importance of Components:
Image: Summary of new data set (with 18 features)
Image: Plot of Diabetic_Data
14 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
15. It looks like Component 1 is very strong. When it is plotted, it gives the results which can be seen in
below.
Image: Plot of PCA
A scree plot is a graphical display of the variance of each component in the dataset which is used to
determine how many components should be retained in order to explain a high percentage of the
variation in the data. First component and second component are higher so we should keep them.
Now lets see the Biplot. Biplots are a type of exploratory graph used in statistics, a generalization of
the simple two-variable scatterplot. A biplot allows information on both samples and variables of a
data matrix to be displayed graphically. Samples are displayed as points while variables are
displayed either as vectors, linear axes or nonlinear trajectories. In the case of categorical variables,
category level points may be used to represent the levels of a categorical variable. A generalised
biplot displays information on both continuous and categorical variables.
Image: biplot of PCA
It is hard to read the information in this biplot image. However we can see that some of red lines are
going to the same direction. This means that they have association.
15 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
16. 3. Data projection by MDS
For visualisation, three methods for MDS (Multidimensional Scaling) are used for visualisation;
classical multidimensional scaling, Sammon mapping and non-metric MDS.
Classical multidimensional scaling is done with cmdscale in 2D and dist functions which uses
euclidian distance between samples. As samples, raw data and first two dimensions of the PCA
result is used.
Sammon mapping is done with sammon function in MASS package. For initial configuration, result
of PCA and several instances uniform distributed random points is used.
Similar to sammon mapping, non-metric MDS is done with isoMDS function in MASS package.
For initial configuration, same values as sammon mapping analysis is used.
3.1. Classical Metric
Image: Classical MDS
3.2. Sammon Mapping and isoMDS
Tried at least 5 different random initial configurations, choosen one that gives minimum error and
plot the final MDS projection onto two dimensions. If result of PCA is used, sammon mapping
converges in just a few iterations and leaves the configuration almost unchanged. Result is very
sensitive to the magic parameter which is used for step size of iterations as indicated by MASS
documentation. Here, magic parameter is chosen as 0.05.
16 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
17. Image: Sammon Mapping with PCA
Image: Magic parameter is 0.05
If the magic parameter is 0.05:
Initial stress : 0.79437
stress after 2 iters: 0.61243
17 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
18. Image: Magic parameter is 0.01
If the magic parameter is 0.01:
Initial stress : 0.79437
stress after 10 iters: 0.50231, magic = 0.115
stress after 10 iters: 0.50231
Image: Magic parameter is 0.02
If the magic parameter is 0.02:
Initial stress : 0.79437
stress after 10 iters: 0.41383, magic = 0.231
stress after 20 iters: 0.36066, magic = 0.021
stress after 30 iters: 0.34217, magic = 0.002
stress after 35 iters: 0.33437
18 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
19. 3.3. Use the projection of data onto first two principal axes (as a result
of PCA) to initialize MDS (sammon and isoMDS). Plot the final
projections.
Image: isoMDS (Non-metric Mapping with PCA)
initial value 11.809209
final value 11.804621
converged
Image: Non-Metric mapping with random configuration
initial value 48.305275
final value 48.304120
converged
19 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
20. 3.4. Observations and Comments
There is so much features in this data set, this means high dimentionality. Although we removed
most of unnecessary features and use a training data which includes 1000 instance, still data is not
easily clusterable in 2D or clusters are not easily visible. Iterative methods used here yields outliers
which are not present in PCA but this behaviour is very dependent on parameters other than
distance data. Classic Torgerson metric MDS is actually done by transforming distances into
similarities and performing PCA (eigen-decomposition or singular-value-decomposition) on those.
So, PCA might be called the algorithm of the simplest MDS.
Thus, MDS and PCA are not at the same level to be in line or opposite to each other. PCA is just a
method while MDS is a class of analysis. As mapping, PCA is a particular case of MDS. On the
other hand, PCA is a particular case of Factor analysis which, being a data reduction, is more than
only a mapping, while MDS is only a mapping.
3.5. Self-Reflection about MDS
As I see in this assignment, MDS gives much more information than PCA. This assignment
provides a general perspective on the measures of similarity and dissimilarity. Both MDS and PCA
use proximity measures such as the correlation coefficient or Euclidean distance to generate a
spatial configuration (map) of points in multidimensional space where distances between points
reflect the similarity among isolates.
4. Clustering
Clustering is a technique for finding similarity groups in data, called clusters. In other words,
clustering is the task of grouping a set of objects in such a way that objects in the same group
(called a cluster) are more similar (in some sense or another) to each other than to those in other
groups (clusters). On this data, two methods of clustering is used for data visualisation.
First one is Hierarchical Clustering with three different methods, implemented in GNU R function
hclust with methods of “average”, ”complete” and ”ward”. Their dendograms are plotted.
Second one is K-means Clustering with different k values (5,10,25,100,200) and several random
runs of the “elbow” value for k which is around 100, consistent with the ground truth.
Euclidian distance of normalized samples is used as the distance between samples for the both
methods.
4.1. Hierarchical Clustering
Dendrograms of hierarchical clustering for three linkages are illustrated under this title. Out of the
three, “ward” method is the easiest to interpret visually, even for high number of clusters chosen.
“Average” and “complete“ methods yield visually similar results but not as easy to interpret. Most
interesting observation is, beyond a certain number of clusters picked, they look similar even
though picks of low numbers of clusters looks different.
20 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
21. Linkage, or the distance from a newly fourmed node to all other nodes, can computed in several
different ways: single, complete, and average. The figure below roughly demonstrates what each
linkage evaluates [3] :
Average linkage clustering uses the average similarity of observations between two groups as the
measure between the two groups. Complete linkage clustering uses the farthest pair of observations
between two groups to determine the similarity of the two groups. Single linkage clustering, on the
other hand, computes the similarity between two groups as the similarity of the closest pair of
observations between the two groups [4].
Ward's linkage is distinct from all the other methods because it uses an analysis of variance
approach to evaluate the distances between clusters. In short, this method attempts to minimize the
Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at each step. In general,
this method is regarded as very efficient, however, it tends to create clusters of small size [4].
Image: Hierarchical Clustering with Ward Method
21 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
22. Image: Dendograms for 5(red),25(green),100(blue) clusters on Hierarchical Clustering with
Ward Method
Image: Hierarchical Clustering with average method
22 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
23. Image: Dendograms for 5,25,100 clusters on Hierarchical Clustering with Average Method
Image: Hierarchical Clustering with complete method
23 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
24. Image: Dendograms for 5,25,100 clusters on Hierarchical Clustering with Complete Method
I recommend 25 number of clusters for this dataset based on the results obtained above. It is
difficult to estimate accurate number of clusters. As you see in Ward method, using 25 clusters
seems best choice.
4.2. K-Means Clustering
K-means clustering aims to partition n observations into k clusters in which each observation
belongs to the cluster with the nearest mean, serving as a prototype of the cluster. For number of
clusters 5, 10, 25, 100 and 200 examined under this title. Total sum of squares within clusters as
given by “tot.withinss” property of kmeans function is used to as the error function to evaluate best
number of clusters.
22 is close to “rule of thumb” value for 1000 samples (k≈√n/2). Ground truth of 100 is close to the
“elbow” value beyond where the error doesn't improve as dramatically. In next steps, five random
runs of k=100 and k=25 and their errors are illustrated.
24 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
25. 4.2.1. K-Means algorithm for different 5 k values – Plot Error
Image: “K-Means Clustering With k=5,10,25,100,200” v.s. “Error Plot”
4.2.2. K-Means with 5 different initial configurations when k is 100 and
when k is 25 – Error Plot
2 different k values are tried in this task. Because of the fact that determining number of clusters is
difficult, 25 and 100 tried in 5 different initial configurations. When k is 100, third try is the best
because its error rate is lowest. When k is 25, fifth try is the best because its error rate is lowest also.
25 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
26. Image: “5 random runs with k=100” v.s. “Error Plot”
Image: “5 random runs with k=25” v.s. “Error Plot”
4.2.3. Plot the data in 2D
In previous step, Error plot is showed in a graph. From five initial comfigurations, third
configuration is chosen because its error rate is lowest one when k=100. Here is the 2D plot of third
configuration which has k=100 in K-Means algorithm.
Image: Third initial configuration with k=100 – Clusters in 2D
26 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
27. Here is the 2D plot of third configuration which has k=25 in K-Means algorithm.
Image: Fifth initial configuration with k=25 – Clusters in 2D
4.3. Self-Reflection About Clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the
same group (called a cluster) are more similar (in some sense or another) to each other than to those
in other groups (clusters). This assignment helped me to estimate cluster number of my data. After
this assignment, I choose k=25 as a cluster number. The figures and projections were very
beneficial in this task.
In k-means algorithm, my difficuly was to specify k. In addition to this, algorithm is so sensitive to
outliers. My data has a lot of outliers because it is a real world data. The weekness of k-means is
that this algorithm is only applicable if the mean is defined. For categorical data, k-mode - the
centroid is represented by most frequent values. Therefore, we cannot say that k-means is the best
solution to estimate number of clusters. The other algorithms have their own weeknesses.
Comparing different clustering algorithms is a difficult task. No one knows the correct clusters.
It is very hard, if not impossible, to know what distribution the application data follow. The data
may not fully follow any “ideal” structure or distribution required by the algorithms. One also needs
to decide how to standardize the data, to choose a suitable distance function and to select other
parameter values.
Hierarchical Clustering has O(n^2) complexity. Due the complexity, hard to use for large data sets. I
27 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
28. used a sample which has 1000 instances in this task, so it did not take long time to implement.
My data has a lot of nominal attributes with more than two states or values. The commonly used
distance measure is based on the simple matching method.
To sum up, this assignment is very helpful for students who want to implement clustering on
unlabeled big data.
5. Cluster Validation
Cluster validation is concerned with the quality of clusters generated by an algorithm for data
clustering. Given the partitioning of a data set, it attempts to answer questions such as: How
pronounced is the cluster structure that has been identified? How do clustering solutions from
different algorithms compare? How do clustering solutions for different parameters (e.g. the number
of clusters compare).[6]
5.1. Comparison of Actual Labels and Predicted Labels
"Ground truth" means a set of measurements that is known to be much more accurate than
measurements from the system you are testing. In Diabetes 130-US hospitals for years 1999-2008
Data Set, there was no labels to determine classes. I labeled the four group in a new column with a
Java console program. The column name is label. This column holds numbers which ranges from 1
to 4.
In this dataset, four groups of encounters are considered:
(1) no HbA1c test performed (A1Cresult=0 ),
(2) HbA1c performed and in normal range(A1Cresult=1 or A1Cresult=2),
(3) HbA1c performed and the result is greater than 8% with no change in diabetic medications
(A1Cresult=3 and change=0),
(4) HbA1c performed, result is greater than 8%, and diabetic medication was changed
(A1Cresult=3 and change=1).
Only cluster number of 4 is considered (ground truth).
Method Precision
H. clustering (ward) 0.482
H. clustering (average) 0.993
H. clustering (complete) 0.976
K-means 0.139
5.2 Dunn Index and Davies-Bouldin
The goal of using an index is to determine the optimal clustering parameters. Greater intracluster
28 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
29. distances and lesser intercluster distances are desired. Different distance measures can be used for
the index calculations.
The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio
between the minimal inter-cluster distance to maximal intra-cluster distance. Let S and T be two
nonempty subsets of . [5] Then, the diameter of S is defined as
and set distance be tween S and T is defined as
Here, d(x,y) indicates the distance between points x and y. For any partition, Dunn defined the
following index [11]:
Larger values of VD correspond to good clusters, and the number of clusters that maximizes VD is
taken as the optimal number of clusters.
The Davies–Bouldin index[9] is a metric for evaluating clustering algorithms. This is an internal
evaluation scheme, where the validation of how well the clustering has been done is made using
quantities and features inherent to the dataset. This index is a function of the ratio of the sum of
within-cluster scatter to between-cluster separation [5]. The scatter within the ith cluster, Si, is
computed as
and the distance between cluster Ci and Cj, denoted by dij, is defined as
Here, zi represents the ith cluster center. The Davies-Bouldin (DB) index
is then defined as
where
The objective is to minimize the DB index for achieving proper clustering.
Most practical difference between two indexes are, higher Dunn index is better while lower Davies-
Bouldin is better. Distances discussed here are euclidian distances.
29 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
30. 5.2.1 Dunn Index Measurements
Hierarchical Clustering (Ward):
Hierarchical Clustering (Average):
Hierarchical Clustering (Complete):
K-Means:
Hierarchical methods gives better results than K-Means.
5.2.2 Davies-Bouldin Measurements
Hierarchical Clustering (Ward):
Hierarchical Clustering (Average):
30 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
31. Hierarchical Clustering (Complete):
K-Means:
Again, hierarchical methods gives better results than K-Means.
5.3 Self-Reflection About Validation
The data set do not have well-separated clusters, so this task was difficult to implement. The aim of
this assignment was to understand and to encourage the use of cluster-validation techniques in the
analysis of data. In particular, the assignment attempts to familiarize students with some of the
fundamental concepts behind cluster-validation techniques, and to assist them in making more
informed choices of the measures to be used. However, to implement better cluster-validation needs
essential background information. There are a lot of different types of validation techniques. Some
articles in literature proposed effective use of validation techniques. As a conclusion, the validation
should be done after research more and have more background knowledge.
6. Spectral Clustering
Clustering is widely applied in science and engineering, including bioinformatics, image
segmentation and web information retrival etc. The essential task of data clustering is partitioning
data points into disjoint clusters so that objects in the same cluster are similar, objects in different
clusters are dissimilar [7]. Many of the clustering algorithms are shortcoming so Spectral Clustering
is proposed as a promising alternative.
Spectral clustering method is proposed as a new kind of clustering method based on graph theory.
This method uses the top eigenvectors of a matrix derived from the distance between points. Such
algorithms have been successfully used in many applications including computer vision and VLSI
design [8]. Through the spectral analysis on the affinity matrix of data sets, spectral clustering can
get promising clustering results [7]. Because there is no iteration proceeding in the algorithm,
spectral clustering avoids to trapped in the local minimum as K-means. The process of the spectral
clustering can be summarized as follows [7][8] (suppose the data set X = {x1; x2; … ; xn} has k
class):
Spectral Clustering Algorithm
31 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
32. STEP 1. Construct the affinity matrix
If i ̸= j, then
, else wij = 0;
STEP 2. Define the diagonal matrix D, where
Meantime define the Laplacian matrix:
STEP 3. Compute the k eigenvectors corresponding to the k largest eigenvalues of matrix L, and
constitute the matrix:
Then we can get the matrix Y, where
STEP 4. Treat each row of Y as a point in , cluster them into k clusters via K-means. Assign the
original points xi to cluster j iff row i of the matrix Y was assigned to cluster j.
On this dataset, the algorithm given above is used as a spectral clustering. To implement this
algorithm, there is an extensible package which name is kernlab for kernel-based machine learning
methods in R. By using this package, spectral clustering can be done in a few steps easily. So,
“specc” method is used from “kernlab” package. You can find the R scripts in the Appendix
section's 8.1.5. Scripts of Spectral Clustering.
In spectral clustering, the similarity between data points is often defined by Gaussian kernel [7].
The scale hyperparameter σ in the Gaussian kernel will great influence the final clustering results.
So to find best σ hyperparameter, firstly a parameter estimation is done. After that, several runs with
the same parameters are compared.
Parameter Estimation
Kernlab includes an S4 method called specc implementing this algorithm which can be used
through a formula interface or a matrix interface. The S4 object returned by the method extends the
class “vector” and contains the assigned cluster for each point along with information on the centers
size and within-cluster sum of squares for each cluster. In case a Gaussian RBF kernel is being used
a model selection process can be used to determine the optimal value of the σ hyperparameter. For a
good value of σ the values of Y tend to cluster tightly and it turns out that the within cluster sum of
squares is a good indicator for the “quality” of the sigma parameter found. We then iterate through
the sigma values to find an optimal value for σ.
The number of clusters are estimated as 4, 25 and 40. They are tried in specc method. The results
32 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
33. are shown on 2 data subset: {time_in_hospital, num_lab_procedures} and {num_medications,
num_lab_procedures}
Estimated value for 4 clusters σ=4.40010321258815. Random runs are done with this
hyperparameter sigma.
If we estimate 4 clusters:
Image: Plot of data {Time in Hospital, Number of Procedures}
for 4 clusters estimation
Image: Plot of data {Number of Medications, Number of Lab Procedures}
for 4 cluster estimation
33 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
34. If we estimate 25 clusters:
Image: Plot of data {Time in Hospital, Number of Procedures}
for 25 clusters estimation
Image: Plot of data {Number of Medications, Number of Procedures}
for 25 clusters estimation
34 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
35. If we estimate 40 clusters:
Image: Plot of data {Time in Hospital, Number of Procedures}
for 40 clusters estimation
If we estimate 35 clusters:
Image: Plot of data {Number of Medications, Number of Procedures}
for 35 clusters estimation
From the dataset, "num_lab_procedures" and "num_medications" features are choosen to use in 2D
plot. Random runs results are presented in image below. The results are approximately same.
35 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
36. Image: Random Run Results with Centers=4 and σ=4.40010321258815.
Image: Cluster Sizes for {Time in Hospital, Number of Procedures}
for 40 clusters estimation
Image: Cluster Sizes for {Number of Medications, Number of Procedures}
First Random Run with centers=4 and σ=4.4
36 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
37. Validation
From the random run, first result is chosen for validation. Dunn index results and Davies-Bouldin
index results are given for comparison. To remember, higher Dunn index is better while lower
Davies-Bouldin is better. Both in Dunn Index and Davies-Bouldin Index, Centroid Diameter with
Complete Link gives best result. Since the ground truth has 4 clusters, sizes of the clusters are
conformable the ground truth.
Spectral (Dunn) Complete diameter Average diameter Centroid diameter
Single link 0.00668823 0.04265670 0.06039251
Complete link 0.52512892 3.34920700 4.74174095
Average link 0.15697465 1.00116480 1.41742928
Centroid link 0.09740126 0.62121320 0.87950129
Table: Spectral Clustering Result Validation with Dunn Index
Spectral (DB) Complete diameter Average diameter Centroid diameter
Single link 194.38845600 37.45235040 26.37252550
Complete link 1.83402600 0.43463780 0.30763710
Average link 7.44275800 1.32189570 0.92798800
Centroid link 10.87935700 1.93850020 1.36070810
Table: Spectral Clustering Result Validation with Davies-Bouldin Index
7. References
[1] Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura,
Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission
Rates: Analysis of 70,000 Clinical Database Patient Records”, BioMed Research International, vol.
2014, Article ID 781670, 11 pages, 2014.
[2] Jan de Leeuw, Patrick Mair, “Gifi Methods for Optimal Scaling in R: The Package homals”,
Journal of Statistical Software August 2009, Volume 31, Issue 4.
[3] Laura Mulvey, Julian Gingold, “Microarray Clustering Methods and Gene Ontology”, 2007.
[4] Phil Ender, “Multivariate Analysis: Hierarchical Cluster Analysis”, 1998.
[5] Maulik U, Bandyopadhyay S., “Performance evaluation of some clustering algorithms and
validity indices”, IEEE Transactions on Pattern Analysis Machine Intelligence, 2002, 24(12): 1650-
1654.
[6] Julia Handl, Joshua Knowles, Douglas Kell, “Computational cluster validation in post-genomic
data analysis”, Bioinformatics 21(15):3201-3212, 2005.
[7] Lai Wei, “Path-based Relative Similarity Spectral Clustering”, 2010 Second WRI Global
Congress on Intelligent Systems, 16-17 Dec. 2010.
37 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
38. [8] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss, “On spectral clustering: Analysis and an
algorithm”, Neural Information Processing Symposium 2001.
[9] D.L. Davies and D.W. Bouldin, “A Cluster Separation Measure”, IEEE Trans. Pattern Analysis
and Machine Intelligence, vol. 1, pp. 224-227, 1979.
[10] J.C. Dunn, “A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-
Separated Clusters”, J. Cybernetics, vol. 3, pp. 32-57, 1973.
8. Appendix
8.1. Used Scripts & Programs
R programming language is used (R version 3.1.1). In this part of the document, you can find the R
scripts and commands to implement given tasks. These tasks are projection with PCA, projection
with MDS, clustering, validation and spectral clustering.
8.1.1. Scripts of PCA Projection
R Commands
#load data
diabetic_data <- read.table("C:/dataset/diabetic_data.csv", header = TRUE, sep = ",")
#see summary
summary(diabetic_data)
#if you want to use two dimention, create a new data with random two features
myvars <- c("num_lab_procedures","time_in_hospital")
newdata <- diabetic_data[myvars]
#PCA
my.pca <- princomp(diabetic_data, scores=TRUE, cor=TRUE)
#see the components
plot(my.pca)
biplot(my.pca)
# calculate covariance
my.cov <- cov(diabetic_data)
# calculate eigen values
my.eigen <- eigen(my.cov)
# see the eigen vectors in plot
pc1.slope <- my.eigen$vectors[1,1]/my.eigen$vectors[2,1]
pc2.slope <- my.eigen$vectors[1,2]/my.eigen$vectors[2,2]
38 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
39. abline(0,pc1.slope, col="red")
abline(0,pc2.slope, col="blue")
# cumulative percentages of eigen values
r<-my.pca$rotation
plot(cumsum(my.pca$sdev^2)/sum(my.pca$sdev^2))
# rotated data
biplot(my.pca,choices=c(2,1))
8.1.2. Scripts of MDS Projection
R Commands
library(MASS)
my.dist<-dist(diabetic_data)
randomdata<-cbind(runif(1000,min=-0.5,max=0.5),runif(1000,min=-0.5,max=0.5))
# classical mds
plot(cmdscale(my.dist))
# sammon mapping with PCA
plot(sammon(my.dist,y=my.pca$x[,c(1,2)],magic=0.05)$points)
# sammon mapping with random configuration
plot(sammon(my.dist,y=randomdata,magic=0.05)$points)
# non-metric mapping with PCA
plot(isoMDS(my.dist,y=my.pca$x[,c(1,2)])$points)
# non-metric mapping with random configuration
plot(isoMDS(my.dist,y=randomdata)$points)
8.1.3 Scripts of Clustering
R Commands
library(cluster)
ds<-dist(scale(diabetic_data))
# hierarchical clustering with ward method
hward<-hclust(ds,method="ward")
plot(hward)
# hierarchical clustering with average method
havg<-hclust(ds,method="average")
plot(havg)
39 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
40. # hierarchical clustering with complete method
hcomp<-hclust(ds,method="complete")
plot(hcomp)
#dendograms for 5,25,100 clusters
rect.hclust(hward, k=100, border="blue")
rect.hclust(hward, k=25, border="green")
rect.hclust(hward, k=5, border="red")
#k-means with k=5,10,25,100,200
k1<-kmeans(scale(diabetic_data),5)
k2<-kmeans(scale(diabetic_data),10)
k3<-kmeans(scale(diabetic_data),25)
k4<-kmeans(scale(diabetic_data),100)
k5<-kmeans(scale(diabetic_data),200)
#k v.s. error plot
plot(c(length(k1$size),length(k2$size),length(k3$size),length(k4$size),length(k5$size)),c(k1$tot.wi
thinss,k2$tot.withinss,k3$tot.withinss,k4$tot.withinss,k5$tot.withinss),type="l")
# 5 random runs with k=25
kk1<-kmeans(scale(diabetic_data),25)
kk2<-kmeans(scale(diabetic_data),25)
kk3<-kmeans(scale(diabetic_data),25)
kk4<-kmeans(scale(diabetic_data),25)
kk5<-kmeans(scale(diabetic_data),25)
#run v.s. error plot
plot(1:5,c(kk1$tot.withinss,kk2$tot.withinss,kk3$tot.withinss,kk4$tot.withinss,kk5$tot.withinss),ty
pe="l")
#clusters in 2D
clusplot(scale(diabetic_data),kk5$cluster,lines=0)
8.1.4 Scripts of Cluster Validation
JAVA Code for labeling
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
public class Main {
40 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
41. public static void main(String[] args) {
String satir = "";
try {
File file = new File("final.csv");
if (!file.exists()) {
file.createNewFile();
}
FileWriter fileWriter = new FileWriter(file, false);
BufferedWriter bWriter = new BufferedWriter(fileWriter);
File inputfile = new File("data.csv");
BufferedReader reader = null;
reader = new BufferedReader(new FileReader(inputfile));
// column names
satir = reader.readLine();
bWriter.write(satir + ";label");
bWriter.newLine();
satir = reader.readLine();
while (satir != null) {
String[] columns = satir.split(";");
if (columns[14].equals("0")) {
bWriter.write(satir + ";1");
} else if (columns[14].equals("1") ||
columns[14].equals("2")) {
bWriter.write(satir + ";2");
} else if (columns[14].equals("3") &&
columns[15].equals("0")) {
bWriter.write(satir + ";3");
} else if (columns[14].equals("3") &&
columns[15].equals("1")) {
bWriter.write(satir + ";4");
}
bWriter.newLine();
satir = reader.readLine();
}
bWriter.close();
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
R Commands
labeled_data <- read.table("C:/diabetic_dataset/final.csv", header = TRUE, sep = ",")
41 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
42. library(clv)
#ground truth labels
gt<-c(labeled_data$label)
#pick mode of labels found within a ground truth label
findmapping<-function(cluster,ground)
{sapply( as.numeric( names( table(ground))),function(x)as.numeric(names(sort(table(cluster[ground
==x]),decreasing=TRUE))[1]))}
#for each ground truth label, compare it with the found label
findmatches<-function(cluster,ground){findmapping(cluster,ground)[ground]==cluster}
#precision for h.c. ward
mean(findmatches(cutree(hward,4),gt))
#precision for h.c. average
mean(findmatches(cutree(havg,4),gt))
#precision for h.c. complete
mean(findmatches(cutree(hcomp,4),gt))
#precision for kmeans
mean(findmatches(kk5$cluster,gt))
#Dunn index for h.c. ward
clv.Dunn(cls.scatt.data(scale(diabetic_data),cutree(hward,4)),c("complete","average","centroid"),c("
single","complete","average","centroid"))
#Dunn index for h.c. average
clv.Dunn(cls.scatt.data(scale(diabetic_data),cutree(havg,4)),c("complete","average","centroid"),c("s
ingle","complete","average","centroid"))
#Dunn index for h.c. complete
clv.Dunn(cls.scatt.data(scale(diabetic_data),cutree(hcomp,4)),c("complete","average","centroid"),c(
"single","complete","average","centroid"))
#Dunn index for kmeans
clv.Dunn(cls.scatt.data(scale(diabetic_data),kk5$cluster),c("complete","average","centroid"),c("sing
le","complete","average","centroid"))
#Davies-Bouldin index for h.c. ward
clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),cutree(hward,4)),c("complete","average","cen
troid"),c("single","complete","average","centroid"))
#Davies-Bouldin index for h.c. average
clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),cutree(havg,4)),c("complete","average","centr
oid"),c("single","complete","average","centroid"))
42 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
43. #Davies-Bouldin index for h.c. complete
clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),cutree(hcomp,4)),c("complete","average","ce
ntroid"),c("single","complete","average","centroid"))
#Davies-Bouldin index for kmeans
clv.Davies.Bouldin(cls.scatt.data(scale(diabetic_data),kk5$cluster),c("complete","average","centroi
d"),c("single","complete","average","centroid"))
8.1.5. Scripts of Spectral Clustering
R Commands
#load data
diabetic_data <- read.table("C:/dataset/diabetic_data.csv", header = TRUE, sep = ",")
#create new datasets with two features
myvars <- c("num_lab_procedures","time_in_hospital")
newdata <- diabetic_data[myvars]
scaled_data<-scale(as.data.frame(newdata))
myvars2 <- c("num_lab_procedures","num_medications")
newdata2 <- diabetic_data[myvars2]
scaled_data2<-scale(as.data.frame(newdata2))
library(ggplot2)
library(kernlab)
#runs with different cluster numbers
sc1<-specc(scaled_data,centers=4)
plot(scaled_data, col = sc1)
sc2<-specc(scaled_data,centers=25)
plot(scaled_data, col = sc2)
sc3<-specc(scaled_data,centers=40)
plot(scaled_data, col = sc3)
sc4<-specc(scaled_data2,centers=4)
plot(scaled_data2, col = sc4)
sc5<-specc(scaled_data2,centers=25)
plot(scaled_data2, col = sc5)
sc6<-specc(scaled_data2,centers=35)
plot(scaled_data2, col = sc6)
43 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver
44. #find sigma
kernelf(sc4)
#random runs with estimated sigma
sce1<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815))
sce2<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815))
sce3<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815))
sce4<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815))
sce5<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815))
sce6<-specc(scaled_data2,centers=4,kernel="rbfdot",kpar=list(sigma=4.40010321258815))
clv.Dunn(cls.scatt.data(scale(scaled_data2),as.vector(sce1)),c("complete","average","centroid"),c("s
ingle","complete","average","centroid"))
clv.Davies.Bouldin(cls.scatt.data(scale(scaled_data2),as.vector(sce1)),c("complete","average","cent
roid"),c("single","complete","average","centroid"))
#cluster sizes
plot(1:40,sort(size(sc3),decreasing=T))
plot(1:4,sort(size(sce1),decreasing=T))
44 Diabetes 130-US hospitals for years 1999-2008 Data Set Analysis– by Seval Unver