1. CIS-5210 HEALTHCARE DATA ANALYTICS
1
Diabetic Encounter Analysis
Monika Mishra
Sushant Burde
CIS 5210: Healthcare Data Analytics
Submitted to: Professor Shilpa Balan
2. CIS-5210 HEALTHCARE DATA ANALYTICS
2
Table of Contents
S. No. Topic Page No.
1 DATA SET
1. Data Set URL
2. About the dataset
3. Dataset details
4. Column details
3
3
4
4-5
2 DATA REFINEMENT
1. Removing duplicates
2. Removing unwanted column
3. Removing unwanted spaces
4. Converting Text to Columns
6
7
8
9
3 ANALYSIS & VISUALIZATIONS
1. Bar Chart
2. Box Plot
3. Line Chart
4. Pie Chart
5. Mosaic Plot
6. Bar-Line Chart
10-11
12-13
14-15
16-17
18-19
20-21
4 STATISTICAL SUMMARY 22-23
5 STATISTICAL TEST
1. One-Way Frequency
2. Correlation Analysis
3. T-Test
24-26
27
28-29
6 REFERENCES 30
3. CIS-5210 HEALTHCARE DATA ANALYTICS
3
DATA SET
1. Data Set URL:
https://www.kaggle.com/brandao/diabetes
2. About the dataset:
The data set represents 10 years (1999-2008) of clinical care at 130 US hospitals and
integrated delivery networks. It includes over 50 features representing patient and
hospital outcomes. Information was extracted from the database for encounters that
satisfied the following criteria.
It is an inpatient encounter (a hospital admission).
It is a diabetic encounter, that is, one during which any kind of diabetes was
entered to the system as a diagnosis.
The length of stay was at least 1 day and at most 14 days.
Laboratory tests were performed during the encounter.
Medications were administered during the encounter.
The data contains such attributes as patient number, race, gender, age, admission type,
time in hospital, medical specialty of admitting physician, number of lab test performed,
HbA1c test result, diagnosis, number of medication, diabetic medications, number of
outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.
3. Dataset details:
4. CIS-5210 HEALTHCARE DATA ANALYTICS
4
Original
File Size 19.2 MB
Number of columns 55
Number of rows 101767
File format CSV
Modified for the analysis
File Size 6 MB
Number of columns 15
Number of rows 68379
File format CSV
4. Column details:
The original dataset had 55 columns. For our analysis, we have reduced it to 15 columns.
The details of the columns are given below:
Column Name Column Detail
Encounter ID Unique identifier of an encounter
Patient number Unique identifier of a patient
Race Patient’s race
Gender Patient’s gender
Age Patient’s age group
5. CIS-5210 HEALTHCARE DATA ANALYTICS
5
Time in hospital Integer number of days between admission and
discharge
Medical specialty Treated by which department
Number of laboratories
procedures
Number of lab tests performed during the
encounter
Number of
procedures
Number of procedures (other than lab tests)
performed during the encounter
Number of
medications
Number of distinct generic names administered
during the encounter
Number of outpatients
visits
Number of outpatient visits of the patient in the
year preceding the encounter
Number of
emergency visits
Number of emergency visits of the patient in the
year preceding the encounter
Number of inpatients
visits
Number of inpatient visits of the patient in the
year preceding the encounter
Number of diagnoses Number of diagnoses entered to the system
Diabetes medications Indicates if there was any diabetic medication
prescribed
6. CIS-5210 HEALTHCARE DATA ANALYTICS
6
DATA REFINEMENT
Removing Duplicates
Before
After
Process
Explanation:
There were many duplicate rows present in the dataset. We used the “Remove Duplicates”
feature of the excel to remove duplicates. The “Remove Duplicate” feature can be found through
the path DataTable ToolsRemove Duplicates.
7. CIS-5210 HEALTHCARE DATA ANALYTICS
7
Removing Unwanted Columns
Before
After
Process
Explanation:
There were many columns which had just one value and were not required for visualizations. So,
I deleted those columns. One of those deleted columns is “max_glu_serum”. I selected the
column, right clicked on it and then clicked “Delete”.
8. CIS-5210 HEALTHCARE DATA ANALYTICS
8
Removing Unwanted Spaces
Before
After
Process
Explanation:
There were white spaces in between the words for the column medical specialty. I created a new
column and used formula builder to Use TRIM function on the medical specialty column. This
removed the white spaces between the words.
9. CIS-5210 HEALTHCARE DATA ANALYTICS
9
Converting text to columns
Before
After
Process
Explanation:
Two columns – race and gender were merged into one column. I used the “Convert Text to
Column wizard to separate the two details in two columns using comma as delimiter. The wizard
can be found through the path DataText to Columns.
10. CIS-5210 HEALTHCARE DATA ANALYTICS
10
ANALYSIS & VISUALIZATIONS
1. Which race had more diabetic encounter?
Chart used:
Bar Chart
Analysis:
The above bar chart provides the diabetic encounter of various races. It can be seen that
the Caucasian race had the largest diabetic encounter of 51,042. It is followed by African
American race with 12,604 frequency. Hispanic race has a frequency of 1,372. The Asian
have the lowest diabetic encounter.
12. CIS-5210 HEALTHCARE DATA ANALYTICS
12
2. What are the statistics of number of diagnoses?
Chart used:
Box Plot
Analysis:
A box plot is a graphical rendition of statistical data based on the minimum, first quartile,
median, third quartile, and maximum. It shows the statistics for number of diagnoses. The
mean is about 7.6 and the median is 9. The first quartile value is 6 while the third quartile
value is 9. The minimum value is 1 while the maximum value is 16. These numbers are
based on the total observations of 68,379 for the variable number of diagnoses.
14. CIS-5210 HEALTHCARE DATA ANALYTICS
14
3. Which age group has the highest inpatient encounter ?
Categories used:
Line Chart
Analysis:
The above line chart shows the frequency of the age group of the inpatient encounter. The
highest inpatient encounter had been for the age group 70-80. The second age group with the
highest diabetic inpatient encounter is for the age group 60-70. The least inpatient encounter
is for the age group 0-10. In general, the encounter increases with increase of age group 70-
80. After that, a decline is observed.
16. CIS-5210 HEALTHCARE DATA ANALYTICS
16
4. Which medical specialty was involved with the highest patient encounter?
Categories used:
Pie Chart
Analysis:
Pie charts show the relative contribution of the parts to the whole. The size of a slice
represents the contribution of the data to the total chart statistic. The Internal Medicine
department had the highest encounter of the diabetic patient. The least have been encountered
by the Surgery-General department.
18. CIS-5210 HEALTHCARE DATA ANALYTICS
18
5. Are there more females than males who take diabetic medicines?
Categories used:
Mosaic Plot
Analysis:
Mosaic plots display tiles that correspond to the crosstabulation table cells. The areas of the
tiles are proportional to the frequencies of the table cells. Maximum males and females
admitted to the hospitals take diabetic medicines. The number females who take diabetic
medicines are lesser than the number of males.
20. CIS-5210 HEALTHCARE DATA ANALYTICS
20
6. Which race accounts for maximum and minimum number of inpatient and
outpatient?
Categories used:
Bar-Line Chart
Analysis:
The above chart displays number of outpatient and number of inpatient grouped by different
race. The Caucasian race tops in both the number of outpatients and number of inpatients.
The Asian race has the minimum value for both number of outpatient and number of
inpatients.
22. CIS-5210 HEALTHCARE DATA ANALYTICS
22
STATISTICAL SUMMARY
Analysis:
Statistics Value Meaning
Mean 4.28
It is the average of the time spent in hospital. It is the
summation of all total time spent in hospital by total
number of observations (68379)
Std Dev
(Standard
Deviation)
2.92 It indicates the extent of deviation for the time spent in
hospital. In this case, it is closed to mean.
Minimum 1 The lowest value of the time spent in hospital
Maximum 14 The highest value of the time spent in hospital
Median 4
It represents the middle number in a given sequence of
numbers when it’s ordered by rank
N 68379
It is the total number of observations or total number of
rows in the table
23. CIS-5210 HEALTHCARE DATA ANALYTICS
23
We have taken the analysis variable as the time spent in hospital. The above table shows the
statistical summary with explanation.
Full Screenshot:
25. CIS-5210 HEALTHCARE DATA ANALYTICS
25
Analysis:
For the one-way frequency test, we have taken gender as the analysis variable and number of
inpatients as frequency count. We want to know which gender had more inpatients
encounters.
From the table and the “Distribution of gender” graph, it can be seen the number of
inpatients for the female gender is higher than the male gender. The female gender has a
frequency count of 24, 985 which is 55.13% while that of male is 20, 339 which is 44.87%.
Cumulative frequency is defined as a running total of frequencies. The frequency of an
element in a set refers to how many of that element there are in the set. Cumulative frequency
can also be defined as the sum of all previous frequencies up to the current point.
26. CIS-5210 HEALTHCARE DATA ANALYTICS
26
The cumulative frequency is important when analyzing data, where the value of the
cumulative frequency indicates the number of elements in the data set that lie below the
current value.
The cumulative frequency adds up to total number of observations which in the above case is
45, 324. The cumulative percentage is always 100% for the last group which in my analysis
is for the Male gender. The “Cumulative Distribution of gender” graph displays the
cumulative frequency distribution.
Full Screenshot:
27. CIS-5210 HEALTHCARE DATA ANALYTICS
27
2. Correlation Analysis
Analysis:
The Correlation Analysis provides statistics for investigating associations among
variables. In the above case the correlation analysis is being performed for the
variables time_in_hospital and number_diagnoses, the value for which is 0.21469. It
means both the variables are weakly co-related. A value close to 1 signifies strong co-
relationship.
Full Screenshot:
28. CIS-5210 HEALTHCARE DATA ANALYTICS
28
3. T – Test
Analysis:
A T-test is a type of inferential statistic used to determine if there is a significant
difference between the means of two groups, which may be related in certain features.
A T-test is used as a hypothesis testing tool, which allows testing of an assumption
applicable to a population.
For my analysis, we have used one-sample t-test taking time_in_hospital as the
analysis variable. A one-sample T- test compares the mean of the sample to the null
hypothesis mean.
Using the Kolmogorov-Smirnov test value, since p<alpha (p<0.0100), there is
significant difference in the variable time_in_hospital.
In fact, using Cramer-von Mises test value and Anderson-Darling test value too, p
value is less than the corresponding alpha value (p<0.0050). And therefore, there is
significant difference in the variable time_in_hospital.