United States Death Cause Analysis

1 RoshikGanesan CIS 5810
Death Cause Analysis in United States
Roshik Ganesan
305641224
California State University Los Angeles

A. Data Sets:
URL:
https://catalog.data.gov/dataset/nchs-potentially-excess-deaths-from-the-five-leading-causes-of-
death
Twitter Hashtags:
#Cancer, #lowerrespiratorydiesease, #HeartDisease, #Stroke #UnintentionalInjury.
B. Data Description:
This dataset which deals with the data related to potential excess death in
the country of United States of Americas has very rich details. It contains data from the year 2005
to 2015 which accounts up to almost a decade of data. The dataset possessing a decade of data
hence has more than 2 million records which gives us a good opportunity to drill and get better
insights. Further this data contains the reasons for the cause of death. However, it contains data
related to the top 5 cause of death in United States of America. This includes Cancer, Chronic
lower respiratory disease, Heart disease, Stroke and Unintentional injury. The details of the states
in which these deaths have occurred has also been provided which helps us in analyzing the results

based on geographic locations in the country. The abbreviations of the states have also been
provided for easy understanding and analysis. It provides with the HHS region codes which is a
unique code that is assigned by Health and Human Services department to each state. The age
group of the victim is provided which would help in analyzing the major cause of death in each
age group by state. This dataset further drills more deeper and gives us the details on whether the
death had occurred in a metropolitan city or a non-metropolitan city. This would help us analyze
the facilities in each state based on these counts. Then this dataset tells us the number of observed
deaths in the particular state for metropolitan city, Non-Metropolitan city and also the consolidated
version of both. The population column gives us the details of the population in the city at that
particular year. This dataset also tells what is the actual expected death for the year which helps us
in identifying the excess death. The potential Excess death column subtracts the expected deaths
for the year from the observed deaths for the year. This let’s know on what is the number of
unexpected deaths that occurs in a particular year for particular region. The data finally provides
us with the information on what is percentage of increase in the death rates.
C. Data Refinement:
Category 1: Missing Value Removal
Pre-Refinement:
A few states aren’t provided with Non-Metropolitan data, as they cannot be used for analyses they
are being removed.

Post-Refinement:
Category 2: Improper column name
Pre-Refinement:
The column is improperly named as “Potentially Excess Death”, hence changing it to “Excess
Death Observed”
Post-Refinement:

Category 3: Setting conditions/Filters
A) For regular dataset
Pre-Refinement:
The data in the “locality” field contains Metropolitan, Non-Metropolitan and “ALL” of which
“ALL” is the consolidation of both Metropolitan and Non-Metropolitan hence it is excluded.
Post-Refinement:

B) For Twitter Dataset:
As we perform the analysis only inside of United State of America we set the condition from the
twitter hashtags to only United States.
Category 4: Correcting Misfed Values
Pre-Refinement:
The name of the state Arizona has been miss-spelt which is being changed.
Post-Refinement:

Calculated Field:
Formula Used: (Observed Death/Population) * 100
This calculated field is done in order to find the percentage of the death that is actually observed
in the state for both the metropolitan and non-metropolitan cities. This let’s us know what
percentage of the total population have observed death.
Data Grouping:

This data grouping is done in order to group the states from USA based on the region hey exist.
There are 4 major group created which are East Coast, West Coast, Central East and Central West
and the corresponding states are grouped into those categories.
D. Data Quality:
The dataset with the quality of 77 is before the elimination of the empty values. Upon eliminating
the empty values and proceeding with further refinement we have achieved a data quality of 80.

These screenshot shows the change in the column quality of the data before and after refinement.
E. Data Exploration:
Question1:
How does the number of Age Range Compare by Cause of Death?

This visualization shows us which ailment has caused most of the deaths in the
entire span of 10 years. It is evident that the top 3 causes are unintentional injury followed by
cancer which is in turn followed by heart disease. The cause unintentional injury tops the list with
a total of 26400 deaths in the last 10 years. Cancer has caused 26370 deaths in total for 10 years
followed by Heart Disease 26346 deaths in the similar span of time. Strokes list to be the 4 in the
top 5 with 24414 deaths and then the chronic lower respiratory disease with 24147 deaths over the
years.
Question 2:
What is the trend of Expected death and Observed death over Years?
This visualization shows us how the trend of deaths been over the years. What we
infer from this visualization is that the difference between the expected deaths and observer deaths
has been significantly more in the earlier days and it is being reduced in the subsequent years. In

2005 the average observed death is 2277 and that of expected death is 1514 having a difference of
763 which is reduced to an extent when compared to that of 2015 which is 644. The alarming sign
from the visualization is that though the difference has gone down showing that a better prediction
is made in the current age, the number of death has decreased initially and then been on a steady
raise. The average deaths have increased from a record low of 2233 to 2390 in a span of 5 years.
Question 3:
How do the value of Observed Death compare by Year and Age Range?
As we drill further deep to get further insights related to the previous visualization
we carry out this analysis to identify which particular age range has caused an increase in the
average number of death over the years. This visualization shows the average death for each
particular age range. Analyzing the visualization, we find out that the average of the age range “0-
84” has decreased from 2005 to 2010 and later it has been on the rise, the average number of deaths
has increased from 4479 to 4760 between the years 2010 to 2015. This difference account to an

increase of an average of 281 deaths per year. A similar increase has also been observed with the
age range between “0-79” which is from 3581 to 3901 which sums up an alarming average increase
of 320 deaths per year. It is to note that for the lower age groups “0-49”, “0-59” the difference is
very low with an average difference of about 10 to 20. The learning from this insight is that the
age ranges from 49 to 89 has caused the increase in the deaths over the years.
Question4:
What is the relationship between Population and Observed Deaths by State?
This analysis shows us the relationship between the population of a state compared
to the observed death. We filter the analyses for the top 10 states for the country. Without any
surprise we see that the state of California having recorded with highest population has the highest
average of observed deaths. This is followed by Texas which is also a fairly larger state. The 3rd
4th and 5th position in the list are taken by New York, Florida and New Jersey respectively. It’s
interesting to note that thought these 3 states do not have a stark difference in the population the

observed death rate for Florida is comparatively high compared to those of New York and New
Jersey, New Jersey being the lowest among them. Florida having a lower population than Texas
has a trivial difference in observed deaths compared to that of Texas. The last 5 states in the list
are Illinois, Pennsylvania, Ohio, Michigan and North Carolina. A worthy note amongst this is that
Illinois population is higher than that of Pennsylvania and Ohio but still the observed death is
lesser than the latter states. This may mean that the heath care facilities in Illinois offer a better
service than the other 2 states.
Question5:
How do the values of Observed Deaths compare by States Based on Region and Locality?
In this visualization we analyze the observed death rates based on the regions of
states in USA. Based on the regional grouping done earlier we have divided the country based on
4 regions which are East Coast, Central East, Central West and West Coast. The interesting insight
from this analysis is that California recorded with the largest Observed death and Population in

the previous analysis belonging to the west coast has a lower observed death than the East Coast.
East Coast tops the list recording with the highest average of observed death (5924 and 1729) in
both metropolitan and non-metropolitan locality. It is also to be noted that the difference in the
observed death is also greater in the East Coast compared to that of the West Coast which is 4195
and 2183 respectively. There happen to be a difference of 2012 deaths between the metropolitan
and non-metropolitan of the East Coast and the West Coast. This might stand as an evidence to the
fact that the health care facilities in the East Coast are little inferior to that of the West Coast.
Question6:
What is the Breakdown of the number of Author name by Matching Hashtags?
This simple pie chart visualization from the Social Media(Twitter) shows us what
people have been talking about. Recollecting an analysis from the first visualization which showed
us that the most number of death have been caused by unintentional injury it is so surprising to see
that there hasn’t been people who have tweeted about unintentional injury. From this analysis we

see that the talk of the social media has always been about Cancer which has been in the rise from
2010 which is followed by Stroke and Heart disease. It is to note that there hasn’t been any tweet
on the lower respiratory disease either.
Question7:
How does the number of Author Name compare by Author State and Matching Hashtags?
Analyzing further to get an in-depth knowledge on which state people have been
very active in tweeting, it is to no surprise that California tops the list of 10 with its dense
population stating to be the reason. A closer look into this chart will give us an interesting fact that
the analysis from question 5 bolsters this analysis. We have seen that the population and the
observed death have been more in the states of California, Texas, New York and Florida. This
analysis shows that the major tweet has been from the same states which recorded with the highest
observed deaths. An interesting fact to note is that though New Jersey was a part of the Top 5
states recorded with the most observed death but isn’t one of the state that tweets much on deaths.

F. Prediction:
Fig F.1
Fig F.2

Fig F.3
Fig F.4

Fig F.5
Fig F.6

The observed death predictor is done in order to predict the number of
observed death in future. This prediction is done taking into consideration the following columns
as input. The columns include Year, State, HHS Region, Population, Expected Deaths, Excess
Death Observed, Age Range, Locality and percentage of potential excess death. This prediction
holds good only for a 21.7% strength as the most of the values aren’t that closely related to each
other. This prediction states us that the observed death is strongly influenced by 6 top relations
which are between the following columns Expected Death, Excess Death Observed, Population.
From Fig F.4 we see that the column Observed Death is linearly regressed with Expected Death,
Excess death observed. The decision tree Fig F.2 states that the observed death is more influenced
by Expected Death and 7 other columns. Hence the highest prediction rate is achieved using those
columns which is 21.7%. The 7 other columns include HHS Region, Excess Death Observed,
Percent Potential excess death, Population, Excess death observed, Age range and Cause of Death.
The decision table in Fig F.3 states that the observed death is a continuous target and hence the
algorithm used by Watson for prediction is CHAID regression tree (Fig F.6). The second strongest
prediction that could be made is using the combination of the columns Excess Death Observed and
HHS region which predicts up to 18.6%. This prediction also using a regression algorithm which
is Linear Regression(ANOVA). Another 2-field prediction using the fields HHS region and
Expected Death together predict the observed death up to 13.9%. More details on this prediction
shows (Fig F.5) that observed death being a continuous value the same Linear
Regression(ANOVA) algorithm is being used. Thus, we learn that from a total of 12 fields which
are used for prediction only 8 of the fields potentially influence in predicting the observed deaths
in the states and the remaining 4 fields do not create an impact in predicting target. As the columns
doesn’t seem to be highly co-related the prediction strength tends to be low for observed deaths.

G. Dashboard:
The dashboard shows 4 important visualization of which 2 are analyzed in depth in
the exploration part. The first visualization shows us which cause of death has the taken the toll
on the most people. Unintentional injury has costed most of the lives followed by cancer and heart
disease. The second visualization which is a pie chart explain us which are the top 10 states which
have the highest excess death observed. The Top 5 of them include Texas, Florida, Ohio and
California. California though has the highest population and the highest observed death rate is
ranked 5 in this list which states that California is better in predicting the expected deaths per year.
The 3rd visualization, Scatter plot has also been analyzed in depth in question 4 which shows the
relationship between the population and the observed deaths based on the states. The Top 5 states
here include California, Texas, New York, Florida and New Jersey. The final visualization shows
us an analysis on the number of tweets which is separated based on gender. For each hashtag we
analyze the number of tweets based on gender. From this analysis we learn that there are more

than about cancer from female than male. Whereas for heart disease and stroke there has been
more number of tweet from male than female. This may stand an evidence that bolsters a fact that
men are more prone to heart disease and strokes where as women are more prone to cancer. Thus,
this analysis gives us an in-depth review on the deaths that have occurred in the country of United
States of America. We have analyzed the major causes by age range, State and the trends of death
over the years. This analysis would server to be a good guidance for health care facilities to tailor
their services for each Age Range and Cause of death.

United States Death Cause Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to United States Death Cause Analysis

Similar to United States Death Cause Analysis (20)

More from Roshik Ganesan

More from Roshik Ganesan (6)

Recently uploaded

Recently uploaded (20)

United States Death Cause Analysis