The Burden of Disease: Data analysis, interpretation and linear regression

The Burden of Disease
Group 30
Aman Desai, Jim Huang, Gloria Marín, Carmen Chen, Dimitris Charitos, Lorenzo Gherardi
Index
• Introduction to the case
• Methodology
• Data Visualization
• Initial Regressions
• Moving Forward

[Slogan here]
Index
• Introduction to the case
• Main variables
• Exploratory Analysis
• Initial Regressions
• Moving Forward
DALYs
Disability
Adjusted
Life
Years
• Metric used to measure the
Burden of Disease
• DALY includes the sum of
mortality and morbidity due
to a specific disease
• One DALY = loss of 1 year
in good health because of
• Premature death
• Disease
• Disability
- Mortality is used as a method to assess a
population's health
- Through ‘child mortality’
- Through ‘life expectancy’
- The Problem with this method is that it does
not account for a population that lives through
suffering due to a disease which otherwise
prevents a normal life.
- For people to get healthy, attention needs to
be given to the impact on the lives of people
suffering with disease. Years of contribution to
ones one’s community, industry, and nation,
are lost.
ASSIGNMENT PURPOSE:
1. Understand causes
2. Identify factors that magnify its impact
Introduction to the Case

[Slogan here]
Introduction to the Case: background and preparation
Communicable diseases
Non-communicable
diseases (NCDs)
Injuries
Diarrhoea, lower
respiratory & other
common infectious
diseases
Cardiovascular diseases
(inc. stroke, heart disease
and heart failure)
Road injuries
Neonatal disorders Cancers
Other transport
injuries
Maternal disorders Respiratory disease Falls
Malaria & neglected
tropical diseases
Diabetes, blood and
endocrine diseases
Drowning
Nutritional deficiencies
Mental and substance use
disorders
Fire, heat and hot
substances
HIV/AIDS Liver diseases Poisonings
Tuberculosis Digestive diseases Self-harm
Other communicable
diseases
Musculoskeletal disorders Interpersonal violence
Neurological disorders
(including dementia)
Conflict & terrorism
Other NCDs Natural disasters

Methodology: Linear Model and Data Visualization
Step 1. Identify relevant variables
• Explanatory Variable (x): Select factors covering the following dimensions from hundreds of
other factors: Diet habit (E.g., fruit consumption), Healthcare level (E.g., healthcare expense),
Living habit (smoking %), and Other demographics (E.g., education, overweight %)
• Response Variable (y): Choose ‘Overall DALYs’, ‘Communicable Diseases DALYs’, ‘Non-
Communicable Diseases DALYs’, and ‘Injuries DALYs’ as our response variable from 24
possible variable by comparing the models
Step 2. Check for non-linear relations
Step 3. Generate the linear regression and prediction model
• Dropped all the insignificant level
• Checked the VIF to eliminate the risk of multilinearity
Data Visualization
• Bar Chart and Stacked Bar Chart: Compare causes of DALYs by continent
• Area Chart: Look into the DALY rate over 2000~2017 by continent
• Scatter Plot: Measure DALY due to Proportion of GDP spent on Healthcare by continent
Linear Model

Accumulated DALYs per Capita 1980 - 2017
Due to Communicable Diseases
Due to Non-Communicable
Diseases
Due to Injuries
• Overall, Africa has the highest accumulated
average DALY rate (9), followed by Asia (5),
and Oceania (4).
• The high contrast of Africa is mainly due to
communicable diseases, with a rate triple of
that of the next highest continent, Asia
• DALY rate for non-communicable diseases
and injuries are relatively uniform around the
world.
• Africa's communicable DALY rate has
declined since 2008. Despite this, the
burden of disease on the continent remains
high and this leaves room to consider the
causes and potential solutions for this.
Summary
Communicable DALYs over 1980 - 2017
Results & Conclusion: Data Visualization (1)

• The variables affecting DALY have been
further broken down. The largest contribution
factor found were:
Ø Cardiovascular Diseases for Non-
Communicable Diseases, and
Ø Unintentional Injuries for Injuries.
• There was no significant cause of disease
found across the different continents.
• GDP has a negative correlation to DALY
for Communicable Diseases, however, the
proprtion of GDP used has a positive
corrlation to the same. This could be
because poorer countries have a higher
liklihood of having to combat communicable
diseases and as a result spend more of their
GDP on healthcare.
Summary
Results & Conclusion: Data Visualization (2)
Causes of DALYs by Continent
due to Non-Communicable Disease
Causes of DALYs by Continent
due to Injuries
Healthcare Expense vs. DALY from
Non-Communicable Disease
Healthcare Expense vs. DALY from
Communicable Disease
Analysis of the Correlation among Variables

1. All explanatory variables’ Pr(>|t|) < 0.01
Results & Conclusion: Final Model
DALY= 66523 - 202.37 overweight% - 33.86 veg_consump - 1030.84 animal_protein_consump -534.61 education - 8.67
pocket_per_cap - 40.86 fruit_consump -7140.44 Asia + 13792.58 Africa -9335.46 NorthA -5196.99 Europe -9146.72 SouthA
Model of best fit
2. VIF (Variance Inflation Factor) is <10
• By ruling out all insignificant variables,
we had 7 variables in our best model.
• The risk of multicollinearity was checked
by ensuring that VIF <10.
• The high R-squared obtained (75.35%)
suggests that the model explains the
variance of DALY accurately.
Summary

Results & Conclusion: Prediction
Step 1. Using our linear model, we have estimated the DALY rates worldwide for
2013 using our data for all the years until 2012.
Step 2. The data was filtered to all periods before 2013 and a linear model was created.
Step 3. Using the Linear Model, data for 2013 was predicted.
Step 4. Compared to the actual data available for 2013, the accuracy was determined
Prediction Accuracy was 85.7%
Prediction

Moving Forward: Adding New Variables
What other ‘external’ elements may be magnifying results?
COMMON TO ALL
• Percentage of population insured with health insurance.
• Number of medical doctors per 1,000 people.
• Number of nurses per 1,000 people.
• Out-of-pocket expenditure for healthcare.
SPECIFIC TO
a) Communicable, maternal, neonatal, and nutritional diseases
• Nutritional deficiencies.
• Hygiene practices.
• Housing space per person.
b) Non-communicable diseases
• Physical inactivity.
• Wellbeing.
• Genetics.
c) Injuries
• Surveillance.
• Regulations for safety.

Group 30 - Aman Desai, Jim Huang, Gloria Marín, Carmen Chen, Dimitris Charitos, Lorenzo Gherardi
Introduction
A glance of DALY
Linear Regressions
DALY= 66523 - 202.37 overweight% - 33.86 veg_consum - 1030.84 animal_consum -534.61 education - 8.67 pocket/cap
- 40.86 fruit_consum -7140.44 Asia + 13792.58 Africa -9335.46 NorthA -5196.99 Europe -9146.72 SouthA
All explanatory variables’ Pr(>|t|) < 0.01 VIF (Variance Inflation Factor) is <10
Conclusions
Moving Forward
Methodology (Model)
Model of best fit
Disability
Adjusted
Life
Years
DALYs
• Metric used to measure the Burden of Disease
• It includes the sum of mortality and morbidity
• DALY = loss of 1 year in good health because of
Premature death, Disease, Disability
Aim of study
• Understand causes
• Identify factors that magnify the impact
Background & Preparation
Burden of Disease, 2017
Disease Burden due to Communicable disease vs GDP per capita
Category of Disease
• Communicable disease
• Non-Communicable
disease (e.g., Cancers)
• Injuries (e.g., Falls, Fire)
DALY Around the World Due to Communicable Disease
Due to Non-Communicable Disease
Step 1. Identify relevant variables
• Explanatory Variable (x): Select factors covering following
dimensions from hundreds of other factors: Diet habit (E.g.,
fruit consumption), Healthcare level (E.g., healthcare expense),
Living habit (smoking %), and Other demographics (E.g.,
education, overweight %
• Response Variable (y): Choose ‘Overall DALYs’, ‘Communicable
Diseases DALYs’, ‘Non-Communicable Diseases DALYs’, and
‘Injuries DALYs’ as our response variable from 24 possible
variable by comparing the models
Step 2. Check for non-linear relations
Step 3. Generate the regression model
• Dropped all the insignificant level
• Checked the VIF to eliminate the risk of multilinearity
Statistical Technique
Healthcare Expense vs.
DALY from Non- and
Communicable Disease
Stacked Bar Chart - Causes of
DALYs by continent for Non-
Communicable Disease and
Injuries
What other ‘external’ elements may be magnifying results?
• Common to all::
• Percentage of population insured with health insurance
• Number of medical doctors/Nurse per 1,000 people
• Specific to
a) Communicable diseases : Family size
b) Non-communicable diseases: Literacy rate
c) Injuries: Alcohol consumption
Linear Regression
• The best model had 7 variables (overweight%, veg_consum, animal_consum,
education, pocket/cap, fruit _consum, continent) including in best model with
all the variavles Pr(>|t|) < 0.01 and VIF<10
• High R-squared (75.35%) suggests the model explain the variance of DALY well
Data Visualization
• Africa has the highest Avg. DALY rate (c 9), followed by Asia (c 5), and Oceania.
• The high contrast of Africa is mainly due to communicable diseases with a rate
more than triples the second highest continent.
• Africa‘s communicable DALY rate declines since 2008 but remain high over other
continents, leaving room to further consider the causes and potential solutions
Communicable DALYs over 1980 - 2017
Analysis of Correlation among
Variables

22/02/2021 CM30_GroupProject_SG30
ﬁle:///Users/Aman/Downloads/The Burden of Disease Code.html 1/39
CM30_GroupProject_SG30
Team 30
2021-02-14
1 Burden of Disease
Mortality rates are a common method used to assess a population’s health. Often used rates for such assessment
include child mortality or life expectancy. However, a focus on mortality neglects the suffering caused to people who
still live with the disease. A disease impacts, in a direct or indirect manner, the ability of living a normal life. Potential
contributions to one’s community, work, or nation, are often lost.
Our study, therefore, seeks to understand the magnitude of the burden of diseases by the different disease types, as
well as identify factors that amplify such effects.
The metric that will be used to measure disease burden is called DALY, which stands for Disability Adjusted Life Years.
This metric includes the sum of mortality and morbidity. One DALY stands for 1 year loss in good health due to either
premature death, disease, or disability.
1.1 Data import and inspection
1.1.0.1 Importing data for overall disease burden (DALY)
Rows: 48,698
Columns: 7
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ total_population_gapminder_hyde_un <dbl> …
$ continent <chr> …
$ health_expenditure_per_capita_current_us <dbl> …
$ dal_ys_disability_adjusted_life_years_all_causes_sex_both_age_age_standardized_rate <dbl> …
Code
Hide
#source: https://ourworldindata.org/burden-of-disease
# Reading first file
daly_total <- read_csv(here::here('Data',"disease-burden-vs-health-expenditure-per-capita.csv")) %>%
clean_names()
# Checking for variable types
glimpse(daly_total)
Hide
# Changing variable names and variable types
daly_total<- daly_total %>%
mutate(
location=as.factor(entity),
period=year,
health_expenditure_per_capita=health_expenditure_per_capita_current_us,
daly_adjusted=dal_ys_disability_adjusted_life_years_all_causes_sex_both_age_age_standardized_rate,
total_population = total_population_gapminder_hyde_un) %>%
select(location,period,daly_adjusted,health_expenditure_per_capita,total_population)
1 Burden of Disease

Although important as a whole, DALY rates can futher be divided into 3 sub-categories of disease cause; these being:
communicable diseases, non-communicable diseases, and injuries. We, therefore, included the datasets for each
individual subcategory below.
1.1.0.2 Adding data for burden of non-communicable diseases
Rows: 6,468
Columns: 4
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ dal_ys_disability_adjusted_life_years_non_communicable_diseases_sex_both_age_age_standardized_rate <dbl> …
Rows: 6,468
Columns: 3
$ location <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghani…
$ period <dbl> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,…
$ daly_ncds <dbl> 41145.51, 40587.17, 39644.60, 39821.31, 40641.76, 40790.73,…
1.1.0.3 Adding data for burden from communicable, neonatal, maternal and nutritional diseases
Rows: 6,468
Columns: 4
$ entity
<chr> …
$ code
<chr> …
$ year
<dbl> …
$ dal_ys_disability_adjusted_life_years_communicable_maternal_neonatal_and_nutritional_diseases_sex_both_age_age_stan
dardized_rate <dbl> …
Hide
#source:https://ourworldindata.org/burden-of-disease
#Reading the file
ncds <- read_csv(here::here('Data',"burden-of-disease-rates-from-ncds.csv")) %>%
clean_names()
glimpse(ncds)
ncds<- ncds %>%
mutate(location=as.factor(entity),
period=year,
daly_ncds=dal_ys_disability_adjusted_life_years_non_communicable_diseases_sex_both_age_age_standardized_rat
e) %>%
select(location,period,daly_ncds)
glimpse(ncds)
#Merging data frames
total <- merge(daly_total,ncds,by=c("location","period"))
Hide
#Reading the file
cnmnd <- read_csv(here::here('Data',"burden-of-disease-rates-from-communicable-neonatal-maternal-nutritional-disease
s.csv")) %>%
clean_names()
glimpse(cnmnd)
Hide

Rows: 6,468
Columns: 3
$ location <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghan…
$ period <dbl> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999…
$ daly_cnmnd <dbl> 51181.84, 47263.29, 38908.25, 36882.69, 38809.79, 38262.20…
1.1.0.4 Adding data for burden from injuries, violence, self-harm and accidents
Rows: 6,468
Columns: 4
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ dal_ys_disability_adjusted_life_years_injuries_sex_both_age_age_standardized_rate <dbl> …
Rows: 6,468
Columns: 3
$ location <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghani…
$ period <dbl> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,…
$ daly_ivsa <dbl> 11775.715, 13390.289, 12365.622, 11530.363, 13546.148, 1238…
Within each of the 3 sub-categories of disease causes, there are speci c diseases that classify as such. We included all
categories in our dataset.
1.1.0.5 Adding data for disease burden by cause (DALY by cause)
cnmnd<- cnmnd %>%
period=year,
daly_cnmnd=dal_ys_disability_adjusted_life_years_communicable_maternal_neonatal_and_nutritional_diseases_sex
_both_age_age_standardized_rate) %>%
select(location,period,daly_cnmnd)
glimpse(cnmnd)
Hide
total <- merge(total,cnmnd,by=c("location","period"))
Hide
#Reading the file
ivsa <- read_csv(here::here('Data',"burden-of-disease-rates-from-injuries.csv")) %>%
clean_names()
glimpse(ivsa)
Hide
ivsa<- ivsa %>%
period=year,
daly_ivsa=dal_ys_disability_adjusted_life_years_injuries_sex_both_age_age_standardized_rate) %>%
select(location,period,daly_ivsa)
glimpse(ivsa)
Hide
total <- merge(total,ivsa,by=c("location","period"))

Aside from the main variables, additional variables that may be contributing to the nal effect of DALY rates were
included in the dataset.
1.1.0.6 Adding data for GDP per capita
Hide
#source: https://ourworldindata.org/burden-of-disease
# Reading second file
daly_by_cause <- read_csv(here::here('Data',"burden-of-disease-by-cause.csv")) %>%
clean_names()
#glimpse(daly_by_cause)
daly_by_cause <- daly_by_cause %>%
mutate(
location=as.factor(entity),
period=year,
daly_conflict_terrorism=dal_ys_disability_adjusted_life_years_conflict_and_terrorism_sex_both_age_all_ages_numbe
r,
daly_hiv_tuberculosis=dal_ys_disability_adjusted_life_years_hiv_aids_and_tuberculosis_sex_both_age_all_ages_numbe
r,
daly_diahrrea_respiratory=dal_ys_disability_adjusted_life_years_diarrhea_lower_respiratory_and_other_common_infec
tious_diseases_sex_both_age_all_ages_number,
daly_cvs=dal_ys_disability_adjusted_life_years_cardiovascular_diseases_sex_both_age_all_ages_number,
daly_self_harm=dal_ys_disability_adjusted_life_years_self_harm_sex_both_age_all_ages_number,
daly_violence=dal_ys_disability_adjusted_life_years_interpersonal_violence_sex_both_age_all_ages_number,
daly_nutritional_deficiencies=dal_ys_disability_adjusted_life_years_nutritional_deficiencies_sex_both_age_all_age
s_number,
daly_transport_injuries=dal_ys_disability_adjusted_life_years_transport_injuries_sex_both_age_all_ages_number,
daly_unintentional_injuries=dal_ys_disability_adjusted_life_years_unintentional_injuries_sex_both_age_all_ages_nu
mber,
daly_maternal_disorders=dal_ys_disability_adjusted_life_years_maternal_disorders_sex_both_age_all_ages_number,
daly_neonatal_disorders=dal_ys_disability_adjusted_life_years_neonatal_disorders_sex_both_age_all_ages_number,
daly_other_communicable=dal_ys_disability_adjusted_life_years_other_communicable_maternal_neonatal_and_nutritiona
l_diseases_sex_both_age_all_ages_number,
daly_nature_forces=dal_ys_disability_adjusted_life_years_exposure_to_forces_of_nature_sex_both_age_all_ages_numbe
r,
daly_chronic_respiratory=dal_ys_disability_adjusted_life_years_chronic_respiratory_diseases_sex_both_age_all_ages
_number,
daly_chronic_liver=dal_ys_disability_adjusted_life_years_cirrhosis_and_other_chronic_liver_diseases_sex_both_age_
all_ages_number,
daly_digestive=dal_ys_disability_adjusted_life_years_digestive_diseases_sex_both_age_all_ages_number,
daly_tropical_and_malaria=dal_ys_disability_adjusted_life_years_neglected_tropical_diseases_and_malaria_sex_both_
age_all_ages_number,
daly_musculoskeletal=dal_ys_disability_adjusted_life_years_musculoskeletal_disorders_sex_both_age_all_ages_numbe
r,
daly_other_non_communicable=dal_ys_disability_adjusted_life_years_other_non_communicable_diseases_sex_both_age_al
l_ages_number,
daly_neurological=dal_ys_disability_adjusted_life_years_neurological_disorders_sex_both_age_all_ages_number,
daly_mental_and_substance=dal_ys_disability_adjusted_life_years_mental_and_substance_use_disorders_sex_both_age_a
ll_ages_number,
daly_diabetes_urogenital_blood_endocrine=dal_ys_disability_adjusted_life_years_diabetes_urogenital_blood_and_endo
crine_diseases_sex_both_age_all_ages_number,
daly_neoplasms=dal_ys_disability_adjusted_life_years_neoplasms_sex_both_age_all_ages_number)%>%
select(location, period,daly_conflict_terrorism,daly_hiv_tuberculosis,daly_diahrrea_respiratory,daly_cvs,daly_self_
harm,daly_violence,daly_nutritional_deficiencies,daly_transport_injuries,daly_unintentional_injuries,daly_mat
ernal_disorders,daly_neonatal_disorders,daly_other_communicable,daly_nature_forces,daly_chronic_respiratory,d
aly_chronic_liver,daly_digestive,daly_tropical_and_malaria,daly_musculoskeletal,daly_other_non_communicable,d
aly_neurological,daly_mental_and_substance,daly_diabetes_urogenital_blood_endocrine,daly_neoplasms)
#glimpse(daly_by_cause)
# Merging dataframes
total <- merge(total,daly_by_cause,by=c("location","period"))
#We will consider taking out health expenditure per capita since it has a complete rate of 57.4% and may distort the
final data.
Hide

#source: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
# Reading third file
gdp <- read_csv(here::here('Data',"API_NY.GDP.PCAP.CD_DS2_en_csv_v2_1926744.csv"),skip=3) %>%
clean_names()
glimpse(gdp)

Rows: 264
Columns: 66
$ country_name <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Andorra"…
$ country_code <chr> "ABW", "AFG", "AGO", "ALB", "AND", "ARB", "ARE", "ARG"…
$ indicator_name <chr> "GDP per capita (current US$)", "GDP per capita (curre…
$ indicator_code <chr> "NY.GDP.PCAP.CD", "NY.GDP.PCAP.CD", "NY.GDP.PCAP.CD", …
$ x1960 <dbl> NA, 59.77319, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1807…
$ x1961 <dbl> NA, 59.86087, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1874…
$ x1962 <dbl> NA, 58.45801, NA, NA, NA, NA, NA, 1155.89017, NA, NA, …
$ x1963 <dbl> NA, 78.70639, NA, NA, NA, NA, NA, 850.30474, NA, NA, N…
$ x1964 <dbl> NA, 82.09523, NA, NA, NA, NA, NA, 1173.23821, NA, NA, …
$ x1965 <dbl> NA, 101.10830, NA, NA, NA, NA, NA, 1279.11343, NA, NA,…
$ x1968 <dbl> NA, 129.10832, NA, NA, NA, 224.87811, NA, 1141.08048, …
$ x1969 <dbl> NA, 129.32971, NA, NA, NA, 240.03563, NA, 1329.05866, …
$ x1970 <dbl> NA, 156.5189, NA, NA, 3238.5568, 262.8663, NA, 1322.59…
$ x1971 <dbl> NA, 159.56758, NA, NA, 3498.17365, 295.97104, NA, 1372…
$ x1972 <dbl> NA, 135.31731, NA, NA, 4217.17358, 343.56582, NA, 1408…
$ x1973 <dbl> NA, 143.14465, NA, NA, 5342.16856, 423.13508, NA, 2097…
$ x1974 <dbl> NA, 173.65376, NA, NA, 6319.73903, 777.56068, NA, 2844…
$ x1975 <dbl> NA, 186.5109, NA, NA, 7169.1010, 836.2083, 26847.7944,…
$ x1976 <dbl> NA, 197.4455, NA, NA, 7152.3751, 1007.1404, 30118.1378…
$ x1977 <dbl> NA, 224.2248, NA, NA, 7751.3702, 1123.1433, 33823.3196…
$ x1978 <dbl> NA, 247.3541, NA, NA, 9129.7062, 1193.7456, 28456.7374…
$ x1979 <dbl> NA, 275.7382, NA, NA, 11820.8494, 1563.7035, 33512.741…
$ x1980 <dbl> NA, 272.6553, 710.9816, NA, 12377.4116, 2052.9558, 427…
$ x1981 <dbl> NA, 264.1113, 642.3839, NA, 10372.2328, 2050.7698, 449…
$ x1982 <dbl> NA, NA, 619.9614, NA, 9610.2663, 1864.8707, 40026.1663…
$ x1983 <dbl> NA, NA, 623.4406, NA, 8022.6548, 1699.2152, 34843.1029…
$ x1984 <dbl> NA, NA, 637.7152, 639.4847, 7728.9067, 1672.2788, 3230…
$ x1985 <dbl> NA, NA, 758.2376, 639.8659, 7774.3938, 1606.7558, 2972…
$ x1986 <dbl> 6472.5020, NA, 685.2701, 693.8735, 10361.8160, 1489.84…
$ x1987 <dbl> 7885.7965, NA, 756.2619, 674.7934, 12616.1676, 1543.51…
$ x1988 <dbl> 9764.7900, NA, 792.3031, 652.7743, 14304.3570, 1476.04…
$ x1989 <dbl> 11392.4558, NA, 890.5541, 697.9956, 15166.4379, 1505.5…
$ x1990 <dbl> 12307.3117, NA, 947.7042, 617.2304, 18878.5060, 2009.4…
$ x1991 <dbl> 13496.0031, NA, 865.6927, 336.5870, 19532.5402, 1929.6…
$ x1992 <dbl> 14046.5038, NA, 656.3618, 200.8522, 20547.7118, 2027.8…
$ x1993 <dbl> 14936.8272, NA, 441.2007, 367.2792, 16516.4710, 1996.9…
$ x1994 <dbl> 16241.0465, NA, 328.6733, 586.4163, 16234.8090, 1989.4…
$ x1995 <dbl> 16439.3564, NA, 397.1795, 750.6044, 18461.0649, 2072.7…
$ x1996 <dbl> 16586.0684, NA, 522.6438, 1009.9777, 19017.1746, 2235.…
$ x1997 <dbl> 17927.7496, NA, 514.2952, 717.3806, 18353.0597, 2319.0…
$ x1998 <dbl> 19078.3432, NA, 423.5937, 813.7903, 18894.5215, 2188.9…
$ x1999 <dbl> 19356.2034, NA, 387.7843, 1033.2417, 19261.7105, 2331.…
$ x2000 <dbl> 20620.7006, NA, 556.8363, 1126.6833, 21854.2468, 2605.…
$ x2001 <dbl> 20669.0320, NA, 527.3335, 1281.6594, 22971.5355, 2506.…
$ x2002 <dbl> 20436.8871, 179.4266, 872.4945, 1425.1248, 25066.8822,…
$ x2003 <dbl> 20833.7616, 190.6838, 982.9609, 1846.1188, 32271.9639,…
$ x2004 <dbl> 22569.9750, 211.3821, 1255.5640, 2373.5798, 37969.1750…
$ x2005 <dbl> 23300.0396, 242.0313, 1902.4223, 2673.7873, 40066.2569…
$ x2006 <dbl> 24045.2725, 263.7337, 2599.5665, 2972.7433, 42675.8128…
$ x2007 <dbl> 25835.1327, 359.6932, 3121.9956, 3595.0372, 47803.6936…
$ x2008 <dbl> 27084.7037, 364.6607, 4080.9414, 4370.5401, 48718.4969…
$ x2009 <dbl> 24630.4537, 438.0760, 3122.7808, 4114.1401, 43503.1855…
$ x2010 <dbl> 23512.6026, 543.3030, 3587.8838, 4094.3503, 40852.6668…
$ x2011 <dbl> 24985.9933, 591.1628, 4615.4680, 4437.1429, 43335.3289…
$ x2012 <dbl> 24713.6980, 641.8715, 5100.0958, 4247.6300, 38686.4613…
$ x2013 <dbl> 26189.4355, 637.1655, 5254.8823, 4413.0609, 39538.7667…
$ x2014 <dbl> 26647.9381, 613.8567, 5408.4105, 4578.6320, 41303.9294…
$ x2015 <dbl> 27980.8807, 578.4664, 4166.9797, 3952.8012, 35762.5231…
$ x2016 <dbl> 28281.3505, 509.2187, 3506.0729, 4124.0557, 37474.6654…
$ x2017 <dbl> 29007.6930, 519.8848, 4095.8129, 4531.0208, 38962.8804…
$ x2018 <dbl> NA, 493.7504, 3289.6467, 5284.3802, 41793.0553, 6601.8…
$ x2019 <dbl> NA, 507.1034, 2790.7266, 5353.2449, 40886.3912, 6584.7…
$ x2020 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ x66 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
Hide

1.1.0.7 Adding data for smoking percentages
1.1.0.8 Adding data for healthcare expenditure per capita
Rows: 4,675
Columns: 4
$ entity <chr> "Afghan…
$ code <chr> "AFG", …
$ year <dbl> 2002, 2…
$ health_expenditure_per_capita_ppp_constant_2011_international <dbl> 75.9835…
gdp <- gdp %>%
gather(year, gdp,-c(country_name, country_code,indicator_name,indicator_code)) %>%
mutate(location=as.factor(country_name),
period=readr::parse_number(year)) %>%
select(location,period,gdp)
# Merging dataframes
total <- merge(total,gdp,by=c("location","period"))
#skim(total)
Hide
#source: http://ghdx.healthdata.org/record/ihme-data/gbd-2015-smoking-prevalence-1980-2015
#Reading fourth file
smoking_percentage <- read_csv(here::here('Data',"IHME_GBD_2015_SMOKING_PREVALENCE_1980_2015_Y2017M04D05.CSV")) %>%
clean_names()
#skim(smoking_percentage)
smoking_percentage <- smoking_percentage %>%
filter(age_group_name=="Age-standardized",
metric=="Percent",
sex=="Both") %>%
mutate(location=as.factor(location_name),
period=year_id,
smoking_percentage=mean) %>%
select(location,period,smoking_percentage)
#skim(smoking_percentage)
total <- merge(total,smoking_percentage,by=c("location","period"))
Hide
#source:https://ourworldindata.org/grapher/annual-healthcare-expenditure-per-capita?tab=chart&time=1995..2014&region=
World
#Reading fifth file
healthcare_expenditure <- read_csv(here::here('Data',"annual-healthcare-expenditure-per-capita.CSV")) %>%
clean_names()
glimpse(healthcare_expenditure)
Hide

1.1.0.9 Adding data for percentage of population being overweight
Rows: 8,316
Columns: 4
$ entity <chr> "Afghanistan", "A…
$ code <chr> "AFG", "AFG", "AF…
$ year <dbl> 1975, 1976, 1977,…
$ prevalence_of_overweight_adults_both_sexes_who_2019 <dbl> 5.3, 5.5, 5.7, 5.…
1.1.0.10 Adding data for fruit consumption per capita
Rows: 11,028
Columns: 4
$ entity <chr> "Afg…
$ code <chr> "AFG…
$ year <dbl> 1961…
$ fruits_excluding_wine_food_supply_quantity_kg_capita_yr_fao_2020 <dbl> 41.1…
healthcare_expenditure <- healthcare_expenditure %>%
period=year,
healthcare_expenditure=health_expenditure_per_capita_ppp_constant_2011_international) %>%
select(location,period,healthcare_expenditure)
#glimpse(healthcare_expenditure)
total <- merge(total,healthcare_expenditure,by=c("location","period"))
Hide
#source: https://ourworldindata.org/obesity
#Reading sixth file
percentage_overweight <- read_csv(here::here('Data',"share-of-adults-who-are-overweight.csv")) %>%
clean_names()
glimpse(percentage_overweight)
Hide
percentage_overweight <- percentage_overweight %>%
period=year,
percentage_overweight=prevalence_of_overweight_adults_both_sexes_who_2019) %>%
select(location,period,percentage_overweight)
#glimpse(percentage_overweight)
total <- merge(total,percentage_overweight,by=c("location","period"))
Hide
#source: https://ourworldindata.org/diet-compositions
#Reading seventh file
fruit_consumption <- read_csv(here::here('Data',"fruit-consumption-per-capita.csv")) %>%
clean_names()
glimpse(fruit_consumption)
Hide

1.1.0.11 Adding data for vegetable consumption per capita
Rows: 11,028
Columns: 4
$ entity <chr> "Afghanistan", …
$ code <chr> "AFG", "AFG", "…
$ year <dbl> 1961, 1962, 196…
$ vegetables_food_supply_quantity_kg_capita_yr_fao_2020 <dbl> 36.75, 37.47, 3…
1.1.0.12 Adding data for animal based foods consumption per capita
fruit_consumption <- fruit_consumption %>%
period=year,
fruit_consumption=fruits_excluding_wine_food_supply_quantity_kg_capita_yr_fao_2020) %>%
select(location,period,fruit_consumption)
#glimpse(fruit_consumption)
total <- merge(total,fruit_consumption,by=c("location","period"))
Hide
#Reading eigth file
vegetable_consumption <- read_csv(here::here('Data',"vegetable-consumption-per-capita.csv")) %>%
clean_names()
#Checking for variable types
glimpse(vegetable_consumption)
Hide
## Changing variable names and variable types
vegetable_consumption <- vegetable_consumption %>%
period=year,
vegetable_consumption=vegetables_food_supply_quantity_kg_capita_yr_fao_2020) %>%
select(location,period,vegetable_consumption)
#glimpse(vegetable_consumption)
#Merging dataframes
total <- merge(total,vegetable_consumption,by=c("location","period"))
#skim(total)
Hide
#Reading ninth file
animal_protein_consumption <-read_csv(here::here('Data',"share-of-calories-from-animal-protein-vs-gdp-per-capita.csv"
)) %>%
clean_names()
glimpse(animal_protein_consumption)

Rows: 24,472
Columns: 7
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ total_population_gapminder <dbl> …
$ continent <chr> …
$ share_of_calories_from_animal_protein_fao_2017 <dbl> …
$ real_gdp_per_capita_in_2011us_2011_benchmark_maddison_project_database_2018 <dbl> …
1.1.0.13 Adding data for mean years of schooling
1.1.0.14 Adding data for physicians per 1000 people
Hide
#Changing variable names and type
animal_protein_consumption <- animal_protein_consumption %>%
period=year,
animal_protein_consumption=share_of_calories_from_animal_protein_fao_2017) %>%
select(location,period,animal_protein_consumption)
#glimpse(animal_protein_consumption)
#Mergining dataframes
total <- merge(total,animal_protein_consumption,by=c("location","period"))
#glimpse(total)
Hide
#source: https://ourworldindata.org/global-education
#Reading file
education_years <- read_csv(here::here('Data',"mean-years-of-schooling-1.csv")) %>%
clean_names()
#glimpse(education_years)
education_years <- education_years %>%
period=year,
education_years=average_total_years_of_schooling_for_adult_population_lee_lee_2016_barro_lee_2018_and_undp_2
018) %>%
select(location,period,education_years)
#glimpse(education_years)
#Merging dataframes
total <- merge(total,education_years,by=c("location","period"))
Hide

1.1.0.15 Adding data for nurses per 1000 people
Rows: 1,542
Columns: 4
$ entity <chr> "Afghanistan", "Afghanistan", "A…
$ code <chr> "AFG", "AFG", "AFG", "AFG", "AFG…
$ year <dbl> 2005, 2006, 2007, 2008, 2009, 20…
$ nurses_and_midwives_per_1_000_people <dbl> 0.612000, 0.462000, 0.519000, 0.…
Nurses had too little incidences. Thus, it was not included in our nal dataset.
1.1.0.16 Adding data for out-of-pocket expenditure
Rows: 3,002
Columns: 4
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ out_of_pocket_expenditure_per_capita_on_healthcare_ppp_usd_who_global_health_expenditure <dbl> …
#source:https://ourworldindata.org/grapher/physicians-per-1000-people
#Reading file
physicians <- read_csv(here::here('Data',"physicians-per-1000-people.csv")) %>%
clean_names()
#glimpse(physicians)
physicians <- physicians %>%
period=year,
physicians_1000=physicians_per_1_000_people) %>%
select(location,period,physicians_1000)
#glimpse(physicians)
#Merging dataframes
total <- merge(total,physicians,by=c("location","period"))
Hide
#source:https://ourworldindata.org/grapher/nurses-and-midwives-per-1000-people?
#Reading file
nurses <- read_csv(here::here('Data',"nurses-and-midwives-per-1000-people.csv")) %>%
clean_names()
glimpse(nurses)
Hide
#source:https://ourworldindata.org/grapher/out-of-pocket-expenditure-per-capita-on-healthcare
#Reading file
pocket_exp <- read_csv(here::here('Data',"out-of-pocket-expenditure-per-capita-on-healthcare.csv")) %>%
clean_names()
glimpse(pocket_exp)
Hide

1.1.0.17 Adding data for health protection coverage
Rows: 162
Columns: 4
$ entity <chr> "Albania", "…
$ code <chr> "ALB", "DZA"…
$ year <dbl> 2008, 2005, …
$ share_of_population_covered_by_health_insurance_ilo_2014 <dbl> 23.6, 85.2, …
Health coverage had too little incidences. Thus, it was not included in our nal dataset.
1.1.0.18 Adding data for literacy rate
Rows: 215
Columns: 4
$ entity <chr> "Afghanistan", "Albania", "Algeria", …
$ code <chr> "AFG", "ALB", "DZA", "ASM", "AND", "A…
$ year <dbl> 2000, 2011, 2006, 1980, 2011, 2011, 1…
$ literacy_rate_cia_factbook_2016 <dbl> 28.1, 96.8, 72.6, 97.0, 100.0, 70.4, …
Literacy had too little incidences. Thus, it was not included in our nal dataset.
1.1.0.19 Adding data for grouping locations into continents
Rows: 194
Columns: 2
$ continent <chr> "Africa", "Africa", "Africa", "Africa", "Africa", "Africa",…
$ country <chr> "Algeria", "Angola", "Benin", "Botswana", "Burkina", "Burun…
pocket_exp <- pocket_exp %>%
period=year,
pocket_per_cap=out_of_pocket_expenditure_per_capita_on_healthcare_ppp_usd_who_global_health_expenditure) %>%
select(location,period,pocket_per_cap)
#Merging dataframes
total <- merge(total,pocket_exp,by=c("location","period"))
Hide
#Reading file
health_protect <- read_csv(here::here('Data',"health-protection-coverage.csv")) %>%
clean_names()
glimpse(health_protect)
Hide
#Reading file
literacy <- read_csv(here::here('Data',"literacy-rate-by-country.csv")) %>%
clean_names()
glimpse(literacy)
Hide
#source: https://github.com/dbouquin/IS_608/blob/master/NanosatDB_munging/Countries-Continents.csv
#Reading file
continents <- read_csv(here::here('Data',"Continents.csv")) %>%
clean_names()
glimpse(continents)

Rows: 194
Columns: 2
$ location <fct> Algeria, Angola, Benin, Botswana, Burkina, Burundi, Cameroo…
$ continent <fct> Africa, Africa, Africa, Africa, Africa, Africa, Africa, Afr…
1.1.0.20 Dealing with NAs
After including all potentially-relevant and signi cant variables into our dataset, an inital exploration of the data was
made.
1.2 Exploratory Data Analsys
1.2.0.1 DALY Rates per Continent
Hide
continents <- continents %>%
mutate(location=as.factor(country),
continent=as.factor(continent))%>%
select(location, continent)
glimpse(continents)
Hide
#Merging dataframes
total <- merge(total,continents,by=c("location"))
Hide
#Adding variables of per capita healthcare expenditure - per capita gdp
total <- total%>%
mutate(healthcare_gdp_rate = healthcare_expenditure/gdp)
#skim(total)
total <- total %>%
na.omit()
#skim(total)
Hide
#Selecting data only from 1980 - onward (to gain better insights on the recent situation)
total_short <-total %>%
filter(period>=1980)
#Re-coding DALY variables as averages per continent, per year
total_cont<-total_short%>%
group_by(period,continent)%>%
summarise(daly_adjusted=mean(daly_adjusted/100000), daly_cnmnd = mean(daly_cnmnd/100000), daly_ncds = mean(daly_ncd
s/100000), daly_ivsa = mean(daly_ivsa/10000))
#Plotting for average DALY rates per capita accumulated from 1980 to 2017
ggplot(total_cont, aes(x = continent, y = daly_adjusted, fill = continent)) +
geom_bar(stat = "identity") +
labs(x= "Continent", y = "Overall DALYs", title = "Accumulated Average DALYs per Capita, per Continent 1980 - 2
017")

Hide
ggplot(total_cont, aes(x = continent, y = daly_cnmnd, fill = continent)) +
geom_bar(stat = "identity")+
labs(x= "Continent", y = "Communicable Diseases DALYs", title = "Accumulated Average DALYs per Capita from Comm
unicable Diseases, per Continent 1980 - 2017")

Hide
ggplot(total_cont, aes(x = continent, y = daly_ncds, fill = continent)) +
labs(x= "Continent", y = "Non-Communicable Diseases DALYs", title = "Accumulated Average DALYs per Capita from
Non-Communicable Diseases, per Continent 1980 - 2017")

Hide
ggplot(total_cont, aes(x = continent, y = daly_ivsa, fill = continent)) +
labs(x= "Continent", y = "Injuries DALYs", title = "Accumulated Average DALYs per Capita from Injuries, per Con
tinent 1980 - 2017")

Overall, we nd that Africa has the highest accumulated average DALY rate per capita of all countries (c 90), followed
by Asia (c 50), and Oceania (c 40). The high contrast of Africa agaist the rest of the continents is mainly due to its high
accumulated average for communicable diseases. In this category, Africa more than tripples the second highest
continent (c 55 for Africa compared to c 17 for Asia).
When it comes to non-communicable diseases and injuries, rates are fairly even. For non-communicable diseases, DALY
rates range c 27 - 33 (North America being the lowest and Africa, the highest). Although with much lower DALY rates,
injuriy rates range c 4 - 6 (Europe being the lowest and Africa, the highest).
Consequently, communicable diseases are found to have the highest burden in the population, with Africa taking (or
having taken) the highest burden. A closer look into these rates were taken to better understand its evolution throught
time.
1.2.1 Communicable Diseases
Hide
graph1 <- total_cont %>%
ggplot(aes(x=period, y=daly_cnmnd, fill=continent, text=continent)) +
geom_area(alpha = 1) +
theme(legend.position="none") +
ggtitle(".") +
theme(legend.position="none") +
labs(x= "Year", y = "DALY for communicable disease", title = "Time Series Average DALYs per Capita from Communica
ble Diseases per Continent")
ggplotly(graph1)
Time Series Average DALYs per Capita from Communicable Diseases per Continent

2000 2005
0.0
0.2
0.4
0.6
0.8
Ye
DALY
for
communicable
disease

As seen from the graph, Africa’s communicable DALY rate seems to be in the decline since 2008.However, this
continent has been consistently ranking high over other continents which leaves room to further consider the causes
and potential solutions.
From the Our World in Data report, it is found that neonatal disorders are the top communicable diseases in terms of
total share of burden (7.45% of all causes). It is also known that there is a strong negative correlation between GDP and
DALY from communicable diseases. Similarly, a negative correlation is found between health expenditure per capita
and DALY from communicable diseases.
What about healthcare expenditure as percentage of GDP?
Hide
ggplot(total_short, aes(x = healthcare_gdp_rate, y = daly_cnmnd, color = continent))+
geom_point()+
labs(x= "Healthcare Expenditure as percentage of GDP", y = "DALY from Communicable Diseases", title = "Rates
due to Proportion of GDP spent on Healthcare")
Hide
# No clear correlation yet, but interesting
Hide
total_short%>%
select(daly_cnmnd, healthcare_gdp_rate, gdp, pocket_per_cap)%>%
ggpairs()

> A higher GDP per country seems to have a signi cant negative correlation to DALY of communicable diseases. However, the proportion of GDP used for
healthcare seems to have a signi cant positive correlation to DALY of communicable diseases. GDP seems to have a signi cant negative correlation to the
proportion of GDP spent on healthcare. This could indicate that poorer countries have a higher likelihood of having to combat communicable diseases.
Consequently, they spend a greater proportion of their GDP on healthcare than richer countries. Out of pocket expenditure is also highly negatively
correlated to DALY of communicable diseases, although highly positively correlated to gdp. This leads to the interpretation that poor countries in which the
population is individually responsible for investing in their medical care and are most likely to have higher DALY communicable disease rates.
1.2.2 Injuries
With DALY rates for injuries and additional causes having similar rates across all continents, we decided to rst take a
closer look at which types of causes were most prominent overall.
Hide
#This plot shows injury related DALY in a stacked bar chart.
start <- total%>%
group_by(continent)%>%
summarise(daly_conflict_terrorism = mean(daly_conflict_terrorism/total_population), daly_self_harm = mean(daly_self
_harm/total_population), daly_violence = mean(daly_violence/total_population), daly_transport_injuries = mean
(daly_transport_injuries/total_population), daly_nature_forces = mean(daly_nature_forces/total_population), d
aly_unintentional_injuries = mean(daly_unintentional_injuries/total_population))
pivot <- pivot_longer(start, cols=c(daly_conflict_terrorism, daly_self_harm, daly_violence,daly_transport_injuries, d
aly_unintentional_injuries, daly_nature_forces), names_to = "diseases",values_to = "value")
#select columns from dataset
plots <- pivot %>%
select(continent,diseases,value)
ggplot(plots, aes(fill=diseases, y=value, x=continent)) +
geom_bar(position="stack", stat="identity") +
labs(x= "Continent", y = "Injuries DALYs", title = "Accumulated Average DALYs per Capita from Injuries, per Conti
nent 1980 - 2017")

Hide
#Plot on Terrorism and Violence
terrorism_violence <- start %>%
select(daly_conflict_terrorism, daly_violence, continent)
terrorism_violence <- pivot_longer(terrorism_violence,c(daly_conflict_terrorism, daly_violence,
),names_to = "diseases",values_to = "value")
#select columns from dataset
terrorism_violence <- terrorism_violence%>%
select(diseases,value,continent)
#stacked bar chart
ggplot(terrorism_violence, aes(fill=diseases, y=value, x=continent)) +
labs(x= "Continent", y = "Injuries DALYs", title = "Accumulated Average DALYs per Capita from Terrorism and Viole
nce 1980 - 2017")

Hide
total_short%>%
select(daly_ivsa, gdp, daly_mental_and_substance, physicians_1000, education_years)%>%
ggpairs()

1.2.3 Non-Communicable Diseases
Hide

start1 <- total%>%
group_by(continent)%>%
summarise(daly_cvs = mean(daly_cvs/total_population), daly_nutritional_deficiencies = mean(daly_nutritional_deficie
ncies/total_population), daly_maternal_disorders = mean(daly_maternal_disorders/total_population), daly_muscu
loskeletal = mean(daly_musculoskeletal/total_population), daly_other_non_communicable = mean(daly_other_non_c
ommunicable/total_population), daly_neurological = mean(daly_neurological/total_population), daly_mental_and_
substance = mean(daly_mental_and_substance/total_population), daly_diabetes_urogenital_blood_endocrine = mean
(daly_diabetes_urogenital_blood_endocrine/ total_population), daly_neoplasms = mean(daly_neoplasms/total_popu
lation), daly_chronic_liver = mean(daly_chronic_liver/total_population))
pivot1 <- pivot_longer(start1, c(daly_cvs,daly_nutritional_deficiencies,daly_maternal_disorders,daly_musculoskeletal,
daly_other_non_communicable,daly_neurological,daly_mental_and_substance,daly_diabetes_urogenital_blood_endocr
ine,daly_neoplasms,daly_chronic_liver), names_to = "diseases",values_to = "value")
#select columns from data set
total_short_ncds <- pivot1%>%
select(continent,diseases,value)
#stacked bar chart
# This staked bar chart shows the DALY once again for non communicable diseases but has been adjusted to show data fo
r per 100000 population. Additionally the data has been colored to show the different categories of non-commu
nicable diseases.
#Asia has the highest DALY for non communicable diseases closely followed by Europe. There are reasons to suggest why
DALY remains high in both regions. For Asia, the lack of affordability, lack of doctors, and having helathcar
e not to the highest standards may all contribute towards this. Due to Europe's aging population, non-communi
cable diseases are more likely to be present among its population. As seen in the graphs earlier, a path of n
ations to become modern and developed, their population transitions from suffering from communicable disease
towards non-communicable disease, which come with age.
ggplot(total_short_ncds, aes(fill=diseases, y=value, x=continent)) +
labs(x= "Continent", y = "Non-Comm DALYs", title = "Accumulated Average DALYs per Capita from Non-Comm, per Conti
nent 1980 - 2017")
Hide

# Looking into CVS in more detail.
ggplot(total_short, aes(x= continent, y = daly_cvs))+
geom_col()+
labs(x= "Continent", y = "Daly due to CVS related conditions", title = "DALY per capita due to CVS condition
s per continent")
Hide
# Looking into neoplasms in more detail.
ggplot(total_short, aes(x= continent, y = daly_neoplasms))+
geom_col()+
labs(x= "Continent", y = "Daly due to neoplasm", title = "DALY per capita due to neoplasms per continent")

Hide
# Looking into diabetes, urogenital, blood, endocrine in more detail.
ggplot(total_short, aes(x= continent, y = daly_diabetes_urogenital_blood_endocrine))+
geom_col()+
labs(x= "Continent", y = "Daly due to diabetes, urogenital, blood and endocrine related conditions.", title
= "DALY per capita due to diabetes, urogenital, blood and endocrine related conditions per continent")

Hide
ggplot(total_short, aes(x= continent, y = daly_mental_and_substance))+
geom_col()+
labs(x= "Continent", y = "Daly due to mental and substance related conditions.", title = "DALY per capita du
e to mental and substance related conditions per continent")

Hide
ggplot(total_short, aes(x = healthcare_gdp_rate, y = daly_ncds/100000, color = continent))+
geom_point()+
labs(x= "Healthcare Expenditure as percentage of GDP", y = "DALY from Non- Communicable Diseases", title =
"Rates due to Proportion of GDP spent on Healthcare")

Hide
total_short%>%
select(daly_cnmnd, healthcare_gdp_rate, gdp, pocket_per_cap)%>%
ggpairs()

1.3 Regression analysis
Although highly complex, and with many different societal and economical variables affecting the nal DALY rates, we
decided to look into certain variables that had enough data to be used for our analysis. These variables affecting both,
DALY rates by cause and general DALY rates, can be divided in several categories.
Diet habit variables (fruit consumption per capita per year, percentage of animal protein consumption out of total daily
calories, vegetable consumption percentage of population being overweight), healthcare variables (annual healtcare
expenditure, out of pocket expenditure on healthcare, healthcare per gdp, and number of physicians per 1,000 people),
living habits (smoking percentages), other demographics (education years).
In addition to these elements, we considered the effect of each continent separately by tranforming them into dummy
variables.
1.3.0.1 Models 0 and 1
Hide
#Transforming continent factors into dummy variables
total=total%>%
mutate(Asia=case_when(total$continent=="Asia"~1,TRUE~0))%>%
mutate(Europe=case_when(total$continent=="Europe"~1,TRUE~0))%>%
mutate(NorthA=case_when(total$continent=="North America"~1,TRUE~0))%>%
mutate(Africa=case_when(total$continent=="Africa"~1,TRUE~0))%>%
mutate(SouthA=case_when(total$continent=="South America"~1,TRUE~0))
Hide

Call:
lm(formula = daly_adjusted ~ smoking_percentage + percentage_overweight +
fruit_consumption + vegetable_consumption + animal_protein_consumption +
education_years + physicians_1000 + pocket_per_cap + healthcare_gdp_rate +
daly_ivsa + daly_ncds + daly_cnmnd, data = total, subset = gdp)
Residuals:
Min 1Q Median 3Q Max
-6.602e-11 -4.258e-12 -5.730e-13 2.894e-12 1.220e-10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.867e-12 1.064e-11 -3.640e-01 0.716590
smoking_percentage -1.016e-10 1.900e-11 -5.346e+00 2.49e-07 ***
percentage_overweight -6.735e-13 1.006e-13 -6.694e+00 2.26e-10 ***
fruit_consumption -7.039e-14 2.012e-14 -3.499e+00 0.000579 ***
vegetable_consumption -1.385e-14 2.245e-14 -6.170e-01 0.537948
animal_protein_consumption 5.153e-12 7.264e-13 7.094e+00 2.35e-11 ***
education_years 1.769e-12 6.270e-13 2.821e+00 0.005277 **
physicians_1000 3.200e-12 1.696e-12 1.887e+00 0.060675 .
pocket_per_cap -2.376e-14 8.464e-15 -2.807e+00 0.005506 **
healthcare_gdp_rate 3.037e-11 1.838e-11 1.652e+00 0.100074
daly_ivsa 1.000e+00 1.045e-15 9.570e+14 < 2e-16 ***
daly_ncds 1.000e+00 3.813e-16 2.623e+15 < 2e-16 ***
daly_cnmnd 1.000e+00 1.157e-16 8.645e+15 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.446e-11 on 195 degrees of freedom
(817 observations deleted due to missingness)
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 3.025e+31 on 12 and 195 DF, p-value: < 2.2e-16
# Lm0 was created to show that daly_ivsa, daly_ncds and daly_cnmnd make up daly_adjusted. As a result, these three va
riables are not included in the linear models.
lm0= lm(daly_adjusted ~ smoking_percentage+ percentage_overweight+ fruit_consumption+ vegetable_consumption+ animal_p
rotein_consumption+ education_years+ physicians_1000+ pocket_per_cap+ healthcare_gdp_rate + daly_ivsa + dal
y_ncds + daly_cnmnd, gdp, data = total)
summary(lm0)
Hide
lm1= lm(daly_adjusted ~ smoking_percentage+ percentage_overweight+ fruit_consumption+ vegetable_consumption+ animal_p
rotein_consumption+ education_years+ physicians_1000+ pocket_per_cap+ healthcare_gdp_rate + gdp, data = tota
l)
summary(lm1)

Call:
gdp, data = total)
Residuals:
-25122 -6233 -812 4866 59542
Coefficients:
(Intercept) 7.921e+04 1.894e+03 41.814 < 2e-16 ***
smoking_percentage -2.391e+04 6.139e+03 -3.894 0.000105 ***
percentage_overweight -2.590e+02 3.531e+01 -7.335 4.53e-13 ***
fruit_consumption -5.051e+01 8.655e+00 -5.836 7.19e-09 ***
vegetable_consumption -3.828e+01 6.980e+00 -5.484 5.26e-08 ***
animal_protein_consumption -1.799e+03 2.719e+02 -6.615 6.00e-11 ***
education_years -1.270e+03 2.027e+02 -6.268 5.41e-10 ***
physicians_1000 7.964e+02 4.986e+02 1.597 0.110538
pocket_per_cap -1.011e+01 2.508e+00 -4.031 5.96e-05 ***
healthcare_gdp_rate 6.335e+03 6.803e+03 0.931 0.351968
gdp 6.325e-02 3.290e-02 1.923 0.054808 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11340 on 1014 degrees of freedom
Multiple R-squared: 0.6371, Adjusted R-squared: 0.6335
F-statistic: 178 on 10 and 1014 DF, p-value: < 2.2e-16
Already from model one we reach an adjusted R-squared of 0.6335, meaning these factors can explain approximately
63 percent of general DALY’s uctuation. The variable with the highest p value was dropped sequentially for the below
models.
Hide
lm2 = lm( daly_adjusted~smoking_percentage+ percentage_overweight+ vegetable_consumption+ animal_protein_consumption+
education_years+ physicians_1000+ pocket_per_cap+ fruit_consumption + gdp, data = total)
summary(lm2)

Call:
vegetable_consumption + animal_protein_consumption + education_years +
physicians_1000 + pocket_per_cap + fruit_consumption + gdp,
data = total)
Residuals:
-25123 -6213 -846 4897 59243
Coefficients:
(Intercept) 8.034e+04 1.453e+03 55.295 < 2e-16 ***
physicians_1000 9.062e+02 4.845e+02 1.871 0.061701 .
gdp 5.507e-02 3.170e-02 1.737 0.082661 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F-statistic: 197.7 on 9 and 1015 DF, p-value: < 2.2e-16
Dropping healthcare-gdp percentage makes out of pocket expenditure become signi cant.
Call:
pocket_per_cap + fruit_consumption + physicians_1000, data = total)
Residuals:
-25203 -6215 -447 4669 59274
Coefficients:
(Intercept) 79726.798 1410.663 56.517 < 2e-16 ***
smoking_percentage -24292.378 6127.035 -3.965 7.86e-05 ***
percentage_overweight -266.174 35.115 -7.580 7.76e-14 ***
vegetable_consumption -40.473 6.894 -5.871 5.87e-09 ***
animal_protein_consumption -1741.370 258.204 -6.744 2.58e-11 ***
education_years -1222.730 201.228 -6.076 1.74e-09 ***
pocket_per_cap -7.568 2.052 -3.688 0.000238 ***
fruit_consumption -47.366 8.442 -5.611 2.59e-08 ***
physicians_1000 859.752 484.202 1.776 0.076097 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1.3.0.2 Drop physicians_1000
Hide
lm3=lm( daly_adjusted~smoking_percentage+ percentage_overweight+ vegetable_consumption+ animal_protein_consumption+ e
ducation_years+ pocket_per_cap+ fruit_consumption + physicians_1000, data = total)
summary(lm3)
Hide

Call:
pocket_per_cap + fruit_consumption, data = total)
Residuals:
-25602 -6210 -408 4778 59363
Coefficients:
(Intercept) 78595.709 1259.974 62.379 < 2e-16 ***
smoking_percentage -21927.213 5986.815 -3.663 0.000263 ***
education_years -1107.609 190.699 -5.808 8.44e-09 ***
pocket_per_cap -6.709 1.996 -3.361 0.000806 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All variables are now sigi cant, leading to a model with 0.632 as its adjusted R-squared.
1.3.1 Stepwise regression& VIF exam
We can also used stepwise regression to nd the optimal model.Stepwise method is more precise than dropping
variables mannually since it provides the possibility of adding the dropped variables back in the future steps if it
improves the model(lowers model’s AIC),and also examines the signi cance after adding or dropping variables.
lm4=lm( daly_adjusted~smoking_percentage+ percentage_overweight+ vegetable_consumption+ animal_protein_consumption+ e
ducation_years+ pocket_per_cap+ fruit_consumption, data = total)
summary(lm4)
Hide
fit1_step=step(lm1,direction="both")

Start: AIC=19149.68
daly_adjusted ~ smoking_percentage + percentage_overweight +
gdp
Df Sum of Sq RSS AIC
- healthcare_gdp_rate 1 111483768 1.3048e+11 19149
<none> 1.3036e+11 19150
- physicians_1000 1 327963566 1.3069e+11 19150
- gdp 1 475232954 1.3084e+11 19151
- smoking_percentage 1 1949782662 1.3231e+11 19163
- pocket_per_cap 1 2089322026 1.3245e+11 19164
- vegetable_consumption 1 3866128944 1.3423e+11 19178
- fruit_consumption 1 4378585222 1.3474e+11 19182
- education_years 1 5050374107 1.3541e+11 19187
- animal_protein_consumption 1 5625702511 1.3599e+11 19191
- percentage_overweight 1 6917422353 1.3728e+11 19201
Step: AIC=19148.55
daly_adjusted ~ smoking_percentage + percentage_overweight +
education_years + physicians_1000 + pocket_per_cap + gdp
Df Sum of Sq RSS AIC
<none> 1.3048e+11 19149
- gdp 1 387923531 1.3086e+11 19150
+ healthcare_gdp_rate 1 111483768 1.3036e+11 19150
- physicians_1000 1 449760026 1.3093e+11 19150
- smoking_percentage 1 1912380306 1.3239e+11 19162
- pocket_per_cap 1 2075803261 1.3255e+11 19163
- vegetable_consumption 1 4040506535 1.3452e+11 19178
- fruit_consumption 1 4418260882 1.3489e+11 19181
- education_years 1 5027016883 1.3550e+11 19185
- animal_protein_consumption 1 6245781148 1.3672e+11 19194
- percentage_overweight 1 7139008696 1.3761e+11 19201
Call:
education_years + physicians_1000 + pocket_per_cap + gdp,
data = total)
Residuals:
-25123 -6213 -846 4897 59243
Coefficients:
(Intercept) 8.034e+04 1.453e+03 55.295 < 2e-16 ***
physicians_1000 9.062e+02 4.845e+02 1.871 0.061701 .
gdp 5.507e-02 3.170e-02 1.737 0.082661 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hide
summary(fit1_step)

smoking_percentage percentage_overweight
1.668205 2.913117
fruit_consumption vegetable_consumption
1.425008 1.502601
animal_protein_consumption education_years
3.110474 3.088943
physicians_1000 pocket_per_cap
3.754590 3.210832
gdp
2.999374
From the nal result we can see that six variables are signi cant with a p-value lower than 0.1. Expense and
pocket_per_cap are both signi cant in this case. However, dropping one of them may lead to insigni cance of the other.
This could be because these two have a joint effect on the burden of disease. We can choose from these two models
according to our con dence interval.
Continents were also considered as part of the model to see their effect.
Call:
lm(formula = daly_adjusted ~ percentage_overweight + vegetable_consumption +
animal_protein_consumption + education_years + pocket_per_cap +
fruit_consumption + Asia + Africa + NorthA + Europe + SouthA +
Asia + Africa + NorthA + Europe + SouthA, data = total)
Residuals:
-29935 -4148 -431 3996 50866
Coefficients:
(Intercept) 66522.998 2345.279 28.365 < 2e-16 ***
education_years -534.613 165.120 -3.238 0.00124 **
pocket_per_cap -8.669 1.678 -5.168 2.86e-07 ***
Asia -7140.437 1807.499 -3.950 8.34e-05 ***
Africa 13792.577 1918.746 7.188 1.27e-12 ***
NorthA -9335.463 1794.310 -5.203 2.38e-07 ***
Europe -5196.987 1650.294 -3.149 0.00169 **
SouthA -9146.724 1917.915 -4.769 2.12e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hide
vif(fit1_step)
Hide
fit = lm(daly_adjusted~ percentage_overweight+ vegetable_consumption+ animal_protein_consumption+ education_years+ po
cket_per_cap+ fruit_consumption+Asia+Africa+NorthA+Europe+SouthA+ Asia+ Africa+ NorthA+ Europe+ SouthA, data
= total)
print(summary(fit))
Hide
print(vif(fit))

percentage_overweight vegetable_consumption
3.908936 1.772149
2.885232 3.048594
pocket_per_cap fruit_consumption
2.136245 1.318169
Asia Africa
7.257346 6.590683
NorthA Europe
3.209642 7.556342
SouthA
2.440735
1.3.2 Interpretation on the nal model
Our nal model had 11 variables
Call:
lm(formula = daly_adjusted ~ percentage_overweight + vegetable_consumption +
animal_protein_consumption + education_years + pocket_per_cap +
fruit_consumption + Asia + Africa + NorthA + Europe + SouthA +
Asia + Africa + NorthA + Europe + SouthA, data = total)
Residuals:
-29935 -4148 -431 3996 50866
Coefficients:
(Intercept) 66522.998 2345.279 28.365 < 2e-16 ***
education_years -534.613 165.120 -3.238 0.00124 **
pocket_per_cap -8.669 1.678 -5.168 2.86e-07 ***
Asia -7140.437 1807.499 -3.950 8.34e-05 ***
Africa 13792.577 1918.746 7.188 1.27e-12 ***
NorthA -9335.463 1794.310 -5.203 2.38e-07 ***
Europe -5196.987 1650.294 -3.149 0.00169 **
SouthA -9146.724 1917.915 -4.769 2.12e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
percentage_overweight vegetable_consumption
3.908936 1.772149
2.885232 3.048594
pocket_per_cap fruit_consumption
2.136245 1.318169
Asia Africa
7.257346 6.590683
NorthA Europe
3.209642 7.556342
SouthA
2.440735
Hide
continent_fit=fit
summary(continent_fit)
Hide
vif(continent_fit)

(1) (2) (3) (4) (5)
(Intercept) 79209.24 *** 80341.02 *** 79726.80 *** 78595.71 *** 66523.00 ***
(1894.33)    (1452.94)    (1410.66)    (1259.97)    (2345.28)
smoking_percentage -23905.42 *** -23651.69 *** -24292.38 *** -21927.21 ***
(6138.51)    (6132.06)    (6127.03)    (5986.82)
percentage_overweight -259.02 *** -262.03 *** -266.17 *** -256.08 *** -202.37 ***
(35.31)    (35.16)    (35.11)    (34.69)    (33.40)
fruit_consumption -50.51 *** -50.72 *** -47.37 *** -49.24 *** -40.86 ***
(8.65)    (8.65)    (8.44)    (8.38)    (6.82)
vegetable_consumption -38.28 *** -38.93 *** -40.47 *** -37.82 *** -33.86 ***
(6.98)    (6.94)    (6.89)    (6.74)    (6.18)
animal_protein_consumption -1798.59 *** -1852.18 *** -1741.37 *** -1623.10 *** -1030.84 ***
(271.90)    (265.72)    (258.20)    (249.73)    (209.88)
education_years -1270.47 *** -1267.36 *** -1222.73 *** -1107.61 *** -534.61 **
(202.70)    (202.66)    (201.23)    (190.70)    (165.12)
physicians_1000 796.40     906.18     859.75
(498.63)    (484.46)    (484.20)
pocket_per_cap -10.11 *** -10.08 *** -7.57 *** -6.71 *** -8.67 ***
(2.51)    (2.51)    (2.05)    (2.00)    (1.68)
healthcare_gdp_rate 6334.55
(6802.52)
gdp 0.06     0.06
(0.03)    (0.03)
Asia                                 -7140.44 ***
                                (1807.50)
Africa                                 13792.58 ***
                                (1918.75)
NorthA                                 -9335.46 ***
                                (1794.31)
Europe                                 -5196.99 **
                                (1650.29)
SouthA                                 -9146.72 ***
                                (1917.91)
N 1025        1025        1025        1025        1025
R2 0.64     0.64     0.64     0.63     0.76
logLik -11018.25     -11018.69     -11020.21     -11021.80     -10814.42
Hide
huxtable::huxreg(lm1,lm2,lm3,lm4, continent_fit,
number_format = "%.2f")

AIC 22060.50 22059.38 22060.42 22061.60 21654.84
*** p < 0.001; ** p < 0.01; * p < 0.05.
actual predicted
actual 1.0000000 0.8572182
predicted 0.8572182 1.0000000
From the 5 models, continent_ t was chosen as the nal model due to having all sigini cant variables, and the highest R
squared (0.76). As it can be seen from our predictions, our model is able to predict the correct DALY rates for 2013 with
85.72 percent accuracy.
Hide
best_model <- continent_fit
#Part 2: We wanted to test the prediction efficacy of our model by ensuring that it was able to predict with a certai
n level of cofidence the DALYS for the last full year of data (2013)
train <- total %>%
filter(period<2013)
predict <- total %>%
filter(period == 2013)
continent_fit2 <- lm(continent_fit, data = train)
final_prediction <- predict(continent_fit2, newdata = predict)
ac_pred <- data.frame(cbind(actual = predict$daly_adjusted, predicted = final_prediction))
correlation_accuracy <- cor(ac_pred)
correlation_accuracy

The Burden of Disease: Data analysis, interpretation and linear regression

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Burden of Disease: Data analysis, interpretation and linear regression

Similar to The Burden of Disease: Data analysis, interpretation and linear regression (20)

Recently uploaded

Recently uploaded (20)

The Burden of Disease: Data analysis, interpretation and linear regression