The Burden of Disease: Data analysis, interpretation and linear regression

22/02/2021 CM30_GroupProject_SG30
ﬁle:///Users/Aman/Downloads/The Burden of Disease Code.html 1/39
CM30_GroupProject_SG30
Team 30
2021-02-14
1 Burden of Disease
Mortality rates are a common method used to assess a population’s health. Often used rates for such assessment
include child mortality or life expectancy. However, a focus on mortality neglects the suffering caused to people who
still live with the disease. A disease impacts, in a direct or indirect manner, the ability of living a normal life. Potential
contributions to one’s community, work, or nation, are often lost.
Our study, therefore, seeks to understand the magnitude of the burden of diseases by the different disease types, as
well as identify factors that amplify such effects.
The metric that will be used to measure disease burden is called DALY, which stands for Disability Adjusted Life Years.
This metric includes the sum of mortality and morbidity. One DALY stands for 1 year loss in good health due to either
premature death, disease, or disability.
1.1 Data import and inspection
1.1.0.1 Importing data for overall disease burden (DALY)
Rows: 48,698
Columns: 7
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ total_population_gapminder_hyde_un <dbl> …
$ continent <chr> …
$ health_expenditure_per_capita_current_us <dbl> …
$ dal_ys_disability_adjusted_life_years_all_causes_sex_both_age_age_standardized_rate <dbl> …
Code
Hide
#source: https://ourworldindata.org/burden-of-disease
# Reading first file
daly_total <- read_csv(here::here('Data',"disease-burden-vs-health-expenditure-per-capita.csv")) %>%
clean_names()
# Checking for variable types
glimpse(daly_total)
Hide
# Changing variable names and variable types
daly_total<- daly_total %>%
mutate(
location=as.factor(entity),
period=year,
health_expenditure_per_capita=health_expenditure_per_capita_current_us,
daly_adjusted=dal_ys_disability_adjusted_life_years_all_causes_sex_both_age_age_standardized_rate,
total_population = total_population_gapminder_hyde_un) %>%
select(location,period,daly_adjusted,health_expenditure_per_capita,total_population)
1 Burden of Disease

Although important as a whole, DALY rates can futher be divided into 3 sub-categories of disease cause; these being:
communicable diseases, non-communicable diseases, and injuries. We, therefore, included the datasets for each
individual subcategory below.
1.1.0.2 Adding data for burden of non-communicable diseases
Rows: 6,468
Columns: 4
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ dal_ys_disability_adjusted_life_years_non_communicable_diseases_sex_both_age_age_standardized_rate <dbl> …
Rows: 6,468
Columns: 3
$ location <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghani…
$ period <dbl> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,…
$ daly_ncds <dbl> 41145.51, 40587.17, 39644.60, 39821.31, 40641.76, 40790.73,…
1.1.0.3 Adding data for burden from communicable, neonatal, maternal and nutritional diseases
Rows: 6,468
Columns: 4
$ entity
<chr> …
$ code
<chr> …
$ year
<dbl> …
$ dal_ys_disability_adjusted_life_years_communicable_maternal_neonatal_and_nutritional_diseases_sex_both_age_age_stan
dardized_rate <dbl> …
Hide
#source:https://ourworldindata.org/burden-of-disease
#Reading the file
ncds <- read_csv(here::here('Data',"burden-of-disease-rates-from-ncds.csv")) %>%
clean_names()
glimpse(ncds)
ncds<- ncds %>%
mutate(location=as.factor(entity),
period=year,
daly_ncds=dal_ys_disability_adjusted_life_years_non_communicable_diseases_sex_both_age_age_standardized_rat
e) %>%
select(location,period,daly_ncds)
glimpse(ncds)
#Merging data frames
total <- merge(daly_total,ncds,by=c("location","period"))
Hide
#Reading the file
cnmnd <- read_csv(here::here('Data',"burden-of-disease-rates-from-communicable-neonatal-maternal-nutritional-disease
s.csv")) %>%
clean_names()
glimpse(cnmnd)
Hide

Rows: 6,468
Columns: 3
$ location <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghan…
$ period <dbl> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999…
$ daly_cnmnd <dbl> 51181.84, 47263.29, 38908.25, 36882.69, 38809.79, 38262.20…
1.1.0.4 Adding data for burden from injuries, violence, self-harm and accidents
Rows: 6,468
Columns: 4
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ dal_ys_disability_adjusted_life_years_injuries_sex_both_age_age_standardized_rate <dbl> …
Rows: 6,468
Columns: 3
$ location <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghani…
$ period <dbl> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,…
$ daly_ivsa <dbl> 11775.715, 13390.289, 12365.622, 11530.363, 13546.148, 1238…
Within each of the 3 sub-categories of disease causes, there are speci c diseases that classify as such. We included all
categories in our dataset.
1.1.0.5 Adding data for disease burden by cause (DALY by cause)
cnmnd<- cnmnd %>%
period=year,
daly_cnmnd=dal_ys_disability_adjusted_life_years_communicable_maternal_neonatal_and_nutritional_diseases_sex
_both_age_age_standardized_rate) %>%
select(location,period,daly_cnmnd)
glimpse(cnmnd)
Hide
total <- merge(total,cnmnd,by=c("location","period"))
Hide
#Reading the file
ivsa <- read_csv(here::here('Data',"burden-of-disease-rates-from-injuries.csv")) %>%
clean_names()
glimpse(ivsa)
Hide
ivsa<- ivsa %>%
period=year,
daly_ivsa=dal_ys_disability_adjusted_life_years_injuries_sex_both_age_age_standardized_rate) %>%
select(location,period,daly_ivsa)
glimpse(ivsa)
Hide
total <- merge(total,ivsa,by=c("location","period"))

Aside from the main variables, additional variables that may be contributing to the nal effect of DALY rates were
included in the dataset.
1.1.0.6 Adding data for GDP per capita
Hide
#source: https://ourworldindata.org/burden-of-disease
# Reading second file
daly_by_cause <- read_csv(here::here('Data',"burden-of-disease-by-cause.csv")) %>%
clean_names()
#glimpse(daly_by_cause)
daly_by_cause <- daly_by_cause %>%
mutate(
location=as.factor(entity),
period=year,
daly_conflict_terrorism=dal_ys_disability_adjusted_life_years_conflict_and_terrorism_sex_both_age_all_ages_numbe
r,
daly_hiv_tuberculosis=dal_ys_disability_adjusted_life_years_hiv_aids_and_tuberculosis_sex_both_age_all_ages_numbe
r,
daly_diahrrea_respiratory=dal_ys_disability_adjusted_life_years_diarrhea_lower_respiratory_and_other_common_infec
tious_diseases_sex_both_age_all_ages_number,
daly_cvs=dal_ys_disability_adjusted_life_years_cardiovascular_diseases_sex_both_age_all_ages_number,
daly_self_harm=dal_ys_disability_adjusted_life_years_self_harm_sex_both_age_all_ages_number,
daly_violence=dal_ys_disability_adjusted_life_years_interpersonal_violence_sex_both_age_all_ages_number,
daly_nutritional_deficiencies=dal_ys_disability_adjusted_life_years_nutritional_deficiencies_sex_both_age_all_age
s_number,
daly_transport_injuries=dal_ys_disability_adjusted_life_years_transport_injuries_sex_both_age_all_ages_number,
daly_unintentional_injuries=dal_ys_disability_adjusted_life_years_unintentional_injuries_sex_both_age_all_ages_nu
mber,
daly_maternal_disorders=dal_ys_disability_adjusted_life_years_maternal_disorders_sex_both_age_all_ages_number,
daly_neonatal_disorders=dal_ys_disability_adjusted_life_years_neonatal_disorders_sex_both_age_all_ages_number,
daly_other_communicable=dal_ys_disability_adjusted_life_years_other_communicable_maternal_neonatal_and_nutritiona
l_diseases_sex_both_age_all_ages_number,
daly_nature_forces=dal_ys_disability_adjusted_life_years_exposure_to_forces_of_nature_sex_both_age_all_ages_numbe
r,
daly_chronic_respiratory=dal_ys_disability_adjusted_life_years_chronic_respiratory_diseases_sex_both_age_all_ages
_number,
daly_chronic_liver=dal_ys_disability_adjusted_life_years_cirrhosis_and_other_chronic_liver_diseases_sex_both_age_
all_ages_number,
daly_digestive=dal_ys_disability_adjusted_life_years_digestive_diseases_sex_both_age_all_ages_number,
daly_tropical_and_malaria=dal_ys_disability_adjusted_life_years_neglected_tropical_diseases_and_malaria_sex_both_
age_all_ages_number,
daly_musculoskeletal=dal_ys_disability_adjusted_life_years_musculoskeletal_disorders_sex_both_age_all_ages_numbe
r,
daly_other_non_communicable=dal_ys_disability_adjusted_life_years_other_non_communicable_diseases_sex_both_age_al
l_ages_number,
daly_neurological=dal_ys_disability_adjusted_life_years_neurological_disorders_sex_both_age_all_ages_number,
daly_mental_and_substance=dal_ys_disability_adjusted_life_years_mental_and_substance_use_disorders_sex_both_age_a
ll_ages_number,
daly_diabetes_urogenital_blood_endocrine=dal_ys_disability_adjusted_life_years_diabetes_urogenital_blood_and_endo
crine_diseases_sex_both_age_all_ages_number,
daly_neoplasms=dal_ys_disability_adjusted_life_years_neoplasms_sex_both_age_all_ages_number)%>%
select(location, period,daly_conflict_terrorism,daly_hiv_tuberculosis,daly_diahrrea_respiratory,daly_cvs,daly_self_
harm,daly_violence,daly_nutritional_deficiencies,daly_transport_injuries,daly_unintentional_injuries,daly_mat
ernal_disorders,daly_neonatal_disorders,daly_other_communicable,daly_nature_forces,daly_chronic_respiratory,d
aly_chronic_liver,daly_digestive,daly_tropical_and_malaria,daly_musculoskeletal,daly_other_non_communicable,d
aly_neurological,daly_mental_and_substance,daly_diabetes_urogenital_blood_endocrine,daly_neoplasms)
#glimpse(daly_by_cause)
# Merging dataframes
total <- merge(total,daly_by_cause,by=c("location","period"))
#We will consider taking out health expenditure per capita since it has a complete rate of 57.4% and may distort the
final data.
Hide

#source: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
# Reading third file
gdp <- read_csv(here::here('Data',"API_NY.GDP.PCAP.CD_DS2_en_csv_v2_1926744.csv"),skip=3) %>%
clean_names()
glimpse(gdp)

Rows: 264
Columns: 66
$ country_name <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Andorra"…
$ country_code <chr> "ABW", "AFG", "AGO", "ALB", "AND", "ARB", "ARE", "ARG"…
$ indicator_name <chr> "GDP per capita (current US$)", "GDP per capita (curre…
$ indicator_code <chr> "NY.GDP.PCAP.CD", "NY.GDP.PCAP.CD", "NY.GDP.PCAP.CD", …
$ x1960 <dbl> NA, 59.77319, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1807…
$ x1961 <dbl> NA, 59.86087, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1874…
$ x1962 <dbl> NA, 58.45801, NA, NA, NA, NA, NA, 1155.89017, NA, NA, …
$ x1963 <dbl> NA, 78.70639, NA, NA, NA, NA, NA, 850.30474, NA, NA, N…
$ x1964 <dbl> NA, 82.09523, NA, NA, NA, NA, NA, 1173.23821, NA, NA, …
$ x1965 <dbl> NA, 101.10830, NA, NA, NA, NA, NA, 1279.11343, NA, NA,…
$ x1968 <dbl> NA, 129.10832, NA, NA, NA, 224.87811, NA, 1141.08048, …
$ x1969 <dbl> NA, 129.32971, NA, NA, NA, 240.03563, NA, 1329.05866, …
$ x1970 <dbl> NA, 156.5189, NA, NA, 3238.5568, 262.8663, NA, 1322.59…
$ x1971 <dbl> NA, 159.56758, NA, NA, 3498.17365, 295.97104, NA, 1372…
$ x1972 <dbl> NA, 135.31731, NA, NA, 4217.17358, 343.56582, NA, 1408…
$ x1973 <dbl> NA, 143.14465, NA, NA, 5342.16856, 423.13508, NA, 2097…
$ x1974 <dbl> NA, 173.65376, NA, NA, 6319.73903, 777.56068, NA, 2844…
$ x1975 <dbl> NA, 186.5109, NA, NA, 7169.1010, 836.2083, 26847.7944,…
$ x1976 <dbl> NA, 197.4455, NA, NA, 7152.3751, 1007.1404, 30118.1378…
$ x1977 <dbl> NA, 224.2248, NA, NA, 7751.3702, 1123.1433, 33823.3196…
$ x1978 <dbl> NA, 247.3541, NA, NA, 9129.7062, 1193.7456, 28456.7374…
$ x1979 <dbl> NA, 275.7382, NA, NA, 11820.8494, 1563.7035, 33512.741…
$ x1980 <dbl> NA, 272.6553, 710.9816, NA, 12377.4116, 2052.9558, 427…
$ x1981 <dbl> NA, 264.1113, 642.3839, NA, 10372.2328, 2050.7698, 449…
$ x1982 <dbl> NA, NA, 619.9614, NA, 9610.2663, 1864.8707, 40026.1663…
$ x1983 <dbl> NA, NA, 623.4406, NA, 8022.6548, 1699.2152, 34843.1029…
$ x1984 <dbl> NA, NA, 637.7152, 639.4847, 7728.9067, 1672.2788, 3230…
$ x1985 <dbl> NA, NA, 758.2376, 639.8659, 7774.3938, 1606.7558, 2972…
$ x1986 <dbl> 6472.5020, NA, 685.2701, 693.8735, 10361.8160, 1489.84…
$ x1987 <dbl> 7885.7965, NA, 756.2619, 674.7934, 12616.1676, 1543.51…
$ x1988 <dbl> 9764.7900, NA, 792.3031, 652.7743, 14304.3570, 1476.04…
$ x1989 <dbl> 11392.4558, NA, 890.5541, 697.9956, 15166.4379, 1505.5…
$ x1990 <dbl> 12307.3117, NA, 947.7042, 617.2304, 18878.5060, 2009.4…
$ x1991 <dbl> 13496.0031, NA, 865.6927, 336.5870, 19532.5402, 1929.6…
$ x1992 <dbl> 14046.5038, NA, 656.3618, 200.8522, 20547.7118, 2027.8…
$ x1993 <dbl> 14936.8272, NA, 441.2007, 367.2792, 16516.4710, 1996.9…
$ x1994 <dbl> 16241.0465, NA, 328.6733, 586.4163, 16234.8090, 1989.4…
$ x1995 <dbl> 16439.3564, NA, 397.1795, 750.6044, 18461.0649, 2072.7…
$ x1996 <dbl> 16586.0684, NA, 522.6438, 1009.9777, 19017.1746, 2235.…
$ x1997 <dbl> 17927.7496, NA, 514.2952, 717.3806, 18353.0597, 2319.0…
$ x1998 <dbl> 19078.3432, NA, 423.5937, 813.7903, 18894.5215, 2188.9…
$ x1999 <dbl> 19356.2034, NA, 387.7843, 1033.2417, 19261.7105, 2331.…
$ x2000 <dbl> 20620.7006, NA, 556.8363, 1126.6833, 21854.2468, 2605.…
$ x2001 <dbl> 20669.0320, NA, 527.3335, 1281.6594, 22971.5355, 2506.…
$ x2002 <dbl> 20436.8871, 179.4266, 872.4945, 1425.1248, 25066.8822,…
$ x2003 <dbl> 20833.7616, 190.6838, 982.9609, 1846.1188, 32271.9639,…
$ x2004 <dbl> 22569.9750, 211.3821, 1255.5640, 2373.5798, 37969.1750…
$ x2005 <dbl> 23300.0396, 242.0313, 1902.4223, 2673.7873, 40066.2569…
$ x2006 <dbl> 24045.2725, 263.7337, 2599.5665, 2972.7433, 42675.8128…
$ x2007 <dbl> 25835.1327, 359.6932, 3121.9956, 3595.0372, 47803.6936…
$ x2008 <dbl> 27084.7037, 364.6607, 4080.9414, 4370.5401, 48718.4969…
$ x2009 <dbl> 24630.4537, 438.0760, 3122.7808, 4114.1401, 43503.1855…
$ x2010 <dbl> 23512.6026, 543.3030, 3587.8838, 4094.3503, 40852.6668…
$ x2011 <dbl> 24985.9933, 591.1628, 4615.4680, 4437.1429, 43335.3289…
$ x2012 <dbl> 24713.6980, 641.8715, 5100.0958, 4247.6300, 38686.4613…
$ x2013 <dbl> 26189.4355, 637.1655, 5254.8823, 4413.0609, 39538.7667…
$ x2014 <dbl> 26647.9381, 613.8567, 5408.4105, 4578.6320, 41303.9294…
$ x2015 <dbl> 27980.8807, 578.4664, 4166.9797, 3952.8012, 35762.5231…
$ x2016 <dbl> 28281.3505, 509.2187, 3506.0729, 4124.0557, 37474.6654…
$ x2017 <dbl> 29007.6930, 519.8848, 4095.8129, 4531.0208, 38962.8804…
$ x2018 <dbl> NA, 493.7504, 3289.6467, 5284.3802, 41793.0553, 6601.8…
$ x2019 <dbl> NA, 507.1034, 2790.7266, 5353.2449, 40886.3912, 6584.7…
$ x2020 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ x66 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
Hide

1.1.0.7 Adding data for smoking percentages
1.1.0.8 Adding data for healthcare expenditure per capita
Rows: 4,675
Columns: 4
$ entity <chr> "Afghan…
$ code <chr> "AFG", …
$ year <dbl> 2002, 2…
$ health_expenditure_per_capita_ppp_constant_2011_international <dbl> 75.9835…
gdp <- gdp %>%
gather(year, gdp,-c(country_name, country_code,indicator_name,indicator_code)) %>%
mutate(location=as.factor(country_name),
period=readr::parse_number(year)) %>%
select(location,period,gdp)
# Merging dataframes
total <- merge(total,gdp,by=c("location","period"))
#skim(total)
Hide
#source: http://ghdx.healthdata.org/record/ihme-data/gbd-2015-smoking-prevalence-1980-2015
#Reading fourth file
smoking_percentage <- read_csv(here::here('Data',"IHME_GBD_2015_SMOKING_PREVALENCE_1980_2015_Y2017M04D05.CSV")) %>%
clean_names()
#skim(smoking_percentage)
smoking_percentage <- smoking_percentage %>%
filter(age_group_name=="Age-standardized",
metric=="Percent",
sex=="Both") %>%
mutate(location=as.factor(location_name),
period=year_id,
smoking_percentage=mean) %>%
select(location,period,smoking_percentage)
#skim(smoking_percentage)
total <- merge(total,smoking_percentage,by=c("location","period"))
Hide
#source:https://ourworldindata.org/grapher/annual-healthcare-expenditure-per-capita?tab=chart&time=1995..2014&region=
World
#Reading fifth file
healthcare_expenditure <- read_csv(here::here('Data',"annual-healthcare-expenditure-per-capita.CSV")) %>%
clean_names()
glimpse(healthcare_expenditure)
Hide

1.1.0.9 Adding data for percentage of population being overweight
Rows: 8,316
Columns: 4
$ entity <chr> "Afghanistan", "A…
$ code <chr> "AFG", "AFG", "AF…
$ year <dbl> 1975, 1976, 1977,…
$ prevalence_of_overweight_adults_both_sexes_who_2019 <dbl> 5.3, 5.5, 5.7, 5.…
1.1.0.10 Adding data for fruit consumption per capita
Rows: 11,028
Columns: 4
$ entity <chr> "Afg…
$ code <chr> "AFG…
$ year <dbl> 1961…
$ fruits_excluding_wine_food_supply_quantity_kg_capita_yr_fao_2020 <dbl> 41.1…
healthcare_expenditure <- healthcare_expenditure %>%
period=year,
healthcare_expenditure=health_expenditure_per_capita_ppp_constant_2011_international) %>%
select(location,period,healthcare_expenditure)
#glimpse(healthcare_expenditure)
total <- merge(total,healthcare_expenditure,by=c("location","period"))
Hide
#source: https://ourworldindata.org/obesity
#Reading sixth file
percentage_overweight <- read_csv(here::here('Data',"share-of-adults-who-are-overweight.csv")) %>%
clean_names()
glimpse(percentage_overweight)
Hide
percentage_overweight <- percentage_overweight %>%
period=year,
percentage_overweight=prevalence_of_overweight_adults_both_sexes_who_2019) %>%
select(location,period,percentage_overweight)
#glimpse(percentage_overweight)
total <- merge(total,percentage_overweight,by=c("location","period"))
Hide
#source: https://ourworldindata.org/diet-compositions
#Reading seventh file
fruit_consumption <- read_csv(here::here('Data',"fruit-consumption-per-capita.csv")) %>%
clean_names()
glimpse(fruit_consumption)
Hide

1.1.0.11 Adding data for vegetable consumption per capita
Rows: 11,028
Columns: 4
$ entity <chr> "Afghanistan", …
$ code <chr> "AFG", "AFG", "…
$ year <dbl> 1961, 1962, 196…
$ vegetables_food_supply_quantity_kg_capita_yr_fao_2020 <dbl> 36.75, 37.47, 3…
1.1.0.12 Adding data for animal based foods consumption per capita
fruit_consumption <- fruit_consumption %>%
period=year,
fruit_consumption=fruits_excluding_wine_food_supply_quantity_kg_capita_yr_fao_2020) %>%
select(location,period,fruit_consumption)
#glimpse(fruit_consumption)
total <- merge(total,fruit_consumption,by=c("location","period"))
Hide
#Reading eigth file
vegetable_consumption <- read_csv(here::here('Data',"vegetable-consumption-per-capita.csv")) %>%
clean_names()
#Checking for variable types
glimpse(vegetable_consumption)
Hide
## Changing variable names and variable types
vegetable_consumption <- vegetable_consumption %>%
period=year,
vegetable_consumption=vegetables_food_supply_quantity_kg_capita_yr_fao_2020) %>%
select(location,period,vegetable_consumption)
#glimpse(vegetable_consumption)
#Merging dataframes
total <- merge(total,vegetable_consumption,by=c("location","period"))
#skim(total)
Hide
#Reading ninth file
animal_protein_consumption <-read_csv(here::here('Data',"share-of-calories-from-animal-protein-vs-gdp-per-capita.csv"
)) %>%
clean_names()
glimpse(animal_protein_consumption)

Rows: 24,472
Columns: 7
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ total_population_gapminder <dbl> …
$ continent <chr> …
$ share_of_calories_from_animal_protein_fao_2017 <dbl> …
$ real_gdp_per_capita_in_2011us_2011_benchmark_maddison_project_database_2018 <dbl> …
1.1.0.13 Adding data for mean years of schooling
1.1.0.14 Adding data for physicians per 1000 people
Hide
#Changing variable names and type
animal_protein_consumption <- animal_protein_consumption %>%
period=year,
animal_protein_consumption=share_of_calories_from_animal_protein_fao_2017) %>%
select(location,period,animal_protein_consumption)
#glimpse(animal_protein_consumption)
#Mergining dataframes
total <- merge(total,animal_protein_consumption,by=c("location","period"))
#glimpse(total)
Hide
#source: https://ourworldindata.org/global-education
#Reading file
education_years <- read_csv(here::here('Data',"mean-years-of-schooling-1.csv")) %>%
clean_names()
#glimpse(education_years)
education_years <- education_years %>%
period=year,
education_years=average_total_years_of_schooling_for_adult_population_lee_lee_2016_barro_lee_2018_and_undp_2
018) %>%
select(location,period,education_years)
#glimpse(education_years)
#Merging dataframes
total <- merge(total,education_years,by=c("location","period"))
Hide

1.1.0.15 Adding data for nurses per 1000 people
Rows: 1,542
Columns: 4
$ entity <chr> "Afghanistan", "Afghanistan", "A…
$ code <chr> "AFG", "AFG", "AFG", "AFG", "AFG…
$ year <dbl> 2005, 2006, 2007, 2008, 2009, 20…
$ nurses_and_midwives_per_1_000_people <dbl> 0.612000, 0.462000, 0.519000, 0.…
Nurses had too little incidences. Thus, it was not included in our nal dataset.
1.1.0.16 Adding data for out-of-pocket expenditure
Rows: 3,002
Columns: 4
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ out_of_pocket_expenditure_per_capita_on_healthcare_ppp_usd_who_global_health_expenditure <dbl> …
#source:https://ourworldindata.org/grapher/physicians-per-1000-people
#Reading file
physicians <- read_csv(here::here('Data',"physicians-per-1000-people.csv")) %>%
clean_names()
#glimpse(physicians)
physicians <- physicians %>%
period=year,
physicians_1000=physicians_per_1_000_people) %>%
select(location,period,physicians_1000)
#glimpse(physicians)
#Merging dataframes
total <- merge(total,physicians,by=c("location","period"))
Hide
#source:https://ourworldindata.org/grapher/nurses-and-midwives-per-1000-people?
#Reading file
nurses <- read_csv(here::here('Data',"nurses-and-midwives-per-1000-people.csv")) %>%
clean_names()
glimpse(nurses)
Hide
#source:https://ourworldindata.org/grapher/out-of-pocket-expenditure-per-capita-on-healthcare
#Reading file
pocket_exp <- read_csv(here::here('Data',"out-of-pocket-expenditure-per-capita-on-healthcare.csv")) %>%
clean_names()
glimpse(pocket_exp)
Hide

1.1.0.17 Adding data for health protection coverage
Rows: 162
Columns: 4
$ entity <chr> "Albania", "…
$ code <chr> "ALB", "DZA"…
$ year <dbl> 2008, 2005, …
$ share_of_population_covered_by_health_insurance_ilo_2014 <dbl> 23.6, 85.2, …
Health coverage had too little incidences. Thus, it was not included in our nal dataset.
1.1.0.18 Adding data for literacy rate
Rows: 215
Columns: 4
$ entity <chr> "Afghanistan", "Albania", "Algeria", …
$ code <chr> "AFG", "ALB", "DZA", "ASM", "AND", "A…
$ year <dbl> 2000, 2011, 2006, 1980, 2011, 2011, 1…
$ literacy_rate_cia_factbook_2016 <dbl> 28.1, 96.8, 72.6, 97.0, 100.0, 70.4, …
Literacy had too little incidences. Thus, it was not included in our nal dataset.
1.1.0.19 Adding data for grouping locations into continents
Rows: 194
Columns: 2
$ continent <chr> "Africa", "Africa", "Africa", "Africa", "Africa", "Africa",…
$ country <chr> "Algeria", "Angola", "Benin", "Botswana", "Burkina", "Burun…
pocket_exp <- pocket_exp %>%
period=year,
pocket_per_cap=out_of_pocket_expenditure_per_capita_on_healthcare_ppp_usd_who_global_health_expenditure) %>%
select(location,period,pocket_per_cap)
#Merging dataframes
total <- merge(total,pocket_exp,by=c("location","period"))
Hide
#Reading file
health_protect <- read_csv(here::here('Data',"health-protection-coverage.csv")) %>%
clean_names()
glimpse(health_protect)
Hide
#Reading file
literacy <- read_csv(here::here('Data',"literacy-rate-by-country.csv")) %>%
clean_names()
glimpse(literacy)
Hide
#source: https://github.com/dbouquin/IS_608/blob/master/NanosatDB_munging/Countries-Continents.csv
#Reading file
continents <- read_csv(here::here('Data',"Continents.csv")) %>%
clean_names()
glimpse(continents)

Rows: 194
Columns: 2
$ location <fct> Algeria, Angola, Benin, Botswana, Burkina, Burundi, Cameroo…
$ continent <fct> Africa, Africa, Africa, Africa, Africa, Africa, Africa, Afr…
1.1.0.20 Dealing with NAs
After including all potentially-relevant and signi cant variables into our dataset, an inital exploration of the data was
made.
1.2 Exploratory Data Analsys
1.2.0.1 DALY Rates per Continent
Hide
continents <- continents %>%
mutate(location=as.factor(country),
continent=as.factor(continent))%>%
select(location, continent)
glimpse(continents)
Hide
#Merging dataframes
total <- merge(total,continents,by=c("location"))
Hide
#Adding variables of per capita healthcare expenditure - per capita gdp
total <- total%>%
mutate(healthcare_gdp_rate = healthcare_expenditure/gdp)
#skim(total)
total <- total %>%
na.omit()
#skim(total)
Hide
#Selecting data only from 1980 - onward (to gain better insights on the recent situation)
total_short <-total %>%
filter(period>=1980)
#Re-coding DALY variables as averages per continent, per year
total_cont<-total_short%>%
group_by(period,continent)%>%
summarise(daly_adjusted=mean(daly_adjusted/100000), daly_cnmnd = mean(daly_cnmnd/100000), daly_ncds = mean(daly_ncd
s/100000), daly_ivsa = mean(daly_ivsa/10000))
#Plotting for average DALY rates per capita accumulated from 1980 to 2017
ggplot(total_cont, aes(x = continent, y = daly_adjusted, fill = continent)) +
geom_bar(stat = "identity") +
labs(x= "Continent", y = "Overall DALYs", title = "Accumulated Average DALYs per Capita, per Continent 1980 - 2
017")

Hide
ggplot(total_cont, aes(x = continent, y = daly_cnmnd, fill = continent)) +
geom_bar(stat = "identity")+
labs(x= "Continent", y = "Communicable Diseases DALYs", title = "Accumulated Average DALYs per Capita from Comm
unicable Diseases, per Continent 1980 - 2017")

Hide
ggplot(total_cont, aes(x = continent, y = daly_ncds, fill = continent)) +
labs(x= "Continent", y = "Non-Communicable Diseases DALYs", title = "Accumulated Average DALYs per Capita from
Non-Communicable Diseases, per Continent 1980 - 2017")

Hide
ggplot(total_cont, aes(x = continent, y = daly_ivsa, fill = continent)) +
labs(x= "Continent", y = "Injuries DALYs", title = "Accumulated Average DALYs per Capita from Injuries, per Con
tinent 1980 - 2017")

Overall, we nd that Africa has the highest accumulated average DALY rate per capita of all countries (c 90), followed
by Asia (c 50), and Oceania (c 40). The high contrast of Africa agaist the rest of the continents is mainly due to its high
accumulated average for communicable diseases. In this category, Africa more than tripples the second highest
continent (c 55 for Africa compared to c 17 for Asia).
When it comes to non-communicable diseases and injuries, rates are fairly even. For non-communicable diseases, DALY
rates range c 27 - 33 (North America being the lowest and Africa, the highest). Although with much lower DALY rates,
injuriy rates range c 4 - 6 (Europe being the lowest and Africa, the highest).
Consequently, communicable diseases are found to have the highest burden in the population, with Africa taking (or
having taken) the highest burden. A closer look into these rates were taken to better understand its evolution throught
time.
1.2.1 Communicable Diseases
Hide
graph1 <- total_cont %>%
ggplot(aes(x=period, y=daly_cnmnd, fill=continent, text=continent)) +
geom_area(alpha = 1) +
theme(legend.position="none") +
ggtitle(".") +
theme(legend.position="none") +
labs(x= "Year", y = "DALY for communicable disease", title = "Time Series Average DALYs per Capita from Communica
ble Diseases per Continent")
ggplotly(graph1)
Time Series Average DALYs per Capita from Communicable Diseases per Continent

2000 2005
0.0
0.2
0.4
0.6
0.8
Ye
DALY
for
communicable
disease

As seen from the graph, Africa’s communicable DALY rate seems to be in the decline since 2008.However, this
continent has been consistently ranking high over other continents which leaves room to further consider the causes
and potential solutions.
From the Our World in Data report, it is found that neonatal disorders are the top communicable diseases in terms of
total share of burden (7.45% of all causes). It is also known that there is a strong negative correlation between GDP and
DALY from communicable diseases. Similarly, a negative correlation is found between health expenditure per capita
and DALY from communicable diseases.
What about healthcare expenditure as percentage of GDP?
Hide
ggplot(total_short, aes(x = healthcare_gdp_rate, y = daly_cnmnd, color = continent))+
geom_point()+
labs(x= "Healthcare Expenditure as percentage of GDP", y = "DALY from Communicable Diseases", title = "Rates
due to Proportion of GDP spent on Healthcare")
Hide
# No clear correlation yet, but interesting
Hide
total_short%>%
select(daly_cnmnd, healthcare_gdp_rate, gdp, pocket_per_cap)%>%
ggpairs()

> A higher GDP per country seems to have a signi cant negative correlation to DALY of communicable diseases. However, the proportion of GDP used for
healthcare seems to have a signi cant positive correlation to DALY of communicable diseases. GDP seems to have a signi cant negative correlation to the
proportion of GDP spent on healthcare. This could indicate that poorer countries have a higher likelihood of having to combat communicable diseases.
Consequently, they spend a greater proportion of their GDP on healthcare than richer countries. Out of pocket expenditure is also highly negatively
correlated to DALY of communicable diseases, although highly positively correlated to gdp. This leads to the interpretation that poor countries in which the
population is individually responsible for investing in their medical care and are most likely to have higher DALY communicable disease rates.
1.2.2 Injuries
With DALY rates for injuries and additional causes having similar rates across all continents, we decided to rst take a
closer look at which types of causes were most prominent overall.
Hide
#This plot shows injury related DALY in a stacked bar chart.
start <- total%>%
group_by(continent)%>%
summarise(daly_conflict_terrorism = mean(daly_conflict_terrorism/total_population), daly_self_harm = mean(daly_self
_harm/total_population), daly_violence = mean(daly_violence/total_population), daly_transport_injuries = mean
(daly_transport_injuries/total_population), daly_nature_forces = mean(daly_nature_forces/total_population), d
aly_unintentional_injuries = mean(daly_unintentional_injuries/total_population))
pivot <- pivot_longer(start, cols=c(daly_conflict_terrorism, daly_self_harm, daly_violence,daly_transport_injuries, d
aly_unintentional_injuries, daly_nature_forces), names_to = "diseases",values_to = "value")
#select columns from dataset
plots <- pivot %>%
select(continent,diseases,value)
ggplot(plots, aes(fill=diseases, y=value, x=continent)) +
geom_bar(position="stack", stat="identity") +
labs(x= "Continent", y = "Injuries DALYs", title = "Accumulated Average DALYs per Capita from Injuries, per Conti
nent 1980 - 2017")

Hide
#Plot on Terrorism and Violence
terrorism_violence <- start %>%
select(daly_conflict_terrorism, daly_violence, continent)
terrorism_violence <- pivot_longer(terrorism_violence,c(daly_conflict_terrorism, daly_violence,
),names_to = "diseases",values_to = "value")
#select columns from dataset
terrorism_violence <- terrorism_violence%>%
select(diseases,value,continent)
#stacked bar chart
ggplot(terrorism_violence, aes(fill=diseases, y=value, x=continent)) +
labs(x= "Continent", y = "Injuries DALYs", title = "Accumulated Average DALYs per Capita from Terrorism and Viole
nce 1980 - 2017")

Hide
total_short%>%
select(daly_ivsa, gdp, daly_mental_and_substance, physicians_1000, education_years)%>%
ggpairs()

1.2.3 Non-Communicable Diseases
Hide

start1 <- total%>%
group_by(continent)%>%
summarise(daly_cvs = mean(daly_cvs/total_population), daly_nutritional_deficiencies = mean(daly_nutritional_deficie
ncies/total_population), daly_maternal_disorders = mean(daly_maternal_disorders/total_population), daly_muscu
loskeletal = mean(daly_musculoskeletal/total_population), daly_other_non_communicable = mean(daly_other_non_c
ommunicable/total_population), daly_neurological = mean(daly_neurological/total_population), daly_mental_and_
substance = mean(daly_mental_and_substance/total_population), daly_diabetes_urogenital_blood_endocrine = mean
(daly_diabetes_urogenital_blood_endocrine/ total_population), daly_neoplasms = mean(daly_neoplasms/total_popu
lation), daly_chronic_liver = mean(daly_chronic_liver/total_population))
pivot1 <- pivot_longer(start1, c(daly_cvs,daly_nutritional_deficiencies,daly_maternal_disorders,daly_musculoskeletal,
daly_other_non_communicable,daly_neurological,daly_mental_and_substance,daly_diabetes_urogenital_blood_endocr
ine,daly_neoplasms,daly_chronic_liver), names_to = "diseases",values_to = "value")
#select columns from data set
total_short_ncds <- pivot1%>%
select(continent,diseases,value)
#stacked bar chart
# This staked bar chart shows the DALY once again for non communicable diseases but has been adjusted to show data fo
r per 100000 population. Additionally the data has been colored to show the different categories of non-commu
nicable diseases.
#Asia has the highest DALY for non communicable diseases closely followed by Europe. There are reasons to suggest why
DALY remains high in both regions. For Asia, the lack of affordability, lack of doctors, and having helathcar
e not to the highest standards may all contribute towards this. Due to Europe's aging population, non-communi
cable diseases are more likely to be present among its population. As seen in the graphs earlier, a path of n
ations to become modern and developed, their population transitions from suffering from communicable disease
towards non-communicable disease, which come with age.
ggplot(total_short_ncds, aes(fill=diseases, y=value, x=continent)) +
labs(x= "Continent", y = "Non-Comm DALYs", title = "Accumulated Average DALYs per Capita from Non-Comm, per Conti
nent 1980 - 2017")
Hide

# Looking into CVS in more detail.
ggplot(total_short, aes(x= continent, y = daly_cvs))+
geom_col()+
labs(x= "Continent", y = "Daly due to CVS related conditions", title = "DALY per capita due to CVS condition
s per continent")
Hide
# Looking into neoplasms in more detail.
ggplot(total_short, aes(x= continent, y = daly_neoplasms))+
geom_col()+
labs(x= "Continent", y = "Daly due to neoplasm", title = "DALY per capita due to neoplasms per continent")

Hide
# Looking into diabetes, urogenital, blood, endocrine in more detail.
ggplot(total_short, aes(x= continent, y = daly_diabetes_urogenital_blood_endocrine))+
geom_col()+
labs(x= "Continent", y = "Daly due to diabetes, urogenital, blood and endocrine related conditions.", title
= "DALY per capita due to diabetes, urogenital, blood and endocrine related conditions per continent")

Hide
ggplot(total_short, aes(x= continent, y = daly_mental_and_substance))+
geom_col()+
labs(x= "Continent", y = "Daly due to mental and substance related conditions.", title = "DALY per capita du
e to mental and substance related conditions per continent")

Hide
ggplot(total_short, aes(x = healthcare_gdp_rate, y = daly_ncds/100000, color = continent))+
geom_point()+
labs(x= "Healthcare Expenditure as percentage of GDP", y = "DALY from Non- Communicable Diseases", title =
"Rates due to Proportion of GDP spent on Healthcare")

Hide
total_short%>%
select(daly_cnmnd, healthcare_gdp_rate, gdp, pocket_per_cap)%>%
ggpairs()

1.3 Regression analysis
Although highly complex, and with many different societal and economical variables affecting the nal DALY rates, we
decided to look into certain variables that had enough data to be used for our analysis. These variables affecting both,
DALY rates by cause and general DALY rates, can be divided in several categories.
Diet habit variables (fruit consumption per capita per year, percentage of animal protein consumption out of total daily
calories, vegetable consumption percentage of population being overweight), healthcare variables (annual healtcare
expenditure, out of pocket expenditure on healthcare, healthcare per gdp, and number of physicians per 1,000 people),
living habits (smoking percentages), other demographics (education years).
In addition to these elements, we considered the effect of each continent separately by tranforming them into dummy
variables.
1.3.0.1 Models 0 and 1
Hide
#Transforming continent factors into dummy variables
total=total%>%
mutate(Asia=case_when(total$continent=="Asia"~1,TRUE~0))%>%
mutate(Europe=case_when(total$continent=="Europe"~1,TRUE~0))%>%
mutate(NorthA=case_when(total$continent=="North America"~1,TRUE~0))%>%
mutate(Africa=case_when(total$continent=="Africa"~1,TRUE~0))%>%
mutate(SouthA=case_when(total$continent=="South America"~1,TRUE~0))
Hide

Call:
lm(formula = daly_adjusted ~ smoking_percentage + percentage_overweight +
fruit_consumption + vegetable_consumption + animal_protein_consumption +
education_years + physicians_1000 + pocket_per_cap + healthcare_gdp_rate +
daly_ivsa + daly_ncds + daly_cnmnd, data = total, subset = gdp)
Residuals:
Min 1Q Median 3Q Max
-6.602e-11 -4.258e-12 -5.730e-13 2.894e-12 1.220e-10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.867e-12 1.064e-11 -3.640e-01 0.716590
smoking_percentage -1.016e-10 1.900e-11 -5.346e+00 2.49e-07 ***
percentage_overweight -6.735e-13 1.006e-13 -6.694e+00 2.26e-10 ***
fruit_consumption -7.039e-14 2.012e-14 -3.499e+00 0.000579 ***
vegetable_consumption -1.385e-14 2.245e-14 -6.170e-01 0.537948
animal_protein_consumption 5.153e-12 7.264e-13 7.094e+00 2.35e-11 ***
education_years 1.769e-12 6.270e-13 2.821e+00 0.005277 **
physicians_1000 3.200e-12 1.696e-12 1.887e+00 0.060675 .
pocket_per_cap -2.376e-14 8.464e-15 -2.807e+00 0.005506 **
healthcare_gdp_rate 3.037e-11 1.838e-11 1.652e+00 0.100074
daly_ivsa 1.000e+00 1.045e-15 9.570e+14 < 2e-16 ***
daly_ncds 1.000e+00 3.813e-16 2.623e+15 < 2e-16 ***
daly_cnmnd 1.000e+00 1.157e-16 8.645e+15 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.446e-11 on 195 degrees of freedom
(817 observations deleted due to missingness)
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 3.025e+31 on 12 and 195 DF, p-value: < 2.2e-16
# Lm0 was created to show that daly_ivsa, daly_ncds and daly_cnmnd make up daly_adjusted. As a result, these three va
riables are not included in the linear models.
lm0= lm(daly_adjusted ~ smoking_percentage+ percentage_overweight+ fruit_consumption+ vegetable_consumption+ animal_p
rotein_consumption+ education_years+ physicians_1000+ pocket_per_cap+ healthcare_gdp_rate + daly_ivsa + dal
y_ncds + daly_cnmnd, gdp, data = total)
summary(lm0)
Hide
lm1= lm(daly_adjusted ~ smoking_percentage+ percentage_overweight+ fruit_consumption+ vegetable_consumption+ animal_p
rotein_consumption+ education_years+ physicians_1000+ pocket_per_cap+ healthcare_gdp_rate + gdp, data = tota
l)
summary(lm1)

Call:
gdp, data = total)
Residuals:
-25122 -6233 -812 4866 59542
Coefficients:
(Intercept) 7.921e+04 1.894e+03 41.814 < 2e-16 ***
smoking_percentage -2.391e+04 6.139e+03 -3.894 0.000105 ***
percentage_overweight -2.590e+02 3.531e+01 -7.335 4.53e-13 ***
fruit_consumption -5.051e+01 8.655e+00 -5.836 7.19e-09 ***
vegetable_consumption -3.828e+01 6.980e+00 -5.484 5.26e-08 ***
animal_protein_consumption -1.799e+03 2.719e+02 -6.615 6.00e-11 ***
education_years -1.270e+03 2.027e+02 -6.268 5.41e-10 ***
physicians_1000 7.964e+02 4.986e+02 1.597 0.110538
pocket_per_cap -1.011e+01 2.508e+00 -4.031 5.96e-05 ***
healthcare_gdp_rate 6.335e+03 6.803e+03 0.931 0.351968
gdp 6.325e-02 3.290e-02 1.923 0.054808 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11340 on 1014 degrees of freedom
Multiple R-squared: 0.6371, Adjusted R-squared: 0.6335
F-statistic: 178 on 10 and 1014 DF, p-value: < 2.2e-16
Already from model one we reach an adjusted R-squared of 0.6335, meaning these factors can explain approximately
63 percent of general DALY’s uctuation. The variable with the highest p value was dropped sequentially for the below
models.
Hide
lm2 = lm( daly_adjusted~smoking_percentage+ percentage_overweight+ vegetable_consumption+ animal_protein_consumption+
education_years+ physicians_1000+ pocket_per_cap+ fruit_consumption + gdp, data = total)
summary(lm2)

Call:
vegetable_consumption + animal_protein_consumption + education_years +
physicians_1000 + pocket_per_cap + fruit_consumption + gdp,
data = total)
Residuals:
-25123 -6213 -846 4897 59243
Coefficients:
(Intercept) 8.034e+04 1.453e+03 55.295 < 2e-16 ***
physicians_1000 9.062e+02 4.845e+02 1.871 0.061701 .
gdp 5.507e-02 3.170e-02 1.737 0.082661 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F-statistic: 197.7 on 9 and 1015 DF, p-value: < 2.2e-16
Dropping healthcare-gdp percentage makes out of pocket expenditure become signi cant.
Call:
pocket_per_cap + fruit_consumption + physicians_1000, data = total)
Residuals:
-25203 -6215 -447 4669 59274
Coefficients:
(Intercept) 79726.798 1410.663 56.517 < 2e-16 ***
smoking_percentage -24292.378 6127.035 -3.965 7.86e-05 ***
percentage_overweight -266.174 35.115 -7.580 7.76e-14 ***
vegetable_consumption -40.473 6.894 -5.871 5.87e-09 ***
animal_protein_consumption -1741.370 258.204 -6.744 2.58e-11 ***
education_years -1222.730 201.228 -6.076 1.74e-09 ***
pocket_per_cap -7.568 2.052 -3.688 0.000238 ***
fruit_consumption -47.366 8.442 -5.611 2.59e-08 ***
physicians_1000 859.752 484.202 1.776 0.076097 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1.3.0.2 Drop physicians_1000
Hide
lm3=lm( daly_adjusted~smoking_percentage+ percentage_overweight+ vegetable_consumption+ animal_protein_consumption+ e
ducation_years+ pocket_per_cap+ fruit_consumption + physicians_1000, data = total)
summary(lm3)
Hide

Call:
pocket_per_cap + fruit_consumption, data = total)
Residuals:
-25602 -6210 -408 4778 59363
Coefficients:
(Intercept) 78595.709 1259.974 62.379 < 2e-16 ***
smoking_percentage -21927.213 5986.815 -3.663 0.000263 ***
education_years -1107.609 190.699 -5.808 8.44e-09 ***
pocket_per_cap -6.709 1.996 -3.361 0.000806 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All variables are now sigi cant, leading to a model with 0.632 as its adjusted R-squared.
1.3.1 Stepwise regression& VIF exam
We can also used stepwise regression to nd the optimal model.Stepwise method is more precise than dropping
variables mannually since it provides the possibility of adding the dropped variables back in the future steps if it
improves the model(lowers model’s AIC),and also examines the signi cance after adding or dropping variables.
lm4=lm( daly_adjusted~smoking_percentage+ percentage_overweight+ vegetable_consumption+ animal_protein_consumption+ e
ducation_years+ pocket_per_cap+ fruit_consumption, data = total)
summary(lm4)
Hide
fit1_step=step(lm1,direction="both")

Start: AIC=19149.68
daly_adjusted ~ smoking_percentage + percentage_overweight +
gdp
Df Sum of Sq RSS AIC
- healthcare_gdp_rate 1 111483768 1.3048e+11 19149
<none> 1.3036e+11 19150
- physicians_1000 1 327963566 1.3069e+11 19150
- gdp 1 475232954 1.3084e+11 19151
- smoking_percentage 1 1949782662 1.3231e+11 19163
- pocket_per_cap 1 2089322026 1.3245e+11 19164
- vegetable_consumption 1 3866128944 1.3423e+11 19178
- fruit_consumption 1 4378585222 1.3474e+11 19182
- education_years 1 5050374107 1.3541e+11 19187
- animal_protein_consumption 1 5625702511 1.3599e+11 19191
- percentage_overweight 1 6917422353 1.3728e+11 19201
Step: AIC=19148.55
daly_adjusted ~ smoking_percentage + percentage_overweight +
education_years + physicians_1000 + pocket_per_cap + gdp
Df Sum of Sq RSS AIC
<none> 1.3048e+11 19149
- gdp 1 387923531 1.3086e+11 19150
+ healthcare_gdp_rate 1 111483768 1.3036e+11 19150
- physicians_1000 1 449760026 1.3093e+11 19150
- smoking_percentage 1 1912380306 1.3239e+11 19162
- pocket_per_cap 1 2075803261 1.3255e+11 19163
- vegetable_consumption 1 4040506535 1.3452e+11 19178
- fruit_consumption 1 4418260882 1.3489e+11 19181
- education_years 1 5027016883 1.3550e+11 19185
- animal_protein_consumption 1 6245781148 1.3672e+11 19194
- percentage_overweight 1 7139008696 1.3761e+11 19201
Call:
education_years + physicians_1000 + pocket_per_cap + gdp,
data = total)
Residuals:
-25123 -6213 -846 4897 59243
Coefficients:
(Intercept) 8.034e+04 1.453e+03 55.295 < 2e-16 ***
physicians_1000 9.062e+02 4.845e+02 1.871 0.061701 .
gdp 5.507e-02 3.170e-02 1.737 0.082661 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hide
summary(fit1_step)

smoking_percentage percentage_overweight
1.668205 2.913117
fruit_consumption vegetable_consumption
1.425008 1.502601
animal_protein_consumption education_years
3.110474 3.088943
physicians_1000 pocket_per_cap
3.754590 3.210832
gdp
2.999374
From the nal result we can see that six variables are signi cant with a p-value lower than 0.1. Expense and
pocket_per_cap are both signi cant in this case. However, dropping one of them may lead to insigni cance of the other.
This could be because these two have a joint effect on the burden of disease. We can choose from these two models
according to our con dence interval.
Continents were also considered as part of the model to see their effect.
Call:
lm(formula = daly_adjusted ~ percentage_overweight + vegetable_consumption +
animal_protein_consumption + education_years + pocket_per_cap +
fruit_consumption + Asia + Africa + NorthA + Europe + SouthA +
Asia + Africa + NorthA + Europe + SouthA, data = total)
Residuals:
-29935 -4148 -431 3996 50866
Coefficients:
(Intercept) 66522.998 2345.279 28.365 < 2e-16 ***
education_years -534.613 165.120 -3.238 0.00124 **
pocket_per_cap -8.669 1.678 -5.168 2.86e-07 ***
Asia -7140.437 1807.499 -3.950 8.34e-05 ***
Africa 13792.577 1918.746 7.188 1.27e-12 ***
NorthA -9335.463 1794.310 -5.203 2.38e-07 ***
Europe -5196.987 1650.294 -3.149 0.00169 **
SouthA -9146.724 1917.915 -4.769 2.12e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hide
vif(fit1_step)
Hide
fit = lm(daly_adjusted~ percentage_overweight+ vegetable_consumption+ animal_protein_consumption+ education_years+ po
cket_per_cap+ fruit_consumption+Asia+Africa+NorthA+Europe+SouthA+ Asia+ Africa+ NorthA+ Europe+ SouthA, data
= total)
print(summary(fit))
Hide
print(vif(fit))

percentage_overweight vegetable_consumption
3.908936 1.772149
2.885232 3.048594
pocket_per_cap fruit_consumption
2.136245 1.318169
Asia Africa
7.257346 6.590683
NorthA Europe
3.209642 7.556342
SouthA
2.440735
1.3.2 Interpretation on the nal model
Our nal model had 11 variables
Call:
lm(formula = daly_adjusted ~ percentage_overweight + vegetable_consumption +
animal_protein_consumption + education_years + pocket_per_cap +
fruit_consumption + Asia + Africa + NorthA + Europe + SouthA +
Asia + Africa + NorthA + Europe + SouthA, data = total)
Residuals:
-29935 -4148 -431 3996 50866
Coefficients:
(Intercept) 66522.998 2345.279 28.365 < 2e-16 ***
education_years -534.613 165.120 -3.238 0.00124 **
pocket_per_cap -8.669 1.678 -5.168 2.86e-07 ***
Asia -7140.437 1807.499 -3.950 8.34e-05 ***
Africa 13792.577 1918.746 7.188 1.27e-12 ***
NorthA -9335.463 1794.310 -5.203 2.38e-07 ***
Europe -5196.987 1650.294 -3.149 0.00169 **
SouthA -9146.724 1917.915 -4.769 2.12e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
percentage_overweight vegetable_consumption
3.908936 1.772149
2.885232 3.048594
pocket_per_cap fruit_consumption
2.136245 1.318169
Asia Africa
7.257346 6.590683
NorthA Europe
3.209642 7.556342
SouthA
2.440735
Hide
continent_fit=fit
summary(continent_fit)
Hide
vif(continent_fit)

(1) (2) (3) (4) (5)
(Intercept) 79209.24 *** 80341.02 *** 79726.80 *** 78595.71 *** 66523.00 ***
(1894.33)    (1452.94)    (1410.66)    (1259.97)    (2345.28)
smoking_percentage -23905.42 *** -23651.69 *** -24292.38 *** -21927.21 ***
(6138.51)    (6132.06)    (6127.03)    (5986.82)
percentage_overweight -259.02 *** -262.03 *** -266.17 *** -256.08 *** -202.37 ***
(35.31)    (35.16)    (35.11)    (34.69)    (33.40)
fruit_consumption -50.51 *** -50.72 *** -47.37 *** -49.24 *** -40.86 ***
(8.65)    (8.65)    (8.44)    (8.38)    (6.82)
vegetable_consumption -38.28 *** -38.93 *** -40.47 *** -37.82 *** -33.86 ***
(6.98)    (6.94)    (6.89)    (6.74)    (6.18)
animal_protein_consumption -1798.59 *** -1852.18 *** -1741.37 *** -1623.10 *** -1030.84 ***
(271.90)    (265.72)    (258.20)    (249.73)    (209.88)
education_years -1270.47 *** -1267.36 *** -1222.73 *** -1107.61 *** -534.61 **
(202.70)    (202.66)    (201.23)    (190.70)    (165.12)
physicians_1000 796.40     906.18     859.75
(498.63)    (484.46)    (484.20)
pocket_per_cap -10.11 *** -10.08 *** -7.57 *** -6.71 *** -8.67 ***
(2.51)    (2.51)    (2.05)    (2.00)    (1.68)
healthcare_gdp_rate 6334.55
(6802.52)
gdp 0.06     0.06
(0.03)    (0.03)
Asia                                 -7140.44 ***
                                (1807.50)
Africa                                 13792.58 ***
                                (1918.75)
NorthA                                 -9335.46 ***
                                (1794.31)
Europe                                 -5196.99 **
                                (1650.29)
SouthA                                 -9146.72 ***
                                (1917.91)
N 1025        1025        1025        1025        1025
R2 0.64     0.64     0.64     0.63     0.76
logLik -11018.25     -11018.69     -11020.21     -11021.80     -10814.42
Hide
huxtable::huxreg(lm1,lm2,lm3,lm4, continent_fit,
number_format = "%.2f")

AIC 22060.50 22059.38 22060.42 22061.60 21654.84
*** p < 0.001; ** p < 0.01; * p < 0.05.
actual predicted
actual 1.0000000 0.8572182
predicted 0.8572182 1.0000000
From the 5 models, continent_ t was chosen as the nal model due to having all sigini cant variables, and the highest R
squared (0.76). As it can be seen from our predictions, our model is able to predict the correct DALY rates for 2013 with
85.72 percent accuracy.
Hide
best_model <- continent_fit
#Part 2: We wanted to test the prediction efficacy of our model by ensuring that it was able to predict with a certai
n level of cofidence the DALYS for the last full year of data (2013)
train <- total %>%
filter(period<2013)
predict <- total %>%
filter(period == 2013)
continent_fit2 <- lm(continent_fit, data = train)
final_prediction <- predict(continent_fit2, newdata = predict)
ac_pred <- data.frame(cbind(actual = predict$daly_adjusted, predicted = final_prediction))
correlation_accuracy <- cor(ac_pred)
correlation_accuracy

The Burden of Disease: Data analysis, interpretation and linear regression

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to The Burden of Disease: Data analysis, interpretation and linear regression

Similar to The Burden of Disease: Data analysis, interpretation and linear regression (20)

Recently uploaded

Recently uploaded (20)

The Burden of Disease: Data analysis, interpretation and linear regression